Category Archives: R programming

Modifiying Data Tables in R VIDEO

In the video below, we look at how to modify data tables in R.

Modifiying Data Tables in R

In this post, we will look at how you can modify data tables in R. Specifically, we will look at how to add columns, fix errors, and calculate values. Below is the initial code to prepare the data we will use.

library(data.table)
mtcars<-data.table(mtcars)

In the code above, we load the data.table package. We then convert the “mtcars” dataset to be a data.table in the second line. Below is a look at all the columns and the first few rows of data

head(mtcars)
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1:  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4
2:  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4
3:  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1
4:  21.4     6   258   110  3.08 3.215 19.44     1     0     3     1
5:  18.7     8   360   175  3.15 3.440 17.02     0     0     3     2
6:  18.1     6   225   105  2.76 3.460 20.22     1     0     3     1

Adding a New Column

Adding a new column involves a simple process. In the code below, we call “mtcars” and inside the brackets, we place a comma first. This tells R that we want all rows of data.

After the comma, we create a name for our new column called “distance_travel.” After this, we place the := notation to indicate that we are calculating a value. After the :=, we write mpg * 4. This means multiply the mpg column by 4.

# Add a new column, travel distance
mtcars[, distance_travel := mpg*4]

Below is the output for the code above. To save space, we will not print every column. Instead, we will subset what we want as shown below.

mtcars[1:3,c(1,10:12)]
     mpg  gear  carb distance_travel
   <num> <num> <num>           <num>
1:  21.0     4     4            84.0
2:  21.0     4     4            84.0
3:  22.8     4     1            91.2

You can see the calculated value in the fourth column. This value is mpg multiplied by 4.

Fixing Errors

We can also fix errors for this example. Let’s assume that any value less than 21.5 in the mpg column is a mistake. We can replace those values with NA with the code below.

#fix errors change mpg less than 21.5 to NA
mtcars[mpg <21.5, mpg := NA]

In this code, you can see that we use brackets again and indicate that we are looking for all mpg values that are less than 21.5. These values will be rewritten with NA. Below is the output for select columns and rows only.

mtcars[1:3,1:3]
     mpg   cyl  disp
   <num> <num> <num>
1:    NA     6   160
2:    NA     6   160
3:  22.8     4   108

The NAs are where there used to be mpg values less than 21.5

Adding Columns for Groups

Another value we may want to calculate is to count values by groups. In the code below, we count the number of cars that have an automatic or manual transmission. To do this, we use brackets again. Inside the brackets, we place a comma. After the comma, we name our new variable “total_am” followed by the := notation. After the := notation, we place .N. The .N notation tells are to count all rows. In this case, we are counting all rows by the variable “am,” which stands for automatic transmission, and yes or no. Below is the code and output

# Add a new column equal to total cars with automatic transmission
 mtcars[, total_am := .N, by = am]

mtcars[1:5,c(1,9,13)]
     mpg    am total_am
   <num> <num>    <int>
1:    NA     1       13
2:    NA     1       13
3:  22.8     1       13
4:    NA     0       19
5:    NA     0       19

You can see that there are 13 cars with automatic transmissions and 19 cars without automatic transmissions. Each row has a 13 if it has an automatic transmission and a 19 if it does not. This calculation is something similar to what a windowing function does in SQL.

Calculate Values of Groups

It is also possible to calculate other values. In the code below, we calculate the average mpg by the number of cylinders a car has. The syntax for this code should be looking familiar by now. Notice how after the := notation, it is possible to use a function. In our case, we are using the mean() function.

# Calculate the mean mpg by cyl 
mtcars[, mean_mpg:=mean(mpg,na.rm=TRUE), 
            by = cyl]

mtcars[1:3,c(1,2,13)]
     mpg   cyl mean_mpg
   <num> <num>    <num>
1:  21.0     6 19.74286
2:  21.0     6 19.74286
3:  22.8     4 26.66364

The results are similar to the previous example, except this time we calculate the mean of mpg by cyl.

Using LHS := RHS Form

LHS := RHS Form is another way to indicate to R what you want to do. On the left-hand side, you create the names of your columns. After the := sign you place the functions you are using. Notice how the functions are wrapped inside parentheses with a period on the outside. Also note the comma at the beginning of the square brackets and in front of the “by” argument.

# Add columns using the LHS := RHS form
mtcars[, c("mean_mpg", "median_mpg") := .(mean(mpg), median(mpg)), 
        by = cyl]

In the code above, we calculate the mean and the median mpg by the number of cylinders. Below is the output

mtcars[1:3,c(1,2,13,14)]
     mpg   cyl mean_mpg median_mpg
   <num> <num>    <num>      <num>
1:  21.0     6 19.74286       19.7
2:  21.0     6 19.74286       19.7
3:  22.8     4 26.66364       26.0

You can clearly see the two new columns that show the mean and median for mpg.

Functional Form

Functional form is another way to get the same results. In the code below, we are still using the square brackets. Inside the square brackets, we first have a comma. Next, you have our := symbol, but this time the := symbol is inside grave accents (“). The grave accent is next to the number 1 on a standard keyboard and is also home to the tilde sign (~). After the := symbol you create the column name,s followed by the function you are using, separated by commas. After all of this, you indicate the grouping using the “by” argument.

# Add columns using the functional form
mtcars[, `:=`(mean_mpg_func = mean(mpg), 
               median_mpg_func = median(mpg)), 
        by = cyl]

mtcars[1:3,c(1,2,15,16)]
     mpg   cyl mean_mpg_func median_mpg_func
   <num> <num>         <num>           <num>
1:  21.0     6      19.74286            19.7
2:  21.0     6      19.74286            19.7
3:  22.8     4      26.66364            26.0

The results speak for themselves.

Functional Form with Complex Grouping

So far, we have been grouping with only one column in the “by” argument. However, it is possible to have more than one column in the “by” argument while also having another filter in place. In the code below, we filter for mpg greater than 21 while also grouping by the number of cylinders, and whether the car has an automatic transmission or not.

# Add the mean_duration column
mtcars[mpg>21,ave_mpg :=mean(mpg),
        by = .(cyl,am)]

mtcars[1:3,c(1,2,9,17)]
     mpg   cyl    am ave_mpg
   <num> <num> <num>   <num>
1:  21.0     6     1      NA
2:  21.0     6     1      NA
3:  22.8     4     1  28.075

The reason for the NA is that there are no cars that meet the criteria. In other words, there was not more than one car that had an mpg greater than 21 that was 6-cylinder and also had an automatic transmission.

COnclusion

Data.table is just another way to manipulate data inside R. Generally, it is considered faster when dealing with large datasets. The purpose here was only to explore the potential of this package if it is needed.

Data Manipulation with Data.Table in R VIDEO

Leave a reply

The video below provides examples of ways to manipulate data and conduct various calculations using data.table in R

Data Table Basics in R VIDEO

Leave a reply

Data tables provide an efficient way to work with and manipulate data. In the video below, examples are provided of the strengths and benefits of using data tables.

Data Tables Basics in R

Leave a reply

Data frames are the default way that data is often stored in R. However, another option for storing data in R is using data tables. As we will see, data tables allow you to accomplish much more than data frames. For now, we will focus on some basic features of data tables and data frames before moving to actions that are easier to perform with data tables.

Loading Packages and Data Preparation

We will start by loading the package data.table and preparing our data. The data.table package is loaded using the library() function. We will use the mtcars and iris datasets for the various examples. Both of these datasets are available by default within R. Since our focus is on data tables, we will convert both the mtcars and iris datasets into data tables and store them in objects with the same name. Below is the code.

library(data.table)
mtcars<-data.table(mtcars)
iris<-data.table(iris)

Next, we will use the head () function to examine the mtcars and iris datasets quickly.

The mtcars dataset has data about cars while the iris dataset has data bout various features of flowers.

Subsetting Basics

The first five examples can be performed data frames or data tables. We will begin by subsetting a single row from a data table as shown below.

#filtering with positive integers
row_2 <- mtcars[2,]
row_2
  mpg cyl disp  hp drat    wt  qsec vs am gear carb
2  21   6  160 110  3.9 2.875 17.02  0  1    4    4

In the code above, we filter for the second row in the mtcars data table. This is done using brackets followed by a number for the row we want. The common after the number 2 in the brackets would allow us to select a column. Since there is no number after the comma, this indicates that R should select all rows. This is why we have all the data from row number 2.

In the example below, we will select multiple rows at once using the c() function and a colon.

#multiple rows
> rows_1_5 <- mtcars[c(1:5),]
> rows_1_5
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

The main difference in the code above is the use of the c() function or the concatenate function. Inside this function, we tell R we want the first 5 rows and columns. However, it is not necessary to pull consecutive rows, as we can also pull whatever rows we want specifically.

#filtering non consecutive rows
rows_1_3_5 <- mtcars[c(1,3,5),]
rows_1_3_5
   mpg cyl disp  hp drat   wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.62 16.46  0  1    4    4
3 22.8   4  108  93 3.85 2.32 18.61  1  1    4    1
5 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2

In the next example above, inside the c() function, we indicate that we want the 1st, 3rd, and 5th rows along with all of the columns. In the next two examples, we will learn how to leave rows rather than include them.

> only_last_two <- mtcars[-c(1:30),]
> only_last_two
    mpg cyl disp  hp drat   wt qsec vs am gear carb
31 15.0   8  301 335 3.54 3.57 14.6  0  1    5    8
32 21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

You can use a minus sign in front of your subset to remove everything that is inside the brackets. For example, in the code above, we place a minus sign in front of rows 1 to 30 to indicate to R to remove rows 1 to 30. This is why in the output, only rows 31 and 32 are available. Just as in the other examples, the numbers do not have to be consecutive, as shown below

> exclude_some <- mtcars[-c(1:10,12:32),]
> exclude_some
    mpg cyl  disp  hp drat   wt qsec vs am gear carb
11 17.8   6 167.6 123 3.92 3.44 18.9  1  0    4    4

In the above example, we exclude rows 1 to 10 and rows 12 to 32, leaving only row 11.

Using data.table

We will now do three examples that require the use of data.table. The first example below removes the first 30 rows and the last row of 32, which means only row 31 is displayed

> not_first_last <- mtcars[-c(1:30,.N)]
> not_first_last
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1:    15     8   301   335  3.54  3.57  14.6     0     1     5     8

If you look closely, the output is different. There is a 1 next to all of the data, which gives the impression that this is row 1 from the dataset. However, this is not row 1 of the dataset but rather the first row of the subsetted data. In addition, you can see the <num> above all columns, which means this data is numerical. Lastly, in the code, you see a .N, which tells R to remove the last row of the data.

In the next example, we are going to subset the data so that only cars with an automatic transmission appear (am==1).

> am_1 <- mtcars[am == 1]
> am_1
      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
 1:  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
 2:  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
 3:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
 4:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
 5:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
 6:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
 7:  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1
 8:  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2
 9:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
10:  15.8     8 351.0   264  4.22 3.170 14.50     0     1     5     4
11:  19.7     6 145.0   175  3.62 2.770 15.50     0     1     5     6
12:  15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8
13:  21.4     4 121.0   109  4.11 2.780 18.60     1     1     4     2

Within the brackets, you simply indicate what values you want for the column that is being used for the filtering. Naturally, you can create more complex queries as shown below.

> am_1_mpg_25 <- mtcars[am==1 & mpg > 25]
> am_1_mpg_25
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
2:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
3:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
4:  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1
5:  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2
6:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2

IN the last example, we filtered the data for cars with automatic transmissions and with mpg above 25.

Conclusion

Data tables are highly flexible and allow a user to do things in a way that is much more efficient, depending on the situation. This is yet another excellent tool that can be deployed by an R enthusiast.

Extracting & matching in R VIDEO

Leave a reply

This video provides examples of how you can extract and match patterns using regular expressions in R. This is a great tool for manipulating strings as needed.

Fuzzy Joins with R

Leave a reply

In this post, we will look at how you can make joins between datasets using a fuzzy criteria in which the matches between the datasets are not 100% the same. In order to do this, we will use the following packages in R.

library(stringdist)
library(stringr)
library(fuzzyjoin)

The “stringdist” package will be used to measure the differences between various strings that we will use. The “stringr” package will be used to manipulate some text. Lastly, the “fuzzyjoin” package will be used to join datasets.

String Distance

The stringdist() function is used to measure the differences between strings. This is measured in many different ways. For our purposes, the distance is measured by the number of changes the function has to make so that the second string is the same as the original string of comparison. Below is code that uses three different methods each to compare the strings in the function.

> stringdist("darrin", "darren", method = "lv")
[1] 1
> stringdist("darrin", "darren", method = "dl")
[1] 1
> stringdist("darrin", "darren", method = "osa")
[1] 1

This code is simple. First, we call the function. Inside the function, the first string is the ground truth string which is the string everything else is compared to. The second string is the other string that is compared to the first one. The method is how the difference is measured. For each example, we picked a different method. “lv” stands for Levenshtein distance, “dl” stands for full Damerau-Levenshtein distance, and “osa” stands for Optimal String Alignment distance. The details of each of these methods can be found by looking at the documentation for the “stringdist” package. Also, note that there are other methods beyond this that are available as well.

The value for these methods is 1, which means that only one change is needed to convert “darren” to “darrin”. Most of the time the methods are highly similar in their results but just to demonstrate, below is an example where the methods disagree.

> stringdist("darrin", "dorirn", method = "lv")
[1] 3
> stringdist("darrin", "dorirn", method = "dl")
[1] 2
> stringdist("darrin", "dorirn", method = "osa")
[1] 2

Now, the values are different. The reason behind these differences is explained in the documentation.

amatch()

The amatch() function allows you to compare multiple strings to the ground truth and indicate which one is closest to the original ground truth string. Below is the code and output from this function.

amatch(
  x = "Darrin",
  table = c("Darren", "Daren", "Darien"),
  maxDist = 1,
  method = "lv" 
) 
[1] 1

Here is what we did.

The x argument is the ground truth. In other words, all other strings are compared to the value of x.
The “table” argument contains all the strings that are being compared to the x argument.
“maxDist” is how far away or how many changes max can be made in order for the strings in the “table” to be considered the best match
“method” is the method used to calculate the “maxdist”
The output is 1. This means that the first string in the table “Darren” has a max distance of 1

Fuzzy Join

The fuzzy join is used to join tables that have columns that are similar but not the same. Normally, joins work on exact matches but the fuzzy join does not require this. Before we use this function we have to prepare some data. We will modify the “Titanic” dataset to run this example. The “Titanic” dataset is a part of R by default and there is no need for any packages. Below is the code for the data preparation.

Titanic_1<-as.data.frame(Titanic)
Titanic_2<-as.data.frame(Titanic)

Titanic_1$Sex<-str_to_lower(Titanic_1$Sex)
Titanic_1$Age<-str_to_lower(Titanic_1$Age)

Here is what we did.

We saved two copies of the “Titanic” dataset as dataframes. This was done because the fuzzy join function needs dataframes.
Next, we made clear differences between the two datasets. For “Titanic_1” we lowercase the sex, and age columns so that there was not an exact match when joining these two dataframes with the fuzzy join function.

We will now use the fuzzy join function. Below is the code followed by the output.

stringdist_join(
  Titanic_1,
  Titanic_2,
  by = c("Age" = "Age","Sex"="Sex"),
  method = "lv",
  max_dist = 1,
  distance_col = "distance"
)

The stringdist_join() function is used to perform the fuzzy join. “Titanic_1” is the x dataset and “Titanic_2” is the y dataset. The “by” argument tells the function which columns are being used in the join. The “method” argument indicates how the distance is calculated between the rows in each dataset. The “max_dist” argument is the criteria by which a join is made. In other words, if the distance is greater than one no join will take place. Lastly, the “distance_col” argument creates new columns that show the distance between the compared columns.

The output was a full join. All columns from both datasets are present. The columns with “.x” are from the “Titanic_1” while the columns with “.y” are from “Titanic_2”. The “.distance” column tells us the difference when that row of data was compared from each dataset. For example, in row 1 the “Age.distance” is 1. This means that the difference in “Age.x” and “Age.y” is 1. The only difference is that “Age.x” is lowercase while “Age.y” is capitalized.

Conclusion

The tools mentioned here allow you to match data that is different with a clear metric of the difference. This can be powerful when you have to join data that does not have a matching column in both datasets. Therefore, there is a place for the tools in the life of any data analyst who deals with fuzzy data like this.

Using glue_collapse() in R VIDEO

Leave a reply

This video will provide examples of how to use the glue_collapse() function from the glue package in R. This function is often used for dealing with strings and we will see a few other tricks as well with this function

Glue Collapse in R

Leave a reply

The glue_collapse() function is another powerful tool from the glue package. In this post, we will look at practical ways to use this function.

Collapsing Text

As you can probably tell from the name, the glue_collapse() function is useful for collapsing a string. In the code below, we will create an object containing several names and use glue_collapse to collapse the values into one string.

> library(glue)
> many_people<-c("Mike","Bob","James","Sam")
> glue_collapse(many_people)
MikeBobJamesSam

In the code above we called the glue package. Next, we created an object called “many_people” which contains several names separated by commas. Lastly, we called the glue_collapse() function which removes the quotes and commas from the string.

Separate & Last

Another great tool of the glue_collapse() function is the ability to separate strings and have a specific last argument. This technique helps to make the output highly readable. Below is the code

> glue_collapse(many_people,sep = ", ", last = ", and ")
Mike, Bob, James, and Sam

In the code above we separate each string with a comma followed by a space in the “sep” argument. The “last” argument tells R what to do before the last word in the string. In this example, we have a comma followed by a space and the word and.

Collapse a Dataframe

The glue_collapse() function can also be used with data frames. In the example, below we will take a column from a dataframe and collapse it.

> head(iris$Species)
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica
> glue_collapse(iris$Species)
setosasetosasetosasetosasetosasetosaseto.....

In the code above, we first take a look at the “Species” column from the “iris” dataset using the head() function. Next, we use the glue_collapse() function in the “Species” column. YOu can see how the rows are all collapsed into one long string in this example.

glue and glue_collapse working together

You can also use the glue() and glue_collapse function together as a team.

> glue(
+   "Hi {more_people}.",
+   more_people = glue_collapse(
+     many_people,sep = ", ",last = ", and "
+   )
+ )
Hi Mike, Bob, James, and Sam.

This code is a little bit more complicated but here is what happened.

On the outside, we start with the glue() function in the first line.
Inside the glue() function we create a string that contains the word Hi and a temporary variable called “more_people”.
Next, we define the temporary variable “more_people with the help of the glue_collapse() function.
Inside the glue_collapse() function we separate the strings inside the “many_people” object.

As you can see, the use of the glue_collapse() and glue() functions can be endless.

Conclusion

The glue_collapse() function is another useful tool that can be used with text data. Knowing what options are available for data analysis makes the process much easier.

Using Glue in R VIDEO

Leave a reply

The glue package provides a lot of great tools for working with strings in R. In this video, we will focus on one particular function within this package that is highly useful for various purposes.

Using Glue in R

Leave a reply

The glue package in R provides a lot of great tools for using regular expressions and manipulating data. In this post, we will look at examples of using just the glue() function from this package.

Paste vs Glue

The paste() function is an older way of achieving the same things that we can achieve with the glue() function. paste() allows you to combine strings. Below we will load our packages and execute a command with the paste() function.

> library(glue)
> library(dplyr)

> people<-"Dan"
> paste("Hello",people)
[1] "Hello Dan"

In the code above, we load the glue and the dplyr package (we will need dplyr later). We then create an object called “people” that contains the string “Dan”. We then used the past function to combine the “people” vector with the string “Hello”. The output is at the bottom of the code.

Below is an example of the same output but using the glue() function

> glue("Hello {people}")
Hello Dan

Inside the glue() function everything is inside parentheses. However, the object “people” is inside curly braces and this indicates to the glue() function to look for what “people” represents. The printout is the same but without parentheses.

Multiple Strings

Below is an example of including multiple strings in the same glue() function

> people<-"Dan"
> people_2<-"Darrell"
> glue("Hello {people} and {people_2}")
Hello Dan and Darrell

In the first two lines above we make our objects. In line 3 we used the glue() function again and inside we included both objects in curly braces.

In another example using multiple strings we will replace text if it meets a certain criteria.

> people<-"Dan"
> people_2<-NA
> glue("Hi {people} and {people_2}",.na="What")
Hi Dan and What

In the code above we start by creating two objects. The second object (people_2) has stored NA. The code in the third line is the same with the exception of the “.na” argument. The “.na” argument is set to the string “What” which tells R to replace any NA values with the string “What”. The output is in the final line.

Temporary Variables

It is also possible to make variables that are temporary. The temporary variable can be named or unnamed. Below is an example with a named variable.

> glue("Dan is {height} cm tall.",height=175)
Dan is 175 cm tall.

The temporary variable “height” is inside the curly braces. The value for “height” is set inside the function to 175.

It is also possible to have unnamed variables inside the function. Below we will use a function inside the curly braces.

> glue("The average number is {mean(c(2,3,4,5))}")
The average number is 3.5

The example is self-explanatory. We used the mean() function inside the curly braces to get a calculated value. As you can see, the potential is endless.

Using Dataframes

In our last example, we will see how you can create a data frame and using input from one column to create a new column.

> df<-data.frame(column_1="Dan")
> df
  column_1
1      Dan
> df %>% mutate(new_column = glue("Hi {column_1}"))
  column_1 new_column
1      Dan     Hi Dan

Here is what we did.

We made a dataframe called df. This dataframe contains one column called column_1. In column_1 we have the string Dan.
In line 2 we display the values of the dataframe.
Next, we use the mutate() function to create a new column. Inside the mutate function we use the glue function and set it to create a string that uses the word “Hi” in front of the values of column_1.
Lastly, we print out the results.

Conclusion

The glue package provides many powerful tools for manipulating data. The examples provided here only focus on one function. As such, this package is sure to provide useful ways in which data analyst can modify their data.

Regular Expressions with R VIDEO

Leave a reply

The video below will provide examples of using regular expressions with R. Regular expressions are a great tool for finding various information within strings.

R Markdown: Table of Contents

Leave a reply

A table of contents can be useful when having to navigate a large document. In the video below, you will learn how to create a table of contents when using R Markdown

R Markdown: Tables & More

Leave a reply

In the video below, we will learn more about R Markdown. In particular, we will learn how to utilize tables along with ways to format text.

R Markdown Tips VIDEO

Leave a reply

In the video below we will look at various ways to use R Markdown to report your results from RStudio.

Intro to R Markdown VIDEO

Leave a reply

R Markdown is a tool within RStudio that is beneficial for reporting results from an analysis that was done in RStudio. The video below provides several basics ways that R Markdown can be used in a document.

Linking Plots in Plotly with R ViDEO

Leave a reply

Linking plots involves allowing the action you take in one plot to affect another. Doing this can allow the user to uncover various patterns that may not be apparent otherwise. Using plotly, it is possible to link plots and this is shown in the video below.

Animation with Plotly in R

Leave a reply

It is possible to animate your visualizations in R using Plotly. The video below provides some tips for how to make animations a reality in your visualizations.

Making Scatterplots Using Plotly in R

Leave a reply

Making scatterplots is a part of the data analyst’s life. The video below shows how a scatterplot can be created using ploty in the R environment.

Making Bar Graphs with Plotly in R VIDEO

Leave a reply

The video below provides examples of how to make bar graphs using plotly in R.

Histograms Using Plotly in R

Leave a reply

Plotly is a library that allows you to make interactive charts in R. The example in the video below will focus on making histograms with this library.

Aggregating Data with dplyr VIDEO

Leave a reply

Aggregating data is a critical step in the data analysis process. In the video below, you will see an example of how to aggregate data using the dplyr package from R.

More Selecting and Transforming with dplyR

Leave a reply

In this post, we are going to learn some more advance ways to work with functions in the dplyr package. Let’s load our libraries

library(dplyr)
library(gapminder)

Our dataset is the gapminder dataset which provides information about countries and continents related to gdp, life expectancy, and population. Here is what the data looks like as a refresher.

glimpse(gapminder)

## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

select

You can use the colon symbol to select multiple columns at once. Doing this is a great way to save time when selecting variables.

gapminder%>%
        select(lifeExp:gdpPercap)

## # A tibble: 1,704 x 3
##    lifeExp      pop gdpPercap
##      <dbl>    <int>     <dbl>
##  1    28.8  8425333      779.
##  2    30.3  9240934      821.
##  3    32.0 10267083      853.
##  4    34.0 11537966      836.
##  5    36.1 13079460      740.
##  6    38.4 14880372      786.
##  7    39.9 12881816      978.
##  8    40.8 13867957      852.
##  9    41.7 16317921      649.
## 10    41.8 22227415      635.
## # … with 1,694 more rows

You can see that by using the colon we were able to select the last three columns.

There are also arguments called “select helpers.” Select helpers help you find columns in really large data sets. For example, let’s say we want columns that contain the string “life” in them. To find this we would use the contain argument as shown below.

gapminder%>%
        select(contains('life'))

## # A tibble: 1,704 x 1
##    lifeExp
##      <dbl>
##  1    28.8
##  2    30.3
##  3    32.0
##  4    34.0
##  5    36.1
##  6    38.4
##  7    39.9
##  8    40.8
##  9    41.7
## 10    41.8
## # … with 1,694 more rows

Only the column that contains the string life is selected. There are other help selectors that you can try on your own such as starts_with, ends_with and more.

To remove a variable from a dataset you simply need to put a minus sign in front of it as shown below.

gapminder %>%
        select(-lifeExp, -gdpPercap)

## # A tibble: 1,704 x 4
##    country     continent  year      pop
##    <fct>       <fct>     <int>    <int>
##  1 Afghanistan Asia       1952  8425333
##  2 Afghanistan Asia       1957  9240934
##  3 Afghanistan Asia       1962 10267083
##  4 Afghanistan Asia       1967 11537966
##  5 Afghanistan Asia       1972 13079460
##  6 Afghanistan Asia       1977 14880372
##  7 Afghanistan Asia       1982 12881816
##  8 Afghanistan Asia       1987 13867957
##  9 Afghanistan Asia       1992 16317921
## 10 Afghanistan Asia       1997 22227415
## # … with 1,694 more rows

In the output above you can see that life expectancy and per capa GDP are missing.

rename

Another function is the rename function which allows you to rename a variable. Below is an example in which the variable “pop” is renamed “population.”

gapminder %>%
        select(country, year, pop) %>%
        rename(population=pop)

## # A tibble: 1,704 x 3
##    country      year population
##    <fct>       <int>      <int>
##  1 Afghanistan  1952    8425333
##  2 Afghanistan  1957    9240934
##  3 Afghanistan  1962   10267083
##  4 Afghanistan  1967   11537966
##  5 Afghanistan  1972   13079460
##  6 Afghanistan  1977   14880372
##  7 Afghanistan  1982   12881816
##  8 Afghanistan  1987   13867957
##  9 Afghanistan  1992   16317921
## 10 Afghanistan  1997   22227415
## # … with 1,694 more rows

You can see that the “pop” variable has been renamed. Remember that the new name goes on the left of the equal sign while the old name goes on the right of the equal sign.

There is a shortcut to this and it involves renaming variables inside the select function. In the example below, we rename the pop variable population inside the select function.

gapminder %>%
        select(country, year, population=pop)

## # A tibble: 1,704 x 3
##    country      year population
##    <fct>       <int>      <int>
##  1 Afghanistan  1952    8425333
##  2 Afghanistan  1957    9240934
##  3 Afghanistan  1962   10267083
##  4 Afghanistan  1967   11537966
##  5 Afghanistan  1972   13079460
##  6 Afghanistan  1977   14880372
##  7 Afghanistan  1982   12881816
##  8 Afghanistan  1987   13867957
##  9 Afghanistan  1992   16317921
## 10 Afghanistan  1997   22227415
## # … with 1,694 more rows

transmute

The transmute function allows you to select and mutate variables at the same time. For example, let’s say that we want to know total gdp we could find this by multplying the population by gdp per capa. This is done with the transmute function below.

gapminder %>%
        transmute(country, year, total_gdp = pop * gdpPercap)

## # A tibble: 1,704 x 3
##    country      year    total_gdp
##    <fct>       <int>        <dbl>
##  1 Afghanistan  1952  6567086330.
##  2 Afghanistan  1957  7585448670.
##  3 Afghanistan  1962  8758855797.
##  4 Afghanistan  1967  9648014150.
##  5 Afghanistan  1972  9678553274.
##  6 Afghanistan  1977 11697659231.
##  7 Afghanistan  1982 12598563401.
##  8 Afghanistan  1987 11820990309.
##  9 Afghanistan  1992 10595901589.
## 10 Afghanistan  1997 14121995875.
## # … with 1,694 more rows

Conclusion

With these basic tools it is now a little easier to do some data analysis when using R. There is so much more than can be learned but this will have to wait for the future.

Transform Variables with dplyR VIDEO

Leave a reply

Handling data is a key skill for a data analyst. In the video below examples are provided of how to transform variables using functions in the dplyr package.

Data Aggregation with dplyr

Leave a reply

In this post, we will learn about data aggregation with the dplyr package. Data aggregation is primarily a tool for summarizing the information you have collected. Let’s start by loading our packages.

library(dplyr)
library(gapminder)

dplyr is for the data manipulation while gapminder provides us with the data. We will learn the following functions from the dplyr package

count()
summarize
group_by()
top_n()

count

The count function allows you to count the number of observations in your dataset as shown below.

gapminder %>%
        count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1  1704

The output tells us that there are over 1700 rows of data. However, the count function can do much more. For example, we can also count values in a specific column. Below, we calculated how many rows of data we have by continent.

gapminder %>%
        count(continent)

## # A tibble: 5 x 2
##   continent     n
##   <fct>     <int>
## 1 Africa      624
## 2 Americas    300
## 3 Asia        396
## 4 Europe      360
## 5 Oceania      24

The output speaks for its self. There are two columns the left is continent and the right is how many times that particular continent appears in the dataset. You can also sort this data by adding the argument called sort as shown below.

gapminder %>%
        count(continent, sort =TRUE)

## # A tibble: 5 x 2
##   continent     n
##   <fct>     <int>
## 1 Africa      624
## 2 Asia        396
## 3 Europe      360
## 4 Americas    300
## 5 Oceania      24

There is another argument we can add and this is called the weight or wt argument. The wt argument adds up the values of the population in our example and we can now see how many respondents there were from each continent. Below is the code an example

gapminder %>% 
        count(continent, wt=pop, sort=TRUE)

## # A tibble: 5 x 2
##   continent           n
##   <fct>           <dbl>
## 1 Asia      30507333901
## 2 Americas   7351438499
## 3 Africa     6187585961
## 4 Europe     6181115304
## 5 Oceania     212992136

You can see that we now know how many people from each continent were in the dataset.

summarize

The summarize function takes many rows of data and reduce it to a single output. For example, if we want to know the total number of people in the dataset we could run the code below.

gapminder %>%
        summarize(total_pop=sum(pop))

## # A tibble: 1 x 1
##     total_pop
##         <dbl>
## 1 50440465801

You can also continue to add more and more things you want to know be separating them with a comma. In the code below, we add to it the average GDP.

gapminder %>%
        summarize(total_pop=sum(pop), average_gdp=mean(gdpPercap))

## # A tibble: 1 x 2
##     total_pop average_gdp
##         <dbl>       <dbl>
## 1 50440465801       7215.

group_by

The group by function allows you to aggregate data by groups. For example, if we want to know the total population and the average gdp by continent the code below would help to learn this.

gapminder %>%
        group_by(continent) %>%
        summarize(total_pop=sum(pop), mean_gdp=mean(gdpPercap)) %>%
        arrange(desc(total_pop))

## # A tibble: 5 x 3
##   continent   total_pop mean_gdp
##   <fct>           <dbl>    <dbl>
## 1 Asia      30507333901    7902.
## 2 Americas   7351438499    7136.
## 3 Africa     6187585961    2194.
## 4 Europe     6181115304   14469.
## 5 Oceania     212992136   18622.

It is also possible to group by more than one column. However, to do this we need to create another categorical variable. We are going to use mutate to create a categorical variable that breaks the data into two parts. Before 1980 and after 1980. Then we will group by country and whether the mean of the gdp was collected before or after 1980. Below is the code

gapminder %>%
        mutate(before_1980=if_else(year < 1980, "yes","no")) %>%
        group_by(country, before_1980) %>%
        summarize(mean_gdp=mean(gdpPercap))

## # A tibble: 284 x 3
## # Groups:   country [142]
##    country     before_1980 mean_gdp
##    <fct>       <chr>          <dbl>
##  1 Afghanistan no              803.
##  2 Afghanistan yes             803.
##  3 Albania     no             3934.
##  4 Albania     yes            2577.
##  5 Algeria     no             5460.
##  6 Algeria     yes            3392.
##  7 Angola      no             2944.
##  8 Angola      yes            4270.
##  9 Argentina   no             9998.
## 10 Argentina   yes            7913.
## # … with 274 more rows

top_n

The top_n function allows you to find the most extreme values when looking at groups. For example, we could find which countries has the highest life expectancy by continent. The answer is below

gapminder %>%
        group_by(continent) %>%
        top_n(1, lifeExp)

## # A tibble: 5 x 6
## # Groups:   continent [5]
##   country   continent  year lifeExp       pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>     <int>     <dbl>
## 1 Australia Oceania    2007    81.2  20434176    34435.
## 2 Canada    Americas   2007    80.7  33390141    36319.
## 3 Iceland   Europe     2007    81.8    301931    36181.
## 4 Japan     Asia       2007    82.6 127467972    31656.
## 5 Reunion   Africa     2007    76.4    798094     7670.

As an example, Japan has the highest life expectancy in Asia. Canada has the highest life expectancy in the Americas. Naturally you are not limited to the top 1. This number can be changed to whatever you want. For example, below we change the number to 3.

gapminder %>%
        group_by(continent) %>%
        top_n(3, lifeExp)

## # A tibble: 15 x 6
## # Groups:   continent [5]
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Australia        Oceania    2002    80.4  19546792    30688.
##  2 Australia        Oceania    2007    81.2  20434176    34435.
##  3 Canada           Americas   2002    79.8  31902268    33329.
##  4 Canada           Americas   2007    80.7  33390141    36319.
##  5 Costa Rica       Americas   2007    78.8   4133884     9645.
##  6 Hong Kong, China Asia       2007    82.2   6980412    39725.
##  7 Iceland          Europe     2007    81.8    301931    36181.
##  8 Japan            Asia       2002    82   127065841    28605.
##  9 Japan            Asia       2007    82.6 127467972    31656.
## 10 New Zealand      Oceania    2007    80.2   4115771    25185.
## 11 Reunion          Africa     1997    74.8    684810     6072.
## 12 Reunion          Africa     2002    75.7    743981     6316.
## 13 Reunion          Africa     2007    76.4    798094     7670.
## 14 Spain            Europe     2007    80.9  40448191    28821.
## 15 Switzerland      Europe     2007    81.7   7554661    37506.

Transform Data with dplyr

Leave a reply

In this post, we will be exposed to tools for wrangling and manipulating data in R.

Let’s begin by loading the libraries we will be using. We will use the dplyr package and the gapminder package. dplyr is for manipulating the data and gapminder provides the dataset.

library(dplyr)
library(gapminder)

You can look at the data briefly by using a function called “glimpse” as shown below.

glimpse(gapminder)

## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

You can see that we have six columns or variables and over 1700 rows of data. This data provides information about countries and various demographic statistics.

select()

The select function allows you to grab only the variables you want for analysis. This becomes exceptionally important when you have a large number of variables. In our next example, we will select 4 variables from the gapminder dataset. Below is the code to achieve this.

gapminder %>% 
        select(country,continent, pop, lifeExp)

## # A tibble: 1,704 x 4
##    country     continent      pop lifeExp
##    <fct>       <fct>        <int>   <dbl>
##  1 Afghanistan Asia       8425333    28.8
##  2 Afghanistan Asia       9240934    30.3
##  3 Afghanistan Asia      10267083    32.0
##  4 Afghanistan Asia      11537966    34.0
##  5 Afghanistan Asia      13079460    36.1
##  6 Afghanistan Asia      14880372    38.4
##  7 Afghanistan Asia      12881816    39.9
##  8 Afghanistan Asia      13867957    40.8
##  9 Afghanistan Asia      16317921    41.7
## 10 Afghanistan Asia      22227415    41.8
## # … with 1,694 more rows

The strange symbol %>% is called a “pipe” and allows you to continuously build your code. You can also save this information by assigning a name to an object like any other variable in r.

country_data<-gapminder %>% 
        select(country,continent, pop, lifeExp)

arrange

The arrange verb sorts your data based on one or more variables. For example, let’s say we want to know which country has the highest population. The code below provides the answer.

country_data %>%
        arrange(pop)

## # A tibble: 1,704 x 4
##    country               continent   pop lifeExp
##    <fct>                 <fct>     <int>   <dbl>
##  1 Sao Tome and Principe Africa    60011    46.5
##  2 Sao Tome and Principe Africa    61325    48.9
##  3 Djibouti              Africa    63149    34.8
##  4 Sao Tome and Principe Africa    65345    51.9
##  5 Sao Tome and Principe Africa    70787    54.4
##  6 Djibouti              Africa    71851    37.3
##  7 Sao Tome and Principe Africa    76595    56.5
##  8 Sao Tome and Principe Africa    86796    58.6
##  9 Djibouti              Africa    89898    39.7
## 10 Sao Tome and Principe Africa    98593    60.4
## # … with 1,694 more rows

To complete this task we had to use the arrange function and place the name of the variable we want to sort by inside the parentheses. However, this is not exactly what we want. What we have found is the countries with the smallest population. To sort from largest to smallest you must use the desc function as well and this is shown below.

country_data %>%
        arrange(desc(pop))

## # A tibble: 1,704 x 4
##    country continent        pop lifeExp
##    <fct>   <fct>          <int>   <dbl>
##  1 China   Asia      1318683096    73.0
##  2 China   Asia      1280400000    72.0
##  3 China   Asia      1230075000    70.4
##  4 China   Asia      1164970000    68.7
##  5 India   Asia      1110396331    64.7
##  6 China   Asia      1084035000    67.3
##  7 India   Asia      1034172547    62.9
##  8 China   Asia      1000281000    65.5
##  9 India   Asia       959000000    61.8
## 10 China   Asia       943455000    64.0
## # … with 1,694 more rows

Now, this is what we want. China claims several of the top spots. The reason a country is on the list more than once is that the data was collected several different years.

filter

The filter function is used to obtain only specific values that meet the criteria. For example, what if we want to know the population of only India in descending order. Below is the code for how to do this.

country_data %>%
        arrange(desc(pop)) %>%
        filter(country=='India')

## # A tibble: 12 x 4
##    country continent        pop lifeExp
##    <fct>   <fct>          <int>   <dbl>
##  1 India   Asia      1110396331    64.7
##  2 India   Asia      1034172547    62.9
##  3 India   Asia       959000000    61.8
##  4 India   Asia       872000000    60.2
##  5 India   Asia       788000000    58.6
##  6 India   Asia       708000000    56.6
##  7 India   Asia       634000000    54.2
##  8 India   Asia       567000000    50.7
##  9 India   Asia       506000000    47.2
## 10 India   Asia       454000000    43.6
## 11 India   Asia       409000000    40.2
## 12 India   Asia       372000000    37.4

Now we have only data that relates to India. All we did was include one more pipe and the filter function. We had to tell R which country by placing the information above in the parentheses.

filter is not limited to text searches. You can also search based on numerical values. For example, what if we only want countries with a life expectancy of 81 or higher

country_data %>%
        arrange(desc(pop)) %>%
        filter(lifeExp >= 81)

## # A tibble: 7 x 4
##   country          continent       pop lifeExp
##   <fct>            <fct>         <int>   <dbl>
## 1 Japan            Asia      127467972    82.6
## 2 Japan            Asia      127065841    82  
## 3 Australia        Oceania    20434176    81.2
## 4 Switzerland      Europe      7554661    81.7
## 5 Hong Kong, China Asia        6980412    82.2
## 6 Hong Kong, China Asia        6762476    81.5
## 7 Iceland          Europe       301931    81.8

You can see the results for yourself. It is also possible to combine multiply conditions for whatever functions are involved. For example, if I want to arrange my data by population and country while also filtering it by a population greater than 100,000,000,000 and with a life expectancy of less than 45. This is shown below

country_data %>%
        arrange(desc(pop, country)) %>%
        filter(pop>100000000, lifeExp<45)

## # A tibble: 5 x 4
##   country continent       pop lifeExp
##   <fct>   <fct>         <int>   <dbl>
## 1 China   Asia      665770000    44.5
## 2 China   Asia      556263527    44  
## 3 India   Asia      454000000    43.6
## 4 India   Asia      409000000    40.2
## 5 India   Asia      372000000    37.4

mutate

The mutate function is for manipulating variables and creating new ones. For example, the gdpPercap variable is highly skewed. We can create a variable of gdpercap that is the log of this variable. Using the log will help the data to assume the characteristics of a normal distribution. Below is the code for this.

gapminder %>% 
        select(country,continent, pop, gdpPercap) %>%
        mutate(log_gdp=log(gdpPercap))

## # A tibble: 1,704 x 5
##    country     continent      pop gdpPercap log_gdp
##    <fct>       <fct>        <int>     <dbl>   <dbl>
##  1 Afghanistan Asia       8425333      779.    6.66
##  2 Afghanistan Asia       9240934      821.    6.71
##  3 Afghanistan Asia      10267083      853.    6.75
##  4 Afghanistan Asia      11537966      836.    6.73
##  5 Afghanistan Asia      13079460      740.    6.61
##  6 Afghanistan Asia      14880372      786.    6.67
##  7 Afghanistan Asia      12881816      978.    6.89
##  8 Afghanistan Asia      13867957      852.    6.75
##  9 Afghanistan Asia      16317921      649.    6.48
## 10 Afghanistan Asia      22227415      635.    6.45
## # … with 1,694 more rows

In the code above we had to select our variables again and then we create the new variable “log_gdp”. This new variable appears all the way to the right in the dataset. Naturally, we can extend our code by using our new variable in other functions as shown below.

Conclusion

This post was longer than normal but several practical things were learned. You now know some basic techniques for wrangling data using the dplyr package in R.

Principal Component Analysis with Python VIDEO

Leave a reply

Principal component analysis is a tool for reducing the number of variables in a dataset without losing too much information. This is a great way to summarize information or to simplify things for a more complex analysis. The video provides a simple example of how to do this.

T-SNE Visualization and R

Leave a reply

It is common in research to want to visualize data in order to search for patterns. When the number of features increases, this can often become even more important. Common tools for visualizing numerous features include principal component analysis and linear discriminant analysis. Not only do these tools work for visualization they can also be beneficial in dimension reduction.

However, the available tools for us are not limited to these two options. Another option for achieving either of these goals is t-Distributed Stochastic Embedding. This relative young algorithm (2008) is the focus of the post. We will explain what it is and provide an example using a simple dataset from the Ecdat package in R.

t-sne Defined

t-sne is a nonlinear dimension reduction visualization tool. Essentially what it does is identify observed clusters. However, it is not a clustering algorithm because it reduces the dimensions (normally to 2) for visualizing. This means that the input features are not longer present in their original form and this limits the ability to make inference. Therefore, t-sne is often used for exploratory purposes.

T-sne non-linear characteristic is what makes it often appear to be superior to PCA, which is linear. Without getting too technical t-sne takes simultaneously a global and local approach to mapping points while PCA can only use a global approach.

The downside to t-sne approach is that it requires a large amount of calculations. The calculations are often pairwise comparisons which can grow exponential in large datasets.

Initial Packages

We will use the “Rtsne” package for the analysis, and we will use the “Fair” dataset from the “Ecdat” package. The “Fair” dataset is data collected from people who had cheated on their spouse. We want to see if we can find patterns among the unfaithful people based on their occupation. Below is some initial code.

library(Rtsne)
library(Ecdat)

Dataset Preparation

To prepare the data, we first remove in rows with missing data using the “na.omit” function. This is saved in a new object called “train”. Next, we change or outcome variable into a factor variable. The categories range from 1 to 9

Farm laborer, day laborer,
Unskilled worker, service worker,
Machine operator, semiskilled worker,
Skilled manual worker, craftsman, police,
Clerical/sales, small farm owner,
Technician, semiprofessional, supervisor,
Small business owner, farm owner, teacher,
Mid-level manager or professional,
Senior manager or professional.

Below is the code.

train<-na.omit(Fair)
train$occupation<-as.factor(train$occupation)

Visualization Preparation

Before we do the analysis we need to set the colors for the different categories. This is done with the code below.

colors<-rainbow(length(unique(train$occupation)))
names(colors)<-unique(train$occupation)

We can now do are analysis. We will use the “Rtsne” function. When you input the dataset you must exclude the dependent variable as well as any other factor variables. You also set the dimensions and the perplexity. Perplexity determines how many neighbors are used to determine the location of the datapoint after the calculations. Verbose just provides information during the calculation. This is useful if you want to know what progress is being made. max_iter is the number of iterations to take to complete the analysis and check_duplicates checks for duplicates which could be a problem in the analysis. Below is the code.

tsne<-Rtsne(train[,-c(1,4,7)],dims=2,perplexity=30,verbose=T,max_iter=1500,check_duplicates=F)

## Performing PCA
## Read the 601 x 6 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.05 seconds (sparsity = 0.190597)!
## Learning embedding...
## Iteration 1450: error is 0.280471 (50 iterations in 0.07 seconds)
## Iteration 1500: error is 0.279962 (50 iterations in 0.07 seconds)
## Fitting performed in 2.21 seconds.

Below is the code for making the visual.

plot(tsne$Y,t='n',main='tsne',xlim=c(-30,30),ylim=c(-30,30))
text(tsne$Y,labels=train$occupation,col = colors[train$occupation])
legend(25,5,legend=unique(train$occupation),col = colors,,pch=c(1))

You can see that there are clusters however, the clusters are all mixed with the different occupations. What this indicates is that the features we used to make the two dimensions do not discriminant between the different occupations.

Conclusion

T-SNE is an improved way to visualize data. This is not to say that there is no place for PCA anymore. Rather, this newer approach provides a different way of quickly visualizing complex data without the limitations of PCA.

Checkout ERT online courses

Quadratic Discriminant Analysis with Python

Leave a reply

Quadratic discriminant analysis allows for the classifier to assess non -linear relationships. This of course something that linear discriminant analysis is not able to do. This post will go through the steps necessary to complete a qda analysis using Python. The steps that will be conducted are as follows

Data preparation
Model training
Model testing

Our goal will be to predict the gender of examples in the “Wages1” dataset using the available independent variables.

Data Preparation

We will begin by first loading the libraries we will need

import pandas as pd
from pydataset import data
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import (confusion_matrix,accuracy_score)
import seaborn as sns
from matplotlib.colors import ListedColormap

Next, we will load our data “Wages1” it comes from the “pydataset” library. After loading the data, we will use the .head() method to look at it briefly.

We need to transform the variable ‘sex’, our dependent variable, into a dummy variable using numbers instead of text. We will use the .getdummies() method to make the dummy variables and then add them to the dataset using the .concat() method. The code for this is below.

In the code below we have the histogram for the continuous independent variables. We are using the .distplot() method from seaborn to make the histograms.

fig = plt.figure()
fig, axs = plt.subplots(figsize=(15, 10),ncols=3)
sns.set(font_scale=1.4)
sns.distplot(df['exper'],color='black',ax=axs[0])
sns.distplot(df['school'],color='black',ax=axs[1])
sns.distplot(df['wage'],color='black',ax=axs[2])

The variables look reasonable normal. Below is the proportions of the categorical dependent variable.

round(df.groupby('sex').count()/3294,2)
Out[247]: 
exper school wage female male
sex 
female 0.48 0.48 0.48 0.48 0.48
male 0.52 0.52 0.52 0.52 0.52

About half male and half female.

We will now make the correlational matrix

corrmat=df.corr(method='pearson')
f,ax=plt.subplots(figsize=(12,12))
sns.set(font_scale=1.2)
sns.heatmap(round(corrmat,2),
vmax=1.,square=True,
cmap="gist_gray",annot=True)

There appears to be no major problems with correlations. The last thing we will do is set up our train and test datasets.

X=df[['exper','school','wage']]
y=df['male']
X_train,X_test,y_train,y_test=train_test_split(X,y,
test_size=.2, random_state=50)

We can now move to model development

Model Development

To create our model we will instantiate an instance of the quadratic discriminant analysis function and use the .fit() method.

qda_model=QDA()
qda_model.fit(X_train,y_train)

There are some descriptive statistics that we can pull from our model. For our purposes, we will look at the group means Below are the group means.

	exper	school	wage
Female	7.73	11.84	5.14
Male	8.28	11.49	6.38

You can see from the table that mean generally have more experience, higher wages, but slightly less education.

We will now use the qda_model we create to predict the classifications for the training set. This information will be used to make a confusion matrix.

cm = confusion_matrix(y_train, y_pred)
ax= plt.subplots(figsize=(10,10))
sns.set(font_scale=3.4)
with sns.axes_style('white'):
sns.heatmap(cm, cbar=False, square=True, annot=True, fmt='g',
cmap=ListedColormap(['gray']), linewidths=2.5)

The information in the upper-left corner are the number of people who were female and correctly classified as female. The lower-right corner is for the men who were correctly classified as men. The upper-right corner is females who were classified as male. Lastly, the lower-left corner is males who were classified as females. Below is the actually accuracy of our model

round(accuracy_score(y_train, y_pred),2)
Out[256]: 0.6

Sixty percent accuracy is not that great. However, we will now move to model testing.

Model Testing

Model testing involves using the .predict() method again but this time with the testing data. Below is the prediction with the confusion matrix.

 y_pred=qda_model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
from matplotlib.colors import ListedColormap
ax= plt.subplots(figsize=(10,10))
sns.set(font_scale=3.4)
with sns.axes_style('white'):
sns.heatmap(cm, cbar=False, square=True,annot=True,fmt='g',
cmap=ListedColormap(['gray']),linewidths=2.5)

The results seem similar. Below is the accuracy.

round(accuracy_score(y_test, y_pred),2)
Out[259]: 0.62

About the same, our model generalizes even though it performs somewhat poorly.

Conclusion

This post provided an explanation of how to do a quadratic discriminant analysis using python. This is just another potential tool that may be useful for the data scientist.

Data Exploration Case Study: Credit Default

Leave a reply

Exploratory data analysis is the main task of a Data Scientist with as much as 60% of their time being devoted to this task. As such, the majority of their time is spent on something that is rather boring compared to building models.

This post will provide a simple example of how to analyze a dataset from the website called Kaggle. This dataset is looking at how is likely to default on their credit. The following steps will be conducted in this analysis.

Load the libraries and dataset
Deal with missing data
Some descriptive stats
Normality check
Model development

This is not an exhaustive analysis but rather a simple one for demonstration purposes. The dataset is available here

Load Libraries and Data

Here are some packages we will need

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn import tree
from scipy import stats
from sklearn import metrics

You can load the data with the code below

df_train=pd.read_csv('/application_train.csv')

You can examine what variables are available with the code below. This is not displayed here because it is rather long

df_train.columns
df_train.head()

Missing Data

I prefer to deal with missing data first because missing values can cause errors throughout the analysis if they are not dealt with immediately. The code below calculates the percentage of missing data in each column.

total=df_train.isnull().sum().sort_values(ascending=False)
percent=(df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data=pd.concat([total,percent],axis=1,keys=['Total','Percent'])
missing_data.head()
 
                           Total   Percent
COMMONAREA_MEDI           214865  0.698723
COMMONAREA_AVG            214865  0.698723
COMMONAREA_MODE           214865  0.698723
NONLIVINGAPARTMENTS_MODE  213514  0.694330
NONLIVINGAPARTMENTS_MEDI  213514  0.694330

Only the first five values are printed. You can see that some variables have a large amount of missing data. As such, they are probably worthless for inclusion in additional analysis. The code below removes all variables with any missing data.

pct_null = df_train.isnull().sum() / len(df_train)
missing_features = pct_null[pct_null > 0.0].index
df_train.drop(missing_features, axis=1, inplace=True)

You can use the .head() function if you want to see how many variables are left.

Data Description & Visualization

For demonstration purposes, we will print descriptive stats and make visualizations of a few of the variables that are remaining.

round(df_train['AMT_CREDIT'].describe())
Out[8]: 
count     307511.0
mean      599026.0
std       402491.0
min        45000.0
25%       270000.0
50%       513531.0
75%       808650.0
max      4050000.0

sns.distplot(df_train['AMT_CREDIT']

round(df_train['AMT_INCOME_TOTAL'].describe())
Out[10]: 
count       307511.0
mean        168798.0
std         237123.0
min          25650.0
25%         112500.0
50%         147150.0
75%         202500.0
max      117000000.0
sns.distplot(df_train['AMT_INCOME_TOTAL']

I think you are getting the point. You can also look at categorical variables using the groupby() function.

We also need to address categorical variables in terms of creating dummy variables. This is so that we can develop a model in the future. Below is the code for dealing with all the categorical variables and converting them to dummy variable’s

df_train.groupby('NAME_CONTRACT_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_CONTRACT_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_CONTRACT_TYPE'],axis=1)

df_train.groupby('CODE_GENDER').count()
dummy=pd.get_dummies(df_train['CODE_GENDER'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['CODE_GENDER'],axis=1)

df_train.groupby('FLAG_OWN_CAR').count()
dummy=pd.get_dummies(df_train['FLAG_OWN_CAR'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['FLAG_OWN_CAR'],axis=1)

df_train.groupby('FLAG_OWN_REALTY').count()
dummy=pd.get_dummies(df_train['FLAG_OWN_REALTY'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['FLAG_OWN_REALTY'],axis=1)

df_train.groupby('NAME_INCOME_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_INCOME_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_INCOME_TYPE'],axis=1)

df_train.groupby('NAME_EDUCATION_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_EDUCATION_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_EDUCATION_TYPE'],axis=1)

df_train.groupby('NAME_FAMILY_STATUS').count()
dummy=pd.get_dummies(df_train['NAME_FAMILY_STATUS'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_FAMILY_STATUS'],axis=1)

df_train.groupby('NAME_HOUSING_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_HOUSING_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_HOUSING_TYPE'],axis=1)

df_train.groupby('ORGANIZATION_TYPE').count()
dummy=pd.get_dummies(df_train['ORGANIZATION_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['ORGANIZATION_TYPE'],axis=1)

You have to be careful with this because now you have many variables that are not necessary. For every categorical variable you must remove at least one category in order for the model to work properly. Below we did this manually.

df_train=df_train.drop(['Revolving loans','F','XNA','N','Y','SK_ID_CURR,''Student','Emergency','Lower secondary','Civil marriage','Municipal apartment'],axis=1)

Below are some boxplots with the target variable and other variables in the dataset.

f,ax=plt.subplots(figsize=(8,6))
fig=sns.boxplot(x=df_train['TARGET'],y=df_train['AMT_INCOME_TOTAL'])

There is a clear outlier there. Below is another boxplot with a different variable

f,ax=plt.subplots(figsize=(8,6))
fig=sns.boxplot(x=df_train['TARGET'],y=df_train['CNT_CHILDREN'])

It appears several people have more than 10 children. This is probably a typo.

Below is a correlation matrix using a heatmap technique

corrmat=df_train.corr()
f,ax=plt.subplots(figsize=(12,9))
sns.heatmap(corrmat,vmax=.8,square=True)

The heatmap is nice but it is hard to really appreciate what is happening. The code below will sort the correlations from least to strongest, so we can remove high correlations.

c = df_train.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")
print(so.head())

FLAG_DOCUMENT_12 FLAG_MOBIL 0.000005
FLAG_MOBIL FLAG_DOCUMENT_12 0.000005
Unknown FLAG_MOBIL 0.000005
FLAG_MOBIL Unknown 0.000005
Cash loans FLAG_DOCUMENT_14 0.000005

The list is to long to show here but the following variables were removed for having a high correlation with other variables.

df_train=df_train.drop(['WEEKDAY_APPR_PROCESS_START','FLAG_EMP_PHONE','REG_CITY_NOT_WORK_CITY','REGION_RATING_CLIENT','REG_REGION_NOT_WORK_REGION'],axis=1)

Below we check a few variables for homoscedasticity, linearity, and normality using plots and histograms

sns.distplot(df_train['AMT_INCOME_TOTAL'],fit=norm)
fig=plt.figure()
res=stats.probplot(df_train['AMT_INCOME_TOTAL'],plot=plt)

This is not normal

sns.distplot(df_train['AMT_CREDIT'],fit=norm)
fig=plt.figure()
res=stats.probplot(df_train['AMT_CREDIT'],plot=plt)

This is not normal either. We could do transformations, or we can make a non-linear model instead.

Model Development

Now comes the easy part. We will make a decision tree using only some variables to predict the target. In the code below we make are X and y dataset.

X=df_train[['Cash loans','DAYS_EMPLOYED','AMT_CREDIT','AMT_INCOME_TOTAL','CNT_CHILDREN','REGION_POPULATION_RELATIVE']]
y=df_train['TARGET']

The code below fits are model and makes the predictions

clf=tree.DecisionTreeClassifier(min_samples_split=20)
clf=clf.fit(X,y)
y_pred=clf.predict(X)

Below is the confusion matrix followed by the accuracy

print (pd.crosstab(y_pred,df_train['TARGET']))
TARGET       0      1
row_0                
0       280873  18493
1         1813   6332
accuracy_score(y_pred,df_train['TARGET'])
Out[47]: 0.933966589813047

Lastly, we can look at the precision, recall, and f1 score

print(metrics.classification_report(y_pred,df_train['TARGET']))
              precision    recall  f1-score   support

           0       0.99      0.94      0.97    299366
           1       0.26      0.78      0.38      8145

   micro avg       0.93      0.93      0.93    307511
   macro avg       0.62      0.86      0.67    307511
weighted avg       0.97      0.93      0.95    307511

This model looks rather good in terms of accuracy of the training set. It actually impressive that we could use so few variables from such a large dataset and achieve such a high degree of accuracy.

Conclusion

Data exploration and analysis is the primary task of a data scientist. This post was just an example of how this can be approached. Of course, there are many other creative ways to do this but the simplistic nature of this analysis yielded strong results

RANSAC Regression in Python

Leave a reply

RANSAC is an acronym for Random Sample Consensus. What this algorithm does is fit a regression model on a subset of data that the algorithm judges as inliers while removing outliers. This naturally improves the fit of the model due to the removal of some data points.

The process that is used to determine inliers and outliers is described below.

The algorithm randomly selects a random amount of samples to be inliers in the model.
All data is used to fit the model and samples that fall with a certain tolerance are relabeled as inliers.
Model is refitted with the new inliers
Error of the fitted model vs the inliers is calculated
Terminate or go back to step 1 if a certain criterion of iterations or performance is not met.

In this post, we will use the tips data from the pydataset module. Our goal will be to predict the tip amount using two different models.

Model 1 will use simple regression and will include total bill as the independent variable and tips as the dependent variable
Model 2 will use multiple regression and includes several independent variables and tips as the dependent variable

The process we will use to complete this example is as follows

Data preparation
Simple Regression Model fit
Simple regression visualization
Multiple regression model fit
Multiple regression visualization

Below are the packages we will need for this example

import pandas as pd
from pydataset import data
from sklearn.linear_model import RANSACRegressor
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

Data Preparation

For the data preparation, we need to do the following

Load the data
Create X and y dataframes
Convert several categorical variables to dummy variables
Drop the original categorical variables from the X dataframe

Below is the code for these steps

df=data('tips')
X,y=df[['total_bill','sex','size','smoker','time']],df['tip']
male=pd.get_dummies(X['sex'])
X['male']=male['Male']
smoker=pd.get_dummies(X['smoker'])
X['smoker']=smoker['Yes']
dinner=pd.get_dummies(X['time'])
X['dinner']=dinner['Dinner']
X=X.drop(['sex','time'],1)

Most of this is self-explanatory, we first load the tips dataset and divide the independent and dependent variables into an X and y dataframe respectively. Next, we converted the sex, smoker, and dinner variables into dummy variables, and then we dropped the original categorical variables.

We can now move to fitting the first model that uses simple regression.

Simple Regression Model

For our model, we want to use total bill to predict tip amount. All this is done in the following steps.

Instantiate an instance of the RANSACRegressor. We the call LinearRegression function, and we also set the residual_threshold to 2 indicate how far an example has to be away from 2 units away from the line.
Next we fit the model
We predict the values
We calculate the r square the mean absolute error

Below is the code for all of this.

ransacReg1= RANSACRegressor(LinearRegression(),residual_threshold=2,random_state=0)
ransacReg1.fit(X[['total_bill']],y)
prediction1=ransacReg1.predict(X[['total_bill']])

r2_score(y,prediction1)
Out[150]: 0.4381748268686979

mean_absolute_error(y,prediction1)
Out[151]: 0.7552429811944833

The r-square is 44% while the MAE is 0.75. These values are most comparative and will be looked at again when we create the multiple regression model.

The next step is to make the visualization. The code below will create a plot that shows the X and y variables and the regression. It also identifies which samples are inliers and outliers. Te coding will not be explained because of the complexity of it.

inlier=ransacReg1.inlier_mask_
outlier=np.logical_not(inlier)
line_X=np.arange(3,51,2)
line_y=ransacReg1.predict(line_X[:,np.newaxis])
plt.scatter(X[['total_bill']][inlier],y[inlier],c='lightblue',marker='o',label='Inliers')
plt.scatter(X[['total_bill']][outlier],y[outlier],c='green',marker='s',label='Outliers')
plt.plot(line_X,line_y,color='black')
plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.legend(loc='upper left')

Plot is self-explanatory as a handful of samples were considered outliers. We will now move to creating our multiple regression model.

Multiple Regression Model Development

The steps for making the model are mostly the same. The real difference takes place in make the plot which we will discuss in a moment. Below is the code for developing the model.

ransacReg2= RANSACRegressor(LinearRegression(),residual_threshold=2,random_state=0)
ransacReg2.fit(X,y)
prediction2=ransacReg2.predict(X)

r2_score(y,prediction2)
Out[154]: 0.4298703800652126

mean_absolute_error(y,prediction2)
Out[155]: 0.7649733201032204

Things have actually gotten slightly worst in terms of r-square and MAE.

For the visualization, we cannot plot directly several variables t once. Therefore, we will compare the predicted values with the actual values. The better the correlated the better our prediction is. Below is the code for the visualization

inlier=ransacReg2.inlier_mask_
outlier=np.logical_not(inlier)
line_X=np.arange(1,8,1)
line_y=(line_X[:,np.newaxis])
plt.scatter(prediction2[inlier],y[inlier],c='lightblue',marker='o',label='Inliers')
plt.scatter(prediction2[outlier],y[outlier],c='green',marker='s',label='Outliers')
plt.plot(line_X,line_y,color='black')
plt.xlabel('Predicted Tip')
plt.ylabel('Actual Tip')
plt.legend(loc='upper left')

The plots are mostly the same as you cans see for yourself.

Conclusion

This post provided an example of how to use the RANSAC regressor algorithm. This algorithm will remove samples from the model based on a criterion you set. The biggest complaint about this algorithm is that it removes data from the model. Generally, we want to avoid losing data when developing models. In addition, the algorithm removes outliers objectively this is a problem because outlier removal is often subjective. Despite these flaws, RANSAC regression is another tool that can be use din machine learning.

Data Science Pipeline

Leave a reply

One of the challenges of conducting a data analysis or any form of research is making decisions. You have to decide primarily two things

What to do
When to do it

People who are familiar with statistics may know what to do but may struggle with timing or when to do it. Others who are weaker when it comes to numbers may not know what to do or when to do it. Generally, it is rare for someone to know when to do something but not know how to do it.

In this post, we will look at a process that that can be used to perform an analysis in the context of data science. Keep in mind that this is just an example and there are naturally many ways to perform an analysis. The purpose here is to provide some basic structure for people who are not sure of what to do and when. One caveat, this process is focused primarily on supervised learning which has a clearer beginning, middle, and end in terms of the process.

Generally, there are three steps that probably always take place when conducting a data analysis and they are as follows.

Data preparation (data mugging)
Model training
Model testing

Off course, it is much more complicated than this but this is the minimum. Within each of these steps there are several substeps, However, depending on the context, the substeps can be optional.

There is one pre-step that you have to consider. How you approach these three steps depends a great deal on the algorithm(s) you have in mind to use for developing different models. The assumptions and characteristics of one algorithm are different from another and shape how you prepare the data and develop models. With this in mind, we will go through each of these three steps.

Data Preparation

Data preparation involves several substeps. Some of these steps are necessary but general not all of them happen ever analysis. Below is a list of steps at this level

Data mugging
Scaling
Normality
Dimension reduction/feature extraction/feature selection
Train, test, validation split

Data mugging is often the first step in data preparation and involves making sure your data is in a readable structure for your algorithm. This can involve changing the format of dates, removing punctuation/text, changing text into dummy variables or factors, combining tables, splitting tables, etc. This is probably the hardest and most unclear aspect of data science because the problems you will face will be highly unique to the dataset you are working with.

Scaling involves making sure all the variables/features are on the same scale. This is important because most algorithms are sensitive to the scale of the variables/features. Scaling can be done through normalization or standardization. Normalization reduces the variables to a range of 0 – 1. Standardization involves converting the examples in the variable to their respective z-score. Which one you use depends on the situation but normally it is expected to do this.

Normality is often an optional step because there are so many variables that can be involved with big data and data science in a given project. However, when fewer variables are involved checking for normality is doable with a few tests and some visualizations. If normality is violated various transformations can be used to deal with this problem. Keep mind that many machine learning algorithms are robust against the influence of non-normal data.

Dimension reduction involves reduce the number of variables that will be included in the final analysis. This is done through factor analysis or principal component analysis. This reduction in the number of variables is also an example of feature extraction. In some context, feature extraction is the in goal in itself. Some algorithms make their own features such as neural networks through the use of hidden layer(s)

Feature selection is the process of determining which variables to keep for future analysis. This can be done through the use of regularization such or in smaller datasets with subset regression. Whether you extract or select features depends on the context.

After all this is accomplished, it is necessary to split the dataset. Traditionally, the data was split in two. This led to the development of a training set and a testing set. You trained the model on the training set and tested the performance on the test set.

However, now many analyst split the data into three parts to avoid overfitting the data to the test set. There is now a training a set, a validation set, and a testing set. The validation set allows you to check the model performance several times. Once you are satisfied you use the test set once at the end.

Once the data is prepared, which again is perhaps the most difficult part, it is time to train the model.

Model training

Model training involves several substeps as well

Determine the metric(s) for success
Creating a grid of several hyperparameter values
Cross-validation
Selection of the most appropriate hyperparameter values

The first thing you have to do and this is probably required is determined how you will know if your model is performing well. This involves selecting a metric. It can be accuracy for classification or mean squared error for a regression model or something else. What you pick depends on your goals. You use these metrics to determine the best algorithm and hyperparameters settings.

Most algorithms have some sort of hyperparameter(s). A hyperparameter is a value or estimate that the algorithm cannot learn and must be set by you. Since there is no way of knowing what values to select it is common practice to have several values tested and see which one is the best.

Cross-validation is another consideration. Using cross-validation always you to stabilize the results through averaging the results of the model over several folds of the data if you are using k-folds cross-validation. This also helps to improve the results of the hyperparameters as well. There are several types of cross-validation but k-folds is probably best initially.

The information for the metric, hyperparameters, and cross-validation are usually put into a grid that then runs the model. Whether you are using R or Python the printout will tell you which combination of hyperparameters is the best based on the metric you determined.

Validation test

When you know what your hyperparameters are you can now move your model to validation or straight to testing. If you are using a validation set you asses your models performance by using this new data. If the results are satisfying based on your metric you can move to testing. If not, you may move back and forth between training and the validation set making the necessary adjustments.

Test set

The final step is testing the model. You want to use the testing dataset as little as possible. The purpose here is to see how your model generalizes to data it has not seen before. There is little turning back after this point as there is an intense danger of overfitting now. Therefore, make sure you are ready before playing with the test data.

Conclusion

This is just one approach to conducting data analysis. Keep in mind the need to prepare data, train your model, and test it. This is the big picture for a somewhat complex process

Gradient Boosting Classification in Python

Leave a reply

Gradient Boosting is an alternative form of boosting to AdaBoost. Many consider gradient boosting to be a better performer than adaboost. Some differences between the two algorithms is that gradient boosting uses optimization for weight the estimators. Like adaboost, gradient boosting can be used for most algorithms but is commonly associated with decision trees.

In addition, gradient boosting requires several additional hyperparameters such as max depth and subsample. Max depth has to do with the number of nodes in a tree. The higher the number the purer the classification become. The downside to this is the risk of overfitting.

Subsampling has to do with the proportion of the sample that is used for each estimator. This can range from a decimal value up until the whole number 1. If the value is set to 1 it becomes stochastic gradient boosting.

This post is focused on classification. To do this, we will use the cancer dataset from the pydataset library. Our goal will be to predict the status of patients (alive or dead) using the available independent variables. The steps we will use are as follows.

Data preparation
Baseline decision tree model
Hyperparameter tuning
Gradient boosting model development

Below is some initial code.

from sklearn.ensemble import GradientBoostingClassifier
 from sklearn import tree
 from sklearn.model_selection import GridSearchCV
 import numpy as np
 from pydataset import data
 import pandas as pd
 from sklearn.model_selection import cross_val_score
 from sklearn.model_selection import KFold

Data Preparation

The data preparation is simple in this situtation. All we need to do is load are dataset, dropping missing values, and create our X dataset and y dataset. All this happens in the code below.

df=data('cancer').dropna()
 X=df[['time','sex','ph.karno','pat.karno','meal.cal','wt.loss']]
 y=df['status']

We will now develop our baseline decision tree model.

Baseline Model

The purpose of the baseline model is to have something to compare our gradient boosting model to. The strength of a model is always relative to some other model, so we need to make at least two, so we can say one is better than the other.

The criteria for better in this situation is accuracy. Therefore, we will make a decision tree model, but we will manipulate the max depth of the tree to create 9 different baseline models. The best accuracy model will be the baseline model.

To achieve this, we need to use a for loop to make python make several decision trees. We also need to set the parameters for the cross validation by calling KFold(). Once this is done, we print the results for the 9 trees. Below is the code and results.

crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10):
     tree_classifier=tree.DecisionTreeClassifier(max_depth=depth,random_state=1)
     if tree_classifier.fit(X,y).tree_.max_depth<depth:
         break
     score=np.mean(cross_val_score(tree_classifier,X,y,scoring='accuracy', cv=crossvalidation,n_jobs=1))
     print(depth, score)
 1 0.71875
 2 0.6477941176470589
 3 0.6768382352941177
 4 0.6698529411764707
 5 0.6584558823529412
 6 0.6525735294117647
 7 0.6283088235294118
 8 0.6573529411764706
 9 0.6577205882352941

It appears that when the max depth is limited to 1 that we get the best accuracy at almost 72%. This will be our baseline for comparison. We will now tune the parameters for the gradient boosting algorithm

Hyperparameter Tuning

There are several hyperparameters we need to tune. The ones we will tune are as follows

number of estimators
learning rate
subsample
max depth

First, we will create an instance of the gradient boosting classifier. Second, we will create our grid for the search. It is inside this grid that we set several values for each hyperparameter. Then we call GridSearchCV and place the instance of the gradient boosting classifier, the grid, the cross validation values from mad earlier, and n_jobs all together in one place. Below is the code for this.

GBC=GradientBoostingClassifier()
 search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'max_depth':[1,3,5],'subsample':[.5,.75,1],'random_state':[1]}
search=GridSearchCV(estimator=GBC,param_grid=search_grid,scoring='accuracy',n_jobs=1,cv=crossvalidation)

You can now run your model by calling .fit(). Keep in mind that there are several hyperparameters. This means that it might take some time to run the calculations. It is common to find values for max depth, subsample, and number of estimators first. Then as second run through is done to find the learning rate. In our example, we are doing everything at once which is why it takes longer. Below is the code with the out for best parameters and best score.

search.fit(X,y)
search.best_params_
 Out[11]: 
 {'learning_rate': 0.01,
  'max_depth': 5,
  'n_estimators': 2000,
  'random_state': 1,
  'subsample': 0.75}
 search.best_score_
 Out[12]: 0.7425149700598802

You can see what the best hyperparameters are for yourself. In addition, we see that when these parameters were set we got an accuracy of 74%. This is superior to our baseline model. We will now see if we can replicate these numbers when we use them for our Gradient Boosting model.

Gradient Boosting Model

Below is the code and results for the model with the predetermined hyperparameter values.

ada2=GradientBoostingClassifier(n_estimators=2000,learning_rate=0.01,subsample=.75,max_depth=5,random_state=1)
 score=np.mean(cross_val_score(ada2,X,y,scoring='accuracy',cv=crossvalidation,n_jobs=1))
 score
 Out[17]: 0.742279411764706

You can see that the results are similar. This is just additional information that the gradient boosting model does outperform the baseline decision tree model.

Conclusion

This post provided an example of what gradient boosting classification can do for a model. With its distinct characteristics gradient boosting is generally a better performing boosting algorithm in comparison to AdaBoost.

AdaBoost Regression with Python

1 Reply

This post will share how to use the adaBoost algorithm for regression in Python. What boosting does is that it makes multiple models in a sequential manner. Each newer model tries to successful predict what older models struggled with. For regression, the average of the models are used for the predictions. It is often most common to use boosting with decision trees but this approach can be used with any machine learning algorithm that deals with supervised learning.

Boosting is associated with ensemble learning because several models are created that are averaged together. An assumption of boosting, is that combining several weak models can make one really strong and accurate model.

For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the weight loss of a patient based on several independent variables. The steps of this process are as follows.

Data preparation
Regression decision tree baseline model
Hyperparameter tuning of Adaboost regression model
AdaBoost regression model development

Below is some initial code

from sklearn.ensemble import AdaBoostRegressor
 from sklearn import tree
 from sklearn.model_selection import GridSearchCV
 import numpy as np
 from pydataset import data
 import pandas as pd
 from sklearn.model_selection import cross_val_score
 from sklearn.model_selection import train_test_split
 from sklearn.model_selection import KFold
 from sklearn.metrics import mean_squared_error

Data Preparation

There is little data preparation for this example. All we need to do is load the data and create the X and y datasets. Below is the code.

df=data('cancer').dropna()
 X=df[['time','sex','ph.karno','pat.karno','status','meal.cal']]
 y=df['wt.loss']

We will now proceed to creating the baseline regression decision tree model.

Baseline Regression Tree Model

The purpose of the baseline model is to compare it to the performance of our model that utilizes adaBoost. To make this model we need to Initiate a K-fold cross-validation. This will help in stabilizing the results. Next, we will create a for loop to create several trees that vary based on their depth. By depth, it is meant how far the tree can go to purify the classification. More depth often leads to a higher likelihood of overfitting.

Finally, we will then print the results for each tree. The criteria used for judgment is the mean squared error. Below is the code and results

crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10):
     tree_regressor=tree.DecisionTreeRegressor(max_depth=depth,random_state=1)
     if tree_regressor.fit(X,y).tree_.max_depth<depth:
         break
     score=np.mean(cross_val_score(tree_regressor,X,y,scoring='neg_mean_squared_error', cv=crossvalidation,n_jobs=1))
     print(depth, score)
 1 -193.55304528235052
 2 -176.27520747356175
 3 -209.2846723461564
 4 -218.80238479654003
 5 -222.4393459885871
 6 -249.95330609042858
 7 -286.76842138165705
 8 -294.0290706405905
 9 -287.39016236497804

Looks like a tree with a depth of 2 had the lowest amount of error. We can now move to tuning the hyperparameters for the adaBoost algorithm.

Hyperparameter Tuning

For hyperparameter tuning we need to start by initiating our AdaBoostRegresor() class. Then we need to create our grid. The grid will address two hyperparameters which are the number of estimators and the learning rate. The number of estimators tells Python how many models to make and the learning indicates how each tree contributes to the overall results. There is one more parameter which is random_state, but this is just for setting the seed and never changes.

After making the grid, we need to use the GridSearchCV function to finish this process. Inside this function, you have to set the estimator, which is adaBoostRegressor, the parameter grid which we just made, the cross-validation which we made when we created the baseline model, and the n_jobs, which allocates resources for the calculation. Below is the code.

ada=AdaBoostRegressor()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'random_state':[1]}
search=GridSearchCV(estimator=ada,param_grid=search_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=crossvalidation)

Next, we can run the model with the desired grid in place. Below is the code for fitting the mode as well as the best parameters and the score to expect when using the best parameters.

search.fit(X,y)
search.best_params_
 Out[31]: {'learning_rate': 0.01, 'n_estimators': 500, 'random_state': 1}
 search.best_score_
 Out[32]: -164.93176650920856

The best mix of hyperparameters is a learning rate of 0.01 and 500 estimators. This mix led to a mean error score of 164, which is a little lower than our single decision tree of 176. We will see how this works when we run our model with refined hyperparameters.

AdaBoost Regression Model

Below is our model, but this time with the refined hyperparameters.

ada2=AdaBoostRegressor(n_estimators=500,learning_rate=0.001,random_state=1)
score=np.mean(cross_val_score(ada2,X,y,scoring='neg_mean_squared_error',cv=crossvalidation,n_jobs=1))
 score
 Out[36]: -174.52604137201791

You can see the score is not as good but it is within reason.

Conclusion

In this post, we explored how to use the AdaBoost algorithm for regression. Employing this algorithm can help to strengthen a model in many ways at times.

Web Scraping with R

1 Reply

In this post we are going to learn how to do web scrapping with R.Web scraping is a process for extracting data from a website. We have all done web scraping before. For example, whenever you copy and paste something from a website into another document such as Word this is an example of web scraping. Technically, this is an example of manual web scraping. The main problem with manual web scraping is that it is labor intensive and takes a great deal of time.

Another problem with web scraping is that the data can come in an unstructured manner. This means that you have to organize it in some way in order to conduct a meaningful analysis. This also means that you must have a clear purpose for what you are scraping along with answerable questions. Otherwise, it is easy to become confused quickly when web scraping

Therefore, we will learn how to automate this process using R. We will need the help of the “rest” and “xml2” packages to do this. Below is some initial code

library(rvest);library(xml2)

For our example, we are going to scrape the titles and prices of books from a webpage on Amazon. When simply want to make an organized data frame. The first thing we need to do is load the URL into R and have R read the website using the “read_html” function. The code is below.

url<-'https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=books'
webpage<-read_html(url)

We now need to specifically harvest the titles from the webpage as this is one of our goals. There are at least two ways to do this. If you are an expert in HTML you can find the information by inspecting the page’s HTML. Another way is to the selectorGadget extension available in Chrome. When using this extension you simply click on the information you want to inspect and it gives you the CSS selector for that particular element. This is shown below

The green highlight is the CSS selector that you clicked on. The yellow represents all other elements that have the same CSS selector. The red represents what you do not want to be included. In this picture, I do not want the price because I want to scrape this separately.

Once you find your information you want to copy the CSS element information in the bar at the bottom of the picture. This information is then pasted into R and use the “html_nodes” function to pull this specific information from the webpage.

bookTitle<- html_nodes(webpage,'.a-link-normal .a-size-base')

We now need to convert this information to text and we are done.

title <- html_text(bookTitle, trim = TRUE)

Next, we repeat this process for the price.

bookPrice<- html_nodes(webpage,'.acs_product-price__buying')
price <- html_text(bookPrice, trim = TRUE)

Lastly, we make our data frame with all of our information.

books<-as.data.frame(title)
books$price<-price

With this done we can do basic statistical analysis such as the mean, standard deviation, histogram, etc. This was not a complex example but the basics of pulling data was provided. Below is what the first few entries of the data frame look like.

head(books)
##                                   title  price
## 1                          Silent Child $17.95
## 2 Say You're Sorry (Morgan Dane Book 1)  $4.99
## 3                     A Wrinkle in Time $19.95
## 4                       The Whiskey Sea  $3.99
## 5            Speaking in Bones: A Novel  $2.99
## 6 Harry Potter and the Sorcerer's Stone  $8.99

Conclusion

Web scraping using automated tools saves time and increases the possibilities of data analysis. The most important thing to remember is to understand what exactly it is you want to know. Otherwise, you will quickly become lost due to the overwhelming amounts of available information.

aggregate Function in R VIDEO

Leave a reply

Using the aggregate function in R.

Creating Subgroups of Data in R VIDEO

Leave a reply

Create subgroups in R

Subsetting Data in R VIDEO

Leave a reply

Subsetting data in R

Getting Data Out of R Video

Leave a reply

Getting data out of R

Importing Data into R VIDEO

Leave a reply

Importing data into R

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

select

rename

transmute

Share this:

Like this:

Share this:

Like this:

count

summarize

group_by

top_n

Share this:

Like this:

select()

arrange

filter

mutate

Share this:

Like this:

Share this:

Like this:

t-sne Defined

Initial Packages

Dataset Preparation

Visualization Preparation

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this: