Data Munging with Dplyr

Data preparation aka data munging is what most data scientist spend the majority of their time doing. Extracting and transforming data is difficult, to say the least. Every dataset is different with unique problems. This makes it hard to generalize best practices for transforming data so that it is suitable for analysis.

In this post, we will look at how to use the various functions in the “dplyr”” package. This package provides numerous ways to develop features as well as explore the data. We will use the “attitude” dataset from base r for our analysis. Below is some initial code.

library(dplyr)
data("attitude")
str(attitude)
## 'data.frame':    30 obs. of  7 variables:
##  $ rating    : num  43 63 71 61 81 43 58 71 72 67 ...
##  $ complaints: num  51 64 70 63 78 55 67 75 82 61 ...
##  $ privileges: num  30 51 68 45 56 49 42 50 72 45 ...
##  $ learning  : num  39 54 69 47 66 44 56 55 67 47 ...
##  $ raises    : num  61 63 76 54 71 54 66 70 71 62 ...
##  $ critical  : num  92 73 86 84 83 49 68 66 83 80 ...
##  $ advance   : num  45 47 48 35 47 34 35 41 31 41 ...

You can see we have seven variables and only 30 observations. Our first function that we will learn to use is the “select” function. This function allows you to select columns of data you want to use. In order to use this feature, you need to know the names of the columns you want. Therefore, we will first use the “names” function to determine the names of the columns and then use the “select”” function.

names(attitude)[1:3]
## [1] "rating"     "complaints" "privileges"
smallset<-select(attitude,rating:privileges)
head(smallset)
##   rating complaints privileges
## 1     43         51         30
## 2     63         64         51
## 3     71         70         68
## 4     61         63         45
## 5     81         78         56
## 6     43         55         49

The difference is probably obvious. Using the “select” function we have 3 instead of 7 variables. We can also exclude columns we do not want by placing a negative in front of the names of the columns. Below is the code

head(select(attitude,-(rating:privileges)))
##   learning raises critical advance
## 1       39     61       92      45
## 2       54     63       73      47
## 3       69     76       86      48
## 4       47     54       84      35
## 5       66     71       83      47
## 6       44     54       49      34

We can also use the “rename” function to change the names of columns. In our example below, we will change the name of the “rating” to “rates.” The code is below. Keep in mind that the new name for the column is to the left of the equal sign and the old name is to the right

attitude<-rename(attitude,rates=rating)
head(attitude)
##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34

The “select”” function can be used in combination with other functions to find specific columns in the dataset. For example, we will use the “ends_with” function inside the “select” function to find all columns that end with the letter s.

s_set<-head(select(attitude,ends_with("s")))
s_set
##   rates complaints privileges raises
## 1    43         51         30     61
## 2    63         64         51     63
## 3    71         70         68     76
## 4    61         63         45     54
## 5    81         78         56     71
## 6    43         55         49     54

The “filter” function allows you to select rows from a dataset based on criteria. In the code below we will select only rows that have a 75 or higher in the “raises” variable.

bigraise<-filter(attitude,raises>75)
bigraise
##   rates complaints privileges learning raises critical advance
## 1    71         70         68       69     76       86      48
## 2    77         77         54       72     79       77      46
## 3    74         85         64       69     79       79      63
## 4    66         77         66       63     88       76      72
## 5    78         75         58       74     80       78      49
## 6    85         85         71       71     77       74      55

If you look closely all values in the “raise” column are greater than 75. Of course, you can have more than one criteria. IN the code below there are two.

filter(attitude, raises>70 & learning<67)
##   rates complaints privileges learning raises critical advance
## 1    81         78         56       66     71       83      47
## 2    65         70         46       57     75       85      46
## 3    66         77         66       63     88       76      72

The “arrange” function allows you to sort the order of the rows. In the code below we first sort the data ascending by the “critical” variable. Then we sort it descendingly by adding the “desc” function.

ascCritical<-arrange(attitude, critical)
head(ascCritical)
##   rates complaints privileges learning raises critical advance
## 1    43         55         49       44     54       49      34
## 2    81         90         50       72     60       54      36
## 3    40         37         42       58     50       57      49
## 4    69         62         57       42     55       63      25
## 5    50         40         33       34     43       64      33
## 6    71         75         50       55     70       66      41
descCritical<-arrange(attitude, desc(critical))
head(descCritical)
##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    71         70         68       69     76       86      48
## 3    65         70         46       57     75       85      46
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    72         82         72       67     71       83      31

The “mutate” function is useful for engineering features. In the code below we will transform the “learning” variable by subtracting its mean from its self

attitude<-mutate(attitude,learningtrend=learning-mean(learning))
head(attitude)
##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34
##   learningtrend
## 1    -17.366667
## 2     -2.366667
## 3     12.633333
## 4     -9.366667
## 5      9.633333
## 6    -12.366667

You can also create logical variables with the “mutate” function.In the code below, we create a logical variable that is true when the “critical” variable” is higher than 80 and false when “critical”” is less than 80. The new variable is called “highCritical”

attitude<-mutate(attitude,highCritical=critical>=80)
head(attitude)
##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34
##   learningtrend highCritical
## 1    -17.366667         TRUE
## 2     -2.366667        FALSE
## 3     12.633333         TRUE
## 4     -9.366667         TRUE
## 5      9.633333         TRUE
## 6    -12.366667        FALSE

The “group_by” function is used for creating summary statistics based on a specific variable. It is similar to the “aggregate” function in R. This function works in combination with the “summarize” function for our purposes here. We will group our data by the “highCritical” variable. This means our data will be viewed as either TRUE for “highCritical” or FALSE. The results of this function will be saved in an object called “hcgroups”

hcgroups<-group_by(attitude,highCritical)
head(hcgroups)
## # A tibble: 6 x 9
## # Groups:   highCritical [2]
##   rates complaints privileges learning raises critical advance
##                            
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34
## # ... with 2 more variables: learningtrend , highCritical 

Looking at the data you probably saw no difference. This is because we are not done yet. We need to summarize the data in order to see the results for our two groups in the “highCritical” variable.

We will now generate the summary statistics by using the “summarize” function. We specifically want to know the mean of the “complaint” variable based on the variable “highCritical.” Below is the code

summarize(hcgroups,complaintsAve=mean(complaints))
## # A tibble: 2 x 2
##   highCritical complaintsAve
##                   
## 1        FALSE      67.31579
## 2         TRUE      65.36364

Of course, you could have learned this through doing a t.test but this is another approach.

Conclusion

The “dplyr” package is one powerful tool for wrestling with data. There is nothing new in this package. Instead, the coding is simpler than what you can excute using base r.

1 thought on “Data Munging with Dplyr

Leave a Reply