brown wooden floor

Transform Data with dplyr

In this post, we will be exposed to tools for wrangling and manipulating data in R.

Let’s begin by loading the libraries we will be using. We will use the dplyr package and the gapminder package. dplyr is for manipulating the data and gapminder provides the dataset.

library(dplyr)
library(gapminder)

You can look at the data briefly by using a function called “glimpse” as shown below.

glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

You can see that we have six columns or variables and over 1700 rows of data. This data provides information about countries and various demographic statistics.

select()

The select function allows you to grab only the variables you want for analysis. This becomes exceptionally important when you have a large number of variables. In our next example, we will select 4 variables from the gapminder dataset. Below is the code to achieve this.

gapminder %>% 
        select(country,continent, pop, lifeExp)
## # A tibble: 1,704 x 4
##    country     continent      pop lifeExp
##    <fct>       <fct>        <int>   <dbl>
##  1 Afghanistan Asia       8425333    28.8
##  2 Afghanistan Asia       9240934    30.3
##  3 Afghanistan Asia      10267083    32.0
##  4 Afghanistan Asia      11537966    34.0
##  5 Afghanistan Asia      13079460    36.1
##  6 Afghanistan Asia      14880372    38.4
##  7 Afghanistan Asia      12881816    39.9
##  8 Afghanistan Asia      13867957    40.8
##  9 Afghanistan Asia      16317921    41.7
## 10 Afghanistan Asia      22227415    41.8
## # … with 1,694 more rows

The strange symbol %>% is called a “pipe” and allows you to continuously build your code. You can also save this information by assigning a name to an object like any other variable in r.

country_data<-gapminder %>% 
        select(country,continent, pop, lifeExp)

arrange

The arrange verb sorts your data based on one or more variables. For example, let’s say we want to know which country has the highest population. The code below provides the answer.

country_data %>%
        arrange(pop)
## # A tibble: 1,704 x 4
##    country               continent   pop lifeExp
##    <fct>                 <fct>     <int>   <dbl>
##  1 Sao Tome and Principe Africa    60011    46.5
##  2 Sao Tome and Principe Africa    61325    48.9
##  3 Djibouti              Africa    63149    34.8
##  4 Sao Tome and Principe Africa    65345    51.9
##  5 Sao Tome and Principe Africa    70787    54.4
##  6 Djibouti              Africa    71851    37.3
##  7 Sao Tome and Principe Africa    76595    56.5
##  8 Sao Tome and Principe Africa    86796    58.6
##  9 Djibouti              Africa    89898    39.7
## 10 Sao Tome and Principe Africa    98593    60.4
## # … with 1,694 more rows

To complete this task we had to use the arrange function and place the name of the variable we want to sort by inside the parentheses. However, this is not exactly what we want. What we have found is the countries with the smallest population. To sort from largest to smallest you must use the desc function as well and this is shown below.

country_data %>%
        arrange(desc(pop))
## # A tibble: 1,704 x 4
##    country continent        pop lifeExp
##    <fct>   <fct>          <int>   <dbl>
##  1 China   Asia      1318683096    73.0
##  2 China   Asia      1280400000    72.0
##  3 China   Asia      1230075000    70.4
##  4 China   Asia      1164970000    68.7
##  5 India   Asia      1110396331    64.7
##  6 China   Asia      1084035000    67.3
##  7 India   Asia      1034172547    62.9
##  8 China   Asia      1000281000    65.5
##  9 India   Asia       959000000    61.8
## 10 China   Asia       943455000    64.0
## # … with 1,694 more rows

Now, this is what we want. China claims several of the top spots. The reason a country is on the list more than once is that the data was collected several different years.

filter

The filter function is used to obtain only specific values that meet the criteria. For example, what if we want to know the population of only India in descending order. Below is the code for how to do this.

country_data %>%
        arrange(desc(pop)) %>%
        filter(country=='India')        
## # A tibble: 12 x 4
##    country continent        pop lifeExp
##    <fct>   <fct>          <int>   <dbl>
##  1 India   Asia      1110396331    64.7
##  2 India   Asia      1034172547    62.9
##  3 India   Asia       959000000    61.8
##  4 India   Asia       872000000    60.2
##  5 India   Asia       788000000    58.6
##  6 India   Asia       708000000    56.6
##  7 India   Asia       634000000    54.2
##  8 India   Asia       567000000    50.7
##  9 India   Asia       506000000    47.2
## 10 India   Asia       454000000    43.6
## 11 India   Asia       409000000    40.2
## 12 India   Asia       372000000    37.4

Now we have only data that relates to India. All we did was include one more pipe and the filter function. We had to tell R which country by placing the information above in the parentheses.

filter is not limited to text searches. You can also search based on numerical values. For example, what if we only want countries with a life expectancy of 81 or higher

country_data %>%
        arrange(desc(pop)) %>%
        filter(lifeExp >= 81)
## # A tibble: 7 x 4
##   country          continent       pop lifeExp
##   <fct>            <fct>         <int>   <dbl>
## 1 Japan            Asia      127467972    82.6
## 2 Japan            Asia      127065841    82  
## 3 Australia        Oceania    20434176    81.2
## 4 Switzerland      Europe      7554661    81.7
## 5 Hong Kong, China Asia        6980412    82.2
## 6 Hong Kong, China Asia        6762476    81.5
## 7 Iceland          Europe       301931    81.8

You can see the results for yourself. It is also possible to combine multiply conditions for whatever functions are involved. For example, if I want to arrange my data by population and country while also filtering it by a population greater than 100,000,000,000 and with a life expectancy of less than 45. This is shown below

country_data %>%
        arrange(desc(pop, country)) %>%
        filter(pop>100000000, lifeExp<45)
## # A tibble: 5 x 4
##   country continent       pop lifeExp
##   <fct>   <fct>         <int>   <dbl>
## 1 China   Asia      665770000    44.5
## 2 China   Asia      556263527    44  
## 3 India   Asia      454000000    43.6
## 4 India   Asia      409000000    40.2
## 5 India   Asia      372000000    37.4

mutate

The mutate function is for manipulating variables and creating new ones. For example, the gdpPercap variable is highly skewed. We can create a variable of gdpercap that is the log of this variable. Using the log will help the data to assume the characteristics of a normal distribution. Below is the code for this.

gapminder %>% 
        select(country,continent, pop, gdpPercap) %>%
        mutate(log_gdp=log(gdpPercap))
## # A tibble: 1,704 x 5
##    country     continent      pop gdpPercap log_gdp
##    <fct>       <fct>        <int>     <dbl>   <dbl>
##  1 Afghanistan Asia       8425333      779.    6.66
##  2 Afghanistan Asia       9240934      821.    6.71
##  3 Afghanistan Asia      10267083      853.    6.75
##  4 Afghanistan Asia      11537966      836.    6.73
##  5 Afghanistan Asia      13079460      740.    6.61
##  6 Afghanistan Asia      14880372      786.    6.67
##  7 Afghanistan Asia      12881816      978.    6.89
##  8 Afghanistan Asia      13867957      852.    6.75
##  9 Afghanistan Asia      16317921      649.    6.48
## 10 Afghanistan Asia      22227415      635.    6.45
## # … with 1,694 more rows

In the code above we had to select our variables again and then we create the new variable “log_gdp”. This new variable appears all the way to the right in the dataset. Naturally, we can extend our code by using our new variable in other functions as shown below.

Conclusion

This post was longer than normal but several practical things were learned. You now know some basic techniques for wrangling data using the dplyr package in R.

Leave a Reply