In this post, we will be exposed to tools for wrangling and manipulating data in R.
Let’s begin by loading the libraries we will be using. We will use the dplyr package and the gapminder package. dplyr is for manipulating the data and gapminder provides the dataset.
library(dplyr)
library(gapminder)
You can look at the data briefly by using a function called “glimpse” as shown below.
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
You can see that we have six columns or variables and over 1700 rows of data. This data provides information about countries and various demographic statistics.
select()
The select function allows you to grab only the variables you want for analysis. This becomes exceptionally important when you have a large number of variables. In our next example, we will select 4 variables from the gapminder dataset. Below is the code to achieve this.
gapminder %>%
select(country,continent, pop, lifeExp)
## # A tibble: 1,704 x 4
## country continent pop lifeExp
## <fct> <fct> <int> <dbl>
## 1 Afghanistan Asia 8425333 28.8
## 2 Afghanistan Asia 9240934 30.3
## 3 Afghanistan Asia 10267083 32.0
## 4 Afghanistan Asia 11537966 34.0
## 5 Afghanistan Asia 13079460 36.1
## 6 Afghanistan Asia 14880372 38.4
## 7 Afghanistan Asia 12881816 39.9
## 8 Afghanistan Asia 13867957 40.8
## 9 Afghanistan Asia 16317921 41.7
## 10 Afghanistan Asia 22227415 41.8
## # … with 1,694 more rows
The strange symbol %>% is called a “pipe” and allows you to continuously build your code. You can also save this information by assigning a name to an object like any other variable in r.
country_data<-gapminder %>%
select(country,continent, pop, lifeExp)
arrange
The arrange verb sorts your data based on one or more variables. For example, let’s say we want to know which country has the highest population. The code below provides the answer.
country_data %>%
arrange(pop)
## # A tibble: 1,704 x 4
## country continent pop lifeExp
## <fct> <fct> <int> <dbl>
## 1 Sao Tome and Principe Africa 60011 46.5
## 2 Sao Tome and Principe Africa 61325 48.9
## 3 Djibouti Africa 63149 34.8
## 4 Sao Tome and Principe Africa 65345 51.9
## 5 Sao Tome and Principe Africa 70787 54.4
## 6 Djibouti Africa 71851 37.3
## 7 Sao Tome and Principe Africa 76595 56.5
## 8 Sao Tome and Principe Africa 86796 58.6
## 9 Djibouti Africa 89898 39.7
## 10 Sao Tome and Principe Africa 98593 60.4
## # … with 1,694 more rows
To complete this task we had to use the arrange function and place the name of the variable we want to sort by inside the parentheses. However, this is not exactly what we want. What we have found is the countries with the smallest population. To sort from largest to smallest you must use the desc function as well and this is shown below.
country_data %>%
arrange(desc(pop))
## # A tibble: 1,704 x 4
## country continent pop lifeExp
## <fct> <fct> <int> <dbl>
## 1 China Asia 1318683096 73.0
## 2 China Asia 1280400000 72.0
## 3 China Asia 1230075000 70.4
## 4 China Asia 1164970000 68.7
## 5 India Asia 1110396331 64.7
## 6 China Asia 1084035000 67.3
## 7 India Asia 1034172547 62.9
## 8 China Asia 1000281000 65.5
## 9 India Asia 959000000 61.8
## 10 China Asia 943455000 64.0
## # … with 1,694 more rows
Now, this is what we want. China claims several of the top spots. The reason a country is on the list more than once is that the data was collected several different years.
filter
The filter function is used to obtain only specific values that meet the criteria. For example, what if we want to know the population of only India in descending order. Below is the code for how to do this.
country_data %>%
arrange(desc(pop)) %>%
filter(country=='India')
## # A tibble: 12 x 4
## country continent pop lifeExp
## <fct> <fct> <int> <dbl>
## 1 India Asia 1110396331 64.7
## 2 India Asia 1034172547 62.9
## 3 India Asia 959000000 61.8
## 4 India Asia 872000000 60.2
## 5 India Asia 788000000 58.6
## 6 India Asia 708000000 56.6
## 7 India Asia 634000000 54.2
## 8 India Asia 567000000 50.7
## 9 India Asia 506000000 47.2
## 10 India Asia 454000000 43.6
## 11 India Asia 409000000 40.2
## 12 India Asia 372000000 37.4
Now we have only data that relates to India. All we did was include one more pipe and the filter function. We had to tell R which country by placing the information above in the parentheses.
filter is not limited to text searches. You can also search based on numerical values. For example, what if we only want countries with a life expectancy of 81 or higher
country_data %>%
arrange(desc(pop)) %>%
filter(lifeExp >= 81)
## # A tibble: 7 x 4
## country continent pop lifeExp
## <fct> <fct> <int> <dbl>
## 1 Japan Asia 127467972 82.6
## 2 Japan Asia 127065841 82
## 3 Australia Oceania 20434176 81.2
## 4 Switzerland Europe 7554661 81.7
## 5 Hong Kong, China Asia 6980412 82.2
## 6 Hong Kong, China Asia 6762476 81.5
## 7 Iceland Europe 301931 81.8
You can see the results for yourself. It is also possible to combine multiply conditions for whatever functions are involved. For example, if I want to arrange my data by population and country while also filtering it by a population greater than 100,000,000,000 and with a life expectancy of less than 45. This is shown below
country_data %>%
arrange(desc(pop, country)) %>%
filter(pop>100000000, lifeExp<45)
## # A tibble: 5 x 4
## country continent pop lifeExp
## <fct> <fct> <int> <dbl>
## 1 China Asia 665770000 44.5
## 2 China Asia 556263527 44
## 3 India Asia 454000000 43.6
## 4 India Asia 409000000 40.2
## 5 India Asia 372000000 37.4
mutate
The mutate function is for manipulating variables and creating new ones. For example, the gdpPercap variable is highly skewed. We can create a variable of gdpercap that is the log of this variable. Using the log will help the data to assume the characteristics of a normal distribution. Below is the code for this.
gapminder %>%
select(country,continent, pop, gdpPercap) %>%
mutate(log_gdp=log(gdpPercap))
## # A tibble: 1,704 x 5
## country continent pop gdpPercap log_gdp
## <fct> <fct> <int> <dbl> <dbl>
## 1 Afghanistan Asia 8425333 779. 6.66
## 2 Afghanistan Asia 9240934 821. 6.71
## 3 Afghanistan Asia 10267083 853. 6.75
## 4 Afghanistan Asia 11537966 836. 6.73
## 5 Afghanistan Asia 13079460 740. 6.61
## 6 Afghanistan Asia 14880372 786. 6.67
## 7 Afghanistan Asia 12881816 978. 6.89
## 8 Afghanistan Asia 13867957 852. 6.75
## 9 Afghanistan Asia 16317921 649. 6.48
## 10 Afghanistan Asia 22227415 635. 6.45
## # … with 1,694 more rows
In the code above we had to select our variables again and then we create the new variable “log_gdp”. This new variable appears all the way to the right in the dataset. Naturally, we can extend our code by using our new variable in other functions as shown below.
Conclusion
This post was longer than normal but several practical things were learned. You now know some basic techniques for wrangling data using the dplyr package in R.