In this post, we are going to learn some more advance ways to work with functions in the dplyr package. Let’s load our libraries
library(dplyr)
library(gapminder)
Our dataset is the gapminder dataset which provides information about countries and continents related to gdp, life expectancy, and population. Here is what the data looks like as a refresher.
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
select
You can use the colon symbol to select multiple columns at once. Doing this is a great way to save time when selecting variables.
gapminder%>%
select(lifeExp:gdpPercap)
## # A tibble: 1,704 x 3
## lifeExp pop gdpPercap
## <dbl> <int> <dbl>
## 1 28.8 8425333 779.
## 2 30.3 9240934 821.
## 3 32.0 10267083 853.
## 4 34.0 11537966 836.
## 5 36.1 13079460 740.
## 6 38.4 14880372 786.
## 7 39.9 12881816 978.
## 8 40.8 13867957 852.
## 9 41.7 16317921 649.
## 10 41.8 22227415 635.
## # … with 1,694 more rows
You can see that by using the colon we were able to select the last three columns.
There are also arguments called “select helpers.” Select helpers help you find columns in really large data sets. For example, let’s say we want columns that contain the string “life” in them. To find this we would use the contain argument as shown below.
gapminder%>%
select(contains('life'))
## # A tibble: 1,704 x 1
## lifeExp
## <dbl>
## 1 28.8
## 2 30.3
## 3 32.0
## 4 34.0
## 5 36.1
## 6 38.4
## 7 39.9
## 8 40.8
## 9 41.7
## 10 41.8
## # … with 1,694 more rows
Only the column that contains the string life is selected. There are other help selectors that you can try on your own such as starts_with, ends_with and more.
To remove a variable from a dataset you simply need to put a minus sign in front of it as shown below.
gapminder %>%
select(-lifeExp, -gdpPercap)
## # A tibble: 1,704 x 4
## country continent year pop
## <fct> <fct> <int> <int>
## 1 Afghanistan Asia 1952 8425333
## 2 Afghanistan Asia 1957 9240934
## 3 Afghanistan Asia 1962 10267083
## 4 Afghanistan Asia 1967 11537966
## 5 Afghanistan Asia 1972 13079460
## 6 Afghanistan Asia 1977 14880372
## 7 Afghanistan Asia 1982 12881816
## 8 Afghanistan Asia 1987 13867957
## 9 Afghanistan Asia 1992 16317921
## 10 Afghanistan Asia 1997 22227415
## # … with 1,694 more rows
In the output above you can see that life expectancy and per capa GDP are missing.
rename
Another function is the rename function which allows you to rename a variable. Below is an example in which the variable “pop” is renamed “population.”
gapminder %>%
select(country, year, pop) %>%
rename(population=pop)
## # A tibble: 1,704 x 3
## country year population
## <fct> <int> <int>
## 1 Afghanistan 1952 8425333
## 2 Afghanistan 1957 9240934
## 3 Afghanistan 1962 10267083
## 4 Afghanistan 1967 11537966
## 5 Afghanistan 1972 13079460
## 6 Afghanistan 1977 14880372
## 7 Afghanistan 1982 12881816
## 8 Afghanistan 1987 13867957
## 9 Afghanistan 1992 16317921
## 10 Afghanistan 1997 22227415
## # … with 1,694 more rows
You can see that the “pop” variable has been renamed. Remember that the new name goes on the left of the equal sign while the old name goes on the right of the equal sign.
There is a shortcut to this and it involves renaming variables inside the select function. In the example below, we rename the pop variable population inside the select function.
gapminder %>%
select(country, year, population=pop)
## # A tibble: 1,704 x 3
## country year population
## <fct> <int> <int>
## 1 Afghanistan 1952 8425333
## 2 Afghanistan 1957 9240934
## 3 Afghanistan 1962 10267083
## 4 Afghanistan 1967 11537966
## 5 Afghanistan 1972 13079460
## 6 Afghanistan 1977 14880372
## 7 Afghanistan 1982 12881816
## 8 Afghanistan 1987 13867957
## 9 Afghanistan 1992 16317921
## 10 Afghanistan 1997 22227415
## # … with 1,694 more rows
transmute
The transmute function allows you to select and mutate variables at the same time. For example, let’s say that we want to know total gdp we could find this by multplying the population by gdp per capa. This is done with the transmute function below.
gapminder %>%
transmute(country, year, total_gdp = pop * gdpPercap)
## # A tibble: 1,704 x 3
## country year total_gdp
## <fct> <int> <dbl>
## 1 Afghanistan 1952 6567086330.
## 2 Afghanistan 1957 7585448670.
## 3 Afghanistan 1962 8758855797.
## 4 Afghanistan 1967 9648014150.
## 5 Afghanistan 1972 9678553274.
## 6 Afghanistan 1977 11697659231.
## 7 Afghanistan 1982 12598563401.
## 8 Afghanistan 1987 11820990309.
## 9 Afghanistan 1992 10595901589.
## 10 Afghanistan 1997 14121995875.
## # … with 1,694 more rows
Conclusion
With these basic tools it is now a little easier to do some data analysis when using R. There is so much more than can be learned but this will have to wait for the future.