computer program language text

More Selecting and Transforming with dplyR

In this post, we are going to learn some more advance ways to work with functions in the dplyr package. Let’s load our libraries

library(dplyr)
library(gapminder)

Our dataset is the gapminder dataset which provides information about countries and continents related to gdp, life expectancy, and population. Here is what the data looks like as a refresher.

glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

select

You can use the colon symbol to select multiple columns at once. Doing this is a great way to save time when selecting variables.

gapminder%>%
        select(lifeExp:gdpPercap)
## # A tibble: 1,704 x 3
##    lifeExp      pop gdpPercap
##      <dbl>    <int>     <dbl>
##  1    28.8  8425333      779.
##  2    30.3  9240934      821.
##  3    32.0 10267083      853.
##  4    34.0 11537966      836.
##  5    36.1 13079460      740.
##  6    38.4 14880372      786.
##  7    39.9 12881816      978.
##  8    40.8 13867957      852.
##  9    41.7 16317921      649.
## 10    41.8 22227415      635.
## # … with 1,694 more rows

You can see that by using the colon we were able to select the last three columns.

There are also arguments called “select helpers.” Select helpers help you find columns in really large data sets. For example, let’s say we want columns that contain the string “life” in them. To find this we would use the contain argument as shown below.

gapminder%>%
        select(contains('life'))
## # A tibble: 1,704 x 1
##    lifeExp
##      <dbl>
##  1    28.8
##  2    30.3
##  3    32.0
##  4    34.0
##  5    36.1
##  6    38.4
##  7    39.9
##  8    40.8
##  9    41.7
## 10    41.8
## # … with 1,694 more rows

Only the column that contains the string life is selected. There are other help selectors that you can try on your own such as starts_with, ends_with and more.

To remove a variable from a dataset you simply need to put a minus sign in front of it as shown below.

gapminder %>%
        select(-lifeExp, -gdpPercap)
## # A tibble: 1,704 x 4
##    country     continent  year      pop
##    <fct>       <fct>     <int>    <int>
##  1 Afghanistan Asia       1952  8425333
##  2 Afghanistan Asia       1957  9240934
##  3 Afghanistan Asia       1962 10267083
##  4 Afghanistan Asia       1967 11537966
##  5 Afghanistan Asia       1972 13079460
##  6 Afghanistan Asia       1977 14880372
##  7 Afghanistan Asia       1982 12881816
##  8 Afghanistan Asia       1987 13867957
##  9 Afghanistan Asia       1992 16317921
## 10 Afghanistan Asia       1997 22227415
## # … with 1,694 more rows

In the output above you can see that life expectancy and per capa GDP are missing.

rename

Another function is the rename function which allows you to rename a variable. Below is an example in which the variable “pop” is renamed “population.”

gapminder %>%
        select(country, year, pop) %>%
        rename(population=pop)
## # A tibble: 1,704 x 3
##    country      year population
##    <fct>       <int>      <int>
##  1 Afghanistan  1952    8425333
##  2 Afghanistan  1957    9240934
##  3 Afghanistan  1962   10267083
##  4 Afghanistan  1967   11537966
##  5 Afghanistan  1972   13079460
##  6 Afghanistan  1977   14880372
##  7 Afghanistan  1982   12881816
##  8 Afghanistan  1987   13867957
##  9 Afghanistan  1992   16317921
## 10 Afghanistan  1997   22227415
## # … with 1,694 more rows

You can see that the “pop” variable has been renamed. Remember that the new name goes on the left of the equal sign while the old name goes on the right of the equal sign.

There is a shortcut to this and it involves renaming variables inside the select function. In the example below, we rename the pop variable population inside the select function.

gapminder %>%
        select(country, year, population=pop)
## # A tibble: 1,704 x 3
##    country      year population
##    <fct>       <int>      <int>
##  1 Afghanistan  1952    8425333
##  2 Afghanistan  1957    9240934
##  3 Afghanistan  1962   10267083
##  4 Afghanistan  1967   11537966
##  5 Afghanistan  1972   13079460
##  6 Afghanistan  1977   14880372
##  7 Afghanistan  1982   12881816
##  8 Afghanistan  1987   13867957
##  9 Afghanistan  1992   16317921
## 10 Afghanistan  1997   22227415
## # … with 1,694 more rows

transmute

The transmute function allows you to select and mutate variables at the same time. For example, let’s say that we want to know total gdp we could find this by multplying the population by gdp per capa. This is done with the transmute function below.

gapminder %>%
        transmute(country, year, total_gdp = pop * gdpPercap)
## # A tibble: 1,704 x 3
##    country      year    total_gdp
##    <fct>       <int>        <dbl>
##  1 Afghanistan  1952  6567086330.
##  2 Afghanistan  1957  7585448670.
##  3 Afghanistan  1962  8758855797.
##  4 Afghanistan  1967  9648014150.
##  5 Afghanistan  1972  9678553274.
##  6 Afghanistan  1977 11697659231.
##  7 Afghanistan  1982 12598563401.
##  8 Afghanistan  1987 11820990309.
##  9 Afghanistan  1992 10595901589.
## 10 Afghanistan  1997 14121995875.
## # … with 1,694 more rows
ad

Conclusion

With these basic tools it is now a little easier to do some data analysis when using R. There is so much more than can be learned but this will have to wait for the future.

Leave a Reply