shallow focus photo of man holding floor brush ceramic figurine

Data Aggregation with dplyr

In this post, we will learn about data aggregation with the dplyr package. Data aggregation is primarily a tool for summarizing the information you have collected. Let’s start by loading our packages.

library(dplyr)
library(gapminder)
ad

dplyr is for the data manipulation while gapminder provides us with the data. We will learn the following functions from the dplyr package

  • count()
  • summarize
  • group_by()
  • top_n()

count

The count function allows you to count the number of observations in your dataset as shown below.

gapminder %>%
        count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1  1704

The output tells us that there are over 1700 rows of data. However, the count function can do much more. For example, we can also count values in a specific column. Below, we calculated how many rows of data we have by continent.

gapminder %>%
        count(continent)
## # A tibble: 5 x 2
##   continent     n
##   <fct>     <int>
## 1 Africa      624
## 2 Americas    300
## 3 Asia        396
## 4 Europe      360
## 5 Oceania      24

The output speaks for its self. There are two columns the left is continent and the right is how many times that particular continent appears in the dataset. You can also sort this data by adding the argument called sort as shown below.

gapminder %>%
        count(continent, sort =TRUE)
## # A tibble: 5 x 2
##   continent     n
##   <fct>     <int>
## 1 Africa      624
## 2 Asia        396
## 3 Europe      360
## 4 Americas    300
## 5 Oceania      24

There is another argument we can add and this is called the weight or wt argument. The wt argument adds up the values of the population in our example and we can now see how many respondents there were from each continent. Below is the code an example

gapminder %>% 
        count(continent, wt=pop, sort=TRUE)
## # A tibble: 5 x 2
##   continent           n
##   <fct>           <dbl>
## 1 Asia      30507333901
## 2 Americas   7351438499
## 3 Africa     6187585961
## 4 Europe     6181115304
## 5 Oceania     212992136

You can see that we now know how many people from each continent were in the dataset.

summarize

The summarize function takes many rows of data and reduce it to a single output. For example, if we want to know the total number of people in the dataset we could run the code below.

gapminder %>%
        summarize(total_pop=sum(pop))
## # A tibble: 1 x 1
##     total_pop
##         <dbl>
## 1 50440465801

You can also continue to add more and more things you want to know be separating them with a comma. In the code below, we add to it the average GDP.

gapminder %>%
        summarize(total_pop=sum(pop), average_gdp=mean(gdpPercap))
## # A tibble: 1 x 2
##     total_pop average_gdp
##         <dbl>       <dbl>
## 1 50440465801       7215.

group_by

The group by function allows you to aggregate data by groups. For example, if we want to know the total population and the average gdp by continent the code below would help to learn this.

gapminder %>%
        group_by(continent) %>%
        summarize(total_pop=sum(pop), mean_gdp=mean(gdpPercap)) %>%
        arrange(desc(total_pop))
## # A tibble: 5 x 3
##   continent   total_pop mean_gdp
##   <fct>           <dbl>    <dbl>
## 1 Asia      30507333901    7902.
## 2 Americas   7351438499    7136.
## 3 Africa     6187585961    2194.
## 4 Europe     6181115304   14469.
## 5 Oceania     212992136   18622.

It is also possible to group by more than one column. However, to do this we need to create another categorical variable. We are going to use mutate to create a categorical variable that breaks the data into two parts. Before 1980 and after 1980. Then we will group by country and whether the mean of the gdp was collected before or after 1980. Below is the code

gapminder %>%
        mutate(before_1980=if_else(year < 1980, "yes","no")) %>%
        group_by(country, before_1980) %>%
        summarize(mean_gdp=mean(gdpPercap))
## # A tibble: 284 x 3
## # Groups:   country [142]
##    country     before_1980 mean_gdp
##    <fct>       <chr>          <dbl>
##  1 Afghanistan no              803.
##  2 Afghanistan yes             803.
##  3 Albania     no             3934.
##  4 Albania     yes            2577.
##  5 Algeria     no             5460.
##  6 Algeria     yes            3392.
##  7 Angola      no             2944.
##  8 Angola      yes            4270.
##  9 Argentina   no             9998.
## 10 Argentina   yes            7913.
## # … with 274 more rows

top_n

The top_n function allows you to find the most extreme values when looking at groups. For example, we could find which countries has the highest life expectancy by continent. The answer is below

gapminder %>%
        group_by(continent) %>%
        top_n(1, lifeExp)
## # A tibble: 5 x 6
## # Groups:   continent [5]
##   country   continent  year lifeExp       pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>     <int>     <dbl>
## 1 Australia Oceania    2007    81.2  20434176    34435.
## 2 Canada    Americas   2007    80.7  33390141    36319.
## 3 Iceland   Europe     2007    81.8    301931    36181.
## 4 Japan     Asia       2007    82.6 127467972    31656.
## 5 Reunion   Africa     2007    76.4    798094     7670.

As an example, Japan has the highest life expectancy in Asia. Canada has the highest life expectancy in the Americas. Naturally you are not limited to the top 1. This number can be changed to whatever you want. For example, below we change the number to 3.

gapminder %>%
        group_by(continent) %>%
        top_n(3, lifeExp)
## # A tibble: 15 x 6
## # Groups:   continent [5]
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Australia        Oceania    2002    80.4  19546792    30688.
##  2 Australia        Oceania    2007    81.2  20434176    34435.
##  3 Canada           Americas   2002    79.8  31902268    33329.
##  4 Canada           Americas   2007    80.7  33390141    36319.
##  5 Costa Rica       Americas   2007    78.8   4133884     9645.
##  6 Hong Kong, China Asia       2007    82.2   6980412    39725.
##  7 Iceland          Europe     2007    81.8    301931    36181.
##  8 Japan            Asia       2002    82   127065841    28605.
##  9 Japan            Asia       2007    82.6 127467972    31656.
## 10 New Zealand      Oceania    2007    80.2   4115771    25185.
## 11 Reunion          Africa     1997    74.8    684810     6072.
## 12 Reunion          Africa     2002    75.7    743981     6316.
## 13 Reunion          Africa     2007    76.4    798094     7670.
## 14 Spain            Europe     2007    80.9  40448191    28821.
## 15 Switzerland      Europe     2007    81.7   7554661    37506.

Leave a Reply