In this post, we will learn about data aggregation with the dplyr package. Data aggregation is primarily a tool for summarizing the information you have collected. Let’s start by loading our packages.
library(dplyr)
library(gapminder)
dplyr is for the data manipulation while gapminder provides us with the data. We will learn the following functions from the dplyr package
- count()
- summarize
- group_by()
- top_n()
count
The count function allows you to count the number of observations in your dataset as shown below.
gapminder %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 1704
The output tells us that there are over 1700 rows of data. However, the count function can do much more. For example, we can also count values in a specific column. Below, we calculated how many rows of data we have by continent.
gapminder %>%
count(continent)
## # A tibble: 5 x 2
## continent n
## <fct> <int>
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24
The output speaks for its self. There are two columns the left is continent and the right is how many times that particular continent appears in the dataset. You can also sort this data by adding the argument called sort as shown below.
gapminder %>%
count(continent, sort =TRUE)
## # A tibble: 5 x 2
## continent n
## <fct> <int>
## 1 Africa 624
## 2 Asia 396
## 3 Europe 360
## 4 Americas 300
## 5 Oceania 24
There is another argument we can add and this is called the weight or wt argument. The wt argument adds up the values of the population in our example and we can now see how many respondents there were from each continent. Below is the code an example
gapminder %>%
count(continent, wt=pop, sort=TRUE)
## # A tibble: 5 x 2
## continent n
## <fct> <dbl>
## 1 Asia 30507333901
## 2 Americas 7351438499
## 3 Africa 6187585961
## 4 Europe 6181115304
## 5 Oceania 212992136
You can see that we now know how many people from each continent were in the dataset.
summarize
The summarize function takes many rows of data and reduce it to a single output. For example, if we want to know the total number of people in the dataset we could run the code below.
gapminder %>%
summarize(total_pop=sum(pop))
## # A tibble: 1 x 1
## total_pop
## <dbl>
## 1 50440465801
You can also continue to add more and more things you want to know be separating them with a comma. In the code below, we add to it the average GDP.
gapminder %>%
summarize(total_pop=sum(pop), average_gdp=mean(gdpPercap))
## # A tibble: 1 x 2
## total_pop average_gdp
## <dbl> <dbl>
## 1 50440465801 7215.
group_by
The group by function allows you to aggregate data by groups. For example, if we want to know the total population and the average gdp by continent the code below would help to learn this.
gapminder %>%
group_by(continent) %>%
summarize(total_pop=sum(pop), mean_gdp=mean(gdpPercap)) %>%
arrange(desc(total_pop))
## # A tibble: 5 x 3
## continent total_pop mean_gdp
## <fct> <dbl> <dbl>
## 1 Asia 30507333901 7902.
## 2 Americas 7351438499 7136.
## 3 Africa 6187585961 2194.
## 4 Europe 6181115304 14469.
## 5 Oceania 212992136 18622.
It is also possible to group by more than one column. However, to do this we need to create another categorical variable. We are going to use mutate to create a categorical variable that breaks the data into two parts. Before 1980 and after 1980. Then we will group by country and whether the mean of the gdp was collected before or after 1980. Below is the code
gapminder %>%
mutate(before_1980=if_else(year < 1980, "yes","no")) %>%
group_by(country, before_1980) %>%
summarize(mean_gdp=mean(gdpPercap))
## # A tibble: 284 x 3
## # Groups: country [142]
## country before_1980 mean_gdp
## <fct> <chr> <dbl>
## 1 Afghanistan no 803.
## 2 Afghanistan yes 803.
## 3 Albania no 3934.
## 4 Albania yes 2577.
## 5 Algeria no 5460.
## 6 Algeria yes 3392.
## 7 Angola no 2944.
## 8 Angola yes 4270.
## 9 Argentina no 9998.
## 10 Argentina yes 7913.
## # … with 274 more rows
top_n
The top_n function allows you to find the most extreme values when looking at groups. For example, we could find which countries has the highest life expectancy by continent. The answer is below
gapminder %>%
group_by(continent) %>%
top_n(1, lifeExp)
## # A tibble: 5 x 6
## # Groups: continent [5]
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Australia Oceania 2007 81.2 20434176 34435.
## 2 Canada Americas 2007 80.7 33390141 36319.
## 3 Iceland Europe 2007 81.8 301931 36181.
## 4 Japan Asia 2007 82.6 127467972 31656.
## 5 Reunion Africa 2007 76.4 798094 7670.
As an example, Japan has the highest life expectancy in Asia. Canada has the highest life expectancy in the Americas. Naturally you are not limited to the top 1. This number can be changed to whatever you want. For example, below we change the number to 3.
gapminder %>%
group_by(continent) %>%
top_n(3, lifeExp)
## # A tibble: 15 x 6
## # Groups: continent [5]
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Australia Oceania 2002 80.4 19546792 30688.
## 2 Australia Oceania 2007 81.2 20434176 34435.
## 3 Canada Americas 2002 79.8 31902268 33329.
## 4 Canada Americas 2007 80.7 33390141 36319.
## 5 Costa Rica Americas 2007 78.8 4133884 9645.
## 6 Hong Kong, China Asia 2007 82.2 6980412 39725.
## 7 Iceland Europe 2007 81.8 301931 36181.
## 8 Japan Asia 2002 82 127065841 28605.
## 9 Japan Asia 2007 82.6 127467972 31656.
## 10 New Zealand Oceania 2007 80.2 4115771 25185.
## 11 Reunion Africa 1997 74.8 684810 6072.
## 12 Reunion Africa 2002 75.7 743981 6316.
## 13 Reunion Africa 2007 76.4 798094 7670.
## 14 Spain Europe 2007 80.9 40448191 28821.
## 15 Switzerland Europe 2007 81.7 7554661 37506.