Data frames are the default way that data is often stored in R. However, another option for storing data in R is using data tables. As we will see, data tables allow you to accomplish much more than data frames. For now, we will focus on some basic features of data tables and data frames before moving to actions that are easier to perform with data tables.
Loading Packages and Data Preparation
We will start by loading the package data.table and preparing our data. The data.table package is loaded using the library() function. We will use the mtcars and iris datasets for the various examples. Both of these datasets are available by default within R. Since our focus is on data tables, we will convert both the mtcars and iris datasets into data tables and store them in objects with the same name. Below is the code.
library(data.table)
mtcars<-data.table(mtcars)
iris<-data.table(iris)
Next, we will use the head () function to examine the mtcars and iris datasets quickly.
The mtcars dataset has data about cars while the iris dataset has data bout various features of flowers.
Subsetting Basics
The first five examples can be performed data frames or data tables. We will begin by subsetting a single row from a data table as shown below.
#filtering with positive integers
row_2 <- mtcars[2,]
row_2
mpg cyl disp hp drat wt qsec vs am gear carb
2 21 6 160 110 3.9 2.875 17.02 0 1 4 4
In the code above, we filter for the second row in the mtcars data table. This is done using brackets followed by a number for the row we want. The common after the number 2 in the brackets would allow us to select a column. Since there is no number after the comma, this indicates that R should select all rows. This is why we have all the data from row number 2.
In the example below, we will select multiple rows at once using the c() function and a colon.
#multiple rows
> rows_1_5 <- mtcars[c(1:5),]
> rows_1_5
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
The main difference in the code above is the use of the c() function or the concatenate function. Inside this function, we tell R we want the first 5 rows and columns. However, it is not necessary to pull consecutive rows, as we can also pull whatever rows we want specifically.
#filtering non consecutive rows
rows_1_3_5 <- mtcars[c(1,3,5),]
rows_1_3_5
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
5 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
In the next example above, inside the c() function, we indicate that we want the 1st, 3rd, and 5th rows along with all of the columns. In the next two examples, we will learn how to leave rows rather than include them.
> only_last_two <- mtcars[-c(1:30),]
> only_last_two
mpg cyl disp hp drat wt qsec vs am gear carb
31 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
32 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
You can use a minus sign in front of your subset to remove everything that is inside the brackets. For example, in the code above, we place a minus sign in front of rows 1 to 30 to indicate to R to remove rows 1 to 30. This is why in the output, only rows 31 and 32 are available. Just as in the other examples, the numbers do not have to be consecutive, as shown below
> exclude_some <- mtcars[-c(1:10,12:32),]
> exclude_some
mpg cyl disp hp drat wt qsec vs am gear carb
11 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
In the above example, we exclude rows 1 to 10 and rows 12 to 32, leaving only row 11.
Using data.table
We will now do three examples that require the use of data.table. The first example below removes the first 30 rows and the last row of 32, which means only row 31 is displayed
> not_first_last <- mtcars[-c(1:30,.N)]
> not_first_last
mpg cyl disp hp drat wt qsec vs am gear carb
<num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: 15 8 301 335 3.54 3.57 14.6 0 1 5 8
If you look closely, the output is different. There is a 1 next to all of the data, which gives the impression that this is row 1 from the dataset. However, this is not row 1 of the dataset but rather the first row of the subsetted data. In addition, you can see the <num> above all columns, which means this data is numerical. Lastly, in the code, you see a .N, which tells R to remove the last row of the data.
In the next example, we are going to subset the data so that only cars with an automatic transmission appear (am==1).
> am_1 <- mtcars[am == 1]
> am_1
mpg cyl disp hp drat wt qsec vs am gear carb
<num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
5: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
6: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
7: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
8: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
9: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
10: 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
11: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
12: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
13: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Within the brackets, you simply indicate what values you want for the column that is being used for the filtering. Naturally, you can create more complex queries as shown below.
> am_1_mpg_25 <- mtcars[am==1 & mpg > 25]
> am_1_mpg_25
mpg cyl disp hp drat wt qsec vs am gear carb
<num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
2: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
3: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
4: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
5: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
6: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
IN the last example, we filtered the data for cars with automatic transmissions and with mpg above 25.
Conclusion
Data tables are highly flexible and allow a user to do things in a way that is much more efficient, depending on the situation. This is yet another excellent tool that can be deployed by an R enthusiast.
