In this post, we will look at how you can modify data tables in R. Specifically, we will look at how to add columns, fix errors, and calculate values. Below is the initial code to prepare the data we will use.
library(data.table)
mtcars<-data.table(mtcars)
In the code above, we load the data.table package. We then convert the “mtcars” dataset to be a data.table in the second line. Below is a look at all the columns and the first few rows of data
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
<num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Adding a New Column
Adding a new column involves a simple process. In the code below, we call “mtcars” and inside the brackets, we place a comma first. This tells R that we want all rows of data.
After the comma, we create a name for our new column called “distance_travel.” After this, we place the := notation to indicate that we are calculating a value. After the :=, we write mpg * 4. This means multiply the mpg column by 4.
# Add a new column, travel distance
mtcars[, distance_travel := mpg*4]
Below is the output for the code above. To save space, we will not print every column. Instead, we will subset what we want as shown below.
mtcars[1:3,c(1,10:12)]
mpg gear carb distance_travel
<num> <num> <num> <num>
1: 21.0 4 4 84.0
2: 21.0 4 4 84.0
3: 22.8 4 1 91.2
You can see the calculated value in the fourth column. This value is mpg multiplied by 4.
Fixing Errors
We can also fix errors for this example. Let’s assume that any value less than 21.5 in the mpg column is a mistake. We can replace those values with NA with the code below.
#fix errors change mpg less than 21.5 to NA
mtcars[mpg <21.5, mpg := NA]
In this code, you can see that we use brackets again and indicate that we are looking for all mpg values that are less than 21.5. These values will be rewritten with NA. Below is the output for select columns and rows only.
mtcars[1:3,1:3]
mpg cyl disp
<num> <num> <num>
1: NA 6 160
2: NA 6 160
3: 22.8 4 108
The NAs are where there used to be mpg values less than 21.5
Adding Columns for Groups
Another value we may want to calculate is to count values by groups. In the code below, we count the number of cars that have an automatic or manual transmission. To do this, we use brackets again. Inside the brackets, we place a comma. After the comma, we name our new variable “total_am” followed by the := notation. After the := notation, we place .N. The .N notation tells are to count all rows. In this case, we are counting all rows by the variable “am,” which stands for automatic transmission, and yes or no. Below is the code and output
# Add a new column equal to total cars with automatic transmission
mtcars[, total_am := .N, by = am]
mtcars[1:5,c(1,9,13)]
mpg am total_am
<num> <num> <int>
1: NA 1 13
2: NA 1 13
3: 22.8 1 13
4: NA 0 19
5: NA 0 19
You can see that there are 13 cars with automatic transmissions and 19 cars without automatic transmissions. Each row has a 13 if it has an automatic transmission and a 19 if it does not. This calculation is something similar to what a windowing function does in SQL.
Calculate Values of Groups
It is also possible to calculate other values. In the code below, we calculate the average mpg by the number of cylinders a car has. The syntax for this code should be looking familiar by now. Notice how after the := notation, it is possible to use a function. In our case, we are using the mean() function.
# Calculate the mean mpg by cyl
mtcars[, mean_mpg:=mean(mpg,na.rm=TRUE),
by = cyl]
mtcars[1:3,c(1,2,13)]
mpg cyl mean_mpg
<num> <num> <num>
1: 21.0 6 19.74286
2: 21.0 6 19.74286
3: 22.8 4 26.66364
The results are similar to the previous example, except this time we calculate the mean of mpg by cyl.
Using LHS := RHS Form
LHS := RHS Form is another way to indicate to R what you want to do. On the left-hand side, you create the names of your columns. After the := sign you place the functions you are using. Notice how the functions are wrapped inside parentheses with a period on the outside. Also note the comma at the beginning of the square brackets and in front of the “by” argument.
# Add columns using the LHS := RHS form
mtcars[, c("mean_mpg", "median_mpg") := .(mean(mpg), median(mpg)),
by = cyl]
In the code above, we calculate the mean and the median mpg by the number of cylinders. Below is the output
mtcars[1:3,c(1,2,13,14)]
mpg cyl mean_mpg median_mpg
<num> <num> <num> <num>
1: 21.0 6 19.74286 19.7
2: 21.0 6 19.74286 19.7
3: 22.8 4 26.66364 26.0
You can clearly see the two new columns that show the mean and median for mpg.
Functional Form
Functional form is another way to get the same results. In the code below, we are still using the square brackets. Inside the square brackets, we first have a comma. Next, you have our := symbol, but this time the := symbol is inside grave accents (“). The grave accent is next to the number 1 on a standard keyboard and is also home to the tilde sign (~). After the := symbol you create the column name,s followed by the function you are using, separated by commas. After all of this, you indicate the grouping using the “by” argument.
# Add columns using the functional form
mtcars[, `:=`(mean_mpg_func = mean(mpg),
median_mpg_func = median(mpg)),
by = cyl]
mtcars[1:3,c(1,2,15,16)]
mpg cyl mean_mpg_func median_mpg_func
<num> <num> <num> <num>
1: 21.0 6 19.74286 19.7
2: 21.0 6 19.74286 19.7
3: 22.8 4 26.66364 26.0
The results speak for themselves.
Functional Form with Complex Grouping
So far, we have been grouping with only one column in the “by” argument. However, it is possible to have more than one column in the “by” argument while also having another filter in place. In the code below, we filter for mpg greater than 21 while also grouping by the number of cylinders, and whether the car has an automatic transmission or not.
# Add the mean_duration column
mtcars[mpg>21,ave_mpg :=mean(mpg),
by = .(cyl,am)]
mtcars[1:3,c(1,2,9,17)]
mpg cyl am ave_mpg
<num> <num> <num> <num>
1: 21.0 6 1 NA
2: 21.0 6 1 NA
3: 22.8 4 1 28.075
The reason for the NA is that there are no cars that meet the criteria. In other words, there was not more than one car that had an mpg greater than 21 that was 6-cylinder and also had an automatic transmission.
COnclusion
Data.table is just another way to manipulate data inside R. Generally, it is considered faster when dealing with large datasets. The purpose here was only to explore the potential of this package if it is needed.







































































