Modifiying Data Tables in R

In this post, we will look at how you can modify data tables in R. Specifically, we will look at how to add columns, fix errors, and calculate values. Below is the initial code to prepare the data we will use.

library(data.table)
mtcars<-data.table(mtcars)

In the code above, we load the data.table package. We then convert the “mtcars” dataset to be a data.table in the second line. Below is a look at all the columns and the first few rows of data

head(mtcars)
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1:  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4
2:  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4
3:  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1
4:  21.4     6   258   110  3.08 3.215 19.44     1     0     3     1
5:  18.7     8   360   175  3.15 3.440 17.02     0     0     3     2
6:  18.1     6   225   105  2.76 3.460 20.22     1     0     3     1

Adding a New Column

Adding a new column involves a simple process. In the code below, we call “mtcars” and inside the brackets, we place a comma first. This tells R that we want all rows of data.

After the comma, we create a name for our new column called “distance_travel.” After this, we place the := notation to indicate that we are calculating a value. After the :=, we write mpg * 4. This means multiply the mpg column by 4.

# Add a new column, travel distance
mtcars[, distance_travel := mpg*4]

Below is the output for the code above. To save space, we will not print every column. Instead, we will subset what we want as shown below.

mtcars[1:3,c(1,10:12)]
     mpg  gear  carb distance_travel
   <num> <num> <num>           <num>
1:  21.0     4     4            84.0
2:  21.0     4     4            84.0
3:  22.8     4     1            91.2

You can see the calculated value in the fourth column. This value is mpg multiplied by 4.

Fixing Errors

We can also fix errors for this example. Let’s assume that any value less than 21.5 in the mpg column is a mistake. We can replace those values with NA with the code below.

#fix errors change mpg less than 21.5 to NA
mtcars[mpg <21.5, mpg := NA]

In this code, you can see that we use brackets again and indicate that we are looking for all mpg values that are less than 21.5. These values will be rewritten with NA. Below is the output for select columns and rows only.

mtcars[1:3,1:3]
     mpg   cyl  disp
   <num> <num> <num>
1:    NA     6   160
2:    NA     6   160
3:  22.8     4   108

The NAs are where there used to be mpg values less than 21.5

Adding Columns for Groups

Another value we may want to calculate is to count values by groups. In the code below, we count the number of cars that have an automatic or manual transmission. To do this, we use brackets again. Inside the brackets, we place a comma. After the comma, we name our new variable “total_am” followed by the := notation. After the := notation, we place .N. The .N notation tells are to count all rows. In this case, we are counting all rows by the variable “am,” which stands for automatic transmission, and yes or no. Below is the code and output

# Add a new column equal to total cars with automatic transmission
 mtcars[, total_am := .N, by = am]
mtcars[1:5,c(1,9,13)]
     mpg    am total_am
   <num> <num>    <int>
1:    NA     1       13
2:    NA     1       13
3:  22.8     1       13
4:    NA     0       19
5:    NA     0       19

You can see that there are 13 cars with automatic transmissions and 19 cars without automatic transmissions. Each row has a 13 if it has an automatic transmission and a 19 if it does not. This calculation is something similar to what a windowing function does in SQL.

Calculate Values of Groups

It is also possible to calculate other values. In the code below, we calculate the average mpg by the number of cylinders a car has. The syntax for this code should be looking familiar by now. Notice how after the := notation, it is possible to use a function. In our case, we are using the mean() function.

# Calculate the mean mpg by cyl 
mtcars[, mean_mpg:=mean(mpg,na.rm=TRUE), 
            by = cyl]
mtcars[1:3,c(1,2,13)]
mpg cyl mean_mpg
<num> <num> <num>
1: 21.0 6 19.74286
2: 21.0 6 19.74286
3: 22.8 4 26.66364

The results are similar to the previous example, except this time we calculate the mean of mpg by cyl.

Using LHS := RHS Form

LHS := RHS Form is another way to indicate to R what you want to do. On the left-hand side, you create the names of your columns. After the := sign you place the functions you are using. Notice how the functions are wrapped inside parentheses with a period on the outside. Also note the comma at the beginning of the square brackets and in front of the “by” argument.

# Add columns using the LHS := RHS form
mtcars[, c("mean_mpg", "median_mpg") := .(mean(mpg), median(mpg)), 
        by = cyl]

In the code above, we calculate the mean and the median mpg by the number of cylinders. Below is the output

mtcars[1:3,c(1,2,13,14)]
     mpg   cyl mean_mpg median_mpg
   <num> <num>    <num>      <num>
1:  21.0     6 19.74286       19.7
2:  21.0     6 19.74286       19.7
3:  22.8     4 26.66364       26.0

You can clearly see the two new columns that show the mean and median for mpg.

Functional Form

Functional form is another way to get the same results. In the code below, we are still using the square brackets. Inside the square brackets, we first have a comma. Next, you have our := symbol, but this time the := symbol is inside grave accents (“). The grave accent is next to the number 1 on a standard keyboard and is also home to the tilde sign (~). After the := symbol you create the column name,s followed by the function you are using, separated by commas. After all of this, you indicate the grouping using the “by” argument.

# Add columns using the functional form
mtcars[, `:=`(mean_mpg_func = mean(mpg), 
               median_mpg_func = median(mpg)), 
        by = cyl]
mtcars[1:3,c(1,2,15,16)]
     mpg   cyl mean_mpg_func median_mpg_func
   <num> <num>         <num>           <num>
1:  21.0     6      19.74286            19.7
2:  21.0     6      19.74286            19.7
3:  22.8     4      26.66364            26.0

The results speak for themselves.

Functional Form with Complex Grouping

So far, we have been grouping with only one column in the “by” argument. However, it is possible to have more than one column in the “by” argument while also having another filter in place. In the code below, we filter for mpg greater than 21 while also grouping by the number of cylinders, and whether the car has an automatic transmission or not.

# Add the mean_duration column
mtcars[mpg>21,ave_mpg :=mean(mpg),
        by = .(cyl,am)]
mtcars[1:3,c(1,2,9,17)]
     mpg   cyl    am ave_mpg
   <num> <num> <num>   <num>
1:  21.0     6     1      NA
2:  21.0     6     1      NA
3:  22.8     4     1  28.075

The reason for the NA is that there are no cars that meet the criteria. In other words, there was not more than one car that had an mpg greater than 21 that was 6-cylinder and also had an automatic transmission.

COnclusion

Data.table is just another way to manipulate data inside R. Generally, it is considered faster when dealing with large datasets. The purpose here was only to explore the potential of this package if it is needed.

Leave a Reply