Monthly Archives: February 2026

Column Computation with Data Table in R

The data table data structure is a great way to manipulate your data to address various questions you may have. In this post, we will learn about filtering, dealing with text, and more complex numerical calculations.

Packages and Data Preparation

We will begin by loading our package data.table and converting our datasets mtcars and iris, into data tables. Both mtcars and iris are preinstalled on R. Below is the code.

library(data.table)
mtcars<-data.table(mtcars)
iris<-data.table(iris)

Next, we will quickly examine both datasets using the head() function to understand what each one is about.

We now move to filtering.

Filtering for Not

Our first exercise is the use of NOT logic in filtering. With NOT logic, you are filtering for what is not included in your code. For example, in the code below, we are telling R to display all cars that do not have a transmission. The code for NOT is != which means “does not equal”. Below is the code and example.

> # Filter all rows where am is not 0
> not_0_am <- mtcars[am !=0]
> not_0_am
      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
 1:  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
 2:  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
 3:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
 4:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
 5:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
 6:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
 7:  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1
 8:  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2
 9:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
10:  15.8     8 351.0   264  4.22 3.170 14.50     0     1     5     4
11:  19.7     6 145.0   175  3.62 2.770 15.50     0     1     5     6
12:  15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8
13:  21.4     4 121.0   109  4.11 2.780 18.60     1     1     4     2
>

Of course, you can have more than one argument within your code, as we will see in the next example.

Multiple Commands for Not

It is also possible to include multiple commands. In the example below, we are filtering for cars with an automatic transmission (am==1) but do not have 6 cylinders (cyl != 6). The output matches the criteria that were set

> # Filter all rows where am is 0 AND cyl is not 6
> am_cyl <- mtcars[am==1 & cyl != 6]
> am_cyl
      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
 1:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
 2:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
 3:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
 4:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
 5:  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1
 6:  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2
 7:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
 8:  15.8     8 351.0   264  4.22 3.170 14.50     0     1     5     4
 9:  15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8
10:  21.4     4 121.0   109  4.11 2.780 18.60     1     1     4     2

Searching Text

It is also possible to search for text and even numbers. In the code below, we are searching the iris dataset for the species “setosa” and for petal lengths that are less than 1.3

> #with text
> iris[Species=="setosa" & Petal.Length<1.3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <num>       <num>        <num>       <num>  <fctr>
1:          4.3         3.0          1.1         0.1  setosa
2:          5.8         4.0          1.2         0.2  setosa
3:          4.6         3.6          1.0         0.2  setosa
4:          5.0         3.2          1.2         0.2  setosa

We can also search for text when unsure what we are looking for. In the example below, we use the %like% argument to search the Specias column for text containing the letter v. Since the results are rather long, we use the head() function to see the first few rows.

> # Filter all rows where Species contains "V"
> any_v <- iris[Species %like% "v"]
> head(any_v)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
          <num>       <num>        <num>       <num>     <fctr>
1:          7.0         3.2          4.7         1.4 versicolor
2:          6.4         3.2          4.5         1.5 versicolor
3:          6.9         3.1          4.9         1.5 versicolor
4:          5.5         2.3          4.0         1.3 versicolor
5:          6.5         2.8          4.6         1.5 versicolor
6:          5.7         2.8          4.5         1.3 versicolor

Another way to search text is by looking for words that end with something. In the example below, we are looking for words in the Species column that end with the word “color.” We indicate this to are by using the %like% argument again and the word “color” with a dollar sign at the end of it. The dollar sign tells R to look for this word at the end of a word in the Species column.

> # Filter all rows where Species ends with "color"
> end_flowers <- iris[Species %like% "color$"]
> head(end_flowers)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
          <num>       <num>        <num>       <num>     <fctr>
1:          7.0         3.2          4.7         1.4 versicolor
2:          6.4         3.2          4.5         1.5 versicolor
3:          6.9         3.1          4.9         1.5 versicolor
4:          5.5         2.3          4.0         1.3 versicolor
5:          6.5         2.8          4.6         1.5 versicolor
6:          5.7         2.8          4.5         1.3 versicolor

Multiple Numerical Arguments

Multiple numerical arguments are also possible. In the example shown below, we are looking for all cars in the mtcars dataset that are 4 or 6 cylinders. We achieve this by listing the variable we are searching “cyl” followed by the %in% argument, and lastly we use the c() function and include our values inside it. Below is the code and output.

> # Filter all rows where cyl is 4 or 6
> filter_cyl <- mtcars[cyl %in% c(4, 6)]
> filter_cyl
      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
 1:  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
 2:  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
 3:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
 4:  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
 5:  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
 6:  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
 7:  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
 8:  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
 9:  17.8     6 167.6   123  3.92 3.440 18.90     1     0     4     4
10:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
11:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
12:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
13:  21.5     4 120.1    97  3.70 2.465 20.01     1     0     3     1
14:  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1
15:  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2
16:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
17:  19.7     6 145.0   175  3.62 2.770 15.50     0     1     5     6
18:  21.4     4 121.0   109  4.11 2.780 18.60     1     1     4     2

In this last example, we learn to find data that meets a range rather than just specific values. In the code below, we are looking for cars that have an mpg between 20 and 22. The new argument in this example is the %between% argument, which is used to tell R to search for a range of values. Below is the code, followed by the output

> # Filter all rows where mpg is between [20, 22]
> mpg_20_22 <- mtcars[mpg %between% c(20,22)]
> mpg_20_22
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1:  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
2:  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
3:  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
4:  21.5     4 120.1    97  3.70 2.465 20.01     1     0     3     1
5:  21.4     4 121.0   109  4.11 2.780 18.60     1     1     4     2

Conclusion

Data tables provide a different way of pulling insights from data. The value of this approach becomes clearer when dealing with large datasets in which speed becomes important.

Comparing Data with Python VIDEO

Leave a reply

The comparison of data can be useful to determine if it is necessary to use additional statistical tests to confirm a significant difference. In the video below, we look at several simple ways to compare data using Python.

Annotating Visualizations in Python

Leave a reply

Annotating data allows you to communicate vital information in a visualization for an audience. In the example below, we will look at how to annotate a visualization while using Python.

Libraries and Data Preparation

We will begin by loading the needed libraries and preparing the data. In the code below, lines 1 and 3 load our visualization libraries. Line 2 loads the function we will need to load our data.

import seaborn as sns
from pydataset import data
import matplotlib.pyplot as plt

In the code below, we use the data() function to load the Prestige data from pydataset into an object called df. Then, we display the head of this data using the .head() method.

df=data('Prestige')
df.head()

Our dataset contains various jobs measured on five dimensions. In our code below, we will focus on using the education, income, and prestige variables.

Making a Comment

Now we will add a comment to our visualization. Specifically, we will point out the highest income value. Below is the code, followed by the visualization

# Draw basic scatter plot of education data and income 
sns.scatterplot(x = 'education', y = 'income', data = df)

# Label highest income value with text annotation
plt.text(6, 25000,
         'The max income is over 25000', 
         # Set the font to large
         fontdict = {'ha': 'left', 'size': 'large'})
plt.show()

The first step was to make a basic scatterplot. We use the .scatterplot() method from seaborn and plot education and income from our df dataset. Next, we set up our text using the .text() method from matplotlib. For the text, we set an x and y value and then indicate what the text should say. Below that, we adjust the font to come from the left and top to be large in size.

Arrow Annotation

Using an arrow is another way to bring attention to data in a visualization. In the code below, we will use an arrow that will point to the same data point that we used in the previous example. Below is the code, followed by the visualization.

# Query and filter to General Managers
women_census = df.query("(women  ==  4.02) & (census  ==  1130)")
prestige_type = df.query("(prestige  ==  69.1) & (type  ==  'prof')")

sns.scatterplot(x = 'education', y = 'income',
                data = df)

# Point arrow to General Managers 
plt.annotate('General Managers',
             xy = (women_census.education, prestige_type.income),
             xytext = (6.5, 15000), 
             # Shrink the arrow to avoid occlusion
             arrowprops = {'facecolor':'gray', 'width': 3, 'shrink': 0.03},
             backgroundcolor = 'white')
plt.show()

Here is what we did,

We create two queries to locate the data point we want the arrow to point to. All the values in the .query() method for both the woman_census and the prestige_type are values from the general manager row. As shown below,

These two objects are used to locate general managers in the dataset.

2. We make the same scatterplot as shown before

3. The .annotate() method is used. We start by writing in quotes what we want to appear in the scatterplot. Next, we set the x and y coordinates of the data point we want to highlight using the women_census and prestige_type queries we did previously. From there, we have to set the text location. After this, we set the arrow properties in terms of the color, width, and size. Lastly, the background color is set.

Annotation with Color & Text

Color annotation provides a contrast based on color. Below is the code and the visualization when this approach is used.

# Make a vector where prof is orangered; else lightgray
prof = ['orangered' if type  ==  'prof' else 'lightgray' for type in df['type']]

# Map facecolors to the list prof and set alpha to 0.3
sns.regplot(x = 'education',
            y = 'income',
            data = df,
            fit_reg = False,
            scatter_kws = {'facecolors':prof, 'alpha': 0.3})

# Add annotation to plot
plt.text(11, 23000, 'General Managers')
plt.show()

This approach is simpler compared to the last one. We begin by separating the data based on type. Professionals are colored orange, and the rest are colored light gray. Next, we create our scatterplot using the .regplot() method this time. Education and income are the x and y axes, the regression line is removed, and the color of the dots is set using the scatter_kws argument. The “prof” argument provides the coloring rules, and the alpha is set to make the points transparent. The next step uses the .text() method to set the x and y coordinates for the text.

Conclusion

Annotation is one of many ways to bring attention to crucial insights in a visualization. The examples above provide some of the many ways this tool can be used to provide crucial information when using Python

Highlighting Data Points with Python VIDEO

Leave a reply

The video below provides two methods that can be used to highlight data points using Python. Which method to use depends on the context, but an analyst should be familiar with both.