Fuzzy Joins with R

Advertisements

In this post, we will look at how you can make joins between datasets using a fuzzy criteria in which the matches between the datasets are not 100% the same. In order to do this, we will use the following packages in R.

library(stringdist)
library(stringr)
library(fuzzyjoin)

The “stringdist” package will be used to measure the differences between various strings that we will use. The “stringr” package will be used to manipulate some text. Lastly, the “fuzzyjoin” package will be used to join datasets.

String Distance

The stringdist() function is used to measure the differences between strings. This is measured in many different ways. For our purposes, the distance is measured by the number of changes the function has to make so that the second string is the same as the original string of comparison. Below is code that uses three different methods each to compare the strings in the function.

> stringdist("darrin", "darren", method = "lv")
[1] 1
> stringdist("darrin", "darren", method = "dl")
[1] 1
> stringdist("darrin", "darren", method = "osa")
[1] 1

This code is simple. First, we call the function. Inside the function, the first string is the ground truth string which is the string everything else is compared to. The second string is the other string that is compared to the first one. The method is how the difference is measured. For each example, we picked a different method. “lv” stands for Levenshtein distance, “dl” stands for full Damerau-Levenshtein distance, and “osa” stands for Optimal String Alignment distance. The details of each of these methods can be found by looking at the documentation for the “stringdist” package. Also, note that there are other methods beyond this that are available as well.

The value for these methods is 1, which means that only one change is needed to convert “darren” to “darrin”. Most of the time the methods are highly similar in their results but just to demonstrate, below is an example where the methods disagree.

> stringdist("darrin", "dorirn", method = "lv")
[1] 3
> stringdist("darrin", "dorirn", method = "dl")
[1] 2
> stringdist("darrin", "dorirn", method = "osa")
[1] 2

Now, the values are different. The reason behind these differences is explained in the documentation.

amatch()

The amatch() function allows you to compare multiple strings to the ground truth and indicate which one is closest to the original ground truth string. Below is the code and output from this function.

amatch(
  x = "Darrin",
  table = c("Darren", "Daren", "Darien"),
  maxDist = 1,
  method = "lv" 
) 
[1] 1

Here is what we did.

The x argument is the ground truth. In other words, all other strings are compared to the value of x.
The “table” argument contains all the strings that are being compared to the x argument.
“maxDist” is how far away or how many changes max can be made in order for the strings in the “table” to be considered the best match
“method” is the method used to calculate the “maxdist”
The output is 1. This means that the first string in the table “Darren” has a max distance of 1

Fuzzy Join

The fuzzy join is used to join tables that have columns that are similar but not the same. Normally, joins work on exact matches but the fuzzy join does not require this. Before we use this function we have to prepare some data. We will modify the “Titanic” dataset to run this example. The “Titanic” dataset is a part of R by default and there is no need for any packages. Below is the code for the data preparation.

Titanic_1<-as.data.frame(Titanic)
Titanic_2<-as.data.frame(Titanic)

Titanic_1$Sex<-str_to_lower(Titanic_1$Sex)
Titanic_1$Age<-str_to_lower(Titanic_1$Age)

Here is what we did.

We saved two copies of the “Titanic” dataset as dataframes. This was done because the fuzzy join function needs dataframes.
Next, we made clear differences between the two datasets. For “Titanic_1” we lowercase the sex, and age columns so that there was not an exact match when joining these two dataframes with the fuzzy join function.

We will now use the fuzzy join function. Below is the code followed by the output.

stringdist_join(
  Titanic_1,
  Titanic_2,
  by = c("Age" = "Age","Sex"="Sex"),
  method = "lv",
  max_dist = 1,
  distance_col = "distance"
)

The stringdist_join() function is used to perform the fuzzy join. “Titanic_1” is the x dataset and “Titanic_2” is the y dataset. The “by” argument tells the function which columns are being used in the join. The “method” argument indicates how the distance is calculated between the rows in each dataset. The “max_dist” argument is the criteria by which a join is made. In other words, if the distance is greater than one no join will take place. Lastly, the “distance_col” argument creates new columns that show the distance between the compared columns.

The output was a full join. All columns from both datasets are present. The columns with “.x” are from the “Titanic_1” while the columns with “.y” are from “Titanic_2”. The “.distance” column tells us the difference when that row of data was compared from each dataset. For example, in row 1 the “Age.distance” is 1. This means that the difference in “Age.x” and “Age.y” is 1. The only difference is that “Age.x” is lowercase while “Age.y” is capitalized.

Conclusion

The tools mentioned here allow you to match data that is different with a clear metric of the difference. This can be powerful when you have to join data that does not have a matching column in both datasets. Therefore, there is a place for the tools in the life of any data analyst who deals with fuzzy data like this.

educational research techniques

Research techniques and education

Fuzzy Joins with R

Leave a ReplyCancel reply

Share this:

Leave a ReplyCancel reply

Discover more from educational research techniques