Regular Expression with R

Advertisements

Regular expressions are used for a variety of reasons. One of the main reasons is for finding data in your dataset that meets specific criteria. In this post, we will use regular expressions for several different purposes.

Initial Setup

The only package we need is the stringr package. We will also create a vector of names that will serve as our data for the first few examples. Below is the code.

library(stringr)
people<-c("Bob","Brad","Dan","Jason","Tony","Tom")

Commonly Used Symbols

We are going to use the people vector for our data. The first function we will use is the str_detect() function. This function detects strings within your data that meet your criteria. The str_detect() function takes the data as the first argument and then a pattern for the second argument.

What we are going to do is subset the people vector using str_detect(). We want to find all words that start with the letter B. To tell R to look for words that start with by we must use the caret (^) symbol in front of the letter B in the pattern argument. Below is the code and the output.

> people[str_detect(people,pattern = "^B")]
[1] "Bob"  "Brad"

The code starts with the vector people. Next, we place all of the code for searching inside brackets. The brackets are used in this example for subsetting the data or for finding data that meets our criteria. Inside the brackets, we are using the str_detect() function. Inside the function is the data we are subsetting followed by the pattern argument. Inside the quotes, we have the caret symbol which means “at the beginning” followed by the letter B. Our output shows the two words that meet this criteria.

The caret symbol is used to indicate finding letters at the beginning. However, the dollar sign “$” is used to find letters at the end of a string. Below is the code and output for this symbol.

> people[str_detect(people,pattern = "n$")]
[1] "Dan"   "Jason"

The code is mostly the same as in the previous example. The only difference is the pattern which shows we want words that end with the letter “n”. The output shows two words that meet this criteria.

The next symbol we will learn is the period “.”. This is used when you want to find strings that have a particular word character anywhere inside the string. Below is the code and output.

> people[str_detect(people,pattern = "a.")]
[1] "Brad"  "Dan"   "Jason"

Again, the only difference is the pattern. We told R we want to find any words that have the letter “a” inside. By using the period we found three words that match this criteria.

Multiple Criteria

All of the previous examples were limited to looking for one character. However, there are several different shortcuts that allow you to look for multiple criteria when using regular expressions. For the next examples, we need to make a different vector of data and we will now be using the str_match_all() function which will find all strings that meet are criteria.

In the code below, we create a new vector that has words and numbers as data. Next, we will use the str_match_all() function to find all strings that contain numbers. To find numbers we will use the “\\d” expression.

> people_and_numbers<-c("Bob","Brad","Dan",1,2,3)
> str_match_all(people_and_numbers,"\\d")
[[1]]
     [,1]

[[2]]
     [,1]

[[3]]
     [,1]

[[4]]
     [,1]
[1,] "1" 

[[5]]
     [,1]
[1,] "2" 

[[6]]
     [,1]
[1,] "3"

The output is a little strange. The actual output is a list. Since there are six strings in our original vector there are six items in our list. The first three items in the list contain nothing because the first three entries in our vector do not contain any numbers. The last three items in the list each contain a number because these are the numbers contained in the original vector.

The next expression we will learn is for finding word characters, which is the “\\w” expression. This expression will find any word character or number. Below is an example.

> str_match_all(people_and_numbers,"\\w")
[[1]]
     [,1]
[1,] "B" 
[2,] "o" 
[3,] "b" 

[[2]]
     [,1]
[1,] "B" 
[2,] "r" 
[3,] "a" 
[4,] "d" 

[[3]]
     [,1]
[1,] "D" 
[2,] "a" 
[3,] "n" 

[[4]]
     [,1]
[1,] "1" 

[[5]]
     [,1]
[1,] "2" 

[[6]]
     [,1]
[1,] "3"

Notice how the output splits apart of the characters in each word. Besides this, the output is to be expected.

We can also indicate that we want only letters. This is done by using brackets and dashes. below is the code and output.

> str_match_all(people_and_numbers,"[A-Za-z]")
[[1]]
     [,1]
[1,] "B" 
[2,] "o" 
[3,] "b" 

[[2]]
     [,1]
[1,] "B" 
[2,] "r" 
[3,] "a" 
[4,] "d" 

[[3]]
     [,1]
[1,] "D" 
[2,] "a" 
[3,] "n" 

[[4]]
     [,1]

[[5]]
     [,1]

[[6]]
     [,1]

The output is mostly the same. The first three words are split apart. However, the last three items are empty because the numbers do not contain letters.

You can put almost anything inside the brackets. In the example below, we are only looking for vowels.

> str_match_all(people_and_numbers,"[aeiou]")
[[1]]
     [,1]
[1,] "o" 

[[2]]
     [,1]
[1,] "a" 

[[3]]
     [,1]
[1,] "a" 

[[4]]
     [,1]

[[5]]
     [,1]

[[6]]
     [,1]

Now only the items that contain vowels are included in the list.

Conclusion

These are just some of the amazing things that regular expression can allow you to do. Whenever you need to wrestle with text it is important to remember how regular expressions can help you.