Text Mining in R

Text mining is descriptive analysis tool that is applied to unstructured textual data. By unstructured, it is meant data that is not stored in relational databases. The majority of data on the Internet and the business world, in general, is of an unstructured nature. As such, the use of text mining tools has grown in importance over the past two decades.

In this post, we will use some text mining tools to analyze religious/philosophical text the five texts we will look at are

  • The King James Bible
  • The Quran
  • The Book Of Mormon
  • The Gospel of Buddha
  • Meditations, by Marcus Aurelius

The link for access to these five text files is as follows

https://www.kaggle.com/tentotheminus9/religious-and-philosophical-texts/downloads/religious-and-philosophical-texts.zip

Once you unzip it you will need to rename each file appropriately.

The actual process of text mining is rather simple and does not involve a great deal of complex coding compared to other machine learning applications. Primarily you need to do the follow Prep the data by first scanning it into r, converting it to ASCII format, and creating the write table for each text Create a corpus that is then cleaned of unnecessary characters Conduct the actual descriptive analysis

We will now begin the actual analysis. The package we need or “tm” for text mining, “wordcloud”, and “RColorBrewer” for visuals. Below is some initial code.

library(tm);library(wordcloud);library(RColorBrewer)

Data Preparation

We need to do three things for each text file

  • Paste
  •  convert it
  • write a table

Below is the code for pasting the text into R. Keep in mind that your code will be slightly different as the location of the file on your computer will be different. The “what” argument tells are what to take from the file and the “Collapse” argument deals with whitespace

bible<-paste(scan(file ="/home/darrin/Desktop/speech/bible.txt",what='character'),collapse=" ")
buddha<-paste(scan(file ="/home/darrin/Desktop/speech/buddha.txt",what='character'),collapse=" ")
meditations<-paste(scan(file ="/home/darrin/Desktop/speech/meditations.txt",what='character'),collapse=" ")
mormon<-paste(scan(file ="/home/darrin/Desktop/speech/mormon.txt",what='character'),collapse=" ")
quran<-paste(scan(file ="/home/darrin/Desktop/speech/quran.txt",what='character'),collapse=" ")

Now we need to convert the new objects we created to ASCII text. This removes a lot of “funny” characters from the objects. For this, we use the “iconv” function. Below is the code.

bible<-iconv(bible,"latin1","ASCII","")
meditations<-iconv(meditations,"latin1","ASCII","")
buddha<-iconv(buddha,"latin1","ASCII","")
mormon<-iconv(mormon,"latin1","ASCII","")
quran<-iconv(quran,"latin1","ASCII","")

The last step of the preparation is the creation of tables. Primarily you are taken the objects you have already created and moved them to their own folder. The text files need to be alone in order to conduct the analysis. Below is the code.

write.table(bible,"/home/darrin/Documents/R working directory/textminingegw/mine/bible.txt")
write.table(meditations,"/home/darrin/Documents/R working directory/textminingegw/mine/meditations.txt")
write.table(buddha,"/home/darrin/Documents/R working directory/textminingegw/mine/buddha.txt")
write.table(mormon,"/home/darrin/Documents/R working directory/textminingegw/mine/mormon.txt")
write.table(quran,"/home/darrin/Documents/R working directory/textminingegw/mine/quran.txt")

For fun, you can see a snippet of each object by simply typing its name into r as shown below.

bible
##[1] "x 1 The Project Gutenberg EBook of The King James Bible This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org Title: The King James Bible Release Date: March 2, 2011 [EBook #10] [This King James Bible was orginally posted by Project Gutenberg in late 1989] Language: English *** START OF THIS PROJECT

Corpus Creation

We are now ready to create the corpus. This is the object we use to clean the text together rather than individually as before. First, we need to make the corpus object, below is the code. Notice how it contains the directory where are tables are

docs<-Corpus(DirSource("/home/darrin/Documents/R working directory/textminingegw/mine"))

There are many different ways to prepare the corpus. For our example, we will do the following…

  • lower case all letters-This avoids the same word be counted separately (ie sheep and Sheep)
  • Remove numbers
  • Remove punctuation-Simplifies the document Remove whitespace-Simplifies the document
  • Remove stopwords-Words that have a function but not a meaning (ie to, the, this, etc)
  • Remove custom words-Provides additional clarity

Below is the code for this

docs<-tm_map(docs,tolower)
docs<-tm_map(docs,removeNumbers)
docs<-tm_map(docs,removePunctuation)
docs<-tm_map(docs,removeWords,stopwords('english'))
docs<-tm_map(docs,stripWhitespace)
#docs<-tm_map(docs,stemDocument)
docs<-tm_map(docs,removeWords,c("chapter","also","no","thee","thy","hath","thou","thus","may",
                                "thee","even","yet","every","said","this","can","unto","upon",
                                "cant",'shall',"will","that","weve","dont","wont"))

We now need to create the matrix. The document matrix is what r will actually analyze. We will then remove sparse terms. Sparse terms are terms that do not occur are a certain percentage in the matrix. For our purposes, we will set the sparsity to .60. This means that a word must appear in 3 of the 5 books of our analysis. Below is the code. The ‘dim’ function will allow you to see how the number of terms is reduced drastically. This is done without losing a great deal of data will speeding up computational time.

dtm<-DocumentTermMatrix(docs)
dim(dtm)
## [1]     5 24368
dtm<-removeSparseTerms(dtm,0.6)
dim(dtm)
## [1]    5 5265

Analysis

We now can explore the text. First, we need to make a matrix that has the sum of the columns od the document term matrix. Then we need to change the order of the matrix to have the most frequent terms first. Below is the code for this.

freq<-colSums(as.matrix(dtm))
ord<-order(-freq)#changes the order to descending

We can now make a simple bar plot to see what the most common words are. Below is the code

barplot(freq[head(ord)])

1

As expected with religious text. The most common terms are religious terms. You can also determine what words appeared least often with the code below.

freq[tail(ord)]
##   posting   secured    smiled      sway swiftness worthless 
##         3         3         3         3         3         3

Notice how each word appeared 3 times. This may mean that the 3 terms appear once in three of the five books. Remember we set the sparsity to .60 or 3/5.

Another analysis is to determine how many words appear a certain number of times. For example, how many words appear 200 times or 300. Below is the code.

head(table(freq))
## freq
##   3   4   5   6   7   8 
## 117 230 172 192 191 187

Using the “head” function and the “table” function gives us the six most common values of word frequencies. Three words appear 117 times, four appear 230 times, etc. Remember the “head” gives the first few values regardless of their amount

The “findFreqTerms” function allows you to set a cutoff point of how frequent a word needs to be. For example, if we want to know how many words appeared 3000 times we would use the following code.

findFreqTerms(dtm,3000)
##  [1] "behold" "came"   "come"   "god"    "land"   "lord"   "man"   
##  [8] "now"    "one"    "people"

The “findAssocs” function finds the correlation between two words in the text. This provides insight into how frequently these words appear together. For our example, we will see which words are associated with war, which is a common subject in many religious texts. We will set the correlation high to keep the list short for the blog post. Below is the code

findAssocs(dtm,"war",corlimit =.998) 
## $war
##     arrows      bands   buildeth    captive      cords     making 
##          1          1          1          1          1          1 
##  perisheth prosperity      tower      wages      yield 
##          1          1          1          1          1

The interpretation of the results can take many forms. It makes sense for ‘arrows’ and ‘captives’ to be associated with ‘war’ but ‘yield’ seems confusing. We also do not know the sample size of the associations.

Our last technique is the development of a word cloud. This allows you to see word frequency based on where the word is located in the cloud as well as its size. For our example, we will set it so that a word must appear at least 1000 times in the corpus with more common words in the middle. Below is the code.

wordcloud(names(freq),freq,min.freq=1000,scale=c(3,.5),colors=brewer.pal(6,"Dark2"),random.color = F,random.order = F)

1.png

Conclusion

This post provided an introduction to text mining in R. There are many more complex features that are available for the more serious user of R than what is described here

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s