Topics Models in R

Topic models is a tool that can group text by their main themes. It involves the use of probability based on word frequencies. The algorithm that does this is called the Latent Dirichlet Allocation algorithm.

IN this post, we will use some text mining tools to analyze religious/philosophical text the five texts we will look at are The King James Bible The Quran The Book Of Mormon The Gospel of Buddha Meditations, by Marcus Aurelius

The link for access to these five text files is as follows https://www.kaggle.com/tentotheminus9/religious-and-philosophical-texts/downloads/religious-and-philosophical-texts.zip

Once you unzip it you will need to rename each file appropriately.

The next few paragraphs are almost verbatim from the post text mining in R. This is because the data preparation is essentially the same. Small changes were made but original material is found in the analysis section of this post.

We will now begin the actual analysis. The package we need or “tm” and “topicmodels” Below is some initial code.

library(tm);library(topicmodels)

Data Preparation

We need to do three things for each text file

  1. Paste it
  2. convert it
  3. write a table

Below is the code for pasting the text into R. Keep in mind that your code will be slightly different as the location of the file on your computer will be different. The “what” argument tells are what to take from the file and the “Collapse” argument deals with whitespace

bible<-paste(scan(file ="/home/darrin/Desktop/speech/bible.txt",what='character'),collapse=" ")
buddha<-paste(scan(file ="/home/darrin/Desktop/speech/buddha.txt",what='character'),collapse=" ")
meditations<-paste(scan(file ="/home/darrin/Desktop/speech/meditations.txt",what='character'),collapse=" ")
mormon<-paste(scan(file ="/home/darrin/Desktop/speech/mormon.txt",what='character'),collapse=" ")
quran<-paste(scan(file ="/home/darrin/Desktop/speech/quran.txt",what='character'),collapse=" ")

Now we need to convert the new objects we created to ASCII text. This removes a lot of “funny” characters from the objects. For this, we use the “iconv” function. Below is the code.

bible<-iconv(bible,"latin1","ASCII","")
meditations<-iconv(meditations,"latin1","ASCII","")
buddha<-iconv(buddha,"latin1","ASCII","")
mormon<-iconv(mormon,"latin1","ASCII","")
quran<-iconv(quran,"latin1","ASCII","")

The last step of the preparation is the creation of tables. What you are doing is you are taking the objects you have already created and are moving them to their own folder. The text files need to be alone in order to conduct the analysis. Below is the code.

write.table(bible,"/home/darrin/Documents/R working directory/textminingegw/mine/bible.txt")
write.table(meditations,"/home/darrin/Documents/R working directory/textminingegw/mine/meditations.txt")
write.table(buddha,"/home/darrin/Documents/R working directory/textminingegw/mine/buddha.txt")
write.table(mormon,"/home/darrin/Documents/R working directory/textminingegw/mine/mormon.txt")
write.table(quran,"/home/darrin/Documents/R working directory/textminingegw/mine/quran.txt")

Corpus Development

We are now ready to create the corpus. This is the object we use to clean the text together rather than individually as before. First, we need to make the corpus object, below is the code. Notice how it contains the directory where are tables are

docs<-Corpus(DirSource("/home/darrin/Documents/R working directory/textminingegw/mine"))

There are many different ways to prepare the corpus. For our example, we will do the following…

lower case all letters-This avoids the same word be counted separately (ie sheep and Sheep)

  • Remove numbers
  • Remove punctuation-Simplifies the document
  • Remove whitespace-Simplifies the document
  • Remove stopwords-Words that have a function but not a meaning (ie to, the, this, etc)
  • Remove custom words-Provides additional clarity

Below is the code for this

docs<-tm_map(docs,tolower)
docs<-tm_map(docs,removeNumbers)
docs<-tm_map(docs,removePunctuation)
docs<-tm_map(docs,removeWords,stopwords('english'))
docs<-tm_map(docs,stripWhitespace)
docs<-tm_map(docs,removeWords,c("chapter","also","no","thee","thy","hath","thou","thus","may",
                                "thee","even","yet","every","said","this","can","unto","upon",
                                "cant",'shall',"will","that","weve","dont","wont"))

We now need to create the matrix. The document matrix is what r will actually analyze. We will then remove sparse terms. Sparse terms are terms that do not occur are a certain percentage in the matrix. For our purposes, we will set the sparsity to .60. This means that a word must appear in 3 of the 5 books of our analysis. Below is the code. The ‘dim’ function will allow you to see how the number of terms is reduced drastically. This is done without losing a great deal of data will speeding up computational time.

dtm<-DocumentTermMatrix(docs)
dim(dtm)
## [1]     5 24368
dtm<-removeSparseTerms(dtm,0.6)
dim(dtm)
## [1]    5 5265

Analysis

We will now create our topics or themes. If there is no a priori information on how many topics to make it os up to you to decide how many. We will create three topics. The “LDA” function is used and the argument “k” is set to three indicating we want three topics. Below is the code

set.seed(123)
lda3<-LDA(dtm,k=3)

We can see which topic each book was assigned to using the “topics” function. Below is the code.

topics(lda3)
##       bible.txt      buddha.txt meditations.txt      mormon.txt 
##               2               3               3               1 
##       quran.txt 
##               3

According to the results. The book of Mormon and the Bible were so unique that they each had their own topic (1 and 3). The other three text (Buddha, Meditations, and the Book of Mormon) were all placed in topic 2. It’s surprising that the Bible and the Book of Mormon were in separate topics since they are both Christian text. It is also surprising the Book by Buddha, Meditations, and the Quran are all under the same topic as it seems that these texts have nothing in common.

We can also use the “terms” function to see what the most common words are for each topic. The first argument in the function is the model name followed by the number of words you want to see. We will look at 10 words per topic.

terms(lda3, 10)
##       Topic 1  Topic 2  Topic 3 
##  [1,] "people" "lord"   "god"   
##  [2,] "came"   "god"    "one"   
##  [3,] "god"    "israel" "things"
##  [4,] "behold" "man"    "say"   
##  [5,] "pass"   "son"    "truth" 
##  [6,] "lord"   "king"   "man"   
##  [7,] "yea"    "house"  "lord"  
##  [8,] "land"   "one"    "life"  
##  [9,] "now"    "come"   "see"   
## [10,] "things" "people" "good"

Interpreting these results takes qualitative skills and is subjective. They all seem to be talking about the same thing. Topic 3 (Bible) seems to focus on Israel and Lord while topic 1 (Mormon) is about God and people. Topic 2 (Buddha, Meditations, and Quran) speak of god as well but the emphasis has moved to truth and the word one.

Conclusion

This post provided insight into developing topic models using R. The results of a topic model analysis is highly subjective and will often require strong domain knowledge. Furthermore, the number of topics is highly flexible as well and in the example in this post we could have had different numbers of topics for comparative purposes.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s