Topic models is a tool that can group text by their main themes. It involves the use of probability based on word frequencies. The algorithm that does this is called the Latent Dirichlet Allocation algorithm.
IN this post, we will use some text mining tools to analyze religious/philosophical text the five texts we will look at are The King James Bible The Quran The Book Of Mormon The Gospel of Buddha Meditations, by Marcus Aurelius
The link for access to these five text files is as follows https://www.kaggle.com/tentotheminus9/religious-and-philosophical-texts/downloads/religious-and-philosophical-texts.zip
Once you unzip it you will need to rename each file appropriately.
The next few paragraphs are almost verbatim from the post text mining in R. This is because the data preparation is essentially the same. Small changes were made but original material is found in the analysis section of this post.
We will now begin the actual analysis. The package we need or “tm” and “topicmodels” Below is some initial code.
library(tm);library(topicmodels)
Data Preparation
We need to do three things for each text file
- Paste it
- convert it
- write a table
Below is the code for pasting the text into R. Keep in mind that your code will be slightly different as the location of the file on your computer will be different. The “what” argument tells are what to take from the file and the “Collapse” argument deals with whitespace
bible<-paste(scan(file ="/home/darrin/Desktop/speech/bible.txt",what='character'),collapse=" ")
buddha<-paste(scan(file ="/home/darrin/Desktop/speech/buddha.txt",what='character'),collapse=" ")
meditations<-paste(scan(file ="/home/darrin/Desktop/speech/meditations.txt",what='character'),collapse=" ")
mormon<-paste(scan(file ="/home/darrin/Desktop/speech/mormon.txt",what='character'),collapse=" ")
quran<-paste(scan(file ="/home/darrin/Desktop/speech/quran.txt",what='character'),collapse=" ")
Now we need to convert the new objects we created to ASCII text. This removes a lot of “funny” characters from the objects. For this, we use the “iconv” function. Below is the code.
bible<-iconv(bible,"latin1","ASCII","")
meditations<-iconv(meditations,"latin1","ASCII","")
buddha<-iconv(buddha,"latin1","ASCII","")
mormon<-iconv(mormon,"latin1","ASCII","")
quran<-iconv(quran,"latin1","ASCII","")
The last step of the preparation is the creation of tables. What you are doing is you are taking the objects you have already created and are moving them to their own folder. The text files need to be alone in order to conduct the analysis. Below is the code.
write.table(bible,"/home/darrin/Documents/R working directory/textminingegw/mine/bible.txt")
write.table(meditations,"/home/darrin/Documents/R working directory/textminingegw/mine/meditations.txt")
write.table(buddha,"/home/darrin/Documents/R working directory/textminingegw/mine/buddha.txt")
write.table(mormon,"/home/darrin/Documents/R working directory/textminingegw/mine/mormon.txt")
write.table(quran,"/home/darrin/Documents/R working directory/textminingegw/mine/quran.txt")
Corpus Development
We are now ready to create the corpus. This is the object we use to clean the text together rather than individually as before. First, we need to make the corpus object, below is the code. Notice how it contains the directory where are tables are
docs<-Corpus(DirSource("/home/darrin/Documents/R working directory/textminingegw/mine"))
There are many different ways to prepare the corpus. For our example, we will do the following…
lower case all letters-This avoids the same word be counted separately (ie sheep and Sheep)
- Remove numbers
- Remove punctuation-Simplifies the document
- Remove whitespace-Simplifies the document
- Remove stopwords-Words that have a function but not a meaning (ie to, the, this, etc)
- Remove custom words-Provides additional clarity
Below is the code for this
docs<-tm_map(docs,tolower)
docs<-tm_map(docs,removeNumbers)
docs<-tm_map(docs,removePunctuation)
docs<-tm_map(docs,removeWords,stopwords('english'))
docs<-tm_map(docs,stripWhitespace)
docs<-tm_map(docs,removeWords,c("chapter","also","no","thee","thy","hath","thou","thus","may",
"thee","even","yet","every","said","this","can","unto","upon",
"cant",'shall',"will","that","weve","dont","wont"))
We now need to create the matrix. The document matrix is what r will actually analyze. We will then remove sparse terms. Sparse terms are terms that do not occur are a certain percentage in the matrix. For our purposes, we will set the sparsity to .60. This means that a word must appear in 3 of the 5 books of our analysis. Below is the code. The ‘dim’ function will allow you to see how the number of terms is reduced drastically. This is done without losing a great deal of data will speeding up computational time.
dtm<-DocumentTermMatrix(docs)
dim(dtm)
## [1] 5 24368
dtm<-removeSparseTerms(dtm,0.6)
dim(dtm)
## [1] 5 5265
Analysis
We will now create our topics or themes. If there is no a priori information on how many topics to make it os up to you to decide how many. We will create three topics. The “LDA” function is used and the argument “k” is set to three indicating we want three topics. Below is the code
set.seed(123)
lda3<-LDA(dtm,k=3)
We can see which topic each book was assigned to using the “topics” function. Below is the code.
topics(lda3)
## bible.txt buddha.txt meditations.txt mormon.txt
## 2 3 3 1
## quran.txt
## 3
According to the results. The book of Mormon and the Bible were so unique that they each had their own topic (1 and 3). The other three text (Buddha, Meditations, and the Book of Mormon) were all placed in topic 2. It’s surprising that the Bible and the Book of Mormon were in separate topics since they are both Christian text. It is also surprising the Book by Buddha, Meditations, and the Quran are all under the same topic as it seems that these texts have nothing in common.
We can also use the “terms” function to see what the most common words are for each topic. The first argument in the function is the model name followed by the number of words you want to see. We will look at 10 words per topic.
terms(lda3, 10)
## Topic 1 Topic 2 Topic 3
## [1,] "people" "lord" "god"
## [2,] "came" "god" "one"
## [3,] "god" "israel" "things"
## [4,] "behold" "man" "say"
## [5,] "pass" "son" "truth"
## [6,] "lord" "king" "man"
## [7,] "yea" "house" "lord"
## [8,] "land" "one" "life"
## [9,] "now" "come" "see"
## [10,] "things" "people" "good"
Interpreting these results takes qualitative skills and is subjective. They all seem to be talking about the same thing. Topic 3 (Bible) seems to focus on Israel and Lord while topic 1 (Mormon) is about God and people. Topic 2 (Buddha, Meditations, and Quran) speak of god as well but the emphasis has moved to truth and the word one.
Conclusion
This post provided insight into developing topic models using R. The results of a topic model analysis is highly subjective and will often require strong domain knowledge. Furthermore, the number of topics is highly flexible as well and in the example in this post we could have had different numbers of topics for comparative purposes.