In this post, we will learn how to conduct a diversity and lexical dispersion analysis in R. Diversity analysis is a measure of the breadth of an author’s vocabulary in a text. Are provides several calculations of this in their output
Lexical dispersion is used for knowing where or when certain words are used in a text. This is useful for identifying patterns if this is a goal of your data exploration.
We will conduct our two analysis by comparing two famous philosophical texts
- The Prince
These books are available at the Gutenberg Project. You can go to the site type in the titles and download them to your computer.
We will use the “qdap” package in order to complete the sentiment analysis. Below is some initial code.
Below are the steps we need to take to prepare the data
- Paste the text files into R
- Convert the text files to ASCII format
- Convert the ASCII format to data frames
- Split the sentences in the data frame
- Add a variable that indicates the book name
- Combine the two books into one dataframe
We now need to prepare the three text. First, we move them into R using the “paste” function.
analects<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Analects.txt",what='character'),collapse=" ") prince<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Prince.txt",what='character'),collapse=" ")
We must convert the text files to ASCII format see that R is able to interpret them.
For each book, we need to make a dataframe. The argument “texts” gives our dataframe one variable called “texts” which contains all the words in each book. Below is the code
With the dataframes completed. We can now split the variable “texts” in each dataframe by sentence. The “sentSplit” function will do this.
Next, we add the variable “book” to each dataframe. What this does is that for each row or sentence in the dataframe the “book” variable will tell you which book the sentence came from. This will be valuable for comparative purposes.
Now we combine the two books into one dataframe. The data preparation is now complete.
We will begin with the diversity analysis
div<-diversity(twobooks$texts,twobooks$book) div book wc simpson shannon collision berger_parker brillouin 1 analects 30995 0.989 6.106 4.480 0.067 5.944 2 prince 52105 0.989 6.324 4.531 0.059 6.177
For most of the metrics, the diversity in the use of vocabulary is the same despite being different books from different eras in history. How these numbers are calculated is beyond the scope of this post.
Next, we will calculate the lexical dispersion of the two books. Will look at three common themes in history money, war, and marriage.
The tick marks show when each word appears. For example, money appears at the beginning of Analects only but is more spread out in tThe PRince. War is evenly dispersed in both books and marriage only appears in The Prince
This analysis showed additional tools that can be used to analyze text in R.