Diversity and Lexical Dispersion Analysis in R

In this post, we will learn how to conduct a diversity and lexical dispersion analysis in R. Diversity analysis is a measure of the breadth of an author’s vocabulary in a text. Are provides several calculations of this in their output

Lexical dispersion is used for knowing where or when certain words are used in a text. This is useful for identifying patterns if this is a goal of your data exploration.

We will conduct our two analysis by comparing two famous philosophical texts

  • Analects
  • The Prince

These books are available at the Gutenberg Project. You can go to the site type in the titles and download them to your computer.

We will use the “qdap” package in order to complete the sentiment analysis. Below is some initial code.

library(qdap)

Data Preparation

Below are the steps we need to take to prepare the data

  1. Paste the text files into R
  2. Convert the text files to ASCII format
  3. Convert the ASCII format to data frames
  4. Split the sentences in the data frame
  5. Add a variable that indicates the book name
  6. Combine the two books into one dataframe

We now need to prepare the three text. First, we move them into R using the “paste” function.

analects<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Analects.txt",what='character'),collapse=" ")
prince<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Prince.txt",what='character'),collapse=" ")

We must convert the text files to ASCII format see that R is able to interpret them.

analects<-iconv(analects,"latin1","ASCII","")
prince<-iconv(prince,"latin1","ASCII","")

For each book, we need to make a dataframe. The argument “texts” gives our dataframe one variable called “texts” which contains all the words in each book. Below is the code
data frame

analects<-data.frame(texts=analects)
prince<-data.frame(texts=prince)

With the dataframes completed. We can now split the variable “texts” in each dataframe by sentence. The “sentSplit” function will do this.

analects<-sentSplit(analects,'texts')
prince<-sentSplit(prince,'texts')

Next, we add the variable “book” to each dataframe. What this does is that for each row or sentence in the dataframe the “book” variable will tell you which book the sentence came from. This will be valuable for comparative purposes.

analects$book<-"analects"
prince$book<-"prince"

Now we combine the two books into one dataframe. The data preparation is now complete.

twobooks<-rbind(analects,prince)

Data Analysis

We will begin with the diversity analysis

div<-diversity(twobooks$texts,twobooks$book)
div
           book wc simpson shannon collision berger_parker brillouin
1 analects 30995    0.989   6.106   4.480     0.067         5.944
2 prince   52105    0.989   6.324   4.531     0.059         6.177

For most of the metrics, the diversity in the use of vocabulary is the same despite being different books from different eras in history. How these numbers are calculated is beyond the scope of this post.

Next, we will calculate the lexical dispersion of the two books. Will look at three common themes in history money, war, and marriage.

dispersion_plot(twobooks$texts,grouping.var=twobooks$book,c("money","war",'marriage'))

1

The tick marks show when each word appears. For example, money appears at the beginning of Analects only but is more spread out in tThe PRince. War is evenly dispersed in both books and marriage only appears in The Prince

Conclusion

This analysis showed additional tools that can be used to analyze text in R.

2 thoughts on “Diversity and Lexical Dispersion Analysis in R

  1. D. B.

    Hi Darren, I’m afraid that the graph this method produces is misleading. For example, it makes it seem as though the token ‘war’ appears only in the first third of the document ‘analects’ and only in the last two thirds of ‘prince’. Further more it appears that there is no overlap in the graph between the two books and the three terms. Actually ‘war’ is relatively evenly distributed in both.

    The problem is caused because qdap is made to process transcripts, and so it interprets your variable ‘book’ as a speaker label–‘analects’ as speaker one and ‘prince’ as speaker two. This distorts the data representation because qdap is treating them as one ‘conversation’ (which is also clear from the ‘all’ line).

    This graph would be much more easily and accurately accomplished in Quanteda using the textplot_xray function in a corpus with the book metadata saved as docvars. An additional advantage is that the documents are also scaled for length, which makes the data more comparable across multiple documents.

    Reply
    1. Dr. Darrin Post author

      Maybe I’m reading the graph wrong but I thought the first half of the plot reading from left to right refers to analects (in blue) while the second part refers to the book the prince (in blue also but below). This gives the impression that the word is not spread evenly.

      THe bottom plot is just combining the top two plots into one.

      Reply

Leave a Reply