Readability and Formality Analysis in R

In this post, we will look at how to assess the readability and formality of a text using R. By readability, we mean the use of a formula that will provide us with the grade level at which the text is roughly written. This is highly useful information in the field of education and even medicine.

Formality provides insights into how the text relates to the reader. The more formal the writing the greater the distance between author and reader. Formal words are nouns, adjectives, prepositions, and articles while informal (contextual) words are pronouns, verbs, adverbs, and interjections.

The F-measure counts and calculates a score of formality based on the proportions of the formal and informal words.

We will conduct our two analysis by comparing two famous philosophical texts

  • Analects
  • The Prince

These books are available at the Gutenberg Project. You can go to the site type in the titles and download them to your computer.

We will use the “qdap” package in order to complete the sentiment analysis. Below is some initial code.

library(qdap)

Data Preparation

Below are the steps we need to take to prepare the data

  1. Paste the text files into R
  2. Convert the text files to ASCII format
  3. Convert the ASCII format to data frames
  4. Split the sentences in the data frame
  5. Add a variable that indicates the book name
  6. Combine the two books into one dataframe

We now need to prepare the two text. The “paste” function will move the text into the R environment.

analects<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Analects.txt",what='character'),collapse=" ")
prince<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Prince.txt",what='character'),collapse=" ")

The text need to be converted to the ASCII format and the code below does this.

analects<-iconv(analects,"latin1","ASCII","")
prince<-iconv(prince,"latin1","ASCII","")

For each book, we need to make a dataframe. The argument “texts” gives our dataframe one variable called “texts” which contains all the words in each book. Below is the code data frame

analects<-data.frame(texts=analects)
prince<-data.frame(texts=prince)

With the dataframes completed. We can now split the variable “texts” in each dataframe by sentence. The “sentSplit” function will do this.

analects<-sentSplit(analects,'texts')
prince<-sentSplit(prince,'texts')

Next, we add the variable “book” to each dataframe. What this does is that for each row or sentence in the dataframe the “book” variable will tell you which book the sentence came from. This will be useful for comparative purposes.

analects$book<-"analects"
prince$book<-"prince"

Lastly, we combine the two books into one dataframe. The data preparation is now complete.

twobooks<-rbind(analects,prince)

Data Analysis

We will begin with the readability. The “automated_readbility_index” function will calculate this for us.

ari<-automated_readability_index(twobooks$texts,twobooks$book)
ari
##       book word.count sentence.count character.count Automated_Readability_Index
## 1 analects      30995           3425          132981                       3.303
## 2   prince      52105           1542          236605                      16.853

Analects is written on a third-grade level but The Prince is written at grade 16. This is equivalent to a Senior in college. As such, The Prince is a challenging book to read.

Next we will calcualte the formality of the two books. The “formality” function is used for this.

form<-formality(twobooks$texts,twobooks$book)
form
##       book word.count formality
## 1   prince      52181     60.02
## 2 analects      31056     58.36

The books are mildly formal. The code below gives you the break down of the word use by percentage.

form$form.prop.by
##       book word.count  noun  adj  prep articles pronoun  verb adverb
## 1 analects      31056 25.05 8.63 14.23     8.49   10.84 22.92   5.86
## 2   prince      52181 21.51 9.89 18.42     7.59   10.69 20.74   5.94
##   interj other
## 1   0.05  3.93
## 2   0.00  5.24

The proportions are consistent when the two books are compared. Below is a visual of the table we just examined.

plot(form)

1.png

Conclusion

Readability and formality are additional text mining tools that can provide insights for Data Scientist. Both of these analysis tools can provide suggestions that may be needed in order to enhance communication or compare different authors and writing styles.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s