In this post, we will look at analyzing tweets from Twitter using R. Before beginning, if you plan to replicate this on your own, you will need to set up a developer account with Twitter. Below are the steps
- Go to https://dev.twitter.com/apps
- Create a twitter account if you do not already have one
- Next, you want to click “create new app”
- After entering the requested information be sure to keep the following information for R; consumer key, consumer secret, request token URL, authorize URL, access token URL
The instruction here are primarily for users of Linux. If you are using a windows machine you need to download a cecert.pem file below is the code
You need to save this file where it is appropriate. Below we will begin the analysis by loading the appropriate libraries.
Next, we need to use all of the application information we generate when we created the developer account at twitter. We will save the information in objects to use in R. In the example code below “XXXX” is used where you should provide your own information. Sharing this kind of information would allow others to use my twitter developer account. Below is the code
my.key<-"XXXX" #consumer key my.secret<-"XXXX" #consumer secret my.accesstoken<-'XXXX' #access token my.accesssecret<-'XXXX' ##access secret
Some of the information we just stored now needs to be passed to the “OAuthFactory” function of the “ROAuth” package. We will be passing the “my.key” and “my.secret”. We also need to add the request URL, access URL, and auth URL. Below is the code for all this.
If you are a windows user you need to code below for the cacert.pem. You need to use the “cred$handshake(cainfo=”LOCATION OF CACERT.PEM” to complete the setup process. make sure to save your authentication and then use the “registerTwitterOAuth(cred)” to finish this. For Linux users, the code is below.
setup_twitter_oauth(my.key, my.secret, my.accesstoken, my.accesssecret)
We can now begin the analysis. We are going to search twitter for the term “Data Science.” We will look for 1,500 of the most recent tweets that contain this term. To do this we will use the “searchTwitter” function. The code is as follows
We know need to some things that are a little more complicated. First, we need to convert our “ds_tweets” object to a dataframe. This is just to save our search so we don’t have to research again. The code below performs this.
Second, we need to find all the text in our “ds_tweets” object and convert this into a list. We will use the “sapply” function along with a “getText” Below is the code
Third, we need to turn our “ds_tweets.list” into a corpus.
Now we need to do a lot of cleaning of the text. In particular, we need to make all words lower case remove punctuation Get rid of funny characters (i.e. #,/, etc) remove stopwords (words that lack meaning such as “the”)
To do this we need to use a combination of functions in the “tm” package as well as some personally made functions
ds_tweets.corpus<-tm_map(ds_tweets.corpus,removePunctuation) removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)#remove garbage terms ds_tweets.corpus<-tm_map(ds_tweets.corpus,removeSpecialChars) #application of custom function ds_tweets.corpus<-tm_map(ds_tweets.corpus,function(x) removeWords(x,stopwords())) #removes stop words ds_tweets.corpus<-tm_map(ds_tweets.corpus,tolower)
We can make a word cloud for fun now
We now need to convert our corpus to a matrix for further analysis. In addition, we need to remove sparse terms as this reduces the size of the matrix without losing much information. The value to set it to is at the discretion of the researcher. Below is the code
ds_tweets.tdm<-TermDocumentMatrix(ds_tweets.corpus) ds_tweets.tdm<-removeSparseTerms(ds_tweets.tdm,sparse = .8)#remove sparse terms
We’ve looked at how to find the most frequent terms in another post. Below is the code for the 15 most common words
##  "datasto" "demonstrates" "download" "executed" ##  "hard" "key" "leaka" "locally" ##  "memory" "mitchellvii" "now" "portable" ##  "science" "similarly" "data"
Below are words that are highly correlated with the term “key”.
## $key ## demonstrates download executed leaka locally ## 0.99 0.99 0.99 0.99 0.99 ## memory datasto hard mitchellvii portable ## 0.99 0.98 0.98 0.98 0.98 ## similarly ## 0.98
For the final trick, we will make a hierarchical agglomerative cluster. This will clump words that are more similar next to each other. We first need to convert our current “ds_tweets.tdm” into a regular matrix. Then we need to scale it because the distances need to be standardized. Below is the code.
Now, we need to calculate the distance statistically
ds_tweets.dist<-dist(ds_tweets.mat.scale,method = 'euclidean')
At last, we can make the clusters,
ds_tweets.fit<-hclust(ds_tweets.dist,method = 'ward')
Looking at the chart, it appears we have six main clusters we can highlight them using the code below
plot(ds_tweets.fit) groups<-cutree(ds_tweets.fit,k=6) rect.hclust(ds_tweets.fit,k=6)
This post provided an example of how to pull data from twitter for text analysis. There are many steps but also some useful insights can be gained from this sort of research.