Binary Recommendation Engines in R

In this post, we will look at recommendation engines using binary information. For a binary recommendation engine, it requires that the data rates the product as good/bad or some other system in which only two responses are possible. The “recommendarlab” package is needed for this analysis and we will use the ratings of movies from grouplens.org for this post.

url http://grouplens.org/datasets/movielens/latest/

If you follow along you want to download the “small dataset” and use the “ratings.csv” and the “movies.csv”. We will then merge these two datasets based on the variable “movieId” the url is below is the initial code

library(recommenderlab) ratings <- read.csv("~/Downloads/ml-latest-small/ratings.csv")#load ratings data
movies <- read.csv("~/Downloads/ml-latest-small/movies.csv")#load movies data
movieRatings<-merge(ratings, movies, by='movieId')#merge movies and ratings data

We now need to convert are “movieRatings” data frame to a matrix that the “recommendarlab” can use. After doing this we need to indicate that we are doing a binary engine by setting the minimum rating to 2.5. What this means is that anything above 2.5 is in one category and anything below 2.5 is in a different category. We use the “binarize” function to do this. Below is the code

movieRatings<-as(movieRatings,"realRatingMatrix")
movie.bin<-binarize(movieRatings,minRating=2.5)

We need to use a subset of our data. We need each row to have a certain minimum number of ratings. For this analysis, we need at least ten ratings per row. Below is the code for this.

movie.bin<-movie.bin[rowCounts(movie.bin)>10]
movie.bin
## 1817 x 671 rating matrix of class 'binaryRatingMatrix' with 68643 ratings.

Next, we need to setup the evaluation scheme. We use the function and plug in the data, method of evaluation, number of folds, and the given number of ratings. The code is as follows.

set.seed(456)
e.bin<-evaluationScheme(movie.bin,method='cross-validation',k=5,given=10)

We now make a list that holds all the models we want to run. We will run four models “popular”, “random”, “ubcf”, and “ibcf”. We will then use the “evaluate” function to see how accurate are models are for 5,10,15, and 20 items.

algorithms.bin<-list(POPULAR=list(name="POPULAR",param=NULL),
                     RAND=list(name="RANDOM"),UBCF=list(name="UBCF"),IBCF=list(name="IBCF")) 
results.bin<-evaluate(e.bin,algorithms.bin,n=c(5,10,15,20))

The “avg” function will help us to see how are models did. Below are the results

avg(results.bin)
## $POPULAR
##          TP        FP       FN       TN precision     recall        TPR
## 5  1.518356  3.481644 26.16877 629.8312 0.3036712 0.09293487 0.09293487
## 10 2.792329  7.207671 24.89479 626.1052 0.2792329 0.15074799 0.15074799
## 15 3.916164 11.083836 23.77096 622.2290 0.2610776 0.20512093 0.20512093
## 20 4.861370 15.138630 22.82575 618.1742 0.2430685 0.24831787 0.24831787
##            FPR
## 5  0.005426716
## 10 0.011221837
## 15 0.017266489
## 20 0.023608749
## 
## $RAND
##           TP        FP       FN       TN  precision      recall
## 5  0.2120548  4.787945 27.47507 628.5249 0.04241096 0.007530989
## 10 0.4104110  9.589589 27.27671 623.7233 0.04104110 0.015611349
## 15 0.6241096 14.375890 27.06301 618.9370 0.04160731 0.023631305
## 20 0.8460274 19.153973 26.84110 614.1589 0.04230137 0.033130430
##            TPR         FPR
## 5  0.007530989 0.007559594
## 10 0.015611349 0.015146399
## 15 0.023631305 0.022702057
## 20 0.033130430 0.030246522
## 
## $UBCF
##          TP        FP       FN       TN precision    recall       TPR
## 5  2.175890  2.824110 25.51123 630.4888 0.4351781 0.1582319 0.1582319
## 10 3.740274  6.259726 23.94685 627.0532 0.3740274 0.2504990 0.2504990
## 15 5.054795  9.945205 22.63233 623.3677 0.3369863 0.3182356 0.3182356
## 20 6.172603 13.827397 21.51452 619.4855 0.3086301 0.3748969 0.3748969
##            FPR
## 5  0.004387006
## 10 0.009740306
## 15 0.015492088
## 20 0.021557381
## 
## $IBCF
##          TP        FP       FN       TN precision     recall        TPR
## 5  1.330411  3.669589 26.35671 629.6433 0.2660822 0.08190126 0.08190126
## 10 2.442192  7.557808 25.24493 625.7551 0.2442192 0.13786523 0.13786523
## 15 3.532603 11.467397 24.15452 621.8455 0.2355068 0.19010813 0.19010813
## 20 4.546301 15.453699 23.14082 617.8592 0.2273151 0.23494969 0.23494969
##            FPR
## 5  0.005727386
## 10 0.011801682
## 15 0.017900255
## 20 0.024124329

The results are pretty bad for all models. The TPR (true positive rate) is always below .4. We can make a visual of the results by creating a ROC using the TPR/FPR as well as precision/recall.

plot(results.bin,legend="topleft",annotate=T)

1.png

plot(results.bin,"prec",legend="topleft",annotate=T)

1.png

The visual makes it clear that the UBCF model is the best.

Conclusion

This post provided an example of the development of an algorithm for binary recommendations.

Advertisements

Developing Standardized Tests

For better or worst, standardized testing is a part of the educational experience of most students and teachers. The purpose here is not to attack or defend their use. Instead, in this post, we will look at how standardized test are developed.

There are primarily about 6 steps in developing a standardized test. These steps are

  1. Determine the goals
  2. Develop the specifications
  3. Create and evaluate test items
  4. Determine scoring and reporting
  5. Continue further development

Determing Goals

The goals of a standardized test are similar to the purpose statement of a research paper in that the determine the scope of the test. By scope, it is meant what the test will and perhaps will not do. This is important in terms of setting the direction for the rest of the project.

For example, the  TOEFL purpose is to evaluate English proficeny. This means that the TOEFL does not deal with science, math, or other subjects. This seems silly for many but this purpose makes it clear what the TOEFL is about.

Develop the Specifications

Specifications have to do with the structure of the test. For example, a test can have multiple-choice, short answer, essay, fill in the blank, etc. The structure of the test needs to be determined in order to decide what types of items to create.

Most standardized tests are primarily multiple-choice. This is due to the scale on which the test are given. However, some language tests are including a writing component as well now.

Create Test Items

Once the structure is set it is now necessary to develop the actual items for the test. This involves a lot with item response theory (IRT) and the use of statistics. There is also a need to ensure that the items measure the actual constructs of the subject domain.

For example, the TOEFL must be sure that it is really measuring language skills. This is done through consulting experts as well as statistical analysis to know for certain they are measuring English proficiency. The items come from a bank and are tested and retested.

Determine Scoring and Reporting

The scoring and reporting need to be considered. How many points is each item worth? What is the weight of one section of the test? Is the test norm-referenced or criterion-referenced? How many people will mark each test?These are some of the questions to consider.

The scoring and reporting matter a great deal because the scores can affect a person’s life significantly. Therefore, this aspect of standardized testing is treated with great care.

Further Development

A completed standardized test needs to be continuously reevaluated. Ideas and theories in a body of knowledge change frequently and this needs to be taken into account as the test goes forward.

For example, the SAT over the years has changed the point values of their test as well as added a writing component. This was done in reaction to concerns about the test.

Conclusion

The concepts behind developing standardize test can be useful for even teachers making their own assessments. There is no need to follow this process as rigorously. However, familiarity with this strict format can help guide assessment development for many different situations.

Recommendation Engines in R

In this post, we will look at how to make a recommendation engine. We will use data that makes recommendations about movies. We will use the “recommenderlab” package to build several different engines. The data comes from

http://grouplens.org/datasets/movielens/latest/

At this link, you need to download the “ml-latest.zip”. From there, we will use the “ratings” and “movies” files in this post. Ratings provide the ratings of the movies while movies provide the names of the movies. Before going further it is important to know that the “recommenderlab” has five different techniques for developing recommendation engines (IBCF, UBCF, POPULAR, RANDOM, & SVD). We will use all of them for comparative purposes Below is the code for getting started.

library(recommenderlab)
ratings <- read.csv("~/Downloads/ml-latest-small/ratings.csv")
movies <- read.csv("~/Downloads/ml-latest-small/movies.csv")

We now need to merge the two datasets so that they become one. This way the titles and ratings are in one place. We will then coerce our “movieRatings” dataframe into a “realRatingMatrix” in order to continue our analysis. Below is the code

movieRatings<-merge(ratings, movies, by='movieId') #merge two files
movieRatings<-as(movieRatings,"realRatingMatrix") #coerce to realRatingMatrix

We will now create two histograms of the ratings. The first is raw data and the second will be normalized data. The function “getRatings” is used in combination with the “hist” function to make the histogram. The normalized data includes the “normalize” function. Below is the code.

hist(getRatings(movieRatings),breaks =10)

1.png

hist(getRatings(normalize(movieRatings)),breaks =10)

1.png

We are now ready to create the evaluation scheme for our analysis. In this object we need to set the data name (movieRatings), the method we want to use (cross-validation), the amount of data we want to use for the training set (80%), how many ratings the algorithm is given during the test set (1) with the rest being used to compute the error. We also need to tell R what a good rating is (4 or higher) and the number of folds for the cross-validation (10). Below is the code for all of this.

set.seed(123)
eSetup<-evaluationScheme(movieRatings,method='cross-validation',train=.8,given=1,goodRating=4,k=10)

Below is the code for developing our models. To do this we need to use the “Recommender” function and the “getData” function to get the dataset. Remember we are using all six modeling techniques

ubcf<-Recommender(getData(eSetup,"train"),"UBCF")
ibcf<-Recommender(getData(eSetup,"train"),"IBCF")
svd<-Recommender(getData(eSetup,"train"),"svd")
popular<-Recommender(getData(eSetup,"train"),"POPULAR")
random<-Recommender(getData(eSetup,"train"),"RANDOM")

The models have been created. We can now make our predictions using the “predict” function in addition to the “getData” function. We also need to set the argument “type” to “ratings”. Below is the code.

ubcf_pred<-predict(ubcf,getData(eSetup,"known"),type="ratings")
ibcf_pred<-predict(ibcf,getData(eSetup,"known"),type="ratings")
svd_pred<-predict(svd,getData(eSetup,"known"),type="ratings")
pop_pred<-predict(popular,getData(eSetup,"known"),type="ratings")
rand_pred<-predict(random,getData(eSetup,"known"),type="ratings")

We can now look at the accuracy of the models. We will do this in two steps. First, we will look at the error rates. After completing this, we will do a more detailed analysis of the stronger models. Below is the code for the first step

ubcf_error<-calcPredictionAccuracy(ubcf_pred,getData(eSetup,"unknown")) #calculate error
ibcf_error<-calcPredictionAccuracy(ibcf_pred,getData(eSetup,"unknown"))
svd_error<-calcPredictionAccuracy(svd_pred,getData(eSetup,"unknown"))
pop_error<-calcPredictionAccuracy(pop_pred,getData(eSetup,"unknown"))
rand_error<-calcPredictionAccuracy(rand_pred,getData(eSetup,"unknown"))
error<-rbind(ubcf_error,ibcf_error,svd_error,pop_error,rand_error) #combine objects into one data frame
rownames(error)<-c("UBCF","IBCF","SVD","POP","RAND") #give names to rows
error
##          RMSE      MSE       MAE
## UBCF 1.278074 1.633473 0.9680428
## IBCF 1.484129 2.202640 1.1049733
## SVD  1.277550 1.632135 0.9679505
## POP  1.224838 1.500228 0.9255929
## RAND 1.455207 2.117628 1.1354987

The results indicate that the “RAND” and “IBCF” models are clearly worst than the remaining three. We will now move to the second step and take a closer look at the “UBCF”, “SVD”, and “POP” models. We will do this by making a list and using the “evaluate” function to get other model evaluation metrics. We will make a list called “algorithms” and store the three strongest models. Then we will make a objective called “evlist” in this object we will use the “evaluate” function as well as called the evaluation scheme “esetup”, the list (“algorithms”) as well as the number of movies to assess (5,10,15,20)

algorithms<-list(POPULAR=list(name="POPULAR"),SVD=list(name="SVD"),UBCF=list(name="UBCF"))
evlist<-evaluate(eSetup,algorithms,n=c(5,10,15,20))
avg(evlist)
## $POPULAR
##           TP        FP       FN       TN  precision     recall        TPR
## 5  0.3010965  3.033333 4.917105 661.7485 0.09028443 0.07670381 0.07670381
## 10 0.4539474  6.214912 4.764254 658.5669 0.06806016 0.11289681 0.11289681
## 15 0.5953947  9.407895 4.622807 655.3739 0.05950450 0.14080354 0.14080354
## 20 0.6839912 12.653728 4.534211 652.1281 0.05127635 0.16024740 0.16024740
##            FPR
## 5  0.004566269
## 10 0.009363021
## 15 0.014177091
## 20 0.019075070
## 
## $SVD
##           TP        FP       FN       TN  precision     recall        TPR
## 5  0.1025219  3.231908 5.115680 661.5499 0.03077788 0.00968336 0.00968336
## 10 0.1808114  6.488048 5.037390 658.2938 0.02713505 0.01625454 0.01625454
## 15 0.2619518  9.741338 4.956250 655.0405 0.02620515 0.02716656 0.02716656
## 20 0.3313596 13.006360 4.886842 651.7754 0.02486232 0.03698768 0.03698768
##            FPR
## 5  0.004871678
## 10 0.009782266
## 15 0.014689510
## 20 0.019615377
## 
## $UBCF
##           TP        FP       FN       TN  precision     recall        TPR
## 5  0.1210526  2.968860 5.097149 661.8129 0.03916652 0.01481106 0.01481106
## 10 0.2075658  5.972259 5.010636 658.8095 0.03357173 0.02352752 0.02352752
## 15 0.3028509  8.966886 4.915351 655.8149 0.03266321 0.03720717 0.03720717
## 20 0.3813596 11.978289 4.836842 652.8035 0.03085246 0.04784538 0.04784538
##            FPR
## 5  0.004475151
## 10 0.009004466
## 15 0.013520481
## 20 0.018063361

Well, the numbers indicate that all the models are terrible. All metrics are scored rather poorly. True positives, false positives, false negatives, true negatives, precision, recall, true positive rate, and false positive rate are low for all models. Remember that these values are averages of the cross-validation. As such, for the “POPULAR” model when looking at the top five movies on average, the number of true positives was .3.

Even though the numbers are terrible the “POPULAR” model always performed the best. We can even view the ROC curve with the code below

plot(evlist,legend="topleft",annotate=T)

1.png

We can now determine individual recommendations. We first need to build a model using the POPULAR algorithm. Below is the code.

Rec1<-Recommender(movieRatings,method="POPULAR")
Rec1
## Recommender of type 'POPULAR' for 'realRatingMatrix' 
## learned using 9066 users.

We will now pull the top five recommendations for the first two raters and make a list. The numbers are the movie ids and not the actual titles

recommend<-predict(Rec1,movieRatings[1:5],n=5)
as(recommend,"list")
## $`1`
## [1] "78"  "95"  "544" "102" "4"  
## 
## $`2`
## [1] "242" "232" "294" "577" "95" 
## 
## $`3`
## [1] "654" "242" "30"  "232" "287"
## 
## $`4`
## [1] "564" "654" "242" "30"  "232"
## 
## $`5`
## [1] "242" "30"  "232" "287" "577"

Below we can see the specific score for a specific movie. The names of the movies come from the original “ratings” dataset.

rating<-predict(Rec1,movieRatings[1:5],type='ratings')
rating
## 5 x 671 rating matrix of class 'realRatingMatrix' with 2873 ratings.
movieresult<-as(rating,'matrix')[1:5,1:3]
colnames(movieresult)<-c("Toy Story","Jumanji","Grumpier Old Men")
movieresult
##   Toy Story  Jumanji Grumpier Old Men
## 1  2.859941 3.822666         3.724566
## 2  2.389340 3.352066         3.253965
## 3  2.148488 3.111213         3.013113
## 4  1.372087 2.334812         2.236711
## 5  2.255328 3.218054         3.119953

This is what the model thinks the person would rate the movie. It is the difference between this number and the actual one that the error is calculated. In addition, if someone did not rate a movie you would see an NA in that spot

Conclusion

This was a lot of work. However, with additional work, you can have your own recommendation system based on data that was collected.

Understanding Recommendation Engines

Recommendations engines are used to make predictions about what future users would like based on prior users suggestions. Whenever you provide numerical feedback on a product or services this information can be used to provide recommendations in the future.

This post will look at various ways in which recommendation engines derive their conclusions.

Ways of Recommending

There are two common ways to develop a recommendation engine in a machine learning context. These two ways are collaborative filtering and content-based. Content-based recommendations rely solely on the data provided by the user. A user develops a profile through their activity and the engine recommends products or services. The only problem is if there is little data on user poor recommendations are made.

Collaborative filtering is crowd-based recommendations. What this means the data of many is used to recommend to one. This bypasses the concern with a lack of data that can happen with content-based recommendations.

There are four common ways to develop collaborative filters and they are as follows

  • User-based collaborative filtering
  • Item-baed collaborative filtering
  • Singular value decomposition and Principal component  analysis

User-based Collaborative Filtering (UBCF)

UBCF uses k-nearest neighbor or some similarity measurement such as Pearson Correlation to predict the missing rating for a user. Once the number of neighbors is determined the algorithm calculates the average of the neighbors to predict the information for the user. The predicted value can be used to determine if a user will like a particular product or service

The predicted value can be used to determine if a user will like a particular product or service. Low values are not recommended while high values may be. A major weakness of UBCF is calculating the similarities of users requires keeping all the data in memory which is a computational challenge.

Item-based Collaborative Filtering (IBCF)

IBCF uses the similarity between items to make recomeendations. This is calculated with the same measures as before (Knn, Pearson correlation, etc.). After finding the most similar items, The algorithm will take the average from the individual user of the other items to predict recommendation the user would make for the unknown item.

In order to assure accuracy, it is necessary to have a huge number of items that can have the similarities calculated. This leads to the same computational problems mentioned earlier.

Singular Value Decomposition and Principal Component Analysis (SVD, PCA)

When the dataset is too big for the first two options. SVD or PCA could be an appropriate choice. What each of these two methods does in a simple way is reduce the dimensionality by making latent variables. Doing this reduces the computational effort as well as reduce noise in the data.

With SVD, we can reduce the data to a handful of factors. The remaining factors can be used to reproduce the original values which can then be used to predict missing values.

For PCA, items are combined in components and like items that load on the same component can be used to make predictions for an unknown data point for a user.

Conclusion

Recommendation engines play a critical part in generating sales for many companies. This post provided an insight into how they are created. Understanding this can allow you to develop recommendation engines based on data.

Political Intrigue and the English Language

Political intrigue has played a role in shaping the English language. In this post, we will look at one example from history of politics shaping the direction of the language.

Henry VII (1491-1547), King of England, had a major problem. He desperately needed a male heir for his kingdom. In order to do achieve, Henry VIII annulled several marriages or had his wife executed. This, of course, did not go over well with the leaders of Europe as nobles tended to marry nobles.

In order to have his first marriage canceled (to Catherine of Aragon) Henry VIII needed the permission of the Pope as divorce is normally not tolerated in Roman Catholicism. However, the Pope did not grant permission because he was facing political pressure from Catherine’s family who ruled over Spain.

Henry VIII was not one to take no for an answer. Stressed over his lack of a male heir and the fact that Catherine was already 40, he banished Catherine and married his second wife, Anne Boleyn. Much to the chagrin of his in-laws in Spain.

This resulted in the Pope excommunicating Henry VIII from the church. In response to this, Henry VIII started his own church or at least laid the foundation for Protestantism in England, with the help of German Lutheran Princes. It was this event that played a role in the development of the English language

Protestantism in England

Until the collapse of his first marriage, Henry VIII was a strong supporter of the Catholic Church and was even given the title of “Defender of the Faith.” However, he now took steps to flood England with Bibles in response to what he saw as a betrayal of the Catholic Church in denying him the annulment that he sought for his first marriage.

During this same time period of the mid-1500’s William Tyndale had just completed a translation of the New Testament from Greek to English. His translation was the first from the original language into English. Tyndale also completed about half of the Old Testament.

Although Tyndale was tried and burned for translating the Bible, within a few years of his death, Henry VIII was permitting the publishing of the Bible. Tyndale’s translation was combined with the work of Miles Coverdale to create the first complete English version of the Bible called the “Great Bible” in 1539.

With the bible in the hands of so many people the English language began to flow into religion and worship services. Such words as “beautiful”, “two-edged”, “landlady”, and “broken-heart” were established as new words. These words are taking for granted today but it was in translating the phrases from the biblical languages that we have these new forms of expression. How many men have called their woman beautiful? or how many people have complained about their landlady? This is possible thanks to Tyndale’s translating and Henry VIII’s support.

All this happen not necessarily for the right motives yet the foundation of the use of common tongues in worship as well as the audacity of translating scripture into other languages was established by a King seeking an heir.

Conclusion

Henry VIII was probably not the most religious man. He did not have any problem with divorcing or killing wives or with committing adultery. However, he used religion to achieve his political goals. By doing so, he inadvertently influenced the language of his country.

Clustering Mixed Data in R

One of the major problems with hierarchical and k-means clustering is that they cannot handle nominal data. The reality is that most data is mixed or a combination of both interval/ratio data and nominal/ordinal data.

One of many ways to deal with this problem is by using the Gower coefficient. This coefficient compares the pairwise cases in the data set and calculates a dissimilarity between. By dissimilar we mean the weighted mean of the variables in that row.

Once the dissimilarity calculations are completed using the gower coefficient (there are naturally other choices), you can then use regular kmeans clustering (there are also other choices) to find the traits of the various clusters. In this post, we will use the “MedExp” dataset from the “Ecdat” package. Our goal will be to cluster the mixed data into four clusters. Below is some initial code.

library(cluster);library(Ecdat);library(compareGroups)
data("MedExp")
str(MedExp)
## 'data.frame':    5574 obs. of  15 variables:
##  $ med     : num  62.1 0 27.8 290.6 0 ...
##  $ lc      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ idp     : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 1 1 ...
##  $ lpi     : num  6.91 6.91 6.91 6.91 6.11 ...
##  $ fmde    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ physlim : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ ndisease: num  13.7 13.7 13.7 13.7 13.7 ...
##  $ health  : Factor w/ 4 levels "excellent","good",..: 2 1 1 2 2 2 2 1 2 2 ...
##  $ linc    : num  9.53 9.53 9.53 9.53 8.54 ...
##  $ lfam    : num  1.39 1.39 1.39 1.39 1.1 ...
##  $ educdec : num  12 12 12 12 12 12 12 12 9 9 ...
##  $ age     : num  43.9 17.6 15.5 44.1 14.5 ...
##  $ sex     : Factor w/ 2 levels "male","female": 1 1 2 2 2 2 2 1 2 2 ...
##  $ child   : Factor w/ 2 levels "no","yes": 1 2 2 1 2 2 1 1 2 1 ...
##  $ black   : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...

You can clearly see that our data is mixed with both numerical and factor variables. Therefore, the first thing we must do is calculate the gower coefficient for the dataset. This is done with the “daisy” function from the “cluster” package.

disMat<-daisy(MedExp,metric = "gower")

Now we can use the “kmeans” to make are clusters. This is possible because all the factor variables have been converted to a numerical value. We will set the number of clusters to 4. Below is the code.

set.seed(123)
mixedClusters<-kmeans(disMat, centers=4)

We can now look at a table of the clusters

table(mixedClusters$cluster)
## 
##    1    2    3    4 
## 1960 1342 1356  916

The groups seem reasonably balanced. We now need to add the results of the kmeans to the original dataset. Below is the code

MedExp$cluster<-mixedClusters$cluster

We now can built a descriptive table that will give us the proportions of each variable in each cluster. To do this we need to use the “compareGroups” function. We will then take the output of the “compareGroups” function and use it in the “createTable” function to get are actual descriptive stats.

group<-compareGroups(cluster~.,data=MedExp)
clustab<-createTable(group)
clustab
## 
## --------Summary descriptives table by 'cluster'---------
## 
## __________________________________________________________________________ 
##                    1            2            3            4      p.overall 
##                  N=1960       N=1342       N=1356       N=916              
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## med            211 (1119)   68.2 (333)   269 (820)   83.8 (210)   <0.001   
## lc            4.07 (0.60)  4.05 (0.60)  0.04 (0.39)  0.03 (0.34)   0.000   
## idp:                                                              <0.001   
##     no        1289 (65.8%) 922 (68.7%)  1123 (82.8%) 781 (85.3%)           
##     yes       671 (34.2%)  420 (31.3%)  233 (17.2%)  135 (14.7%)           
## lpi           5.72 (1.94)  5.90 (1.73)  3.27 (2.91)  3.05 (2.96)  <0.001   
## fmde          6.82 (0.99)  6.93 (0.90)  0.00 (0.12)  0.00 (0.00)   0.000   
## physlim:                                                          <0.001   
##     no        1609 (82.1%) 1163 (86.7%) 1096 (80.8%) 789 (86.1%)           
##     yes       351 (17.9%)  179 (13.3%)  260 (19.2%)  127 (13.9%)           
## ndisease      11.5 (8.26)  10.2 (2.97)  12.2 (8.50)  10.6 (3.35)  <0.001   
## health:                                                           <0.001   
##     excellent 910 (46.4%)  880 (65.6%)  615 (45.4%)  612 (66.8%)           
##     good      828 (42.2%)  382 (28.5%)  563 (41.5%)  261 (28.5%)           
##     fair      183 (9.34%)   74 (5.51%)  137 (10.1%)  42 (4.59%)            
##     poor       39 (1.99%)   6 (0.45%)    41 (3.02%)   1 (0.11%)            
## linc          8.68 (1.22)  8.61 (1.37)  8.75 (1.17)  8.78 (1.06)   0.005   
## lfam          1.05 (0.57)  1.49 (0.34)  1.08 (0.58)  1.52 (0.35)  <0.001   
## educdec       12.1 (2.87)  11.8 (2.58)  12.0 (3.08)  11.8 (2.73)   0.005   
## age           36.5 (12.0)  9.26 (5.01)  37.0 (12.5)  9.29 (5.11)   0.000   
## sex:                                                              <0.001   
##     male      893 (45.6%)  686 (51.1%)  623 (45.9%)  482 (52.6%)           
##     female    1067 (54.4%) 656 (48.9%)  733 (54.1%)  434 (47.4%)           
## child:                                                             0.000   
##     no        1960 (100%)   0 (0.00%)   1356 (100%)   0 (0.00%)            
##     yes        0 (0.00%)   1342 (100%)   0 (0.00%)   916 (100%)            
## black:                                                            <0.001   
##     yes       1623 (82.8%) 986 (73.5%)  1148 (84.7%) 730 (79.7%)           
##     no        337 (17.2%)  356 (26.5%)  208 (15.3%)  186 (20.3%)           
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

The table speaks for itself. Results that utilize factor variables have proportions to them. For example, in cluster 1, 1289 people or 65.8% responded “no” that the have an individual deductible plan (idp). Numerical variables have the mean with the standard deviation in parentheses. For example, in cluster 1 the average family size was 1 with a standard deviation of 1.05 (lfam).

Conclusion

Mixed data can be partition into clusters with the help of the gower or another coefficient. In addition, kmeans is not the only way to cluster the data. There are other choices such as the partitioning around medoids. The example provided here simply serves as a basic introduction to this.

Item Indices for Multiple Choice Questions

Many teachers use multiple choice questions to assess students knowledge in a subject matter. This is especially true if the class is large and marking essays would provide to be impractical.

Even if best practices are used in making multiple choice exams it can still be difficult to know if the questions are doing the work they are supposed too. Fortunately, there are several quantitative measures that can be used to assess the quality of a multiple choice question.

This post will look at three ways that you can determine the quality of your multiple choice questions using quantitative means. These three items are

  • Item facility
  • Item discrimination
  • Distractor efficiency

Item Facility

Item facility measures the difficulty of a particular question. This is determined by the following formula

Item facility = Number of students who answer the item correctly
Total number of students who answered the item

This formula simply calculates the percentage of students who answered the question correctly. There is no boundary for a good or bad item facility score. Your goal should be to try and separate the high ability from the low ability students in your class with challenging items with a low item facility score. In addition, there should be several easier items with a high item facility score for the weaker students to support them as well as serve as warmups for the stronger students.

Item Discrimination

Item discrimination measures a questions ability to separate the strong students from the weak ones.

Item discrimination = # items correct of strong group – # items correct of weak group
1/2(total of two groups)

The first thing that needs to be done in order to calculate the item discrimination is to divide the class into three groups by rank. The top 1/3 is the strong group, the middle third is the average group and the bottom 1/3 is the weak group. The middle group is removed and you use the data on the strong and the weak to determine the item discrimination.

The results of the item discrimination range from zero (no discrimination) to 1 (perfect discrimination). There are no hard cutoff points for item discrimination. However, values near zero are generally removed while a range of values above that is expected on an exam.

Distractor Efficiency

Distractor efficiency looks at the individual responses that the students select in a multiple choice question. For example, if a multiple choice has four possible answers, there should be a reasonable distribution of students who picked the various possible answers.

The Distractor efficiency is tabulated by simply counting the which answer students select for each question. Again there are no hard rules for removal. However, if nobody selected a distractor it may not be a good one.

Conclusion

Assessing multiple choice questions becomes much more important as the size of class grows bigger and bigger or the test needs to be reused multiple times in various context. This information covered here is only an introduction to the much broader subject of item response theory.

Hierarchical Clustering in R

Hierarchical clustering is a form of unsupervised learning. What this means is that the data points lack any form of label and the purpose of the analysis is to generate labels for our data points. IN other words, we have no Y values in our data.

Hierarchical clustering is an agglomerative technique. This means that each data point starts as their own individual clusters and are merged over iterations. This is great for small datasets but is difficult to scale. In addition, you need to set the linkage which is used to place observations in different clusters. There are several choices (ward, complete, single, etc.) and the best choice depends on context.

In this post, we will make a hierarchical clustering analysis of the “MedExp” data from the “Ecdat” package. We are trying to identify distinct subgroups in the sample. The actual hierarchical cluster creates what is a called a dendrogram. Below is some initial code.

library(cluster);library(compareGroups);library(NbClust);library(HDclassif);library(sparcl);library(Ecdat)
data("MedExp")
str(MedExp)
## 'data.frame':    5574 obs. of  15 variables:
##  $ med     : num  62.1 0 27.8 290.6 0 ...
##  $ lc      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ idp     : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 1 1 ...
##  $ lpi     : num  6.91 6.91 6.91 6.91 6.11 ...
##  $ fmde    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ physlim : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ ndisease: num  13.7 13.7 13.7 13.7 13.7 ...
##  $ health  : Factor w/ 4 levels "excellent","good",..: 2 1 1 2 2 2 2 1 2 2 ...
##  $ linc    : num  9.53 9.53 9.53 9.53 8.54 ...
##  $ lfam    : num  1.39 1.39 1.39 1.39 1.1 ...
##  $ educdec : num  12 12 12 12 12 12 12 12 9 9 ...
##  $ age     : num  43.9 17.6 15.5 44.1 14.5 ...
##  $ sex     : Factor w/ 2 levels "male","female": 1 1 2 2 2 2 2 1 2 2 ...
##  $ child   : Factor w/ 2 levels "no","yes": 1 2 2 1 2 2 1 1 2 1 ...
##  $ black   : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...

Currently, for the purposes of this post. The dataset is too big. IF we try to do the analysis with over 5500 observations it will take a long time. Therefore, we will only use the first 1000 observations. In addition, We need to remove factor variables as hierarchical clustering cannot analyze factor variables. Below is the code.

MedExp_small<-MedExp[1:1000,]
MedExp_small$sex<-NULL
MedExp_small$idp<-NULL
MedExp_small$child<-NULL
MedExp_small$black<-NULL
MedExp_small$physlim<-NULL
MedExp_small$health<-NULL

We now need to scale are data. This is important because different scales will cause different variables to have more or less influence on the results. Below is the code

MedExp_small_df<-as.data.frame(scale(MedExp_small))

We now need to determine how many clusters to create. There is no rule on this but we can use statistical analysis to help us. The “NbClust” package will conduct several different analysis to provide a suggested number of clusters to create. You have to set the distance, min/max number of clusters, the method, and the index. The graphs can be understood by looking for the bend or elbow in them. At this point is the best number of clusters.

numComplete<-NbClust(MedExp_small_df,distance = 'euclidean',min.nc = 2,max.nc = 8,method = 'ward.D2',index = c('all'))

1.png

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

1.png

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 7 proposed 2 as the best number of clusters 
## * 9 proposed 3 as the best number of clusters 
## * 6 proposed 6 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
numComplete$Best.nc
##                     KL       CH Hartigan     CCC    Scott      Marriot
## Number_clusters 2.0000   2.0000   6.0000  8.0000    3.000 3.000000e+00
## Value_Index     2.9814 292.0974  56.9262 28.4817 1800.873 4.127267e+24
##                   TrCovW   TraceW Friedman   Rubin Cindex     DB
## Number_clusters      6.0   6.0000   3.0000  6.0000  2.000 3.0000
## Value_Index     166569.3 265.6967   5.3929 -0.0913  0.112 1.0987
##                 Silhouette   Duda PseudoT2  Beale Ratkowsky     Ball
## Number_clusters     2.0000 2.0000   2.0000 2.0000    6.0000    3.000
## Value_Index         0.2809 0.9567  16.1209 0.2712    0.2707 1435.833
##                 PtBiserial Frey McClain   Dunn Hubert SDindex Dindex
## Number_clusters     6.0000    1   3.000 3.0000      0  3.0000      0
## Value_Index         0.4102   NA   0.622 0.1779      0  1.9507      0
##                   SDbw
## Number_clusters 3.0000
## Value_Index     0.5195

Simple majority indicates that three clusters is most appropriate. However, four clusters are probably just as good. Every time you do the analysis you will get slightly different results unless you set the seed.

To make our actual clusters we need to calculate the distances between clusters using the “dist” function while also specifying the way to calculate it. We will calculate distance using the “Euclidean” method. Then we will take the distance’s information and make the actual clustering using the ‘hclust’ function. Below is the code.

distance<-dist(MedExp_small_df,method = 'euclidean')
hiclust<-hclust(distance,method = 'ward.D2')

We can now plot the results. We will plot “hiclust” and set hang to -1 so this will place the observations at the bottom of the plot. Next, we use the “cutree” function to identify 4 clusters and store this in the “comp” variable. Lastly, we use the “ColorDendrogram” function to highlight are actual clusters.

plot(hiclust,hang=-1, labels=F)
comp<-cutree(hiclust,4) ColorDendrogram(hiclust,y=comp,branchlength = 100)

1.jpeg

We can also create some descriptive stats such as the number of observations per cluster.

table(comp)
## comp
##   1   2   3   4 
## 439 203 357   1

We can also make a table that looks at the descriptive stats by cluster by using the “aggregate” function.

aggregate(MedExp_small_df,list(comp),mean)
##   Group.1         med         lc        lpi       fmde     ndisease
## 1       1  0.01355537 -0.7644175  0.2721403 -0.7498859  0.048977122
## 2       2 -0.06470294 -0.5358340 -1.7100649 -0.6703288 -0.105004408
## 3       3 -0.06018129  1.2405612  0.6362697  1.3001820 -0.002099968
## 4       4 28.66860936  1.4732183  0.5252898  1.1117244  0.564626907
##          linc        lfam    educdec         age
## 1  0.12531718 -0.08861109  0.1149516  0.12754008
## 2 -0.44435225  0.22404456 -0.3767211 -0.22681535
## 3  0.09804031 -0.01182114  0.0700381 -0.02765987
## 4  0.18887531 -2.36063161  1.0070155 -0.07200553

Cluster 1 is the most educated (‘educdec’). Cluster 2 stands out as having higher medical cost (‘med’), chronic disease (‘ndisease’) and age. Cluster 3 had the lowest annual incentive payment (‘lpi’). Cluster 4 had the highest coinsurance rate (‘lc’). You can make boxplots of each of the stats above. Below is just an example of age by cluster.

MedExp_small_df$cluster<-comp
boxplot(age~cluster,MedExp_small_df)

1.jpeg

Conclusion

Hierarchical clustering is one way in which to provide labels for data that does not have labels. The main challenge is determining how many clusters to create. However, this can be dealt with through using recommendations that come from various functions in R.

Disease and the English Language

There are many different factors that have influenced and shaped the English language. In this post, we will look specifically at the role of disease in shaping the English language. We will do this by looking at one example from the 14th century, the Bubonic plague.

Background

During this time period, English was an oppressed second tier language. The French were in control even though they were no longer French due to Philip II of France seizing the Norman lands in Northern France. Normandy was where the French-English came from and their original home was taken from them in 1295. Nevertheless,  with the French in control English was little more than a low-level language in a diglossia context.

Latin was another major language of the country but primarily in the religious sphere. Virtually all priest were fluent in Latin. Mass was and other religious ceremonies were also in Latin.

The Plague

The “Black Death” as it was called, sweep through what is now known as England in the mid 14th century. The disease killed 1/3 of the population of the country. This is the equivalent of almost 2.5 billion people dying today.

The clergy of England were especially hard hit by the dreaded disease. This has to do with the role of the priest in society. The duties of a priest includes visiting the sick (gasp!), anointing the dieing (gasp!), and performing funerals (gasp!). As such, the priests were called on to come into direct contact with those who were suffering under the plague and they began to die in large numbers themselves. Those who did not die often would run away.

The loss of the professionally trained clergy led to replacements that were not of the same caliber. In other words, new priest came along who only knew English and did not have a knowledge of Latin. This meant religious ceremonies and services were now being performed in English and not Latin, not for Protestant reasons but just out of necessity.

So many people died that it also lead to economic chaos. Land prices collapsed as homes of the wealthy were empty from entire families being wiped out. Fewer workers meant wages skyrocketed and this caused people to abandon their feudal lords and work in the towns and cities.

The rich who survived did not care for paying higher wages and tried to force the peasants back to their feudal slavery. This led to the Peasant Revolt of 1381. The rebellion was only crushed when the shrew Richard II spoke to the peasants in English to calm them and then quickly betrayed them.

Richard II was not the first to use English to rally the people. This began in 1295 when Normandy was taken. However, in the context of disease and death, there was a recommitment to the use of English.

Conclusion 

Disease led to economic and political catastrophes in 14th century England. Yet there was also an influence on the language itself. Latin decreased in use due to the loss of clergy. In addition, the French-English were being forced more and more to use English to deal with the unruly locals who wanted more freedom.

Using H2o Deep Learning in R

Deep learning is a complex machine learning concept in which new features are created new features from the variables that were inputted. These new features are used for classifying labeled data. This all done mostly with artificial neural networks that are multiple layers deep and can involve regularization.

If understanding is not important but you are in search of the most accurate classification possible deep learning is a useful tool. It is nearly impossible to explain to the typical stakeholder and is best for just getting the job done.

One of the most accessible packages for using deep learning is the “h2o” package.This package allows you to access the H2O website which will analyze your data and send it back to you. This allows a researcher to do analytics on a much larger scale than their own computer can handle. In this post, we will use deep learning to predict gender of the head of household in the “VietnamH” dataset from the “Ecdat” package. Below is some initial code.

Data Preparation

library(h2o);library(Ecdat);library(corrplot)
data("VietNamH")
str(VietNamH)
## 'data.frame':    5999 obs. of  11 variables:
##  $ sex     : Factor w/ 2 levels "male","female": 2 2 1 2 2 2 2 1 1 1 ...
##  $ age     : int  68 57 42 72 73 66 73 46 50 45 ...
##  $ educyr  : num  4 8 14 9 1 13 2 9 12 12 ...
##  $ farm    : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
##  $ urban   : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hhsize  : int  6 6 6 6 8 7 9 4 5 4 ...
##  $ lntotal : num  10.1 10.3 10.9 10.3 10.5 ...
##  $ lnmed   : num  11.23 8.51 8.71 9.29 7.56 ...
##  $ lnrlfood: num  8.64 9.35 10.23 9.26 9.59 ...
##  $ lnexp12m: num  11.23 8.51 8.71 9.29 7.56 ...
##  $ commune : Factor w/ 194 levels "1","10","100",..: 1 1 1 1 1 1 1 1 1 1 ...
corrplot(cor(na.omit(VietNamH[,c(-1,-4,-5,-11)])),method = 'number')

1.png

We need to remove the “commune” variable “lnexp12m” and the “lntotal” variable. The “commune” variable should be removed because it doesn’t provide much information. The “lntotal” variable should be removed because it is the total expenditures that the family spends. This is represented by other variables such as food “lnrlfood” which “lntotal” highly correlates with. the “lnexp12m” should be removed because it has a perfect correlation with “lnmed”. Below is the code

VietNamH$commune<-NULL
VietNamH$lnexp12m<-NULL
VietNamH$lntotal<-NULL

Save as CSV file

We now need to save our modified dataset as a csv file that we can send to h2o. The code is as follows.

write.csv(VietNamH, file="viet.csv",row.names = F)

Connect to H2O

Now we can connect to H2o and start what is called an instance.

localH2O<-h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         50 minutes 18 seconds 
##     H2O cluster version:        3.10.4.6 
##     H2O cluster version age:    27 days  
##     H2O cluster name:           H2O_started_from_R_darrin_hsl318 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.44 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.4.0 (2017-04-21)

The output indicates that we are connected. The next step is where it really gets complicated. We need to upload our data to h2o as an h2o dataframe, which is different from a regular data frame. We also need to indicate the location of the csv file on our computer that needs to be converted. All of this is done in the code below.

viet.hex<-h2o.uploadFile(path="/home/darrin/Documents/R working directory/blog/blog/viet.csv",destination_frame = "viet.hex")

In the code above we create an object called “viet.hex”. This object uses the “h2o.uploadFile” function to send our csv to h2o. We can check if everything worked by using the “class” function and the “str” function on “viet.hex”.

class(viet.hex)
## [1] "H2OFrame"
str(viet.hex)
## Class 'H2OFrame' <environment: 0x71f7c18> 
##  - attr(*, "op")= chr "Parse"
##  - attr(*, "id")= chr "viet.hex"
##  - attr(*, "eval")= logi FALSE
##  - attr(*, "nrow")= int 5999
##  - attr(*, "ncol")= int 8
##  - attr(*, "types")=List of 8
##   ..$ : chr "enum"
##   ..$ : chr "int"
##   ..$ : chr "real"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "int"
##   ..$ : chr "real"
##   ..$ : chr "real"
##  - attr(*, "data")='data.frame': 10 obs. of  8 variables:
##   ..$ sex     : Factor w/ 2 levels "female","male": 1 1 2 1 1 1 1 2 2 2
##   ..$ age     : num  68 57 42 72 73 66 73 46 50 45
##   ..$ educyr  : num  4 8 14 9 1 13 2 9 12 12
##   ..$ farm    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1
##   ..$ urban   : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2
##   ..$ hhsize  : num  6 6 6 6 8 7 9 4 5 4
##   ..$ lnmed   : num  11.23 8.51 8.71 9.29 7.56 ...
##   ..$ lnrlfood: num  8.64 9.35 10.23 9.26 9.59 ...

The “summary” function also provides insight into the data.

summary(viet.hex)
##  sex          age             educyr           farm      urban    
##  male  :4375  Min.   :16.00   Min.   : 0.000   yes:3438  no :4269 
##  female:1624  1st Qu.:37.00   1st Qu.: 3.982   no :2561  yes:1730 
##               Median :46.00   Median : 6.996                      
##               Mean   :48.01   Mean   : 7.094                      
##               3rd Qu.:58.00   3rd Qu.: 9.988                      
##               Max.   :95.00   Max.   :22.000                      
##  hhsize           lnmed            lnrlfood        
##  Min.   : 1.000   Min.   : 0.000   Min.   : 6.356  
##  1st Qu.: 4.000   1st Qu.: 4.166   1st Qu.: 8.372  
##  Median : 5.000   Median : 5.959   Median : 8.689  
##  Mean   : 4.752   Mean   : 5.266   Mean   : 8.680  
##  3rd Qu.: 6.000   3rd Qu.: 7.171   3rd Qu.: 9.001  
##  Max.   :19.000   Max.   :12.363   Max.   :11.384

Create Training and Testing Sets

We now need to create our train and test sets. We need to use slightly different syntax to do this with h2o. The code below is how it is done to create a 70/30 split in the data.

rand<-h2o.runif(viet.hex,seed = 123)
train<-viet.hex[rand<=.7,]
train<-h2o.assign(train, key = "train")
test<-viet.hex[rand>.7,]
test<-h2o.assign(test, key = "test")

Here is what we did

  1. We created an object called “rand” that created random numbers for or “viet.hex” dataset.
  2. All values less than .7 were assigned to the “train” variable
  3. The train variable was given the key name “train” in order to use it in the h2o framework
  4. All values greater than .7 were assigned to test and test was given a key name

You can check the proportions of the train and test sets using the “h2o.table” function.

h2o.table(train$sex)
##      sex Count
## 1 female  1146
## 2   male  3058
## 
## [2 rows x 2 columns]
h2o.table(test$sex)
##      sex Count
## 1 female   478
## 2   male  1317
## 
## [2 rows x 2 columns]

Model Development

We can now create our model.

vietdlmodel<-h2o.deeplearning(x=2:8,y=1,training_frame = train,validation_frame = test,seed=123,variable_importances = T)

Here is what the code above means.

  1. We created an object called “vietdlmodel”
  2. We used the “h2o.deeplearning” function.
  3.  x = 2:8 is all the independent variables in the dataframe and y=1 is the first variable “sex”
  4. We set the training and validation frame to “train” and “test” and set the seed.
  5. Finally, we indicated that we want to know the variable importance.

We can check the performance of the model with the code below.

vietdlmodel
## Model Details:
## training
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        female male    Error       Rate
## female    435  711 0.620419  =711/1146
## male      162 2896 0.052976  =162/3058
## Totals    597 3607 0.207659  =873/4204

## testing
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        female male    Error       Rate
## female    151  327 0.684100   =327/478
## male       60 1257 0.045558   =60/1317
## Totals    211 1584 0.215599  =387/1795

There is a lot of output here. For simplicity, we will focus on the confusion matrices for the training and testing sets.The error rate for the training set is 19.8% and for the testing set, it is 21.2%. Below we can see which variable were most useful

vietdlmodel@model$variable_importances
## Variable Importances: 
##             variable relative_importance scaled_importance percentage
## 1           urban.no            1.000000          1.000000   0.189129
## 2          urban.yes            0.875128          0.875128   0.165512
## 3            farm.no            0.807208          0.807208   0.152666
## 4           farm.yes            0.719517          0.719517   0.136081
## 5                age            0.451581          0.451581   0.085407
## 6             hhsize            0.410472          0.410472   0.077632
## 7           lnrlfood            0.386189          0.386189   0.073039
## 8             educyr            0.380398          0.380398   0.071944
## 9              lnmed            0.256911          0.256911   0.048589
## 10  farm.missing(NA)            0.000000          0.000000   0.000000
## 11 urban.missing(NA)            0.000000          0.000000   0.000000

The numbers speak for themselves. “Urban” and “farm” are both the most important variables for predicting sex. Below is the code for obtaining the predicted results and placing them into a dataframe. This is useful if you need to send in final results to a data science competition such as those found at kaggle.

vietdlPredict<-h2o.predict(vietdlmodel,newdata = test)
vietdlPredict
##   predict     female      male
## 1    male 0.06045560 0.9395444
## 2    male 0.10957121 0.8904288
## 3    male 0.27459108 0.7254089
## 4    male 0.14721353 0.8527865
## 5    male 0.05493486 0.9450651
## 6    male 0.10598351 0.8940165
## 
## [1795 rows x 3 columns]
vietdlPred<-as.data.frame(vietdlPredict)
head(vietdlPred)
##   predict     female      male
## 1    male 0.06045560 0.9395444
## 2    male 0.10957121 0.8904288
## 3    male 0.27459108 0.7254089
## 4    male 0.14721353 0.8527865
## 5    male 0.05493486 0.9450651
## 6    male 0.10598351 0.8940165

Conclusion

This was a complicated experience. However, we learned how to upload and download results from h2.

Tips for Developing Tests

Assessment is a critical component of education. One form of assessment  that is commonly used is testing. In this post, we will look at several practical tips for developing tests.

Consider the Practicality

When developing a test, it is important to consider the time constraints, as well as the time it will take to mark the test. For example, essays are great form of assessment that really encourage critical thinking. However, if the class has 50 students the practicality of essays test quickly disappears.

The point is that the context of teaching moves what is considered practical. What is practical can change from year to year while adjusting to new students.

Think about the Reliability

Relibility is the consistency of the score that the student earns. THis can be affected by the setting of the test as well as the person who marks the test. It is difficult to maintain consistency when marking subject answers such as short and answer and or essay. However, it is important that this is still done.

Consider Validity

Validity in this context has to do with whether the test covers objects that were addressed  in the actual teaching. Assessing this is subject but needs to be considered. What is taught is what should be on the test. This is easier said than done as poor planning can lead to severally poor testing.

The students also need to be somewhat convince that the testing is appropriate. If not it can lead to problems and complaints. Furthermore, an invalid test from the students perspective can lead to cheating as the students will cheat in order to survive.

Make it Aunthentic 

Tests, if possible, should mimic real-world behaviors whenever possible. This enhances relevance and validity for students. One of the main problems with authentic assessment is what to do when it is time to mark them. The real-world behaviors cannot always be reduced to a single letter grade. This concern is closely relates to practicality.

Washback

Washback is the experience of learning from an assessment. This normally entails some sort of feedback that the teacher provides the student. the feedbag they give. This personal attention encourages reflection which aides in comprehension. Often, it will happen after the testing as the answers are reviewed.

Conclusion

Tests can be improved by keeping in mind the concepts addressed in this post. Teachers and students can have better experiences with testing by maintaining practical assessments that are valid, provide authentic experiences as well insights into how to improve.

Gradient Boosting With Random Forest Classification in R

In this blog we have already discussed and what gradient boosting is. However, for a brief recap, gradient boosting improves model performance by first developing an initial model called the base learner using whatever algorithm of your choice (linear, tree, etc.).

What follows next is that gradient boosting looks at the error in the first model and develops a second model using what is called the loss function. The loss function calculates the difference between the current accuracy and the desired prediction whether it’s accuracy for classification or error in regression. This process is repeated with the creation of additional models until a certain level of accuracy or reduction in error is attained.

This post what provide an example of the use of gradient boosting in random forest classification. Specifically, we will try to predict a person’s labor participation based on several independent variables.

library(randomForest);library(gbm);library(caret);library(Ecdat)
data("Participation")
str(Participation)
## 'data.frame':    872 obs. of  7 variables:
##  $ lfp    : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ lnnlinc: num  10.8 10.5 11 11.1 11.1 ...
##  $ age    : num  3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
##  $ educ   : num  8 8 9 11 12 12 8 8 12 11 ...
##  $ nyc    : num  1 0 0 2 0 0 0 0 0 0 ...
##  $ noc    : num  1 1 0 0 2 1 0 2 0 2 ...
##  $ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Data Preparation

We need to transform the ‘age’ variable by multiplying by ten so that the ages are realistic. In addition, we need to convert “lnnlinc” from the log of salary to regular salary. Below is the code to transform these two variables.

Participation$age<-10*Participation$age #normal age
Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log

We can now create our train and test datasets

set.seed(502)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]

We now need to create our grid and control. The grid allows us to create several different models with various parameter settings. This is important in determining what is the most appropriate model which is always determined by comparing. We are using random forest so we need to set the number of trees we desire, the depth of the trees, the shrinkage which controls the influence of each tree, and the minimum number of observations in a node. The control will allow us to set the cross-validation. Below is the code for the creation of the grid and control.

grid<-expand.grid(.n.trees=seq(200,500,by=200),.interaction.depth=seq(1,3,by=2),.shrinkage=seq(.01,.09,by=.04),
                   .n.minobsinnode=seq(1,5,by=2)) #grid features
control<-trainControl(method="CV",number = 10) #control

Parameter Selection

Now we set our seed and run the gradient boosted model.

set.seed(123)
gbm.lfp.train<-train(lfp~.,data=train,method='gbm',trControl=control,tuneGrid=grid)
gbm.lfp.train
## Stochastic Gradient Boosting 
## 
## 636 samples
##   6 predictors
##   2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 573, 573, 571, 572, 573, 572, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.minobsinnode  n.trees  Accuracy 
##   0.01       1                  1               200      0.6666026
##   0.01       1                  1               400      0.6823306
##   0.01       1                  3               200      0.6588637
##   0.01       1                  3               400      0.6854804
##   0.01       1                  5               200      0.6792769
##   0.01       1                  5               400      0.6823306
##   0.01       3                  1               200      0.6730044
##   0.01       3                  1               400      0.6572051
##   0.01       3                  3               200      0.6793273
##   0.01       3                  3               400      0.6697787
##   0.01       3                  5               200      0.6682914
##   0.01       3                  5               400      0.6650416
##   0.05       1                  1               200      0.6759558
##   0.05       1                  1               400      0.6508040
##   0.05       1                  3               200      0.6681426
##   0.05       1                  3               400      0.6602286
##   0.05       1                  5               200      0.6680441
##   0.05       1                  5               400      0.6570788
##   0.05       3                  1               200      0.6493662
##   0.05       3                  1               400      0.6603518
##   0.05       3                  3               200      0.6540545
##   0.05       3                  3               400      0.6366911
##   0.05       3                  5               200      0.6712428
##   0.05       3                  5               400      0.6445299
##   0.09       1                  1               200      0.6461405
##   0.09       1                  1               400      0.6634768
##   0.09       1                  3               200      0.6571036
##   0.09       1                  3               400      0.6320765
##   0.09       1                  5               200      0.6554922
##   0.09       1                  5               400      0.6540755
##   0.09       3                  1               200      0.6523920
##   0.09       3                  1               400      0.6430140
##   0.09       3                  3               200      0.6430666
##   0.09       3                  3               400      0.6447749
##   0.09       3                  5               200      0.6540522
##   0.09       3                  5               400      0.6524416
##   Kappa    
##   0.3210036
##   0.3611194
##   0.3032151
##   0.3667274
##   0.3472079
##   0.3603046
##   0.3414686
##   0.3104335
##   0.3542736
##   0.3355582
##   0.3314006
##   0.3258459
##   0.3473532
##   0.2961782
##   0.3310251
##   0.3158762
##   0.3308353
##   0.3080692
##   0.2940587
##   0.3170198
##   0.3044814
##   0.2692627
##   0.3378545
##   0.2844781
##   0.2859754
##   0.3214156
##   0.3079460
##   0.2585840
##   0.3062307
##   0.3044324
##   0.3003943
##   0.2805715
##   0.2827956
##   0.2861825
##   0.3024944
##   0.3002135
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 400,
##  interaction.depth = 1, shrinkage = 0.01 and n.minobsinnode = 3.

Gradient boosting provides us with the recommended parameters for our training model as shown above as well as the accuracy and kappa of each model. We also need to recode the dependent variable as 0 and 1 for the ‘gbm’ function.

Model Training

train$lfp=ifelse(train$lfp=="no",0,1)
gbm.lfp<-gbm(lfp~., distribution = 'bernoulli',data=train,n.trees = 400,interaction.depth = 1,shrinkage=.01,n.minobsinnode = 3)

You can see a summary of the most important variables for prediction as well as a plot by using the “summary” function.

summary(gbm.lfp)

1.png

##             var   rel.inf
## lnnlinc lnnlinc 28.680447
## age         age 27.451474
## foreign foreign 23.307932
## nyc         nyc 18.375856
## educ       educ  2.184291
## noc         noc  0.000000

Salary (lnnlinc), age and foreigner status are the most important predictors followed by number of younger children (nyc) and lastest education. Number of older children (noc) has no effect. We can now test our model on the test set.

Model Testing

gbm.lfp.test<-predict(gbm.lfp,newdata = test,type = 'response', n.trees = 400)

Our test model returns a set of probabilities. We need to convert this to a simple yes or no and this is done in the code below.

gbm.class<-ifelse(gbm.lfp.test<0.5,'no','yes')

We can now look at a table to see how accurate our model is as well as calculate the accuracy.

table(gbm.class,test$lfp)
##          
## gbm.class no yes
##       no  91  39
##       yes 39  67
(accuracy<-(91+67)/(91+67+39+39))
## [1] 0.6694915

The model is not great. However, you now have an example of how to use gradient boosting to develop a random forest classification model

Washback

Washback is the effect that testing has on teaching and learning. This term is commonly used in used in language assessment but it is not limited to only that field. One of the primary concerns of many teachers is developing that provide washback or that enhances students learning and understanding of ideas in a class.

This post will discuss three ways in which washback can be improved in a class. The three ways are…

  • Written feedback on exams
  • Go over the results as a class
  • Meetings with students on exam performance

Written Feedback

Exams or assignments that are highly subjective (ie essays) require written feedback in order to provide washback. This means specific, personalized feedback for each student. This is a daunting task for most teachers especially as classes get larger. However, if your goal is to improve washback providing written comments is one way to achieve this.

The letter grade or numerical score a student receives on a test does not provide insights into how the student can improve. The reasoning behind what is right or wrong can be provided in the written feedback.

Go Over Answers in Class

Perhaps the most common way to enhance feedback is to go over the test in class. This allows the students to learn what the correct answer is, as well as why one answer is the answer. In addition, students are given time to ask questions and clarification of the reasoning behind the teacher’s marking.

If there were common points of confusion, going over the answers in this way allows for the teacher to reteach the confusing concepts. In many ways, the test revealed what was unclear and now the teacher is able to provide support to achieve mastery.

One-on-One Meetings

For highly complex and extremely subjective forms of assessments (ie research paper) one-on-one meetings may be the most appropriate. This may require a more personal touch and a greater deal of time.

During the meeting, students can have their questions addressed and learn what they need to do in order to improve. This is a useful method for assignments that require several rounds of feedback in order to be completed.

Conclusion

Washback, if done properly, can help with motivation, autonomy, and self-confidence of students. What this means is that assessment should not only be used for grades but also to develop learning skills.

Understanding Testing

Testing is standard practice in most educational context. A teacher needs a way to determine what level of knowledge the students currently have or have gained through the learning experience. However, identifying what testing is and is not has not always been clear.

In this post, we will look at exactly what testing as. In general, testing is a way of measuring a person’s ability and or knowledge in a given are of study. Specifically, there are five key characteristics of a test, and they are…

  • Systematic
  • Quantifiable
  • Individualistic
  • Competence
  • Domain specific

Systematic

A test must be well organized and structured. For example, the multiple choice are in one section while the short answers are in a different section. If an essay is required there is a rubric for grading. Directions for all sections are in the test to explain the expectations to the students.

This is not as easy or as obvious as some may believe. Developing a test takes a great deal of planning for the actual creation of the test.

Quantifiable

Test are intended to measure something. A test can measure general knowledge such as proficiency test of English or a test can be specific such as a test that only looks at vocabulary memorization. Either way, it is important for both the student and teacher to know what is being measured.

Another obvious but sometimes mistake by test makers is the reporting of results. How many points each section and even each question is important for students to know when taking a test. This information is also critical for the person who is responsible for grading the tests.

Individualistic 

Test are primarily designed to assess a student’s individual knowledge/performance. This is a Western concept of the responsibility of a person to have an individual expertise in a field of knowledge.

There are examples of groups working together on tests. However, group work is normally left to projects and not formal modes of assessment such as testing.

Competence

As has already been alluded too, tests assess competence either through the knowledge a person has about a subject or their performance doing something. For example, a vocabulary test assesses knowledge of words while a speaking test would assess a person ability to use words or their performance.

Generally, a test is either knowledge or performance based.  it is possible to blend the two, however, mixing styles raises the complexity not only for the student but also for the person who s responsible for marking the results.

Domain Specific

A test needs to be focused on a specific area of knowledge. A language test is specific to language as an example. A teacher needs to know in what specific area they are trying to assess students knowledge/performance. This not always easy to define as not only are there domains but sub-domains and many other ways to divide up the information in a given course.

Therefore, a teacher needs to identify what students need to know as well as what they should know and assess this information when developing a test. This helps to focus the test on relevant content for the students.

Conclusion

There is art and science to testing. There is no simple solution to how to setup tests to help students. However, the five concepts here provides a framework that can help a teacher to get started in developing tests.

Gradient Boosting Of Regression Trees in R

Gradient boosting is a machine learning tool for “boosting” or improving model performance. How this works is that you first develop an initial model called the base learner using whatever algorithm of your choice (linear, tree, etc.).

Gradient boosting looks at the error and develops a second model using what is called da loss function. the loss function is the difference between the current accuracy and the desired prediction whether it’s accuracy for classification or error in regression. This process of making additional models based only on the misclassified ones continues until the level of accuracy is reached.

Gradient boosting is also stochastic. This means that it randomly draws from the sample as it iterates over the data. This helps to improve accuracy and or reduce error.

In this post, we will use gradient boosting for regression trees. In particular, we will use the “Sacramento” dataset from the “caret” package. Our goal is to predict a house’s price based on the available variables. Below is some initial code

library(caret);library(gbm);library(corrplot)
data("Sacramento")
str(Sacramento)
## 'data.frame':    932 obs. of  9 variables:
##  $ city     : Factor w/ 37 levels "ANTELOPE","AUBURN",..: 34 34 34 34 34 34 34 34 29 31 ...
##  $ zip      : Factor w/ 68 levels "z95603","z95608",..: 64 52 44 44 53 65 66 49 24 25 ...
##  $ beds     : int  2 3 2 2 2 3 3 3 2 3 ...
##  $ baths    : num  1 1 1 1 1 1 2 1 2 2 ...
##  $ sqft     : int  836 1167 796 852 797 1122 1104 1177 941 1146 ...
##  $ type     : Factor w/ 3 levels "Condo","Multi_Family",..: 3 3 3 3 3 1 3 3 1 3 ...
##  $ price    : int  59222 68212 68880 69307 81900 89921 90895 91002 94905 98937 ...
##  $ latitude : num  38.6 38.5 38.6 38.6 38.5 ...
##  $ longitude: num  -121 -121 -121 -121 -121 ...

Data Preparation

Already there are some actions that need to be made. We need to remove the variables “city” and “zip” because they both have a large number of factors. Next, we need to remove “latitude” and “longitude” because these values are hard to interpret in a housing price model. Let’s run the correlations before removing this information

corrplot(cor(Sacramento[,c(-1,-2,-6)]),method = 'number')

1

There also appears to be a high correlation between “sqft” and beds and bathrooms. As such, we will remove “sqft” from te model. Below is the code for the revised variables remaining for the model.

sacto.clean<-Sacramento
sacto.clean[,c(1,2,5)]<-NULL
sacto.clean[,c(5,6)]<-NULL
str(sacto.clean)
## 'data.frame':    932 obs. of  4 variables:
##  $ beds : int  2 3 2 2 2 3 3 3 2 3 ...
##  $ baths: num  1 1 1 1 1 1 2 1 2 2 ...
##  $ type : Factor w/ 3 levels "Condo","Multi_Family",..: 3 3 3 3 3 1 3 3 1 3 ...
##  $ price: int  59222 68212 68880 69307 81900 89921 90895 91002 94905 98937 ...

We will now develop our training and testing sets

set.seed(502)
ind=sample(2,nrow(sacto.clean),replace=T,prob=c(.7,.3))
train<-sacto.clean[ind==1,]
test<-sacto.clean[ind==2,]

We need to create a grid in order to develop the many different potential models available. We have to tune three different parameters for gradient boosting, These three parameters are number of trees, interaction depth, and shrinkage. Number of trees is how many trees gradient boosting g will make, interaction depth is the number of splits, shrinkage controls the contribution of each tree and stump to the final model. We also have to determine the type of cross-validation using the “trainControl”” function. Below is the code for the grid.

grid<-expand.grid(.n.trees=seq(100,500,by=200),.interaction.depth=seq(1,4,by=1),.shrinkage=c(.001,.01,.1),
                  .n.minobsinnode=10)
control<-trainControl(method = "CV")

Model Training

We now can train our model

gbm.train<-train(price~.,data=train,method='gbm',trControl=control,tuneGrid=grid)
gbm.train

 

Stochastic Gradient Boosting 

685 samples
  4 predictors

No pre-processing
Resampling: Cross-Validated (25 fold) 
Summary of sample sizes: 659, 657, 658, 657, 657, 657, ... 
Resampling results across tuning parameters:

  shrinkage  interaction.depth  n.trees  RMSE       Rsquared 
  0.001      1                  100      128372.32  0.4850879
  0.001      1                  300      120272.16  0.4965552
  0.001      1                  500      113986.08  0.5064680
  0.001      2                  100      127197.20  0.5463527
  0.001      2                  300      117228.42  0.5524074
  0.001      2                  500      109634.39  0.5566431
  0.001      3                  100      126633.35  0.5646994
  0.001      3                  300      115873.67  0.5707619
  0.001      3                  500      107850.02  0.5732942
  0.001      4                  100      126361.05  0.5740655
  0.001      4                  300      115269.63  0.5767396
  0.001      4                  500      107109.99  0.5799836
  0.010      1                  100      103554.11  0.5286663
  0.010      1                  300       90114.05  0.5728993
  0.010      1                  500       88327.15  0.5838981
  0.010      2                  100       97876.10  0.5675862
  0.010      2                  300       88260.16  0.5864650
  0.010      2                  500       86773.49  0.6007150
  0.010      3                  100       96138.06  0.5778062
  0.010      3                  300       87213.34  0.5975438
  0.010      3                  500       86309.87  0.6072987
  0.010      4                  100       95260.93  0.5861798
  0.010      4                  300       86962.20  0.6011429
  0.010      4                  500       86380.39  0.6082593
  0.100      1                  100       86808.91  0.6022690
  0.100      1                  300       86081.65  0.6100963
  0.100      1                  500       86197.52  0.6081493
  0.100      2                  100       86810.97  0.6036919
  0.100      2                  300       87251.66  0.6042293
  0.100      2                  500       88396.21  0.5945206
  0.100      3                  100       86649.14  0.6088309
  0.100      3                  300       88565.35  0.5942948
  0.100      3                  500       89971.44  0.5849622
  0.100      4                  100       86922.22  0.6037571
  0.100      4                  300       88629.92  0.5894188
  0.100      4                  500       91008.39  0.5718534

Tuning parameter 'n.minobsinnode' was held constant at a value of 10
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were n.trees = 300, interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10.

The print out shows you the values for each potential model. At the bottom of the printout are the recommended parameters for our model. We take the values at the bottom to create our model for the test data.

gbm.price<-gbm(price~.,data=train,n.trees = 300,interaction.depth = 1,
              shrinkage = .1,distribution = 'gaussian')

Test Model

Now we use the test data, below we predict as well as calculate the error and make a plot.

gbm.test<-predict(gbm.price,newdata = test,n.trees = 300)
gbm.resid<-gbm.test-test$price
mean(gbm.resid^2)
## [1] 8721772767
plot(gbm.test,test$price)

1

The actual value for the mean squared error is relative and means nothing by its self. The plot, however, looks good and indicates that our model may be doing well. The mean squared error is only useful when comparing one model to another it does not mean much by its self.

English Language and the Church

The English language during the middle ages had a serious struggle with the church of its time. Church officials supported that the bible should only be published in Latin. This led to a large number of people having no idea what was happening during a worship service. Even though church attendance was mandatory.

One response to this problem was the development of “mystery plays.” These were theatrical performances based on the bible. The topics ranged from Genesis to Revelation and were performed in local languages. However, watching pseudo-movies and reading the text for yourself are widely different experiences.

This post will look at the role of several prominent people’s response to the suppression of English in religious text.

John Wycliffe

The lack of scripture in the English language led to John Wycliffe translating the Latin Vulgate into English. Naturally, this was illegal and Wycliffe faced significant trouble over doing this. Despite this, his translation was one of the first translations of the bible into what was called at the time a “vulgar” language.

Wycliffe’s translation was not from the original text but rather from the Latin. This means it was a translation of a translation which nearly destroys the comprehensibility of the text.

William Tyndale

William Tyndale attempted to deal with the challenges of the Wycliff translation by translating the bible from the original greek and Hebrew. Tyndale’s translation heavily influences the English language as he literally had to create words to capture the meaning of the text. Such phrases as “scapegoat”, “sea-shore”, and “my brother’s  keeper” were developed by Tyndale to communicate ideas within the bible.  For his work, Tyndale was put to death.  It took him about

Naturally, many were not happen with what Tyndale had accomplished. For his work, Tyndale was put to death.  It took him about four years to complete his work

King James Bible

However, the move away from Latin to English was made complete with the development of the 1611 King James bible. The KJV is named as King James the I of England who sponsored the translation of the bible for political reasons.  By the 17th century, there were so many versions of the bible that scholars wanted a definitive translation and King James I sponsored this.

Over fifty scholars worked on this translation for five years. Despite all this work, the 1611 KJV is 60-80% based on Tyndale’s work a century prior. This makes Tyndale’s work all the more amazing that he did the work of 50 scholars in the same amount of time. From this moment English became know as the language of the preacher

Conclusion

The role of English in religious matters today is due in part to the work of Wycliffe, Tyndale, and the scholars of the KJV. Their efforts led to supplanting Latin as the language of worship while also contributing many idioms to the English language

Random Forest Classification in R

This post will cover the use of random forest for classification. Random forest involves the use of many decision trees in the development of a classification or regression tree. The results of each individual tree is added together and the mean is used in the final classification of an example. The use of an ensemble helps in dealing with the bias-variance tradeoff.

In the example of random forest classification, we will use the “Participation” dataset from the “ecdat” package. We want to classify people by their labor participation based on the other variables available in the dataset. Below is some initial code

library(randomForest);library(Ecdat)
data("Participation")
str(Participation)
## 'data.frame':    872 obs. of  7 variables:
##  $ lfp    : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ lnnlinc: num  10.8 10.5 11 11.1 11.1 ...
##  $ age    : num  3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
##  $ educ   : num  8 8 9 11 12 12 8 8 12 11 ...
##  $ nyc    : num  1 0 0 2 0 0 0 0 0 0 ...
##  $ noc    : num  1 1 0 0 2 1 0 2 0 2 ...
##  $ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

For the data preparation, we need to multiple age by ten as the current values imply small children. Furthermore, we need to change the “lnnlinc” variable from the log of salary to just the regular salary. After completing these two steps, we need to split our data into training and testing sets. Below is the code

Participation$age<-10*Participation$age #normal age
Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log
#split data
set.seed(502)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]

We will now create our classification model using random forest.

set.seed(123)
rf.lfp<-randomForest(lfp~.,data = train)
rf.lfp
## 
## Call:
##  randomForest(formula = lfp ~ ., data = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 32.39%
## Confusion matrix:
##      no yes class.error
## no  248  93   0.2727273
## yes 113 182   0.3830508

The output is mostly self-explanatory. It includes the number of trees, number of variables at each split, error rate, and the confusion matrix. In general, are error rate is poor and we are having a hard time distinguishing between those who work and do not work based on the variables in the dataset. However, this is based on having all 500 trees in the analysis. Having this many trees is probably not necessary but we need to confirm this.

We can also plot the error by tree using the “plot” function as shown below.

plot(rf.lfp)

1.png

It looks as though error lowest with around 400 trees. We can confirm this using the “which.min” function and call information from “err.rate” in our model.

which.min(rf.lfp$err.rate[,1])
## [1] 242

We need 395 trees in order to reduce the error rate to its most optimal level. We will now create a new model that contains 395 trees in it.

rf.lfp2<-randomForest(lfp~.,data = train,ntree=395)
rf.lfp2
## 
## Call:
##  randomForest(formula = lfp ~ ., data = train, ntree = 395) 
##                Type of random forest: classification
##                      Number of trees: 395
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 31.92%
## Confusion matrix:
##      no yes class.error
## no  252  89   0.2609971
## yes 114 181   0.3864407

The results are mostly the same. There is a small decline in error but not much to get excited about. We will now run our model on the test set.

rf.lfptest<-predict(rf.lfp2,newdata=test,type = 'response')
table(rf.lfptest,test$lfp)
##           
## rf.lfptest no yes
##        no  93  48
##        yes 37  58
(92+63)/(92+63+43+38) #calculate accuracy
## [1] 0.6567797

Still disappointing, there is one last chart we should examine and that is the importance of each variable plot. It shows which variables are most useful in the prediction process. Below is the code.

varImpPlot(rf.lfp2)

1.png

This plot clearly indicates that salary (“lnnlinc”), age, and education are the strongest features for classifying by labor activity. However, the overall model is probably not useful.

Conclusion

This post explained and demonstrated how to conduct a random forest analysis. This form of analysis is powerful in dealing with large datasets with nonlinear relationships among the variables.

Post Norman Conquest Decline of the French Language

After the Norman Conquest of England in 1066, French dominated England for three hundred years. The decline French can be traced to at least two main reasons, which are…

  • War/politics
  • Disease

This post will examine briefly the role of these two phenomena in shaping the decline of the French language in England as well as the reemergence of English.

War/Politics

The King of Normandy was also the King of England. In 1204, John, King of Normandy and England, lost his Norman territory to the King of France. This left a large number of Norman nobles living in England with any property back in France unless the swore allegiance to the King of France, Philipp II. The consequence of forced loyalty was the development of an English identity among the elite.

In 1295, Philip IV, King of France, threaten to invade England. Edward I, King of England, communicated with the people in English in order to unite the people. While speaking to the people in English, Edward I stated that Philip IV intended to destroy the English language. When the French invasion never came, Edward set aside his use of English

Disease

In the mid-1300’s, the Bubonic plague spread through England and wipe out 1/3 of the population. The plague was particular hard on the clergy killing almost 1/2 of them and removing the influence of Latin on English. The replacement clergy used English.

The loss of so many people allowed English-speaking peasants to take over empty homes and demand higher wages. The price of land fell as there was no one to work the fields nor was there as much demand for products with so many dead. The bonds of serfdom were severely broken.

When the nobility tried to push the peasants back onto the lands as serfs, it led several revolts. When communicating both the nobility and peasants used English. The nobility used English to make promises that were not kept and destroy resistance their rule.

Aftermath

Through war and disease, English rose to prominence again. By the 1400’s English was the language of education and official business. In 1399, Henry IV was sworn in as king with the use of the English language. After three centuries of oppression, the English language emerged as the language of the elite as well as the commoner again.

Norman Conquest and the English Language

The year 1066 is highly significant in the English language. This is the year that William, the Duke of Normandy, conquer most of what today is known as Great Britain. The effects of this upon the English language was significant.

Background

As a background, when the King of England, Edward the Confessor died, he named William, the duke of Normandy, as King of England. Edward was childless but his mother was from Normandy, which is located in France.  As such, the English court was already full of French speaking Normans as Edward’s supporters.

Naming a Norman to the throne of England did not sit well with one Edward’s biggest rivals, Earl Harold Godwineson. Harold quickly led a rebellion against Willam but was defeated and William of Normandy became known as William the Conqueror and was crowned King of England Christmas day of 1066

Aftermath

Over the next three centuries under French rule, the English language was invaded by as many as 10,000 French words. Such words as “city”, “bacon”, “biscuit”, and “felony” to name a few. The English court quickly became a French court.

The English court quickly became a French court. All positions of power were taken by Normans. This was not only because of conquest but also because most of the English nobility and leadership were killed in the Battle of Hastings.

The only way to get ahead in this context was to learn French and leave English in the home. In many ways, French became a high language and English was relegated to a low language almost as a diglossia situation. English was the language of the poor and French of the elite. Most documents during this time were produced in French and even written English was pushed aside.

The division by class has led some to allege that this kept English alive. This is to say that the rich and the poor had their own separate languages and both work to preserve their own manner of communication.

Conclusion

War is yet another factor to consider when looking at the development of a language. Even without intending to do so William the Conqueror made a major impact on the English language simply by sticking to his mother tongue of French when he took the English throne. To this day, loan words from French play a major role in communication in the English language.

Random Forest Regression Trees in R

Random forest involves the process of creating multiple decision trees and the combing of their results. How this is done is through r using 2/3 of the data set to develop decision tree. This is done dozens, hundreds, or more times. Every tree made is created with a slightly different sample. The results of all these trees are then averaged together. This process of sampling is called bootstrap aggregation or bagging for short.

While the random forest algorithm is developing different samples it also randomly selects which variables to be use din each tree that is developed. By randomizing the sample and the features used in the tree, random forest is able to reduce both bias and variance in a model. In addition, random forest is robust against outliers and collinearity. Lastly, keep in mind that random forest can be used for regression and classification trees

In our example, we will use the “Participation” dataset from the “Ecdat” package. We will create a random forest regression tree to predict income of people. Below is some initial code

library(randomForest);library(rpart);library(Ecdat)
data("Participation")
str(Participation)
## 'data.frame':    872 obs. of  7 variables:
##  $ lfp    : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ lnnlinc: num  10.8 10.5 11 11.1 11.1 ...
##  $ age    : num  3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
##  $ educ   : num  8 8 9 11 12 12 8 8 12 11 ...
##  $ nyc    : num  1 0 0 2 0 0 0 0 0 0 ...
##  $ noc    : num  1 1 0 0 2 1 0 2 0 2 ...
##  $ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

We now need to prepare the data. We need to transform the lnnlinc from a log of salary to the actual salary. In addition, we need to multiply “age” by ten as 3.4 & 4.5 do not make any sense. Below is the code

Participation$age<-10*Participation$age #normal age
Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log

Now we create our training and testing sets.

set.seed(123)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]

We are now ready to create our model. Below is the code

set.seed(123)
rf.pros<-randomForest(lnnlinc~.,data = train)
rf.pros
## 
## Call:
##  randomForest(formula = lnnlinc ~ ., data = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 529284177
##                     % Var explained: 13.74

As you can see from calling “rf.pros” the variance explained is low at around 14%. The output also tells us how many trees were created. You have to be careful with making too many trees as this leads to overfitting. We can determine how many trees are optimal by looking at a plot and then using the “which.min”. Below is a plot of the number of trees by the mean squared error.

plot(rf.pros)

1

As you can see, as there are more trees there us less error to a certain point. It looks as though about 50 trees is enough. To confirm this guess, we used the “which.min” function. Below is the code

which.min(rf.pros$mse)
## [1] 45

We need 45 trees to have the lowest error. We will now rerun the model and add an argument called “ntrees” to indicating the number of trees we want to generate.

set.seed(123)
rf.pros.45<-randomForest(lnnlinc~.,data = train,ntree=45)
rf.pros.45
## 
## Call:
##  randomForest(formula = lnnlinc ~ ., data = train, ntree = 45) 
##                Type of random forest: regression
##                      Number of trees: 45
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 520705601
##                     % Var explained: 15.13

This model is still not great. We explain a little bit more of the variance and the error decreased slightly. We can now see which of the features in our model are the most useful by using the “varImpPlot” function. Below is the code.

varImpPlot(rf.pros.45)

1.png

The higher the IncNodePurity the more important the variable. AS you can see, education is most important followed by age and then the number of older children. The raw scores for each variable can be examined using the “importance” function. Below is the code.

importance(rf.pros.45)
##         IncNodePurity
## lfp       16678498398
## age       66716765357
## educ      72007615063
## nyc        9337131671
## noc       31951386811
## foreign   10205305287

We are now ready to test our model with the test set. We will then calculate the residuals and the mean squared error

rf.pros.test<-predict(rf.pros.45,newdata = test)
rf.resid<-rf.pros.test-test$lnnlinc
mean(rf.resid^2)
## [1] 381850711

Remember that the mean squared error calculated here is only useful in comparison to other models. Random forest provides a way in which to remove the weaknesses of one decision tree by averaging the results of many. This form of ensemble learning is one of the more powerful algorithms in machine learning.

Understanding Classification Trees Using R

Classification trees are similar to regression trees except that the determinant of success is not the residual sum of squares but rather the error rate. The strange thing about classification trees is that you can you can continue to gain information in splitting the tree without necessarily improving the misclassification rate. This is done through calculating a measure of error called the Gini coefficient

Gini coefficient is calculated using the values of the accuracy and error in an equation. For example, if we have a model that is 80% accurate with a 20% error rate the Gini coefficient is calculated as follows for a single node

n0gini<- 1 - (((8/10)^2) -((2/10)^2)) 
n0gini
## [1] 0.4

Now if we split this into two nodes notice the change in the Gini coefficient

n1gini<-1-(((5/6)^2)-((1/7)^2))
n2gini<-1-(((3/4)^2))-((1/3)^2)
newgini<-(.8*n1gini) + (.2*n2gini)
newgini
## [1] 0.3260488

The lower the Gini coefficient the better as it measures purity. IN the example, there is no improvement in the accuracy yet there is an improvement in the Gini coefficient. Therefore, classification is about purity and not the residual sum of squares.

In this post, we will make a classification tree to predict if someone is participating in the labor market. We will do this using the “Participation” dataset from the “Ecdat” package. Below is some initial code to get started.

library(Ecdat);library(rpart);library(partykit)
data(Participation)
str(Participation)
## 'data.frame':    872 obs. of  7 variables:
##  $ lfp    : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ lnnlinc: num  10.8 10.5 11 11.1 11.1 ...
##  $ age    : num  3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
##  $ educ   : num  8 8 9 11 12 12 8 8 12 11 ...
##  $ nyc    : num  1 0 0 2 0 0 0 0 0 0 ...
##  $ noc    : num  1 1 0 0 2 1 0 2 0 2 ...
##  $ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

The ‘age’ feature needs to be transformed. Since it is doubtful that the survey was conducted among 4 and 5-year-olds. We need to multiply this variable by ten. In addition, the “lnnlinc” feature is the log of income and we want the actual income so we will exponentiate this information. Below is the code for these two steps.

Participation$age<-10*Participation$age #normal age
Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log

We will now create our training and testing datasets with the code below.

set.seed(502)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]

We can now create our classification tree and take a look at the output

tree.pros<-rpart(lfp~.,data = train)
tree.pros
## n= 636 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 636 295 no (0.5361635 0.4638365)  
##     2) foreign=no 471 182 no (0.6135881 0.3864119)  
##       4) nyc>=0.5 99  21 no (0.7878788 0.2121212) *
##       5) nyc< 0.5 372 161 no (0.5672043 0.4327957)  
##        10) age>=49.5 110  25 no (0.7727273 0.2272727) *
##        11) age< 49.5 262 126 yes (0.4809160 0.5190840)  
##          22) lnnlinc>=46230.43 131  50 no (0.6183206 0.3816794)  
##            44) noc>=0.5 102  34 no (0.6666667 0.3333333) *
##            45) noc< 0.5 29  13 yes (0.4482759 0.5517241)  
##              90) lnnlinc>=47910.86 22  10 no (0.5454545 0.4545455)  
##               180) lnnlinc< 65210.78 12   3 no (0.7500000 0.2500000) *
##               181) lnnlinc>=65210.78 10   3 yes (0.3000000 0.7000000) *
##              91) lnnlinc< 47910.86 7   1 yes (0.1428571 0.8571429) *
##          23) lnnlinc< 46230.43 131  45 yes (0.3435115 0.6564885) *
##     3) foreign=yes 165  52 yes (0.3151515 0.6848485)  
##       6) lnnlinc>=56365.39 16   5 no (0.6875000 0.3125000) *
##       7) lnnlinc< 56365.39 149  41 yes (0.2751678 0.7248322) *

In the text above, the first split is made on the feature “foreign” which is a yes or no possibility. 471 were not foreigners will 165 were foreigners. The accuracy here is not great at 61% for those not classified as foreigners and 31% for those classified as foreigners. For the 165 that are classified as foreigners, the next split is by their income, etc. This is hard to understand. Below is an actual diagram of the text above.

plot(as.party(tree.pros))

1

We now need to determining if pruning the tree is beneficial. We do this by looking at the cost complexity. Below is the code.

tree.pros$cptable
##           CP nsplit rel error    xerror       xstd
## 1 0.20677966      0 1.0000000 1.0000000 0.04263219
## 2 0.04632768      1 0.7932203 0.7932203 0.04122592
## 3 0.02033898      4 0.6542373 0.6677966 0.03952891
## 4 0.01016949      5 0.6338983 0.6881356 0.03985120
## 5 0.01000000      8 0.6033898 0.6915254 0.03990308

The “rel error” indicates that our model is bad no matter how any splits. Even with 9 splits we have an error rate of 60%. Below is a plot of the table above

plotcp(tree.pros)

1.png

Based on the table, we will try to prune the tree to 5 splits. The plot above provides a visual as it has the lowest error. The table indicates that a tree of five splits (row number 4) has the lowest cross-validation error (xstd). Below is the code for pruning the tree followed by a plot of the modified tree.

cp<-min(tree.pros$cptable[4,])
pruned.tree.pros<-prune(tree.pros,cp=cp)
plot(as.party(pruned.tree.pros))

1

IF you compare the two trees we have developed. One of the main differences is that the pruned.tree is missing the “noc” (number of older children) variable. There are also fewer splits on the income variable (lnnlinc). We can no use the pruned tree with the test data set.

party.pros.test<-predict(pruned.tree.pros,newdata=test,type="class")
table(party.pros.test,test$lfp)
##                
## party.pros.test no yes
##             no  90  41
##             yes 40  65

Now for the accuracy

(90+65) / (90+41+40+65)
## [1] 0.6567797

This is surprisingly high compared to the results for the training set but 65% is not great, However, this is fine for a demonstration.

Conclusion

Classification trees are one of many useful tools available for data analysis. When developing classification trees one of the key ideas to keep in mind is the aspect of prunning as this affects the complexity of the model.

Numeric Prediction with Support Vector Machines in R

In this post, we will look at support vector machines for numeric prediction. SVM is used for both classification and numeric prediction. The advantage of SVM for numeric prediction is that SVM will automatically create higher dimensions of the features and summarizes this in the output. In other words, unlike in regression where you have to decide for yourself how to modify your features, SVM does this automatically using different kernels.

Different kernels transform the features in different ways. And the cost function determines the penalty for an example being on the wrong side of the margin developed by the kernel. Remember that SVM draws lines and separators to divide the examples. Examples on the wrong side are penalized as determined by the researcher.

Just like with regression, generally, the model with the least amount of error may be the best model. As such, the purpose of this post is to use SVM to predict income in the “Mroz” dataset from the “Ecdat” package. We will use several different kernels that will transformation the features different ways and calculate the mean-squared error to determine the most appropriate model. Below is some initial code.

library(caret);library(e1071);library(corrplot);library(Ecdat)
data(Mroz)
str(Mroz)
## 'data.frame':    753 obs. of  18 variables:
##  $ work      : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hoursw    : int  1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
##  $ child6    : int  1 0 1 0 1 0 0 0 0 0 ...
##  $ child618  : int  0 2 3 3 2 0 2 0 2 2 ...
##  $ agew      : int  32 30 35 34 31 54 37 54 48 39 ...
##  $ educw     : int  12 12 12 12 14 12 16 12 12 12 ...
##  $ hearnw    : num  3.35 1.39 4.55 1.1 4.59 ...
##  $ wagew     : num  2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
##  $ hoursh    : int  2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
##  $ ageh      : int  34 30 40 53 32 57 37 53 52 43 ...
##  $ educh     : int  12 9 12 10 12 11 12 8 4 12 ...
##  $ wageh     : num  4.03 8.44 3.58 3.54 10 ...
##  $ income    : int  16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
##  $ educwm    : int  12 7 12 7 12 14 14 3 7 7 ...
##  $ educwf    : int  7 7 7 7 14 7 7 3 7 7 ...
##  $ unemprate : num  5 11 5 5 9.5 7.5 5 5 3 5 ...
##  $ city      : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ...
##  $ experience: int  14 5 15 6 7 33 11 35 24 21 ...

We need to place the factor variables next to each other as it helps in having to remove them when we need to scale the data. We must scale the data because SVM is based on distance when making calculations. If there are different scales the larger scale will have more influence on the results. Below is the code

mroz.scale<-Mroz[,c(17,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18)]
mroz.scale<-as.data.frame(scale(mroz.scale[,c(-1,-2)])) #remove factor variables for scaling
mroz.scale$city<-Mroz$city # add factor variable back into the dataset
mroz.scale$work<-Mroz$work # add factor variable back into the dataset
#mroz.cor<-cor(mroz.scale[,-17:-18])
#corrplot(mroz.cor,method='number', col='black')

Below is the code for creating the train and test datasets.

set.seed(502)
ind=sample(2,nrow(mroz.scale),replace=T,prob=c(.7,.3))
train<-mroz.scale[ind==1,]
test<-mroz.scale[ind==2,]

Linear Kernel

Our first kernel is the linear kernel. Below is the code. We use the “tune.svm” function from the “e1071” package. We set the kernel to “linear” and we pick our own values for the cost function. The numbers for the cost function can be whatever you want. Also, keep in mind that r will produce six different models because we have six different values in the “cost” argument.

The process we are using to develop the models is as follows

  1. Set the seed
  2. Develop the initial model by setting the formula, dataset, kernel, cost function, and other needed information.
  3. Select the best model for the test set
  4. Predict with the best model
  5. Plot the predicted and actual results
  6. Calculate the mean squared error

The first time we will go through this process step-by-step. However, all future models will just have the code followed by an interpretation.

linear.tune<-tune.svm(income~.,data=train,kernel="linear",cost = c(.001,.01,.1,1,5,10))
summary(linear.tune)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##    10
## 
## - best performance: 0.3492453 
## 
## - Detailed performance results:
##    cost     error dispersion
## 1 1e-03 0.6793025  0.2285748
## 2 1e-02 0.3769298  0.1800839
## 3 1e-01 0.3500734  0.1626964
## 4 1e+00 0.3494828  0.1618478
## 5 5e+00 0.3493379  0.1611353
## 6 1e+01 0.3492453  0.1609774

The best model had a cost = 10 with a performance of .35. We will select the best model and use this on our test data. Below is the code.

best.linear<-linear.tune$best.model
tune.test<-predict(best.linear,newdata=test)

Now we will create a plot so we can see how well our model predicts. In addition, we will calculate the mean squared error to have an actual number of our model’s performance

plot(tune.test,test$income)

1

tune.test.resid<-tune.test-test$income
mean(tune.test.resid^2)
## [1] 0.215056

The model looks good in the plot. However, we cannot tell if the error number is decent until it is compared to other models

Polynomial Kernel

The next kernel we will use is the polynomial one. The kernel requires two parameters the degree of the polynomial (3,4,5, etc) as well as the kernel coefficient. Below is the code

set.seed(123)
poly.tune<-tune.svm(income~.,data = train,kernal="polynomial",degree = c(3,4,5),coef0 = c(.1,.5,1,2,3,4))
best.poly<-poly.tune$best.model
poly.test<-predict(best.poly,newdata=test)
plot(poly.test,test$income)

1

poly.test.resid<-poly.test-test$income
mean(poly.test.resid^2)
## [1] 0.2453022

The polynomial has an insignificant additional amount of error.

Radial Kernel

Next, we will use the radial kernel. One thing that is new here is the need for a parameter in the code call gamma. Below is the code.

set.seed(123)
rbf.tune<-tune.svm(income~.,data=train,kernel="radial",gamma = c(.1,.5,1,2,3,4))
summary(rbf.tune)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  gamma
##    0.1
## 
## - best performance: 0.5225952 
## 
## - Detailed performance results:
##   gamma     error dispersion
## 1   0.1 0.5225952  0.4183170
## 2   0.5 0.9743062  0.5293211
## 3   1.0 1.0475714  0.5304482
## 4   2.0 1.0582550  0.5286129
## 5   3.0 1.0590367  0.5283465
## 6   4.0 1.0591208  0.5283059
best.rbf<-rbf.tune$best.model
rbf.test<-predict(best.rbf,newdata=test)
plot(rbf.test,test$income)

1.png

rbf.test.resid<-rbf.test-test$income
mean(rbf.test.resid^2)
## [1] 0.3138517

The radial kernel is worst than the linear and polynomial kernel. However, there is not much different in the performance of the models so far.

Sigmoid Kernel

Next, we will try the sigmoid kernel. Sigmoid kernel relies on a “gamma” parameter and a cost function. Below is the code

set.seed(123)
sigmoid.tune<-tune.svm(income~., data=train,kernel="sigmoid",gamma = c(.1,.5,1,2,3,4),coef0 = c(.1,.5,1,2,3,4))
summary(sigmoid.tune)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  gamma coef0
##    0.1     3
## 
## - best performance: 0.8759507 
## 
## - Detailed performance results:
##    gamma coef0        error  dispersion
## 1    0.1   0.1   27.0808221   6.2866615
## 2    0.5   0.1  746.9235624 129.0224096
## 3    1.0   0.1 1090.9660708 198.2993895
## 4    2.0   0.1 1317.4497885 214.7997608
## 5    3.0   0.1 1339.8455047 180.3195491
## 6    4.0   0.1 1299.7469190 201.6901577
## 7    0.1   0.5  151.6070833  38.8450961
## 8    0.5   0.5 1221.2396575 335.4320445
## 9    1.0   0.5 1225.7731007 190.7718103
## 10   2.0   0.5 1290.1784238 216.9249899
## 11   3.0   0.5 1338.1069460 223.3126800
## 12   4.0   0.5 1261.8861304 300.0001079
## 13   0.1   1.0  162.6041229  45.3216740
## 14   0.5   1.0 2276.4330973 330.1739559
## 15   1.0   1.0 2036.4791854 335.8051736
## 16   2.0   1.0 1626.4347749 290.6445164
## 17   3.0   1.0 1333.0626614 244.4424896
## 18   4.0   1.0 1343.7617925 194.2220729
## 19   0.1   2.0   19.2061993   9.6767496
## 20   0.5   2.0 2504.9271757 583.8943008
## 21   1.0   2.0 3296.8519140 542.7903751
## 22   2.0   2.0 2376.8169815 398.1458855
## 23   3.0   2.0 1949.9232179 319.6548059
## 24   4.0   2.0 1758.7879267 313.2581011
## 25   0.1   3.0    0.8759507   0.3812578
## 26   0.5   3.0 1405.9712578 389.0822797
## 27   1.0   3.0 3559.4804854 843.1905348
## 28   2.0   3.0 3159.9549029 492.6072149
## 29   3.0   3.0 2428.1144437 412.2854724
## 30   4.0   3.0 1997.4596435 372.1962595
## 31   0.1   4.0    0.9543167   0.5170661
## 32   0.5   4.0  746.4566494 201.4341061
## 33   1.0   4.0 3277.4331302 527.6037421
## 34   2.0   4.0 3643.6413379 604.2778089
## 35   3.0   4.0 2998.5102806 471.7848740
## 36   4.0   4.0 2459.7133632 439.3389369
best.sigmoid<-sigmoid.tune$best.model
sigmoid.test<-predict(best.sigmoid,newdata=test)
plot(sigmoid.test,test$income)

 

1

sigmoid.test.resid<-sigmoid.test-test$income
mean(sigmoid.test.resid^2)
## [1] 0.8004045

The sigmoid performed much worst then the other models based on the metric of error. You can further see the problems with this model in the plot above.

Conclusion

The final results are as follows

  • Linear kernel .21
  • Polynomial kernel .24
  • Radial kernel .31
  • Sigmoid kernel .80

Which model to select depends on the goals of the study. However, it definitely looks as though you would be picking from among the first three models. The power of SVM is the ability to use different kernels to uncover different results without having to really modify the features yourself.

The Beginnings of English

What we now know as English today has a long and complex history. With any subject that is complex, it is necessary to pick a starting point and work from there. For this post, we will date the origins of English from the early 5th century.

Early History

English was not born in England.  Rather, it came to England through the invasion of Germanic warriors. These “barbarian” hoards push the indigenous Celts and Britons almost into the ocean.

However, it was not only war and conquest that brought English. The roots of English arrived also in the immigration of farmers. Either way, English slowly grew to be one of the prominent languages of England.

In the late sixth century, the Roman Catholic Church came to England. This left a mark on English in the various words taken from Latin and Greek. Such words as “angels”, “pope”, and “minister” all arrived through the Catholic Church.

Vikings and Alfred the Great

By the 8th and 9th century the Vikings were invading lands all over Europe. It was the Danes in particular that almost wiped out the inhabitants of England. However, thanks to the craftiness of Alfred the Great the Danes were defeated and their leader Guthrum was so shocked at Alfred’s comeback victory that he was baptized and became a Christian.

Alfred set to work using the English language to unite the people. He supported education in the English language and the use of language in general. Furthermore, to try and prevent future conflicts with the Danes, Alfred gave them some territory called “Dane Law” where they could live. Naturally, staying in the area meant that the Danish language had an effect on English as well.

Alfred also supported religion. Thanks to the Viking invasions, there was almost no priest left in the entire country. Alfred could barely find a priest who could read Latin. Without religious scholarship, there could be no passing on of religious teachings. This lead Alfred to encourage the translation books in other languages (like Latin) into English.

Conclusion

The story of English is not one continuous rise to prominence. There were several experiences of up and down as the language was in England. For example, there was a time when the French language almost overran the country. Yet this is a story for another day.

Regression Tree Development in R

In this post, we will take a look at regression trees. Regression trees use a concept called recursive partitioning. Recursive partitioning involves splitting features in a way that reduces the error the most.

The splitting is also greedy which means that the algorithm will partition the data at one point without considered how it will affect future partitions. Ignoring how a current split affects the future splits can lead to unnecessary branches with high variance and low bias.

One of the main strengths of regression trees is their ability ti deal with nonlinear relationships. However, predictive performance can be hurt when a particular example is assigned the mean of a node. This forced assignment is a loss of data such as turning continuous variables into categorical variables.

in this post, we will use the “participation” dataset from the “ecdat” package to predict income based on the other variables in the dataset. Below is some initial code.

library(rpart);library(partykit);library(caret);library(Ecdat)
data("Participation")
str(Participation)
## 'data.frame':    872 obs. of  7 variables:
##  $ lfp    : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ lnnlinc: num  10.8 10.5 11 11.1 11.1 ...
##  $ age    : num  3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
##  $ educ   : num  8 8 9 11 12 12 8 8 12 11 ...
##  $ nyc    : num  1 0 0 2 0 0 0 0 0 0 ...
##  $ noc    : num  1 1 0 0 2 1 0 2 0 2 ...
##  $ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

There are several things we need to do to make the results easier to interpret. The “age” variable needs to be multiplied by ten as it currently shows such results as 4.5, 3, etc. Common sense indicates that a four-year-old and a three-year-old is not earning an income.

In addition, we need to convert or income variable (lnnlinc) from the log of income to regular income. This will also help to understand the results. Below is the code.

Participation$age<-10*Participation$age #normal age
Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log

The next step is to create our training and testing data sets. Below is the code.

set.seed(502)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]

We can now develop our model. We will also use the ‘print’ command

reg.tree<-rpart(lnnlinc~.,data = train)

Below is a printout of the current tree

reg.tree
## n= 636 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##   1) root 636 390503700000  48405.08  
##     2) educ< 11.5 473 127460900000  43446.69  
##       4) educ< 9.5 335  70269440000  40758.25  
##         8) foreign=yes 129  10617380000  36016.12 *
##         9) foreign=no 206  54934520000  43727.84 *
##       5) educ>=9.5 138  48892370000  49972.98 *
##     3) educ>=11.5 163 217668400000  62793.52  
##       6) age< 34.5 79  34015680000  51323.86  
##        12) age< 25.5 12    984764800  34332.97 *
##        13) age>=25.5 67  28946170000  54367.01 *
##       7) age>=34.5 84 163486000000  73580.46  
##        14) lfp=yes 36  23888410000  58916.66 *
##        15) lfp=no 48 126050900000  84578.31  
##          30) educ< 12.5 29  86940400000  74425.51  
##            60) age< 38.5 8    763764600  57390.34 *
##            61) age>=38.5 21  82970650000  80915.10  
##             122) age>=44 14  34091840000  68474.57 *
##             123) age< 44 7  42378600000 105796.20 *
##          31) educ>=12.5 19  31558550000 100074.70 *

I will not interpret all of this but here is a brief description use numbers 2,4, and 8. If the person has less than 11.5 years of education (473 qualify) If the person has less than 9.5 years of education (335 of the 473 qualify) If the person is a foreigner (129 of the 335 qualify) then their average salary is 36,016.12 dollars.

Perhaps now you can see how some data is lost. The average salary for people in this node is 36,016.12 dollars but probably nobody earns exactly this amount

If what I said does not make sense. Here is an actual plot of the current regression tree.

plot(as.party(reg.tree))

1

The little boxes at the bottom are boxplots of that node.

Tree modification

We now will make modifications to the tree. We will begin by examining the cptable. Below is the code

reg.tree$cptable
##           CP nsplit rel error    xerror      xstd
## 1 0.11619458      0 1.0000000 1.0026623 0.1666662
## 2 0.05164297      1 0.8838054 0.9139383 0.1434768
## 3 0.03469034      2 0.8321625 0.9403669 0.1443843
## 4 0.02125215      3 0.7974721 0.9387060 0.1433101
## 5 0.01933892      4 0.7762200 0.9260030 0.1442329
## 6 0.01242779      5 0.7568810 0.9097011 0.1434606
## 7 0.01208066      7 0.7320255 0.9166627 0.1433779
## 8 0.01046022      8 0.7199448 0.9100704 0.1432901
## 9 0.01000000      9 0.7094846 0.9107869 0.1427025

The cptable shares a lot of information. First, cp stands for cost complexity and this is the column furthest to the left. This number decreases as the tree becomes more complex. “nsplit” indicates the number of splits in the tree. “rel error” as another term for the residual sum of squares or RSS error. The ‘xerror’ and ‘xstd’ are the cross-validated average error and standard deviation of the error respectively.

One thing we can see from the cptable is that 9 splits has the lowest error but 2 splits has the lowest cross-validated error. Below we will look at a printout of the current table.

We will now make a plot of the complexity parameter to determine at what point to prune the tree. Pruning helps in removing unnecessary splits that do not improve the model much. Below is the code. The information in the plot is a visual of the “cptable”

plotcp(reg.tree)

1

It appears that a tree of size 2 is the best but this is boring. The next lowest dip is a tree of size 8. Therefore, we will prune our tree to have a size of 8 or eight splits. First, we need to create an object that contains how many splits we want. Then we use the “prune” function to make the actually modified tree.

cp<-min(reg.tree$cptable[8,])
pruned.reg.tree<-prune(reg.tree,cp=cp)

We will now make are modified tree

plot(as.party(pruned.reg.tree))

1

The only difference is the loss of the age nod for greater or less than 25.5.

Model Test

We can now test our tree to see how well it performs.

reg.tree.test<-predict(pruned.reg.tree,newdata=test)
reg.tree.resid<-reg.tree.test-test$lnnlinc
mean(reg.tree.resid^2)
## [1] 431928030

The number we calculated is the mean squared error. This number must be compared to models that are developed differently in order to assess the current model. By it’s self it means nothing.

Conclusion

This post exposed you to regression trees. This type of tree can be used to m ake numeric predictions in nonlinear data. However, with the classification comes a loss of data as the uniqueness of each example is lost when placed in a node.

Statistical Models

In research, the term ‘model’ is employed frequently. Normally, a model is some sort of a description or explanation of a real world phenomenon. In data science, we employ the use of statistical models. Statistical models used numbers to help us to understand something that happens in the real world.

A statistical model used numbers to help us to understand something that happens in the real world.

In the real world, quantitative research relies on numeric observations of some phenomenon, behavior, and or perception. For example, let’s say we have the quiz results of 20 students as show below.

32 60 95 15 43 22 45 14 48 98 79 97 49 63 50 11 26 52 39 97

This is great information but what if we want to go beyond how these students did and try to understand how students in the population would do on the quizzes. Doing this requires the development of a model.

A model is simply trying to describe how the data is generated in terms of whatever we are mesuring while allowing for randomness. It helps in summarizing a large collection of numbers while also providing structure to it.

One commonly used model is the normal model. This model is the famous bell-curve model that most of us are familiar with. To calculate this model we need to calculate the mean and standard deviation to get a plot similar to the one below

 1

 Now, this model is not completely perfect. For example, a student cannot normally get a score above 100 or below 0 on a quiz. Despite this weakness, normal distribution gives is an indication of what the population looks like.

With this, we can also calculate the probability of getting a specific score on the quiz. For example, if we want to calculate the probability that a student would get a score of 70  or higher we can do a simple test and find that it is about 26%.

Other Options

The normal model is not the only model. There are many different models to match different types of data. There are the gamma, student t, binomial, chi-square, etc. To determine which model to use requires examining the distribution of your data and match it to an appropriate model.

Another option is to transform the data. This is normally done to make data conform to a normal distribution. Which transformation to employ depends on how the data looks when it is plotted.

Conclusion

Modeling helps to bring order to data that has been collected for analysis. By using a model such as the normal distribution, you can begin to make inferences about what the population is like. This allows you to take a handful of data to better understand the world.

K Nearest Neighbor in R

K-nearest neighbor is one of many nonlinear algorithms that can be used in machine learning. By non-linear I mean that a linear combination of the features or variables is not needed in order to develop decision boundaries. This allows for the analysis of data that naturally does not meet the assumptions of linearity.

KNN is also known as a “lazy learner”. This means that there are known coefficients or parameter estimates. When doing regression we always had coefficient outputs regardless of the type of regression (ridge, lasso, elastic net, etc.). What KNN does instead is used K nearest neighbors to give a label to an unlabeled example. Our job when using KNN is to determine the number of K neighbors to use that is most accurate based on the different criteria for assessing the models.

In this post, we will develop a KNN model using the “Mroz” dataset from the “Ecdat” package. Our goal is to predict if someone lives in the city based on the other predictor variables. Below is some initial code.

library(class);library(kknn);library(caret);library(corrplot)
library(reshape2);library(ggplot2);library(pROC);library(Ecdat)
data(Mroz)
str(Mroz)
## 'data.frame':    753 obs. of  18 variables:
##  $ work      : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hoursw    : int  1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
##  $ child6    : int  1 0 1 0 1 0 0 0 0 0 ...
##  $ child618  : int  0 2 3 3 2 0 2 0 2 2 ...
##  $ agew      : int  32 30 35 34 31 54 37 54 48 39 ...
##  $ educw     : int  12 12 12 12 14 12 16 12 12 12 ...
##  $ hearnw    : num  3.35 1.39 4.55 1.1 4.59 ...
##  $ wagew     : num  2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
##  $ hoursh    : int  2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
##  $ ageh      : int  34 30 40 53 32 57 37 53 52 43 ...
##  $ educh     : int  12 9 12 10 12 11 12 8 4 12 ...
##  $ wageh     : num  4.03 8.44 3.58 3.54 10 ...
##  $ income    : int  16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
##  $ educwm    : int  12 7 12 7 12 14 14 3 7 7 ...
##  $ educwf    : int  7 7 7 7 14 7 7 3 7 7 ...
##  $ unemprate : num  5 11 5 5 9.5 7.5 5 5 3 5 ...
##  $ city      : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ...
##  $ experience: int  14 5 15 6 7 33 11 35 24 21 ...

We need to remove the factor variable “work” as KNN cannot use factor variables. After this, we will use the “melt” function from the “reshape2” package to look at the variables when divided by whether the example was from the city or not.

Mroz$work<-NULL
mroz.melt<-melt(Mroz,id.var='city')
Mroz_plots<-ggplot(mroz.melt,aes(x=city,y=value))+geom_boxplot()+facet_wrap(~variable, ncol = 4)
Mroz_plots

1

From the plots, it appears there are no differences in how the variable act whether someone is from the city or not. This may be a flag that classification may not work.

We now need to scale our data otherwise the results will be inaccurate. Scaling might also help our box-plots because everything will be on the same scale rather than spread all over the place. To do this we will have to temporarily remove our outcome variable from the data set because it’s a factor and then reinsert it into the data set. Below is the code.

mroz.scale<-as.data.frame(scale(Mroz[,-16]))
mroz.scale$city<-Mroz$city

We will now look at our box-plots a second time but this time with scaled data.

mroz.scale.melt<-melt(mroz.scale,id.var="city")
mroz_plot2<-ggplot(mroz.scale.melt,aes(city,value))+geom_boxplot()+facet_wrap(~variable, ncol = 4)
mroz_plot2

1

This second plot is easier to read but there is still little indication of difference.

We can now move to checking the correlations among the variables. Below is the code

mroz.cor<-cor(mroz.scale[,-17])
corrplot(mroz.cor,method = 'number')

1.png

There is a high correlation between husband’s age (ageh) and wife’s age (agew). Since this algorithm is non-linear this should not be a major problem.

We will now divide our dataset into the training and testing sets

set.seed(502)
ind=sample(2,nrow(mroz.scale),replace=T,prob=c(.7,.3))
train<-mroz.scale[ind==1,]
test<-mroz.scale[ind==2,]

Before creating a model we need to create a grid. We do not know the value of k yet so we have to run multiple models with different values of k in order to determine this for our model. As such we need to create a ‘grid’ using the ‘expand.grid’ function. We will also use cross-validation to get a better estimate of k as well using the “trainControl” function. The code is below.

grid<-expand.grid(.k=seq(2,20,by=1))
control<-trainControl(method="cv")

Now we make our model,

knn.train<-train(city~.,train,method="knn",trControl=control,tuneGrid=grid)
knn.train
## k-Nearest Neighbors 
## 
## 540 samples
##  16 predictors
##   2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 487, 486, 486, 486, 486, 486, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    2  0.6000095  0.1213920
##    3  0.6368757  0.1542968
##    4  0.6424325  0.1546494
##    5  0.6386252  0.1275248
##    6  0.6329998  0.1164253
##    7  0.6589619  0.1616377
##    8  0.6663344  0.1774391
##    9  0.6663681  0.1733197
##   10  0.6609510  0.1566064
##   11  0.6664018  0.1575868
##   12  0.6682199  0.1669053
##   13  0.6572111  0.1397222
##   14  0.6719586  0.1694953
##   15  0.6571425  0.1263937
##   16  0.6664367  0.1551023
##   17  0.6719573  0.1588789
##   18  0.6608811  0.1260452
##   19  0.6590979  0.1165734
##   20  0.6609510  0.1219624
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 14.

R recommends that k = 16. This is based on a combination of accuracy and the kappa statistic. The kappa statistic is a measurement of the accuracy of a model while taking into account chance. We don’t have a model in the sense that we do not use the ~ sign like we do with regression. Instead, we have a train and a test set a factor variable and a number for k. This will make more sense when you see the code. Finally, we will use this information on our test dataset. We will then look at the table and the accuracy of the model.

knn.test<-knn(train[,-17],test[,-17],train[,17],k=16) #-17 removes the dependent variable 'city
table(knn.test,test$city)
##         
## knn.test  no yes
##      no   19   8
##      yes  61 125
prob.agree<-(15+129)/213
prob.agree
## [1] 0.6760563

Accuracy is 67% which is consistent with what we found when determining the k. We can also calculate the kappa. This done by calculating the probability and then do some subtraction and division. We already know the accuracy as we stored it in the variable “prob.agree” we now need the probability that this is by chance. Lastly, we calculate the kappa.

prob.chance<-((15+4)/213)*((15+65)/213)
kap<-(prob.agree-prob.chance)/(1-prob.chance)
kap
## [1] 0.664827

A kappa of .66 is actual good.

The example we just did was with unweighted k neighbors. There are times when weighted neighbors can improve accuracy. We will look at three different weighing methods. “Rectangular” is unweighted and is the one that we used. The other two are “triangular” and “epanechnikov”. How these calculate the weights is beyond the scope of this post. In the code below the argument “distance” can be set to 1 for euclidean and 2 for absolute distance.

kknn.train<-train.kknn(city~.,train,kmax = 25,distance = 2,kernel = c("rectangular","triangular",
                                                                      "epanechnikov"))
plot(kknn.train)

1.png

kknn.train
## 
## Call:
## train.kknn(formula = city ~ ., data = train, kmax = 25, distance = 2,     kernel = c("rectangular", "triangular", "epanechnikov"))
## 
## Type of response variable: nominal
## Minimal misclassification: 0.3277778
## Best kernel: rectangular
## Best k: 14

If you look at the plot you can see which value of k is the best by looking at the point that is the lowest on the graph which is right before 15. Looking at the legend it indicates that the point is the “rectangular” estimate which is the same as unweighted. This means that the best classification is unweighted with a k of 14. Although it recommends a different value for k our misclassification was about the same.

Conclusion

In this post, we explored both weighted and unweighted KNN. This algorithm allows you to deal with data that does not meet the assumptions of regression by ignoring the need for parameters. However, because there are no numbers really attached to the results beyond accuracy it can be difficult to explain what is happening in the model to people. As such, perhaps the biggest drawback is communicating results when using KNN.

Common Speech Functions

Functions of speech are different ways of communicating. The differences among the speech functions have to do with the intention of the communication. Different intention or goal leads to the use of a different function of speech. There are many different functions if speech but we will look at the six that are listed below.

  • Referential
  • Directive
  • Expressive
  • Phatic
  • Poetic
  • Metalinguistic

Referential

Referential speech provides information. For example, a person might share the time with someone (“It’s five o’clock” ). Referential speech can often provide information to a question (“what time is it?”).

Directive

Directives or commands that try to get someone to do something. Examples include “turn left” or “sit down”. The context of a directive is one in which something needs or should be done. As such, one person tries to make one or more other persons do something. Even children say directives towards their parents (“give me the ball”).

Expressive

Expressive speech shares a person’s feelings. An example would be “I feel happy today!”. Expressive communication can at times provide clear evidence of how someone is doing.

Phatic

Phatic speech is closely related to expressive speech. However, the main difference is that phatic speech is focused on the well-being of others while expressive speech focuses on the feelings of the person speaking.

An example of phatic speech is saying “how are you?”. This is clearly a question but it is focusing on how the person is doing. Another phrase might be “I hope you get well soon.” Again the focus on is on the welfare of someone else.

Poetic

Poetic speech is speech that is highly aesthetic. Songs and poetry are examples of language that is poetic in nature. An example would be the famous nursery rhyme “Roses are red, violets are blue…..). Poetic speech often has a powerful emotional effect as well.

Metalinguistic 

Metalinguistic speech is communication about language. For example, this entire blog post would be considered by many to be metalinguistic because I a talking about language and not really using language as described in the other functons of speech.

Exceptions

There are many more categories than the ones presented. In addition, the categories presented are not mutually exclusive. Many phrases can be correctly classified into many different categories. For example, if someone says “I love you” you could argue that it’s expressive, poetic, and or even phatic. What is missing is the context in which such a statement is made.

Conclusion

The ways in which we communicated have been briefly explained here. Understanding how people communicate will help others to better understand those around us and improve our style of communicating.

Elastic Net Regression in R

Elastic net is a combination of ridge and lasso regression. What is most unusual about elastic net is that it has two tuning parameters (alpha and lambda) while lasso and ridge regression only has 1.

In this post, we will go through an example of the use of elastic net using the “VietnamI” dataset from the “Ecdat” package. Our goal is to predict how many days a person is ill based on the other variables in the dataset. Below is some initial code for our analysis

library(Ecdat);library(corrplot);library(caret);library(glmnet)
data("VietNamI")
str(VietNamI)
## 'data.frame':    27765 obs. of  12 variables:
##  $ pharvis  : num  0 0 0 1 1 0 0 0 2 3 ...
##  $ lnhhexp  : num  2.73 2.74 2.27 2.39 3.11 ...
##  $ age      : num  3.76 2.94 2.56 3.64 3.3 ...
##  $ sex      : Factor w/ 2 levels "female","male": 2 1 2 1 2 2 1 2 1 2 ...
##  $ married  : num  1 0 0 1 1 1 1 0 1 1 ...
##  $ educ     : num  2 0 4 3 3 9 2 5 2 0 ...
##  $ illness  : num  1 1 0 1 1 0 0 0 2 1 ...
##  $ injury   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ illdays  : num  7 4 0 3 10 0 0 0 4 7 ...
##  $ actdays  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ insurance: num  0 0 1 1 0 1 1 1 0 0 ...
##  $ commune  : num  192 167 76 123 148 20 40 57 49 170 ...
##  - attr(*, "na.action")=Class 'omit'  Named int 27734
##   .. ..- attr(*, "names")= chr "27734"

We need to check the correlations among the variables. We need to exclude the “sex” variable as it is categorical. Code is below.

p.cor<-cor(VietNamI[,-4])
corrplot.mixed(p.cor)

1.png

No major problems with correlations. Next, we set up our training and testing datasets. We need to remove the variable “commune” because it adds no value to our results. In addition, to reduce the computational time we will only use the first 1000 rows from the data set.

VietNamI$commune<-NULL
VietNamI_reduced<-VietNamI[1:1000,]
ind<-sample(2,nrow(VietNamI_reduced),replace=T,prob = c(0.7,0.3))
train<-VietNamI_reduced[ind==1,]
test<-VietNamI_reduced[ind==2,]

We need to create a grid that will allow us to investigate different models with different combinations of alpha ana lambda. This is done using the “expand.grid” function. In combination with the “seq” function below is the code

grid<-expand.grid(.alpha=seq(0,1,by=.5),.lambda=seq(0,0.2,by=.1))

We also need to set the resampling method, which allows us to assess the validity of our model. This is done using the “trainControl” function” from the “caret” package. In the code below “LOOCV” stands for “leave one out cross-validation”.

control<-trainControl(method = "LOOCV")

We are no ready to develop our model. The code is mostly self-explanatory. This initial model will help us to determine the appropriate values for the alpha and lambda parameters

enet.train<-train(illdays~.,train,method="glmnet",trControl=control,tuneGrid=grid)
enet.train
## glmnet 
## 
## 694 samples
##  10 predictors
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 693, 693, 693, 693, 693, 693, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda  RMSE      Rsquared 
##   0.0    0.0     5.229759  0.2968354
##   0.0    0.1     5.229759  0.2968354
##   0.0    0.2     5.229759  0.2968354
##   0.5    0.0     5.243919  0.2954226
##   0.5    0.1     5.225067  0.2985989
##   0.5    0.2     5.200415  0.3038821
##   1.0    0.0     5.244020  0.2954519
##   1.0    0.1     5.203973  0.3033173
##   1.0    0.2     5.182120  0.3083819
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.2.

The output list all the possible alpha and lambda values that we set in the “grid” variable. It even tells us which combination was the best. For our purposes, the alpha will be .5 and the lambda .2. The r-square is also included.

We will set our model and run it on the test set. We have to convert the “sex” variable to a dummy variable for the “glmnet” function. We next have to make matrices for the predictor variables and a for our outcome variable “illdays”

train$sex<-model.matrix( ~ sex - 1, data=train ) #convert to dummy variable 
test$sex<-model.matrix( ~ sex - 1, data=test )
predictor_variables<-as.matrix(train[,-9])
days_ill<-as.matrix(train$illdays)
enet<-glmnet(predictor_variables,days_ill,family = "gaussian",alpha = 0.5,lambda = .2)

We can now look at specific coefficient by using the “coef” function.

enet.coef<-coef(enet,lambda=.2,alpha=.5,exact=T)
enet.coef
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                         s0
## (Intercept)   -1.304263895
## pharvis        0.532353361
## lnhhexp       -0.064754000
## age            0.760864404
## sex.sexfemale  0.029612290
## sex.sexmale   -0.002617404
## married        0.318639271
## educ           .          
## illness        3.103047473
## injury         .          
## actdays        0.314851347
## insurance      .

You can see for yourself that several variables were removed from the model. Medical expenses (lnhhexp), sex, education, injury, and insurance do not play a role in the number of days ill for an individual in Vietnam.

With our model developed. We now can test it using the predict function. However, we first need to convert our test dataframe into a matrix and remove the outcome variable from it

test.matrix<-as.matrix(test[,-9])
enet.y<-predict(enet, newx = test.matrix, type = "response", lambda=.2,alpha=.5)

Let’s plot our results

plot(enet.y)

1.png

This does not look good. Let’s check the mean squared error

enet.resid<-enet.y-test$illdays
mean(enet.resid^2)
## [1] 20.18134

We will now do a cross-validation of our model. We need to set the seed and then use the “cv.glmnet” to develop the cross-validated model. We can see the model by plotting it.

set.seed(317)
enet.cv<-cv.glmnet(predictor_variables,days_ill,alpha=.5)
plot(enet.cv)

1

You can see that as the number of features are reduce (see the numbers on the top of the plot) the MSE increases (y-axis). In addition, as the lambda increases, there is also an increase in the error but only when the number of variables is reduced as well.

The dotted vertical lines in the plot represent the minimum MSE for a set lambda (on the left) and the one standard error from the minimum (on the right). You can extract these two lambda values using the code below.

enet.cv$lambda.min
## [1] 0.3082347
enet.cv$lambda.1se
## [1] 2.874607

We can see the coefficients for a lambda that is one standard error away by using the code below. This will give us an alternative idea for what to set the model parameters to when we want to predict.

coef(enet.cv,s="lambda.1se")
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept)   2.34116947
## pharvis       0.003710399       
## lnhhexp       .       
## age           .       
## sex.sexfemale .       
## sex.sexmale   .       
## married       .       
## educ          .       
## illness       1.817479480
## injury        .       
## actdays       .       
## insurance     .

Using the one standard error lambda we lose most of our features. We can now see if the model improves by rerunning it with this information.

enet.y.cv<-predict(enet.cv,newx = test.matrix,type='response',lambda="lambda.1se", alpha = .5)
enet.cv.resid<-enet.y.cv-test$illdays
mean(enet.cv.resid^2)
## [1] 25.47966

A small improvement.  Our model is a mess but this post served as an example of how to conduct an analysis using elastic net regression.

Exploratory Data Analyst

In data science, exploratory data analyst serves the purpose of assessing whether the data set that you have is suitable for answering the research questions of the project. As such, there are several steps that can be taken to make this process more efficient.

Therefore, the purpose of this post is to explain one process that can be used for exploratory data analyst. The steps include ethe following.

  • Consult your questions
  • Check the structure of the dataset
  • Use visuals

Consult Your Questions

Research questions give a project a sense of direction. They help you to know what you want to know. In addition, research questions help you to determine what type of analyst to conduct as well.

During the data exploration stage, the purpose of a research question is not for analyst but rather to determine if your data can actually provide answers to the questions. For example, if you want to know what the average height of men in America are and your data tells you the salary of office workers there is a problem,. Your question (average height) cannot be answered with the current data that you have (office workers salaries).

As such, the research questions need to be answerable and specific before moving forward. By answerable, we mean that the data can provide the solution. By specific, we mean a question moves away from generalities and deals with clearly define phenomenon. For example, “what is the average height of males age 20-30 in the United states?” This question clearly identifies the what we want to know (average height) and among who (20-30, male Americans).

Not can you confirm if your questions are answerable you can also decide if you need to be more or less specific with your questions. Returning to our average height question. We may find that we can be more specific and check average height by state if we want. Or, we might learn that we can only determine the average height for a region. All this depends on the type of data we have.

Check the Structure

Checking the structure involves determining how many rows and columns in the dataset, the sample size, as well as looking for missing data and erroneous data. Data sets in data science almost always need some sort of cleaning or data wrangling before analyst and checking the structure helps to determine what needs to be done.

You should have a priori expectations for the structure of the data set. If the stakeholders tell you that there should be several million rows in the data set and you check and there are only several thousand you know there is a problem. This concept also applies to the number of features you expect as well.

Make Visuals

Visuals, which can be plots or tables, help you further develop your expectations as well as to look for deviations or outliers. Tables are an excellent source for summarizing data. Plots, on the other hand, allow you to see deviations from your expectations in the data.What kind of tables and plots to make depends heavily on

What kind of tables and plots to make depends heavily on the type of data as well as the type of questions that you have. For example, for descriptive questions tables of summary statistics with bar plots might be sufficient. For comparison questions, summary stats and boxplots may be enough. For relationship question, summary stat tables with a scatterplot may be enough. Please keep in mind that it is much more complicated than this.

Conclusion

Before questions can be answered the data needs to be explored. This will help to make sure that the potential answers that are developed are appropriate.

Accommodation Theory in Language Communication

Often when people communicate, they will make a subconscious or even a conscious decision to adjust their speech so that it is more alike or less alike. This is known as accommodation.

In this post, we will look at the following concepts related to accommodation

  • Speech convergence
  • Speech divergence

Speech Convergence

Speech convergence is when people speech starts to sound similar to each other. Often, this is a sign that the speakers are being polite to each other, like each other, and or when one speaker has the interest to please another.

Speech convergence is not only for social reasons. Another reason that a person will modify their speech is for the sake of removing technical jargon when dealing with people who are not familiar with it. For example, when a mechanic speaks to a doctor about what is wrong with their car or when a medical doctor speaks to a patient about the patient’s health. The modification happens so that the other person can understand.

Speech convergence can be overdone in terms of the perceptions of the hearers. For example, if a foreigner sounds too much like a native it can raise suspicion. Furthermore, over convergence can be perceived as insulting and or making fun of others.  As such, some difference is probably wise.

Speech Divergence

Speech divergence happens when people deliberately choose not to mirror each other speaking styles. The message that is sent when doing this is that the people communicating do not want to accommodate, seem polite, or perhaps that they do not like the people they are communicating with.

Examples of this often involve minority groups who desire to maintain their own cultural identity. Such a group will use their language judiciously, especially around the local dominant culture, as a sign of independence.

Accent divergence is also possible. For example, two people from the same country but different socioeconomic standings may deliberately choose to maintain their specific style of communication to indicate the differences between them.

Conclusion

Convergence and divergence in communication can send many different messages to people. It is difficult to determine how people will respond to how a people convergence or divergences from their speaking style. However, the main motivations for accommodation appear to be how such behavior benefits the communicator.