# Gradient Boosting With Random Forest Classification in R

In this blog we have already discussed and what gradient boosting is. However, for a brief recap, gradient boosting improves model performance by first developing an initial model called the base learner using whatever algorithm of your choice (linear, tree, etc.).

What follows next is that gradient boosting looks at the error in the first model and develops a second model using what is called the loss function. The loss function calculates the difference between the current accuracy and the desired prediction whether it’s accuracy for classification or error in regression. This process is repeated with the creation of additional models until a certain level of accuracy or reduction in error is attained.

This post what provide an example of the use of gradient boosting in random forest classification. Specifically, we will try to predict a person’s labor participation based on several independent variables.

library(randomForest);library(gbm);library(caret);library(Ecdat)
data("Participation")
str(Participation)
## 'data.frame':    872 obs. of  7 variables:
##  $lfp : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ... ##$ lnnlinc: num  10.8 10.5 11 11.1 11.1 ...
##  $age : num 3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ... ##$ educ   : num  8 8 9 11 12 12 8 8 12 11 ...
##  $nyc : num 1 0 0 2 0 0 0 0 0 0 ... ##$ noc    : num  1 1 0 0 2 1 0 2 0 2 ...
##  $foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... Data Preparation We need to transform the ‘age’ variable by multiplying by ten so that the ages are realistic. In addition, we need to convert “lnnlinc” from the log of salary to regular salary. Below is the code to transform these two variables. Participation$age<-10*Participation$age #normal age Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log We can now create our train and test datasets set.seed(502) ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3)) train<-Participation[ind==1,] test<-Participation[ind==2,] We now need to create our grid and control. The grid allows us to create several different models with various parameter settings. This is important in determining what is the most appropriate model which is always determined by comparing. We are using random forest so we need to set the number of trees we desire, the depth of the trees, the shrinkage which controls the influence of each tree, and the minimum number of observations in a node. The control will allow us to set the cross-validation. Below is the code for the creation of the grid and control. grid<-expand.grid(.n.trees=seq(200,500,by=200),.interaction.depth=seq(1,3,by=2),.shrinkage=seq(.01,.09,by=.04), .n.minobsinnode=seq(1,5,by=2)) #grid features control<-trainControl(method="CV",number = 10) #control Parameter Selection Now we set our seed and run the gradient boosted model. set.seed(123) gbm.lfp.train<-train(lfp~.,data=train,method='gbm',trControl=control,tuneGrid=grid) gbm.lfp.train ## Stochastic Gradient Boosting ## ## 636 samples ## 6 predictors ## 2 classes: 'no', 'yes' ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 573, 573, 571, 572, 573, 572, ... ## Resampling results across tuning parameters: ## ## shrinkage interaction.depth n.minobsinnode n.trees Accuracy ## 0.01 1 1 200 0.6666026 ## 0.01 1 1 400 0.6823306 ## 0.01 1 3 200 0.6588637 ## 0.01 1 3 400 0.6854804 ## 0.01 1 5 200 0.6792769 ## 0.01 1 5 400 0.6823306 ## 0.01 3 1 200 0.6730044 ## 0.01 3 1 400 0.6572051 ## 0.01 3 3 200 0.6793273 ## 0.01 3 3 400 0.6697787 ## 0.01 3 5 200 0.6682914 ## 0.01 3 5 400 0.6650416 ## 0.05 1 1 200 0.6759558 ## 0.05 1 1 400 0.6508040 ## 0.05 1 3 200 0.6681426 ## 0.05 1 3 400 0.6602286 ## 0.05 1 5 200 0.6680441 ## 0.05 1 5 400 0.6570788 ## 0.05 3 1 200 0.6493662 ## 0.05 3 1 400 0.6603518 ## 0.05 3 3 200 0.6540545 ## 0.05 3 3 400 0.6366911 ## 0.05 3 5 200 0.6712428 ## 0.05 3 5 400 0.6445299 ## 0.09 1 1 200 0.6461405 ## 0.09 1 1 400 0.6634768 ## 0.09 1 3 200 0.6571036 ## 0.09 1 3 400 0.6320765 ## 0.09 1 5 200 0.6554922 ## 0.09 1 5 400 0.6540755 ## 0.09 3 1 200 0.6523920 ## 0.09 3 1 400 0.6430140 ## 0.09 3 3 200 0.6430666 ## 0.09 3 3 400 0.6447749 ## 0.09 3 5 200 0.6540522 ## 0.09 3 5 400 0.6524416 ## Kappa ## 0.3210036 ## 0.3611194 ## 0.3032151 ## 0.3667274 ## 0.3472079 ## 0.3603046 ## 0.3414686 ## 0.3104335 ## 0.3542736 ## 0.3355582 ## 0.3314006 ## 0.3258459 ## 0.3473532 ## 0.2961782 ## 0.3310251 ## 0.3158762 ## 0.3308353 ## 0.3080692 ## 0.2940587 ## 0.3170198 ## 0.3044814 ## 0.2692627 ## 0.3378545 ## 0.2844781 ## 0.2859754 ## 0.3214156 ## 0.3079460 ## 0.2585840 ## 0.3062307 ## 0.3044324 ## 0.3003943 ## 0.2805715 ## 0.2827956 ## 0.2861825 ## 0.3024944 ## 0.3002135 ## ## Accuracy was used to select the optimal model using the largest value. ## The final values used for the model were n.trees = 400, ## interaction.depth = 1, shrinkage = 0.01 and n.minobsinnode = 3. Gradient boosting provides us with the recommended parameters for our training model as shown above as well as the accuracy and kappa of each model. We also need to recode the dependent variable as 0 and 1 for the ‘gbm’ function. Model Training train$lfp=ifelse(train$lfp=="no",0,1) gbm.lfp<-gbm(lfp~., distribution = 'bernoulli',data=train,n.trees = 400,interaction.depth = 1,shrinkage=.01,n.minobsinnode = 3) You can see a summary of the most important variables for prediction as well as a plot by using the “summary” function. summary(gbm.lfp) ## var rel.inf ## lnnlinc lnnlinc 28.680447 ## age age 27.451474 ## foreign foreign 23.307932 ## nyc nyc 18.375856 ## educ educ 2.184291 ## noc noc 0.000000 Salary (lnnlinc), age and foreigner status are the most important predictors followed by number of younger children (nyc) and lastest education. Number of older children (noc) has no effect. We can now test our model on the test set. Model Testing gbm.lfp.test<-predict(gbm.lfp,newdata = test,type = 'response', n.trees = 400) Our test model returns a set of probabilities. We need to convert this to a simple yes or no and this is done in the code below. gbm.class<-ifelse(gbm.lfp.test<0.5,'no','yes') We can now look at a table to see how accurate our model is as well as calculate the accuracy. table(gbm.class,test$lfp)
##
## gbm.class no yes
##       no  91  39
##       yes 39  67
(accuracy<-(91+67)/(91+67+39+39))
## [1] 0.6694915

The model is not great. However, you now have an example of how to use gradient boosting to develop a random forest classification model

# Gradient Boosting Of Regression Trees in R

Gradient boosting is a machine learning tool for “boosting” or improving model performance. How this works is that you first develop an initial model called the base learner using whatever algorithm of your choice (linear, tree, etc.).

Gradient boosting looks at the error and develops a second model using what is called da loss function. the loss function is the difference between the current accuracy and the desired prediction whether it’s accuracy for classification or error in regression. This process of making additional models based only on the misclassified ones continues until the level of accuracy is reached.

Gradient boosting is also stochastic. This means that it randomly draws from the sample as it iterates over the data. This helps to improve accuracy and or reduce error.

In this post, we will use gradient boosting for regression trees. In particular, we will use the “Sacramento” dataset from the “caret” package. Our goal is to predict a house’s price based on the available variables. Below is some initial code

library(caret);library(gbm);library(corrplot)
data("Sacramento")
str(Sacramento)
## 'data.frame':    932 obs. of  9 variables:
##  $city : Factor w/ 37 levels "ANTELOPE","AUBURN",..: 34 34 34 34 34 34 34 34 29 31 ... ##$ zip      : Factor w/ 68 levels "z95603","z95608",..: 64 52 44 44 53 65 66 49 24 25 ...
##  $beds : int 2 3 2 2 2 3 3 3 2 3 ... ##$ baths    : num  1 1 1 1 1 1 2 1 2 2 ...
##  $sqft : int 836 1167 796 852 797 1122 1104 1177 941 1146 ... ##$ type     : Factor w/ 3 levels "Condo","Multi_Family",..: 3 3 3 3 3 1 3 3 1 3 ...
##  $price : int 59222 68212 68880 69307 81900 89921 90895 91002 94905 98937 ... ##$ latitude : num  38.6 38.5 38.6 38.6 38.5 ...
##  $longitude: num -121 -121 -121 -121 -121 ... Data Preparation Already there are some actions that need to be made. We need to remove the variables “city” and “zip” because they both have a large number of factors. Next, we need to remove “latitude” and “longitude” because these values are hard to interpret in a housing price model. Let’s run the correlations before removing this information corrplot(cor(Sacramento[,c(-1,-2,-6)]),method = 'number') There also appears to be a high correlation between “sqft” and beds and bathrooms. As such, we will remove “sqft” from te model. Below is the code for the revised variables remaining for the model. sacto.clean<-Sacramento sacto.clean[,c(1,2,5)]<-NULL sacto.clean[,c(5,6)]<-NULL str(sacto.clean) ## 'data.frame': 932 obs. of 4 variables: ##$ beds : int  2 3 2 2 2 3 3 3 2 3 ...
##  $baths: num 1 1 1 1 1 1 2 1 2 2 ... ##$ type : Factor w/ 3 levels "Condo","Multi_Family",..: 3 3 3 3 3 1 3 3 1 3 ...
##  $price: int 59222 68212 68880 69307 81900 89921 90895 91002 94905 98937 ... We will now develop our training and testing sets set.seed(502) ind=sample(2,nrow(sacto.clean),replace=T,prob=c(.7,.3)) train<-sacto.clean[ind==1,] test<-sacto.clean[ind==2,] We need to create a grid in order to develop the many different potential models available. We have to tune three different parameters for gradient boosting, These three parameters are number of trees, interaction depth, and shrinkage. Number of trees is how many trees gradient boosting g will make, interaction depth is the number of splits, shrinkage controls the contribution of each tree and stump to the final model. We also have to determine the type of cross-validation using the “trainControl”” function. Below is the code for the grid. grid<-expand.grid(.n.trees=seq(100,500,by=200),.interaction.depth=seq(1,4,by=1),.shrinkage=c(.001,.01,.1), .n.minobsinnode=10) control<-trainControl(method = "CV") Model Training We now can train our model gbm.train<-train(price~.,data=train,method='gbm',trControl=control,tuneGrid=grid) gbm.train Stochastic Gradient Boosting 685 samples 4 predictors No pre-processing Resampling: Cross-Validated (25 fold) Summary of sample sizes: 659, 657, 658, 657, 657, 657, ... Resampling results across tuning parameters: shrinkage interaction.depth n.trees RMSE Rsquared 0.001 1 100 128372.32 0.4850879 0.001 1 300 120272.16 0.4965552 0.001 1 500 113986.08 0.5064680 0.001 2 100 127197.20 0.5463527 0.001 2 300 117228.42 0.5524074 0.001 2 500 109634.39 0.5566431 0.001 3 100 126633.35 0.5646994 0.001 3 300 115873.67 0.5707619 0.001 3 500 107850.02 0.5732942 0.001 4 100 126361.05 0.5740655 0.001 4 300 115269.63 0.5767396 0.001 4 500 107109.99 0.5799836 0.010 1 100 103554.11 0.5286663 0.010 1 300 90114.05 0.5728993 0.010 1 500 88327.15 0.5838981 0.010 2 100 97876.10 0.5675862 0.010 2 300 88260.16 0.5864650 0.010 2 500 86773.49 0.6007150 0.010 3 100 96138.06 0.5778062 0.010 3 300 87213.34 0.5975438 0.010 3 500 86309.87 0.6072987 0.010 4 100 95260.93 0.5861798 0.010 4 300 86962.20 0.6011429 0.010 4 500 86380.39 0.6082593 0.100 1 100 86808.91 0.6022690 0.100 1 300 86081.65 0.6100963 0.100 1 500 86197.52 0.6081493 0.100 2 100 86810.97 0.6036919 0.100 2 300 87251.66 0.6042293 0.100 2 500 88396.21 0.5945206 0.100 3 100 86649.14 0.6088309 0.100 3 300 88565.35 0.5942948 0.100 3 500 89971.44 0.5849622 0.100 4 100 86922.22 0.6037571 0.100 4 300 88629.92 0.5894188 0.100 4 500 91008.39 0.5718534 Tuning parameter 'n.minobsinnode' was held constant at a value of 10 RMSE was used to select the optimal model using the smallest value. The final values used for the model were n.trees = 300, interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10. The print out shows you the values for each potential model. At the bottom of the printout are the recommended parameters for our model. We take the values at the bottom to create our model for the test data. gbm.price<-gbm(price~.,data=train,n.trees = 300,interaction.depth = 1, shrinkage = .1,distribution = 'gaussian') Test Model Now we use the test data, below we predict as well as calculate the error and make a plot. gbm.test<-predict(gbm.price,newdata = test,n.trees = 300) gbm.resid<-gbm.test-test$price
mean(gbm.resid^2)
## [1] 8721772767
plot(gbm.test,test$price) The actual value for the mean squared error is relative and means nothing by its self. The plot, however, looks good and indicates that our model may be doing well. The mean squared error is only useful when comparing one model to another it does not mean much by its self. # Random Forest Classification in R This post will cover the use of random forest for classification. Random forest involves the use of many decision trees in the development of a classification or regression tree. The results of each individual tree is added together and the mean is used in the final classification of an example. The use of an ensemble helps in dealing with the bias-variance tradeoff. In the example of random forest classification, we will use the “Participation” dataset from the “ecdat” package. We want to classify people by their labor participation based on the other variables available in the dataset. Below is some initial code library(randomForest);library(Ecdat) data("Participation") str(Participation) ## 'data.frame': 872 obs. of 7 variables: ##$ lfp    : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $lnnlinc: num 10.8 10.5 11 11.1 11.1 ... ##$ age    : num  3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
##  $educ : num 8 8 9 11 12 12 8 8 12 11 ... ##$ nyc    : num  1 0 0 2 0 0 0 0 0 0 ...
##  $noc : num 1 1 0 0 2 1 0 2 0 2 ... ##$ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

For the data preparation, we need to multiple age by ten as the current values imply small children. Furthermore, we need to change the “lnnlinc” variable from the log of salary to just the regular salary. After completing these two steps, we need to split our data into training and testing sets. Below is the code

Participation$age<-10*Participation$age #normal age
Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log
#split data
set.seed(502)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]

We will now create our classification model using random forest.

set.seed(123)
rf.lfp<-randomForest(lfp~.,data = train)
rf.lfp
##
## Call:
##  randomForest(formula = lfp ~ ., data = train)
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
##
##         OOB estimate of  error rate: 32.39%
## Confusion matrix:
##      no yes class.error
## no  248  93   0.2727273
## yes 113 182   0.3830508

The output is mostly self-explanatory. It includes the number of trees, number of variables at each split, error rate, and the confusion matrix. In general, are error rate is poor and we are having a hard time distinguishing between those who work and do not work based on the variables in the dataset. However, this is based on having all 500 trees in the analysis. Having this many trees is probably not necessary but we need to confirm this.

We can also plot the error by tree using the “plot” function as shown below.

plot(rf.lfp)

It looks as though error lowest with around 400 trees. We can confirm this using the “which.min” function and call information from “err.rate” in our model.

which.min(rf.lfp$err.rate[,1]) ## [1] 242 We need 395 trees in order to reduce the error rate to its most optimal level. We will now create a new model that contains 395 trees in it. rf.lfp2<-randomForest(lfp~.,data = train,ntree=395) rf.lfp2 ## ## Call: ## randomForest(formula = lfp ~ ., data = train, ntree = 395) ## Type of random forest: classification ## Number of trees: 395 ## No. of variables tried at each split: 2 ## ## OOB estimate of error rate: 31.92% ## Confusion matrix: ## no yes class.error ## no 252 89 0.2609971 ## yes 114 181 0.3864407 The results are mostly the same. There is a small decline in error but not much to get excited about. We will now run our model on the test set. rf.lfptest<-predict(rf.lfp2,newdata=test,type = 'response') table(rf.lfptest,test$lfp)
##
## rf.lfptest no yes
##        no  93  48
##        yes 37  58
(92+63)/(92+63+43+38) #calculate accuracy
## [1] 0.6567797

Still disappointing, there is one last chart we should examine and that is the importance of each variable plot. It shows which variables are most useful in the prediction process. Below is the code.

varImpPlot(rf.lfp2)

This plot clearly indicates that salary (“lnnlinc”), age, and education are the strongest features for classifying by labor activity. However, the overall model is probably not useful.

Conclusion

This post explained and demonstrated how to conduct a random forest analysis. This form of analysis is powerful in dealing with large datasets with nonlinear relationships among the variables.

# Random Forest Regression Trees in R

Random forest involves the process of creating multiple decision trees and the combing of their results. How this is done is through r using 2/3 of the data set to develop decision tree. This is done dozens, hundreds, or more times. Every tree made is created with a slightly different sample. The results of all these trees are then averaged together. This process of sampling is called bootstrap aggregation or bagging for short.

While the random forest algorithm is developing different samples it also randomly selects which variables to be use din each tree that is developed. By randomizing the sample and the features used in the tree, random forest is able to reduce both bias and variance in a model. In addition, random forest is robust against outliers and collinearity. Lastly, keep in mind that random forest can be used for regression and classification trees

In our example, we will use the “Participation” dataset from the “Ecdat” package. We will create a random forest regression tree to predict income of people. Below is some initial code

library(randomForest);library(rpart);library(Ecdat)
data("Participation")
str(Participation)
## 'data.frame':    872 obs. of  7 variables:
##  $lfp : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ... ##$ lnnlinc: num  10.8 10.5 11 11.1 11.1 ...
##  $age : num 3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ... ##$ educ   : num  8 8 9 11 12 12 8 8 12 11 ...
##  $nyc : num 1 0 0 2 0 0 0 0 0 0 ... ##$ noc    : num  1 1 0 0 2 1 0 2 0 2 ...
##  $foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... We now need to prepare the data. We need to transform the lnnlinc from a log of salary to the actual salary. In addition, we need to multiply “age” by ten as 3.4 & 4.5 do not make any sense. Below is the code Participation$age<-10*Participation$age #normal age Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log Now we create our training and testing sets. set.seed(123) ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3)) train<-Participation[ind==1,] test<-Participation[ind==2,] We are now ready to create our model. Below is the code set.seed(123) rf.pros<-randomForest(lnnlinc~.,data = train) rf.pros ## ## Call: ## randomForest(formula = lnnlinc ~ ., data = train) ## Type of random forest: regression ## Number of trees: 500 ## No. of variables tried at each split: 2 ## ## Mean of squared residuals: 529284177 ## % Var explained: 13.74 As you can see from calling “rf.pros” the variance explained is low at around 14%. The output also tells us how many trees were created. You have to be careful with making too many trees as this leads to overfitting. We can determine how many trees are optimal by looking at a plot and then using the “which.min”. Below is a plot of the number of trees by the mean squared error. plot(rf.pros) As you can see, as there are more trees there us less error to a certain point. It looks as though about 50 trees is enough. To confirm this guess, we used the “which.min” function. Below is the code which.min(rf.pros$mse)
## [1] 45

We need 45 trees to have the lowest error. We will now rerun the model and add an argument called “ntrees” to indicating the number of trees we want to generate.

set.seed(123)
rf.pros.45<-randomForest(lnnlinc~.,data = train,ntree=45)
rf.pros.45
##
## Call:
##  randomForest(formula = lnnlinc ~ ., data = train, ntree = 45)
##                Type of random forest: regression
##                      Number of trees: 45
## No. of variables tried at each split: 2
##
##           Mean of squared residuals: 520705601
##                     % Var explained: 15.13

This model is still not great. We explain a little bit more of the variance and the error decreased slightly. We can now see which of the features in our model are the most useful by using the “varImpPlot” function. Below is the code.

varImpPlot(rf.pros.45)

The higher the IncNodePurity the more important the variable. AS you can see, education is most important followed by age and then the number of older children. The raw scores for each variable can be examined using the “importance” function. Below is the code.

importance(rf.pros.45)
##         IncNodePurity
## lfp       16678498398
## age       66716765357
## educ      72007615063
## nyc        9337131671
## noc       31951386811
## foreign   10205305287

We are now ready to test our model with the test set. We will then calculate the residuals and the mean squared error

rf.pros.test<-predict(rf.pros.45,newdata = test)
rf.resid<-rf.pros.test-test$lnnlinc mean(rf.resid^2) ## [1] 381850711 Remember that the mean squared error calculated here is only useful in comparison to other models. Random forest provides a way in which to remove the weaknesses of one decision tree by averaging the results of many. This form of ensemble learning is one of the more powerful algorithms in machine learning. # Understanding Classification Trees Using R Classification trees are similar to regression trees except that the determinant of success is not the residual sum of squares but rather the error rate. The strange thing about classification trees is that you can you can continue to gain information in splitting the tree without necessarily improving the misclassification rate. This is done through calculating a measure of error called the Gini coefficient Gini coefficient is calculated using the values of the accuracy and error in an equation. For example, if we have a model that is 80% accurate with a 20% error rate the Gini coefficient is calculated as follows for a single node n0gini<- 1 - (((8/10)^2) -((2/10)^2)) n0gini ## [1] 0.4 Now if we split this into two nodes notice the change in the Gini coefficient n1gini<-1-(((5/6)^2)-((1/7)^2)) n2gini<-1-(((3/4)^2))-((1/3)^2) newgini<-(.8*n1gini) + (.2*n2gini) newgini ## [1] 0.3260488 The lower the Gini coefficient the better as it measures purity. IN the example, there is no improvement in the accuracy yet there is an improvement in the Gini coefficient. Therefore, classification is about purity and not the residual sum of squares. In this post, we will make a classification tree to predict if someone is participating in the labor market. We will do this using the “Participation” dataset from the “Ecdat” package. Below is some initial code to get started. library(Ecdat);library(rpart);library(partykit) data(Participation) str(Participation) ## 'data.frame': 872 obs. of 7 variables: ##$ lfp    : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $lnnlinc: num 10.8 10.5 11 11.1 11.1 ... ##$ age    : num  3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
##  $educ : num 8 8 9 11 12 12 8 8 12 11 ... ##$ nyc    : num  1 0 0 2 0 0 0 0 0 0 ...
##  $noc : num 1 1 0 0 2 1 0 2 0 2 ... ##$ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

The ‘age’ feature needs to be transformed. Since it is doubtful that the survey was conducted among 4 and 5-year-olds. We need to multiply this variable by ten. In addition, the “lnnlinc” feature is the log of income and we want the actual income so we will exponentiate this information. Below is the code for these two steps.

Participation$age<-10*Participation$age #normal age
Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log

We will now create our training and testing datasets with the code below.

set.seed(502)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]

We can now create our classification tree and take a look at the output

tree.pros<-rpart(lfp~.,data = train)
tree.pros
## n= 636
##
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
##
##   1) root 636 295 no (0.5361635 0.4638365)
##     2) foreign=no 471 182 no (0.6135881 0.3864119)
##       4) nyc>=0.5 99  21 no (0.7878788 0.2121212) *
##       5) nyc< 0.5 372 161 no (0.5672043 0.4327957)
##        10) age>=49.5 110  25 no (0.7727273 0.2272727) *
##        11) age< 49.5 262 126 yes (0.4809160 0.5190840)
##          22) lnnlinc>=46230.43 131  50 no (0.6183206 0.3816794)
##            44) noc>=0.5 102  34 no (0.6666667 0.3333333) *
##            45) noc< 0.5 29  13 yes (0.4482759 0.5517241)
##              90) lnnlinc>=47910.86 22  10 no (0.5454545 0.4545455)
##               180) lnnlinc< 65210.78 12   3 no (0.7500000 0.2500000) *
##               181) lnnlinc>=65210.78 10   3 yes (0.3000000 0.7000000) *
##              91) lnnlinc< 47910.86 7   1 yes (0.1428571 0.8571429) *
##          23) lnnlinc< 46230.43 131  45 yes (0.3435115 0.6564885) *
##     3) foreign=yes 165  52 yes (0.3151515 0.6848485)
##       6) lnnlinc>=56365.39 16   5 no (0.6875000 0.3125000) *
##       7) lnnlinc< 56365.39 149  41 yes (0.2751678 0.7248322) *

In the text above, the first split is made on the feature “foreign” which is a yes or no possibility. 471 were not foreigners will 165 were foreigners. The accuracy here is not great at 61% for those not classified as foreigners and 31% for those classified as foreigners. For the 165 that are classified as foreigners, the next split is by their income, etc. This is hard to understand. Below is an actual diagram of the text above.

plot(as.party(tree.pros))

We now need to determining if pruning the tree is beneficial. We do this by looking at the cost complexity. Below is the code.

tree.pros$cptable ## CP nsplit rel error xerror xstd ## 1 0.20677966 0 1.0000000 1.0000000 0.04263219 ## 2 0.04632768 1 0.7932203 0.7932203 0.04122592 ## 3 0.02033898 4 0.6542373 0.6677966 0.03952891 ## 4 0.01016949 5 0.6338983 0.6881356 0.03985120 ## 5 0.01000000 8 0.6033898 0.6915254 0.03990308 The “rel error” indicates that our model is bad no matter how any splits. Even with 9 splits we have an error rate of 60%. Below is a plot of the table above plotcp(tree.pros) Based on the table, we will try to prune the tree to 5 splits. The plot above provides a visual as it has the lowest error. The table indicates that a tree of five splits (row number 4) has the lowest cross-validation error (xstd). Below is the code for pruning the tree followed by a plot of the modified tree. cp<-min(tree.pros$cptable[4,])
pruned.tree.pros<-prune(tree.pros,cp=cp)
plot(as.party(pruned.tree.pros))

IF you compare the two trees we have developed. One of the main differences is that the pruned.tree is missing the “noc” (number of older children) variable. There are also fewer splits on the income variable (lnnlinc). We can no use the pruned tree with the test data set.

party.pros.test<-predict(pruned.tree.pros,newdata=test,type="class")
table(party.pros.test,test$lfp) ## ## party.pros.test no yes ## no 90 41 ## yes 40 65 Now for the accuracy (90+65) / (90+41+40+65) ## [1] 0.6567797 This is surprisingly high compared to the results for the training set but 65% is not great, However, this is fine for a demonstration. Conclusion Classification trees are one of many useful tools available for data analysis. When developing classification trees one of the key ideas to keep in mind is the aspect of prunning as this affects the complexity of the model. # Numeric Prediction with Support Vector Machines in R In this post, we will look at support vector machines for numeric prediction. SVM is used for both classification and numeric prediction. The advantage of SVM for numeric prediction is that SVM will automatically create higher dimensions of the features and summarizes this in the output. In other words, unlike in regression where you have to decide for yourself how to modify your features, SVM does this automatically using different kernels. Different kernels transform the features in different ways. And the cost function determines the penalty for an example being on the wrong side of the margin developed by the kernel. Remember that SVM draws lines and separators to divide the examples. Examples on the wrong side are penalized as determined by the researcher. Just like with regression, generally, the model with the least amount of error may be the best model. As such, the purpose of this post is to use SVM to predict income in the “Mroz” dataset from the “Ecdat” package. We will use several different kernels that will transformation the features different ways and calculate the mean-squared error to determine the most appropriate model. Below is some initial code. library(caret);library(e1071);library(corrplot);library(Ecdat) data(Mroz) str(Mroz) ## 'data.frame': 753 obs. of 18 variables: ##$ work      : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
##  $hoursw : int 1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ... ##$ child6    : int  1 0 1 0 1 0 0 0 0 0 ...
##  $child618 : int 0 2 3 3 2 0 2 0 2 2 ... ##$ agew      : int  32 30 35 34 31 54 37 54 48 39 ...
##  $educw : int 12 12 12 12 14 12 16 12 12 12 ... ##$ hearnw    : num  3.35 1.39 4.55 1.1 4.59 ...
##  $wagew : num 2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ... ##$ hoursh    : int  2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
##  $ageh : int 34 30 40 53 32 57 37 53 52 43 ... ##$ educh     : int  12 9 12 10 12 11 12 8 4 12 ...
##  $wageh : num 4.03 8.44 3.58 3.54 10 ... ##$ income    : int  16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
##  $educwm : int 12 7 12 7 12 14 14 3 7 7 ... ##$ educwf    : int  7 7 7 7 14 7 7 3 7 7 ...
##  $unemprate : num 5 11 5 5 9.5 7.5 5 5 3 5 ... ##$ city      : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ...
##  $experience: int 14 5 15 6 7 33 11 35 24 21 ... We need to place the factor variables next to each other as it helps in having to remove them when we need to scale the data. We must scale the data because SVM is based on distance when making calculations. If there are different scales the larger scale will have more influence on the results. Below is the code mroz.scale<-Mroz[,c(17,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18)] mroz.scale<-as.data.frame(scale(mroz.scale[,c(-1,-2)])) #remove factor variables for scaling mroz.scale$city<-Mroz$city # add factor variable back into the dataset mroz.scale$work<-Mroz$work # add factor variable back into the dataset #mroz.cor<-cor(mroz.scale[,-17:-18]) #corrplot(mroz.cor,method='number', col='black') Below is the code for creating the train and test datasets. set.seed(502) ind=sample(2,nrow(mroz.scale),replace=T,prob=c(.7,.3)) train<-mroz.scale[ind==1,] test<-mroz.scale[ind==2,] Linear Kernel Our first kernel is the linear kernel. Below is the code. We use the “tune.svm” function from the “e1071” package. We set the kernel to “linear” and we pick our own values for the cost function. The numbers for the cost function can be whatever you want. Also, keep in mind that r will produce six different models because we have six different values in the “cost” argument. The process we are using to develop the models is as follows 1. Set the seed 2. Develop the initial model by setting the formula, dataset, kernel, cost function, and other needed information. 3. Select the best model for the test set 4. Predict with the best model 5. Plot the predicted and actual results 6. Calculate the mean squared error The first time we will go through this process step-by-step. However, all future models will just have the code followed by an interpretation. linear.tune<-tune.svm(income~.,data=train,kernel="linear",cost = c(.001,.01,.1,1,5,10)) summary(linear.tune) ## ## Parameter tuning of 'svm': ## ## - sampling method: 10-fold cross validation ## ## - best parameters: ## cost ## 10 ## ## - best performance: 0.3492453 ## ## - Detailed performance results: ## cost error dispersion ## 1 1e-03 0.6793025 0.2285748 ## 2 1e-02 0.3769298 0.1800839 ## 3 1e-01 0.3500734 0.1626964 ## 4 1e+00 0.3494828 0.1618478 ## 5 5e+00 0.3493379 0.1611353 ## 6 1e+01 0.3492453 0.1609774 The best model had a cost = 10 with a performance of .35. We will select the best model and use this on our test data. Below is the code. best.linear<-linear.tune$best.model
tune.test<-predict(best.linear,newdata=test)

Now we will create a plot so we can see how well our model predicts. In addition, we will calculate the mean squared error to have an actual number of our model’s performance

plot(tune.test,test$income) tune.test.resid<-tune.test-test$income
mean(tune.test.resid^2)
## [1] 0.215056

The model looks good in the plot. However, we cannot tell if the error number is decent until it is compared to other models

Polynomial Kernel

The next kernel we will use is the polynomial one. The kernel requires two parameters the degree of the polynomial (3,4,5, etc) as well as the kernel coefficient. Below is the code

set.seed(123)
poly.tune<-tune.svm(income~.,data = train,kernal="polynomial",degree = c(3,4,5),coef0 = c(.1,.5,1,2,3,4))
best.poly<-poly.tune$best.model poly.test<-predict(best.poly,newdata=test) plot(poly.test,test$income)

poly.test.resid<-poly.test-test$income mean(poly.test.resid^2) ## [1] 0.2453022 The polynomial has an insignificant additional amount of error. Radial Kernel Next, we will use the radial kernel. One thing that is new here is the need for a parameter in the code call gamma. Below is the code. set.seed(123) rbf.tune<-tune.svm(income~.,data=train,kernel="radial",gamma = c(.1,.5,1,2,3,4)) summary(rbf.tune) ## ## Parameter tuning of 'svm': ## ## - sampling method: 10-fold cross validation ## ## - best parameters: ## gamma ## 0.1 ## ## - best performance: 0.5225952 ## ## - Detailed performance results: ## gamma error dispersion ## 1 0.1 0.5225952 0.4183170 ## 2 0.5 0.9743062 0.5293211 ## 3 1.0 1.0475714 0.5304482 ## 4 2.0 1.0582550 0.5286129 ## 5 3.0 1.0590367 0.5283465 ## 6 4.0 1.0591208 0.5283059 best.rbf<-rbf.tune$best.model
rbf.test<-predict(best.rbf,newdata=test)
plot(rbf.test,test$income) rbf.test.resid<-rbf.test-test$income
mean(rbf.test.resid^2)
## [1] 0.3138517

The radial kernel is worst than the linear and polynomial kernel. However, there is not much different in the performance of the models so far.

Sigmoid Kernel

Next, we will try the sigmoid kernel. Sigmoid kernel relies on a “gamma” parameter and a cost function. Below is the code

set.seed(123)
sigmoid.tune<-tune.svm(income~., data=train,kernel="sigmoid",gamma = c(.1,.5,1,2,3,4),coef0 = c(.1,.5,1,2,3,4))
summary(sigmoid.tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
##  gamma coef0
##    0.1     3
##
## - best performance: 0.8759507
##
## - Detailed performance results:
##    gamma coef0        error  dispersion
## 1    0.1   0.1   27.0808221   6.2866615
## 2    0.5   0.1  746.9235624 129.0224096
## 3    1.0   0.1 1090.9660708 198.2993895
## 4    2.0   0.1 1317.4497885 214.7997608
## 5    3.0   0.1 1339.8455047 180.3195491
## 6    4.0   0.1 1299.7469190 201.6901577
## 7    0.1   0.5  151.6070833  38.8450961
## 8    0.5   0.5 1221.2396575 335.4320445
## 9    1.0   0.5 1225.7731007 190.7718103
## 10   2.0   0.5 1290.1784238 216.9249899
## 11   3.0   0.5 1338.1069460 223.3126800
## 12   4.0   0.5 1261.8861304 300.0001079
## 13   0.1   1.0  162.6041229  45.3216740
## 14   0.5   1.0 2276.4330973 330.1739559
## 15   1.0   1.0 2036.4791854 335.8051736
## 16   2.0   1.0 1626.4347749 290.6445164
## 17   3.0   1.0 1333.0626614 244.4424896
## 18   4.0   1.0 1343.7617925 194.2220729
## 19   0.1   2.0   19.2061993   9.6767496
## 20   0.5   2.0 2504.9271757 583.8943008
## 21   1.0   2.0 3296.8519140 542.7903751
## 22   2.0   2.0 2376.8169815 398.1458855
## 23   3.0   2.0 1949.9232179 319.6548059
## 24   4.0   2.0 1758.7879267 313.2581011
## 25   0.1   3.0    0.8759507   0.3812578
## 26   0.5   3.0 1405.9712578 389.0822797
## 27   1.0   3.0 3559.4804854 843.1905348
## 28   2.0   3.0 3159.9549029 492.6072149
## 29   3.0   3.0 2428.1144437 412.2854724
## 30   4.0   3.0 1997.4596435 372.1962595
## 31   0.1   4.0    0.9543167   0.5170661
## 32   0.5   4.0  746.4566494 201.4341061
## 33   1.0   4.0 3277.4331302 527.6037421
## 34   2.0   4.0 3643.6413379 604.2778089
## 35   3.0   4.0 2998.5102806 471.7848740
## 36   4.0   4.0 2459.7133632 439.3389369
best.sigmoid<-sigmoid.tune$best.model sigmoid.test<-predict(best.sigmoid,newdata=test) plot(sigmoid.test,test$income)

sigmoid.test.resid<-sigmoid.test-test$income mean(sigmoid.test.resid^2) ## [1] 0.8004045 The sigmoid performed much worst then the other models based on the metric of error. You can further see the problems with this model in the plot above. Conclusion The final results are as follows • Linear kernel .21 • Polynomial kernel .24 • Radial kernel .31 • Sigmoid kernel .80 Which model to select depends on the goals of the study. However, it definitely looks as though you would be picking from among the first three models. The power of SVM is the ability to use different kernels to uncover different results without having to really modify the features yourself. # Regression Tree Development in R In this post, we will take a look at regression trees. Regression trees use a concept called recursive partitioning. Recursive partitioning involves splitting features in a way that reduces the error the most. The splitting is also greedy which means that the algorithm will partition the data at one point without considered how it will affect future partitions. Ignoring how a current split affects the future splits can lead to unnecessary branches with high variance and low bias. One of the main strengths of regression trees is their ability ti deal with nonlinear relationships. However, predictive performance can be hurt when a particular example is assigned the mean of a node. This forced assignment is a loss of data such as turning continuous variables into categorical variables. in this post, we will use the “participation” dataset from the “ecdat” package to predict income based on the other variables in the dataset. Below is some initial code. library(rpart);library(partykit);library(caret);library(Ecdat) data("Participation") str(Participation) ## 'data.frame': 872 obs. of 7 variables: ##$ lfp    : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $lnnlinc: num 10.8 10.5 11 11.1 11.1 ... ##$ age    : num  3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
##  $educ : num 8 8 9 11 12 12 8 8 12 11 ... ##$ nyc    : num  1 0 0 2 0 0 0 0 0 0 ...
##  $noc : num 1 1 0 0 2 1 0 2 0 2 ... ##$ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

There are several things we need to do to make the results easier to interpret. The “age” variable needs to be multiplied by ten as it currently shows such results as 4.5, 3, etc. Common sense indicates that a four-year-old and a three-year-old is not earning an income.

In addition, we need to convert or income variable (lnnlinc) from the log of income to regular income. This will also help to understand the results. Below is the code.

Participation$age<-10*Participation$age #normal age
Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log

The next step is to create our training and testing data sets. Below is the code.

set.seed(502)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]

We can now develop our model. We will also use the ‘print’ command

reg.tree<-rpart(lnnlinc~.,data = train)

Below is a printout of the current tree

reg.tree
## n= 636
##
## node), split, n, deviance, yval
##       * denotes terminal node
##
##   1) root 636 390503700000  48405.08
##     2) educ< 11.5 473 127460900000  43446.69
##       4) educ< 9.5 335  70269440000  40758.25
##         8) foreign=yes 129  10617380000  36016.12 *
##         9) foreign=no 206  54934520000  43727.84 *
##       5) educ>=9.5 138  48892370000  49972.98 *
##     3) educ>=11.5 163 217668400000  62793.52
##       6) age< 34.5 79  34015680000  51323.86
##        12) age< 25.5 12    984764800  34332.97 *
##        13) age>=25.5 67  28946170000  54367.01 *
##       7) age>=34.5 84 163486000000  73580.46
##        14) lfp=yes 36  23888410000  58916.66 *
##        15) lfp=no 48 126050900000  84578.31
##          30) educ< 12.5 29  86940400000  74425.51
##            60) age< 38.5 8    763764600  57390.34 *
##            61) age>=38.5 21  82970650000  80915.10
##             122) age>=44 14  34091840000  68474.57 *
##             123) age< 44 7  42378600000 105796.20 *
##          31) educ>=12.5 19  31558550000 100074.70 *

I will not interpret all of this but here is a brief description use numbers 2,4, and 8. If the person has less than 11.5 years of education (473 qualify) If the person has less than 9.5 years of education (335 of the 473 qualify) If the person is a foreigner (129 of the 335 qualify) then their average salary is 36,016.12 dollars.

Perhaps now you can see how some data is lost. The average salary for people in this node is 36,016.12 dollars but probably nobody earns exactly this amount

If what I said does not make sense. Here is an actual plot of the current regression tree.

plot(as.party(reg.tree))

The little boxes at the bottom are boxplots of that node.

Tree modification

We now will make modifications to the tree. We will begin by examining the cptable. Below is the code

reg.tree$cptable ## CP nsplit rel error xerror xstd ## 1 0.11619458 0 1.0000000 1.0026623 0.1666662 ## 2 0.05164297 1 0.8838054 0.9139383 0.1434768 ## 3 0.03469034 2 0.8321625 0.9403669 0.1443843 ## 4 0.02125215 3 0.7974721 0.9387060 0.1433101 ## 5 0.01933892 4 0.7762200 0.9260030 0.1442329 ## 6 0.01242779 5 0.7568810 0.9097011 0.1434606 ## 7 0.01208066 7 0.7320255 0.9166627 0.1433779 ## 8 0.01046022 8 0.7199448 0.9100704 0.1432901 ## 9 0.01000000 9 0.7094846 0.9107869 0.1427025 The cptable shares a lot of information. First, cp stands for cost complexity and this is the column furthest to the left. This number decreases as the tree becomes more complex. “nsplit” indicates the number of splits in the tree. “rel error” as another term for the residual sum of squares or RSS error. The ‘xerror’ and ‘xstd’ are the cross-validated average error and standard deviation of the error respectively. One thing we can see from the cptable is that 9 splits has the lowest error but 2 splits has the lowest cross-validated error. Below we will look at a printout of the current table. We will now make a plot of the complexity parameter to determine at what point to prune the tree. Pruning helps in removing unnecessary splits that do not improve the model much. Below is the code. The information in the plot is a visual of the “cptable” plotcp(reg.tree) It appears that a tree of size 2 is the best but this is boring. The next lowest dip is a tree of size 8. Therefore, we will prune our tree to have a size of 8 or eight splits. First, we need to create an object that contains how many splits we want. Then we use the “prune” function to make the actually modified tree. cp<-min(reg.tree$cptable[8,])
pruned.reg.tree<-prune(reg.tree,cp=cp)

We will now make are modified tree

plot(as.party(pruned.reg.tree))

The only difference is the loss of the age nod for greater or less than 25.5.

Model Test

We can now test our tree to see how well it performs.

reg.tree.test<-predict(pruned.reg.tree,newdata=test)
reg.tree.resid<-reg.tree.test-test$lnnlinc mean(reg.tree.resid^2) ## [1] 431928030 The number we calculated is the mean squared error. This number must be compared to models that are developed differently in order to assess the current model. By it’s self it means nothing. Conclusion This post exposed you to regression trees. This type of tree can be used to m ake numeric predictions in nonlinear data. However, with the classification comes a loss of data as the uniqueness of each example is lost when placed in a node. # K Nearest Neighbor in R K-nearest neighbor is one of many nonlinear algorithms that can be used in machine learning. By non-linear I mean that a linear combination of the features or variables is not needed in order to develop decision boundaries. This allows for the analysis of data that naturally does not meet the assumptions of linearity. KNN is also known as a “lazy learner”. This means that there are known coefficients or parameter estimates. When doing regression we always had coefficient outputs regardless of the type of regression (ridge, lasso, elastic net, etc.). What KNN does instead is used K nearest neighbors to give a label to an unlabeled example. Our job when using KNN is to determine the number of K neighbors to use that is most accurate based on the different criteria for assessing the models. In this post, we will develop a KNN model using the “Mroz” dataset from the “Ecdat” package. Our goal is to predict if someone lives in the city based on the other predictor variables. Below is some initial code. library(class);library(kknn);library(caret);library(corrplot) library(reshape2);library(ggplot2);library(pROC);library(Ecdat) data(Mroz) str(Mroz) ## 'data.frame': 753 obs. of 18 variables: ##$ work      : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
##  $hoursw : int 1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ... ##$ child6    : int  1 0 1 0 1 0 0 0 0 0 ...
##  $child618 : int 0 2 3 3 2 0 2 0 2 2 ... ##$ agew      : int  32 30 35 34 31 54 37 54 48 39 ...
##  $educw : int 12 12 12 12 14 12 16 12 12 12 ... ##$ hearnw    : num  3.35 1.39 4.55 1.1 4.59 ...
##  $wagew : num 2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ... ##$ hoursh    : int  2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
##  $ageh : int 34 30 40 53 32 57 37 53 52 43 ... ##$ educh     : int  12 9 12 10 12 11 12 8 4 12 ...
##  $wageh : num 4.03 8.44 3.58 3.54 10 ... ##$ income    : int  16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
##  $educwm : int 12 7 12 7 12 14 14 3 7 7 ... ##$ educwf    : int  7 7 7 7 14 7 7 3 7 7 ...
##  $unemprate : num 5 11 5 5 9.5 7.5 5 5 3 5 ... ##$ city      : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ...
##  $experience: int 14 5 15 6 7 33 11 35 24 21 ... We need to remove the factor variable “work” as KNN cannot use factor variables. After this, we will use the “melt” function from the “reshape2” package to look at the variables when divided by whether the example was from the city or not. Mroz$work<-NULL
mroz.melt<-melt(Mroz,id.var='city')
Mroz_plots<-ggplot(mroz.melt,aes(x=city,y=value))+geom_boxplot()+facet_wrap(~variable, ncol = 4)
Mroz_plots

From the plots, it appears there are no differences in how the variable act whether someone is from the city or not. This may be a flag that classification may not work.

We now need to scale our data otherwise the results will be inaccurate. Scaling might also help our box-plots because everything will be on the same scale rather than spread all over the place. To do this we will have to temporarily remove our outcome variable from the data set because it’s a factor and then reinsert it into the data set. Below is the code.

mroz.scale<-as.data.frame(scale(Mroz[,-16]))
mroz.scale$city<-Mroz$city

We will now look at our box-plots a second time but this time with scaled data.

mroz.scale.melt<-melt(mroz.scale,id.var="city")
mroz_plot2<-ggplot(mroz.scale.melt,aes(city,value))+geom_boxplot()+facet_wrap(~variable, ncol = 4)
mroz_plot2

This second plot is easier to read but there is still little indication of difference.

We can now move to checking the correlations among the variables. Below is the code

mroz.cor<-cor(mroz.scale[,-17])
corrplot(mroz.cor,method = 'number')

There is a high correlation between husband’s age (ageh) and wife’s age (agew). Since this algorithm is non-linear this should not be a major problem.

We will now divide our dataset into the training and testing sets

set.seed(502)
ind=sample(2,nrow(mroz.scale),replace=T,prob=c(.7,.3))
train<-mroz.scale[ind==1,]
test<-mroz.scale[ind==2,]

Before creating a model we need to create a grid. We do not know the value of k yet so we have to run multiple models with different values of k in order to determine this for our model. As such we need to create a ‘grid’ using the ‘expand.grid’ function. We will also use cross-validation to get a better estimate of k as well using the “trainControl” function. The code is below.

grid<-expand.grid(.k=seq(2,20,by=1))
control<-trainControl(method="cv")

Now we make our model,

knn.train<-train(city~.,train,method="knn",trControl=control,tuneGrid=grid)
knn.train
## k-Nearest Neighbors
##
## 540 samples
##  16 predictors
##   2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 487, 486, 486, 486, 486, 486, ...
## Resampling results across tuning parameters:
##
##   k   Accuracy   Kappa
##    2  0.6000095  0.1213920
##    3  0.6368757  0.1542968
##    4  0.6424325  0.1546494
##    5  0.6386252  0.1275248
##    6  0.6329998  0.1164253
##    7  0.6589619  0.1616377
##    8  0.6663344  0.1774391
##    9  0.6663681  0.1733197
##   10  0.6609510  0.1566064
##   11  0.6664018  0.1575868
##   12  0.6682199  0.1669053
##   13  0.6572111  0.1397222
##   14  0.6719586  0.1694953
##   15  0.6571425  0.1263937
##   16  0.6664367  0.1551023
##   17  0.6719573  0.1588789
##   18  0.6608811  0.1260452
##   19  0.6590979  0.1165734
##   20  0.6609510  0.1219624
##
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 14.

R recommends that k = 16. This is based on a combination of accuracy and the kappa statistic. The kappa statistic is a measurement of the accuracy of a model while taking into account chance. We don’t have a model in the sense that we do not use the ~ sign like we do with regression. Instead, we have a train and a test set a factor variable and a number for k. This will make more sense when you see the code. Finally, we will use this information on our test dataset. We will then look at the table and the accuracy of the model.

knn.test<-knn(train[,-17],test[,-17],train[,17],k=16) #-17 removes the dependent variable 'city
table(knn.test,test$city) ## ## knn.test no yes ## no 19 8 ## yes 61 125 prob.agree<-(15+129)/213 prob.agree ## [1] 0.6760563 Accuracy is 67% which is consistent with what we found when determining the k. We can also calculate the kappa. This done by calculating the probability and then do some subtraction and division. We already know the accuracy as we stored it in the variable “prob.agree” we now need the probability that this is by chance. Lastly, we calculate the kappa. prob.chance<-((15+4)/213)*((15+65)/213) kap<-(prob.agree-prob.chance)/(1-prob.chance) kap ## [1] 0.664827 A kappa of .66 is actual good. The example we just did was with unweighted k neighbors. There are times when weighted neighbors can improve accuracy. We will look at three different weighing methods. “Rectangular” is unweighted and is the one that we used. The other two are “triangular” and “epanechnikov”. How these calculate the weights is beyond the scope of this post. In the code below the argument “distance” can be set to 1 for euclidean and 2 for absolute distance. kknn.train<-train.kknn(city~.,train,kmax = 25,distance = 2,kernel = c("rectangular","triangular", "epanechnikov")) plot(kknn.train) kknn.train ## ## Call: ## train.kknn(formula = city ~ ., data = train, kmax = 25, distance = 2, kernel = c("rectangular", "triangular", "epanechnikov")) ## ## Type of response variable: nominal ## Minimal misclassification: 0.3277778 ## Best kernel: rectangular ## Best k: 14 If you look at the plot you can see which value of k is the best by looking at the point that is the lowest on the graph which is right before 15. Looking at the legend it indicates that the point is the “rectangular” estimate which is the same as unweighted. This means that the best classification is unweighted with a k of 14. Although it recommends a different value for k our misclassification was about the same. Conclusion In this post, we explored both weighted and unweighted KNN. This algorithm allows you to deal with data that does not meet the assumptions of regression by ignoring the need for parameters. However, because there are no numbers really attached to the results beyond accuracy it can be difficult to explain what is happening in the model to people. As such, perhaps the biggest drawback is communicating results when using KNN. # Elastic Net Regression in R Elastic net is a combination of ridge and lasso regression. What is most unusual about elastic net is that it has two tuning parameters (alpha and lambda) while lasso and ridge regression only has 1. In this post, we will go through an example of the use of elastic net using the “VietnamI” dataset from the “Ecdat” package. Our goal is to predict how many days a person is ill based on the other variables in the dataset. Below is some initial code for our analysis library(Ecdat);library(corrplot);library(caret);library(glmnet) data("VietNamI") str(VietNamI) ## 'data.frame': 27765 obs. of 12 variables: ##$ pharvis  : num  0 0 0 1 1 0 0 0 2 3 ...
##  $lnhhexp : num 2.73 2.74 2.27 2.39 3.11 ... ##$ age      : num  3.76 2.94 2.56 3.64 3.3 ...
##  $sex : Factor w/ 2 levels "female","male": 2 1 2 1 2 2 1 2 1 2 ... ##$ married  : num  1 0 0 1 1 1 1 0 1 1 ...
##  $educ : num 2 0 4 3 3 9 2 5 2 0 ... ##$ illness  : num  1 1 0 1 1 0 0 0 2 1 ...
##  $injury : num 0 0 0 0 0 0 0 0 0 0 ... ##$ illdays  : num  7 4 0 3 10 0 0 0 4 7 ...
##  $actdays : num 0 0 0 0 0 0 0 0 0 0 ... ##$ insurance: num  0 0 1 1 0 1 1 1 0 0 ...
##  $commune : num 192 167 76 123 148 20 40 57 49 170 ... ## - attr(*, "na.action")=Class 'omit' Named int 27734 ## .. ..- attr(*, "names")= chr "27734" We need to check the correlations among the variables. We need to exclude the “sex” variable as it is categorical. Code is below. p.cor<-cor(VietNamI[,-4]) corrplot.mixed(p.cor) No major problems with correlations. Next, we set up our training and testing datasets. We need to remove the variable “commune” because it adds no value to our results. In addition, to reduce the computational time we will only use the first 1000 rows from the data set. VietNamI$commune<-NULL
VietNamI_reduced<-VietNamI[1:1000,]
ind<-sample(2,nrow(VietNamI_reduced),replace=T,prob = c(0.7,0.3))
train<-VietNamI_reduced[ind==1,]
test<-VietNamI_reduced[ind==2,]

We need to create a grid that will allow us to investigate different models with different combinations of alpha ana lambda. This is done using the “expand.grid” function. In combination with the “seq” function below is the code

grid<-expand.grid(.alpha=seq(0,1,by=.5),.lambda=seq(0,0.2,by=.1))

We also need to set the resampling method, which allows us to assess the validity of our model. This is done using the “trainControl” function” from the “caret” package. In the code below “LOOCV” stands for “leave one out cross-validation”.

control<-trainControl(method = "LOOCV")

We are no ready to develop our model. The code is mostly self-explanatory. This initial model will help us to determine the appropriate values for the alpha and lambda parameters

enet.train<-train(illdays~.,train,method="glmnet",trControl=control,tuneGrid=grid)
enet.train
## glmnet
##
## 694 samples
##  10 predictors
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 693, 693, 693, 693, 693, 693, ...
## Resampling results across tuning parameters:
##
##   alpha  lambda  RMSE      Rsquared
##   0.0    0.0     5.229759  0.2968354
##   0.0    0.1     5.229759  0.2968354
##   0.0    0.2     5.229759  0.2968354
##   0.5    0.0     5.243919  0.2954226
##   0.5    0.1     5.225067  0.2985989
##   0.5    0.2     5.200415  0.3038821
##   1.0    0.0     5.244020  0.2954519
##   1.0    0.1     5.203973  0.3033173
##   1.0    0.2     5.182120  0.3083819
##
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.2.

The output list all the possible alpha and lambda values that we set in the “grid” variable. It even tells us which combination was the best. For our purposes, the alpha will be .5 and the lambda .2. The r-square is also included.

We will set our model and run it on the test set. We have to convert the “sex” variable to a dummy variable for the “glmnet” function. We next have to make matrices for the predictor variables and a for our outcome variable “illdays”

train$sex<-model.matrix( ~ sex - 1, data=train ) #convert to dummy variable test$sex<-model.matrix( ~ sex - 1, data=test )
predictor_variables<-as.matrix(train[,-9])
days_ill<-as.matrix(train$illdays) enet<-glmnet(predictor_variables,days_ill,family = "gaussian",alpha = 0.5,lambda = .2) We can now look at specific coefficient by using the “coef” function. enet.coef<-coef(enet,lambda=.2,alpha=.5,exact=T) enet.coef ## 12 x 1 sparse Matrix of class "dgCMatrix" ## s0 ## (Intercept) -1.304263895 ## pharvis 0.532353361 ## lnhhexp -0.064754000 ## age 0.760864404 ## sex.sexfemale 0.029612290 ## sex.sexmale -0.002617404 ## married 0.318639271 ## educ . ## illness 3.103047473 ## injury . ## actdays 0.314851347 ## insurance . You can see for yourself that several variables were removed from the model. Medical expenses (lnhhexp), sex, education, injury, and insurance do not play a role in the number of days ill for an individual in Vietnam. With our model developed. We now can test it using the predict function. However, we first need to convert our test dataframe into a matrix and remove the outcome variable from it test.matrix<-as.matrix(test[,-9]) enet.y<-predict(enet, newx = test.matrix, type = "response", lambda=.2,alpha=.5) Let’s plot our results plot(enet.y) This does not look good. Let’s check the mean squared error enet.resid<-enet.y-test$illdays
mean(enet.resid^2)
## [1] 20.18134

We will now do a cross-validation of our model. We need to set the seed and then use the “cv.glmnet” to develop the cross-validated model. We can see the model by plotting it.

set.seed(317)
enet.cv<-cv.glmnet(predictor_variables,days_ill,alpha=.5)
plot(enet.cv)

You can see that as the number of features are reduce (see the numbers on the top of the plot) the MSE increases (y-axis). In addition, as the lambda increases, there is also an increase in the error but only when the number of variables is reduced as well.

The dotted vertical lines in the plot represent the minimum MSE for a set lambda (on the left) and the one standard error from the minimum (on the right). You can extract these two lambda values using the code below.

enet.cv$lambda.min ## [1] 0.3082347 enet.cv$lambda.1se
## [1] 2.874607

We can see the coefficients for a lambda that is one standard error away by using the code below. This will give us an alternative idea for what to set the model parameters to when we want to predict.

coef(enet.cv,s="lambda.1se")
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept)   2.34116947
## pharvis       0.003710399
## lnhhexp       .
## age           .
## sex.sexfemale .
## sex.sexmale   .
## married       .
## educ          .
## illness       1.817479480
## injury        .
## actdays       .
## insurance     .

Using the one standard error lambda we lose most of our features. We can now see if the model improves by rerunning it with this information.

enet.y.cv<-predict(enet.cv,newx = test.matrix,type='response',lambda="lambda.1se", alpha = .5)
enet.cv.resid<-enet.y.cv-test$illdays mean(enet.cv.resid^2) ## [1] 25.47966 A small improvement. Our model is a mess but this post served as an example of how to conduct an analysis using elastic net regression. # Lasso Regression in R In this post, we will conduct an analysis using the lasso regression. Remember lasso regression will actually eliminate variables by reducing them to zero through how the shrinkage penalty can be applied. We will use the dataset “nlschools” from the “MASS” packages to conduct our analysis. We want to see if we can predict language test scores “lang” with the other available variables. Below is some initial code to begin the analysis library(MASS);library(corrplot);library(glmnet) data("nlschools") str(nlschools) ## 'data.frame': 2287 obs. of 6 variables: ##$ lang : int  46 45 33 46 20 30 30 57 36 36 ...
##  $IQ : num 15 14.5 9.5 11 8 9.5 9.5 13 9.5 11 ... ##$ class: Factor w/ 133 levels "180","280","1082",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $GS : int 29 29 29 29 29 29 29 29 29 29 ... ##$ SES  : int  23 10 15 23 10 10 23 10 13 15 ...
##  $COMB : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... We need to remove the “class” variable as it is used as an identifier and provides no useful data. After this, we can check the correlations among the variables. Below is the code for this. nlschools$class<-NULL
p.cor<-cor(nlschools[,-5])
corrplot.mixed(p.cor)

No problems with collinearity. We will now setup are training and testing sets.

ind<-sample(2,nrow(nlschools),replace=T,prob = c(0.7,0.3))
train<-nlschools[ind==1,]
test<-nlschools[ind==2,]

Remember that the ‘glmnet’ function does not like factor variables. So we need to convert our “COMB” variable to a dummy variable. In addition, “glmnet” function does not like data frames so we need to make two data frames. The first will include all the predictor variables and the second we include only the outcome variable. Below is the code

train$COMB<-model.matrix( ~ COMB - 1, data=train ) #convert to dummy variable test$COMB<-model.matrix( ~ COMB - 1, data=test )
predictor_variables<-as.matrix(train[,2:4])
language_score<-as.matrix(train$lang) We can now run our model. We place both matrices inside the “glmnet” function. The family is set to “gaussian” because our outcome variable is continuous. The “alpha” is set to 1 as this indicates that we are using lasso regression. lasso<-glmnet(predictor_variables,language_score,family="gaussian",alpha=1) Now we need to look at the results using the “print” function. This function prints a lot of information as explained below. • Df = number of variables including in the model (this is always the same number in a ridge model) • %Dev = Percent of deviance explained. The higher the better • Lambda = The lambda used to obtain the %Dev When you use the “print” function for a lasso model it will print up to 100 different models. Fewer models are possible if the percent of deviance stops improving. 100 is the default stopping point. In the code below we will use the “print” function but, I only printed the first 5 and last 5 models in order to reduce the size of the printout. Fortunately, it only took 60 models to converge. print(lasso) ## ## Call: glmnet(x = predictor_variables, y = language_score, family = "gaussian", alpha = 1) ## ## Df %Dev Lambda ## [1,] 0 0.00000 5.47100 ## [2,] 1 0.06194 4.98500 ## [3,] 1 0.11340 4.54200 ## [4,] 1 0.15610 4.13900 ## [5,] 1 0.19150 3.77100 ............................ ## [55,] 3 0.39890 0.03599 ## [56,] 3 0.39900 0.03280 ## [57,] 3 0.39900 0.02988 ## [58,] 3 0.39900 0.02723 ## [59,] 3 0.39900 0.02481 ## [60,] 3 0.39900 0.02261 The results from the “print” function will allow us to set the lambda for the “test” dataset. Based on the results we can set the lambda at 0.02 because this explains the highest amount of deviance at .39. The plot below shows us lambda on the x-axis and the coefficients of the predictor variables on the y-axis. The numbers next to the coefficient lines refers to the actual coefficient of a particular variable as it changes from using different lambda values. Each number corresponds to a variable going from left to right in a dataframe/matrix using the “View” function. For example, 1 in the plot refers to “IQ” 2 refers to “GS” etc. plot(lasso,xvar="lambda",label=T) As you can see, as lambda increase the coefficient decrease in value. This is how regularized regression works. However, unlike ridge regression which never reduces a coefficient to zero, lasso regression does reduce a coefficient to zero. For example, coefficient 3 (SES variable) and coefficient 2 (GS variable) are reduced to zero when lambda is near 1. You can also look at the coefficient values at a specific lambda values. The values are unstandardized and are used to determine the final model selection. In the code below the lambda is set to .02 and we use the “coef” function to do see the results lasso.coef<-coef(lasso,s=.02,exact = T) lasso.coef ## 4 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 9.35736325 ## IQ 2.34973922 ## GS -0.02766978 ## SES 0.16150542 Results indicate that for a 1 unit increase in IQ there is a 2.41 point increase in language score. When GS (class size) goes up 1 unit there is a .03 point decrease in language score. Finally, when SES (socioeconomic status) increase 1 unit language score improves .13 point. The second plot shows us the deviance explained on the x-axis. On the y-axis is the coefficients of the predictor variables. Below is the code plot(lasso,xvar='dev',label=T) If you look carefully, you can see that the two plots are completely opposite to each other. increasing lambda cause a decrease in the coefficients. Furthermore, increasing the fraction of deviance explained leads to an increase in the coefficient. You may remember seeing this when we used the “print”” function. As lambda became smaller there was an increase in the deviance explained. Now, we will assess our model using the test data. We need to convert the test dataset to a matrix. Then we will use the “predict”” function while setting our lambda to .02. Lastly, we will plot the results. Below is the code. test.matrix<-as.matrix(test[,2:4]) lasso.y<-predict(lasso,newx = test.matrix,type = 'response',s=.02) plot(lasso.y,test$lang)

The visual looks promising. The last thing we need to do is calculated the mean squared error. By its self this number does not mean much. However, it provides a benchmark for comparing our current model with any other models that we may develop. Below is the code

lasso.resid<-lasso.y-test$lang mean(lasso.resid^2) ## [1] 46.74314 Knowing this number, we can, if we wanted, develop other models using other methods of analysis to try to reduce it. Generally, the lower the error the better while keeping in mind the complexity of the model. # Ridge Regression in R In this post, we will conduct an analysis using ridge regression. Ridge regression is a type of regularized regression. By applying a shrinkage penalty, we are able to reduce the coefficients of many variables almost to zero while still retaining them in the model. This allows us to develop models that have many more variables in them compared to models using best subset or stepwise regression. In the example used in this post, we will use the “SAheart” dataset from the “ElemStatLearn” package. We want to predict systolic blood pressure (sbp) using all of the other variables available as predictors. Below is some initial code that we need to begin. library(ElemStatLearn);library(car);library(corrplot) library(leaps);library(glmnet);library(caret) data(SAheart) str(SAheart) ## 'data.frame': 462 obs. of 10 variables: ##$ sbp      : int  160 144 118 170 134 132 142 114 114 132 ...
##  $tobacco : num 12 0.01 0.08 7.5 13.6 6.2 4.05 4.08 0 0 ... ##$ ldl      : num  5.73 4.41 3.48 6.41 3.5 6.47 3.38 4.59 3.83 5.8 ...
##  $adiposity: num 23.1 28.6 32.3 38 27.8 ... ##$ famhist  : Factor w/ 2 levels "Absent","Present": 2 1 2 2 2 2 1 2 2 2 ...
##  $typea : int 49 55 52 51 60 62 59 62 49 69 ... ##$ obesity  : num  25.3 28.9 29.1 32 26 ...
##  $alcohol : num 97.2 2.06 3.81 24.26 57.34 ... ##$ age      : int  52 63 46 58 49 45 38 58 29 53 ...
##  $chd : int 1 1 0 1 1 0 0 1 0 1 ... A look at the object using the “str” function indicates that one variable “famhist” is a factor variable. The “glmnet” function that does the ridge regression analysis cannot handle factors so we need to converts this to a dummy variable. However, there are two things we need to do before this. First, we need to check the correlations to make sure there are no major issues with multi-collinearity Second, we need to create our training and testing data sets. Below is the code for the correlation plot. p.cor<-cor(SAheart[,-5]) corrplot.mixed(p.cor) First we created a variable called “p.cor” the -5 in brackets means we removed the 5th column from the “SAheart” data set which is the factor variable “Famhist”. The correlation plot indicates that there is one strong relationship between adiposity and obesity. However, one common cut-off for collinearity is 0.8 and this value is 0.72 which is not a problem. We will now create are training and testing sets and convert “famhist” to a dummy variable. ind<-sample(2,nrow(SAheart),replace=T,prob = c(0.7,0.3)) train<-SAheart[ind==1,] test<-SAheart[ind==2,] train$famhist<-model.matrix( ~ famhist - 1, data=train ) #convert to dummy variable
test$famhist<-model.matrix( ~ famhist - 1, data=test ) We are still not done preparing our data yet. “glmnet” cannot use data frames, instead, it can only use matrices. Therefore, we now need to convert our data frames to matrices. We have to create two matrices, one with all of the predictor variables and a second with the outcome variable of blood pressure. Below is the code predictor_variables<-as.matrix(train[,2:10]) blood_pressure<-as.matrix(train$sbp)

We are now ready to create our model. We use the “glmnet” function and insert our two matrices. The family is set to Gaussian because “blood pressure” is a continuous variable. Alpha is set to 0 as this indicates ridge regression. Below is the code

ridge<-glmnet(predictor_variables,blood_pressure,family = 'gaussian',alpha = 0)

Now we need to look at the results using the “print” function. This function prints a lot of information as explained below.

•  Df = number of variables including in the model (this is always the same number in a ridge model)
•  %Dev = Percent of deviance explained. The higher the better
• Lambda = The lambda used to attain the %Dev

When you use the “print” function for a ridge model it will print up to 100 different models. Fewer models are possible if the percent of deviance stops improving. 100 is the default stopping point. In the code below we have the “print” function. However, I have only printed the first 5 and last 5 models in order to save space.

print(ridge)
##
## Call:  glmnet(x = predictor_variables, y = blood_pressure, family = "gaussian",      alpha = 0)
##
##        Df      %Dev    Lambda
##   [1,] 10 7.622e-37 7716.0000
##   [2,] 10 2.135e-03 7030.0000
##   [3,] 10 2.341e-03 6406.0000
##   [4,] 10 2.566e-03 5837.0000
##   [5,] 10 2.812e-03 5318.0000
................................
##  [95,] 10 1.690e-01    1.2290
##  [96,] 10 1.691e-01    1.1190
##  [97,] 10 1.692e-01    1.0200
##  [98,] 10 1.693e-01    0.9293
##  [99,] 10 1.693e-01    0.8468
## [100,] 10 1.694e-01    0.7716

The results from the “print” function are useful in setting the lambda for the “test” dataset. Based on the results we can set the lambda at 0.83 because this explains the highest amount of deviance at .20.

The plot below shows us lambda on the x-axis and the coefficients of the predictor variables on the y-axis. The numbers refer to the actual coefficient of a particular variable. Inside the plot, each number corresponds to a variable going from left to right in a data-frame/matrix using the “View” function. For example, 1 in the plot refers to “tobacco” 2 refers to “ldl” etc. Across the top of the plot is the number of variables used in the model. Remember this number never changes when doing ridge regression.

plot(ridge,xvar="lambda",label=T)

As you can see, as lambda increase the coefficient decrease in value. This is how ridge regression works yet no coefficient ever goes to absolute 0.

You can also look at the coefficient values at a specific lambda value. The values are unstandardized but they provide a useful insight when determining final model selection. In the code below the lambda is set to .83 and we use the “coef” function to do this

ridge.coef<-coef(ridge,s=.83,exact = T)
ridge.coef
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                   1
## (Intercept)            105.69379942
## tobacco                 -0.25990747
## ldl                     -0.13075557
## famhist.famhistAbsent    0.42532887
## famhist.famhistPresent  -0.40000846
## typea                   -0.01799031
## obesity                  0.29899976
## alcohol                  0.03648850
## age                      0.43555450
## chd                     -0.26539180

The second plot shows us the deviance explained on the x-axis and the coefficients of the predictor variables on the y-axis. Below is the code

plot(ridge,xvar='dev',label=T)

The two plots are completely opposite to each other. Increasing lambda cause a decrease in the coefficients while increasing the fraction of deviance explained leads to an increase in the coefficient. You can also see this when we used the “print” function. As lambda became smaller there was an increase in the deviance explained.

We now can begin testing our model on the test data set. We need to convert the test dataset to a matrix and then we will use the predict function while setting our lambda to .83 (remember a lambda of .83 explained the most of the deviance). Lastly, we will plot the results. Below is the code.

test.matrix<-as.matrix(test[,2:10])
ridge.y<-predict(ridge,newx = test.matrix,type = 'response',s=.83)
plot(ridge.y,test$sbp) The last thing we need to do is calculated the mean squared error. By it’s self this number is useless. However, it provides a benchmark for comparing the current model with any other models you may develop. Below is the code ridge.resid<-ridge.y-test$sbp
mean(ridge.resid^2)
## [1] 372.4431

Knowing this number, we can develop other models using other methods of analysis to try to reduce it as much as possible.

# Primary Tasks in Data Analysis

Performing a data analysis in the realm of data science is a difficult task due to the huge number of decisions that need to be made. For some people,  plotting the course to conduct an analysis is easy. However, for most of us, beginning a project leads to a sense of paralysis as we struggle to determine what to do.

In light of this challenge, there are at least 5 core task that you need to consider when preparing to analyze data. These five task are

2. Data exploration
3. Developing a statistical model
4. Interpreting the results
5. Sharing the results

You really cannot analyze data until you first determine what it is you want to know. It is tempting to just jump in and start looking for interesting stuff but you will not know if something you find is interesting unless it helps to answer your question(s).

There are several types of research questions. The point is you need to ask them in order to answer them.

Data Exploration

Data exploration allows you to determine if you can answer your questions with the data you have. In data science, the data is normally already collected by the time you are called upon to analyze it. As such, what you want to find may not be possible.

In addition, exploration of the data allows you to determine if there are any problems with the data set such as missing data, strange variables, and if necessary to develop a data dictionary so you know the characteristics of the variables.

Data exploration allows you to determine what kind of data wrangling needs to be done. This involves the preparation of the data for a more formal analysis when you develop your statistical models. This process takes up the majority of a data scientist time and is not easy at all.  Mastery of this in many ways means being a master of data science

Develop a Statistical Model

Your research questions  and the data exploration  process helps you to determine what kind of model to develop. The factors that can affect this is whether your data is supervised or unsupervised and whether you want to classify or predict numerical values.

This is probably the funniest part of data analysis and is much easier then having to wrangle with the data. Your goal is to determine if the model helps to answer your question(s)

Interpreting the Results

Once a model is developed it is time to explain what it means. Sometimes you can make a really cool model that nobody (including yourself) can explain. This is especially true of “black box” methods such as support vector machines and artificial neural networks. Models need to normally be explainable to non-technical stakeholders.

With interpretation you are trying to determine “what does this answer mean to the stakeholders?”  For example, if you find that people who smoke are 5 times more likely to die before the age of 50 what are the implications of this? How can the stakeholders use this information to achieve their own goals? In other words, why should they care about what you found out?

Communication of Results

Now  is the time to actually share the answer(s) to your question(s). How this is done varies but it can be written, verbal or both. Whatever the mode of communication it is necessary to consider the following

• The audience or stakeholders
• The actual answers to the questions
• The benefits of knowing this

You must remember the stakeholders because this affects how you communicate. How you speak to business professionals would be  different from academics. Next, you must share the answers to the questions. This can be done with charts, figures, illustrations etc. Data visualization is an expertise of its own. Lastly, you explain how this information is useful in a practical way.

Conclusion

The process shared here is one way to approach the analysis of data. Think of this as a framework from which to develop your own method of analysis.

# Linear VS Quadratic Discriminant Analysis in R

In this post we will look at linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). Discriminant analysis is used when the dependent variable is categorical. Another commonly used option is logistic regression but there are differences between logistic regression and discriminant analysis. Both LDA and QDA are used in situations in which there is a clear separation between the classes you want to predict. If the categories are fuzzier logistic regression is often the better choice.

For our example, we will use the “Mathlevel” dataset found in the “Ecdat” package. Our goal will be to predict the sex of a respondent based on SAT math score, major, foreign language proficiency, as well as the number of math, physic, and chemistry classes a respondent took. Below is some initial code to start our analysis.

library(MASS);library(Ecdat)
data("Mathlevel")

The first thing we need to do is clean up the data set. We have to remove any missing data in order to run our model. We will create a dataset called “math” that has the “Mathlevel” dataset but with the “NA”s removed use the “na.omit” function. After this, we need to set our seed for the purpose of reproducibility using the “set.seed” function. Lastly, we will split the data using the “sample” function using a 70/30 split. The training dataset will be called “math.train” and the testing dataset will be called “math.test”. Below is the code

math<-na.omit(Mathlevel)
set.seed(123)
math.ind<-sample(2,nrow(math),replace=T,prob = c(0.7,0.3))
math.train<-math[math.ind==1,]
math.test<-math[math.ind==2,]

Now we will make our model and it is called “lda.math” and it will include all available variables in the “math.train” dataset. Next we will check the results by calling the modle. Finally, we will examine the plot to see how our model is doing. Below is the code.

lda.math<-lda(sex~.,math.train)
lda.math
## Call:
## lda(sex ~ ., data = math.train)
##
## Prior probabilities of groups:
##      male    female
## 0.5986079 0.4013921
##
## Group means:
##        mathlevel.L mathlevel.Q mathlevel.C mathlevel^4 mathlevel^5
## male   -0.10767593  0.01141838 -0.05854724   0.2070778  0.05032544
## female -0.05571153  0.05360844 -0.08967303   0.2030860 -0.01072169
##        mathlevel^6      sat languageyes  majoreco  majoross   majorns
## male    -0.2214849 632.9457  0.07751938 0.3914729 0.1472868 0.1782946
## female  -0.2226767 613.6416  0.19653179 0.2601156 0.1907514 0.2485549
##          majorhum mathcourse physiccourse chemistcourse
## male   0.05426357   1.441860    0.7441860      1.046512
## female 0.07514451   1.421965    0.6531792      1.040462
##
## Coefficients of linear discriminants:
##                       LD1
## mathlevel.L    1.38456344
## mathlevel.Q    0.24285832
## mathlevel.C   -0.53326543
## mathlevel^4    0.11292817
## mathlevel^5   -1.24162715
## mathlevel^6   -0.06374548
## sat           -0.01043648
## languageyes    1.50558721
## majoreco      -0.54528930
## majoross       0.61129797
## majorns        0.41574298
## majorhum       0.33469586
## mathcourse    -0.07973960
## physiccourse  -0.53174168
## chemistcourse  0.16124610
plot(lda.math,type='both')

Calling “lda.math” gives us the details of our model. It starts be indicating the prior probabilities of someone being male or female. Next is the means for each variable by sex. The last part is the coefficients of the linear discriminants. Each of these values is used to determine the probability that a particular example is male or female. This is similar to a regression equation.

The plot provides us with densities of the discriminant scores for males and then for females. The output indicates a problem. There is a great deal of overlap between male and females in the model. What this indicates is that there is a lot of misclassification going on as the two groups are not clearly separated. Furthermore, this means that logistic regression is probably a better choice for distinguishing between male and females. However, since this is for demonstrating purposes we will not worry about this.

We will now use the “predict” function on the training set data to see how well our model classifies the respondents by gender. We will then compare the prediction of the model with thee actual classification. Below is the code.

math.lda.predict<-predict(lda.math)
math.train$lda<-math.lda.predict$class
table(math.train$lda,math.train$sex)
##
##          male female
##   male    219    100
##   female   39     73
mean(math.train$lda==math.train$sex)
## [1] 0.6774942

As you can see, we have a lot of misclassification happening. A large amount of false negatives which is a lot of males being classified as female. The overall accuracy us only 59% which is not much better than chance.

We will now conduct the same analysis on the test data set. Below is the code.

lda.math.test<-predict(lda.math,math.test)
math.test$lda<-lda.math.test$class
table(math.test$lda,math.test$sex)
##
##          male female
##   male     92     43
##   female   23     20
mean(math.test$lda==math.test$sex)
## [1] 0.6292135

As you can see the results are similar. To put it simply, our model is terrible. The main reason is that there is little distinction between males and females as shown in the plot. However, we can see if perhaps a quadratic discriminant analysis will do better

QDA allows for each class in the dependent variable to have it’s own covariance rather than a shared covariance as in LDA. This allows for quadratic terms in the development of the model. To complete a QDA we need to use the “qda” function from the “MASS” package. Below is the code for the training data set.

math.qda.fit<-qda(sex~.,math.train)
math.qda.predict<-predict(math.qda.fit)
math.train$qda<-math.qda.predict$class
table(math.train$qda,math.train$sex)
##
##          male female
##   male    215     84
##   female   43     89
mean(math.train$qda==math.train$sex)
## [1] 0.7053364

You can see there is almost no difference. Below is the code for the test data.

math.qda.test<-predict(math.qda.fit,math.test)
math.test$qda<-math.qda.test$class
table(math.test$qda,math.test$sex)
##
##          male female
##   male     91     43
##   female   24     20
mean(math.test$qda==math.test$sex)
## [1] 0.6235955

Still disappointing. However, in this post we reviewed linear discriminant analysis as well as learned about the use of quadratic linear discriminant analysis. Both of these statistical tools are used for predicting categorical dependent variables. LDA assumes shared covariance in the dependent variable categories will QDA allows for each category in the dependent variable to have it’s own variance.

# Validating a Logistic Model in R

In this post, we are going to continue are analysis of the logistic regression model from the the post on logistic regression  in R. We need to rerun all of the code from the last post to be ready to continue. As such the code form the last post is all below

library(MASS);library(bestglm);library(reshape2);library(corrplot);
library(ggplot2);library(ROCR)
data(survey)
survey$Clap<-NULL survey$W.Hnd<-NULL
survey$Fold<-NULL survey$Exer<-NULL
survey$Smoke<-NULL survey$M.I<-NULL
survey<-na.omit(survey)
pm<-melt(survey, id.var="Sex")
ggplot(pm,aes(Sex,value))+geom_boxplot()+facet_wrap(~variable,ncol = 3)

pc<-cor(survey[,2:5]) corrplot.mixed(pc)

set.seed(123) ind<-sample(2,nrow(survey),replace=T,prob = c(0.7,0.3)) train<-survey[ind==1,] test<-survey[ind==2,] fit<-glm(Sex~.,binomial,train) exp(coef(fit))

train$probs<-predict(fit, type = 'response') train$predict<-rep('Female',123)
train$predict[train$probs>0.5]<-"Male"
table(train$predict,train$Sex)
mean(train$predict==train$Sex)
test$prob<-predict(fit,newdata = test, type = 'response') test$predict<-rep('Female',46)
test$predict[test$prob>0.5]<-"Male"
table(test$predict,test$Sex)
mean(test$predict==test$Sex)

Model Validation

We will now do a K-fold cross validation in order to further see how our model is doing. We cannot use the factor variable “Sex” with the K-fold code so we need to create a dummy variable. First, we create a variable called “y” that has 123 spaces, which is the same size as the “train” dataset. Second, we fill “y” with 1 in every example that is coded “male” in the “Sex” variable.

In addition, we also need to create a new dataset and remove some variables from our prior analysis otherwise we will confuse the functions that we are going to use. We will remove “predict”, “Sex”, and “probs”

train$y<-rep(0,123) train$y[train$Sex=="Male"]=1 my.cv<-train[,-8] my.cv$Sex<-NULL
my.cv$probs<-NULL We now can do our K-fold analysis. The code is complicated so you can trust it and double check on your own. bestglm(Xy=my.cv,IC="CV",CVArgs = list(Method="HTF",K=10,REP=1),family = binomial) ## Morgan-Tatar search since family is non-gaussian. ## CV(K = 10, REP = 1) ## BICq equivalent for q in (6.66133814775094e-16, 0.0328567092272112) ## Best Model: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -45.2329733 7.80146036 -5.798014 6.710501e-09 ## Height 0.2615027 0.04534919 5.766425 8.097067e-09 The results confirm what we alreaedy knew that only the “Height” variable is valuable in predicting Sex. We will now create our new model using only the recommendation of the kfold validation analysis. Then we check the new model against the train dataset and with the test dataset. The code below is a repeat of prior code but based on the cross-validation reduce.fit<-glm(Sex~Height, family=binomial,train) train$cv.probs<-predict(reduce.fit,type='response')
train$cv.predict<-rep('Female',123) train$cv.predict[train$cv.probs>0.5]='Male' table(train$cv.predict,train$Sex) ## ## Female Male ## Female 61 11 ## Male 7 44 mean(train$cv.predict==train$Sex) ## [1] 0.8536585 test$cv.probs<-predict(reduce.fit,test,type = 'response')
test$cv.predict<-rep('Female',46) test$cv.predict[test$cv.probs>0.5]='Male' table(test$cv.predict,test$Sex) ## ## Female Male ## Female 16 7 ## Male 1 22 mean(test$cv.predict==test$Sex) ## [1] 0.826087 The results are consistent for both the train and test dataset. We are now going to create the ROC curve. This will provide a visual and the AUC number to further help us to assess our model. However, a model is only good when it is compared to another model. Therefore, we will create a really bad model in order to compare it to the original model, and the cross validated model. We will first make a bad model and store the probabilities in the “test” dataset. The bad model will use “age” to predict “Sex” which doesn’t make any sense at all. Below is the code followed by the ROC curve of the bad model. bad.fit<-glm(Sex~Age,family = binomial,test) test$bad.probs<-predict(bad.fit,type='response')
pred.bad<-prediction(test$bad.probs,test$Sex)
plot(perf.bad,col=1)

The more of a diagonal the line is the worst it is. As, we can see the bad model is really bad.

What we just did with the bad model we will now repeat for the full model and the cross-validated model.  As before, we need to store the prediction in a way that the ROCR package can use them. We will create a variable called “pred.full” to begin the process of graphing the the original full model from the last blog post. Then we will use the “prediction” function. Next, we will create the “perf.full” variable to store the performance of the model. Notice, the arguments ‘tpr’ and ‘fpr’ for true positive rate and false positive rate. Lastly, we plot the results

pred.full<-prediction(test$prob,test$Sex)
perf.full<-performance(pred.full,'tpr','fpr')
plot(perf.full, col=2)

We repeat this process for the cross-validated model

pred.cv<-prediction(test$cv.probs,test$Sex)
perf.cv<-performance(pred.cv,'tpr','fpr')
plot(perf.cv,col=3)

Now let’s put all the different models on one plot

plot(perf.bad,col=1)
legend(.7,.4,c("BAD","FULL","CV"), 1:3)

Finally, we can calculate the AUC for each model

auc.bad<-performance(pred.bad,'auc')
auc.bad@y.values
## [[1]]
## [1] 0.4766734
auc.full<-performance(pred.full,"auc")
auc.full@y.values
## [[1]]
## [1] 0.959432
auc.cv<-performance(pred.cv,'auc')
auc.cv@y.values
## [[1]]
## [1] 0.9107505

The higher the AUC the better. As such, the full model with all variables is superior to the cross-validated or bad model. This is despite the fact that there are many high correlations in the full model as well. Another point to consider is that the cross-validated model is simpler so this may be a reason to pick it over the full model. As such, the statistics provide support for choosing a model but the do not trump the ability of the research to pick based on factors beyond just numbers.

# Logistic Regression in R

In this post, we will conduct a logistic regression analysis. Logistic regression is used when you want to predict a categorical dependent variable using continuous or categorical dependent variables. In our example, we want to predict Sex (male or female) when using several continuous variables from the “survey” dataset in the “MASS” package.

library(MASS);library(bestglm);library(reshape2);library(corrplot)
data(survey)
?MASS::survey #explains the variables in the study

The first thing we need to do is remove the independent factor variables from our dataset. The reason for this is that the function that we will use for the cross-validation does not accept factors. We will first use the “str” function to identify factor variables and then remove them from the dataset. We also need to remove in examples that are missing data so we use the “na.omit” function for this. Below is the code

survey$Clap<-NULL survey$W.Hnd<-NULL
survey$Fold<-NULL survey$Exer<-NULL
survey$Smoke<-NULL survey$M.I<-NULL
survey<-na.omit(survey)

We now need to check for collinearity using the “corrplot.mixed” function form the “corrplot” package.

pc<-cor(survey[,2:5])
corrplot.mixed(pc)
corrplot.mixed(pc)

We have extreme correlation between “We.Hnd” and “NW.Hnd” this makes sense because people’s hands are normally the same size. Since this blog post  is a demonstration of logistic regression we will not worry about this too much.

We now need to divide our dataset into a train and a test set. We set the seed for. First we need to make a variable that we call “ind” that is randomly assigns 70% of the number of rows of survey 1 and 30% 2. We then subset the “train” dataset by taking all rows that are 1’s based on the “ind” variable and we create the “test” dataset for all the rows that line up with 2 in the “ind” variable. This means our data split is 70% train and 30% test. Below is the code

set.seed(123)
ind<-sample(2,nrow(survey),replace=T,prob = c(0.7,0.3))
train<-survey[ind==1,]
test<-survey[ind==2,]

We now make our model. We use the “glm” function for logistic regression. We set the family argument to “binomial”. Next, we look at the results as well as the odds ratios.

fit<-glm(Sex~.,family=binomial,train)
summary(fit)
##
## Call:
## glm(formula = Sex ~ ., family = binomial, data = train)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.9875  -0.5466  -0.1395   0.3834   3.4443
##
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -46.42175    8.74961  -5.306 1.12e-07 ***
## Wr.Hnd       -0.43499    0.66357  -0.656    0.512
## NW.Hnd        1.05633    0.70034   1.508    0.131
## Pulse        -0.02406    0.02356  -1.021    0.307
## Height        0.21062    0.05208   4.044 5.26e-05 ***
## Age           0.00894    0.05368   0.167    0.868
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 169.14  on 122  degrees of freedom
## Residual deviance:  81.15  on 117  degrees of freedom
## AIC: 93.15
##
## Number of Fisher Scoring iterations: 6
exp(coef(fit))
##  (Intercept)       Wr.Hnd       NW.Hnd        Pulse       Height
## 6.907034e-21 6.472741e-01 2.875803e+00 9.762315e-01 1.234447e+00
##          Age
## 1.008980e+00

The results indicate that only height is useful in predicting if someone is a male or female. The second piece of code shares the odds ratios. The odds ratio tell how a one unit increase in the independent variable leads to an increase in the odds of being male in our model. For example, for every one unit increase in height there is a 1.23 increase in the odds of a particular example being male.

We now need to see how well our model does on the train and test dataset. We first capture the probabilities and save them to the train dataset as “probs”. Next we create a “predict” variable and place the string “Female” in the same number of rows as are in the “train” dataset. Then we rewrite the “predict” variable by changing any example that has a probability above 0.5 as “Male”. Then we make a table of our results to see the number correct, false positives/negatives. Lastly, we calculate the accuracy rate. Below is the code.

train$probs<-predict(fit, type = 'response') train$predict<-rep('Female',123)
train$predict[train$probs>0.5]<-"Male"
table(train$predict,train$Sex)
##
##          Female Male
##   Female     61    7
##   Male        7   48
mean(train$predict==train$Sex)
## [1] 0.8861789

Despite the weaknesses of the model with so many insignificant variables it is surprisingly accurate at 88.6%. Let’s see how well we do on the “test” dataset.

test$prob<-predict(fit,newdata = test, type = 'response') test$predict<-rep('Female',46)
test$predict[test$prob>0.5]<-"Male"
table(test$predict,test$Sex)
##
##          Female Male
##   Female     17    3
##   Male        0   26
mean(test$predict==test$Sex)
## [1] 0.9347826

As you can see, we do even better on the test set with an accuracy of 93.4%. Our model is looking pretty good and height is an excellent predictor of sex which makes complete sense. However, in the next post we will use cross-validation and the ROC plot to further assess the quality of it.

# Best Subset Regression in R

In this post, we will take a look at best subset regression. Best subset regression fits a model for all possible feature or variable combinations and the decision for the most appropriate model is made by the analyst based on judgment or some statistical criteria.

Best subset regression is an alternative to both Forward and Backward stepwise regression. Forward stepwise selection adds one variable at a time based on the lowest residual sum of squares until no more variables continues to lower the residual sum of squares. Backward stepwise regression starts with all variables in the model and removes variables one at a time. The concern with stepwise methods is they can produce biased regression coefficients, conflicting models, and inaccurate confidence intervals.

Best subset regression bypasses these weaknesses of stepwise models by creating all models possible and then allowing you to assess which variables should be including in your final model. The one drawback to best subset is that a large number of variables means a large number of potential models, which can make it difficult to make a decision among several choices.

In this post, we will use the “Fair” dataset from the “Ecdat” package to predict marital satisfaction based on age, Sex, the presence of children, years married, religiosity, education, occupation, and number of affairs in the past year. Below is some initial code.

library(leaps);library(Ecdat);library(car);library(lmtest)
data(Fair)

We begin our analysis by building the initial model with all variables in it. Below is the code

fit<-lm(rate~.,Fair)
summary(fit)
##
## Call:
## lm(formula = rate ~ ., data = Fair)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.2049 -0.6661  0.2298  0.7705  2.2292
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3.522875   0.358793   9.819  < 2e-16 ***
## sexmale     -0.062281   0.099952  -0.623  0.53346
## age         -0.009683   0.007548  -1.283  0.20005
## ym          -0.019978   0.013887  -1.439  0.15079
## childyes    -0.206976   0.116227  -1.781  0.07546 .
## religious    0.042142   0.037705   1.118  0.26416
## education    0.068874   0.021153   3.256  0.00119 **
## occupation  -0.015606   0.029602  -0.527  0.59825
## nbaffairs   -0.078812   0.013286  -5.932 5.09e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.03 on 592 degrees of freedom
## Multiple R-squared:  0.1405, Adjusted R-squared:  0.1289
## F-statistic:  12.1 on 8 and 592 DF,  p-value: 4.487e-16

The initial results are already interesting even though the r-square is low. When couples have children the have less martial satisfaction than couples without children when controlling for the other factors and this is the strongest regression weight. In addition, the more education a person has there is an increase in marital satisfaction. Lastly, as the number of affairs increases there is also a decrease in martial satisfaction. Keep in mind that the “rate” variable goes from 1 to 5 with one meaning a terrible marriage to five being a great one. The mean marital satisfaction was 3.52 when controlling for the other variables.

We will now create our subset models. Below is the code.

sub.fit<-regsubsets(rate~.,Fair)
best.summary<-summary(sub.fit)

In the code above we create the sub models using the “regsubsets” function from the “leaps” package and saved it in the variable called “sub.fit”. We then saved the summary of “sub.fit” in the variable “best.summary”. We will use the “best.summary” “sub.fit variables several times to determine which model to use.

There are many different ways to assess the model. We will use the following statistical methods that come with the results from the “regsubset” function.

• Mallow’ Cp
• Bayesian Information Criteria

We will make two charts for each of the criteria above. The plot to the left will explain how many features to include in the model. The plot to the right will tell you which variables to include. It is important to note that for both of these methods, the lower the score the better the model. Below is the code for Mallow’s Cp.

par(mfrow=c(1,2))
plot(best.summary$cp) plot(sub.fit,scale = "Cp") The plot on the left suggests that a four feature model is the most appropriate. However, this chart does not tell me which four features. The chart on the right is read in reverse order. The high numbers are at the bottom and the low numbers are at the top when looking at the y-axis. Knowing this, we can conclude that the most appropriate variables to include in the model are age, children presence, education, and number of affairs. Below are the results using the Bayesian Information Criterion par(mfrow=c(1,2)) plot(best.summary$bic)
plot(sub.fit,scale = "bic")

These results indicate that a three feature model is appropriate. The variables or features are years married, education, and number of affairs. Presence of children was not considered beneficial. Since our original model and Mallow’s Cp indicated that presence of children was significant we will include it for now.

Below is the code for the model based on the subset regression.

fit2<-lm(rate~age+child+education+nbaffairs,Fair)
summary(fit2)
##
## Call:
## lm(formula = rate ~ age + child + education + nbaffairs, data = Fair)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.2172 -0.7256  0.1675  0.7856  2.2713
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3.861154   0.307280  12.566  < 2e-16 ***
## age         -0.017440   0.005057  -3.449 0.000603 ***
## childyes    -0.261398   0.103155  -2.534 0.011531 *
## education    0.058637   0.017697   3.313 0.000978 ***
## nbaffairs   -0.084973   0.012830  -6.623 7.87e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.029 on 596 degrees of freedom
## Multiple R-squared:  0.1352, Adjusted R-squared:  0.1294
## F-statistic: 23.29 on 4 and 596 DF,  p-value: < 2.2e-16

The results look ok. The older a person is the less satisfied they are with their marriage. If children are present the marriage is less satisfying. The more educated the more satisfied they are. Lastly, the higher the number of affairs indicate less marital satisfaction. However, before we get excited we need to check for collinearity and homoscedasticity. Below is the code

vif(fit2)
##       age     child education nbaffairs
##  1.249430  1.228733  1.023722  1.014338

No issues with collinearity.For vif values above 5 or 10 indicate a problem. Let’s check for homoscedasticity

par(mfrow=c(2,2))
plot(fit2)

The normal qqplot and residuals vs leverage plot can be used for locating outliers. The residual vs fitted and the scale-location plot do not look good as there appears to be a pattern in the dispersion which indicates homoscedasticity. To confirm this we will use Breusch-Pagan test from the “lmtest” package. Below is the code

bptest(fit2)
##
##  studentized Breusch-Pagan test
##
## data:  fit2
## BP = 16.238, df = 4, p-value = 0.002716

There you have it. Our model violates the assumption of homoscedasticity. However, this model was developed for demonstration purpose to provide an example of subset regression.

# Data Wrangling in R

Collecting and preparing data for analysis is the primary job of a data scientist. This experience is called data wrangling. In this post, we will look at an example of data wrangling using a simple artificial data set. You can create the table below in r or excel. If you created it in excel just save it as a csv and load it into r. Below is the initial code

library(readr)
apple <- read_csv("~/Desktop/apple.csv")
## # A tibble: 10 × 2
##        weight      location
##         <chr>         <chr>
## 1         3.2        Europe
## 2       4.2kg       europee
## 3      1.3 kg          U.S.
## 4  7200 grams           USA
## 5          42 United States
## 6         2.3       europee
## 7       2.1kg        Europe
## 8       3.1kg           USA
## 9  2700 grams          U.S.
## 10         24 United States

This a small dataset with the columns of “weight” and “location”. Here are some of the problems

• Weights are in different units
• Weights are written in different ways
• Location is not consistent

In order to have any success with data wrangling you need to state specifically what it is you want to do. Here are our goals for this project

• Convert the “Weight variable” to a numerical variable instead of character
• Remove the text and have only numbers in the “weight variable”
• Change weights in grams to kilograms
• Convert the “location” variable to a factor variable instead of character
• Have consistent spelling for Europe and United States in the “location” variable

We will begin with the “weight” variable. We want to convert it to a numerical variable and remove any non-numerical text. Below is the code for this

corrected.weight<-as.numeric(gsub(pattern = "[[:alpha:]]","",apple$weight)) corrected.weight ## [1] 3.2 4.2 1.3 7200.0 42.0 2.3 2.1 3.1 2700.0 24.0 Here is what we did. 1. We created a variable called “corrected.weight” 2. We use the function “as.numeric” this makes whatever results inside it to be a numerical variable 3. Inside “as.numeric” we used the “gsub” function which allows us to substitute one value for another. 4. Inside “gsub” we used the argument pattern and set it to “[[alpha:]]” and “” this told r to look for any lower or uppercase letters and replace with nothing or remove it. This all pertains to the “weight” variable in the apple dataframe. We now need to convert the weights in grams to kilograms so that everything is the same unit. Below is the code gram.error<-grep(pattern = "[[:digit:]]{4}",apple$weight)
corrected.weight[gram.error]<-corrected.weight[gram.error]/1000
corrected.weight
##  [1]  3.2  4.2  1.3  7.2 42.0  2.3  2.1  3.1  2.7 24.0

Here is what we did

1. We created a variable called “gram.error”
2. We used the grep function to search are the “weight” variable in the apple data frame for input that is a digit and is 4 digits in length this is what the “[[:digit:]]{4}” argument means. We do not change any values yet we just store them in “gram.error”
3. Once this information is stored in “gram.error” we use it as a subset for the “corrected.weight” variable.
4. We tell r to save into the “corrected.weight” variable any value that is changeable according to the criteria set in “gram.error” and to divided it by 1000. Dividing it by 1000 converts the value from grams to kilograms.

We have completed the transformation of the “weight” and will move to dealing with the problems with the “location” variable in the “apple” dataframe. To do this we will first deal with the issues related to the values that relate to Europe and then we will deal with values related to United States. Below is the code.

europe<-agrep(pattern = "europe",apple$location,ignore.case = T,max.distance = list(insertion=c(1),deletions=c(2))) america<-agrep(pattern = "us",apple$location,ignore.case = T,max.distance = list(insertion=c(0),deletions=c(2),substitutions=0))
corrected.location<-apple$location corrected.location[europe]<-"europe" corrected.location[america]<-"US" corrected.location<-gsub(pattern = "United States","US",corrected.location) corrected.location ## [1] "europe" "europe" "US" "US" "US" "europe" "europe" ## [8] "US" "US" "US" The code is a little complicated to explain but in short We used the “agrep” function to tell r to search the “location” to look for values similar to our term “europe”. The other arguments provide some exceptions that r should change because the exceptions are close to the term europe. This process is repeated for the term “us”. We then store are the location variable from the “apple” dataframe in a new variable called “corrected.location” We then apply the two objects we made called “europe” and “america” to the “corrected.location” variable. Next we have to make some code to deal with “United States” and apply this using the “gsub” function. We are almost done, now we combine are two variables “corrected.weight” and “corrected.location” into a new data.frame. The code is below cleaned.apple<-data.frame(corrected.weight,corrected.location) names(cleaned.apple)<-c('weight','location') cleaned.apple ## weight location ## 1 3.2 europe ## 2 4.2 europe ## 3 1.3 US ## 4 7.2 US ## 5 42.0 US ## 6 2.3 europe ## 7 2.1 europe ## 8 3.1 US ## 9 2.7 US ## 10 24.0 US If you use the “str” function on the “cleaned.apple” dataframe you will see that “location” was automatically converted to a factor. This looks much better especially if you compare it to the original dataframe that is printed at the top of this post. # Principal Component Analysis in R This post will demonstrate the use of principal component analysis (PCA). PCA is useful for several reasons. One it allows you place your examples into groups similar to linear discriminant analysis but you do not need to know beforehand what the groups are. Second, PCA is used for the purpose of dimension reduction. For example, if you have 50 variables PCA can allow you to reduce this while retaining a certain threshold of variance. If you are working with a large dataset this can greatly reduce the computational time and general complexity of your models. Keep in mind that there really is not a dependent variable as this is unsupervised learning. What you are trying to see is how different examples can be mapped in space based on whatever independent variables are used. For our example, we will use the “Carseats” dataset form the “ISLR”. Our goal is to understanding the relationship among the variables when examining the shelve location of the car seat. Below is the initial code to begin the analysis library(ggplot2) library(ISLR) data("Carseats") We first need to rearrange the data and remove the variables we are not going to use in the analysis. Below is the code. Carseats1<-Carseats Carseats1<-Carseats1[,c(1,2,3,4,5,6,8,9,7,10,11)] Carseats1$Urban<-NULL
Carseats1$US<-NULL Here is what we did 1. We made a copy of the “Carseats” data called “Careseats1” 2. We rearranged the order of the variables so that the factor variables are at the end. This will make sense later 3.We removed the “Urban” and “US” variables from the table as they will not be a part of our analysis We will now do the PCA. We need to scale and center our data otherwise the larger numbers will have a much stronger influence on the results than smaller numbers. Fortunately, the “prcomp” function has a “scale” and a “center” argument. We will also use only the first 7 columns for the analysis as “sheveLoc” is not useful for this analysis. If we hadn’t moved “shelveLoc” to the end of the dataframe it would cause some headache. Below is the code. Carseats.pca<-prcomp(Carseats1[,1:7],scale. = T,center = T) summary(Carseats.pca) ## Importance of components: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 ## Standard deviation 1.3315 1.1907 1.0743 0.9893 0.9260 0.80506 0.41320 ## Proportion of Variance 0.2533 0.2026 0.1649 0.1398 0.1225 0.09259 0.02439 ## Cumulative Proportion 0.2533 0.4558 0.6207 0.7605 0.8830 0.97561 1.00000 The summary of “Carseats.pca” Tells us how much of the variance each component explains. Keep in mind that number of components is equal to the number of variables. The “proportion of variance” tells us the contribution each component makes and the “cumulative proportion”. If your goal is dimension reduction than the number of components to keep depends on the threshold you set. For example, if you need around 90% of the variance you would keep the first 5 components. If you need 95% or more of the variance you would keep the first six. To actually use the components you would take the “Carseats.pca$x” data and move it to your data frame.

Keep in mind that the actual components have no conceptual meaning but is a numerical representation of a combination of several variables that were reduce using PCA to fewer variables such as going form 7 variables to 5 variables.

This means that PCA is great for reducing variables for prediction purpose but is much harder for explanatory studies unless you can explain what the new components represent.

For our purposes, we will keep 5 components. This means that we have reduce our dimensions from 7 to 5 while still keeping almost 90% of the variance. Graphing our results is tricky because we have 5 dimensions but the human mind can only conceptualize 3 at the best and normally 2. As such we will plot the first two components and label them by shelf location using ggplot2. Below is the code

scores<-as.data.frame(Carseats.pca$x) pcaplot<-ggplot(scores,(aes(PC1,PC2,color=Carseats1$ShelveLoc)))+geom_point()
pcaplot

From the plot you can see there is little separation when using the first two components of the PCA analysis. This makes sense as we can only graph to components so we are missing a lot of the variance. However for demonstration purposes the analysis is complete.

# Linear Discriminant Analysis in R

In this post we will look at an example of linear discriminant analysis (LDA). LDA is used to develop a statistical model that classifies examples in a dataset. In the example in this post, we will use the “Star” dataset from the “Ecdat” package. What we will do is try to predict the type of class the students learned in (regular, small, regular with aide) using their math scores, reading scores, and the teaching experience of the teacher. Below is the initial code

library(Ecdat)
library(MASS)
data(Star)

We first need to examine the data by using the “str” function

str(Star)
## 'data.frame':    5748 obs. of  8 variables:
##  $tmathssk: int 473 536 463 559 489 454 423 500 439 528 ... ##$ treadssk: int  447 450 439 448 447 431 395 451 478 455 ...
##  $classk : Factor w/ 3 levels "regular","small.class",..: 2 2 3 1 2 1 3 1 2 2 ... ##$ totexpk : int  7 21 0 16 5 8 17 3 11 10 ...
##  $sex : Factor w/ 2 levels "girl","boy": 1 1 2 2 2 2 1 1 1 1 ... ##$ freelunk: Factor w/ 2 levels "no","yes": 1 1 2 1 2 2 2 1 1 1 ...
##  $race : Factor w/ 3 levels "white","black",..: 1 2 2 1 1 1 2 1 2 1 ... ##$ schidkn : int  63 20 19 69 79 5 16 56 11 66 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:5850] 1 4 6 7 8 9 10 15 16 17 ...
##   .. ..- attr(*, "names")= chr [1:5850] "1" "4" "6" "7" ...

We will use the following variables

• dependent variable = classk (class type)
• independent variable = tmathssk (Math score)
• independent variable = totexpk (Teaching experience)

We now need to examine the data visually by looking at histograms for our independent variables and a table for our dependent variable

hist(Star$tmathssk) hist(Star$treadssk)

hist(Star$totexpk) prop.table(table(Star$classk))
##
##           regular       small.class regular.with.aide
##         0.3479471         0.3014962         0.3505567

The data mostly looks good. The results of the “prop.table” function will help us when we develop are training and testing datasets. The only problem is with the “totexpk” variable. IT is not anywhere near to be normally distributed. TO deal with this we will use the square root for teaching experience. Below is the code

star.sqrt<-Star
star.sqrt$totexpk.sqrt<-sqrt(star.sqrt$totexpk)
hist(sqrt(star.sqrt$totexpk)) Much better. We now need to check the correlation among the variables as well and we will use the code below. cor.star<-data.frame(star.sqrt$tmathssk,star.sqrt$treadssk,star.sqrt$totexpk.sqrt)
cor(cor.star)
##                        star.sqrt.tmathssk star.sqrt.treadssk
## star.sqrt.tmathssk             1.00000000          0.7135489
## star.sqrt.totexpk.sqrt         0.08647957          0.1045353
##                        star.sqrt.totexpk.sqrt
## star.sqrt.tmathssk                 0.08647957
## star.sqrt.totexpk.sqrt             1.00000000

None of the correlations are too bad. We can now develop our model using linear discriminant analysis. First, we need to scale are scores because the test scores and the teaching experience are measured differently. Then, we need to divide our data into a train and test set as this will allow us to determine the accuracy of the model. Below is the code.

star.sqrt$tmathssk<-scale(star.sqrt$tmathssk)
star.sqrt$treadssk<-scale(star.sqrt$treadssk)
star.sqrt$totexpk.sqrt<-scale(star.sqrt$totexpk.sqrt)
train.star<-star.sqrt[1:4000,]
test.star<-star.sqrt[4001:5748,]

Now we develop our model. In the code before the “prior” argument indicates what we expect the probabilities to be. In our data the distribution of the the three class types is about the same which means that the apriori probability is 1/3 for each class type.

train.lda<-lda(classk~tmathssk+treadssk+totexpk.sqrt, data =
train.star,prior=c(1,1,1)/3)
train.lda
## Call:
## lda(classk ~ tmathssk + treadssk + totexpk.sqrt, data = train.star,
##     prior = c(1, 1, 1)/3)
##
## Prior probabilities of groups:
##           regular       small.class regular.with.aide
##         0.3333333         0.3333333         0.3333333
##
## Group means:
## regular           -0.04237438 -0.05258944  -0.05082862
## small.class        0.13465218  0.11021666  -0.02100859
## regular.with.aide -0.05129083 -0.01665593   0.09068835
##
## Coefficients of linear discriminants:
##                      LD1         LD2
## tmathssk      0.89656393 -0.04972956
## totexpk.sqrt -0.49061950  0.80051026
##
## Proportion of trace:
##    LD1    LD2
## 0.7261 0.2739

The printout is mostly readable. At the top is the actual code used to develop the model followed by the probabilities of each group. The next section shares the means of the groups. The coefficients of linear discriminants are the values used to classify each example. The coefficients are similar to regression coefficients. The computer places each example in both equations and probabilities are calculated. Whichever class has the highest probability is the winner. In addition, the higher the coefficient the more weight it has. For example, “tmathssk” is the most influential on LD1 with a coefficient of 0.89.

The proportion of trace is similar to principal component analysis

Now we will take the trained model and see how it does with the test set. We create a new model called “predict.lda” and use are “train.lda” model and the test data called “test.star”

predict.lda<-predict(train.lda,newdata = test.star)

We can use the “table” function to see how well are model has done. We can do this because we actually know what class our data is beforehand because we divided the dataset. What we need to do is compare this to what our model predicted. Therefore, we compare the “classk” variable of our “test.star” dataset with the “class” predicted by the “predict.lda” model.

table(test.star$classk,predict.lda$class)
##
##                     regular small.class regular.with.aide
##   regular               155         182               249
##   small.class           145         198               174
##   regular.with.aide     172         204               269

The results are pretty bad. For example, in the first row called “regular” we have 155 examples that were classified as “regular” and predicted as “regular” by the model. In rhe next column, 182 examples that were classified as “regular” but predicted as “small.class”, etc. To find out how well are model did you add together the examples across the diagonal from left to right and divide by the total number of examples. Below is the code

(155+198+269)/1748
## [1] 0.3558352

Only 36% accurate, terrible but ok for a demonstration of linear discriminant analysis. Since we only have two-functions or two-dimensions we can plot our model.  Below I provide a visual of the first 50 examples classified by the predict.lda model.

plot(predict.lda$x[1:50]) text(predict.lda$x[1:50],as.character(predict.lda$class[1:50]),col=as.numeric(predict.lda$class[1:100]))
abline(h=0,col="blue")
abline(v=0,col="blue")

The first function, which is the vertical line, doesn’t seem to discriminant anything as it off to the side and not separating any of the data. However, the second function, which is the horizontal one, does a good of dividing the “regular.with.aide” from the “small.class”. Yet, there are problems with distinguishing the class “regular” from either of the other two groups.  In order improve our model we need additional independent variables to help to distinguish the groups in the dependent variable.

# Generalized Additive Models in R

In this post, we will learn how to create a generalized additive model (GAM). GAMs are non-parametric generalized linear models. This means that linear predictor of the model uses smooth functions on the predictor variables. As such, you do not need to specific the functional relationship between the response and continuous variables. This allows you to explore the data for potential relationships that can be more rigorously tested with other statistical models

In our example, we will use the “Auto” dataset from the “ISLR” package and use the variables “mpg”,“displacement”,“horsepower”,and “weight” to predict “acceleration”. We will also use the “mgcv” package. Below is some initial code to begin the analysis

library(mgcv)
library(ISLR)
data(Auto)

We will now make the model we want to understand the response of “accleration” to the explanatory variables of “mpg”,“displacement”,“horsepower”,and “weight”. After setting the model we will examine the summary. Below is the code

model1<-gam(acceleration~s(mpg)+s(displacement)+s(horsepower)+s(weight),data=Auto)
summary(model1)
##
## Family: gaussian
##
## Formula:
## acceleration ~ s(mpg) + s(displacement) + s(horsepower) + s(weight)
##
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.54133    0.07205   215.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
##                   edf Ref.df      F  p-value
## s(mpg)          6.382  7.515  3.479  0.00101 **
## s(displacement) 1.000  1.000 36.055 4.35e-09 ***
## s(horsepower)   4.883  6.006 70.187  < 2e-16 ***
## s(weight)       3.785  4.800 41.135  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) =  0.733   Deviance explained = 74.4%
## GCV = 2.1276  Scale est. = 2.0351    n = 392

All of the explanatory variables are significant and the adjust r-squared is .73 which is excellent. edf stands for “effective degrees of freedom”. This modified version of the degree of freedoms is due to the smoothing process in the model. GCV stands for generalized cross validation and this number is useful when comparing models. The model with the lowest number is the better model.

We can also examine the model visually by using the “plot” function. This will allow us to examine if the curvature fitted by the smoothing process was useful or not for each variable. Below is the code.

plot(model1)

We can also look at a 3d graph that includes the linear predictor as well as the two strongest predictors. This is done with the “vis.gam” function. Below is the code

vis.gam(model1)

If multiple models are developed. You can compare the GCV values to determine which model is the best. In addition, another way to compare models is with the “AIC” function. In the code below, we will create an additional model that includes “year” compare the GCV scores and calculate the AIC. Below is the code.

model2<-gam(acceleration~s(mpg)+s(displacement)+s(horsepower)+s(weight)+s(year),data=Auto)
summary(model2)
##
## Family: gaussian
##
## Formula:
## acceleration ~ s(mpg) + s(displacement) + s(horsepower) + s(weight) +
##     s(year)
##
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.54133    0.07203   215.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
##                   edf Ref.df      F p-value
## s(mpg)          5.578  6.726  2.749  0.0106 *
## s(displacement) 2.251  2.870 13.757 3.5e-08 ***
## s(horsepower)   4.936  6.054 66.476 < 2e-16 ***
## s(weight)       3.444  4.397 34.441 < 2e-16 ***
## s(year)         1.682  2.096  0.543  0.6064
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) =  0.733   Deviance explained = 74.5%
## GCV = 2.1368  Scale est. = 2.0338    n = 392
#model1 GCV
model1$gcv.ubre ## GCV.Cp ## 2.127589 #model2 GCV model2$gcv.ubre
##   GCV.Cp
## 2.136797

As you can see, the second model has a higher GCV score when compared to the first model. This indicates that the first model is a better choice. This makes sense because in the second model the variable “year” is not significant. To confirm this we will calculate the AIC scores using the AIC function.

AIC(model1,model2)
##              df      AIC
## model1 18.04952 1409.640
## model2 19.89068 1411.156

Again, you can see that model1 s better due to its fewer degrees of freedom and slightly lower AIC score.

Conclusion

Using GAMs is most common for exploring potential relationships in your data. This is stated because they are difficult to interpret and to try and summarize. Therefore, it is normally better to develop a generalized linear model over a GAM due to the difficulty in understanding what the data is trying to tell you when using GAMs.

# Proportion Test in R

Proportions are are a fraction or “portion” of a total amount. For example, if there are ten men and ten women in a room the proportion of men in the room is 50% (5 / 10). There are times when doing an analysis that you want to evaluate proportions in our data rather than individual measurements of mean, correlation, standard deviation etc.

In this post we will learn how to do a test of proportions using R. We will use the dataset “Default” which is found in the “ISLR” pacakage. We will compare the proportion of those who are students in the dataset to a theoretical value. We will calculate the results using the z-test and the binomial exact test. Below is some initial code to get started.

library(ISLR)
data("Default")

We first need to determine the actual number of students that are in the sample. This is calculated below using the “table” function.

table(Default$student) ## ## No Yes ## 7056 2944 We have 2944 students in the sample and 7056 people who are not students. We now need to determine how many people are in the sample. If we sum the results from the table below is the code. sum(table(Default$student))
## [1] 10000

There are 10000 people in the sample. To determine the proprtion of students we take the number 2944 / 10000 which equals 29.44 or 29.44%. Below is the code to calculate this

table(Default$student) / sum(table(Default$student))
##
##     No    Yes
## 0.7056 0.2944

The proportion test is used to compare a particular value with a theoretical value. For our example, the particular value we have is 29.44% of the people were students. We want to compare this value with a theoretical value of 50%. Before we do so it is better to state specificallt what are hypotheses are. NULL = The value of 29.44% of the sample being students is the same as 50% found in the population ALTERNATIVE = The value of 29.44% of the sample being students is NOT the same as 50% found in the population.

Below is the code to complete the z-test.

prop.test(2944,n = 10000, p = 0.5, alternative = "two.sided", correct = FALSE)
##
##  1-sample proportions test without continuity correction
##
## data:  2944 out of 10000, null probability 0.5
## X-squared = 1690.9, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.2855473 0.3034106
## sample estimates:
##      p
## 0.2944

Here is what the code means. 1. prop.test is the function used 2. The first value of 2944 is the total number of students in the sample 3. n = is the sample size 4. p= 0.5 is the theoretical proportion 5. alternative =“two.sided” means we want a two-tail test 6. correct = FALSE means we do not want a correction applied to the z-test. This is useful for small sample sizes but not for our sample of 10000

The p-value is essentially zero. This means that we reject the null hypothesis and conclude that the proprtion of students in our sample is different from a theortical proprition of 50% in the population.

Below is the same analysis using the binomial exact test.

binom.test(2944, n = 10000, p = 0.5)
##
##  Exact binomial test
##
## data:  2944 and 10000
## number of successes = 2944, number of trials = 10000, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.2854779 0.3034419
## sample estimates:
## probability of success
##                 0.2944

The results are the same. Whether to use the “prop.test”” or “binom.test” is a major argument among statisticians. The purpose here was to provide an example of the use of both

# Theoretical Distribution and R

This post will explore an example of testing if a dataset fits a specific theoretical distribution. This is a very important aspect of statistical modeling as it allows to understand the normality of the data and the appropriate steps needed to take to prepare for analysis.

In our example, we will use the “Auto” dataset from the “ISLR” package. We will check if the horsepower of the cars in the dataset is normally distributed or not. Below is some initial code to begin the process.

library(ISLR)
library(nortest)
library(fBasics)
data("Auto")

Determining if a dataset is normally distributed is simple in R. This is normally done visually through making a Quantile-Quantile plot (Q-Q plot). It involves using two functions the “qnorm” and the “qqline”. Below is the code for the Q-Q plot

qqnorm(Auto$horsepower) We now need to add the Q-Q line to see how are distribution lines up with the theoretical normal one. Below is the code. Note that we have to repeat the code above in order to get the completed plot. qqnorm(Auto$horsepower)
qqline(Auto$horsepower, distribution = qnorm, probs=c(.25,.75)) The “qqline” function needs the data you want to test as well as the distribution and probability. The distribution we wanted is normal and is indicated by the argument “qnorm”. The probs argument means probability. The default values are .25 and .75. The resulting graph indicates that the distribution of “horsepower”, in the “Auto” dataset is not normally distributed. That are particular problems with the lower and upper values. We can confirm our suspicion by running a statistical test. The Anderson-Darling test from the “nortest” package will allow us to test whether our data is normally distributed or not. The code is below ad.test(Auto$horsepower)
##  Anderson-Darling normality test
##
## data:  Auto$horsepower ## A = 12.675, p-value < 2.2e-16 From the results, we can conclude that the data is not normally distributed. This could mean that we may need to use non-parametric tools for statistical analysis. We can further explore our distribution in terms of its skew and kurtosis. Skew measures how far to the left or right the data leans and kurtosis measures how peaked or flat the data is. This is done with the “fBasics” package and the functions “skewness” and “kurtosis”. First we will deal with skewness. Below is the code for calculating skewness. horsepowerSkew<-skewness(Auto$horsepower)
horsepowerSkew
## [1] 1.079019
## attr(,"method")
## [1] "moment"

We now need to determine if this value of skewness is significantly different from zero. This is done with a simple t-test. We must calculate the t-value before calculating the probability. The standard error of the skew is defined as the square root of six divided by the total number of samples. The code is below

stdErrorHorsepower<-horsepowerSkew/(sqrt(6/length(Auto$horsepower))) stdErrorHorsepower ## [1] 8.721607 ## attr(,"method") ## [1] "moment" Now we take the standard error of Horsepower and plug this into the “pt” function (t probability) with the degrees of freedom (sample size – 1 = 391) we also put in the number 1 and subtract all of this information. Below is the code 1-pt(stdErrorHorsepower,391) ## [1] 0 ## attr(,"method") ## [1] "moment" The value zero means that we reject the null hypothesis that the skew is not significantly different form zero and conclude that the skew is different form zero. However, the value of the skew was only 1.1 which is not that non-normal. We will now repeat this process for the kurtosis. The only difference is that instead of taking the square root divided by six we divided by 24 in the example below. horsepowerKurt<-kurtosis(Auto$horsepower)
horsepowerKurt
## [1] 0.6541069
## attr(,"method")
## [1] "excess"
stdErrorHorsepowerKurt<-horsepowerKurt/(sqrt(24/length(Auto$horsepower))) stdErrorHorsepowerKurt ## [1] 2.643542 ## attr(,"method") ## [1] "excess" 1-pt(stdErrorHorsepowerKurt,391) ## [1] 0.004267199 ## attr(,"method") ## [1] "excess" Again the pvalue is essentially zero, which means that the kurtosis is significantly different from zero. With a value of 2.64 this is not that bad. However, when both skew and kurtosis are non-normally it explains why our overall distributions was not normal either. Conclusion This post provided insights into assessing the normality of a dataset. Visually inspection can take place using Q-Q plots. Statistical inspection can be done through hypothesis testing along with checking skew and kurtosis. # Probability Distribution and Graphs in R In this post, we will use probability distributions and ggplot2 in R to solve a hypothetical example. This provides a practical example of the use of R in everyday life through the integration of several statistical and coding skills. Below is the scenario. At a busing company the average number of stops for a bus is 81 with a standard deviation of 7.9. The data is normally distributed. Knowing this complete the following. • Calculate the interval value to use using the 68-95-99.7 rule • Calculate the density curve • Graph the normal curve • Evaluate the probability of a bus having less then 65 stops • Evaluate the probability of a bus having more than 93 stops Calculate the Interval Value Our first step is to calculate the interval value. This is the range in which 99.7% of the values falls within. Doing this requires knowing the mean and the standard deviation and subtracting/adding the standard deviation as it is multiplied by three from the mean. Below is the code for this. busStopMean<-81 busStopSD<-7.9 busStopMean+3*busStopSD ## [1] 104.7 busStopMean-3*busStopSD ## [1] 57.3 The values above mean that we can set are interval between 55 and 110 with 100 buses in the data. Below is the code to set the interval. interval<-seq(55,110, length=100) #length here represents 100 fictitious buses Density Curve The next step is to calculate the density curve. This is done with our knowledge of the interval, mean, and standard deviation. We also need to use the “dnorm” function. Below is the code for this. densityCurve<-dnorm(interval,mean=81,sd=7.9) We will now plot the normal curve of our data using ggplot. Before we need to put our “interval” and “densityCurve” variables in a dataframe. We will call the dataframe “normal” and then we will create the plot. Below is the code. library(ggplot2) normal<-data.frame(interval, densityCurve) ggplot(normal, aes(interval, densityCurve))+geom_line()+ggtitle("Number of Stops for Buses") Probability Calculation We now want to determine what is the provability of a bus having less than 65 stops. To do this we use the “pnorm” function in R and include the value 65, along with the mean, standard deviation, and tell R we want the lower tail only. Below is the code for completing this. pnorm(65,mean = 81,sd=7.9,lower.tail = TRUE) ## [1] 0.02141744 As you can see, at 2% it would be unusually to. We can also plot this using ggplot. First, we need to set a different density curve using the “pnorm” function. Combine this with our “interval” variable in a dataframe and then use this information to make a plot in ggplot2. Below is the code. CumulativeProb<-pnorm(interval, mean=81,sd=7.9,lower.tail = TRUE) pnormal<-data.frame(interval, CumulativeProb) ggplot(pnormal, aes(interval, CumulativeProb))+geom_line()+ggtitle("Cumulative Density of Stops for Buses") Second Probability Problem We will now calculate the probability of a bus have 93 or more stops. To make it more interesting we will create a plot that shades the area under the curve for 93 or more stops. The code is a little to complex to explain so just enjoy the visual. pnorm(93,mean=81,sd=7.9,lower.tail = FALSE) ## [1] 0.06438284 x<-interval ytop<-dnorm(93,81,7.9) MyDF<-data.frame(x=x,y=densityCurve) p<-ggplot(MyDF,aes(x,y))+geom_line()+scale_x_continuous(limits = c(50, 110)) +ggtitle("Probabilty of 93 Stops or More is 6.4%") shade <- rbind(c(93,0), subset(MyDF, x > 93), c(MyDF[nrow(MyDF), "X"], 0)) p + geom_segment(aes(x=93,y=0,xend=93,yend=ytop)) + geom_polygon(data = shade, aes(x, y)) Conclusion A lot of work was done but all in a practical manner. Looking at realistic problem. We were able to calculate several different probabilities and graph them accordingly. # Using Maps in ggplot2 It seems as though there are no limits to what can be done with ggplot2. Another example of this is the use of maps in presenting data. If you are trying to share information that depends on location then this is an important feature to understand. This post will provide some basic explanation for understanding how to use maps with ggplot2. The Maps Package One of several packages available for using maps with ggplot2 is the “maps” package. This package contains a limited number of maps along with several databases that contain information that can be used to create data-filled maps. The “maps” package cooperates with ggplot2 through the use of the “borders” function and plotting the plot using lattitude and longitude for the “aes” function. After you have installed the “maps” package you can run the example code below. library(ggplot2);library(maps) ggplot(us.cities,aes(long,lat))+geom_point()+borders("state") In the code above we told R to use the data from “us.cities” which comes with the “maps” package. We then told R to graph the latitude and longitude and to do this by placing a point for each city. Lastly, the “borders” function was use to place this information on the state map of the US. There are several points way off of the map. These represents datapoints for cities in Alaska and Hawaii. Below is an example that is limited to one state in America. To do this we first must subset the data to only include one state. tx_cities<-subset(us.cities,country.etc=="TX") ggplot(tx_cities,aes(long,lat))+geom_point()+borders(database = "state",regions = "texas") The map shows all the cities in the state of Texas that are pulled form the “us.cities” dataset. We can also play with the colors of the maps just like any other ggplot2 output. Below is an example. data("world.cities") Thai_cities<-subset(world.cities, country.etc=="Thailand") ggplot(Thai_cities,aes(long,lat))+borders("world","Thailand", fill="light blue",col="dark blue")+geom_point(aes(size=pop),col="dark red") In the example above, we took all of the cities in Thailand and saved them into the variable “Thai_cities”. We then made a plot of Thailand but we played with the color and fill features. Lastly, we plotted the population be location and we indicated that the size of the data point should depend on the size. In this example, all the data points were the same size which means that all the cities in Thailand in the dataset are about the same size. We can also add text to maps. In the example below, we will use a subset of the data from Thailand and add the names of cities to the map. Big_Thai_cities<-subset(Thai_cities, pop>100000) ggplot(Big_Thai_cities,aes(long,lat))+borders("world","Thailand", fill="light blue",col="dark blue")+geom_point(aes(size=pop),col="dark red")+geom_text(aes(long,lat,label=name),hjust=-.2,size=3) In this plot there is a messy part in the middle where Bangkok is a long with several other large cities. However, you can see the flexiability in the plot by adding the “geom_text” function which has been discussed previously. In the “geom_text” function we added some aesthetics as well add the “name” of the city. Conclusion In this post, we look at some of the basic was of using maps with ggplot2. There are many more ways and features that can be explored in future post. # Axis and Title Modifications in ggplot2 This post will provide explanation on how to customize the axis and title of a plot that utilizes ggplot2. We will use the “Computer” dataset from the “Ecdat” package looking specifically at the difference in price of computers based on the inclusion of a cd-rom. Below is some code needed to be prepared for the examples along with a printout of our initial boxplot. library(ggplot2);library(grid);library("Ecdat") data("Computers") theBoxplot<-ggplot(Computers,aes(cd, price, fill=cd))+geom_boxplot() theBoxplot In the example below, we change the color of the tick marks to purple and we bold them. This all involves the use of the “axis.text” argument in the “theme” function. theBoxplot + theme(axis.text=element_text(color="purple",face="bold")) In the example below, the y label “price” is rotated 90 degrees to be in line with text. This is accomplished using the “axis.title.y” argument along with additional code. theBoxplot+theme(axis.title.y=element_text(size=rel(1.5),angle = 0)) Below is an example that includes a title with a change to the default size and color theBoxplot+labs(title="The Example")+theme(plot.title=element_text(size=rel(1.5),color="orange")) You can also remove the axis label. IN the example below, we remove the x axis along with its tick marks. theBoxplot+theme(axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.title.x=element_blank()) It is also possible to modify the plot background axis as well. In the example below, we change the background color to blue, the color of the lines to green, and yellow. This is not an attractive plot but it does provide an example of the various options available in ggplot2 theBoxplot+theme(panel.background=element_rect(fill="blue"), panel.grid.major=element_line(color="green", size = 3),panel.grid.minor=element_line(color="yellow",linetype="solid",size=2)) All of the tricks we have discussed so far can also apply when faceting data. Below we make a scatterplot using the same background as before but comparing trend and price. theScatter<-ggplot(Computers,aes(trend, price, color=cd))+geom_point() theScatter1<-theScatter+facet_grid(.~cd)+theme(panel.background=element_rect(fill="blue"), panel.grid.major=element_line(color="green", size = 3),panel.grid.minor=element_line(color="yellow",linetype="solid",size=2)) theScatter1 Right now the plots are too close to each other. We can account for this by modifying the panel margins. theScatter1 +theme(panel.margin=unit(2,"cm")) Conclusion These examples provide further evidence of the endless variety that is available when using ggplot2. Whatever are your purposes, it is highly probably that ggplot2 has some sort of a data visualization answer. # Modifying Legends in ggplot2 This post will provide information on fine tuning the legend of a graph using ggplot2. We will be using the “Wage” dataset from the “ISLR” package. Below is some initial code that is needed to complete the examples. The initial plot is saved as a variable to save time and avoid repeating the same code. library(ggplot2);library(ISLR); library(grid) myBoxplot<-ggplot(Wage, aes(education, wage,fill=education))+geom_boxplot() myBoxplot The default ggplot has a grey background with grey text. By adding the “theme_bw” function to a plot you can create a plot that has a white background with black text. The code is below. myBoxplot+theme_bw() If you desire, you can also add a rectangle around the legend with the “legend.baclground” argument You can even specify the color of the rectangle as shown below. myBoxplot+theme(legend.background=element_rect(color="blue")) It is also possible to add a highlighting color to the keys in the legend. In the code below we highlight the keys with the color red using the “legend.key” argument myBoxplot+theme(legend.key=element_rect(fill="red")) The code below provides an example of how to change the size of a plot. myBoxplot+theme(legend.margin= unit(2, "cm")) This example demonstrate how to modify the text in a legend. This requires the use of the “legend.text”, along with several other arguments and functions. The code below does the following. • Size 15 font • Dark red font color • Text at 35 degree angle • Italic font myBoxplot + theme(legend.text = element_text(size = 15,color="dark red",angle= 35, face="italic")) Lastly, you can even move the legend around the plot. The first example moves the legend to the top of the plot using “legend.position” argument. The second example moves the legend based on numerical input. The first number moves the plot from left to right or from 0 being left to 1 being all the way to the right. The second number moves the text from bottom to top with 0 being the bottom and 1 being the top. myBoxplot+theme(legend.position="top") myBoxplot+theme(legend.position=c(.6,.7)) Conclusion The examples provided here show how much control over plots is possible when using ggplot2. In many ways this is just an introduction into the nuance controlled that is available # Axis and Labels in ggplot2 In this post, we will look at how to manipulate the labels and positioning of the data when using ggplot2. We will use the “Wage” data from the “ISLR” package. Below is initial code needed to begin. library(ggplot2);library(ISLR) data("Wage") Manipulating Labels Our first example involves adding labels for the x, y axis as well as a title. To do this we will create a histgram of the wage variable and save it as a variable in R. By saving the histogram as a variable it saves time as we do not have to recreate all of the code but only add the additional information. After creating the histogram and saving it to a variable we will add the code for creating the labels. Below is the code myHistogram<-ggplot(Wage, aes(wage, fill=..count..))+geom_histogram() myHistogram+labs(title="This is My Histogram", x="Salary as a Wage", y="Number") By using the “labs” function you can add a title and information for the x and y axis. If your title is really long you can use the code “” to break the information into separate lines as shown below. myHistogram+labs(title="This is the Longest Title for a Histogram \n that I have ever Seen in My Entire Life", x="Salary as a Wage", y="Number") Discrete Axis Scale We will now turn our attention to working with discrete scales. Discrete scales deal with categorical data such as boxplots and bar charts. First, we will store a boxplot of the wages subsetted by level of education in a variable and we will display it. myBoxplot<-ggplot(Wage, aes(education, wage,fill=education))+geom_boxplot() myBoxplot Now, by using the “scale_x_discrete” function along with the “limits” argument we are able to change the order of the gorups as shown below myBoxplot+scale_x_discrete(limits=c("5. Advanced Degree","2. HS Grad","1. < HS Grad","4. College Grad","3. Some College")) Continuous Scale The most common modification to a continuous scale is to modify the range. In the code below, we change the default range of “myBoxplot” to something that is larger. myBoxplot+scale_y_continuous(limits=c(0,400)) Conclusion This post provided some basic insights into modifiying plots using ggplot2. # Pie Charts and More Using ggplot2 This post will explain several types of visuals that can be developed in using ggplot2. In particular, we are going to make three specific types of charts and they are… • Pie chart • Bullseye chart • Coxcomb diagram To complete this ask, we will use the “Wage” dataset from the “ISLR” package. We will nbe using the “education” variable which has five factors in it. Below is the initial code to get started. library(ggplot2);library(ISLR) data("Wage") Pie Chart In order to make a pie chart, we first need to make a bar chart and add several pieces of code to change it into a pie chart. Below is the code for making a regular bar plot. ggplot(Wage, aes(education, fill=education))+geom_bar() We will now modify two parts of the code. First, we do not want separate bars. Instead we want one bar. The reason being is that we only want one pie chart so before that we need one bar. Therefore, for the x value in the “aes” function we will use the argument “factor(1)” which tells R to force the data as one factor on the chart thus making one bar. We also need to add the “width=1” inside the “geom_bar” function. This helps with spacing. Below is the code for this ggplot(Wage, aes(factor(1), fill=education))+geom_bar(width=1) To make the pie chart, we need to add the “coord_polar” function to the code which adjusts the mapping. We will include the argument “theta=y” which tells R that the size of the pie a factor gets depends on the number of people in that factor. Below is the code for the pie chart. ggplot(Wage, aes(factor(1), fill=education))+ geom_bar(width=1)+coord_polar(theta="y") By changing the “width” argument you can place a circle in the middle of the chart as shown below. ggplot(Wage, aes(factor(1), fill=education))+ geom_bar(width=.5)+coord_polar(theta="y") Bullseye Chart A bullseye chart is a pie chart that share the information in a concentric way. The coding is mostly the same except that you remove the “theta” argument from the “coord_polar” function. The thicker the circle the more respondents within it. Below is the code ggplot(Wage, aes(factor(1), fill=education))+ geom_bar(width=1)+coord_polar() Coxcomb Diagram The Coxcomb Diagram is similiar to the pie chart but the data is not normalized to fit the entire area of the circle. To make this plot we have to modify the code to make the by removing the “factor(1)” argument and replacing it with the name of the variable and be readding the “coord_polor” function. Below is the code ggplot(Wage, aes(education, fill=education))+ geom_bar(width=1)+coord_polar() Conclusion These are just some of the many forms of visualizations available using ggplot2. Which to use depends on many factors from personal preference to the needs of the audience. # Adding text and Lines to Plots in R There are times when a researcher may want to add annotated information to a plot. Example of annotation includes text and or different lines to clarify information. In this post we will learn how to add lines and text to a plot. For the lines, we are speaking of lines that are added mainly and not through some sort of statistical transformation such as through regression or smoothing. In order to do this we will use the “Caschool” data set from the “Ecdata” package and will make several histograms that will display test scores. Below is initial coding information that is needed. library(ggplot2);library(Ecdat) data("Caschool") There are three lines that can be added manually using ggplot2. They are… • geom_vline = vertical line • geom_hline = horizontal line • geom_abline = slope/intercept line In the code below, we are going to make a histogram of the test scores in the “Caschool” dataset. We are also going to add a vertical yellow line that is set at where the median is. Below is the code ggplot(Caschool,aes(testscr))+geom_histogram()+ geom_vline(aes(xintercept=median(testscr)),color="yellow") By adding aesthetic information to the “geom_vline” function we add the line depicting the median. We will now use the same code but add a horizontal line. Below is the code. ggplot(Caschool,aes(testscr))+geom_histogram()+ geom_vline(aes(xintercept=median(testscr)),color="yellow")+ geom_hline(aes(yintercept=15), color="blue") The horizontal line we added was at the arbitrary point of 15 on the y axis. We could have set it anywhere we wanted by specifying a value for the y-intercept. In the next histogram we are going to add text to the graph. Text provides further explanation about what is happening in the plot. We are going to use the same code as before but we are going to provide additional information about the yellow median line. We are going to explain that the yellow is the median and we will provide the value of the median. ggplot(Caschool,aes(testscr))+geom_histogram()+ geom_vline(aes(xintercept=median(testscr)),color="yellow")+ geom_hline(aes(yintercept=15), color="blue")+ geom_text(aes(x=median(Caschool$testscr),
y=30),label="Median",hjust=1, size=9)+
geom_text(aes(x=median(Caschool$testscr), y=30,label=round(median(testscr),digits=0)),hjust=-0.5, size=9) Must of the code above is review but we did add the “geom_text” function. Here is what’s happening. Inside the function we need to add aesthetic information. We indicate that the label =“median” should be placed at the median for the test scores for the x value and at the arbitrary point of 30 for the y-intercept. We also offset the the placement by using the hjust argument. For the second label we calculate the actual median and have it rounded and have the digits removed. This result is also offset slightly. Lastly, for both text we set the text size to 9 to make it easier to read. Are next example involves annotating. Using ggplot2 we can actually highlight a specific area of the histogram. In the example below we highlight the middle quartile. ggplot(Caschool,aes(testscr))+geom_histogram()+geom_vline(aes(xintercept=median(testscr)),color="yellow")+ geom_hline(aes(yintercept=15), color="blue")+ geom_text(aes(x=median(Caschool$testscr),y=30),
label="Median",hjust=1, size=9)+
geom_text(aes(x=median(Caschool$testscr),y=30, label=round(median(testscr),digits=0)),hjust=-0.5, size=9)+ annotate("rect",xmin=quantile(Caschool$testscr, probs = 0.25),
xmax = quantile(Caschool$testscr, probs=0.75),ymin=0, ymax=45, alpha=.2, fill="red") The information inside the “annotate” function includes the “rect” argument which indicates that the added information is numerical. Next, we indicate that we want the xmin value to be the 25% quartile and the xmax to be the 75% quartile. We also indicate the values for the y axis as well as some transparency with the “alpha” argument as well as the color of the annotated area, which is red. Are final example involves the use of facets. We are going to split the data by school district type and show how you can add lines to another while not adding lines to a different plot. The second plot will include a line based on median while the first plot will not. ggplot(Caschool,aes(testscr, fill=grspan))+geom_histogram()+ geom_vline(data=subset(Caschool, grspan=="KK-08"), aes(xintercept=median(testscr)), color="yellow")+ geom_text(data=subset(Caschool, grspan=="KK-08"), aes(x=median(Caschool$testscr),y=35), label=round(median
(Caschool$testscr), digits=0), hjust=-0.2, size=9)+ geom_text(data=subset(Caschool,grspan=="KK-08"), aes(x=median(Caschool$testscr), y=35),label="Median",
hjust=1,size=9)+facet_grid(.~grspan)

Conclusion

Adding lines to text and understanding how to annotate provides additional tools for those who need to communicate data in a visual way.

# Histograms and Colors with ggplot2

In this post, we will look at how ggplot2 is able to create variables for the purpose of providing aesthetic information for a histogram. Specifically, we will look at how ggplot2 calculates the bin sizes and then assigns colors to each bin depending on the count or density of that particular bin.

To do this we will use dataset called “Star” from the “Edat” package. From the dataset, we will look at total math score and make several different histograms. Below is the initial code you need to begin.

library(ggplot2);library(Ecdat)
data(Star)

We will now create are initial histogram. What is new in the code below is the “..count..” for the “fill” argument. This information tells are to fill the bins based on their count or the number of data points that fall in this bin. By doing this, we get a gradation of colors with darker colors indicating more data points and lighter colors indicating fewer data points. The code is as follows.

ggplot(Star, aes(tmathssk, fill=..count..))+geom_histogram()

As you can see, we have a nice histogram that uses color to indicate how common data in a specific bin is. We can also make a histogram that has a line that indicates the density of the data using the kernal function. This is similar to adding a LOESS line on a plot. The code is below.

ggplot(Star, aes(tmathssk)) + geom_histogram(aes(y=..density.., fill=..density..))+geom_density()

The code is mostly the same but we moved the “fill” argument inside “geom_histogram” function and added a second “aes” function. We also included a y argument inside the second “aes” function. Instead of using the “..count..” information we used “..density..” as this is needed to create the line. Lastly, we added the “geom_density” function.

The chart below uses the “alpha” argument to add transparency to the histogram. This allows us to communicate additional information. In the histogram below we can see visually information about gender and the how common a particular gender and bin are in the data.

ggplot(Star, aes(tmathssk, col=sex, fill=sex, alpha=..count..))+geom_histogram()

Conclusion

What we have learned in this post is some of the basic features of ggplot2 for creating various histograms. Through the use of colors a researcher is able to display useful information in an interesting way.

# Linear Regression Lines and Facets in ggplot2

In this post, we will look at how to add a regression line to a plot using the “ggplot2” package. This is mostly a review of what we learned in the post on adding a LOESS line to a plot. The main difference is that a regression line is a straight line that represents the relationship between the x and y variable while a LOESS line is used mostly to identify trends in the data.

One new wrinkle we will add to this discussion is the use of faceting when developing plots. Faceting is the development of multiple plots simultaneously with each sharing different information about the data.

The data we will use is the “Housing” dataset from the “Ecdat” package. We will examine how lotsize affects housing price when also considering whether the house has central air conditioning or not. Below is the initial code in order to be prepared for analysis

library(ggplot2);library(Ecdat)
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
##     Orange
data("Housing")

The first plot we will make is the basic plot of lotsize and price with the data being distinguished by having central air or not, without a regression line. The code is as follows

ggplot(data=Housing, aes(x=lotsize, y=price, col=airco))+geom_point()

We will now add the regression line to the plot. We will make a new plot with an additional piece of code.

ggplot(data=Housing, aes(x=lotsize, y=price, col=airco))+geom_point()+stat_smooth(method='lm')

As you can see we get two lines for the two conditions of the data. If we want to see the overall regression line we use the code that is shown below.

ggplot()+geom_point(data=Housing, aes(x=lotsize, y=price, col=airco))+stat_smooth(data=Housing, aes(x=lotsize, y=price ),method='lm')

We will now experiment with a technique called faceting. Faceting allows you to split the data by various subgroups and display the result via plot simultaneously. For example, below is the code for splitting the data by central air for examining the relationship between lot size and price.

ggplot(data=Housing, aes(lotsize, price, col=airco))+geom_point()+stat_smooth(method='lm')+facet_grid(.~airco)

By adding the “facet_grid” function we can subset the data by the categorical variable “airco”.

In the code below we have three plots. The first two show the relationship between lotsize and price based on central air and the last plot shows the overall relationship.

ggplot(data=Housing, aes(lotsize, price, col=airco))+geom_point()+stat_smooth(method="lm")+facet_grid(.~airco, margins = TRUE)

By adding the argument “margins” and setting it to true we are able to add the third plot that shows the overall results.

So far all of are facetted plots have had the same statistical transformation of the use of a regression. However, we can actually mix the type of transformations that happen when facetting the results. This is shown below.

ggplot(data=Housing, aes(lotsize, price, col=airco))+geom_point()+stat_smooth(data=subset(Housing, airco=="yes"))+stat_smooth(data=subset(Housing, airco=="no"), method="lm")+facet_grid(.~airco)

In the code we needed to use two functions of “stat_smooth” and indicate the information to transform inside the function. The plot to the left is a regression line with houses without central air and the plot to the right is a LOESS line with houses that have central air.

Conclusion

In this post we explored the use of regression lines and advance faceting techniques. Communicating data with ggplot2 is one of many ways in which a data analyst can portray valuable information.

# Adding LOESS Lines to Plots in R

A common goal of statistics is to try and identify trends in the data as well as to predict what may happen. Both of these goals can be partially achieved through the development of graphs and or charts.

In this post, we will look at adding  a smooth line to a scatterplot using the “ggplot2” package.

To accomplish this, we will use the “Carseats” dataset from the “ISLR” package. We will explore the relationship between the price of carseats with actual sales along with whether the carseat was purchase in an urban location or not. Below is some initial code to prepare for the analysis.

library(ggplot2);library(ISLR)
data("Carseats")

We are going to us a layering approach in this example. This means we will add one piece of code at a time until we have the complete plot.We are now going to plot the initial scatterplot. We simply want a scatterplot depicting the relationship between Price and Sales of carseats.

ggplot(data=Carseats, aes(x=Price, y=Sales, col=Urban))+geom_point()

The general trend appears to be negative. As price increases sales decrease regardless if carseat was purchase in an urban setting or not.

We will now add ar LOESS line to the graph. LOESS stands for “Locally weighted smoothing” this is a commonly used tool in regression analysis. The addition of a LOESS line allows in identifying trends visually much easily. Below is the code

ggplot(data=Carseats, aes(x=Price, y=Sales, col=Urban))+geom_point()+
stat_smooth()

Unlike a regression line which is strictly straight, a LOESS line curves with the data. As you look at the graph the LOESS line is mostly straight with curves at the extremes and for small rise in fall in the middle for carseats purchased in urban areas.

So far we have created LOESS lines by the categorical variable Urban. We can actually make a graph with three LOESS lines. One for Yes urban, another for No Urban, and a last one that is an overall line that does not take into account the Urban variable. Below is the code.

ggplot()+ geom_point(data=Carseats, aes(x=Price, y=Sales, col=Urban))+ stat_smooth(data=Carseats, aes(x=Price, y=Sales))+stat_smooth(data=Carseats, aes(x=Price, y=Sales, col=Urban))

Notice that the code is slightly different with the information being mostly outside of the “ggplot” function. You can barely see the third line in the graph but if you look closely you will see a new blue line that was not there previously. This is the overall trend line. If you want you can see the overall trend line with the code below.

ggplot()+ geom_point(data=Carseats, aes(x=Price, y=Sales, col=Urban))+ stat_smooth(data=Carseats, aes(x=Price, y=Sales))

The very first graph we generated in this post only contained points. This is because we used the “geom_point” function. Any of the graphs we created could be generated with points by removing the “geom_point” function and only using the “stat_smooth” function as shown below.

ggplot(data=Carseats, aes(x=Price, y=Sales, col=Urban))+
stat_smooth()

Conclusion This post provided an introduction to adding LOESS lines to a graph using ggplot2. For presenting data in a visually appealing way, adding lines can help in identifying key characteristics in the data.

# Intro to the Grammar of Graphics

In developing graphs, there are certain core principles that need to be addressed in order to provide a graph that communicates meaning clearly to an audience. Many of these core principles are addressed in the book “The Grammar of Graphics” by Leland Wilkinson.

The concepts of Wilkinson’s book were used to create the “ggplot2” r package by Hadley Wickham. This post will explain some of the core principles needed in developing high quality visualizations. In particular we will look at the following.

• Aesthetic attributes
• Geometric objects
• Statistical transformations
• Scales
• Coordinates
• Faceting

One important point to mention is that when using ggplot2 not all of these concepts have to be addressed in the code as R will auto-select certain features if you do not specify them.

Aesthetic Attributes and Geometric Objects

Aesthetic attributes is about how the data is perceived. This general involves arguments in the “ggplot” relating to the x/y coordinates as well as the actual data that is being used. Aesthetic atrributes is mandatory information for making a graph.

Geometric objects determines what type of plot is generated. There are many different examples such as bar, point, boxplot, and histogram.

To use the “ggplot” function you must provide the aesthetic and geometric object informatio to generate a plot. Below is coding containing only this information.

library(ggplot2)
ggplot(Orange, aes(circumference))+geom_histogram()
## stat_bin() using bins = 30. Pick better value with binwidth.

The code is broken down as follows ggplot(data, aesthetic attribute(x-axis data at least)+geometric object())

Statistical Transformation Statistical transformation involves combining the data in one way or the other to get a general sense of the data. Examples of statistical transformation includes adding a smooth line, a regression line, or even binning the data for histograms. This feature is optional but can provide additional explanation of the data.

Below we are going to look at two variables on one plot. For this we will need a different geomtric object as we will use points instead of a histogram. We will also use a statisitcal transformation. In particular, the statistical transformation is regression line. The code is as follows

ggplot(Orange, aes(circumference, age))+geom_point()+stat_smooth(method="lm")

The code is broken down as follows ggplot(data, aesthetic attribute(x-axis data at least)+geometric object()+ statistical transformation(type of transformation))

Scales Scales is a rather complicated feature. For simplicity, scales have to do with labelling the title, x and y-axis, creating a legend, as well as the coloring of data points. This use of this feature is optional.

Below is a simple example using the “labs” function in the plot we develop in the previous example.

ggplot(Orange, aes(circumference,age))+geom_point()+stat_smooth(method="lm") +  labs(title="Example Plot", x="circumference of the tree", y="age of the tree")

The plot now has a title and clearly labelled x and y axises

Coordinates Coordinates is another complex feature. This feature allows for the adjusting of the mapping of the data. Two common mappin features are cartesian and polar. Cartesian is commonly used for plots in 2D while polor is often used for pie charts.

In the example below, we will use the same data but this time use a polor mapping approach. The plot doesn’t make much sense but is really just an example of using this feature. This feature is also optional.

ggplot(Orange, aes(circumference, age))+geom_point()+stat_smooth(method="lm")+labs(title="Example Plot",x="circumference of the tree", y="age of the tree")+coord_polar()

The last feature is faceting. Faceting allows you to group data in subsets. This allows you to look at your data from the perspective of various subgroups in the sample population.

In the example below, we will look at the relationship between circumference and age by tree type.

ggplot(Orange, aes(circumference, age))+geom_point()+stat_smooth(method="lm")+labs(title="Example Plot",x="circumference of the tree", y="age of the tree")+facet_grid(Tree~.)

Now we can see the relationship between the two variables based on the type of tree. One important thing to note about the “facet_grid” function is the use of the “.~” If this symbol “~.” is placed behind the categorical variable the charts will be stacked on top of each other is in the previous example.

However, if the symbol is written differently “.~” and placed in front of the categorical variable the plots will be placed next to each other as in the example below

ggplot(Orange, aes(circumference, age))+geom_point()+stat_smooth(method="lm")+labs(title="Example Plot",x="circumference of the tree", y="age of the tree")+facet_grid(.~Tree)

Conclusion

This post provided an introduction to the grammar of graphics. In order to appreciate the art of data visualization it requires understanding how the different principles of graphics work together to communicate information in a visually manner with an audience.

# Using Qplots for Graphs in R

In this post, we will explore the use of the “qplot” function from the “ggplot2” package. One of the major advantages of “ggplot” when compared to the base graphics package in R is that the “ggplot2” plots are much more visually appealing. This will make more sense when we explore the grammar of graphics. for now we will just make plots to get use to using the “qplot” function.

We are going to use the “Carseats” dataset from the “ISLR” package in the examples. This dataset has data about the purchase of carseats for babies. Below is the initial code you will need to make the various plots.

library(ggplot2);library(ISLR)
data("Carseats")

In the first scatterplot, we are going to compare the price of a carseat with the volumn of sales. Below is the code

qplot(Price, Sales,data=Carseats)

Most of this coding format you are familiar. “Price” is the x variable. “Sales” is the y variable and the data used is “Carseats. From the plot, we can see that as the price of the carseat increases there is normally a decline in the number of sales.

For our next plot, we will compare sales based on shelf location. This requires the use of a boxplot. Below is the code

qplot(ShelveLoc, Sales, data=Carseats, geom="boxplot")

The new argument in the code is the “geom” argument. This argument indicates what type of plot is drawn.

The boxplot appears to indicate that a “good” shelf location has the best sales. However, this would need to be confirmed with a statistical test.

Perhaps you are wondering how many of the Carseats where in the bad, medium, and good shelf locations. To find out, we will make a barplot as shown in the code below

qplot(ShelveLoc, data=Carseats, geom="bar")

The most common location was medium with bad and good be almost equal.

Lastly, we will now create a histogram using the “qplot” function. We want to see the distribution of “Sales”. Below is the code

qplot(Sales, data=Carseats, geom="histogram")
## stat_bin() using bins = 30. Pick better value with binwidth.

The distribution appears to be normal but again to know for certain requires a statistical test. For one last, trick we will add the median to the plot by using the following code

qplot(Sales, data=Carseats, geom="histogram") + geom_vline(xintercept = median(Carseats$Sales), colour="blue") ## stat_bin() using bins = 30. Pick better value with binwidth. To add the median all we needed to do was add an additional argument called “geom_vline” which adds a line to a plot. Inside this argument we had to indicate what to add by indicating the median of “Sales” from the “Carseats” package. Conclusion This post provided an introduction to the use of the “qplot” function in the “ggplot2” package. Understanding the basics of “qplor” is beneficial in providing visually appealing graphics # Making Graphics in R Data visualization is a critical component of communicate results with an audience. Fortunately, R provides many different ways to present numerical data it a clear way visually. This post will look specifically at making data visualizations with the base r package “graphics”. Generally, functions available in the “graphics” package can be either high-level functions or low-level functions. High-level functions actually make the plot. Examples of high-level functions includes are the “hist” (histogram), “boxplot” (boxplot), and “barplot” (bar plot). Low-level functions are used to add additional information to a plot. Some commonly used low-level functions includes “legend” (add legend) and “text” (add text). When coding we allows call high-level functions before low-level functions as the other way is not accepted by R. We are going to begin with a simple graph. We are going to use the“Caschool” dataset from the “Ecdat” package. For now, we are going to plot the average expenditures per student by the average number of computers per student. Keep in mind that we are only plotting the data so we are only using a high-level function (plot). Below is the code library(Ecdat) data("Caschool") plot(compstu~expnstu, data=Caschool) The plot is not “beautiful” but it is a start in plotting data. Next, we are going to add a low-level function to our code. In particular, we will add a regression line to try and see the diretion of the relationship between the two variables via a straight line. In addition, we will use the “loess.smooth” function. This function will allow us to see the general shape of the data. The regression line is green and the loess smooth line is blue. The coding is mostly familiy but the “lwd” argument allows us to make the line thicker. plot(compstu~expnstu, data=Caschool) abline(lm(compstu~expnstu, data=Caschool), col="green", lwd=5) lines(loess.smooth(Caschool$expnstu, Caschool$compstu), col="blue", lwd=5) Boxplots allow you to show data that has been subsetted in some way. This allows for the comparisions of groups. In addition, one or more boxplots can be used to identify outliers. In the plot below, the student-to-teacher ratio of k-6 and k-8 grades are displayed. boxplot(str~grspan, data=Caschool) As you look at the data you can see there is very little difference. However, one major differnce is that the K-8 group has much more extreme values than K-6. Histograms are an excellent way to display information about one continuous variable. In the plot below, we can see the spread of the expenditure per student. hist(Caschool$expnstu)

We will now add median to the plot by calling the low-level function “abline”. Below is the code.

hist(Caschool$expnstu) abline(v=median(Caschool$expnstu), col="green", lwd=5)

Conclusion

In this post, we learned some of the basic structures of creating plots using the “graphics” package. All plots in include both low and high-level functions that work together to draw and provide additional information for communicating data in a visual manner

# Bagging in R

In this post, we will explore the potential of bagging. Bagging is a process in which the original data is boostrapped to make several different datasets. Each of these datasets are used to generate a model and voting is used to classify an example or averaging is used for numeric prediction.

Bagging is especially useful for unstable learners. These are algorithms who generate models that can change a great deal when the data is modified a small amount. In order to complete this example, you will need to load the following packages, set the seed, as well as load the dataset “Wages1”. We will be using a decision tree that was developed in an earlier post. Below is the initial code

library(caret); library(Ecdat);library(ipred);library(vcd)
set.seed(1)
data(Wages1)

We will now use the “bagging” function from the “ipred” package to create our model as well as tell R how many bags to make.

theBag<-bagging(sex~.,data=Wages1,nbagg=25)

Next, we will make our predictions. Then we will check the accuracy of the model looking at a confusion matrix and the kappa statistic. The “kappa” function comes from the “vcd” package.

bagPred<-predict(theBag, Wages1)
keep<-table(bagPred, Wages1$sex) keep ## ## bagPred female male ## female 1518 52 ## male 51 1673 Kappa(keep) ## value ASE z Pr(>|z|) ## Unweighted 0.9373 0.006078 154.2 0 ## Weighted 0.9373 0.006078 154.2 0 The results appearing exciting with almost 97% accuracy. In addition, the Kappa was almost 0.94 indicating a well-fitted model. However, in order to further confirm this we can cross-validate the model instead of using bootstrap aggregating as bagging does. Therefore we will do a 10-fold cross-validation using the functions from the “caret” package. Below is the code. ctrl<-trainControl(method="cv", number=10) trainModel<-train(sex~.,data=Wages1, method="treebag",trControl=ctrl) trainModel ## Bagged CART ## ## 3294 samples ## 3 predictors ## 2 classes: 'female', 'male' ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 2965, 2965, 2965, 2964, 2964, 2964, ... ## Resampling results ## ## Accuracy Kappa Accuracy SD Kappa SD ## 0.5504128 0.09712194 0.02580514 0.05233441 ## ##  Now the results are not so impressive. In many ways the model is terrible. The accuracy has fallen significantly and the kappa is almost 0. Remeber that cross-validation is an indicator of future performance. This means that our current model would not generalize well to other datasets. Bagging is not limited to decision trees and can be used for all machine learning models. The example used in this post was one that required the least time to run. For real datasets, the processing time can be quite long for the average laptop. # Ensemble Learning for Machine Models One way to improve a machine learning model is to not make just one model. Instead, you can make several models that all have different strengths and weaknesses. This combination of diverse abilities can allow for much more accurate predictions. The use of multiple models is know as ensemble learning. This post will provide insights into ensemble learning as they are used in developing machine models. The Major Challenge The biggest challenges in creating an ensemble of models is deciding what models to develop and how the various models are combined to make predictions. To deal with these challenges involves the use of training data and several different functions. The Process Developing an ensemble model begins with training data. The next step is the use of some sort of allocation function. The allocation function determines how much data each model receives in order to make predictions. For example, each model may receive a subset of the data or limit how many features each model can use. However, if several different algorithms are used the allocation function may pass all the data to each model with making any changes. After the data is allocated, it is necessary for the models to be created. From there, the next step is to determine how to combine the models. The decision on how to combine the models is made with a combination function. The combination function can take one of several approaches for determining final predictions. For example, a simple majority vote can be used which means that if 5 models where develop and 3 vote “yes” than the example is classified as a yes. Another option is to weight the models so that some have more influence then others in the final predictions. Benefits of Ensemble Learning Ensemble learning provides several advantages. One, ensemble learning improves the generalizability of your model. With the combine strengths of many different models and or algorithms it is difficult to go wrong Two, ensemble learning approaches allow for tackling large datasets. The biggest enemy to machine learning is memory. With ensemble approaches, the data can be broken into smaller pieces for each model. Conclusion Ensemble learning is yet another critical tool in the data scientist’s toolkit. The complexity of the world today makes it too difficult to lean on a singular model to explain things. Therefore, understanding the application of ensemble methods is a necessary step. # Developing a Customize Tuning Process in R In this post, we will learn how to develop customize criteria for tuning a machine learning model using the “caret” package. There are two things that need to be done in order to complete assess a model using customized features. These two steps are… • Determine the model evaluation criteria • Create a grid of parameters to optimize The model we are going to tune is the decision tree model made in a previous post with the C5.0 algorithm. Below is code for loading some prior information. library(caret); library(Ecdat) data(Wages1) DETERMINE the MODEL EVALUATION CRITERIA We are going to begin by using the “trainControl” function to indicate to R what re-sampling method we want to use, the number of folds in the sample, and the method for determining the best model. Remember, that there are many more options but these are the onese we will use. All this information must be saved into a variable using the “trainControl” function. Later, the information we place into the variable will be used when we rerun our model. For our example, we are going to code the following information into a variable we will call “chck” for re sampling we will use k-fold cross-validation. The number of folds will be set to 10. The criteria for selecting the best model will be the through the use of the “oneSE” method. The “oneSE” method selects the simplest model within one standard error of the best performance. Below is the code for our variable “chck” chck<-trainControl(method = "cv",number = 10, selectionFunction = "oneSE") For now this information is stored to be used later CREATE GRID OF PARAMETERS TO OPTIMIZE We now need to create a grid of parameters. The grid is essential the characteristics of each model. For the C5.0 model we need to optimize the model, number of trials, and if winnowing was used. Therefore we will do the following. • For model, we want decision trees only • Trials will go from 1-35 by increments of 5 • For winnowing, we do not want any winnowing to take place. In all we are developing 8 models. We know this based on the trial parameter which is set to 1, 5, 10, 15, 20, 25, 30, 35. To make the grid we use the “expand.grid” function. Below is the code. modelGrid<-expand.grid(.model ="tree", .trials= c(1,5,10,15,20,25,30,35), .winnow="FALSE") CREATE THE MODEL We are now ready to generate our model. We will use the kappa statistic to evaluate each model’s performance set.seed(1) customModel<- train(sex ~., data=Wages1, method="C5.0", metric="Kappa", trControl=chck, tuneGrid=modelGrid) customModel ## C5.0 ## ## 3294 samples ## 3 predictors ## 2 classes: 'female', 'male' ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 2966, 2965, 2964, 2964, 2965, 2964, ... ## Resampling results across tuning parameters: ## ## trials Accuracy Kappa Accuracy SD Kappa SD ## 1 0.5922991 0.1792161 0.03328514 0.06411924 ## 5 0.6147547 0.2255819 0.03394219 0.06703475 ## 10 0.6077693 0.2129932 0.03113617 0.06103682 ## 15 0.6077693 0.2129932 0.03113617 0.06103682 ## 20 0.6077693 0.2129932 0.03113617 0.06103682 ## 25 0.6077693 0.2129932 0.03113617 0.06103682 ## 30 0.6077693 0.2129932 0.03113617 0.06103682 ## 35 0.6077693 0.2129932 0.03113617 0.06103682 ## ## Tuning parameter 'model' was held constant at a value of tree ## ## Tuning parameter 'winnow' was held constant at a value of FALSE ## Kappa was used to select the optimal model using the one SE rule. ## The final values used for the model were trials = 5, model = tree ## and winnow = FALSE. The actually output is similar to the model that “caret” can automatically create. The difference here is that the criteria was set by us rather than automatically. A close look reveals that all of the models perform poorly but that there is no change in performance after ten trials. CONCLUSION This post provided a brief explanation of developing a customize way of assessing a models performance. To complete this, you need configure your options as well as setup your grid in order to assess a model. Understanding the customization process for evaluating machine learning models is one of the strongest ways to develop supremely accurate models that retain generalizability. # Developing an Automatically Tuned Model in R In this post, we are going to learn how to use the “caret” package to automatically tune a machine learning model. This is perhaps the simplest way to evaluate the performance of several models. In a later post, we will explore how to perform custom tuning to a model. The model we are trying to tune is the decision tree we made using the C5.0 algorithm in a previous post. Specifically we were trying to predict sex based on the variables available in the “Wages1” dataset in the “Ecdat” package. In order to accomplish our goal we will need to load the “caret” and “Ecdat”package, load the “Wages1” dataset as well as set the seed. Setting the seed will allow us to reproduce our results. Below is the code for these steps. library(caret); library(Ecdat) data(Wages1) set.seed(1) We will now build and display our model using the code below. tuned_model<-train(sex ~., data=Wages1, method="C5.0")  tuned_model ## C5.0 ## ## 3294 samples ## 3 predictors ## 2 classes: 'female', 'male' ## ## No pre-processing ## Resampling: Bootstrapped (25 reps) ## Summary of sample sizes: 3294, 3294, 3294, 3294, 3294, 3294, ... ## Resampling results across tuning parameters: ## ## model winnow trials Accuracy Kappa Accuracy SD Kappa SD ## rules FALSE 1 0.5892713 0.1740587 0.01262945 0.02526656 ## rules FALSE 10 0.5938071 0.1861964 0.01510209 0.03000961 ## rules FALSE 20 0.5938071 0.1861964 0.01510209 0.03000961 ## rules TRUE 1 0.5892713 0.1740587 0.01262945 0.02526656 ## rules TRUE 10 0.5938071 0.1861964 0.01510209 0.03000961 ## rules TRUE 20 0.5938071 0.1861964 0.01510209 0.03000961 ## tree FALSE 1 0.5841768 0.1646881 0.01255853 0.02634012 ## tree FALSE 10 0.5930511 0.1855230 0.01637060 0.03177075 ## tree FALSE 20 0.5930511 0.1855230 0.01637060 0.03177075 ## tree TRUE 1 0.5841768 0.1646881 0.01255853 0.02634012 ## tree TRUE 10 0.5930511 0.1855230 0.01637060 0.03177075 ## tree TRUE 20 0.5930511 0.1855230 0.01637060 0.03177075 ## ## Accuracy was used to select the optimal model using the largest value. ## The final values used for the model were trials = 10, model = rules ## and winnow = TRUE. There is a lot of information that is printed out. The first column is the type of model developed. Two types of models were developed either a rules-based classification tree or a normal decision tree. Next, is the winnow column. This column indicates if a winnowing process was used to remove poor predictor variables. The next two columns are accuracy and kappa which have been explained previously. The last two columns are the standard deviations of accuarcy and kappa. None of the models are that good but the purpose here is for teaching. At the bottom of the printout, r tells you which model was the best. For us, the best model was the fifth model from the top which was a rule-based, 10 trial model with winnow set to “TRUE”. We will now use the best model (the caret package automatically picks it) to make predictions on the training data. We will also look at the confusion matrix of the correct classification followed by there proportions. Below is the code. predict_model<-predict(tuned_model, Wages1) table(predict_model, Wages1$sex)
##
## predict_model female male
##        female    936  590
##        male      633 1135
prop.table(table(predict_model, Wages1\$sex))
##
## predict_model    female      male
##        female 0.2841530 0.1791135
##        male   0.1921676 0.3445659

In term of prediction, the model was correct 62% of the time (.28 + .34 = .62). If we want to know, can also see the probabilities for each example using the following code.

probTable<-(predict(tuned_model, Wages1, type="prob"))
head(probTable)
##      female       male
## 1 0.6191287 0.38087132
## 2 0.2776770 0.72232303
## 3 0.2975327 0.70246734
## 4 0.7195866 0.28041344
## 5 1.0000000 0.00000000
## 6 0.9092993 0.09070072


Conclusion

In this post, we looked at an automated way to determine the best model among many using the “caret” package. Understanding how to improve the performance of a model is critical skill in machine learning.

# Improving the Performance of Machine Learning Model

For many, especially beginners, making a machine learning model is difficult enough. Trying to understand what to do, how to specify the model, among other things is confusing in itself. However, after developing a model it is necessary to assess ways in which to improve performance.

This post will serve as an introduction to understanding how to improving model performance. In particular, we will look at the following

• When it is necessary to improve performance
• Parameter tuning

When to Improve

It is not always necessary to try and improve the performance of a model. There are times when a model does well and you know this through the evaluating it. If the commonly used measures are adequate there is no cause for concern.

However, there are times when improvement is necessary. Complex problems, noisy data, and trying to look for subtle/unclear relationships can make improvement necessary. Normally, real-world data has the problems so model improvement is usually necessary.

Model improvement requires the application of scientific means in an artistic manner. It requires a sense of intuition at times and also brute trial-and-error effort as well. The point is that there is no singular agreed upon way to improve a model. It is better to focus on explaining how you did it if necessary.

Parameter Tuning

Parameter tuning is the actual adjustment of model fit options. Different machine learning models have different options that can be adjusted. Often, this process can be automated in r through the use of the “caret” package.

When trying to decide what to do when tuning parameters it is important to remember the following.

• What machine learning model and algorithm you are using for your data.
• Which parameters you can adjust.
• What criteria you are using to evaluate the model

Naturally, you need to know what kind of model and algorithm you are using in order to improve the model. There are three types of models in machine learning, those that classify, those that employ regression, and those that can do both. Understanding this helps you to make decision about what you are trying to do.

Next, you need to understand what exactly you or r are adjusting when analyzing the model. For example, for C5.0 decision trees “trials” is one parameter you can adjust. If you don’t know this, you will not know how the model was improved.

Lastly, it is important to know what criteria you are using to compare the various models. For classifying models you can look at the kappa and the various information derived from the confusion matrix. For regression based models you may look at the r-square, the RMSE (Root mean squared error), or the ROC curve.

Conclusion

As you can perhaps tell there is an incredible amount of choice and options in trying to improve a model. As such, model improvement requires a clearly developed strategy that allows for clear decision-making.

In a future post, we will look at an example of model improvement.