Tag Archives: machine learning

K Nearest Neighbor in R

K-nearest neighbor is one of many nonlinear algorithms that can be used in machine learning. By non-linear I mean that a linear combination of the features or variables is not needed in order to develop decision boundaries. This allows for the analysis of data that naturally does not meet the assumptions of linearity.

KNN is also known as a “lazy learner”. This means that there are known coefficients or parameter estimates. When doing regression we always had coefficient outputs regardless of the type of regression (ridge, lasso, elastic net, etc.). What KNN does instead is used K nearest neighbors to give a label to an unlabeled example. Our job when using KNN is to determine the number of K neighbors to use that is most accurate based on the different criteria for assessing the models.

In this post, we will develop a KNN model using the “Mroz” dataset from the “Ecdat” package. Our goal is to predict if someone lives in the city based on the other predictor variables. Below is some initial code.

## 'data.frame':    753 obs. of  18 variables:
##  $ work      : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hoursw    : int  1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
##  $ child6    : int  1 0 1 0 1 0 0 0 0 0 ...
##  $ child618  : int  0 2 3 3 2 0 2 0 2 2 ...
##  $ agew      : int  32 30 35 34 31 54 37 54 48 39 ...
##  $ educw     : int  12 12 12 12 14 12 16 12 12 12 ...
##  $ hearnw    : num  3.35 1.39 4.55 1.1 4.59 ...
##  $ wagew     : num  2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
##  $ hoursh    : int  2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
##  $ ageh      : int  34 30 40 53 32 57 37 53 52 43 ...
##  $ educh     : int  12 9 12 10 12 11 12 8 4 12 ...
##  $ wageh     : num  4.03 8.44 3.58 3.54 10 ...
##  $ income    : int  16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
##  $ educwm    : int  12 7 12 7 12 14 14 3 7 7 ...
##  $ educwf    : int  7 7 7 7 14 7 7 3 7 7 ...
##  $ unemprate : num  5 11 5 5 9.5 7.5 5 5 3 5 ...
##  $ city      : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ...
##  $ experience: int  14 5 15 6 7 33 11 35 24 21 ...

We need to remove the factor variable “work” as KNN cannot use factor variables. After this, we will use the “melt” function from the “reshape2” package to look at the variables when divided by whether the example was from the city or not.

Mroz_plots<-ggplot(mroz.melt,aes(x=city,y=value))+geom_boxplot()+facet_wrap(~variable, ncol = 4)


From the plots, it appears there are no differences in how the variable act whether someone is from the city or not. This may be a flag that classification may not work.

We now need to scale our data otherwise the results will be inaccurate. Scaling might also help our box-plots because everything will be on the same scale rather than spread all over the place. To do this we will have to temporarily remove our outcome variable from the data set because it’s a factor and then reinsert it into the data set. Below is the code.


We will now look at our box-plots a second time but this time with scaled data.

mroz_plot2<-ggplot(mroz.scale.melt,aes(city,value))+geom_boxplot()+facet_wrap(~variable, ncol = 4)


This second plot is easier to read but there is still little indication of difference.

We can now move to checking the correlations among the variables. Below is the code

corrplot(mroz.cor,method = 'number')


There is a high correlation between husband’s age (ageh) and wife’s age (agew). Since this algorithm is non-linear this should not be a major problem.

We will now divide our dataset into the training and testing sets


Before creating a model we need to create a grid. We do not know the value of k yet so we have to run multiple models with different values of k in order to determine this for our model. As such we need to create a ‘grid’ using the ‘expand.grid’ function. We will also use cross-validation to get a better estimate of k as well using the “trainControl” function. The code is below.


Now we make our model,

## k-Nearest Neighbors 
## 540 samples
##  16 predictors
##   2 classes: 'no', 'yes' 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 487, 486, 486, 486, 486, 486, ... 
## Resampling results across tuning parameters:
##   k   Accuracy   Kappa    
##    2  0.6000095  0.1213920
##    3  0.6368757  0.1542968
##    4  0.6424325  0.1546494
##    5  0.6386252  0.1275248
##    6  0.6329998  0.1164253
##    7  0.6589619  0.1616377
##    8  0.6663344  0.1774391
##    9  0.6663681  0.1733197
##   10  0.6609510  0.1566064
##   11  0.6664018  0.1575868
##   12  0.6682199  0.1669053
##   13  0.6572111  0.1397222
##   14  0.6719586  0.1694953
##   15  0.6571425  0.1263937
##   16  0.6664367  0.1551023
##   17  0.6719573  0.1588789
##   18  0.6608811  0.1260452
##   19  0.6590979  0.1165734
##   20  0.6609510  0.1219624
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 14.

R recommends that k = 16. This is based on a combination of accuracy and the kappa statistic. The kappa statistic is a measurement of the accuracy of a model while taking into account chance. We don’t have a model in the sense that we do not use the ~ sign like we do with regression. Instead, we have a train and a test set a factor variable and a number for k. This will make more sense when you see the code. Finally, we will use this information on our test dataset. We will then look at the table and the accuracy of the model.

knn.test<-knn(train[,-17],test[,-17],train[,17],k=16) #-17 removes the dependent variable 'city
## knn.test  no yes
##      no   19   8
##      yes  61 125
## [1] 0.6760563

Accuracy is 67% which is consistent with what we found when determining the k. We can also calculate the kappa. This done by calculating the probability and then do some subtraction and division. We already know the accuracy as we stored it in the variable “prob.agree” we now need the probability that this is by chance. Lastly, we calculate the kappa.

## [1] 0.664827

A kappa of .66 is actual good.

The example we just did was with unweighted k neighbors. There are times when weighted neighbors can improve accuracy. We will look at three different weighing methods. “Rectangular” is unweighted and is the one that we used. The other two are “triangular” and “epanechnikov”. How these calculate the weights is beyond the scope of this post. In the code below the argument “distance” can be set to 1 for euclidean and 2 for absolute distance.

kknn.train<-train.kknn(city~.,train,kmax = 25,distance = 2,kernel = c("rectangular","triangular",


## Call:
## train.kknn(formula = city ~ ., data = train, kmax = 25, distance = 2,     kernel = c("rectangular", "triangular", "epanechnikov"))
## Type of response variable: nominal
## Minimal misclassification: 0.3277778
## Best kernel: rectangular
## Best k: 14

If you look at the plot you can see which value of k is the best by looking at the point that is the lowest on the graph which is right before 15. Looking at the legend it indicates that the point is the “rectangular” estimate which is the same as unweighted. This means that the best classification is unweighted with a k of 14. Although it recommends a different value for k our misclassification was about the same.


In this post, we explored both weighted and unweighted KNN. This algorithm allows you to deal with data that does not meet the assumptions of regression by ignoring the need for parameters. However, because there are no numbers really attached to the results beyond accuracy it can be difficult to explain what is happening in the model to people. As such, perhaps the biggest drawback is communicating results when using KNN.


Elastic Net Regression in R

Elastic net is a combination of ridge and lasso regression. What is most unusual about elastic net is that it has two tuning parameters (alpha and lambda) while lasso and ridge regression only has 1.

In this post, we will go through an example of the use of elastic net using the “VietnamI” dataset from the “Ecdat” package. Our goal is to predict how many days a person is ill based on the other variables in the dataset. Below is some initial code for our analysis

## 'data.frame':    27765 obs. of  12 variables:
##  $ pharvis  : num  0 0 0 1 1 0 0 0 2 3 ...
##  $ lnhhexp  : num  2.73 2.74 2.27 2.39 3.11 ...
##  $ age      : num  3.76 2.94 2.56 3.64 3.3 ...
##  $ sex      : Factor w/ 2 levels "female","male": 2 1 2 1 2 2 1 2 1 2 ...
##  $ married  : num  1 0 0 1 1 1 1 0 1 1 ...
##  $ educ     : num  2 0 4 3 3 9 2 5 2 0 ...
##  $ illness  : num  1 1 0 1 1 0 0 0 2 1 ...
##  $ injury   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ illdays  : num  7 4 0 3 10 0 0 0 4 7 ...
##  $ actdays  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ insurance: num  0 0 1 1 0 1 1 1 0 0 ...
##  $ commune  : num  192 167 76 123 148 20 40 57 49 170 ...
##  - attr(*, "na.action")=Class 'omit'  Named int 27734
##   .. ..- attr(*, "names")= chr "27734"

We need to check the correlations among the variables. We need to exclude the “sex” variable as it is categorical. Code is below.



No major problems with correlations. Next, we set up our training and testing datasets. We need to remove the variable “commune” because it adds no value to our results. In addition, to reduce the computational time we will only use the first 1000 rows from the data set.

ind<-sample(2,nrow(VietNamI_reduced),replace=T,prob = c(0.7,0.3))

We need to create a grid that will allow us to investigate different models with different combinations of alpha ana lambda. This is done using the “expand.grid” function. In combination with the “seq” function below is the code


We also need to set the resampling method, which allows us to assess the validity of our model. This is done using the “trainControl” function” from the “caret” package. In the code below “LOOCV” stands for “leave one out cross-validation”.

control<-trainControl(method = "LOOCV")

We are no ready to develop our model. The code is mostly self-explanatory. This initial model will help us to determine the appropriate values for the alpha and lambda parameters

## glmnet 
## 694 samples
##  10 predictors
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 693, 693, 693, 693, 693, 693, ... 
## Resampling results across tuning parameters:
##   alpha  lambda  RMSE      Rsquared 
##   0.0    0.0     5.229759  0.2968354
##   0.0    0.1     5.229759  0.2968354
##   0.0    0.2     5.229759  0.2968354
##   0.5    0.0     5.243919  0.2954226
##   0.5    0.1     5.225067  0.2985989
##   0.5    0.2     5.200415  0.3038821
##   1.0    0.0     5.244020  0.2954519
##   1.0    0.1     5.203973  0.3033173
##   1.0    0.2     5.182120  0.3083819
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.2.

The output list all the possible alpha and lambda values that we set in the “grid” variable. It even tells us which combination was the best. For our purposes, the alpha will be .5 and the lambda .2. The r-square is also included.

We will set our model and run it on the test set. We have to convert the “sex” variable to a dummy variable for the “glmnet” function. We next have to make matrices for the predictor variables and a for our outcome variable “illdays”

train$sex<-model.matrix( ~ sex - 1, data=train ) #convert to dummy variable 
test$sex<-model.matrix( ~ sex - 1, data=test )
enet<-glmnet(predictor_variables,days_ill,family = "gaussian",alpha = 0.5,lambda = .2)

We can now look at specific coefficient by using the “coef” function.

## 12 x 1 sparse Matrix of class "dgCMatrix"
##                         s0
## (Intercept)   -1.304263895
## pharvis        0.532353361
## lnhhexp       -0.064754000
## age            0.760864404
## sex.sexfemale  0.029612290
## sex.sexmale   -0.002617404
## married        0.318639271
## educ           .          
## illness        3.103047473
## injury         .          
## actdays        0.314851347
## insurance      .

You can see for yourself that several variables were removed from the model. Medical expenses (lnhhexp), sex, education, injury, and insurance do not play a role in the number of days ill for an individual in Vietnam.

With our model developed. We now can test it using the predict function. However, we first need to convert our test dataframe into a matrix and remove the outcome variable from it

enet.y<-predict(enet, newx = test.matrix, type = "response", lambda=.2,alpha=.5)

Let’s plot our results



This does not look good. Let’s check the mean squared error

## [1] 20.18134

We will now do a cross-validation of our model. We need to set the seed and then use the “cv.glmnet” to develop the cross-validated model. We can see the model by plotting it.



You can see that as the number of features are reduce (see the numbers on the top of the plot) the MSE increases (y-axis). In addition, as the lambda increases, there is also an increase in the error but only when the number of variables is reduced as well.

The dotted vertical lines in the plot represent the minimum MSE for a set lambda (on the left) and the one standard error from the minimum (on the right). You can extract these two lambda values using the code below.

## [1] 0.3082347
## [1] 2.874607

We can see the coefficients for a lambda that is one standard error away by using the code below. This will give us an alternative idea for what to set the model parameters to when we want to predict.

## 12 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept)   2.34116947
## pharvis       0.003710399       
## lnhhexp       .       
## age           .       
## sex.sexfemale .       
## sex.sexmale   .       
## married       .       
## educ          .       
## illness       1.817479480
## injury        .       
## actdays       .       
## insurance     .

Using the one standard error lambda we lose most of our features. We can now see if the model improves by rerunning it with this information.

enet.y.cv<-predict(enet.cv,newx = test.matrix,type='response',lambda="lambda.1se", alpha = .5)
## [1] 25.47966

A small improvement.  Our model is a mess but this post served as an example of how to conduct an analysis using elastic net regression.

Lasso Regression in R

In this post, we will conduct an analysis using the lasso regression. Remember lasso regression will actually eliminate variables by reducing them to zero through how the shrinkage penalty can be applied.

We will use the dataset “nlschools” from the “MASS” packages to conduct our analysis. We want to see if we can predict language test scores “lang” with the other available variables. Below is some initial code to begin the analysis

## 'data.frame':    2287 obs. of  6 variables:
##  $ lang : int  46 45 33 46 20 30 30 57 36 36 ...
##  $ IQ   : num  15 14.5 9.5 11 8 9.5 9.5 13 9.5 11 ...
##  $ class: Factor w/ 133 levels "180","280","1082",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ GS   : int  29 29 29 29 29 29 29 29 29 29 ...
##  $ SES  : int  23 10 15 23 10 10 23 10 13 15 ...
##  $ COMB : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

We need to remove the “class” variable as it is used as an identifier and provides no useful data. After this, we can check the correlations among the variables. Below is the code for this.



No problems with collinearity. We will now setup are training and testing sets.

ind<-sample(2,nrow(nlschools),replace=T,prob = c(0.7,0.3))

Remember that the ‘glmnet’ function does not like factor variables. So we need to convert our “COMB” variable to a dummy variable. In addition, “glmnet” function does not like data frames so we need to make two data frames. The first will include all the predictor variables and the second we include only the outcome variable. Below is the code

train$COMB<-model.matrix( ~ COMB - 1, data=train ) #convert to dummy variable 
test$COMB<-model.matrix( ~ COMB - 1, data=test )

We can now run our model. We place both matrices inside the “glmnet” function. The family is set to “gaussian” because our outcome variable is continuous. The “alpha” is set to 1 as this indicates that we are using lasso regression.


Now we need to look at the results using the “print” function. This function prints a lot of information as explained below.

  • Df = number of variables including in the model (this is always the same number in a ridge model)
  • %Dev = Percent of deviance explained. The higher the better
  • Lambda = The lambda used to obtain the %Dev

When you use the “print” function for a lasso model it will print up to 100 different models. Fewer models are possible if the percent of deviance stops improving. 100 is the default stopping point. In the code below we will use the “print” function but, I only printed the first 5 and last 5 models in order to reduce the size of the printout. Fortunately, it only took 60 models to converge.

## Call:  glmnet(x = predictor_variables, y = language_score, family = "gaussian",      alpha = 1) 
##       Df    %Dev  Lambda
##  [1,]  0 0.00000 5.47100
##  [2,]  1 0.06194 4.98500
##  [3,]  1 0.11340 4.54200
##  [4,]  1 0.15610 4.13900
##  [5,]  1 0.19150 3.77100
## [55,]  3 0.39890 0.03599
## [56,]  3 0.39900 0.03280
## [57,]  3 0.39900 0.02988
## [58,]  3 0.39900 0.02723
## [59,]  3 0.39900 0.02481
## [60,]  3 0.39900 0.02261

The results from the “print” function will allow us to set the lambda for the “test” dataset. Based on the results we can set the lambda at 0.02 because this explains the highest amount of deviance at .39.

The plot below shows us lambda on the x-axis and the coefficients of the predictor variables on the y-axis. The numbers next to the coefficient lines refers to the actual coefficient of a particular variable as it changes from using different lambda values. Each number corresponds to a variable going from left to right in a dataframe/matrix using the “View” function. For example, 1 in the plot refers to “IQ” 2 refers to “GS” etc.



As you can see, as lambda increase the coefficient decrease in value. This is how regularized regression works. However, unlike ridge regression which never reduces a coefficient to zero, lasso regression does reduce a coefficient to zero. For example, coefficient 3 (SES variable) and coefficient 2 (GS variable) are reduced to zero when lambda is near 1.

You can also look at the coefficient values at a specific lambda values. The values are unstandardized and are used to determine the final model selection. In the code below the lambda is set to .02 and we use the “coef” function to do see the results

lasso.coef<-coef(lasso,s=.02,exact = T)
## 4 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept)  9.35736325
## IQ           2.34973922
## GS          -0.02766978
## SES          0.16150542

Results indicate that for a 1 unit increase in IQ there is a 2.41 point increase in language score. When GS (class size) goes up 1 unit there is a .03 point decrease in language score. Finally, when SES (socioeconomic status) increase  1 unit language score improves .13 point.

The second plot shows us the deviance explained on the x-axis. On the y-axis is the coefficients of the predictor variables. Below is the code



If you look carefully, you can see that the two plots are completely opposite to each other. increasing lambda cause a decrease in the coefficients. Furthermore, increasing the fraction of deviance explained leads to an increase in the coefficient. You may remember seeing this when we used the “print”” function. As lambda became smaller there was an increase in the deviance explained.

Now, we will assess our model using the test data. We need to convert the test dataset to a matrix. Then we will use the “predict”” function while setting our lambda to .02. Lastly, we will plot the results. Below is the code.

lasso.y<-predict(lasso,newx = test.matrix,type = 'response',s=.02)


The visual looks promising. The last thing we need to do is calculated the mean squared error. By its self this number does not mean much. However, it provides a benchmark for comparing our current model with any other models that we may develop. Below is the code

## [1] 46.74314

Knowing this number, we can, if we wanted, develop other models using other methods of analysis to try to reduce it. Generally, the lower the error the better while keeping in mind the complexity of the model.

Ridge Regression in R

In this post, we will conduct an analysis using ridge regression. Ridge regression is a type of regularized regression. By applying a shrinkage penalty, we are able to reduce the coefficients of many variables almost to zero while still retaining them in the model. This allows us to develop models that have many more variables in them compared to models using best subset or stepwise regression.

In the example used in this post, we will use the “SAheart” dataset from the “ElemStatLearn” package. We want to predict systolic blood pressure (sbp) using all of the other variables available as predictors. Below is some initial code that we need to begin.

## 'data.frame':    462 obs. of  10 variables:
##  $ sbp      : int  160 144 118 170 134 132 142 114 114 132 ...
##  $ tobacco  : num  12 0.01 0.08 7.5 13.6 6.2 4.05 4.08 0 0 ...
##  $ ldl      : num  5.73 4.41 3.48 6.41 3.5 6.47 3.38 4.59 3.83 5.8 ...
##  $ adiposity: num  23.1 28.6 32.3 38 27.8 ...
##  $ famhist  : Factor w/ 2 levels "Absent","Present": 2 1 2 2 2 2 1 2 2 2 ...
##  $ typea    : int  49 55 52 51 60 62 59 62 49 69 ...
##  $ obesity  : num  25.3 28.9 29.1 32 26 ...
##  $ alcohol  : num  97.2 2.06 3.81 24.26 57.34 ...
##  $ age      : int  52 63 46 58 49 45 38 58 29 53 ...
##  $ chd      : int  1 1 0 1 1 0 0 1 0 1 ...

A look at the object using the “str” function indicates that one variable “famhist” is a factor variable. The “glmnet” function that does the ridge regression analysis cannot handle factors so we need to converts this to a dummy variable. However, there are two things we need to do before this. First, we need to check the correlations to make sure there are no major issues with multi-collinearity Second, we need to create our training and testing data sets. Below is the code for the correlation plot.



First we created a variable called “p.cor” the -5 in brackets means we removed the 5th column from the “SAheart” data set which is the factor variable “Famhist”. The correlation plot indicates that there is one strong relationship between adiposity and obesity. However, one common cut-off for collinearity is 0.8 and this value is 0.72 which is not a problem.

We will now create are training and testing sets and convert “famhist” to a dummy variable.

ind<-sample(2,nrow(SAheart),replace=T,prob = c(0.7,0.3))
train$famhist<-model.matrix( ~ famhist - 1, data=train ) #convert to dummy variable 
test$famhist<-model.matrix( ~ famhist - 1, data=test )

We are still not done preparing our data yet. “glmnet” cannot use data frames, instead, it can only use matrices. Therefore, we now need to convert our data frames to matrices. We have to create two matrices, one with all of the predictor variables and a second with the outcome variable of blood pressure. Below is the code


We are now ready to create our model. We use the “glmnet” function and insert our two matrices. The family is set to Gaussian because “blood pressure” is a continuous variable. Alpha is set to 0 as this indicates ridge regression. Below is the code

ridge<-glmnet(predictor_variables,blood_pressure,family = 'gaussian',alpha = 0)

Now we need to look at the results using the “print” function. This function prints a lot of information as explained below.

  •  Df = number of variables including in the model (this is always the same number in a ridge model)
  •  %Dev = Percent of deviance explained. The higher the better
  • Lambda = The lambda used to attain the %Dev

When you use the “print” function for a ridge model it will print up to 100 different models. Fewer models are possible if the percent of deviance stops improving. 100 is the default stopping point. In the code below we have the “print” function. However, I have only printed the first 5 and last 5 models in order to save space.

## Call:  glmnet(x = predictor_variables, y = blood_pressure, family = "gaussian",      alpha = 0) 
##        Df      %Dev    Lambda
##   [1,] 10 7.622e-37 7716.0000
##   [2,] 10 2.135e-03 7030.0000
##   [3,] 10 2.341e-03 6406.0000
##   [4,] 10 2.566e-03 5837.0000
##   [5,] 10 2.812e-03 5318.0000
##  [95,] 10 1.690e-01    1.2290
##  [96,] 10 1.691e-01    1.1190
##  [97,] 10 1.692e-01    1.0200
##  [98,] 10 1.693e-01    0.9293
##  [99,] 10 1.693e-01    0.8468
## [100,] 10 1.694e-01    0.7716

The results from the “print” function are useful in setting the lambda for the “test” dataset. Based on the results we can set the lambda at 0.83 because this explains the highest amount of deviance at .20.

The plot below shows us lambda on the x-axis and the coefficients of the predictor variables on the y-axis. The numbers refer to the actual coefficient of a particular variable. Inside the plot, each number corresponds to a variable going from left to right in a data-frame/matrix using the “View” function. For example, 1 in the plot refers to “tobacco” 2 refers to “ldl” etc. Across the top of the plot is the number of variables used in the model. Remember this number never changes when doing ridge regression.



As you can see, as lambda increase the coefficient decrease in value. This is how ridge regression works yet no coefficient ever goes to absolute 0.

You can also look at the coefficient values at a specific lambda value. The values are unstandardized but they provide a useful insight when determining final model selection. In the code below the lambda is set to .83 and we use the “coef” function to do this

ridge.coef<-coef(ridge,s=.83,exact = T)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                   1
## (Intercept)            105.69379942
## tobacco                 -0.25990747
## ldl                     -0.13075557
## adiposity                0.29515034
## famhist.famhistAbsent    0.42532887
## famhist.famhistPresent  -0.40000846
## typea                   -0.01799031
## obesity                  0.29899976
## alcohol                  0.03648850
## age                      0.43555450
## chd                     -0.26539180

The second plot shows us the deviance explained on the x-axis and the coefficients of the predictor variables on the y-axis. Below is the code



The two plots are completely opposite to each other. Increasing lambda cause a decrease in the coefficients while increasing the fraction of deviance explained leads to an increase in the coefficient. You can also see this when we used the “print” function. As lambda became smaller there was an increase in the deviance explained.

We now can begin testing our model on the test data set. We need to convert the test dataset to a matrix and then we will use the predict function while setting our lambda to .83 (remember a lambda of .83 explained the most of the deviance). Lastly, we will plot the results. Below is the code.

ridge.y<-predict(ridge,newx = test.matrix,type = 'response',s=.83)


The last thing we need to do is calculated the mean squared error. By it’s self this number is useless. However, it provides a benchmark for comparing the current model with any other models you may develop. Below is the code

## [1] 372.4431

Knowing this number, we can develop other models using other methods of analysis to try to reduce it as much as possible.

Regularized Linear Regression

Traditional linear regression has been a tried and true  model for making predictions for decades. However, with the growth of Big Data and datasets with 100’s of variables  problems have begun to arise. For example, using stepwise or best subset method with regression could take hours if not days to converge in even some of the best computers.

To deal with this problem, regularized regression has been developed to help to determine which features or variables to keep when developing models from large datasets with a huge number of variables. In this post, we will look at the following concepts

  • Definition of regularized regression
  • Ridge regression
  • Lasso regression
  • Elastic net regression


Regularization involves the use of a shrinkage penalty in order to reduce the residual sum of squares (RSS). This is done through selecting a value for a tuning parameter called “lambda”. Tuning parameters are used in machine learning algorithms to control the behavior of the models that are developed.

The lambda is multiplied by the normalized coefficients of the model and added to the RSS. Below is an equation of what was just said

RSS + λ(normalized coefficients)

The benefits of regularization are at least three-fold. First, regularization is highly computationally efficient. Instead of fitting k-1 models when k is the number of variables available (for example, 50 variables would lead 49 models!), with regularization only one model is developed for each value of lambda you specify.

Second, regularization helps to deal with the bias-variance headache of model development. When small changes are made to data, such as switching from the training to testing data, there can be wild changes in the estimates. Regularization can often smooth this problem out substantially.

Finally, regularization can help to reduce or eliminate any multicollenarity in a model. As such, the benefits of using regularization make it clear that this should be considering when working with larger data sets.

Ridge Regression

Ridge regression involves the normalization of the squared weights or as shown in the equation below

RSS + λ(normalized coefficients^2)

This is also refered to as the L2-norm. As lambda increase in value the coefficients in the model are shrunk towards 0 but never reach 0. This is how the error is shrunk. The higher the lambda the lower the value of the coefficients as they are reduce more and more thus reducing the RSS.

The benefit is that predictive accuracy is often increased. However, interpreting and communicating your results can become difficult because no variables are removed from the model. Instead the variables are reduced near to zero. This can be especially tough if you have dozens of variables remaining in your model to try to explain.


Lasso is short for “Least Absolute Shrinkage and Selection Operator”. This approach uses the L1-norm which is the sum of the absolute value of the coefficients or as shown in the equation below

RSS + λ(Σ|normalized coefficients|)

This shrinkage penalty will reduce a coefficient to 0 which is another way of saying that variables will be removed from the model. One problem is that highly correlated variables that need to be  in your model my be removed when Lasso shrinks coefficients. This is one reason why ridge regression is still used.

Elastic Net

Elastic net is the best of ridge and Lasso without the weaknesses of either. It combines extracts variables like Lasso and Ridge does not while also group variables like Ridge does but Lasso does not.

This is done by including a second tuning parameter called “alpha”. If alpha is set to 0 it is the same as ridge regression and if alpha is set to 1 it is the same as lasso regression. If you are able to appreciate it below is the formula used for elastic net regression

(RSS + l[(1 – alpha)(S|normalized coefficients|2)/2 + alpha(S|normalized coefficients|)])/N)

As such when working with elastic net you have to set two different tuning parameters (alpha and lambda) in order to develop a model.


Regularized regressio was developed as an answer to the growth in the size and number of variables in a data set today. Ridge, lasso an elastic net all provide solutions to converging over large datasets and selecting features.

Linear VS Quadratic Discriminant Analysis in R

In this post we will look at linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). Discriminant analysis is used when the dependent variable is categorical. Another commonly used option is logistic regression but there are differences between logistic regression and discriminant analysis. Both LDA and QDA are used in situations in which there is a clear separation between the classes you want to predict. If the categories are fuzzier logistic regression is often the better choice.

For our example, we will use the “Mathlevel” dataset found in the “Ecdat” package. Our goal will be to predict the sex of a respondent based on SAT math score, major, foreign language proficiency, as well as the number of math, physic, and chemistry classes a respondent took. Below is some initial code to start our analysis.


The first thing we need to do is clean up the data set. We have to remove any missing data in order to run our model. We will create a dataset called “math” that has the “Mathlevel” dataset but with the “NA”s removed use the “na.omit” function. After this, we need to set our seed for the purpose of reproducibility using the “set.seed” function. Lastly, we will split the data using the “sample” function using a 70/30 split. The training dataset will be called “math.train” and the testing dataset will be called “math.test”. Below is the code

math.ind<-sample(2,nrow(math),replace=T,prob = c(0.7,0.3))

Now we will make our model and it is called “lda.math” and it will include all available variables in the “math.train” dataset. Next we will check the results by calling the modle. Finally, we will examine the plot to see how our model is doing. Below is the code.

## Call:
## lda(sex ~ ., data = math.train)
## Prior probabilities of groups:
##      male    female 
## 0.5986079 0.4013921 
## Group means:
##        mathlevel.L mathlevel.Q mathlevel.C mathlevel^4 mathlevel^5
## male   -0.10767593  0.01141838 -0.05854724   0.2070778  0.05032544
## female -0.05571153  0.05360844 -0.08967303   0.2030860 -0.01072169
##        mathlevel^6      sat languageyes  majoreco  majoross   majorns
## male    -0.2214849 632.9457  0.07751938 0.3914729 0.1472868 0.1782946
## female  -0.2226767 613.6416  0.19653179 0.2601156 0.1907514 0.2485549
##          majorhum mathcourse physiccourse chemistcourse
## male   0.05426357   1.441860    0.7441860      1.046512
## female 0.07514451   1.421965    0.6531792      1.040462
## Coefficients of linear discriminants:
##                       LD1
## mathlevel.L    1.38456344
## mathlevel.Q    0.24285832
## mathlevel.C   -0.53326543
## mathlevel^4    0.11292817
## mathlevel^5   -1.24162715
## mathlevel^6   -0.06374548
## sat           -0.01043648
## languageyes    1.50558721
## majoreco      -0.54528930
## majoross       0.61129797
## majorns        0.41574298
## majorhum       0.33469586
## mathcourse    -0.07973960
## physiccourse  -0.53174168
## chemistcourse  0.16124610


Calling “lda.math” gives us the details of our model. It starts be indicating the prior probabilities of someone being male or female. Next is the means for each variable by sex. The last part is the coefficients of the linear discriminants. Each of these values is used to determine the probability that a particular example is male or female. This is similar to a regression equation.

The plot provides us with densities of the discriminant scores for males and then for females. The output indicates a problem. There is a great deal of overlap between male and females in the model. What this indicates is that there is a lot of misclassification going on as the two groups are not clearly separated. Furthermore, this means that logistic regression is probably a better choice for distinguishing between male and females. However, since this is for demonstrating purposes we will not worry about this.

We will now use the “predict” function on the training set data to see how well our model classifies the respondents by gender. We will then compare the prediction of the model with thee actual classification. Below is the code.

##          male female
##   male    219    100
##   female   39     73
## [1] 0.6774942

As you can see, we have a lot of misclassification happening. A large amount of false negatives which is a lot of males being classified as female. The overall accuracy us only 59% which is not much better than chance.

We will now conduct the same analysis on the test data set. Below is the code.

##          male female
##   male     92     43
##   female   23     20
## [1] 0.6292135

As you can see the results are similar. To put it simply, our model is terrible. The main reason is that there is little distinction between males and females as shown in the plot. However, we can see if perhaps a quadratic discriminant analysis will do better

QDA allows for each class in the dependent variable to have it’s own covariance rather than a shared covariance as in LDA. This allows for quadratic terms in the development of the model. To complete a QDA we need to use the “qda” function from the “MASS” package. Below is the code for the training data set.

##          male female
##   male    215     84
##   female   43     89
## [1] 0.7053364

You can see there is almost no difference. Below is the code for the test data.

##          male female
##   male     91     43
##   female   24     20
## [1] 0.6235955

Still disappointing. However, in this post we reviewed linear discriminant analysis as well as learned about the use of quadratic linear discriminant analysis. Both of these statistical tools are used for predicting categorical dependent variables. LDA assumes shared covariance in the dependent variable categories will QDA allows for each category in the dependent variable to have it’s own variance.

Validating a Logistic Model in R

In this post, we are going to continue are analysis of the logistic regression model from the the post on logistic regression  in R. We need to rerun all of the code from the last post to be ready to continue. As such the code form the last post is all below

pm<-melt(survey, id.var="Sex")
ggplot(pm,aes(Sex,value))+geom_boxplot()+facet_wrap(~variable,ncol = 3)


set.seed(123) ind<-sample(2,nrow(survey),replace=T,prob = c(0.7,0.3)) train<-survey[ind==1,] test<-survey[ind==2,] fit<-glm(Sex~.,binomial,train) exp(coef(fit))

train$probs<-predict(fit, type = 'response')
test$prob<-predict(fit,newdata = test, type = 'response')

Model Validation

We will now do a K-fold cross validation in order to further see how our model is doing. We cannot use the factor variable “Sex” with the K-fold code so we need to create a dummy variable. First, we create a variable called “y” that has 123 spaces, which is the same size as the “train” dataset. Second, we fill “y” with 1 in every example that is coded “male” in the “Sex” variable.

In addition, we also need to create a new dataset and remove some variables from our prior analysis otherwise we will confuse the functions that we are going to use. We will remove “predict”, “Sex”, and “probs”


We now can do our K-fold analysis. The code is complicated so you can trust it and double check on your own.

bestglm(Xy=my.cv,IC="CV",CVArgs = list(Method="HTF",K=10,REP=1),family = binomial)
## Morgan-Tatar search since family is non-gaussian.
## CV(K = 10, REP = 1)
## BICq equivalent for q in (6.66133814775094e-16, 0.0328567092272112)
## Best Model:
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -45.2329733 7.80146036 -5.798014 6.710501e-09
## Height        0.2615027 0.04534919  5.766425 8.097067e-09

The results confirm what we alreaedy knew that only the “Height” variable is valuable in predicting Sex. We will now create our new model using only the recommendation of the kfold validation analysis. Then we check the new model against the train dataset and with the test dataset. The code below is a repeat of prior code but based on the cross-validation

reduce.fit<-glm(Sex~Height, family=binomial,train)
##          Female Male
##   Female     61   11
##   Male        7   44
## [1] 0.8536585
test$cv.probs<-predict(reduce.fit,test,type = 'response')
##          Female Male
##   Female     16    7
##   Male        1   22
## [1] 0.826087

The results are consistent for both the train and test dataset. We are now going to create the ROC curve. This will provide a visual and the AUC number to further help us to assess our model. However, a model is only good when it is compared to another model. Therefore, we will create a really bad model in order to compare it to the original model, and the cross validated model. We will first make a bad model and store the probabilities in the “test” dataset. The bad model will use “age” to predict “Sex” which doesn’t make any sense at all. Below is the code followed by the ROC curve of the bad model.

bad.fit<-glm(Sex~Age,family = binomial,test)


The more of a diagonal the line is the worst it is. As, we can see the bad model is really bad.

What we just did with the bad model we will now repeat for the full model and the cross-validated model.  As before, we need to store the prediction in a way that the ROCR package can use them. We will create a variable called “pred.full” to begin the process of graphing the the original full model from the last blog post. Then we will use the “prediction” function. Next, we will create the “perf.full” variable to store the performance of the model. Notice, the arguments ‘tpr’ and ‘fpr’ for true positive rate and false positive rate. Lastly, we plot the results

plot(perf.full, col=2)


We repeat this process for the cross-validated model



Now let’s put all the different models on one plot

plot(perf.full, col=2, add=T)
legend(.7,.4,c("BAD","FULL","CV"), 1:3)


Finally, we can calculate the AUC for each model

## [[1]]
## [1] 0.4766734
## [[1]]
## [1] 0.959432
## [[1]]
## [1] 0.9107505

The higher the AUC the better. As such, the full model with all variables is superior to the cross-validated or bad model. This is despite the fact that there are many high correlations in the full model as well. Another point to consider is that the cross-validated model is simpler so this may be a reason to pick it over the full model. As such, the statistics provide support for choosing a model but the do not trump the ability of the research to pick based on factors beyond just numbers.

Logistic Regression in R

In this post, we will conduct a logistic regression analysis. Logistic regression is used when you want to predict a categorical dependent variable using continuous or categorical dependent variables. In our example, we want to predict Sex (male or female) when using several continuous variables from the “survey” dataset in the “MASS” package.

?MASS::survey #explains the variables in the study

The first thing we need to do is remove the independent factor variables from our dataset. The reason for this is that the function that we will use for the cross-validation does not accept factors. We will first use the “str” function to identify factor variables and then remove them from the dataset. We also need to remove in examples that are missing data so we use the “na.omit” function for this. Below is the code


We now need to check for collinearity using the “corrplot.mixed” function form the “corrplot” package.



We have extreme correlation between “We.Hnd” and “NW.Hnd” this makes sense because people’s hands are normally the same size. Since this blog post  is a demonstration of logistic regression we will not worry about this too much.

We now need to divide our dataset into a train and a test set. We set the seed for. First we need to make a variable that we call “ind” that is randomly assigns 70% of the number of rows of survey 1 and 30% 2. We then subset the “train” dataset by taking all rows that are 1’s based on the “ind” variable and we create the “test” dataset for all the rows that line up with 2 in the “ind” variable. This means our data split is 70% train and 30% test. Below is the code

ind<-sample(2,nrow(survey),replace=T,prob = c(0.7,0.3))

We now make our model. We use the “glm” function for logistic regression. We set the family argument to “binomial”. Next, we look at the results as well as the odds ratios.

## Call:
## glm(formula = Sex ~ ., family = binomial, data = train)
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9875  -0.5466  -0.1395   0.3834   3.4443  
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -46.42175    8.74961  -5.306 1.12e-07 ***
## Wr.Hnd       -0.43499    0.66357  -0.656    0.512    
## NW.Hnd        1.05633    0.70034   1.508    0.131    
## Pulse        -0.02406    0.02356  -1.021    0.307    
## Height        0.21062    0.05208   4.044 5.26e-05 ***
## Age           0.00894    0.05368   0.167    0.868    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
##     Null deviance: 169.14  on 122  degrees of freedom
## Residual deviance:  81.15  on 117  degrees of freedom
## AIC: 93.15
## Number of Fisher Scoring iterations: 6
##  (Intercept)       Wr.Hnd       NW.Hnd        Pulse       Height 
## 6.907034e-21 6.472741e-01 2.875803e+00 9.762315e-01 1.234447e+00 
##          Age 
## 1.008980e+00

The results indicate that only height is useful in predicting if someone is a male or female. The second piece of code shares the odds ratios. The odds ratio tell how a one unit increase in the independent variable leads to an increase in the odds of being male in our model. For example, for every one unit increase in height there is a 1.23 increase in the odds of a particular example being male.

We now need to see how well our model does on the train and test dataset. We first capture the probabilities and save them to the train dataset as “probs”. Next we create a “predict” variable and place the string “Female” in the same number of rows as are in the “train” dataset. Then we rewrite the “predict” variable by changing any example that has a probability above 0.5 as “Male”. Then we make a table of our results to see the number correct, false positives/negatives. Lastly, we calculate the accuracy rate. Below is the code.

train$probs<-predict(fit, type = 'response')
##          Female Male
##   Female     61    7
##   Male        7   48
## [1] 0.8861789

Despite the weaknesses of the model with so many insignificant variables it is surprisingly accurate at 88.6%. Let’s see how well we do on the “test” dataset.

test$prob<-predict(fit,newdata = test, type = 'response')
##          Female Male
##   Female     17    3
##   Male        0   26
## [1] 0.9347826

As you can see, we do even better on the test set with an accuracy of 93.4%. Our model is looking pretty good and height is an excellent predictor of sex which makes complete sense. However, in the next post we will use cross-validation and the ROC plot to further assess the quality of it.

Data Wrangling in R

Collecting and preparing data for analysis is the primary job of a data scientist. This experience is called data wrangling. In this post, we will look at an example of data wrangling using a simple artificial data set. You can create the table below in r or excel. If you created it in excel just save it as a csv and load it into r. Below is the initial code

apple <- read_csv("~/Desktop/apple.csv")
## # A tibble: 10 × 2
##        weight      location
##         <chr>         <chr>
## 1         3.2        Europe
## 2       4.2kg       europee
## 3      1.3 kg          U.S.
## 4  7200 grams           USA
## 5          42 United States
## 6         2.3       europee
## 7       2.1kg        Europe
## 8       3.1kg           USA
## 9  2700 grams          U.S.
## 10         24 United States

This a small dataset with the columns of “weight” and “location”. Here are some of the problems

  • Weights are in different units
  • Weights are written in different ways
  • Location is not consistent

In order to have any success with data wrangling you need to state specifically what it is you want to do. Here are our goals for this project

  • Convert the “Weight variable” to a numerical variable instead of character
  • Remove the text and have only numbers in the “weight variable”
  • Change weights in grams to kilograms
  • Convert the “location” variable to a factor variable instead of character
  • Have consistent spelling for Europe and United States in the “location” variable

We will begin with the “weight” variable. We want to convert it to a numerical variable and remove any non-numerical text. Below is the code for this

corrected.weight<-as.numeric(gsub(pattern = "[[:alpha:]]","",apple$weight))
##  [1]    3.2    4.2    1.3 7200.0   42.0    2.3    2.1    3.1 2700.0   24.0

Here is what we did.

  1. We created a variable called “corrected.weight”
  2. We use the function “as.numeric” this makes whatever results inside it to be a numerical variable
  3. Inside “as.numeric” we used the “gsub” function which allows us to substitute one value for another.
  4. Inside “gsub” we used the argument pattern and set it to “[[alpha:]]” and “” this told r to look for any lower or uppercase letters and replace with nothing or remove it. This all pertains to the “weight” variable in the apple dataframe.

We now need to convert the weights in grams to kilograms so that everything is the same unit. Below is the code

gram.error<-grep(pattern = "[[:digit:]]{4}",apple$weight)
##  [1]  3.2  4.2  1.3  7.2 42.0  2.3  2.1  3.1  2.7 24.0

Here is what we did

  1. We created a variable called “gram.error”
  2. We used the grep function to search are the “weight” variable in the apple data frame for input that is a digit and is 4 digits in length this is what the “[[:digit:]]{4}” argument means. We do not change any values yet we just store them in “gram.error”
  3. Once this information is stored in “gram.error” we use it as a subset for the “corrected.weight” variable.
  4. We tell r to save into the “corrected.weight” variable any value that is changeable according to the criteria set in “gram.error” and to divided it by 1000. Dividing it by 1000 converts the value from grams to kilograms.

We have completed the transformation of the “weight” and will move to dealing with the problems with the “location” variable in the “apple” dataframe. To do this we will first deal with the issues related to the values that relate to Europe and then we will deal with values related to United States. Below is the code.

europe<-agrep(pattern = "europe",apple$location,ignore.case = T,max.distance = list(insertion=c(1),deletions=c(2)))
america<-agrep(pattern = "us",apple$location,ignore.case = T,max.distance = list(insertion=c(0),deletions=c(2),substitutions=0))
corrected.location<-gsub(pattern = "United States","US",corrected.location)
##  [1] "europe" "europe" "US"     "US"     "US"     "europe" "europe"
##  [8] "US"     "US"     "US"

The code is a little complicated to explain but in short We used the “agrep” function to tell r to search the “location” to look for values similar to our term “europe”. The other arguments provide some exceptions that r should change because the exceptions are close to the term europe. This process is repeated for the term “us”. We then store are the location variable from the “apple” dataframe in a new variable called “corrected.location” We then apply the two objects we made called “europe” and “america” to the “corrected.location” variable. Next we have to make some code to deal with “United States” and apply this using the “gsub” function.

We are almost done, now we combine are two variables “corrected.weight” and “corrected.location” into a new data.frame. The code is below

##    weight location
## 1     3.2   europe
## 2     4.2   europe
## 3     1.3       US
## 4     7.2       US
## 5    42.0       US
## 6     2.3   europe
## 7     2.1   europe
## 8     3.1       US
## 9     2.7       US
## 10   24.0       US

If you use the “str” function on the “cleaned.apple” dataframe you will see that “location” was automatically converted to a factor.

This looks much better especially if you compare it to the original dataframe that is printed at the top of this post.

Principal Component Analysis in R

This post will demonstrate the use of principal component analysis (PCA). PCA is useful for several reasons. One it allows you place your examples into groups similar to linear discriminant analysis but you do not need to know beforehand what the groups are. Second, PCA is used for the purpose of dimension reduction. For example, if you have 50 variables PCA can allow you to reduce this while retaining a certain threshold of variance. If you are working with a large dataset this can greatly reduce the computational time and general complexity of your models.

Keep in mind that there really is not a dependent variable as this is unsupervised learning. What you are trying to see is how different examples can be mapped in space based on whatever independent variables are used. For our example, we will use the “Carseats” dataset form the “ISLR”. Our goal is to understanding the relationship among the variables when examining the shelve location of the car seat. Below is the initial code to begin the analysis


We first need to rearrange the data and remove the variables we are not going to use in the analysis. Below is the code.


Here is what we did 1. We made a copy of the “Carseats” data called “Careseats1” 2. We rearranged the order of the variables so that the factor variables are at the end. This will make sense later 3.We removed the “Urban” and “US” variables from the table as they will not be a part of our analysis

We will now do the PCA. We need to scale and center our data otherwise the larger numbers will have a much stronger influence on the results than smaller numbers. Fortunately, the “prcomp” function has a “scale” and a “center” argument. We will also use only the first 7 columns for the analysis  as “sheveLoc” is not useful for this analysis. If we hadn’t moved “shelveLoc” to the end of the dataframe it would cause some headache. Below is the code.

Carseats.pca<-prcomp(Carseats1[,1:7],scale. = T,center = T)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6     PC7
## Standard deviation     1.3315 1.1907 1.0743 0.9893 0.9260 0.80506 0.41320
## Proportion of Variance 0.2533 0.2026 0.1649 0.1398 0.1225 0.09259 0.02439
## Cumulative Proportion  0.2533 0.4558 0.6207 0.7605 0.8830 0.97561 1.00000

The summary of “Carseats.pca” Tells us how much of the variance each component explains. Keep in mind that number of components is equal to the number of variables. The “proportion of variance” tells us the contribution each component makes and the “cumulative proportion”.

If your goal is dimension reduction than the number of components to keep depends on the threshold you set. For example, if you need around 90% of the variance you would keep the first 5 components. If you need 95% or more of the variance you would keep the first six. To actually use the components you would take the “Carseats.pca$x” data and move it to your data frame.

Keep in mind that the actual components have no conceptual meaning but is a numerical representation of a combination of several variables that were reduce using PCA to fewer variables such as going form 7 variables to 5 variables.

This means that PCA is great for reducing variables for prediction purpose but is much harder for explanatory studies unless you can explain what the new components represent.

For our purposes, we will keep 5 components. This means that we have reduce our dimensions from 7 to 5 while still keeping almost 90% of the variance. Graphing our results is tricky because we have 5 dimensions but the human mind can only conceptualize 3 at the best and normally 2. As such we will plot the first two components and label them by shelf location using ggplot2. Below is the code



From the plot you can see there is little separation when using the first two components of the PCA analysis. This makes sense as we can only graph to components so we are missing a lot of the variance. However for demonstration purposes the analysis is complete.

Linear Discriminant Analysis in R

In this post we will look at an example of linear discriminant analysis (LDA). LDA is used to develop a statistical model that classifies examples in a dataset. In the example in this post, we will use the “Star” dataset from the “Ecdat” package. What we will do is try to predict the type of class the students learned in (regular, small, regular with aide) using their math scores, reading scores, and the teaching experience of the teacher. Below is the initial code


We first need to examine the data by using the “str” function

## 'data.frame':    5748 obs. of  8 variables:
##  $ tmathssk: int  473 536 463 559 489 454 423 500 439 528 ...
##  $ treadssk: int  447 450 439 448 447 431 395 451 478 455 ...
##  $ classk  : Factor w/ 3 levels "regular","small.class",..: 2 2 3 1 2 1 3 1 2 2 ...
##  $ totexpk : int  7 21 0 16 5 8 17 3 11 10 ...
##  $ sex     : Factor w/ 2 levels "girl","boy": 1 1 2 2 2 2 1 1 1 1 ...
##  $ freelunk: Factor w/ 2 levels "no","yes": 1 1 2 1 2 2 2 1 1 1 ...
##  $ race    : Factor w/ 3 levels "white","black",..: 1 2 2 1 1 1 2 1 2 1 ...
##  $ schidkn : int  63 20 19 69 79 5 16 56 11 66 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:5850] 1 4 6 7 8 9 10 15 16 17 ...
##   .. ..- attr(*, "names")= chr [1:5850] "1" "4" "6" "7" ...

We will use the following variables

  • dependent variable = classk (class type)
  • independent variable = tmathssk (Math score)
  • independent variable = treadssk (Reading score)
  • independent variable = totexpk (Teaching experience)

We now need to examine the data visually by looking at histograms for our independent variables and a table for our dependent variable







##           regular       small.class regular.with.aide 
##         0.3479471         0.3014962         0.3505567

The data mostly looks good. The results of the “prop.table” function will help us when we develop are training and testing datasets. The only problem is with the “totexpk” variable. IT is not anywhere near to be normally distributed. TO deal with this we will use the square root for teaching experience. Below is the code



Much better. We now need to check the correlation among the variables as well and we will use the code below.

##                        star.sqrt.tmathssk star.sqrt.treadssk
## star.sqrt.tmathssk             1.00000000          0.7135489
## star.sqrt.treadssk             0.71354889          1.0000000
## star.sqrt.totexpk.sqrt         0.08647957          0.1045353
##                        star.sqrt.totexpk.sqrt
## star.sqrt.tmathssk                 0.08647957
## star.sqrt.treadssk                 0.10453533
## star.sqrt.totexpk.sqrt             1.00000000

None of the correlations are too bad. We can now develop our model using linear discriminant analysis. First, we need to scale are scores because the test scores and the teaching experience are measured differently. Then, we need to divide our data into a train and test set as this will allow us to determine the accuracy of the model. Below is the code.


Now we develop our model. In the code before the “prior” argument indicates what we expect the probabilities to be. In our data the distribution of the the three class types is about the same which means that the apriori probability is 1/3 for each class type.

train.lda<-lda(classk~tmathssk+treadssk+totexpk.sqrt, data = 
## Call:
## lda(classk ~ tmathssk + treadssk + totexpk.sqrt, data = train.star, 
##     prior = c(1, 1, 1)/3)
## Prior probabilities of groups:
##           regular       small.class regular.with.aide 
##         0.3333333         0.3333333         0.3333333 
## Group means:
##                      tmathssk    treadssk totexpk.sqrt
## regular           -0.04237438 -0.05258944  -0.05082862
## small.class        0.13465218  0.11021666  -0.02100859
## regular.with.aide -0.05129083 -0.01665593   0.09068835
## Coefficients of linear discriminants:
##                      LD1         LD2
## tmathssk      0.89656393 -0.04972956
## treadssk      0.04337953  0.56721196
## totexpk.sqrt -0.49061950  0.80051026
## Proportion of trace:
##    LD1    LD2 
## 0.7261 0.2739

The printout is mostly readable. At the top is the actual code used to develop the model followed by the probabilities of each group. The next section shares the means of the groups. The coefficients of linear discriminants are the values used to classify each example. The coefficients are similar to regression coefficients. The computer places each example in both equations and probabilities are calculated. Whichever class has the highest probability is the winner. In addition, the higher the coefficient the more weight it has. For example, “tmathssk” is the most influential on LD1 with a coefficient of 0.89.

The proportion of trace is similar to principal component analysis

Now we will take the trained model and see how it does with the test set. We create a new model called “predict.lda” and use are “train.lda” model and the test data called “test.star”

predict.lda<-predict(train.lda,newdata = test.star)

We can use the “table” function to see how well are model has done. We can do this because we actually know what class our data is beforehand because we divided the dataset. What we need to do is compare this to what our model predicted. Therefore, we compare the “classk” variable of our “test.star” dataset with the “class” predicted by the “predict.lda” model.

##                     regular small.class regular.with.aide
##   regular               155         182               249
##   small.class           145         198               174
##   regular.with.aide     172         204               269

The results are pretty bad. For example, in the first row called “regular” we have 155 examples that were classified as “regular” and predicted as “regular” by the model. In rhe next column, 182 examples that were classified as “regular” but predicted as “small.class”, etc. To find out how well are model did you add together the examples across the diagonal from left to right and divide by the total number of examples. Below is the code

## [1] 0.3558352

Only 36% accurate, terrible but ok for a demonstration of linear discriminant analysis. Since we only have two-functions or two-dimensions we can plot our model.  Below I provide a visual of the first 50 examples classified by the predict.lda model.



The first function, which is the vertical line, doesn’t seem to discriminant anything as it off to the side and not separating any of the data. However, the second function, which is the horizontal one, does a good of dividing the “regular.with.aide” from the “small.class”. Yet, there are problems with distinguishing the class “regular” from either of the other two groups.  In order improve our model we need additional independent variables to help to distinguish the groups in the dependent variable.

Bagging in R

In this post, we will explore the potential of bagging. Bagging is a process in which the original data is boostrapped to make several different datasets. Each of these datasets are used to generate a model and voting is used to classify an example or averaging is used for numeric prediction.

Bagging is especially useful for unstable learners. These are algorithms who generate models that can change a great deal when the data is modified a small amount. In order to complete this example, you will need to load the following packages, set the seed, as well as load the dataset “Wages1”. We will be using a decision tree that was developed in an earlier post. Below is the initial code

library(caret); library(Ecdat);library(ipred);library(vcd)

We will now use the “bagging” function from the “ipred” package to create our model as well as tell R how many bags to make.


Next, we will make our predictions. Then we will check the accuracy of the model looking at a confusion matrix and the kappa statistic. The “kappa” function comes from the “vcd” package.

bagPred<-predict(theBag, Wages1)
keep<-table(bagPred, Wages1$sex)
## bagPred  female male
##   female   1518   52
##   male       51 1673
##             value      ASE     z Pr(>|z|)
## Unweighted 0.9373 0.006078 154.2        0
## Weighted   0.9373 0.006078 154.2        0

The results appearing exciting with almost 97% accuracy. In addition, the Kappa was almost 0.94 indicating a well-fitted model. However, in order to further confirm this we can cross-validate the model instead of using bootstrap aggregating as bagging does. Therefore we will do a 10-fold cross-validation using the functions from the “caret” package. Below is the code.

ctrl<-trainControl(method="cv", number=10)
trainModel<-train(sex~.,data=Wages1, method="treebag",trControl=ctrl)
## Bagged CART 
## 3294 samples
##    3 predictors
##    2 classes: 'female', 'male' 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2965, 2965, 2965, 2964, 2964, 2964, ... 
## Resampling results
##   Accuracy   Kappa       Accuracy SD  Kappa SD  
##   0.5504128  0.09712194  0.02580514   0.05233441

Now the results are not so impressive. In many ways the model is terrible. The accuracy has fallen significantly and the kappa is almost 0. Remeber that cross-validation is an indicator of future performance. This means that our current model would not generalize well to other datasets.

Bagging is not limited to decision trees and can be used for all machine learning models. The example used in this post was one that required the least time to run. For real datasets, the processing time can be quite long for the average laptop.

Ensemble Learning for Machine Models

One way to improve a machine learning model is to not make just one model. Instead, you can make several models  that all have different strengths and weaknesses. This combination of diverse abilities can allow for much more accurate predictions.

The use of multiple models is know as ensemble learning. This post will provide insights into ensemble learning as they are used in developing machine models.

The Major Challenge

The biggest challenges in creating an ensemble of models is deciding what models to develop and how the various models are combined to make predictions. To deal with these challenges involves the use of training data and several different functions.

The Process

Developing an ensemble model begins with training data. The next step is the use of some sort of allocation function. The allocation function determines how much data each model receives in order to make predictions. For example, each model may receive a subset of the data or limit how many features each model can use. However, if several different algorithms are used the allocation function may pass all the data to each model with making any changes.

After the data is allocated, it is necessary for the models to be created. From there, the next step is to determine how to combine the models. The decision on how to combine the models is made with a combination function.

The combination function can take one of several approaches for determining final predictions. For example, a simple majority vote can be used which means that if 5 models where develop and 3 vote “yes” than the example is classified as a yes. Another option is to weight the models so that some have more influence then others in the final predictions.

Benefits of Ensemble Learning

Ensemble learning provides several advantages. One, ensemble learning improves the generalizability of your model. With the combine strengths of many different models and or algorithms it is difficult to go wrong

Two, ensemble learning approaches allow for tackling large datasets. The biggest enemy to machine learning is memory. With ensemble approaches, the data can be broken into smaller pieces for each model.


Ensemble learning is yet another critical tool in the data scientist’s toolkit. The complexity of the world today makes it too difficult to lean on a singular model to explain things. Therefore, understanding the application of ensemble methods is a necessary step.


Developing a Customize Tuning Process in R

In this post, we will learn how to develop customize criteria for tuning a machine learning model using the “caret” package. There are two things that need to be done in order to complete assess a model using customized features. These two steps are…

  • Determine the model evaluation criteria
  • Create a grid of parameters to optimize

The model we are going to tune is the decision tree model made in a previous post with the C5.0 algorithm. Below is code for loading some prior information.

library(caret); library(Ecdat)


We are going to begin by using the “trainControl” function to indicate to R what re-sampling method we want to use, the number of folds in the sample, and the method for determining the best model. Remember, that there are many more options but these are the onese we will use. All this information must be saved into a variable using the “trainControl” function. Later, the information we place into the variable will be used when we rerun our model.

For our example, we are going to code the following information into a variable we will call “chck” for re sampling we will use k-fold cross-validation. The number of folds will be set to 10. The criteria for selecting the best model will be the through the use of the “oneSE” method. The “oneSE” method selects the simplest model within one standard error of the best performance. Below is the code for our variable “chck”

chck<-trainControl(method = "cv",number = 10, selectionFunction = "oneSE")

For now this information is stored to be used later


We now need to create a grid of parameters. The grid is essential the characteristics of each model. For the C5.0 model we need to optimize the model, number of trials, and if winnowing was used. Therefore we will do the following.

  • For model, we want decision trees only
  • Trials will go from 1-35 by increments of 5
  • For winnowing, we do not want any winnowing to take place.

In all we are developing 8 models. We know this based on the trial parameter which is set to 1, 5, 10, 15, 20, 25, 30, 35. To make the grid we use the “expand.grid” function. Below is the code.

modelGrid<-expand.grid(.model ="tree", .trials= c(1,5,10,15,20,25,30,35), .winnow="FALSE")


We are now ready to generate our model. We will use the kappa statistic to evaluate each model’s performance

customModel<- train(sex ~., data=Wages1, method="C5.0", metric="Kappa", trControl=chck, tuneGrid=modelGrid)
## C5.0 
## 3294 samples
##    3 predictors
##    2 classes: 'female', 'male' 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2966, 2965, 2964, 2964, 2965, 2964, ... 
## Resampling results across tuning parameters:
##   trials  Accuracy   Kappa      Accuracy SD  Kappa SD  
##    1      0.5922991  0.1792161  0.03328514   0.06411924
##    5      0.6147547  0.2255819  0.03394219   0.06703475
##   10      0.6077693  0.2129932  0.03113617   0.06103682
##   15      0.6077693  0.2129932  0.03113617   0.06103682
##   20      0.6077693  0.2129932  0.03113617   0.06103682
##   25      0.6077693  0.2129932  0.03113617   0.06103682
##   30      0.6077693  0.2129932  0.03113617   0.06103682
##   35      0.6077693  0.2129932  0.03113617   0.06103682
## Tuning parameter 'model' was held constant at a value of tree
## Tuning parameter 'winnow' was held constant at a value of FALSE
## Kappa was used to select the optimal model using  the one SE rule.
## The final values used for the model were trials = 5, model = tree
##  and winnow = FALSE.

The actually output is similar to the model that “caret” can automatically create. The difference here is that the criteria was set by us rather than automatically. A close look reveals that all of the models perform poorly but that there is no change in performance after ten trials.


This post provided a brief explanation of developing a customize way of assessing a models performance. To complete this, you need configure your options as well as setup your grid in order to assess a model. Understanding the customization process for evaluating machine learning models is one of the strongest ways to develop supremely accurate models that retain generalizability.

Developing an Automatically Tuned Model in R

In this post, we are going to learn how to use the “caret” package to automatically tune a machine learning model. This is perhaps the simplest way to evaluate the performance of several models. In a later post, we will explore how to perform custom tuning to a model.

The model we are trying to tune is the decision tree we made using the C5.0 algorithm in a previous post. Specifically we were trying to predict sex based on the variables available in the “Wages1” dataset in the “Ecdat” package.

In order to accomplish our goal we will need to load the “caret” and “Ecdat”package, load the “Wages1” dataset as well as set the seed. Setting the seed will allow us to reproduce our results. Below is the code for these steps.

library(caret); library(Ecdat)

We will now build and display our model using the code below.

tuned_model<-train(sex ~., data=Wages1, method="C5.0")
## C5.0 
## 3294 samples
##    3 predictors
##    2 classes: 'female', 'male' 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3294, 3294, 3294, 3294, 3294, 3294, ... 
## Resampling results across tuning parameters:
##   model  winnow  trials  Accuracy   Kappa      Accuracy SD  Kappa SD  
##   rules  FALSE    1      0.5892713  0.1740587  0.01262945   0.02526656
##   rules  FALSE   10      0.5938071  0.1861964  0.01510209   0.03000961
##   rules  FALSE   20      0.5938071  0.1861964  0.01510209   0.03000961
##   rules   TRUE    1      0.5892713  0.1740587  0.01262945   0.02526656
##   rules   TRUE   10      0.5938071  0.1861964  0.01510209   0.03000961
##   rules   TRUE   20      0.5938071  0.1861964  0.01510209   0.03000961
##   tree   FALSE    1      0.5841768  0.1646881  0.01255853   0.02634012
##   tree   FALSE   10      0.5930511  0.1855230  0.01637060   0.03177075
##   tree   FALSE   20      0.5930511  0.1855230  0.01637060   0.03177075
##   tree    TRUE    1      0.5841768  0.1646881  0.01255853   0.02634012
##   tree    TRUE   10      0.5930511  0.1855230  0.01637060   0.03177075
##   tree    TRUE   20      0.5930511  0.1855230  0.01637060   0.03177075
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were trials = 10, model = rules
##  and winnow = TRUE.

There is a lot of information that is printed out. The first column is the type of model developed. Two types of models were developed either a rules-based classification tree or a normal decision tree. Next, is the winnow column. This column indicates if a winnowing process was used to remove poor predictor variables.

The next two columns are accuracy and kappa which have been explained previously. The last two columns are the standard deviations of accuarcy and kappa. None of the models are that good but the purpose here is for teaching.

At the bottom of the printout, r tells you which model was the best. For us, the best model was the fifth model from the top which was a rule-based, 10 trial model with winnow set to “TRUE”.

We will now use the best model (the caret package automatically picks it) to make predictions on the training data. We will also look at the confusion matrix of the correct classification followed by there proportions. Below is the code.

predict_model<-predict(tuned_model, Wages1)
table(predict_model, Wages1$sex)
## predict_model female male
##        female    936  590
##        male      633 1135
prop.table(table(predict_model, Wages1$sex))
## predict_model    female      male
##        female 0.2841530 0.1791135
##        male   0.1921676 0.3445659

In term of prediction, the model was correct 62% of the time (.28 + .34 = .62). If we want to know, can also see the probabilities for each example using the following code.

probTable<-(predict(tuned_model, Wages1, type="prob"))
##      female       male
## 1 0.6191287 0.38087132
## 2 0.2776770 0.72232303
## 3 0.2975327 0.70246734
## 4 0.7195866 0.28041344
## 5 1.0000000 0.00000000
## 6 0.9092993 0.09070072


In this post, we looked at an automated way to determine the best model among many using the “caret” package. Understanding how to improve the performance of a model is critical skill in machine learning.

Improving the Performance of Machine Learning Model

For many, especially beginners, making a machine learning model is difficult enough. Trying to understand what to do, how to specify the model, among other things is confusing in itself. However, after developing a model it is necessary to assess ways in which to improve performance.

This post will serve as an introduction to understanding how to improving model performance. In particular, we will look at the following

  • When it is necessary to improve performance
  • Parameter tuning

When to Improve

It is not always necessary to try and improve the performance of a model. There are times when a model does well and you know this through the evaluating it. If the commonly used measures are adequate there is no cause for concern.

However, there are times when improvement is necessary. Complex problems, noisy data, and trying to look for subtle/unclear relationships can make improvement necessary. Normally, real-world data has the problems so model improvement is usually necessary.

Model improvement requires the application of scientific means in an artistic manner. It requires a sense of intuition at times and also brute trial-and-error effort as well. The point is that there is no singular agreed upon way to improve a model. It is better to focus on explaining how you did it if necessary.

Parameter Tuning

Parameter tuning is the actual adjustment of model fit options. Different machine learning models have different options that can be adjusted. Often, this process can be automated in r through the use of the “caret” package.

When trying to decide what to do when tuning parameters it is important to remember the following.

  • What machine learning model and algorithm you are using for your data.
  • Which parameters you can adjust.
  • What criteria you are using to evaluate the model

Naturally, you need to know what kind of model and algorithm you are using in order to improve the model. There are three types of models in machine learning, those that classify, those that employ regression, and those that can do both. Understanding this helps you to make decision about what you are trying to do.

Next, you need to understand what exactly you or r are adjusting when analyzing the model. For example, for C5.0 decision trees “trials” is one parameter you can adjust. If you don’t know this, you will not know how the model was improved.

Lastly, it is important to know what criteria you are using to compare the various models. For classifying models you can look at the kappa and the various information derived from the confusion matrix. For regression based models you may look at the r-square, the RMSE (Root mean squared error), or the ROC curve.


As you can perhaps tell there is an incredible amount of choice and options in trying to improve a model. As such, model improvement requires a clearly developed strategy that allows for clear decision-making.

In a future post, we will look at an example of model improvement.

K-Fold Cross-Validation

In this post, we are going to look at k-fold cross-validation and its use in evaluating models in machine learning.

K-fold cross-validation is use for determining the performance of statistical models. How it works is the data is divided into a pre-determined number of folds (called ‘k’). One fold is used to determine the model estimates and the other folds are used for evaluating. This is done k times and the results are average based on a statistic such as kappa to see how the model performs.

In our example we are going to review a model we made using the C5.0 algorithm. In that post, we were trying to predict gender based on several other features.

First, we need to load several packages into R as well as the dataset we are going to use. All of this is shared in the code below


We now need to set the seed. This is important for allowing us to reproduce the results. Every time a k-fold is perform the results can be slightly different but setting the seed prevents this. The code is as follows


We will now create are folds. How many folds to create is up to the researcher. For us, we are going to create ten folds. What this means is that R will divide are sample into ten equal parts. To do this we use the “createFolds” function from the “caret” package. After creating the folds, we will view the results using the “str” function which will tell us how many examples are in each fold. Below is the code to complete this.

folds<-createFolds(Wages1$sex, k=10)
## List of 10
##  $ Fold01: int [1:328] 8 13 18 37 39 57 61 67 78 90 ...
##  $ Fold02: int [1:329] 5 27 47 48 62 72 76 79 85 93 ...
##  $ Fold03: int [1:330] 2 10 11 31 32 34 36 64 65 77 ...
##  $ Fold04: int [1:330] 19 24 40 45 55 58 81 96 99 102 ...
##  $ Fold05: int [1:329] 6 14 28 30 33 84 88 91 95 97 ...
##  $ Fold06: int [1:330] 4 15 16 38 43 52 53 54 56 63 ...
##  $ Fold07: int [1:330] 1 3 12 22 29 50 66 73 75 82 ...
##  $ Fold08: int [1:330] 7 21 23 25 26 46 51 59 60 83 ...
##  $ Fold09: int [1:329] 9 20 35 44 49 68 74 94 100 105 ...
##  $ Fold10: int [1:329] 17 41 42 71 101 107 117 165 183 184 ...

As you can see, we normally have about 330 examples per fold.

In order to get the results that we need. We have to take fold 1 to make the model and fold 2-10 to evaluate it. We repeat this process until every combination possible is used. First fold 1 is used and 2-10 are the test data, then fold 2 is used and then folds 1, 3-10 are the test data etc.  Manually coding this would take great deal of time. To get around this we will use the “lapply” function.

Using “lapply” we will create a function that takes “x” (one of our folds) and makes it the “training set” shown here as “Wages1_train”. Next we assigned the rest of the folds to be the “test” (Wages1_test) set as depicted with the “-x”. The next two lines of code should look familiar as it is the code for developing a decision tree. The “Wages_actual” are the actual labels for gender in the “Wages_1” testing set.

The “kappa2” function is new and it comes form the “irr” package. The kappa statistic is a measurement of accuracy of a  model while taking into account chance. The closer the value is to 1 the better. Below is the code for what has been discussed.

results<-lapply(folds, function(x) {
        Wages1_train<-Wages1[x, ]
        Wages1_test<-Wages1[-x, ]
        Wages1_pred<-predict(Wages1_model, Wages1_test)
        Wages1_kappa<-kappa2(data.frame(Wages1_actual, Wages1_pred))$value

To get our results, we will use the “str” function again to display them. This will tell us the kappa for each fold. To really see how are model does we need to calculate the mean kappa of the ten models. This is done with the “unlist” and “mean” function as shown below

## List of 10
##  $ Fold01: num 0.205
##  $ Fold02: num 0.186
##  $ Fold03: num 0.19
##  $ Fold04: num 0.193
##  $ Fold05: num 0.202
##  $ Fold06: num 0.208
##  $ Fold07: num 0.196
##  $ Fold08: num 0.202
##  $ Fold09: num 0.194
##  $ Fold10: num 0.204
## [1] 0.1978915

The final mean kappa was 0.19 which is really poor. It indicates that the model is no better at predicting then chance alone. However, for illustrative purposes we now understand how to perfrom a k-fold cross-validation

Receiver Operating Characteristic Curve

The receiver operating characteristic curve (ROC curve) is a tool used in statistical research to assess the trade-off of detecting true positives and true negatives. The origins of this tool goes all the way back to WWII when engineers were trying to distinguish between true and false alarms. Now this technique is used in machine learning

This post will explain the ROC curve and provide and example using R.

Below is a diagram of an ROC curve


On the X axis we have the false positive rate. As you move to the right the false positive rate increases which is bad. We want to be as close to zero as possible.

On the y axis we have the true positive rate. Unlike the x axis we want the true positive rate to be as close to 100 as possible. In general we want a low value on the x-axis and a high value on the y-axis.

In the diagram above, the diagonal line called “Test without diagnostic benefit” represents a model that cannot tell the difference between true and false positives. Therefore, it is not useful for our purpose.

The L-shaped curve call “Good diagnostic test” is an example of an excellent model. This is because  all the true positives are detected .

Lastly, the curved-line called “Medium diagonistic test” represents an actually model. This model is a balance between the perfect L-shaped model and the useless straight-line model. The curved-line model is able to moderately distinguish between false and true positives.

Area Under the ROC Curve

The area under an ROC curve is literally called the “Area Under the Curve” (AUC). This area is calculated with a standardized value ranging from 0 – 1. The closer to 1 the better the model

We will now look at an analysis of a model using the ROC curve and AUC. This is based on the results of a post using the KNN algorithm for nearest neighbor classification. Below is the code

predCollege <- ifelse(College_test_pred=="Yes", 1, 0)
realCollege <- ifelse(College_test_labels=="Yes", 1, 0)
pr <- prediction(predCollege, realCollege)
collegeResults <- performance(pr, "tpr", "fpr")
plot(collegeResults, main="ROC Curve for KNN Model", col="dark green", lwd=5)
abline(a=0,b=1, lwd=1, lty=2)
aucOfModel<-performance(pr, measure="auc")
  1. The first to variables (predCollege & realCollege) is just for converting the values of the prediction of the model and the actual results to numeric variables
  2. The “pr” variable is for storing the actual values to be used for the ROC curve. The “prediction” function comes from the “ROCR” package
  3. With the information information of the “pr” variable we can now analyze the true and false positives, which are stored in the “collegeResults” variable. The “performance” function also comes from the “ROCR” package.
  4. The next two lines of code are for plot the ROC curve. You can see the results below


6. The curve looks pretty good. To confirm this we use the last two lines of code to calculate the actually AUC. The actual AUC is 0.88 which is excellent. In other words, the model developed does an excellent job of discerning between true and false positives.


The ROC curve provides one of many ways in which to assess the appropriateness of a model. As such, it is yet another tool available for a person who is trying to test models.


Using Confusion Matrices to Evaluate Performance

The data within a confusion matrix can be used to calculate several different statistics that can indicate the usefulness of a statistical model in machine learning. In this post, we will look at several commonly used measures, specifically…

  • accuracy
  • error
  • sensitivity
  • specificity
  • precision
  • recall
  • f-measure


Accuracy is probably the easiest statistic to understand. Accuracy is the total number of items correctly classified divided by the total number of items below is the equation

accuracy =   TP + TN
                          TP + TN + FP  + FN

TP =  true positive, TN =  true negative, FP = false positive, FN = false negative

Accuracy can range in value from 0-1 with one representing 100% accuracy. Normally, you don’t want perfect accuracy as this is an indication of overfitting and your model will probably not do well with other data.


Error is the opposite of accuracy and represent the percentage of examples that are incorrectly classified it’s equation is as follows.

error =   FP + FN
                          TP + TN + FP  + FN

The lower the error the better in general. However, if error is 0 it indicates overfitting. Keep in mind that error is the inverse of accuracy. As one increases the other decreases.


Sensitivity is the proportion of true positives that were correctly classified.The formula is as follows

sensitivity =       TP
                       TP + FN

This may sound confusing but high sensitivity is useful for assessing a negative result. In other words, if I am testing people for a disease and my model has a high sensitivity. This means that the model is useful telling me a person does not have a disease.


Specificity measures the proportion of negative examples that were correctly classified. The formula is below

specificity =       TN
                       TN + FP

Returning to the disease example, a high specificity is a good measure for determining if someone has a disease if they test positive for it. Remember that no test is foolproof and there are always false positives and negatives happening. The role of the researcher is to maximize the sensitivity or specificity based on the purpose of the model.


Precision is the proportion of examples that are really positive. The formula is as follows

precision =       TP
                       TP + FP

 The more precise a model is the more trustworthy it is. In other words, high precision indicates that the results are relevant.


Recall is a measure of the completeness of the results of a model. It is calculated as follows

recall =       TP
                       TP + FN

This formula is the same as the formula for sensitivity. The difference is in the interpretation. High recall means that the results have a breadth to them such as in search engine results.


The f-measure uses recall and precision to develop another way to assess a model. The formula is below

sensitivity =      2 * TP
                       2 * TP + FP + FN

The f-measure can range from 0 – 1 and is useful for comparing several potential models using one convenient number.


This post provide a basic explanation of various statistics that can be used to determine the strength of a model. Through using a combination of statistics a researcher can develop insights into the strength of a model. The only mistake is relying exclusively on any single statistical measurement.

Using Probability of the Prediction to Evaluate a Machine Learning Model

In evaluating a model when employing machine learning techniques, there are three common types of data used for evaluation.

  • The actual classification values
  • The predicted classification values
  • The estimated probability of the prediction

The first two types of data (actual and predicted) are used for assessing the accuracy of a model in several different ways such as error rate, sensitivity, specificity, etc.

The benefit of the probabilities of prediction is that it is a measure of a model’s confidence in its prediction. If you need to compare to models and one is more confident in it’s prediction of its classification of examples, the more confident model is the better learner.

In this post, we will look at examples of the probability predictions of several models that have been used in this blog in the past.

Prediction Probabilities for Decision Trees

Our first example come from the decision tree we made using the C5.0 algorithm. Below is the code for calculating the probability of the correct classification of each example in the model followed by an output of the first

Wage_pred_prob<-predict(Wage_model, Wage_test, type="prob")
        female      male
497  0.2853016 0.7146984
1323 0.2410568 0.7589432
1008 0.5770177 0.4229823
947  0.6834378 0.3165622
695  0.5871323 0.4128677
1368 0.4303364 0.5696636

The argument “type” is added to the “predict” function so that R calculates the probability that the example is classified correctly. A close look at the results using the “head” function provides a list of 6 examples from the model.

  • For example 497, there is a 28.5% probability that this example is female and a 71.5% probability that this example is male. Therefore, the model  predicts that this example is male.
  • For example 1322, there is a 24% probability that this example is female and a 76% probability that this example is male. Therefore, the model  predicts that this example is male.
  • etc.

Prediction Probabilities for KNN Nearest Neighbor

Below is the code for finding the probilities for KNN algorithm.

College_test_pred_prob<-knn(train=College_train, test=College_test, 
+                        cl=College_train_labels, k=27, prob=TRUE)

The print for this is rather long. However, you can match the predict level with the actual probability by looking carefully at the data.

  • For example 1, there is a 77% probability that this example is a yes and a 23% probability that this example is a no. Therefore, the model  predicts that this example as yes.
  • For example 2, there is a 71% probability that this example is no and a 29% probability that this example is yes. Therefore, the model  predicts that this example is a no.


One of the primary purposes of the probabilities option is in comparing various models that are derived from the same data. This information combined with other techniques for evaluating models can help a researcher in determining the most appropriate model of analysis.

Kmeans Analysis in R

In this post, we will conduct a kmeans analysis on some data on student alcohol consumption. The data is available at the UC Irvine Machine Learning Repository and is available at https://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION

We want to see what segments are within the sample of students who participated in this study on several factors in addition to alcohol consumption. Understanding the characteristics that groups of students have in common could be useful in reaching out to them for various purposes.

We will begin by loading the “stats” package for the kmeans analysis. Then we will combine the data at it is in two different files and we will explore the data using tables and histograms. I will not print the results of the exploration of the data here as there are too many variables. Lastly, we need to set the seed in order to get the same results each time. The code is still found below.

student.mat <- read.csv("~/Documents/R working directory/student-mat.csv", sep=";")
student.por <- read.csv("~/Documents/R working directory/student-por.csv", sep=";")
student_alcohol <- rbind(student.mat, student.por)
options(digits = 2)
  • str(student_alcohol)
  • hist(student_alcoholage)
  • table(studentalcoholage)
  • table(studentalcoholaddress)
  • table(student_alcoholfamsize)
  • table(studentalcoholfamsize)
  • table(studentalcoholPstatus)
  • hist(student_alcoholMedu)
  • hist(studentalcoholMedu)
  • hist(studentalcoholFedu)
  • hist(student_alcoholtraveltime)
  • hist(studentalcoholtraveltime)
  • hist(studentalcoholstudytime)
  • hist(student_alcoholfailures)
  • table(studentalcoholfailures)
  • table(studentalcoholschoolsup)
  • table(student_alcoholfamsup)
  • table(studentalcoholfamsup)
  • table(studentalcoholpaid)
  • table(student_alcoholactivities)
  • table(studentalcoholactivities)
  • table(studentalcoholnursery)
  • table(student_alcoholhigher)
  • table(studentalcoholhigher)
  • table(studentalcoholinternet)
  • hist(student_alcoholfamrel)
  • hist(studentalcoholfamrel)
  • hist(studentalcoholfreetime)
  • hist(student_alcoholgoout)
  • hist(studentalcoholgoout)
  • hist(studentalcoholDalc)
  • hist(student_alcoholWalc)
  • hist(studentalcoholWalc)
  • hist(studentalcoholhealth)
  • hist(student_alcohol$absences)

The details about the variables can be found at the website link in the first paragraph of this post. The study look at students alcohol use and other factors related to school and family life.

Before we do the actual kmeans clustering we need to normalize the variables. This is because are variables are measured using different scales. Some are Likert with 5 steps, while others are numeric going from 0 to over 300. The different ranges have an influence on the results. To deal with this problem we will use the “scale” for the variables that will be included in the analysis. Below is the code


We also need to create a matrix in order to deal with the factor variables. All factor variables need to be converted so that they have dummy variables for the analysis. To do this we use the “matrix” function as shown in the code below.

student_alcohol_clean_matrix<-(model.matrix(~.+0, data=student_alcohol_clean))

We are now ready to conduct our kmeans cluster analysis using the “kmeans” function. We have to determine how many clusters to develop before the analysis. There are statistical ways to do this but another method is domain knowledge. Since we are dealing with teenagers, it is probably that there will be about four distinct groups because of how high school is structured. Therefore, we will use four segments for our analysis. Our code is below.

alcohol_cluster<-kmeans(student_alcohol_clean_matrix, 4)

To view the results we need to view two variables in are “alcohol_cluster” list. The “size” variable will tell us how many people are in each cluster and the “centers” variable describes a clusters characteristics on a particular variable. Below is the code

alcohol_cluster$size # size of clusters
## [1] 191 381 325 147
alcohol_cluster$centers #center of clusters
##   schoolGP schoolMS sexM   age address famsize Pstatus  Medu  Fedu
## 1     0.74     0.26 0.70  0.06  0.0017   0.265  -0.030  0.25  0.29
## 2     0.69     0.31 0.27 -0.15 -0.1059  -0.056  -0.031 -0.53 -0.45
## 3     0.88     0.12 0.45 -0.13  0.3363   0.005   0.016  0.73  0.58
## 4     0.56     0.44 0.48  0.59 -0.4712  -0.210   0.086 -0.55 -0.51
##   Mjobhealth Mjobother Mjobservices Mjobteacher Fjobhealth Fjobother
## 1      0.079      0.30         0.27       0.199     0.0471      0.54
## 2      0.031      0.50         0.17       0.024     0.0210      0.62
## 3      0.154      0.26         0.28       0.237     0.0708      0.48
## 4      0.034      0.46         0.22       0.041     0.0068      0.60
##   Fjobservices Fjobteacher reasonhome reasonother reasonreputation
## 1         0.33       0.042       0.27       0.120             0.20
## 2         0.25       0.031       0.27       0.113             0.20
## 3         0.28       0.123       0.24       0.077             0.34
## 4         0.29       0.034       0.17       0.116             0.15
##   guardianmother guardianother traveltime studytime failures schoolsup
## 1           0.69         0.079       0.17     -0.32    -0.12    -0.079
## 2           0.70         0.052       0.10      0.10    -0.26     0.269
## 3           0.73         0.040      -0.37      0.29    -0.35    -0.213
## 4           0.65         0.170       0.33     -0.49     1.60    -0.123
##   famsup   paid activities nurseryyes higheryes internet romanticyes
## 1 -0.033  0.253     0.1319       0.79      0.92    0.228        0.34
## 2 -0.095 -0.098    -0.2587       0.76      0.93   -0.387        0.35
## 3  0.156  0.079     0.2237       0.88      1.00    0.360        0.31
## 4 -0.057 -0.250     0.0047       0.73      0.70   -0.091        0.49
##   famrel freetime goout  Dalc  Walc health absences    G1    G2     G3
## 1 -0.184     0.43  0.76  1.34  1.29  0.273    0.429 -0.23 -0.17 -0.129
## 2 -0.038    -0.31 -0.35 -0.37 -0.40 -0.123   -0.042 -0.17 -0.12 -0.053
## 3  0.178    -0.01 -0.14 -0.41 -0.35 -0.055   -0.184  0.90  0.87  0.825
## 4 -0.055     0.25  0.22  0.11  0.14  0.087   -0.043 -1.24 -1.40 -1.518

The size of the each cluster is about the same which indicates reasonable segmentation of the sample. The output for “centers” tells us how much above or below the mean a particular cluster is. For example, for the variable “age” we see the following

1    0.06
2   -0.15   
3   -0.13 
4    0.59

What this means is that people in cluster one have an average age 0.06 standard deviations above the mean, cluster two is -0.14 standard deviations below the mean etc. To give our clusters meaning we have to look at the variables and see which one the clusters are extremely above or below the mean. Below is my interpretation of the clusters. The words in parenthesis is the variable from which I made my conclusion

Cluster 1 doesn’t study much (studytime), lives in the biggest families (famsize), requires litle school support (schoolsup), has a lot of free time (freetime), and consumes the most alcohol (Dalc, Walc), lives in an urban area (address), loves to go out the most (goout), and has the most absences. This is the underachieving party alcoholics of the sample

Cluster 2 have parents that are much less educated (Medu, Fedu), requires the most school support (schoolsup), while receiving the less family support (famsup), have the least involvement in extra-curricular activities (activities), has the least internet access at home (internet), socialize the least (goout), and lowest alcohol consumption. (Dalc, Walc) This cluster is the unsupported non-alcoholic loners of the sample

Cluster 3 has the most educated parents (Medu, Fedu), live in an urban setting (address) choose their school based on reputation (reasonreputation), have the lowest travel time to school (traveltime), study the most (studytime), rarely fail a course (failures), have the lowest support from the school while having the highest family support and family relationship (schoolsup, famsup, famrel), most involve in extra-curricular activities (activities), best internet acess at home (internet), least amount of free time (freetime) low alcohol consumption (Dalc, Walc). This cluster represents the affluent non-alcoholic high achievers.

CLuster 4 is the oldest (age), live in rural setting (address), has the smallest families (famsize), the least educated parents (Medu, Fedu), spends the most time traveling to school (traveltime), doesnt study much (studytime), has the highest failure rate (failures), never pays for extra classes (paid), most likely to be in a relationship (romanticyes), consumes alcohol moderately (Dalc, Walc), does poorly in school (G3). These students are the students in the greatest academic danger.

To get better insights, we can add the cluster results to our original dataset that was not normalize we can then identify what cluster each student belongs to ad calculate unstandardized means if we wanted.

student_alcohol$cluster<-alcohol_cluster$cluster # add clusters back to original data normalize does not mean much
aggregate(data=student_alcohol, G3~cluster, mean)
##   cluster   G3
## 1       1 10.8
## 2       2 11.1
## 3       3 14.5
## 4       4  5.5

The “aggregate function” tells us the average for each cluster on this question. We could do this for all of our variables to learn more about our clusters.

Kmeans provides a researcher with an understanding of the homogeneous characteristics of individuals within a sample. This information can be used to develop intervention plans in education or marketing plans in business. As such, kmeans is another powerful tool in machine learning

K-Means Clustering

There are times in research when you neither want to predict nor classify examples. Rather, you want to take a dataset and segment the examples within the dataset so that examples with similar characteristics are gather together.

When your goal is to great several homogeneous groups within a sample it involves clustering. One type of clustering used in machine learning is k-means clustering. In this post we will learn the following about k-means clustering.

  • The purpose of k-means
  • Pros and cons of k-means

The Purpose of K-means

K-means has a number of applications. In the business setting, k-means has been used to segment customers. Businesses use this information to adjusts various aspects of their strategy for reaching their customers.

Another purpose for k-means is in simplifying large datasets so that you have several smaller datasets with similar characteristics. This subsetting could be useful in finding distinct patterns

K-means is a form of unsupervised classification. This means that the results label examples that the researcher must give meaning too. When R gives the results of an analysis it just labels the clusters as 1,2,3 etc. It is the researchers job to look at the clusters and give a qualitative meaning to them.

Pros and Cons of K-Means

The pros of k-means is that it is simple, highly flexible, and efficient. The simplicity of k-means makes it easy to explain the results in contrast to artificial neural networks or support vector machines. The flexibility of k-means allows for easy adjust if there are problems. Lastly, the efficiency of k-means implies that the algorithm is good at segmenting a dataset.

Some drawbacks to k-means is that it does not allows develop the most optimal set of clusters and that the number of clusters to make must be decided before the analysis. When doing the analysis, the k-means algorithm will randomly selecting several different places from which to develop clusters. This can be good or bad depending on where the algorithm chooses to begin at. From there, the center of the clusters is recalculated until an adequate “center” is found for the number of clusters requested.

How many clusters to include is left at the discretion of the researcher. This involves a combination of common sense, domain knowledge, and statistical tools. Too many clusters tells you nothing because the groups becoming very small and there are too many of them.

There are statistical tools that measure within group homogeneity and with group heterogeneity. In addition, there is a technique called a dendrogram. The results of a dendrogram analysis provides a recommendation of how many clusters to use. However, calculating a dendrogram for a large dataset could potential crash a computer due to the computational load and the limits of RAM.


K-means is an analytical tool that helps to separate apples from oranges to give you one example. If you are in need of labeling examples based on the features in the dataset this method can be useful.

Market Basket Analysis in R

In this post, we will conduct a market basket analysis on the shopping habits of people at a grocery store. Remember that a market basket analysis provides insights through indicating relationships among items that are commonly purchased together.

The first thing we need to do is load the package that makes association rules, which is the “arules” package. Next, we need to load our dataset groceries. This dataset is commonly used as a demonstration for market basket analysis.

However, you don’t won’t to load this dataset as dataframe because it leads to several technical issues during the analysis. Rather you want to load it as a sparse matrix. The function for this is “read.transactions” and is available in the “arules” pacakge

## Loading required package: Matrix
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##     abbreviate, write
#make sparse matrix
groceries<-read.transactions("/home/darrin/Documents/R working directory/Machine-Learning-with-R-datasets-master/groceries.csv", sep = ",")

Please keep in mind that the location of the file on your computer will be different from my hard drive.

We will now explore the data set by using several different functions. First, we will use the “summary” function as indicated below.

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55 
##   16   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   46   29   14   14    9   11    4    6    1    1    1    1    3    1 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

The output tells us the number of rows in our dataset (9835) columns (169) as well as the density, which is the percentage of columns that are not empty (2.6%). This may seem small but remember that the number of purchases varies from person to person so this affects how many empty columns there are.

Next, we have the most commonly purchased items. Milk and other vegetables were the two most common followed by other foods. After the most frequent items we have the size of each transaction. For example, 2159 people purchased one item during a transaction. While one person purchased 32 items in a transaction.

Lastly, we summary statistics about transactions. On average, a person would purchased 4.4 items per transaction.

We will now look at the support of different items. Remember, that the support is the frequency of an item in the dataset. We will use the “itemFrequencyPlot” function to do this and we will add the argument “topN” to sort the items from most common to less for the 15 most frequent transactions. Below is the code

itemFrequencyPlot(groceries, topN=15)

download The plot that is produce gives you an idea of what people were purchasing. We will now attempt to develop association rules using the “apriori” function.

For now we will use the default settings for support and confidence (confidence is the proportion of transactions that have they same item(s)). The default for support is 0.1 and for confidence it is 0.8. Below is the code.

## Apriori
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##         0.8    0.1    1 none FALSE            TRUE     0.1      1     10
##  target   ext
##   rules FALSE
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## Absolute minimum support count: 983 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
## set of 0 rules

As you can see from the printout, nothing meets the criteria of a support of 0.1 and confidence of 0.8. How to play with these numbers is a matter of experience as there are few strong rules for this matter. Below, I set the support to 0.006, confidence to 0.25, and the minimum number of rules items to 2. The support of 0.006 means that this item must have been purchased at least 60 times out of 9835 items and the confidence of 0.25 means that rule needs to be accurate 25% of the time. Lastly, I want at least two items in each rule that is produce as indicated by minlen = 2. Below is the code with the “summary” as well.

groceriesrules<-apriori(groceries, parameter = list(support=0.006, confidence = 0.25, minlen=2))
## Apriori
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##        0.25    0.1    1 none FALSE            TRUE   0.006      2     10
##  target   ext
##   rules FALSE
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## Absolute minimum support count: 59 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
## set of 463 rules
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
## 150 297  16 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.711   3.000   4.000 
## summary of quality measures:
##     support           confidence          lift       
##  Min.   :0.006101   Min.   :0.2500   Min.   :0.9932  
##  1st Qu.:0.007117   1st Qu.:0.2971   1st Qu.:1.6229  
##  Median :0.008744   Median :0.3554   Median :1.9332  
##  Mean   :0.011539   Mean   :0.3786   Mean   :2.0351  
##  3rd Qu.:0.012303   3rd Qu.:0.4495   3rd Qu.:2.3565  
##  Max.   :0.074835   Max.   :0.6600   Max.   :3.9565  
## mining info:
##       data ntransactions support confidence
##  groceries          9835   0.006       0.25

Are current analysis has 463 rules. This is a major improvement from 0. We can also see how many rules contain 2 (150), 3 (297), and 4 items (16). We also have a summary of of the average number of items per rule. The next is descriptive stats on the support and confidence of the rules generated.

Something that is new for us is the “lift” column. Lift is a measure how much more likely an item is to be purchased above its number rate. Anything above 1 means that the likelihood of purchase is higher than chance.

We are now going to look for useful rules from our dataset. We are looking for rules that we can use to make decisions. It takes industry experiences in the field of your data to really glean useful rules. For now, we can only determine this statistically by sorting the rules by their lift. Below is the code for this

inspect(sort(groceriesrules, by="lift")[1:7])
##   lhs             rhs                     support    confidence  lift
## 1 {herbs}         => {root vegetables}    0.007015760  0.4312500 3.956
## 2 {berries}       => {whipped/sour cream} 0.009049314  0.2721713 3.796
## 3 {other vegetables,                                                  ##    tropical fruit,                                                    
##    whole milk}    => {root vegetables}    0.007015760  0.4107143 3.768
## 4 {beef,                                                              
##    other vegetables} => {root vegetables} 0.007930859  0.4020619 3.688
## 5 {other vegetables,                                                  
##    tropical fruit} => {pip fruit}         0.009456024  0.2634561 3.482
## 6 {beef,                                                              
##    whole milk}    => {root vegetables}    0.008032537  0.3779904 3.467
## 7 {other vegetables,                                                  
##    pip fruit}     => {tropical fruit}     0.009456024  0.3618677 3.448

The first three rules rules are translated into simply English below as ordered by lift.

  1. If herbs are purchased then root vegetables are purchased
  2. If berries are purchased then whipped sour cream is purchased
  3. If other vegetables, tropical fruit and whole milk are purchased then root vegetables are purchased


Since we are making no predictions, there is no way to really objectively improve the model. This is normal when the learning is unsupervised. If we had to make a recommendation based on the results we could say that the store should place all vegetables near each other.

The power of market basket analysis is allowing the researcher to identify relationships that may not have been noticed any other way. Naturally, insights gained from this approach must be use for practical actions in the setting in which they apply.

Understanding Market Basket Analysis

Market basket analysis a machine learning approach that attempts to find relationships among a group of items in a data set. For example, a famous use of this method was when retailers discovered an association between beer and diapers.

Upon closer examination, the retailers found that when men came to purchase diapers for their babies they would often buy beer in the same trip. With this knowledge, the retailers placed beer and diapers next to each other in the store and this further increased sales.

In addition, many of the recommendation systems we experience when shopping online use market basket analysis results to suggest additional products to us. As such, market basket analysis is an intimate part of our lives with us even knowing.

In this post, we will look at some of the details of market basket analysis such as association rules, apriori, and the role of support and confidence.

Association Rules

The heart of market basket analysis are association rules. Association rules explain patterns of relationship  among items. Below is an example

{rice, seaweed} -> {soy sauce}

Everything in curly braces { } is an itemset, which is some form of data that occurs often in the dataset based on a criteria. Rice and seaweed is our itemset on the left and soy sauce is our itemset on the right. The arrow -> indicates what comes first as we read from left to right. If we put this association rule in simply English it would say “if someone buys rice and seaweed then they will buy soy sauce”.

The practical application of this rule is to place rice, seaweed and soy sauce near each other in order to reinforce this rule when people come to shop.

The Algorithm

Market basket analysis uses a apriori algorithm. This algorithm is useful for unsupervised learning that does not require any training and thus no predictions. The apriori algorithm is especially useful with large datasets but it employs simple procedures to find useful relationships among the items.

The shortcut that this algorithm uses is the “apriori property” which states that all sugsets of a frequent itemset must also be frequent. What this means in simply English is that the items in an itemset need to be common in the overall dataset. This simple rule saves a tremendous amount of computational time.

Support and Confidence

To key pieces of information that can further refine the work of the apriori algorithm is support and confidence. Support is a measure of the frequency of an itemset ranging from 0 (no support) to 1 (highest support). High support indicates the importance of the itemset in the data and contributes to the itemset being used to generate association rule(s).

Returning to our rice, seaweed, and soy sauce example. We can say that the support for soy sauce is 0.4. This means that soy sauce appears in 40% of the purchases in the dataset which is pretty high.

Confidence is a measure of the accuracy of an association rule which is measured from 0 to 1. The higher the confidence the more accurate the association rule. If we say that our rice, seaweed, and soy sauce rule has a confidence of 0.8 we are saying that when rice and seaweed are purchased together, 80% of the time soy sauce is purchased as well.

Support and confidence can be used to influence the apriori algorithm by setting cutoff values to be searched for. For example, if we setting a minimum support of 0.5 and a confidence of 0.65 we are telling the computer to only report to us association rules that are above these cutoff points. This helps to remove useless rules that are obvious or useless.


Market basket analysis is a useful tool for mining information from large datasets. The rules are easy to understanding. In addition, market basket analysis can be used in many fields beyond shopping and can include relationships within DNA, and other forms of human behavior. As such, care must be made so that unsound conclusions are not drawn from random patterns in the data

Support Vector Machines in R

In this post, we will use support vector machine analysis to look at some data available on kaggle. In particular, we will predict what number a person wrote by analyze the pixels that were used to make the number. The file for this example is available at https://www.kaggle.com/c/digit-recognizer/data

To do this analysis you will need to use the ‘kernlab’ package. While playing with this dataset I noticed a major problem, doing the analysis with the full data set of 42000 examples took forever. To alleviate this problem. We are going to practice with a training set of 7000 examples and a test set of 3000. Below is the code for the first few sets. Remember that the dataset was download separately

#load packages
#split data 
#explore data
## 'data.frame':    7000 obs. of  785 variables:
##  $ label   : int  1 0 1 4 0 0 7 3 5 3 ...
##  $ pixel0  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pixel1  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pixel2  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pixel3  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pixel4  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pixel5  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pixel6  : int  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]

From the “str” function you can tell we have a lot of variables (785). This is what slowed the analysis down so much when I tried to run the full 42000 examples in the original dataset.

SVM need a factor variable as the predictor if possible. We are trying to predict the “label” variable so we are going to change this to a factor variable because that is what it really is. Below is the code

#convert label variable to factor

Before we continue with the analysis we need to scale are variables. This makes all variables to be within the same given range which helps to equalizes the influence of them. However, we do not want to change our “label” variable as this is the predictor variable and scaling it would make the results hard to understand. Therefore, we are going to temporarily remove the “label” variable from both of our data sets and save them in a temporary data frame. The code is below.

#temporary dataframe for the label results
#null label variable in both datasets

Next, we scale the remaining variable and reinsert the label variables for each data set as show in our code below.

digitRedux[is.na(digitRedux)]<- 0 #replace NA with 0
digitReduxTest[is.na(digitReduxTest)]<- 0
#add back label

Now we make our model using the “ksvm” function in the “kernlab” package. We set the kernel to “vanilladot” which is a linear kernel. We will aslo print the results. However, the results do not make any sense on their own and the model can only be assess througn other means. Below is the code. If you get a warning message about scaling do not worry about this as we scaled the data ourselves.

#make the model
number_classify<-ksvm(label~.,  data=digitRedux, 
##  Setting default kernel parameters
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
#look at the results
## Support Vector Machine object of class "ksvm" 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## Linear (vanilla) kernel function. 
## Number of Support Vectors : 2218 
## Objective Function Value : -0.0623 -0.207 -0.1771 -0.0893 -0.3207 -0.4304 -0.0764 -0.2719 -0.2125 -0.3575 -0.2776 -0.1618 -0.3408 -0.1108 -0.2766 -1.0657 -0.3201 -1.0509 -0.2679 -0.4565 -0.2846 -0.4274 -0.8681 -0.3253 -0.1571 -2.1586 -0.1488 -0.2464 -2.9248 -0.5689 -0.2753 -0.2939 -0.4997 -0.2429 -2.336 -0.8108 -0.1701 -2.4031 -0.5086 -0.0794 -0.2749 -0.1162 -0.3249 -5.0495 -0.8051 
## Training error : 0

We now need to use the “predict” function so that we can determine the accuracy of our model. Remember that for predicting, we use the answers in the test data and compare them to what our model would guess based on what it knows.

number_predict<-predict(number_classify, digitReduxTest)
table(number_predict, digitReduxTest$label)
## number_predict   0   1   2   3   4   5   6   7   8   9
##              0 297   0   3   3   1   4   6   1   1   1
##              1   0 307   4   1   0   4   2   5  11   1
##              2   0   2 268  10   5   1   3  10  12   3
##              3   0   1   7 291   1  11   0   1   8   3
##              4   0   1   3   0 278   4   3   2   0   9
##              5   2   0   1  10   1 238   4   1  11   1
##              6   2   1   1   0   2   1 287   1   0   0
##              7   0   1   1   0   1   0   0 268   3  10
##              8   1   3   4  10   0  11   1   0 236   2
##              9   0   0   2   2   9   2   0  14   2 264
accuracy<-number_predict == digitReduxTest$label
## accuracy
##      FALSE       TRUE 
## 0.08866667 0.91133333

The table allows you to see how many were classified correctly and how they were misclassified. The prop.table allows you to see an overall percentage. This particular model was highly accurate at 91%. It would be difficult to improve further. Below is code for a model that is using a different kernel with results that are barely better. However, if you ever enter a data science competition any improve ususally helps even if it is not practical for everyday use.

number_classify_rbf<-ksvm(label~.,  data=digitRedux, 
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
#evaluate improved model
number_predict_rbf<-predict(number_classify_rbf, digitReduxTest)
table(number_predict_rbf, digitReduxTest$label)
## number_predict_rbf   0   1   2   3   4   5   6   7   8   9
##                  0 294   0   2   3   1   1   2   1   1   0
##                  1   0 309   1   0   1   0   1   3   4   2
##                  2   4   2 277  12   4   4   8   9   9   8
##                  3   0   1   3 297   1   3   0   1   3   3
##                  4   0   1   3   0 278   4   1   5   0   6
##                  5   0   0   1   6   1 254   4   0   9   1
##                  6   2   1   0   0   2   6 289   0   2   0
##                  7   0   0   3   2   3   0   0 277   2  13
##                  8   2   2   4   3   0   3   1   0 253   3
##                  9   0   0   0   4   7   1   0   7   1 258
accuracy_rbf<-number_predict_rbf == digitReduxTest$label
## accuracy_rbf
##      FALSE       TRUE 
## 0.07133333 0.92866667


From this demonstration we can see the power of support vector machines with numerical data. This type of analysis can be used for things beyond the conventional analysis and can be used to predict things such as hand written numbers. As such, SVM is yet another tool available for the data scientist.

Basics of Support Vector Machines

Support vector machines (SVM) is another one of those mysterious black box methods in machine learning. This post will try to explain in simple terms what SVM are and their strengths and weaknesses.


SVM is a combination of nearest neighbor and linear regression. For the nearest neighbor, SVM uses the traits of an identified example to classify an unidentified one. For regression, a line is drawn that divides the various groups.It is preferred that the line is straight but this is not always the case

This combination of using the nearest neighbor along with the development of a line leads to the development of a hyperplane. The hyperplane is drawn in a place that creates the greatest amount of distance among the various groups identified.

The examples in each group that are closest to the hyperplane are the support vectors. They support the vectors by providing the boundaries for the various groups.

If for whatever reason a line cannot be straight because the boundaries are not nice and night. R will still draw a straight line but make accommodations through the use of a slack variable, which allow for error and or for examples to be in the wrong group.

Another trick used in SVM analysis is the kernel trick. A kernel will add a new dimension or feature to the analysis by combining features that were measured in the data. For example, latitude and lonigitude might be combine mathematically to make altitude. This new feature is now used to develop the hyperplane for the data.

There are several different types of kernel tricks that achieve their goal using various mathematics. There is no rule for which one to use and playing different choices is the only strategy currently.

Pros and Cons

The pros of SVM is their flexibility of use as they can be used to predict numbers or classify. SVM are also able to deal with nosy data and are easier to use than artificial neural networks. Lastly, SVM are often able to resist overfitting and are usually highly accurate.

Cons of SVM include they are still complex as they are a member of black box machine learning methods even if they are simpler than artificial neural networks. The lack of a criteria over kernel selection makes it difficult to determine which model is the best.


SVM provide yet another approach to analyzing data in a machine learning context. Success with this approach depends on determining specifically what the goals of a project are.

Developing an Artificial Neural Network in R

In this post, we are going make an artificial neural network (ANN) by analyzing some data about computers. Specifically, we are going to make an ANN to predict the price of computers.

We will be using the “ecdat” package and the data set “Computers” from this pacakge. In addition, we are going to use the “neuralnet” pacakge to conduct the ANN analysis. Below is the code for the packages and dataset we are using

## Loading required package: Ecfun
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
##     Orange
## Loading required package: grid
## Loading required package: MASS
## Attaching package: 'MASS'
## The following object is masked from 'package:Ecdat':
##     SP500
#load data set

Explore the Data

The first step is always data exploration. We will first look at nature of the data using the “str” function and then used the “summary” function. Below is the code.

## 'data.frame':    6259 obs. of  10 variables:
##  $ price  : num  1499 1795 1595 1849 3295 ...
##  $ speed  : num  25 33 25 25 33 66 25 50 50 50 ...
##  $ hd     : num  80 85 170 170 340 340 170 85 210 210 ...
##  $ ram    : num  4 2 4 8 16 16 4 2 8 4 ...
##  $ screen : num  14 14 15 14 14 14 14 14 14 15 ...
##  $ cd     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ multi  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ premium: Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
##  $ ads    : num  94 94 94 94 94 94 94 94 94 94 ...
##  $ trend  : num  1 1 1 1 1 1 1 1 1 1 ...
lapply(Computers, summary)
## $price
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     949    1794    2144    2220    2595    5399 
## $speed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.00   33.00   50.00   52.01   66.00  100.00 
## $hd
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    80.0   214.0   340.0   416.6   528.0  2100.0 
## $ram
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   4.000   8.000   8.287   8.000  32.000 
## $screen
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.00   14.00   14.00   14.61   15.00   17.00 
## $cd
##   no  yes 
## 3351 2908 
## $multi
##   no  yes 
## 5386  873 
## $premium
##   no  yes 
##  612 5647 
## $ads
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    39.0   162.5   246.0   221.3   275.0   339.0 
## $trend
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   10.00   16.00   15.93   21.50   35.00

ANN is primarily for numerical data and not categorical or factors. As such, we will remove the factor variables cd,multi, and premium from further analysis. Below are is the code and histograms of the remaining variables.





Clean the Data

Looking at the summary combined with the histograms indicates that we need to normalize are data as this is a requirement for ANN. We want all of the variables to have an equal influence initially. Below is the code for the function we wil use to normalize are variables.

normalize<-function(x) {
        return((x-min(x)) / (max(x)))

Explore the Data Again

We now need to make a new dataframe that has only the variables we are going to use for the analysis. Then we will use our “normalize” function to scale and center the variables appropriately. Lastly, we will re-explore the data to make sure it is ok using the “str” “summary” and “hist” functions. Below is the code.

#make dataframe without factor variables
#make a normalize dataframe of the data
Computers_norm<-as.data.frame(lapply(Computers_no_factors, normalize))
#reexamine the normalized data
lapply(Computers_norm, summary)
## $Computers.price
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1565  0.2213  0.2353  0.3049  0.8242 
## $Computers.speed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0800  0.2500  0.2701  0.4100  0.7500 
## $Computers.hd
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.06381 0.12380 0.16030 0.21330 0.96190 
## $Computers.ram
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0625  0.1875  0.1965  0.1875  0.9375 
## $Computers.screen
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.03581 0.05882 0.17650 
## $Computers.ad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3643  0.6106  0.5378  0.6962  0.8850 
## $Computers.trend
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2571  0.4286  0.4265  0.5857  0.9714








Develop the Model

Everything looks good, so we will now split are data into a training and testing set and develop the ANN model that will predict computer price. Below is the code for this.

#split data into training and testing
                                  Computers.ad+Computers.trend, Computer_train)

Our intial model is a simple feedforward networm with a single hidden node. You can visualize the model using the “plot” function as shown in the code below.


Screenshot from 2016-07-11 14:35:35.png

We now need to evaluate our model’s performance. We will use the “compute” function to generate predictions. The predictions generated by the “compute” function will then be compared to the actual prices in the test data set using a pearson correlation. Since we are not classifying we can’t measure accuracy with a confusion matrix but rather with correlation. Below is the code followed by the results

evaluate_model<-compute(computer_model, Computers_test[2:7])
cor(predicted_price, Computers_test$Computers.price)
##              [,1]
## [1,] 0.8809571295

The correlation between the predict results and the actual results is 0.88 which is a strong relationship. THis indicates that our model does an excellent job in predicting the price of a computer based on ram, screen size, speed, hard drive size, advertising, and trends.

Develop Refined Model

Just for fun we are going to make a more complex model with three hidden nodes and see how the results change below is the code.

                           Computer_train, hidden =3)
evaluate_model2<-compute(computer_model2, Computers_test[2:7])
cor(predicted_price2, Computers_test$Computers.price)

Screenshot from 2016-07-11 14:36:34.png

##              [,1]
## [1,] 0.8963994092

The correlation improves to 0.89. As such, the increased complexity did not yield much of an improvement in the overall correlation. Therefore, a single node model is more appropriate.


In this post we explored an application of artificial neural networks. This black box method is useful for making powerful predictions in highly complex data.

Black Box Method-Artificial Neural Networks

In machine learning, there are a set of analytical techniques know as black box methods. What is meant by black box methods is that the actual models developed are derived from complex mathematical processes that are difficult to understand and interpret. This difficulty in understanding them is what makes them mysterious.

One black method is artificial neural network (ANN). This method tries to imitate mathematically the behavior of neurons in the nervous system of humans. This post will attempt to explain ANN in simplistic terms.

The Human Mind and the Artificial One

We will begin by looking at how real neurons work before looking at ANN. In simple terms, as this is not a biology blog, neurons send and receive signals. They receive signals through their dendrites, process information in the soma, and the send signals through their axon terminal. Below is a picture of a neuron.


An ANN works in a highly similar manner. The x variables are the dendrites that are providing information to the cell body we they are summed. Different dendrites or x variables can have different weights (w). Next, the summation of the x variables is passed to an activation function before moving to the output or dependent variable y. Below is a picture of this process.

Diagram1If you compare the two pictures they are similar yet different. ANN mimics the mind in a way that has fascinated people for over 50 years.

Activation Function (f)

The activation function purpose is to determine if there should be an activation. In the human body, activation takes place one the nerve cell sends the message to the next cell. This indicates that the message was strong enough to have it move forward.

The same concept applies in ANN. A signal will not be passed on unless it meets a minimum threshold. This threshold can vary depending on how the ANN is model.

ANN Design

The makeup of a ANNs can vary great. Some models have more than one output variable as shown below.


Two outputs

Other models have what are called hidden layers. These are variables that are both input and output variables. They could be seen as mediating variables. Below is a visual example.


Hidden layer is the two circles in the middle

How many layers to developed is left to the researcher. When models become really complex with several hidden layers and or outcome variables it is referred to as deep learning in the machine learning community.

Another complexity of ANN is the direction of information. Just as in the human body information can move forward and backward in an ANN. This provides for opportunities to model highly complex data and relationships.


ANN can be used for classifying virtually anything. They are a highly accurate model as well that is not bogged down by many assumptions. However, ANN’s are so hard to understand that it makes it difficult to use them despite their advantages. As such, this form of analysis can be beneficial if the user is able to explain the results.


Making Regression and Modal Trees in R

In this post, we will look at an example of regression trees. Regression trees use decision tree-like approach to develop predicition models involving numerical data. In our example we will be trying to predict how many kids a person has based on several independent variables in the “PSID” data set in the “Ecdat” package.

Lets begin by loading the necessary packages and data set. The code is below

## 'data.frame':    4856 obs. of  8 variables:
##  $ intnum  : int  4 4 4 4 5 6 6 7 7 7 ...
##  $ persnum : int  4 6 7 173 2 4 172 4 170 171 ...
##  $ age     : int  39 35 33 39 47 44 38 38 39 37 ...
##  $ educatn : int  12 12 12 10 9 12 16 9 12 11 ...
##  $ earnings: int  77250 12000 8000 15000 6500 6500 7000 5000 21000 0 ...
##  $ hours   : int  2940 2040 693 1904 1683 2024 1144 2080 2575 0 ...
##  $ kids    : int  2 2 1 2 5 2 3 4 3 5 ...
##  $ married : Factor w/ 7 levels "married","never married",..: 1 4 1 1 1 1 1 4 1 1 ...
##      intnum        persnum            age           educatn     
##  Min.   :   4   Min.   :  1.00   Min.   :30.00   Min.   : 0.00  
##  1st Qu.:1905   1st Qu.:  2.00   1st Qu.:34.00   1st Qu.:12.00  
##  Median :5464   Median :  4.00   Median :38.00   Median :12.00  
##  Mean   :4598   Mean   : 59.21   Mean   :38.46   Mean   :16.38  
##  3rd Qu.:6655   3rd Qu.:170.00   3rd Qu.:43.00   3rd Qu.:14.00  
##  Max.   :9306   Max.   :205.00   Max.   :50.00   Max.   :99.00  
##                                                  NA's   :1      
##     earnings          hours           kids                 married    
##  Min.   :     0   Min.   :   0   Min.   : 0.000   married      :3071  
##  1st Qu.:    85   1st Qu.:  32   1st Qu.: 1.000   never married: 681  
##  Median : 11000   Median :1517   Median : 2.000   widowed      :  90  
##  Mean   : 14245   Mean   :1235   Mean   : 4.481   divorced     : 645  
##  3rd Qu.: 22000   3rd Qu.:2000   3rd Qu.: 3.000   separated    : 317  
##  Max.   :240000   Max.   :5160   Max.   :99.000   NA/DF        :   9  
##                                                   no histories :  43

The variables “intnum” and “persnum” are for identification and are useless for our analysis. We will now explore our dataset with the following code.











##       married never married       widowed      divorced     separated 
##          3071           681            90           645           317 
##         NA/DF  no histories 
##             9            43

Almost all of the variables are non-normal. However, this is not a problem when using regression trees. There are some major problems with the “kids” and “educatn” variables. Each of these variables have values at 98 and 99. When the data for this survey was collected 98 meant the respondent did not know the answer and a 99 means they did not want to say. Since both of these variables are numerical we have to do something with them so they do not ruin our analysis.

We are going to recode all values equal to or greater than 98 as 3 for the “kids” variable. The number 3 means they have 3 kids. This number was picked because it was the most common response for the other respondents. For the “educatn” variable all values equal to or greater than 98 are recoded as 12, which means that they completed 12th grade. Again this was the most frequent response. Below is the code.

PSID$kids[PSID$kids >= 98] <- 3
PSID$educatn[PSID$educatn >= 98] <- 12

Another peek at the histograms for these two variables and things look much better.





Make Model and Visualization

Now that everything is cleaned up we now need to make are training and testing data sets as seen in the code below.


We will now make our model and also create a visual of it. Our goal is to predict the number of children a person has based on their age, education, earnings, hours worked, marital status. Below is the code

#make model
PSID_Model<-rpart(kids~age+educatn+earnings+hours+married, PSID_train)
#make visualization
rpart.plot(PSID_Model, digits=3, fallen.leaves = TRUE,type = 3, extra=101)


The first split on the tree is by income. On the left we have those who make more than 20k and on the right those who make less than 20k. On the left the next split is by marriage, those who are never married or not applicable have on average 0.74 kids. Those who are married, widowed, divorced, separated, or have no history have on average 1.72.

The left side of the tree is much more complicated and I will not explain all of it. The after making less than 20k the next split is by marriage. Those who are married, widowed, divorced, separated, or no history with less than 13.5 years of education have 2.46 on average.

Make Prediction Model and Conduct Evaluation

Our next task is to make the prediction model. We will do this with the folllowing code

PSID_pred<-predict(PSID_Model, PSID_test)

We will now evaluate the model. We will do this three different ways. The first involves looking at the summary statistics of the prediction model and the testing data. The numbers should be about the same. After that we will calculate the correlation between the prediction model and the testing data. Lastly, we will use a technique called the mean absolute error. Below is the code for the summary statistics and correlation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.735   2.041   2.463   2.226   2.463   2.699
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   2.000   2.494   3.000  10.000
cor(PSID_pred, PSID_test$kids)
## [1] 0.308116

Looking at the summary stats are model has a hard time predicting extreme values because the max value of the two models are vey far a part. However, how often do people have ten kids? As such, this is not a major concern.

A look at the correlation finds that it is pretty low (0.30) this means that the two models have little in common and this means we need to make some changes. The mean absolute error is a measure of the difference between the predicted and actual values in a model. We need to make a function first before we analyze our model.

MAE<-function(actual, predicted){

We now assess the model with the code below

MAE(PSID_pred, PSID_test$kids)
## [1] 1.134968

The results indicate that on average the difference between our model’s prediction of the number of kids and the actual number of kids was 1.13 on a scale of 0 – 10. That’s a lot of error. However, we need to compare this number to how well the mean does to give us a benchmark. The code is below.

MAE(ave_kids, PSID_test$kids)
## [1] 1.178909

Model Tree

Our model with a score of 1.13 is slightly better than using the mean which is 1.17. We will try to improve our model by switching from a regression tree to a model tree which uses a slightly different approach for prediction. In a model tree each node in the tree ends in a linear regression model. Below is the code.

PSIDM5<- M5P(kids~age+educatn+earnings+hours+married, PSID_train)
## M5 pruned model tree:
## (using smoothed linear models)
## earnings <= 20754 : 
## |   earnings <= 2272 : 
## |   |   educatn <= 12.5 : LM1 (702/111.555%)
## |   |   educatn >  12.5 : LM2 (283/92%)
## |   earnings >  2272 : LM3 (1509/88.566%)
## earnings >  20754 : LM4 (1147/82.329%)
## LM num: 1
## kids = 
##  0.0385 * age 
##  + 0.0308 * educatn 
##  - 0 * earnings 
##  - 0 * hours 
##  + 0.0187 * married=married,divorced,widowed,separated,no histories 
##  + 0.2986 * married=divorced,widowed,separated,no histories 
##  + 0.0082 * married=widowed,separated,no histories 
##  + 0.0017 * married=separated,no histories 
##  + 0.7181
## LM num: 2
## kids = 
##  0.002 * age 
##  - 0.0028 * educatn 
##  + 0.0002 * earnings 
##  - 0 * hours 
##  + 0.7854 * married=married,divorced,widowed,separated,no histories 
##  - 0.3437 * married=divorced,widowed,separated,no histories 
##  + 0.0154 * married=widowed,separated,no histories 
##  + 0.0017 * married=separated,no histories 
##  + 1.4075
## LM num: 3
## kids = 
##  0.0305 * age 
##  - 0.1362 * educatn 
##  - 0 * earnings 
##  - 0 * hours 
##  + 0.9028 * married=married,divorced,widowed,separated,no histories 
##  + 0.2151 * married=widowed,separated,no histories 
##  + 0.0017 * married=separated,no histories 
##  + 2.0218
## LM num: 4
## kids = 
##  0.0393 * age 
##  - 0.0658 * educatn 
##  - 0 * earnings 
##  - 0 * hours 
##  + 0.8845 * married=married,divorced,widowed,separated,no histories 
##  + 0.3666 * married=widowed,separated,no histories 
##  + 0.0037 * married=separated,no histories 
##  + 0.4712
## Number of Rules : 4

It would take too much time to explain everything. You can read part of this model as follows earnings greater than 20754 use linear model 4earnings less than 20754 and less than 2272 and less than 12.5 years of education use linear model 1 earnings less than 20754 and less than 2272 and greater than 12.5 years of education use linear model 2 earnings less than 20754 and greater than 2272 linear model 3 The print out then shows each of the linear model. Lastly, we will evaluate our model tree with the following code

PSIDM5_Pred<-predict(PSIDM5, PSID_test)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3654  2.0490  2.3400  2.3370  2.6860  4.4220
cor(PSIDM5_Pred, PSID_test$kids)
## [1] 0.3486492
MAE(PSID_test$kids, PSIDM5_Pred)
## [1] 1.088617

This model is slightly better. FOr example, it is better at predict extreme values at 4.4 compare to 2.69 for the regression tree model. The correlation is 0.34 which is better than 0.30 for the regression tree model. Lastly. the mean absolute error shows a slight improve to 1.08 compared to 1.13 in the regression tree model


This provide examples of the use of regression trees and model trees. Both of these models make prediction a key component of their analysis.

Numeric Prediction Trees

Decision trees are used for classifying examples into distinct classes or categories. Such as pass/fail, win/lose, buy/sell/trade, etc. However, as we all know, categories are just one form of outcome in machine learning. Sometimes we want to make numeric predictions.

The use of trees in making predictions numeric involves the use of regression trees or model trees. In this post, we will look at each of these forms of numeric prediction with the use of trees.

Regression Trees and Modal Trees

Regression trees have been around since the 1980’s. They work by predicting the average value of specific examples that reach a given leaf in the tree. Despite their name, there is no regression involved with regression trees. Regression trees are straightforward  to interpret but at the expense of accuracy.

Modal trees are similar to regression trees but employs multiple regression with the examples at each leaf in a tree. This leads to many different regression models being used to split the data throughout a tree. This makes model trees hard to interpret and understand in comparison to regression trees. However, they are normally much more accurate than regression trees.

Both types of trees have the goal of making groups that are as homogeneous as possible. For decision trees, entropy is used to measure the homogeneity of groups. For numeric decision trees, the standard deviation reduction (SDR) is used. The detail of SDR are somewhat complex and technical and will be avoided for that reason.

Strengths of Numeric Prediction Trees

Numeric prediction trees do not have the assumptions of linear regression. As such, they can be used to model non-normal and or non-linear data. In addition, if a dataset has a large number of feature variables, a numeric prediction tree can easily select the most appropriate ones automatically. Lastly, numeric prediction trees also do not need the the model to be specific in advance of the analysis.

Weaknesses of Numeric Prediction Trees

This form of analysis requires a large amounts of data in the training set in order to develop a testable model. It is also hard to tell which variables are most important in shaping the outcome. Lastly, sometimes numeric prediction trees are hard to interpret. This naturally limits there usefulness among people who lack statistical training.


Numeric prediction trees combine the strength of decision tress with the ability to digest la large amount of numerical variables. This form of machine learning is useful when trying to rate or measure something that is very difficult to rate or measure. However, when possible, it is usually wise to allows try to use simpler methods if permissible.

Classification Rules in Machine Learning

Classification rules represent knowledge in an if-else format. These types of rules involve the terms antecedent and consequent. Antecedent is the before ad consequent is after. For example, I may have the following rule.

If the students studies 5 hours a week then they will pass the class with an A

This simple rule can be broken down into the following antecedent and consequent.

  • Antecedent–If the student studies 5 hours a week
  • Consequent-then they will pass the class with an A

The antecedent determines if the consequent takes place. For example, the student must study 5 hours a week to get an A. This is the rule in this particular context.

This post will further explain the characteristic and traits of classification rules.

Classification Rules and Decision Trees

Classification rules are developed on current data to make decisions about future actions. They are highly similar to the more common decision trees. The primary difference is that decision trees involve a complex step-by-step  process to make a decision.

Classification rules are stand alone rules that are abstracted from a process. To appreciate a classification rule you do not need to be familiar with the process that created it. While with decision trees you do need to be familiar with the process that generated the decision.

One catch with classification rules in machine learning is that the majority of the variables need to be nominal in nature. As such, classification rules are not as useful for large amounts of numeric variables. This is not a problem with decision trees.

The Algorithm

Classification rules use algorithms that employ a separate and conquer heuristic. What this means is that the algorithm will try to separate the data into smaller and smaller subset by generating enough rules to make homogeneous subsets. The goal is always to separate the examples in the data set into subgroups that have similar characteristics.

Common algorithms used in classification rules include the One Rule Algorithm and the RIPPER Algorithm. The One Rule Algorithm analyzes data and generates one all-encompassing rule. This algorithm works be finding the single rule that contains the less amount of error. Despite its simplicity it is surprisingly accurate.

The RIPPER algorithm grows as many rules as possible. When a rule begins to become so complex that in no longer helps to purify the various groups the rule is pruned or the part of the rule that is not beneficial is removed. This process of growing and pruning rules is continued until there is no further benefit.

RIPPER algorithm rules are more complex than One Rule Algorithm. This allows for the development of complex models. The drawback is that the rules can become to complex to make practical sense.


Classification rules are a useful way to develop clear principles as found in the data. The advantages of such an approach is simplicity. However, numeric data is harder to use when trying to develop such rules.

Understanding Decision Trees

Decision trees are yet another method in machine learning that is used for classifying outcomes. Decision trees are very useful for, as you can guess, making decisions based on the characteristics of the data.

In this post we will discuss the following

  • Physical traits of decision trees
  • How decision trees work
  • Pros and cons of decision trees

Physical Traits of a Decision Tree

Decision trees consist of what is called a tree structure. The tree structure consist of a root node, decision nodes, branches and leaf nodes.

A root node is the initial decision made in the tree. This depends on which feature the algorithm selects first.

Following the root node the tree splits into various branches. Each branch leads to an additional decision node where the data is further subdivided. When you reach the bottom of a tree at the terminal node(s) these are also called leaf nodes.

How Decision Trees Work

Decision trees use a heuristic called recursive partitioning. What this does is it splits the overall data set into smaller and smaller subsets until each subset is as close to pure (having the same characteristics) as possible. This process is also know as divide and conquer.

The mathematics for deciding how to split the data is based on an equation called entropy, which measures the purity of a potential decision node. The lower the entropy score the more pure the decision node is. The entropy can range from 0 (most pure) to 1 (most impure).

One of the most popular algorithms for developing decision trees is the C5.0 algorithm. This algorithm in particular uses entropy to assess potential decision nodes.

Pros and Cons

The prose of decision trees include it versatile nature. Decision trees can deal with all types of data as well as missing data. Furthermore, this approach learns automatically and only uses the most important features. Lastly, a deep understanding of mathematics is not necessary to use this method in comparison to more complex models.

Some problems with decision trees is that the can easily overfit the data. This means that the tree does not generalize well to other datasets. In addition, a large complex tree can be hard to interpret, which may be yet another indication of overfitting.


Decision trees provide another vehicle that researchers can use to empower decision making. This model is most useful particular when a decision that was made needs to be explained and defended. For example, when rejecting a person’s loan application. Complex models made provide stronger mathematical reasons but would be difficult to explain to an irate customer.

Therefore, for complex calculation presented in an easy to follow format. Decision trees are one possibility.

Conditional Probability & Bayes’ Theorem

In a prior post, we look at some of the basics of probability. The prior forms pf probability we looked at focused on independent events, which are events that are unrelated to each other.

In this post we will look at conditional probability which involves calculating probabilities for events that are dependent on each other. We will understand conditional probability through the use of Bayes’ theorem.

Conditional Probability 

If all events were independent in it would be impossible to predict anything because there would be no relationships between features. However, there are many examples of on event affecting another. For example, thunder and lighting can be used to predictors of rain and lack of study can be used as a predictor for test performance.

Thomas Bayes develop a theorem to understand conditional probability. A theorem is a statement that can be proven true through the use of math. Bayes’ theorem is written as follows

P(A | B)

This complex notation simply means

The probability of event A given event B occurs

Calculating probabilities using Bayes’ theorem can be somewhat confusing when done by hand. There are a few terms however that you need to be exposed too.

  • prior probability is the probability of an event without a conditional event
  • likelihood is the probability of a given event
  • posterior probability is the probability of an event given that another event occurred. the calculation or posterior probability is the application of Bayes’ theorem

Naives Bayes Algorithm

Bayes’ theorem has been used to develop the Naive Bayes Algorithm. This algorithm is particularly useful in classifying text data, such as emails. This algorithm is fast, good with missing data, and powerful with large or small data sets. However, naive bayes struggles with large amounts of numeric data and it has a problem with assuming that all features are of equal value, which is rarely the case.


Probability is a core component of prediction. However, prediction cannot truly take place with events being dependent. Thanks to the work of Thomas Bayes, we have one approach to making prediction through the use of his theorem.

In a future post, we will use naive Bayes algorithm to make predictions about text.


Nearest Neighbor Classification in R

In this post, we will conduct a nearest neighbor classification using R. In a previous post, we discussed nearest neighbor classification. To summarize, nearest neighbor uses the traits of a known example to classify an unknown example. The classification is determine by the closets known example(s) to the unknown example. There are essentially four steps in order to complete a nearest neighbor classification 1. Find a dataset 2. Explore/prepare the Dataset 3. Train the model 4. Evaluate the model Find a data set For this example, we will use the college data set from the ISLR package. Our goal will be to predict which colleges are private or not private based on the feature “Private”. Below is the code for this.


Step 2 Exploring the Data

We now need to explore and prep the data for analysis. Exploration helps us to find any problems in the data set. Below is the code, you can see the results in your own computer if you are following along.


The “str” function gave us an understanding of the different types of variables and some of there initial values. We have 18 variables in all. We want to predict “Private” which is a categorical feature Nearest neighbor predicts categorical features only. We will use all of the other numerical variables to predict “Private”. The prop.table give us the proportion of private and not private colleges. About 30% of the data is not private and about 70% is a private college

Lastly, the “summary” function gives us some descriptive stats. If you look closely at the descriptive stats, there is a problem. The variables use different scales. For example, the “Apps” feature goes from 81 t0 48094 while the “Grad.Rate” feature goes from 10 to 100. If we do the analysis with these different scales the “App” feature will be a much stronger influence on the prediction. Therefore, we need to rescale the features or normalize them. Below is the code to do this


We made a function called “normal” that normalizes or scales are variables so that all values are between 0-1. This makes all of the features equal in their influence. We now run code to normalize are data using the “normal” function we created. We will also look at the summary stats. In addition, we will leave out the prediction feature “Private” because we do not want to have that in the new data set because we want to predict this. In a real world example, you would not know this before the analysis anyway.


The name of the new dataset is “College_New” we used the “lapply” function to make R normalize all the features in the “College” dataset. Summary stats are good as all values are between 0 and 1. The last part of this step involves dividing our data into training and testing sets as seen in the code below and creating the labels. The labels are the actually information from the “Private” feature. Remember, we remove this from the “College_New” dataset but we need to put this information in their own features to allow us to check the accuracy of our results later.

College_train<-College_New[1:677, ]
 #make labels
 College_train_labels<-College[1:677, 1]

Step 3 Model Training

Train the model We will now train the model. You will need to download the “class” package and load it. After that, you need to run the following code.

 College_test_pred<-knn(train=College_train, test=College_Test,
 cl=College_train_labels, k=25)

Here is what we did

  • We created a variable called “College_test_pred” to store are results
  • We used the “knn” function to predict the examples. Inside the “knn” function we used “College_train” to teach the model.
  • We then identify “College_test” as the dataset we want to test. For the ‘cl’ argument we used the “College_train_labels” dataset which contains which colleges are private in the “College_train” dataset. The “k” is the number of neighbors that knn uses to determine what class to label an unknown example. In this code, we are using the 25 nearest neighbors to label the unknown example The number of “k” to use can vary but a rule of thumb is to take the square root of the total number of examples in the training data. Our training data has 677 and the square root of this is about 25. What happens is that each of the 25 closest neighbors are calculated for an example. If 20 are not private and 5 are private the unknown example is labeled private because the majority of its neighbors are not private.

Step 4 Evaluating the Model

We will now evaluate the model using the following code

 CrossTable(x=College_test_labels, y=College_test_pred, prop.chisq = FALSE)

The results can be seen by clicking here.  The box in the top left  predicts which colleges are private correctly while the box in the bottom right classifies which colleges are not private correctly. For example, 28 colleges that are not private were classed as not private while 64 colleges that are not private were classed as not private. Adding these two numbers together gives us 92 (28+64) which is our accuracy rate. In the top right box we have are false positives. These are colleges which are private but  were predicted as private, there were 8 of these. Lastly, we have our false negatives in the bottom left box, which are colleges that  are private but label as not private, there were 0 of these.


The results are pretty good for this example. Nearest neighbor classification allows you to predict an unknown example based on the classification of the known neighbors. This is a simple way to identify examples that are clump among neighbors with the same characteristics.

Nearest Neighbor Classification

There are times when the relationships among examples you want to classify are messy and complicated. This makes it difficult to actually classify them. Yet in this same situation, items of the same class have a lot of features in common even though the overall sample is messy. In such a situation, nearest neighbor classification may be useful.

Nearest neighbor classification uses a simple technique to classify unlabeled examples. The algorithm assigns an unlabeled example the label of the nearest example. This based on the assumption that if two examples are next to each other they must be of the same class.

In this post, we will look at the characteristics of nearest neighbor classification as well as the strengths and weakness of this approach.


Nearest neighbor classification uses the features of the data set to create a multidimensional feature space. The number of features determines the number of dimensions. Therefore, two features leads to a two-dimensional feature space, three features leads to a three dimensional feature space, etc. In this feature space all the examples are placed based on their respective features.

The label of the unknown examples are determined by who the closet neighbor is or are. This calculation is based on Euclidean distance, which is the shortest distance possible. The number of neighbors that are used to calculate the distance varies at the discretion of the researcher. For example, we could use one neighbor or several to determine the label of an unlabeled example. There are pros and cons to how many neighbors to use. The more neighbors used the more complicated the classification becomes.

Nearest neighbor classification is considered a type of lazy learning. What is meant by lazy is that no abstraction of the data happens. This means there is no real explanation or theory provide by the model to understand why there are certain relationships. Nearest neighbor tells you where the relationships are but not why or how. This is partly due to the fact that it is a non-parametric learning method and provides no parameters (summary statistics) about the data.

Pros and Cons

Nearest neighbor classification has the advantage of being simple, highly effective, and fast during the training phase. There are also no assumptions made about the data distribution. This means that common problems like a lack of normality are not an issue.

Some problems include the lack of a model. This deprives us of insights into the relationships in the data. Another concern is the headache of missing data.  This forces you to spend time cleaning the data more thoroughly.  One final issue is that the classification phase of a project is slow and cumbersome because of the messy nature of the data.


Nearest neighbor classification is one useful tool in machine learning. This approach is valuable for times when the data is heterogeneous but with clear homogeneous groups in the data. In a future post, we will go through an example of this classification approach using R.

Steps for Approaching Data Science Analysis

Research is difficult for many reasons. One major challenge of research is knowing exactly what to do. You have to develop your way of approaching your problem, data collection and analysis that is acceptable to peers.

This level of freedom leads to people literally freezing and not completing a project. Now imagine have several gigabytes or terabytes of data and being expected to “analyze” it.

This is a daily problem in data science. In this post, we will look at one simply six step process to approaching data science. The process involves the following six steps

  1. Acquire data
  2. Explore the data
  3. Process the data
  4. Analyze the data
  5. Communicate the results
  6. Apply the results

Step 1 Acquire the Data

This may seem obvious but it needs to be said. The first step is to access data for further analysis. Not always, but often data scientist are given data that was already collected by others who want answers from it.

In contrast with traditional empirical research in which you are often involved from the beginning to the end, in data science you jump to analyze a mess of data that others collected. This is challenging as it may not be clear what people what to know are what exactly the collected.

Step 2 Explore the Data

Exploring the data allows you to see what is going on. You have to determine what kinds of potential feature variables you have, the level of data that was collected (nominal, ordinal, interval, ratio). In addition, exploration allows you to determine what you need to do to prep the data for analysis.

Since data can come in many different formats from structured to unstructured. It is critical to take a look at the data through using summary statistics and various visualization options such as plots and graphs.

Another purpose for exploring data is that it can provide insights into how to analyze the data. If you are not given specific instructions as to what stakeholders want to know, exploration can help you to determine what may be valuable for them to know.

Step 3 Process the Data

Processing data involves cleaning it. This involves dealing with missing data, transforming features, addressing outliers, and other necessary processes for preparing analysis. The primary goal is to organize the data for analysis

This is a critical step as various machine learning models have different assumptions that must be met. Some models can handle missing data some cannot. Some models are affected by outliers some are not.

Step 4 Analyze the Data

This is often the most enjoyable part of the process. At this step, you actually get to develop your model. How this is done depends on the type of model you selected.

In machine learning, analysis is almost never complete until some form of validation of the model takes place. This involves taking the model developed on one set of data and seeing how well the model predicts the results on another set of data. One of the greatest fears of statistical modeling is overfitting, which is a model that only works on one set of data and lacks the ability to generalize.

Step 5 Communicate Results

This step is self-explanatory. The results of your analysis needs to be shared with those involved. This is actually an art in data science called storytelling. It involves the use of visuals as well-spoken explanations.

Steps 6 Apply the Results

This is the chance to actual use the results of a study. Again, how this is done depends on the type of model developed. If a model was developed to predict which people to approve for home loans, then the model will be used to analyze applications by people interested in applying for a home loan.


The steps in this process is just one way to approach data science analysis. One thing to keep in mind is that these steps are iterative, which means that it is common to go back and forth and to skip steps as necessary. This process is just a guideline for those who need direction in doing an analysis.

Types of Machine Learning

Machine learning is a tool used in analytics for using data to make decision for action. This field of study is at the crossroads of regular academic research and action research used in professional settings. This juxtaposition of skills has led to exciting new opportunities in the domains of academics and industry.

This post will provide information on basic types of machine learning which includes predictive models, supervised learning, descriptive models, and unsupervised learning.

Predictive Models and Supervised Learning

Predictive models do as their name implies. Predictive models predict one value based on other values. For example, a model might predict who is mostly likely to buy a plane ticket or purchase a specific book.

Predictive models are not limited to the future. They can also be used to predict something that has already happen but we are not sure when. For example, data can be collect from expectant mothers to determine the date that they conceived. Such information would be useful in preparing for birth .

Predictive models are intimately connected with supervised learning. Supervised learning is a form of machine learning in which the predictive model is given clear direction as to what it they need to learn and how to do it.

For example, if we want to predict who will be accept or rejected for a home loan we would provide clear instructions to our model. We might include such features as salary, gender, credit score, etc. These features would be used to predict whether an individual person should be accepted or reject for the home loan. The supervisors in this example or the features (salary, gender, credit score) used to predict the target feature (home loan).

The target feature can either be a classification or a numeric prediction. A classification target feature is a nominal variable such as gender, race, type of car, etc. A classification feature has a limited number of choices or classes that the feature can take. In addition, the classes are mutually exclusive. At least in machine learning, someone can only be classified as male or female, current algorithms cannot place a person in both classes.

A numeric prediction predicts a number that has an infinite number of possibilities. Examples include height, weight, and salary.

Descriptive Models and Unsupervised Learning

Descriptive models summarizes data to provide interesting insights. There is no target feature that you are trying to predict. Since there is no specific goal or target to predict there are no supervisors or specific features that are used to predict the target feature. Instead, descriptive models use a process of unsupervised learning. There are no instructions given to model as to what to do per say.

Descriptive models are very useful for discovering patterns. For example, one descriptive model analysis found a relationship between beer purchases and diaper purchases. It was later found that when men went to the store they often would be beer for themselves and diapers for their small children. Stores used this information and they placed beer and diapers next to each in the stores. This led to an increase in profits as men could now find beer and diapers together. This kind of relationship can only be found through machine learning techniques.


The model you used depends on what you want to know. Prediction is for, as you can guess, predicting. With this model you are not as concern about relationships as you are about understanding what affects specifically the target feature. If you want to explore relationships then descriptive models can be of use. Machine learning models are tools that are appropriate for different situations.

Intro to Using Machine Learning

Machine learning is about using data to take action. This post will explain common steps that are taking when using machine learning in the analysis of data. In general, there are five steps when applying machine learning.

  1. Collecting data
  2. Preparing data
  3. Exploring data
  4. Training a model on the data
  5. Assessing the performance of the model
  6. Improving the model

We will go through each step briefly

Step One, Collecting Data

Data can come from almost anywhere. Data can come from a database in a structured format like mySQL. Data can also come unstructured such as tweets collected from twitter. However you get the data, you need to develop a way to clean and process it so that it is ready for analysis.

There are some distinct terms used in machine learning that some coming from traditional research my not be familiar.

  • example-An example is one set of data. In an excel spreadsheet, an example would be one row. In empirical social science research we would call an example a respondent or participant.
  • Unit of observation-This is how an example is measured. The units can be time, money, height, weight, etc.
  • feature-A feature is a characteristic of an example. In other forms of research we normally call a feature a variable. For example, ‘salary’ would be a feature in machine learning but a variable in traditional research.

Step Two, Preparing Data

This is actually the most difficult step in machine learning analysis. It can take up to 80% of the time. With data coming from multiple sources and in multiple formats it is a real challenge to get everything where it needs to be for an analysis.

Missing data needs to be addressed, duplicate records, and other issues are a part of this process. Once these challenges are dealt with it is time to explore the data.

Step Three, Explore the Data

Before analyzing the data, it is critical that the data is explored. This is often done visually in the form of plots and graphs but also with summary statistics. You are looking for some insights into the data and the characteristics of different features. You are also looking out for things that might be unusually such as outliers. There are also times when a variable needs to be transformed because there are issues with normality.

Step Four, Training a Model

After exploring the data, you should have an idea of what you want to know if you did not already know. Determining what you want to know helps you to decide which algorithm is most appropriate for developing a model.

To develop a model, we normally split the data into a training and testing set. This allows us to assess the model for its strengths and weaknesses.

Step Five, Assessing the Model

The strength of the model is deteremined. Every model has certain biases that limits its usefulness. How to assess a model depends on what type of model is developed and the purpose of the analysis. Whenever suitable, we want to try and improve the model.

Step Six, Improving the Model

Improve can happen in many ways. You might decide to normalize the variables in a different way. Or you may choose to add or remove features from the model. Perhaps you may switch to a different model.


Success in data analysis involves have a clear path for achieving your goals. The steps presented here provide one way of tackling machine learning.

Machine Learning: Getting Machines to Learn

One of the most interesting fields in quantitative research today is machine learning. Machine Learning serves the purpose of developing models through the use of algorithms in order to inform action.

Examples of machine learning are all around us. Machine learning is used to make recommendations about what you should purchase at most retail websites. Machine learning is also used to filter spam from your email inbox. Each of these two examples involves making prediction about previous behavior. This is what machines learning does. It learns what will happen based on what has happen. However, the question remains how does a machine actually learn.

The purpose of this post is to explain how machines actually “learn.” The process of learning involves three steps which are…

  1. Data input
  2. Abstraction
  3. Generalization

Data Input

Data input is simply having access to some for of data whether it be numbers, text, pictures or something else. The data is the factual information that is used to develop insights for future action.

The majority of a person’s time in conducting machine learning is involve in organize and cleaning data. This process is actually called data mugging by many. Once the data is ready for analysis, it is time for the abstraction process to begin.


Clean beautiful data still does not provide any useful information yet. Abstraction is the process of deriving meaning from the data. The raw data represent knowledge but as we already are aware the problem is how it is currently represented. Numbers and text mean little to us.

Abstraction takes all of the numbers and or text and develops a model. What a model does is summarize data and provides explanation about the data. For example, if you are doing a study for a a bank who wants to know who will default on loans. You might discover from the data that a person is more likely to default if they are a student with a credit card balance over $2,000. How this information is shared with the researcher can be in one of the following forms

  • equation
  • diagram
  • logic rules

This kind of information is hard to find manually and is normally found through the use of an algorithm. Abstraction involves the use of some sort of algorithm. An algorithm is a systemic step-by-step procedure for solving a problem. It is very technical to try and understand algorithms unless you have a graduate degree in statistics.   The point to remember is that algorithms are what the computer uses to learn and develop models.

Developing a model involves training. Training is achieved when the abstraction process is complete. The completion of this process depends on the criteria of what a “good” model is. This varies depending on the requirements of the model and the preference of the researcher.


A machine has not actually learned anything until the model it developed is assessed for bias. Bias is a result of the educated guesses (heuristics) that an algorithm makes to develop a model that are systematically wrong.

For example, let’s say an algorithm learns to identify a man by the length of their hair. If a man has really long hair or if a woman has really short hair the algorithm will misclassify them because each person does not fit the educated guess the algorithm developed. The challenge is that these guesses work most of the time but they struggle with exceptions.

The main way of dealing with this is to develop a model on one set of data and test it on another set of data. This will inform the researcher as to what changes are needed for the model.

Another problem is noise. Noise is caused by measurement error data reporting issues trying to make your model deal with noise can lead to overfitting which means that your model only works for your data and cannot be applied to other data sets. This can also be addressed by testing your model on other data sets.


A machine learns through the use of an algorithm to explain relationships in a data set in a specific manner. This process involves the three steps of data input, abstraction and generalization. The results of a machine learning model is a model that can be use to make prediction about the future with a certain degree of accuracy.

Decisions Trees with Titanic

In this post, we will make predictions about Titanic survivors using decision trees. The advantage of decisions trees is that they split the data into clearly defined groups. This process continues until the data is divided into extremely small subsets. This subsetting is used for making predictions.

We are assuming you have the data and have viewed the previous machine learning post on this blog. If not please click here.


You need to install the ‘rpart’ package from the CRAN repository as the package contains a decision tree function.You will also need to install the following packages

  • ratttle
  • rpart.plot
  • RColorNrewer

Each of these packages plays a role in developing decision trees

Building the Model

Once you have installed the packages. You need to develop the model below is the code. The model uses most of the variables in the data set for predicting survivors.

tree <- rpart(Survived~Pclass+Sex+Age+SibSp+Parch+Fare+Embarked,data=train, method=’class’)

The ‘rpart’ function is used for making the classification tree aka decision tree.

We now need to see the tree we do this with the code below


Plot makes the tree and ‘text’ adds names download

You can probably tell that this is an ugly plot. To improve the appearance we will run the following code.


Below is the revised plot


This looks much better

How to Read

Here is one way to read a decision tree.

  1. At the top node, keep in mind we are predicting Survival rate. There is a 0 or 1 in all of the ‘buckets’. This number represents how the bucket voted. If more than 50% perish than the bucket votes ‘0’ or no survivors
  2. Still looking at the top bucket, 62% of the passengers die while 38% survived before we even split the data.The number under the node tells what percentage of the sample is in this “bucket”. For the first bucket 100% of the sample is represented.
  3. The first split is based on sex. If the person is male you look to the left. For males, 81% of them died compared to 19% who survived and the bucket votes 0 for death. 65% of the sample is in this bucket
  4. For those who are not male (female) we look to the right and see that only 26% died compared to 74% who survived leading to this bucket voting 1 for survival. This bucket represents 35% of the entire sample.
  5. This process continues all the way down to the bottom


Decisions trees are useful for making ‘decisions’ about what is happening in data. For those who are looking for a simple prediction algorithm, decision trees is one place to begin