Category Archives: machine learning

Ensemble Learning for Machine Models

One way to improve a machine learning model is to not make just one model. Instead, you can make several models that all have different strengths and weaknesses. This combination of diverse abilities can allow for much more accurate predictions.

The use of multiple models is known as ensemble learning. This post will provide insights into ensemble learning as they are used in developing machine models.

The Major Challenge

The biggest challenges in creating an ensemble of models are deciding what models to develop and how the various models are combined to make predictions. To deal with these challenges involves the use of training data and several different functions.

The Process

Developing an ensemble model begins with training data. The next step is the use of some sort of allocation function. The allocation function determines how much data each model receives in order to make predictions. For example, each model may receive a subset of the data or limit how many features each model can use. However, if several different algorithms are used the allocation function may pass all the data to each model with making any changes.

After the data is allocated, it is necessary for the models to be created. From there, the next step is to determine how to combine the models. The decision on how to combine the models is made with a combination function.

The combination function can take one of several approaches for determining final predictions. For example, a simple majority vote can be used which means that if 5 models where developed and 3 vote “yes” than the example is classified as a yes. Another option is to weight the models so that some have more influence than others in the final predictions.

Benefits of Ensemble Learning

Ensemble learning provides several advantages. One, ensemble learning improves the generalizability of your model. With the combined strengths of many different models and or algorithms it is difficult to go wrong

Two, ensemble learning approaches allow for tackling large datasets. The biggest enemy to machine learning is memory. With ensemble approaches, the data can be broken into smaller pieces for each model.

Conclusion

Ensemble learning is yet another critical tool in the data scientist’s toolkit. The complexity of the world today makes it too difficult to lean on a singular model to explain things. Therefore, understanding the application of ensemble methods is a necessary step.

Developing a Customize Tuning Process in R

In this post, we will learn how to develop customize criteria for tuning a machine learning model using the “caret” package. There are two things that need to be done in order to completely assess a model using customized features. These two steps are…

  • Determine the model evaluation criteria
  • Create a grid of parameters to optimize

The model we are going to tune is the decision tree model made in a previous post with the C5.0 algorithm. Below is code for loading some prior information.

library(caret); library(Ecdat)
data(Wages1)

DETERMINE the MODEL EVALUATION CRITERIA

We are going to begin by using the “trainControl” function to indicate to R what re-sampling method we want to use, the number of folds in the sample, and the method for determining the best model. Remember, that there are many more options but these are the ones we will use. All this information must be saved into a variable using the “trainControl” function. Later, the information we place into the variable will be used when we rerun our model.

For our example, we are going to code the following information into a variable we will call “chck” for resampling we will use k-fold cross-validation. The number of folds will be set to 10. The criteria for selecting the best model will be the through the use of the “oneSE” method. The “oneSE” method selects the simplest model within one standard error of the best performance. Below is the code for our variable “chck”

chck<-trainControl(method = "cv",number = 10, selectionFunction = "oneSE")

For now, this information is stored to be used later

CREATE GRID OF PARAMETERS TO OPTIMIZE

We now need to create a grid of parameters. The grid is essential the characteristics of each model. For the C5.0 model we need to optimize the model, the number of trials, and if winnowing was used. Therefore we will do the following.

  • For model, we want decision trees only
  • Trials will go from 1-35 by increments of 5
  • For winnowing, we do not want any winnowing to take place.

In all, we are developing 8 models. We know this based on the trial parameter which is set to 1, 5, 10, 15, 20, 25, 30, 35. To make the grid we use the “expand.grid” function. Below is the code.

modelGrid<-expand.grid(.model ="tree", .trials= c(1,5,10,15,20,25,30,35), .winnow="FALSE")

CREATE THE MODEL

We are now ready to generate our model. We will use the kappa statistic to evaluate each model’s performance

set.seed(1)
customModel<- train(sex ~., data=Wages1, method="C5.0", metric="Kappa", trControl=chck, tuneGrid=modelGrid)
customModel
## C5.0 
## 
## 3294 samples
##    3 predictors
##    2 classes: 'female', 'male' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2966, 2965, 2964, 2964, 2965, 2964, ... 
## Resampling results across tuning parameters:
## 
##   trials  Accuracy   Kappa      Accuracy SD  Kappa SD  
##    1      0.5922991  0.1792161  0.03328514   0.06411924
##    5      0.6147547  0.2255819  0.03394219   0.06703475
##   10      0.6077693  0.2129932  0.03113617   0.06103682
##   15      0.6077693  0.2129932  0.03113617   0.06103682
##   20      0.6077693  0.2129932  0.03113617   0.06103682
##   25      0.6077693  0.2129932  0.03113617   0.06103682
##   30      0.6077693  0.2129932  0.03113617   0.06103682
##   35      0.6077693  0.2129932  0.03113617   0.06103682
## 
## Tuning parameter 'model' was held constant at a value of tree
## 
## Tuning parameter 'winnow' was held constant at a value of FALSE
## Kappa was used to select the optimal model using  the one SE rule.
## The final values used for the model were trials = 5, model = tree
##  and winnow = FALSE.

The actual output is similar to the model that “caret” can automatically create. The difference here is that the criteria was set by us rather than automatically. A close look reveals that all of the models perform poorly but that there is no change in performance after ten trials.

CONCLUSION

This post provided a brief explanation of developing a customized way of assessing a models performance. To complete this, you need to configure your options as well as setup your grid in order to assess a model. Understanding the customization process for evaluating machine learning models is one of the strongest ways to develop supremely accurate models that retain generalizability.

Developing an Automatically Tuned Model in R

In this post, we are going to learn how to use the “caret” package to automatically tune a machine learning model. This is perhaps the simplest way to evaluate the performance of several models. In a later post, we will explore how to perform custom tuning to a model.

The model we are trying to tune is the decision tree we made using the C5.0 algorithm in a previous post. Specifically we were trying to predict sex based on the variables available in the “Wages1” dataset in the “Ecdat” package.

In order to accomplish our goal we will need to load the “caret” and “Ecdat”package, load the “Wages1” dataset as well as set the seed. Setting the seed will allow us to reproduce our results. Below is the code for these steps.

library(caret); library(Ecdat)
data(Wages1)
set.seed(1)

We will now build and display our model using the code below.

tuned_model<-train(sex ~., data=Wages1, method="C5.0")
tuned_model
## C5.0 
## 
## 3294 samples
##    3 predictors
##    2 classes: 'female', 'male' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3294, 3294, 3294, 3294, 3294, 3294, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa      Accuracy SD  Kappa SD  
##   rules  FALSE    1      0.5892713  0.1740587  0.01262945   0.02526656
##   rules  FALSE   10      0.5938071  0.1861964  0.01510209   0.03000961
##   rules  FALSE   20      0.5938071  0.1861964  0.01510209   0.03000961
##   rules   TRUE    1      0.5892713  0.1740587  0.01262945   0.02526656
##   rules   TRUE   10      0.5938071  0.1861964  0.01510209   0.03000961
##   rules   TRUE   20      0.5938071  0.1861964  0.01510209   0.03000961
##   tree   FALSE    1      0.5841768  0.1646881  0.01255853   0.02634012
##   tree   FALSE   10      0.5930511  0.1855230  0.01637060   0.03177075
##   tree   FALSE   20      0.5930511  0.1855230  0.01637060   0.03177075
##   tree    TRUE    1      0.5841768  0.1646881  0.01255853   0.02634012
##   tree    TRUE   10      0.5930511  0.1855230  0.01637060   0.03177075
##   tree    TRUE   20      0.5930511  0.1855230  0.01637060   0.03177075
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were trials = 10, model = rules
##  and winnow = TRUE.

There is a lot of information that is printed out. The first column is the type of model developed. Two types of models were developed either a rules-based classification tree or a normal decision tree. Next, is the winnow column. This column indicates if a winnowing process was used to remove poor predictor variables.

The next two columns are accuracy and kappa which have been explained previously. The last two columns are the standard deviations of accuarcy and kappa. None of the models are that good but the purpose here is for teaching.

At the bottom of the printout, r tells you which model was the best. For us, the best model was the fifth model from the top which was a rule-based, 10 trial model with winnow set to “TRUE”.

We will now use the best model (the caret package automatically picks it) to make predictions on the training data. We will also look at the confusion matrix of the correct classification followed by there proportions. Below is the code.

predict_model<-predict(tuned_model, Wages1)
table(predict_model, Wages1$sex)
##              
## predict_model female male
##        female    936  590
##        male      633 1135
prop.table(table(predict_model, Wages1$sex))
##              
## predict_model    female      male
##        female 0.2841530 0.1791135
##        male   0.1921676 0.3445659

In term of prediction, the model was correct 62% of the time (.28 + .34 = .62). If we want to know, can also see the probabilities for each example using the following code.

probTable<-(predict(tuned_model, Wages1, type="prob"))
head(probTable)
##      female       male
## 1 0.6191287 0.38087132
## 2 0.2776770 0.72232303
## 3 0.2975327 0.70246734
## 4 0.7195866 0.28041344
## 5 1.0000000 0.00000000
## 6 0.9092993 0.09070072

Conclusion

In this post, we looked at an automated way to determine the best model among many using the “caret” package. Understanding how to improve the performance of a model is critical skill in machine learning.

Improving the Performance of Machine Learning Model

For many, especially beginners, making a machine learning model is difficult enough. Trying to understand what to do, how to specify the model, among other things, is confusing in itself. However, after developing a model it is necessary to assess ways in which to improve performance.

This post will serve as an introduction to understanding how to improving model performance. In particular, we will look at the following

  • When it is necessary to improve performance
  • Parameter tuning

When to Improve

It is not always necessary to try and improve the performance of a model. There are times when a model does well and you know this through the evaluating it. If the commonly used measures are adequate there is no cause for concern.

However, there are times when improvement is necessary. Complex problems, noisy data, and trying to look for subtle/unclear relationships can make improvement necessary. Normally, real-world data has the problems so model improvement is usually necessary.

Model improvement requires the application of scientific means in an artistic manner. It requires a sense of intuition at times and also brute trial-and-error effort as well. The point is that there is no singular agreed upon way to improve a model. It is better to focus on explaining how you did it if necessary.

Parameter Tuning

Parameter tuning is the actual adjustment of model fit options. Different machine learning models have different options that can be adjusted. Often, this process can be automated in r through the use of the “caret” package.

When trying to decide what to do when tuning parameters it is important to remember the following.

  • What machine learning model and algorithm you are using for your data.
  • Which parameters you can adjust.
  • What criteria you are using to evaluate the model

Naturally, you need to know what kind of model and algorithm you are using in order to improve the model. There are three types of models in machine learning, those that classify, those that employ regression, and those that can do both. Understanding this helps you to make a decision about what you are trying to do.

Next, you need to understand what exactly you or r are adjusting when analyzing the model. For example, for C5.0 decision trees “trials” is one parameter you can adjust. If you don’t know this, you will not know how the model was improved.

Lastly, it is important to know what criteria you are using to compare the various models. For classifying models you can look at the kappa and the various information derived from the confusion matrix. For regression-based models, you may look at the r-square, the RMSE (Root mean squared error), or the ROC curve.

Conclusion

As you can perhaps tell there is an incredible amount of choice and options in trying to improve a model. As such, model improvement requires a clearly developed strategy that allows for clear decision-making.

In a future post, we will look at an example of model improvement.

K-Fold Cross-Validation

In this post, we are going to look at k-fold cross-validation and its use in evaluating models in machine learning.

K-fold cross-validation is used for determining the performance of statistical models. How it works is the data is divided into a predetermined number of folds (called ‘k’). One fold is used to determine the model estimates and the other folds are used for evaluating. This is done k times and the results are average based on a statistic such as kappa to see how the model performs.

In our example, we are going to review a model we made using the C5.0 algorithm. In that post, we were trying to predict gender based on several other features.

First, we need to load several packages into R as well as the dataset we are going to use. All of this is shared in the code below

library(caret);library(C50);library(irr);library(Ecdat)
data("Wages1")

We now need to set the seed. This is important for allowing us to reproduce the results. Every time a k-fold is performed the results can be slightly different but setting the seed prevents this. The code is as follows

set.seed(1)

We will now create are folds. How many folds to create is up to the researcher. For us, we are going to create ten folds. What this means is that R will divide our sample into ten equal parts. To do this we use the “createFolds” function from the “caret” package. After creating the folds, we will view the results using the “str” function which will tell us how many examples are in each fold. Below is the code to complete this.

folds<-createFolds(Wages1$sex, k=10)
str(folds)
## List of 10
##  $ Fold01: int [1:328] 8 13 18 37 39 57 61 67 78 90 ...
##  $ Fold02: int [1:329] 5 27 47 48 62 72 76 79 85 93 ...
##  $ Fold03: int [1:330] 2 10 11 31 32 34 36 64 65 77 ...
##  $ Fold04: int [1:330] 19 24 40 45 55 58 81 96 99 102 ...
##  $ Fold05: int [1:329] 6 14 28 30 33 84 88 91 95 97 ...
##  $ Fold06: int [1:330] 4 15 16 38 43 52 53 54 56 63 ...
##  $ Fold07: int [1:330] 1 3 12 22 29 50 66 73 75 82 ...
##  $ Fold08: int [1:330] 7 21 23 25 26 46 51 59 60 83 ...
##  $ Fold09: int [1:329] 9 20 35 44 49 68 74 94 100 105 ...
##  $ Fold10: int [1:329] 17 41 42 71 101 107 117 165 183 184 ...

As you can see, we normally have about 330 examples per fold.

In order to get the results that we need. We have to take fold 1 to make the model and fold 2-10 to evaluate it. We repeat this process until every combination possible is used. First, fold 1 is used and 2-10 are the test data, then fold 2 is used and then folds 1, 3-10 are the test data etc.  Manually coding this would take a great deal of time. To get around this we will use the “lapply” function.

Using “lapply” we will create a function that takes “x” (one of our folds) and makes it the “training set” shown here as “Wages1_train”. Next, we assigned the rest of the folds to be the “test” (Wages1_test) set as depicted with the “-x”. The next two lines of code should look familiar as it is the code for developing a decision tree. The “Wages_actual” are the actual labels for gender in the “Wages_1” testing set.

The “kappa2” function is new and it comes from the “irr” package. The kappa statistic is a measurement of the accuracy of a  model while taking into account chance. The closer the value is to 1 the better. Below is the code for what has been discussed.

results<-lapply(folds, function(x) {
        Wages1_train<-Wages1[x, ]
        Wages1_test<-Wages1[-x, ]
        Wages1_model<-C5.0(sex~.,data=Wages1)
        Wages1_pred<-predict(Wages1_model, Wages1_test)
        Wages1_actual<-Wages1_test$sex
        Wages1_kappa<-kappa2(data.frame(Wages1_actual, Wages1_pred))$value
        return(Wages1_kappa)
})

To get our results, we will use the “str” function again to display them. This will tell us the kappa for each fold. To really see how our model does we need to calculate the mean kappa of the ten models. This is done with the “unlist” and “mean” function as shown below

str(results)
## List of 10
##  $ Fold01: num 0.205
##  $ Fold02: num 0.186
##  $ Fold03: num 0.19
##  $ Fold04: num 0.193
##  $ Fold05: num 0.202
##  $ Fold06: num 0.208
##  $ Fold07: num 0.196
##  $ Fold08: num 0.202
##  $ Fold09: num 0.194
##  $ Fold10: num 0.204
mean(unlist(results))
## [1] 0.1978915

The final mean kappa was 0.19 which is really poor. It indicates that the model is no better at predicting then chance alone. However, for illustrative purposes, we now understand how to perform a k-fold cross-validation

Receiver Operating Characteristic Curve

The receiver operating characteristic curve (ROC curve) is a tool used in statistical research to assess the trade-off of detecting true positives and true negatives. The origins of this tool goes all the way back to WWII when engineers were trying to distinguish between true and false alarms. Now this technique is used in machine learning

This post will explain the ROC curve and provide an example using R.

Below is a diagram of a ROC curve

ROC1

On the X axis, we have the false positive rate. As you move to the right the false positive rate increases which is bad. We want to be as close to zero as possible.

On the y-axis, we have the true positive rate. Unlike the x-axis, we want the true positive rate to be as close to 100 as possible. In general, we want a low value on the x-axis and a high value on the y-axis.

In the diagram above, the diagonal line called “Test without diagnostic benefit” represents a model that cannot tell the difference between true and false positives. Therefore, it is not useful for our purpose.

The L-shaped curve call “Good diagnostic test” is an example of an excellent model. This is because all the true positives are detected.

Lastly, the curved-line called “Medium diagnostic test” represents an actual model. This model is a balance between the perfect L-shaped model and the useless straight-line model. The curved-line model is able to moderately distinguish between false and true positives.

Area Under the ROC Curve

The area under a ROC curve is literally called the “Area Under the Curve” (AUC). This area is calculated with a standardized value ranging from 0 – 1. The closer to 1 the better the model

We will now look at an analysis of a model using the ROC curve and AUC. This is based on the results of a post using the KNN algorithm for nearest neighbor classification. Below is the code

predCollege <- ifelse(College_test_pred=="Yes", 1, 0)
realCollege <- ifelse(College_test_labels=="Yes", 1, 0)
pr <- prediction(predCollege, realCollege)
collegeResults <- performance(pr, "tpr", "fpr")
plot(collegeResults, main="ROC Curve for KNN Model", col="dark green", lwd=5)
abline(a=0,b=1, lwd=1, lty=2)
aucOfModel<-performance(pr, measure="auc")
unlist(aucOfModel@y.values)
  1. The first two variables (predCollege & realCollege) is just for converting the values of the prediction of the model and the actual results to numeric variables
  2. The “pr” variable is for storing the actual values to be used for the ROC curve. The “prediction” function comes from the “ROCR” package
  3. With the information of the “pr” variable we can now analyze the true and false positives, which are stored in the “collegeResults” variable. The “performance” function also comes from the “ROCR” package.
  4. The next two lines of code are for the plot the ROC curve. You can see the results below

Rplot.jpeg

6. The curve looks pretty good. To confirm this we use the last two lines of code to calculate the actually AUC. The actual AUC is 0.88 which is excellent. In other words, the model developed does an excellent job of discerning between true and false positives.

Conclusion 

The ROC curve provides one of many ways in which to assess the appropriateness of a model. As such, it is yet another tool available for a person who is trying to test models.

Using Confusion Matrices to Evaluate Performance

The data within a confusion matrix can be used to calculate several different statistics that can indicate the usefulness of a statistical model in machine learning. In this post, we will look at several commonly used measures, specifically…

  • accuracy
  • error
  • sensitivity
  • specificity
  • precision
  • recall
  • f-measure

Accuracy

Accuracy is probably the easiest statistic to understand. Accuracy is the total number of items correctly classified divided by the total number of items below is the equation

accuracy =   TP + TN
                          TP + TN + FP  + FN

TP =  true positive, TN =  true negative, FP = false positive, FN = false negative

Accuracy can range in value from 0-1 with one representing 100% accuracy. Normally, you don’t want perfect accuracy as this is an indication of overfitting and your model will probably not do well with other data.

Error

Error is the opposite of accuracy and represents the percentage of examples that are incorrectly classified its equation is as follows.

error =   FP + FN
                          TP + TN + FP  + FN

The lower the error the better in general. However, if an error is 0 it indicates overfitting. Keep in mind that error is the inverse of accuracy. As one increases the other decreases.

Sensitivity 

Sensitivity is the proportion of true positives that were correctly classified.The formula is as follows

sensitivity =       TP
                       TP + FN

This may sound confusing but high sensitivity is useful for assessing a negative result. In other words, if I am testing people for a disease and my model has a high sensitivity. This means that the model is useful telling me a person does not have a disease.

Specificity

Specificity measures the proportion of negative examples that were correctly classified. The formula is below

specificity =       TN
                       TN + FP

Returning to the disease example, a high specificity is a good measure for determining if someone has a disease if they test positive for it. Remember that no test is foolproof and there are always false positives and negatives happening. The role of the researcher is to maximize the sensitivity or specificity based on the purpose of the model.

Precision

Precision is the proportion of examples that are really positive. The formula is as follows

precision =       TP
                       TP + FP

 The more precise a model is the more trustworthy it is. In other words, high precision indicates that the results are relevant.

Recall

Recall is a measure of the completeness of the results of a model. It is calculated as follows

recall =       TP
                       TP + FN

This formula is the same as the formula for sensitivity. The difference is in the interpretation. High recall means that the results have a breadth to them such as in search engine results.

F-Measure

The f-measure uses recall and precision to develop another way to assess a model. The formula is below

sensitivity =      2 * TP
                       2 * TP + FP + FN

The f-measure can range from 0 – 1 and is useful for comparing several potential models using one convenient number.

Conclusion

This post provided a basic explanation of various statistics that can be used to determine the strength of a model. Through using a combination of statistics a researcher can develop insights into the strength of a model. The only mistake is relying exclusively on any single statistical measurement.

Using Probability of the Prediction to Evaluate a Machine Learning Model

In evaluating a model when employing machine learning techniques, there are three common types of data used for evaluation.

  • The actual classification values
  • The predicted classification values
  • The estimated probability of the prediction

The first two types of data (actual and predicted) are used for assessing the accuracy of a model in several different ways such as error rate, sensitivity, specificity, etc.

The benefit of the probabilities of prediction is that it is a measure of a model’s confidence in its prediction. If you need to compare two models and one is more confident in its prediction of its classification of examples, the more confident model is the better learner.

In this post, we will look at examples of the probability predictions of several models that have been used in this blog in the past.

Prediction Probabilities for Decision Trees

Our first example comes from the decision tree we made using the C5.0 algorithm. Below is the code for calculating the probability of the correct classification of each example in the model followed by an output of the first

Wage_pred_prob<-predict(Wage_model, Wage_test, type="prob")
 head(Wage_pred_prob)
        female      male
497  0.2853016 0.7146984
1323 0.2410568 0.7589432
1008 0.5770177 0.4229823
947  0.6834378 0.3165622
695  0.5871323 0.4128677
1368 0.4303364 0.5696636

The argument “type” is added to the “predict” function so that R calculates the probability that the example is classified correctly. A close look at the results using the “head” function provides a list of 6 examples from the model.

  • For example 497, there is a 28.5% probability that this example is female and a 71.5% probability that this example is male. Therefore, the model predicts that this example is male.
  • For example 1322, there is a 24% probability that this example is female and a 76% probability that this example is male. Therefore, the model predicts that this example is male.
  • etc.

Prediction Probabilities for KNN Nearest Neighbor

Below is the code for finding the probabilities for KNN algorithm.

College_test_pred_prob<-knn(train=College_train, test=College_test, 
+                        cl=College_train_labels, k=27, prob=TRUE)
College_test_pred_prob

The print for this is rather long. However, you can match the predicted level with the actual probability by looking carefully at the data.

  • For example 1, there is a 77% probability that this example is a yes and a 23% probability that this example is a no. Therefore, the model predicts that this example as yes.
  • For example 2, there is a 71% probability that this example is no and a 29% probability that this example is yes. Therefore, the model predicts that this example is a no.

Conclusion 

One of the primary purposes of the probabilities option is in comparing various models that are derived from the same data. This information combined with other techniques for evaluating models can help a researcher in determining the most appropriate model of analysis.

Kmeans Analysis in R

In this post, we will conduct a kmeans analysis on some data on student alcohol consumption. The data is available at the UC Irvine Machine Learning Repository and is available at https://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION

We want to see what segments are within the sample of students who participated in this study on several factors in addition to alcohol consumption. Understanding the characteristics that groups of students have in common could be useful in reaching out to them for various purposes.

We will begin by loading the “stats” package for the kmeans analysis. Then we will combine the data at it is in two different files and we will explore the data using tables and histograms. I will not print the results of the exploration of the data here as there are too many variables. Lastly, we need to set the seed in order to get the same results each time. The code is still found below.

library(stats)
student.mat <- read.csv("~/Documents/R working directory/student-mat.csv", sep=";")
student.por <- read.csv("~/Documents/R working directory/student-por.csv", sep=";")
student_alcohol <- rbind(student.mat, student.por)
set.seed(123)
options(digits = 2)
  • str(student_alcohol)
  • hist(student_alcoholage)
  • table(studentalcoholage)
  • table(studentalcoholaddress)
  • table(student_alcoholfamsize)
  • table(studentalcoholfamsize)
  • table(studentalcoholPstatus)
  • hist(student_alcoholMedu)
  • hist(studentalcoholMedu)
  • hist(studentalcoholFedu)
  • hist(student_alcoholtraveltime)
  • hist(studentalcoholtraveltime)
  • hist(studentalcoholstudytime)
  • hist(student_alcoholfailures)
  • table(studentalcoholfailures)
  • table(studentalcoholschoolsup)
  • table(student_alcoholfamsup)
  • table(studentalcoholfamsup)
  • table(studentalcoholpaid)
  • table(student_alcoholactivities)
  • table(studentalcoholactivities)
  • table(studentalcoholnursery)
  • table(student_alcoholhigher)
  • table(studentalcoholhigher)
  • table(studentalcoholinternet)
  • hist(student_alcoholfamrel)
  • hist(studentalcoholfamrel)
  • hist(studentalcoholfreetime)
  • hist(student_alcoholgoout)
  • hist(studentalcoholgoout)
  • hist(studentalcoholDalc)
  • hist(student_alcoholWalc)
  • hist(studentalcoholWalc)
  • hist(studentalcoholhealth)
  • hist(student_alcohol$absences)

The details about the variables can be found at the website link in the first paragraph of this post. The study look at students alcohol use and other factors related to school and family life.

Before we do the actual kmeans clustering we need to normalize the variables. This is because are variables are measured using different scales. Some are Likert with 5 steps, while others are numeric going from 0 to over 300. The different ranges have an influence on the results. To deal with this problem we will use the “scale” for the variables that will be included in the analysis. Below is the code

student_alcohol_clean<-as.data.frame(student_alcohol)
student_alcohol_clean$age<-scale(student_alcohol$age)
student_alcohol_clean$address<-scale(as.numeric(student_alcohol$address))
student_alcohol_clean$famsize<-scale(as.numeric(student_alcohol$famsize))
student_alcohol_clean$Pstatus<-scale(as.numeric(student_alcohol$Pstatus))
student_alcohol_clean$Medu<-scale(student_alcohol$Medu)
student_alcohol_clean$Fedu<-scale(student_alcohol$Fedu)
student_alcohol_clean$traveltime<-scale(student_alcohol$traveltime)
student_alcohol_clean$studytime<-scale(student_alcohol$studytime)
student_alcohol_clean$failures<-scale(student_alcohol$failures)
student_alcohol_clean$schoolsup<-scale(as.numeric(student_alcohol$schoolsup))
student_alcohol_clean$famsup<-scale(as.numeric(student_alcohol$famsup))
student_alcohol_clean$paid<-scale(as.numeric(student_alcohol$paid))
student_alcohol_clean$activities<-scale(as.numeric(student_alcohol$activities))
student_alcohol_clean$internet<-scale(as.numeric(student_alcohol$internet))
student_alcohol_clean$famrel<-scale(student_alcohol$famrel)
student_alcohol_clean$freetime<-scale(student_alcohol$freetime)
student_alcohol_clean$goout<-scale(student_alcohol$goout)
student_alcohol_clean$Dalc<-scale(student_alcohol$Dalc)
student_alcohol_clean$Walc<-scale(student_alcohol$Walc)
student_alcohol_clean$health<-scale(student_alcohol$health)
student_alcohol_clean$absences<-scale(student_alcohol$absences)
student_alcohol_clean$G1<-scale(student_alcohol$G1)
student_alcohol_clean$G2<-scale(student_alcohol$G2)
student_alcohol_clean$G3<-scale(student_alcohol$G3)

We also need to create a matrix in order to deal with the factor variables. All factor variables need to be converted so that they have dummy variables for the analysis. To do this we use the “matrix” function as shown in the code below.

student_alcohol_clean_matrix<-(model.matrix(~.+0, data=student_alcohol_clean))

We are now ready to conduct our kmeans cluster analysis using the “kmeans” function. We have to determine how many clusters to develop before the analysis. There are statistical ways to do this but another method is domain knowledge. Since we are dealing with teenagers, it is probably that there will be about four distinct groups because of how high school is structured. Therefore, we will use four segments for our analysis. Our code is below.

alcohol_cluster<-kmeans(student_alcohol_clean_matrix, 4)

To view the results we need to view two variables in are “alcohol_cluster” list. The “size” variable will tell us how many people are in each cluster and the “centers” variable describes a clusters characteristics on a particular variable. Below is the code

alcohol_cluster$size # size of clusters
## [1] 191 381 325 147
alcohol_cluster$centers #center of clusters
##   schoolGP schoolMS sexM   age address famsize Pstatus  Medu  Fedu
## 1     0.74     0.26 0.70  0.06  0.0017   0.265  -0.030  0.25  0.29
## 2     0.69     0.31 0.27 -0.15 -0.1059  -0.056  -0.031 -0.53 -0.45
## 3     0.88     0.12 0.45 -0.13  0.3363   0.005   0.016  0.73  0.58
## 4     0.56     0.44 0.48  0.59 -0.4712  -0.210   0.086 -0.55 -0.51
##   Mjobhealth Mjobother Mjobservices Mjobteacher Fjobhealth Fjobother
## 1      0.079      0.30         0.27       0.199     0.0471      0.54
## 2      0.031      0.50         0.17       0.024     0.0210      0.62
## 3      0.154      0.26         0.28       0.237     0.0708      0.48
## 4      0.034      0.46         0.22       0.041     0.0068      0.60
##   Fjobservices Fjobteacher reasonhome reasonother reasonreputation
## 1         0.33       0.042       0.27       0.120             0.20
## 2         0.25       0.031       0.27       0.113             0.20
## 3         0.28       0.123       0.24       0.077             0.34
## 4         0.29       0.034       0.17       0.116             0.15
##   guardianmother guardianother traveltime studytime failures schoolsup
## 1           0.69         0.079       0.17     -0.32    -0.12    -0.079
## 2           0.70         0.052       0.10      0.10    -0.26     0.269
## 3           0.73         0.040      -0.37      0.29    -0.35    -0.213
## 4           0.65         0.170       0.33     -0.49     1.60    -0.123
##   famsup   paid activities nurseryyes higheryes internet romanticyes
## 1 -0.033  0.253     0.1319       0.79      0.92    0.228        0.34
## 2 -0.095 -0.098    -0.2587       0.76      0.93   -0.387        0.35
## 3  0.156  0.079     0.2237       0.88      1.00    0.360        0.31
## 4 -0.057 -0.250     0.0047       0.73      0.70   -0.091        0.49
##   famrel freetime goout  Dalc  Walc health absences    G1    G2     G3
## 1 -0.184     0.43  0.76  1.34  1.29  0.273    0.429 -0.23 -0.17 -0.129
## 2 -0.038    -0.31 -0.35 -0.37 -0.40 -0.123   -0.042 -0.17 -0.12 -0.053
## 3  0.178    -0.01 -0.14 -0.41 -0.35 -0.055   -0.184  0.90  0.87  0.825
## 4 -0.055     0.25  0.22  0.11  0.14  0.087   -0.043 -1.24 -1.40 -1.518

The size of the each cluster is about the same which indicates reasonable segmentation of the sample. The output for “centers” tells us how much above or below the mean a particular cluster is. For example, for the variable “age” we see the following

age 
1    0.06
2   -0.15   
3   -0.13 
4    0.59

What this means is that people in cluster one have an average age 0.06 standard deviations above the mean, cluster two is -0.14 standard deviations below the mean etc. To give our clusters meaning we have to look at the variables and see which one the clusters are extremely above or below the mean. Below is my interpretation of the clusters. The words in parenthesis is the variable from which I made my conclusion

Cluster 1 doesn’t study much (studytime), lives in the biggest families (famsize), requires litle school support (schoolsup), has a lot of free time (freetime), and consumes the most alcohol (Dalc, Walc), lives in an urban area (address), loves to go out the most (goout), and has the most absences. This is the underachieving party alcoholics of the sample

Cluster 2 have parents that are much less educated (Medu, Fedu), requires the most school support (schoolsup), while receiving the less family support (famsup), have the least involvement in extra-curricular activities (activities), has the least internet access at home (internet), socialize the least (goout), and lowest alcohol consumption. (Dalc, Walc) This cluster is the unsupported non-alcoholic loners of the sample

Cluster 3 has the most educated parents (Medu, Fedu), live in an urban setting (address) choose their school based on reputation (reasonreputation), have the lowest travel time to school (traveltime), study the most (studytime), rarely fail a course (failures), have the lowest support from the school while having the highest family support and family relationship (schoolsup, famsup, famrel), most involve in extra-curricular activities (activities), best internet acess at home (internet), least amount of free time (freetime) low alcohol consumption (Dalc, Walc). This cluster represents the affluent non-alcoholic high achievers.

CLuster 4 is the oldest (age), live in rural setting (address), has the smallest families (famsize), the least educated parents (Medu, Fedu), spends the most time traveling to school (traveltime), doesnt study much (studytime), has the highest failure rate (failures), never pays for extra classes (paid), most likely to be in a relationship (romanticyes), consumes alcohol moderately (Dalc, Walc), does poorly in school (G3). These students are the students in the greatest academic danger.

To get better insights, we can add the cluster results to our original dataset that was not normalize we can then identify what cluster each student belongs to ad calculate unstandardized means if we wanted.

student_alcohol$cluster<-alcohol_cluster$cluster # add clusters back to original data normalize does not mean much
View(student_alcohol)
aggregate(data=student_alcohol, G3~cluster, mean)
##   cluster   G3
## 1       1 10.8
## 2       2 11.1
## 3       3 14.5
## 4       4  5.5

The “aggregate function” tells us the average for each cluster on this question. We could do this for all of our variables to learn more about our clusters.

Kmeans provides a researcher with an understanding of the homogeneous characteristics of individuals within a sample. This information can be used to develop intervention plans in education or marketing plans in business. As such, kmeans is another powerful tool in machine learning

K-Means Clustering

There are times in research when you neither want to predict nor classify examples. Rather, you want to take a dataset and segment the examples within the dataset so that examples with similar characteristics are gathered together.

When your goal is to great several homogeneous groups within a sample it involves clustering. One type of clustering used in machine learning is k-means clustering. In this post, we will learn the following about k-means clustering.

  • The purpose of k-means
  • Pros and cons of k-means

The Purpose of K-means

K-means has a number of applications. In the business setting, k-means has been used to segment customers. Businesses use this information to adjusts various aspects of their strategy for reaching their customers.

Another purpose for k-means is in simplifying large datasets so that you have several smaller datasets with similar characteristics. This subsetting could be useful in finding distinct patterns

K-means is a form of unsupervised classification. This means that the results label examples that the researcher must give meaning too. When R gives the results of an analysis it just labels the clusters as 1,2,3 etc. It is the researcher’s job to look at the clusters and give a qualitative meaning to them.

Pros and Cons of K-Means

The pros of k-means is that it is simple, highly flexible, and efficient. The simplicity of k-means makes it easy to explain the results in contrast to artificial neural networks or support vector machines. The flexibility of k-means allows for easy adjust if there are problems. Lastly, the efficiency of k-means implies that the algorithm is good at segmenting a dataset.

Some drawbacks to k-means are that it does not allow to develop the most optimal set of clusters and that the number of clusters to make must be decided before the analysis. When doing the analysis, the k-means algorithm will randomly select several different places from which to develop clusters. This can be good or bad depending on where the algorithm chooses to begin at. From there, the center of the clusters is recalculated until an adequate “center” is found for the number of clusters requested.

How many clusters to include is left at the discretion of the researcher. This involves a combination of common sense, domain knowledge, and statistical tools. Too many clusters tell you nothing because the groups becoming very small and there are too many of them.

There are statistical tools that measure within-group homogeneity and with group heterogeneity. In addition, there is a technique called a dendrogram. The results of a dendrogram analysis provide a recommendation of how many clusters to use. However, calculating a dendrogram for a large dataset could potential crash a computer due to the computational load and the limits of RAM.

Conclusion.

K-means is an analytical tool that helps to separate apples from oranges to give you one example. If you are in need of labeling examples based on the features in the dataset this method can be useful.

Market Basket Analysis in R

In this post, we will conduct a market basket analysis on the shopping habits of people at a grocery store. Remember that a market basket analysis provides insights through indicating relationships among items that are commonly purchased together.

The first thing we need to do is load the package that makes association rules, which is the “arules” package. Next, we need to load our dataset groceries. This dataset is commonly used as a demonstration for market basket analysis.

However, you don’t want to load this dataset as dataframe because it leads to several technical issues during the analysis. Rather you want to load it as a sparse matrix. The function for this is “read.transactions” and is available in the “arules” package

library(arules)
#make sparse matrix
groceries<-read.transactions("/home/darrin/Documents/R working directory/Machine-Learning-with-R-datasets-master/groceries.csv", sep = ",")

Please keep in mind that the location of the file on your computer will be different from my hard drive.

We will now explore the data set by using several different functions. First, we will use the “summary” function as indicated below.

summary(groceries)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55 
##   16   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   46   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

The output tells us the number of rows in our dataset (9835) columns (169) as well as the density, which is the percentage of columns that are not empty (2.6%). This may seem small but remember that the number of purchases varies from person to person so this affects how many empty columns there are.

Next, we have the most commonly purchased items. Milk and other vegetables were the two most common followed by other foods. After the most frequent items, we have the size of each transaction. For example, 2159 people purchased one item during a transaction. While one person purchased 32 items in a transaction.

Lastly, we summary statistics about transactions. On average, a person would purchase 4.4 items per transaction.

We will now look at the support of different items. Remember, that the support is the frequency of an item in the dataset. We will use the “itemFrequencyPlot” function to do this and we will add the argument “topN” to sort the items from most common to less for the 15 most frequent transactions. Below is the code

itemFrequencyPlot(groceries, topN=15)

download The plot that is produced gives you an idea of what people were purchasing. We will now attempt to develop association rules using the “apriori” function.

For now, we will use the default settings for support and confidence (confidence is the proportion of transactions that have the same item(s)). The default for support is 0.1 and for confidence it is 0.8. Below is the code.

apriori(groceries)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##         0.8    0.1    1 none FALSE            TRUE     0.1      1     10
##  target   ext
##   rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 983 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
## set of 0 rules

As you can see from the printout, nothing meets the criteria of a support of 0.1 and confidence of 0.8. How to play with these numbers is a matter of experience as there are few strong rules for this matter. Below, I set the support to 0.006, confidence to 0.25, and the minimum number of rules items to 2. The support of 0.006 means that this item must have been purchased at least 60 times out of 9835 items and the confidence of 0.25 means that rule needs to be accurate 25% of the time. Lastly, I want at least two items in each rule that is produced as indicated by minlen = 2. Below is the code with the “summary” as well.

groceriesrules<-apriori(groceries, parameter = list(support=0.006, confidence = 0.25, minlen=2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##        0.25    0.1    1 none FALSE            TRUE   0.006      2     10
##  target   ext
##   rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 59 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(groceriesrules)
## set of 463 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
## 150 297  16 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.711   3.000   4.000 
## 
## summary of quality measures:
##     support           confidence          lift       
##  Min.   :0.006101   Min.   :0.2500   Min.   :0.9932  
##  1st Qu.:0.007117   1st Qu.:0.2971   1st Qu.:1.6229  
##  Median :0.008744   Median :0.3554   Median :1.9332  
##  Mean   :0.011539   Mean   :0.3786   Mean   :2.0351  
##  3rd Qu.:0.012303   3rd Qu.:0.4495   3rd Qu.:2.3565  
##  Max.   :0.074835   Max.   :0.6600   Max.   :3.9565  
## 
## mining info:
##       data ntransactions support confidence
##  groceries          9835   0.006       0.25

Our current analysis has 463 rules. This is a major improvement from 0. We can also see how many rules contain 2 (150), 3 (297), and 4 items (16). We also have a summary of the average number of items per rule. The next is descriptive stats on the support and confidence of the rules generated.

Something that is new for us is the “lift” column. Lift is a measure how much more likely an item is to be purchased above its number rate. Anything above 1 means that the likelihood of purchase is higher than chance.

We are now going to look for useful rules from our dataset. We are looking for rules that we can use to make decisions. It takes industry experiences in the field of your data to really glean useful rules. For now, we can only determine this statistically by sorting the rules by their lift. Below is the code for this

inspect(sort(groceriesrules, by="lift")[1:7])
##   lhs             rhs                     support    confidence  lift
## 1 {herbs}         => {root vegetables}    0.007015760  0.4312500 3.956
## 2 {berries}       => {whipped/sour cream} 0.009049314  0.2721713 3.796
## 3 {other vegetables,                                                  ##    tropical fruit,                                                    
##    whole milk}    => {root vegetables}    0.007015760  0.4107143 3.768
## 4 {beef,                                                              
##    other vegetables} => {root vegetables} 0.007930859  0.4020619 3.688
## 5 {other vegetables,                                                  
##    tropical fruit} => {pip fruit}         0.009456024  0.2634561 3.482
## 6 {beef,                                                              
##    whole milk}    => {root vegetables}    0.008032537  0.3779904 3.467
## 7 {other vegetables,                                                  
##    pip fruit}     => {tropical fruit}     0.009456024  0.3618677 3.448

The first three rules are translated into simple English below as ordered by lift.

  1. If herbs are purchased then root vegetables are purchased
  2. If berries are purchased then whipped sour cream is purchased
  3. If other vegetables, tropical fruit, and whole milk are purchased then root vegetables are purchased

Conclusion

Since we are making no predictions, there is no way to really objectively improve the model. This is normal when the learning is unsupervised. If we had to make a recommendation based on the results we could say that the store should place all vegetables near each other.

The power of market basket analysis is allowing the researcher to identify relationships that may not have been noticed any other way. Naturally, insights gained from this approach must be used for practical actions in the setting in which they apply.

Understanding Market Basket Analysis

Market basket analysis a machine learning approach that attempts to find relationships among a group of items in a data set. For example, a famous use of this method was when retailers discovered an association between beer and diapers.

Upon closer examination, the retailers found that when men came to purchase diapers for their babies they would often buy beer in the same trip. With this knowledge, the retailers placed beer and diapers next to each other in the store and this further increased sales.

In addition, many of the recommendation systems we experience when shopping online use market basket analysis results to suggest additional products to us. As such, market basket analysis is an intimate part of our lives with us even knowing.

In this post, we will look at some of the details of market basket analysis such as association rules, apriori, and the role of support and confidence.

Association Rules

The heart of market basket analysis are association rules. Association rules explain patterns of relationship among items. Below is an example

{rice, seaweed} -> {soy sauce}

Everything in curly braces { } is an itemset, which is some form of data that occurs often in the dataset based on criteria. Rice and seaweed are our itemset on the left and soy sauce is our itemset on the right. The arrow -> indicates what comes first as we read from left to right. If we put this association rule in simple English it would say “if someone buys rice and seaweed then they will buy soy sauce”.

The practical application of this rule is to place rice, seaweed and soy sauce near each other in order to reinforce this rule when people come to shop.

The Algorithm

Market basket analysis uses an apriori algorithm. This algorithm is useful for unsupervised learning that does not require any training and thus no predictions. The Apriori algorithm is especially useful with large datasets but it employs simple procedures to find useful relationships among the items.

The shortcut that this algorithm uses is the “apriori property” which states that all suggsets of a frequent itemset must also be frequent. What this means in simple English is that the items in an itemset need to be common in the overall dataset. This simple rule saves a tremendous amount of computational time.

Support and Confidence

Two key pieces of information that can further refine the work of the Apriori algorithm is support and confidence. Support is a measure of the frequency of an itemset ranging from 0 (no support) to 1 (highest support). High support indicates the importance of the itemset in the data and contributes to the itemset being used to generate association rule(s).

Returning to our rice, seaweed, and soy sauce example. We can say that the support for soy sauce is 0.4. This means that soy sauce appears in 40% of the purchases in the dataset which is pretty high.

Confidence is a measure of the accuracy of an association rule which is measured from 0 to 1. The higher the confidence the more accurate the association rule. If we say that our rice, seaweed, and soy sauce rule has a confidence of 0.8 we are saying that when rice and seaweed are purchased together, 80% of the time soy sauce is purchased as well.

Support and confidence can be used to influence the apriori algorithm by setting cutoff values to be searched for. For example, if we set a minimum support of 0.5 and a confidence of 0.65 we are telling the computer to only report to us association rules that are above these cutoff points. This helps to remove useless rules that are obvious or useless.

Conclusion

Market basket analysis is a useful tool for mining information from large datasets. The rules are easy to understanding. In addition, market basket analysis can be used in many fields beyond shopping and can include relationships within DNA, and other forms of human behavior. As such, care must be made so that unsound conclusions are not drawn from random patterns in the data

Support Vector Machines in R

In this post, we will use support vector machine analysis to look at some data available on kaggle. In particular, we will predict what number a person wrote by analyzing the pixels that were used to make the number. The file for this example is available at https://www.kaggle.com/c/digit-recognizer/data

To do this analysis you will need to use the ‘kernlab’ package. While playing with this dataset I noticed a major problem, doing the analysis with the full data set of 42000 examples took forever. To alleviate this problem. We are going to practice with a training set of 7000 examples and a test set of 3000. Below is the code for the first few sets. Remember that the dataset was downloaded separately

#load packages
library(kernlab)
digitTrain<-read.csv(file="digitTrain.csv",head=TRUE,sep=",")
#split data 
digitRedux<-digitTrain[1:7000,]
digitReduxTest<-digitTrain[7001:10000,]
#explore data
str(digitRedux)
## 'data.frame':    7000 obs. of  785 variables:
##  $ label   : int  1 0 1 4 0 0 7 3 5 3 ...
##  $ pixel0  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pixel1  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pixel2  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pixel3  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pixel4  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pixel5  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pixel6  : int  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]

From the “str” function you can tell we have a lot of variables (785). This is what slowed the analysis down so much when I tried to run the full 42000 examples in the original dataset.

SVM need a factor variable as the predictor if possible. We are trying to predict the “label” variable so we are going to change this to a factor variable because that is what it really is. Below is the code

#convert label variable to factor
digitRedux$label<-as.factor(digitRedux$label)
digitReduxTest$label<-as.factor(digitReduxTest$label)

Before we continue with the analysis we need to scale are variables. This makes all variables to be within the same given range which helps to equalize the influence of them. However, we do not want to change our “label” variable as this is the predictor variable and scaling it would make the results hard to understand. Therefore, we are going to temporarily remove the “label” variable from both of our data sets and save them in a temporary data frame. The code is below.

#temporary dataframe for the label results
keep<-as.data.frame(digitRedux$label)
keeptest<-as.data.frame(digitReduxTest$label)
#null label variable in both datasets
digitRedux$label<-NA
digitReduxTest$label<-NA

Next, we scale the remaining variable and reinsert the label variables for each data set as show in our code below.

digitRedux<-as.data.frame(scale(digitRedux))
digitRedux[is.na(digitRedux)]<- 0 #replace NA with 0
digitReduxTest<-as.data.frame(scale(digitReduxTest))
digitReduxTest[is.na(digitReduxTest)]<- 0
#add back label
digitRedux$label<-keep$`digitRedux$label`
digitReduxTest$label<-keeptest$`digitReduxTest$label`

Now we make our model using the “ksvm” function in the “kernlab” package. We set the kernel to “vanilladot” which is a linear kernel. We will also print the results. However, the results do not make any sense on their own and the model can only be assessed through other means. Below is the code. If you get a warning message about scaling do not worry about this as we scaled the data ourselves.

#make the model
number_classify<-ksvm(label~.,  data=digitRedux, 
                      kernel="vanilladot")
#look at the results
number_classify
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 2218 
## 
## Objective Function Value : -0.0623 -0.207 -0.1771 -0.0893 -0.3207 -0.4304 -0.0764 -0.2719 -0.2125 -0.3575 -0.2776 -0.1618 -0.3408 -0.1108 -0.2766 -1.0657 -0.3201 -1.0509 -0.2679 -0.4565 -0.2846 -0.4274 -0.8681 -0.3253 -0.1571 -2.1586 -0.1488 -0.2464 -2.9248 -0.5689 -0.2753 -0.2939 -0.4997 -0.2429 -2.336 -0.8108 -0.1701 -2.4031 -0.5086 -0.0794 -0.2749 -0.1162 -0.3249 -5.0495 -0.8051 
## Training error : 0

We now need to use the “predict” function so that we can determine the accuracy of our model. Remember that for predicting, we use the answers in the test data and compare them to what our model would guess based on what it knows.

number_predict<-predict(number_classify, digitReduxTest)
table(number_predict, digitReduxTest$label)
##               
## number_predict   0   1   2   3   4   5   6   7   8   9
##              0 297   0   3   3   1   4   6   1   1   1
##              1   0 307   4   1   0   4   2   5  11   1
##              2   0   2 268  10   5   1   3  10  12   3
##              3   0   1   7 291   1  11   0   1   8   3
##              4   0   1   3   0 278   4   3   2   0   9
##              5   2   0   1  10   1 238   4   1  11   1
##              6   2   1   1   0   2   1 287   1   0   0
##              7   0   1   1   0   1   0   0 268   3  10
##              8   1   3   4  10   0  11   1   0 236   2
##              9   0   0   2   2   9   2   0  14   2 264
accuracy<-number_predict == digitReduxTest$label
prop.table(table(accuracy))
## accuracy
##      FALSE       TRUE 
## 0.08866667 0.91133333

The table allows you to see how many were classified correctly and how they were misclassified. The prop.table allows you to see an overall percentage. This particular model was highly accurate at 91%. It would be difficult to improve further. Below is code for a model that is using a different kernel with results that are barely better. However, if you ever enter a data science competition any improve ususally helps even if it is not practical for everyday use.

number_classify_rbf<-ksvm(label~.,  data=digitRedux, 
                          kernel="rbfdot")
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
#evaluate improved model
number_predict_rbf<-predict(number_classify_rbf, digitReduxTest)
table(number_predict_rbf, digitReduxTest$label)
##                   
## number_predict_rbf   0   1   2   3   4   5   6   7   8   9
##                  0 294   0   2   3   1   1   2   1   1   0
##                  1   0 309   1   0   1   0   1   3   4   2
##                  2   4   2 277  12   4   4   8   9   9   8
##                  3   0   1   3 297   1   3   0   1   3   3
##                  4   0   1   3   0 278   4   1   5   0   6
##                  5   0   0   1   6   1 254   4   0   9   1
##                  6   2   1   0   0   2   6 289   0   2   0
##                  7   0   0   3   2   3   0   0 277   2  13
##                  8   2   2   4   3   0   3   1   0 253   3
##                  9   0   0   0   4   7   1   0   7   1 258
accuracy_rbf<-number_predict_rbf == digitReduxTest$label
prop.table(table(accuracy_rbf))
## accuracy_rbf
##      FALSE       TRUE 
## 0.07133333 0.92866667

Conclusion

From this demonstration we can see the power of support vector machines with numerical data. This type of analysis can be used for things beyond the conventional analysis and can be used to predict things such as hand written numbers. As such, SVM is yet another tool available for the data scientist.

Basics of Support Vector Machines

Support vector machines (SVM) is another one of those mysterious black box methods in machine learning. This post will try to explain in simple terms what SVM are and their strengths and weaknesses.

Definition

SVM is a combination of nearest neighbor and linear regression. For the nearest neighbor, SVM uses the traits of an identified example to classify an unidentified one. For regression, a line is drawn that divides the various groups.It is preferred that the line is straight but this is not always the case

This combination of using the nearest neighbor along with the development of a line leads to the development of a hyperplane. The hyperplane is drawn in a place that creates the greatest amount of distance among the various groups identified.

The examples in each group that are closest to the hyperplane are the support vectors. They support the vectors by providing the boundaries for the various groups.

If for whatever reason a line cannot be straight because the boundaries are not nice and night. R will still draw a straight line but make accommodations through the use of a slack variable, which allows for error and or for examples to be in the wrong group.

Another trick used in SVM analysis is the kernel trick. A kernel will add a new dimension or feature to the analysis by combining features that were measured in the data. For example, latitude and longitude might be combined mathematically to make altitude. This new feature is now used to develop the hyperplane for the data.

There are several different types of kernel tricks that achieve their goal using various mathematics. There is no rule for which one to use and playing different choices is the only strategy currently.

Pros and Cons

The pros of SVM is their flexibility of use as they can be used to predict numbers or classify. SVM are also able to deal with nosy data and are easier to use than artificial neural networks. Lastly, SVM are often able to resist overfitting and are usually highly accurate.

Cons of SVM include they are still complex as they are a member of black box machine learning methods even if they are simpler than artificial neural networks. The lack of criteria for kernel selection makes it difficult to determine which model is the best.

Conclusion

SVM provide yet another approach to analyzing data in a machine learning context. Success with this approach depends on determining specifically what the goals of a project are.

Developing an Artificial Neural Network in R

In this post, we are going make an artificial neural network (ANN) by analyzing some data about computers. Specifically, we are going to make an ANN to predict the price of computers.

We will be using the “ecdat” package and the data set “Computers” from this package. In addition, we are going to use the “neuralnet” package to conduct the ANN analysis. Below is the code for the packages and dataset we are using

library(Ecdat);library(neuralnet)
#load data set
data("Computers")

Explore the Data

The first step is always data exploration. We will first look at nature of the data using the “str” function and then used the “summary” function. Below is the code.

str(Computers)
## 'data.frame':    6259 obs. of  10 variables:
##  $ price  : num  1499 1795 1595 1849 3295 ...
##  $ speed  : num  25 33 25 25 33 66 25 50 50 50 ...
##  $ hd     : num  80 85 170 170 340 340 170 85 210 210 ...
##  $ ram    : num  4 2 4 8 16 16 4 2 8 4 ...
##  $ screen : num  14 14 15 14 14 14 14 14 14 15 ...
##  $ cd     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ multi  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ premium: Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
##  $ ads    : num  94 94 94 94 94 94 94 94 94 94 ...
##  $ trend  : num  1 1 1 1 1 1 1 1 1 1 ...
lapply(Computers, summary)
## $price
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     949    1794    2144    2220    2595    5399 
## 
## $speed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.00   33.00   50.00   52.01   66.00  100.00 
## 
## $hd
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    80.0   214.0   340.0   416.6   528.0  2100.0 
## 
## $ram
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   4.000   8.000   8.287   8.000  32.000 
## 
## $screen
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.00   14.00   14.00   14.61   15.00   17.00 
## 
## $cd
##   no  yes 
## 3351 2908 
## 
## $multi
##   no  yes 
## 5386  873 
## 
## $premium
##   no  yes 
##  612 5647 
## 
## $ads
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    39.0   162.5   246.0   221.3   275.0   339.0 
## 
## $trend
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   10.00   16.00   15.93   21.50   35.00

ANN is primarily for numerical data and not categorical or factors. As such, we will remove the factor variables cd, multi, and premium from further analysis. Below are is the code and histograms of the remaining variables.

hist(Computers$price);hist(Computers$speed);hist(Computers$hd);hist(Computers$ram)

dc97c105-c791-41ae-9e6a-e01061d82f10c50779eb-99c6-47e6-b44f-c037cd58ab2d.png34980edf-1c6f-4d5b-9214-c932860dff32.png725e049e-f58d-4d22-a5a6-70e9a8290e61.png

hist(Computers$screen);hist(Computers$ads);hist(Computers$trend)

200d47df-0ca3-4ee6-90a7-0ab6c8057278.png8a53ccf2-b6be-4243-90ec-4fcf6b6f4eb4.pngd3080f32-fefe-419a-af3c-8e2ade2fc11f.png

Clean the Data

Looking at the summary combined with the histograms indicates that we need to normalize our data as this is a requirement for ANN. We want all of the variables to have an equal influence initially. Below is the code for the function we will use to normalize are variables.

normalize<-function(x) {
        return((x-min(x)) / (max(x)))
}

Explore the Data Again

We now need to make a new dataframe that has only the variables we are going to use for the analysis. Then we will use our “normalize” function to scale and center the variables appropriately. Lastly, we will re-explore the data to make sure it is ok using the “str” “summary” and “hist” functions. Below is the code.

#make dataframe without factor variables
Computers_no_factors<-data.frame(Computers$price,Computers$speed,Computers$hd,
                              Computers$ram,Computers$screen,Computers$ad,
                              Computers$trend)
#make a normalize dataframe of the data
Computers_norm<-as.data.frame(lapply(Computers_no_factors, normalize))
#reexamine the normalized data
lapply(Computers_norm, summary)
## $Computers.price
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1565  0.2213  0.2353  0.3049  0.8242 
## 
## $Computers.speed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0800  0.2500  0.2701  0.4100  0.7500 
## 
## $Computers.hd
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.06381 0.12380 0.16030 0.21330 0.96190 
## 
## $Computers.ram
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0625  0.1875  0.1965  0.1875  0.9375 
## 
## $Computers.screen
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.03581 0.05882 0.17650 
## 
## $Computers.ad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3643  0.6106  0.5378  0.6962  0.8850 
## 
## $Computers.trend
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2571  0.4286  0.4265  0.5857  0.9714
hist(Computers_norm$Computers.price);hist(Computers_norm$Computers.speed);

ed7fe31f-5ac2-47a0-a178-bbf2565d1d5b.png8b4b0bac-0936-448a-8216-cc7b06768692.png

hist(Computers_norm$Computers.hd);hist(Computers_norm$Computers.ram);

38b5ea71-47f8-4912-8a8d-c7656544d4da.pngdbd9d052-ca58-4ba7-b774-46f5da6d4532

hist(Computers_norm$Computers.screen);hist(Computers_norm$Computers.ad);

bde52e4e-cacd-4798-9dc1-2b56c1ca9c1a.png27d73af5-56d6-4c0b-ab9d-368be59a5d78

hist(Computers_norm$Computers.trend)

cfe12fb9-9074-441c-9a6c-9b4136ad6a4e.png

Develop the Model

Everything looks good, so we will now split our data into a training and testing set and develop the ANN model that will predict computer price. Below is the code for this.

#split data into training and testing
Computer_train<-Computers_norm[1:4694,]
Computers_test<-Computers_norm[4695:6259,]
#model
computer_model<-neuralnet(Computers.price~Computers.speed+Computers.hd+
                                  Computers.ram+Computers.screen+
                                  Computers.ad+Computers.trend, Computer_train)

Our initial model is a simple feedforward network with a single hidden node. You can visualize the model using the “plot” function as shown in the code below.

plot(computer_model)

Screenshot from 2016-07-11 14:35:35.png

We now need to evaluate our model’s performance. We will use the “compute” function to generate predictions. The predictions generated by the “compute” function will then be compared to the actual prices in the test data set using a Pearson correlation. Since we are not classifying we can’t measure accuracy with a confusion matrix but rather with correlation. Below is the code followed by the results

evaluate_model<-compute(computer_model, Computers_test[2:7])
predicted_price<-evaluate_model$net.result
cor(predicted_price, Computers_test$Computers.price)
##              [,1]
## [1,] 0.8809571295

The correlation between the predict results and the actual results is 0.88 which is a strong relationship. This indicates that our model does an excellent job in predicting the price of a computer based on ram, screen size, speed, hard drive size, advertising, and trends.

Develop Refined Model

Just for fun, we are going to make a more complex model with three hidden nodes and see how the results change below is the code.

computer_model2<-neuralnet(Computers.price~Computers.speed+Computers.hd+
                                   Computers.ram+Computers.screen+ 
                                   Computers.ad+Computers.trend, 
                           Computer_train, hidden =3)
plot(computer_model2)
evaluate_model2<-compute(computer_model2, Computers_test[2:7])
predicted_price2<-evaluate_model2$net.result
cor(predicted_price2, Computers_test$Computers.price)

Screenshot from 2016-07-11 14:36:34.png

##              [,1]
## [1,] 0.8963994092

The correlation improves to 0.89. As such, the increased complexity did not yield much of an improvement in the overall correlation. Therefore, a single node model is more appropriate.

Conclusion

In this post, we explored an application of artificial neural networks. This black box method is useful for making powerful predictions in highly complex data.

Black Box Method-Artificial Neural Networks

In machine learning, there are a set of analytical techniques know as black box methods. What is meant by black box methods is that the actual models developed are derived from complex mathematical processes that are difficult to understand and interpret. This difficulty in understanding them is what makes them mysterious.

One black method is artificial neural network (ANN). This method tries to imitate mathematically the behavior of neurons in the nervous system of humans. This post will attempt to explain ANN in simplistic terms.

The Human Mind and the Artificial One

We will begin by looking at how real neurons work before looking at ANN. In simple terms, as this is not a biology blog, neurons send and receive signals. They receive signals through their dendrites, process information in the soma, and the send signals through their axon terminal. Below is a picture of a neuron.

download.jpg

An ANN works in a highly similar manner. The x variables are the dendrites that are providing information to the cell body when they are summed. Different dendrites or x variables can have different weights (w). Next, the summation of the x variables is passed to an activation function before moving to the output or dependent variable y. Below is a picture of this process.

Diagram1If you compare the two pictures they are similar yet different. ANN mimics the mind in a way that has fascinated people for over 50 years.

Activation Function (f)

The activation function purpose is to determine if there should be an activation. In the human body, activation takes place when the nerve cell sends the message to the next cell. This indicates that the message was strong enough to have it move forward.

The same concept applies in ANN. A signal will not be passed on unless it meets a minimum threshold. This threshold can vary depending on how the ANN is model.

ANN Design

The makeup of an ANNs can vary greatly. Some models have more than one output variable as shown below.

Diagram2

Two outputs

Other models have what are called hidden layers. These are variables that are both input and output variables. They could be seen as mediating variables. Below is a visual example.

Diagram2

Hidden layer is the two circles in the middle

How many layers to developed is left to the researcher. When models become really complex with several hidden layers and or outcome variables it is referred to as deep learning in the machine learning community.

Another complexity of ANN is the direction of information. Just as in the human body information can move forward and backward in an ANN. This provides for opportunities to model highly complex data and relationships.

Conclusion

ANN can be used for classifying virtually anything. They are a highly accurate model as well that is not bogged down by many assumptions. However, ANN’s are so hard to understand that it makes it difficult to use them despite their advantages. As such, this form of analysis can be beneficial if the user is able to explain the results.

Making Regression and Modal Trees in R

In this post, we will look at an example of regression trees. Regression trees use decision tree-like approach to develop prediction models involving numerical data. In our example, we will be trying to predict how many kids a person has based on several independent variables in the “PSID” data set in the “Ecdat” package.

Let’s begin by loading the necessary packages and data set. The code is below

library(Ecdat);library(rpart);library(rpart.plot)
library(RWeka)
data(PSID)
str(PSID)
## 'data.frame':    4856 obs. of  8 variables:
##  $ intnum  : int  4 4 4 4 5 6 6 7 7 7 ...
##  $ persnum : int  4 6 7 173 2 4 172 4 170 171 ...
##  $ age     : int  39 35 33 39 47 44 38 38 39 37 ...
##  $ educatn : int  12 12 12 10 9 12 16 9 12 11 ...
##  $ earnings: int  77250 12000 8000 15000 6500 6500 7000 5000 21000 0 ...
##  $ hours   : int  2940 2040 693 1904 1683 2024 1144 2080 2575 0 ...
##  $ kids    : int  2 2 1 2 5 2 3 4 3 5 ...
##  $ married : Factor w/ 7 levels "married","never married",..: 1 4 1 1 1 1 1 4 1 1 ...
summary(PSID)
##      intnum        persnum            age           educatn     
##  Min.   :   4   Min.   :  1.00   Min.   :30.00   Min.   : 0.00  
##  1st Qu.:1905   1st Qu.:  2.00   1st Qu.:34.00   1st Qu.:12.00  
##  Median :5464   Median :  4.00   Median :38.00   Median :12.00  
##  Mean   :4598   Mean   : 59.21   Mean   :38.46   Mean   :16.38  
##  3rd Qu.:6655   3rd Qu.:170.00   3rd Qu.:43.00   3rd Qu.:14.00  
##  Max.   :9306   Max.   :205.00   Max.   :50.00   Max.   :99.00  
##                                                  NA's   :1      
##     earnings          hours           kids                 married    
##  Min.   :     0   Min.   :   0   Min.   : 0.000   married      :3071  
##  1st Qu.:    85   1st Qu.:  32   1st Qu.: 1.000   never married: 681  
##  Median : 11000   Median :1517   Median : 2.000   widowed      :  90  
##  Mean   : 14245   Mean   :1235   Mean   : 4.481   divorced     : 645  
##  3rd Qu.: 22000   3rd Qu.:2000   3rd Qu.: 3.000   separated    : 317  
##  Max.   :240000   Max.   :5160   Max.   :99.000   NA/DF        :   9  
##                                                   no histories :  43

The variables “intnum” and “persnum” are for identification and are useless for our analysis. We will now explore our dataset with the following code.

hist(PSID$age)

Rplot.jpeg

hist(PSID$educatn)

Rplot06.jpeg

hist(PSID$earnings)

Rplot02

hist(PSID$hours)

Rplot03

hist(PSID$kids)

Rplot04

table(PSID$married)
## 
##       married never married       widowed      divorced     separated 
##          3071           681            90           645           317 
##         NA/DF  no histories 
##             9            43

Almost all of the variables are non-normal. However, this is not a problem when using regression trees. There are some major problems with the “kids” and “educatn” variables. Each of these variables has values at 98 and 99. When the data for this survey was collected 98 meant the respondent did not know the answer and a 99 means they did not want to say. Since both of these variables are numerical we have to do something with them so they do not ruin our analysis.

We are going to recode all values equal to or greater than 98 as 3 for the “kids” variable. The number 3 means they have 3 kids. This number was picked because it was the most common response for the other respondents. For the “educatn” variable all values equal to or greater than 98 are recoded as 12, which means that they completed 12th grade. Again this was the most frequent response. Below is the code.

PSID$kids[PSID$kids >= 98] <- 3
PSID$educatn[PSID$educatn >= 98] <- 12

Another peek at the histograms for these two variables and things look much better.

hist(PSID$kids)

Rplot05.jpeg

hist(PSID$educatn)

Rplot01

Make Model and Visualization

Now that everything is cleaned up we now need to make our training and testing data sets as seen in the code below.

PSID_train<-PSID[1:3642,]
PSID_test<-PSID[3643:4856,]

We will now make our model and also create a visual of it. Our goal is to predict the number of children a person has based on their age, education, earnings, hours worked, marital status. Below is the code

#make model
PSID_Model<-rpart(kids~age+educatn+earnings+hours+married, PSID_train)
#make visualization
rpart.plot(PSID_Model, digits=3, fallen.leaves = TRUE,type = 3, extra=101)

Rplot07

The first split on the tree is by income. On the left, we have those who make more than 20k and on the right those who make less than 20k. On the left the next split is by marriage, those who are never married or not applicable have on average 0.74 kids. Those who are married, widowed, divorced, separated, or have no history have on average 1.72.

The left side of the tree is much more complicated and I will not explain all of it. The after making less than 20k the next split is by marriage. Those who are married, widowed, divorced, separated, or no history with less than 13.5 years of education have 2.46 on average.

Make Prediction Model and Conduct Evaluation

Our next task is to make the prediction model. We will do this with the following code

PSID_pred<-predict(PSID_Model, PSID_test)

We will now evaluate the model. We will do this three different ways. The first involves looking at the summary statistics of the prediction model and the testing data. The numbers should be about the same. After that, we will calculate the correlation between the prediction model and the testing data. Lastly, we will use a technique called the mean absolute error. Below is the code for the summary statistics and correlation.

summary(PSID_pred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.735   2.041   2.463   2.226   2.463   2.699
summary(PSID_test$kids)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   2.000   2.494   3.000  10.000
cor(PSID_pred, PSID_test$kids)
## [1] 0.308116

Looking at the summary stats our model has a hard time predicting extreme values because the max value of the two models are far apart. However, how often do people have ten kids? As such, this is not a major concern.

A look at the correlation finds that it is pretty low (0.30) this means that the two models have little in common and this means we need to make some changes. The mean absolute error is a measure of the difference between the predicted and actual values in a model. We need to make a function first before we analyze our model.

MAE<-function(actual, predicted){
        mean(abs(actual-predicted))
}

We now assess the model with the code below

MAE(PSID_pred, PSID_test$kids)
## [1] 1.134968

The results indicate that on average the difference between our model’s prediction of the number of kids and the actual number of kids was 1.13 on a scale of 0 – 10. That’s a lot of error. However, we need to compare this number to how well the mean does to give us a benchmark. The code is below.

ave_kids<-mean(PSID_train$kids)
MAE(ave_kids, PSID_test$kids)
## [1] 1.178909

Model Tree

Our model with a score of 1.13 is slightly better than using the mean which is 1.17. We will try to improve our model by switching from a regression tree to a model tree which uses a slightly different approach for prediction. In a model tree each node in the tree ends in a linear regression model. Below is the code.

PSIDM5<- M5P(kids~age+educatn+earnings+hours+married, PSID_train)
PSIDM5
## M5 pruned model tree:
## (using smoothed linear models)
## 
## earnings <= 20754 : 
## |   earnings <= 2272 : 
## |   |   educatn <= 12.5 : LM1 (702/111.555%) ## |   |   educatn >  12.5 : LM2 (283/92%)
## |   earnings >  2272 : LM3 (1509/88.566%)
## earnings >  20754 : LM4 (1147/82.329%)
## 
## LM num: 1
## kids = 
##  0.0385 * age 
##  + 0.0308 * educatn 
##  - 0 * earnings 
##  - 0 * hours 
##  + 0.0187 * married=married,divorced,widowed,separated,no histories 
##  + 0.2986 * married=divorced,widowed,separated,no histories 
##  + 0.0082 * married=widowed,separated,no histories 
##  + 0.0017 * married=separated,no histories 
##  + 0.7181
## 
## LM num: 2
## kids = 
##  0.002 * age 
##  - 0.0028 * educatn 
##  + 0.0002 * earnings 
##  - 0 * hours 
##  + 0.7854 * married=married,divorced,widowed,separated,no histories 
##  - 0.3437 * married=divorced,widowed,separated,no histories 
##  + 0.0154 * married=widowed,separated,no histories 
##  + 0.0017 * married=separated,no histories 
##  + 1.4075
## 
## LM num: 3
## kids = 
##  0.0305 * age 
##  - 0.1362 * educatn 
##  - 0 * earnings 
##  - 0 * hours 
##  + 0.9028 * married=married,divorced,widowed,separated,no histories 
##  + 0.2151 * married=widowed,separated,no histories 
##  + 0.0017 * married=separated,no histories 
##  + 2.0218
## 
## LM num: 4
## kids = 
##  0.0393 * age 
##  - 0.0658 * educatn 
##  - 0 * earnings 
##  - 0 * hours 
##  + 0.8845 * married=married,divorced,widowed,separated,no histories 
##  + 0.3666 * married=widowed,separated,no histories 
##  + 0.0037 * married=separated,no histories 
##  + 0.4712
## 
## Number of Rules : 4

It would take too much time to explain everything. You can read part of this model as follows earnings greater than 20754 use linear model 4earnings less than 20754 and less than 2272 and less than 12.5 years of education use linear model 1 earnings less than 20754 and less than 2272 and greater than 12.5 years of education use linear model 2 earnings less than 20754 and greater than 2272 linear model 3 The print out then shows each of the linear model. Lastly, we will evaluate our model tree with the following code

PSIDM5_Pred<-predict(PSIDM5, PSID_test)
summary(PSIDM5_Pred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3654  2.0490  2.3400  2.3370  2.6860  4.4220
cor(PSIDM5_Pred, PSID_test$kids)
## [1] 0.3486492
MAE(PSID_test$kids, PSIDM5_Pred)
## [1] 1.088617

This model is slightly better. For example, it is better at predict extreme values at 4.4 compare to 2.69 for the regression tree model. The correlation is 0.34 which is better than 0.30 for the regression tree model. Lastly. the mean absolute error shows a slight improve to 1.08 compared to 1.13 in the regression tree model

Conclusion

This provide examples of the use of regression trees and model trees. Both of these models make prediction a key component of their analysis.

Numeric Prediction Trees

Decision trees are used for classifying examples into distinct classes or categories. Such as pass/fail, win/lose, buy/sell/trade, etc. However, as we all know, categories are just one form of outcome in machine learning. Sometimes we want to make numeric predictions.

The use of trees in making predictions numeric involves the use of regression trees or model trees. In this post, we will look at each of these forms of numeric prediction with the use of trees.

Regression Trees and Modal Trees

Regression trees have been around since the 1980’s. They work by predicting the average value of specific examples that reach a given leaf in the tree. Despite their name, there is no regression involved with regression trees. Regression trees are straightforward to interpret but at the expense of accuracy.

Modal trees are similar to regression trees but employ multiple regression with the examples at each leaf in a tree. This leads to many different regression models being used to split the data throughout a tree. This makes model trees hard to interpret and understand in comparison to regression trees. However, they are normally much more accurate than regression trees.

Both types of trees have the goal of making groups that are as homogeneous as possible. For decision trees, entropy is used to measure the homogeneity of groups. For numeric decision trees, the standard deviation reduction (SDR) is used. The detail of SDR are somewhat complex and technical and will be avoided for that reason.

Strengths of Numeric Prediction Trees

Numeric prediction trees do not have the assumptions of linear regression. As such, they can be used to model non-normal and or non-linear data. In addition, if a dataset has a large number of feature variables, a numeric prediction tree can easily select the most appropriate ones automatically. Lastly, numeric prediction trees also do not need the model to be specific in advance of the analysis.

Weaknesses of Numeric Prediction Trees

This form of analysis requires a large amount of data in the training set in order to develop a testable model. It is also hard to tell which variables are most important in shaping the outcome. Lastly, sometimes numeric prediction trees are hard to interpret. This naturally limits there usefulness among people who lack statistical training.

Conclusion

Numeric prediction trees combine the strength of decision trees with the ability to digest a large amount of numerical variables. This form of machine learning is useful when trying to rate or measure something that is very difficult to rate or measure. However, when possible, it is usually wise to allow to try to use simpler methods if permissible.

Developing Classification Rules in R

library(Ecdat)
library(RWeka)
data("Males")

1. Explore the Data

The first step as always is to explore the data. We need to determine what kind of variables we are working with by using the “str” function

str(Males)
## 'data.frame':    4360 obs. of  12 variables:
##  $ nr        : int  13 13 13 13 13 13 13 13 17 17 ...
##  $ year      : int  1980 1981 1982 1983 1984 1985 1986 1987 1980 1981 ...
##  $ school    : int  14 14 14 14 14 14 14 14 13 13 ...
##  $ exper     : int  1 2 3 4 5 6 7 8 4 5 ...
##  $ union     : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ ethn      : Factor w/ 3 levels "other","black",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ maried    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ health    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wage      : num  1.2 1.85 1.34 1.43 1.57 ...
##  $ industry  : Factor w/ 12 levels "Agricultural",..: 7 8 7 7 8 7 7 7 4 4 ...
##  $ occupation: Factor w/ 9 levels "Professional, Technical_and_kindred",..: 9 9 9 9 5 2 2 2 2 2 ...
##  $ residence : Factor w/ 4 levels "rural_area","north_east",..: 2 2 2 2 2 2 2 2 2 2 ...

The first two variables “nr” and “year” are not too useful for us. “nr” is an identification number and “year” is the year the data was collected. We shouldn’t expect much change in marriage rates over a few years and the identification number has no meaning in our analysis. So we will ignore these two variables.

Nex, we visualize the data with some tables and histograms. Integer variables will be visualized with histograms and factor variables with tables. The code and results are below.

prop.table(table(Males$maried))
## 
##        no       yes 
## 0.5610092 0.4389908
prop.table(table(Males$union))
## 
##        no       yes 
## 0.7559633 0.2440367
prop.table(table(Males$ethn))
## 
##     other     black      hisp 
## 0.7284404 0.1155963 0.1559633
prop.table(table(Males$industry))
## 
##                     Agricultural                           Mining 
##                       0.03211009                       0.01559633 
##                     Construction                            Trade 
##                       0.07500000                       0.26811927 
##                   Transportation                          Finance 
##                       0.06559633                       0.03692661 
##      Business_and_Repair_Service                 Personal_Service 
##                       0.07591743                       0.01674312 
##                    Entertainment                    Manufacturing 
##                       0.01513761                       0.28233945 
## Professional_and_Related Service            Public_Administration 
##                       0.07637615                       0.04013761
prop.table(table(Males$health))
## 
##         no        yes 
## 0.98302752 0.01697248
prop.table(table(Males$residence))
## 
##      rural_area      north_east nothern_central           south 
##      0.02728732      0.23531300      0.30947030      0.42792937
prop.table(table(Males$occupation))
## 
## Professional, Technical_and_kindred Managers, Officials_and_Proprietors 
##                          0.10389908                          0.09151376 
##                       Sales_Workers                Clerical_and_kindred 
##                          0.05344037                          0.11146789 
##      Craftsmen, Foremen_and_kindred              Operatives_and_kindred 
##                          0.21422018                          0.20206422 
##                Laborers_and_farmers           Farm_Laborers_and_Foreman 
##                          0.09197248                          0.01467890 
##                     Service_Workers 
##                          0.11674312
#explore histograms
hist(Males$school)

Rplot10

hist(Males$exper)

Rplot

hist(Males$wage)

Rplot01

There is no time or space to explain the tables and histograms in detail. Only two things are worth mentioning. 1. Are “maried” are classifiying variable is mostly balance between those who are married and those who are not (56% to 44%) 2. The “health” variable is horribly unbalanced and needs to be removed (98% no vs 2% yes).

2. Develop and Evaluate the Model

We can now train our first model. The first model will be a one rule model, which means that R will develop one rule for classification purposes. In this post, we are not doing any prediction since we simply want to make a rule for the sample data. Below is the code.

Males_1R<-OneR(maried~ethn+union+industry+school+exper+
                       occupation+residence, data=Males)

We used the “OneR” function to create the model. This function analyzes the data and makes a single rule for it. We will now evaluate the model be first looking at the rule that was generated.

Males_1R
## exper:
##  < 7.5   -> no
##  < 12.5  -> yes
##  < 13.5  -> no
##  >= 13.5 -> yes
## (1973/3115 instances correct)

The “exper” variable was selected for generating the rule. To state  the rule clearly it literally it means “If a man has lest than 7.5 years of experience he is not married if he has more than 7.5 years of experience but less than 12.5 years of experience he is married, if he has more than 12.5 years of experience but less than 13.5 years of experience he is not married, and if he has more than 13.5 years of experience he is married”

Explaining this can take many interpretations. Young guys have less experience so they aren’t ready to marry. After about 8 years they marry. However, after about 12 years of experience males are suddenly not married. This is probably due to divorce. After 13 years, the typical male is married again. This may be because his marriage survived the scary 12th year or may be due to remarriage.

However, as we look at the accuracy of the model we will see some problems a you will notice below after typing in the following code

summary(Males_1R)
## 
## === Summary ===
## 
## Correctly Classified Instances        1973               63.3387 %
## Incorrectly Classified Instances      1142               36.6613 %
## Kappa statistic                          0.2287
## Mean absolute error                      0.3666
## Root mean squared error                  0.6055
## Relative absolute error                 75.2684 %
## Root relative squared error            122.6945 %
## Coverage of cases (0.95 level)          63.3387 %
## Mean rel. region size (0.95 level)      50      %
## Total Number of Instances             3115     
## 
## === Confusion Matrix ===
## 
##     a    b   <-- classified as
##  1351  457 |    a = no
##   685  622 |    b = yes

We only correctly classified 63% of the data. This is pretty bad. Perhaps if we change our approach and develop more than one rule we will have more success.

We will now use the “JRip” function to develop multiple classification rules. Below is the code.

Males_JRip<-JRip(maried~ethn+union+school+exper+industry+occupation+residence, data=Males)
Males_JRip
## JRIP rules:
## ===========
## 
## (exper >= 7) and (occupation = Craftsmen, Foremen_and_kindred) and (school >= 9) and (residence = south) and (exper >= 8) and (union = yes) => maried=yes (28.0/3.0)
## (exper >= 6) and (exper >= 8) and (school >= 11) and (ethn = other) => maried=yes (649.0/238.0)
## (exper >= 6) and (residence = south) and (ethn = hisp) => maried=yes (102.0/36.0)
## (exper >= 5) and (school >= 14) and (school >= 15) => maried=yes (76.0/25.0)
## (exper >= 5) and (ethn = other) and (occupation = Craftsmen, Foremen_and_kindred) => maried=yes (232.0/93.0)
##  => maried=no (2028.0/615.0)
## 
## Number of Rules : 6

There are six rules all together below is there meaning

  1. If a male has seven years are more of experience, is a craftsmen or foreman, has at least nine years of school, and his ethnicity is other then he is married.
  2. If a male has at least 6 years of experience, has at least 11 years of school and his ethnicity is Hispanic then he is married.
  3. If a male has at least 6 years of experience, resides in the south, and his ethnicity is Hispanic then he is married.
  4. If a male has at least 5 years of and has at least 14 years of school then he is married.
  5. If a male has at least 5 years of experience, his ethnicity is other, and his occupation is craftsmen or foremen then he is married
  6. If not any of these then he is not married

Notice how all rules begin with “exper” this is one reason why the “OneR” function made its rule on experience. Experience is the best predictor of marriage in this dataset. However, are accuracy has not improve much as you will see in the following code.

summary(Males_JRip)
## 
## === Summary ===
## 
## Correctly Classified Instances        2105               67.5762 %
## Incorrectly Classified Instances      1010               32.4238 %
## Kappa statistic                          0.3184
## Mean absolute error                      0.4351
## Root mean squared error                  0.4664
## Relative absolute error                 89.3319 %
## Root relative squared error             94.5164 %
## Coverage of cases (0.95 level)         100      %
## Mean rel. region size (0.95 level)     100      %
## Total Number of Instances             3115     
## 
## === Confusion Matrix ===
## 
##     a    b   <-- classified as
##  1413  395 |    a = no
##   615  692 |    b = yes

We are only at 67% which is not much better. Since this is a demonstration the actually numbers do not matter as much.

Conclusion 

Classification rules provide easy to understand rules for  organizing data homogeneously. This yet another way to analyze data with machine learning approaches.

Classification Rules in Machine Learning

Classification rules represent knowledge in an if-else format. These types of rules involve the terms antecedent and consequent. The antecedent is the before and consequent is after. For example, I may have the following rule.

If the students studies 5 hours a week then they will pass the class with an A

This simple rule can be broken down into the following antecedent and consequent.

  • Antecedent–If the student studies 5 hours a week
  • Consequent-then they will pass the class with an A

The antecedent determines if the consequent takes place. For example, the student must study 5 hours a week to get an A. This is the rule in this particular context.

This post will further explain the characteristics and traits of classification rules.

Classification Rules and Decision Trees

Classification rules are developed on current data to make decisions about future actions. They are highly similar to the more common decision trees. The primary difference is that decision trees involve a complex step-by-step process to make a decision.

Classification rules are stand-alone rules that are abstracted from a process. To appreciate a classification rule you do not need to be familiar with the process that created it. While with decision trees you do need to be familiar with the process that generated the decision.

One catch with classification rules in machine learning is that the majority of the variables need to be nominal in nature. As such, classification rules are not as useful for large amounts of numeric variables. This is not a problem with decision trees.

The Algorithm

Classification rules use algorithms that employ a separate and conquer heuristic. What this means is that the algorithm will try to separate the data into smaller and smaller subset by generating enough rules to make homogeneous subsets. The goal is always to separate the examples in the data set into subgroups that have similar characteristics.

Common algorithms used in classification rules include the One Rule Algorithm and the RIPPER Algorithm. The One Rule Algorithm analyzes data and generates one all-encompassing rule. This algorithm works by finding the single rule that contains the less amount of error. Despite its simplicity, it is surprisingly accurate.

The RIPPER algorithm grows as many rules as possible. When a rule begins to become so complex that in no longer helps to purify the various groups the rule is pruned or the part of the rule that is not beneficial is removed. This process of growing and pruning rules is continued until there is no further benefit.

RIPPER algorithm rules are more complex than One Rule Algorithm. This allows for the development of complex models. The drawback is that the rules can become too complex to make practical sense.

Conclusion

Classification rules are a useful way to develop clear principles as found in the data. The advantage of such an approach is simplicity. However, numeric data is harder to use when trying to develop such rules.

Making a Decision Tree in R

In this post, we are going to learn how to use the C5.0 algorithm to make a classification tree in order to make predictions about gender based on wage, education, and job experience using a data set in the “Ecdat” package in R. Below is some code to get started.

library(Ecdat); library(C50); library(gmodels)
 data(Wages1)

We now will explore the data to get a sense of what is happening in it. Below is the code for this

str(Wages1)
 ## 'data.frame': 3294 obs. of 4 variables:
 ## $ exper : int 9 12 11 9 8 9 8 10 12 7 ...
 ## $ sex : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
 ## $ school: int 13 12 11 14 14 14 12 12 10 12 ...
 ## $ wage : num 6.32 5.48 3.64 4.59 2.42 ...
 hist(Wages1$exper)

Rplot02

summary(Wages1$exper)
 ## Min. 1st Qu. Median Mean 3rd Qu. Max.
 ## 1.000 7.000 8.000 8.043 9.000 18.000

hist(Wages1$wage)

Rplot

summary(Wages1$wage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07656 3.62200 5.20600 5.75800 7.30500 39.81000

hist(Wages1$school)

Rplot01

summary(Wages1$school)
 ## Min. 1st Qu. Median Mean 3rd Qu. Max.
 ## 3.00 11.00 12.00 11.63 12.00 16.00

table(Wages1$sex)
## female male
## 1569 1725

As you can see, we have four features (exper, sex, school, wage) in the “Wages1” data set. The histogram for “exper” indicates that it is normally distributed. The “wage” feature is highly left-skewed and almost bimodal. This is not a big deal as classification trees are robust against non-normality. The ‘school’ feature is mostly normally distributed. Lastly, the ‘sex’ feature is categorical but there is almost an equal number of men and women in the data. All of the outputs for the means are listed above.

Create Training and Testing Sets

We now need to create our training and testing data sets. In order to do this, we need to first randomly reorder our data set. For example, if the data is sorted by one of the features, to split it now would lead to extreme values all being lumped together in one data set.

To make things more confusing, we also need to set our seed. This allows us to be able to replicate our results. Below is the code for doing this.

set.seed(12345)
 Wage_rand<-Wages1[order(runif(3294)),]

What we did is explained as follows

  1. set the seed using the ‘set.seed’ function (We randomly picked the number 12345)
  2. We created the variable ‘Wage_rand’ and we assigned the following
  3. From the ‘Wages1’ dataset we used the ‘runif’ function to create a list of 3294 numbers (1-3294) we did this because there are a total of 3294 examples in the dataset.
  4. After generating the 3294 numbers we then order sequentially using the “order” function.
  5. We then assigned each example in the “Wages1” dataset one of the numbers we created

We will now create are training and testing set using the code below.

Wage_train<-Wage_rand[1:2294,]
 Wage_test<-Wage_rand[2295:3294,]

Make the Model
We can now begin training a model below is the code.

Wage_model<-C5.0(Wage_train[-2], Wage_train$sex)

The coding for making the model should be familiar by now. One thing that is new is the brackets with the -2 inside. This tells r to ignore the second column in the dataset. We are doing this because we want to predict sex. If it is a part of the independent variables we cannot predict it. We can now examine the results of our model by using the following code.

Wage_model
##
## Call:
## C5.0.default(x = Wage_train[-2], y = Wage_train$sex)
##
## Classification Tree
## Number of samples: 2294
## Number of predictors: 3
##
## Tree size: 9
##
## Non-standard options: attempt to group attributes
summary(Wage_model)
##
## Call:
## C5.0.default(x = Wage_train[-2], y = Wage_train$sex)
##
##
## C5.0 [Release 2.07 GPL Edition] Wed May 25 10:55:22 2016
## ——————————-
##
## Class specified by attribute `outcome’
##
## Read 2294 cases (4 attributes) from undefined.data
##
## Decision tree:
##
## wage <= 3.985179: ## :…school > 11: female (345/109)
## : school <= 11:
## : :…exper <= 8: female (224/96) ## : exper > 8: male (143/59)
## wage > 3.985179:
## :…wage > 9.478313: male (254/61)
## wage <= 9.478313: ## :…school > 12: female (320/132)
## school <= 12:
## :…school <= 10: male (246/70) ## school > 10:
## :…school <= 11: male (265/114) ## school > 11:
## :…exper <= 6: female (83/35) ## exper > 6: male (414/173)
##
##
## Evaluation on training data (2294 cases):
##
## Decision Tree
## —————-
## Size Errors
##
## 9 849(37.0%) <<
##
##
## (a) (b) ## —- —-
## 600 477 (a): class female
## 372 845 (b): class male
##
##
## Attribute usage:
##
## 100.00% wage
## 88.93% school
## 37.66% exper
##
##
## Time: 0.0 secs

The “Wage_model” indicates a small decision tree of only 9 decisions. The “summary” function shows the actual decision tree. It’s somewhat complicated but I will explain the beginning part of the tree.

If wages are less than or equal to 3.98 then the person is female THEN

If the school is greater than 11 then the person is female ELSE

If the school is less than or equal to 11 THEN

If The experience of the person is less than or equal to 8 the person is female ELSE

If the experience is greater than 8 the person is male etc.

The next part of the output shows the amount of error. This model misclassified 37% of the examples which is pretty high. 477 men were misclassified as women and 372 women were misclassified as men.

Predict with the Model

We will now see how well this model predicts gender in the testing set. Below is the code

Wage_pred<-predict(Wage_model, Wage_test)

CrossTable(Wage_test$sex, Wage_pred, prop.c = FALSE,
 prop.r = FALSE, dnn=c('actual sex', 'predicted sex'))

The output will not display properly here. Please see C50 for a pdf of this post and go to page 7

Again, this code should be mostly familiar for the prediction model. For the table, we are comparing the test set sex with predicted sex. The overall model was correct 269 + 346/1000 for 61.5% accuracy rate, which is pretty bad.

Improve the Model

There are two ways we are going to try and improve our model. The first is adaptive boosting and the second is error cost.

Adaptive boosting involves making several models that “vote” how to classify an example. To do this you need to add the ‘trials’ parameter to the code. The ‘trial’ parameter sets the upper limit of the number of models R will iterate if necessary. Below is the code for this and the code for the results.

Wage_boost10<-C5.0(Wage_train[-2], Wage_train$sex, trials = 10)
 #view boosted model
 summary(Wage_boost10)
 ##
 ## Call:
 ## C5.0.default(x = Wage_train[-2], y = Wage_train$sex, trials = 10)
 ##
 ##
 ## C5.0 [Release 2.07 GPL Edition] Wed May 25 10:55:22 2016
 ## -------------------------------
 ##
 ## Class specified by attribute `outcome'
 ##
 ## Read 2294 cases (4 attributes) from undefined.data
 ##
 ## ----- Trial 0: -----
 ##
 ## Decision tree:
 ##
 ## wage <= 3.985179: ## :...school > 11: female (345/109)
 ## : school <= 11:
 ## : :...exper <= 8: female (224/96) ## : exper > 8: male (143/59)
 ## wage > 3.985179:
 ## :...wage > 9.478313: male (254/61)
 ## wage <= 9.478313: ## :...school > 12: female (320/132)
 ## school <= 12:
 ## :...school <= 10: male (246/70) ## school > 10:
 ## :...school <= 11: male (265/114) ## school > 11:
 ## :...exper <= 6: female (83/35) ## exper > 6: male (414/173)
 ##
 ## ----- Trial 1: -----
 ##
 ## Decision tree:
 ##
 ## wage > 6.848846: male (663.6/245)
 ## wage <= 6.848846:
 ## :...school <= 10: male (413.9/175) ## school > 10: female (1216.5/537.6)
 ##
 ## ----- Trial 2: -----
 ##
 ## Decision tree:
 ##
 ## wage <= 3.234474: female (458.1/192.9) ## wage > 3.234474: male (1835.9/826.2)
 ##
 ## ----- Trial 3: -----
 ##
 ## Decision tree:
 ##
 ## wage > 9.478313: male (234.8/82.1)
 ## wage <= 9.478313:
 ## :...school <= 11: male (883.2/417.8) ## school > 11: female (1175.9/545.1)
 ##
 ## ----- Trial 4: -----
 ##
 ## Decision tree:
 ## male (2294/1128.1)
 ##
 ## *** boosting reduced to 4 trials since last classifier is very inaccurate
 ##
 ##
 ## Evaluation on training data (2294 cases):
 ##
 ## Trial Decision Tree
 ## ----- ----------------
 ## Size Errors
 ##
 ## 0 9 849(37.0%)
 ## 1 3 917(40.0%)
 ## 2 2 958(41.8%)
 ## 3 3 949(41.4%)
 ## boost 864(37.7%) <<
 ##
 ##
 ## (a) (b) ## ---- ----
 ## 507 570 (a): class female
 ## 294 923 (b): class male
 ##
 ##
 ## Attribute usage:
 ##
 ## 100.00% wage
 ## 88.93% school
 ## 37.66% exper
 ##
 ##
 ## Time: 0.0 secs

R only created 4 models as there was no additional improvement after this. You can see each model in the printout. The overall results are similar to our original model that was not boosted. We will now see how well our boosted model predicts with the code below.

Wage_boost_pred10<-predict(Wage_boost10, Wage_test)
 CrossTable(Wage_test$sex, Wage_boost_pred10, prop.c = FALSE,
 prop.r = FALSE, dnn=c('actual Sex Boost', 'predicted Sex Boost'))

Our boosted model has an accuracy rate 223+379/1000 = 60.2% which is about 1% better then our unboosted model (59.1%). As such, boosting the model was not useful (see page 11 of the pdf for the table printout.)

Our next effort will be through the use of a cost matrix. A cost matrix allows you to impose a penalty on false positives and negatives at your discretion. This is useful if certain mistakes are too costly for the learner to make. IN our example, we are going to make it 4 times more costly misclassify a female as a male (false negative) and 1 times for costly to misclassify a male as a female (false positive). Below is the code

error_cost Wage_cost<-C5.0(Wage_train[-21], Wage_train$sex, cost = error_cost)
 Wage_cost_pred<-predict(Wage_cost, Wage_test)
 CrossTable(Wage_test$sex, Wage_cost_pred, prop.c = FALSE,
 prop.r = FALSE, dnn=c('actual Sex EC', 'predicted Sex EC'))

With this small change our model is 100% accurate (see page 12 of the pdf).

Conclusion

This post provided an example of decision trees. Such a model allows someone to predict a given outcome when given specific information.

Understanding Decision Trees

Decision trees are yet another method of machine learning that is used for classifying outcomes. Decision trees are very useful for, as you can guess, making decisions based on the characteristics of the data.

In this post, we will discuss the following

  • Physical traits of decision trees
  • How decision trees work
  • Pros and cons of decision trees

Physical Traits of a Decision Tree

Decision trees consist of what is called a tree structure. The tree structure consists of a root node, decision nodes, branches and leaf nodes.

A root node is an initial decision made in the tree. This depends on which feature the algorithm selects first.

Following the root node, the tree splits into various branches. Each branch leads to an additional decision node where the data is further subdivided. When you reach the bottom of a tree at the terminal node(s) these are also called leaf nodes.

How Decision Trees Work

Decision trees use a heuristic called recursive partitioning. What this does is it splits the overall dataset into smaller and smaller subsets until each subset is as close to pure (having the same characteristics) as possible. This process is also known as divide and conquer.

The mathematics for deciding how to split the data is based on an equation called entropy, which measures the purity of a potential decision node. The lower the entropy scores the purer the decision node is. The entropy can range from 0 (most pure) to 1 (most impure).

One of the most popular algorithms for developing decision trees is the C5.0 algorithm. This algorithm, in particular, uses entropy to assess potential decision nodes.

Pros and Cons

The prose of decision trees includes its versatile nature. Decision trees can deal with all types of data as well as missing data. Furthermore, this approach learns automatically and only uses the most important features. Lastly, a deep understanding of mathematics is not necessary to use this method in comparison to more complex models.

Some problems with decision trees are that they can easily overfit the data. This means that the tree does not generalize well to other datasets. In addition, a large complex tree can be hard to interpret, which may be yet another indication of overfitting.

Conclusion

Decision trees provide another vehicle that researchers can use to empower decision making. This model is most useful particularly when a decision that was made needs to be explained and defended. For example, when rejecting a person’s loan application. Complex models made provide stronger mathematical reasons but would be difficult to explain to an irate customer.

Therefore, for complex calculation presented in an easy to follow format. Decision trees are one possibility.

Conditional Probability & Bayes’ Theorem

In a prior post, we look at some of the basics of probability. The prior forms of probability we looked at focused on independent events, which are events that are unrelated to each other.

In this post, we will look at conditional probability which involves calculating probabilities for events that are dependent on each other. We will understand conditional probability through the use of Bayes’ theorem.

Conditional Probability 

If all events were independent of it would be impossible to predict anything because there would be no relationships between features. However, there are many examples of on event affecting another. For example, thunder and lighting can be used to predictors of rain and lack of study can be used as a predictor of test performance.

Thomas Bayes develop a theorem to understand conditional probability. A theorem is a statement that can be proven true through the use of math. Bayes’ theorem is written as follows

P(A | B)

This complex notation simply means

The probability of event A given event B occurs

Calculating probabilities using Bayes’ theorem can be somewhat confusing when done by hand. There are a few terms however that you need to be exposed too.

  • prior probability is the probability of an event without a conditional event
  • likelihood is the probability of a given event
  • posterior probability is the probability of an event given that another event occurred. the calculation or posterior probability is the application of Bayes’ theorem

Naive Bayes Algorithm

Bayes’ theorem has been used to develop the Naive Bayes Algorithm. This algorithm is particularly useful in classifying text data, such as emails. This algorithm is fast, good with missing data, and powerful with large or small data sets. However, naive Bayes struggles with large amounts of numeric data and it has a problem with assuming that all features are of equal value, which is rarely the case.

Conclusion

Probability is a core component of prediction. However, prediction cannot truly take place with events being dependent. Thanks to the work of Thomas Bayes, we have one approach to making prediction through the use of his theorem.

In a future post, we will use naive Bayes algorithm to make predictions about text.

Nearest Neighbor Classification in R

In this post, we will conduct a nearest neighbor classification using R. In a previous post, we discussed nearest neighbor classification. To summarize, nearest neighbor uses the traits of a known example to classify an unknown example. The classification is determined by the closets known example(s) to the unknown example. There are essentially four steps in order to complete a nearest neighbor classification

  1. Find a dataset
  2. Explore/prepare the Dataset
  3. Train the model
  4. Evaluate the model

For this example, we will use the college data set from the ISLR package. Our goal will be to predict which colleges are private or not private based on the feature “Private”. Below is the code for this.

library(ISLR)
 data("College")

Step 2 Exploring the Data

We now need to explore and prep the data for analysis. Exploration helps us to find any problems in the data set. Below is the code, you can see the results in your own computer if you are following along.

str(College)
prop.table(table(College$Private))
summary(College)

The “str” function gave us an understanding of the different types of variables and some of there initial values. We have 18 variables in all. We want to predict “Private” which is a categorical feature Nearest neighbor predicts categorical features only. We will use all of the other numerical variables to predict “Private”. The prop.table give us the proportion of private and not private colleges. About 30% of the data is not private and about 70% is a private college

Lastly, the “summary” function gives us some descriptive stats. If you look closely at the descriptive stats, there is a problem. The variables use different scales. For example, the “Apps” feature goes from 81 to 48094 while the “Grad.Rate” feature goes from 10 to 100. If we do the analysis with these different scales the “App” feature will be a much stronger influence on the prediction. Therefore, we need to rescale the features or normalize them. Below is the code to do this

 normal<-function(x){
 return((x-min(x))/(max(x)))
 }

We made a function called “normal” that normalizes or scales are variables so that all values are between 0-1. This makes all of the features equal in their influence. We now run code to normalize our data using the “normal” function we created. We will also look at the summary stats. In addition, we will leave out the prediction feature “Private” because we do not want to have that in the new data set because we want to predict this. In a real-world example, you would not know this before the analysis anyway.

College_New<-as.data.frame(lapply(College[2:18],normal))
summary(College_New)

The name of the new dataset is “College_New” we used the “lapply” function to make R normalize all the features in the “College” dataset. Summary stats are good as all values are between 0 and 1. The last part of this step involves dividing our data into training and testing sets as seen in the code below and creating the labels. The labels are the actual information from the “Private” feature. Remember, we remove this from the “College_New” dataset but we need to put this information in their own features to allow us to check the accuracy of our results later.

College_train<-College_New[1:677, ]
 College_Test<-College_New[678:777,]
 #make labels
 College_train_labels<-College[1:677, 1]
 College_test_labels<-College[678:777,1]

Step 3 Model Training

Train the model We will now train the model. You will need to download the “class” package and load it. After that, you need to run the following code.

library(class)
 College_test_pred<-knn(train=College_train, test=College_Test,
 cl=College_train_labels, k=25)

Here is what we did

  • We created a variable called “College_test_pred” to store our results
  • We used the “knn” function to predict the examples. Inside the “knn” function we used “College_train” to teach the model.
  • We then identify “College_test” as the dataset we want to test. For the ‘cl’ argument we used the “College_train_labels” dataset which contains which colleges are private in the “College_train” dataset. The “k” is the number of neighbors that knn uses to determine what class to label an unknown example. In this code, we are using the 25 nearest neighbors to label the unknown example The number of “k” to use can vary but a rule of thumb is to take the square root of the total number of examples in the training data. Our training data has 677 and the square root of this is about 25. What happens is that each of the 25 closest neighbors are calculated for an example. If 20 are not private and 5 are private the unknown example is labeled private because the majority of its neighbors are not private.

Step 4 Evaluating the Model

We will now evaluate the model using the following code

 library(gmodels)
 CrossTable(x=College_test_labels, y=College_test_pred, prop.chisq = FALSE)

The results can be seen by clicking here.  The box in the top left predicts which colleges are private correctly while the box in the bottom right classifies which colleges are not private correctly. For example, 28 colleges that are not private were classed as not private while 64 colleges that are not private were classed as not private. Adding these two numbers together gives us 92 (28+64) which is our accuracy rate. In the top right box, we have are false positives. These are colleges which are private but were predicted as private, there were 8 of these. Lastly, we have our false negatives in the bottom left box, which are colleges that are private but label as not private, there were 0 of these.

Conclusion

The results are pretty good for this example. Nearest neighbor classification allows you to predict an unknown example based on the classification of the known neighbors. This is a simple way to identify examples that are clump among neighbors with the same characteristics.

Nearest Neighbor Classification

There are times when the relationships among examples you want to classify are messy and complicated. This makes it difficult to actually classify them. Yet in this same situation, items of the same class have a lot of features in common even though the overall sample is messy. In such a situation, nearest neighbor classification may be useful.

Nearest neighbor classification uses a simple technique to classify unlabeled examples. The algorithm assigns an unlabeled example the label of the nearest example. This based on the assumption that if two examples are next to each other they must be of the same class.

In this post, we will look at the characteristics of nearest neighbor classification as well as the strengths and weakness of this approach.

Characteristics

Nearest neighbor classification uses the features of the data set to create a multidimensional feature space. The number of features determines the number of dimensions. Therefore, two features leads to a two-dimensional feature space, three features leads to a three dimensional feature space, etc. In this feature space all the examples are placed based on their respective features.

The label of the unknown examples are determined by who the closet neighbor is or are. This calculation is based on Euclidean distance, which is the shortest distance possible. The number of neighbors that are used to calculate the distance varies at the discretion of the researcher. For example, we could use one neighbor or several to determine the label of an unlabeled example. There are pros and cons to how many neighbors to use. The more neighbors used the more complicated the classification becomes.

Nearest neighbor classification is considered a type of lazy learning. What is meant by lazy is that no abstraction of the data happens. This means there is no real explanation or theory provide by the model to understand why there are certain relationships. Nearest neighbor tells you where the relationships are but not why or how. This is partly due to the fact that it is a non-parametric learning method and provides no parameters (summary statistics) about the data.

Pros and Cons

Nearest neighbor classification has the advantage of being simple, highly effective, and fast during the training phase. There are also no assumptions made about the data distribution. This means that common problems like a lack of normality are not an issue.

Some problems include the lack of a model. This deprives us of insights into the relationships in the data. Another concern is the headache of missing data.  This forces you to spend time cleaning the data more thoroughly.  One final issue is that the classification phase of a project is slow and cumbersome because of the messy nature of the data.

Conclusion

Nearest neighbor classification is one useful tool in machine learning. This approach is valuable for times when the data is heterogeneous but with clear homogeneous groups in the data. In a future post, we will go through an example of this classification approach using R.

Steps for Approaching Data Science Analysis

Research is difficult for many reasons. One major challenge of research is knowing exactly what to do. You have to develop your way of approaching your problem, data collection and analysis that is acceptable to peers.

This level of freedom leads to people literally freezing and not completing a project. Now imagine have several gigabytes or terabytes of data and being expected to “analyze” it.

This is a daily problem in data science. In this post, we will look at one simply six step process to approaching data science. The process involves the following six steps

  1. Acquire data
  2. Explore the data
  3. Process the data
  4. Analyze the data
  5. Communicate the results
  6. Apply the results

Step 1 Acquire the Data

This may seem obvious but it needs to be said. The first step is to access data for further analysis. Not always, but often data scientist are given data that was already collected by others who want answers from it.

In contrast with traditional empirical research in which you are often involved from the beginning to the end, in data science you jump to analyze a mess of data that others collected. This is challenging as it may not be clear what people what to know are what exactly the collected.

Step 2 Explore the Data

Exploring the data allows you to see what is going on. You have to determine what kinds of potential feature variables you have, the level of data that was collected (nominal, ordinal, interval, ratio). In addition, exploration allows you to determine what you need to do to prep the data for analysis.

Since data can come in many different formats from structured to unstructured. It is critical to take a look at the data through using summary statistics and various visualization options such as plots and graphs.

Another purpose for exploring data is that it can provide insights into how to analyze the data. If you are not given specific instructions as to what stakeholders want to know, exploration can help you to determine what may be valuable for them to know.

Step 3 Process the Data

Processing data involves cleaning it. This involves dealing with missing data, transforming features, addressing outliers, and other necessary processes for preparing analysis. The primary goal is to organize the data for analysis

This is a critical step as various machine learning models have different assumptions that must be met. Some models can handle missing data some cannot. Some models are affected by outliers some are not.

Step 4 Analyze the Data

This is often the most enjoyable part of the process. At this step, you actually get to develop your model. How this is done depends on the type of model you selected.

In machine learning, analysis is almost never complete until some form of validation of the model takes place. This involves taking the model developed on one set of data and seeing how well the model predicts the results on another set of data. One of the greatest fears of statistical modeling is overfitting, which is a model that only works on one set of data and lacks the ability to generalize.

Step 5 Communicate Results

This step is self-explanatory. The results of your analysis needs to be shared with those involved. This is actually an art in data science called storytelling. It involves the use of visuals as well-spoken explanations.

Steps 6 Apply the Results

This is the chance to actual use the results of a study. Again, how this is done depends on the type of model developed. If a model was developed to predict which people to approve for home loans, then the model will be used to analyze applications by people interested in applying for a home loan.

Conclusion

The steps in this process is just one way to approach data science analysis. One thing to keep in mind is that these steps are iterative, which means that it is common to go back and forth and to skip steps as necessary. This process is just a guideline for those who need direction in doing an analysis.

Types of Machine Learning

Machine learning is a tool used in analytics for using data to make a decision for action. This field of study is at the crossroads of regular academic research and action research used in professional settings. This juxtaposition of skills has led to exciting new opportunities in the domains of academics and industry.

This post will provide information on basic types of machine learning which includes predictive models, supervised learning, descriptive models, and unsupervised learning.

Predictive Models and Supervised Learning

Predictive models do as their name implies. Predictive models predict one value based on other values. For example, a model might predict who is most likely to buy a plane ticket or purchase a specific book.

Predictive models are not limited to the future. They can also be used to predict something that has already happened but we are not sure when. For example, data can be collected from expectant mothers to determine the date that they conceived. Such information would be useful in preparing for birth.

Predictive models are intimately connected with supervised learning. Supervised learning is a form of machine learning in which the predictive model is given clear direction as to what they need to learn and how to do it.

For example, if we want to predict who will be accepted or rejected for a home loan we would provide clear instructions to our model. We might include such features as salary, gender, credit score, etc. These features would be used to predict whether an individual person should be accepted or rejected for the home loan. The supervisors in this example or the features (salary, gender, credit score) used to predict the target feature (home loan).

The target feature can either be a classification or a numeric prediction. A classification target feature is a nominal variable such as gender, race, type of car, etc. A classification feature has a limited number of choices or classes that the feature can take. In addition, the classes are mutually exclusive. At least in machine learning, someone can only be classified as male or female, current algorithms cannot place a person in both classes.

A numeric prediction predicts a number that has an infinite number of possibilities. Examples include height, weight, and salary.

Descriptive Models and Unsupervised Learning

Descriptive models summarize data to provide interesting insights. There is no target feature that you are trying to predict. Since there is no specific goal or target to predict there are no supervisors or specific features that are used to predict the target feature. Instead, descriptive models use a process of unsupervised learning. There are no instructions given to model as to what to do per say.

Descriptive models are very useful for discovering patterns. For example, one descriptive model analysis found a relationship between beer purchases and diaper purchases. It was later found that when men went to the store they often would be beer for themselves and diapers for their small children. Stores used this information and they placed beer and diapers next to each in the stores. This led to an increase in profits as men could now find beer and diapers together. This kind of relationship can only be found through machine learning techniques.

Conclusion

The model you used depends on what you want to know. Prediction is for, as you can guess, predicting. With this model, you are not as concern about relationships as you are about understanding what affects specifically the target feature. If you want to explore relationships then descriptive models can be of use. Machine learning models are tools that are appropriate for different situations.

Intro to Using Machine Learning

Machine learning is about using data to take action. This post will explain common steps that are taking when using machine learning in the analysis of data. In general, there are five steps when applying machine learning.

  1. Collecting data
  2. Preparing data
  3. Exploring data
  4. Training a model on the data
  5. Assessing the performance of the model
  6. Improving the model

We will go through each step briefly

Step One, Collecting Data

Data can come from almost anywhere. Data can come from a database in a structured format like mySQL. Data can also come unstructured such as tweets collected from twitter. However you get the data, you need to develop a way to clean and process it so that it is ready for analysis.

There are some distinct terms used in machine learning that some coming from traditional research may not be familiar.

  • example-An example is one set of data. In an excel spreadsheet, an example would be one row. In empirical social science research, we would call an example a respondent or participant.
  • Unit of observation-This is how an example is measured. The units can be time, money, height, weight, etc.
  • feature-A feature is a characteristic of an example. In other forms of research, we normally call a feature a variable. For example, ‘salary’ would be a feature in machine learning but a variable in traditional research.

Step Two, Preparing Data

This is actually the most difficult step in machine learning analysis. It can take up to 80% of the time. With data coming from multiple sources and in multiple formats it is a real challenge to get everything where it needs to be for an analysis.

Missing data needs to be addressed, duplicate records, and other issues are a part of this process. Once these challenges are dealt with it is time to explore the data.

Step Three, Explore the Data

Before analyzing the data, it is critical that the data is explored. This is often done visually in the form of plots and graphs but also with summary statistics. You are looking for some insights into the data and the characteristics of different features. You are also looking out for things that might be unusually such as outliers. There are also times when a variable needs to be transformed because there are issues with normality.

Step Four, Training a Model

After exploring the data, you should have an idea of what you want to know if you did not already know. Determining what you want to know helps you to decide which algorithm is most appropriate for developing a model.

To develop a model, we normally split the data into a training and testing set. This allows us to assess the model for its strengths and weaknesses.

Step Five, Assessing the Model

The strength of the model is determined. Every model has certain biases that limit its usefulness. How to assess a model depends on what type of model is developed and the purpose of the analysis. Whenever suitable, we want to try and improve the model.

Step Six, Improving the Model

Improve can happen in many ways. You might decide to normalize the variables in a different way. Or you may choose to add or remove features from the model. Perhaps you may switch to a different model.

Conclusion

Success in data analysis involves have a clear path for achieving your goals. The steps presented here provide one way of tackling machine learning.

Machine Learning: Getting Machines to Learn

One of the most interesting fields of quantitative research today is machine learning. Machine Learning serves the purpose of developing models through the use of algorithms in order to inform action.

Examples of machine learning are all around us. Machine learning is used to make recommendations about what you should purchase at most retail websites. Machine learning is also used to filter spam from your email inbox. Each of these two examples involves making a prediction about previous behavior. This is what machines learning does. It learns what will happen based on what has happened. However, the question remains how does a machine actually learn.

The purpose of this post is to explain how machines actually “learn.” The process of learning involves three steps which are…

  1. Data input
  2. Abstraction
  3. Generalization

Data Input

Data input is simply having access to some form of data whether it be numbers, text, pictures or something else. The data is the factual information that is used to develop insights for future action.

The majority of a person’s time in conducting machine learning is involved in organizing and cleaning data. This process is actually called data mugging by many. Once the data is ready for analysis, it is time for the abstraction process to begin.

Abstraction

Clean beautiful data still does not provide any useful information yet. Abstraction is the process of deriving meaning from the data. The raw data represent knowledge but as we already are aware the problem is how it is currently represented. Numbers and text mean little to us.

Abstraction takes all of the numbers and or text and develops a model. What a model does is summarize data and provides an explanatiofout the data. For example, if you are doing a study for a bank who wants to know who will default on loans. You might discover from the data that a person is more likely to default if they are a student with a credit card balance over $2,000. How this information is shared with the researcher can be in one of the following forms

  • equation
  • diagram
  • logic rules

This kind of information is hard to find manually and is normally found through the use of an algorithm. Abstraction involves the use of some sort of algorithm. An algorithm is a systemic step-by-step procedure for solving a problem. It is very technical to try and understand algorithms unless you have a graduate degree in statistics.   The point to remember is that algorithms are what the computer uses to learn and develop models.

Developing a model involves training. Training is achieved when the abstraction process is complete. The completion of this process depends on the criteria of what a “good” model is. This varies depending on the requirements of the model and the preference of the researcher.

Generalization

A machine has not actually learned anything until the model it developed is assessed for bias. Bias is a result of the educated guesses (heuristics) that an algorithm makes to develop a model that is systematically wrong.

For example, let’s say an algorithm learns to identify a man by the length of their hair. If a man has really long hair or if a woman has really short hair the algorithm will misclassify them because each person does not fit the educated guess the algorithm developed. The challenge is that these guesses work most of the time but they struggle with exceptions.

The main way of dealing with this is to develop a model on one set of data and test it on another set of data. This will inform the researcher as to what changes are needed for the model.

Another problem is noise. Noise is caused by measurement error data reporting issues trying to make your model deal with noise can lead to overfitting which means that your model only works for your data and cannot be applied to other data sets. This can also be addressed by testing your model on other data sets.

Conclusion

A machine learns through the use of an algorithm to explain relationships in a data set in a specific manner. This process involves the three steps of data input, abstraction and generalization. The results of a machine learning model is a model that can be use to make prediction about the future with a certain degree of accuracy.

Decisions Trees with Titanic

In this post, we will make predictions about Titanic survivors using decision trees. The advantage of decisions trees is that they split the data into clearly defined groups. This process continues until the data is divided into extremely small subsets. This subsetting is used for making predictions.

We are assuming you have the data and have viewed the previous machine learning post on this blog. If not please click here.

Beginnings

You need to install the ‘rpart’ package from the CRAN repository as the package contains a decision tree function.You will also need to install the following packages

  • ratttle
  • rpart.plot
  • RColorNrewer

Each of these packages plays a role in developing decision trees

Building the Model

Once you have installed the packages. You need to develop the model below is the code. The model uses most of the variables in the data set for predicting survivors.

tree <- rpart(Survived~Pclass+Sex+Age+SibSp+Parch+Fare+Embarked,data=train, method=’class’)

The ‘rpart’ function is used for making the classification tree aka decision tree.

We now need to see the tree we do this with the code below

plot(tree)
text(tree)

Plot makes the tree and ‘text’ adds names download

You can probably tell that this is an ugly plot. To improve the appearance we will run the following code.

library(rattle)
library(rpart.plot)
library(RColorBrewer)
fancyRpartPlot(my_tree_two)

Below is the revised plot

download

This looks much better

How to Read

Here is one way to read a decision tree.

  1. At the top node, keep in mind we are predicting Survival rate. There is a 0 or 1 in all of the ‘buckets’. This number represents how the bucket voted. If more than 50% perish than the bucket votes ‘0’ or no survivors
  2. Still looking at the top bucket, 62% of the passengers die while 38% survived before we even split the data.The number under the node tells what percentage of the sample is in this “bucket”. For the first bucket 100% of the sample is represented.
  3. The first split is based on sex. If the person is male you look to the left. For males, 81% of them died compared to 19% who survived and the bucket votes 0 for death. 65% of the sample is in this bucket
  4. For those who are not male (female) we look to the right and see that only 26% died compared to 74% who survived leading to this bucket voting 1 for survival. This bucket represents 35% of the entire sample.
  5. This process continues all the way down to the bottom

Conclusion

Decisions trees are useful for making ‘decisions’ about what is happening in data. For those who are looking for a simple prediction algorithm, decision trees is one place to begin

Data Science Application: Titanic Survivors

This post will provide a practical application of some of the basics of data science using data from the sinking of the Titanic. In this post, in particular, we will explore the dataset and see what we can uncover.

Loading the Dataset

The first thing that we need to do is load the actual datasets into R. In machine learning, there are always at least two datasets. One dataset is the training dataset and the second dataset is the testing dataset. The training is used for developing a model and the testing is used for checking the accuracy of the model on a different dataset. Downloading both datasets can be done through the use of the following code.

url_train <- "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
url_test <- "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
training <- read.csv(url_train)
testing <- read.csv(url_test)

What Happen?

  1. We created the variable “url_train” and put the web link in quotes. We then repeat this for the test data set
  2. Next, we create the variable “training” and use the function “read.csv” for our “url_train” variable. This tells R to read the csv file at the web address in ‘url_train’. We then repeat this process for the testing variable

Exploration

We will now do some basic data exploration. This will help us to see what is happening in the data. What to look for is endless. Therefore, we will look at a few basics things that we might need to know. Below are some questions with answers.

  1. What variables are in the dataset?

This can be found by using the code below

str(training)

The output reveals 12 variables

'data.frame':	891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 354 273 16 555 516 625 413 577 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
 $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Here is a better explanation of each

  • PassengerID-ID number of each passenger
  • Survived-Who survived as indicated by a 1 and who did not by a 0
  • Pclass-Class of passenger 1st class, 2nd class, 3rd class
  • Name-The full name of a passenger
  • Sex-Male or female
  • Age-How old a passenger was
  • SibSp-Number of siblings and spouse a passenger had on board
  • Parch-Number of parents and or children a passenger had
  • Ticket-Ticket number
  • Fare-How much a ticket cost a passenger
  • Cabin-Where they slept on the titanic
  • Embarked-What port they came from

2. How many people survived in the training data?

The answer for this is found by running the following code in the R console. Keep in mind that 0 means died and 1 means survived

> table(training$Survived)
  0   1 
549 342

Unfortunately, unless you are really good at math these numbers do not provide much context. It is better to examine this using percentages as can be found in the code below.

> prop.table(table(training$Survived))

        0         1 
0.6161616 0.3838384 

These results indicate that about 62% of the passengers died while 38% survived in the training data.

3. What percentage of men and women survived?

This information can help us to determine how to setup a model for predicting who will survive. The answer is below. This time we only look percentages.

> prop.table(table(training$Sex, training$Survived), 1)
        
                 0         1
  female 0.2579618 0.7420382
  male   0.8110919 0.1889081

You can see that being male was not good on the titanic. Men died in much higher proportions compared to women.

4. Who survived by class?

The code for this is below

> prop.table(table(train$Pclass, train$Survived), 1)
   
            0         1
  1 0.3703704 0.6296296
  2 0.5271739 0.4728261
  3 0.7576375 0.2423625

Here is a code for a plot of this information followed by the plot

plot(deathbyclass, main="Passenger Fate by Traveling Class", shade=FALSE, 
+      color=TRUE, xlab="Pclass", ylab="Survived")
Rplot

3rd class had the highest mortality rate. This makes sense as 3rd class was the cheapest tickets.

5. Does age make a difference in survival?

We want to see if age matters in survival. It would make sense that younger people would be more likely to survive. This might be due to parents give their kids a seat on the lifeboats and younger singles pushing their way past older people.

We cannot use a plot for this because we would have several dozen columns on the x-axis for each year of life. Instead, we will use a box plot based on survived or died to see. Below is the code followed by a visual.

boxplot(training$Age ~ training$Survived, 
        main="Passenger Fate by Age",
        xlab="Survived", ylab="Age")

Rplot01

As you can see, there is little difference in terms of who survives based on age. This means that age may not be a useful predictor of survival.

Conclusion

Here is what we know so far

  • Sex makes a major difference in survival
  • Class makes a major difference in survival
  • Age may not make a difference in survival

There is so much more we can explore in the data. However, this is enough for beginning to lay down criteria for developing a model.

Big Data & Data Mining

Dealing with large amounts of data has been a problem throughout most of human history. Ancient civilizations had to keep large amounts of clay tablets, papyrus, steles, parchments, scrolls etc. to keep track of all the details of an empire.

However, whenever it seemed as though there would be no way to hold any more information a new technology would be developed to alleviate the problem. When people could not handle keeping records on stone paper scrolls were invented. When scrolls were no longer practical books were developed. When hand-copying books were too much the printing press came along.

By the mid 20th century there were concerns that we would not be able to have libraries large enough to keep all of the books that were being developed. With this problem came the solution of the computer. One computer could hold the information of several dozen if not hundreds of libraries.

Now even a single computer can no longer cope with all of the information that is constantly being developed for just a single subject. This has lead to computers working together in networks to share the task of storing information. With data spread across several computers, it makes analyzing data much more challenging. It was now necessary to mine for useful information in a way that people used to mine for gold in the 19th century.

Big data is data that is too large to fit within the memory of a single computer. Analyzing data that is spread across a network of databases takes skills different from traditional statistical analysis. This post will explain some of the characteristics of big data as well as data mining.

Big Data Traits

The three main traits of big data are volume, velocity, and variety. Volume describes the size of big data, which means data to big to be on only one computer. Velocity is about how fast the data can be processed. Lastly, variety different types of data. common sources of big data includes the following

  • Metadata from visual sources such as cameras
  • Data from sensors such as in medical equipment
  • Social media data such as information from google, youtube or facebook

Data Mining

Data mining is the process of discovering a model in a big dataset. Through the development of an algorithm, we can find specific information that helps us to answer our questions. Generally, there are two ways to mine data and these are extraction and summarization.

Extraction is the process of pulling specific information from a big dataset. For example, if we want all the addresses of people who bought a specific book from Amazon the result would be an extraction from a big data set.

Summarization is reducing a dataset to describe it. We might do a cluster analysis in which similar data is combined on a characteristic. For example, if we analyze all the books people ordered through Amazon last year we might notice that one cluster of groups buys mostly religious books while another buys investment books.

Conclusion

Big data will only continue to get bigger. Currently, the response has been to just use more computers and servers. As such, there is now a need for finding information on many computers and servers. This is the purpose of data mining, which is to find pertinent information that answers stakeholders questions.

Boosting in R

Boosting is a technique used to sort through many predictors in order to find the strongest through weighing them. In order to do this, you tell R to use a specific classifier such as a tree or regression model. R than makes multiple models or trees while trying to reduce the error in each model as much as possible. The weight of each predictor is based on the amount of error it reduces as an average across the models.

We will now go through an example of boosting use the “College” dataset from the “ISLR” package.

Load Packages and Setup Training/Testing Sets

First, we will load the required packages and create the needed datasets. Below is the code for this.

library(ISLR); data("College");library(ggplot2);
library(caret)
intrain<-createDataPartition(y=College$Grad.Rate, p=0.7,
 list=FALSE)
trainingset<-College[intrain, ]; testingset<-
 College[-intrain, ]

Develop the Model

We will now create the model. We are going to use all of the variables in the dataset for this example to predict graduation rate. To use all available variables requires the use of the “.” symbol instead of listing every variable name in the model. Below is the code.

Model <- train(Grad.Rate ~., method='gbm', 
 data=trainingset, verbose=FALSE)

The method we used is ‘gbm’ which stands for boosting with trees. This means that we are using the boosting feature for making decision trees.

Once the model is created you can check the results by using the following code

summary(Model)

The output is as follows (for the first 5 variables only).

                    var    rel.inf
Outstate       Outstate 36.1745640
perc.alumni perc.alumni 14.0532312
Top25perc     Top25perc 13.0194117
Apps               Apps  5.7415103
F.Undergrad F.Undergrad  5.7016602

These results tell you what the most influential variables are in predicting graduation rate. The strongest predictor was “Outstate”. This means that as the number of outstate students increases it leads to an increase in the graduation rate. You can check this by running a correlation test between ‘Outstate’ and ‘Grad.Rate’.

The next two variables are the percentage of alumni and top 25 percent. The more alumni the higher the grad rate and the more people in the top 25% the higher the grad rate.

A Visual

We will now make plot comparing the predicted grad rate with the actual grade rate. Below is the code followed by the plot.

qplot(predict(Model, testingset), Grad.Rate, data = testingset)

Rplot10

The model looks sound based on the visual inspection.

Conclusion

Boosting is a useful way to found out which predictors are strongest. It is an excellent way to explore a model to determine this for future processing.

Random Forest in R

Random Forest is a similar machine learning approach to decision trees. The main difference is that with random forest. At each node in the tree, the variable is bootstrapped. In addition, several different trees are made and the average of the trees are presented as the results. This means that there is no individual tree to analyze but rather a ‘forest’ of trees

The primary advantage of random forest is accuracy and prevent overfitting. In this post, we will look at an application of random forest in R. We will use the ‘College’ data from the ‘ISLR’ package to predict whether a college is public or private

Preparing the Data

First, we need to split our data into a training and testing set as well as load the various packages that we need. We have run this code several times when using machine learning. Below is the code to complete this.

library(ggplot2);library(ISLR)
library(caret)
data("College")
forTrain<-createDataPartition(y=College$Private, p=0.7, list=FALSE)
trainingset<-College[forTrain, ]
testingset<-College[-forTrain, ]

Develop the Model

Next, we need to setup the model we want to run using Random Forest. The coding is similar to that which is used for regression. Below is the code

Model1<-train(Private~Grad.Rate+Outstate+Room.Board+Books+PhD+S.F.Ratio+Expend, data=trainingset, method='rf',prox=TRUE)

We are using 7 variables to predict whether a university is private or not. The method is ‘rf’ which stands for “Random Forest”. By now, I am assuming you can read code and understand what the model is trying to do. For a refresher on reading code for a model please click here.

Reading the Output

If you type “Model1” followed by pressing enter, you will receive the output for the random forest

Random Forest 

545 samples
 17 predictors
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 545, 545, 545, 545, 545, 545, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa      Accuracy SD  Kappa SD  
  2     0.8957658  0.7272629  0.01458794   0.03874834
  4     0.8969672  0.7320475  0.01394062   0.04050297
  7     0.8937115  0.7248174  0.01536274   0.04135164

Accuracy was used to select the optimal model using 
 the largest value.
The final value used for the model was mtry = 4.

Most of this is self-explanatory. The main focus is on the mtry, accuracy, and Kappa.

The shows several different models that the computer generated. Each model reports the accuracy of the model as well as the Kappa. The accuracy states how well the model predicted accurately whether a university was public or private. The kappa shares the same information but it calculates how well a model predicted while taking into account chance or luck. As such, the Kappa should be lower than the accuracy.

At the bottom of the output, the computer tells which mtry was the best. For our example, the best mtry was number 4. If you look closely, you will see that mtry 4 had the highest accuracy and Kappa as well.

Confusion Matrix for the Training Data

Below is the confusion matrix for the training data using the model developed by the random forest. As you can see, the results are different from the random forest output. This is because this model is predicting without bootstrapping

> predNew<-predict(Model1, trainingset) > trainingset$predRight<-predNew==trainingset$Private > table(predNew, trainingset$Private)
       
predNew  No Yes
    No  149   0
    Yes   0 396

Results of the Testing Data

We will now use the testing data to check the accuracy of the model we developed on the training data. Below is the code followed by the output

pred <- predict(Model1, testingset)
testingset$predRight<-pred==testingset$Private
table(pred, testingset$Private)
pred   No Yes
  No   48  11
  Yes  15 158

For the most part, the model we developed to predict if a university is private or not is accurate.

How Important is a Variable

You can calculate how important an individual variable is in the model by using the following code

Model1RF<-randomForest(Private~Grad.Rate+Outstate+Room.Board+Books+PhD+S.F.Ratio+Expend, data=trainingset, importance=TRUE)
importance(Model1RF)

The output tells you how much the accuracy of the model is reduced if you remove the variable. As such, the higher the number the more valuable the variable is in improving accuracy.

Conclusion

This post exposed you to the basics of random forest. Random forest is a technique that develops a forest of decisions trees through resampling. The results of all these trees are then averaged to give you an idea of which variables are most useful in prediction.

Decision Trees in R

Decision trees are useful for splitting data based into smaller distinct groups based on criteria you establish. This post will attempt to explain how to develop decision trees in R.

We are going to use the ‘College’ dataset found in the “ISLR” package. Once you load the package you need to split the data into a training and testing set as shown in the code below. We want to divide the data based on education level, age, and income

library(ISLR); library(ggplot2); library(caret)
data("College")
inTrain<-createDataPartition(y=College$education, 
 p=0.7, list=FALSE)
trainingset <- College[inTrain, ]
testingset <- College[-inTrain, ]

Visualize the Data

We will now make a plot of the data based on education as the groups and age and wage as the x and y variable. Below is the code followed by the plot. Please note that education is divided into 5 groups as indicated in the chart.

qplot(age, wage, colour=education, data=trainingset)
Rplot10
Create the Model

We are now going to develop the model for the decision tree. We will use age and wage to predict education as shown in the code below.

TreeModel<-train(education~age+income, method='rpart', data=trainingset)

Create Visual of the Model

We now need to create a visual of the model. This involves installing the package called ‘rattle’. You can install ‘rattle’ separately yourself. After doing this below is the code for the tree model followed by the diagram.

fancyRpartPlot(TreeModel$finalModel)

Rplot02

Here is what the chart means

  1. At the top is node 1 which is called ‘HS Grad” the decimals underneath is the percentage of the data that falls within the “HS Grad” category. As the highest node, everything is classified as “HS grad” until we begin to apply our criteria.
  2. Underneath nod 1 is a decision about wage. If a person makes less than 112 you go to the left if they make more you go to the right.
  3. Nod 2 indicates the percentage of the sample that was classified as HS grade regardless of education. 14% of those with less than a HS diploma were classified as a HS Grade based on wage. 43% of those with a HS diploma were classified as a HS grade based on income. The percentage underneath the decimals indicates the total amount of the sample placed in the HS grad category. Which was 57%.
  4. This process is repeated for each node until the data is divided as much as possible.

Predict

You can predict individual values in the dataset by using the ‘predict’ function with the test data as shown in the code below.

predict(TreeModel, newdata = testingset)

Conclusion

Prediction Trees are a unique feature in data analysis for determining how well data can be divided into subsets. It also provides a visual of how to move through data sequentially based on characteristics in the data.

Multiple Regression Prediction in R

In this post, we will learn how to predict using multiple regression in R. In a previous post, we learn how to predict with simple regression. This post will be a large repeat of this other post with the addition of using more than one predictor variable. We will use the “College” dataset and we will try to predict Graduation rate with the following variables

  • Student to faculty ratio
  • Percentage of faculty with PhD
  • Expenditures per student

Preparing the Data

First we need to load several packages and divide the dataset int training and testing sets. This is not new for this blog. Below is the code for this.

library(ISLR); library(ggplot2); library(caret)
data("College")
inTrain<-createDataPartition(y=College$Grad.Rate, 
 p=0.7, list=FALSE)
trainingset <- College[inTrain, ]
testingset <- College[-inTrain, ]
dim(trainingset); dim(testingset)

Visualizing the Data

We now need to get a visual idea of the data. Since we are using several variables the code for this is slightly different so we can look at several charts at the same time. Below is the code followed by the plots

> featurePlot(x=trainingset[,c("S.F.Ratio","PhD","Expend")],y=trainingset$Grad.Rate, plot="pairs")
Rplot10

To make these plots we did the following

  1. We used the ‘featureplot’ function told R to use the ‘trainingset’ data set and subsetted the data to use the three independent variables.
  2. Next, we told R what the y= variable was and told R to plot the data in pairs

Developing the Model

We will now develop the model. Below is the code for creating the model. How to interpret this information is in another post.

> TrainingModel <-lm(Grad.Rate ~ S.F.Ratio+PhD+Expend, data=trainingset) > summary(TrainingModel)

As you look at the summary, you can see that all of our variables are significant and that the current model explains 18% of the variance of graduation rate.

Visualizing the Multiple Regression Model

We cannot use a regular plot because are model involves more than two dimensions.  To get around this problem to see are modeling, we will graph fitted values against the residual values. Fitted values are the predict values while residual values are the acutal values from the data. Below is the code followed by the plot.

> CheckModel<-train(Grad.Rate~S.F.Ratio+PhD+Expend, method="lm", data=trainingset)
> DoubleCheckModel<-CheckModel$finalModel
> plot(DoubleCheckModel, 1, pch=19, cex=0.5)
Rplot01

Here is what happened

  1. We created the variable ‘CheckModel’.  In this variable, we used the ‘train’ function to create a linear model with all of our variables
  2. We then created the variable ‘DoubleCheckModel’ which includes the information from ‘CheckModel’ plus the new column of ‘finalModel’
  3. Lastly, we plot ‘DoubleCheckModel’

The regression line was automatically added for us. As you can see, the model does not predict much but shows some linearity.

Predict with Model

We will now do one prediction. We want to know the graduation rate when we have the following information

  • Student-to-faculty ratio = 33
  • Phd percent = 76
  • Expenditures per Student = 11000

Here is the code with the answer

> newdata<-data.frame(S.F.Ratio=33, PhD=76, Expend=11000)
> predict(TrainingModel, newdata)
       1 
57.04367

To put it simply, if the student-to-faculty ratio is 33, the percentage of PhD faculty is 76%, and the expenditures per student is 11,000, we can expect 57% of the students to graduate.

Testing

We will now test our model with the testing dataset. We will calculate the RMSE. Below is the code for creating the testing model followed by the codes for calculating each RMSE.

> TestingModel<-lm(Grad.Rate~S.F.Ratio+PhD+Expend, data=testingset)
> sqrt(sum((TrainingModel$fitted-trainingset$Grad.Rate)^2))
[1] 369.4451
> sqrt(sum((TestingModel$fitted-testingset$Grad.Rate)^2))
[1] 219.4796

Here is what happened

  1. We created the ‘TestingModel’ by using the same model as before but using the ‘testingset’ instead of the ‘trainingset’.
  2. The next two lines of codes should look familiar.
  3. From this output the performance of the model improvement on the testing set since the RMSE is lower than compared to the training results.

Conclusion

This post attempted to explain how to predict and assess models with multiple variables. Although complex for some, prediction is a valuable statistical tool in many situations.

Using Regression for Prediction in R

In the last post about R, we looked at plotting information to make predictions. We will now look at an example of making predictions using regression.

We will use the same data as last time with the help of the ‘caret’ package as well. The code below sets up the seed and the training and testing set we need.

> library(caret); library(ISLR); library(ggplot2)
> data("College");set.seed(1)
> PracticeSet<-createDataPartition(y=College$Grad.Rate,  +                                  p=0.5, list=FALSE) > TrainingSet<-College[PracticeSet, ]; TestingSet<- +         College[-PracticeSet, ] > head(TrainingSet)

The code above should look familiar from the previous post.

Make the Scatterplot

We will now create the scatterplot showing the relationship between “S.F. Ratio” and “Grad.Rate” with the code below and the scatterplot.

> plot(TrainingSet$S.F.Ratio, TrainingSet$Grad.Rate, pch=5, col="green", 
xlab="Student Faculty Ratio", ylab="Graduation Rate")

Rplot10

Here is what we did

  1. We used the ‘plot’ function to make this scatterplot. The x variable was ‘S.F.Ratio’ of the ‘TrainingSet’ the y variable was ‘Grad.Rate’.
  2. We picked the type of dot to use using the ‘pch’ argument and choosing ’19’
  3. Next, we chose a color and labeled each axis

Fitting the Model

We will now develop the linear model. This model will help us to predict future models. Furthermore, we will compare the model of the Training Set with the Test Set. Below is the code for developing the model.

> TrainingModel<-lm(Grad.Rate~S.F.Ratio, data=TrainingSet)
> summary(TrainingModel)

How to interpret this information was presented in a previous post. However, to summarize, we can say that when the student to faculty ratio increases one the graduation rate decreases 1.29. In other words, an increase in the student to faculty ratio leads to decrease in the graduation rate.

Adding the Regression Line to the Plot

Below is the code for adding the regression line followed by the scatterplot

> plot(TrainingSet$S.F.Ratio, TrainingSet$Grad.Rate, pch=19, col="green", xlab="Student Faculty Ratio", ylab="Graduation Rate")
> lines(TrainingSet$S.F.Ratio, TrainingModel$fitted, lwd=3)

Rplot01

Predicting New Values

With our model complete we can now predict values. For our example, we will only predict one value. We want to know what the graduation rate would be if we have a student to faculty ratio of 33. Below is the code for this with the answer

> newdata<-data.frame(S.F.Ratio=33)
> predict(TrainingModel, newdata)
      1 
40.6811

Here is what we did

  1. We made a variable called ‘newdata’ and stored a data frame in it with a variable called ‘S.F.Ratio’ with a value of 33. This is x value
  2. Next, we used the ‘predict’ function from the ‘caret’ package to determine what the graduation rate would be if the student to faculty ratio is 33. To do this we told caret to use the ‘TrainingModel’ we developed using regression and to run this model with the information in the ‘newdata’ dataframe
  3. The answer was 40.68. This means that if the student to faculty ratio is 33 at a university then the graduation rate would be about 41%.

Testing the Model

We will now test the model we made with the training set with the testing set. First, we will make a visual of both models by using the “plot” function. Below is the code follow by the plots.

par(mfrow=c(1,2))
plot(TrainingSet$S.F.Ratio,
TrainingSet$Grad.Rate, pch=19, col=’green’,  xlab=”Student Faculty Ratio”, ylab=’Graduation Rate’)
lines(TrainingSet$S.F.Ratio,  predict(TrainingModel), lwd=3)
plot(TestingSet$S.F.Ratio,  TestingSet$Grad.Rate, pch=19, col=’purple’,
xlab=”Student Faculty Ratio”, ylab=’Graduation Rate’)
lines(TestingSet$S.F.Ratio,  predict(TrainingModel, newdata = TestingSet),lwd=3)

Rplot02.jpeg

In the code, all that is new is the “par” function which allows us to see to plots at the same time. We also used the ‘predict’ function to set the plots. As you can see, the two plots are somewhat differ based on a visual inspection. To determine how much so, we need to calculate the error. This is done through computing the root mean square error as shown below.

> sqrt(sum((TrainingModel$fitted-TrainingSet$Grad.Rate)^2))
[1] 328.9992
> sqrt(sum((predict(TrainingModel, newdata=TestingSet)-TestingSet$Grad.Rate)^2))
[1] 315.0409

The main take away from this complicated calculation is the number 328.9992 and 315.0409. These numbers tell you the amount of error in the training model and testing model. The lower the number the better the model. Since the error number in the testing set is lower than the training set we know that our model actually improves when using the testing set. This means that our model is beneficial in assessing graduation rates. If there were problems we may consider using other variables in the model.

Conclusion

This post shared ways to develop a regression model for the purpose of prediction and for model testing.

Using Plots for Prediction in R

It is common in machine learning to look at the training set of your data visually. This helps you to decide what to do as you begin to build your model.  In this post, we will make several different visual representations of data using datasets available in several R packages.

We are going to explore data in the “College” dataset in the “ISLR” package. If you have not done so already, you need to download the “ISLR” package along with “ggplot2” and the “caret” package.

Once these packages are installed in R you want to look at a summary of the variables use the summary function as shown below.

summary(College)

You should get a printout of information about 18 different variables. Based on this printout, we want to explore the relationship between graduation rate “Grad.Rate” and student to faculty ratio “S.F.Ratio”. This is the objective of this post.

Next, we need to create a training and testing dataset below is the code to do this.

> library(ISLR);library(ggplot2);library(caret)
> data("College")
> PracticeSet<-createDataPartition(y=College$Enroll, p=0.7, +                                  list=FALSE) > trainingSet<-College[PracticeSet,] > testSet<-College[-PracticeSet,] > dim(trainingSet); dim(testSet)
[1] 545  18
[1] 232  18

The explanation behind this code was covered in predicting with caret so we will not explain it again. You just need to know that the dataset you will use for the rest of this post is called “trainingSet”.

Developing a Plot

We now want to explore the relationship between graduation rates and student to faculty ratio. We will be used the ‘ggpolt2’  package to do this. Below is the code for this followed by the plot.

qplot(S.F.Ratio, Grad.Rate, data=trainingSet)
Rplot10
As you can see, there appears to be a negative relationship between student faculty ratio and grad rate. In other words, as the ration of student to faculty increases there is a decrease in the graduation rate.

Next, we will color the plots on the graph based on whether they are a public or private university to get a better understanding of the data. Below is the code for this followed by the plot.

> qplot(S.F.Ratio, Grad.Rate, colour = Private, data=trainingSet)
Rplot.jpeg
It appears that private colleges usually have lower student to faculty ratios and also higher graduation rates than public colleges

Add Regression Line

We will now plot the same data but will add a regression line. This will provide us with a visual of the slope. Below is the code followed by the plot.

> collegeplot<-qplot(S.F.Ratio, Grad.Rate, colour = Private, data=trainingSet) > collegeplot+geom_smooth(method = ‘lm’,formula=y~x)
Rplot01.jpeg
Most of this code should be familiar to you. We saved the plot as the variable ‘collegeplot’. In the second line of code, we add specific coding for ‘ggplot2’ to add the regression line. ‘lm’ means linear model and formula is for creating the regression.

Cutting the Data

We will now divide the data based on the student-faculty ratio into three equal size groups to look for additional trends. To do this you need the “Hmisc” packaged. Below is the code followed by the table

> library(Hmisc)
> divide_College<-cut2(trainingSet$S.F.Ratio, g=3)
> table(divide_College)
divide_College
[ 2.9,12.3) [12.3,15.2) [15.2,39.8] 
        185         179         181

Our data is now divided into three equal sizes.

Box Plots

Lastly, we will make a box plot with our three equal size groups based on student-faculty ratio. Below is the code followed by the box plot

CollegeBP<-qplot(divide_College, Grad.Rate, data=trainingSet, fill=divide_College, geom=c(“boxplot”)) > CollegeBP
Rplot02
As you can see, the negative relationship continues even when student-faculty is divided into three equally size groups. However, our information about private and public college is missing. To fix this we need to make a table as shown in the code below.

> CollegeTable<-table(divide_College, trainingSet$Private)
> CollegeTable
              
divide_College  No Yes
   [ 2.9,12.3)  14 171
   [12.3,15.2)  27 152
   [15.2,39.8] 106  75

This table tells you how many public and private colleges there based on the division of the student-faculty ratio into three groups. We can also get proportions by using the following

> prop.table(CollegeTable, 1)
              
divide_College         No        Yes
   [ 2.9,12.3) 0.07567568 0.92432432
   [12.3,15.2) 0.15083799 0.84916201
   [15.2,39.8] 0.58563536 0.41436464

In this post, we found that there is a negative relationship between student-faculty ratio and graduation rate. We also found that private colleges have a lower student-faculty ratio and a higher graduation rate than public colleges. In other words, the status of a university as public or private moderates the relationship between student-faculty ratio and graduation rate.

You can probably tell by now that R can be a lot of fun with some basic knowledge of coding.

Predicting with Caret

In this post, we will explore the use of the caret package for developing algorithms for use in machine learning. The caret package is particularly useful for processing data before the actual analysis of the algorithm.

When developing algorithms is common practice to divide the data into a training a testing subsamples. The training subsample is what is used to develop the algorithm while the testing sample is used to assess the predictive power of the algorithm. There are many different ways to divide a sample into a testing and training set and one of the main benefits of the “caret” package is in dividing the sample.

In the example we will use, we will return to the “kearnlab” example and this develop an algorithm after sub-setting the sample to have a training data set and a testing data set.

First, you need to download the ‘caret’ and ‘kearnlab’ package if you have not done so. After that below is the code for subsetting the ‘spam’ data from the ‘kearnlab’ package.

inTrain<- createDataPartition(y=spam$type, p=0.75, 
list=FALSE)
training<-spam[inTrain,]
testing<-spam[-inTrain,] 
dim(training)

Here is what we did

  1. We created the variable ‘inTrain’
  2. In the variable ‘inTrain’ we told R to make a partition in the data use the ‘createDataPartition’ function. I the parenthesis we told r to look at the dataset ‘spam’ and to examine the variable ‘type’. Then we told are to pull 75% of the data in ‘type’ and copy it to the ‘inTrain’ variable we created. List = False tells R not to make a list. If you look closely, you will see that the variable ‘type’ is being set as the y variable in the ‘inTrain’ data set. This means that all the other variables in the data set will be used as predictors. Also, remember that the ‘type’ variable has two outcomes “spam” or “nonspam”
  3. Next, we created the variable ‘training’ which is the dataset we will use for developing our algorithm. To make this we take the original ‘spam’ data and subset the ‘inTrain’ partition. Now all the data that is in the ‘inTrain’ partition is now in the ‘training’ variable.
  4. Finally, we create the ‘testing’ variable which will be used for testing the algorithm. To make this variable, we tell R to take everything that was not assigned to the ‘inTrain’ variable and put it into the ‘testing’ variable. This is done through the use of a negative sign
  5. The ‘dim’ function just tells us how many rows and columns we have as shown below.
[1] 3451   58

As you can see, we have 3451 rows and 58 columns. Rows are for different observations and columns are for the variables in the data set.

Now to make the model. We are going to bootstrap our sample. Bootstrapping involves random sampling from the sample with replacement in order to assess the stability of the results. Below is the code for the bootstrap and model development followed by an explanation.

set.seed(32343)
SpamModel<-train(type ~., data=training, method="glm")
SpamModel

Here is what we did,

  1. Whenever you bootstrap, it is wise to set the seed. This allows you to reproduce the same results each time. For us, we set the seed to 32343
  2. Next, we developed the actual model. We gave the model the name “SpamModel” we used the ‘train’ function. Inside the parenthesis, we tell r to set “type” as the y variable and then use ~. which is a shorthand for using all other variables in the model as predictor variables. Then we set the data to the ‘training’ data set and indicate that the method is ‘glm’ which means generalized linear model.
  3. The output for the analysis is available at the link SpamModel

There is a lot of information but the most important information for us is the accuracy of the model which is 91.3%. The kappa stat tells us what the expected accuracy of the model is which is 81.7%. This means that our model is a little bit better than the expected accuracy.

For our final trick, we will develop a confusion matrix to assess the accuracy of our model using the ‘testing’ sample we made earlier. Below is the code

SpamPredict<-predict(SpamModel, newdata=testing)
confusionMatrix(SpamPredict, testing$type)

Here is what we did,

  1. We made a variable called ‘SpamPredict’. We use the function ‘predict’ using the ‘SpamModel’ with the new data called ‘testing’.
  2. Next, we make matrix using the ‘confusionMatrix’ function using the new model ‘SpamPredict’ based on the ‘testing’ data on the ‘type’ variable. Below is the output

Reference

  1. Prediction nonspam spam
       nonspam     657   35
       spam         40  418
                                              
                   Accuracy : 0.9348          
                     95% CI : (0.9189, 0.9484)
        No Information Rate : 0.6061          
        P-Value [Acc > NIR] : <2e-16          
                                              
                      Kappa : 0.8637          
     Mcnemar's Test P-Value : 0.6442          
                                              
                Sensitivity : 0.9426          
                Specificity : 0.9227          
             Pos Pred Value : 0.9494          
             Neg Pred Value : 0.9127          
                 Prevalence : 0.6061          
             Detection Rate : 0.5713          
       Detection Prevalence : 0.6017          
          Balanced Accuracy : 0.9327          
                                              
           'Positive' Class : nonspam

    The accuracy of the model actually improved to 93% on the test data. The other values such as sensitivity and specificity have to do with such things as looking at correct classifications divided by false negatives and other technical matters. As you can see, machine learning is a somewhat complex experience

Simple Prediction

Prediction is one of the key concepts of machine learning. Machine learning is a field of study that is focused on the development of algorithms that can be used to make predictions.

Anyone who has shopped online at has experienced machine learning. When you make a purchase at an online store, the website will recommend additional purchases for you to make. Often these recommendations are based on whatever you have purchased or whatever you click on while at the site.

There are two common forms of machine learning, unsupervised and supervised learning. Unsupervised learning involves using data that is not cleaned and labeled and attempts are made to find patterns within the data. Since the data is not labeled, there is no indication of what is right or wrong

Supervised machine learning is using cleaned and properly labeled data. Since the data is labeled there is some form of indication whether the model that is developed is accurate or not. If the is incorrect then you need to make adjustments to it. In other words, the model learns based on its ability to accurately predict results. However, it is up to the human to make adjustments to the model in order to improve the accuracy of it.

In this post, we will look at using R for supervised machine learning. The definition presented so far will make more sense with an example.

The Example

We are going to make a simple prediction about whether emails are spam or not using data from kern lab.

The first thing that you need to do is to install and load the “kernlab” package using the following code

install.packages("kernlab")
library(kernlab)

If you use the “View” function to examine the data you will see that there are several columns. Each column tells you the frequency of a word that kernlab found in a collection of emails. We are going to use the word/variable “money” to predict whether an email is spam or not. First, we need to plot the density of the use of the word “money” when the email was not coded as spam. Below is the code for this.

plot(density(spam$money[spam$type=="nonspam"]),
 col='blue',main="", xlab="Frequency of 'money'")

This is an advance R post so I am assuming you can read the code. The plot should look like the following.

Rplot10

As you can see, money is not used to frequently in emails that are not spam in this dataset. However, you really cannot say this unless you compare the times ‘money’ is labeled nonspam to the times that it is labeled spam. To learn this we need to add a second line that explains to us when the word ‘money’ is used and classified as spam. The code for this is below with the prior code included.

plot(density(spam$money[spam$type=="nonspam"]),
 col='blue',main="", xlab="Frequency of 'money'")
lines(density(spam$money[spam$type=="spam"]), col="red")

Your new plot should look like the following

Rplot10

If you look closely at the plot doing a visual inspection, where there is a separation between the blue line for nonspam and the red line for spam is the cutoff point for whether an email is spam or not. In other words, everything inside the arc is labeled correctly while the information outside the arc is not.

The next code and graph show that this cutoff point is around 0.1. This means that any email that has on average more than 0.1 frequency of the word ‘money’ is spam. Below is the code and the graph with the cutoff point indicated by a black line.

plot(density(spam$money[spam$type=="nonspam"]),
 col='blue',main="", xlab="Frequency of 'money'")
lines(density(spam$money[spam$type=="spam"]), col="red")
abline(v=0.1, col="black", lw= 3)

Rplot10.jpeg Now we need to calculate the accuracy of the use of the word ‘money’ to predict spam. For our current example, we will simply use in “ifelse” function. If the frequency is greater than 0.1.

We then need to make a table to see the results. The code for the “ifelse” function and the table are below followed by the table.

predict<-ifelse(spam$money > 0.1, "spam","nonspam")
table(predict, spam$type)/length(spam$type)
predict       nonspam        spam
  nonspam 0.596392089 0.266898500
  spam    0.009563138 0.127146273

Based on the table that I am assuming you can read, our model accurately calculates that an email is spam about 71% (0.59 + 0.12) of the time based on the frequency of the word ‘money’ being greater than 0.1.

Of course, for this to be true machine learning we would repeat this process by trying to improve the accuracy of the prediction. However, this is an adequate introduction to this topic.