# Statistical Models

In research, the term ‘model’ is employed frequently. Normally, a model is some sort of a description or explanation of a real world phenomenon. In data science, we employ the use of statistical models. Statistical models used numbers to help us to understand something that happens in the real world.

A statistical model used numbers to help us to understand something that happens in the real world.

In the real world, quantitative research relies on numeric observations of some phenomenon, behavior, and or perception. For example, let’s say we have the quiz results of 20 students as show below.

32 60 95 15 43 22 45 14 48 98 79 97 49 63 50 11 26 52 39 97

This is great information but what if we want to go beyond how these students did and try to understand how students in the population would do on the quizzes. Doing this requires the development of a model.

A model is simply trying to describe how the data is generated in terms of whatever we are mesuring while allowing for randomness. It helps in summarizing a large collection of numbers while also providing structure to it.

One commonly used model is the normal model. This model is the famous bell-curve model that most of us are familiar with. To calculate this model we need to calculate the mean and standard deviation to get a plot similar to the one below

Now, this model is not completely perfect. For example, a student cannot normally get a score above 100 or below 0 on a quiz. Despite this weakness, normal distribution gives is an indication of what the population looks like.

With this, we can also calculate the probability of getting a specific score on the quiz. For example, if we want to calculate the probability that a student would get a score of 70  or higher we can do a simple test and find that it is about 26%.

Other Options

The normal model is not the only model. There are many different models to match different types of data. There are the gamma, student t, binomial, chi-square, etc. To determine which model to use requires examining the distribution of your data and match it to an appropriate model.

Another option is to transform the data. This is normally done to make data conform to a normal distribution. Which transformation to employ depends on how the data looks when it is plotted.

Conclusion

Modeling helps to bring order to data that has been collected for analysis. By using a model such as the normal distribution, you can begin to make inferences about what the population is like. This allows you to take a handful of data to better understand the world.

# K Nearest Neighbor in R

K-nearest neighbor is one of many nonlinear algorithms that can be used in machine learning. By non-linear I mean that a linear combination of the features or variables is not needed in order to develop decision boundaries. This allows for the analysis of data that naturally does not meet the assumptions of linearity.

KNN is also known as a “lazy learner”. This means that there are known coefficients or parameter estimates. When doing regression we always had coefficient outputs regardless of the type of regression (ridge, lasso, elastic net, etc.). What KNN does instead is used K nearest neighbors to give a label to an unlabeled example. Our job when using KNN is to determine the number of K neighbors to use that is most accurate based on the different criteria for assessing the models.

In this post, we will develop a KNN model using the “Mroz” dataset from the “Ecdat” package. Our goal is to predict if someone lives in the city based on the other predictor variables. Below is some initial code.

library(class);library(kknn);library(caret);library(corrplot)
library(reshape2);library(ggplot2);library(pROC);library(Ecdat)
data(Mroz)
str(Mroz)
## 'data.frame':    753 obs. of  18 variables:
##  $work : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ... ##$ hoursw    : int  1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
##  $child6 : int 1 0 1 0 1 0 0 0 0 0 ... ##$ child618  : int  0 2 3 3 2 0 2 0 2 2 ...
##  $agew : int 32 30 35 34 31 54 37 54 48 39 ... ##$ educw     : int  12 12 12 12 14 12 16 12 12 12 ...
##  $hearnw : num 3.35 1.39 4.55 1.1 4.59 ... ##$ wagew     : num  2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
##  $hoursh : int 2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ... ##$ ageh      : int  34 30 40 53 32 57 37 53 52 43 ...
##  $educh : int 12 9 12 10 12 11 12 8 4 12 ... ##$ wageh     : num  4.03 8.44 3.58 3.54 10 ...
##  $income : int 16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ... ##$ educwm    : int  12 7 12 7 12 14 14 3 7 7 ...
##  $educwf : int 7 7 7 7 14 7 7 3 7 7 ... ##$ unemprate : num  5 11 5 5 9.5 7.5 5 5 3 5 ...
##  $city : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ... ##$ experience: int  14 5 15 6 7 33 11 35 24 21 ...

We need to remove the factor variable “work” as KNN cannot use factor variables. After this, we will use the “melt” function from the “reshape2” package to look at the variables when divided by whether the example was from the city or not.

Mroz$work<-NULL mroz.melt<-melt(Mroz,id.var='city') Mroz_plots<-ggplot(mroz.melt,aes(x=city,y=value))+geom_boxplot()+facet_wrap(~variable, ncol = 4) Mroz_plots From the plots, it appears there are no differences in how the variable act whether someone is from the city or not. This may be a flag that classification may not work. We now need to scale our data otherwise the results will be inaccurate. Scaling might also help our box-plots because everything will be on the same scale rather than spread all over the place. To do this we will have to temporarily remove our outcome variable from the data set because it’s a factor and then reinsert it into the data set. Below is the code. mroz.scale<-as.data.frame(scale(Mroz[,-16])) mroz.scale$city<-Mroz$city We will now look at our box-plots a second time but this time with scaled data. mroz.scale.melt<-melt(mroz.scale,id.var="city") mroz_plot2<-ggplot(mroz.scale.melt,aes(city,value))+geom_boxplot()+facet_wrap(~variable, ncol = 4) mroz_plot2 This second plot is easier to read but there is still little indication of difference. We can now move to checking the correlations among the variables. Below is the code mroz.cor<-cor(mroz.scale[,-17]) corrplot(mroz.cor,method = 'number') There is a high correlation between husband’s age (ageh) and wife’s age (agew). Since this algorithm is non-linear this should not be a major problem. We will now divide our dataset into the training and testing sets set.seed(502) ind=sample(2,nrow(mroz.scale),replace=T,prob=c(.7,.3)) train<-mroz.scale[ind==1,] test<-mroz.scale[ind==2,] Before creating a model we need to create a grid. We do not know the value of k yet so we have to run multiple models with different values of k in order to determine this for our model. As such we need to create a ‘grid’ using the ‘expand.grid’ function. We will also use cross-validation to get a better estimate of k as well using the “trainControl” function. The code is below. grid<-expand.grid(.k=seq(2,20,by=1)) control<-trainControl(method="cv") Now we make our model, knn.train<-train(city~.,train,method="knn",trControl=control,tuneGrid=grid) knn.train ## k-Nearest Neighbors ## ## 540 samples ## 16 predictors ## 2 classes: 'no', 'yes' ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 487, 486, 486, 486, 486, 486, ... ## Resampling results across tuning parameters: ## ## k Accuracy Kappa ## 2 0.6000095 0.1213920 ## 3 0.6368757 0.1542968 ## 4 0.6424325 0.1546494 ## 5 0.6386252 0.1275248 ## 6 0.6329998 0.1164253 ## 7 0.6589619 0.1616377 ## 8 0.6663344 0.1774391 ## 9 0.6663681 0.1733197 ## 10 0.6609510 0.1566064 ## 11 0.6664018 0.1575868 ## 12 0.6682199 0.1669053 ## 13 0.6572111 0.1397222 ## 14 0.6719586 0.1694953 ## 15 0.6571425 0.1263937 ## 16 0.6664367 0.1551023 ## 17 0.6719573 0.1588789 ## 18 0.6608811 0.1260452 ## 19 0.6590979 0.1165734 ## 20 0.6609510 0.1219624 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was k = 14. R recommends that k = 16. This is based on a combination of accuracy and the kappa statistic. The kappa statistic is a measurement of the accuracy of a model while taking into account chance. We don’t have a model in the sense that we do not use the ~ sign like we do with regression. Instead, we have a train and a test set a factor variable and a number for k. This will make more sense when you see the code. Finally, we will use this information on our test dataset. We will then look at the table and the accuracy of the model. knn.test<-knn(train[,-17],test[,-17],train[,17],k=16) #-17 removes the dependent variable 'city table(knn.test,test$city)
##
## knn.test  no yes
##      no   19   8
##      yes  61 125
prob.agree<-(15+129)/213
prob.agree
## [1] 0.6760563

Accuracy is 67% which is consistent with what we found when determining the k. We can also calculate the kappa. This done by calculating the probability and then do some subtraction and division. We already know the accuracy as we stored it in the variable “prob.agree” we now need the probability that this is by chance. Lastly, we calculate the kappa.

prob.chance<-((15+4)/213)*((15+65)/213)
kap<-(prob.agree-prob.chance)/(1-prob.chance)
kap
## [1] 0.664827

A kappa of .66 is actual good.

The example we just did was with unweighted k neighbors. There are times when weighted neighbors can improve accuracy. We will look at three different weighing methods. “Rectangular” is unweighted and is the one that we used. The other two are “triangular” and “epanechnikov”. How these calculate the weights is beyond the scope of this post. In the code below the argument “distance” can be set to 1 for euclidean and 2 for absolute distance.

kknn.train<-train.kknn(city~.,train,kmax = 25,distance = 2,kernel = c("rectangular","triangular",
"epanechnikov"))
plot(kknn.train)

kknn.train
##
## Call:
## train.kknn(formula = city ~ ., data = train, kmax = 25, distance = 2,     kernel = c("rectangular", "triangular", "epanechnikov"))
##
## Type of response variable: nominal
## Minimal misclassification: 0.3277778
## Best kernel: rectangular
## Best k: 14

If you look at the plot you can see which value of k is the best by looking at the point that is the lowest on the graph which is right before 15. Looking at the legend it indicates that the point is the “rectangular” estimate which is the same as unweighted. This means that the best classification is unweighted with a k of 14. Although it recommends a different value for k our misclassification was about the same.

Conclusion

In this post, we explored both weighted and unweighted KNN. This algorithm allows you to deal with data that does not meet the assumptions of regression by ignoring the need for parameters. However, because there are no numbers really attached to the results beyond accuracy it can be difficult to explain what is happening in the model to people. As such, perhaps the biggest drawback is communicating results when using KNN.

# Elastic Net Regression in R

Elastic net is a combination of ridge and lasso regression. What is most unusual about elastic net is that it has two tuning parameters (alpha and lambda) while lasso and ridge regression only has 1.

In this post, we will go through an example of the use of elastic net using the “VietnamI” dataset from the “Ecdat” package. Our goal is to predict how many days a person is ill based on the other variables in the dataset. Below is some initial code for our analysis

library(Ecdat);library(corrplot);library(caret);library(glmnet)
data("VietNamI")
str(VietNamI)
## 'data.frame':    27765 obs. of  12 variables:
##  $pharvis : num 0 0 0 1 1 0 0 0 2 3 ... ##$ lnhhexp  : num  2.73 2.74 2.27 2.39 3.11 ...
##  $age : num 3.76 2.94 2.56 3.64 3.3 ... ##$ sex      : Factor w/ 2 levels "female","male": 2 1 2 1 2 2 1 2 1 2 ...
##  $married : num 1 0 0 1 1 1 1 0 1 1 ... ##$ educ     : num  2 0 4 3 3 9 2 5 2 0 ...
##  $illness : num 1 1 0 1 1 0 0 0 2 1 ... ##$ injury   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $illdays : num 7 4 0 3 10 0 0 0 4 7 ... ##$ actdays  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $insurance: num 0 0 1 1 0 1 1 1 0 0 ... ##$ commune  : num  192 167 76 123 148 20 40 57 49 170 ...
##  - attr(*, "na.action")=Class 'omit'  Named int 27734
##   .. ..- attr(*, "names")= chr "27734"

We need to check the correlations among the variables. We need to exclude the “sex” variable as it is categorical. Code is below.

p.cor<-cor(VietNamI[,-4])
corrplot.mixed(p.cor)

No major problems with correlations. Next, we set up our training and testing datasets. We need to remove the variable “commune” because it adds no value to our results. In addition, to reduce the computational time we will only use the first 1000 rows from the data set.

VietNamI$commune<-NULL VietNamI_reduced<-VietNamI[1:1000,] ind<-sample(2,nrow(VietNamI_reduced),replace=T,prob = c(0.7,0.3)) train<-VietNamI_reduced[ind==1,] test<-VietNamI_reduced[ind==2,] We need to create a grid that will allow us to investigate different models with different combinations of alpha ana lambda. This is done using the “expand.grid” function. In combination with the “seq” function below is the code grid<-expand.grid(.alpha=seq(0,1,by=.5),.lambda=seq(0,0.2,by=.1)) We also need to set the resampling method, which allows us to assess the validity of our model. This is done using the “trainControl” function” from the “caret” package. In the code below “LOOCV” stands for “leave one out cross-validation”. control<-trainControl(method = "LOOCV") We are no ready to develop our model. The code is mostly self-explanatory. This initial model will help us to determine the appropriate values for the alpha and lambda parameters enet.train<-train(illdays~.,train,method="glmnet",trControl=control,tuneGrid=grid) enet.train ## glmnet ## ## 694 samples ## 10 predictors ## ## No pre-processing ## Resampling: Leave-One-Out Cross-Validation ## Summary of sample sizes: 693, 693, 693, 693, 693, 693, ... ## Resampling results across tuning parameters: ## ## alpha lambda RMSE Rsquared ## 0.0 0.0 5.229759 0.2968354 ## 0.0 0.1 5.229759 0.2968354 ## 0.0 0.2 5.229759 0.2968354 ## 0.5 0.0 5.243919 0.2954226 ## 0.5 0.1 5.225067 0.2985989 ## 0.5 0.2 5.200415 0.3038821 ## 1.0 0.0 5.244020 0.2954519 ## 1.0 0.1 5.203973 0.3033173 ## 1.0 0.2 5.182120 0.3083819 ## ## RMSE was used to select the optimal model using the smallest value. ## The final values used for the model were alpha = 1 and lambda = 0.2. The output list all the possible alpha and lambda values that we set in the “grid” variable. It even tells us which combination was the best. For our purposes, the alpha will be .5 and the lambda .2. The r-square is also included. We will set our model and run it on the test set. We have to convert the “sex” variable to a dummy variable for the “glmnet” function. We next have to make matrices for the predictor variables and a for our outcome variable “illdays” train$sex<-model.matrix( ~ sex - 1, data=train ) #convert to dummy variable
test$sex<-model.matrix( ~ sex - 1, data=test ) predictor_variables<-as.matrix(train[,-9]) days_ill<-as.matrix(train$illdays)
enet<-glmnet(predictor_variables,days_ill,family = "gaussian",alpha = 0.5,lambda = .2)

We can now look at specific coefficient by using the “coef” function.

enet.coef<-coef(enet,lambda=.2,alpha=.5,exact=T)
enet.coef
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                         s0
## (Intercept)   -1.304263895
## pharvis        0.532353361
## lnhhexp       -0.064754000
## age            0.760864404
## sex.sexfemale  0.029612290
## sex.sexmale   -0.002617404
## married        0.318639271
## educ           .
## illness        3.103047473
## injury         .
## actdays        0.314851347
## insurance      .

You can see for yourself that several variables were removed from the model. Medical expenses (lnhhexp), sex, education, injury, and insurance do not play a role in the number of days ill for an individual in Vietnam.

With our model developed. We now can test it using the predict function. However, we first need to convert our test dataframe into a matrix and remove the outcome variable from it

test.matrix<-as.matrix(test[,-9])
enet.y<-predict(enet, newx = test.matrix, type = "response", lambda=.2,alpha=.5)

Let’s plot our results

plot(enet.y)

This does not look good. Let’s check the mean squared error

enet.resid<-enet.y-test$illdays mean(enet.resid^2) ## [1] 20.18134 We will now do a cross-validation of our model. We need to set the seed and then use the “cv.glmnet” to develop the cross-validated model. We can see the model by plotting it. set.seed(317) enet.cv<-cv.glmnet(predictor_variables,days_ill,alpha=.5) plot(enet.cv) You can see that as the number of features are reduce (see the numbers on the top of the plot) the MSE increases (y-axis). In addition, as the lambda increases, there is also an increase in the error but only when the number of variables is reduced as well. The dotted vertical lines in the plot represent the minimum MSE for a set lambda (on the left) and the one standard error from the minimum (on the right). You can extract these two lambda values using the code below. enet.cv$lambda.min
## [1] 0.3082347
enet.cv$lambda.1se ## [1] 2.874607 We can see the coefficients for a lambda that is one standard error away by using the code below. This will give us an alternative idea for what to set the model parameters to when we want to predict. coef(enet.cv,s="lambda.1se") ## 12 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 2.34116947 ## pharvis 0.003710399 ## lnhhexp . ## age . ## sex.sexfemale . ## sex.sexmale . ## married . ## educ . ## illness 1.817479480 ## injury . ## actdays . ## insurance . Using the one standard error lambda we lose most of our features. We can now see if the model improves by rerunning it with this information. enet.y.cv<-predict(enet.cv,newx = test.matrix,type='response',lambda="lambda.1se", alpha = .5) enet.cv.resid<-enet.y.cv-test$illdays
mean(enet.cv.resid^2)
## [1] 25.47966

A small improvement.  Our model is a mess but this post served as an example of how to conduct an analysis using elastic net regression.

# Data Science Research Questions

Developing research questions is an absolute necessity in completing any research project. The questions you ask help to shape the type of analysis that you need to conduct.

The type of questions you ask in the context of analytics and data science are similar to those found in traditional quantitative research. Yet data science, like any other field, has its own distinct traits.

In this post, we will look at six different types of questions that are used frequently in the context of the field of data science. The six questions are…

1. Descriptive
2. Exploratory/Inferential
3. Predictive
4. Causal
5. Mechanistic

Understanding the types of question that can be asked will help anyone involved in data science to determine what exactly it is that they want to know.

Descriptive

A descriptive question seeks to describe a characteristic of the dataset. For example, if I collect the GPA of 100 university student I may want to what the average GPA of the students is. Seeking the average is one example of a descriptive question.

With descriptive questions, there is no need for a hypothesis as you are not trying to infer, establish a relationship, or generalize to a broader context. You simply want to know a trait of the dataset.

Exploratory/Inferential

Exploratory questions seek to identify things that may be “interesting” in the dataset. Examples of things that may be interesting include trends, patterns, and or relationships among variables.

Exploratory questions generate hypotheses. This means that they lead to something that may be more formal questioned and tested. For example, if you have GPA and hours of sleep for university students. You may explore the potential that there is a relationship between these two variables.

Inferential questions are an extension of exploratory questions. What this means is that the exploratory question is formally tested by developing an inferential question. Often, the difference between an exploratory and inferential question is the following

1. Exploratory questions are usually developed first
2. Exploratory questions generate inferential questions
3. Inferential questions are tested often on a different dataset from exploratory questions

In our example, if we find a relationship between GPA and sleep in our dataset. We may test this relationship in a different, perhaps larger dataset. If the relationship holds we can then generalize this to the population of the study.

Causal

Causal questions address if a change in one variable directly affects another. In analytics, A/B testing is one form of data collection that can be used to develop causal questions. For example, we may develop two version of a website and see which one generates more sales.

In this example, the type of website is the independent variable and sales is the dependent variable. By controlling the type of website people see we can see if this affects sales.

Mechanistic

Mechanistic questions deal with how one variable affects another. This is different from causal questions that focus on if one variable affects another. Continuing with the website example, we may take a closer look at the two different websites and see what it was about them that made one more succesful in generating sales. It may be that one had more banners than another or fewer pictures. Perhaps there were different products offered on the home page.

All of these different features, of course, require data that helps to explain what is happening. This leads to an important point that the questions that can be asked are limited by the available data. You can’t answer a question that does not contain data that may answer it.

Conclusion

Answering questions is essential what research is about. In order to do this, you have to know what your questions are. This information will help you to decide on the analysis you wish to conduct. Familiarity with the types of research questions that are common in data science can help you to approach and complete analysis much faster than when this is unclear

# Lasso Regression in R

In this post, we will conduct an analysis using the lasso regression. Remember lasso regression will actually eliminate variables by reducing them to zero through how the shrinkage penalty can be applied.

We will use the dataset “nlschools” from the “MASS” packages to conduct our analysis. We want to see if we can predict language test scores “lang” with the other available variables. Below is some initial code to begin the analysis

library(MASS);library(corrplot);library(glmnet)
data("nlschools")
str(nlschools)
## 'data.frame':    2287 obs. of  6 variables:
##  $lang : int 46 45 33 46 20 30 30 57 36 36 ... ##$ IQ   : num  15 14.5 9.5 11 8 9.5 9.5 13 9.5 11 ...
##  $class: Factor w/ 133 levels "180","280","1082",..: 1 1 1 1 1 1 1 1 1 1 ... ##$ GS   : int  29 29 29 29 29 29 29 29 29 29 ...
##  $SES : int 23 10 15 23 10 10 23 10 13 15 ... ##$ COMB : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

We need to remove the “class” variable as it is used as an identifier and provides no useful data. After this, we can check the correlations among the variables. Below is the code for this.

nlschools$class<-NULL p.cor<-cor(nlschools[,-5]) corrplot.mixed(p.cor) No problems with collinearity. We will now setup are training and testing sets. ind<-sample(2,nrow(nlschools),replace=T,prob = c(0.7,0.3)) train<-nlschools[ind==1,] test<-nlschools[ind==2,] Remember that the ‘glmnet’ function does not like factor variables. So we need to convert our “COMB” variable to a dummy variable. In addition, “glmnet” function does not like data frames so we need to make two data frames. The first will include all the predictor variables and the second we include only the outcome variable. Below is the code train$COMB<-model.matrix( ~ COMB - 1, data=train ) #convert to dummy variable
test$COMB<-model.matrix( ~ COMB - 1, data=test ) predictor_variables<-as.matrix(train[,2:4]) language_score<-as.matrix(train$lang)

We can now run our model. We place both matrices inside the “glmnet” function. The family is set to “gaussian” because our outcome variable is continuous. The “alpha” is set to 1 as this indicates that we are using lasso regression.

lasso<-glmnet(predictor_variables,language_score,family="gaussian",alpha=1)

Now we need to look at the results using the “print” function. This function prints a lot of information as explained below.

• Df = number of variables including in the model (this is always the same number in a ridge model)
• %Dev = Percent of deviance explained. The higher the better
• Lambda = The lambda used to obtain the %Dev

When you use the “print” function for a lasso model it will print up to 100 different models. Fewer models are possible if the percent of deviance stops improving. 100 is the default stopping point. In the code below we will use the “print” function but, I only printed the first 5 and last 5 models in order to reduce the size of the printout. Fortunately, it only took 60 models to converge.

print(lasso)
##
## Call:  glmnet(x = predictor_variables, y = language_score, family = "gaussian",      alpha = 1)
##
##       Df    %Dev  Lambda
##  [1,]  0 0.00000 5.47100
##  [2,]  1 0.06194 4.98500
##  [3,]  1 0.11340 4.54200
##  [4,]  1 0.15610 4.13900
##  [5,]  1 0.19150 3.77100
............................
## [55,]  3 0.39890 0.03599
## [56,]  3 0.39900 0.03280
## [57,]  3 0.39900 0.02988
## [58,]  3 0.39900 0.02723
## [59,]  3 0.39900 0.02481
## [60,]  3 0.39900 0.02261

The results from the “print” function will allow us to set the lambda for the “test” dataset. Based on the results we can set the lambda at 0.02 because this explains the highest amount of deviance at .39.

The plot below shows us lambda on the x-axis and the coefficients of the predictor variables on the y-axis. The numbers next to the coefficient lines refers to the actual coefficient of a particular variable as it changes from using different lambda values. Each number corresponds to a variable going from left to right in a dataframe/matrix using the “View” function. For example, 1 in the plot refers to “IQ” 2 refers to “GS” etc.

plot(lasso,xvar="lambda",label=T)

As you can see, as lambda increase the coefficient decrease in value. This is how regularized regression works. However, unlike ridge regression which never reduces a coefficient to zero, lasso regression does reduce a coefficient to zero. For example, coefficient 3 (SES variable) and coefficient 2 (GS variable) are reduced to zero when lambda is near 1.

You can also look at the coefficient values at a specific lambda values. The values are unstandardized and are used to determine the final model selection. In the code below the lambda is set to .02 and we use the “coef” function to do see the results

lasso.coef<-coef(lasso,s=.02,exact = T)
lasso.coef
## 4 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept)  9.35736325
## IQ           2.34973922
## GS          -0.02766978
## SES          0.16150542

Results indicate that for a 1 unit increase in IQ there is a 2.41 point increase in language score. When GS (class size) goes up 1 unit there is a .03 point decrease in language score. Finally, when SES (socioeconomic status) increase  1 unit language score improves .13 point.

The second plot shows us the deviance explained on the x-axis. On the y-axis is the coefficients of the predictor variables. Below is the code

plot(lasso,xvar='dev',label=T)

If you look carefully, you can see that the two plots are completely opposite to each other. increasing lambda cause a decrease in the coefficients. Furthermore, increasing the fraction of deviance explained leads to an increase in the coefficient. You may remember seeing this when we used the “print”” function. As lambda became smaller there was an increase in the deviance explained.

Now, we will assess our model using the test data. We need to convert the test dataset to a matrix. Then we will use the “predict”” function while setting our lambda to .02. Lastly, we will plot the results. Below is the code.

test.matrix<-as.matrix(test[,2:4])
lasso.y<-predict(lasso,newx = test.matrix,type = 'response',s=.02)
plot(lasso.y,test$lang) The visual looks promising. The last thing we need to do is calculated the mean squared error. By its self this number does not mean much. However, it provides a benchmark for comparing our current model with any other models that we may develop. Below is the code lasso.resid<-lasso.y-test$lang
mean(lasso.resid^2)
## [1] 46.74314

Knowing this number, we can, if we wanted, develop other models using other methods of analysis to try to reduce it. Generally, the lower the error the better while keeping in mind the complexity of the model.

# Ridge Regression in R

In this post, we will conduct an analysis using ridge regression. Ridge regression is a type of regularized regression. By applying a shrinkage penalty, we are able to reduce the coefficients of many variables almost to zero while still retaining them in the model. This allows us to develop models that have many more variables in them compared to models using best subset or stepwise regression.

In the example used in this post, we will use the “SAheart” dataset from the “ElemStatLearn” package. We want to predict systolic blood pressure (sbp) using all of the other variables available as predictors. Below is some initial code that we need to begin.

library(ElemStatLearn);library(car);library(corrplot)
library(leaps);library(glmnet);library(caret)
data(SAheart)
str(SAheart)
## 'data.frame':    462 obs. of  10 variables:
##  $sbp : int 160 144 118 170 134 132 142 114 114 132 ... ##$ tobacco  : num  12 0.01 0.08 7.5 13.6 6.2 4.05 4.08 0 0 ...
##  $ldl : num 5.73 4.41 3.48 6.41 3.5 6.47 3.38 4.59 3.83 5.8 ... ##$ adiposity: num  23.1 28.6 32.3 38 27.8 ...
##  $famhist : Factor w/ 2 levels "Absent","Present": 2 1 2 2 2 2 1 2 2 2 ... ##$ typea    : int  49 55 52 51 60 62 59 62 49 69 ...
##  $obesity : num 25.3 28.9 29.1 32 26 ... ##$ alcohol  : num  97.2 2.06 3.81 24.26 57.34 ...
##  $age : int 52 63 46 58 49 45 38 58 29 53 ... ##$ chd      : int  1 1 0 1 1 0 0 1 0 1 ...

A look at the object using the “str” function indicates that one variable “famhist” is a factor variable. The “glmnet” function that does the ridge regression analysis cannot handle factors so we need to converts this to a dummy variable. However, there are two things we need to do before this. First, we need to check the correlations to make sure there are no major issues with multi-collinearity Second, we need to create our training and testing data sets. Below is the code for the correlation plot.

p.cor<-cor(SAheart[,-5])
corrplot.mixed(p.cor)

First we created a variable called “p.cor” the -5 in brackets means we removed the 5th column from the “SAheart” data set which is the factor variable “Famhist”. The correlation plot indicates that there is one strong relationship between adiposity and obesity. However, one common cut-off for collinearity is 0.8 and this value is 0.72 which is not a problem.

We will now create are training and testing sets and convert “famhist” to a dummy variable.

ind<-sample(2,nrow(SAheart),replace=T,prob = c(0.7,0.3))
train<-SAheart[ind==1,]
test<-SAheart[ind==2,]
train$famhist<-model.matrix( ~ famhist - 1, data=train ) #convert to dummy variable test$famhist<-model.matrix( ~ famhist - 1, data=test )

We are still not done preparing our data yet. “glmnet” cannot use data frames, instead, it can only use matrices. Therefore, we now need to convert our data frames to matrices. We have to create two matrices, one with all of the predictor variables and a second with the outcome variable of blood pressure. Below is the code

predictor_variables<-as.matrix(train[,2:10])
blood_pressure<-as.matrix(train$sbp) We are now ready to create our model. We use the “glmnet” function and insert our two matrices. The family is set to Gaussian because “blood pressure” is a continuous variable. Alpha is set to 0 as this indicates ridge regression. Below is the code ridge<-glmnet(predictor_variables,blood_pressure,family = 'gaussian',alpha = 0) Now we need to look at the results using the “print” function. This function prints a lot of information as explained below. • Df = number of variables including in the model (this is always the same number in a ridge model) • %Dev = Percent of deviance explained. The higher the better • Lambda = The lambda used to attain the %Dev When you use the “print” function for a ridge model it will print up to 100 different models. Fewer models are possible if the percent of deviance stops improving. 100 is the default stopping point. In the code below we have the “print” function. However, I have only printed the first 5 and last 5 models in order to save space. print(ridge) ## ## Call: glmnet(x = predictor_variables, y = blood_pressure, family = "gaussian", alpha = 0) ## ## Df %Dev Lambda ## [1,] 10 7.622e-37 7716.0000 ## [2,] 10 2.135e-03 7030.0000 ## [3,] 10 2.341e-03 6406.0000 ## [4,] 10 2.566e-03 5837.0000 ## [5,] 10 2.812e-03 5318.0000 ................................ ## [95,] 10 1.690e-01 1.2290 ## [96,] 10 1.691e-01 1.1190 ## [97,] 10 1.692e-01 1.0200 ## [98,] 10 1.693e-01 0.9293 ## [99,] 10 1.693e-01 0.8468 ## [100,] 10 1.694e-01 0.7716 The results from the “print” function are useful in setting the lambda for the “test” dataset. Based on the results we can set the lambda at 0.83 because this explains the highest amount of deviance at .20. The plot below shows us lambda on the x-axis and the coefficients of the predictor variables on the y-axis. The numbers refer to the actual coefficient of a particular variable. Inside the plot, each number corresponds to a variable going from left to right in a data-frame/matrix using the “View” function. For example, 1 in the plot refers to “tobacco” 2 refers to “ldl” etc. Across the top of the plot is the number of variables used in the model. Remember this number never changes when doing ridge regression. plot(ridge,xvar="lambda",label=T) As you can see, as lambda increase the coefficient decrease in value. This is how ridge regression works yet no coefficient ever goes to absolute 0. You can also look at the coefficient values at a specific lambda value. The values are unstandardized but they provide a useful insight when determining final model selection. In the code below the lambda is set to .83 and we use the “coef” function to do this ridge.coef<-coef(ridge,s=.83,exact = T) ridge.coef ## 11 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 105.69379942 ## tobacco -0.25990747 ## ldl -0.13075557 ## adiposity 0.29515034 ## famhist.famhistAbsent 0.42532887 ## famhist.famhistPresent -0.40000846 ## typea -0.01799031 ## obesity 0.29899976 ## alcohol 0.03648850 ## age 0.43555450 ## chd -0.26539180 The second plot shows us the deviance explained on the x-axis and the coefficients of the predictor variables on the y-axis. Below is the code plot(ridge,xvar='dev',label=T) The two plots are completely opposite to each other. Increasing lambda cause a decrease in the coefficients while increasing the fraction of deviance explained leads to an increase in the coefficient. You can also see this when we used the “print” function. As lambda became smaller there was an increase in the deviance explained. We now can begin testing our model on the test data set. We need to convert the test dataset to a matrix and then we will use the predict function while setting our lambda to .83 (remember a lambda of .83 explained the most of the deviance). Lastly, we will plot the results. Below is the code. test.matrix<-as.matrix(test[,2:10]) ridge.y<-predict(ridge,newx = test.matrix,type = 'response',s=.83) plot(ridge.y,test$sbp)

The last thing we need to do is calculated the mean squared error. By it’s self this number is useless. However, it provides a benchmark for comparing the current model with any other models you may develop. Below is the code

ridge.resid<-ridge.y-test$sbp mean(ridge.resid^2) ## [1] 372.4431 Knowing this number, we can develop other models using other methods of analysis to try to reduce it as much as possible. # Regularized Linear Regression Traditional linear regression has been a tried and true model for making predictions for decades. However, with the growth of Big Data and datasets with 100’s of variables problems have begun to arise. For example, using stepwise or best subset method with regression could take hours if not days to converge in even some of the best computers. To deal with this problem, regularized regression has been developed to help to determine which features or variables to keep when developing models from large datasets with a huge number of variables. In this post, we will look at the following concepts • Definition of regularized regression • Ridge regression • Lasso regression • Elastic net regression Regularization Regularization involves the use of a shrinkage penalty in order to reduce the residual sum of squares (RSS). This is done through selecting a value for a tuning parameter called “lambda”. Tuning parameters are used in machine learning algorithms to control the behavior of the models that are developed. The lambda is multiplied by the normalized coefficients of the model and added to the RSS. Below is an equation of what was just said RSS + λ(normalized coefficients) The benefits of regularization are at least three-fold. First, regularization is highly computationally efficient. Instead of fitting k-1 models when k is the number of variables available (for example, 50 variables would lead 49 models!), with regularization only one model is developed for each value of lambda you specify. Second, regularization helps to deal with the bias-variance headache of model development. When small changes are made to data, such as switching from the training to testing data, there can be wild changes in the estimates. Regularization can often smooth this problem out substantially. Finally, regularization can help to reduce or eliminate any multicollenarity in a model. As such, the benefits of using regularization make it clear that this should be considering when working with larger data sets. Ridge Regression Ridge regression involves the normalization of the squared weights or as shown in the equation below RSS + λ(normalized coefficients^2) This is also refered to as the L2-norm. As lambda increase in value the coefficients in the model are shrunk towards 0 but never reach 0. This is how the error is shrunk. The higher the lambda the lower the value of the coefficients as they are reduce more and more thus reducing the RSS. The benefit is that predictive accuracy is often increased. However, interpreting and communicating your results can become difficult because no variables are removed from the model. Instead the variables are reduced near to zero. This can be especially tough if you have dozens of variables remaining in your model to try to explain. Lasso Lasso is short for “Least Absolute Shrinkage and Selection Operator”. This approach uses the L1-norm which is the sum of the absolute value of the coefficients or as shown in the equation below RSS + λ(Σ|normalized coefficients|) This shrinkage penalty will reduce a coefficient to 0 which is another way of saying that variables will be removed from the model. One problem is that highly correlated variables that need to be in your model my be removed when Lasso shrinks coefficients. This is one reason why ridge regression is still used. Elastic Net Elastic net is the best of ridge and Lasso without the weaknesses of either. It combines extracts variables like Lasso and Ridge does not while also group variables like Ridge does but Lasso does not. This is done by including a second tuning parameter called “alpha”. If alpha is set to 0 it is the same as ridge regression and if alpha is set to 1 it is the same as lasso regression. If you are able to appreciate it below is the formula used for elastic net regression (RSS + l[(1 – alpha)(S|normalized coefficients|2)/2 + alpha(S|normalized coefficients|)])/N) As such when working with elastic net you have to set two different tuning parameters (alpha and lambda) in order to develop a model. Conclusion Regularized regressio was developed as an answer to the growth in the size and number of variables in a data set today. Ridge, lasso an elastic net all provide solutions to converging over large datasets and selecting features. # Primary Tasks in Data Analysis Performing a data analysis in the realm of data science is a difficult task due to the huge number of decisions that need to be made. For some people, plotting the course to conduct an analysis is easy. However, for most of us, beginning a project leads to a sense of paralysis as we struggle to determine what to do. In light of this challenge, there are at least 5 core task that you need to consider when preparing to analyze data. These five task are 1. Developing your question(s) 2. Data exploration 3. Developing a statistical model 4. Interpreting the results 5. Sharing the results Developing Your Question(s) You really cannot analyze data until you first determine what it is you want to know. It is tempting to just jump in and start looking for interesting stuff but you will not know if something you find is interesting unless it helps to answer your question(s). There are several types of research questions. The point is you need to ask them in order to answer them. Data Exploration Data exploration allows you to determine if you can answer your questions with the data you have. In data science, the data is normally already collected by the time you are called upon to analyze it. As such, what you want to find may not be possible. In addition, exploration of the data allows you to determine if there are any problems with the data set such as missing data, strange variables, and if necessary to develop a data dictionary so you know the characteristics of the variables. Data exploration allows you to determine what kind of data wrangling needs to be done. This involves the preparation of the data for a more formal analysis when you develop your statistical models. This process takes up the majority of a data scientist time and is not easy at all. Mastery of this in many ways means being a master of data science Develop a Statistical Model Your research questions and the data exploration process helps you to determine what kind of model to develop. The factors that can affect this is whether your data is supervised or unsupervised and whether you want to classify or predict numerical values. This is probably the funniest part of data analysis and is much easier then having to wrangle with the data. Your goal is to determine if the model helps to answer your question(s) Interpreting the Results Once a model is developed it is time to explain what it means. Sometimes you can make a really cool model that nobody (including yourself) can explain. This is especially true of “black box” methods such as support vector machines and artificial neural networks. Models need to normally be explainable to non-technical stakeholders. With interpretation you are trying to determine “what does this answer mean to the stakeholders?” For example, if you find that people who smoke are 5 times more likely to die before the age of 50 what are the implications of this? How can the stakeholders use this information to achieve their own goals? In other words, why should they care about what you found out? Communication of Results Now is the time to actually share the answer(s) to your question(s). How this is done varies but it can be written, verbal or both. Whatever the mode of communication it is necessary to consider the following • The audience or stakeholders • The actual answers to the questions • The benefits of knowing this You must remember the stakeholders because this affects how you communicate. How you speak to business professionals would be different from academics. Next, you must share the answers to the questions. This can be done with charts, figures, illustrations etc. Data visualization is an expertise of its own. Lastly, you explain how this information is useful in a practical way. Conclusion The process shared here is one way to approach the analysis of data. Think of this as a framework from which to develop your own method of analysis. # Logistic Regression in R In this post, we will conduct a logistic regression analysis. Logistic regression is used when you want to predict a categorical dependent variable using continuous or categorical dependent variables. In our example, we want to predict Sex (male or female) when using several continuous variables from the “survey” dataset in the “MASS” package. library(MASS);library(bestglm);library(reshape2);library(corrplot) data(survey) ?MASS::survey #explains the variables in the study The first thing we need to do is remove the independent factor variables from our dataset. The reason for this is that the function that we will use for the cross-validation does not accept factors. We will first use the “str” function to identify factor variables and then remove them from the dataset. We also need to remove in examples that are missing data so we use the “na.omit” function for this. Below is the code survey$Clap<-NULL
survey$W.Hnd<-NULL survey$Fold<-NULL
survey$Exer<-NULL survey$Smoke<-NULL
survey$M.I<-NULL survey<-na.omit(survey) We now need to check for collinearity using the “corrplot.mixed” function form the “corrplot” package. pc<-cor(survey[,2:5]) corrplot.mixed(pc) corrplot.mixed(pc) We have extreme correlation between “We.Hnd” and “NW.Hnd” this makes sense because people’s hands are normally the same size. Since this blog post is a demonstration of logistic regression we will not worry about this too much. We now need to divide our dataset into a train and a test set. We set the seed for. First we need to make a variable that we call “ind” that is randomly assigns 70% of the number of rows of survey 1 and 30% 2. We then subset the “train” dataset by taking all rows that are 1’s based on the “ind” variable and we create the “test” dataset for all the rows that line up with 2 in the “ind” variable. This means our data split is 70% train and 30% test. Below is the code set.seed(123) ind<-sample(2,nrow(survey),replace=T,prob = c(0.7,0.3)) train<-survey[ind==1,] test<-survey[ind==2,] We now make our model. We use the “glm” function for logistic regression. We set the family argument to “binomial”. Next, we look at the results as well as the odds ratios. fit<-glm(Sex~.,family=binomial,train) summary(fit) ## ## Call: ## glm(formula = Sex ~ ., family = binomial, data = train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9875 -0.5466 -0.1395 0.3834 3.4443 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -46.42175 8.74961 -5.306 1.12e-07 *** ## Wr.Hnd -0.43499 0.66357 -0.656 0.512 ## NW.Hnd 1.05633 0.70034 1.508 0.131 ## Pulse -0.02406 0.02356 -1.021 0.307 ## Height 0.21062 0.05208 4.044 5.26e-05 *** ## Age 0.00894 0.05368 0.167 0.868 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 169.14 on 122 degrees of freedom ## Residual deviance: 81.15 on 117 degrees of freedom ## AIC: 93.15 ## ## Number of Fisher Scoring iterations: 6 exp(coef(fit)) ## (Intercept) Wr.Hnd NW.Hnd Pulse Height ## 6.907034e-21 6.472741e-01 2.875803e+00 9.762315e-01 1.234447e+00 ## Age ## 1.008980e+00 The results indicate that only height is useful in predicting if someone is a male or female. The second piece of code shares the odds ratios. The odds ratio tell how a one unit increase in the independent variable leads to an increase in the odds of being male in our model. For example, for every one unit increase in height there is a 1.23 increase in the odds of a particular example being male. We now need to see how well our model does on the train and test dataset. We first capture the probabilities and save them to the train dataset as “probs”. Next we create a “predict” variable and place the string “Female” in the same number of rows as are in the “train” dataset. Then we rewrite the “predict” variable by changing any example that has a probability above 0.5 as “Male”. Then we make a table of our results to see the number correct, false positives/negatives. Lastly, we calculate the accuracy rate. Below is the code. train$probs<-predict(fit, type = 'response')
train$predict<-rep('Female',123) train$predict[train$probs>0.5]<-"Male" table(train$predict,train$Sex) ## ## Female Male ## Female 61 7 ## Male 7 48 mean(train$predict==train$Sex) ## [1] 0.8861789 Despite the weaknesses of the model with so many insignificant variables it is surprisingly accurate at 88.6%. Let’s see how well we do on the “test” dataset. test$prob<-predict(fit,newdata = test, type = 'response')
test$predict<-rep('Female',46) test$predict[test$prob>0.5]<-"Male" table(test$predict,test$Sex) ## ## Female Male ## Female 17 3 ## Male 0 26 mean(test$predict==test$Sex) ## [1] 0.9347826 As you can see, we do even better on the test set with an accuracy of 93.4%. Our model is looking pretty good and height is an excellent predictor of sex which makes complete sense. However, in the next post we will use cross-validation and the ROC plot to further assess the quality of it. # Probability,Odds, and Odds Ratio In logistic regression, there are three terms that are used frequently but can be confusing if they are not thoroughly explained. These three terms are probability, odds, and odds ratio. In this post, we will look at these three terms and provide an explanation of them. Probability Probability is probably (no pun intended) the easiest of these three terms to understand. Probability is simply the likelihood that a certain even will happen. To calculate the probability in the traditional sense you need to know the number of events and outcomes to find the probability. Bayesian probability uses prior probabilities to develop a posterior probability based on new evidence. For example, at one point during Super Bowl LI the Atlanta Falcons had a 99.7% chance of winning. This was base don such factors as the number points they were ahead and the time remaining. As these changed, so did the probability of them winning. yet the Patriots still found a way to win with less then a 1% chance Bayesian probability was also used for predicting who would win the 2016 US presidential race. It is important to remember that probability is an expression of confidence and not a guarantee as we saw in both examples. Odds Odds are the expression of relative probabilities. Odds are calculated using the following equation probability of the event ⁄ 1 – probability of the event For example, at one point during Super Bowl LI the odds of the Atlanta Falcons winning were as follows 0.997 ⁄ 1 – 0.997 = 332 This can be interpreted as the odds being 332 to 1! This means that Atlanta was 332 times more likely to win the Super Bowl then loss the Super Bowl. Odds are commonly used in gambling and this is probably (again no pun intended) where most of us have heard the term before. The odds is just an extension of probabilities and the are most commonly expressed as a fraction such as one in four, etc. Odds Ratio A ratio is the comparison of of two numbers and indicates how many times one number is contained or contains another number. For example, a ration of boys to girls is 5 to 1 it means that there are five boys for every one girl. By extension odds ratio is the comparison of two different odds. For example, if the odds of Team A making the playoffs is 45% and the odds of Team B making the playoffs is 35% the odds ratio is calculated as follows. 0.45 ⁄ 0.35 = 1.28 Team A is 1.28 more likely to make the playoffs then Team B. The value of the odds and the odds ratio can sometimes be the same. Below is the odds ratio of the Atlanta Falcons winning and the New Patriots winning Super Bowl LI 0.997⁄ 0.003 = 332 As such there is little difference between odds and odds ratio except that odds ratio is the ratio of two odds ratio. As you can tell, there is a lot of confusion about this for the average person. However, understanding these terms is critical to the application of logistic regression. # Best Subset Regression in R In this post, we will take a look at best subset regression. Best subset regression fits a model for all possible feature or variable combinations and the decision for the most appropriate model is made by the analyst based on judgment or some statistical criteria. Best subset regression is an alternative to both Forward and Backward stepwise regression. Forward stepwise selection adds one variable at a time based on the lowest residual sum of squares until no more variables continues to lower the residual sum of squares. Backward stepwise regression starts with all variables in the model and removes variables one at a time. The concern with stepwise methods is they can produce biased regression coefficients, conflicting models, and inaccurate confidence intervals. Best subset regression bypasses these weaknesses of stepwise models by creating all models possible and then allowing you to assess which variables should be including in your final model. The one drawback to best subset is that a large number of variables means a large number of potential models, which can make it difficult to make a decision among several choices. In this post, we will use the “Fair” dataset from the “Ecdat” package to predict marital satisfaction based on age, Sex, the presence of children, years married, religiosity, education, occupation, and number of affairs in the past year. Below is some initial code. library(leaps);library(Ecdat);library(car);library(lmtest) data(Fair) We begin our analysis by building the initial model with all variables in it. Below is the code fit<-lm(rate~.,Fair) summary(fit) ## ## Call: ## lm(formula = rate ~ ., data = Fair) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.2049 -0.6661 0.2298 0.7705 2.2292 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.522875 0.358793 9.819 < 2e-16 *** ## sexmale -0.062281 0.099952 -0.623 0.53346 ## age -0.009683 0.007548 -1.283 0.20005 ## ym -0.019978 0.013887 -1.439 0.15079 ## childyes -0.206976 0.116227 -1.781 0.07546 . ## religious 0.042142 0.037705 1.118 0.26416 ## education 0.068874 0.021153 3.256 0.00119 ** ## occupation -0.015606 0.029602 -0.527 0.59825 ## nbaffairs -0.078812 0.013286 -5.932 5.09e-09 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.03 on 592 degrees of freedom ## Multiple R-squared: 0.1405, Adjusted R-squared: 0.1289 ## F-statistic: 12.1 on 8 and 592 DF, p-value: 4.487e-16 The initial results are already interesting even though the r-square is low. When couples have children the have less martial satisfaction than couples without children when controlling for the other factors and this is the strongest regression weight. In addition, the more education a person has there is an increase in marital satisfaction. Lastly, as the number of affairs increases there is also a decrease in martial satisfaction. Keep in mind that the “rate” variable goes from 1 to 5 with one meaning a terrible marriage to five being a great one. The mean marital satisfaction was 3.52 when controlling for the other variables. We will now create our subset models. Below is the code. sub.fit<-regsubsets(rate~.,Fair) best.summary<-summary(sub.fit) In the code above we create the sub models using the “regsubsets” function from the “leaps” package and saved it in the variable called “sub.fit”. We then saved the summary of “sub.fit” in the variable “best.summary”. We will use the “best.summary” “sub.fit variables several times to determine which model to use. There are many different ways to assess the model. We will use the following statistical methods that come with the results from the “regsubset” function. • Mallow’ Cp • Bayesian Information Criteria We will make two charts for each of the criteria above. The plot to the left will explain how many features to include in the model. The plot to the right will tell you which variables to include. It is important to note that for both of these methods, the lower the score the better the model. Below is the code for Mallow’s Cp. par(mfrow=c(1,2)) plot(best.summary$cp)
plot(sub.fit,scale = "Cp")

The plot on the left suggests that a four feature model is the most appropriate. However, this chart does not tell me which four features. The chart on the right is read in reverse order. The high numbers are at the bottom and the low numbers are at the top when looking at the y-axis. Knowing this, we can conclude that the most appropriate variables to include in the model are age, children presence, education, and number of affairs. Below are the results using the Bayesian Information Criterion

par(mfrow=c(1,2))
plot(best.summary$bic) plot(sub.fit,scale = "bic") These results indicate that a three feature model is appropriate. The variables or features are years married, education, and number of affairs. Presence of children was not considered beneficial. Since our original model and Mallow’s Cp indicated that presence of children was significant we will include it for now. Below is the code for the model based on the subset regression. fit2<-lm(rate~age+child+education+nbaffairs,Fair) summary(fit2) ## ## Call: ## lm(formula = rate ~ age + child + education + nbaffairs, data = Fair) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.2172 -0.7256 0.1675 0.7856 2.2713 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.861154 0.307280 12.566 < 2e-16 *** ## age -0.017440 0.005057 -3.449 0.000603 *** ## childyes -0.261398 0.103155 -2.534 0.011531 * ## education 0.058637 0.017697 3.313 0.000978 *** ## nbaffairs -0.084973 0.012830 -6.623 7.87e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.029 on 596 degrees of freedom ## Multiple R-squared: 0.1352, Adjusted R-squared: 0.1294 ## F-statistic: 23.29 on 4 and 596 DF, p-value: < 2.2e-16 The results look ok. The older a person is the less satisfied they are with their marriage. If children are present the marriage is less satisfying. The more educated the more satisfied they are. Lastly, the higher the number of affairs indicate less marital satisfaction. However, before we get excited we need to check for collinearity and homoscedasticity. Below is the code vif(fit2) ## age child education nbaffairs ## 1.249430 1.228733 1.023722 1.014338 No issues with collinearity.For vif values above 5 or 10 indicate a problem. Let’s check for homoscedasticity par(mfrow=c(2,2)) plot(fit2) The normal qqplot and residuals vs leverage plot can be used for locating outliers. The residual vs fitted and the scale-location plot do not look good as there appears to be a pattern in the dispersion which indicates homoscedasticity. To confirm this we will use Breusch-Pagan test from the “lmtest” package. Below is the code bptest(fit2) ## ## studentized Breusch-Pagan test ## ## data: fit2 ## BP = 16.238, df = 4, p-value = 0.002716 There you have it. Our model violates the assumption of homoscedasticity. However, this model was developed for demonstration purpose to provide an example of subset regression. # Data Wrangling in R Collecting and preparing data for analysis is the primary job of a data scientist. This experience is called data wrangling. In this post, we will look at an example of data wrangling using a simple artificial data set. You can create the table below in r or excel. If you created it in excel just save it as a csv and load it into r. Below is the initial code library(readr) apple <- read_csv("~/Desktop/apple.csv") ## # A tibble: 10 × 2 ## weight location ## <chr> <chr> ## 1 3.2 Europe ## 2 4.2kg europee ## 3 1.3 kg U.S. ## 4 7200 grams USA ## 5 42 United States ## 6 2.3 europee ## 7 2.1kg Europe ## 8 3.1kg USA ## 9 2700 grams U.S. ## 10 24 United States This a small dataset with the columns of “weight” and “location”. Here are some of the problems • Weights are in different units • Weights are written in different ways • Location is not consistent In order to have any success with data wrangling you need to state specifically what it is you want to do. Here are our goals for this project • Convert the “Weight variable” to a numerical variable instead of character • Remove the text and have only numbers in the “weight variable” • Change weights in grams to kilograms • Convert the “location” variable to a factor variable instead of character • Have consistent spelling for Europe and United States in the “location” variable We will begin with the “weight” variable. We want to convert it to a numerical variable and remove any non-numerical text. Below is the code for this corrected.weight<-as.numeric(gsub(pattern = "[[:alpha:]]","",apple$weight))
corrected.weight
##  [1]    3.2    4.2    1.3 7200.0   42.0    2.3    2.1    3.1 2700.0   24.0

Here is what we did.

1. We created a variable called “corrected.weight”
2. We use the function “as.numeric” this makes whatever results inside it to be a numerical variable
3. Inside “as.numeric” we used the “gsub” function which allows us to substitute one value for another.
4. Inside “gsub” we used the argument pattern and set it to “[[alpha:]]” and “” this told r to look for any lower or uppercase letters and replace with nothing or remove it. This all pertains to the “weight” variable in the apple dataframe.

We now need to convert the weights in grams to kilograms so that everything is the same unit. Below is the code

gram.error<-grep(pattern = "[[:digit:]]{4}",apple$weight) corrected.weight[gram.error]<-corrected.weight[gram.error]/1000 corrected.weight ## [1] 3.2 4.2 1.3 7.2 42.0 2.3 2.1 3.1 2.7 24.0 Here is what we did 1. We created a variable called “gram.error” 2. We used the grep function to search are the “weight” variable in the apple data frame for input that is a digit and is 4 digits in length this is what the “[[:digit:]]{4}” argument means. We do not change any values yet we just store them in “gram.error” 3. Once this information is stored in “gram.error” we use it as a subset for the “corrected.weight” variable. 4. We tell r to save into the “corrected.weight” variable any value that is changeable according to the criteria set in “gram.error” and to divided it by 1000. Dividing it by 1000 converts the value from grams to kilograms. We have completed the transformation of the “weight” and will move to dealing with the problems with the “location” variable in the “apple” dataframe. To do this we will first deal with the issues related to the values that relate to Europe and then we will deal with values related to United States. Below is the code. europe<-agrep(pattern = "europe",apple$location,ignore.case = T,max.distance = list(insertion=c(1),deletions=c(2)))
america<-agrep(pattern = "us",apple$location,ignore.case = T,max.distance = list(insertion=c(0),deletions=c(2),substitutions=0)) corrected.location<-apple$location
corrected.location[europe]<-"europe"
corrected.location[america]<-"US"
corrected.location<-gsub(pattern = "United States","US",corrected.location)
corrected.location
##  [1] "europe" "europe" "US"     "US"     "US"     "europe" "europe"
##  [8] "US"     "US"     "US"

The code is a little complicated to explain but in short We used the “agrep” function to tell r to search the “location” to look for values similar to our term “europe”. The other arguments provide some exceptions that r should change because the exceptions are close to the term europe. This process is repeated for the term “us”. We then store are the location variable from the “apple” dataframe in a new variable called “corrected.location” We then apply the two objects we made called “europe” and “america” to the “corrected.location” variable. Next we have to make some code to deal with “United States” and apply this using the “gsub” function.

We are almost done, now we combine are two variables “corrected.weight” and “corrected.location” into a new data.frame. The code is below

cleaned.apple<-data.frame(corrected.weight,corrected.location)
names(cleaned.apple)<-c('weight','location')
cleaned.apple
##    weight location
## 1     3.2   europe
## 2     4.2   europe
## 3     1.3       US
## 4     7.2       US
## 5    42.0       US
## 6     2.3   europe
## 7     2.1   europe
## 8     3.1       US
## 9     2.7       US
## 10   24.0       US

If you use the “str” function on the “cleaned.apple” dataframe you will see that “location” was automatically converted to a factor.

This looks much better especially if you compare it to the original dataframe that is printed at the top of this post.

# Principal Component Analysis in R

This post will demonstrate the use of principal component analysis (PCA). PCA is useful for several reasons. One it allows you place your examples into groups similar to linear discriminant analysis but you do not need to know beforehand what the groups are. Second, PCA is used for the purpose of dimension reduction. For example, if you have 50 variables PCA can allow you to reduce this while retaining a certain threshold of variance. If you are working with a large dataset this can greatly reduce the computational time and general complexity of your models.

Keep in mind that there really is not a dependent variable as this is unsupervised learning. What you are trying to see is how different examples can be mapped in space based on whatever independent variables are used. For our example, we will use the “Carseats” dataset form the “ISLR”. Our goal is to understanding the relationship among the variables when examining the shelve location of the car seat. Below is the initial code to begin the analysis

library(ggplot2)
library(ISLR)
data("Carseats")

We first need to rearrange the data and remove the variables we are not going to use in the analysis. Below is the code.

Carseats1<-Carseats
Carseats1<-Carseats1[,c(1,2,3,4,5,6,8,9,7,10,11)]
Carseats1$Urban<-NULL Carseats1$US<-NULL

Here is what we did 1. We made a copy of the “Carseats” data called “Careseats1” 2. We rearranged the order of the variables so that the factor variables are at the end. This will make sense later 3.We removed the “Urban” and “US” variables from the table as they will not be a part of our analysis

We will now do the PCA. We need to scale and center our data otherwise the larger numbers will have a much stronger influence on the results than smaller numbers. Fortunately, the “prcomp” function has a “scale” and a “center” argument. We will also use only the first 7 columns for the analysis  as “sheveLoc” is not useful for this analysis. If we hadn’t moved “shelveLoc” to the end of the dataframe it would cause some headache. Below is the code.

Carseats.pca<-prcomp(Carseats1[,1:7],scale. = T,center = T)
summary(Carseats.pca)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6     PC7
## Standard deviation     1.3315 1.1907 1.0743 0.9893 0.9260 0.80506 0.41320
## Proportion of Variance 0.2533 0.2026 0.1649 0.1398 0.1225 0.09259 0.02439
## Cumulative Proportion  0.2533 0.4558 0.6207 0.7605 0.8830 0.97561 1.00000

The summary of “Carseats.pca” Tells us how much of the variance each component explains. Keep in mind that number of components is equal to the number of variables. The “proportion of variance” tells us the contribution each component makes and the “cumulative proportion”.

If your goal is dimension reduction than the number of components to keep depends on the threshold you set. For example, if you need around 90% of the variance you would keep the first 5 components. If you need 95% or more of the variance you would keep the first six. To actually use the components you would take the “Carseats.pca$x” data and move it to your data frame. Keep in mind that the actual components have no conceptual meaning but is a numerical representation of a combination of several variables that were reduce using PCA to fewer variables such as going form 7 variables to 5 variables. This means that PCA is great for reducing variables for prediction purpose but is much harder for explanatory studies unless you can explain what the new components represent. For our purposes, we will keep 5 components. This means that we have reduce our dimensions from 7 to 5 while still keeping almost 90% of the variance. Graphing our results is tricky because we have 5 dimensions but the human mind can only conceptualize 3 at the best and normally 2. As such we will plot the first two components and label them by shelf location using ggplot2. Below is the code scores<-as.data.frame(Carseats.pca$x)
pcaplot<-ggplot(scores,(aes(PC1,PC2,color=Carseats1$ShelveLoc)))+geom_point() pcaplot From the plot you can see there is little separation when using the first two components of the PCA analysis. This makes sense as we can only graph to components so we are missing a lot of the variance. However for demonstration purposes the analysis is complete. # Developing a Data Analysis Plan It is extremely common for beginners and perhaps even experience researchers to lose track of what they are trying to achieve or do when trying to complete a research project. The open nature of research allows for a multitude of equally acceptable ways to complete a project. This leads to an inability to make decision and or stay on course when doing research. One way to reduce and eliminate the roadblock to decision making and focus in research is to develop a plan. In this post we will look at one version of a data analysis plan. Data Analysis Plan A data analysis plan includes many features of a research project in it with a particular emphasis on mapping out how research questions will be answered and what is necessary to answer the question. Below is a sample template of the analysis plan. The majority of this diagram should be familiar to someone who has ever done research. At the top, you state the problem, this is the overall focus of the paper. Next comes the purpose, the purpose is the over-arching goal of a research project. After purpose comes the research questions. The research questions are questions about the problem that are answerable. People struggle with developing clear and answerable research questions. It is critical that research questions are written in a way that they can be answered and that the questions are clearly derived from the problem. Poor questions means poor or even no answers. After the research questions it is important to know what variables are available for the entire study and specifically what variables can be used to answer each research question. Lastly, you must indicate what analysis or visual you will develop in order to answer your research questions about your problem. This requires you to know how you will answer your research questions Example Below is an example of a completed analysis plan for simple undergraduate level research paper In the example above, the student want to understand the perceptions of university students about the cafeteria food quality and their satisfaction with the university. There were four research questions, a demographic descriptive question, a descriptive question about the two main variables, a comparison question, and lastly a relationship question. The variables available for answering the questions are listed of to the left side. Under that, the student indicates the variables needed to answer each question. For example, the demographic variables of sex, class level, and major are needed to answer the question about the demographic profile. The last section is the analysis. For the demographic profile the student found the percentage of the population in each sub group of the demographic variables. Conclusion A data analysis plan provides an excellent way to determine what needs to be done to complete a study. It also helps a researcher to clearly understand what they are trying to do and provides a visuals for those who the research wants to communicate with about the progress of a study. # Linear Discriminant Analysis in R In this post we will look at an example of linear discriminant analysis (LDA). LDA is used to develop a statistical model that classifies examples in a dataset. In the example in this post, we will use the “Star” dataset from the “Ecdat” package. What we will do is try to predict the type of class the students learned in (regular, small, regular with aide) using their math scores, reading scores, and the teaching experience of the teacher. Below is the initial code library(Ecdat) library(MASS) data(Star) We first need to examine the data by using the “str” function str(Star) ## 'data.frame': 5748 obs. of 8 variables: ##$ tmathssk: int  473 536 463 559 489 454 423 500 439 528 ...
##  $treadssk: int 447 450 439 448 447 431 395 451 478 455 ... ##$ classk  : Factor w/ 3 levels "regular","small.class",..: 2 2 3 1 2 1 3 1 2 2 ...
##  $totexpk : int 7 21 0 16 5 8 17 3 11 10 ... ##$ sex     : Factor w/ 2 levels "girl","boy": 1 1 2 2 2 2 1 1 1 1 ...
##  $freelunk: Factor w/ 2 levels "no","yes": 1 1 2 1 2 2 2 1 1 1 ... ##$ race    : Factor w/ 3 levels "white","black",..: 1 2 2 1 1 1 2 1 2 1 ...
##  $schidkn : int 63 20 19 69 79 5 16 56 11 66 ... ## - attr(*, "na.action")=Class 'omit' Named int [1:5850] 1 4 6 7 8 9 10 15 16 17 ... ## .. ..- attr(*, "names")= chr [1:5850] "1" "4" "6" "7" ... We will use the following variables • dependent variable = classk (class type) • independent variable = tmathssk (Math score) • independent variable = treadssk (Reading score) • independent variable = totexpk (Teaching experience) We now need to examine the data visually by looking at histograms for our independent variables and a table for our dependent variable hist(Star$tmathssk)

hist(Star$treadssk) hist(Star$totexpk)

prop.table(table(Star$classk)) ## ## regular small.class regular.with.aide ## 0.3479471 0.3014962 0.3505567 The data mostly looks good. The results of the “prop.table” function will help us when we develop are training and testing datasets. The only problem is with the “totexpk” variable. IT is not anywhere near to be normally distributed. TO deal with this we will use the square root for teaching experience. Below is the code star.sqrt<-Star star.sqrt$totexpk.sqrt<-sqrt(star.sqrt$totexpk) hist(sqrt(star.sqrt$totexpk))

Much better. We now need to check the correlation among the variables as well and we will use the code below.

cor.star<-data.frame(star.sqrt$tmathssk,star.sqrt$treadssk,star.sqrt$totexpk.sqrt) cor(cor.star) ## star.sqrt.tmathssk star.sqrt.treadssk ## star.sqrt.tmathssk 1.00000000 0.7135489 ## star.sqrt.treadssk 0.71354889 1.0000000 ## star.sqrt.totexpk.sqrt 0.08647957 0.1045353 ## star.sqrt.totexpk.sqrt ## star.sqrt.tmathssk 0.08647957 ## star.sqrt.treadssk 0.10453533 ## star.sqrt.totexpk.sqrt 1.00000000 None of the correlations are too bad. We can now develop our model using linear discriminant analysis. First, we need to scale are scores because the test scores and the teaching experience are measured differently. Then, we need to divide our data into a train and test set as this will allow us to determine the accuracy of the model. Below is the code. star.sqrt$tmathssk<-scale(star.sqrt$tmathssk) star.sqrt$treadssk<-scale(star.sqrt$treadssk) star.sqrt$totexpk.sqrt<-scale(star.sqrt$totexpk.sqrt) train.star<-star.sqrt[1:4000,] test.star<-star.sqrt[4001:5748,] Now we develop our model. In the code before the “prior” argument indicates what we expect the probabilities to be. In our data the distribution of the the three class types is about the same which means that the apriori probability is 1/3 for each class type. train.lda<-lda(classk~tmathssk+treadssk+totexpk.sqrt, data = train.star,prior=c(1,1,1)/3) train.lda ## Call: ## lda(classk ~ tmathssk + treadssk + totexpk.sqrt, data = train.star, ## prior = c(1, 1, 1)/3) ## ## Prior probabilities of groups: ## regular small.class regular.with.aide ## 0.3333333 0.3333333 0.3333333 ## ## Group means: ## tmathssk treadssk totexpk.sqrt ## regular -0.04237438 -0.05258944 -0.05082862 ## small.class 0.13465218 0.11021666 -0.02100859 ## regular.with.aide -0.05129083 -0.01665593 0.09068835 ## ## Coefficients of linear discriminants: ## LD1 LD2 ## tmathssk 0.89656393 -0.04972956 ## treadssk 0.04337953 0.56721196 ## totexpk.sqrt -0.49061950 0.80051026 ## ## Proportion of trace: ## LD1 LD2 ## 0.7261 0.2739 The printout is mostly readable. At the top is the actual code used to develop the model followed by the probabilities of each group. The next section shares the means of the groups. The coefficients of linear discriminants are the values used to classify each example. The coefficients are similar to regression coefficients. The computer places each example in both equations and probabilities are calculated. Whichever class has the highest probability is the winner. In addition, the higher the coefficient the more weight it has. For example, “tmathssk” is the most influential on LD1 with a coefficient of 0.89. The proportion of trace is similar to principal component analysis Now we will take the trained model and see how it does with the test set. We create a new model called “predict.lda” and use are “train.lda” model and the test data called “test.star” predict.lda<-predict(train.lda,newdata = test.star) We can use the “table” function to see how well are model has done. We can do this because we actually know what class our data is beforehand because we divided the dataset. What we need to do is compare this to what our model predicted. Therefore, we compare the “classk” variable of our “test.star” dataset with the “class” predicted by the “predict.lda” model. table(test.star$classk,predict.lda$class) ## ## regular small.class regular.with.aide ## regular 155 182 249 ## small.class 145 198 174 ## regular.with.aide 172 204 269 The results are pretty bad. For example, in the first row called “regular” we have 155 examples that were classified as “regular” and predicted as “regular” by the model. In rhe next column, 182 examples that were classified as “regular” but predicted as “small.class”, etc. To find out how well are model did you add together the examples across the diagonal from left to right and divide by the total number of examples. Below is the code (155+198+269)/1748 ## [1] 0.3558352 Only 36% accurate, terrible but ok for a demonstration of linear discriminant analysis. Since we only have two-functions or two-dimensions we can plot our model. Below I provide a visual of the first 50 examples classified by the predict.lda model. plot(predict.lda$x[1:50])
text(predict.lda$x[1:50],as.character(predict.lda$class[1:50]),col=as.numeric(predict.lda$class[1:100])) abline(h=0,col="blue") abline(v=0,col="blue") The first function, which is the vertical line, doesn’t seem to discriminant anything as it off to the side and not separating any of the data. However, the second function, which is the horizontal one, does a good of dividing the “regular.with.aide” from the “small.class”. Yet, there are problems with distinguishing the class “regular” from either of the other two groups. In order improve our model we need additional independent variables to help to distinguish the groups in the dependent variable. # Generalized Additive Models in R In this post, we will learn how to create a generalized additive model (GAM). GAMs are non-parametric generalized linear models. This means that linear predictor of the model uses smooth functions on the predictor variables. As such, you do not need to specific the functional relationship between the response and continuous variables. This allows you to explore the data for potential relationships that can be more rigorously tested with other statistical models In our example, we will use the “Auto” dataset from the “ISLR” package and use the variables “mpg”,“displacement”,“horsepower”,and “weight” to predict “acceleration”. We will also use the “mgcv” package. Below is some initial code to begin the analysis library(mgcv) library(ISLR) data(Auto) We will now make the model we want to understand the response of “accleration” to the explanatory variables of “mpg”,“displacement”,“horsepower”,and “weight”. After setting the model we will examine the summary. Below is the code model1<-gam(acceleration~s(mpg)+s(displacement)+s(horsepower)+s(weight),data=Auto) summary(model1) ## ## Family: gaussian ## Link function: identity ## ## Formula: ## acceleration ~ s(mpg) + s(displacement) + s(horsepower) + s(weight) ## ## Parametric coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 15.54133 0.07205 215.7 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df F p-value ## s(mpg) 6.382 7.515 3.479 0.00101 ** ## s(displacement) 1.000 1.000 36.055 4.35e-09 *** ## s(horsepower) 4.883 6.006 70.187 < 2e-16 *** ## s(weight) 3.785 4.800 41.135 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.733 Deviance explained = 74.4% ## GCV = 2.1276 Scale est. = 2.0351 n = 392 All of the explanatory variables are significant and the adjust r-squared is .73 which is excellent. edf stands for “effective degrees of freedom”. This modified version of the degree of freedoms is due to the smoothing process in the model. GCV stands for generalized cross validation and this number is useful when comparing models. The model with the lowest number is the better model. We can also examine the model visually by using the “plot” function. This will allow us to examine if the curvature fitted by the smoothing process was useful or not for each variable. Below is the code. plot(model1) We can also look at a 3d graph that includes the linear predictor as well as the two strongest predictors. This is done with the “vis.gam” function. Below is the code vis.gam(model1) If multiple models are developed. You can compare the GCV values to determine which model is the best. In addition, another way to compare models is with the “AIC” function. In the code below, we will create an additional model that includes “year” compare the GCV scores and calculate the AIC. Below is the code. model2<-gam(acceleration~s(mpg)+s(displacement)+s(horsepower)+s(weight)+s(year),data=Auto) summary(model2) ## ## Family: gaussian ## Link function: identity ## ## Formula: ## acceleration ~ s(mpg) + s(displacement) + s(horsepower) + s(weight) + ## s(year) ## ## Parametric coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 15.54133 0.07203 215.8 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df F p-value ## s(mpg) 5.578 6.726 2.749 0.0106 * ## s(displacement) 2.251 2.870 13.757 3.5e-08 *** ## s(horsepower) 4.936 6.054 66.476 < 2e-16 *** ## s(weight) 3.444 4.397 34.441 < 2e-16 *** ## s(year) 1.682 2.096 0.543 0.6064 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.733 Deviance explained = 74.5% ## GCV = 2.1368 Scale est. = 2.0338 n = 392 #model1 GCV model1$gcv.ubre
##   GCV.Cp
## 2.127589
#model2 GCV
model2$gcv.ubre ## GCV.Cp ## 2.136797 As you can see, the second model has a higher GCV score when compared to the first model. This indicates that the first model is a better choice. This makes sense because in the second model the variable “year” is not significant. To confirm this we will calculate the AIC scores using the AIC function. AIC(model1,model2) ## df AIC ## model1 18.04952 1409.640 ## model2 19.89068 1411.156 Again, you can see that model1 s better due to its fewer degrees of freedom and slightly lower AIC score. Conclusion Using GAMs is most common for exploring potential relationships in your data. This is stated because they are difficult to interpret and to try and summarize. Therefore, it is normally better to develop a generalized linear model over a GAM due to the difficulty in understanding what the data is trying to tell you when using GAMs. # Generalized Models in R Generalized linear models are another way to approach linear regression. The advantage of of GLM is that allows the error to follow many different distributions rather than only the normal distribution which is an assumption of traditional linear regression. Often GLM is used for response or dependent variables that are binary or represent count data. THis post will provide a brief explanation of GLM as well as provide an example. Key Information There are three important components to a GLM and they are • Error structure • Linear predictor • Link function The error structure is the type of distribution you will use in generating the model. There are many different distributions in statistical modeling such as binomial, gaussian, poission, etc. Each distribution comes with certain assumptions that govern their use. The linear predictor is the sum of the effects of the independent variables. Lastly, the link function determines the relationship between the linear predictor and the mean of the dependent variable. There are many different link functions and the best link function is the one that reduces the residual deviances the most. In our example, we will try to predict if a house will have air conditioning based on the interactioon between number of bedrooms and bathrooms, number of stories, and the price of the house. To do this, we will use the “Housing” dataset from the “Ecdat” package. Below is some initial code to get started. library(Ecdat) data("Housing") The dependent variable “airco” in the “Housing” dataset is binary. This calls for us to use a GLM. To do this we will use the “glm” function in R. Furthermore, in our example, we want to determine if there is an interaction between number of bedrooms and bathrooms. Interaction means that the two independent variables (bathrooms and bedrooms) influence on the dependent variable (aircon) is not additive, which means that the combined effect of the independnet variables is different than if you just added them together. Below is the code for the model followed by a summary of the results model<-glm(Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + Housing$price, family=binomial)
summary(model)
##
## Call:
## glm(formula = Housing$airco ~ Housing$bedrooms * Housing$bathrms + ## Housing$stories + Housing$price, family = binomial) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.7069 -0.7540 -0.5321 0.8073 2.4217 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -6.441e+00 1.391e+00 -4.632 3.63e-06 ## Housing$bedrooms                  8.041e-01  4.353e-01   1.847   0.0647
## Housing$bathrms 1.753e+00 1.040e+00 1.685 0.0919 ## Housing$stories                   3.209e-01  1.344e-01   2.388   0.0170
## Housing$price 4.268e-05 5.567e-06 7.667 1.76e-14 ## Housing$bedrooms:Housing$bathrms -6.585e-01 3.031e-01 -2.173 0.0298 ## ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 681.92 on 545 degrees of freedom ## Residual deviance: 549.75 on 540 degrees of freedom ## AIC: 561.75 ## ## Number of Fisher Scoring iterations: 4 To check how good are model is we need to check for overdispersion as well as compared this model to other potential models. Overdispersion is a measure to determine if there is too much variablity in the model. It is calcualted by dividing the residual deviance by the degrees of freedom. Below is the solution for this 549.75/540 ## [1] 1.018056 Our answer is 1.01, which is pretty good because the cutoff point is 1, so we are really close. Now we will make several models and we will compare the results of them Model 2 #add recroom and garagepl model2<-glm(Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + Housing$price + Housing$recroom + Housing$garagepl, family=binomial)
summary(model2)
##
## Call:
## glm(formula = Housing$airco ~ Housing$bedrooms * Housing$bathrms + ## Housing$stories + Housing$price + Housing$recroom + Housing$garagepl, ## family = binomial) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.6733 -0.7522 -0.5287 0.8035 2.4239 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -6.369e+00 1.401e+00 -4.545 5.51e-06 ## Housing$bedrooms                  7.830e-01  4.391e-01   1.783   0.0745
## Housing$bathrms 1.702e+00 1.047e+00 1.626 0.1039 ## Housing$stories                   3.286e-01  1.378e-01   2.384   0.0171
## Housing$price 4.204e-05 6.015e-06 6.989 2.77e-12 ## Housing$recroomyes                1.229e-01  2.683e-01   0.458   0.6470
## Housing$garagepl 2.555e-03 1.308e-01 0.020 0.9844 ## Housing$bedrooms:Housing$bathrms -6.430e-01 3.054e-01 -2.106 0.0352 ## ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 681.92 on 545 degrees of freedom ## Residual deviance: 549.54 on 538 degrees of freedom ## AIC: 565.54 ## ## Number of Fisher Scoring iterations: 4 #overdispersion calculation 549.54/538 ## [1] 1.02145 Model 3 model3<-glm(Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + Housing$price + Housing$recroom + Housing$fullbase + Housing$garagepl, family=binomial) summary(model3) ## ## Call: ## glm(formula = Housing$airco ~ Housing$bedrooms * Housing$bathrms +
##     Housing$stories + Housing$price + Housing$recroom + Housing$fullbase +
##     Housing$garagepl, family = binomial) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.6629 -0.7436 -0.5295 0.8056 2.4477 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -6.424e+00 1.409e+00 -4.559 5.14e-06 ## Housing$bedrooms                  8.131e-01  4.462e-01   1.822   0.0684
## Housing$bathrms 1.764e+00 1.061e+00 1.662 0.0965 ## Housing$stories                   3.083e-01  1.481e-01   2.082   0.0374
## Housing$price 4.241e-05 6.106e-06 6.945 3.78e-12 ## Housing$recroomyes                1.592e-01  2.860e-01   0.557   0.5778
## Housing$fullbaseyes -9.523e-02 2.545e-01 -0.374 0.7083 ## Housing$garagepl                 -1.394e-03  1.313e-01  -0.011   0.9915
## Housing$bedrooms:Housing$bathrms -6.611e-01  3.095e-01  -2.136   0.0327
##
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 681.92  on 545  degrees of freedom
## Residual deviance: 549.40  on 537  degrees of freedom
## AIC: 567.4
##
## Number of Fisher Scoring iterations: 4
#overdispersion calculation
549.4/537
## [1] 1.023091

Now we can assess the models by using the “anova” function with the “test” argument set to “Chi” for the chi-square test.

anova(model, model2, model3, test = "Chi")
## Analysis of Deviance Table
##
## Model 1: Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories +
##     Housing$price ## Model 2: Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + ## Housing$price + Housing$recroom + Housing$garagepl
## Model 3: Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories +
##     Housing$price + Housing$recroom + Housing$fullbase + Housing$garagepl
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       540     549.75
## 2       538     549.54  2  0.20917   0.9007
## 3       537     549.40  1  0.14064   0.7076

The results of the anova indicate that the models are all essentially the same as there is no statistical difference. The only criteria on which to select a model is the measure of overdispersion. The first model has the lowest rate of overdispersion and so is the best when using this criteria. Therefore, determining if a hous has air conditioning depends on examining number of bedrooms and bathrooms simultenously as well as the number of stories and the price of the house.

Conclusion

The post explained how to use and interpret GLM in R. GLM can be used primarilyy for fitting data to disrtibutions that are not normal.

# Proportion Test in R

Proportions are are a fraction or “portion” of a total amount. For example, if there are ten men and ten women in a room the proportion of men in the room is 50% (5 / 10). There are times when doing an analysis that you want to evaluate proportions in our data rather than individual measurements of mean, correlation, standard deviation etc.

In this post we will learn how to do a test of proportions using R. We will use the dataset “Default” which is found in the “ISLR” pacakage. We will compare the proportion of those who are students in the dataset to a theoretical value. We will calculate the results using the z-test and the binomial exact test. Below is some initial code to get started.

library(ISLR)
data("Default")

We first need to determine the actual number of students that are in the sample. This is calculated below using the “table” function.

table(Default$student) ## ## No Yes ## 7056 2944 We have 2944 students in the sample and 7056 people who are not students. We now need to determine how many people are in the sample. If we sum the results from the table below is the code. sum(table(Default$student))
## [1] 10000

There are 10000 people in the sample. To determine the proprtion of students we take the number 2944 / 10000 which equals 29.44 or 29.44%. Below is the code to calculate this

table(Default$student) / sum(table(Default$student))
##
##     No    Yes
## 0.7056 0.2944

The proportion test is used to compare a particular value with a theoretical value. For our example, the particular value we have is 29.44% of the people were students. We want to compare this value with a theoretical value of 50%. Before we do so it is better to state specificallt what are hypotheses are. NULL = The value of 29.44% of the sample being students is the same as 50% found in the population ALTERNATIVE = The value of 29.44% of the sample being students is NOT the same as 50% found in the population.

Below is the code to complete the z-test.

prop.test(2944,n = 10000, p = 0.5, alternative = "two.sided", correct = FALSE)
##
##  1-sample proportions test without continuity correction
##
## data:  2944 out of 10000, null probability 0.5
## X-squared = 1690.9, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.2855473 0.3034106
## sample estimates:
##      p
## 0.2944

Here is what the code means. 1. prop.test is the function used 2. The first value of 2944 is the total number of students in the sample 3. n = is the sample size 4. p= 0.5 is the theoretical proportion 5. alternative =“two.sided” means we want a two-tail test 6. correct = FALSE means we do not want a correction applied to the z-test. This is useful for small sample sizes but not for our sample of 10000

The p-value is essentially zero. This means that we reject the null hypothesis and conclude that the proprtion of students in our sample is different from a theortical proprition of 50% in the population.

Below is the same analysis using the binomial exact test.

binom.test(2944, n = 10000, p = 0.5)
##
##  Exact binomial test
##
## data:  2944 and 10000
## number of successes = 2944, number of trials = 10000, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.2854779 0.3034419
## sample estimates:
## probability of success
##                 0.2944

The results are the same. Whether to use the “prop.test”” or “binom.test” is a major argument among statisticians. The purpose here was to provide an example of the use of both

# Theoretical Distribution and R

This post will explore an example of testing if a dataset fits a specific theoretical distribution. This is a very important aspect of statistical modeling as it allows to understand the normality of the data and the appropriate steps needed to take to prepare for analysis.

In our example, we will use the “Auto” dataset from the “ISLR” package. We will check if the horsepower of the cars in the dataset is normally distributed or not. Below is some initial code to begin the process.

library(ISLR)
library(nortest)
library(fBasics)
data("Auto")

Determining if a dataset is normally distributed is simple in R. This is normally done visually through making a Quantile-Quantile plot (Q-Q plot). It involves using two functions the “qnorm” and the “qqline”. Below is the code for the Q-Q plot

qqnorm(Auto$horsepower) We now need to add the Q-Q line to see how are distribution lines up with the theoretical normal one. Below is the code. Note that we have to repeat the code above in order to get the completed plot. qqnorm(Auto$horsepower)
qqline(Auto$horsepower, distribution = qnorm, probs=c(.25,.75)) The “qqline” function needs the data you want to test as well as the distribution and probability. The distribution we wanted is normal and is indicated by the argument “qnorm”. The probs argument means probability. The default values are .25 and .75. The resulting graph indicates that the distribution of “horsepower”, in the “Auto” dataset is not normally distributed. That are particular problems with the lower and upper values. We can confirm our suspicion by running a statistical test. The Anderson-Darling test from the “nortest” package will allow us to test whether our data is normally distributed or not. The code is below ad.test(Auto$horsepower)
##  Anderson-Darling normality test
##
## data:  Auto$horsepower ## A = 12.675, p-value < 2.2e-16 From the results, we can conclude that the data is not normally distributed. This could mean that we may need to use non-parametric tools for statistical analysis. We can further explore our distribution in terms of its skew and kurtosis. Skew measures how far to the left or right the data leans and kurtosis measures how peaked or flat the data is. This is done with the “fBasics” package and the functions “skewness” and “kurtosis”. First we will deal with skewness. Below is the code for calculating skewness. horsepowerSkew<-skewness(Auto$horsepower)
horsepowerSkew
## [1] 1.079019
## attr(,"method")
## [1] "moment"

We now need to determine if this value of skewness is significantly different from zero. This is done with a simple t-test. We must calculate the t-value before calculating the probability. The standard error of the skew is defined as the square root of six divided by the total number of samples. The code is below

stdErrorHorsepower<-horsepowerSkew/(sqrt(6/length(Auto$horsepower))) stdErrorHorsepower ## [1] 8.721607 ## attr(,"method") ## [1] "moment" Now we take the standard error of Horsepower and plug this into the “pt” function (t probability) with the degrees of freedom (sample size – 1 = 391) we also put in the number 1 and subtract all of this information. Below is the code 1-pt(stdErrorHorsepower,391) ## [1] 0 ## attr(,"method") ## [1] "moment" The value zero means that we reject the null hypothesis that the skew is not significantly different form zero and conclude that the skew is different form zero. However, the value of the skew was only 1.1 which is not that non-normal. We will now repeat this process for the kurtosis. The only difference is that instead of taking the square root divided by six we divided by 24 in the example below. horsepowerKurt<-kurtosis(Auto$horsepower)
horsepowerKurt
## [1] 0.6541069
## attr(,"method")
## [1] "excess"
stdErrorHorsepowerKurt<-horsepowerKurt/(sqrt(24/length(Auto$horsepower))) stdErrorHorsepowerKurt ## [1] 2.643542 ## attr(,"method") ## [1] "excess" 1-pt(stdErrorHorsepowerKurt,391) ## [1] 0.004267199 ## attr(,"method") ## [1] "excess" Again the pvalue is essentially zero, which means that the kurtosis is significantly different from zero. With a value of 2.64 this is not that bad. However, when both skew and kurtosis are non-normally it explains why our overall distributions was not normal either. Conclusion This post provided insights into assessing the normality of a dataset. Visually inspection can take place using Q-Q plots. Statistical inspection can be done through hypothesis testing along with checking skew and kurtosis. # Probability Distribution and Graphs in R In this post, we will use probability distributions and ggplot2 in R to solve a hypothetical example. This provides a practical example of the use of R in everyday life through the integration of several statistical and coding skills. Below is the scenario. At a busing company the average number of stops for a bus is 81 with a standard deviation of 7.9. The data is normally distributed. Knowing this complete the following. • Calculate the interval value to use using the 68-95-99.7 rule • Calculate the density curve • Graph the normal curve • Evaluate the probability of a bus having less then 65 stops • Evaluate the probability of a bus having more than 93 stops Calculate the Interval Value Our first step is to calculate the interval value. This is the range in which 99.7% of the values falls within. Doing this requires knowing the mean and the standard deviation and subtracting/adding the standard deviation as it is multiplied by three from the mean. Below is the code for this. busStopMean<-81 busStopSD<-7.9 busStopMean+3*busStopSD ## [1] 104.7 busStopMean-3*busStopSD ## [1] 57.3 The values above mean that we can set are interval between 55 and 110 with 100 buses in the data. Below is the code to set the interval. interval<-seq(55,110, length=100) #length here represents 100 fictitious buses Density Curve The next step is to calculate the density curve. This is done with our knowledge of the interval, mean, and standard deviation. We also need to use the “dnorm” function. Below is the code for this. densityCurve<-dnorm(interval,mean=81,sd=7.9) We will now plot the normal curve of our data using ggplot. Before we need to put our “interval” and “densityCurve” variables in a dataframe. We will call the dataframe “normal” and then we will create the plot. Below is the code. library(ggplot2) normal<-data.frame(interval, densityCurve) ggplot(normal, aes(interval, densityCurve))+geom_line()+ggtitle("Number of Stops for Buses") Probability Calculation We now want to determine what is the provability of a bus having less than 65 stops. To do this we use the “pnorm” function in R and include the value 65, along with the mean, standard deviation, and tell R we want the lower tail only. Below is the code for completing this. pnorm(65,mean = 81,sd=7.9,lower.tail = TRUE) ## [1] 0.02141744 As you can see, at 2% it would be unusually to. We can also plot this using ggplot. First, we need to set a different density curve using the “pnorm” function. Combine this with our “interval” variable in a dataframe and then use this information to make a plot in ggplot2. Below is the code. CumulativeProb<-pnorm(interval, mean=81,sd=7.9,lower.tail = TRUE) pnormal<-data.frame(interval, CumulativeProb) ggplot(pnormal, aes(interval, CumulativeProb))+geom_line()+ggtitle("Cumulative Density of Stops for Buses") Second Probability Problem We will now calculate the probability of a bus have 93 or more stops. To make it more interesting we will create a plot that shades the area under the curve for 93 or more stops. The code is a little to complex to explain so just enjoy the visual. pnorm(93,mean=81,sd=7.9,lower.tail = FALSE) ## [1] 0.06438284 x<-interval ytop<-dnorm(93,81,7.9) MyDF<-data.frame(x=x,y=densityCurve) p<-ggplot(MyDF,aes(x,y))+geom_line()+scale_x_continuous(limits = c(50, 110)) +ggtitle("Probabilty of 93 Stops or More is 6.4%") shade <- rbind(c(93,0), subset(MyDF, x > 93), c(MyDF[nrow(MyDF), "X"], 0)) p + geom_segment(aes(x=93,y=0,xend=93,yend=ytop)) + geom_polygon(data = shade, aes(x, y)) Conclusion A lot of work was done but all in a practical manner. Looking at realistic problem. We were able to calculate several different probabilities and graph them accordingly. # A History of Structural Equation Modeling Structural Equation Modeling (SEM) is complex form of multiple regression that is commonly used in social science research. In many ways, SEM is an amalgamation of factor analysis and path analysis as we shall see. The history of this data analysis approach can be traced all the way back to the beginning of the 20th century. This post will provide a brief overview of SEM. Specifically, we will look at the role of factory and path analysis in the development of SEM. The Beginning with Factor and Path Analysis The foundation of SEM was laid with the development of Spearman’s work with intelligence in the early 20th century. Spearman was trying to trace the various dimensions of intelligence back to a single factor. In the 1930’s Thurstone developed multi-factor analysis as he saw intelligence not as a a single factor as Spearman but rather as several factors. Thurstone also bestowed the gift of factor rotation on the statistical community. Around the same time (1920’s-1930’s), Wright was developing path analysis. Path analysis relies on manifest variables with the ability to model indirect relationships among variables. This is something that standard regression normally does not do. In economics, a econometrics was using many of the same ideas as Wright. It was in the early 1950’s that econometricians saw what Wright was doing in his discipline of biometrics. SEM is Born In the 1970’s, Joreskog combined the measurement powers of factor analysis with the regression modeling power of path analysis. The factor analysis capabilities of SEM allow it to assess the accuracy of the measurement of the model. The path analysis capabilities of SEM allow it to model direct and indirect relationships among latent variables. From there, there was an explosion in ways to assess models as well as best practice suggestions. In addition, there are many different software available for conducting SEM analysis. Examples include the LISREL which was the first software available, AMOS which allows the use of a graphical interface. One software worthy of mentioning is Lavaan. Lavaan is a r package that performs SEM. The primary benefit of Lavaan is that it is available for free. Other software can be exceedingly expensive but Lavaan provides the same features for a price that cannot be beat. Conclusion SEM is by far not new to the statistical community. With a history that is almost 100 years old, SEM has been in many ways with the statistical community since the birth of modern statistics. # Ensemble Learning for Machine Models One way to improve a machine learning model is to not make just one model. Instead, you can make several models that all have different strengths and weaknesses. This combination of diverse abilities can allow for much more accurate predictions. The use of multiple models is know as ensemble learning. This post will provide insights into ensemble learning as they are used in developing machine models. The Major Challenge The biggest challenges in creating an ensemble of models is deciding what models to develop and how the various models are combined to make predictions. To deal with these challenges involves the use of training data and several different functions. The Process Developing an ensemble model begins with training data. The next step is the use of some sort of allocation function. The allocation function determines how much data each model receives in order to make predictions. For example, each model may receive a subset of the data or limit how many features each model can use. However, if several different algorithms are used the allocation function may pass all the data to each model with making any changes. After the data is allocated, it is necessary for the models to be created. From there, the next step is to determine how to combine the models. The decision on how to combine the models is made with a combination function. The combination function can take one of several approaches for determining final predictions. For example, a simple majority vote can be used which means that if 5 models where develop and 3 vote “yes” than the example is classified as a yes. Another option is to weight the models so that some have more influence then others in the final predictions. Benefits of Ensemble Learning Ensemble learning provides several advantages. One, ensemble learning improves the generalizability of your model. With the combine strengths of many different models and or algorithms it is difficult to go wrong Two, ensemble learning approaches allow for tackling large datasets. The biggest enemy to machine learning is memory. With ensemble approaches, the data can be broken into smaller pieces for each model. Conclusion Ensemble learning is yet another critical tool in the data scientist’s toolkit. The complexity of the world today makes it too difficult to lean on a singular model to explain things. Therefore, understanding the application of ensemble methods is a necessary step. # Developing a Customize Tuning Process in R In this post, we will learn how to develop customize criteria for tuning a machine learning model using the “caret” package. There are two things that need to be done in order to complete assess a model using customized features. These two steps are… • Determine the model evaluation criteria • Create a grid of parameters to optimize The model we are going to tune is the decision tree model made in a previous post with the C5.0 algorithm. Below is code for loading some prior information. library(caret); library(Ecdat) data(Wages1) DETERMINE the MODEL EVALUATION CRITERIA We are going to begin by using the “trainControl” function to indicate to R what re-sampling method we want to use, the number of folds in the sample, and the method for determining the best model. Remember, that there are many more options but these are the onese we will use. All this information must be saved into a variable using the “trainControl” function. Later, the information we place into the variable will be used when we rerun our model. For our example, we are going to code the following information into a variable we will call “chck” for re sampling we will use k-fold cross-validation. The number of folds will be set to 10. The criteria for selecting the best model will be the through the use of the “oneSE” method. The “oneSE” method selects the simplest model within one standard error of the best performance. Below is the code for our variable “chck” chck<-trainControl(method = "cv",number = 10, selectionFunction = "oneSE") For now this information is stored to be used later CREATE GRID OF PARAMETERS TO OPTIMIZE We now need to create a grid of parameters. The grid is essential the characteristics of each model. For the C5.0 model we need to optimize the model, number of trials, and if winnowing was used. Therefore we will do the following. • For model, we want decision trees only • Trials will go from 1-35 by increments of 5 • For winnowing, we do not want any winnowing to take place. In all we are developing 8 models. We know this based on the trial parameter which is set to 1, 5, 10, 15, 20, 25, 30, 35. To make the grid we use the “expand.grid” function. Below is the code. modelGrid<-expand.grid(.model ="tree", .trials= c(1,5,10,15,20,25,30,35), .winnow="FALSE") CREATE THE MODEL We are now ready to generate our model. We will use the kappa statistic to evaluate each model’s performance set.seed(1) customModel<- train(sex ~., data=Wages1, method="C5.0", metric="Kappa", trControl=chck, tuneGrid=modelGrid) customModel ## C5.0 ## ## 3294 samples ## 3 predictors ## 2 classes: 'female', 'male' ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 2966, 2965, 2964, 2964, 2965, 2964, ... ## Resampling results across tuning parameters: ## ## trials Accuracy Kappa Accuracy SD Kappa SD ## 1 0.5922991 0.1792161 0.03328514 0.06411924 ## 5 0.6147547 0.2255819 0.03394219 0.06703475 ## 10 0.6077693 0.2129932 0.03113617 0.06103682 ## 15 0.6077693 0.2129932 0.03113617 0.06103682 ## 20 0.6077693 0.2129932 0.03113617 0.06103682 ## 25 0.6077693 0.2129932 0.03113617 0.06103682 ## 30 0.6077693 0.2129932 0.03113617 0.06103682 ## 35 0.6077693 0.2129932 0.03113617 0.06103682 ## ## Tuning parameter 'model' was held constant at a value of tree ## ## Tuning parameter 'winnow' was held constant at a value of FALSE ## Kappa was used to select the optimal model using the one SE rule. ## The final values used for the model were trials = 5, model = tree ## and winnow = FALSE. The actually output is similar to the model that “caret” can automatically create. The difference here is that the criteria was set by us rather than automatically. A close look reveals that all of the models perform poorly but that there is no change in performance after ten trials. CONCLUSION This post provided a brief explanation of developing a customize way of assessing a models performance. To complete this, you need configure your options as well as setup your grid in order to assess a model. Understanding the customization process for evaluating machine learning models is one of the strongest ways to develop supremely accurate models that retain generalizability. # Improving the Performance of Machine Learning Model For many, especially beginners, making a machine learning model is difficult enough. Trying to understand what to do, how to specify the model, among other things is confusing in itself. However, after developing a model it is necessary to assess ways in which to improve performance. This post will serve as an introduction to understanding how to improving model performance. In particular, we will look at the following • When it is necessary to improve performance • Parameter tuning When to Improve It is not always necessary to try and improve the performance of a model. There are times when a model does well and you know this through the evaluating it. If the commonly used measures are adequate there is no cause for concern. However, there are times when improvement is necessary. Complex problems, noisy data, and trying to look for subtle/unclear relationships can make improvement necessary. Normally, real-world data has the problems so model improvement is usually necessary. Model improvement requires the application of scientific means in an artistic manner. It requires a sense of intuition at times and also brute trial-and-error effort as well. The point is that there is no singular agreed upon way to improve a model. It is better to focus on explaining how you did it if necessary. Parameter Tuning Parameter tuning is the actual adjustment of model fit options. Different machine learning models have different options that can be adjusted. Often, this process can be automated in r through the use of the “caret” package. When trying to decide what to do when tuning parameters it is important to remember the following. • What machine learning model and algorithm you are using for your data. • Which parameters you can adjust. • What criteria you are using to evaluate the model Naturally, you need to know what kind of model and algorithm you are using in order to improve the model. There are three types of models in machine learning, those that classify, those that employ regression, and those that can do both. Understanding this helps you to make decision about what you are trying to do. Next, you need to understand what exactly you or r are adjusting when analyzing the model. For example, for C5.0 decision trees “trials” is one parameter you can adjust. If you don’t know this, you will not know how the model was improved. Lastly, it is important to know what criteria you are using to compare the various models. For classifying models you can look at the kappa and the various information derived from the confusion matrix. For regression based models you may look at the r-square, the RMSE (Root mean squared error), or the ROC curve. Conclusion As you can perhaps tell there is an incredible amount of choice and options in trying to improve a model. As such, model improvement requires a clearly developed strategy that allows for clear decision-making. In a future post, we will look at an example of model improvement. # Receiver Operating Characteristic Curve The receiver operating characteristic curve (ROC curve) is a tool used in statistical research to assess the trade-off of detecting true positives and true negatives. The origins of this tool goes all the way back to WWII when engineers were trying to distinguish between true and false alarms. Now this technique is used in machine learning This post will explain the ROC curve and provide and example using R. Below is a diagram of an ROC curve On the X axis we have the false positive rate. As you move to the right the false positive rate increases which is bad. We want to be as close to zero as possible. On the y axis we have the true positive rate. Unlike the x axis we want the true positive rate to be as close to 100 as possible. In general we want a low value on the x-axis and a high value on the y-axis. In the diagram above, the diagonal line called “Test without diagnostic benefit” represents a model that cannot tell the difference between true and false positives. Therefore, it is not useful for our purpose. The L-shaped curve call “Good diagnostic test” is an example of an excellent model. This is because all the true positives are detected . Lastly, the curved-line called “Medium diagonistic test” represents an actually model. This model is a balance between the perfect L-shaped model and the useless straight-line model. The curved-line model is able to moderately distinguish between false and true positives. Area Under the ROC Curve The area under an ROC curve is literally called the “Area Under the Curve” (AUC). This area is calculated with a standardized value ranging from 0 – 1. The closer to 1 the better the model We will now look at an analysis of a model using the ROC curve and AUC. This is based on the results of a post using the KNN algorithm for nearest neighbor classification. Below is the code predCollege <- ifelse(College_test_pred=="Yes", 1, 0) realCollege <- ifelse(College_test_labels=="Yes", 1, 0) pr <- prediction(predCollege, realCollege) collegeResults <- performance(pr, "tpr", "fpr") plot(collegeResults, main="ROC Curve for KNN Model", col="dark green", lwd=5) abline(a=0,b=1, lwd=1, lty=2) aucOfModel<-performance(pr, measure="auc") unlist(aucOfModel@y.values) 1. The first to variables (predCollege & realCollege) is just for converting the values of the prediction of the model and the actual results to numeric variables 2. The “pr” variable is for storing the actual values to be used for the ROC curve. The “prediction” function comes from the “ROCR” package 3. With the information information of the “pr” variable we can now analyze the true and false positives, which are stored in the “collegeResults” variable. The “performance” function also comes from the “ROCR” package. 4. The next two lines of code are for plot the ROC curve. You can see the results below 6. The curve looks pretty good. To confirm this we use the last two lines of code to calculate the actually AUC. The actual AUC is 0.88 which is excellent. In other words, the model developed does an excellent job of discerning between true and false positives. Conclusion The ROC curve provides one of many ways in which to assess the appropriateness of a model. As such, it is yet another tool available for a person who is trying to test models. # Using Confusion Matrices to Evaluate Performance The data within a confusion matrix can be used to calculate several different statistics that can indicate the usefulness of a statistical model in machine learning. In this post, we will look at several commonly used measures, specifically… • accuracy • error • sensitivity • specificity • precision • recall • f-measure Accuracy Accuracy is probably the easiest statistic to understand. Accuracy is the total number of items correctly classified divided by the total number of items below is the equation accuracy = TP + TN TP + TN + FP + FN TP = true positive, TN = true negative, FP = false positive, FN = false negative Accuracy can range in value from 0-1 with one representing 100% accuracy. Normally, you don’t want perfect accuracy as this is an indication of overfitting and your model will probably not do well with other data. Error Error is the opposite of accuracy and represent the percentage of examples that are incorrectly classified it’s equation is as follows. error = FP + FN TP + TN + FP + FN The lower the error the better in general. However, if error is 0 it indicates overfitting. Keep in mind that error is the inverse of accuracy. As one increases the other decreases. Sensitivity Sensitivity is the proportion of true positives that were correctly classified.The formula is as follows sensitivity = TP TP + FN This may sound confusing but high sensitivity is useful for assessing a negative result. In other words, if I am testing people for a disease and my model has a high sensitivity. This means that the model is useful telling me a person does not have a disease. Specificity Specificity measures the proportion of negative examples that were correctly classified. The formula is below specificity = TN TN + FP Returning to the disease example, a high specificity is a good measure for determining if someone has a disease if they test positive for it. Remember that no test is foolproof and there are always false positives and negatives happening. The role of the researcher is to maximize the sensitivity or specificity based on the purpose of the model. Precision Precision is the proportion of examples that are really positive. The formula is as follows precision = TP TP + FP The more precise a model is the more trustworthy it is. In other words, high precision indicates that the results are relevant. Recall Recall is a measure of the completeness of the results of a model. It is calculated as follows recall = TP TP + FN This formula is the same as the formula for sensitivity. The difference is in the interpretation. High recall means that the results have a breadth to them such as in search engine results. F-Measure The f-measure uses recall and precision to develop another way to assess a model. The formula is below sensitivity = 2 * TP 2 * TP + FP + FN The f-measure can range from 0 – 1 and is useful for comparing several potential models using one convenient number. Conclusion This post provide a basic explanation of various statistics that can be used to determine the strength of a model. Through using a combination of statistics a researcher can develop insights into the strength of a model. The only mistake is relying exclusively on any single statistical measurement. # Understanding Confusion Matrices A confusion matrix is a table that is used to organize the predictions made during an analysis of data. Without making a joke confusion matrices can be confusing especially for those who are new to research. In this post, we will look at how confusion matrices are setup as well as what the information in them means. Actual Vs Predicted Class The most common confusion matrix is a two class matrix. This matrix compares the actual class of an example with the predicted class of the model. Below is an example Two Class Matrix Predicted Class A B Correctly classified as A Incorrectly classified as B Incorrectly classified as A Correctly classified as B Actual class is along the vertical side Looking at the table there are four possible outcomes. • Correctly classified as A-This means that the example was a part of the A category and the model predicted it as such • Correctly classified as B-This means that the example was a part of the B category and the model predicted it as such • Incorrectly classified as A-This means that the example was a part of the B category but the model predicted it to be a part of the A group • Incorrectly classified as B-This means that the example was a part of the A category but the model predicted it to be a part of the B group These four types of classifications have four different names which are true positive, true negative, false positive, and false negative. We will look at another example to understand these four terms. Two Class Matrix Predicted Lazy Students Lazy Not Lazy 1. Correctly classified as lazy 2. Incorrectly classified as not Lazy 3. Incorrectly classified as Lazy 4. Correctly classified as not lazy Actual class is along the vertical side In the example above, we want to predict which students are lazy. Group one, is the group in which students who are lazy are correctly classified as lazy. This is called true positive. Group 2 are those who are lazy but are predicted as not being lazy. This is known as a false negative also known as a type II error in statistics. This is a problem because if the student is misclassified they may not get the support they need. Group three is students who are not lazy but are classified as such. This is known as a false positive or type I error. In this example, being labeled lazy is a major headache for the students but not as dangerous perhaps as a false negative. Lastly, group four are students who are not lazy and are correctly classified as such. This is known as a true negative. Conclusion The primary purpose of a confusion matrix is to display this information visually. In future post we will see that there is even more information found in a confusion matrix than what was cover briefly here. # Understanding Market Basket Analysis Market basket analysis a machine learning approach that attempts to find relationships among a group of items in a data set. For example, a famous use of this method was when retailers discovered an association between beer and diapers. Upon closer examination, the retailers found that when men came to purchase diapers for their babies they would often buy beer in the same trip. With this knowledge, the retailers placed beer and diapers next to each other in the store and this further increased sales. In addition, many of the recommendation systems we experience when shopping online use market basket analysis results to suggest additional products to us. As such, market basket analysis is an intimate part of our lives with us even knowing. In this post, we will look at some of the details of market basket analysis such as association rules, apriori, and the role of support and confidence. Association Rules The heart of market basket analysis are association rules. Association rules explain patterns of relationship among items. Below is an example {rice, seaweed} -> {soy sauce} Everything in curly braces { } is an itemset, which is some form of data that occurs often in the dataset based on a criteria. Rice and seaweed is our itemset on the left and soy sauce is our itemset on the right. The arrow -> indicates what comes first as we read from left to right. If we put this association rule in simply English it would say “if someone buys rice and seaweed then they will buy soy sauce”. The practical application of this rule is to place rice, seaweed and soy sauce near each other in order to reinforce this rule when people come to shop. The Algorithm Market basket analysis uses a apriori algorithm. This algorithm is useful for unsupervised learning that does not require any training and thus no predictions. The apriori algorithm is especially useful with large datasets but it employs simple procedures to find useful relationships among the items. The shortcut that this algorithm uses is the “apriori property” which states that all sugsets of a frequent itemset must also be frequent. What this means in simply English is that the items in an itemset need to be common in the overall dataset. This simple rule saves a tremendous amount of computational time. Support and Confidence To key pieces of information that can further refine the work of the apriori algorithm is support and confidence. Support is a measure of the frequency of an itemset ranging from 0 (no support) to 1 (highest support). High support indicates the importance of the itemset in the data and contributes to the itemset being used to generate association rule(s). Returning to our rice, seaweed, and soy sauce example. We can say that the support for soy sauce is 0.4. This means that soy sauce appears in 40% of the purchases in the dataset which is pretty high. Confidence is a measure of the accuracy of an association rule which is measured from 0 to 1. The higher the confidence the more accurate the association rule. If we say that our rice, seaweed, and soy sauce rule has a confidence of 0.8 we are saying that when rice and seaweed are purchased together, 80% of the time soy sauce is purchased as well. Support and confidence can be used to influence the apriori algorithm by setting cutoff values to be searched for. For example, if we setting a minimum support of 0.5 and a confidence of 0.65 we are telling the computer to only report to us association rules that are above these cutoff points. This helps to remove useless rules that are obvious or useless. Conclusion Market basket analysis is a useful tool for mining information from large datasets. The rules are easy to understanding. In addition, market basket analysis can be used in many fields beyond shopping and can include relationships within DNA, and other forms of human behavior. As such, care must be made so that unsound conclusions are not drawn from random patterns in the data # Basics of Support Vector Machines Support vector machines (SVM) is another one of those mysterious black box methods in machine learning. This post will try to explain in simple terms what SVM are and their strengths and weaknesses. Definition SVM is a combination of nearest neighbor and linear regression. For the nearest neighbor, SVM uses the traits of an identified example to classify an unidentified one. For regression, a line is drawn that divides the various groups.It is preferred that the line is straight but this is not always the case This combination of using the nearest neighbor along with the development of a line leads to the development of a hyperplane. The hyperplane is drawn in a place that creates the greatest amount of distance among the various groups identified. The examples in each group that are closest to the hyperplane are the support vectors. They support the vectors by providing the boundaries for the various groups. If for whatever reason a line cannot be straight because the boundaries are not nice and night. R will still draw a straight line but make accommodations through the use of a slack variable, which allow for error and or for examples to be in the wrong group. Another trick used in SVM analysis is the kernel trick. A kernel will add a new dimension or feature to the analysis by combining features that were measured in the data. For example, latitude and lonigitude might be combine mathematically to make altitude. This new feature is now used to develop the hyperplane for the data. There are several different types of kernel tricks that achieve their goal using various mathematics. There is no rule for which one to use and playing different choices is the only strategy currently. Pros and Cons The pros of SVM is their flexibility of use as they can be used to predict numbers or classify. SVM are also able to deal with nosy data and are easier to use than artificial neural networks. Lastly, SVM are often able to resist overfitting and are usually highly accurate. Cons of SVM include they are still complex as they are a member of black box machine learning methods even if they are simpler than artificial neural networks. The lack of a criteria over kernel selection makes it difficult to determine which model is the best. Conclusion SVM provide yet another approach to analyzing data in a machine learning context. Success with this approach depends on determining specifically what the goals of a project are. # Black Box Method-Artificial Neural Networks In machine learning, there are a set of analytical techniques know as black box methods. What is meant by black box methods is that the actual models developed are derived from complex mathematical processes that are difficult to understand and interpret. This difficulty in understanding them is what makes them mysterious. One black method is artificial neural network (ANN). This method tries to imitate mathematically the behavior of neurons in the nervous system of humans. This post will attempt to explain ANN in simplistic terms. The Human Mind and the Artificial One We will begin by looking at how real neurons work before looking at ANN. In simple terms, as this is not a biology blog, neurons send and receive signals. They receive signals through their dendrites, process information in the soma, and the send signals through their axon terminal. Below is a picture of a neuron. An ANN works in a highly similar manner. The x variables are the dendrites that are providing information to the cell body we they are summed. Different dendrites or x variables can have different weights (w). Next, the summation of the x variables is passed to an activation function before moving to the output or dependent variable y. Below is a picture of this process. ￼If you compare the two pictures they are similar yet different. ANN mimics the mind in a way that has fascinated people for over 50 years. Activation Function (f) The activation function purpose is to determine if there should be an activation. In the human body, activation takes place one the nerve cell sends the message to the next cell. This indicates that the message was strong enough to have it move forward. The same concept applies in ANN. A signal will not be passed on unless it meets a minimum threshold. This threshold can vary depending on how the ANN is model. ANN Design The makeup of a ANNs can vary great. Some models have more than one output variable as shown below. Two outputs Other models have what are called hidden layers. These are variables that are both input and output variables. They could be seen as mediating variables. Below is a visual example. Hidden layer is the two circles in the middle How many layers to developed is left to the researcher. When models become really complex with several hidden layers and or outcome variables it is referred to as deep learning in the machine learning community. Another complexity of ANN is the direction of information. Just as in the human body information can move forward and backward in an ANN. This provides for opportunities to model highly complex data and relationships. Conclusion ANN can be used for classifying virtually anything. They are a highly accurate model as well that is not bogged down by many assumptions. However, ANN’s are so hard to understand that it makes it difficult to use them despite their advantages. As such, this form of analysis can be beneficial if the user is able to explain the results. # Classification Rules in Machine Learning Classification rules represent knowledge in an if-else format. These types of rules involve the terms antecedent and consequent. Antecedent is the before ad consequent is after. For example, I may have the following rule. If the students studies 5 hours a week then they will pass the class with an A This simple rule can be broken down into the following antecedent and consequent. • Antecedent–If the student studies 5 hours a week • Consequent-then they will pass the class with an A The antecedent determines if the consequent takes place. For example, the student must study 5 hours a week to get an A. This is the rule in this particular context. This post will further explain the characteristic and traits of classification rules. Classification Rules and Decision Trees Classification rules are developed on current data to make decisions about future actions. They are highly similar to the more common decision trees. The primary difference is that decision trees involve a complex step-by-step process to make a decision. Classification rules are stand alone rules that are abstracted from a process. To appreciate a classification rule you do not need to be familiar with the process that created it. While with decision trees you do need to be familiar with the process that generated the decision. One catch with classification rules in machine learning is that the majority of the variables need to be nominal in nature. As such, classification rules are not as useful for large amounts of numeric variables. This is not a problem with decision trees. The Algorithm Classification rules use algorithms that employ a separate and conquer heuristic. What this means is that the algorithm will try to separate the data into smaller and smaller subset by generating enough rules to make homogeneous subsets. The goal is always to separate the examples in the data set into subgroups that have similar characteristics. Common algorithms used in classification rules include the One Rule Algorithm and the RIPPER Algorithm. The One Rule Algorithm analyzes data and generates one all-encompassing rule. This algorithm works be finding the single rule that contains the less amount of error. Despite its simplicity it is surprisingly accurate. The RIPPER algorithm grows as many rules as possible. When a rule begins to become so complex that in no longer helps to purify the various groups the rule is pruned or the part of the rule that is not beneficial is removed. This process of growing and pruning rules is continued until there is no further benefit. RIPPER algorithm rules are more complex than One Rule Algorithm. This allows for the development of complex models. The drawback is that the rules can become to complex to make practical sense. Conclusion Classification rules are a useful way to develop clear principles as found in the data. The advantages of such an approach is simplicity. However, numeric data is harder to use when trying to develop such rules. # Making a Decision Tree in R In this post, we are going to learn how to use the C5.0 algorithm to make a classification tree in order to make predictions about gender based on wage, education, and job experience using a data set in the “Ecdat” package in R. Below is some code to get started. library(Ecdat); library(C50); library(gmodels) data(Wages1) We now will explore the data to get a sense of what is happening in the it. Below is the code for this str(Wages1) ## 'data.frame': 3294 obs. of 4 variables: ##$ exper : int 9 12 11 9 8 9 8 10 12 7 ...
## $sex : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ... ##$ school: int 13 12 11 14 14 14 12 12 10 12 ...
## $wage : num 6.32 5.48 3.64 4.59 2.42 ... hist(Wages1$exper)

summary(Wages1$exper) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.000 7.000 8.000 8.043 9.000 18.000  hist(Wages1$wage)

summary(Wages1$wage) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.07656 3.62200 5.20600 5.75800 7.30500 39.81000 hist(Wages1$school)

summary(Wages1$school) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3.00 11.00 12.00 11.63 12.00 16.00 table(Wages1$sex)
## female male
## 1569 1725

As you can see, we have four features (exper, sex, school, wage) in the “Wages1” data set. The histogram for “exper” indicates that it is normally distributed. The “wage” feature is highly left skewed and almost bimodal. This is not a big deal as classification trees are robust against non-normality. The ‘school’ feature is mostly normally distributed. Lastly, the ‘sex’ feature is categorical but there is almost an equal number of men and women in the data. All of the outputs for the means are listed above.

Create Training and Testing Sets

We now need to create are training and testing data sets. In order to do this we need to first randomly reorder our data set. For example, if the data is sorted by one of the features, to split it now would lead to extreme values all being lumped together in one data set.

To make things more confusing, we also need to set our seed. This allows us to be able to replicate our results. Below is the code for doing this.

set.seed(12345)
Wage_rand<-Wages1[order(runif(3294)),]

What we did is explained as follows

1. set the seed using the ‘set.seed’ function (We randomly picked the number 12345)
2. We created the variable ‘Wage_rand’ and we assigned the following
3. From the ‘Wages1’ dataset we used the ‘runif’ function to create a list of 3294 numbers (1-3294) we did this because there are a total of 3294 examples in the dataset.
4. After generating the 3294 numbers we then order sequentially using the “order” function.
5. We then assigned each example in the “Wages1” dataset one of the numbers we created

We will now created are training and testing set using the code below.

Wage_train<-Wage_rand[1:2294,]
Wage_test<-Wage_rand[2295:3294,]

Make the Model
We can now begin training a model below is the code.

Wage_model<-C5.0(Wage_train[-2], Wage_train$sex) The coding for making the model should be familiar by now. One thing that is new is the brackets with the -2 inside. This tells r to ignore the second column in the dataset. We are doing this because we want to predict sex. If it is a part of the indepedent variables we cannot predict it. We can now examine the results of our model by using the following code. Wage_model ## ## Call: ## C5.0.default(x = Wage_train[-2], y = Wage_train$sex)
##
## Classification Tree
## Number of samples: 2294
## Number of predictors: 3
##
## Tree size: 9
##
## Non-standard options: attempt to group attributes
summary(Wage_model)
##
## Call:
## C5.0.default(x = Wage_train[-2], y = Wage_train$sex) ## ## ## C5.0 [Release 2.07 GPL Edition] Wed May 25 10:55:22 2016 ## ——————————- ## ## Class specified by attribute outcome’ ## ## Read 2294 cases (4 attributes) from undefined.data ## ## Decision tree: ## ## wage <= 3.985179: ## :…school > 11: female (345/109) ## : school <= 11: ## : :…exper <= 8: female (224/96) ## : exper > 8: male (143/59) ## wage > 3.985179: ## :…wage > 9.478313: male (254/61) ## wage <= 9.478313: ## :…school > 12: female (320/132) ## school <= 12: ## :…school <= 10: male (246/70) ## school > 10: ## :…school <= 11: male (265/114) ## school > 11: ## :…exper <= 6: female (83/35) ## exper > 6: male (414/173) ## ## ## Evaluation on training data (2294 cases): ## ## Decision Tree ## —————- ## Size Errors ## ## 9 849(37.0%) << ## ## ## (a) (b) ## —- —- ## 600 477 (a): class female ## 372 845 (b): class male ## ## ## Attribute usage: ## ## 100.00% wage ## 88.93% school ## 37.66% exper ## ## ## Time: 0.0 secs The “Wage_model” indicates a small decision tree of only 9 decisions. The “summary” function shows the actual decision tree. It’s somewhat complicated but I will explain the beginning part of the tree. If wages are less than or equal to 3.98 then the person is female THEN If the school is greater than 11 then the person is female ELSE If the school is less then or equal to 11 THEN If The experience of the person is less than or equal to 8 the person is female ELSE If the experience is greater than 8 the person is male etc. The next part of the output shows the amount of error. This model misclassified 37% of the examples which is pretty high. 477 men were misclassified as women and 372 women were misclassified as men. Predict with the Model We will now see how well this model predicts gender in the testing set. Below is the code Wage_pred<-predict(Wage_model, Wage_test) CrossTable(Wage_test$sex, Wage_pred, prop.c = FALSE,
prop.r = FALSE, dnn=c('actual sex', 'predicted sex'))

The output will not display properly here. Please see C50 for a pdf of this post and go to page 7

Again, this code should be mostly familiar for the prediction model. For the table we are comparing the test set sex with predicted sex. The overall model was correct 269 + 346/1000 for 61.5% accuracy rate, which is pretty bad.

Improve the Model

There are two way we are going to try and improve our model. The first is adaptive boosting and the second is error cost.

Adaptive boosting involves making several models that “vote” how to classify an example. To do this you need to add the ‘trials’ parameter to the code. The ‘trial’ parameter sets the upper limit of the number of models R will iterate if necessary. Below is the code for this and the code for the results.

Wage_boost10<-C5.0(Wage_train[-2], Wage_train$sex, trials = 10) #view boosted model summary(Wage_boost10) ## ## Call: ## C5.0.default(x = Wage_train[-2], y = Wage_train$sex, trials = 10)
##
##
## C5.0 [Release 2.07 GPL Edition] Wed May 25 10:55:22 2016
## -------------------------------
##
## Class specified by attribute outcome'
##
## Read 2294 cases (4 attributes) from undefined.data
##
## ----- Trial 0: -----
##
## Decision tree:
##
## wage <= 3.985179: ## :...school > 11: female (345/109)
## : school <= 11:
## : :...exper <= 8: female (224/96) ## : exper > 8: male (143/59)
## wage > 3.985179:
## :...wage > 9.478313: male (254/61)
## wage <= 9.478313: ## :...school > 12: female (320/132)
## school <= 12:
## :...school <= 10: male (246/70) ## school > 10:
## :...school <= 11: male (265/114) ## school > 11:
## :...exper <= 6: female (83/35) ## exper > 6: male (414/173)
##
## ----- Trial 1: -----
##
## Decision tree:
##
## wage > 6.848846: male (663.6/245)
## wage <= 6.848846:
## :...school <= 10: male (413.9/175) ## school > 10: female (1216.5/537.6)
##
## ----- Trial 2: -----
##
## Decision tree:
##
## wage <= 3.234474: female (458.1/192.9) ## wage > 3.234474: male (1835.9/826.2)
##
## ----- Trial 3: -----
##
## Decision tree:
##
## wage > 9.478313: male (234.8/82.1)
## wage <= 9.478313:
## :...school <= 11: male (883.2/417.8) ## school > 11: female (1175.9/545.1)
##
## ----- Trial 4: -----
##
## Decision tree:
## male (2294/1128.1)
##
## *** boosting reduced to 4 trials since last classifier is very inaccurate
##
##
## Evaluation on training data (2294 cases):
##
## Trial Decision Tree
## ----- ----------------
## Size Errors
##
## 0 9 849(37.0%)
## 1 3 917(40.0%)
## 2 2 958(41.8%)
## 3 3 949(41.4%)
## boost 864(37.7%) <<
##
##
## (a) (b) ## ---- ----
## 507 570 (a): class female
## 294 923 (b): class male
##
##
## Attribute usage:
##
## 100.00% wage
## 88.93% school
## 37.66% exper
##
##
## Time: 0.0 secs

R only created 4 models as there was no additional improvement after this. You can see each model in the printout. The overall results are similar to our original model that was not boosted. We will now see how well our boosted model predicts with the code below.

Wage_boost_pred10<-predict(Wage_boost10, Wage_test)
CrossTable(Wage_test$sex, Wage_boost_pred10, prop.c = FALSE, prop.r = FALSE, dnn=c('actual Sex Boost', 'predicted Sex Boost')) Our boosted model has an accuracy rate 223+379/1000 = 60.2% which is about 1% better the our unboosted model (59.1%). As such, boosting the model was not useful (see page 11 of the pdf for the table printout.) Our next effort will be through the use of a cost matrix. A cost matrix allows you to impose a penalty on false positives and negatives at your discretion. This is useful if certain mistakes are too costly for the learner to make. IN our example, we are going to make it 4 times more costly misclassify a female as a male (false negative) and 1 times for costly to misclassify a male as a female (false positive). Below is the code error_cost Wage_cost<-C5.0(Wage_train[-21], Wage_train$sex, cost = error_cost)
Wage_cost_pred<-predict(Wage_cost, Wage_test)
CrossTable(Wage_test$sex, Wage_cost_pred, prop.c = FALSE, prop.r = FALSE, dnn=c('actual Sex EC', 'predicted Sex EC')) With this small change are model is 100% accurate (see page 12 of the pdf). Conclusion This post provided an example of decision trees. Such a model allows someone to predict a given outcome when given specific information. # Understanding Decision Trees Decision trees are yet another method in machine learning that is used for classifying outcomes. Decision trees are very useful for, as you can guess, making decisions based on the characteristics of the data. In this post we will discuss the following • Physical traits of decision trees • How decision trees work • Pros and cons of decision trees Physical Traits of a Decision Tree Decision trees consist of what is called a tree structure. The tree structure consist of a root node, decision nodes, branches and leaf nodes. A root node is the initial decision made in the tree. This depends on which feature the algorithm selects first. Following the root node the tree splits into various branches. Each branch leads to an additional decision node where the data is further subdivided. When you reach the bottom of a tree at the terminal node(s) these are also called leaf nodes. How Decision Trees Work Decision trees use a heuristic called recursive partitioning. What this does is it splits the overall data set into smaller and smaller subsets until each subset is as close to pure (having the same characteristics) as possible. This process is also know as divide and conquer. The mathematics for deciding how to split the data is based on an equation called entropy, which measures the purity of a potential decision node. The lower the entropy score the more pure the decision node is. The entropy can range from 0 (most pure) to 1 (most impure). One of the most popular algorithms for developing decision trees is the C5.0 algorithm. This algorithm in particular uses entropy to assess potential decision nodes. Pros and Cons The prose of decision trees include it versatile nature. Decision trees can deal with all types of data as well as missing data. Furthermore, this approach learns automatically and only uses the most important features. Lastly, a deep understanding of mathematics is not necessary to use this method in comparison to more complex models. Some problems with decision trees is that the can easily overfit the data. This means that the tree does not generalize well to other datasets. In addition, a large complex tree can be hard to interpret, which may be yet another indication of overfitting. Conclusion Decision trees provide another vehicle that researchers can use to empower decision making. This model is most useful particular when a decision that was made needs to be explained and defended. For example, when rejecting a person’s loan application. Complex models made provide stronger mathematical reasons but would be difficult to explain to an irate customer. Therefore, for complex calculation presented in an easy to follow format. Decision trees are one possibility. # Types of Mixed Method Design In a previous post, we looked at mix methods and some examples of this design. Mixed methods is focused on combining quantitative and qualitative methods to study a research problem. In this post we will look at several additional mixed method designs. Specifically, we will look at the follow designs • Embedded design • Transformative design • Multi-phase design Embedded Design Embedded design is the simultaneous collection of quantitative and qualitative data with one form of data by supportive to the other. The supportive data augments the conclusions of the main data collection. The benefits of this design is that allows for one method to lead the analysis with the secondary method provide additional information. For example, quantitative measures are excellent at recording the results of an experiment. Qualitative measures would be useful in determining how participants perceived their experience in the experiment. A downside to this approach making sure the secondary method is truly supporting the overall research. Quantitative and qualitative methods natural answer different research questions. Therefore, the research questions of a study must be worded in a way that allows for cooperation between qualitative and quantitative methods. Transformative Design The transformative design is more of a philosophy than a mixed method design. This design can employ any other mixed method design. The main difference that transformative designs focus on helping a marginalized population with the goal of bringing about change. For example, a researcher might do a study Asian students facing discrimination in a predominately African American high school. The goal of the study would be to document the experiences of Asian students in order to provide administrators with information on the extent of this problem. Such a focus on the oppress is drawn heavily from Critical Theory which exposes how oppression takes place through education. The emphasis on change is derived from Dewy and progressivism. Multiphase Design Multiphase design is actually the use of several designs over several studies. This is a massive and supremely complex process. You would need to tie together several different mixed method studies under one general research problem. From this you can see that this is not a commonly used design. For example, you may decide to continue doing research into Asian student discrimination at African American high schools. The first study might employ an explanatory design. The second study might employ and exploratory design. The last study might be a transformative design. After completing all this work, you would need to be able to articulate the experiences with discrimination of the Asian students. This is not an easy task by any means. As such, if and when this design is used, it often requires the teamwork of several researchers. Conclusion Mixed method designs requires a different way of thinking when it comes to research. The uniqueness of this approach is the combination of qualitative and quantitative methods. This mixing of methods has advantages and disadvantage. The primary point to remember is that the most appropriate design depends on the circumstances of the study. # Introduction to Probability Probability is a critical component of statistical analysis and serves as a way to determine the likelihood of an event occurring. This post will provide a brief introduction into some of the principles of probability. Probability There are several basic probability terms we need to cover • events • trial • mutually exclusive and exhaustive Events are possible outcomes. For example, if you flip a coin, the event can be heads or tails. A trial is a single opportunity for an event to occur. For example, if you flip a coin one time this means that there was one trial or one opportunity for the event of heads or tails to occur. To calculate the probability of an event you need to take the number of trials an event occurred divided by the total number of trials. The capital letter “P” followed by the number in parentheses is always how probability is expressed. Below is the actual equation for this Number of trial the event occurredTotal number of trials = P(event) To provide an example, if we flip a coin ten times and we recored five heads and five tails, if we want to know the probability of heads this is the answer below Five heads ⁄ Ten trials = P(heads) = 0.5 Another term to understand is mutually exclusive and exhaustive. This means that events cannot occur at the same time. For example, if we flip a coin, the result can only be heads or tails. We cannot flip a coin and have both heads and tails happen simultaneously. Joint Probability There are times were events are not mutually exclusive. For example, lets say we have the possible events 1. Musicians 2. Female 3. Female musicians There are many different events that came happen simultaneously • Someone is a musician and not female • Someone who is female and not a musician • Someone who is a female musician There are also other things we need to keep in mind • Everyone is not female • Everyone is not a musician • There are many people who are not female and are not musicians We can now work through a sample problem as shown below. 25% of the population are musicians and 60% of the population is female. What is the probability that someone is a female musician To solve this problem we need to find the joint probability which is the probability of two independent events happening at the same time. Independent events or events that do not influence each other. For example, being female has no influence on becoming a musician and vice versa. For our female musician example, we run the follow calculation. P(Being Musician) * P(Being Female) = 0.25 * 0.60 = 0.25 = 15% From the calculation, we can see that there is a 15% chance that someone will be female and a musician. Conclusion Probability is the foundation of statistical inference. We will see in a future post that not all events are independent. When they are not the use of conditional probability and Bayes theorem is appropriate. # Mixed Methods Mix Methods research involves the combination of qualitative and quantitative approaches in addressing a research problem. Generally, qualitative and quantitative methods have separate philosophical positions when it comes to how to uncover insights in addressing research questions. For many, mixed methods have their own philosophical position, which is pragmatism. Pragmatist believe that if it works it’s good. Therefore, if mixed methods leads to a solution it’s an appropriate method to use. This post will try to explain some of the mixed method designs. Before explaining it is important to understand that there are several common ways to approach mixed methods • Qualitative and Quantitative are equal (Convergent Parallel Design) • Quantitative is more important than qualitative (explanatory design) • Qualitative is more important than quantitative Convergent Parallel Design This design involves the simultaneous collecting of qualitative and quantitative data. The results are then compared to provide insights into the problem. The advantage of this design is the quantitative data provides for generalizability while the qualitative data provides information about the context of the study. However, the challenge is in trying to merge the two types of data. Qualitative and quantitative methods answer slightly different questions about a problem. As such it can be difficult to pain a picture of the results that are comprehensible. Explanatory Design This design puts emphasis on the quantitative data with qualitative data playing a secondary role. Normally, the results found in the quantitative data are followed up on in the qualitative part. For example, if you collect surveys about what students think about college and the results indicate negative opinions, you might conduct interview with students to understand way they are negative towards college. A likert survey will not explain why students are negative. Interviews will help to capture why students have a particular position. The advantage of this approach is the clear organization of the data. Quantitative data is more important. The drawback is deciding what about the quantitative data to explore when conducting the qualitative data collection. Exploratory Design This design is the opposite of explanatory. Now the qualitative data is more important than the quantitative. This design is used when you want to understand a phenomenon in order to measure it. It is common when developing an instrument to interview people in focus groups to understand the phenomenon. For example, if I want to understand what cellphone addiction is I might ask students to share what they think about this in interviews. From there, I could develop a survey instrument to measure cell phone addiction. The drawback to this approach is the time consumption. It takes a lot of work to conduct interviews, develop an instrument, and assess the instrument. Conclusions Mixed methods are not that new. However, they are still a somewhat unusual approach to research in many fields. Despite this, the approaches of mixed methods can be beneficial depending on the context. # Nearest Neighbor Classification in R Classification In this post, we will conduct a nearest neighbor classification using R. In a previous post, we discussed nearest neighbor classification. To summarize, nearest neighbor uses the traits of a known example to classify an unknown example. The classification is determine by the closets known example(s) to the unknown example. There are essentially four steps in order to complete a nearest neighbor classification 1. Find a dataset 2. Explore/prepare the Dataset 3. Train the model 4. Evaluate the model Find a data set For this example, we will use the college data set from the ISLR package. Our goal will be to predict which colleges are private or not private based on the feature “Private”. Below is the code for this. library(ISLR) data("College") Step 2 Exploring the Data We now need to explore and prep the data for analysis. Exploration helps us to find any problems in the data set. Below is the code, you can see the results in your own computer if you are following along. str(College) prop.table(table(College$Private))
summary(College)

The “str” function gave us an understanding of the different types of variables and some of there initial values. We have 18 variables in all. We want to predict “Private” which is a categorical feature Nearest neighbor predicts categorical features only. We will use all of the other numerical variables to predict “Private”. The prop.table give us the proportion of private and not private colleges. About 30% of the data is not private and about 70% is a private college

Lastly, the “summary” function gives us some descriptive stats. If you look closely at the descriptive stats, there is a problem. The variables use different scales. For example, the “Apps” feature goes from 81 t0 48094 while the “Grad.Rate” feature goes from 10 to 100. If we do the analysis with these different scales the “App” feature will be a much stronger influence on the prediction. Therefore, we need to rescale the features or normalize them. Below is the code to do this

 normal<-function(x){
return((x-min(x))/(max(x)))
}

We made a function called “normal” that normalizes or scales are variables so that all values are between 0-1. This makes all of the features equal in their influence. We now run code to normalize are data using the “normal” function we created. We will also look at the summary stats. In addition, we will leave out the prediction feature “Private” because we do not want to have that in the new data set because we want to predict this. In a real world example, you would not know this before the analysis anyway.

College_New<-as.data.frame(lapply(College[2:18],normal))
summary(College_New)

The name of the new dataset is “College_New” we used the “lapply” function to make R normalize all the features in the “College” dataset. Summary stats are good as all values are between 0 and 1. The last part of this step involves dividing our data into training and testing sets as seen in the code below and creating the labels. The labels are the actually information from the “Private” feature. Remember, we remove this from the “College_New” dataset but we need to put this information in their own features to allow us to check the accuracy of our results later.

College_train<-College_New[1:677, ]
College_Test<-College_New[678:777,]
#make labels
College_train_labels<-College[1:677, 1]
College_test_labels<-College[678:777,1]

Step 3 Model Training

Train the model We will now train the model. You will need to download the “class” package and load it. After that, you need to run the following code.

library(class)
College_test_pred<-knn(train=College_train, test=College_Test,
cl=College_train_labels, k=25)

Here is what we did

• We created a variable called “College_test_pred” to store are results
• We used the “knn” function to predict the examples. Inside the “knn” function we used “College_train” to teach the model.
• We then identify “College_test” as the dataset we want to test. For the ‘cl’ argument we used the “College_train_labels” dataset which contains which colleges are private in the “College_train” dataset. The “k” is the number of neighbors that knn uses to determine what class to label an unknown example. In this code, we are using the 25 nearest neighbors to label the unknown example The number of “k” to use can vary but a rule of thumb is to take the square root of the total number of examples in the training data. Our training data has 677 and the square root of this is about 25. What happens is that each of the 25 closest neighbors are calculated for an example. If 20 are not private and 5 are private the unknown example is labeled private because the majority of its neighbors are not private.

Step 4 Evaluating the Model

We will now evaluate the model using the following code

 library(gmodels)
CrossTable(x=College_test_labels, y=College_test_pred, prop.chisq = FALSE)

The results can be seen by clicking here.  The box in the top left  predicts which colleges are private correctly while the box in the bottom right classifies which colleges are not private correctly. For example, 28 colleges that are not private were classed as not private while 64 colleges that are not private were classed as not private. Adding these two numbers together gives us 92 (28+64) which is our accuracy rate. In the top right box we have are false positives. These are colleges which are private but  were predicted as private, there were 8 of these. Lastly, we have our false negatives in the bottom left box, which are colleges that  are private but label as not private, there were 0 of these.

Conclusion

The results are pretty good for this example. Nearest neighbor classification allows you to predict an unknown example based on the classification of the known neighbors. This is a simple way to identify examples that are clump among neighbors with the same characteristics.

# Nearest Neighbor Classification

There are times when the relationships among examples you want to classify are messy and complicated. This makes it difficult to actually classify them. Yet in this same situation, items of the same class have a lot of features in common even though the overall sample is messy. In such a situation, nearest neighbor classification may be useful.

Nearest neighbor classification uses a simple technique to classify unlabeled examples. The algorithm assigns an unlabeled example the label of the nearest example. This based on the assumption that if two examples are next to each other they must be of the same class.

In this post, we will look at the characteristics of nearest neighbor classification as well as the strengths and weakness of this approach.

Characteristics

Nearest neighbor classification uses the features of the data set to create a multidimensional feature space. The number of features determines the number of dimensions. Therefore, two features leads to a two-dimensional feature space, three features leads to a three dimensional feature space, etc. In this feature space all the examples are placed based on their respective features.

The label of the unknown examples are determined by who the closet neighbor is or are. This calculation is based on Euclidean distance, which is the shortest distance possible. The number of neighbors that are used to calculate the distance varies at the discretion of the researcher. For example, we could use one neighbor or several to determine the label of an unlabeled example. There are pros and cons to how many neighbors to use. The more neighbors used the more complicated the classification becomes.

Nearest neighbor classification is considered a type of lazy learning. What is meant by lazy is that no abstraction of the data happens. This means there is no real explanation or theory provide by the model to understand why there are certain relationships. Nearest neighbor tells you where the relationships are but not why or how. This is partly due to the fact that it is a non-parametric learning method and provides no parameters (summary statistics) about the data.

Pros and Cons

Nearest neighbor classification has the advantage of being simple, highly effective, and fast during the training phase. There are also no assumptions made about the data distribution. This means that common problems like a lack of normality are not an issue.

Some problems include the lack of a model. This deprives us of insights into the relationships in the data. Another concern is the headache of missing data.  This forces you to spend time cleaning the data more thoroughly.  One final issue is that the classification phase of a project is slow and cumbersome because of the messy nature of the data.

Conclusion

Nearest neighbor classification is one useful tool in machine learning. This approach is valuable for times when the data is heterogeneous but with clear homogeneous groups in the data. In a future post, we will go through an example of this classification approach using R.

# Steps for Approaching Data Science Analysis

Research is difficult for many reasons. One major challenge of research is knowing exactly what to do. You have to develop your way of approaching your problem, data collection and analysis that is acceptable to peers.

This level of freedom leads to people literally freezing and not completing a project. Now imagine have several gigabytes or terabytes of data and being expected to “analyze” it.

This is a daily problem in data science. In this post, we will look at one simply six step process to approaching data science. The process involves the following six steps

1. Acquire data
2. Explore the data
3. Process the data
4. Analyze the data
5. Communicate the results
6. Apply the results

Step 1 Acquire the Data

This may seem obvious but it needs to be said. The first step is to access data for further analysis. Not always, but often data scientist are given data that was already collected by others who want answers from it.

In contrast with traditional empirical research in which you are often involved from the beginning to the end, in data science you jump to analyze a mess of data that others collected. This is challenging as it may not be clear what people what to know are what exactly the collected.

Step 2 Explore the Data

Exploring the data allows you to see what is going on. You have to determine what kinds of potential feature variables you have, the level of data that was collected (nominal, ordinal, interval, ratio). In addition, exploration allows you to determine what you need to do to prep the data for analysis.

Since data can come in many different formats from structured to unstructured. It is critical to take a look at the data through using summary statistics and various visualization options such as plots and graphs.

Another purpose for exploring data is that it can provide insights into how to analyze the data. If you are not given specific instructions as to what stakeholders want to know, exploration can help you to determine what may be valuable for them to know.

Step 3 Process the Data

Processing data involves cleaning it. This involves dealing with missing data, transforming features, addressing outliers, and other necessary processes for preparing analysis. The primary goal is to organize the data for analysis

This is a critical step as various machine learning models have different assumptions that must be met. Some models can handle missing data some cannot. Some models are affected by outliers some are not.

Step 4 Analyze the Data

This is often the most enjoyable part of the process. At this step, you actually get to develop your model. How this is done depends on the type of model you selected.

In machine learning, analysis is almost never complete until some form of validation of the model takes place. This involves taking the model developed on one set of data and seeing how well the model predicts the results on another set of data. One of the greatest fears of statistical modeling is overfitting, which is a model that only works on one set of data and lacks the ability to generalize.

Step 5 Communicate Results

This step is self-explanatory. The results of your analysis needs to be shared with those involved. This is actually an art in data science called storytelling. It involves the use of visuals as well-spoken explanations.

Steps 6 Apply the Results

This is the chance to actual use the results of a study. Again, how this is done depends on the type of model developed. If a model was developed to predict which people to approve for home loans, then the model will be used to analyze applications by people interested in applying for a home loan.

Conclusion

The steps in this process is just one way to approach data science analysis. One thing to keep in mind is that these steps are iterative, which means that it is common to go back and forth and to skip steps as necessary. This process is just a guideline for those who need direction in doing an analysis.

# Types of Machine Learning

Machine learning is a tool used in analytics for using data to make decision for action. This field of study is at the crossroads of regular academic research and action research used in professional settings. This juxtaposition of skills has led to exciting new opportunities in the domains of academics and industry.

This post will provide information on basic types of machine learning which includes predictive models, supervised learning, descriptive models, and unsupervised learning.

Predictive Models and Supervised Learning

Predictive models do as their name implies. Predictive models predict one value based on other values. For example, a model might predict who is mostly likely to buy a plane ticket or purchase a specific book.

Predictive models are not limited to the future. They can also be used to predict something that has already happen but we are not sure when. For example, data can be collect from expectant mothers to determine the date that they conceived. Such information would be useful in preparing for birth .

Predictive models are intimately connected with supervised learning. Supervised learning is a form of machine learning in which the predictive model is given clear direction as to what it they need to learn and how to do it.

For example, if we want to predict who will be accept or rejected for a home loan we would provide clear instructions to our model. We might include such features as salary, gender, credit score, etc. These features would be used to predict whether an individual person should be accepted or reject for the home loan. The supervisors in this example or the features (salary, gender, credit score) used to predict the target feature (home loan).

The target feature can either be a classification or a numeric prediction. A classification target feature is a nominal variable such as gender, race, type of car, etc. A classification feature has a limited number of choices or classes that the feature can take. In addition, the classes are mutually exclusive. At least in machine learning, someone can only be classified as male or female, current algorithms cannot place a person in both classes.

A numeric prediction predicts a number that has an infinite number of possibilities. Examples include height, weight, and salary.

Descriptive Models and Unsupervised Learning

Descriptive models summarizes data to provide interesting insights. There is no target feature that you are trying to predict. Since there is no specific goal or target to predict there are no supervisors or specific features that are used to predict the target feature. Instead, descriptive models use a process of unsupervised learning. There are no instructions given to model as to what to do per say.

Descriptive models are very useful for discovering patterns. For example, one descriptive model analysis found a relationship between beer purchases and diaper purchases. It was later found that when men went to the store they often would be beer for themselves and diapers for their small children. Stores used this information and they placed beer and diapers next to each in the stores. This led to an increase in profits as men could now find beer and diapers together. This kind of relationship can only be found through machine learning techniques.

Conclusion

The model you used depends on what you want to know. Prediction is for, as you can guess, predicting. With this model you are not as concern about relationships as you are about understanding what affects specifically the target feature. If you want to explore relationships then descriptive models can be of use. Machine learning models are tools that are appropriate for different situations.