Working with vectors in R

Advertisements

Leave a reply

Working with vectors in R

Advertisements

In regression, one of the assumptions is the additive assumption. This assumption states that the influence of a predictor variable on the dependent variable is independent of any other influence. However, in practice, it is common that this assumption does not hold.

In this post, we will look at how to address violations of the additive assumption through the use of interactions in a regression model.

An interaction effect is when you have two predictor variables whose effect on the dependent variable is not the same. As such, their effect must be considered simultaneously rather than separately. This is done through the use of an interaction term. An interaction term is the product of the two predictor variables.

Let’s begin by making a regular regression model with an interaction. To do this we will use the “Carseats” data from the “ISLR” package to predict “Sales”. Below is the code.

```
library(ISLR);library(ggplot2)
data(Carseats)
saleslm<-lm(Sales~.,Carseats)
summary(saleslm)
```

```
##
## Call:
## lm(formula = Sales ~ ., data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8692 -0.6908 0.0211 0.6636 3.4115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.6606231 0.6034487 9.380 < 2e-16 ***
## CompPrice 0.0928153 0.0041477 22.378 < 2e-16 ***
## Income 0.0158028 0.0018451 8.565 2.58e-16 ***
## Advertising 0.1230951 0.0111237 11.066 < 2e-16 ***
## Population 0.0002079 0.0003705 0.561 0.575
## Price -0.0953579 0.0026711 -35.700 < 2e-16 ***
## ShelveLocGood 4.8501827 0.1531100 31.678 < 2e-16 ***
## ShelveLocMedium 1.9567148 0.1261056 15.516 < 2e-16 ***
## Age -0.0460452 0.0031817 -14.472 < 2e-16 ***
## Education -0.0211018 0.0197205 -1.070 0.285
## UrbanYes 0.1228864 0.1129761 1.088 0.277
## USYes -0.1840928 0.1498423 -1.229 0.220
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.019 on 388 degrees of freedom
## Multiple R-squared: 0.8734, Adjusted R-squared: 0.8698
## F-statistic: 243.4 on 11 and 388 DF, p-value: < 2.2e-16
```

The results are rather excellent for the social sciences. The model explains 87.3% of the variance in “Sales”. The current results that we have are known as main effects. These are effects that directly influence the dependent variable. Most regression models only include main effects.

We will now examine an interaction effect between two continuous variables. Let’s see if there is an interaction between “Population” and “Income”.

```
saleslm1<-lm(Sales~CompPrice+Income+Advertising+Population+Price+Age+Education+US+
Urban+ShelveLoc+Population*Income, Carseats)
summary(saleslm1)
```

```
##
## Call:
## lm(formula = Sales ~ CompPrice + Income + Advertising + Population +
## Price + Age + Education + US + Urban + ShelveLoc + Population *
## Income, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8699 -0.7624 0.0139 0.6763 3.4344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.195e+00 6.436e-01 9.625 <2e-16 ***
## CompPrice 9.262e-02 4.126e-03 22.449 <2e-16 ***
## Income 7.973e-03 3.869e-03 2.061 0.0400 *
## Advertising 1.237e-01 1.107e-02 11.181 <2e-16 ***
## Population -1.811e-03 9.524e-04 -1.901 0.0580 .
## Price -9.511e-02 2.659e-03 -35.773 <2e-16 ***
## Age -4.566e-02 3.169e-03 -14.409 <2e-16 ***
## Education -2.157e-02 1.961e-02 -1.100 0.2722
## USYes -2.160e-01 1.497e-01 -1.443 0.1498
## UrbanYes 1.330e-01 1.124e-01 1.183 0.2375
## ShelveLocGood 4.859e+00 1.523e-01 31.901 <2e-16 ***
## ShelveLocMedium 1.964e+00 1.255e-01 15.654 <2e-16 ***
## Income:Population 2.879e-05 1.253e-05 2.298 0.0221 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.013 on 387 degrees of freedom
## Multiple R-squared: 0.8751, Adjusted R-squared: 0.8712
## F-statistic: 226 on 12 and 387 DF, p-value: < 2.2e-16
```

The new contribution is at the bottom of the coefficient table and is the “Income:Population” coefficient. What this means is “the increase of Sales given a one unit increase in Income and Population simultaneously” In other words the “Income:Population” coefficient looks at their combined simultaneous effect on Sales rather than just their independent effect on Sales.

This makes practical sense as well. The larger the population the more available income and vice versa. However, for our current model, the improvement in the r-squared is relatively small. The actual effect is a small increase in sales. Below is a graph of income and population by sales. Notice how the lines cross. This is a visual of what an interaction looks like. The lines are not parallel by any means.

```
ggplot(data=Carseats, aes(x=Income, y=Sales, group=1)) +geom_smooth(method=lm,se=F)+
geom_smooth(aes(Population,Sales), method=lm, se=F,color="black")+xlab("Income and Population")+labs(
title="Income in Blue Population in Black")
```

We will now repeat this process but this time using a categorical variable and a continuous variable. We will look at the interaction between “US” location (categorical) and “Advertising” (continuous).

```
saleslm2<-lm(Sales~CompPrice+Income+Advertising+Population+Price+Age+Education+US+
Urban+ShelveLoc+US*Advertising, Carseats)
summary(saleslm2)
```

```
##
## Call:
## lm(formula = Sales ~ CompPrice + Income + Advertising + Population +
## Price + Age + Education + US + Urban + ShelveLoc + US * Advertising,
## data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8531 -0.7140 0.0266 0.6735 3.3773
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.6995305 0.6023074 9.463 < 2e-16 ***
## CompPrice 0.0926214 0.0041384 22.381 < 2e-16 ***
## Income 0.0159111 0.0018414 8.641 < 2e-16 ***
## Advertising 0.2130932 0.0530297 4.018 7.04e-05 ***
## Population 0.0001540 0.0003708 0.415 0.6782
## Price -0.0954623 0.0026649 -35.823 < 2e-16 ***
## Age -0.0463674 0.0031789 -14.586 < 2e-16 ***
## Education -0.0233500 0.0197122 -1.185 0.2369
## USYes -0.1057320 0.1561265 -0.677 0.4987
## UrbanYes 0.1191653 0.1127047 1.057 0.2910
## ShelveLocGood 4.8726025 0.1532599 31.793 < 2e-16 ***
## ShelveLocMedium 1.9665296 0.1259070 15.619 < 2e-16 ***
## Advertising:USYes -0.0933384 0.0537807 -1.736 0.0834 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.016 on 387 degrees of freedom
## Multiple R-squared: 0.8744, Adjusted R-squared: 0.8705
## F-statistic: 224.5 on 12 and 387 DF, p-value: < 2.2e-16
```

Again, you can see that when the store is in the US you have to also consider the advertising budget as well. When these two variables are considered there is a slight decline in sales. What this means in practice is that advertising in the US is not as beneficial as advertising outside the US.

Below you can again see a visual of the interaction effect when the lines for US yes and no cross each other in the plot below.

```
ggplot(data=Carseats, aes(x=Advertising, y=Sales, group = US, colour = US)) +
geom_smooth(method=lm,se=F)+scale_x_continuous(limits = c(0, 25))+scale_y_continuous(limits = c(0, 25))
```

Lastly, we will look at an interaction effect for two categorical variables.

```
saleslm3<-lm(Sales~CompPrice+Income+Advertising+Population+Price+Age+Education+US+
Urban+ShelveLoc+ShelveLoc*US, Carseats)
summary(saleslm3)
```

```
##
## Call:
## lm(formula = Sales ~ CompPrice + Income + Advertising + Population +
## Price + Age + Education + US + Urban + ShelveLoc + ShelveLoc *
## US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8271 -0.6839 0.0213 0.6407 3.4537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8120748 0.6089695 9.544 <2e-16 ***
## CompPrice 0.0929370 0.0041283 22.512 <2e-16 ***
## Income 0.0158793 0.0018378 8.640 <2e-16 ***
## Advertising 0.1223281 0.0111143 11.006 <2e-16 ***
## Population 0.0001899 0.0003721 0.510 0.6100
## Price -0.0952439 0.0026585 -35.826 <2e-16 ***
## Age -0.0459380 0.0031830 -14.433 <2e-16 ***
## Education -0.0267021 0.0197807 -1.350 0.1778
## USYes -0.3683074 0.2379400 -1.548 0.1225
## UrbanYes 0.1438775 0.1128171 1.275 0.2030
## ShelveLocGood 4.3491643 0.2734344 15.906 <2e-16 ***
## ShelveLocMedium 1.8967193 0.2084496 9.099 <2e-16 ***
## USYes:ShelveLocGood 0.7184116 0.3320759 2.163 0.0311 *
## USYes:ShelveLocMedium 0.0907743 0.2631490 0.345 0.7303
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.014 on 386 degrees of freedom
## Multiple R-squared: 0.8753, Adjusted R-squared: 0.8711
## F-statistic: 208.4 on 13 and 386 DF, p-value: < 2.2e-16
```

In this case, we can see that when the store is in the US and the shelf location is good it has an effect on Sales when compared to a bad location. The plot below is a visual of this. However, it is harder to see this because the x-axis has only two categories

```
ggplot(data=Carseats, aes(x=US, y=Sales, group = ShelveLoc, colour = ShelveLoc)) +
geom_smooth(method=lm,se=F)
```

**Conclusion**

Interactions effects are a great way to fine-tune a model, especially for explanatory purposes. Often, the change in r-square is not strong enough for prediction but can be used for nuanced understanding of the relationships among the variables.

Vectors and functions in R

Understanding the use of vectors in Rstudio

Basics of Rstudio

In this post, we are going to look at Bayesian regression. In particular, we will compare the results of ordinary least squares regression with Bayesian regression.

**Bayesian Statistics**

Bayesian statistics involves the use of probabilities rather than frequencies when addressing uncertainty. This allows you to determine the distribution of the model parameters and not only the values. This is done through averaging over the model parameters through marginalizing the joint probability distribution.

**Linear Regression**

We will now develop our two models. The first model will be a normal regression and the second a Bayesian model. We will be looking at factors that affect the tax rate of homes in the “Hedonic” dataset in the “Ecdat” package. We will load our packages and partition our data. Below is some initial code

`library(ISLR);library(caret);library(arm);library(Ecdat);library(gridExtra)`

```
data("Hedonic")
inTrain<-createDataPartition(y=Hedonic$tax,p=0.7, list=FALSE)
trainingset <- Hedonic[inTrain, ]
testingset <- Hedonic[-inTrain, ]
str(Hedonic)
```

```
## 'data.frame': 506 obs. of 15 variables:
## $ mv : num 10.09 9.98 10.45 10.42 10.5 ...
## $ crim : num 0.00632 0.02731 0.0273 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 ...
## $ chas : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ nox : num 28.9 22 22 21 21 ...
## $ rm : num 43.2 41.2 51.6 49 51.1 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 ...
## $ dis : num 1.41 1.6 1.6 1.8 1.8 ...
## $ rad : num 0 0.693 0.693 1.099 1.099 ...
## $ tax : int 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 ...
## $ blacks : num 0.397 0.397 0.393 0.395 0.397 ...
## $ lstat : num -3 -2.39 -3.21 -3.53 -2.93 ...
## $ townid : int 1 2 2 3 3 3 4 4 4 4 ...
```

We will now create our regression model

```
ols.reg<-lm(tax~.,trainingset)
summary(ols.reg)
```

```
##
## Call:
## lm(formula = tax ~ ., data = trainingset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -180.898 -35.276 2.731 33.574 200.308
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 305.1928 192.3024 1.587 0.11343
## mv -41.8746 18.8490 -2.222 0.02697 *
## crim 0.3068 0.6068 0.506 0.61339
## zn 1.3278 0.2006 6.618 1.42e-10 ***
## indus 7.0685 0.8786 8.045 1.44e-14 ***
## chasyes -17.0506 15.1883 -1.123 0.26239
## nox 0.7005 0.4797 1.460 0.14518
## rm -0.1840 0.5875 -0.313 0.75431
## age 0.3054 0.2265 1.349 0.17831
## dis -7.4484 14.4654 -0.515 0.60695
## rad 98.9580 6.0964 16.232 < 2e-16 ***
## ptratio 6.8961 2.1657 3.184 0.00158 **
## blacks -29.6389 45.0043 -0.659 0.51061
## lstat -18.6637 12.4674 -1.497 0.13532
## townid 1.1142 0.1649 6.758 6.07e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.72 on 341 degrees of freedom
## Multiple R-squared: 0.8653, Adjusted R-squared: 0.8597
## F-statistic: 156.4 on 14 and 341 DF, p-value: < 2.2e-16
```

The model does a reasonable job. Next, we will do our prediction and compare the results with the test set using correlation, summary statistics, and the mean absolute error. In the code below, we use the “predict.lm” function and include the arguments “interval” for the prediction as well as “se.fit” for the standard error

`ols.regTest<-predict.lm(ols.reg,testingset,interval = 'prediction',se.fit = T)`

Below is the code for the correlation, summary stats, and mean absolute error. For MAE, we need to create a function.

`cor(testingset$tax,ols.regTest$fit[,1])`

`## [1] 0.9313795`

`summary(ols.regTest$fit[,1])`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 144.7 288.3 347.6 399.4 518.4 684.1
```

`summary(trainingset$tax)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 188.0 279.0 330.0 410.4 666.0 711.0
```

```
MAE<-function(actual, predicted){
mean(abs(actual-predicted))
}
MAE(ols.regTest$fit[,1], testingset$tax)
```

`## [1] 41.07212`

The correlation is excellent. The summary stats are similar and the error is not unreasonable. Below is a plot of the actual and predicted values

We now need to combine some data into one dataframe. In particular, we need the following actual dependent variable results predicted dependent variable results The upper confidence value of the prediction THe lower confidence value of the prediction

The code is below

```
yout.ols <- as.data.frame(cbind(testingset$tax,ols.regTest$fit))
ols.upr <- yout.ols$upr
ols.lwr <- yout.ols$lwr
```

We can now plot this

```
p.ols <- ggplot(data = yout.ols, aes(x = testingset$tax, y = ols.regTest$fit[,1])) + geom_point() + ggtitle("Ordinary Regression") + labs(x = "Actual", y = "Predicted")
p.ols + geom_errorbar(ymin = ols.lwr, ymax = ols.upr)
```

You can see the strong linear relationship. However, the confidence intervals are rather wide. Let’s see how Bayes does.

**Bayes Regression**

Bayes regression uses the “bayesglm” function from the “arm” package. We need to set the family to “gaussian” and the link to “identity”. In addition, we have to set the “prior.df” (prior degrees of freedom) to infinity as this indicates we want a normal distribution

`bayes.reg<-bayesglm(tax~.,family=gaussian(link=identity),trainingset,prior.df = Inf)`

```
bayes.regTest<-predict.glm(bayes.reg,newdata = testingset,se.fit = T)
cor(testingset$tax,bayes.regTest$fit)
```

`## [1] 0.9313793`

`summary(bayes.regTest$fit)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 144.7 288.3 347.5 399.4 518.4 684.1
```

`summary(trainingset$tax)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 188.0 279.0 330.0 410.4 666.0 711.0
```

`MAE(bayes.regTest$fit, testingset$tax)`

`## [1] 41.07352`

The numbers are essentially the same. This leads to the question of what is the benefit of Bayesian regression? The answer is in the confidence intervals. Next, we will calculate the confidence intervals for the Bayesian model.

```
yout.bayes <- as.data.frame(cbind(testingset$tax,bayes.regTest$fit))
names(yout.bayes) <- c("tax", "fit")
critval <- 1.96 #approx for 95% CI
bayes.upr <- bayes.regTest$fit + critval * bayes.regTest$se.fit
bayes.lwr <- bayes.regTest$fit - critval * bayes.regTest$se.fit
```

We now create our Bayesian regression plot.

`p.bayes <- ggplot(data = yout.bayes, aes(x = yout.bayes$tax, y = yout.bayes$fit)) + geom_point() + ggtitle("Bayesian Regression Prediction") + labs(x = "Actual", y = "Predicted")`

Lastly, we display both plots as a comparison.

```
ols.plot <- p.ols + geom_errorbar(ymin = ols.lwr, ymax = ols.upr)
bayes.plot <- p.bayes + geom_errorbar(ymin = bayes.lwr, ymax = bayes.upr)
grid.arrange(ols.plot,bayes.plot,ncol=2)
```

As you can see, the Bayesian approach gives much more compact confidence intervals. This is because the Bayesian approach a distribution of parameters is calculated from a posterior distribution. These values are then averaged to get the final prediction that appears on the plot. This reduces the variance and strengthens the confidence we can have in each individual example.

Machine learning is used for a variety of task today with a multitude of algorithms that can each do one or more of these tasks well. In this post, we will look at some of the most common tasks that machine learning algorithms perform. In particular, we will look at the following task.

- Regression
- Classification
- Forecasting
- Clustering
- Association rules
- Dimension reduction

Numbers 1-3 are examples of supervised learning, which is learning that involves a dependent variable. Numbers 4-6 are unsupervised which is learning that does not involve a clearly labeled dependent variable.

**Regression**

Regression involves understanding the relationship between a continuous dependent variable and categorical and continuous independent variables. Understanding this relationship allows for numeric prediction of the dependent continuous variable.

Example algorithms for regression include linear regression, numeric prediction random forest as well as support vector machines and artificial neural networks.

**Classification**

Classification involves the use of a categorical dependent variable with continuous and or categorical independent variables. The purpose is to classify examples into the groups in the dependent variable.

Examples of this are logisitic regression as well as all the algorithms mentioned in regression. Many algorithms can do both regression and classification.

**Forecasting**

Forecasting is similar to regression. However, the difference is that the data is a time series. The goal remains the same of predicting future outcomes based on current available data. As such, a slightly different approach is needed because of the type of data involved.

Common algorithms for forecasting is ARIMA even artificial neural networks.

**Clustering**

Clustering involves grouping together items that are similar in a dataset. This is done by detecting patterns in the data. The problem is that the number of clusters needed is usually no known in advanced which leads to a trial and error approach if there is no other theoretical support.

Common clustering algorithms include k-means and hierarchical clustering. Latent Dirichlet allocation is used often in text mining applications.

**Association Rules**

Associations rules find items that occur together in a dataset. A common application of association rules is market basket analysis.

Common algorithms include Apriori and frequent pattern matching algorithm.

**Dimension Reduction**

Dimension reduction involves combining several redundant features into one or more components that capture the majority of the variance. Reducing the number of features can increase the speed of the computation as well as reduce the risk of overfitting.

In machine learning, principal component analysis is often used for dimension reduction. However, factor analysis is sometimes used as well.

**Conclusion**

In machine learning, there is always an appropriate tool for the job. This post provided insight into the main task of machine learning as well as the algorithm for the situation.

Data preparation aka data munging is what most data scientist spend the majority of their time doing. Extracting and transforming data is difficult, to say the least. Every dataset is different with unique problems. This makes it hard to generalize best practices for transforming data so that it is suitable for analysis.

In this post, we will look at how to use the various functions in the “dplyr”” package. This package provides numerous ways to develop features as well as explore the data. We will use the “attitude” dataset from base r for our analysis. Below is some initial code.

`library(dplyr)`

```
data("attitude")
str(attitude)
```

```
## 'data.frame': 30 obs. of 7 variables:
## $ rating : num 43 63 71 61 81 43 58 71 72 67 ...
## $ complaints: num 51 64 70 63 78 55 67 75 82 61 ...
## $ privileges: num 30 51 68 45 56 49 42 50 72 45 ...
## $ learning : num 39 54 69 47 66 44 56 55 67 47 ...
## $ raises : num 61 63 76 54 71 54 66 70 71 62 ...
## $ critical : num 92 73 86 84 83 49 68 66 83 80 ...
## $ advance : num 45 47 48 35 47 34 35 41 31 41 ...
```

You can see we have seven variables and only 30 observations. Our first function that we will learn to use is the “select” function. This function allows you to select columns of data you want to use. In order to use this feature, you need to know the names of the columns you want. Therefore, we will first use the “names” function to determine the names of the columns and then use the “select”” function.

`names(attitude)[1:3]`

`## [1] "rating" "complaints" "privileges"`

```
smallset<-select(attitude,rating:privileges)
head(smallset)
```

```
## rating complaints privileges
## 1 43 51 30
## 2 63 64 51
## 3 71 70 68
## 4 61 63 45
## 5 81 78 56
## 6 43 55 49
```

The difference is probably obvious. Using the “select” function we have 3 instead of 7 variables. We can also exclude columns we do not want by placing a negative in front of the names of the columns. Below is the code

`head(select(attitude,-(rating:privileges)))`

```
## learning raises critical advance
## 1 39 61 92 45
## 2 54 63 73 47
## 3 69 76 86 48
## 4 47 54 84 35
## 5 66 71 83 47
## 6 44 54 49 34
```

We can also use the “rename” function to change the names of columns. In our example below, we will change the name of the “rating” to “rates.” The code is below. Keep in mind that the new name for the column is to the left of the equal sign and the old name is to the right

```
attitude<-rename(attitude,rates=rating)
head(attitude)
```

```
## rates complaints privileges learning raises critical advance
## 1 43 51 30 39 61 92 45
## 2 63 64 51 54 63 73 47
## 3 71 70 68 69 76 86 48
## 4 61 63 45 47 54 84 35
## 5 81 78 56 66 71 83 47
## 6 43 55 49 44 54 49 34
```

The “select”” function can be used in combination with other functions to find specific columns in the dataset. For example, we will use the “ends_with” function inside the “select” function to find all columns that end with the letter s.

```
s_set<-head(select(attitude,ends_with("s")))
s_set
```

```
## rates complaints privileges raises
## 1 43 51 30 61
## 2 63 64 51 63
## 3 71 70 68 76
## 4 61 63 45 54
## 5 81 78 56 71
## 6 43 55 49 54
```

The “filter” function allows you to select rows from a dataset based on criteria. In the code below we will select only rows that have a 75 or higher in the “raises” variable.

```
bigraise<-filter(attitude,raises>75)
bigraise
```

```
## rates complaints privileges learning raises critical advance
## 1 71 70 68 69 76 86 48
## 2 77 77 54 72 79 77 46
## 3 74 85 64 69 79 79 63
## 4 66 77 66 63 88 76 72
## 5 78 75 58 74 80 78 49
## 6 85 85 71 71 77 74 55
```

If you look closely all values in the “raise” column are greater than 75. Of course, you can have more than one criteria. IN the code below there are two.

`filter(attitude, raises>70 & learning<67)`

```
## rates complaints privileges learning raises critical advance
## 1 81 78 56 66 71 83 47
## 2 65 70 46 57 75 85 46
## 3 66 77 66 63 88 76 72
```

The “arrange” function allows you to sort the order of the rows. In the code below we first sort the data ascending by the “critical” variable. Then we sort it descendingly by adding the “desc” function.

```
ascCritical<-arrange(attitude, critical)
head(ascCritical)
```

```
## rates complaints privileges learning raises critical advance
## 1 43 55 49 44 54 49 34
## 2 81 90 50 72 60 54 36
## 3 40 37 42 58 50 57 49
## 4 69 62 57 42 55 63 25
## 5 50 40 33 34 43 64 33
## 6 71 75 50 55 70 66 41
```

```
descCritical<-arrange(attitude, desc(critical))
head(descCritical)
```

```
## rates complaints privileges learning raises critical advance
## 1 43 51 30 39 61 92 45
## 2 71 70 68 69 76 86 48
## 3 65 70 46 57 75 85 46
## 4 61 63 45 47 54 84 35
## 5 81 78 56 66 71 83 47
## 6 72 82 72 67 71 83 31
```

The “mutate” function is useful for engineering features. In the code below we will transform the “learning” variable by subtracting its mean from its self

```
attitude<-mutate(attitude,learningtrend=learning-mean(learning))
head(attitude)
```

```
## rates complaints privileges learning raises critical advance
## 1 43 51 30 39 61 92 45
## 2 63 64 51 54 63 73 47
## 3 71 70 68 69 76 86 48
## 4 61 63 45 47 54 84 35
## 5 81 78 56 66 71 83 47
## 6 43 55 49 44 54 49 34
## learningtrend
## 1 -17.366667
## 2 -2.366667
## 3 12.633333
## 4 -9.366667
## 5 9.633333
## 6 -12.366667
```

You can also create logical variables with the “mutate” function.In the code below, we create a logical variable that is true when the “critical” variable” is higher than 80 and false when “critical”” is less than 80. The new variable is called “highCritical”

```
attitude<-mutate(attitude,highCritical=critical>=80)
head(attitude)
```

```
## rates complaints privileges learning raises critical advance
## 1 43 51 30 39 61 92 45
## 2 63 64 51 54 63 73 47
## 3 71 70 68 69 76 86 48
## 4 61 63 45 47 54 84 35
## 5 81 78 56 66 71 83 47
## 6 43 55 49 44 54 49 34
## learningtrend highCritical
## 1 -17.366667 TRUE
## 2 -2.366667 FALSE
## 3 12.633333 TRUE
## 4 -9.366667 TRUE
## 5 9.633333 TRUE
## 6 -12.366667 FALSE
```

The “group_by” function is used for creating summary statistics based on a specific variable. It is similar to the “aggregate” function in R. This function works in combination with the “summarize” function for our purposes here. We will group our data by the “highCritical” variable. This means our data will be viewed as either TRUE for “highCritical” or FALSE. The results of this function will be saved in an object called “hcgroups”

```
hcgroups<-group_by(attitude,highCritical)
head(hcgroups)
```

```
## # A tibble: 6 x 9
## # Groups: highCritical [2]
## rates complaints privileges learning raises critical advance
##
## 1 43 51 30 39 61 92 45
## 2 63 64 51 54 63 73 47
## 3 71 70 68 69 76 86 48
## 4 61 63 45 47 54 84 35
## 5 81 78 56 66 71 83 47
## 6 43 55 49 44 54 49 34
## # ... with 2 more variables: learningtrend , highCritical
```

Looking at the data you probably saw no difference. This is because we are not done yet. We need to summarize the data in order to see the results for our two groups in the “highCritical” variable.

We will now generate the summary statistics by using the “summarize” function. We specifically want to know the mean of the “complaint” variable based on the variable “highCritical.” Below is the code

`summarize(hcgroups,complaintsAve=mean(complaints))`

```
## # A tibble: 2 x 2
## highCritical complaintsAve
##
## 1 FALSE 67.31579
## 2 TRUE 65.36364
```

Of course, you could have learned this through doing a t.test but this is another approach.

**Conclusion**

The “dplyr” package is one powerful tool for wrestling with data. There is nothing new in this package. Instead, the coding is simpler than what you can excute using base r.

In this post, we will look at analyzing tweets from Twitter using R. Before beginning, if you plan to replicate this on your own, you will need to set up a developer account with Twitter. Below are the steps

**Twitter Setup**

- Go to https://dev.twitter.com/apps
- Create a twitter account if you do not already have one
- Next, you want to click “create new app”
- After entering the requested information be sure to keep the following information for R; consumer key, consumer secret, request token URL, authorize URL, access token URL

The instruction here are primarily for users of Linux. If you are using a windows machine you need to download a cecert.pem file below is the code

download.file(url=‘http://curl.haxx.se/ca/cacert.pem’,destfile=‘/YOUR_LOCATION/cacert.pem’)

You need to save this file where it is appropriate. Below we will begin the analysis by loading the appropriate libraries.

**R Setup**

`library(twitteR);library(ROAuth);library(RCurl);library(tm);library(wordcloud)`

Next, we need to use all of the application information we generate when we created the developer account at twitter. We will save the information in objects to use in R. In the example code below “XXXX” is used where you should provide your own information. Sharing this kind of information would allow others to use my twitter developer account. Below is the code

```
my.key<-"XXXX" #consumer key
my.secret<-"XXXX" #consumer secret
my.accesstoken<-'XXXX' #access token
my.accesssecret<-'XXXX' ##access secret
```

Some of the information we just stored now needs to be passed to the “OAuthFactory” function of the “ROAuth” package. We will be passing the “my.key” and “my.secret”. We also need to add the request URL, access URL, and auth URL. Below is the code for all this.

```
cred<-OAuthFactory$new(consumerKey=my.key,consumerSecret=my.secret,requestURL='https://api.twitter/oauth/request_token',
accessURL='https://api.twitter/oauth/access_token',authURL='https://api.twitter/oauth/authorize')
```

If you are a windows user you need to code below for the cacert.pem. You need to use the “cred$handshake(cainfo=”LOCATION OF CACERT.PEM” to complete the setup process. make sure to save your authentication and then use the “registerTwitterOAuth(cred)” to finish this. For Linux users, the code is below.

`setup_twitter_oauth(my.key, my.secret, my.accesstoken, my.accesssecret)`

**Data Preparation**

We can now begin the analysis. We are going to search twitter for the term “Data Science.” We will look for 1,500 of the most recent tweets that contain this term. To do this we will use the “searchTwitter” function. The code is as follows

`ds_tweets<-searchTwitter("data science",n=1500)`

We know need to some things that are a little more complicated. First, we need to convert our “ds_tweets” object to a dataframe. This is just to save our search so we don’t have to research again. The code below performs this.

`ds_tweets.df<-do.call(rbind,lapply(ds_tweets,as.data.frame))`

Second, we need to find all the text in our “ds_tweets” object and convert this into a list. We will use the “sapply” function along with a “getText” Below is the code

`ds_tweets.list<-sapply(ds_tweets,function(x) x$getText())`

Third, we need to turn our “ds_tweets.list” into a corpus.

`ds_tweets.corpus<-Corpus(VectorSource(ds_tweets.list)) `

Now we need to do a lot of cleaning of the text. In particular, we need to *make all words lower case *remove punctuation *Get rid of funny characters (i.e. #,/, etc) *remove stopwords (words that lack meaning such as “the”)

To do this we need to use a combination of functions in the “tm” package as well as some personally made functions

```
ds_tweets.corpus<-tm_map(ds_tweets.corpus,removePunctuation)
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)#remove garbage terms
ds_tweets.corpus<-tm_map(ds_tweets.corpus,removeSpecialChars) #application of custom function
ds_tweets.corpus<-tm_map(ds_tweets.corpus,function(x) removeWords(x,stopwords())) #removes stop words
ds_tweets.corpus<-tm_map(ds_tweets.corpus,tolower)
```

**Data Analysis**

We can make a word cloud for fun now

```
wordcloud(ds_tweets.corpus)
```

We now need to convert our corpus to a matrix for further analysis. In addition, we need to remove sparse terms as this reduces the size of the matrix without losing much information. The value to set it to is at the discretion of the researcher. Below is the code

```
ds_tweets.tdm<-TermDocumentMatrix(ds_tweets.corpus)
ds_tweets.tdm<-removeSparseTerms(ds_tweets.tdm,sparse = .8)#remove sparse terms
```

We’ve looked at how to find the most frequent terms in another post. Below is the code for the 15 most common words

`findFreqTerms(ds_tweets.tdm,15)`

```
## [1] "datasto" "demonstrates" "download" "executed"
## [5] "hard" "key" "leaka" "locally"
## [9] "memory" "mitchellvii" "now" "portable"
## [13] "science" "similarly" "data"
```

Below are words that are highly correlated with the term “key”.

`findAssocs(ds_tweets.tdm,'key',.95)`

```
## $key
## demonstrates download executed leaka locally
## 0.99 0.99 0.99 0.99 0.99
## memory datasto hard mitchellvii portable
## 0.99 0.98 0.98 0.98 0.98
## similarly
## 0.98
```

For the final trick, we will make a hierarchical agglomerative cluster. This will clump words that are more similar next to each other. We first need to convert our current “ds_tweets.tdm” into a regular matrix. Then we need to scale it because the distances need to be standardized. Below is the code.

```
ds_tweets.mat<-as.matrix(ds_tweets.tdm)
ds_tweets.mat.scale<-scale(ds_tweets.mat)
```

Now, we need to calculate the distance statistically

`ds_tweets.dist<-dist(ds_tweets.mat.scale,method = 'euclidean')`

At last, we can make the clusters,

`ds_tweets.fit<-hclust(ds_tweets.dist,method = 'ward')`

`plot(ds_tweets.fit)`

Looking at the chart, it appears we have six main clusters we can highlight them using the code below

```
plot(ds_tweets.fit)
groups<-cutree(ds_tweets.fit,k=6)
rect.hclust(ds_tweets.fit,k=6)
```

**Conclusion**

This post provided an example of how to pull data from twitter for text analysis. There are many steps but also some useful insights can be gained from this sort of research.

In this post, we will learn how to conduct a diversity and lexical dispersion analysis in R. Diversity analysis is a measure of the breadth of an author’s vocabulary in a text. Are provides several calculations of this in their output

Lexical dispersion is used for knowing where or when certain words are used in a text. This is useful for identifying patterns if this is a goal of your data exploration.

We will conduct our two analysis by comparing two famous philosophical texts

- Analects
- The Prince

These books are available at the Gutenberg Project. You can go to the site type in the titles and download them to your computer.

We will use the “qdap” package in order to complete the sentiment analysis. Below is some initial code.

library(qdap)

**Data Preparation**

Below are the steps we need to take to prepare the data

- Paste the text files into R
- Convert the text files to ASCII format
- Convert the ASCII format to data frames
- Split the sentences in the data frame
- Add a variable that indicates the book name
- Combine the two books into one dataframe

We now need to prepare the three text. First, we move them into R using the “paste” function.

analects<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Analects.txt",what='character'),collapse=" ") prince<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Prince.txt",what='character'),collapse=" ")

We must convert the text files to ASCII format see that R is able to interpret them.

analects<-iconv(analects,"latin1","ASCII","") prince<-iconv(prince,"latin1","ASCII","")

For each book, we need to make a dataframe. The argument “texts” gives our dataframe one variable called “texts” which contains all the words in each book. Below is the code

data frame

analects<-data.frame(texts=analects) prince<-data.frame(texts=prince)

With the dataframes completed. We can now split the variable “texts” in each dataframe by sentence. The “sentSplit” function will do this.

analects<-sentSplit(analects,'texts') prince<-sentSplit(prince,'texts')

Next, we add the variable “book” to each dataframe. What this does is that for each row or sentence in the dataframe the “book” variable will tell you which book the sentence came from. This will be valuable for comparative purposes.

analects$book<-"analects" prince$book<-"prince"

Now we combine the two books into one dataframe. The data preparation is now complete.

twobooks<-rbind(analects,prince)

**Data Analysis**

We will begin with the diversity analysis

div<-diversity(twobooks$texts,twobooks$book) div book wc simpson shannon collision berger_parker brillouin 1 analects 30995 0.989 6.106 4.480 0.067 5.944 2 prince 52105 0.989 6.324 4.531 0.059 6.177

For most of the metrics, the diversity in the use of vocabulary is the same despite being different books from different eras in history. How these numbers are calculated is beyond the scope of this post.

Next, we will calculate the lexical dispersion of the two books. Will look at three common themes in history money, war, and marriage.

dispersion_plot(twobooks$texts,grouping.var=twobooks$book,c("money","war",'marriage'))

The tick marks show when each word appears. For example, money appears at the beginning of Analects only but is more spread out in tThe PRince. War is evenly dispersed in both books and marriage only appears in The Prince

**Conclusion**

This analysis showed additional tools that can be used to analyze text in R.

In this post, we will look at how to assess the readability and formality of a text using R. By readability, we mean the use of a formula that will provide us with the grade level at which the text is roughly written. This is highly useful information in the field of education and even medicine.

Formality provides insights into how the text relates to the reader. The more formal the writing the greater the distance between author and reader. Formal words are nouns, adjectives, prepositions, and articles while informal (contextual) words are pronouns, verbs, adverbs, and interjections.

The F-measure counts and calculates a score of formality based on the proportions of the formal and informal words.

We will conduct our two analysis by comparing two famous philosophical texts

- Analects
- The Prince

These books are available at the Gutenberg Project. You can go to the site type in the titles and download them to your computer.

We will use the “qdap” package in order to complete the sentiment analysis. Below is some initial code.

`library(qdap)`

**Data Preparation**

Below are the steps we need to take to prepare the data

- Paste the text files into R
- Convert the text files to ASCII format
- Convert the ASCII format to data frames
- Split the sentences in the data frame
- Add a variable that indicates the book name
- Combine the two books into one dataframe

We now need to prepare the two text. The “paste” function will move the text into the R environment.

```
analects<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Analects.txt",what='character'),collapse=" ")
prince<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Prince.txt",what='character'),collapse=" ")
```

The text need to be converted to the ASCII format and the code below does this.

```
analects<-iconv(analects,"latin1","ASCII","")
prince<-iconv(prince,"latin1","ASCII","")
```

For each book, we need to make a dataframe. The argument “texts” gives our dataframe one variable called “texts” which contains all the words in each book. Below is the code data frame

```
analects<-data.frame(texts=analects)
prince<-data.frame(texts=prince)
```

With the dataframes completed. We can now split the variable “texts” in each dataframe by sentence. The “sentSplit” function will do this.

`analects<-sentSplit(analects,'texts')`

`prince<-sentSplit(prince,'texts')`

Next, we add the variable “book” to each dataframe. What this does is that for each row or sentence in the dataframe the “book” variable will tell you which book the sentence came from. This will be useful for comparative purposes.

```
analects$book<-"analects"
prince$book<-"prince"
```

Lastly, we combine the two books into one dataframe. The data preparation is now complete.

`twobooks<-rbind(analects,prince)`

**Data Analysis**

We will begin with the readability. The “automated_readbility_index” function will calculate this for us.

```
ari<-automated_readability_index(twobooks$texts,twobooks$book)
ari
```

```
## book word.count sentence.count character.count Automated_Readability_Index
## 1 analects 30995 3425 132981 3.303
## 2 prince 52105 1542 236605 16.853
```

Analects is written on a third-grade level but The Prince is written at grade 16. This is equivalent to a Senior in college. As such, The Prince is a challenging book to read.

Next we will calcualte the formality of the two books. The “formality” function is used for this.

```
form<-formality(twobooks$texts,twobooks$book)
form
```

```
## book word.count formality
## 1 prince 52181 60.02
## 2 analects 31056 58.36
```

The books are mildly formal. The code below gives you the break down of the word use by percentage.

`form$form.prop.by`

```
## book word.count noun adj prep articles pronoun verb adverb
## 1 analects 31056 25.05 8.63 14.23 8.49 10.84 22.92 5.86
## 2 prince 52181 21.51 9.89 18.42 7.59 10.69 20.74 5.94
## interj other
## 1 0.05 3.93
## 2 0.00 5.24
```

The proportions are consistent when the two books are compared. Below is a visual of the table we just examined.

`plot(form)`

**Conclusion**

Readability and formality are additional text mining tools that can provide insights for Data Scientist. Both of these analysis tools can provide suggestions that may be needed in order to enhance communication or compare different authors and writing styles.

In this post, we will perform a sentiment analysis in R. Sentiment analysis involves employs the use of dictionaries to give each word in a sentence a score. A more positive word is given a higher positive number while a more negative word is given a more negative number. The score is then calculated based on the position of the word, the weight, as well as other more complex factors. This is then performed for the entire corpus to give it a score.

We will do a sentiment analysis in which we will compare three famous philosophical texts

- Analects
- The Prince
- Penesees

These books are available at the Gutenberg Project. You can go to the site type in the titles and download them to your computer.

We will use the “qdap” package in order to complete the sentiment analysis. Below is some initial code.

`library(qdap)`

**Data Preparation**

Below are the steps we need to take to prepare the data

- Paste the text files into R
- Convert the text files to ASCII format
- Convert the ASCII format to data frames
- Split the sentences in the data frame
- Add a variable that indicates the book name
- Combine the three books into one dataframe

We now need to prepare the three text. First, we move them into R using the “paste” function.

```
analects<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Analects.txt",what='character'),collapse=" ")
pensees<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Pascal.txt",what='character'),collapse=" ")
prince<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Prince.txt",what='character'),collapse=" ")
```

We need to convert the text files to ASCII format see that R is able to read them.

```
analects<-iconv(analects,"latin1","ASCII","")
pensees<-iconv(pensees,"latin1","ASCII","")
prince<-iconv(prince,"latin1","ASCII","")
```

Now we make our dataframe for each book. The argument “texts” gives our dataframe one variable called “texts” which contains all the words in each book. Below is the code data frame

```
analects<-data.frame(texts=analects)
pensees<-data.frame(texts=pensees)
prince<-data.frame(texts=prince)
```

With the dataframes completed. We can now split the variable “texts” in each dataframe by sentence. We will use the “sentSplit” function to do this.

`analects<-sentSplit(analects,'texts')`

`pensees<-sentSplit(pensees,'texts')`

`prince<-sentSplit(prince,'texts')`

Next, we add the variable “book” to each dataframe. What this does is that for each row or sentence in the dataframe the “book” variable will tell you which book the sentence came from. This will be valuable for comparative purposes.

```
analects$book<-"analects"
pensees$book<-"pensees"
prince$book<-"prince"
```

Now we combine all three books into one dataframe. The data preparation is now complete.

`threebooks<-rbind(analects,pensees,prince)`

**Data Analysis **

We are now ready to perform the actual sentiment analysis. We will use the “polarity” function for this. Inside the function, we need to use the text and the book variables. Below is the code. polarity analysis

`pol<-polarity(threebooks$texts,threebooks$book)`

We can see the results and a plot in the code below.

`pol`

```
## book total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 analects 3425 31383 0.076 0.254 0.299
## 2 pensees 7617 101043 0.008 0.278 0.028
## 3 prince 1542 52281 0.017 0.296 0.056
```

The table is mostly self-explanatory. We have the total number of sentences and words in the first two columns. Next is the average polarity and the standard deviation. Lastly, we have the standardized mean. The last column is commonly used for comparison purposes. As such, it appears that Analects is the most positive book by a large margin with Pensees and Prince be about the same and generally neutral.

`plot(pol)`

The top plot shows the polarity of each sentence over time or through the book. The bluer the more negative and the redder the more positive the sentence. The second plot shows the dispersion of the polarity.

There are many things to interpret from the second plot. For example, Pensees is more dispersed than the other two books in terms of polarity. The Prince is much less dispersed in comparison to the other books.

Another interesting task is to find the most negative and positive sentence. We need to take information from the “pol” dataframe and then use the “which.min” function to find the lowest scoring. The “which.min” function only gives the row. Therefore, we need to take this information and use it to find the actual sentence and the book. Below is the code.

```
pol.df<-pol$all #take polarity scores from pol.df
which.min(pol.df$polarity) #find the lowest scored sentence
```

`## [1] 6343`

`pol.df$text.var[6343] #find the actual sentence`

`## [1] "Apart from Him there is but vice, misery, darkness, death, despair."`

`pol.df$book[6343] #find the actual book name`

`## [1] "pensees"`

Pensees had the most negative sentence. You can see for yourself the clearly negative words which are vice, misery, darkness, death, and despair. We can repeat this for the most positive sentence

`which.max(pol.df$polarity)`

`## [1] 4839`

`pol.df$text.var[4839]`

`## [1] "You will be faithful, honest, humble, grateful, generous, a sincere friend, truthful."`

`pol.df$book[4839]`

`## [1] "pensees"`

Again Pensees has the most positive sentence with such words as faithful, honest, humble, grateful, generous, sincere, friend, truthful all being positive.

**Conclusion**

Sentiment analysis allows for the efficient analysis of a large body of text in a highly qualitative manner. There are weaknesses to this approach such as the dictionary used to classify the words can affect the results. In addition, Sentiment analysis only looks at individual sentences and not larger contextual circumstances such as a paragraph. As such, a sentiment analysis provides descriptive insights and not generalizations.

In this post, we will look at how to visualize multivariate clustered data. We will use the “Hitters” dataset from the “ISLR” package. We will use the features of the various baseball players as the dimensions for the clustering. Below is the initial code

```
library(ISLR);library(cluster)
data("Hitters")
str(Hitters)
```

```
## 'data.frame': 322 obs. of 20 variables:
## $ AtBat : int 293 315 479 496 321 594 185 298 323 401 ...
## $ Hits : int 66 81 130 141 87 169 37 73 81 92 ...
## $ HmRun : int 1 7 18 20 10 4 1 0 6 17 ...
## $ Runs : int 30 24 66 65 39 74 23 24 26 49 ...
## $ RBI : int 29 38 72 78 42 51 8 24 32 66 ...
## $ Walks : int 14 39 76 37 30 35 21 7 8 65 ...
## $ Years : int 1 14 3 11 2 11 2 3 2 13 ...
## $ CAtBat : int 293 3449 1624 5628 396 4408 214 509 341 5206 ...
## $ CHits : int 66 835 457 1575 101 1133 42 108 86 1332 ...
## $ CHmRun : int 1 69 63 225 12 19 1 0 6 253 ...
## $ CRuns : int 30 321 224 828 48 501 30 41 32 784 ...
## $ CRBI : int 29 414 266 838 46 336 9 37 34 890 ...
## $ CWalks : int 14 375 263 354 33 194 24 12 8 866 ...
## $ League : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
## $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
## $ PutOuts : int 446 632 880 200 805 282 76 121 143 0 ...
## $ Assists : int 33 43 82 11 40 421 127 283 290 0 ...
## $ Errors : int 20 10 14 3 4 25 7 9 19 0 ...
## $ Salary : num NA 475 480 500 91.5 750 70 100 75 1100 ...
## $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...
```

**Data Preparation**

We need to remove all of the factor variables as the kmeans algorithm cannot support factor variables. In addition, we need to remove the “Salary” variable because it is missing data. Lastly, we need to scale the data because the scaling affects the results of the clustering. The code for all of this is below.

`hittersScaled<-scale(Hitters[,c(-14,-15,-19,-20)])`

**Data Analysis**

We will set the k for the kmeans to 3. This can be set to any number and it often requires domain knowledge to determine what is most appropriate. Below is the code

`kHitters<-kmeans(hittersScaled,3)`

We now look at some descriptive stats. First, we will see how many examples are in each cluster.

`table(kHitters$cluster)`

```
##
## 1 2 3
## 116 144 62
```

The groups are mostly balanced. Next, we will look at the mean of each feature by cluster. This will be done with the “aggregate” function. We will use the original data and make a list by the three clusters.

`round(aggregate(Hitters[,c(-14,-15,-19,-20)],FUN=mean,by=list(kHitters$cluster)),1)`

```
## Group.1 AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## 1 1 522.4 143.4 15.1 73.8 66.0 51.7 5.7 2179.1 597.2 51.3
## 2 2 256.6 64.5 5.5 30.9 28.6 24.3 5.6 1377.1 355.6 24.7
## 3 3 404.9 106.7 14.8 54.6 59.4 48.1 15.1 6480.7 1783.4 207.5
## CRuns CRBI CWalks PutOuts Assists Errors
## 1 299.2 256.1 199.7 380.2 181.8 11.7
## 2 170.1 143.6 122.2 209.0 62.4 5.8
## 3 908.5 901.8 694.0 303.7 70.3 6.4
```

Now we can see some difference. It seems group 3 are young (5.6 years of experience) starters based on the number of at-bats they get. Group 1 is young players who may not get to start due to the lower at-bats the receive. Group 2 is old (15.1 years) players who receive significant playing time and have but together impressive career statistics.

Now we will create our visual of the three clusters. For this, we use the “clusplot” function from the “cluster” package.

`clusplot(hittersScaled,kHitters$cluster,color = T,shade = T,labels = 4)`

In general, there is little overlap between the clusters. The overlap between groups 1 and 3 may be due to how they both have a similar amount of experience.

**Conclusion**

Visualizing the clusters can help with developing insights into the groups found during the analysis. This post provided one example of this.

In this post, we will explore multidimensional scaling (MDS) in R. The main benefit of MDS is that it allows you to plot multivariate data into two dimensions. This allows you to create visuals of complex models. In addition, the plotting of MDS allows you to see relationships among examples in a dataset based on how far or close they are to each other.

We will use the “College” dataset from the “ISLR” package to create an MDS of the colleges that are in the data set. Below is some initial code.

```
library(ISLR);library(ggplot2)
data("College")
str(College)
```

```
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
```

**Data Preparation**

After using the “str” function we know that we need to remove the variable “Private” because it is a factor and type of MDS we are doing can only accommodate numerical variables. After removing this variable we will then make a matrix using the “as.matrix” function. Once the matrix is ready we can use the “cmdscale” function to create the actual two-dimensional MDS. Another point to mention is that for the sake of simplicity, we are only going to use the first ten colleges in the dataset. The reason being that using all 722 will m ake it hard to understand the plots we will make. Below is the code.

```
collegedata<-as.matrix(College[,-1])
collegemds<-cmdscale(dist(collegedata[1:10,]))
```

**Data Analysis**

We can now make our initial plot. The xlim and ylim arguments had to be played with a little for the plot to display properly. In addition, the “text” function was used to provide additional information such as the names of the colleges.

```
plot(collegemds,xlim=c(-15000,10000),ylim=c(-15000,10000))
text(collegemds[,1],collegemds[,2],rownames(collegemds))
```

From the plot, you can see that even with only ten names it is messy. The colleges are mostly clumped together which makes it difficult to interpret. We can plot this with a four quadrant graph using “ggplot2”. First, we need to convert the matrix that we create to a dataframe.

`collegemdsdf<-as.data.frame(collegemds)`

We are now ready to use “ggplot” to create the four quadrant plot.

```
p<-ggplot(collegemdsdf, aes(x=V1, y=V2)) +
geom_point() +
lims(x=c(-10000,8000),y=c(-4000,5000)) +
theme_minimal() +
coord_fixed() +
geom_vline(xintercept = 5) + geom_hline(yintercept = 5)+geom_text(aes(label=rownames(collegemdsdf)))
p+theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
```

We set the horizontal and vertical line at the x and y-intercept respectively. By doing this it is much easier to understand and interpret the graph. Agnes Scott College is way off to the left while Alaska Pacific University, Abilene Christian College, and even Alderson-Broaddus College are clump together. The rest of the colleges are straddling below the x-axis.

**Conclusion**

In this example, we took several variables and condense them to two dimensions. This is the primary benefit of MDS. It allows you to visualize was cannot be visualized normally. The visualizing allows you to see the structure of the data from which you can draw inferences.

Topic models is a tool that can group text by their main themes. It involves the use of probability based on word frequencies. The algorithm that does this is called the Latent Dirichlet Allocation algorithm.

IN this post, we will use some text mining tools to analyze religious/philosophical text the five texts we will look at are The King James Bible The Quran The Book Of Mormon The Gospel of Buddha Meditations, by Marcus Aurelius

The link for access to these five text files is as follows https://www.kaggle.com/tentotheminus9/religious-and-philosophical-texts/downloads/religious-and-philosophical-texts.zip

Once you unzip it you will need to rename each file appropriately.

The next few paragraphs are almost verbatim from the post text mining in R. This is because the data preparation is essentially the same. Small changes were made but original material is found in the analysis section of this post.

We will now begin the actual analysis. The package we need or “tm” and “topicmodels” Below is some initial code.

`library(tm);library(topicmodels)`

**Data Preparation**

We need to do three things for each text file

- Paste it
- convert it
- write a table

Below is the code for pasting the text into R. Keep in mind that your code will be slightly different as the location of the file on your computer will be different. The “what” argument tells are what to take from the file and the “Collapse” argument deals with whitespace

```
bible<-paste(scan(file ="/home/darrin/Desktop/speech/bible.txt",what='character'),collapse=" ")
buddha<-paste(scan(file ="/home/darrin/Desktop/speech/buddha.txt",what='character'),collapse=" ")
meditations<-paste(scan(file ="/home/darrin/Desktop/speech/meditations.txt",what='character'),collapse=" ")
mormon<-paste(scan(file ="/home/darrin/Desktop/speech/mormon.txt",what='character'),collapse=" ")
quran<-paste(scan(file ="/home/darrin/Desktop/speech/quran.txt",what='character'),collapse=" ")
```

Now we need to convert the new objects we created to ASCII text. This removes a lot of “funny” characters from the objects. For this, we use the “iconv” function. Below is the code.

```
bible<-iconv(bible,"latin1","ASCII","")
meditations<-iconv(meditations,"latin1","ASCII","")
buddha<-iconv(buddha,"latin1","ASCII","")
mormon<-iconv(mormon,"latin1","ASCII","")
quran<-iconv(quran,"latin1","ASCII","")
```

The last step of the preparation is the creation of tables. What you are doing is you are taking the objects you have already created and are moving them to their own folder. The text files need to be alone in order to conduct the analysis. Below is the code.

```
write.table(bible,"/home/darrin/Documents/R working directory/textminingegw/mine/bible.txt")
write.table(meditations,"/home/darrin/Documents/R working directory/textminingegw/mine/meditations.txt")
write.table(buddha,"/home/darrin/Documents/R working directory/textminingegw/mine/buddha.txt")
write.table(mormon,"/home/darrin/Documents/R working directory/textminingegw/mine/mormon.txt")
write.table(quran,"/home/darrin/Documents/R working directory/textminingegw/mine/quran.txt")
```

**Corpus Development**

We are now ready to create the corpus. This is the object we use to clean the text together rather than individually as before. First, we need to make the corpus object, below is the code. Notice how it contains the directory where are tables are

`docs<-Corpus(DirSource("/home/darrin/Documents/R working directory/textminingegw/mine"))`

There are many different ways to prepare the corpus. For our example, we will do the following…

lower case all letters-This avoids the same word be counted separately (ie sheep and Sheep)

- Remove numbers
- Remove punctuation-Simplifies the document
- Remove whitespace-Simplifies the document
- Remove stopwords-Words that have a function but not a meaning (ie to, the, this, etc)
- Remove custom words-Provides additional clarity

Below is the code for this

```
docs<-tm_map(docs,tolower)
docs<-tm_map(docs,removeNumbers)
docs<-tm_map(docs,removePunctuation)
docs<-tm_map(docs,removeWords,stopwords('english'))
docs<-tm_map(docs,stripWhitespace)
docs<-tm_map(docs,removeWords,c("chapter","also","no","thee","thy","hath","thou","thus","may",
"thee","even","yet","every","said","this","can","unto","upon",
"cant",'shall',"will","that","weve","dont","wont"))
```

We now need to create the matrix. The document matrix is what r will actually analyze. We will then remove sparse terms. Sparse terms are terms that do not occur are a certain percentage in the matrix. For our purposes, we will set the sparsity to .60. This means that a word must appear in 3 of the 5 books of our analysis. Below is the code. The ‘dim’ function will allow you to see how the number of terms is reduced drastically. This is done without losing a great deal of data will speeding up computational time.

```
dtm<-DocumentTermMatrix(docs)
dim(dtm)
```

`## [1] 5 24368`

```
dtm<-removeSparseTerms(dtm,0.6)
dim(dtm)
```

`## [1] 5 5265`

Analysis

We will now create our topics or themes. If there is no a priori information on how many topics to make it os up to you to decide how many. We will create three topics. The “LDA” function is used and the argument “k” is set to three indicating we want three topics. Below is the code

```
set.seed(123)
lda3<-LDA(dtm,k=3)
```

We can see which topic each book was assigned to using the “topics” function. Below is the code.

`topics(lda3)`

```
## bible.txt buddha.txt meditations.txt mormon.txt
## 2 3 3 1
## quran.txt
## 3
```

According to the results. The book of Mormon and the Bible were so unique that they each had their own topic (1 and 3). The other three text (Buddha, Meditations, and the Book of Mormon) were all placed in topic 2. It’s surprising that the Bible and the Book of Mormon were in separate topics since they are both Christian text. It is also surprising the Book by Buddha, Meditations, and the Quran are all under the same topic as it seems that these texts have nothing in common.

We can also use the “terms” function to see what the most common words are for each topic. The first argument in the function is the model name followed by the number of words you want to see. We will look at 10 words per topic.

`terms(lda3, 10)`

```
## Topic 1 Topic 2 Topic 3
## [1,] "people" "lord" "god"
## [2,] "came" "god" "one"
## [3,] "god" "israel" "things"
## [4,] "behold" "man" "say"
## [5,] "pass" "son" "truth"
## [6,] "lord" "king" "man"
## [7,] "yea" "house" "lord"
## [8,] "land" "one" "life"
## [9,] "now" "come" "see"
## [10,] "things" "people" "good"
```

Interpreting these results takes qualitative skills and is subjective. They all seem to be talking about the same thing. Topic 3 (Bible) seems to focus on Israel and Lord while topic 1 (Mormon) is about God and people. Topic 2 (Buddha, Meditations, and Quran) speak of god as well but the emphasis has moved to truth and the word one.

**Conclusion**

This post provided insight into developing topic models using R. The results of a topic model analysis is highly subjective and will often require strong domain knowledge. Furthermore, the number of topics is highly flexible as well and in the example in this post we could have had different numbers of topics for comparative purposes.

Text mining is descriptive analysis tool that is applied to unstructured textual data. By unstructured, it is meant data that is not stored in relational databases. The majority of data on the Internet and the business world, in general, is of an unstructured nature. As such, the use of text mining tools has grown in importance over the past two decades.

In this post, we will use some text mining tools to analyze religious/philosophical text the five texts we will look at are

- The King James Bible
- The Quran
- The Book Of Mormon
- The Gospel of Buddha
- Meditations, by Marcus Aurelius

The link for access to these five text files is as follows

Once you unzip it you will need to rename each file appropriately.

The actual process of text mining is rather simple and does not involve a great deal of complex coding compared to other machine learning applications. Primarily you need to do the follow Prep the data by first scanning it into r, converting it to ASCII format, and creating the write table for each text Create a corpus that is then cleaned of unnecessary characters Conduct the actual descriptive analysis

We will now begin the actual analysis. The package we need or “tm” for text mining, “wordcloud”, and “RColorBrewer” for visuals. Below is some initial code.

`library(tm);library(wordcloud);library(RColorBrewer)`

**Data Preparation**

We need to do three things for each text file

- Paste
- convert it
- write a table

Below is the code for pasting the text into R. Keep in mind that your code will be slightly different as the location of the file on your computer will be different. The “what” argument tells are what to take from the file and the “Collapse” argument deals with whitespace

```
bible<-paste(scan(file ="/home/darrin/Desktop/speech/bible.txt",what='character'),collapse=" ")
buddha<-paste(scan(file ="/home/darrin/Desktop/speech/buddha.txt",what='character'),collapse=" ")
meditations<-paste(scan(file ="/home/darrin/Desktop/speech/meditations.txt",what='character'),collapse=" ")
mormon<-paste(scan(file ="/home/darrin/Desktop/speech/mormon.txt",what='character'),collapse=" ")
quran<-paste(scan(file ="/home/darrin/Desktop/speech/quran.txt",what='character'),collapse=" ")
```

Now we need to convert the new objects we created to ASCII text. This removes a lot of “funny” characters from the objects. For this, we use the “iconv” function. Below is the code.

```
bible<-iconv(bible,"latin1","ASCII","")
meditations<-iconv(meditations,"latin1","ASCII","")
buddha<-iconv(buddha,"latin1","ASCII","")
mormon<-iconv(mormon,"latin1","ASCII","")
quran<-iconv(quran,"latin1","ASCII","")
```

The last step of the preparation is the creation of tables. Primarily you are taken the objects you have already created and moved them to their own folder. The text files need to be alone in order to conduct the analysis. Below is the code.

```
write.table(bible,"/home/darrin/Documents/R working directory/textminingegw/mine/bible.txt")
write.table(meditations,"/home/darrin/Documents/R working directory/textminingegw/mine/meditations.txt")
write.table(buddha,"/home/darrin/Documents/R working directory/textminingegw/mine/buddha.txt")
write.table(mormon,"/home/darrin/Documents/R working directory/textminingegw/mine/mormon.txt")
write.table(quran,"/home/darrin/Documents/R working directory/textminingegw/mine/quran.txt")
```

For fun, you can see a snippet of each object by simply typing its name into r as shown below.

`bible`

##[1] "x 1 The Project Gutenberg EBook of The King James Bible This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org Title: The King James Bible Release Date: March 2, 2011 [EBook #10] [This King James Bible was orginally posted by Project Gutenberg in late 1989] Language: English *** START OF THIS PROJECT

**Corpus Creation**

We are now ready to create the corpus. This is the object we use to clean the text together rather than individually as before. First, we need to make the corpus object, below is the code. Notice how it contains the directory where are tables are

`docs<-Corpus(DirSource("/home/darrin/Documents/R working directory/textminingegw/mine"))`

There are many different ways to prepare the corpus. For our example, we will do the following…

- lower case all letters-This avoids the same word be counted separately (ie sheep and Sheep)
- Remove numbers
- Remove punctuation-Simplifies the document Remove whitespace-Simplifies the document
- Remove stopwords-Words that have a function but not a meaning (ie to, the, this, etc)
- Remove custom words-Provides additional clarity

Below is the code for this

```
docs<-tm_map(docs,tolower)
docs<-tm_map(docs,removeNumbers)
docs<-tm_map(docs,removePunctuation)
docs<-tm_map(docs,removeWords,stopwords('english'))
docs<-tm_map(docs,stripWhitespace)
#docs<-tm_map(docs,stemDocument)
docs<-tm_map(docs,removeWords,c("chapter","also","no","thee","thy","hath","thou","thus","may",
"thee","even","yet","every","said","this","can","unto","upon",
"cant",'shall',"will","that","weve","dont","wont"))
```

We now need to create the matrix. The document matrix is what r will actually analyze. We will then remove sparse terms. Sparse terms are terms that do not occur are a certain percentage in the matrix. For our purposes, we will set the sparsity to .60. This means that a word must appear in 3 of the 5 books of our analysis. Below is the code. The ‘dim’ function will allow you to see how the number of terms is reduced drastically. This is done without losing a great deal of data will speeding up computational time.

```
dtm<-DocumentTermMatrix(docs)
dim(dtm)
```

`## [1] 5 24368`

```
dtm<-removeSparseTerms(dtm,0.6)
dim(dtm)
```

`## [1] 5 5265`

**Analysis**

We now can explore the text. First, we need to make a matrix that has the sum of the columns od the document term matrix. Then we need to change the order of the matrix to have the most frequent terms first. Below is the code for this.

```
freq<-colSums(as.matrix(dtm))
ord<-order(-freq)#changes the order to descending
```

We can now make a simple bar plot to see what the most common words are. Below is the code

`barplot(freq[head(ord)])`

As expected with religious text. The most common terms are religious terms. You can also determine what words appeared least often with the code below.

`freq[tail(ord)]`

```
## posting secured smiled sway swiftness worthless
## 3 3 3 3 3 3
```

Notice how each word appeared 3 times. This may mean that the 3 terms appear once in three of the five books. Remember we set the sparsity to .60 or 3/5.

Another analysis is to determine how many words appear a certain number of times. For example, how many words appear 200 times or 300. Below is the code.

`head(table(freq))`

```
## freq
## 3 4 5 6 7 8
## 117 230 172 192 191 187
```

Using the “head” function and the “table” function gives us the six most common values of word frequencies. Three words appear 117 times, four appear 230 times, etc. Remember the “head” gives the first few values regardless of their amount

The “findFreqTerms” function allows you to set a cutoff point of how frequent a word needs to be. For example, if we want to know how many words appeared 3000 times we would use the following code.

`findFreqTerms(dtm,3000)`

```
## [1] "behold" "came" "come" "god" "land" "lord" "man"
## [8] "now" "one" "people"
```

The “findAssocs” function finds the correlation between two words in the text. This provides insight into how frequently these words appear together. For our example, we will see which words are associated with war, which is a common subject in many religious texts. We will set the correlation high to keep the list short for the blog post. Below is the code

`findAssocs(dtm,"war",corlimit =.998) `

```
## $war
## arrows bands buildeth captive cords making
## 1 1 1 1 1 1
## perisheth prosperity tower wages yield
## 1 1 1 1 1
```

The interpretation of the results can take many forms. It makes sense for ‘arrows’ and ‘captives’ to be associated with ‘war’ but ‘yield’ seems confusing. We also do not know the sample size of the associations.

Our last technique is the development of a word cloud. This allows you to see word frequency based on where the word is located in the cloud as well as its size. For our example, we will set it so that a word must appear at least 1000 times in the corpus with more common words in the middle. Below is the code.

`wordcloud(names(freq),freq,min.freq=1000,scale=c(3,.5),colors=brewer.pal(6,"Dark2"),random.color = F,random.order = F)`

**Conclusion**

This post provided an introduction to text mining in R. There are many more complex features that are available for the more serious user of R than what is described here

In this post, we will look at recommendation engines using binary information. For a binary recommendation engine, it requires that the data rates the product as good/bad or some other system in which only two responses are possible. The “recommendarlab” package is needed for this analysis and we will use the ratings of movies from grouplens.org for this post.

url http://grouplens.org/datasets/movielens/latest/

If you follow along you want to download the “small dataset” and use the “ratings.csv” and the “movies.csv”. We will then merge these two datasets based on the variable “movieId” the url is below is the initial code

`library(recommenderlab)`

`ratings <- read.csv("~/Downloads/ml-latest-small/ratings.csv")#load ratings data movies <- read.csv("~/Downloads/ml-latest-small/movies.csv")#load movies data movieRatings<-merge(ratings, movies, by='movieId')#merge movies and ratings data`

We now need to convert are “movieRatings” data frame to a matrix that the “recommendarlab” can use. After doing this we need to indicate that we are doing a binary engine by setting the minimum rating to 2.5. What this means is that anything above 2.5 is in one category and anything below 2.5 is in a different category. We use the “binarize” function to do this. Below is the code

```
movieRatings<-as(movieRatings,"realRatingMatrix")
movie.bin<-binarize(movieRatings,minRating=2.5)
```

We need to use a subset of our data. We need each row to have a certain minimum number of ratings. For this analysis, we need at least ten ratings per row. Below is the code for this.

```
movie.bin<-movie.bin[rowCounts(movie.bin)>10]
movie.bin
```

`## 1817 x 671 rating matrix of class 'binaryRatingMatrix' with 68643 ratings.`

Next, we need to setup the evaluation scheme. We use the function and plug in the data, method of evaluation, number of folds, and the given number of ratings. The code is as follows.

```
set.seed(456)
e.bin<-evaluationScheme(movie.bin,method='cross-validation',k=5,given=10)
```

We now make a list that holds all the models we want to run. We will run four models “popular”, “random”, “ubcf”, and “ibcf”. We will then use the “evaluate” function to see how accurate are models are for 5,10,15, and 20 items.

```
algorithms.bin<-list(POPULAR=list(name="POPULAR",param=NULL),
RAND=list(name="RANDOM"),UBCF=list(name="UBCF"),IBCF=list(name="IBCF"))
results.bin<-evaluate(e.bin,algorithms.bin,n=c(5,10,15,20))
```

The “avg” function will help us to see how are models did. Below are the results

`avg(results.bin)`

```
## $POPULAR
## TP FP FN TN precision recall TPR
## 5 1.518356 3.481644 26.16877 629.8312 0.3036712 0.09293487 0.09293487
## 10 2.792329 7.207671 24.89479 626.1052 0.2792329 0.15074799 0.15074799
## 15 3.916164 11.083836 23.77096 622.2290 0.2610776 0.20512093 0.20512093
## 20 4.861370 15.138630 22.82575 618.1742 0.2430685 0.24831787 0.24831787
## FPR
## 5 0.005426716
## 10 0.011221837
## 15 0.017266489
## 20 0.023608749
##
## $RAND
## TP FP FN TN precision recall
## 5 0.2120548 4.787945 27.47507 628.5249 0.04241096 0.007530989
## 10 0.4104110 9.589589 27.27671 623.7233 0.04104110 0.015611349
## 15 0.6241096 14.375890 27.06301 618.9370 0.04160731 0.023631305
## 20 0.8460274 19.153973 26.84110 614.1589 0.04230137 0.033130430
## TPR FPR
## 5 0.007530989 0.007559594
## 10 0.015611349 0.015146399
## 15 0.023631305 0.022702057
## 20 0.033130430 0.030246522
##
## $UBCF
## TP FP FN TN precision recall TPR
## 5 2.175890 2.824110 25.51123 630.4888 0.4351781 0.1582319 0.1582319
## 10 3.740274 6.259726 23.94685 627.0532 0.3740274 0.2504990 0.2504990
## 15 5.054795 9.945205 22.63233 623.3677 0.3369863 0.3182356 0.3182356
## 20 6.172603 13.827397 21.51452 619.4855 0.3086301 0.3748969 0.3748969
## FPR
## 5 0.004387006
## 10 0.009740306
## 15 0.015492088
## 20 0.021557381
##
## $IBCF
## TP FP FN TN precision recall TPR
## 5 1.330411 3.669589 26.35671 629.6433 0.2660822 0.08190126 0.08190126
## 10 2.442192 7.557808 25.24493 625.7551 0.2442192 0.13786523 0.13786523
## 15 3.532603 11.467397 24.15452 621.8455 0.2355068 0.19010813 0.19010813
## 20 4.546301 15.453699 23.14082 617.8592 0.2273151 0.23494969 0.23494969
## FPR
## 5 0.005727386
## 10 0.011801682
## 15 0.017900255
## 20 0.024124329
```

The results are pretty bad for all models. The TPR (true positive rate) is always below .4. We can make a visual of the results by creating a ROC using the TPR/FPR as well as precision/recall.

`plot(results.bin,legend="topleft",annotate=T)`

`plot(results.bin,"prec",legend="topleft",annotate=T)`

The visual makes it clear that the UBCF model is the best.

**Conclusion**

This post provided an example of the development of an algorithm for binary recommendations.

In this post, we will look at how to make a recommendation engine. We will use data that makes recommendations about movies. We will use the “recommenderlab” package to build several different engines. The data comes from

“http://grouplens.org/datasets/movielens/latest/”

At this link, you need to download the “ml-latest.zip”. From there, we will use the “ratings” and “movies” files in this post. Ratings provide the ratings of the movies while movies provide the names of the movies. Before going further it is important to know that the “recommenderlab” has five different techniques for developing recommendation engines (IBCF, UBCF, POPULAR, RANDOM, & SVD). We will use all of them for comparative purposes Below is the code for getting started.

`library(recommenderlab)`

```
ratings <- read.csv("~/Downloads/ml-latest-small/ratings.csv")
movies <- read.csv("~/Downloads/ml-latest-small/movies.csv")
```

We now need to merge the two datasets so that they become one. This way the titles and ratings are in one place. We will then coerce our “movieRatings” dataframe into a “realRatingMatrix” in order to continue our analysis. Below is the code

```
movieRatings<-merge(ratings, movies, by='movieId') #merge two files
movieRatings<-as(movieRatings,"realRatingMatrix") #coerce to realRatingMatrix
```

We will now create two histograms of the ratings. The first is raw data and the second will be normalized data. The function “getRatings” is used in combination with the “hist” function to make the histogram. The normalized data includes the “normalize” function. Below is the code.

`hist(getRatings(movieRatings),breaks =10)`

`hist(getRatings(normalize(movieRatings)),breaks =10)`

We are now ready to create the evaluation scheme for our analysis. In this object we need to set the data name (movieRatings), the method we want to use (cross-validation), the amount of data we want to use for the training set (80%), how many ratings the algorithm is given during the test set (1) with the rest being used to compute the error. We also need to tell R what a good rating is (4 or higher) and the number of folds for the cross-validation (10). Below is the code for all of this.

```
set.seed(123)
eSetup<-evaluationScheme(movieRatings,method='cross-validation',train=.8,given=1,goodRating=4,k=10)
```

Below is the code for developing our models. To do this we need to use the “Recommender” function and the “getData” function to get the dataset. Remember we are using all six modeling techniques

```
ubcf<-Recommender(getData(eSetup,"train"),"UBCF")
ibcf<-Recommender(getData(eSetup,"train"),"IBCF")
svd<-Recommender(getData(eSetup,"train"),"svd")
popular<-Recommender(getData(eSetup,"train"),"POPULAR")
random<-Recommender(getData(eSetup,"train"),"RANDOM")
```

The models have been created. We can now make our predictions using the “predict” function in addition to the “getData” function. We also need to set the argument “type” to “ratings”. Below is the code.

```
ubcf_pred<-predict(ubcf,getData(eSetup,"known"),type="ratings")
ibcf_pred<-predict(ibcf,getData(eSetup,"known"),type="ratings")
svd_pred<-predict(svd,getData(eSetup,"known"),type="ratings")
pop_pred<-predict(popular,getData(eSetup,"known"),type="ratings")
rand_pred<-predict(random,getData(eSetup,"known"),type="ratings")
```

We can now look at the accuracy of the models. We will do this in two steps. First, we will look at the error rates. After completing this, we will do a more detailed analysis of the stronger models. Below is the code for the first step

```
ubcf_error<-calcPredictionAccuracy(ubcf_pred,getData(eSetup,"unknown")) #calculate error
ibcf_error<-calcPredictionAccuracy(ibcf_pred,getData(eSetup,"unknown"))
svd_error<-calcPredictionAccuracy(svd_pred,getData(eSetup,"unknown"))
pop_error<-calcPredictionAccuracy(pop_pred,getData(eSetup,"unknown"))
rand_error<-calcPredictionAccuracy(rand_pred,getData(eSetup,"unknown"))
error<-rbind(ubcf_error,ibcf_error,svd_error,pop_error,rand_error) #combine objects into one data frame
rownames(error)<-c("UBCF","IBCF","SVD","POP","RAND") #give names to rows
error
```

```
## RMSE MSE MAE
## UBCF 1.278074 1.633473 0.9680428
## IBCF 1.484129 2.202640 1.1049733
## SVD 1.277550 1.632135 0.9679505
## POP 1.224838 1.500228 0.9255929
## RAND 1.455207 2.117628 1.1354987
```

The results indicate that the “RAND” and “IBCF” models are clearly worst than the remaining three. We will now move to the second step and take a closer look at the “UBCF”, “SVD”, and “POP” models. We will do this by making a list and using the “evaluate” function to get other model evaluation metrics. We will make a list called “algorithms” and store the three strongest models. Then we will make a objective called “evlist” in this object we will use the “evaluate” function as well as called the evaluation scheme “esetup”, the list (“algorithms”) as well as the number of movies to assess (5,10,15,20)

```
algorithms<-list(POPULAR=list(name="POPULAR"),SVD=list(name="SVD"),UBCF=list(name="UBCF"))
evlist<-evaluate(eSetup,algorithms,n=c(5,10,15,20))
```

`avg(evlist)`

```
## $POPULAR
## TP FP FN TN precision recall TPR
## 5 0.3010965 3.033333 4.917105 661.7485 0.09028443 0.07670381 0.07670381
## 10 0.4539474 6.214912 4.764254 658.5669 0.06806016 0.11289681 0.11289681
## 15 0.5953947 9.407895 4.622807 655.3739 0.05950450 0.14080354 0.14080354
## 20 0.6839912 12.653728 4.534211 652.1281 0.05127635 0.16024740 0.16024740
## FPR
## 5 0.004566269
## 10 0.009363021
## 15 0.014177091
## 20 0.019075070
##
## $SVD
## TP FP FN TN precision recall TPR
## 5 0.1025219 3.231908 5.115680 661.5499 0.03077788 0.00968336 0.00968336
## 10 0.1808114 6.488048 5.037390 658.2938 0.02713505 0.01625454 0.01625454
## 15 0.2619518 9.741338 4.956250 655.0405 0.02620515 0.02716656 0.02716656
## 20 0.3313596 13.006360 4.886842 651.7754 0.02486232 0.03698768 0.03698768
## FPR
## 5 0.004871678
## 10 0.009782266
## 15 0.014689510
## 20 0.019615377
##
## $UBCF
## TP FP FN TN precision recall TPR
## 5 0.1210526 2.968860 5.097149 661.8129 0.03916652 0.01481106 0.01481106
## 10 0.2075658 5.972259 5.010636 658.8095 0.03357173 0.02352752 0.02352752
## 15 0.3028509 8.966886 4.915351 655.8149 0.03266321 0.03720717 0.03720717
## 20 0.3813596 11.978289 4.836842 652.8035 0.03085246 0.04784538 0.04784538
## FPR
## 5 0.004475151
## 10 0.009004466
## 15 0.013520481
## 20 0.018063361
```

Well, the numbers indicate that all the models are terrible. All metrics are scored rather poorly. True positives, false positives, false negatives, true negatives, precision, recall, true positive rate, and false positive rate are low for all models. Remember that these values are averages of the cross-validation. As such, for the “POPULAR” model when looking at the top five movies on average, the number of true positives was .3.

Even though the numbers are terrible the “POPULAR” model always performed the best. We can even view the ROC curve with the code below

`plot(evlist,legend="topleft",annotate=T)`

We can now determine individual recommendations. We first need to build a model using the POPULAR algorithm. Below is the code.

```
Rec1<-Recommender(movieRatings,method="POPULAR")
Rec1
```

```
## Recommender of type 'POPULAR' for 'realRatingMatrix'
## learned using 9066 users.
```

We will now pull the top five recommendations for the first two raters and make a list. The numbers are the movie ids and not the actual titles

```
recommend<-predict(Rec1,movieRatings[1:5],n=5)
as(recommend,"list")
```

```
## $`1`
## [1] "78" "95" "544" "102" "4"
##
## $`2`
## [1] "242" "232" "294" "577" "95"
##
## $`3`
## [1] "654" "242" "30" "232" "287"
##
## $`4`
## [1] "564" "654" "242" "30" "232"
##
## $`5`
## [1] "242" "30" "232" "287" "577"
```

Below we can see the specific score for a specific movie. The names of the movies come from the original “ratings” dataset.

```
rating<-predict(Rec1,movieRatings[1:5],type='ratings')
rating
```

`## 5 x 671 rating matrix of class 'realRatingMatrix' with 2873 ratings.`

```
movieresult<-as(rating,'matrix')[1:5,1:3]
colnames(movieresult)<-c("Toy Story","Jumanji","Grumpier Old Men")
movieresult
```

```
## Toy Story Jumanji Grumpier Old Men
## 1 2.859941 3.822666 3.724566
## 2 2.389340 3.352066 3.253965
## 3 2.148488 3.111213 3.013113
## 4 1.372087 2.334812 2.236711
## 5 2.255328 3.218054 3.119953
```

This is what the model thinks the person would rate the movie. It is the difference between this number and the actual one that the error is calculated. In addition, if someone did not rate a movie you would see an NA in that spot

**Conclusion**

This was a lot of work. However, with additional work, you can have your own recommendation system based on data that was collected.

Recommendations engines are used to make predictions about what future users would like based on prior users suggestions. Whenever you provide numerical feedback on a product or services this information can be used to provide recommendations in the future.

This post will look at various ways in which recommendation engines derive their conclusions.

**Ways of Recommending**

There are two common ways to develop a recommendation engine in a machine learning context. These two ways are collaborative filtering and content-based. Content-based recommendations rely solely on the data provided by the user. A user develops a profile through their activity and the engine recommends products or services. The only problem is if there is little data on user poor recommendations are made.

Collaborative filtering is crowd-based recommendations. What this means the data of many is used to recommend to one. This bypasses the concern with a lack of data that can happen with content-based recommendations.

There are four common ways to develop collaborative filters and they are as follows

- User-based collaborative filtering
- Item-baed collaborative filtering
- Singular value decomposition and Principal component analysis

**User-based Collaborative Filtering (UBCF)**

UBCF uses k-nearest neighbor or some similarity measurement such as Pearson Correlation to predict the missing rating for a user. Once the number of neighbors is determined the algorithm calculates the average of the neighbors to predict the information for the user. The predicted value can be used to determine if a user will like a particular product or service

The predicted value can be used to determine if a user will like a particular product or service. Low values are not recommended while high values may be. A major weakness of UBCF is calculating the similarities of users requires keeping all the data in memory which is a computational challenge.

**Item-based Collaborative Filtering (IBCF)**

IBCF uses the similarity between items to make recomeendations. This is calculated with the same measures as before (Knn, Pearson correlation, etc.). After finding the most similar items, The algorithm will take the average from the individual user of the other items to predict recommendation the user would make for the unknown item.

In order to assure accuracy, it is necessary to have a huge number of items that can have the similarities calculated. This leads to the same computational problems mentioned earlier.

**Singular Value Decomposition and Principal Component Analysis (SVD, PCA)**

When the dataset is too big for the first two options. SVD or PCA could be an appropriate choice. What each of these two methods does in a simple way is reduce the dimensionality by making latent variables. Doing this reduces the computational effort as well as reduce noise in the data.

With SVD, we can reduce the data to a handful of factors. The remaining factors can be used to reproduce the original values which can then be used to predict missing values.

For PCA, items are combined in components and like items that load on the same component can be used to make predictions for an unknown data point for a user.

**Conclusion**

Recommendation engines play a critical part in generating sales for many companies. This post provided an insight into how they are created. Understanding this can allow you to develop recommendation engines based on data.

One of the major problems with hierarchical and k-means clustering is that they cannot handle nominal data. The reality is that most data is mixed or a combination of both interval/ratio data and nominal/ordinal data.

One of many ways to deal with this problem is by using the Gower coefficient. This coefficient compares the pairwise cases in the data set and calculates a dissimilarity between. By dissimilar we mean the weighted mean of the variables in that row.

Once the dissimilarity calculations are completed using the gower coefficient (there are naturally other choices), you can then use regular kmeans clustering (there are also other choices) to find the traits of the various clusters. In this post, we will use the “MedExp” dataset from the “Ecdat” package. Our goal will be to cluster the mixed data into four clusters. Below is some initial code.

`library(cluster);library(Ecdat);library(compareGroups)`

```
data("MedExp")
str(MedExp)
```

```
## 'data.frame': 5574 obs. of 15 variables:
## $ med : num 62.1 0 27.8 290.6 0 ...
## $ lc : num 0 0 0 0 0 0 0 0 0 0 ...
## $ idp : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 1 1 ...
## $ lpi : num 6.91 6.91 6.91 6.91 6.11 ...
## $ fmde : num 0 0 0 0 0 0 0 0 0 0 ...
## $ physlim : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ ndisease: num 13.7 13.7 13.7 13.7 13.7 ...
## $ health : Factor w/ 4 levels "excellent","good",..: 2 1 1 2 2 2 2 1 2 2 ...
## $ linc : num 9.53 9.53 9.53 9.53 8.54 ...
## $ lfam : num 1.39 1.39 1.39 1.39 1.1 ...
## $ educdec : num 12 12 12 12 12 12 12 12 9 9 ...
## $ age : num 43.9 17.6 15.5 44.1 14.5 ...
## $ sex : Factor w/ 2 levels "male","female": 1 1 2 2 2 2 2 1 2 2 ...
## $ child : Factor w/ 2 levels "no","yes": 1 2 2 1 2 2 1 1 2 1 ...
## $ black : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
```

You can clearly see that our data is mixed with both numerical and factor variables. Therefore, the first thing we must do is calculate the gower coefficient for the dataset. This is done with the “daisy” function from the “cluster” package.

`disMat<-daisy(MedExp,metric = "gower")`

Now we can use the “kmeans” to make are clusters. This is possible because all the factor variables have been converted to a numerical value. We will set the number of clusters to 4. Below is the code.

```
set.seed(123)
mixedClusters<-kmeans(disMat, centers=4)
```

We can now look at a table of the clusters

`table(mixedClusters$cluster)`

```
##
## 1 2 3 4
## 1960 1342 1356 916
```

The groups seem reasonably balanced. We now need to add the results of the kmeans to the original dataset. Below is the code

`MedExp$cluster<-mixedClusters$cluster`

We now can built a descriptive table that will give us the proportions of each variable in each cluster. To do this we need to use the “compareGroups” function. We will then take the output of the “compareGroups” function and use it in the “createTable” function to get are actual descriptive stats.

```
group<-compareGroups(cluster~.,data=MedExp)
clustab<-createTable(group)
clustab
```

```
##
## --------Summary descriptives table by 'cluster'---------
##
## __________________________________________________________________________
## 1 2 3 4 p.overall
## N=1960 N=1342 N=1356 N=916
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
## med 211 (1119) 68.2 (333) 269 (820) 83.8 (210) <0.001
## lc 4.07 (0.60) 4.05 (0.60) 0.04 (0.39) 0.03 (0.34) 0.000
## idp: <0.001
## no 1289 (65.8%) 922 (68.7%) 1123 (82.8%) 781 (85.3%)
## yes 671 (34.2%) 420 (31.3%) 233 (17.2%) 135 (14.7%)
## lpi 5.72 (1.94) 5.90 (1.73) 3.27 (2.91) 3.05 (2.96) <0.001
## fmde 6.82 (0.99) 6.93 (0.90) 0.00 (0.12) 0.00 (0.00) 0.000
## physlim: <0.001
## no 1609 (82.1%) 1163 (86.7%) 1096 (80.8%) 789 (86.1%)
## yes 351 (17.9%) 179 (13.3%) 260 (19.2%) 127 (13.9%)
## ndisease 11.5 (8.26) 10.2 (2.97) 12.2 (8.50) 10.6 (3.35) <0.001
## health: <0.001
## excellent 910 (46.4%) 880 (65.6%) 615 (45.4%) 612 (66.8%)
## good 828 (42.2%) 382 (28.5%) 563 (41.5%) 261 (28.5%)
## fair 183 (9.34%) 74 (5.51%) 137 (10.1%) 42 (4.59%)
## poor 39 (1.99%) 6 (0.45%) 41 (3.02%) 1 (0.11%)
## linc 8.68 (1.22) 8.61 (1.37) 8.75 (1.17) 8.78 (1.06) 0.005
## lfam 1.05 (0.57) 1.49 (0.34) 1.08 (0.58) 1.52 (0.35) <0.001
## educdec 12.1 (2.87) 11.8 (2.58) 12.0 (3.08) 11.8 (2.73) 0.005
## age 36.5 (12.0) 9.26 (5.01) 37.0 (12.5) 9.29 (5.11) 0.000
## sex: <0.001
## male 893 (45.6%) 686 (51.1%) 623 (45.9%) 482 (52.6%)
## female 1067 (54.4%) 656 (48.9%) 733 (54.1%) 434 (47.4%)
## child: 0.000
## no 1960 (100%) 0 (0.00%) 1356 (100%) 0 (0.00%)
## yes 0 (0.00%) 1342 (100%) 0 (0.00%) 916 (100%)
## black: <0.001
## yes 1623 (82.8%) 986 (73.5%) 1148 (84.7%) 730 (79.7%)
## no 337 (17.2%) 356 (26.5%) 208 (15.3%) 186 (20.3%)
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
```

The table speaks for itself. Results that utilize factor variables have proportions to them. For example, in cluster 1, 1289 people or 65.8% responded “no” that the have an individual deductible plan (idp). Numerical variables have the mean with the standard deviation in parentheses. For example, in cluster 1 the average family size was 1 with a standard deviation of 1.05 (lfam).

**Conclusion**

Mixed data can be partition into clusters with the help of the gower or another coefficient. In addition, kmeans is not the only way to cluster the data. There are other choices such as the partitioning around medoids. The example provided here simply serves as a basic introduction to this.

Hierarchical clustering is a form of unsupervised learning. What this means is that the data points lack any form of label and the purpose of the analysis is to generate labels for our data points. IN other words, we have no Y values in our data.

Hierarchical clustering is an agglomerative technique. This means that each data point starts as their own individual clusters and are merged over iterations. This is great for small datasets but is difficult to scale. In addition, you need to set the linkage which is used to place observations in different clusters. There are several choices (ward, complete, single, etc.) and the best choice depends on context.

In this post, we will make a hierarchical clustering analysis of the “MedExp” data from the “Ecdat” package. We are trying to identify distinct subgroups in the sample. The actual hierarchical cluster creates what is a called a dendrogram. Below is some initial code.

`library(cluster);library(compareGroups);library(NbClust);library(HDclassif);library(sparcl);library(Ecdat)`

```
data("MedExp")
str(MedExp)
```

```
## 'data.frame': 5574 obs. of 15 variables:
## $ med : num 62.1 0 27.8 290.6 0 ...
## $ lc : num 0 0 0 0 0 0 0 0 0 0 ...
## $ idp : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 1 1 ...
## $ lpi : num 6.91 6.91 6.91 6.91 6.11 ...
## $ fmde : num 0 0 0 0 0 0 0 0 0 0 ...
## $ physlim : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ ndisease: num 13.7 13.7 13.7 13.7 13.7 ...
## $ health : Factor w/ 4 levels "excellent","good",..: 2 1 1 2 2 2 2 1 2 2 ...
## $ linc : num 9.53 9.53 9.53 9.53 8.54 ...
## $ lfam : num 1.39 1.39 1.39 1.39 1.1 ...
## $ educdec : num 12 12 12 12 12 12 12 12 9 9 ...
## $ age : num 43.9 17.6 15.5 44.1 14.5 ...
## $ sex : Factor w/ 2 levels "male","female": 1 1 2 2 2 2 2 1 2 2 ...
## $ child : Factor w/ 2 levels "no","yes": 1 2 2 1 2 2 1 1 2 1 ...
## $ black : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
```

Currently, for the purposes of this post. The dataset is too big. IF we try to do the analysis with over 5500 observations it will take a long time. Therefore, we will only use the first 1000 observations. In addition, We need to remove factor variables as hierarchical clustering cannot analyze factor variables. Below is the code.

```
MedExp_small<-MedExp[1:1000,]
MedExp_small$sex<-NULL
MedExp_small$idp<-NULL
MedExp_small$child<-NULL
MedExp_small$black<-NULL
MedExp_small$physlim<-NULL
MedExp_small$health<-NULL
```

We now need to scale are data. This is important because different scales will cause different variables to have more or less influence on the results. Below is the code

`MedExp_small_df<-as.data.frame(scale(MedExp_small))`

We now need to determine how many clusters to create. There is no rule on this but we can use statistical analysis to help us. The “NbClust” package will conduct several different analysis to provide a suggested number of clusters to create. You have to set the distance, min/max number of clusters, the method, and the index. The graphs can be understood by looking for the bend or elbow in them. At this point is the best number of clusters.

`numComplete<-NbClust(MedExp_small_df,distance = 'euclidean',min.nc = 2,max.nc = 8,method = 'ward.D2',index = c('all'))`

```
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
```

```
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 7 proposed 2 as the best number of clusters
## * 9 proposed 3 as the best number of clusters
## * 6 proposed 6 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
```

`numComplete$Best.nc`

```
## KL CH Hartigan CCC Scott Marriot
## Number_clusters 2.0000 2.0000 6.0000 8.0000 3.000 3.000000e+00
## Value_Index 2.9814 292.0974 56.9262 28.4817 1800.873 4.127267e+24
## TrCovW TraceW Friedman Rubin Cindex DB
## Number_clusters 6.0 6.0000 3.0000 6.0000 2.000 3.0000
## Value_Index 166569.3 265.6967 5.3929 -0.0913 0.112 1.0987
## Silhouette Duda PseudoT2 Beale Ratkowsky Ball
## Number_clusters 2.0000 2.0000 2.0000 2.0000 6.0000 3.000
## Value_Index 0.2809 0.9567 16.1209 0.2712 0.2707 1435.833
## PtBiserial Frey McClain Dunn Hubert SDindex Dindex
## Number_clusters 6.0000 1 3.000 3.0000 0 3.0000 0
## Value_Index 0.4102 NA 0.622 0.1779 0 1.9507 0
## SDbw
## Number_clusters 3.0000
## Value_Index 0.5195
```

Simple majority indicates that three clusters is most appropriate. However, four clusters are probably just as good. Every time you do the analysis you will get slightly different results unless you set the seed.

To make our actual clusters we need to calculate the distances between clusters using the “dist” function while also specifying the way to calculate it. We will calculate distance using the “Euclidean” method. Then we will take the distance’s information and make the actual clustering using the ‘hclust’ function. Below is the code.

```
distance<-dist(MedExp_small_df,method = 'euclidean')
hiclust<-hclust(distance,method = 'ward.D2')
```

We can now plot the results. We will plot “hiclust” and set hang to -1 so this will place the observations at the bottom of the plot. Next, we use the “cutree” function to identify 4 clusters and store this in the “comp” variable. Lastly, we use the “ColorDendrogram” function to highlight are actual clusters.

```
plot(hiclust,hang=-1, labels=F)
```

comp<-cutree(hiclust,4) ColorDendrogram(hiclust,y=comp,branchlength = 100)

We can also create some descriptive stats such as the number of observations per cluster.

`table(comp)`

```
## comp
## 1 2 3 4
## 439 203 357 1
```

We can also make a table that looks at the descriptive stats by cluster by using the “aggregate” function.

`aggregate(MedExp_small_df,list(comp),mean)`

```
## Group.1 med lc lpi fmde ndisease
## 1 1 0.01355537 -0.7644175 0.2721403 -0.7498859 0.048977122
## 2 2 -0.06470294 -0.5358340 -1.7100649 -0.6703288 -0.105004408
## 3 3 -0.06018129 1.2405612 0.6362697 1.3001820 -0.002099968
## 4 4 28.66860936 1.4732183 0.5252898 1.1117244 0.564626907
## linc lfam educdec age
## 1 0.12531718 -0.08861109 0.1149516 0.12754008
## 2 -0.44435225 0.22404456 -0.3767211 -0.22681535
## 3 0.09804031 -0.01182114 0.0700381 -0.02765987
## 4 0.18887531 -2.36063161 1.0070155 -0.07200553
```

Cluster 1 is the most educated (‘educdec’). Cluster 2 stands out as having higher medical cost (‘med’), chronic disease (‘ndisease’) and age. Cluster 3 had the lowest annual incentive payment (‘lpi’). Cluster 4 had the highest coinsurance rate (‘lc’). You can make boxplots of each of the stats above. Below is just an example of age by cluster.

```
MedExp_small_df$cluster<-comp
boxplot(age~cluster,MedExp_small_df)
```

**Conclusion**

Hierarchical clustering is one way in which to provide labels for data that does not have labels. The main challenge is determining how many clusters to create. However, this can be dealt with through using recommendations that come from various functions in R.

Deep learning is a complex machine learning concept in which new features are created new features from the variables that were inputted. These new features are used for classifying labeled data. This all done mostly with artificial neural networks that are multiple layers deep and can involve regularization.

If understanding is not important but you are in search of the most accurate classification possible deep learning is a useful tool. It is nearly impossible to explain to the typical stakeholder and is best for just getting the job done.

One of the most accessible packages for using deep learning is the “h2o” package.This package allows you to access the H2O website which will analyze your data and send it back to you. This allows a researcher to do analytics on a much larger scale than their own computer can handle. In this post, we will use deep learning to predict the gender of the head of household in the “VietnamH” dataset from the “Ecdat” package. Below is some initial code.

**Data Preparation**

`library(h2o);library(Ecdat);library(corrplot)`

`data("VietNamH") str(VietNamH)`

```
## 'data.frame': 5999 obs. of 11 variables:
## $ sex : Factor w/ 2 levels "male","female": 2 2 1 2 2 2 2 1 1 1 ...
## $ age : int 68 57 42 72 73 66 73 46 50 45 ...
## $ educyr : num 4 8 14 9 1 13 2 9 12 12 ...
## $ farm : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
## $ urban : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ hhsize : int 6 6 6 6 8 7 9 4 5 4 ...
## $ lntotal : num 10.1 10.3 10.9 10.3 10.5 ...
## $ lnmed : num 11.23 8.51 8.71 9.29 7.56 ...
## $ lnrlfood: num 8.64 9.35 10.23 9.26 9.59 ...
## $ lnexp12m: num 11.23 8.51 8.71 9.29 7.56 ...
## $ commune : Factor w/ 194 levels "1","10","100",..: 1 1 1 1 1 1 1 1 1 1 ...
```

`corrplot(cor(na.omit(VietNamH[,c(-1,-4,-5,-11)])),method = 'number')`

We need to remove the “commune” variable “lnexp12m” and the “lntotal” variable. The “commune” variable should be removed because it doesn’t provide much information. The “lntotal” variable should be removed because it is the total expenditures that the family spends. This is represented by other variables such as food “lnrlfood” which “lntotal” highly correlates with. the “lnexp12m” should be removed because it has a perfect correlation with “lnmed”. Below is the code

```
VietNamH$commune<-NULL
VietNamH$lnexp12m<-NULL
VietNamH$lntotal<-NULL
```

*Save as CSV file*

We now need to save our modified dataset as a csv file that we can send to h2o. The code is as follows.

`write.csv(VietNamH, file="viet.csv",row.names = F)`

*Connect to H2O*

Now we can connect to H2o and start what is called an instance.

`localH2O<-h2o.init()`

```
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 50 minutes 18 seconds
## H2O cluster version: 3.10.4.6
## H2O cluster version age: 27 days
## H2O cluster name: H2O_started_from_R_darrin_hsl318
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.44 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 2
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 3.4.0 (2017-04-21)
```

The output indicates that we are connected. The next step is where it really gets complicated. We need to upload our data to h2o as an h2o dataframe, which is different from a regular data frame. We also need to indicate the location of the csv file on your computer that needs to be converted. All of this is done in the code below.

`viet.hex<-h2o.uploadFile(path="/home/darrin/Documents/R working directory/blog/blog/viet.csv",destination_frame = "viet.hex")`

In the code above we create an object called “viet.hex”. This object uses the “h2o.uploadFile” function to send our csv to h2o. We can check if everything worked by using the “class” function and the “str” function on “viet.hex”.

`class(viet.hex)`

`## [1] "H2OFrame"`

`str(viet.hex)`

```
## Class 'H2OFrame'
## - attr(*, "op")= chr "Parse"
## - attr(*, "id")= chr "viet.hex"
## - attr(*, "eval")= logi FALSE
## - attr(*, "nrow")= int 5999
## - attr(*, "ncol")= int 8
## - attr(*, "types")=List of 8
## ..$ : chr "enum"
## ..$ : chr "int"
## ..$ : chr "real"
## ..$ : chr "enum"
## ..$ : chr "enum"
## ..$ : chr "int"
## ..$ : chr "real"
## ..$ : chr "real"
## - attr(*, "data")='data.frame': 10 obs. of 8 variables:
## ..$ sex : Factor w/ 2 levels "female","male": 1 1 2 1 1 1 1 2 2 2
## ..$ age : num 68 57 42 72 73 66 73 46 50 45
## ..$ educyr : num 4 8 14 9 1 13 2 9 12 12
## ..$ farm : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1
## ..$ urban : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2
## ..$ hhsize : num 6 6 6 6 8 7 9 4 5 4
## ..$ lnmed : num 11.23 8.51 8.71 9.29 7.56 ...
## ..$ lnrlfood: num 8.64 9.35 10.23 9.26 9.59 ...
```

The “summary” function also provides insight into the data.

`summary(viet.hex)`

```
## sex age educyr farm urban
## male :4375 Min. :16.00 Min. : 0.000 yes:3438 no :4269
## female:1624 1st Qu.:37.00 1st Qu.: 3.982 no :2561 yes:1730
## Median :46.00 Median : 6.996
## Mean :48.01 Mean : 7.094
## 3rd Qu.:58.00 3rd Qu.: 9.988
## Max. :95.00 Max. :22.000
## hhsize lnmed lnrlfood
## Min. : 1.000 Min. : 0.000 Min. : 6.356
## 1st Qu.: 4.000 1st Qu.: 4.166 1st Qu.: 8.372
## Median : 5.000 Median : 5.959 Median : 8.689
## Mean : 4.752 Mean : 5.266 Mean : 8.680
## 3rd Qu.: 6.000 3rd Qu.: 7.171 3rd Qu.: 9.001
## Max. :19.000 Max. :12.363 Max. :11.384
```

*Create Training and Testing Sets*

We now need to create our train and test sets. We need to use slightly different syntax to do this with h2o. The code below is how it is done to create a 70/30 split in the data.

```
rand<-h2o.runif(viet.hex,seed = 123)
train<-viet.hex[rand<=.7,]
train<-h2o.assign(train, key = "train")
test<-viet.hex[rand>.7,]
test<-h2o.assign(test, key = "test")
```

Here is what we did

- We created an object called “rand” that created random numbers for or “viet.hex” dataset.
- All values less than .7 were assigned to the “train” variable
- The train variable was given the key name “train” in order to use it in the h2o framework
- All values greater than .7 were assigned to test and test was given a key name

You can check the proportions of the train and test sets using the “h2o.table” function.

`h2o.table(train$sex)`

```
## sex Count
## 1 female 1146
## 2 male 3058
##
## [2 rows x 2 columns]
```

`h2o.table(test$sex)`

```
## sex Count
## 1 female 478
## 2 male 1317
##
## [2 rows x 2 columns]
```

**Model Development**

We can now create our model.

`vietdlmodel<-h2o.deeplearning(x=2:8,y=1,training_frame = train,validation_frame = test,seed=123,variable_importances = T)`

Here is what the code above means.

- We created an object called “vietdlmodel”
- We used the “h2o.deeplearning” function.
- x = 2:8 is all the independent variables in the dataframe and y=1 is the first variable “sex”
- We set the training and validation frame to “train” and “test” and set the seed.
- Finally, we indicated that we want to know the variable importance.

We can check the performance of the model with the code below.

`vietdlmodel`

```
## Model Details:
## training
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## female male Error Rate
## female 435 711 0.620419 =711/1146
## male 162 2896 0.052976 =162/3058
## Totals 597 3607 0.207659 =873/4204
## testing
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## female male Error Rate
## female 151 327 0.684100 =327/478
## male 60 1257 0.045558 =60/1317
## Totals 211 1584 0.215599 =387/1795
```

There is a lot of output here. For simplicity, we will focus on the confusion matrices for the training and testing sets.The error rate for the training set is 19.8% and for the testing set, it is 21.2%. Below we can see which variable were most useful

`vietdlmodel@model$variable_importances`

```
## Variable Importances:
## variable relative_importance scaled_importance percentage
## 1 urban.no 1.000000 1.000000 0.189129
## 2 urban.yes 0.875128 0.875128 0.165512
## 3 farm.no 0.807208 0.807208 0.152666
## 4 farm.yes 0.719517 0.719517 0.136081
## 5 age 0.451581 0.451581 0.085407
## 6 hhsize 0.410472 0.410472 0.077632
## 7 lnrlfood 0.386189 0.386189 0.073039
## 8 educyr 0.380398 0.380398 0.071944
## 9 lnmed 0.256911 0.256911 0.048589
## 10 farm.missing(NA) 0.000000 0.000000 0.000000
## 11 urban.missing(NA) 0.000000 0.000000 0.000000
```

The numbers speak for themselves. “Urban” and “farm” are both the most important variables for predicting sex. Below is the code for obtaining the predicted results and placing them into a dataframe. This is useful if you need to send in final results to a data science competition such as those found at kaggle.

`vietdlPredict<-h2o.predict(vietdlmodel,newdata = test)`

`vietdlPredict`

```
## predict female male
## 1 male 0.06045560 0.9395444
## 2 male 0.10957121 0.8904288
## 3 male 0.27459108 0.7254089
## 4 male 0.14721353 0.8527865
## 5 male 0.05493486 0.9450651
## 6 male 0.10598351 0.8940165
##
## [1795 rows x 3 columns]
```

```
vietdlPred<-as.data.frame(vietdlPredict)
head(vietdlPred)
```

```
## predict female male
## 1 male 0.06045560 0.9395444
## 2 male 0.10957121 0.8904288
## 3 male 0.27459108 0.7254089
## 4 male 0.14721353 0.8527865
## 5 male 0.05493486 0.9450651
## 6 male 0.10598351 0.8940165
```

**Conclusion**

This was a complicated experience. However, we learned how to upload and download results from h2.

In this blog, we have already discussed and what gradient boosting is. However, for a brief recap, gradient boosting improves model performance by first developing an initial model called the base learner using whatever algorithm of your choice (linear, tree, etc.).

What follows next is that gradient boosting looks at the error in the first model and develops a second model using what is called the loss function. The loss function calculates the difference between the current accuracy and the desired prediction whether it’s accuracy for classification or error in regression. This process is repeated with the creation of additional models until a certain level of accuracy or reduction in error is attained.

This post what provide an example of the use of gradient boosting in random forest classification. Specifically, we will try to predict a person’s labor participation based on several independent variables.

`library(randomForest);library(gbm);library(caret);library(Ecdat)`

```
data("Participation")
str(Participation)
```

```
## 'data.frame': 872 obs. of 7 variables:
## $ lfp : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ lnnlinc: num 10.8 10.5 11 11.1 11.1 ...
## $ age : num 3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
## $ educ : num 8 8 9 11 12 12 8 8 12 11 ...
## $ nyc : num 1 0 0 2 0 0 0 0 0 0 ...
## $ noc : num 1 1 0 0 2 1 0 2 0 2 ...
## $ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
```

**Data Preparation**

We need to transform the ‘age’ variable by multiplying by ten so that the ages are realistic. In addition, we need to convert “lnnlinc” from the log of salary to regular salary. Below is the code to transform these two variables.

```
Participation$age<-10*Participation$age #normal age
Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log
```

We can now create our train and test datasets

```
set.seed(502)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]
```

We now need to create our grid and control. The grid allows us to create several different models with various parameter settings. This is important in determining what is the most appropriate model which is always determined by comparing. We are using random forest so we need to set the number of trees we desire, the depth of the trees, the shrinkage which controls the influence of each tree, and the minimum number of observations in a node. The control will allow us to set the cross-validation. Below is the code for the creation of the grid and control.

```
grid<-expand.grid(.n.trees=seq(200,500,by=200),.interaction.depth=seq(1,3,by=2),.shrinkage=seq(.01,.09,by=.04),
.n.minobsinnode=seq(1,5,by=2)) #grid features
control<-trainControl(method="CV",number = 10) #control
```

**Parameter Selection**

Now we set our seed and run the gradient boosted model.

```
set.seed(123)
gbm.lfp.train<-train(lfp~.,data=train,method='gbm',trControl=control,tuneGrid=grid)
gbm.lfp.train
```

```
## Stochastic Gradient Boosting
##
## 636 samples
## 6 predictors
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 573, 573, 571, 572, 573, 572, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.minobsinnode n.trees Accuracy
## 0.01 1 1 200 0.6666026
## 0.01 1 1 400 0.6823306
## 0.01 1 3 200 0.6588637
## 0.01 1 3 400 0.6854804
## 0.01 1 5 200 0.6792769
## 0.01 1 5 400 0.6823306
## 0.01 3 1 200 0.6730044
## 0.01 3 1 400 0.6572051
## 0.01 3 3 200 0.6793273
## 0.01 3 3 400 0.6697787
## 0.01 3 5 200 0.6682914
## 0.01 3 5 400 0.6650416
## 0.05 1 1 200 0.6759558
## 0.05 1 1 400 0.6508040
## 0.05 1 3 200 0.6681426
## 0.05 1 3 400 0.6602286
## 0.05 1 5 200 0.6680441
## 0.05 1 5 400 0.6570788
## 0.05 3 1 200 0.6493662
## 0.05 3 1 400 0.6603518
## 0.05 3 3 200 0.6540545
## 0.05 3 3 400 0.6366911
## 0.05 3 5 200 0.6712428
## 0.05 3 5 400 0.6445299
## 0.09 1 1 200 0.6461405
## 0.09 1 1 400 0.6634768
## 0.09 1 3 200 0.6571036
## 0.09 1 3 400 0.6320765
## 0.09 1 5 200 0.6554922
## 0.09 1 5 400 0.6540755
## 0.09 3 1 200 0.6523920
## 0.09 3 1 400 0.6430140
## 0.09 3 3 200 0.6430666
## 0.09 3 3 400 0.6447749
## 0.09 3 5 200 0.6540522
## 0.09 3 5 400 0.6524416
## Kappa
## 0.3210036
## 0.3611194
## 0.3032151
## 0.3667274
## 0.3472079
## 0.3603046
## 0.3414686
## 0.3104335
## 0.3542736
## 0.3355582
## 0.3314006
## 0.3258459
## 0.3473532
## 0.2961782
## 0.3310251
## 0.3158762
## 0.3308353
## 0.3080692
## 0.2940587
## 0.3170198
## 0.3044814
## 0.2692627
## 0.3378545
## 0.2844781
## 0.2859754
## 0.3214156
## 0.3079460
## 0.2585840
## 0.3062307
## 0.3044324
## 0.3003943
## 0.2805715
## 0.2827956
## 0.2861825
## 0.3024944
## 0.3002135
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 400,
## interaction.depth = 1, shrinkage = 0.01 and n.minobsinnode = 3.
```

Gradient boosting provides us with the recommended parameters for our training model as shown above as well as the accuracy and kappa of each model. We also need to recode the dependent variable as 0 and 1 for the ‘gbm’ function.

**Model Training**

```
train$lfp=ifelse(train$lfp=="no",0,1)
gbm.lfp<-gbm(lfp~., distribution = 'bernoulli',data=train,n.trees = 400,interaction.depth = 1,shrinkage=.01,n.minobsinnode = 3)
```

You can see a summary of the most important variables for prediction as well as a plot by using the “summary” function.

`summary(gbm.lfp)`

```
## var rel.inf
## lnnlinc lnnlinc 28.680447
## age age 27.451474
## foreign foreign 23.307932
## nyc nyc 18.375856
## educ educ 2.184291
## noc noc 0.000000
```

Salary (lnnlinc), age and foreigner status are the most important predictors followed by the number of younger children (nyc) and last education. The number of older children (noc) has no effect. We can now test our model on the test set.

**Model Testing**

`gbm.lfp.test<-predict(gbm.lfp,newdata = test,type = 'response', n.trees = 400)`

Our test model returns a set of probabilities. We need to convert this to a simple yes or no and this is done in the code below.

`gbm.class<-ifelse(gbm.lfp.test<0.5,'no','yes')`

We can now look at a table to see how accurate our model is as well as calculate the accuracy.

`table(gbm.class,test$lfp)`

```
##
## gbm.class no yes
## no 91 39
## yes 39 67
```

`(accuracy<-(91+67)/(91+67+39+39))`

`## [1] 0.6694915`

The model is not great. However, you now have an example of how to use gradient boosting to develop a random forest classification model

Gradient boosting is a machine learning tool for “boosting” or improving model performance. How this works is that you first develop an initial model called the base learner using whatever algorithm of your choice (linear, tree, etc.).

Gradient boosting looks at the error and develops a second model using what is called da loss function. The loss function is the difference between the current accuracy and the desired prediction whether it’s accuracy for classification or error in regression. This process of making additional models based only on the misclassified ones continues until the level of accuracy is reached.

Gradient boosting is also stochastic. This means that it randomly draws from the sample as it iterates over the data. This helps to improve accuracy and or reduce error.

In this post, we will use gradient boosting for regression trees. In particular, we will use the “Sacramento” dataset from the “caret” package. Our goal is to predict a house’s price based on the available variables. Below is some initial code

`library(caret);library(gbm);library(corrplot)`

```
data("Sacramento")
str(Sacramento)
```

```
## 'data.frame': 932 obs. of 9 variables:
## $ city : Factor w/ 37 levels "ANTELOPE","AUBURN",..: 34 34 34 34 34 34 34 34 29 31 ...
## $ zip : Factor w/ 68 levels "z95603","z95608",..: 64 52 44 44 53 65 66 49 24 25 ...
## $ beds : int 2 3 2 2 2 3 3 3 2 3 ...
## $ baths : num 1 1 1 1 1 1 2 1 2 2 ...
## $ sqft : int 836 1167 796 852 797 1122 1104 1177 941 1146 ...
## $ type : Factor w/ 3 levels "Condo","Multi_Family",..: 3 3 3 3 3 1 3 3 1 3 ...
## $ price : int 59222 68212 68880 69307 81900 89921 90895 91002 94905 98937 ...
## $ latitude : num 38.6 38.5 38.6 38.6 38.5 ...
## $ longitude: num -121 -121 -121 -121 -121 ...
```

**Data Preparation**

Already there are some actions that need to be made. We need to remove the variables “city” and “zip” because they both have a large number of factors. Next, we need to remove “latitude” and “longitude” because these values are hard to interpret in a housing price model. Let’s run the correlations before removing this information

`corrplot(cor(Sacramento[,c(-1,-2,-6)]),method = 'number')`

There also appears to be a high correlation between “sqft” and beds and bathrooms. As such, we will remove “sqft” from the model. Below is the code for the revised variables remaining for the model.

```
sacto.clean<-Sacramento
sacto.clean[,c(1,2,5)]<-NULL
sacto.clean[,c(5,6)]<-NULL
str(sacto.clean)
```

```
## 'data.frame': 932 obs. of 4 variables:
## $ beds : int 2 3 2 2 2 3 3 3 2 3 ...
## $ baths: num 1 1 1 1 1 1 2 1 2 2 ...
## $ type : Factor w/ 3 levels "Condo","Multi_Family",..: 3 3 3 3 3 1 3 3 1 3 ...
## $ price: int 59222 68212 68880 69307 81900 89921 90895 91002 94905 98937 ...
```

We will now develop our training and testing sets

```
set.seed(502)
ind=sample(2,nrow(sacto.clean),replace=T,prob=c(.7,.3))
train<-sacto.clean[ind==1,]
test<-sacto.clean[ind==2,]
```

We need to create a grid in order to develop the many different potential models available. We have to tune three different parameters for gradient boosting, These three parameters are number of trees, interaction depth, and shrinkage. Number of trees is how many trees gradient boosting g will make, interaction depth is the number of splits, shrinkage controls the contribution of each tree and stump to the final model. We also have to determine the type of cross-validation using the “trainControl”” function. Below is the code for the grid.

```
grid<-expand.grid(.n.trees=seq(100,500,by=200),.interaction.depth=seq(1,4,by=1),.shrinkage=c(.001,.01,.1),
.n.minobsinnode=10)
control<-trainControl(method = "CV")
```

**Model Training**

We now can train our model

gbm.train<-train(price~.,data=train,method='gbm',trControl=control,tuneGrid=grid) gbm.train

Stochastic Gradient Boosting 685 samples 4 predictors No pre-processing Resampling: Cross-Validated (25 fold) Summary of sample sizes: 659, 657, 658, 657, 657, 657, ... Resampling results across tuning parameters: shrinkage interaction.depth n.trees RMSE Rsquared 0.001 1 100 128372.32 0.4850879 0.001 1 300 120272.16 0.4965552 0.001 1 500 113986.08 0.5064680 0.001 2 100 127197.20 0.5463527 0.001 2 300 117228.42 0.5524074 0.001 2 500 109634.39 0.5566431 0.001 3 100 126633.35 0.5646994 0.001 3 300 115873.67 0.5707619 0.001 3 500 107850.02 0.5732942 0.001 4 100 126361.05 0.5740655 0.001 4 300 115269.63 0.5767396 0.001 4 500 107109.99 0.5799836 0.010 1 100 103554.11 0.5286663 0.010 1 300 90114.05 0.5728993 0.010 1 500 88327.15 0.5838981 0.010 2 100 97876.10 0.5675862 0.010 2 300 88260.16 0.5864650 0.010 2 500 86773.49 0.6007150 0.010 3 100 96138.06 0.5778062 0.010 3 300 87213.34 0.5975438 0.010 3 500 86309.87 0.6072987 0.010 4 100 95260.93 0.5861798 0.010 4 300 86962.20 0.6011429 0.010 4 500 86380.39 0.6082593 0.100 1 100 86808.91 0.6022690 0.100 1 300 86081.65 0.6100963 0.100 1 500 86197.52 0.6081493 0.100 2 100 86810.97 0.6036919 0.100 2 300 87251.66 0.6042293 0.100 2 500 88396.21 0.5945206 0.100 3 100 86649.14 0.6088309 0.100 3 300 88565.35 0.5942948 0.100 3 500 89971.44 0.5849622 0.100 4 100 86922.22 0.6037571 0.100 4 300 88629.92 0.5894188 0.100 4 500 91008.39 0.5718534 Tuning parameter 'n.minobsinnode' was held constant at a value of 10 RMSE was used to select the optimal model using the smallest value. The final values used for the model were n.trees = 300, interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10.

The printout shows you the values for each potential model. At the bottom of the printout are the recommended parameters for our model. We take the values at the bottom to create our model for the test data.

```
gbm.price<-gbm(price~.,data=train,n.trees = 300,interaction.depth = 1,
shrinkage = .1,distribution = 'gaussian')
```

**Test Model**

Now we use the test data, below we predict as well as calculate the error and make a plot.

```
gbm.test<-predict(gbm.price,newdata = test,n.trees = 300)
gbm.resid<-gbm.test-test$price
mean(gbm.resid^2)
```

`## [1] 8721772767`

`plot(gbm.test,test$price)`

The actual value for the mean squared error is relative and means nothing by its self. The plot, however, looks good and indicates that our model may be doing well. The mean squared error is only useful when comparing one model to another it does not mean much by its self.

This post will cover the use of random forest for classification. Random forest involves the use of many decision trees in the development of a classification or regression tree. The results of each individual tree is added together and the mean is used in the final classification of an example. The use of an ensemble helps in dealing with the bias-variance tradeoff.

In the example of random forest classification, we will use the “Participation” dataset from the “ecdat” package. We want to classify people by their labor participation based on the other variables available in the dataset. Below is some initial code

`library(randomForest);library(Ecdat)`

`data("Participation") str(Participation)`

```
## 'data.frame': 872 obs. of 7 variables:
## $ lfp : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ lnnlinc: num 10.8 10.5 11 11.1 11.1 ...
## $ age : num 3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
## $ educ : num 8 8 9 11 12 12 8 8 12 11 ...
## $ nyc : num 1 0 0 2 0 0 0 0 0 0 ...
## $ noc : num 1 1 0 0 2 1 0 2 0 2 ...
## $ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
```

For the data preparation, we need to multiple age by ten as the current values imply small children. Furthermore, we need to change the “lnnlinc” variable from the log of salary to just the regular salary. After completing these two steps, we need to split our data into training and testing sets. Below is the code

```
Participation$age<-10*Participation$age #normal age
Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log
#split data
set.seed(502)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]
```

We will now create our classification model using random forest.

```
set.seed(123)
rf.lfp<-randomForest(lfp~.,data = train)
rf.lfp
```

```
##
## Call:
## randomForest(formula = lfp ~ ., data = train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 32.39%
## Confusion matrix:
## no yes class.error
## no 248 93 0.2727273
## yes 113 182 0.3830508
```

The output is mostly self-explanatory. It includes the number of trees, number of variables at each split, error rate, and the confusion matrix. In general, are error rate is poor and we are having a hard time distinguishing between those who work and do not work based on the variables in the dataset. However, this is based on having all 500 trees in the analysis. Having this many trees is probably not necessary but we need to confirm this.

We can also plot the error by tree using the “plot” function as shown below.

`plot(rf.lfp)`

It looks as though error lowest with around 400 trees. We can confirm this using the “which.min” function and call information from “err.rate” in our model.

`which.min(rf.lfp$err.rate[,1])`

`## [1] 242`

We need 395 trees in order to reduce the error rate to its most optimal level. We will now create a new model that contains 395 trees in it.

```
rf.lfp2<-randomForest(lfp~.,data = train,ntree=395)
rf.lfp2
```

```
##
## Call:
## randomForest(formula = lfp ~ ., data = train, ntree = 395)
## Type of random forest: classification
## Number of trees: 395
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 31.92%
## Confusion matrix:
## no yes class.error
## no 252 89 0.2609971
## yes 114 181 0.3864407
```

The results are mostly the same. There is a small decline in error but not much to get excited about. We will now run our model on the test set.

```
rf.lfptest<-predict(rf.lfp2,newdata=test,type = 'response')
table(rf.lfptest,test$lfp)
```

```
##
## rf.lfptest no yes
## no 93 48
## yes 37 58
```

`(92+63)/(92+63+43+38) #calculate accuracy`

`## [1] 0.6567797`

Still disappointing, there is one last chart we should examine and that is the importance of each variable plot. It shows which variables are most useful in the prediction process. Below is the code.

`varImpPlot(rf.lfp2)`

This plot clearly indicates that salary (“lnnlinc”), age, and education are the strongest features for classifying by labor activity. However, the overall model is probably not useful.

**Conclusion**

This post explained and demonstrated how to conduct a random forest analysis. This form of analysis is powerful in dealing with large datasets with nonlinear relationships among the variables.

Random forest involves the process of creating multiple decision trees and the combing of their results. How this is done is through r using 2/3 of the data set to develop decision tree. This is done dozens, hundreds, or more times. Every tree made is created with a slightly different sample. The results of all these trees are then averaged together. This process of sampling is called bootstrap aggregation or bagging for short.

While the random forest algorithm is developing different samples it also randomly selects which variables to be used in each tree that is developed. By randomizing the sample and the features used in the tree, random forest is able to reduce both bias and variance in a model. In addition, random forest is robust against outliers and collinearity. Lastly, keep in mind that random forest can be used for regression and classification trees

In our example, we will use the “Participation” dataset from the “Ecdat” package. We will create a random forest regression tree to predict income of people. Below is some initial code

`library(randomForest);library(rpart);library(Ecdat)`

`data("Participation") str(Participation)`

```
## 'data.frame': 872 obs. of 7 variables:
## $ lfp : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ lnnlinc: num 10.8 10.5 11 11.1 11.1 ...
## $ age : num 3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
## $ educ : num 8 8 9 11 12 12 8 8 12 11 ...
## $ nyc : num 1 0 0 2 0 0 0 0 0 0 ...
## $ noc : num 1 1 0 0 2 1 0 2 0 2 ...
## $ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
```

We now need to prepare the data. We need to transform the lnnlinc from a log of salary to the actual salary. In addition, we need to multiply “age” by ten as 3.4 & 4.5 do not make any sense. Below is the code

```
Participation$age<-10*Participation$age #normal age
Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log
```

Now we create our training and testing sets.

```
set.seed(123)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]
```

We are now ready to create our model. Below is the code

```
set.seed(123)
rf.pros<-randomForest(lnnlinc~.,data = train)
rf.pros
```

```
##
## Call:
## randomForest(formula = lnnlinc ~ ., data = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 529284177
## % Var explained: 13.74
```

As you can see from calling “rf.pros” the variance explained is low at around 14%. The output also tells us how many trees were created. You have to be careful with making too many trees as this leads to overfitting. We can determine how many trees are optimal by looking at a plot and then using the “which.min”. Below is a plot of the number of trees by the mean squared error.

`plot(rf.pros)`

As you can see, as there are more trees there us less error to a certain point. It looks as though about 50 trees is enough. To confirm this guess, we used the “which.min” function. Below is the code

`which.min(rf.pros$mse)`

`## [1] 45`

We need 45 trees to have the lowest error. We will now rerun the model and add an argument called “ntrees” to indicating the number of trees we want to generate.

```
set.seed(123)
rf.pros.45<-randomForest(lnnlinc~.,data = train,ntree=45)
rf.pros.45
```

```
##
## Call:
## randomForest(formula = lnnlinc ~ ., data = train, ntree = 45)
## Type of random forest: regression
## Number of trees: 45
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 520705601
## % Var explained: 15.13
```

This model is still not great. We explain a little bit more of the variance and the error decreased slightly. We can now see which of the features in our model are the most useful by using the “varImpPlot” function. Below is the code.

`varImpPlot(rf.pros.45)`

The higher the IncNodePurity the more important the variable. AS you can see, education is most important followed by age and then the number of older children. The raw scores for each variable can be examined using the “importance” function. Below is the code.

`importance(rf.pros.45)`

```
## IncNodePurity
## lfp 16678498398
## age 66716765357
## educ 72007615063
## nyc 9337131671
## noc 31951386811
## foreign 10205305287
```

We are now ready to test our model with the test set. We will then calculate the residuals and the mean squared error

```
rf.pros.test<-predict(rf.pros.45,newdata = test)
rf.resid<-rf.pros.test-test$lnnlinc
mean(rf.resid^2)
```

`## [1] 381850711`

Remember that the mean squared error calculated here is only useful in comparison to other models. Random forest provides a way in which to remove the weaknesses of one decision tree by averaging the results of many. This form of ensemble learning is one of the more powerful algorithms in machine learning.

Classification trees are similar to regression trees except that the determinant of success is not the residual sum of squares but rather the error rate. The strange thing about classification trees is that you can you can continue to gain information in splitting the tree without necessarily improving the misclassification rate. This is done through calculating a measure of error called the Gini coefficient

Gini coefficient is calculated using the values of the accuracy and error in an equation. For example, if we have a model that is 80% accurate with a 20% error rate the Gini coefficient is calculated as follows for a single node

```
n0gini<- 1 - (((8/10)^2) -((2/10)^2))
n0gini
```

`## [1] 0.4`

Now if we split this into two nodes notice the change in the Gini coefficient

```
n1gini<-1-(((5/6)^2)-((1/7)^2))
n2gini<-1-(((3/4)^2))-((1/3)^2)
newgini<-(.8*n1gini) + (.2*n2gini)
newgini
```

`## [1] 0.3260488`

The lower the Gini coefficient the better as it measures purity. IN the example, there is no improvement in the accuracy yet there is an improvement in the Gini coefficient. Therefore, classification is about purity and not the residual sum of squares.

In this post, we will make a classification tree to predict if someone is participating in the labor market. We will do this using the “Participation” dataset from the “Ecdat” package. Below is some initial code to get started.

`library(Ecdat);library(rpart);library(partykit)`

`data(Participation) str(Participation)`

```
## 'data.frame': 872 obs. of 7 variables:
## $ lfp : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ lnnlinc: num 10.8 10.5 11 11.1 11.1 ...
## $ age : num 3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
## $ educ : num 8 8 9 11 12 12 8 8 12 11 ...
## $ nyc : num 1 0 0 2 0 0 0 0 0 0 ...
## $ noc : num 1 1 0 0 2 1 0 2 0 2 ...
## $ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
```

The ‘age’ feature needs to be transformed. Since it is doubtful that the survey was conducted among 4 and 5-year-olds. We need to multiply this variable by ten. In addition, the “lnnlinc” feature is the log of income and we want the actual income so we will exponentiate this information. Below is the code for these two steps.

```
Participation$age<-10*Participation$age #normal age
Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log
```

We will now create our training and testing datasets with the code below.

```
set.seed(502)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]
```

We can now create our classification tree and take a look at the output

```
tree.pros<-rpart(lfp~.,data = train)
tree.pros
```

```
## n= 636
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 636 295 no (0.5361635 0.4638365)
## 2) foreign=no 471 182 no (0.6135881 0.3864119)
## 4) nyc>=0.5 99 21 no (0.7878788 0.2121212) *
## 5) nyc< 0.5 372 161 no (0.5672043 0.4327957) ## 10) age>=49.5 110 25 no (0.7727273 0.2272727) *
## 11) age< 49.5 262 126 yes (0.4809160 0.5190840) ## 22) lnnlinc>=46230.43 131 50 no (0.6183206 0.3816794)
## 44) noc>=0.5 102 34 no (0.6666667 0.3333333) *
## 45) noc< 0.5 29 13 yes (0.4482759 0.5517241) ## 90) lnnlinc>=47910.86 22 10 no (0.5454545 0.4545455)
## 180) lnnlinc< 65210.78 12 3 no (0.7500000 0.2500000) * ## 181) lnnlinc>=65210.78 10 3 yes (0.3000000 0.7000000) *
## 91) lnnlinc< 47910.86 7 1 yes (0.1428571 0.8571429) *
## 23) lnnlinc< 46230.43 131 45 yes (0.3435115 0.6564885) * ## 3) foreign=yes 165 52 yes (0.3151515 0.6848485) ## 6) lnnlinc>=56365.39 16 5 no (0.6875000 0.3125000) *
## 7) lnnlinc< 56365.39 149 41 yes (0.2751678 0.7248322) *
```

In the text above, the first split is made on the feature “foreign” which is a yes or no possibility. 471 were not foreigners will 165 were foreigners. The accuracy here is not great at 61% for those not classified as foreigners and 31% for those classified as foreigners. For the 165 that are classified as foreigners, the next split is by their income, etc. This is hard to understand. Below is an actual diagram of the text above.

`plot(as.party(tree.pros))`

We now need to determining if pruning the tree is beneficial. We do this by looking at the cost complexity. Below is the code.

`tree.pros$cptable`

```
## CP nsplit rel error xerror xstd
## 1 0.20677966 0 1.0000000 1.0000000 0.04263219
## 2 0.04632768 1 0.7932203 0.7932203 0.04122592
## 3 0.02033898 4 0.6542373 0.6677966 0.03952891
## 4 0.01016949 5 0.6338983 0.6881356 0.03985120
## 5 0.01000000 8 0.6033898 0.6915254 0.03990308
```

The “rel error” indicates that our model is bad no matter how any splits. Even with 9 splits we have an error rate of 60%. Below is a plot of the table above

`plotcp(tree.pros)`

Based on the table, we will try to prune the tree to 5 splits. The plot above provides a visual as it has the lowest error. The table indicates that a tree of five splits (row number 4) has the lowest cross-validation error (xstd). Below is the code for pruning the tree followed by a plot of the modified tree.

```
cp<-min(tree.pros$cptable[4,])
pruned.tree.pros<-prune(tree.pros,cp=cp)
plot(as.party(pruned.tree.pros))
```

IF you compare the two trees we have developed. One of the main differences is that the pruned.tree is missing the “noc” (number of older children) variable. There are also fewer splits on the income variable (lnnlinc). We can no use the pruned tree with the test data set.

```
party.pros.test<-predict(pruned.tree.pros,newdata=test,type="class")
table(party.pros.test,test$lfp)
```

```
##
## party.pros.test no yes
## no 90 41
## yes 40 65
```

Now for the accuracy

`(90+65) / (90+41+40+65)`

`## [1] 0.6567797`

This is surprisingly high compared to the results for the training set but 65% is not great, However, this is fine for a demonstration.

**Conclusion**

Classification trees are one of many useful tools available for data analysis. When developing classification trees one of the key ideas to keep in mind is the aspect of prunning as this affects the complexity of the model.

In this post, we will look at support vector machines for numeric prediction. SVM is used for both classification and numeric prediction. The advantage of SVM for numeric prediction is that SVM will automatically create higher dimensions of the features and summarizes this in the output. In other words, unlike in regression where you have to decide for yourself how to modify your features, SVM does this automatically using different kernels.

Different kernels transform the features in different ways. And the cost function determines the penalty for an example being on the wrong side of the margin developed by the kernel. Remember that SVM draws lines and separators to divide the examples. Examples on the wrong side are penalized as determined by the researcher.

Just like with regression, generally, the model with the least amount of error may be the best model. As such, the purpose of this post is to use SVM to predict income in the “Mroz” dataset from the “Ecdat” package. We will use several different kernels that will transformation the features different ways and calculate the mean-squared error to determine the most appropriate model. Below is some initial code.

`library(caret);library(e1071);library(corrplot);library(Ecdat)`

```
data(Mroz)
str(Mroz)
```

```
## 'data.frame': 753 obs. of 18 variables:
## $ work : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
## $ hoursw : int 1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
## $ child6 : int 1 0 1 0 1 0 0 0 0 0 ...
## $ child618 : int 0 2 3 3 2 0 2 0 2 2 ...
## $ agew : int 32 30 35 34 31 54 37 54 48 39 ...
## $ educw : int 12 12 12 12 14 12 16 12 12 12 ...
## $ hearnw : num 3.35 1.39 4.55 1.1 4.59 ...
## $ wagew : num 2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
## $ hoursh : int 2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
## $ ageh : int 34 30 40 53 32 57 37 53 52 43 ...
## $ educh : int 12 9 12 10 12 11 12 8 4 12 ...
## $ wageh : num 4.03 8.44 3.58 3.54 10 ...
## $ income : int 16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
## $ educwm : int 12 7 12 7 12 14 14 3 7 7 ...
## $ educwf : int 7 7 7 7 14 7 7 3 7 7 ...
## $ unemprate : num 5 11 5 5 9.5 7.5 5 5 3 5 ...
## $ city : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ...
## $ experience: int 14 5 15 6 7 33 11 35 24 21 ...
```

We need to place the factor variables next to each other as it helps in having to remove them when we need to scale the data. We must scale the data because SVM is based on distance when making calculations. If there are different scales the larger scale will have more influence on the results. Below is the code

```
mroz.scale<-Mroz[,c(17,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18)]
mroz.scale<-as.data.frame(scale(mroz.scale[,c(-1,-2)])) #remove factor variables for scaling
mroz.scale$city<-Mroz$city # add factor variable back into the dataset
mroz.scale$work<-Mroz$work # add factor variable back into the dataset
#mroz.cor<-cor(mroz.scale[,-17:-18])
#corrplot(mroz.cor,method='number', col='black')
```

Below is the code for creating the train and test datasets.

```
set.seed(502)
ind=sample(2,nrow(mroz.scale),replace=T,prob=c(.7,.3))
train<-mroz.scale[ind==1,]
test<-mroz.scale[ind==2,]
```

**Linear Kernel**

Our first kernel is the linear kernel. Below is the code. We use the “tune.svm” function from the “e1071” package. We set the kernel to “linear” and we pick our own values for the cost function. The numbers for the cost function can be whatever you want. Also, keep in mind that r will produce six different models because we have six different values in the “cost” argument.

The process we are using to develop the models is as follows

- Set the seed
- Develop the initial model by setting the formula, dataset, kernel, cost function, and other needed information.
- Select the best model for the test set
- Predict with the best model
- Plot the predicted and actual results
- Calculate the mean squared error

The first time we will go through this process step-by-step. However, all future models will just have the code followed by an interpretation.

```
linear.tune<-tune.svm(income~.,data=train,kernel="linear",cost = c(.001,.01,.1,1,5,10))
summary(linear.tune)
```

```
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 10
##
## - best performance: 0.3492453
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-03 0.6793025 0.2285748
## 2 1e-02 0.3769298 0.1800839
## 3 1e-01 0.3500734 0.1626964
## 4 1e+00 0.3494828 0.1618478
## 5 5e+00 0.3493379 0.1611353
## 6 1e+01 0.3492453 0.1609774
```

The best model had a cost = 10 with a performance of .35. We will select the best model and use this on our test data. Below is the code.

```
best.linear<-linear.tune$best.model
tune.test<-predict(best.linear,newdata=test)
```

Now we will create a plot so we can see how well our model predicts. In addition, we will calculate the mean squared error to have an actual number of our model’s performance

`plot(tune.test,test$income)`

```
tune.test.resid<-tune.test-test$income
mean(tune.test.resid^2)
```

`## [1] 0.215056`

The model looks good in the plot. However, we cannot tell if the error number is decent until it is compared to other models

**Polynomial Kernel**

The next kernel we will use is the polynomial one. The kernel requires two parameters the degree of the polynomial (3,4,5, etc) as well as the kernel coefficient. Below is the code

```
set.seed(123)
poly.tune<-tune.svm(income~.,data = train,kernal="polynomial",degree = c(3,4,5),coef0 = c(.1,.5,1,2,3,4))
best.poly<-poly.tune$best.model
poly.test<-predict(best.poly,newdata=test)
plot(poly.test,test$income)
```

```
poly.test.resid<-poly.test-test$income
mean(poly.test.resid^2)
```

`## [1] 0.2453022`

The polynomial has an insignificant additional amount of error.

**Radial Kernel**

Next, we will use the radial kernel. One thing that is new here is the need for a parameter in the code call gamma. Below is the code.

```
set.seed(123)
rbf.tune<-tune.svm(income~.,data=train,kernel="radial",gamma = c(.1,.5,1,2,3,4))
summary(rbf.tune)
```

```
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma
## 0.1
##
## - best performance: 0.5225952
##
## - Detailed performance results:
## gamma error dispersion
## 1 0.1 0.5225952 0.4183170
## 2 0.5 0.9743062 0.5293211
## 3 1.0 1.0475714 0.5304482
## 4 2.0 1.0582550 0.5286129
## 5 3.0 1.0590367 0.5283465
## 6 4.0 1.0591208 0.5283059
```

```
best.rbf<-rbf.tune$best.model
rbf.test<-predict(best.rbf,newdata=test)
plot(rbf.test,test$income)
```

```
rbf.test.resid<-rbf.test-test$income
mean(rbf.test.resid^2)
```

`## [1] 0.3138517`

The radial kernel is worst than the linear and polynomial kernel. However, there is not much different in the performance of the models so far.

**Sigmoid Kernel**

Next, we will try the sigmoid kernel. Sigmoid kernel relies on a “gamma” parameter and a cost function. Below is the code

```
set.seed(123)
sigmoid.tune<-tune.svm(income~., data=train,kernel="sigmoid",gamma = c(.1,.5,1,2,3,4),coef0 = c(.1,.5,1,2,3,4))
summary(sigmoid.tune)
```

```
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma coef0
## 0.1 3
##
## - best performance: 0.8759507
##
## - Detailed performance results:
## gamma coef0 error dispersion
## 1 0.1 0.1 27.0808221 6.2866615
## 2 0.5 0.1 746.9235624 129.0224096
## 3 1.0 0.1 1090.9660708 198.2993895
## 4 2.0 0.1 1317.4497885 214.7997608
## 5 3.0 0.1 1339.8455047 180.3195491
## 6 4.0 0.1 1299.7469190 201.6901577
## 7 0.1 0.5 151.6070833 38.8450961
## 8 0.5 0.5 1221.2396575 335.4320445
## 9 1.0 0.5 1225.7731007 190.7718103
## 10 2.0 0.5 1290.1784238 216.9249899
## 11 3.0 0.5 1338.1069460 223.3126800
## 12 4.0 0.5 1261.8861304 300.0001079
## 13 0.1 1.0 162.6041229 45.3216740
## 14 0.5 1.0 2276.4330973 330.1739559
## 15 1.0 1.0 2036.4791854 335.8051736
## 16 2.0 1.0 1626.4347749 290.6445164
## 17 3.0 1.0 1333.0626614 244.4424896
## 18 4.0 1.0 1343.7617925 194.2220729
## 19 0.1 2.0 19.2061993 9.6767496
## 20 0.5 2.0 2504.9271757 583.8943008
## 21 1.0 2.0 3296.8519140 542.7903751
## 22 2.0 2.0 2376.8169815 398.1458855
## 23 3.0 2.0 1949.9232179 319.6548059
## 24 4.0 2.0 1758.7879267 313.2581011
## 25 0.1 3.0 0.8759507 0.3812578
## 26 0.5 3.0 1405.9712578 389.0822797
## 27 1.0 3.0 3559.4804854 843.1905348
## 28 2.0 3.0 3159.9549029 492.6072149
## 29 3.0 3.0 2428.1144437 412.2854724
## 30 4.0 3.0 1997.4596435 372.1962595
## 31 0.1 4.0 0.9543167 0.5170661
## 32 0.5 4.0 746.4566494 201.4341061
## 33 1.0 4.0 3277.4331302 527.6037421
## 34 2.0 4.0 3643.6413379 604.2778089
## 35 3.0 4.0 2998.5102806 471.7848740
## 36 4.0 4.0 2459.7133632 439.3389369
```

```
best.sigmoid<-sigmoid.tune$best.model
sigmoid.test<-predict(best.sigmoid,newdata=test)
plot(sigmoid.test,test$income)
```

```
sigmoid.test.resid<-sigmoid.test-test$income
mean(sigmoid.test.resid^2)
```

`## [1] 0.8004045`

The sigmoid performed much worst then the other models based on the metric of error. You can further see the problems with this model in the plot above.

**Conclusion**

The final results are as follows

- Linear kernel .21
- Polynomial kernel .24
- Radial kernel .31
- Sigmoid kernel .80

Which model to select depends on the goals of the study. However, it definitely looks as though you would be picking from among the first three models. The power of SVM is the ability to use different kernels to uncover different results without having to really modify the features yourself.

In this post, we will take a look at regression trees. Regression trees use a concept called recursive partitioning. Recursive partitioning involves splitting features in a way that reduces the error the most.

The splitting is also greedy which means that the algorithm will partition the data at one point without considering how it will affect future partitions. Ignoring how a current split affects the future splits can lead to unnecessary branches with high variance and low bias.

One of the main strengths of regression trees is their ability to deal with nonlinear relationships. However, predictive performance can be hurt when a particular example is assigned the mean of a node. This forced assignment is a loss of data such as turning continuous variables into categorical variables.

in this post, we will use the “participation” dataset from the “ecdat” package to predict income based on the other variables in the dataset. Below is some initial code.

`library(rpart);library(partykit);library(caret);library(Ecdat)`

`data("Participation") str(Participation)`

```
## 'data.frame': 872 obs. of 7 variables:
## $ lfp : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ lnnlinc: num 10.8 10.5 11 11.1 11.1 ...
## $ age : num 3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
## $ educ : num 8 8 9 11 12 12 8 8 12 11 ...
## $ nyc : num 1 0 0 2 0 0 0 0 0 0 ...
## $ noc : num 1 1 0 0 2 1 0 2 0 2 ...
## $ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
```

There are several things we need to do to make the results easier to interpret. The “age” variable needs to be multiplied by ten as it currently shows such results as 4.5, 3, etc. Common sense indicates that a four-year-old and a three-year-old is not earning an income.

In addition, we need to convert or income variable (lnnlinc) from the log of income to regular income. This will also help to understand the results. Below is the code.

```
Participation$age<-10*Participation$age #normal age
Participation$lnnlinc<-exp(Participation$lnnlinc) #actual income not log
```

The next step is to create our training and testing data sets. Below is the code.

```
set.seed(502)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]
```

We can now develop our model. We will also use the ‘print’ command

`reg.tree<-rpart(lnnlinc~.,data = train)`

Below is a printout of the current tree

`reg.tree`

```
## n= 636
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 636 390503700000 48405.08
## 2) educ< 11.5 473 127460900000 43446.69
## 4) educ< 9.5 335 70269440000 40758.25 ## 8) foreign=yes 129 10617380000 36016.12 * ## 9) foreign=no 206 54934520000 43727.84 * ## 5) educ>=9.5 138 48892370000 49972.98 *
## 3) educ>=11.5 163 217668400000 62793.52
## 6) age< 34.5 79 34015680000 51323.86
## 12) age< 25.5 12 984764800 34332.97 * ## 13) age>=25.5 67 28946170000 54367.01 *
## 7) age>=34.5 84 163486000000 73580.46
## 14) lfp=yes 36 23888410000 58916.66 *
## 15) lfp=no 48 126050900000 84578.31
## 30) educ< 12.5 29 86940400000 74425.51
## 60) age< 38.5 8 763764600 57390.34 * ## 61) age>=38.5 21 82970650000 80915.10
## 122) age>=44 14 34091840000 68474.57 *
## 123) age< 44 7 42378600000 105796.20 * ## 31) educ>=12.5 19 31558550000 100074.70 *
```

I will not interpret all of this but here is a brief description use numbers 2,4, and 8. If the person has less than 11.5 years of education (473 qualify) If the person has less than 9.5 years of education (335 of the 473 qualify) If the person is a foreigner (129 of the 335 qualify) than their average salary is 36,016.12 dollars.

Perhaps now you can see how some data is lost. The average salary for people in this node is 36,016.12 dollars but probably nobody earns exactly this amount

If what I said does not make sense. Here is an actual plot of the current regression tree.

`plot(as.party(reg.tree))`

The little boxes at the bottom are boxplots of that node.

**Tree modification**

We now will make modifications to the tree. We will begin by examining the cptable. Below is the code

`reg.tree$cptable`

```
## CP nsplit rel error xerror xstd
## 1 0.11619458 0 1.0000000 1.0026623 0.1666662
## 2 0.05164297 1 0.8838054 0.9139383 0.1434768
## 3 0.03469034 2 0.8321625 0.9403669 0.1443843
## 4 0.02125215 3 0.7974721 0.9387060 0.1433101
## 5 0.01933892 4 0.7762200 0.9260030 0.1442329
## 6 0.01242779 5 0.7568810 0.9097011 0.1434606
## 7 0.01208066 7 0.7320255 0.9166627 0.1433779
## 8 0.01046022 8 0.7199448 0.9100704 0.1432901
## 9 0.01000000 9 0.7094846 0.9107869 0.1427025
```

The cptable shares a lot of information. First, cp stands for cost complexity and this is the column furthest to the left. This number decreases as the tree becomes more complex. “nsplit” indicates the number of splits in the tree. “rel error” as another term for the residual sum of squares or RSS error. The ‘xerror’ and ‘xstd’ are the cross-validated average error and standard deviation of the error respectively.

One thing we can see from the cptable is that 9 splits has the lowest error but 2 splits have the lowest cross-validated error. Below we will look at a printout of the current table.

We will now make a plot of the complexity parameter to determine at what point to prune the tree. Pruning helps in removing unnecessary splits that do not improve the model much. Below is the code. The information in the plot is a visual of the “cptable”

`plotcp(reg.tree)`

It appears that a tree of size 2 is the best but this is boring. The next lowest dip is a tree of size 8. Therefore, we will prune our tree to have a size of 8 or eight splits. First, we need to create an object that contains how many splits we want. Then we use the “prune” function to make the actually modified tree.

```
cp<-min(reg.tree$cptable[8,])
pruned.reg.tree<-prune(reg.tree,cp=cp)
```

We will now make are modified tree

`plot(as.party(pruned.reg.tree))`

The only difference is the loss of the age nod for greater or less than 25.5.

**Model Test**

We can now test our tree to see how well it performs.

```
reg.tree.test<-predict(pruned.reg.tree,newdata=test)
reg.tree.resid<-reg.tree.test-test$lnnlinc
mean(reg.tree.resid^2)
```

`## [1] 431928030`

The number we calculated is the mean squared error. This number must be compared to models that are developed differently in order to assess the current model. By its self it means nothing.

**Conclusion**

This post exposed you to regression trees. This type of tree can be used to m ake numeric predictions in nonlinear data. However, with the classification comes a loss of data as the uniqueness of each example is lost when placed in a node.

K-nearest neighbor is one of many nonlinear algorithms that can be used in machine learning. By non-linear I mean that a linear combination of the features or variables is not needed in order to develop decision boundaries. This allows for the analysis of data that naturally does not meet the assumptions of linearity.

KNN is also known as a “lazy learner”. This means that there are known coefficients or parameter estimates. When doing regression we always had coefficient outputs regardless of the type of regression (ridge, lasso, elastic net, etc.). What KNN does instead is used K nearest neighbors to give a label to an unlabeled example. Our job when using KNN is to determine the number of K neighbors to use that is most accurate based on the different criteria for assessing the models.

In this post, we will develop a KNN model using the “Mroz” dataset from the “Ecdat” package. Our goal is to predict if someone lives in the city based on the other predictor variables. Below is some initial code.

`library(class);library(kknn);library(caret);library(corrplot)`

`library(reshape2);library(ggplot2);library(pROC);library(Ecdat)`

`data(Mroz) str(Mroz)`

```
## 'data.frame': 753 obs. of 18 variables:
## $ work : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
## $ hoursw : int 1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
## $ child6 : int 1 0 1 0 1 0 0 0 0 0 ...
## $ child618 : int 0 2 3 3 2 0 2 0 2 2 ...
## $ agew : int 32 30 35 34 31 54 37 54 48 39 ...
## $ educw : int 12 12 12 12 14 12 16 12 12 12 ...
## $ hearnw : num 3.35 1.39 4.55 1.1 4.59 ...
## $ wagew : num 2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
## $ hoursh : int 2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
## $ ageh : int 34 30 40 53 32 57 37 53 52 43 ...
## $ educh : int 12 9 12 10 12 11 12 8 4 12 ...
## $ wageh : num 4.03 8.44 3.58 3.54 10 ...
## $ income : int 16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
## $ educwm : int 12 7 12 7 12 14 14 3 7 7 ...
## $ educwf : int 7 7 7 7 14 7 7 3 7 7 ...
## $ unemprate : num 5 11 5 5 9.5 7.5 5 5 3 5 ...
## $ city : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ...
## $ experience: int 14 5 15 6 7 33 11 35 24 21 ...
```

We need to remove the factor variable “work” as KNN cannot use factor variables. After this, we will use the “melt” function from the “reshape2” package to look at the variables when divided by whether the example was from the city or not.

```
Mroz$work<-NULL
mroz.melt<-melt(Mroz,id.var='city')
Mroz_plots<-ggplot(mroz.melt,aes(x=city,y=value))+geom_boxplot()+facet_wrap(~variable, ncol = 4)
Mroz_plots
```

From the plots, it appears there are no differences in how the variable act whether someone is from the city or not. This may be a flag that classification may not work.

We now need to scale our data otherwise the results will be inaccurate. Scaling might also help our box-plots because everything will be on the same scale rather than spread all over the place. To do this we will have to temporarily remove our outcome variable from the data set because it’s a factor and then reinsert it into the data set. Below is the code.

```
mroz.scale<-as.data.frame(scale(Mroz[,-16]))
mroz.scale$city<-Mroz$city
```

We will now look at our box-plots a second time but this time with scaled data.

```
mroz.scale.melt<-melt(mroz.scale,id.var="city")
mroz_plot2<-ggplot(mroz.scale.melt,aes(city,value))+geom_boxplot()+facet_wrap(~variable, ncol = 4)
mroz_plot2
```

This second plot is easier to read but there is still little indication of difference.

We can now move to checking the correlations among the variables. Below is the code

```
mroz.cor<-cor(mroz.scale[,-17])
corrplot(mroz.cor,method = 'number')
```

There is a high correlation between husband’s age (ageh) and wife’s age (agew). Since this algorithm is non-linear this should not be a major problem.

We will now divide our dataset into the training and testing sets

```
set.seed(502)
ind=sample(2,nrow(mroz.scale),replace=T,prob=c(.7,.3))
train<-mroz.scale[ind==1,]
test<-mroz.scale[ind==2,]
```

Before creating a model we need to create a grid. We do not know the value of k yet so we have to run multiple models with different values of k in order to determine this for our model. As such we need to create a ‘grid’ using the ‘expand.grid’ function. We will also use cross-validation to get a better estimate of k as well using the “trainControl” function. The code is below.

```
grid<-expand.grid(.k=seq(2,20,by=1))
control<-trainControl(method="cv")
```

Now we make our model,

```
knn.train<-train(city~.,train,method="knn",trControl=control,tuneGrid=grid)
knn.train
```

```
## k-Nearest Neighbors
##
## 540 samples
## 16 predictors
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 487, 486, 486, 486, 486, 486, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 2 0.6000095 0.1213920
## 3 0.6368757 0.1542968
## 4 0.6424325 0.1546494
## 5 0.6386252 0.1275248
## 6 0.6329998 0.1164253
## 7 0.6589619 0.1616377
## 8 0.6663344 0.1774391
## 9 0.6663681 0.1733197
## 10 0.6609510 0.1566064
## 11 0.6664018 0.1575868
## 12 0.6682199 0.1669053
## 13 0.6572111 0.1397222
## 14 0.6719586 0.1694953
## 15 0.6571425 0.1263937
## 16 0.6664367 0.1551023
## 17 0.6719573 0.1588789
## 18 0.6608811 0.1260452
## 19 0.6590979 0.1165734
## 20 0.6609510 0.1219624
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 14.
```

R recommends that k = 16. This is based on a combination of accuracy and the kappa statistic. The kappa statistic is a measurement of the accuracy of a model while taking into account chance. We don’t have a model in the sense that we do not use the ~ sign like we do with regression. Instead, we have a train and a test set a factor variable and a number for k. This will make more sense when you see the code. Finally, we will use this information on our test dataset. We will then look at the table and the accuracy of the model.

```
knn.test<-knn(train[,-17],test[,-17],train[,17],k=16) #-17 removes the dependent variable 'city
table(knn.test,test$city)
```

```
##
## knn.test no yes
## no 19 8
## yes 61 125
```

```
prob.agree<-(15+129)/213
prob.agree
```

`## [1] 0.6760563`

Accuracy is 67% which is consistent with what we found when determining the k. We can also calculate the kappa. This done by calculating the probability and then do some subtraction and division. We already know the accuracy as we stored it in the variable “prob.agree” we now need the probability that this is by chance. Lastly, we calculate the kappa.

```
prob.chance<-((15+4)/213)*((15+65)/213)
kap<-(prob.agree-prob.chance)/(1-prob.chance)
kap
```

`## [1] 0.664827`

A kappa of .66 is actually good.

The example we just did was with unweighted k neighbors. There are times when weighted neighbors can improve accuracy. We will look at three different weighing methods. “Rectangular” is unweighted and is the one that we used. The other two are “triangular” and “epanechnikov”. How these calculate the weights is beyond the scope of this post. In the code below the argument “distance” can be set to 1 for Euclidean and 2 for absolute distance.

```
kknn.train<-train.kknn(city~.,train,kmax = 25,distance = 2,kernel = c("rectangular","triangular",
"epanechnikov"))
plot(kknn.train)
```

`kknn.train`

```
##
## Call:
## train.kknn(formula = city ~ ., data = train, kmax = 25, distance = 2, kernel = c("rectangular", "triangular", "epanechnikov"))
##
## Type of response variable: nominal
## Minimal misclassification: 0.3277778
## Best kernel: rectangular
## Best k: 14
```

If you look at the plot you can see which value of k is the best by looking at the point that is the lowest on the graph which is right before 15. Looking at the legend it indicates that the point is the “rectangular” estimate which is the same as unweighted. This means that the best classification is unweighted with a k of 14. Although it recommends a different value for k our misclassification was about the same.

**Conclusion**

In this post, we explored both weighted and unweighted KNN. This algorithm allows you to deal with data that does not meet the assumptions of regression by ignoring the need for parameters. However, because there are no numbers really attached to the results beyond accuracy it can be difficult to explain what is happening in the model to people. As such, perhaps the biggest drawback is communicating results when using KNN.

Elastic net is a combination of ridge and lasso regression. What is most unusual about elastic net is that it has two tuning parameters (alpha and lambda) while lasso and ridge regression only has 1.

In this post, we will go through an example of the use of elastic net using the “VietnamI” dataset from the “Ecdat” package. Our goal is to predict how many days a person is ill based on the other variables in the dataset. Below is some initial code for our analysis

`library(Ecdat);library(corrplot);library(caret);library(glmnet)`

`data("VietNamI") str(VietNamI)`

```
## 'data.frame': 27765 obs. of 12 variables:
## $ pharvis : num 0 0 0 1 1 0 0 0 2 3 ...
## $ lnhhexp : num 2.73 2.74 2.27 2.39 3.11 ...
## $ age : num 3.76 2.94 2.56 3.64 3.3 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 2 1 2 2 1 2 1 2 ...
## $ married : num 1 0 0 1 1 1 1 0 1 1 ...
## $ educ : num 2 0 4 3 3 9 2 5 2 0 ...
## $ illness : num 1 1 0 1 1 0 0 0 2 1 ...
## $ injury : num 0 0 0 0 0 0 0 0 0 0 ...
## $ illdays : num 7 4 0 3 10 0 0 0 4 7 ...
## $ actdays : num 0 0 0 0 0 0 0 0 0 0 ...
## $ insurance: num 0 0 1 1 0 1 1 1 0 0 ...
## $ commune : num 192 167 76 123 148 20 40 57 49 170 ...
## - attr(*, "na.action")=Class 'omit' Named int 27734
## .. ..- attr(*, "names")= chr "27734"
```

We need to check the correlations among the variables. We need to exclude the “sex” variable as it is categorical. The code is below.

```
p.cor<-cor(VietNamI[,-4])
corrplot.mixed(p.cor)
```

No major problems with correlations. Next, we set up our training and testing datasets. We need to remove the variable “commune” because it adds no value to our results. In addition, to reduce the computational time we will only use the first 1000 rows from the data set.

```
VietNamI$commune<-NULL
VietNamI_reduced<-VietNamI[1:1000,]
ind<-sample(2,nrow(VietNamI_reduced),replace=T,prob = c(0.7,0.3))
train<-VietNamI_reduced[ind==1,]
test<-VietNamI_reduced[ind==2,]
```

We need to create a grid that will allow us to investigate different models with different combinations of alpha and lambda. This is done using the “expand.grid” function. In combination with the “seq” function below is the code

`grid<-expand.grid(.alpha=seq(0,1,by=.5),.lambda=seq(0,0.2,by=.1))`

We also need to set the resampling method, which allows us to assess the validity of our model. This is done using the “trainControl” function” from the “caret” package. In the code below “LOOCV” stands for “leave one out cross-validation”.

`control<-trainControl(method = "LOOCV")`

We are no ready to develop our model. The code is mostly self-explanatory. This initial model will help us to determine the appropriate values for the alpha and lambda parameters

```
enet.train<-train(illdays~.,train,method="glmnet",trControl=control,tuneGrid=grid)
enet.train
```

```
## glmnet
##
## 694 samples
## 10 predictors
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 693, 693, 693, 693, 693, 693, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared
## 0.0 0.0 5.229759 0.2968354
## 0.0 0.1 5.229759 0.2968354
## 0.0 0.2 5.229759 0.2968354
## 0.5 0.0 5.243919 0.2954226
## 0.5 0.1 5.225067 0.2985989
## 0.5 0.2 5.200415 0.3038821
## 1.0 0.0 5.244020 0.2954519
## 1.0 0.1 5.203973 0.3033173
## 1.0 0.2 5.182120 0.3083819
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.2.
```

The output list all the possible alpha and lambda values that we set in the “grid” variable. It even tells us which combination was the best. For our purposes, the alpha will be .5 and the lambda .2. The r-square is also included.

We will set our model and run it on the test set. We have to convert the “sex” variable to a dummy variable for the “glmnet” function. We next have to make matrices for the predictor variables and a for our outcome variable “illdays”

```
train$sex<-model.matrix( ~ sex - 1, data=train ) #convert to dummy variable
test$sex<-model.matrix( ~ sex - 1, data=test )
predictor_variables<-as.matrix(train[,-9])
days_ill<-as.matrix(train$illdays)
enet<-glmnet(predictor_variables,days_ill,family = "gaussian",alpha = 0.5,lambda = .2)
```

We can now look at specific coefficient by using the “coef” function.

```
enet.coef<-coef(enet,lambda=.2,alpha=.5,exact=T)
enet.coef
```

```
## 12 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -1.304263895
## pharvis 0.532353361
## lnhhexp -0.064754000
## age 0.760864404
## sex.sexfemale 0.029612290
## sex.sexmale -0.002617404
## married 0.318639271
## educ .
## illness 3.103047473
## injury .
## actdays 0.314851347
## insurance .
```

You can see for yourself that several variables were removed from the model. Medical expenses (lnhhexp), sex, education, injury, and insurance do not play a role in the number of days ill for an individual in Vietnam.

With our model developed. We now can test it using the predict function. However, we first need to convert our test dataframe into a matrix and remove the outcome variable from it

```
test.matrix<-as.matrix(test[,-9])
enet.y<-predict(enet, newx = test.matrix, type = "response", lambda=.2,alpha=.5)
```

Let’s plot our results

`plot(enet.y)`

This does not look good. Let’s check the mean squared error

```
enet.resid<-enet.y-test$illdays
mean(enet.resid^2)
```

`## [1] 20.18134`

We will now do a cross-validation of our model. We need to set the seed and then use the “cv.glmnet” to develop the cross-validated model. We can see the model by plotting it.

```
set.seed(317)
enet.cv<-cv.glmnet(predictor_variables,days_ill,alpha=.5)
plot(enet.cv)
```

You can see that as the number of features are reduce (see the numbers on the top of the plot) the MSE increases (y-axis). In addition, as the lambda increases, there is also an increase in the error but only when the number of variables is reduced as well.

The dotted vertical lines in the plot represent the minimum MSE for a set lambda (on the left) and the one standard error from the minimum (on the right). You can extract these two lambda values using the code below.

`enet.cv$lambda.min`

`## [1] `

0.3082347

`enet.cv$lambda.1se`

`## [1] `

2.874607

We can see the coefficients for a lambda that is one standard error away by using the code below. This will give us an alternative idea for what to set the model parameters to when we want to predict.

`coef(enet.cv,s="lambda.1se")`

```
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 2.34116947
## pharvis 0.003710399
## lnhhexp .
## age .
## sex.sexfemale .
## sex.sexmale .
## married .
## educ .
## illness 1.817479480
## injury .
## actdays .
## insurance .
```

Using the one standard error lambda we lose most of our features. We can now see if the model improves by rerunning it with this information.

```
enet.y.cv<-predict(enet.cv,newx = test.matrix,type='response',lambda="lambda.1se", alpha = .5)
enet.cv.resid<-enet.y.cv-test$illdays
mean(enet.cv.resid^2)
```

`## [1] 25.47966`

A small improvement. Our model is a mess but this post served as an example of how to conduct an analysis using elastic net regression.

In this post, we will conduct an analysis using the lasso regression. Remember lasso regression will actually eliminate variables by reducing them to zero through how the shrinkage penalty can be applied.

We will use the dataset “nlschools” from the “MASS” packages to conduct our analysis. We want to see if we can predict language test scores “lang” with the other available variables. Below is some initial code to begin the analysis

`library(MASS);library(corrplot);library(glmnet)`

`data("nlschools") str(nlschools)`

```
## 'data.frame': 2287 obs. of 6 variables:
## $ lang : int 46 45 33 46 20 30 30 57 36 36 ...
## $ IQ : num 15 14.5 9.5 11 8 9.5 9.5 13 9.5 11 ...
## $ class: Factor w/ 133 levels "180","280","1082",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ GS : int 29 29 29 29 29 29 29 29 29 29 ...
## $ SES : int 23 10 15 23 10 10 23 10 13 15 ...
## $ COMB : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
```

We need to remove the “class” variable as it is used as an identifier and provides no useful data. After this, we can check the correlations among the variables. Below is the code for this.

```
nlschools$class<-NULL
p.cor<-cor(nlschools[,-5])
corrplot.mixed(p.cor)
```

No problems with collinearity. We will now setup are training and testing sets.

```
ind<-sample(2,nrow(nlschools),replace=T,prob = c(0.7,0.3))
train<-nlschools[ind==1,]
test<-nlschools[ind==2,]
```

Remember that the ‘glmnet’ function does not like factor variables. So we need to convert our “COMB” variable to a dummy variable. In addition, “glmnet” function does not like data frames so we need to make two data frames. The first will include all the predictor variables and the second we include only the outcome variable. Below is the code

```
train$COMB<-model.matrix( ~ COMB - 1, data=train ) #convert to dummy variable
test$COMB<-model.matrix( ~ COMB - 1, data=test )
predictor_variables<-as.matrix(train[,2:4])
language_score<-as.matrix(train$lang)
```

We can now run our model. We place both matrices inside the “glmnet” function. The family is set to “gaussian” because our outcome variable is continuous. The “alpha” is set to 1 as this indicates that we are using lasso regression.

`lasso<-glmnet(predictor_variables,language_score,family="gaussian",alpha=1)`

Now we need to look at the results using the “print” function. This function prints a lot of information as explained below.

- Df = number of variables including in the model (this is always the same number in a ridge model)
- %Dev = Percent of deviance explained. The higher the better
- Lambda = The lambda used to obtain the %Dev

When you use the “print” function for a lasso model it will print up to 100 different models. Fewer models are possible if the percent of deviance stops improving. 100 is the default stopping point. In the code below we will use the “print” function but, I only printed the first 5 and last 5 models in order to reduce the size of the printout. Fortunately, it only took 60 models to converge.

`print(lasso)`

```
##
## Call: glmnet(x = predictor_variables, y = language_score, family = "gaussian", alpha = 1)
##
## Df %Dev Lambda
## [1,] 0 0.00000 5.47100
## [2,] 1 0.06194 4.98500
## [3,] 1 0.11340 4.54200
## [4,] 1 0.15610 4.13900
## [5,] 1 0.19150 3.77100
............................
## [55,] 3 0.39890 0.03599
## [56,] 3 0.39900 0.03280
## [57,] 3 0.39900 0.02988
## [58,] 3 0.39900 0.02723
## [59,] 3 0.39900 0.02481
## [60,] 3 0.39900 0.02261
```

The results from the “print” function will allow us to set the lambda for the “test” dataset. Based on the results we can set the lambda at 0.02 because this explains the highest amount of deviance at .39.

The plot below shows us lambda on the x-axis and the coefficients of the predictor variables on the y-axis. The numbers next to the coefficient lines refers to the actual coefficient of a particular variable as it changes from using different lambda values. Each number corresponds to a variable going from left to right in a dataframe/matrix using the “View” function. For example, 1 in the plot refers to “IQ” 2 refers to “GS” etc.

`plot(lasso,xvar="lambda",label=T)`

As you can see, as lambda increase the coefficient decrease in value. This is how regularized regression works. However, unlike ridge regression which never reduces a coefficient to zero, lasso regression does reduce a coefficient to zero. For example, coefficient 3 (SES variable) and coefficient 2 (GS variable) are reduced to zero when lambda is near 1.

You can also look at the coefficient values at a specific lambda values. The values are unstandardized and are used to determine the final model selection. In the code below the lambda is set to .02 and we use the “coef” function to do see the results

```
lasso.coef<-coef(lasso,s=.02,exact = T)
lasso.coef
```

```
## 4 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 9.35736325
## IQ 2.34973922
## GS -0.02766978
## SES 0.16150542
```

Results indicate that for a 1 unit increase in IQ there is a 2.41 point increase in language score. When GS (class size) goes up 1 unit there is a .03 point decrease in language score. Finally, when SES (socioeconomic status) increase 1 unit language score improves .13 point.

The second plot shows us the deviance explained on the x-axis. On the y-axis is the coefficients of the predictor variables. Below is the code

`plot(lasso,xvar='dev',label=T)`

If you look carefully, you can see that the two plots are completely opposite to each other. increasing lambda cause a decrease in the coefficients. Furthermore, increasing the fraction of deviance explained leads to an increase in the coefficient. You may remember seeing this when we used the “print”” function. As lambda became smaller there was an increase in the deviance explained.

Now, we will assess our model using the test data. We need to convert the test dataset to a matrix. Then we will use the “predict”” function while setting our lambda to .02. Lastly, we will plot the results. Below is the code.

```
test.matrix<-as.matrix(test[,2:4])
lasso.y<-predict(lasso,newx = test.matrix,type = 'response',s=.02)
plot(lasso.y,test$lang)
```

The visual looks promising. The last thing we need to do is calculated the mean squared error. By its self this number does not mean much. However, it provides a benchmark for comparing our current model with any other models that we may develop. Below is the code

```
lasso.resid<-lasso.y-test$lang
mean(lasso.resid^2)
```

`## [1] 46.74314`

Knowing this number, we can, if we wanted, develop other models using other methods of analysis to try to reduce it. Generally, the lower the error the better while keeping in mind the complexity of the model.

In this post, we will conduct an analysis using ridge regression. Ridge regression is a type of regularized regression. By applying a shrinkage penalty, we are able to reduce the coefficients of many variables almost to zero while still retaining them in the model. This allows us to develop models that have many more variables in them compared to models using the best subset or stepwise regression.

In the example used in this post, we will use the “SAheart” dataset from the “ElemStatLearn” package. We want to predict systolic blood pressure (sbp) using all of the other variables available as predictors. Below is some initial code that we need to begin.

```
library(ElemStatLearn);library(car);library(corrplot)
library(leaps);library(glmnet);library(caret)
```

```
data(SAheart)
````str(SAheart)`

```
## 'data.frame': 462 obs. of 10 variables:
## $ sbp : int 160 144 118 170 134 132 142 114 114 132 ...
## $ tobacco : num 12 0.01 0.08 7.5 13.6 6.2 4.05 4.08 0 0 ...
## $ ldl : num 5.73 4.41 3.48 6.41 3.5 6.47 3.38 4.59 3.83 5.8 ...
## $ adiposity: num 23.1 28.6 32.3 38 27.8 ...
## $ famhist : Factor w/ 2 levels "Absent","Present": 2 1 2 2 2 2 1 2 2 2 ...
## $ typea : int 49 55 52 51 60 62 59 62 49 69 ...
## $ obesity : num 25.3 28.9 29.1 32 26 ...
## $ alcohol : num 97.2 2.06 3.81 24.26 57.34 ...
## $ age : int 52 63 46 58 49 45 38 58 29 53 ...
## $ chd : int 1 1 0 1 1 0 0 1 0 1 ...
```

A look at the object using the “str” function indicates that one variable “famhist” is a factor variable. The “glmnet” function that does the ridge regression analysis cannot handle factors so we need to convert this to a dummy variable. However, there are two things we need to do before this. First, we need to check the correlations to make sure there are no major issues with multicollinearity Second, we need to create our training and testing data sets. Below is the code for the correlation plot.

```
p.cor<-cor(SAheart[,-5])
corrplot.mixed(p.cor)
```

First we created a variable called “p.cor” the -5 in brackets means we removed the 5th column from the “SAheart” data set which is the factor variable “Famhist”. The correlation plot indicates that there is one strong relationship between adiposity and obesity. However, one common cut-off for collinearity is 0.8 and this value is 0.72 which is not a problem.

We will now create are training and testing sets and convert “famhist” to a dummy variable.

```
ind<-sample(2,nrow(SAheart),replace=T,prob = c(0.7,0.3))
train<-SAheart[ind==1,]
test<-SAheart[ind==2,]
train$famhist<-model.matrix( ~ famhist - 1, data=train ) #convert to dummy variable
test$famhist<-model.matrix( ~ famhist - 1, data=test )
```

We are still not done preparing our data yet. “glmnet” cannot use data frames, instead, it can only use matrices. Therefore, we now need to convert our data frames to matrices. We have to create two matrices, one with all of the predictor variables and a second with the outcome variable of blood pressure. Below is the code

```
predictor_variables<-as.matrix(train[,2:10])
blood_pressure<-as.matrix(train$sbp)
```

We are now ready to create our model. We use the “glmnet” function and insert our two matrices. The family is set to Gaussian because “blood pressure” is a continuous variable. Alpha is set to 0 as this indicates ridge regression. Below is the code

`ridge<-glmnet(predictor_variables,blood_pressure,family = 'gaussian',alpha = 0)`

Now we need to look at the results using the “print” function. This function prints a lot of information as explained below.

- Df = number of variables including in the model (this is always the same number in a ridge model)
- %Dev = Percent of deviance explained. The higher the better
- Lambda = The lambda used to attain the %Dev

When you use the “print” function for a ridge model it will print up to 100 different models. Fewer models are possible if the percent of deviance stops improving. 100 is the default stopping point. In the code below we have the “print” function. However, I have only printed the first 5 and last 5 models in order to save space.

`print(ridge)`

```
##
## Call: glmnet(x = predictor_variables, y = blood_pressure, family = "gaussian", alpha = 0)
##
## Df %Dev Lambda
## [1,] 10 7.622e-37 7716.0000
## [2,] 10 2.135e-03 7030.0000
## [3,] 10 2.341e-03 6406.0000
## [4,] 10 2.566e-03 5837.0000
## [5,] 10 2.812e-03 5318.0000
................................
## [95,] 10 1.690e-01 1.2290
## [96,] 10 1.691e-01 1.1190
## [97,] 10 1.692e-01 1.0200
## [98,] 10 1.693e-01 0.9293
## [99,] 10 1.693e-01 0.8468
## [100,] 10 1.694e-01 0.7716
```

The results from the “print” function are useful in setting the lambda for the “test” dataset. Based on the results we can set the lambda at 0.83 because this explains the highest amount of deviance at .20.

The plot below shows us lambda on the x-axis and the coefficients of the predictor variables on the y-axis. The numbers refer to the actual coefficient of a particular variable. Inside the plot, each number corresponds to a variable going from left to right in a data-frame/matrix using the “View” function. For example, 1 in the plot refers to “tobacco” 2 refers to “ldl” etc. Across the top of the plot is the number of variables used in the model. Remember this number never changes when doing ridge regression.

`plot(ridge,xvar="lambda",label=T)`

As you can see, as lambda increase the coefficient decrease in value. This is how ridge regression works yet no coefficient ever goes to absolute 0.

You can also look at the coefficient values at a specific lambda value. The values are unstandardized but they provide a useful insight when determining final model selection. In the code below the lambda is set to .83 and we use the “coef” function to do this

```
ridge.coef<-coef(ridge,s=.83,exact = T)
ridge.coef
```

```
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 105.69379942
## tobacco -0.25990747
## ldl -0.13075557
## adiposity 0.29515034
## famhist.famhistAbsent 0.42532887
## famhist.famhistPresent -0.40000846
## typea -0.01799031
## obesity 0.29899976
## alcohol 0.03648850
## age 0.43555450
## chd -0.26539180
```

The second plot shows us the deviance explained on the x-axis and the coefficients of the predictor variables on the y-axis. Below is the code

`plot(ridge,xvar='dev',label=T)`

The two plots are completely opposite to each other. Increasing lambda cause a decrease in the coefficients while increasing the fraction of deviance explained leads to an increase in the coefficient. You can also see this when we used the “print” function. As lambda became smaller there was an increase in the deviance explained.

We now can begin testing our model on the test data set. We need to convert the test dataset to a matrix and then we will use the predict function while setting our lambda to .83 (remember a lambda of .83 explained the most of the deviance). Lastly, we will plot the results. Below is the code.

```
test.matrix<-as.matrix(test[,2:10])
ridge.y<-predict(ridge,newx = test.matrix,type = 'response',s=.83)
plot(ridge.y,test$sbp)
```

The last thing we need to do is calculated the mean squared error. By it’s self this number is useless. However, it provides a benchmark for comparing the current model with any other models you may develop. Below is the code

```
ridge.resid<-ridge.y-test$sbp
mean(ridge.resid^2)
```

`## [1] 372.4431`

Knowing this number, we can develop other models using other methods of analysis to try to reduce it as much as possible.

Performing a data analysis in the realm of data science is a difficult task due to the huge number of decisions that need to be made. For some people, plotting the course to conduct an analysis is easy. However, for most of us, beginning a project leads to a sense of paralysis as we struggle to determine what to do.

In light of this challenge, there are at least 5 core task that you need to consider when preparing to analyze data. These five tasks are

- Developing your question(s)
- Data exploration
- Developing a statistical model
- Interpreting the results
- Sharing the results

**Developing Your Question(s)**

You really cannot analyze data until you first determine what it is you want to know. It is tempting to just jump in and start looking for interesting stuff but you will not know if something you find is interesting unless it helps to answer your question(s).

There are several types of research questions. The point is you need to ask them in order to answer them.

**Data Exploration**

Data exploration allows you to determine if you can answer your questions with the data you have. In data science, the data is normally already collected by the time you are called upon to analyze it. As such, what you want to find may not be possible.

In addition, exploration of the data allows you to determine if there are any problems with the data set such as missing data, strange variables, and if necessary to develop a data dictionary so you know the characteristics of the variables.

Data exploration allows you to determine what kind of data wrangling needs to be done. This involves the preparation of the data for a more formal analysis when you develop your statistical models. This process takes up the majority of a data scientist time and is not easy at all. Mastery of this in many ways means being a master of data science

**Develop a Statistical Model**

Your research questions and the data exploration process helps you to determine what kind of model to develop. The factors that can affect this is whether your data is supervised or unsupervised and whether you want to classify or predict numerical values.

This is probably the funniest part of data analysis and is much easier than having to wrangle with the data. Your goal is to determine if the model helps to answer your question(s)

**Interpreting the Results**

Once a model is developed it is time to explain what it means. Sometimes you can make a really cool model that nobody (including yourself) can explain. This is especially true of “black box” methods such as support vector machines and artificial neural networks. Models need to normally be explainable to non-technical stakeholders.

With interpretation, you are trying to determine “what does this answer mean to the stakeholders?” For example, if you find that people who smoke are 5 times more likely to die before the age of 50 what are the implications of this? How can the stakeholders use this information to achieve their own goals? In other words, why should they care about what you found out?

**Communication of Results**

Now is the time to actually share the answer(s) to your question(s). How this is done varies but it can be written, verbal or both. Whatever the mode of communication it is necessary to consider the following

- The audience or stakeholders
- The actual answers to the questions
- The benefits of knowing this

You must remember the stakeholders because this affects how you communicate. How you speak to business professionals would be different from academics. Next, you must share the answers to the questions. This can be done with charts, figures, illustrations etc. Data visualization is an expertise of its own. Lastly, you explain how this information is useful in a practical way.

**Conclusion**

The process shared here is one way to approach the analysis of data. Think of this as a framework from which to develop your own method of analysis.

In this post, we will look at linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). Discriminant analysis is used when the dependent variable is categorical. Another commonly used option is logistic regression but there are differences between logistic regression and discriminant analysis. Both LDA and QDA are used in situations in which there is a clear separation between the classes you want to predict. If the categories are fuzzier logistic regression is often the better choice.

For our example, we will use the “Mathlevel” dataset found in the “Ecdat” package. Our goal will be to predict the sex of a respondent based on SAT math score, major, foreign language proficiency, as well as the number of math, physic, and chemistry classes a respondent took. Below is some initial code to start our analysis.

`library(MASS);library(Ecdat)`

data("Mathlevel")

The first thing we need to do is clean up the data set. We have to remove any missing data in order to run our model. We will create a dataset called “math” that has the “Mathlevel” dataset but with the “NA”s removed use the “na.omit” function. After this, we need to set our seed for the purpose of reproducibility using the “set.seed” function. Lastly, we will split the data using the “sample” function using a 70/30 split. The training dataset will be called “math.train” and the testing dataset will be called “math.test”. Below is the code

```
math<-na.omit(Mathlevel)
set.seed(123)
math.ind<-sample(2,nrow(math),replace=T,prob = c(0.7,0.3))
math.train<-math[math.ind==1,]
math.test<-math[math.ind==2,]
```

Now we will make our model and it is called “lda.math” and it will include all available variables in the “math.train” dataset. Next, we will check the results by calling the model. Finally, we will examine the plot to see how our model is doing. Below is the code.

```
lda.math<-lda(sex~.,math.train)
lda.math
```

```
## Call:
## lda(sex ~ ., data = math.train)
##
## Prior probabilities of groups:
## male female
## 0.5986079 0.4013921
##
## Group means:
## mathlevel.L mathlevel.Q mathlevel.C mathlevel^4 mathlevel^5
## male -0.10767593 0.01141838 -0.05854724 0.2070778 0.05032544
## female -0.05571153 0.05360844 -0.08967303 0.2030860 -0.01072169
## mathlevel^6 sat languageyes majoreco majoross majorns
## male -0.2214849 632.9457 0.07751938 0.3914729 0.1472868 0.1782946
## female -0.2226767 613.6416 0.19653179 0.2601156 0.1907514 0.2485549
## majorhum mathcourse physiccourse chemistcourse
## male 0.05426357 1.441860 0.7441860 1.046512
## female 0.07514451 1.421965 0.6531792 1.040462
##
## Coefficients of linear discriminants:
## LD1
## mathlevel.L 1.38456344
## mathlevel.Q 0.24285832
## mathlevel.C -0.53326543
## mathlevel^4 0.11292817
## mathlevel^5 -1.24162715
## mathlevel^6 -0.06374548
## sat -0.01043648
## languageyes 1.50558721
## majoreco -0.54528930
## majoross 0.61129797
## majorns 0.41574298
## majorhum 0.33469586
## mathcourse -0.07973960
## physiccourse -0.53174168
## chemistcourse 0.16124610
```

`plot(lda.math,type='both')`

Calling “lda.math” gives us the details of our model. It starts be indicating the prior probabilities of someone being male or female. Next is the means for each variable by sex. The last part is the coefficients of the linear discriminants. Each of these values is used to determine the probability that a particular example is male or female. This is similar to a regression equation.

The plot provides us with densities of the discriminant scores for males and then for females. The output indicates a problem. There is a great deal of overlap between male and females in the model. What this indicates is that there is a lot of misclassification going on as the two groups are not clearly separated. Furthermore, this means that logistic regression is probably a better choice for distinguishing between male and females. However, since this is for demonstrating purposes we will not worry about this.

We will now use the “predict” function on the training set data to see how well our model classifies the respondents by gender. We will then compare the prediction of the model with the actual classification. Below is the code.

```
math.lda.predict<-predict(lda.math)
math.train$lda<-math.lda.predict$class
table(math.train$lda,math.train$sex)
```

```
##
## male female
## male 219 100
## female 39 73
```

`mean(math.train$lda==math.train$sex)`

`## [1] 0.6774942`

As you can see, we have a lot of misclassification happening. A large amount of false negatives which is a lot of males being classified as female. The overall accuracy is only 59% which is not much better than chance.

We will now conduct the same analysis on the test data set. Below is the code.

```
lda.math.test<-predict(lda.math,math.test)
math.test$lda<-lda.math.test$class
table(math.test$lda,math.test$sex)
```

```
##
## male female
## male 92 43
## female 23 20
```

`mean(math.test$lda==math.test$sex)`

`## [1] 0.6292135`

As you can see the results are similar. To put it simply, our model is terrible. The main reason is that there is little distinction between males and females as shown in the plot. However, we can see if perhaps a quadratic discriminant analysis will do better

QDA allows for each class in the dependent variable to have its own covariance rather than a shared covariance as in LDA. This allows for quadratic terms in the development of the model. To complete a QDA we need to use the “qda” function from the “MASS” package. Below is the code for the training data set.

```
math.qda.fit<-qda(sex~.,math.train)
math.qda.predict<-predict(math.qda.fit)
math.train$qda<-math.qda.predict$class
table(math.train$qda,math.train$sex)
```

```
##
## male female
## male 215 84
## female 43 89
```

`mean(math.train$qda==math.train$sex)`

`## [1] 0.7053364`

You can see there is almost no difference. Below is the code for the test data.

```
math.qda.test<-predict(math.qda.fit,math.test)
math.test$qda<-math.qda.test$class
table(math.test$qda,math.test$sex)
```

```
##
## male female
## male 91 43
## female 24 20
```

`mean(math.test$qda==math.test$sex)`

`## [1] 0.6235955`

Still disappointing. However, in this post, we reviewed linear discriminant analysis as well as learned about the use of quadratic linear discriminant analysis. Both of these statistical tools are used for predicting categorical dependent variables. LDA assumes shared covariance in the dependent variable categories will QDA allows for each category in the dependent variable to have its own variance.

In this post, we are going to continue our analysis of the logistic regression model from the post on logistic regression in R. We need to rerun all of the code from the last post to be ready to continue. As such the code form the last post is all below

```
library(MASS);library(bestglm);library(reshape2);library(corrplot);
library(ggplot2);library(ROCR)
```

```
data(survey)
survey$Clap<-NULL
survey$W.Hnd<-NULL
survey$Fold<-NULL
survey$Exer<-NULL
survey$Smoke<-NULL
survey$M.I<-NULL
survey<-na.omit(survey)
pm<-melt(survey, id.var="Sex")
ggplot(pm,aes(Sex,value))+geom_boxplot()+facet_wrap(~variable,ncol = 3)
```

`pc<-cor(survey[,2:5])`

corrplot.mixed(pc)

`set.seed(123) ind<-sample(2,nrow(survey),replace=T,prob = c(0.7,0.3)) train<-survey[ind==1,] test<-survey[ind==2,] fit<-glm(Sex~.,binomial,train) exp(coef(fit))`

```
train$probs<-predict(fit, type = 'response')
train$predict<-rep('Female',123)
train$predict[train$probs>0.5]<-"Male"
table(train$predict,train$Sex)
```

`mean(train$predict==train$Sex)`

```
test$prob<-predict(fit,newdata = test, type = 'response')
test$predict<-rep('Female',46)
test$predict[test$prob>0.5]<-"Male"
table(test$predict,test$Sex)
```

`mean(test$predict==test$Sex)`

**Model Validation**

We will now do a K-fold cross validation in order to further see how our model is doing. We cannot use the factor variable “Sex” with the K-fold code so we need to create a dummy variable. First, we create a variable called “y” that has 123 spaces, which is the same size as the “train” dataset. Second, we fill “y” with 1 in every example that is coded “male” in the “Sex” variable.

In addition, we also need to create a new dataset and remove some variables from our prior analysis otherwise we will confuse the functions that we are going to use. We will remove “predict”, “Sex”, and “probs”

```
train$y<-rep(0,123)
train$y[train$Sex=="Male"]=1
my.cv<-train[,-8]
my.cv$Sex<-NULL
my.cv$probs<-NULL
```

We now can do our K-fold analysis. The code is complicated so you can trust it and double check on your own.

`bestglm(Xy=my.cv,IC="CV",CVArgs = list(Method="HTF",K=10,REP=1),family = binomial)`

`## Morgan-Tatar search since family is non-gaussian.`

```
## CV(K = 10, REP = 1)
## BICq equivalent for q in (6.66133814775094e-16, 0.0328567092272112)
## Best Model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -45.2329733 7.80146036 -5.798014 6.710501e-09
## Height 0.2615027 0.04534919 5.766425 8.097067e-09
```

The results confirm what we alreaedy knew that only the “Height” variable is valuable in predicting Sex. We will now create our new model using only the recommendation of the kfold validation analysis. Then we check the new model against the train dataset and with the test dataset. The code below is a repeat of prior code but based on the cross-validation

```
reduce.fit<-glm(Sex~Height, family=binomial,train)
train$cv.probs<-predict(reduce.fit,type='response')
train$cv.predict<-rep('Female',123)
train$cv.predict[train$cv.probs>0.5]='Male'
table(train$cv.predict,train$Sex)
```

```
##
## Female Male
## Female 61 11
## Male 7 44
```

`mean(train$cv.predict==train$Sex)`

`## [1] 0.8536585`

```
test$cv.probs<-predict(reduce.fit,test,type = 'response')
test$cv.predict<-rep('Female',46)
test$cv.predict[test$cv.probs>0.5]='Male'
table(test$cv.predict,test$Sex)
```

```
##
## Female Male
## Female 16 7
## Male 1 22
```

`mean(test$cv.predict==test$Sex)`

`## [1] 0.826087`

The results are consistent for both the train and test dataset. We are now going to create the ROC curve. This will provide a visual and the AUC number to further help us to assess our model. However, a model is only good when it is compared to another model. Therefore, we will create a really bad model in order to compare it to the original model, and the cross validated model. We will first make a bad model and store the probabilities in the “test” dataset. The bad model will use “age” to predict “Sex” which doesn’t make any sense at all. Below is the code followed by the ROC curve of the bad model.

```
bad.fit<-glm(Sex~Age,family = binomial,test)
test$bad.probs<-predict(bad.fit,type='response')
pred.bad<-prediction(test$bad.probs,test$Sex)
perf.bad<-performance(pred.bad,'tpr','fpr')
plot(perf.bad,col=1)
```

The more of a diagonal the line is the worst it is. As we can see the bad model is really bad.

What we just did with the bad model we will now repeat for the full model and the cross-validated model. As before, we need to store the prediction in a way that the ROCR package can use them. We will create a variable called “pred.full” to begin the process of graphing the original full model from the last blog post. Then we will use the “prediction” function. Next, we will create the “perf.full” variable to store the performance of the model. Notice, the arguments ‘tpr’ and ‘fpr’ for true positive rate and false positive rate. Lastly, we plot the results

```
pred.full<-prediction(test$prob,test$Sex)
perf.full<-performance(pred.full,'tpr','fpr')
plot(perf.full, col=2)
```

We repeat this process for the cross-validated model

```
pred.cv<-prediction(test$cv.probs,test$Sex)
perf.cv<-performance(pred.cv,'tpr','fpr')
plot(perf.cv,col=3)
```

Now let’s put all the different models on one plot

```
plot(perf.bad,col=1)
plot(perf.full, col=2, add=T)
plot(perf.cv,col=3,add=T)
legend(.7,.4,c("BAD","FULL","CV"), 1:3)
```

Finally, we can calculate the AUC for each model

```
auc.bad<-performance(pred.bad,'auc')
auc.bad@y.values
```

```
## [[1]]
## [1] 0.4766734
```

```
auc.full<-performance(pred.full,"auc")
auc.full@y.values
```

```
## [[1]]
## [1] 0.959432
```

```
auc.cv<-performance(pred.cv,'auc')
auc.cv@y.values
```

```
## [[1]]
## [1] 0.9107505
```

The higher the AUC the better. As such, the full model with all variables is superior to the cross-validated or bad model. This is despite the fact that there are many high correlations in the full model as well. Another point to consider is that the cross-validated model is simpler so this may be a reason to pick it over the full model. As such, the statistics provide support for choosing a model but they do not trump the ability of the research to pick based on factors beyond just numbers.

In this post, we will conduct a logistic regression analysis. Logistic regression is used when you want to predict a categorical dependent variable using continuous or categorical dependent variables. In our example, we want to predict Sex (male or female) when using several continuous variables from the “survey” dataset in the “MASS” package.

`library(MASS);library(bestglm);library(reshape2);library(corrplot)`

```
data(survey)
?MASS::survey #explains the variables in the study
```

The first thing we need to do is remove the independent factor variables from our dataset. The reason for this is that the function that we will use for the cross-validation does not accept factors. We will first use the “str” function to identify factor variables and then remove them from the dataset. We also need to remove in examples that are missing data so we use the “na.omit” function for this. Below is the code

```
survey$Clap<-NULL
survey$W.Hnd<-NULL
survey$Fold<-NULL
survey$Exer<-NULL
survey$Smoke<-NULL
survey$M.I<-NULL
survey<-na.omit(survey)
```

We now need to check for collinearity using the “corrplot.mixed” function form the “corrplot” package.

```
pc<-cor(survey[,2:5])
corrplot.mixed(pc)
corrplot.mixed(pc)
```

We have an extreme correlation between “We.Hnd” and “NW.Hnd” this makes sense because people’s hands are normally the same size. Since this blog post is a demonstration of logistic regression we will not worry about this too much.

We now need to divide our dataset into a train and a test set. We set the seed for. First, we need to make a variable that we call “ind” that is randomly assigned 70% of the number of rows of survey 1 and 30% 2. We then subset the “train” dataset by taking all rows that are 1’s based on the “ind” variable and we create the “test” dataset for all the rows that line up with 2 in the “ind” variable. This means our data split is 70% train and 30% test. Below is the code

```
set.seed(123)
ind<-sample(2,nrow(survey),replace=T,prob = c(0.7,0.3))
train<-survey[ind==1,]
test<-survey[ind==2,]
```

We now make our model. We use the “glm” function for logistic regression. We set the family argument to “binomial”. Next, we look at the results as well as the odds ratios.

```
fit<-glm(Sex~.,family=binomial,train)
summary(fit)
```

```
##
## Call:
## glm(formula = Sex ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9875 -0.5466 -0.1395 0.3834 3.4443
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -46.42175 8.74961 -5.306 1.12e-07 ***
## Wr.Hnd -0.43499 0.66357 -0.656 0.512
## NW.Hnd 1.05633 0.70034 1.508 0.131
## Pulse -0.02406 0.02356 -1.021 0.307
## Height 0.21062 0.05208 4.044 5.26e-05 ***
## Age 0.00894 0.05368 0.167 0.868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 169.14 on 122 degrees of freedom
## Residual deviance: 81.15 on 117 degrees of freedom
## AIC: 93.15
##
## Number of Fisher Scoring iterations: 6
```

`exp(coef(fit))`

```
## (Intercept) Wr.Hnd NW.Hnd Pulse Height
## 6.907034e-21 6.472741e-01 2.875803e+00 9.762315e-01 1.234447e+00
## Age
## 1.008980e+00
```

The results indicate that only height is useful in predicting if someone is a male or female. The second piece of code shares the odds ratios. The odds ratio tell how a one unit increase in the independent variable leads to an increase in the odds of being male in our model. For example, for every one unit increase in height there is a 1.23 increase in the odds of a particular example being male.

We now need to see how well our model does on the train and test dataset. We first capture the probabilities and save them to the train dataset as “probs”. Next we create a “predict” variable and place the string “Female” in the same number of rows as are in the “train” dataset. Then we rewrite the “predict” variable by changing any example that has a probability above 0.5 as “Male”. Then we make a table of our results to see the number correct, false positives/negatives. Lastly, we calculate the accuracy rate. Below is the code.

```
train$probs<-predict(fit, type = 'response')
train$predict<-rep('Female',123)
train$predict[train$probs>0.5]<-"Male"
table(train$predict,train$Sex)
```

```
##
## Female Male
## Female 61 7
## Male 7 48
```

`mean(train$predict==train$Sex)`

`## [1] 0.8861789`

Despite the weaknesses of the model with so many insignificant variables it is surprisingly accurate at 88.6%. Let’s see how well we do on the “test” dataset.

```
test$prob<-predict(fit,newdata = test, type = 'response')
test$predict<-rep('Female',46)
test$predict[test$prob>0.5]<-"Male"
table(test$predict,test$Sex)
```

```
##
## Female Male
## Female 17 3
## Male 0 26
```

`mean(test$predict==test$Sex)`

`## [1] 0.9347826`

As you can see, we do even better on the test set with an accuracy of 93.4%. Our model is looking pretty good and height is an excellent predictor of sex which makes complete sense. However, in the next post we will use cross-validation and the ROC plot to further assess the quality of it.

In this post, we will take a look at best subset regression. Best subset regression fits a model for all possible feature or variable combinations and the decision for the most appropriate model is made by the analyst based on judgment or some statistical criteria.

Best subset regression is an alternative to both Forward and Backward stepwise regression. Forward stepwise selection adds one variable at a time based on the lowest residual sum of squares until no more variables continue to lower the residual sum of squares. Backward stepwise regression starts with all variables in the model and removes variables one at a time. The concern with stepwise methods is they can produce biased regression coefficients, conflicting models, and inaccurate confidence intervals.

Best subset regression bypasses these weaknesses of stepwise models by creating all models possible and then allowing you to assess which variables should be included in your final model. The one drawback to best subset is that a large number of variables means a large number of potential models, which can make it difficult to make a decision among several choices.

In this post, we will use the “Fair” dataset from the “Ecdat” package to predict marital satisfaction based on age, Sex, the presence of children, years married, religiosity, education, occupation, and the number of affairs in the past year. Below is some initial code.

`library(leaps);library(Ecdat);library(car);library(lmtest)`

data(Fair)

We begin our analysis by building the initial model with all variables in it. Below is the code

```
fit<-lm(rate~.,Fair)
summary(fit)
```

```
##
## Call:
## lm(formula = rate ~ ., data = Fair)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2049 -0.6661 0.2298 0.7705 2.2292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.522875 0.358793 9.819 < 2e-16 ***
## sexmale -0.062281 0.099952 -0.623 0.53346
## age -0.009683 0.007548 -1.283 0.20005
## ym -0.019978 0.013887 -1.439 0.15079
## childyes -0.206976 0.116227 -1.781 0.07546 .
## religious 0.042142 0.037705 1.118 0.26416
## education 0.068874 0.021153 3.256 0.00119 **
## occupation -0.015606 0.029602 -0.527 0.59825
## nbaffairs -0.078812 0.013286 -5.932 5.09e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.03 on 592 degrees of freedom
## Multiple R-squared: 0.1405, Adjusted R-squared: 0.1289
## F-statistic: 12.1 on 8 and 592 DF, p-value: 4.487e-16
```

The initial results are already interesting even though the r-square is low. When couples have children the have less marital satisfaction than couples without children when controlling for the other factors and this is the strongest regression weight. In addition, the more education a person has there is an increase in marital satisfaction. Lastly, as the number of affairs increases there is also a decrease in marital satisfaction. Keep in mind that the “rate” variable goes from 1 to 5 with one meaning a terrible marriage to five being a great one. The mean marital satisfaction was 3.52 when controlling for the other variables.

We will now create our subset models. Below is the code.

```
sub.fit<-regsubsets(rate~.,Fair)
best.summary<-summary(sub.fit)
```

In the code above we create the sub models using the “regsubsets” function from the “leaps” package and saved it in the variable called “sub.fit”. We then saved the summary of “sub.fit” in the variable “best.summary”. We will use the “best.summary” “sub.fit variables several times to determine which model to use.

There are many different ways to assess the model. We will use the following statistical methods that come with the results from the “regsubset” function.

- Mallow’ Cp
- Bayesian Information Criteria

We will make two charts for each of the criteria above. The plot to the left will explain how many features to include in the model. The plot to the right will tell you which variables to include. It is important to note that for both of these methods, the lower the score the better the model. Below is the code for Mallow’s Cp.

```
par(mfrow=c(1,2))
plot(best.summary$cp)
plot(sub.fit,scale = "Cp")
```

The plot on the left suggests that a four feature model is the most appropriate. However, this chart does not tell me which four features. The chart on the right is read in reverse order. The high numbers are at the bottom and the low numbers are at the top when looking at the y-axis. Knowing this, we can conclude that the most appropriate variables to include in the model are age, children presence, education, and number of affairs. Below are the results using the Bayesian Information Criterion

```
par(mfrow=c(1,2))
plot(best.summary$bic)
plot(sub.fit,scale = "bic")
```

These results indicate that a three feature model is appropriate. The variables or features are years married, education, and number of affairs. Presence of children was not considered beneficial. Since our original model and Mallow’s Cp indicated that presence of children was significant we will include it for now.

Below is the code for the model based on the subset regression.

```
fit2<-lm(rate~age+child+education+nbaffairs,Fair)
summary(fit2)
```

```
##
## Call:
## lm(formula = rate ~ age + child + education + nbaffairs, data = Fair)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2172 -0.7256 0.1675 0.7856 2.2713
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.861154 0.307280 12.566 < 2e-16 ***
## age -0.017440 0.005057 -3.449 0.000603 ***
## childyes -0.261398 0.103155 -2.534 0.011531 *
## education 0.058637 0.017697 3.313 0.000978 ***
## nbaffairs -0.084973 0.012830 -6.623 7.87e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.029 on 596 degrees of freedom
## Multiple R-squared: 0.1352, Adjusted R-squared: 0.1294
## F-statistic: 23.29 on 4 and 596 DF, p-value: < 2.2e-16
```

The results look ok. The older a person is the less satisfied they are with their marriage. If children are present the marriage is less satisfying. The more educated the more satisfied they are. Lastly, the higher the number of affairs indicate less marital satisfaction. However, before we get excited we need to check for collinearity and homoscedasticity. Below is the code

`vif(fit2)`

```
## age child education nbaffairs
## 1.249430 1.228733 1.023722 1.014338
```

No issues with collinearity.For vif values above 5 or 10 indicate a problem. Let’s check for homoscedasticity

```
par(mfrow=c(2,2))
plot(fit2)
```

The normal qqplot and residuals vs leverage plot can be used for locating outliers. The residual vs fitted and the scale-location plot do not look good as there appears to be a pattern in the dispersion which indicates homoscedasticity. To confirm this we will use Breusch-Pagan test from the “lmtest” package. Below is the code

`bptest(fit2)`

```
##
## studentized Breusch-Pagan test
##
## data: fit2
## BP = 16.238, df = 4, p-value = 0.002716
```

There you have it. Our model violates the assumption of homoscedasticity. However, this model was developed for demonstration purpose to provide an example of subset regression.

Collecting and preparing data for analysis is the primary job of a data scientist. This experience is called data wrangling. In this post, we will look at an example of data wrangling using a simple artificial data set. You can create the table below in r or excel. If you created it in excel just save it as a csv and load it into r. Below is the initial code

```
library(readr)
apple <- read_csv("~/Desktop/apple.csv")
```

```
## # A tibble: 10 × 2
## weight location
##
## 1 3.2 Europe
## 2 4.2kg europee
## 3 1.3 kg U.S.
## 4 7200 grams USA
## 5 42 United States
## 6 2.3 europee
## 7 2.1kg Europe
## 8 3.1kg USA
## 9 2700 grams U.S.
## 10 24 United States
```

This a small dataset with the columns of “weight” and “location”. Here are some of the problems

- Weights are in different units
- Weights are written in different ways
- Location is not consistent

In order to have any success with data wrangling, you need to state specifically what it is you want to do. Here are our goals for this project

- Convert the “Weight variable” to a numerical variable instead of character
- Remove the text and have only numbers in the “weight variable”
- Change weights in grams to kilograms
- Convert the “location” variable to a factor variable instead of character
- Have consistent spelling for Europe and United States in the “location” variable

We will begin with the “weight” variable. We want to convert it to a numerical variable and remove any non-numerical text. Below is the code for this

```
corrected.weight<-as.numeric(gsub(pattern = "[[:alpha:]]","",apple$weight))
corrected.weight
```

`## [1] 3.2 4.2 1.3 7200.0 42.0 2.3 2.1 3.1 2700.0 24.0`

Here is what we did.

- We created a variable called “corrected.weight”
- We use the function “as.numeric” this makes whatever results inside it to be a numerical variable
- Inside “as.numeric” we used the “gsub” function which allows us to substitute one value for another.
- Inside “gsub” we used the argument pattern and set it to “[[alpha:]]” and “” this told r to look for any lower or uppercase letters and replace with nothing or remove it. This all pertains to the “weight” variable in the apple dataframe.

We now need to convert the weights into grams to kilograms so that everything is the same unit. Below is the code

```
gram.error<-grep(pattern = "[[:digit:]]{4}",apple$weight)
corrected.weight[gram.error]<-corrected.weight[gram.error]/1000
corrected.weight
```

`## [1] 3.2 4.2 1.3 7.2 42.0 2.3 2.1 3.1 2.7 24.0`

Here is what we did

- We created a variable called “gram.error”
- We used the grep function to search are the “weight” variable in the apple data frame for input that is a digit and is 4 digits in length this is what the “[[:digit:]]{4}” argument means. We do not change any values yet we just store them in “gram.error”
- Once this information is stored in “gram.error” we use it as a subset for the “corrected.weight” variable.
- We tell r to save into the “corrected.weight” variable any value that is changeable according to the criteria set in “gram.error” and to divide it by 1000. Dividing it by 1000 converts the value from grams to kilograms.

We have completed the transformation of the “weight” and will move to dealing with the problems with the “location” variable in the “apple” dataframe. To do this we will first deal with the issues related to the values that relate to Europe and then we will deal with values related to the United States. Below is the code.

```
europe<-agrep(pattern = "europe",apple$location,ignore.case = T,max.distance = list(insertion=c(1),deletions=c(2)))
america<-agrep(pattern = "us",apple$location,ignore.case = T,max.distance = list(insertion=c(0),deletions=c(2),substitutions=0))
corrected.location<-apple$location
corrected.location[europe]<-"europe"
corrected.location[america]<-"US"
corrected.location<-gsub(pattern = "United States","US",corrected.location)
corrected.location
```

```
## [1] "europe" "europe" "US" "US" "US" "europe" "europe"
## [8] "US" "US" "US"
```

The code is a little complicated to explain but in short We used the “agrep” function to tell r to search the “location” to look for values similar to our term “europe”. The other arguments provide some exceptions that r should change because the exceptions are close to the term europe. This process is repeated for the term “us”. We then store are the location variable from the “apple” dataframe in a new variable called “corrected.location” We then apply the two objects we made called “europe” and “america” to the “corrected.location” variable. Next, we have to make some code to deal with “United States” and apply this using the “gsub” function.

We are almost done, now we combine are two variables “corrected.weight” and “corrected.location” into a new data.frame. The code is below

```
cleaned.apple<-data.frame(corrected.weight,corrected.location)
names(cleaned.apple)<-c('weight','location')
cleaned.apple
```

```
## weight location
## 1 3.2 europe
## 2 4.2 europe
## 3 1.3 US
## 4 7.2 US
## 5 42.0 US
## 6 2.3 europe
## 7 2.1 europe
## 8 3.1 US
## 9 2.7 US
## 10 24.0 US
```

If you use the “str” function on the “cleaned.apple” dataframe you will see that “location” was automatically converted to a factor.

This looks much better especially if you compare it to the original dataframe that is printed at the top of this post.

This post will demonstrate the use of principal component analysis (PCA). PCA is useful for several reasons. One it allows you place your examples into groups similar to linear discriminant analysis but you do not need to know beforehand what the groups are. Second, PCA is used for the purpose of dimension reduction. For example, if you have 50 variables PCA can allow you to reduce this while retaining a certain threshold of variance. If you are working with a large dataset this can greatly reduce the computational time and general complexity of your models.

Keep in mind that there really is not a dependent variable as this is unsupervised learning. What you are trying to see is how different examples can

be mapped in space based on whatever independent variables are used. For our example, we will use the “Carseats” dataset from the “ISLR”. Our goal is to understand the relationship among the variables when examining the shelve location of the car seat. Below is the initial code to begin the analysis

```
library(ggplot2)
library(ISLR)
data("Carseats")
```

We first need to rearrange the data and remove the variables we are not going to use in the analysis. Below is the code.

```
Carseats1<-Carseats
Carseats1<-Carseats1[,c(1,2,3,4,5,6,8,9,7,10,11)]
Carseats1$Urban<-NULL
Carseats1$US<-NULL
```

Here is what we did 1. We made a copy of the “Carseats” data called “Careseats1” 2. We rearranged the order of the variables so that the factor variables are at the end. This will make sense later 3.We removed the “Urban” and “US” variables from the table as they will not be a part of our analysis

We will now do the PCA. We need to scale and center our data otherwise the larger numbers will have a much stronger influence on the results than smaller numbers. Fortunately, the “prcomp” function has a “scale” and a “center” argument. We will also use only the first 7 columns for the analysis as “sheveLoc” is not useful for this analysis. If we hadn’t moved “shelveLoc” to the end of the dataframe it would cause some headache. Below is the code.

```
Carseats.pca<-prcomp(Carseats1[,1:7],scale. = T,center = T)
summary(Carseats.pca)
```

```
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.3315 1.1907 1.0743 0.9893 0.9260 0.80506 0.41320
## Proportion of Variance 0.2533 0.2026 0.1649 0.1398 0.1225 0.09259 0.02439
## Cumulative Proportion 0.2533 0.4558 0.6207 0.7605 0.8830 0.97561 1.00000
```

The summary of “Carseats.pca” Tells us how much of the variance each component explains. Keep in mind that the number of components is equal to the number of variables. The “proportion of variance” tells us the contribution each component makes and the “cumulative proportion”.

If your goal is dimension reduction than the number of components to keep depends on the threshold you set. For example, if you need around 90% of the variance you would keep the first 5 components. If you need 95% or more of the variance you would keep the first six. To actually use the components you would take the “Carseats.pca$x” data and move it to your data frame.

Keep in mind that the actual components have no conceptual meaning but is a numerical representation of a combination of several variables that were reduced using PCA to fewer variables such as going from 7 variables to 5 variables.

This means that PCA is great for reducing variables for prediction purpose but is much harder for explanatory studies unless you can explain what the new components represent.

For our purposes, we will keep 5 components. This means that we have reduced our dimensions from 7 to 5 while still keeping almost 90% of the variance. Graphing our results is tricky because we have 5 dimensions but the human mind can only conceptualize 3 at the best and normally 2. As such we will plot the first two components and label them by shelf location using ggplot2. Below is the code

```
scores<-as.data.frame(Carseats.pca$x)
pcaplot<-ggplot(scores,(aes(PC1,PC2,color=Carseats1$ShelveLoc)))+geom_point()
pcaplot
```

From the plot, you can see there is little separation when using the first two components of the PCA analysis. This makes sense as we can only graph to components so we are missing a lot of the variance. However, for demonstration purposes the analysis is complete.

%d bloggers like this: