Validation Set for Regression in R

Estimating error and looking for ways to reduce it is a key component of machine learning. In this post, we will look at a simple way of addressing this problem through the use of the validation set method.

The validation set method is a standard approach in model development. To put it simply, you divide your dataset into a training and a hold-out set. The model is developed on the training set and then the hold-out set is used for prediction purposes. The error rate of the hold-out set is assumed to be reflective of the test error rate.

In the example below, we will use the “Carseats” dataset from the “ISLR” package. Our goal is to predict the competitors’ price for a carseat based on the other available variables. Below is some initial code

library(ISLR)
data("Carseats")
str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

We need to divide our dataset into two part. One will be the training set and the other the hold-out set. Below is the code.

set.seed(7)
train<-sample(x=400,size=200)

Now, for those who are familiar with R you know that we haven’t actually made our training set. We are going to use the “train” object to index items from the “Carseat” dataset. What we did was set the seed so that the results can be replicated. Then we used the “sample” function using two arguments “x” and “size”. X represents the number of examples in the “Carseat” dataset. Size represents how big we want the sample to be. In other words, we want a sample size of 200 of the 400 examples to be in the training set. Therefore, R will randomly select 200 numbers from 400.

We will now fit our initial model

car.lm<-lm(CompPrice ~ Income+Sales+Advertising+Population+Price+ShelveLoc+Age+Education+Urban, data = Carseats,subset = train)

The code above should not be new. However, one unique twist is the use of the “subset” argument. What this argument does is tell R to only use rows that are in the “train” index. Next, we calculate the mean squared error.

mean((Carseats$CompPrice-predict(car.lm,Carseats))[-train]^2)
## [1] 77.13932

Here is what the code above means

  1. We took the “CompPrice” results and subtracted them from the prediction made by the “car.lm” model we developed.
  2. Used the test set which here is identified as “-train” minus means everything that is not in the “train”” index
  3. the results were squared.

The results here are the baseline comparison. We will now make two more models each with a polynomial in one of the variables. First, we will square the “income” variable

car.lm2<-lm(CompPrice ~ Income+Sales+Advertising+Population+I(Income^2)+Price+ShelveLoc+Age+Education+Urban, data = Carseats,subset = train)
mean((Carseats$CompPrice-predict(car.lm2,Carseats))[-train]^2)
## [1] 75.68999

You can see that there is a small decrease in the MSE. Also, notice the use of the “I” function which allows us to square “income”. Now, let’s try a cubic model

car.lm3<-lm(CompPrice ~ Income+Sales+Advertising+Population+I(Income^3)+Price+ShelveLoc+Age+Education+Urban, data = Carseats,subset = train)
mean((Carseats$CompPrice-predict(car.lm3,Carseats))[-train]^2)
## [1] 75.84575

This time there was an increase when compared to the second model. As such, higher order polynomials will probably not improve the model.

Conclusion

This post provided a simple example of assessing several different models use the validation approach. However, in practice, this approach is not used as frequently as there are so many more ways to do this now. Yet, it is still good to be familiar with a standard approach such as this.

Leave a Reply