Leave one out cross validation. (LOOCV) is a variation of the validation approach in that instead of splitting the dataset in half, LOOCV uses one example as the validation set and all the rest as the training set. This helps to reduce bias and randomness in the results but unfortunately, can increase variance. Remember that the goal is always to reduce the error rate which is often calculated as the mean-squared error.
In this post, we will use the “Hedonic” dataset from the “Ecdat” package to assess several different models that predict the taxes of homes In order to do this, we will also need to use the “boot” package. Below is the code.
library(Ecdat);library(boot)
data(Hedonic) str(Hedonic)
## 'data.frame': 506 obs. of 15 variables:
## $ mv : num 10.09 9.98 10.45 10.42 10.5 ...
## $ crim : num 0.00632 0.02731 0.0273 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 ...
## $ chas : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ nox : num 28.9 22 22 21 21 ...
## $ rm : num 43.2 41.2 51.6 49 51.1 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 ...
## $ dis : num 1.41 1.6 1.6 1.8 1.8 ...
## $ rad : num 0 0.693 0.693 1.099 1.099 ...
## $ tax : int 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 ...
## $ blacks : num 0.397 0.397 0.393 0.395 0.397 ...
## $ lstat : num -3 -2.39 -3.21 -3.53 -2.93 ...
## $ townid : int 1 2 2 3 3 3 4 4 4 4 ...
First, we need to develop our basic least squares regression model. We will do this with the “glm” function. This is because the “cv.glm” function (more on this later) only works when models are developed with the “glm” function. Below is the code.
tax.glm<-glm(tax ~ mv+crim+zn+indus+chas+nox+rm+age+dis+rad+ptratio+blacks+lstat, data = Hedonic)
We now need to calculate the MSE. To do this we will use the “cv.glm” function. Below is the code.
cv.error<-cv.glm(Hedonic,tax.glm)
cv.error$delta
## [1] 4536.345 4536.075
cv.error$delta contains two numbers. The first is the MSE for the training set and the second is the error for the LOOCV. As you can see the numbers are almost identical.
We will now repeat this process but with the inclusion of different polynomial models. The code for this is a little more complicated and is below.
cv.error=rep(0,5)
for (i in 1:5){
tax.loocv<-glm(tax ~ mv+poly(crim,i)+zn+indus+chas+nox+rm+poly(age,i)+dis+rad+ptratio+blacks+lstat, data = Hedonic)
cv.error[i]=cv.glm(Hedonic,tax.loocv)$delta[1]
}
cv.error
## [1] 4536.345 4515.464 4710.878 7047.097 9814.748
Here is what happen.
- First, we created an empty object called “cv.error” with five empty spots, which we will use to store information later.
- Next, we created a for loop that repeats 5 times
- Inside the for loop, we create the same regression model except we added the “poly” function in front of “age”” and also “crim”. These are the variables we want to try polynomials 1-5 one to see if it reduces the error.
- The results of the polynomial models are stored in the “cv.error” object and we specifically request the results of “delta” Finally, we printed “cv.error” to the console.
From the results, you can see that the error decreases at a second order polynomial but then increases after that. This means that high order polynomials are not beneficial generally.
Conclusion
LOOCV is another option in assessing different models and determining which is most appropriate. As such, this is a tool that is used by many data scientist.
but how do we know this cross validation is leave-one-out?