Partial least squares regression is a form of regression that involves the development of components of the original variables in a supervised way. What this means is that the dependent variable is used to help create the new components form the original variables. This means that when pls is used the linear combination of the new features helps to explain both the independent and dependent variables in the model.
In this post, we will use predict “income” in the “Mroz” dataset using pls. Below is some initial code.
library(pls);library(Ecdat)
data("Mroz")
str(Mroz)
## 'data.frame': 753 obs. of 18 variables:
## $ work : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
## $ hoursw : int 1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
## $ child6 : int 1 0 1 0 1 0 0 0 0 0 ...
## $ child618 : int 0 2 3 3 2 0 2 0 2 2 ...
## $ agew : int 32 30 35 34 31 54 37 54 48 39 ...
## $ educw : int 12 12 12 12 14 12 16 12 12 12 ...
## $ hearnw : num 3.35 1.39 4.55 1.1 4.59 ...
## $ wagew : num 2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
## $ hoursh : int 2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
## $ ageh : int 34 30 40 53 32 57 37 53 52 43 ...
## $ educh : int 12 9 12 10 12 11 12 8 4 12 ...
## $ wageh : num 4.03 8.44 3.58 3.54 10 ...
## $ income : int 16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
## $ educwm : int 12 7 12 7 12 14 14 3 7 7 ...
## $ educwf : int 7 7 7 7 14 7 7 3 7 7 ...
## $ unemprate : num 5 11 5 5 9.5 7.5 5 5 3 5 ...
## $ city : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ...
## $ experience: int 14 5 15 6 7 33 11 35 24 21 ...
First, we must prepare our data by dividing it into a training and test set. We will do this by doing a 50/50 split of the data.
set.seed(777)
train<-sample(c(T,F),nrow(Mroz),rep=T) #50/50 train/test split
test<-(!train)
In the code above we set the “set.seed function in order to assure reduplication. Then we created the “train” object and used the “sample” function to make a vector with ‘T’ and ‘F’ based on the number of rows in “Mroz”. Lastly, we created the “test”” object base don everything that is not in the “train” object as that is what the exclamation point is for.
Now we create our model using the “plsr” function from the “pls” package and we will examine the results using the “summary” function. We will also scale the data since this the scale affects the development of the components and use cross-validation. Below is the code.
set.seed(777)
pls.fit<-plsr(income~.,data=Mroz,subset=train,scale=T,validation="CV")
summary(pls.fit)
## Data: X dimension: 392 17
## Y dimension: 392 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 11218 8121 6701 6127 5952 5886 5857
## adjCV 11218 8114 6683 6108 5941 5872 5842
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 5853 5849 5854 5853 5853 5852 5852
## adjCV 5837 5833 5837 5836 5836 5835 5835
## 14 comps 15 comps 16 comps 17 comps
## CV 5852 5852 5852 5852
## adjCV 5835 5835 5835 5835
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 17.04 26.64 37.18 49.16 59.63 64.63 69.13
## income 49.26 66.63 72.75 74.16 74.87 75.25 75.44
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 72.82 76.06 78.59 81.79 85.52 89.55 92.14
## income 75.49 75.51 75.51 75.52 75.52 75.52 75.52
## 15 comps 16 comps 17 comps
## X 94.88 97.62 100.00
## income 75.52 75.52 75.52
The printout includes the root mean squared error for each of the components in the VALIDATION section as well as the variance explained in the TRAINING section. There are 17 components because there are 17 independent variables. You can see that after component 3 or 4 there is little improvement in the variance explained in the dependent variable. Below is the code for the plot of these results. It requires the use of the “validationplot” function with the “val.type” argument set to “MSEP” Below is the code
validationplot(pls.fit,val.type = "MSEP")
We will do the predictions with our model. We use the “predict” function, use our “Mroz” dataset but only those index in the “test” vector and set the components to three based on our previous plot. Below is the code.
set.seed(777)
pls.pred<-predict(pls.fit,Mroz[test,],ncomp=3)
After this, we will calculate the mean squared error. This is done by subtracting the results of our predicted model from the dependent variable of the test set. We then square this information and calculate the mean. Below is the code
mean((pls.pred-Mroz$income[test])^2)
## [1] 63386682
As you know, this information is only useful when compared to something else. Therefore, we will run the data with a tradition least squares regression model and compare the results.
set.seed(777)
lm.fit<-lm(income~.,data=Mroz,subset=train)
lm.pred<-predict(lm.fit,Mroz[test,])
mean((lm.pred-Mroz$income[test])^2)
## [1] 59432814
The least squares model is slightly better then our partial least squares model but if we look at the model we see several variables that are not significant. We will remove these see what the results are
summary(lm.fit)
##
## Call:
## lm(formula = income ~ ., data = Mroz, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20131 -2923 -1065 1670 36246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.946e+04 3.224e+03 -6.036 3.81e-09 ***
## workno -4.823e+03 1.037e+03 -4.651 4.59e-06 ***
## hoursw 4.255e+00 5.517e-01 7.712 1.14e-13 ***
## child6 -6.313e+02 6.694e+02 -0.943 0.346258
## child618 4.847e+02 2.362e+02 2.052 0.040841 *
## agew 2.782e+02 8.124e+01 3.424 0.000686 ***
## educw 1.268e+02 1.889e+02 0.671 0.502513
## hearnw 6.401e+02 1.420e+02 4.507 8.79e-06 ***
## wagew 1.945e+02 1.818e+02 1.070 0.285187
## hoursh 6.030e+00 5.342e-01 11.288 < 2e-16 ***
## ageh -9.433e+01 7.720e+01 -1.222 0.222488
## educh 1.784e+02 1.369e+02 1.303 0.193437
## wageh 2.202e+03 8.714e+01 25.264 < 2e-16 ***
## educwm -4.394e+01 1.128e+02 -0.390 0.697024
## educwf 1.392e+02 1.053e+02 1.322 0.186873
## unemprate -1.657e+02 9.780e+01 -1.694 0.091055 .
## cityyes -3.475e+02 6.686e+02 -0.520 0.603496
## experience -1.229e+02 4.490e+01 -2.737 0.006488 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5668 on 374 degrees of freedom
## Multiple R-squared: 0.7552, Adjusted R-squared: 0.744
## F-statistic: 67.85 on 17 and 374 DF, p-value: < 2.2e-16
set.seed(777)
lm.fit<-lm(income~work+hoursw+child618+agew+hearnw+hoursh+wageh+experience,data=Mroz,subset=train)
lm.pred<-predict(lm.fit,Mroz[test,])
mean((lm.pred-Mroz$income[test])^2)
## [1] 57839715
As you can see the error decreased even more which indicates that the least squares regression model is superior to the partial least squares model. In addition, the partial least squares model is much more difficult to explain because of the use of components. As such, the least squares model is the favored one.