This post will explain and provide an example of principal component regression (PCR). Principal component regression involves having the model construct components from the independent variables that are a linear combination of the independent variables. This is similar to principal component analysis but the components are designed in a way to best explain the dependent variable. Doing this often allows you to use fewer variables in your model and usually improves the fit of your model as well.
Since PCR is based on principal component analysis it is an unsupervised method, which means the dependent variable has no influence on the development of the components. As such, there are times when the components that are developed may not be beneficial for explaining the dependent variable.
Our example will use the “Mroz” dataset from the “Ecdat” package. Our goal will be to predict “income” based on the variables in the dataset. Below is the initial code
library(pls);library(Ecdat)
data(Mroz)
str(Mroz)
## 'data.frame': 753 obs. of 18 variables:
## $ work : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
## $ hoursw : int 1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
## $ child6 : int 1 0 1 0 1 0 0 0 0 0 ...
## $ child618 : int 0 2 3 3 2 0 2 0 2 2 ...
## $ agew : int 32 30 35 34 31 54 37 54 48 39 ...
## $ educw : int 12 12 12 12 14 12 16 12 12 12 ...
## $ hearnw : num 3.35 1.39 4.55 1.1 4.59 ...
## $ wagew : num 2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
## $ hoursh : int 2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
## $ ageh : int 34 30 40 53 32 57 37 53 52 43 ...
## $ educh : int 12 9 12 10 12 11 12 8 4 12 ...
## $ wageh : num 4.03 8.44 3.58 3.54 10 ...
## $ income : int 16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
## $ educwm : int 12 7 12 7 12 14 14 3 7 7 ...
## $ educwf : int 7 7 7 7 14 7 7 3 7 7 ...
## $ unemprate : num 5 11 5 5 9.5 7.5 5 5 3 5 ...
## $ city : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ...
## $ experience: int 14 5 15 6 7 33 11 35 24 21 ...
Our first step is to divide our dataset into a train and test set. We will do a simple 50/50 split for this demonstration.
train<-sample(c(T,F),nrow(Mroz),rep=T) #50/50 train/test split
test<-(!train)
In the code above we use the “sample” function to create a “train” index based on the number of rows in the “Mroz” dataset. Basically, R is making a vector that randomly assigns different rows in the “Mroz” dataset to be marked as True or False. Next, we use the “train” vector and we assign everything or every number that is not in the “train” vector to the test vector by using the exclamation mark.
We are now ready to develop our model. Below is the code
set.seed(777)
pcr.fit<-pcr(income~.,data=Mroz,subset=train,scale=T,validation="CV")
To make our model we use the “pcr” function from the “pls” package. The “subset” argument tells r to use the “train” vector to select examples from the “Mroz” dataset. The “scale” argument makes sure everything is measured the same way. This is important when using a component analysis tool as variables with different scale have a different influence on the components. Lastly, the “validation” argument enables cross-validation. This will help us to determine the number of components to use for prediction. Below is the results of the model using the “summary” function.
summary(pcr.fit)
## Data: X dimension: 381 17
## Y dimension: 381 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 12102 11533 11017 9863 9884 9524 9563
## adjCV 12102 11534 11011 9855 9878 9502 9596
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 9149 9133 8811 8527 7265 7234 7120
## adjCV 9126 9123 8798 8877 7199 7172 7100
## 14 comps 15 comps 16 comps 17 comps
## CV 7118 7141 6972 6992
## adjCV 7100 7123 6951 6969
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 21.359 38.71 51.99 59.67 65.66 71.20 76.28
## income 9.927 19.50 35.41 35.63 41.28 41.28 46.75
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 80.70 84.39 87.32 90.15 92.65 95.02 96.95
## income 47.08 50.98 51.73 68.17 68.29 68.31 68.34
## 15 comps 16 comps 17 comps
## X 98.47 99.38 100.00
## income 68.48 70.29 70.39
There is a lot of information here.The VALIDATION: RMSEP section gives you the root mean squared error of the model broken down by component. The TRAINING section is similar the printout of any PCA but it shows the amount of cumulative variance of the components, as well as the variance, explained for the dependent variable “income.” In this model, we are able to explain up to 70% of the variance if we use all 17 components.
We can graph the MSE using the “validationplot” function with the argument “val.type” set to “MSEP”. The code is below.
validationplot(pcr.fit,val.type = "MSEP")
How many components to pick is subjective, however, there is almost no improvement beyond 13 so we will use 13 components in our prediction model and we will calculate the means squared error.
set.seed(777)
pcr.pred<-predict(pcr.fit,Mroz[test,],ncomp=13)
mean((pcr.pred-Mroz$income[test])^2)
## [1] 48958982
MSE is what you would use to compare this model to other models that you developed. Below is the performance of a least squares regression model
set.seed(777)
lm.fit<-lm(income~.,data=Mroz,subset=train)
lm.pred<-predict(lm.fit,Mroz[test,])
mean((lm.pred-Mroz$income[test])^2)
## [1] 47794472
If you compare the MSE the least squares model performs slightly better than the PCR one. However, there are a lot of non-significant features in the model as shown below.
summary(lm.fit)
##
## Call:
## lm(formula = income ~ ., data = Mroz, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27646 -3337 -1387 1860 48371
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.215e+04 3.987e+03 -5.556 5.35e-08 ***
## workno -3.828e+03 1.316e+03 -2.909 0.00385 **
## hoursw 3.955e+00 7.085e-01 5.582 4.65e-08 ***
## child6 5.370e+02 8.241e+02 0.652 0.51512
## child618 4.250e+02 2.850e+02 1.491 0.13673
## agew 1.962e+02 9.849e+01 1.992 0.04709 *
## educw 1.097e+02 2.276e+02 0.482 0.63013
## hearnw 9.835e+02 2.303e+02 4.270 2.50e-05 ***
## wagew 2.292e+02 2.423e+02 0.946 0.34484
## hoursh 6.386e+00 6.144e-01 10.394 < 2e-16 ***
## ageh -1.284e+01 9.762e+01 -0.132 0.89542
## educh 1.460e+02 1.592e+02 0.917 0.35982
## wageh 2.083e+03 9.930e+01 20.978 < 2e-16 ***
## educwm 1.354e+02 1.335e+02 1.014 0.31115
## educwf 1.653e+02 1.257e+02 1.315 0.18920
## unemprate -1.213e+02 1.148e+02 -1.057 0.29140
## cityyes -2.064e+02 7.905e+02 -0.261 0.79421
## experience -1.165e+02 5.393e+01 -2.159 0.03147 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6729 on 363 degrees of freedom
## Multiple R-squared: 0.7039, Adjusted R-squared: 0.69
## F-statistic: 50.76 on 17 and 363 DF, p-value: < 2.2e-16
Removing these and the MSE is almost the same for the PCR and least square models
set.seed(777)
lm.fit2<-lm(income~work+hoursw+hearnw+hoursh+wageh,data=Mroz,subset=train)
lm.pred2<-predict(lm.fit2,Mroz[test,])
mean((lm.pred2-Mroz$income[test])^2)
## [1] 47968996
Conclusion
Since the least squares model is simpler it is probably the superior model. PCR is strongest when there are a lot of variables involve and if there are issues with multicollinearity.