Principal Component Regression in R

This post will explain and provide an example of principal component regression (PCR). Principal component regression involves having the model construct components from the independent variables that are a linear combination of the independent variables. This is similar to principal component analysis but the components are designed in a way to best explain the dependent variable. Doing this often allows you to use fewer variables in your model and usually improves the fit of your model as well.

Since PCR is based on principal component analysis it is an unsupervised method, which means the dependent variable has no influence on the development of the components. As such, there are times when the components that are developed may not be beneficial for explaining the dependent variable.

Our example will use the “Mroz” dataset from the “Ecdat” package. Our goal will be to predict “income” based on the variables in the dataset. Below is the initial code

library(pls);library(Ecdat)
data(Mroz)
str(Mroz)
## 'data.frame':    753 obs. of  18 variables:
##  $ work      : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hoursw    : int  1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
##  $ child6    : int  1 0 1 0 1 0 0 0 0 0 ...
##  $ child618  : int  0 2 3 3 2 0 2 0 2 2 ...
##  $ agew      : int  32 30 35 34 31 54 37 54 48 39 ...
##  $ educw     : int  12 12 12 12 14 12 16 12 12 12 ...
##  $ hearnw    : num  3.35 1.39 4.55 1.1 4.59 ...
##  $ wagew     : num  2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
##  $ hoursh    : int  2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
##  $ ageh      : int  34 30 40 53 32 57 37 53 52 43 ...
##  $ educh     : int  12 9 12 10 12 11 12 8 4 12 ...
##  $ wageh     : num  4.03 8.44 3.58 3.54 10 ...
##  $ income    : int  16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
##  $ educwm    : int  12 7 12 7 12 14 14 3 7 7 ...
##  $ educwf    : int  7 7 7 7 14 7 7 3 7 7 ...
##  $ unemprate : num  5 11 5 5 9.5 7.5 5 5 3 5 ...
##  $ city      : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ...
##  $ experience: int  14 5 15 6 7 33 11 35 24 21 ...

Our first step is to divide our dataset into a train and test set. We will do a simple 50/50 split for this demonstration.

train<-sample(c(T,F),nrow(Mroz),rep=T) #50/50 train/test split
test<-(!train)

In the code above we use the “sample” function to create a “train” index based on the number of rows in the “Mroz” dataset. Basically, R is making a vector that randomly assigns different rows in the “Mroz” dataset to be marked as True or False. Next, we use the “train” vector and we assign everything or every number that is not in the “train” vector to the test vector by using the exclamation mark.

We are now ready to develop our model. Below is the code

set.seed(777)
pcr.fit<-pcr(income~.,data=Mroz,subset=train,scale=T,validation="CV")

To make our model we use the “pcr” function from the “pls” package. The “subset” argument tells r to use the “train” vector to select examples from the “Mroz” dataset. The “scale” argument makes sure everything is measured the same way. This is important when using a component analysis tool as variables with different scale have a different influence on the components. Lastly, the “validation” argument enables cross-validation. This will help us to determine the number of components to use for prediction. Below is the results of the model using the “summary” function.

summary(pcr.fit)
## Data:    X dimension: 381 17 
##  Y dimension: 381 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           12102    11533    11017     9863     9884     9524     9563
## adjCV        12102    11534    11011     9855     9878     9502     9596
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        9149     9133     8811      8527      7265      7234      7120
## adjCV     9126     9123     8798      8877      7199      7172      7100
##        14 comps  15 comps  16 comps  17 comps
## CV         7118      7141      6972      6992
## adjCV      7100      7123      6951      6969
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X        21.359    38.71    51.99    59.67    65.66    71.20    76.28
## income    9.927    19.50    35.41    35.63    41.28    41.28    46.75
##         8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X         80.70    84.39     87.32     90.15     92.65     95.02     96.95
## income    47.08    50.98     51.73     68.17     68.29     68.31     68.34
##         15 comps  16 comps  17 comps
## X          98.47     99.38    100.00
## income     68.48     70.29     70.39

There is a lot of information here.The VALIDATION: RMSEP section gives you the root mean squared error of the model broken down by component. The TRAINING section is similar the printout of any PCA but it shows the amount of cumulative variance of the components, as well as the variance, explained for the dependent variable “income.” In this model, we are able to explain up to 70% of the variance if we use all 17 components.

We can graph the MSE using the “validationplot” function with the argument “val.type” set to “MSEP”. The code is below.

validationplot(pcr.fit,val.type = "MSEP")

1

How many components to pick is subjective, however, there is almost no improvement beyond 13 so we will use 13 components in our prediction model and we will calculate the means squared error.

set.seed(777)
pcr.pred<-predict(pcr.fit,Mroz[test,],ncomp=13)
mean((pcr.pred-Mroz$income[test])^2)
## [1] 48958982

MSE is what you would use to compare this model to other models that you developed. Below is the performance of a least squares regression model

set.seed(777)
lm.fit<-lm(income~.,data=Mroz,subset=train)
lm.pred<-predict(lm.fit,Mroz[test,])
mean((lm.pred-Mroz$income[test])^2)
## [1] 47794472

If you compare the MSE the least squares model performs slightly better than the PCR one. However, there are a lot of non-significant features in the model as shown below.

summary(lm.fit)
## 
## Call:
## lm(formula = income ~ ., data = Mroz, subset = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27646  -3337  -1387   1860  48371 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.215e+04  3.987e+03  -5.556 5.35e-08 ***
## workno      -3.828e+03  1.316e+03  -2.909  0.00385 ** 
## hoursw       3.955e+00  7.085e-01   5.582 4.65e-08 ***
## child6       5.370e+02  8.241e+02   0.652  0.51512    
## child618     4.250e+02  2.850e+02   1.491  0.13673    
## agew         1.962e+02  9.849e+01   1.992  0.04709 *  
## educw        1.097e+02  2.276e+02   0.482  0.63013    
## hearnw       9.835e+02  2.303e+02   4.270 2.50e-05 ***
## wagew        2.292e+02  2.423e+02   0.946  0.34484    
## hoursh       6.386e+00  6.144e-01  10.394  < 2e-16 ***
## ageh        -1.284e+01  9.762e+01  -0.132  0.89542    
## educh        1.460e+02  1.592e+02   0.917  0.35982    
## wageh        2.083e+03  9.930e+01  20.978  < 2e-16 ***
## educwm       1.354e+02  1.335e+02   1.014  0.31115    
## educwf       1.653e+02  1.257e+02   1.315  0.18920    
## unemprate   -1.213e+02  1.148e+02  -1.057  0.29140    
## cityyes     -2.064e+02  7.905e+02  -0.261  0.79421    
## experience  -1.165e+02  5.393e+01  -2.159  0.03147 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6729 on 363 degrees of freedom
## Multiple R-squared:  0.7039, Adjusted R-squared:   0.69 
## F-statistic: 50.76 on 17 and 363 DF,  p-value: < 2.2e-16

Removing these and the MSE is almost the same for the PCR and least square models

set.seed(777)
lm.fit2<-lm(income~work+hoursw+hearnw+hoursh+wageh,data=Mroz,subset=train)
lm.pred2<-predict(lm.fit2,Mroz[test,])
mean((lm.pred2-Mroz$income[test])^2)
## [1] 47968996

Conclusion

Since the least squares model is simpler it is probably the superior model. PCR is strongest when there are a lot of variables involve and if there are issues with multicollinearity.

Leave a Reply