In this post, we will conduct an analysis using the lasso regression. Remember lasso regression will actually eliminate variables by reducing them to zero through how the shrinkage penalty can be applied.
We will use the dataset “nlschools” from the “MASS” packages to conduct our analysis. We want to see if we can predict language test scores “lang” with the other available variables. Below is some initial code to begin the analysis
## 'data.frame': 2287 obs. of 6 variables: ## $ lang : int 46 45 33 46 20 30 30 57 36 36 ... ## $ IQ : num 15 14.5 9.5 11 8 9.5 9.5 13 9.5 11 ... ## $ class: Factor w/ 133 levels "180","280","1082",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ GS : int 29 29 29 29 29 29 29 29 29 29 ... ## $ SES : int 23 10 15 23 10 10 23 10 13 15 ... ## $ COMB : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
We need to remove the “class” variable as it is used as an identifier and provides no useful data. After this, we can check the correlations among the variables. Below is the code for this.
nlschools$class<-NULL p.cor<-cor(nlschools[,-5]) corrplot.mixed(p.cor)
No problems with collinearity. We will now setup are training and testing sets.
ind<-sample(2,nrow(nlschools),replace=T,prob = c(0.7,0.3)) train<-nlschools[ind==1,] test<-nlschools[ind==2,]
Remember that the ‘glmnet’ function does not like factor variables. So we need to convert our “COMB” variable to a dummy variable. In addition, “glmnet” function does not like data frames so we need to make two data frames. The first will include all the predictor variables and the second we include only the outcome variable. Below is the code
train$COMB<-model.matrix( ~ COMB - 1, data=train ) #convert to dummy variable test$COMB<-model.matrix( ~ COMB - 1, data=test ) predictor_variables<-as.matrix(train[,2:4]) language_score<-as.matrix(train$lang)
We can now run our model. We place both matrices inside the “glmnet” function. The family is set to “gaussian” because our outcome variable is continuous. The “alpha” is set to 1 as this indicates that we are using lasso regression.
Now we need to look at the results using the “print” function. This function prints a lot of information as explained below.
- Df = number of variables including in the model (this is always the same number in a ridge model)
- %Dev = Percent of deviance explained. The higher the better
- Lambda = The lambda used to obtain the %Dev
When you use the “print” function for a lasso model it will print up to 100 different models. Fewer models are possible if the percent of deviance stops improving. 100 is the default stopping point. In the code below we will use the “print” function but, I only printed the first 5 and last 5 models in order to reduce the size of the printout. Fortunately, it only took 60 models to converge.
## ## Call: glmnet(x = predictor_variables, y = language_score, family = "gaussian", alpha = 1) ## ## Df %Dev Lambda ## [1,] 0 0.00000 5.47100 ## [2,] 1 0.06194 4.98500 ## [3,] 1 0.11340 4.54200 ## [4,] 1 0.15610 4.13900 ## [5,] 1 0.19150 3.77100 ............................ ## [55,] 3 0.39890 0.03599 ## [56,] 3 0.39900 0.03280 ## [57,] 3 0.39900 0.02988 ## [58,] 3 0.39900 0.02723 ## [59,] 3 0.39900 0.02481 ## [60,] 3 0.39900 0.02261
The results from the “print” function will allow us to set the lambda for the “test” dataset. Based on the results we can set the lambda at 0.02 because this explains the highest amount of deviance at .39.
The plot below shows us lambda on the x-axis and the coefficients of the predictor variables on the y-axis. The numbers next to the coefficient lines refers to the actual coefficient of a particular variable as it changes from using different lambda values. Each number corresponds to a variable going from left to right in a dataframe/matrix using the “View” function. For example, 1 in the plot refers to “IQ” 2 refers to “GS” etc.
As you can see, as lambda increase the coefficient decrease in value. This is how regularized regression works. However, unlike ridge regression which never reduces a coefficient to zero, lasso regression does reduce a coefficient to zero. For example, coefficient 3 (SES variable) and coefficient 2 (GS variable) are reduced to zero when lambda is near 1.
You can also look at the coefficient values at a specific lambda values. The values are unstandardized and are used to determine the final model selection. In the code below the lambda is set to .02 and we use the “coef” function to do see the results
lasso.coef<-coef(lasso,s=.02,exact = T) lasso.coef
## 4 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 9.35736325 ## IQ 2.34973922 ## GS -0.02766978 ## SES 0.16150542
Results indicate that for a 1 unit increase in IQ there is a 2.41 point increase in language score. When GS (class size) goes up 1 unit there is a .03 point decrease in language score. Finally, when SES (socioeconomic status) increase 1 unit language score improves .13 point.
The second plot shows us the deviance explained on the x-axis. On the y-axis is the coefficients of the predictor variables. Below is the code
If you look carefully, you can see that the two plots are completely opposite to each other. increasing lambda cause a decrease in the coefficients. Furthermore, increasing the fraction of deviance explained leads to an increase in the coefficient. You may remember seeing this when we used the “print”” function. As lambda became smaller there was an increase in the deviance explained.
Now, we will assess our model using the test data. We need to convert the test dataset to a matrix. Then we will use the “predict”” function while setting our lambda to .02. Lastly, we will plot the results. Below is the code.
test.matrix<-as.matrix(test[,2:4]) lasso.y<-predict(lasso,newx = test.matrix,type = 'response',s=.02) plot(lasso.y,test$lang)
The visual looks promising. The last thing we need to do is calculated the mean squared error. By its self this number does not mean much. However, it provides a benchmark for comparing our current model with any other models that we may develop. Below is the code
##  46.74314
Knowing this number, we can, if we wanted, develop other models using other methods of analysis to try to reduce it. Generally, the lower the error the better while keeping in mind the complexity of the model.