Lasso Regression with Python

Lasso regression is another form of regularized regression. With this particular version, the coefficient of a variable can be reduced all the way to zero through the use of the l1 regularization. This is in contrast to ridge regression which never completely removes a variable from an equation as it employs l2 regularization.

Regularization helps to stabilize estimates as well as deal with bias and variance in a model. In this post, we will use the “CaSchools” dataset from the pydataset library. Our goal will be to predict test scores based on several independent variables. The steps we will follow are as follows.

  1. Data preparation
  2. Develop a baseline linear model
  3. Develop lasso regression model

The initial code is as follows

from pydataset import data
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
df=pd.DataFrame(data(‘Caschool’))

Data Preparation

The data preparation is simple in this example. We only have to store the desired variables in our X and y datasets. We are not using all of the variables. Some were left out because they were highly correlated. Lasso is able to deal with this to a certain extent w=but it was decided to leave them out anyway. Below is the code.

X=df[['teachers','calwpct','mealpct','compstu','expnstu','str','avginc','elpct']]
y=df['testscr']

Baseline Model

We can now run our baseline model. This will give us a measure of comparison for the lasso model. Our metric is the mean squared error. Below is the code with the results of the model.

regression=LinearRegression()
regression.fit(X,y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))
print(first_model)
69.07380530137416

First, we instantiate the LinearRegression class. Then, we run the .fit method to do the analysis. Next, we predicted future values of our regression model and save the results to the object first_model. Lastly, we printed the results.

Below are the coefficient for the baseline regression model.

coef_dict_baseline = {}
for coef, feat in zip(regression.coef_,X.columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[52]:
{'teachers': 0.00010011947964873427,
'calwpct': -0.07813766458116565,
'mealpct': -0.3754719080127311,
'compstu': 11.914006268826652,
'expnstu': 0.001525630709965126,
'str': -0.19234209691788984,
'avginc': 0.6211690806021222,
'elpct': -0.19857026121348267}

The for loop simply combines the features in our model with their coefficients. With this information we can now make our lasso model and compare the results.

Lasso Model

For our lasso model, we have to determine what value to set the l1 or alpha to prior to creating the model. This can be done with the grid function, This function allows you to assess several models with different l1 settings. Then python will tell which setting is the best. Below is the code.

lasso=Lasso(normalize=True)
search=GridSearchCV(estimator=lasso,param_grid={'alpha':np.logspace(-5,2,8)},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)
search.fit(X,y)

We start be instantiate lasso with normalization set to true. It is important to scale data when doing regularized regression. Next, we setup our grid, we include the estimator, and parameter grid, and scoring. The alpha is set using logspace. We want values between -5 and 2, and we want 8 evenly spaced settings for the alpha. The other arguments include cv which stands for cross-validation. n_jobs effects processing and refit updates the parameters. 

After completing this, we used the fit function. The code below indicates the appropriate alpha and the expected score if we ran the model with this alpha setting.

search.best_params_
Out[55]: {'alpha': 1e-05}
abs(search.best_score_)
Out[56]: 85.38831122904011

`The alpha is set almost to zero, which is the same as a regression model. You can also see that the mean squared error is actually worse than in the baseline model. In the code below, we run the lasso model with the recommended alpha setting and print the results.

lasso=Lasso(normalize=True,alpha=1e-05)
lasso.fit(X,y)
second_model=(mean_squared_error(y_true=y,y_pred=lasso.predict(X)))
print(second_model)
69.0738055527604

The value for the second model is almost the same as the first one. The tiny difference is due to the fact that there is some penalty involved. Below are the coefficient values.

coef_dict_baseline = {}
for coef, feat in zip(lasso.coef_,X.columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[63]:
{'teachers': 9.795933425676567e-05,
'calwpct': -0.07810938255735576,
'mealpct': -0.37548182158171706,
'compstu': 11.912164626067028,
'expnstu': 0.001525439984250718,
'str': -0.19225486069458508,
'avginc': 0.6211695477945162,
'elpct': -0.1985510490295491}

The coefficient values are also slightly different. The only difference is the teachers variable was essentially set to zero. This means that it is not a useful variable for predicting testscrs. That is ironic to say the least.

Conclusion

Lasso regression is able to remove variables that are not adequate predictors of the outcome variable. Doing this in Python  is fairly simple. This yet another tool that can be used in statistical analysis.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.