# Tag Archives: regularization # Ridge Regression in Python

Ridge regression is one of several regularized linear models. Regularization is the process of penalizing coefficients of variables either by removing them and or reduce their impact. Ridge regression reduces the effect of problematic variables close to zero but never fully removes them.

We will go through an example of ridge regression using the VietNamI dataset available in the pydataset library. Our goal will be to predict expenses based on the variables available. We will complete this task using the following steps/

1. Data preparation
2. Baseline model development
3. Ridge regression model

Below is the initial code

`from pydataset import dataimport numpy as npimport pandas as pdfrom sklearn.model_selection import GridSearchCVfrom sklearn.linear_model import Ridgefrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_erro`

Data Preparation

The data preparation is simple. All we have to do is load the data and convert the sex variable to a dummy variable. We also need to set up our X and y datasets. Below is the code.

`df=pd.DataFrame(data('VietNamI'))df.loc[df.sex== 'male', 'sex'] = 0df.loc[df.sex== 'female','sex'] = 1df['sex'] = df['sex'].astype(int)X=df[['pharvis','age','sex','married','educ','illness','injury','illdays','actdays','insurance']]y=df['lnhhexp'`

We can now create our baseline regression model.

Baseline Model

The metric we are using is the mean squared error. Below is the code and output for our baseline regression model. This is a model that has no regularization to it. Below is the code.

`regression=LinearRegression()regression.fit(X,y)first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))print(first_model)0.35528915032173053`

This  value of 0.355289 will be our indicator to determine if the regularized ridge regression model is superior or not.

Ridge Model

In order to create our ridge model we need to first determine the most appropriate value for the l2 regularization. L2 is the name of the hyperparameter that is used in ridge regression. Determining the value of a hyperparameter requires the use of a grid. In the code below, we first are ridge model and indicate normalization in order to get better estimates. Next we setup the grid that we will use. Below is the code.

`ridge=Ridge(normalize=True)search=GridSearchCV(estimator=ridge,param_grid={'alpha':np.logspace(-5,2,8)},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)`

The search object has several arguments within it. Alpha is hyperparameter we are trying to set. The log space is the range of values we want to test. We want the log of -5 to 2, but we only get 8 values from within that range evenly spread out. Are metric is the mean squared error. Refit set true means to adjust the parameters while modeling and cv is the number of folds to develop for the cross-validation. We can now use the .fit function to run the model and then use the .best_params_ and .best_scores_ function to determine the model;s strength. Below is the code.

`search.fit(X,y)search.best_params_{'alpha': 0.01}abs(search.best_score_)0.3801489007094425`

The best_params_ tells us what to set alpha too which in this case is 0.01. The best_score_ tells us what the best possible mean squared error is. In this case, the value of 0.38 is worse than what the baseline model was. We can confirm this by  fitting our model with the ridge information and finding the mean squared error. This is done below.

`ridge=Ridge(normalize=True,alpha=0.01)ridge.fit(X,y)second_model=(mean_squared_error(y_true=y,y_pred=ridge.predict(X)))print(second_model)0.35529321992606566`

The 0.35 is lower than the 0.38. This is because the last results are not cross-validated. In addition, these results indicate that there is little difference between the ridge and baseline models. This is confirmed with the coefficients of each model found below.

`coef_dict_baseline = {}for coef, feat in zip(regression.coef_,data("VietNamI").columns):    coef_dict_baseline[feat] = coefcoef_dict_baselineOut: {'pharvis': 0.013282050886950674, 'lnhhexp': 0.06480086550467873, 'age': 0.004012412278795848, 'sex': -0.08739614349708981, 'married': 0.075276463838362, 'educ': -0.06180921300600292, 'illness': 0.040870384578962596, 'injury': -0.002763768716569026, 'illdays': -0.006717063310893158, 'actdays': 0.1468784364977112}coef_dict_ridge = {}for coef, feat in zip(ridge.coef_,data("VietNamI").columns):    coef_dict_ridge[feat] = coefcoef_dict_ridgeOut: {'pharvis': 0.012881937698185289, 'lnhhexp': 0.06335455237380987, 'age': 0.003896623321297935, 'sex': -0.0846541637961565, 'married': 0.07451889604357693, 'educ': -0.06098723778992694, 'illness': 0.039430607922053884, 'injury': -0.002779341753010467, 'illdays': -0.006551280792122459, 'actdays': 0.14663287713359757}`

The coefficient values are about the same. This means that the penalization made little difference with this dataset.

Conclusion

Ridge regression allows you to penalize variables based on their useful in developing the model. With this form of regularized regression the coefficients of the variables is never set to zero. Other forms of regularization regression allows for the total removal of variables. One example of this is lasso regression.