Traditional linear regression has been a tried and true model for making predictions for decades. However, with the growth of Big Data and datasets with 100’s of variables problems have begun to arise. For example, using stepwise or best subset method with regression could take hours if not days to converge in even some of the best computers.
To deal with this problem, regularized regression has been developed to help to determine which features or variables to keep when developing models from large datasets with a huge number of variables. In this post, we will look at the following concepts
- Definition of regularized regression
- Ridge regression
- Lasso regression
- Elastic net regression
Regularization involves the use of a shrinkage penalty in order to reduce the residual sum of squares (RSS). This is done by selecting a value for a tuning parameter called “lambda”. Tuning parameters are used in machine learning algorithms to control the behavior of the models that are developed.
The lambda is multiplied by the normalized coefficients of the model and added to the RSS. Below is an equation of what was just said
RSS + λ(normalized coefficients)
The benefits of regularization are at least three-fold. First, regularization is highly computationally efficient. Instead of fitting k-1 models when k is the number of variables available (for example, 50 variables would lead 49 models!), with regularization only one model is developed for each value of lambda you specify.
Second, regularization helps to deal with the bias-variance headache of model development. When small changes are made to data, such as switching from the training to testing data, there can be wild changes in the estimates. Regularization can often smooth this problem out substantially.
Finally, regularization can help to reduce or eliminate any multicollinearity in a model. As such, the benefits of using regularization make it clear that this should be considered when working with larger datasets.
Ridge regression involves the normalization of the squared weights or as shown in the equation below
RSS + λ(normalized coefficients^2)
This is also referred to as the L2-norm. As lambda increase in value, the coefficients in the model are shrunk towards 0 but never reach 0. This is how the error is shrunk. The higher the lambda the lower the value of the coefficients as they are reduced more and more thus reducing the RSS.
The benefit is that predictive accuracy is often increased. However, interpreting and communicating your results can become difficult because no variables are removed from the model. Instead, the variables are reduced near to zero. This can be especially tough if you have dozens of variables remaining in your model to try to explain.
Lasso is short for “Least Absolute Shrinkage and Selection Operator”. This approach uses the L1-norm which is the sum of the absolute value of the coefficients or as shown in the equation below
RSS + λ(Σ|normalized coefficients|)
This shrinkage penalty will reduce a coefficient to 0 which is another way of saying that variables will be removed from the model. One problem is that highly correlated variables that need to be in your model may be removed when Lasso shrinks coefficients. This is one reason why ridge regression is still used.
Elastic net is the best of ridge and Lasso without the weaknesses of either. It combines extracts variables like Lasso and Ridge does not while also group variables like Ridge does but Lasso does not.
This is done by including a second tuning parameter called “alpha”. If alpha is set to 0 it is the same as ridge regression and if alpha is set to 1 it is the same as lasso regression. If you are able to appreciate it below is the formula used for elastic net regression
(RSS + l[(1 – alpha)(S|normalized coefficients|2)/2 + alpha(S|normalized coefficients|)])/N)
As such when working with elastic net you have to set two different tuning parameters (alpha and lambda) in order to develop a model.
Regularized regression was developed as an answer to the growth in the size and number of variables in a data set today. Ridge, lasso an elastic net all provide solutions to converging over large datasets and selecting features.