A key concept in machine learning and data science in general is variable selection. Sometimes, a dataset can have hundreds of variables to include in a model. The benefit of variable selection is that it reduces the amount of useless information aka noise in the model. By removing noise it can improve the learning process and help to stabilize the estimates.
In this post, we will look at two ways to do this. These two common approaches are the univariate approach and the greedy approach. The univariate approach selects variables that are most related to the dependent variable based on a metric. The greedy approach will alone remove a variable if getting rid of it does not affect the model’s performance.
We will now move to our first example which is the univariate approach using Python. We will use the VietNamH dataset from the pydataset library. Are goal is to predict how much a family spends on medical expenses. Below is the initial code.
import pandas as pd
import numpy as np
from pydataset import data
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_regression
Are data is called df. If you use the head function, you will see that we need to convert several variables to dummy variables. Below is the code for doing this.
df.loc[df.sex== 'female', 'sex'] = 0
df.loc[df.sex== 'male','sex'] = 1
df.loc[df.farm== 'no', 'farm'] = 0
df.loc[df.farm== 'yes','farm'] = 1
df.loc[df.urban== 'no', 'urban'] = 0
df.loc[df.urban== 'yes','urban'] = 1
We now need to setup or X and y datasets as shown below
We are now ready to actual use the univariate approach. This involves the use of two different classes in Python. The SelectPercentile class allows you to only include the variables that meet a certain percentile rank such as 25%. The f_regression class is designed for checking a variable’s performance in the context of regression. Below is the code to run the analysis.
We can now see the results using a for loop. We want the scores from our selector_f object. To do this we setup a for lop and use the zip function to iterate over the data. The output is placed in the print statement. Below is the code and output for this.
for n,s in zip(X,selector_f.scores_):
print('F-score: %3.2f\t for feature %s ' % (s,n))
F-score: 62.42 for feature age
F-score: 33.86 for feature educyr
F-score: 3.17 for feature sex
F-score: 106.35 for feature hhsize
F-score: 14.82 for feature farm
F-score: 5.95 for feature urban
F-score: 97.77 for feature lnrlfood
You can see the f-score for all of the independent variables. You can decide for yourself which to include.
The greedy approach only removes variables if they do not impact model performance. We are using the same dataset so all we have to do is run the code. We need the RFECV class from the model_selection library. We then use the function RFECV and set the estimator, cross-validation, and scoring metric. Finally, we run the analysis and print the results. The code is below with the output.
from sklearn.feature_selection import RFECV
The number 7 represents how many independent variables to include in the model. Since we only had 7 total variables we should include all variables in the model.
With help with univariate and greedy approaches, it is possible to deal with a large number of variables efficiently one developing models. The example here involve only a handful of variables. However, bear in mind that the approaches mentioned here are highly scalable and useful.