Hyperparameters are a numerical quantity you must set yourself when developing a model. This is often one of the last steps of model development. Choosing an algorithm and determining which variables to include often come before this step.
Algorithms cannot determine hyperparameters themselves which is why you have to do it. The problem is that the typical person has no idea what is an optimally choice for the hyperparameter. To deal with this confusion, often a range of values are inputted and then it is left to python to determine which combination of hyperparameters is most appropriate.
In this post, we will learn how to set hyperparameters by developing a grid in Python. To do this, we will use the PSID dataset from the pydataset library. Our goal will be to classify who is married and not married based on several independent variables. The steps of this process is as follows
- Data preparation
- Baseline model (for comparison)
- Grid development
- Revised model
Below is some initial code that includes all the libraries and classes that we need.
import pandas as pd
import numpy as np
from pydataset import data
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
Data Preparation
The dataset PSID has several problems that we need to address.
- We need to remove all NAs
- The married variable will be converted to a dummy variable. It will simply be changed to married or not rather than all of the other possible categories.
- The educatnn and kids variables have codes that are 98 and 99. These need to be removed because they do not make sense.
Below is the code that deals with all of this
df=data('PSID').dropna()
df.loc[df.married!= 'married', 'married'] = 0
df.loc[df.married== 'married','married'] = 1
df['married'] = df['married'].astype(int)
df['marry']=df.married
df.drop(df.loc[df['kids']>90].index, inplace=True)
df.drop(df.loc[df['educatn']>90].index, inplace=True
- Line 1 loads the dataset and drops the NAs
- Line 2-4 create our dummy variable for marriage. We create a new variable called marry to hold the results
- Lines 5-6 drop the values in kids and educatn that are above 90.
Below we create our X and y datasets and then are ready to make our baseline model.
X=df[['age','educatn','hours','kids','earnings']]
y=df['marry']
Baseline Model
The purpose of baseline model is to see how much better or worst the hyperparameter tuning works. We are using K Nearest Neighbors for our classification. In our example, there are 4 hyperparameters we need to set. They are as follows.
- number of neighbors
- weight of neighbors
- metric for measuring distance
- power parameter for minkowski
Below is the baseline model with the set hyperparameters. The second line shows the accuracy of the model after a k-fold cross-validation that was set to 10.
classifier=KNeighborsClassifier(n_neighbors=5,weights=’uniform’, metric=’minkowski’,p=2)
np.mean(cross_val_score(classifier,X,y,cv=10,scoring=’accuracy’,n_jobs=1)) 0.6188104238047426
Our model has an accuracy of about 62%. We will now move to setting up our grid so we can see if tuning the hyperparameters improves the performance
Grid Development
The grid allows you to develop scores of models with all of the hyperparameters tuned slightly differently. In the code below, we create our grid object, and then we calculate how many models we will run
grid={'n_neighbors':range(1,13),'weights':['uniform','distance'],'metric':['manhattan','minkowski'],'p':[1,2]}
np.prod([len(grid[element]) for element in grid])
96
You can see we made a simple list that has several values for each hyperparameter
- Number if neighbors can be 1 to 13
- weight of neighbors can be uniform or distance
- metric can be manhatten or minkowski
- p can be 1 or 2
We will develop 96 models all together. Below is the code to begin tuning the hyperparameters.
search=GridSearchCV(estimator=classifier,param_grid=grid,scoring='accuracy',n_jobs=1,refit=True,cv=10)
search.fit(X,y)
The estimator is the code for the type of algorithm we are using. We set this earlier. The param_grid is our grid. Accuracy is our metric for determining the best model. n_jobs has to do with the amount of resources committed to the process. refit is for changing parameters and cv is for cross-validation folds.The search.fit command runs the model
The code below provides the output for the results.
print(search.best_params_)
print(search.best_score_)
{'metric': 'manhattan', 'n_neighbors': 11, 'p': 1, 'weights': 'uniform'}
0.6503975265017667
The best_params_ function tells us what the most appropriate parameters are. The best_score_ tells us what the accuracy of the model is with the best parameters. Are model accuracy improves from 61% to 65% from adjusting the hyperparameters. We can confirm this by running our revised model with the updated hyper parameters.
Model Revision
Below is the cod efor the erevised model
classifier2=KNeighborsClassifier(n_neighbors=11,weights='uniform', metric='manhattan',p=1)
np.mean(cross_val_score(classifier2,X,y,cv=10,scoring='accuracy',n_jobs=1)) #new res
Out[24]: 0.6503909993913031
Exactly as we thought. This is a small improvement but this can make a big difference in some situation such as in a data science competition.
Conclusion
Tuning hyperparameters is one of the final pieces to improving a model. With this tool, small gradually changes can be seen in a model. It is important to keep in mind this aspect of model development in order to have the best success final.