Hyperparameter Tuning in Python

Hyperparameters are a numerical quantity you must set yourself when developing a model. This is often one of the last steps of model development. Choosing an algorithm and determining which variables to include often come before this step.

Algorithms cannot determine hyperparameters themselves which is why you have to do it. The problem is that the typical person has no idea what is an optimally choice for the hyperparameter. To deal with this confusion, often a range of values are inputted and then it is left to python to determine which combination of hyperparameters is most appropriate.

In this post, we will learn how to set hyperparameters by developing a grid in Python. To do this, we will use the PSID dataset from the pydataset library. Our goal will be to classify who is married and not married based on several independent variables. The steps of this process is as follows

Data preparation
Baseline model (for comparison)
Grid development
Revised model

Below is some initial code that includes all the libraries and classes that we need.

import pandas as pd
import numpy as np
from pydataset import data
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

Data Preparation

The dataset PSID has several problems that we need to address.

We need to remove all NAs
The married variable will be converted to a dummy variable. It will simply be changed to married or not rather than all of the other possible categories.
The educatnn and kids variables have codes that are 98 and 99. These need to be removed because they do not make sense.

Below is the code that deals with all of this

df=data('PSID').dropna()
df.loc[df.married!= 'married', 'married'] = 0
df.loc[df.married== 'married','married'] = 1
df['married'] = df['married'].astype(int)
df['marry']=df.married
df.drop(df.loc[df['kids']>90].index, inplace=True)
df.drop(df.loc[df['educatn']>90].index, inplace=True

Line 1 loads the dataset and drops the NAs
Line 2-4 create our dummy variable for marriage. We create a new variable called marry to hold the results
Lines 5-6 drop the values in kids and educatn that are above 90.

Below we create our X and y datasets and then are ready to make our baseline model.

X=df[['age','educatn','hours','kids','earnings']]
y=df['marry']

Baseline Model

The purpose of baseline model is to see how much better or worst the hyperparameter tuning works. We are using K Nearest Neighbors for our classification. In our example, there are 4 hyperparameters we need to set. They are as follows.

number of neighbors
weight of neighbors
metric for measuring distance
power parameter for minkowski

Below is the baseline model with the set hyperparameters. The second line shows the accuracy of the model after a k-fold cross-validation that was set to 10.

classifier=KNeighborsClassifier(n_neighbors=5,weights=’uniform’, metric=’minkowski’,p=2)
np.mean(cross_val_score(classifier,X,y,cv=10,scoring=’accuracy’,n_jobs=1)) 0.6188104238047426

Our model has an accuracy of about 62%. We will now move to setting up our grid so we can see if tuning the hyperparameters improves the performance

Grid Development

The grid allows you to develop scores of models with all of the hyperparameters tuned slightly differently. In the code below, we create our grid object, and then we calculate how many models we will run

grid={'n_neighbors':range(1,13),'weights':['uniform','distance'],'metric':['manhattan','minkowski'],'p':[1,2]}
np.prod([len(grid[element]) for element in grid])
96

You can see we made a simple list that has several values for each hyperparameter

Number if neighbors can be 1 to 13
weight of neighbors can be uniform or distance
metric can be manhatten or minkowski
p can be 1 or 2

We will develop 96 models all together. Below is the code to begin tuning the hyperparameters.

search=GridSearchCV(estimator=classifier,param_grid=grid,scoring='accuracy',n_jobs=1,refit=True,cv=10)
search.fit(X,y)

The estimator is the code for the type of algorithm we are using. We set this earlier. The param_grid is our grid. Accuracy is our metric for determining the best model. n_jobs has to do with the amount of resources committed to the process. refit is for changing parameters and cv is for cross-validation folds.The search.fit command runs the model

The code below provides the output for the results.

print(search.best_params_)
print(search.best_score_)
{'metric': 'manhattan', 'n_neighbors': 11, 'p': 1, 'weights': 'uniform'}
0.6503975265017667

The best_params_ function tells us what the most appropriate parameters are. The best_score_ tells us what the accuracy of the model is with the best parameters. Are model accuracy improves from 61% to 65% from adjusting the hyperparameters. We can confirm this by running our revised model with the updated hyper parameters.

Model Revision

Below is the cod efor the erevised model

classifier2=KNeighborsClassifier(n_neighbors=11,weights='uniform', metric='manhattan',p=1)
np.mean(cross_val_score(classifier2,X,y,cv=10,scoring='accuracy',n_jobs=1)) #new res
Out[24]: 0.6503909993913031

Exactly as we thought. This is a small improvement but this can make a big difference in some situation such as in a data science competition.

Conclusion

Tuning hyperparameters is one of the final pieces to improving a model. With this tool, small gradually changes can be seen in a model. It is important to keep in mind this aspect of model development in order to have the best success final.

educational research techniques

Research techniques and education

Hyperparameter Tuning in Python

Like this:

Related

Leave a ReplyCancel reply

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from educational research techniques