model | educational research techniques

Boosting is a technique in machine learning in which multiple models are developed sequentially. Each new model tries to successful predict what prior models were unable to do. The average for regression and majority vote for classification are used. For classification, boosting is commonly associated with decision trees. However, boosting can be used with any machine learning algorithm in the supervised learning context.

Since several models are being developed with aggregation, boosting is associated with ensemble learning. Ensemble is just a way of developing more than one model for machine-learning purposes. With boosting, the assumption is that the combination of several weak models can make one really strong and accurate model.

For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the status of a patient based on several independent variables. The steps of this process are as follows.

Data preparation
Decision tree baseline model
Hyperparameter tuning of Adaboost model
AdaBoost model development

Below is some initial code

from sklearn.ensemble import AdaBoostClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

Data Preparation

Data preparation is minimal in this situation. We will load are data and at the same time drop any NA using the .dropna() function. In addition, we will place the independent variables in dataframe called X and the dependent variable in a dataset called y. Below is the code.

df=data('cancer').dropna()
X=df[['time','sex','ph.karno','pat.karno','meal.cal','wt.loss']]
y=df['status']

Decision Tree Baseline Model

We will make a decision tree just for the purposes of comparison. First, we will set the parameters for the cross-validation. Then we will use a for loop to run several different decision trees. The difference in the decision trees will be their depth. The depth is how far the tree can go in order to purify the classification. The more depth the more likely your decision tree is to overfit the data. The last thing we will do is print the results. Below is the code with the output

crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10):
    tree_classifier=tree.DecisionTreeClassifier(max_depth=depth,random_state=1)
    if tree_classifier.fit(X,y).tree_.max_depth<depth:
        break
    score=np.mean(cross_val_score(tree_classifier,X,y,scoring='accuracy', cv=crossvalidation,n_jobs=1))
    print(depth, score)
1 0.71875
2 0.6477941176470589
3 0.6768382352941177
4 0.6698529411764707
5 0.6584558823529412
6 0.6525735294117647
7 0.6283088235294118
8 0.6573529411764706
9 0.6577205882352941

You can see that the most accurate decision tree had a depth of 1. After that there was a general decline in accuracy.

We now can determine if the adaBoost model is better based on whether the accuracy is above 72%. Before we develop the AdaBoost model, we need to tune several hyperparameters in order to develop the most accurate model possible.

Hyperparameter Tuning AdaBoost Model

In order to tune the hyperparameters there are several things that we need to do. First we need to initiate our AdaBoostClassifier with some basic settings. Then We need to create our search grid with the hyperparameters. There are two hyperparameters that we will set and they are number of estimators (n_estimators) and the learning rate.

Number of estimators has to do with how many trees are developed. The learning rate indicates how each tree contributes to the overall results. We have to place in the grid several values for each of these. Once we set the arguments for the AdaBoostClassifier and the search grid we combine all this information into an object called search. This object uses the GridSearchCV function and includes additional arguments for scoring, n_jobs, and for cross-validation. Below is the code for all of this

ada=AdaBoostClassifier()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1]}
search=GridSearchCV(estimator=ada,param_grid=search_grid,scoring='accuracy',n_jobs=1,cv=crossvalidation)

We can now run the model of hyperparameter tuning and see the results. The code is below.

search.fit(X,y)
search.best_params_
Out[33]: {'learning_rate': 0.01, 'n_estimators': 1000}
search.best_score_
Out[34]: 0.7425149700598802

We can see that if the learning rate is set to 0.01 and the number of estimators to 1000 We can expect an accuracy of 74%. This is superior to our baseline model.

AdaBoost Model

We can now rune our AdaBoost Classifier based on the recommended hyperparameters. Below is the code.

ada=AdaBoostClassifier(n_estimators=1000,learning_rate=0.01)
score=np.mean(cross_val_score(ada,X,y,scoring='accuracy',cv=crossvalidation,n_jobs=1))
score
Out[36]: 0.7415441176470589

We knew we would get around 74% and that is what we got. It’s only a 3% improvement but depending on the context that can be a substantial difference.

Conclusion

In this post, we look at how to use boosting for classification. In particular, we used the AdaBoost algorithm. Boosting in general uses many models to determine the most accurate classification in a sequential manner. Doing this will often lead to an improvement in the prediction of a model.

A key concept in machine learning and data science in general is variable selection. Sometimes, a dataset can have hundreds of variables to include in a model. The benefit of variable selection is that it reduces the amount of useless information aka noise in the model. By removing noise it can improve the learning process and help to stabilize the estimates.

In this post, we will look at two ways to do this. These two common approaches are the univariate approach and the greedy approach. The univariate approach selects variables that are most related to the dependent variable based on a metric. The greedy approach will alone remove a variable if getting rid of it does not affect the model’s performance.

We will now move to our first example which is the univariate approach using Python. We will use the VietNamH dataset from the pydataset library. Are goal is to predict how much a family spends on medical expenses. Below is the initial code.

import pandas as pd
import numpy as np
from pydataset import data
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_regression
df=data('VietNamH').dropna()

Are data is called df. If you use the head function, you will see that we need to convert several variables to dummy variables. Below is the code for doing this.

df.loc[df.sex== 'female', 'sex'] = 0
df.loc[df.sex== 'male','sex'] = 1
df.loc[df.farm== 'no', 'farm'] = 0
df.loc[df.farm== 'yes','farm'] = 1
df.loc[df.urban== 'no', 'urban'] = 0
df.loc[df.urban== 'yes','urban'] = 1

We now need to setup or X and y datasets as shown below

X=df[['age','educyr','sex','hhsize','farm','urban','lnrlfood']]
y=df['lnmed']

We are now ready to actual use the univariate approach. This involves the use of two different classes in Python. The SelectPercentile class allows you to only include the variables that meet a certain percentile rank such as 25%. The f_regression class is designed for checking a variable’s performance in the context of regression. Below is the code to run the analysis.

selector_f=SelectPercentile(f_regression,percentile=25)
selector_f.fit(X,y)

We can now see the results using a for loop. We want the scores from our selector_f object. To do this we setup a for lop and use the zip function to iterate over the data. The output is placed in the print statement. Below is the code and output for this.

for n,s in zip(X,selector_f.scores_):
    print('F-score: %3.2f\t for feature %s ' % (s,n))
F-score: 62.42   for feature age 
F-score: 33.86   for feature educyr 
F-score: 3.17    for feature sex 
F-score: 106.35  for feature hhsize 
F-score: 14.82   for feature farm 
F-score: 5.95    for feature urban 
F-score: 97.77   for feature lnrlfood

You can see the f-score for all of the independent variables. You can decide for yourself which to include.

Greedy Approach

The greedy approach only removes variables if they do not impact model performance. We are using the same dataset so all we have to do is run the code. We need the RFECV class from the model_selection library. We then use the function RFECV and set the estimator, cross-validation, and scoring metric. Finally, we run the analysis and print the results. The code is below with the output.

from sklearn.feature_selection import RFECV
select=RFECV(estimator=regression,cv=10,scoring='neg_mean_squared_error')
select.fit(X,y)
print(select.n_features_)
7

The number 7 represents how many independent variables to include in the model. Since we only had 7 total variables we should include all variables in the model.

Conclusion

With help with univariate and greedy approaches, it is possible to deal with a large number of variables efficiently one developing models. The example here involve only a handful of variables. However, bear in mind that the approaches mentioned here are highly scalable and useful.

educational research techniques

Research techniques and education

Tag Archives: model

AdaBoost Classification in Python

Like this:

Variable Selection in Python

Like this:

Share this:

Like this:

Share this:

Like this: