RANSAC regression is a unique style of regression. This algorithm identifies outliers and inliers using the unique tools of this approach. The video below provides an overview of how it can be used in Python
Category Archives: python
Gradient Boosting CLassification with Python VIDEO
AdaBoost Regression with Python VIDEO
AdaBoost Classification with Python VIDEO
Elastic Net Regression with Python VIDEO
Lasso Regression with Python VIDEO
Ridge Regression with Python VIDEO
Ridge regression belongs to a family of regression called regularization regression. This family of regression uses various mathematical techniques to reduce or remove coefficients from a regression model. In the case of ridge, this algorithm will reduce coefficients close to zero but never actually remove variables from a model. In this video, we will focus on using this algorithm in python rather than on the mathematical details.
Hyper-Parameter Tuning with Python VIDEO
Hyper-parameter tuning is one way of taking your model development to the next level. This tool provides several ways to make small adjustments that can reap huge benefits. In the video below, we will look at tuning the hyper-parameters of a KNN model. Naturally, this tuning process can be used for any algorithm.
Cross-Validation with Python VIDEO
Intro to Matplotlib with Python VIDEO
Random Forest Regression with Python VIDEO
K Nearest Neighbor Classification with Python VIDEO
Naive Bayes with Python VIDEO
K-Nearest Neighbor Regression with Python VIDEO
K-Nearest neighbor is a great technique for dealing with data. In the video below, we will look at how to use this tool with Python.
Support Vector Machines Regression with Python VIDEO
VIDEOIn this video, we will look at a simple example of SVM regression. In this context, regression involves predicting a continuous dependent variable. This is similar to the basic form of regression that is taught in an introduction to stats class
Support Vector Machines Classification with Python VIDEO
Learn how to use support vector machines for classification using Python. This is another great algorithm for data science purposes.
Linear Discriminant Analysis with Python VIDEO
Linear discriminant analysis is a tool that is used for classification. This tool is one of many that is employed in data science. In this video, we will look at an example of how to use this tool in Python for practical purposes.
Factor Analysis with Python VIDEO
Natural Language Process and WordClouds with Python VIDEO
KMeans Clustering with Python VIDEO
Kmeans clustering is an unsupervised learning technique used to place date in various groups as determine by the algorithm. In this video, we will go step by step through the process of using this insight tool.
Random Forest in Python VIDEO
Decision Trees in Python VIDEO
Principal Component Analysis with Python VIDEO
Data Visualization with Altair VIDEO

Visualizations with Altair
We are going to take a look at Altair which is a data visulization library for Python. What is unique abiut Altair compared to other packages experienced on this blog is that it allows for interactions.
The interactions can take place inside jupyter or they can be exported and loaded onto websites as we shall see. In the past, making interactions for website was often tught using a jacascript library such as d3.js. D3.js works but is cumbersome to work with for the avaerage non-coder. Altair solves this problem as Python is often seen as easier to work with compared to javascript.
Installing Altair
If Altair is not already install on your computer you can do so with the following code
pip install altair vega_datasets
OR
conda install -c conda-forge altair vega_datasets
Which one of the lines above you use will depend on the type of Python installation you have.
Goal
We are going to make some simple visualizations using the “Duncan” dataset from the pydataset library using Altair. If you do not have pydataset install on your ocmputer you can use the code listed above to install it. Simple replace “altair vega_datasets” with “pydataset.” Below is the initial code followed by the output
import pandas as pd
from pydataset import data
df=data("Duncan")
df.head()
In the code above, we load pandas and import “data” from the “pydataset” library. Next, we load the “Duncan” dataset as the object “df”. Lastly, we use the .head() function to take a look at the dataset. You can see in the imagine above what variables are available.
Our first visualization is a simple bar graph. The code is below followed by the visualization.
import altair as alt
alt.Chart(df).mark_bar().encode(
x= "type",
y = "prestige"
)

In the code above we did the following,
- Line one loads the altair library.
- Line 2 uses several functions together to make the bar graph. .Chart(df) loads the data for the plot. .mark_bar() assigns the geomtric shape for the plot which in this case is bars. Lastly, the .encode() function contains the information for the variables that will be assigned to the x and y axes. In this case we are looking at job type and prestige.
The three dots in the upper right provide options for saving or editing the plot. We will learn more about saving plots later. In addition, Altair follows the grammar of graphics for creating plots. This has been discussed in another post but a summary of the components are below.
- Data
- Aesthetics
- Scale.
- Statistical transformation
- Geometric object
- Facets
- Coordinate system
We will not deal with all of these but we have dealt with the following
- Data as .Chart()
- Aesthetics and Geometric object as .mark_bar()
- coordinate system as .encode()
In our second example, we will make a scatterplot. The code and output are below.
alt.Chart(df).mark_circle().encode(
x= "education",
y = "prestige"
)

The code is mostly the same. We simple use .mark_circle() as to indicate the type of geometric object. For .encode() we made sure to use two continuous variables.
In the next plot, we add a categorical variable to the scatterplot by manipulating the color.
alt.Chart(df).mark_circle().encode(
x= "education",
y = "prestige",
color='type'
)

The only change is the addition of the “color”argument which is set to the categorical vareiable of “type.”
It is also possible to use bubbles to indicate size. In the plot below we can add the income varibale to the plot using bubbles.
alt.Chart(df).mark_circle().encode(
x= "education",
y = "prestige",
color='type',
size="income"
)

The latest argument that was added was the “size” argument which was used to map income to the plot.
You can also facet data by piping. The code below makes two plots and saving them as objects. Then you print both by typing the name of the objects while separated by the pipe symbol (|) which you can find above the enter key on your keyboard. Below you will find two different plots created through this piping process.
educationPlot=alt.Chart(df).mark_circle().encode(
x= "education",
y = "prestige",
color='type',
)
incomePlot=alt.Chart(df).mark_circle().encode(
x= "income",
y = "prestige",
color='type',
)
educationPlot | incomePlot

With this code you can make multiple plots. Simply keep adding pipes to make more plots.
Interaction and Saving Plots
It is also possible to move plots interactive. In the code below we add the command called tool tip. This allows us to add an additional variable called “income” to the chart. When the mouse hoovers over a data-point the income will display.
However, since we are in a browser right now this will not work unless w save the chart as an html file. The last line of code saves the plot as an html file and renders it using svg. We also remove the three dots in the upper left corner by adding the ‘actions’:False. Below is the code and the plot once the html was loaded to this blog.
interact_plot=alt.Chart(df).mark_circle().encode(
x= "education",
y = "prestige",
color='type',
tooltip=["income"]
)
interact_plot.save('interact_plot.html',embed_options={'renderer':'svg','actions':False})
I’ve made a lot of visuals in the past and never has it been this simple
Conclusion
Altair is another tool for visualizations. This may be the easiest way to make complex and interactive charts that I have seen. As such, this is a great way to achieve goals if visualizing data is something that needs to be done.

Random Forest Classification with Python
Random forest is a type of machine learning algorithm in which the algorithm makes multiple decision trees that may use different features and subsample to making as many trees as you specify. The trees then vote to determine the class of an example. This approach helps to deal with the high variance that is a problem with making only one decision tree.
In this post, we will learn how to develop a random forest model in Python. We will use the cancer dataset from the pydataset module to classify whether a person status is censored or dead based on several independent variables. The steps we need to perform to complete this task are defined below
- Data preparation
- Model development and evaluation
Data Preparation
Below are some initial modules we need to complete all of the tasks for this project.
import pandas as pd import numpy as np from pydataset import data from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report
We will now load our dataset “Cancer” and drop any rows that contain NA using the .dropna() function.
df = data('cancer') df=df.dropna()
Next, we need to separate our independent variables from our dependent variable. We will do this by make two datasets. The X dataset will contain all of our independent variables and the y dataset will contain our dependent variable. You can check the documentation for the dataset using the code data(“Cancer”, show_doc=True)
Before we make the y dataset we need to change the numerical values in the status variable to text. Doing this will aid in the interpretation of the results. If you look at the documentation of the dataset you will see that a 1 in the status variable means censored while a 2 means dead. We will change the 1 to censored and the 2 to dead when we make the y dataset. This involves the use of the .replace() function. The code is below.
X=df[['time','age',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']] df['status']=df.status.replace(1,'censored') df['status']=df.status.replace(2,'dead') y=df['status']
We can now proceed to model development.
Model Development and Evaluation
We will first make our train and test datasets. We will use a 70/30 split. Next, we initialize the actual random forest classifier. There are many options that can be set. For our purposes, we will set the number of trees to make to 100. Setting the random_state option is similar to setting the seed for the purpose of reproducibility. Below is the code.
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) h=RandomForestClassifier(n_estimators=100,random_state=1)
We can now run our modle with the .fit() function and test it with the .pred() function. The code is velow.
h.fit(x_train,y_train) y_pred=h.predict(x_test)
We will now print two tables. The first will provide the raw results for the classification using the .crosstab() function. THe classification_reports function will provide the various metrics used for determining the value of a classification model.
print(pd.crosstab(y_test,y_pred)) print(classification_report(y_test,y_pred))
Our overall accuracy is about 75%. How good this is depends in context. We are really good at predicting people are dead but have much more trouble with predicting if people are censored.
Conclusion
This post provided an example of using random forest in python. Through the use of a forest of trees, it is possible to get much more accurate results when a comparison is made to a single decision tree. This is one of many reasons for the use of random forest in machine learning.

Data Exploration Case Study: Credit Default
Exploratory data analysis is the main task of a Data Scientist with as much as 60% of their time being devoted to this task. As such, the majority of their time is spent on something that is rather boring compared to building models.
This post will provide a simple example of how to analyze a dataset from the website called Kaggle. This dataset is looking at how is likely to default on their credit. The following steps will be conducted in this analysis.
- Load the libraries and dataset
- Deal with missing data
- Some descriptive stats
- Normality check
- Model development
This is not an exhaustive analysis but rather a simple one for demonstration purposes. The dataset is available here
Load Libraries and Data
Here are some packages we will need
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from scipy.stats import norm from sklearn import tree from scipy import stats from sklearn import metrics
You can load the data with the code below
df_train=pd.read_csv('/application_train.csv')
You can examine what variables are available with the code below. This is not displayed here because it is rather long
df_train.columns df_train.head()
Missing Data
I prefer to deal with missing data first because missing values can cause errors throughout the analysis if they are not dealt with immediately. The code below calculates the percentage of missing data in each column.
total=df_train.isnull().sum().sort_values(ascending=False) percent=(df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False) missing_data=pd.concat([total,percent],axis=1,keys=['Total','Percent']) missing_data.head() Total Percent COMMONAREA_MEDI 214865 0.698723 COMMONAREA_AVG 214865 0.698723 COMMONAREA_MODE 214865 0.698723 NONLIVINGAPARTMENTS_MODE 213514 0.694330 NONLIVINGAPARTMENTS_MEDI 213514 0.694330
Only the first five values are printed. You can see that some variables have a large amount of missing data. As such, they are probably worthless for inclusion in additional analysis. The code below removes all variables with any missing data.
pct_null = df_train.isnull().sum() / len(df_train) missing_features = pct_null[pct_null > 0.0].index df_train.drop(missing_features, axis=1, inplace=True)
You can use the .head() function if you want to see how many variables are left.
Data Description & Visualization
For demonstration purposes, we will print descriptive stats and make visualizations of a few of the variables that are remaining.
round(df_train['AMT_CREDIT'].describe()) Out[8]: count 307511.0 mean 599026.0 std 402491.0 min 45000.0 25% 270000.0 50% 513531.0 75% 808650.0 max 4050000.0 sns.distplot(df_train['AMT_CREDIT']
round(df_train['AMT_INCOME_TOTAL'].describe()) Out[10]: count 307511.0 mean 168798.0 std 237123.0 min 25650.0 25% 112500.0 50% 147150.0 75% 202500.0 max 117000000.0 sns.distplot(df_train['AMT_INCOME_TOTAL']
I think you are getting the point. You can also look at categorical variables using the groupby() function.
We also need to address categorical variables in terms of creating dummy variables. This is so that we can develop a model in the future. Below is the code for dealing with all the categorical variables and converting them to dummy variable’s
df_train.groupby('NAME_CONTRACT_TYPE').count() dummy=pd.get_dummies(df_train['NAME_CONTRACT_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_CONTRACT_TYPE'],axis=1) df_train.groupby('CODE_GENDER').count() dummy=pd.get_dummies(df_train['CODE_GENDER']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['CODE_GENDER'],axis=1) df_train.groupby('FLAG_OWN_CAR').count() dummy=pd.get_dummies(df_train['FLAG_OWN_CAR']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['FLAG_OWN_CAR'],axis=1) df_train.groupby('FLAG_OWN_REALTY').count() dummy=pd.get_dummies(df_train['FLAG_OWN_REALTY']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['FLAG_OWN_REALTY'],axis=1) df_train.groupby('NAME_INCOME_TYPE').count() dummy=pd.get_dummies(df_train['NAME_INCOME_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_INCOME_TYPE'],axis=1) df_train.groupby('NAME_EDUCATION_TYPE').count() dummy=pd.get_dummies(df_train['NAME_EDUCATION_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_EDUCATION_TYPE'],axis=1) df_train.groupby('NAME_FAMILY_STATUS').count() dummy=pd.get_dummies(df_train['NAME_FAMILY_STATUS']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_FAMILY_STATUS'],axis=1) df_train.groupby('NAME_HOUSING_TYPE').count() dummy=pd.get_dummies(df_train['NAME_HOUSING_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_HOUSING_TYPE'],axis=1) df_train.groupby('ORGANIZATION_TYPE').count() dummy=pd.get_dummies(df_train['ORGANIZATION_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['ORGANIZATION_TYPE'],axis=1)
You have to be careful with this because now you have many variables that are not necessary. For every categorical variable you must remove at least one category in order for the model to work properly. Below we did this manually.
df_train=df_train.drop(['Revolving loans','F','XNA','N','Y','SK_ID_CURR,''Student','Emergency','Lower secondary','Civil marriage','Municipal apartment'],axis=1)
Below are some boxplots with the target variable and other variables in the dataset.
f,ax=plt.subplots(figsize=(8,6)) fig=sns.boxplot(x=df_train['TARGET'],y=df_train['AMT_INCOME_TOTAL'])
There is a clear outlier there. Below is another boxplot with a different variable
f,ax=plt.subplots(figsize=(8,6)) fig=sns.boxplot(x=df_train['TARGET'],y=df_train['CNT_CHILDREN'])
It appears several people have more than 10 children. This is probably a typo.
Below is a correlation matrix using a heatmap technique
corrmat=df_train.corr() f,ax=plt.subplots(figsize=(12,9)) sns.heatmap(corrmat,vmax=.8,square=True)
The heatmap is nice but it is hard to really appreciate what is happening. The code below will sort the correlations from least to strongest, so we can remove high correlations.
c = df_train.corr().abs() s = c.unstack() so = s.sort_values(kind="quicksort") print(so.head()) FLAG_DOCUMENT_12 FLAG_MOBIL 0.000005 FLAG_MOBIL FLAG_DOCUMENT_12 0.000005 Unknown FLAG_MOBIL 0.000005 FLAG_MOBIL Unknown 0.000005 Cash loans FLAG_DOCUMENT_14 0.000005
The list is to long to show here but the following variables were removed for having a high correlation with other variables.
df_train=df_train.drop(['WEEKDAY_APPR_PROCESS_START','FLAG_EMP_PHONE','REG_CITY_NOT_WORK_CITY','REGION_RATING_CLIENT','REG_REGION_NOT_WORK_REGION'],axis=1)
Below we check a few variables for homoscedasticity, linearity, and normality using plots and histograms
sns.distplot(df_train['AMT_INCOME_TOTAL'],fit=norm) fig=plt.figure() res=stats.probplot(df_train['AMT_INCOME_TOTAL'],plot=plt)
This is not normal
sns.distplot(df_train['AMT_CREDIT'],fit=norm) fig=plt.figure() res=stats.probplot(df_train['AMT_CREDIT'],plot=plt)
This is not normal either. We could do transformations, or we can make a non-linear model instead.
Model Development
Now comes the easy part. We will make a decision tree using only some variables to predict the target. In the code below we make are X and y dataset.
X=df_train[['Cash loans','DAYS_EMPLOYED','AMT_CREDIT','AMT_INCOME_TOTAL','CNT_CHILDREN','REGION_POPULATION_RELATIVE']] y=df_train['TARGET']
The code below fits are model and makes the predictions
clf=tree.DecisionTreeClassifier(min_samples_split=20) clf=clf.fit(X,y) y_pred=clf.predict(X)
Below is the confusion matrix followed by the accuracy
print (pd.crosstab(y_pred,df_train['TARGET'])) TARGET 0 1 row_0 0 280873 18493 1 1813 6332 accuracy_score(y_pred,df_train['TARGET']) Out[47]: 0.933966589813047
Lastly, we can look at the precision, recall, and f1 score
print(metrics.classification_report(y_pred,df_train['TARGET'])) precision recall f1-score support 0 0.99 0.94 0.97 299366 1 0.26 0.78 0.38 8145 micro avg 0.93 0.93 0.93 307511 macro avg 0.62 0.86 0.67 307511 weighted avg 0.97 0.93 0.95 307511
This model looks rather good in terms of accuracy of the training set. It actually impressive that we could use so few variables from such a large dataset and achieve such a high degree of accuracy.
Conclusion
Data exploration and analysis is the primary task of a data scientist. This post was just an example of how this can be approached. Of course, there are many other creative ways to do this but the simplistic nature of this analysis yielded strong results

RANSAC Regression in Python
RANSAC is an acronym for Random Sample Consensus. What this algorithm does is fit a regression model on a subset of data that the algorithm judges as inliers while removing outliers. This naturally improves the fit of the model due to the removal of some data points.
The process that is used to determine inliers and outliers is described below.
- The algorithm randomly selects a random amount of samples to be inliers in the model.
- All data is used to fit the model and samples that fall with a certain tolerance are relabeled as inliers.
- Model is refitted with the new inliers
- Error of the fitted model vs the inliers is calculated
- Terminate or go back to step 1 if a certain criterion of iterations or performance is not met.
In this post, we will use the tips data from the pydataset module. Our goal will be to predict the tip amount using two different models.
- Model 1 will use simple regression and will include total bill as the independent variable and tips as the dependent variable
- Model 2 will use multiple regression and includes several independent variables and tips as the dependent variable
The process we will use to complete this example is as follows
- Data preparation
- Simple Regression Model fit
- Simple regression visualization
- Multiple regression model fit
- Multiple regression visualization
Below are the packages we will need for this example
import pandas as pd from pydataset import data from sklearn.linear_model import RANSACRegressor from sklearn.linear_model import LinearRegression import numpy as np import matplotlib.pyplot as plt from sklearn.metrics import mean_absolute_error from sklearn.metrics import r2_score
Data Preparation
For the data preparation, we need to do the following
- Load the data
- Create X and y dataframes
- Convert several categorical variables to dummy variables
- Drop the original categorical variables from the X dataframe
Below is the code for these steps
df=data('tips') X,y=df[['total_bill','sex','size','smoker','time']],df['tip'] male=pd.get_dummies(X['sex']) X['male']=male['Male'] smoker=pd.get_dummies(X['smoker']) X['smoker']=smoker['Yes'] dinner=pd.get_dummies(X['time']) X['dinner']=dinner['Dinner'] X=X.drop(['sex','time'],1)
Most of this is self-explanatory, we first load the tips dataset and divide the independent and dependent variables into an X and y dataframe respectively. Next, we converted the sex, smoker, and dinner variables into dummy variables, and then we dropped the original categorical variables.
We can now move to fitting the first model that uses simple regression.
Simple Regression Model
For our model, we want to use total bill to predict tip amount. All this is done in the following steps.
- Instantiate an instance of the RANSACRegressor. We the call LinearRegression function, and we also set the residual_threshold to 2 indicate how far an example has to be away from 2 units away from the line.
- Next we fit the model
- We predict the values
- We calculate the r square the mean absolute error
Below is the code for all of this.
ransacReg1= RANSACRegressor(LinearRegression(),residual_threshold=2,random_state=0) ransacReg1.fit(X[['total_bill']],y) prediction1=ransacReg1.predict(X[['total_bill']])
r2_score(y,prediction1) Out[150]: 0.4381748268686979 mean_absolute_error(y,prediction1) Out[151]: 0.7552429811944833
The r-square is 44% while the MAE is 0.75. These values are most comparative and will be looked at again when we create the multiple regression model.
The next step is to make the visualization. The code below will create a plot that shows the X and y variables and the regression. It also identifies which samples are inliers and outliers. Te coding will not be explained because of the complexity of it.
inlier=ransacReg1.inlier_mask_ outlier=np.logical_not(inlier) line_X=np.arange(3,51,2) line_y=ransacReg1.predict(line_X[:,np.newaxis]) plt.scatter(X[['total_bill']][inlier],y[inlier],c='lightblue',marker='o',label='Inliers') plt.scatter(X[['total_bill']][outlier],y[outlier],c='green',marker='s',label='Outliers') plt.plot(line_X,line_y,color='black') plt.xlabel('Total Bill') plt.ylabel('Tip') plt.legend(loc='upper left')
Plot is self-explanatory as a handful of samples were considered outliers. We will now move to creating our multiple regression model.
Multiple Regression Model Development
The steps for making the model are mostly the same. The real difference takes place in make the plot which we will discuss in a moment. Below is the code for developing the model.
ransacReg2= RANSACRegressor(LinearRegression(),residual_threshold=2,random_state=0) ransacReg2.fit(X,y) prediction2=ransacReg2.predict(X)
r2_score(y,prediction2) Out[154]: 0.4298703800652126 mean_absolute_error(y,prediction2) Out[155]: 0.7649733201032204
Things have actually gotten slightly worst in terms of r-square and MAE.
For the visualization, we cannot plot directly several variables t once. Therefore, we will compare the predicted values with the actual values. The better the correlated the better our prediction is. Below is the code for the visualization
inlier=ransacReg2.inlier_mask_ outlier=np.logical_not(inlier) line_X=np.arange(1,8,1) line_y=(line_X[:,np.newaxis]) plt.scatter(prediction2[inlier],y[inlier],c='lightblue',marker='o',label='Inliers') plt.scatter(prediction2[outlier],y[outlier],c='green',marker='s',label='Outliers') plt.plot(line_X,line_y,color='black') plt.xlabel('Predicted Tip') plt.ylabel('Actual Tip') plt.legend(loc='upper left')
The plots are mostly the same as you cans see for yourself.
Conclusion
This post provided an example of how to use the RANSAC regressor algorithm. This algorithm will remove samples from the model based on a criterion you set. The biggest complaint about this algorithm is that it removes data from the model. Generally, we want to avoid losing data when developing models. In addition, the algorithm removes outliers objectively this is a problem because outlier removal is often subjective. Despite these flaws, RANSAC regression is another tool that can be use din machine learning.

Combining Algorithms for Classification with Python
Many approaches in machine learning involve making many models that combine their strength and weaknesses to make more accuracy classification. Generally, when this is done it is the same algorithm being used. For example, random forest is simply many decision trees being developed. Even when bagging or boosting is being used it is the same algorithm but with variances in sampling and the use of features.
In addition to this common form of ensemble learning there is also a way to combine different algorithms to make predictions. For one way of doing this is through a technique called stacking in which the predictions of several models are passed to a higher model that uses the individual model predictions to make a final prediction. In this post we will look at how to do this using Python.
Assumptions
This blog usually tries to explain as much as possible about what is happening. However, due to the complexity of this topic there are several assumptions about the reader’s background.
- Already familiar with python
- Can use various algorithms to make predictions (logistic regression, linear discriminant analysis, decision trees, K nearest neighbors)
- Familiar with cross-validation and hyperparameter tuning
We will be using the Mroz dataset in the pydataset module. Our goal is to use several of the independent variables to predict whether someone lives in the city or not.
The steps we will take in this post are as follows
- Data preparation
- Individual model development
- Ensemble model development
- Hyperparameter tuning of ensemble model
- Ensemble model testing
Below is all of the libraries we will be using in this post
import pandas as pd from sklearn.model_selection import GridSearchCV from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import LabelEncoder from pydataset import data from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from mlxtend.classifier import EnsembleVoteClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA from sklearn.metrics import accuracy_score from sklearn.metrics import classification_report
Data Preparation
We need to perform the following steps for the data preparation
- Load the data
- Select the independent variables to be used in the analysis
- Scale the independent variables
- Convert the dependent variable from text to numbers
- Split the data in train and test sets
Not all of the variables in the Mroz dataset were used. Some were left out because they were highly correlated with others. This analysis is not in this post but you can explore this on your own. The data was also scaled because many algorithms are sensitive to this so it is best practice to always scale the data. We will use the StandardScaler function for this. Lastly, the dpeendent variable currently consist of values of “yes” and “no” these need to be convert to numbers 1 and 0. We will use the LabelEncoder function for this. The code for all of this is below.
df=data('Mroz') X,y=df[['hoursw','child6','child618','educw','hearnw','hoursh','educh','wageh','educwm','educwf','experience']],df['city'] sc=StandardScaler() X_scale=sc.fit_transform(X) X=pd.DataFrame(X_scale, index=X.index, columns=X.columns) le=LabelEncoder() y=le.fit_transform(y) X_train, X_test,y_train, y_test=train_test_split(X,y,test_size=.3,random_state=5)
We can now proceed to individul model development
Individual Model Development
Below are the steps for this part of the analysis
- Instantiate an instance of each algorithm
- Check accuracy of each model
- Check roc curve of each model
We will create four different models, and they are logistic regression, decision tree, k nearest neighbor, and linear discriminant analysis. We will also set some initial values for the hyperparameters for each. Below is the code
logclf=LogisticRegression(penalty='l2',C=0.001, random_state=0) treeclf=DecisionTreeClassifier(max_depth=3,criterion='entropy',random_state=0) knnclf=KNeighborsClassifier(n_neighbors=5,p=2,metric='minkowski') LDAclf=LDA()
We can now assess the accuracy and roc curve of each model. This will be done through using two separate for loops. The first will have the accuracy results and the second will have the roc curve results. The results will also use k-fold cross validation with the cross_val_score function. Below is the code with the results.
clf_labels=['Logistic Regression','Decision Tree','KNN','LDAclf'] for clf, label in zip ([logclf,treeclf,knnclf,LDAclf],clf_labels): scores=cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10,scoring='accuracy') print("accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label)) for clf, label in zip ([logclf,treeclf,knnclf,LDAclf],clf_labels): scores=cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10,scoring='roc_auc') print("roc auc: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label)) accuracy: 0.69 (+/- 0.04) [Logistic Regression] accuracy: 0.72 (+/- 0.06) [Decision Tree] accuracy: 0.66 (+/- 0.06) [KNN] accuracy: 0.70 (+/- 0.05) [LDAclf] roc auc: 0.71 (+/- 0.08) [Logistic Regression] roc auc: 0.70 (+/- 0.07) [Decision Tree] roc auc: 0.62 (+/- 0.10) [KNN] roc auc: 0.70 (+/- 0.08) [LDAclf]
The results can speak for themselves. We have a general accuracy of around 70% but our roc auc is poor. Despite this we will now move to the ensemble model development.
Ensemble Model Development
The ensemble model requires the use of the EnsembleVoteClassifier function. Inside this function are the four models we made earlier. Other than this the rest of the code is the same as the previous step. We will assess the accuracy and the roc auc. Below is the code and the results
mv_clf= EnsembleVoteClassifier(clfs=[logclf,treeclf,knnclf,LDAclf],weights=[1.5,1,1,1]) for clf, label in zip ([logclf,treeclf,knnclf,LDAclf,mv_clf],labels): scores=cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10,scoring='accuracy') print("accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label)) for clf, label in zip ([logclf,treeclf,knnclf,LDAclf,mv_clf],labels): scores=cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10,scoring='roc_auc') print("roc auc: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label)) accuracy: 0.69 (+/- 0.04) [LR] accuracy: 0.72 (+/- 0.06) [tree] accuracy: 0.66 (+/- 0.06) [knn] accuracy: 0.70 (+/- 0.05) [LDA] accuracy: 0.70 (+/- 0.04) [combine] roc auc: 0.71 (+/- 0.08) [LR] roc auc: 0.70 (+/- 0.07) [tree] roc auc: 0.62 (+/- 0.10) [knn] roc auc: 0.70 (+/- 0.08) [LDA] roc auc: 0.72 (+/- 0.09) [combine]
You can see that the combine model as similar performance to the individual models. This means in this situation that the ensemble learning did not make much of a difference. However, we have not tuned are hyperparameters yet. This will be done in the next step.
Hyperparameter Tuning of Ensemble Model
We are going to tune the decision tree, logistic regression, and KNN model. There are many different hyperparameters we can tune. For demonstration purposes we are only tuning one hyperparameter per algorithm. Once we set the hyperparameters we will run the model and pull the best hyperparameters values based on the roc auc as the metric. Below is the code and the output.
params={'decisiontreeclassifier__max_depth':[2,3,5], 'logisticregression__C':[0.001,0.1,1,10], 'kneighborsclassifier__n_neighbors':[5,7,9,11]} grid=GridSearchCV(estimator=mv_clf,param_grid=params,cv=10,scoring='roc_auc') grid.fit(X_train,y_train) grid.best_params_ Out[34]: {'decisiontreeclassifier__max_depth': 3, 'kneighborsclassifier__n_neighbors': 9, 'logisticregression__C': 10} grid.best_score_ Out[35]: 0.7196051482279385
The best values are as follows
- Decision tree max depth set to 3
- KNN number of neighbors set to 9
- logistic regression C set to 10
These values give us a roc auc of 0.72 which is still poor . We can now use these values when we test our final model.
Ensemble Model Testing
The following steps are performed in the analysis
- Created new instances of the algorithms with the adjusted hyperparameters
- Run the ensemble model
- Predict with the test data
- Check the results
Below is the first step
logclf=LogisticRegression(penalty='l2',C=10, random_state=0) treeclf=DecisionTreeClassifier(max_depth=3,criterion='entropy',random_state=0) knnclf=KNeighborsClassifier(n_neighbors=9,p=2,metric='minkowski') LDAclf=LDA()
Below is step two
mv_clf= EnsembleVoteClassifier(clfs=[logclf,treeclf,knnclf,LDAclf],weights=[1.5,1,1,1]) mv_clf.fit(X_train,y_train)
Below are steps 3 and four
y_pred=mv_clf.predict(X_test) print(accuracy_score(y_test,y_pred)) print(pd.crosstab(y_test,y_pred)) print(classification_report(y_test,y_pred)) 0.6902654867256637 col_0 0 1 row_0 0 29 58 1 12 127 precision recall f1-score support 0 0.71 0.33 0.45 87 1 0.69 0.91 0.78 139 avg / total 0.69 0.69 0.66 226
The accuracy is about 69%. One thing that is noticeable low is the recall for people who do not live in the city. This probably one reason why the overall roc auc score is so low. The f1-score is also low for those who do not live in the city as well. The f1-score is just a combination of precision and recall. If we really want to improve performance we would probably start with improving the recall of the no’s.
Conclusion
This post provided an example of how you can combine different algorithms to make predictions in Python. This is a powerful technique t to use. Off course, it is offset by the complexity of the analysis which makes it hard to explain exactly what the results mean if you were asked tot do so.

Gradient Boosting Regression in Python
In this post, we will take a look at gradient boosting for regression. Gradient boosting simply makes sequential models that try to explain any examples that had not been explained by previously models. This approach makes gradient boosting superior to AdaBoost.
Regression trees are mostly commonly teamed with boosting. There are some additional hyperparameters that need to be set which includes the following
- number of estimators
- learning rate
- subsample
- max depth
We will deal with each of these when it is appropriate. Our goal in this post is to predict the amount of weight loss in cancer patients based on the independent variables. This is the process we will follow to achieve this.
- Data preparation
- Baseline decision tree model
- Hyperparameter tuning
- Gradient boosting model development
Below is some initial code
from sklearn.ensemble import GradientBoostingRegressor from sklearn import tree from sklearn.model_selection import GridSearchCV import numpy as np from pydataset import data import pandas as pd from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold
Data Preparation
The data preparation is not that difficult in this situation. We simply need to load the dataset in an object and remove any missing values. Then we separate the independent and dependent variables into separate datasets. The code is below.
df=data('cancer').dropna() X=df[['time','sex','ph.karno','pat.karno','status','meal.cal']] y=df['wt.loss']
We can now move to creating our baseline model.
Baseline Model
The purpose of the baseline model is to have something to compare our gradient boosting model to. Therefore, all we will do here is create several regression trees. The difference between the regression trees will be the max depth. The max depth has to with the number of nodes python can make to try to purify the classification. We will then decide which tree is best based on the mean squared error.
The first thing we need to do is set the arguments for the cross-validation. Cross validating the results helps to check the accuracy of the results. The rest of the code requires the use of for loops and if statements that cannot be reexplained in this post. Below is the code with the output.
crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10): tree_regressor=tree.DecisionTreeRegressor(max_depth=depth,random_state=1) if tree_regressor.fit(X,y).tree_.max_depth<depth: break score=np.mean(cross_val_score(tree_regressor,X,y,scoring='neg_mean_squared_error', cv=crossvalidation,n_jobs=1)) print(depth, score)
1 -193.55304528235052 2 -176.27520747356175 3 -209.2846723461564 4 -218.80238479654003 5 -222.4393459885871 6 -249.95330609042858 7 -286.76842138165705 8 -294.0290706405905 9 -287.39016236497804
You can see that a max depth of 2 had the lowest amount of error. Therefore, our baseline model has a mean squared error of 176. We need to improve on this in order to say that our gradient boosting model is superior.
However, before we create our gradient boosting model. we need to tune the hyperparameters of the algorithm.
Hyperparameter Tuning
Hyperparameter tuning has to with setting the value of parameters that the algorithm cannot learn on its own. As such, these are constants that you set as the researcher. The problem is that you are not any better at knowing where to set these values than the computer. Therefore, the process that is commonly used is to have the algorithm use several combinations of values until it finds the values that are best for the model/. Having said this, there are several hyperparameters we need to tune, and they are as follows.
- number of estimators
- learning rate
- subsample
- max depth
The number of estimators is show many trees to create. The more trees the more likely to overfit. The learning rate is the weight that each tree has on the final prediction. Subsample is the proportion of the sample to use. Max depth was explained previously.
What we will do now is make an instance of the GradientBoostingRegressor. Next, we will create our grid with the various values for the hyperparameters. We will then take this grid and place it inside GridSearchCV function so that we can prepare to run our model. There are some arguments that need to be set inside the GridSearchCV function such as estimator, grid, cv, etc. Below is the code.
GBR=GradientBoostingRegressor() search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'max_depth':[1,2,4],'subsample':[.5,.75,1],'random_state':[1]} search=GridSearchCV(estimator=GBR,param_grid=search_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=crossvalidation)
We can now run the code and determine the best combination of hyperparameters and how well the model did base on the means squared error metric. Below is the code and the output.
search.fit(X,y) search.best_params_ Out[13]: {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 500, 'random_state': 1, 'subsample': 0.5} search.best_score_ Out[14]: -160.51398257591643
The hyperparameter results speak for themselves. With this tuning we can see that the mean squared error is lower than with the baseline model. We can now move to the final step of taking these hyperparameter settings and see how they do on the dataset. The results should be almost the same.
Gradient Boosting Model Development
Below is the code and the output for the tuned gradient boosting model
GBR2=GradientBoostingRegressor(n_estimators=500,learning_rate=0.01,subsample=.5,max_depth=1,random_state=1) score=np.mean(cross_val_score(GBR2,X,y,scoring='neg_mean_squared_error',cv=crossvalidation,n_jobs=1)) score Out[18]: -160.77842893572068
These results were to be expected. The gradient boosting model has a better performance than the baseline regression tree model.
Conclusion
In this post, we looked at how to use gradient boosting to improve a regression tree. By creating multiple models. Gradient boosting will almost certainly have a better performance than other type of algorithms that rely on only one model.

Gradient Boosting Classification in Python
Gradient Boosting is an alternative form of boosting to AdaBoost. Many consider gradient boosting to be a better performer than adaboost. Some differences between the two algorithms is that gradient boosting uses optimization for weight the estimators. Like adaboost, gradient boosting can be used for most algorithms but is commonly associated with decision trees.
In addition, gradient boosting requires several additional hyperparameters such as max depth and subsample. Max depth has to do with the number of nodes in a tree. The higher the number the purer the classification become. The downside to this is the risk of overfitting.
Subsampling has to do with the proportion of the sample that is used for each estimator. This can range from a decimal value up until the whole number 1. If the value is set to 1 it becomes stochastic gradient boosting.
This post is focused on classification. To do this, we will use the cancer dataset from the pydataset library. Our goal will be to predict the status of patients (alive or dead) using the available independent variables. The steps we will use are as follows.
- Data preparation
- Baseline decision tree model
- Hyperparameter tuning
- Gradient boosting model development
Below is some initial code.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
Data Preparation
The data preparation is simple in this situtation. All we need to do is load are dataset, dropping missing values, and create our X dataset and y dataset. All this happens in the code below.
df=data('cancer').dropna()
X=df[['time','sex','ph.karno','pat.karno','meal.cal','wt.loss']]
y=df['status']
We will now develop our baseline decision tree model.
Baseline Model
The purpose of the baseline model is to have something to compare our gradient boosting model to. The strength of a model is always relative to some other model, so we need to make at least two, so we can say one is better than the other.
The criteria for better in this situation is accuracy. Therefore, we will make a decision tree model, but we will manipulate the max depth of the tree to create 9 different baseline models. The best accuracy model will be the baseline model.
To achieve this, we need to use a for loop to make python make several decision trees. We also need to set the parameters for the cross validation by calling KFold(). Once this is done, we print the results for the 9 trees. Below is the code and results.
crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10):
tree_classifier=tree.DecisionTreeClassifier(max_depth=depth,random_state=1)
if tree_classifier.fit(X,y).tree_.max_depth<depth:
break
score=np.mean(cross_val_score(tree_classifier,X,y,scoring='accuracy', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 0.71875
2 0.6477941176470589
3 0.6768382352941177
4 0.6698529411764707
5 0.6584558823529412
6 0.6525735294117647
7 0.6283088235294118
8 0.6573529411764706
9 0.6577205882352941
It appears that when the max depth is limited to 1 that we get the best accuracy at almost 72%. This will be our baseline for comparison. We will now tune the parameters for the gradient boosting algorithm
Hyperparameter Tuning
There are several hyperparameters we need to tune. The ones we will tune are as follows
- number of estimators
- learning rate
- subsample
- max depth
First, we will create an instance of the gradient boosting classifier. Second, we will create our grid for the search. It is inside this grid that we set several values for each hyperparameter. Then we call GridSearchCV and place the instance of the gradient boosting classifier, the grid, the cross validation values from mad earlier, and n_jobs all together in one place. Below is the code for this.
GBC=GradientBoostingClassifier()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'max_depth':[1,3,5],'subsample':[.5,.75,1],'random_state':[1]}
search=GridSearchCV(estimator=GBC,param_grid=search_grid,scoring='accuracy',n_jobs=1,cv=crossvalidation)
You can now run your model by calling .fit(). Keep in mind that there are several hyperparameters. This means that it might take some time to run the calculations. It is common to find values for max depth, subsample, and number of estimators first. Then as second run through is done to find the learning rate. In our example, we are doing everything at once which is why it takes longer. Below is the code with the out for best parameters and best score.
search.fit(X,y)
search.best_params_
Out[11]:
{'learning_rate': 0.01,
'max_depth': 5,
'n_estimators': 2000,
'random_state': 1,
'subsample': 0.75}
search.best_score_
Out[12]: 0.7425149700598802
You can see what the best hyperparameters are for yourself. In addition, we see that when these parameters were set we got an accuracy of 74%. This is superior to our baseline model. We will now see if we can replicate these numbers when we use them for our Gradient Boosting model.
Gradient Boosting Model
Below is the code and results for the model with the predetermined hyperparameter values.
ada2=GradientBoostingClassifier(n_estimators=2000,learning_rate=0.01,subsample=.75,max_depth=5,random_state=1)
score=np.mean(cross_val_score(ada2,X,y,scoring='accuracy',cv=crossvalidation,n_jobs=1))
score
Out[17]: 0.742279411764706
You can see that the results are similar. This is just additional information that the gradient boosting model does outperform the baseline decision tree model.
Conclusion
This post provided an example of what gradient boosting classification can do for a model. With its distinct characteristics gradient boosting is generally a better performing boosting algorithm in comparison to AdaBoost.

AdaBoost Regression with Python
This post will share how to use the adaBoost algorithm for regression in Python. What boosting does is that it makes multiple models in a sequential manner. Each newer model tries to successful predict what older models struggled with. For regression, the average of the models are used for the predictions. It is often most common to use boosting with decision trees but this approach can be used with any machine learning algorithm that deals with supervised learning.
Boosting is associated with ensemble learning because several models are created that are averaged together. An assumption of boosting, is that combining several weak models can make one really strong and accurate model.
For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the weight loss of a patient based on several independent variables. The steps of this process are as follows.
- Data preparation
- Regression decision tree baseline model
- Hyperparameter tuning of Adaboost regression model
- AdaBoost regression model development
Below is some initial code
from sklearn.ensemble import AdaBoostRegressor
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
Data Preparation
There is little data preparation for this example. All we need to do is load the data and create the X and y datasets. Below is the code.
df=data('cancer').dropna()
X=df[['time','sex','ph.karno','pat.karno','status','meal.cal']]
y=df['wt.loss']
We will now proceed to creating the baseline regression decision tree model.
Baseline Regression Tree Model
The purpose of the baseline model is to compare it to the performance of our model that utilizes adaBoost. To make this model we need to Initiate a K-fold cross-validation. This will help in stabilizing the results. Next, we will create a for loop to create several trees that vary based on their depth. By depth, it is meant how far the tree can go to purify the classification. More depth often leads to a higher likelihood of overfitting.
Finally, we will then print the results for each tree. The criteria used for judgment is the mean squared error. Below is the code and results
crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10):
tree_regressor=tree.DecisionTreeRegressor(max_depth=depth,random_state=1)
if tree_regressor.fit(X,y).tree_.max_depth<depth:
break
score=np.mean(cross_val_score(tree_regressor,X,y,scoring='neg_mean_squared_error', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 -193.55304528235052
2 -176.27520747356175
3 -209.2846723461564
4 -218.80238479654003
5 -222.4393459885871
6 -249.95330609042858
7 -286.76842138165705
8 -294.0290706405905
9 -287.39016236497804
Looks like a tree with a depth of 2 had the lowest amount of error. We can now move to tuning the hyperparameters for the adaBoost algorithm.
Hyperparameter Tuning
For hyperparameter tuning we need to start by initiating our AdaBoostRegresor() class. Then we need to create our grid. The grid will address two hyperparameters which are the number of estimators and the learning rate. The number of estimators tells Python how many models to make and the learning indicates how each tree contributes to the overall results. There is one more parameter which is random_state, but this is just for setting the seed and never changes.
After making the grid, we need to use the GridSearchCV function to finish this process. Inside this function, you have to set the estimator, which is adaBoostRegressor, the parameter grid which we just made, the cross-validation which we made when we created the baseline model, and the n_jobs, which allocates resources for the calculation. Below is the code.
ada=AdaBoostRegressor()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'random_state':[1]}
search=GridSearchCV(estimator=ada,param_grid=search_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=crossvalidation)
Next, we can run the model with the desired grid in place. Below is the code for fitting the mode as well as the best parameters and the score to expect when using the best parameters.
search.fit(X,y)
search.best_params_
Out[31]: {'learning_rate': 0.01, 'n_estimators': 500, 'random_state': 1}
search.best_score_
Out[32]: -164.93176650920856
The best mix of hyperparameters is a learning rate of 0.01 and 500 estimators. This mix led to a mean error score of 164, which is a little lower than our single decision tree of 176. We will see how this works when we run our model with refined hyperparameters.
AdaBoost Regression Model
Below is our model, but this time with the refined hyperparameters.
ada2=AdaBoostRegressor(n_estimators=500,learning_rate=0.001,random_state=1)
score=np.mean(cross_val_score(ada2,X,y,scoring='neg_mean_squared_error',cv=crossvalidation,n_jobs=1))
score
Out[36]: -174.52604137201791
You can see the score is not as good but it is within reason.
Conclusion
In this post, we explored how to use the AdaBoost algorithm for regression. Employing this algorithm can help to strengthen a model in many ways at times.

AdaBoost Classification in Python
Boosting is a technique in machine learning in which multiple models are developed sequentially. Each new model tries to successful predict what prior models were unable to do. The average for regression and majority vote for classification are used. For classification, boosting is commonly associated with decision trees. However, boosting can be used with any machine learning algorithm in the supervised learning context.
Since several models are being developed with aggregation, boosting is associated with ensemble learning. Ensemble is just a way of developing more than one model for machine-learning purposes. With boosting, the assumption is that the combination of several weak models can make one really strong and accurate model.
For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the status of a patient based on several independent variables. The steps of this process are as follows.
- Data preparation
- Decision tree baseline model
- Hyperparameter tuning of Adaboost model
- AdaBoost model development
Below is some initial code
from sklearn.ensemble import AdaBoostClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
Data Preparation
Data preparation is minimal in this situation. We will load are data and at the same time drop any NA using the .dropna() function. In addition, we will place the independent variables in dataframe called X and the dependent variable in a dataset called y. Below is the code.
df=data('cancer').dropna()
X=df[['time','sex','ph.karno','pat.karno','meal.cal','wt.loss']]
y=df['status']
Decision Tree Baseline Model
We will make a decision tree just for the purposes of comparison. First, we will set the parameters for the cross-validation. Then we will use a for loop to run several different decision trees. The difference in the decision trees will be their depth. The depth is how far the tree can go in order to purify the classification. The more depth the more likely your decision tree is to overfit the data. The last thing we will do is print the results. Below is the code with the output
crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10):
tree_classifier=tree.DecisionTreeClassifier(max_depth=depth,random_state=1)
if tree_classifier.fit(X,y).tree_.max_depth<depth:
break
score=np.mean(cross_val_score(tree_classifier,X,y,scoring='accuracy', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 0.71875
2 0.6477941176470589
3 0.6768382352941177
4 0.6698529411764707
5 0.6584558823529412
6 0.6525735294117647
7 0.6283088235294118
8 0.6573529411764706
9 0.6577205882352941
You can see that the most accurate decision tree had a depth of 1. After that there was a general decline in accuracy.
We now can determine if the adaBoost model is better based on whether the accuracy is above 72%. Before we develop the AdaBoost model, we need to tune several hyperparameters in order to develop the most accurate model possible.
Hyperparameter Tuning AdaBoost Model
In order to tune the hyperparameters there are several things that we need to do. First we need to initiate our AdaBoostClassifier with some basic settings. Then We need to create our search grid with the hyperparameters. There are two hyperparameters that we will set and they are number of estimators (n_estimators) and the learning rate.
Number of estimators has to do with how many trees are developed. The learning rate indicates how each tree contributes to the overall results. We have to place in the grid several values for each of these. Once we set the arguments for the AdaBoostClassifier and the search grid we combine all this information into an object called search. This object uses the GridSearchCV function and includes additional arguments for scoring, n_jobs, and for cross-validation. Below is the code for all of this
ada=AdaBoostClassifier()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1]}
search=GridSearchCV(estimator=ada,param_grid=search_grid,scoring='accuracy',n_jobs=1,cv=crossvalidation)
We can now run the model of hyperparameter tuning and see the results. The code is below.
search.fit(X,y)
search.best_params_
Out[33]: {'learning_rate': 0.01, 'n_estimators': 1000}
search.best_score_
Out[34]: 0.7425149700598802
We can see that if the learning rate is set to 0.01 and the number of estimators to 1000 We can expect an accuracy of 74%. This is superior to our baseline model.
AdaBoost Model
We can now rune our AdaBoost Classifier based on the recommended hyperparameters. Below is the code.
ada=AdaBoostClassifier(n_estimators=1000,learning_rate=0.01) score=np.mean(cross_val_score(ada,X,y,scoring='accuracy',cv=crossvalidation,n_jobs=1)) score Out[36]: 0.7415441176470589
We knew we would get around 74% and that is what we got. It’s only a 3% improvement but depending on the context that can be a substantial difference.
Conclusion
In this post, we look at how to use boosting for classification. In particular, we used the AdaBoost algorithm. Boosting in general uses many models to determine the most accurate classification in a sequential manner. Doing this will often lead to an improvement in the prediction of a model.

Recommendation Engine with Python
Recommendation engines make future suggestion to a person based on their prior behavior. There are several ways to develop recommendation engines but for purposes, we will be looking at the development of a user-based collaborative filter. This type of filter takes the ratings of others to suggest future items to another user based on the other user’s ratings.
Making a recommendation engine in Python actually does not take much code and is somewhat easy consider what can be done through coding. We will make a movie recommendation engine using data from movielens.
Below is the link for downloading the zip fileÂ
Inside the zip file are several files we will use. We will use each in a few moments. Below is the initial code to get started
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.decomposition import TruncatedSVD
import numpy as np
We will now make 4 dataframes. Dataframes 1-3 will be the user, rating, and movie title data. The last dataframe will be a merger of the first 3. The code is below with a printout of the final result.
user = pd.read_table('/home/darrin/Documents/python/new/ml-1m/users.dat', sep='::', header=None, names=['user_id', 'gender', 'age', 'occupation', 'zip'],engine='python')
rating = pd.read_table('/home/darrin/Documents/python/new/ml-1m/ratings.dat', sep='::', header=None, names=['user_id', 'movie_id', 'rating', 'timestamp'],engine='python')
movie = pd.read_table('/home/darrin/Documents/python/new/ml-1m/movies.dat', sep='::', header=None, names=['movie_id', 'title', 'genres'],engine='python')
MovieAll = pd.merge(pd.merge(rating, user), movie)

We now need to create a matrix using the .pivot_table function. This matrix will include ratings and user_id from our “MovieAll” dataframe. We will then move this information into a dataframe called “movie_index”. This index will help us keep track of what movie each column represents. The code is below.
rating_mtx_df = MovieAll.pivot_table(values='rating', index='user_id', columns='title', fill_value=0)
There are many variables in our matrix. This makes the computational time long and expensive. To reduce this we will reduce the dimensions using the TruncatedSVD function. We will reduce the matrix to 20 components. We also need to transform the data because we want the Vh matrix and no tthe U matrix. All this is hand in the code below.
recomm = TruncatedSVD(n_components=20, random_state=10)
R = recomm.fit_transform(rating_mtx_df.values.T)
What we saved our modified dataset as “R”. If we were to print this it would show that each row has two columns with various numbers in it that cannot be interpreted by us. Instead, we will move to the actual recommendation part of this post.
To get a recommendation you have to tell Python the movie that you watch first. Python will then compare this movie with other movies that have a similiar rating and genera in the training dataset and then provide recommendation based on which movies have the highest correlation to the movie that was watched.
We are going to tell Python that we watched “One Flew Over the Cuckoo’s Nest” and see what movies it recommends.
First, we need to pull the information for just “One Flew Over the Cuckoo’s Nest” and place this in a matrix. Then we need to calculate the correlations of all our movies using the modified dataset we named “R”. These two steps are completed below.
cuckoo_idx = list(movie_index).index("One Flew Over the Cuckoo's Nest (1975)")
correlation_matrix = np.corrcoef(R)
Now we can determine which movies have the highest correlation with our movie. However, to determine this, we must gvive Python a range of acceptable correlations. For our purposes we will set this between 0.93 and 1.0. The code is below with the recommendations.
P = correlation_matrix[cuckoo_idx]
print (list(movie_index[(P > 0.93) & (P < 1.0)]))
['Graduate, The (1967)', 'Taxi Driver (1976)']
You can see that the engine recommended two movies which are “The Graduate” and “Taxi Driver”. We could increase the number of recommendations by lower the correlation requirement if we desired.
Conclusion
Recommendation engines are a great tool for generating sales automatically for customers. Understanding the basics of how to do this a practical application of machine learning

Elastic Net Regression in Python
Elastic net regression combines the power of ridge and lasso regression into one algorithm. What this means is that with elastic net the algorithm can remove weak variables altogether as with lasso or to reduce them to close to zero as with ridge. All of these algorithms are examples of regularized regression.
This post will provide an example of elastic net regression in Python. Below are the steps of the analysis.
- Data preparation
- Baseline model development
- Elastic net model development
To accomplish this, we will use the Fair dataset from the pydataset library. Our goal will be to predict marriage satisfaction based on the other independent variables. Below is some initial code to begin the analysis.
from pydataset import data
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
Data Preparation
We will now load our data. The only preparation that we need to do is convert the factor variables to dummy variables. Then we will make our and y datasets. Below is the code.
df=pd.DataFrame(data('Fair'))
df.loc[df.sex== 'male', 'sex'] = 0
df.loc[df.sex== 'female','sex'] = 1
df['sex'] = df['sex'].astype(int)
df.loc[df.child== 'no', 'child'] = 0
df.loc[df.child== 'yes','child'] = 1
df['child'] = df['child'].astype(int)
X=df[['religious','age','sex','ym','education','occupation','nbaffairs']]
y=df['rate']
We can now proceed to creating the baseline model
Baseline Model
This model is a basic regression model for the purpose of comparison. We will instantiate our regression model, use the fit command and finally calculate the mean squared error of the data. The code is below.
regression=LinearRegression()
regression.fit(X,y)
first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))
print(first_model)
1.0498738644696668
This mean standard error score of 1.05 is our benchmark for determining if the elastic net model will be better or worst. Below are the coefficients of this first model. We use a for loop to go through the model and the zip function to combine the two columns.
coef_dict_baseline = {}
for coef, feat in zip(regression.coef_,X.columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[63]:
{'religious': 0.04235281110639178,
'age': -0.009059645428673819,
'sex': 0.08882013337087094,
'ym': -0.030458802565476516,
'education': 0.06810255742293699,
'occupation': -0.005979506852998164,
'nbaffairs': -0.07882571247653956}
We will now move to making the elastic net model.
Elastic Net Model
Elastic net, just like ridge and lasso regression, requires normalize data. This argument is set inside the ElasticNet function. The second thing we need to do is create our grid. This is the same grid as we create for ridge and lasso in prior posts. The only thing that is new is the l1_ratio argument.
When the l1_ratio is set to 0 it is the same as ridge regression. When l1_ratio is set to 1 it is lasso. Elastic net is somewhere between 0 and 1 when setting the l1_ratio. Therefore, in our grid, we need to set several values of this argument. Below is the code.
elastic=ElasticNet(normalize=True)
search=GridSearchCV(estimator=elastic,param_grid={'alpha':np.logspace(-5,2,8),'l1_ratio':[.2,.4,.6,.8]},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)
We will now fit our model and display the best parameters and the best results we can get with that setup.
search.fit(X,y)
search.best_params_
Out[73]: {'alpha': 0.001, 'l1_ratio': 0.8}
abs(search.best_score_)
Out[74]: 1.0816514028705004
The best hyperparameters was an alpha set to 0.001 and a l1_ratio of 0.8. With these settings we got an MSE of 1.08. This is above our baseline model of MSE 1.05Â for the baseline model. Which means that elastic net is doing worse than linear regression. For clarity, we will set our hyperparameters to the recommended values and run on the data.
elastic=ElasticNet(normalize=True,alpha=0.001,l1_ratio=0.75)
elastic.fit(X,y)
second_model=(mean_squared_error(y_true=y,y_pred=elastic.predict(X)))
print(second_model)
1.0566430678343806
Now our values are about the same. Below are the coefficients
coef_dict_baseline = {}
for coef, feat in zip(elastic.coef_,X.columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[76]:
{'religious': 0.01947541724957858,
'age': -0.008630896492807691,
'sex': 0.018116464568090795,
'ym': -0.024224831274512956,
'education': 0.04429085595448633,
'occupation': -0.0,
'nbaffairs': -0.06679513627963515}
The coefficients are mostly the same. Notice that occupation was completely removed from the model in the elastic net version. This means that this values was no good to the algorithm. Traditional regression cannot do this.
Conclusion
This post provided an example of elastic net regression. Elastic net regression allows for the maximum flexibility in terms of finding the best combination of ridge and lasso regression characteristics. This flexibility is what gives elastic net its power.

Lasso Regression with Python
Lasso regression is another form of regularized regression. With this particular version, the coefficient of a variable can be reduced all the way to zero through the use of the l1 regularization. This is in contrast to ridge regression which never completely removes a variable from an equation as it employs l2 regularization.
Regularization helps to stabilize estimates as well as deal with bias and variance in a model. In this post, we will use the “CaSchools” dataset from the pydataset library. Our goal will be to predict test scores based on several independent variables. The steps we will follow are as follows.
- Data preparation
- Develop a baseline linear model
- Develop lasso regression model
The initial code is as follows
from pydataset import data
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
df=pd.DataFrame(data(‘Caschool’))
Data Preparation
The data preparation is simple in this example. We only have to store the desired variables in our X and y datasets. We are not using all of the variables. Some were left out because they were highly correlated. Lasso is able to deal with this to a certain extent w=but it was decided to leave them out anyway. Below is the code.
X=df[['teachers','calwpct','mealpct','compstu','expnstu','str','avginc','elpct']]
y=df['testscr']
Baseline Model
We can now run our baseline model. This will give us a measure of comparison for the lasso model. Our metric is the mean squared error. Below is the code with the results of the model.
regression=LinearRegression()
regression.fit(X,y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))
print(first_model)
69.07380530137416
First, we instantiate the LinearRegression class. Then, we run the .fit method to do the analysis. Next, we predicted future values of our regression model and save the results to the object first_model. Lastly, we printed the results.
Below are the coefficient for the baseline regression model.
coef_dict_baseline = {}
for coef, feat in zip(regression.coef_,X.columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[52]:
{'teachers': 0.00010011947964873427,
'calwpct': -0.07813766458116565,
'mealpct': -0.3754719080127311,
'compstu': 11.914006268826652,
'expnstu': 0.001525630709965126,
'str': -0.19234209691788984,
'avginc': 0.6211690806021222,
'elpct': -0.19857026121348267}
The for loop simply combines the features in our model with their coefficients. With this information we can now make our lasso model and compare the results.
Lasso Model
For our lasso model, we have to determine what value to set the l1 or alpha to prior to creating the model. This can be done with the grid function, This function allows you to assess several models with different l1 settings. Then python will tell which setting is the best. Below is the code.
lasso=Lasso(normalize=True)
search=GridSearchCV(estimator=lasso,param_grid={'alpha':np.logspace(-5,2,8)},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)
search.fit(X,y)
We start be instantiate lasso with normalization set to true. It is important to scale data when doing regularized regression. Next, we setup our grid, we include the estimator, and parameter grid, and scoring. The alpha is set using logspace. We want values between -5 and 2, and we want 8 evenly spaced settings for the alpha. The other arguments include cv which stands for cross-validation. n_jobs effects processing and refit updates the parameters.Â
After completing this, we used the fit function. The code below indicates the appropriate alpha and the expected score if we ran the model with this alpha setting.
search.best_params_
Out[55]: {'alpha': 1e-05}
abs(search.best_score_)
Out[56]: 85.38831122904011
`The alpha is set almost to zero, which is the same as a regression model. You can also see that the mean squared error is actually worse than in the baseline model. In the code below, we run the lasso model with the recommended alpha setting and print the results.
lasso=Lasso(normalize=True,alpha=1e-05)
lasso.fit(X,y)
second_model=(mean_squared_error(y_true=y,y_pred=lasso.predict(X)))
print(second_model)
69.0738055527604
The value for the second model is almost the same as the first one. The tiny difference is due to the fact that there is some penalty involved. Below are the coefficient values.
coef_dict_baseline = {}
for coef, feat in zip(lasso.coef_,X.columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[63]:
{'teachers': 9.795933425676567e-05,
'calwpct': -0.07810938255735576,
'mealpct': -0.37548182158171706,
'compstu': 11.912164626067028,
'expnstu': 0.001525439984250718,
'str': -0.19225486069458508,
'avginc': 0.6211695477945162,
'elpct': -0.1985510490295491}
The coefficient values are also slightly different. The only difference is the teachers variable was essentially set to zero. This means that it is not a useful variable for predicting testscrs. That is ironic to say the least.
Conclusion
Lasso regression is able to remove variables that are not adequate predictors of the outcome variable. Doing this in Python is fairly simple. This yet another tool that can be used in statistical analysis.

Ridge Regression in Python
Ridge regression is one of several regularized linear models. Regularization is the process of penalizing coefficients of variables either by removing them and or reduce their impact. Ridge regression reduces the effect of problematic variables close to zero but never fully removes them.
We will go through an example of ridge regression using the VietNamI dataset available in the pydataset library. Our goal will be to predict expenses based on the variables available. We will complete this task using the following steps/
- Data preparation
- Baseline model development
- Ridge regression model
Below is the initial code
from pydataset import data
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_erro
Data Preparation
The data preparation is simple. All we have to do is load the data and convert the sex variable to a dummy variable. We also need to set up our X and y datasets. Below is the code.
df=pd.DataFrame(data('VietNamI'))
df.loc[df.sex== 'male', 'sex'] = 0
df.loc[df.sex== 'female','sex'] = 1
df['sex'] = df['sex'].astype(int)
X=df[['pharvis','age','sex','married','educ','illness','injury','illdays','actdays','insurance']]
y=df['lnhhexp'
We can now create our baseline regression model.
Baseline Model
The metric we are using is the mean squared error. Below is the code and output for our baseline regression model. This is a model that has no regularization to it. Below is the code.
regression=LinearRegression()
regression.fit(X,y)
first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))
print(first_model)
0.35528915032173053
This value of 0.355289 will be our indicator to determine if the regularized ridge regression model is superior or not.
Ridge Model
In order to create our ridge model we need to first determine the most appropriate value for the l2 regularization. L2 is the name of the hyperparameter that is used in ridge regression. Determining the value of a hyperparameter requires the use of a grid. In the code below, we first are ridge model and indicate normalization in order to get better estimates. Next we setup the grid that we will use. Below is the code.
ridge=Ridge(normalize=True)
search=GridSearchCV(estimator=ridge,param_grid={'alpha':np.logspace(-5,2,8)},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)
The search object has several arguments within it. Alpha is hyperparameter we are trying to set. The log space is the range of values we want to test. We want the log of -5 to 2, but we only get 8 values from within that range evenly spread out. Are metric is the mean squared error. Refit set true means to adjust the parameters while modeling and cv is the number of folds to develop for the cross-validation. We can now use the .fit function to run the model and then use the .best_params_ and .best_scores_ function to determine the model;s strength. Below is the code.
search.fit(X,y)
search.best_params_
{'alpha': 0.01}
abs(search.best_score_)
0.3801489007094425
The best_params_ tells us what to set alpha too which in this case is 0.01. The best_score_ tells us what the best possible mean squared error is. In this case, the value of 0.38 is worse than what the baseline model was. We can confirm this by fitting our model with the ridge information and finding the mean squared error. This is done below.
ridge=Ridge(normalize=True,alpha=0.01)
ridge.fit(X,y)
second_model=(mean_squared_error(y_true=y,y_pred=ridge.predict(X)))
print(second_model)
0.35529321992606566
The 0.35 is lower than the 0.38. This is because the last results are not cross-validated. In addition, these results indicate that there is little difference between the ridge and baseline models. This is confirmed with the coefficients of each model found below.
coef_dict_baseline = {}
for coef, feat in zip(regression.coef_,data("VietNamI").columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[188]:
{'pharvis': 0.013282050886950674,
'lnhhexp': 0.06480086550467873,
'age': 0.004012412278795848,
'sex': -0.08739614349708981,
'married': 0.075276463838362,
'educ': -0.06180921300600292,
'illness': 0.040870384578962596,
'injury': -0.002763768716569026,
'illdays': -0.006717063310893158,
'actdays': 0.1468784364977112}
coef_dict_ridge = {}
for coef, feat in zip(ridge.coef_,data("VietNamI").columns):
coef_dict_ridge[feat] = coef
coef_dict_ridge
Out[190]:
{'pharvis': 0.012881937698185289,
'lnhhexp': 0.06335455237380987,
'age': 0.003896623321297935,
'sex': -0.0846541637961565,
'married': 0.07451889604357693,
'educ': -0.06098723778992694,
'illness': 0.039430607922053884,
'injury': -0.002779341753010467,
'illdays': -0.006551280792122459,
'actdays': 0.14663287713359757}
The coefficient values are about the same. This means that the penalization made little difference with this dataset.
Conclusion
Ridge regression allows you to penalize variables based on their useful in developing the model. With this form of regularized regression the coefficients of the variables is never set to zero. Other forms of regularization regression allows for the total removal of variables. One example of this is lasso regression.

Hyperparameter Tuning in Python
Hyperparameters are a numerical quantity you must set yourself when developing a model. This is often one of the last steps of model development. Choosing an algorithm and determining which variables to include often come before this step.
Algorithms cannot determine hyperparameters themselves which is why you have to do it. The problem is that the typical person has no idea what is an optimally choice for the hyperparameter. To deal with this confusion, often a range of values are inputted and then it is left to python to determine which combination of hyperparameters is most appropriate.
In this post, we will learn how to set hyperparameters by developing a grid in Python. To do this, we will use the PSID dataset from the pydataset library. Our goal will be to classify who is married and not married based on several independent variables. The steps of this process is as follows
- Â Data preparation
- Baseline model (for comparison)
- Grid development
- Revised model
Below is some initial code that includes all the libraries and classes that we need.
import pandas as pd
import numpy as np
from pydataset import data
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
Data Preparation
The dataset PSID has several problems that we need to address.
- We need to remove all NAs
- The married variable will be converted to a dummy variable. It will simply be changed to married or not rather than all of the other possible categories.
- The educatnn and kids variables have codes that are 98 and 99. These need to be removed because they do not make sense.
Below is the code that deals with all of this
df=data('PSID').dropna()
df.loc[df.married!= 'married', 'married'] = 0
df.loc[df.married== 'married','married'] = 1
df['married'] = df['married'].astype(int)
df['marry']=df.married
df.drop(df.loc[df['kids']>90].index, inplace=True)
df.drop(df.loc[df['educatn']>90].index, inplace=True
- Line 1 loads the dataset and drops the NAs
- Line 2-4 create our dummy variable for marriage. We create a new variable called marry to hold the results
- Lines 5-6 drop the values in kids and educatn that are above 90.
Below we create our X and y datasets and then are ready to make our baseline model.
X=df[['age','educatn','hours','kids','earnings']]
y=df['marry']
Baseline Model
The purpose of baseline model is to see how much better or worst the hyperparameter tuning works. We are using K Nearest Neighbors for our classification. In our example, there are 4 hyperparameters we need to set. They are as follows.
- number of neighbors
- weight of neighbors
- metric for measuring distance
- power parameter for minkowski
Below is the baseline model with the set hyperparameters. The second line shows the accuracy of the model after a k-fold cross-validation that was set to 10.
classifier=KNeighborsClassifier(n_neighbors=5,weights=’uniform’, metric=’minkowski’,p=2)
np.mean(cross_val_score(classifier,X,y,cv=10,scoring=’accuracy’,n_jobs=1)) 0.6188104238047426
Our model has an accuracy of about 62%. We will now move to setting up our grid so we can see if tuning the hyperparameters improves the performance
Grid Development
The grid allows you to develop scores of models with all of the hyperparameters tuned slightly differently. In the code below, we create our grid object, and then we calculate how many models we will run
grid={'n_neighbors':range(1,13),'weights':['uniform','distance'],'metric':['manhattan','minkowski'],'p':[1,2]}
np.prod([len(grid[element]) for element in grid])
96
You can see we made a simple list that has several values for each hyperparameter
- Number if neighbors can be 1 to 13
- weight of neighbors can be uniform or distance
- metric can be manhatten or minkowski
- p can be 1 or 2
We will develop 96 models all together. Below is the code to begin tuning the hyperparameters.
search=GridSearchCV(estimator=classifier,param_grid=grid,scoring='accuracy',n_jobs=1,refit=True,cv=10)
search.fit(X,y)
The estimator is the code for the type of algorithm we are using. We set this earlier. The param_grid is our grid. Accuracy is our metric for determining the best model. n_jobs has to do with the amount of resources committed to the process. refit is for changing parameters and cv is for cross-validation folds.The search.fit command runs the model
The code below provides the output for the results.
print(search.best_params_)
print(search.best_score_)
{'metric': 'manhattan', 'n_neighbors': 11, 'p': 1, 'weights': 'uniform'}
0.6503975265017667
The best_params_ function tells us what the most appropriate parameters are. The best_score_ tells us what the accuracy of the model is with the best parameters. Are model accuracy improves from 61% to 65% from adjusting the hyperparameters. We can confirm this by running our revised model with the updated hyper parameters.
Model Revision
Below is the cod efor the erevised model
classifier2=KNeighborsClassifier(n_neighbors=11,weights='uniform', metric='manhattan',p=1)
np.mean(cross_val_score(classifier2,X,y,cv=10,scoring='accuracy',n_jobs=1)) #new res
Out[24]: 0.6503909993913031
Exactly as we thought. This is a small improvement but this can make a big difference in some situation such as in a data science competition.
Conclusion
Tuning hyperparameters is one of the final pieces to improving a model. With this tool, small gradually changes can be seen in a model. It is important to keep in mind this aspect of model development in order to have the best success final.

Variable Selection in Python
A key concept in machine learning and data science in general is variable selection. Sometimes, a dataset can have hundreds of variables to include in a model. The benefit of variable selection is that it reduces the amount of useless information aka noise in the model. By removing noise it can improve the learning process and help to stabilize the estimates.
In this post, we will look at two ways to do this. These two common approaches are the univariate approach and the greedy approach. The univariate approach selects variables that are most related to the dependent variable based on a metric. The greedy approach will alone remove a variable if getting rid of it does not affect the model’s performance.
We will now move to our first example which is the univariate approach using Python. We will use the VietNamH dataset from the pydataset library. Are goal is to predict how much a family spends on medical expenses. Below is the initial code.
import pandas as pd
import numpy as np
from pydataset import data
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_regression
df=data('VietNamH').dropna()
Are data is called df. If you use the head function, you will see that we need to convert several variables to dummy variables. Below is the code for doing this.
df.loc[df.sex== 'female', 'sex'] = 0
df.loc[df.sex== 'male','sex'] = 1
df.loc[df.farm== 'no', 'farm'] = 0
df.loc[df.farm== 'yes','farm'] = 1
df.loc[df.urban== 'no', 'urban'] = 0
df.loc[df.urban== 'yes','urban'] = 1
We now need to setup or X and y datasets as shown below
X=df[['age','educyr','sex','hhsize','farm','urban','lnrlfood']]
y=df['lnmed']
We are now ready to actual use the univariate approach. This involves the use of two different classes in Python. The SelectPercentile class allows you to only include the variables that meet a certain percentile rank such as 25%. The f_regression class is designed for checking a variable’s performance in the context of regression. Below is the code to run the analysis.
selector_f=SelectPercentile(f_regression,percentile=25)
selector_f.fit(X,y)
We can now see the results using a for loop. We want the scores from our selector_f object. To do this we setup a for lop and use the zip function to iterate over the data. The output is placed in the print statement. Below is the code and output for this.
for n,s in zip(X,selector_f.scores_):
print('F-score: %3.2f\t for feature %s ' % (s,n))
F-score: 62.42 for feature age
F-score: 33.86 for feature educyr
F-score: 3.17 for feature sex
F-score: 106.35 for feature hhsize
F-score: 14.82 for feature farm
F-score: 5.95 for feature urban
F-score: 97.77 for feature lnrlfood
You can see the f-score for all of the independent variables. You can decide for yourself which to include.
Greedy Approach
The greedy approach only removes variables if they do not impact model performance. We are using the same dataset so all we have to do is run the code. We need the RFECV class from the model_selection library. We then use the function RFECV and set the estimator, cross-validation, and scoring metric. Finally, we run the analysis and print the results. The code is below with the output.
from sklearn.feature_selection import RFECV
select=RFECV(estimator=regression,cv=10,scoring='neg_mean_squared_error')
select.fit(X,y)
print(select.n_features_)
7
The number 7 represents how many independent variables to include in the model. Since we only had 7 total variables we should include all variables in the model.
Conclusion
With help with univariate and greedy approaches, it is possible to deal with a large number of variables efficiently one developing models. The example here involve only a handful of variables. However, bear in mind that the approaches mentioned here are highly scalable and useful.

Scatter Plots in Python
Scatterplots are one of many crucial forms of visualization in statistics. With scatterplots, you can examine the relationship between two variables. This can lead to insights in terms of decision making or additional analysis.
We will be using the “Prestige” dataset form the pydataset module to look at scatterplot use. Below is some initial code.
from pydataset import data import matplotlib.pyplot as plt import pandas as pd import seaborn as sns df=data('Prestige')
We will begin by making a correlation matrix. this will help us to determine which pairs of variables have strong relationships with each other. This will be done with the .corr() function. below is the code
You can see that there are several strong relationships. For our purposes, we will look at the relationship between education and income.
The seaborn library is rather easy to use for making visuals. To make a plot you can use the .lmplot() function. Below is a basic scatterplot of our data.
The code should be self-explanatory. THe only thing that might be unknown is the fit_reg argument. This is set to False so that the function does not make a regression line. Below is the same visual but this time with the regression line.
facet = sns.lmplot(data=df, x='education', y='income',fit_reg=True)
It is also possible to add a third variable to our plot. One of the more common ways is through including a categorical variable. Therefore, we will look at job type and see what the relationship is. To do this we use the same .lmplot.() function but include several additional arguments. These include the hue and the indication of a legend. Below is the code and output.
You can clearly see that type separates education and income. A look at the boxplots for these variables confirms this.
As you can see, we can conclude that job type influences both education and income in this example.
Conclusion
This post focused primarily on making scatterplots with the seaborn package. Scatterplots are a tool that all data analyst should be familiar with as it can be used to communicate information to people who must make decisions.