Data Exploration Case Study: Credit Default

Exploratory data analysis is the main task of a Data Scientist with as much as 60% of their time being devoted to this task. As such, the majority of their time is spent on something that is rather boring compared to building models.

This post will provide a simple example of how to analyze a dataset from the website called Kaggle. This dataset is looking at how is likely to default on their credit. The following steps will be conducted in this analysis.

  1. Load the libraries and dataset
  2. Deal with missing data
  3. Some descriptive stats
  4. Normality check
  5. Model development

This is not an exhaustive analysis but rather a simple one for demonstration purposes. The dataset is available here

Load Libraries and Data

Here are some packages we will need

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn import tree
from scipy import stats
from sklearn import metrics

You can load the data with the code below


You can examine what variables are available with the code below. This is not displayed here because it is rather long


Missing Data

I prefer to deal with missing data first because missing values can cause errors throughout the analysis if they are not dealt with immediately. The code below calculates the percentage of missing data in each column.

                           Total   Percent
COMMONAREA_MEDI           214865  0.698723
COMMONAREA_AVG            214865  0.698723
COMMONAREA_MODE           214865  0.698723

Only the first five values are printed. You can see that some variables have a large amount of missing data. As such, they are probably worthless for inclusion in additional analysis. The code below removes all variables with any missing data.

pct_null = df_train.isnull().sum() / len(df_train)
missing_features = pct_null[pct_null > 0.0].index
df_train.drop(missing_features, axis=1, inplace=True)

You can use the .head() function if you want to see how  many variables are left.

Data Description & Visualization

For demonstration purposes, we will print descriptive stats and make visualizations of a few of the variables that are remaining.

count     307511.0
mean      599026.0
std       402491.0
min        45000.0
25%       270000.0
50%       513531.0
75%       808650.0
max      4050000.0



count       307511.0
mean        168798.0
std         237123.0
min          25650.0
25%         112500.0
50%         147150.0
75%         202500.0
max      117000000.0


I think you are getting the point. You can also look at categorical variables using the groupby() function.

We also need to address categorical variables in terms of creating dummy variables. This is so that we can develop a model in the future. Below is the code for dealing with all the categorical  variables and converting them to dummy variable’s










You have to be careful with this because now you have many variables that are not necessary. For every categorical variable you must remove at least one category in order for the model to work properly.  Below we did this manually.

df_train=df_train.drop(['Revolving loans','F','XNA','N','Y','SK_ID_CURR,''Student','Emergency','Lower secondary','Civil marriage','Municipal apartment'],axis=1)

Below are some boxplots with the target variable and other variables in the dataset.



There is a clear outlier there. Below is another boxplot with a different variable



It appears several people have more than 10 children. This is probably a typo.

Below is a correlation matrix using a heatmap technique



The heatmap is nice but it is hard to really appreciate what is happening. The code below will sort the correlations from least to strongest, so we can remove high correlations.

c = df_train.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")

Unknown FLAG_MOBIL 0.000005
FLAG_MOBIL Unknown 0.000005
Cash loans FLAG_DOCUMENT_14 0.000005

The list is to long to show here but the following variables were removed for having a high correlation with other variables.


Below we check a few variables for homoscedasticity, linearity, and normality  using plots and histograms



This is not normal



This is not normal either. We could do transformations, or we can make a non-linear model instead.

Model Development

Now comes the easy part. We will make a decision tree using only some variables to predict the target. In the code below we make are X and y dataset.


The code below fits are model and makes the predictions


Below is the confusion matrix followed by the accuracy

print (pd.crosstab(y_pred,df_train['TARGET']))
TARGET       0      1
0       280873  18493
1         1813   6332
Out[47]: 0.933966589813047

Lastly, we can look at the precision, recall, and f1 score

              precision    recall  f1-score   support

           0       0.99      0.94      0.97    299366
           1       0.26      0.78      0.38      8145

   micro avg       0.93      0.93      0.93    307511
   macro avg       0.62      0.86      0.67    307511
weighted avg       0.97      0.93      0.95    307511

This model looks rather good in terms of accuracy of the training set. It actually impressive that we could use so few variables from such a large dataset and achieve such a high degree of accuracy.


Data exploration and analysis is the primary task of a data scientist.  This post was just an example of how this can be approached. Of course, there are many other creative ways to do this but the simplistic nature of this analysis yielded strong results


Classroom Conflict Resolution Strategies

Disagreements among students and even teachers is part of working at any institution. People have different perceptions and opinions of what they see and experience. With these differences often comes disagreements that can lead to serious problems.

This post will look at several broad categories in which conflicts can be resolved when dealing with conflicts in the classroom. The categories are as follows.

  1. Avoiding
  2. Accommodating
  3. Forcing
  4. Compromising
  5. Problem-solving


The avoidance strategy involves ignoring the problem. The tension of trying to work out the difficulty is not worth the effort. The hope is  that the problem will somehow go away with any form of intervention. Often the problem becomes worst.

Teachers sometimes use avoidance in dealing with conflict. One common classroom management strategy is avoidance in which a teacher deliberately ignores poor behavior of a student to extinguish it.  Since the student is not getting any attention  from their poor behavior  they will often stop the  behavior.


Accommodating is focused on making everyone involved in the conflict happy. The focus is on relationships and not productivity. Many who employ this strategy believe that confrontation is destructive. Actual applications of this approach involve using humor, or some other tension breaking technique during a conflict. Again, the problem is never actually solved but rather some form of “happiness band-aid” is applied.

In the classroom, accommodation happens when teachers use humor to smooth over tense situations and when they make adjustments to goals to ameliorate students complaints. Generally, the first step in accommodation leads to more and more accommodating until the teacher is backed into a corner.

Another use of the term accommodating is the mandate in education under the catchphrase “meeting student needs”. Teachers are expected to accommodate as much as possible within guidelines given to them by the school. This leads to extraordinarily large amount of work and effort on the part of the teacher.


Force involves simply making people do something through the power you have over them. It gets things done but can lead to long term relational problems. As people are forced the often lose motivation and new conflicts begin to arise.

Forcing is often a default strategy for teachers. After all, the teacher is t an authority over children. However, force is highly demotivating and should be avoided if possible. If students have no voice they quickly can become passive which is often in opposite of active learning in the classroom.


Compromise involves trying to develop a win win situation for both parties. However, the reality is that often compromising can be the most frustrating. To totally avoid conflict means no fighting. TO be force means to have no power. However, compromise means that a person almost got what they wanted but not exactly, which can be more annoying.

Depending on the age a teacher is working with, compromising can be difficult to achieve. Younger children often lack the skills to see alternative solutions and half-way points of agreement. Compromising can also be viewed as accommodating by older kids which can lead to perceptions of the teacher’s weakness when conflict arises. Therefore, compromise is an excellent strategy when used with care.


Problem-solving is similar to compromising except that both parties are satisfied with the decision and the problem is actually solved, at least temporarily. This takes a great deal of trust and communication between the parties involved.

For this to work in  the classroom, a teacher must de-emphasize their position of authority in order to work with the students. This is counterintuitive for most in teachers and even for many students. It is also necessary to developing strong listening and communication skills to allow both parties to provide ways of dealing with the conflict. As with compromise, problem-solving is better reserved for older students.


Teachers need to know what their options are when it comes to addressing conflict. This post provided several ideas or ways for maneuvering disagreements and setbacks in  the classroom.

Signs a Student is Lying

Deception is a common tool students use when trying to avoid discipline or some other uncomfortable situation with a teacher. However, there are some tips and indicators that you can be aware of to help you to determine if a student is lying to you. This post will share some ways to determine if a student may be lying. The tips are as follows

  • Determine what is normal
  • Examine how the play with their clothing
  • Watch personal space
  • Tone of voice
  • Movement

Determine What is Normal

People are all individuals and thus unique. Therefore, determining deception first requires determining what is normal for the student. This involves some observation and getting to know the student. These are natural parts of teaching.

However, if you are in an administrative position and may not know the student that well it will be much harder to determine what is normal for the student sot that it can be compared to their behavior if you believe they are lying. One solution for this challenge is to first engage in small talk with the student so you can establish what appears to be natural behavior for the student.

Clothing Signs

One common  sign that someone is lying is that they begin to play with their clothing. This can include tugging on clothes, closing buttons, pulling down on sleeves, and or rubbing a spot. This all depends on what is considered normal for the individual.

Personal Space

When people pull away when talking it is often a sign of dishonesty. This can be done through such actions as shifting one’s chair, or leaning back. Other individuals will fold their arms across their chest. All these behaviors are subconscious was of trying to protect one’s self.


The voice provides several clues of deception. Often the rate or speed of the speaking slows down. Deceptive answers are often much longer and detailed than honest ones. Liars often show hesitations and pauses that are out of the ordinary for them.

A change in pitch is perhaps the strongest sigh of lying. Students will often speak with a much higher pitch one lying. This is perhaps do to the nervousness they are experiencing.


Liars have a habit of covering their mouth when speaking. Gestures also become more mute and closer to the bottom when a student is lying. Another common cue is gestures with the palms up rather than down when speaking. Additional signs include nervous tapping with the feet.


People lie for many reasons. Due to this, it is important that a teacher is able to determine the honesty of a student when necessary. The tips in this post provide some basic ways of potentially identifying who is being truthful.

Barriers to Teachers Listening

Few of us want to admit it but all teachers have had problems at one time or another listening to their students. There are many reasons for this but in this post we will look at the following barriers to listening that teachers may face.

  1. Inability to focus
  2. Difference in speaking and listening speed
  3. Willingness
  4. Detours
  5. Noise
  6. Debate

Inability to Focus

Sometimes a teacher or even a student may not be able to focus on the discussion or conversation. This could be due to a lack of motivation or desire to pay attention. Listening can be taxing mental work. Therefore, the teacher must be engaged and have some desire to try to understand what is happening.

Differences in the Speed of Speaking and Listening

We speak much slower than we think. Some have put the estimate that we speak at 1/4 the speed at which we can think. What this means is that if you can think 100 words per minute you can speak at only 25 words per minute. With thinking being 4 times faster than speaking this leaves a lot of mental energy lying around unused which can lead to daydreaming.

This difference can lead to impatience and to anticipation of what the person is going to say. Neither of these are beneficial because they discourage listening.


There are times, rightfully so, that a teacher does not want to listen. This can be when a student is not cooperating or giving an unjustified excuse for their actions. The main point here is that a teacher needs to be aware of their unwillingness to listen. Is it justified or is it unjustified? This is the question to ask.


Detours happen when we respond to a specific point or comment by the student which changes the subject. This barrier is tricking because what is happening is that you are actually paying attention but allow the conversation to wander from the original purpose. Wandering conversation is natural and often happens when we are enjoying the talk.

Preventing this requires mental discipline to stay on topic and to not what you are listening for. This is not easy but is necessary at times.


Noise can be external or internal. External noise is factors beyond our control. For example, if there is a lot of noise in the classroom it may be hard to hear a student speak. A soft-spoken student in a loud place is frustrating to try and listen to even when there is a willingness to do so.

Internal noise has to do with what is happening inside your own mind If you are tired, sick, or feeling rush due to a lack of time, these can all affect your ability to listening to others.


Sometimes we listen until we want to jump in and try to defend a point are disagree with something. This is not so much as listening as it is hunting and waiting to pounce and the slightest misstep of logic from the person we are supposed to listen to.

It is critical to show restraint and focus on allowing the other side to be heard rather than interrupted by you.


We often view teachers as communicators. However, half the job of a communicator is to listen. At times, due to the position and the need to be the talker a teacher may neglect the need to be a listener. The barriers explained here should help teachers to be aware of why they may neglect to do this.

Principles of Management and the Classroom

Henri Fayol (1841-1925) had a major impact on managerial communication in his develop of 14 principles of management. In this post, we will look at these principles briefly and see how at least some of them can be applied in the classroom as educators.

Below is a list of the 14 principles of management by Fayol

  1. Division of work
  2. Authority
  3. Discipline
  4. Unity of command
  5. Unity of direction
  6. Subordination of individual interest
  7. Remuneration
  8. The degree of centralization
  9. Scalar chain
  10. Order
  11. Equity
  12. Stability of personnel
  13. Initiative
  14. Esprit de corps

Division of Work & Authority

Division of work has to do with breaking work into small parts with each worker having responsibility for one aspect of the work. In the classroom, this would apply to group projects in which collaboration is required to complete a task.

Authority is  the power to give orders and commands. The source of the authority cannot only be in the position. The leader must demonstrate expertise and competency in order to lead. For the classroom, it is a well-known tenet of education that the teacher must demonstrate expertise in their subject matter and knowledge of teaching.

Discipline & Unity of command

Discipline has to do with obedience. The workers should obey the leader. In the classroom this relates to concepts found in classroom management. The teacher must put in place mechanisms to ensure that the students follow directions.

Unity of command means that there should only be directions given from one leader to the workers. This is the default setting in some schools until about junior high or high school. At that point, students have several teachers at once. However, generally it is one teacher per classroom even if the students have several teachers.

Unity of Direction & Subordination i of Individual Interests

The employees activities must all be linked to the same objectives. This ensures everyone is going in the same directions. In the classroom, this relates to the idea of goals and objectives in teaching. The curriculum needs to be aligned with students all going in the same direction. A major difference here is that the activities may vary in terms of achieving the learning goals from student to student.

Subordination of individual interests in tells putting the organization ahead of personal goals. This is where there may be a break in managerial and educational practices. Currently, education  in many parts of the world are highly focused on the students interest at the expense of what may be most efficient and beneficial to the institution.

Remuneration & Degree of Centralization

Remuneration has to do with the compensation. This can be monetary or non-monetary. Monetary needs to be high enough to provide some motivation to work. Non-monetary can include recognition, honor or privileges. In education, non-monetary compensation is standard in the form of grades, compliments, privileges, recognition, etc. Whatever is done is usually contributes to intrinsic or extrinsic motivation.

Centralization has to do with who makes decisions. A highly centralized institution has top down decision-making while a decentralized institution has decisions coming from many directions. Generally, in  the classroom setting, decisions are made by the teacher. Students may be given autonomy over how to approach assignments or which assignments to do but the major decisions are made by the teacher even in highly decentralized classrooms due to the students inexperience and lack of maturity.

Scalar Chain & Order

Scalar chain has to do with recognizing the chain of command. The employee should contact the immediate supervisor when there is a problem. This prevents to many people going to the same person. In education, this is enforced by default as the only authority in a classroom is usually a teacher.

Order deals with having the resources to get the job done. In the classroom, there are many things the teacher can supply such as books, paper, pencils, etc. and even social needs such as attention and encouragement. However, sometimes there are physical needs that are neglected such as kids who miss breakfast and come to school hungry.

Equity & Stability of Personal

Equity means workers are treated fairly. This principle again relates to classroom management and even assessment. Students need to know that the process for discipline is fair even if it is dislike and that there is adequate preparation for assessments such as quizzes and tests.

Stability of personnel means keeping turnover to a minimum. In education, schools generally prefer to keep teacher long term if possible. Leaving during the middle of a school year whether a student or teacher is discouraged as it is disruptive.

Initiative & Esprit de Corps

Initiative means allowing workers to contribute new ideas and do things. This empowers workers and adds value to the company. In education, this also relates to classroom management in that students need to be able to share their opinion freely during discussions and also when they have concerns about what is happening in the classroom.

Esprit de corps focuses on morale. Workers need to feel good and appreciated. The classroom learning environment is a topic that is frequently studied in education. Students need to have their psychological needs meet through having a place to study that is safe and friendly.


These 14 principles are found in the business world, but they also have a strong influence in the world of education as well. Teachers can pull these principles any ideas that may be useful l in their classroom.

Hierarchical Regression in R

In this post, we will learn how to conduct a hierarchical regression analysis in R. Hierarchical regression analysis is used in situation in which you want to see if adding additional variables to your model will significantly change the r2 when accounting for the other variables in the model. This approach is a model comparison approach and not necessarily a statistical one.

We are going to use the “Carseats” dataset from the ISLR package. Our goal will be to predict total sales using the following independent variables in three different models.

model 1 = intercept only
model 2 = Sales~Urban + US + ShelveLoc
model 3 = Sales~Urban + US + ShelveLoc + price + income
model 4 = Sales~Urban + US + ShelveLoc + price + income + Advertising

Often the primary goal with hierarchical regression is to show that the addition of a new variable builds or improves upon a previous model in a statistically significant way. For example, if a previous model was able to predict the total sales of an object using three variables you may want to see if a new additional variable you have in mind may improve model performance. Another way to see this is in the following research question

Is a model that explains the total sales of an object with Urban location, US location, shelf location, price, income and advertising cost as independent variables superior in terms of R2 compared to a model that explains total sales with Urban location, US location, shelf location, price and income as independent variables?

In this complex research question we essentially want to know if adding advertising cost will improve the model significantly in terms of the r square. The formal steps that we will following to complete this analysis is as follows.

  1. Build sequential (nested) regression models by adding variables at each step.
  2. Run ANOVAs in order to compute the R2
  3. Compute difference in sum of squares for each step
    1. Check F-statistics and p-values for the SS differences.
  4. Compare sum of squares between models from ANOVA results.
  5. Compute increase in R2 from sum of square difference
  6. Run regression to obtain the coefficients for each independent variable.

We will now begin our analysis. Below is some initial code


Model Development

We now need to create our models. Model 1 will not have any variables in it and will be created for the purpose of obtaining the total sum of squares. Model 2 will include demographic variables. Model 3 will contain the initial model with the continuous independent variables. Lastly, model 4 will contain all the information of the previous models with the addition of the continuous independent variable of advertising cost. Below is the code.

model1 = lm(Sales~1,Carseats)
model2=lm(Sales~Urban + US + ShelveLoc,Carseats)
model3=lm(Sales~Urban + US + ShelveLoc + Price + Income,Carseats)
model4=lm(Sales~Urban + US + ShelveLoc + Price + Income + Advertising,Carseats)

We can now turn to the ANOVA analysis for model comparison #ANOVA Calculation We will use the anova() function to calculate the total sum of square for model 0. This will serve as a baseline for the other models for calculating r square

## Analysis of Variance Table
## Model 1: Sales ~ 1
## Model 2: Sales ~ Urban + US + ShelveLoc
## Model 3: Sales ~ Urban + US + ShelveLoc + Price + Income
## Model 4: Sales ~ Urban + US + ShelveLoc + Price + Income + Advertising
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1    399 3182.3                                   
## 2    395 2105.4  4   1076.89  89.165 < 2.2e-16 ***
## 3    393 1299.6  2    805.83 133.443 < 2.2e-16 ***
## 4    392 1183.6  1    115.96  38.406 1.456e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For now, we are only focusing on the residual sum of squares. Here is a basic summary of what we know as we compare the models.

model 1 = sum of squares = 3182.3
model 2 = sum of squares = 2105.4 (with demographic variables of Urban, US, and ShelveLoc)
model 3 = sum of squares = 1299.6 (add price and income)
model 4 = sum of squares = 1183.6 (add Advertising)

Each model is statistical significant which means adding each variable lead to some improvement.

By adding price and income to the model we were able to improve the model in a statistically significant way. The r squared increased by .25 below is how this was calculated.

2105.4-1299.6 #SS of Model 2 - Model 3
## [1] 805.8
805.8/ 3182.3 #SS difference of Model 2 and Model 3 divided by total sum of sqaure ie model 1
## [1] 0.2532131

When we add Advertising to the model the r square increases by .03. The calculation is below

1299.6-1183.6 #SS of Model 3 - Model 4
## [1] 116
116/ 3182.3 #SS difference of Model 3 and Model 4 divided by total sum of sqaure ie model 1
## [1] 0.03645162

Coefficients and R Square

We will now look at a summary of each model using the summary() function.

## Call:
## lm(formula = Sales ~ Urban + US + ShelveLoc, data = Carseats)
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.713 -1.634 -0.019  1.738  5.823 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.8966     0.3398  14.411  < 2e-16 ***
## UrbanYes          0.0999     0.2543   0.393   0.6947    
## USYes             0.8506     0.2424   3.510   0.0005 ***
## ShelveLocGood     4.6400     0.3453  13.438  < 2e-16 ***
## ShelveLocMedium   1.8168     0.2834   6.410 4.14e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 2.309 on 395 degrees of freedom
## Multiple R-squared:  0.3384, Adjusted R-squared:  0.3317 
## F-statistic: 50.51 on 4 and 395 DF,  p-value: < 2.2e-16
## Call:
## lm(formula = Sales ~ Urban + US + ShelveLoc + Price + Income, 
##     data = Carseats)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9096 -1.2405 -0.0384  1.2754  4.7041 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     10.280690   0.561822  18.299  < 2e-16 ***
## UrbanYes         0.219106   0.200627   1.092    0.275    
## USYes            0.928980   0.191956   4.840 1.87e-06 ***
## ShelveLocGood    4.911033   0.272685  18.010  < 2e-16 ***
## ShelveLocMedium  1.974874   0.223807   8.824  < 2e-16 ***
## Price           -0.057059   0.003868 -14.752  < 2e-16 ***
## Income           0.013753   0.003282   4.190 3.44e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.818 on 393 degrees of freedom
## Multiple R-squared:  0.5916, Adjusted R-squared:  0.5854 
## F-statistic: 94.89 on 6 and 393 DF,  p-value: < 2.2e-16
## Call:
## lm(formula = Sales ~ Urban + US + ShelveLoc + Price + Income + 
##     Advertising, data = Carseats)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2199 -1.1703  0.0225  1.0826  4.1124 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     10.299180   0.536862  19.184  < 2e-16 ***
## UrbanYes         0.198846   0.191739   1.037    0.300    
## USYes           -0.128868   0.250564  -0.514    0.607    
## ShelveLocGood    4.859041   0.260701  18.638  < 2e-16 ***
## ShelveLocMedium  1.906622   0.214144   8.903  < 2e-16 ***
## Price           -0.057163   0.003696 -15.467  < 2e-16 ***
## Income           0.013750   0.003136   4.384 1.50e-05 ***
## Advertising      0.111351   0.017968   6.197 1.46e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.738 on 392 degrees of freedom
## Multiple R-squared:  0.6281, Adjusted R-squared:  0.6214 
## F-statistic: 94.56 on 7 and 392 DF,  p-value: < 2.2e-16

You can see for yourself the change in the r square. From model 2 to model 3 there is a 26 point increase in r square just as we calculated manually. From model 3 to model 4 there is a 3 point increase in r square. The purpose of the anova() analysis was determined if the significance of the change meet a statistical criterion, The lm() function reports a change but not the significance of it.


Hierarchical regression is just another potential tool for the statistical researcher. It provides you with a way to develop several models and compare the results based on any potential improvement in the r square.

RANSAC Regression in Python

RANSAC is an acronym for Random Sample Consensus. What this algorithm does is fit a regression model on a subset of data that the algorithm judges as inliers while removing outliers. This naturally improves the fit of the model due to the removal of some data points.

The process that is used to determine inliers and outliers is described below.

  1. The algorithm randomly selects a random amount of samples to be inliers in the model.
  2. All data is used to fit the model and samples that fall with a certain tolerance are relabeled as inliers.
  3. Model is refitted with the new inliers
  4. Error of the fitted model vs the inliers is calculated
  5. Terminate or go back to step 1 if a certain criterion of iterations or performance is not met.

In this post, we will use the tips data from the pydataset module. Our goal will be to predict the tip amount using two different models.

  1. Model 1 will use simple regression and will include total bill as the independent variable and tips as the dependent variable
  2. Model 2 will use multiple regression and  includes several independent variables and tips as the dependent variable

The process we will use to complete this example is as follows

  1. Data preparation
  2. Simple Regression Model fit
  3. Simple regression visualization
  4. Multiple regression model fit
  5. Multiple regression visualization

Below are the packages we will need for this example

import pandas as pd
from pydataset import data
from sklearn.linear_model import RANSACRegressor
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

Data Preparation

For the data preparation, we need to do the following

  1. Load the data
  2. Create X and y dataframes
  3. Convert several categorical variables to dummy variables
  4. Drop the original categorical variables from the X dataframe

Below is the code for these steps


Most of this is self-explanatory, we first load the tips dataset and divide the independent and dependent variables into an X and y dataframe respectively. Next, we converted the sex, smoker, and dinner variables into dummy variables, and then we dropped the original categorical variables.

We can now move to fitting the first model that uses simple regression.

Simple Regression Model

For our model, we want to use total bill to predict tip amount. All this is done in the following steps.

  1. Instantiate an instance of the RANSACRegressor. We the call LinearRegression function, and we also set the residual_threshold to 2 indicate how far an example has to be away from  2 units away from the line.
  2. Next we fit the model
  3. We predict the values
  4. We calculate the r square  the mean absolute error

Below is the code for all of this.

ransacReg1= RANSACRegressor(LinearRegression(),residual_threshold=2,random_state=0)[['total_bill']],y)
Out[150]: 0.4381748268686979

Out[151]: 0.7552429811944833

The r-square is 44% while the MAE is 0.75.  These values are most comparative and will be looked at again when we create the multiple regression model.

The next step is to make the visualization. The code below will create a plot that shows the X and y variables and the regression. It also identifies which samples are inliers and outliers. Te coding will not be explained because of the complexity of it.

plt.xlabel('Total Bill')
plt.legend(loc='upper left')


Plot is self-explanatory as a handful of samples were considered outliers. We will now move to creating our multiple regression model.

Multiple Regression Model Development

The steps for making the model are mostly the same. The real difference takes place in make the plot which we will discuss in a moment. Below is the code for  developing the model.

ransacReg2= RANSACRegressor(LinearRegression(),residual_threshold=2,random_state=0),y)
Out[154]: 0.4298703800652126

Out[155]: 0.7649733201032204

Things have actually gotten slightly worst in terms of r-square and MAE.

For the visualization, we cannot plot directly several variables t once. Therefore, we will compare the predicted values with the actual values. The better the correlated the better our prediction is. Below is the code for the visualization

plt.xlabel('Predicted Tip')
plt.ylabel('Actual Tip')
plt.legend(loc='upper left')


The plots are mostly the same  as you cans see for yourself.


This post provided an example of how to use the RANSAC regressor algorithm. This algorithm will remove samples from the model based on a criterion you set. The biggest complaint about this algorithm is that it removes data from the model. Generally, we want to avoid losing data when developing models. In addition, the algorithm removes outliers objectively this is a problem because outlier removal is often subjective. Despite these flaws, RANSAC regression is another tool that can be use din machine learning.

Teaching English

Teaching English  or any other subject requires that the teacher be able to walk into the classroom and find ways to have an immediate impact. This is much easier said than done. In this post we look at several ways to increase the likelihood of being able to help students.

Address Needs

People’s reasons for learning a language such as English can vary tremendously. Knowing this, it is critical that you as a teacher know what the need in their learning. This allows you to adjust the methods and techniques that you used to help them learn.

For example, some students may study English for academic purposes while others are just looking to develop communications skills. Some students maybe trying to pass a proficiency examine in order to study at  university or in graduate school.

How you teach these different groups will be different. The academic students want academic English and language skills. Therefore, if you plan to play games in the classroom and other fun activities there may be some frustration because the students will not see how this helps them.

On the other hand, for students who just want to learn to converse in English, if you smother them with heavy readings and academic like work they will also become frustrated from how “rigorous” the course is. This is why you must know what the goals of the students are and make the needed changes as possible

Stay Focused

When dealing with students, it is tempting to answer and following ever question that they have. However, this can quickly lead to a lost of directions as the class goes here there and everywhere to answer every nuance question.

Even though the teacher needs to know what the students want help with the teacher is also the expert and needs to place limits over how far they will go in terms of addressing questions and needs. Everything cannot be accommodated no matter how hard one tries.

As the teacher, things that limit your ability to explore questions and concerns of students includes time, resources,  your own expertise, and the importance of the question/concern. Of course, we help students, but not to the detriment of the larger group.

Providing a sense of direction is critical as a teacher. The students have their needs but it is your goal to lead them to the answers. This requires a sense of knowing what you want and  being able to get there. There re a lot of experts out there who cannot lead a group of students to the knowledge they need as this requires communication skills and an ability to see the forest from the trees.


Teaching is a mysterious profession as so many things happen that cannot be seen or measured but clearly have an effect on the classroom. Despite the confusion it never hurts to determine where the students want to go and to find a way to get them there academically.

Improving Lecturing

Lecturing is a necessary evil at the university level. The university system was founded during a time when lecturing was the only way to share information. Originally, owning books was nearly impossible due to their price, there was no internet or computer, and  there were few options for reviewing material. For these reasons, lecturing was the go to approach for centuries.

With all the advantages in technology, the world has changed but lecturing has not. This has led to students becoming disengaged in the learning experience with the emphasis on lecture style teaching.

This post will look at times when lecturing is necessary as well as ways to improve the lecturing experience.

Times to Lecture

Despite the criticism given earlier, there are times when lecturing is an appropriate strategy. Below are some examples.

  • When there is a need to cover a large amount of content-If you need to get through a lot of material quickly and don’t have time for discussion.
  • Complex concepts/instructions-You probably do not want to use discovery learning to cover lab safety policies
  • New material-The first time through they may need to listen. When the topic is addressed later a different form of instruction should be employed

The point here is not to say that lecturing is bad but rather that it is overly relied upon by the typical college lecturer. Below are ways to improve lecturing when it is necessary.

Prepare Own Materials

With all the tools on the internet from videos to textbook supplied PowerPoint slides. It is tempting to just use these materials as they are and teach. However, preparing your own materials allows you to bring yourself and your personality into the teaching experience.

You can add anecdotes to illustrate various concepts, bring in additional resources, are leave information that you do not think is pertinent. Furthermore, by preparing your own material you know inside and out where you are going and when. This can also help to organize your thinking on a topic due to the highly structured nature of PowerPoint slides.

Even modifying others materials can provide some benefit. By owning your own material it allows you to focus less on what someone else said and more on what you want to say with your own materials that you are using.

Focus on the Presentation

If many teachers listen to themselves lecturing, they might be convinced that they are boring. When presenting a lecture a teacher should make sure to try to share the content extemporaneously. There should be a sense of energy and direction to the content. The students need to be convinced that you have something to say.

There is even a component of body language to this. A teacher needs to walk into a room like they “own the place” and speak accordingly. This means standing up straight, shoulders back with a strong voice that changes speed. These are all examples of having a commanding stage presence. Make it clear you are the leader through your behavior. Who wants to listen to someone who lacks self-confidence and mumbles?

Read the Audience

If all you do is have confidence  and run through your PowerPoint like nobody exists there will be little improvement for the students. A good speaker must read the audience and respond accordingly. If, despite all your efforts to prepare an interesting talk on a subject, the students are on their phones or even unconscience there is no point continuing but to do some sort of diversionary activity to get people refocus. Some examples of diversionary tactics include the following.

  • Have the students discuss something about the lecture for a moment
  • Have the students solve a problem of some sort related to the material
  • Have the students move. Instead of talking with someone next to them they have to find someone from a different part of the lecture room. A bit of movement is all it takes to regain conscientiousness.

The lecture should be dynamic which means that it changes in nature at times. Breaking up the content into 10 minutes periods followed by some sort of activity can really prevent fatigue in the listeners.


Lecturing is a classic skill that can still be used in the 21st century. However, given that times have  changed it is necessary to make some adjustments to how a  teacher approaches lecturing.

Teaching Large Classes

It is common for undergraduate courses, particularly introductory courses, to have a large number of students. Some introductory courses can have as many as 150-300 students in them. Combine this with the fact that it is common for the people with the least amount of teaching experience whether a graduate assistant or new non-tenured professor. This leads to a question of how to handle teaching so many students at one time.

In this post, we will look at some common challenges to teaching a large at the tertiary level. In particular, we will  look at the following.

  • Addressing student engagement
  • Grading assignments
  • Logistics

Student Engagement

Once a class reaches a certain size, it becomes difficult to engage students with discussion and one-on-one  interaction. This leaves a teacher with the most commonly used tool for university teaching, which is lecturing. However, most students find lecturing to be utterly boring and even some teachers find it boring.

Lecturing can be useful but it must be broken up into “chunks.” What this means is that perhaps you lecture for 8-10 minutes then have the students do something such as discuss a concept with their neighbor and then continue lecturing 8-10 minutes. The reason for 8-10 minutes is that is about how long a TV show runs until a commercial. This implies that 8-10 minutes is about how long someone can pay attention.

The during a break in the lecturing, students can teach a neighbor how/explain a concept to a neighbor, they can write a summary of what they just learned, or they can simply discuss what they learn. What happens during this time is up to the teacher but it should provide a way to continue to examine what is being learned without having to sit and only listen to the lecturer.

Grading Assignments

Grading assignments can be a nightmare in a large class. This is particularly true if the assessment has open-ended questions. The problem with open-ended questions is that they cannot be automated and mark by a computer.

If you must have open-ended questions that require humans to grade them here are some suggestions.

  • If the assessment is formative or a stepping stone in a project selective marking may be an option.  Selective marking involves only grade some papers through sampling and then inferring that other students made the same mistakes. You can then reteach the common mistakes to the whole class while saving a large amount of time.
  • Working with your teaching assistants you can have each assistant mark a section of an exam. This helps to spread the work around and prevent students from complaining about one TA who’s grading they dislike.
  • Peer review is another form of formative feedback that can work in large classes.

As mentioned earlier, for assessments that involve one answer, such as in lower level math classes, there are many automated options that are probably already available at your school such as scan tron sheets or online examinations.

Cheating can also be a problem for examines. However, thorough preparation and developing an assignment that is based on what is taught can greatly reduce cheating. Randomizing the exams and seating can help as well. For plagiarism there are many resources available online


Common logistical problem includes communication which can be through email or office hours. If a class has over 100 or even 200 students. The demands for personal help can quickly become overwhelming. This can be avoided by establishing clear lines of communication and how you will response.

Hopefully, there is some sort of way for you to communicate with all the students simultaneously such as through a forum or some other way. In this way, you can share the answer to a good question with everyone rather than individually several times.

Office hours can be adjusted by having them in groups rather than one-on-one. This allows the teacher to help several students at once rather than individually. Another idea is to have online office hours. Again you can meet several students at once but with the added convenience of not having to be in the same physical location.


Large classes are a lot of work and can be demanding for even experienced teachers. However, with some basic adjustments it is possible to shoulder this load with care.

Teaching Materials

Regardless of what level a teacher teaches at you are always looking or materials and activities to do in your class. It is always a challenge to have new ideas and activities to support and help students because the world changes. This leads to an constant need to remove old inefficient activities and bring in new fresh one. The primary purpose of  activities is to provide practical application of the skills taught and experience in school.

For the math teacher you can naturally make your own math problems. However, this can quickly become quietly. One solution to this is to employed other worksheets that provide growth opportunities for the students with stressing out the teacher.

There are many great websites for this. For example, provides many different types of worksheets to help students. They have some great simple math worksheets like the ones below

addition_outer space_answers

addition_outer space

There  are many more resources available at  as well as other sites. There is no purpose or benefit to reinventing the wheel. The incorporation of the assignments of others is a great way to expand the resources you have available without the stress of developing them yourself.

Review of “Usborne Mysteries & Marvels of Nature”

The book Mysteries & Marvels of Nature by Elizabeth Dalby (pp. 128) is a classic picture book focused on nature for children.

The Summary

This text provides explanations of various aspects of animals as found in nature. Some of the topics covered is how animals eat, move, defend themselves, communicate, and their life cycles. Each section has various of examples of the theme with a plethora of colorful photos.

The text that is include provides a brief description of the animal(s) and what they are doing in the picture. Leaping tigers, swimming fish, and even egg-laying snakes are all apart of this text.

The Good

The pictures are fascinating, and they really help in making the text come alive. There is a strong sense of color contrast  in the text and you can tell the authors spent a great deal planning the layout of the text. There are pictures of whales, cuttlefish, frogs, beetles, etc.

There is also the use of drawings to depict scenes that may be hard to get in nature. For example, the book explains how the darkling beetle escapes prey by spraying a liquid that stinks. Off, course there is an illustration of this in the text and not a photo. Another example shows a frog waking from hibernating underground.

This text would work great for almost any age group. Young kids would love looking the pictures while older kids can read the text. The book is also largest enough to accommodate a medium size class for a whole class reading.

The Bad

There is little to complain with in regard to this book. It is a paperback text. Therefore, it would not last long in most classrooms. However, the motivation for paperback my  have been to keep the price down. At $16.99, this is a fairly cheap book  for a classroom. Besides the quality of the material there is little to criticize about this book.

The Recommendation

This is a great text for any classroom. Students will spend hours fascinated by the pictures. Older students may enjoy the pictures as well while they need to focus on the text. For families this is an even better text because in most families there would be a reduction in the number of hands that are touching the text.

Teaching Smaller Class at University

The average teacher prefers small classes. However, there are times when the enrollment in a class that is usually big (however you define this) takes a dip in size and suddenly a class has become “small.” This can be harder to deal with than many people tend to believe. There are some aspects of the teaching and learning experience that need to be adjusted because the original approach is not user-friendly for the small class.

Another time when a person  often struggles with teaching small classes is if they never had the pleasure of experiencing a small class as a student. If your learning experience was a traditional large class lecture style and all of a sudden  you are teaching at a small liberal arts college there will need to be some adjustments too.

In this post, we will look at some pros and cons of teaching smaller classes at the tertiary level. In addition, we will look at some ways to address the challenges of teaching smaller classes for those who have not had this experience.


With a smaller class size there is an overall decrease in the amount of work that has to be done. This means few assignments to mark, less preparation of materials, etc. In addition, because the class is smaller it is not necessary to be as formal and structured with the class. In other words, there is no need to have routines in place because there is little potential chaos that can ensue if everyone does what they want.

The teaching can also be more personalized. You can adjust content and address individual question much easier than when dealing with a larger class. You can even get to know the students in a much more informal manner that is not possible in a large lecturer hall.

Probably the biggest advantage  for a new teacher is the ability to make changes and adjustments during a semester. A bad teacher in a large class leads to a large problem. However, a bad teacher in a small class is a small problem. If things are not working, it is easier to change things in a small group. The analogy that I like to make is that it is easier to do an u-turn on a bike instead of in a bus. For new teachers who do not quite know how to teach, a smaller class can help them to develop their skills for larger classes


There are some challenges with small classes especially for people with a large class experience. One thing you will notice when teaching a large class is a lost of energy. If you are used to lecturing to 80 students and suddenly are teaching 12 it can seem as if that learning spark is gone.

The lost of energy can contribute to a lost of discipline. The informal nature of small classes can lead to students having a sense seriousness about the course. In larger classes there is a sense of “sink or swim”.  This may not be the most positive mindset but it helps people to take the learning experience seriously. In smaller classes this can sometimes be lost.

Attendance is another problem. In a large class having several absences is not a big deal. However, if your class is small, several absences is almost like a plague that wipes out a village. You can still teach but nobody is there or the key people who participate in the discussion are not there or there is no one to listen to their comments. This can lead to pressure to cancel class which causes even more problems.


There are several things that a teacher can do to have success with smaller class sizes. One suggestion is to adjust your teaching style. Lecturing is great for large classes in which content delivery is key. However, in  smaller class a more interact, discussion-like approach can be taken. This helps to bring energy back to the classroom as well as engage the students.

Sometimes, if this is possible, changing from a large room to a smaller one can help to bring back the energy that is lost when a class is smaller. Many times the academic office will put class in a certain classroom regardless of size. This normally no longer a problem with all the advances in scheduling and registration software. However, if you are teaching 10 students in an auditorium perhaps it is possible to find a smaller more intimate location.

Another way to deal with smaller classes is through increasing participation. This is often not practical in a large class. However, interaction can be useful in increasing the engagement.


The size of the class is not as important as the ability of the teacher to adjust to it in order to help students to learn. Small classes need a slightly different approach  than traditional large classes at university. With a few minor adjustments, a teacher can still find ways to help students even if the class is not quite the size everyone was expecting

Combining Algorithms for Classification with Python

Many approaches in machine learning involve making many models that combine their strength and weaknesses to make more accuracy classification. Generally, when this is done it is the same algorithm being used. For example, random forest is simply many decision trees being developed. Even when bagging or boosting is being used it is the same algorithm but with variances in sampling and the use of features.

In addition to this common form of ensemble learning there is also a way to combine different algorithms to make predictions. For one way of doing this is through a technique called stacking in which the predictions of several models are passed to a higher model that uses the individual model predictions to make a final prediction. In this post we will look at how to do this using Python.


This blog usually tries to explain as much  as possible about what is happening. However, due to the complexity of this topic there are several assumptions about the reader’s background.

  • Already familiar with python
  • Can use various algorithms to make predictions (logistic regression, linear discriminant analysis, decision trees, K nearest neighbors)
  • Familiar with cross-validation and hyperparameter tuning

We will be using the Mroz dataset in the pydataset module. Our goal is to use several of the independent variables to predict whether someone lives in the city or not.

The steps we will take in this post are as follows

  1. Data preparation
  2. Individual model development
  3. Ensemble model development
  4. Hyperparameter tuning of ensemble model
  5. Ensemble model testing

Below is all of the libraries we will be using in this post

import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from pydataset import data
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.classifier import EnsembleVoteClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

Data Preparation

We need to perform the following steps for the data preparation

  1. Load the data
  2. Select the independent variables to be used in the analysis
  3. Scale the independent variables
  4. Convert the dependent variable from text to numbers
  5. Split the data in train and test sets

Not all of the variables in the Mroz dataset were used. Some were left out because they were highly correlated with others. This analysis is not in this post but you can explore this on your own. The data was also scaled because many algorithms are sensitive to this so it is best practice to always scale the data. We will use the StandardScaler function for this. Lastly, the dpeendent variable currently consist of values of “yes” and “no” these need to be convert to numbers 1 and 0. We will use the LabelEncoder function for this. The code for all of this is below.

X=pd.DataFrame(X_scale, index=X.index, columns=X.columns)
X_train, X_test,y_train, y_test=train_test_split(X,y,test_size=.3,random_state=5)

We can now proceed to individul model development

Individual Model Development

Below are the steps for this part of the analysis

  1. Instantiate an instance of each algorithm
  2. Check accuracy of each model
  3. Check roc curve of each model

We will create four different models, and they are logistic regression, decision tree, k nearest neighbor, and linear discriminant analysis. We will also set some initial values for the hyperparameters for each. Below is the code

logclf=LogisticRegression(penalty='l2',C=0.001, random_state=0)

We can now assess the accuracy and roc curve of each model. This will be done through using two separate for loops. The first will have the accuracy results and the second will have the roc curve results. The results will also use k-fold cross validation with the cross_val_score function. Below is the code with the results.

clf_labels=['Logistic Regression','Decision Tree','KNN','LDAclf']
for clf, label in zip ([logclf,treeclf,knnclf,LDAclf],clf_labels):
    print("accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label))

for clf, label in zip ([logclf,treeclf,knnclf,LDAclf],clf_labels):
    print("roc auc: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label))

accuracy: 0.69 (+/- 0.04) [Logistic Regression]
accuracy: 0.72 (+/- 0.06) [Decision Tree]
accuracy: 0.66 (+/- 0.06) [KNN]
accuracy: 0.70 (+/- 0.05) [LDAclf]
roc auc: 0.71 (+/- 0.08) [Logistic Regression]
roc auc: 0.70 (+/- 0.07) [Decision Tree]
roc auc: 0.62 (+/- 0.10) [KNN]
roc auc: 0.70 (+/- 0.08) [LDAclf]

The results can speak for themselves. We have a general accuracy of around 70% but our roc auc is poor. Despite this we will now move to the ensemble model development.

Ensemble Model Development

The ensemble model requires the use of the EnsembleVoteClassifier function. Inside this function are the four models we made earlier. Other than this the rest of the code is the same as the previous step. We will assess the accuracy and the roc auc. Below is the code and the results

 mv_clf= EnsembleVoteClassifier(clfs=[logclf,treeclf,knnclf,LDAclf],weights=[1.5,1,1,1])

for clf, label in zip ([logclf,treeclf,knnclf,LDAclf,mv_clf],labels):
    print("accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label))

for clf, label in zip ([logclf,treeclf,knnclf,LDAclf,mv_clf],labels):
    print("roc auc: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label))

accuracy: 0.69 (+/- 0.04) [LR]
accuracy: 0.72 (+/- 0.06) [tree]
accuracy: 0.66 (+/- 0.06) [knn]
accuracy: 0.70 (+/- 0.05) [LDA]
accuracy: 0.70 (+/- 0.04) [combine]
roc auc: 0.71 (+/- 0.08) [LR]
roc auc: 0.70 (+/- 0.07) [tree]
roc auc: 0.62 (+/- 0.10) [knn]
roc auc: 0.70 (+/- 0.08) [LDA]
roc auc: 0.72 (+/- 0.09) [combine]

You can see that the combine model as similar performance to the individual models. This means in this situation that the ensemble learning did not make much of a difference. However, we have not tuned are hyperparameters yet. This will be done in the next step.

Hyperparameter Tuning of Ensemble Model

We are going to tune the decision tree, logistic regression, and KNN model. There are many different hyperparameters we can tune. For demonstration purposes we are only tuning one hyperparameter per algorithm. Once we set the hyperparameters we will run the model and pull the best hyperparameters values based on the roc auc as the metric. Below is the code and the output.



{'decisiontreeclassifier__max_depth': 3,
 'kneighborsclassifier__n_neighbors': 9,
 'logisticregression__C': 10}

Out[35]: 0.7196051482279385

The best values are as follows

  • Decision tree max depth set to 3
  • KNN number of neighbors set to 9
  • logistic regression C set to 10

These values give us a roc auc of 0.72 which is still poor . We can now use these values when we test our final model.

Ensemble Model Testing

The following steps are performed in the analysis

  1. Created new instances of the algorithms with the adjusted hyperparameters
  2. Run the ensemble model
  3. Predict with the test data
  4. Check the results

Below is the first step

logclf=LogisticRegression(penalty='l2',C=10, random_state=0)

Below is step two

mv_clf= EnsembleVoteClassifier(clfs=[logclf,treeclf,knnclf,LDAclf],weights=[1.5,1,1,1]),y_train)

Below are steps 3 and four


col_0   0    1
0      29   58
1      12  127
             precision    recall  f1-score   support

          0       0.71      0.33      0.45        87
          1       0.69      0.91      0.78       139

avg / total       0.69      0.69      0.66       226

The accuracy is about 69%. One thing that is noticeable low is the  recall for people who do not live in the city. This probably one reason why the overall roc auc score is so low. The f1-score is also low for those who do not live in the city as well. The f1-score is just a combination of precision and recall. If we really want to improve performance we would probably start with improving the recall of the no’s.


This post provided an example of how you can  combine different algorithms to make predictions in Python. This is a powerful technique t to use. Off course, it is offset by the complexity of the analysis which makes it hard to explain exactly what the results mean if you were asked tot do so.

Review of The Beginner’s American History

In this post, we take a look at The Beginners American History. This book was written by D.H. Montgomery in the late 19th century and was updated by John Holzmann (pp. 269).


This is a classic text that covers the history of the United States from Christopher Columbus’ discovery of America until the Gold Rush of California. All of the expected content is there from Captain John Smith, George Washington, to even Eli Whitney. Other information shared includes the various wars in America from the battles with the British for independence to the wars with the Mexicans and Indians for control of the land in what is now the United States.

The Good

This would be a great personal reader for an older student. It is primarily text based and there are few illustrations. The writing is simple for the most part and is not overly weighed down with a lot of academic insights and communication. 

The illustrations that are included tend to be an ever-changing map that shows how America is being slowly taken over by the American colonist. This provides the reader of a perspective of time and the growth of the United States.

It is also beneficial for students to get an older perspective on history. The way Montgomery viewed American history in the 19th century is vastly different from how historians see it today.

The Bad

As previously mentioned, the book is text heavy. This makes it inappropriate for small children. In addition, there are few illustrations in the book. This can be a detriment to students who learn through their senses. This would also make the text hard to use in a whole-class situation.

It is a children’s book, however, the portrayal of content is in the most rudimentary manner. This may be due to the context in which the book was written as well as the purpose for this book. Either way, the book was rich on  content but lacked depth.

The Recommendation

For personal reading this is an excellent book. However, in an academic context, I believe there are superior options to the book discussed here. The age of the text provides a distinct perspective on history but lacks the content for deep learning today.

Data Science Pipeline

One of the challenges of conducting a data analysis or any form of research is making decisions. You have to decide primarily two things

  1. What to do
  2. When to do it

People who are familiar with statistics may know what to do but may struggle with timing or when to do it. Others who are weaker when it comes to numbers may not know what to do or when to do it. Generally, it is rare for someone to know when to do something but not know how to do it.

In this post, we will look at a process that that can be used to perform an analysis in the context of data science. Keep in mind that this is just an example and there are naturally many ways to perform an analysis. The purpose here is to provide some basic structure for people who are not sure of what to do and when. One caveat, this process is focused primarily on supervised learning which has a clearer beginning, middle, and end in terms of the process.

Generally, there are three steps that probably always take place when conducting a data analysis and they are as follows.

  1. Data preparation (data mugging)
  2. Model training
  3. Model testing

Off course, it is much more complicated than this but this is the minimum. Within each of these steps there are several substeps, However, depending on the context, the substeps can be optional.

There is one pre-step that you have to consider. How you approach these three steps depends a great deal on the algorithm(s) you have in mind to use for developing different models. The assumptions and characteristics of one algorithm are different from another and shape how you prepare the data and develop models. With this in mind, we will go through each of these three steps.

Data Preparation

Data preparation involves several substeps. Some of these steps are necessary but general not all of them happen ever analysis. Below is a list of steps at this level

  • Data mugging
  • Scaling
  • Normality
  • Dimension reduction/feature extraction/feature selection
  • Train, test, validation split

Data mugging is often the first step in data preparation and involves making sure your data is in a readable structure for your algorithm. This can involve changing the format of dates, removing punctuation/text, changing text into dummy variables or factors, combining tables, splitting tables, etc. This is probably the hardest and most unclear aspect of data science because the problems you will face will be highly unique to the dataset you are working with.

Scaling involves making sure all the variables/features are on the same scale. This is important because most algorithms are sensitive to the scale of the variables/features. Scaling can be done through normalization or standardization. Normalization reduces the variables to a range of 0 – 1. Standardization involves converting the examples in the variable to their respective z-score. Which one you use depends on the situation but normally it is expected to do this.

Normality is often an optional step because there are so many variables that can be involved with big data and data science in a given project. However, when fewer variables are involved checking for normality is doable with a few tests and some visualizations. If normality is violated various transformations can be used to deal with this problem. Keep mind that many machine learning algorithms are robust against the influence of non-normal data.

Dimension reduction involves reduce the number of variables that will be included in the final analysis. This is done through factor analysis or principal component analysis. This reduction  in the number of variables is also an example of feature extraction. In some context, feature extraction is the in goal in itself. Some algorithms make their own features such as neural networks through the use of hidden layer(s)

Feature selection is the process of determining which variables to keep for future analysis. This can be done through the use of regularization such or in smaller datasets with subset regression. Whether you extract or select features depends on the context.

After all this is accomplished, it is necessary to split the dataset. Traditionally, the data was split in two. This led to the development of a training set and a testing set. You trained the model on the training set and tested the performance on the test set.

However, now many analyst split the data into three parts to avoid overfitting the data to the test set. There is now a training a set, a validation set, and a testing set. The  validation set allows you to check the model performance several times. Once you are satisfied you use the test set once at the end.

Once the data is prepared, which again is perhaps the most difficult part, it is time to train the model.

Model training

Model training involves several substeps as well

  1. Determine the metric(s) for success
  2. Creating a grid of several hyperparameter values
  3. Cross-validation
  4. Selection of the most appropriate hyperparameter values

The first thing you have to do and this is probably required is determined how you will know if your model is performing well. This involves selecting a metric. It can be accuracy for classification or mean squared error for a regression model or something else. What you pick depends on your goals. You use these metrics to determine the best algorithm and hyperparameters settings.

Most algorithms have some sort of hyperparameter(s). A hyperparameter is a value or estimate that the algorithm cannot learn and must be set by you. Since there is no way of knowing what values to select it is common practice to have several values tested and see which one is the best.

Cross-validation is another consideration. Using cross-validation always you to stabilize the results through averaging the results of the model over several folds of the data if you are using k-folds cross-validation. This also helps to improve the results of the hyperparameters as well.  There are several types of cross-validation but k-folds is probably best initially.

The information for the metric, hyperparameters, and cross-validation are usually put into  a grid that then runs the model. Whether you are using R or Python the printout will tell you which combination of hyperparameters is the best based on the metric you determined.

Validation test

When you know what your hyperparameters are you can now move your model to validation or straight to testing. If you are using a validation set you asses your models performance by using this new data. If the results are satisfying based on your metric you can move to testing. If not, you may move back and forth between training and the validation set making the necessary adjustments.

Test set

The final step is testing the model. You want to use the testing dataset as little as possible. The purpose here is to see how your model generalizes to data it has not seen before. There is little turning back after this point as there is an intense danger of overfitting now. Therefore, make sure you are ready before playing with the test data.


This is just one approach to conducting data analysis. Keep in mind the need to prepare data, train your model, and test it. This is the big picture for a somewhat complex process

Gradient Boosting Regression in Python

In this  post, we will take a look at gradient boosting for regression. Gradient boosting simply makes sequential models that try to explain any examples that had not been explained by previously models. This approach makes gradient boosting superior to AdaBoost.

Regression trees are mostly commonly teamed with boosting. There are some additional hyperparameters that need to be set  which includes the following

  • number of estimators
  • learning rate
  • subsample
  • max depth

We will deal with each of these when it is appropriate. Our goal in this post is to predict the amount of weight loss in cancer patients based on the independent variables. This is the process we will follow to achieve this.

  1. Data preparation
  2. Baseline decision tree model
  3. Hyperparameter tuning
  4. Gradient boosting model development

Below is some initial code

from sklearn.ensemble import GradientBoostingRegressor
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

Data Preparation

The data preparation is not that difficult in this situation. We simply need to load the dataset in an object and remove any missing values. Then we separate the independent and dependent variables into separate datasets. The code is below.


We can now move to creating our baseline model.

Baseline Model

The purpose of the baseline model is to have something to compare our gradient boosting model to. Therefore, all we will do here is create  several regression trees. The difference between the regression trees will be the max depth. The max depth has to with the number of nodes python can make to try to purify the classification.  We will then decide which tree is best based on the mean squared error.

The first thing we need to do is set the arguments for the cross-validation. Cross validating the results helps to check the accuracy of the results. The rest of the code  requires the use of for loops and if statements that cannot be reexplained in this post. Below is the code with the output.

for depth in range (1,10):

You can see that a max depth of 2 had the lowest amount of error. Therefore, our baseline model has a mean squared error of 176. We need to improve on this in order to say that our gradient boosting model is superior.

However, before we create our gradient boosting model. we need to tune the hyperparameters of the algorithm.

Hyperparameter Tuning

Hyperparameter tuning has to with setting the value of parameters that the algorithm cannot learn on its own. As such, these are constants that you set as the researcher. The problem is that you are not any better at knowing where to set these values than the computer. Therefore, the process that is commonly used is to have the algorithm use several combinations  of values until it finds the values that are best for the model/. Having said this, there are several hyperparameters we need to tune, and they are as follows.

  • number of estimators
  • learning rate
  • subsample
  • max depth

The number of estimators is show many trees to create. The more trees the more likely to overfit. The learning rate is the weight that each tree has on the final prediction. Subsample is the proportion of the sample to use. Max depth was explained previously.

What we will do now is make an instance of the GradientBoostingRegressor. Next, we will create our grid with the various values for the hyperparameters. We will then take this grid and place it inside GridSearchCV function so that we can prepare to run our model. There are some arguments that need to be set inside the GridSearchCV function such as estimator, grid, cv, etc. Below is the code.


We can now run the code and determine the best combination of hyperparameters and how well the model did base on the means squared error metric. Below is the code and the output.,y)
{'learning_rate': 0.01,
 'max_depth': 1,
 'n_estimators': 500,
 'random_state': 1,
 'subsample': 0.5}

Out[14]: -160.51398257591643

The hyperparameter results speak for themselves. With this tuning we can see that the mean squared error is lower than with the baseline model. We can now move to the final step of taking these hyperparameter settings and see how they do on the dataset. The results should be almost the same.

Gradient Boosting Model Development

Below is the code and the output for the tuned gradient boosting model

Out[18]: -160.77842893572068

These results were to be expected. The gradient boosting model has a better performance than the baseline regression tree model.


In this post, we looked at how to  use gradient boosting to improve a regression tree. By creating multiple models. Gradient boosting will almost certainly have a better performance than other type of algorithms that rely on only one model.

Review of The Landmark History of the American People Vol 1

The book The Landmark History of the American People Vol 1 by Daniel Boorstin (pp. 169) provides a rich explanation of the history of the United States from the dawn of colonial America until the end of the 19th century. Daniel Boorstin was a rather famous author and  a former Librarian of Congress. Holding such as position gives you the esteem in which this man was held.

The Summary

This book covers many interesting aspects of early American history. It begins with the development of the colonies. From there the text provides A detailed account of the eventual split from Great Britain. The next focus of the text is on the America heading west through the expansion that involved purchasing land, warfare, and unfortunate exploitation.

The latter part of the text focuses somewhat more on such ideas as life out in the western frontier. There is also a mention of the early effects of the industrial revolution with the development of the train and all the advantages and dangers that this brought.

The Good

This book provides a lot of interesting details about life in America. For example, on the frontier, Americans developed something called the balloon frame house. This type of building was faster and relatively safe when compared to the European model of building at this time. This kinds of little details are not common in most text for children

The text is also full illustrations that capture the time period in which the author was writing about. From pictures of puritans, to Indians, to even photos of various famous American historical sites. This text has a little of everything.

The Bad

Although the text is full of illustrations, it is still primarily text based. In addition, even though the text is full of interesting details this can also be a disadvantage of you or your student  needs the big picture about a particular time period. Yes, I did compliment the development of the balloon frame house. However, what is the benefit of knowing this small detail from American history?

Younger children would struggle with the writing and text heavy nature of the book. However, to be fair, perhaps the author was gearing this book towards older students. However, in the preface, the editor, mentions that this book was meant to be read by parents to 3rd or 4th graders. This seems like a tall task given the content.

The Recommendation

This book would be good for older kids. Perhaps middle school, who have the reading comprehension and perhaps the curiosity to handle such a text. However, for younger children I am convinced the text is too complicated for them to appreciate it. One way to address this is to focus on the visual aspects of the book and not worry too much of getting every detail of the challenging text.

Gradient Boosting Classification in Python

Gradient Boosting is an alternative form of boosting to AdaBoost. Many consider gradient boosting to be a better performer than adaboost. Some differences between the two algorithms is that gradient boosting uses optimization for weight the estimators. Like adaboost, gradient boosting can be used for most algorithms but is commonly associated with decision trees.

In addition, gradient boosting requires several additional hyperparameters such as max depth and subsample. Max depth has to do with the number of nodes in a tree. The higher the number the purer the classification become. The downside to this is the risk of overfitting.

Subsampling has to do with the proportion of the sample that is used for each estimator. This can range from a decimal value up until the whole number 1. If the value is set to 1 it becomes stochastic gradient boosting.

This post is focused on classification. To do this, we will use the cancer dataset from the pydataset library. Our goal will be to predict the status of patients (alive or dead) using the available independent variables. The steps we will use are as follows.

  1. Data preparation
  2. Baseline decision tree model
  3. Hyperparameter tuning
  4. Gradient boosting model development

Below is some initial code.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

Data Preparation

The data preparation is simple in this situtation. All we need to do is load are dataset, dropping missing values, and create our X dataset and y dataset. All this happens in the code below.


We will now develop our baseline decision tree model.

Baseline Model

The purpose of the baseline model is to have something to compare our gradient boosting model to. The strength of a model is always relative to some other model, so we need to make at least two, so we can say one is better than the other.

The criteria for better in this situation is accuracy. Therefore, we will make a decision tree model, but we will manipulate the max depth of the tree to create 9 different baseline models. The best accuracy model will be the baseline model.

To achieve this, we need to use a for loop to make python make several decision trees. We also need to set the parameters for the cross validation by calling KFold(). Once this is done, we print the results for the 9 trees. Below is the code and results.

for depth in range (1,10):
score=np.mean(cross_val_score(tree_classifier,X,y,scoring='accuracy', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 0.71875
2 0.6477941176470589
3 0.6768382352941177
4 0.6698529411764707
5 0.6584558823529412
6 0.6525735294117647
7 0.6283088235294118
8 0.6573529411764706
9 0.6577205882352941

It appears that when the max depth is limited to 1 that we get the best accuracy at almost 72%. This will be our baseline for comparison. We will now tune the parameters for the gradient boosting algorithm

Hyperparameter Tuning

There are several hyperparameters we need to tune. The ones we will tune are as follows

  • number of estimators
  • learning rate
  • subsample
  • max depth

First, we will create an instance of the gradient boosting classifier. Second, we will create our grid for the search. It is inside this grid that we set several values for each hyperparameter. Then we call GridSearchCV and place the instance of the gradient boosting classifier, the grid, the cross validation values from mad earlier, and n_jobs all together in one place. Below is the code for this.


You can now run your model by calling .fit(). Keep in mind that there are several hyperparameters. This means that it might take some time to run the calculations. It is common to find values for max depth, subsample, and number of estimators first. Then as second run through is done to find the learning rate. In our example, we are doing everything at once which is why it takes longer. Below is the code with the out for best parameters and best score.,y)
{'learning_rate': 0.01,
'max_depth': 5,
'n_estimators': 2000,
'random_state': 1,
'subsample': 0.75}
Out[12]: 0.7425149700598802

You can see what the best hyperparameters are for yourself. In addition, we see that when these parameters were set we got an accuracy of 74%. This is superior to our baseline model. We will now see if we can replicate these numbers when we use them for our Gradient Boosting model.

Gradient Boosting Model

Below is the code and results for the model with the predetermined hyperparameter values.

Out[17]: 0.742279411764706

You can see that the results are similar. This is just additional information that the gradient boosting model does outperform the baseline decision tree model.


This post provided an example of what gradient boosting classification can do for a model. With its distinct characteristics gradient boosting is generally a better performing boosting algorithm in comparison to AdaBoost.

AdaBoost Regression with Python

This post will share how to use the adaBoost algorithm for regression in Python. What boosting does is that it makes multiple models in a sequential manner. Each newer model tries to successful predict what older models struggled with. For regression, the average of the models are used for the predictions.  It is often most common to use boosting with decision trees but this approach can be used with any machine learning algorithm that deals with supervised learning.

Boosting is associated with ensemble learning because several models are created that are averaged together. An assumption of boosting, is that combining several weak models can make one really strong and accurate model.

For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the weight loss of a patient based on several independent variables. The steps of this process are as follows.

  1. Data preparation
  2. Regression decision tree baseline model
  3. Hyperparameter tuning of Adaboost regression model
  4. AdaBoost regression model development

Below is some initial code

from sklearn.ensemble import AdaBoostRegressor
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

Data Preparation

There is little data preparation for this example. All we need to do is load the data and create the X and y datasets. Below is the code.


We will now proceed to creating the baseline regression decision tree model.

Baseline Regression Tree Model

The purpose of the baseline model is for comparing it to the performance of our model that utilizes adaBoost. In order to make this model we need to Initiate a Kfold cross-validation. This will help in stabilizing the results. Next we will create a for loop so that we can create several trees that vary based on their depth. By depth, it is meant how far the tree can go to purify the classification. More depth often leads to a higher likelihood of overfitting.

Finally, we will then print the results for each tree. The criteria used for judgment is the mean squared error. Below is the code and results

for depth in range (1,10):
score=np.mean(cross_val_score(tree_regressor,X,y,scoring='neg_mean_squared_error', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 -193.55304528235052
2 -176.27520747356175
3 -209.2846723461564
4 -218.80238479654003
5 -222.4393459885871
6 -249.95330609042858
7 -286.76842138165705
8 -294.0290706405905
9 -287.39016236497804

Looks like a tree with a depth of 2 had the lowest amount of error. We can now move to tuning the hyperparameters for the adaBoost algorithm.

Hyperparameter Tuning

For hyperparameter tuning we need to start by initiating our AdaBoostRegresor() class. Then we need to create our grid. The grid will address two hyperparameters which are the number of estimators and the learning rate. The number of estimators tells Python how many models to make and the learning indicates how each tree contributes to the overall results. There is one more parameters which is random_state but this is just for setting the seed and never changes.

After making the grid, we need to use the GridSearchCV function to finish this process. Inside this function you have to set the estimator which is adaBoostRegressor, the parameter grid which we just made, the cross validation which we made when we created the baseline model, and the n_jobs which allocates resources for the calculation. Below is the code.


Next, we can run the model with the desired grid in place. Below is the code for fitting the mode as well as the best parameters and the score to expect when using the best parameters.,y)
Out[31]: {'learning_rate': 0.01, 'n_estimators': 500, 'random_state': 1}
Out[32]: -164.93176650920856

The best mix of hyperparameters is a learning rate of 0.01 and 500 estimators. This mix led to a mean error score of 164, which is a little lower than our single decision tree of 176. We will see how this works when we run our model with the refined hyperparameters.

AdaBoost Regression Model

Below is our model but this time with the refined hyperparameters.

Out[36]: -174.52604137201791

You can see the score is not as good but it is within reason.


In this post, we explored how to use the AdaBoost algorithm for regression. Employing this algorithm can help to strengthen a model in many ways at times.

Review of Children’s Encyclopedia of American History

The book Children’s Encyclopedia of American History by David King (pp. 320) provides a rich explanation of the the background and shaping of America.

The Summary
This text covers American history from the 11th century all the way until the beginning of the 21st century. Over this 1,000 years of American history the text goes from  explaining the discovery of the new world, to the turbulent times of colonial America, the wars of the 18th and 19th century, all the way to dealing with terrorism.

All of the classic famous names of American history such as George Washington, Benjamin Franklin, Andrew Jackson, Abraham Lincoln, Theodore Roosevelt, John F Kennedy, and even Barrack Obama.

The Good

The text offers a rich array of authentic photos and artifacts as images in the book. Almost no detail was left undone. Pictures of buildings, famous people, and even toys of different eras are provided. There are paintings of gold  miners, maps, Indians, athletes, etc. There is even commentary on the accuracy of some of the paintings. For example, one painting shows George Washington standing in a book. The author points out that this would be dangerous as the boat might tip over. In  addition, the artist of the painted the wrong boat and the US flag was in the painting but did not exist at the time in history that the painting was depicting.

There are also lots of maps throughout the book describing America at different times in history. There are also maps of other countries when other countries interact with the US. For example, there is a map of Korea when it was divided when the book discusses the Korean War.

The book also addresses major changes in technology, influential people in such fields as arts and entertainment. Consistent with an encyclopedia, this book has a little of everything.

The Bad

There is little to disparage about this book. It is highly visually appealing for young readers. Even adults would found the text interesting especially if they do not have a strong background in history. If a criticism had to be made it might be that the book is more focus on visuals and lacks substantive text. However, this is not much of a criticism as the book is geared towards children and focus more on pictures than text.

The Recommendation

For the history teacher, this is a great text to have to augment other studies in history. The book is fairly large and could possible be used to teach a medium size group of children. The picture in the text make history come alive and remove some of the abstract nature to learning about the past.

The was even built well as it is a hard cover text that should be able to survive years of stress from the joys of children. As such, this book is highly recommended

AdaBoost Classification in Python

Boosting is a technique in machine learning in which multiple models are developed sequentially. Each new model tries to successful predict what prior models were unable to do. The average for regression and majority vote for classification are used. For classification, boosting is commonly associated with decision trees. However, boosting can be used with any machine learning algorithm in the supervised learning context.

Since several models are being developed with aggregation, boosting is associated with ensemble learning. Ensemble is just a way of developing more than one model for machine-learning purposes. With boosting, the assumption is that the combination of several weak models can make one really strong and accurate model.

For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the status of a patient based on several independent variables. The steps of this process are as follows.

  1. Data preparation
  2. Decision tree baseline model
  3. Hyperparameter tuning of Adaboost model
  4. AdaBoost model development

Below is some initial code

from sklearn.ensemble import AdaBoostClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

Data Preparation

Data preparation is minimal in this situation. We will load are data and at the same time drop any NA using the .dropna() function. In addition, we will place the independent variables in dataframe called X and the dependent variable in a dataset called y. Below is the code.


Decision Tree Baseline Model

We will make a decision tree just for the purposes of comparison. First, we will set the parameters for the cross-validation. Then we will use a for loop to run several different decision trees. The difference in the decision trees will be their depth. The depth is how far the tree can go in order to purify the classification. The more depth the more likely your decision tree is to overfit the data. The last thing we will do is print the results. Below is the code with the output

for depth in range (1,10):
score=np.mean(cross_val_score(tree_classifier,X,y,scoring='accuracy', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 0.71875
2 0.6477941176470589
3 0.6768382352941177
4 0.6698529411764707
5 0.6584558823529412
6 0.6525735294117647
7 0.6283088235294118
8 0.6573529411764706
9 0.6577205882352941

You can see that the most accurate decision tree had a depth of 1. After that there was a general decline in accuracy.

We now can determine if the adaBoost model is better based on whether the accuracy is above 72%. Before we develop the  AdaBoost model, we need to tune several hyperparameters in order to develop the most accurate model possible.

Hyperparameter Tuning AdaBoost Model

In order to tune the hyperparameters there are several things that we need to do. First we need to initiate  our AdaBoostClassifier with some basic settings. Then We need to create our search grid with the hyperparameters. There are two hyperparameters that we will set and they are number of estimators (n_estimators) and the learning rate.

Number of estimators has to do with how many trees are developed. The learning rate indicates how each tree contributes to the overall results. We have to place in the grid several values for each of these. Once we set the arguments for the AdaBoostClassifier and the search grid we combine all this information into an object called search. This object uses the GridSearchCV function and includes additional arguments for scoring, n_jobs, and for cross-validation. Below is the code for all of this


We can now run the model of hyperparameter tuning and see the results. The code is below.,y)
Out[33]: {'learning_rate': 0.01, 'n_estimators': 1000}
Out[34]: 0.7425149700598802

We can see that if the learning rate is set to 0.01 and the number of estimators to 1000 We can expect an accuracy of 74%. This is superior to our baseline model.

AdaBoost Model

We can now rune our AdaBoost Classifier based on the recommended hyperparameters. Below is the code.

Out[36]: 0.7415441176470589

We knew we would get around 74% and that is what we got. It’s only a 3% improvement but depending on the context that can be a substantial difference.


In this post, we look at how to use boosting for classification. In particular, we used the AdaBoost algorithm. Boosting in general uses many models to determine the most accurate classification in a sequential manner. Doing this will often lead to an improvement in the prediction of a model.

Research Questions, Variables, and Statistics

Working with students over the years has led me to the conclusion that often students do not understand the connection between variables, quantitative research questions and the statistical tools

used to answer these questions. In other words, students will take statistics and pass the class. Then they will take research methods, collect data, and have no idea how to analyze the data even though they have the necessary skills in statistics to succeed.

This means that the students have a theoretical understanding of statistics but struggle in the application of it. In this post, we will look at some of the connections between research questions and statistics.


Variables are important because how they are measured affects the type of question you can ask and get answers to. Students often have no clue how they will measure a variable and therefore have no idea how they will answer any research questions they may have.

Another aspect that can make this confusing is that many variables can be measured more than one way. Sometimes the variable “salary” can be measured in a continuous manner or in a categorical manner. The superiority of one or the other depends on the goals of the research.

It is critical to support students to have a thorough understanding of variables in order to support their research.

Types of Research Questions

In general, there are two types of research questions. These two types are descriptive and relational questions. Descriptive questions involve the use of descriptive statistic such as the mean, median, mode, skew, kurtosis, etc. The purpose is to describe the sample quantitatively with numbers (ie the average height is 172cm) rather than relying on qualitative descriptions of it (ie the people are tall).

Below are several example research questions that are descriptive in nature.

  1. What is the average height of the participants in the study?
  2. What proportion of the sample is passed the exam?
  3. What are the respondents perceptions towards the cafeteria?

These questions are not intellectually sophisticated but they are all answerable with descriptive statistical tools. Question 1 can be answered by calculating the mean. Question 2 can be answered by determining how many passed the exam and dividing by the total sample size. Question 3 can be answered by calculating the mean of all the survey items that are used to measure respondents perception of the cafeteria.

Understanding the link between research question and statistical tool is critical. However, many people seem to miss the connection between the type of question and the tools to use.

Relational questions look for the connection or link between variables. Within this type there are two sub-types. Comparison question involve comparing groups. The other sub-type is called relational or an association question.

Comparison questions involve comparing groups on a continuous variable. For example, comparing men and women by height. What you want to know is whether there is a difference in the height of men and women. The comparison here is trying to determine if gender is related to height. Therefore, it is looking for a relationship just not in the way that many student understand. Common comparison questions include the following.male

  1. Is there a difference in height by gender among the participants?
  2. Is there a difference in reading scores by grade level?
  3. Is there a difference in job satisfaction in based on major?

Each of these questions can be answered using ANOVA or if we want to get technical and there are only two groups (ie gender) we can use t-test. This is a broad overview and does not include the complexities of one-sample test and or paired t-test.

Relational or association question involve continuous variables primarily. The goal is to see how variables move together. For example, you may look for the relationship between height and weight of students. Common questions include the following.

  1.  Is there a relationship between height and weight?
  2. Does height and show size explain weight?

Questions 1 can be answered by calculating the correlation. Question 2 requires the use of linear regression in order to answer the question.


The challenging as a teacher is showing the students the connection between statistics and research questions from the real world. It takes time for students to see how the question inspire the type of statistical tool to use. Understanding this is critical because it helps to frame the possibilities of what to do in research based on the statistical knowledge one has.

Recommendation Engine with Python

Recommendation engines make future suggestion to a person based on their prior behavior. There are several ways to develop recommendation engines but for purposes, we will be looking at the development of a user-based collaborative filter. This type of filter takes the ratings of others to suggest future items to another user based on the other user’s ratings.

Making a recommendation engine in Python actually does not take much code and is somewhat easy consider what can be done through coding. We will make a movie recommendation engine using data from movielens.


Below is the link for downloading the zip file 

Inside the zip file are several files we will use. We will use each in a few moments. Below is the initial code to get started

import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.decomposition import TruncatedSVD
import numpy as np

We will now make 4 dataframes. Dataframes 1-3 will be the user, rating, and movie title data. The last dataframe will be a merger of the first 3. The code is below with a printout of the final result.

user = pd.read_table('/home/darrin/Documents/python/new/ml-1m/users.dat', sep='::', header=None, names=['user_id', 'gender', 'age', 'occupation', 'zip'],engine='python')
rating = pd.read_table('/home/darrin/Documents/python/new/ml-1m/ratings.dat', sep='::', header=None, names=['user_id', 'movie_id', 'rating', 'timestamp'],engine='python')
movie = pd.read_table('/home/darrin/Documents/python/new/ml-1m/movies.dat', sep='::', header=None, names=['movie_id', 'title', 'genres'],engine='python')
MovieAll = pd.merge(pd.merge(rating, user), movie)

We now need to create a matrix using the .pivot_table function. This matrix will include ratings and user_id from our “MovieAll” dataframe. We will then move this information into a dataframe called “movie_index”. This index will help us keep track of what movie each column represents. The code is below.

rating_mtx_df = MovieAll.pivot_table(values='rating', index='user_id', columns='title', fill_value=0)

There are many variables in our matrix. This makes the computational time long and expensive. To reduce this we will reduce the dimensions using the TruncatedSVD function. We will reduce the matrix to 20 components. We also need to transform the data because we want the Vh matrix and no tthe U matrix. All this is hand in the code below.

recomm = TruncatedSVD(n_components=20, random_state=10)
R = recomm.fit_transform(rating_mtx_df.values.T)

What we saved our modified dataset as “R”. If we were to print this it would show that each row has two columns with various numbers in it that cannot be interpreted by us.  Instead, we will move to the actual recommendation part of this post.

To get a recommendation you have to tell Python the movie that you watch first. Python will then compare this movie with other movies that have a similiar rating and genera in the training dataset and then provide recommendation based on which movies have the highest correlation to the movie that was watched.

We are going to tell Python that we watched “One Flew Over the Cuckoo’s Nest” and see what movies it recommends.

First, we need to pull the information for just “One Flew Over the Cuckoo’s Nest”  and place this in a matrix. Then we need to calculate the correlations of all our movies using the modified dataset we named “R”. These two steps are completed below.

cuckoo_idx = list(movie_index).index("One Flew Over the Cuckoo's Nest (1975)")
correlation_matrix = np.corrcoef(R)

Now we can determine which movies have the highest correlation with our movie. However, to determine this, we must gvive Python a range of acceptable correlations. For our purposes we will set this between 0.93 and 1.0. The code is below with the recommendations.

P = correlation_matrix[cuckoo_idx]
print (list(movie_index[(P > 0.93) & (P < 1.0)]))
['Graduate, The (1967)', 'Taxi Driver (1976)']

You can see that the engine recommended two movies which are “The Graduate” and “Taxi Driver”. We could increase the number of recommendations by lower the correlation requirement if we desired.


Recommendation engines are a great tool for generating sales automatically for customers. Understanding the basics of how to do this a practical application of machine learning


Elastic Net Regression in Python

Elastic net regression combines the power of ridge and lasso regression into one algorithm. What this means is that with elastic net the algorithm can remove weak variables altogether as with lasso or to reduce them to close to zero as with ridge. All of these algorithms are examples of regularized regression.

This post will provide an example of elastic net regression in Python. Below are the steps of the analysis.

  1. Data preparation
  2. Baseline model development
  3. Elastic net model development

To accomplish this, we will use the Fair dataset from the pydataset library. Our goal will be to predict marriage satisfaction based on the other independent variables. Below is some initial code to begin the analysis.

from pydataset import data
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Data Preparation

We will now load our data. The only preparation that we need to do is convert the factor variables to dummy variables. Then we will make our  and y datasets. Below is the code.

df.loc[ 'male', 'sex'] = 0
df.loc[ 'female','sex'] = 1
df['sex'] = df['sex'].astype(int)
df.loc[df.child== 'no', 'child'] = 0
df.loc[df.child== 'yes','child'] = 1
df['child'] = df['child'].astype(int)

We can now proceed to creating the baseline model

Baseline Model

This model is a basic regression model for the purpose of comparison. We will instantiate our regression model, use the fit command and finally calculate the mean squared error of the data. The code is below.


This mean standard error score of 1.05 is our benchmark for determining if the elastic net model will be better or worst. Below are the coefficients of this first model. We use a for loop to go through the model and the zip function to combine the two columns.

coef_dict_baseline = {}
for coef, feat in zip(regression.coef_,X.columns):
coef_dict_baseline[feat] = coef
{'religious': 0.04235281110639178,
'age': -0.009059645428673819,
'sex': 0.08882013337087094,
'ym': -0.030458802565476516,
'education': 0.06810255742293699,
'occupation': -0.005979506852998164,
'nbaffairs': -0.07882571247653956}

We will now move to making the elastic net model.

Elastic Net Model

Elastic net, just like ridge and lasso regression, requires normalize data. This argument  is set inside the ElasticNet function. The second thing we need to do is create our grid. This is the same grid as we create for ridge and lasso in prior posts. The only thing that is new is the l1_ratio argument.

When the l1_ratio is set to 0 it is the same as ridge regression. When l1_ratio is set to 1 it is lasso. Elastic net is somewhere between 0 and 1 when setting the l1_ratio. Therefore, in our grid, we need to set several values of this argument. Below is the code.


We will now fit our model and display the best parameters and the best results we can get with that setup.,y)
Out[73]: {'alpha': 0.001, 'l1_ratio': 0.8}
Out[74]: 1.0816514028705004

The best hyperparameters was an alpha set to 0.001 and a l1_ratio of 0.8. With these settings we got an MSE of 1.08. This is above our baseline model of MSE 1.05  for the baseline model. Which means that elastic net is doing worse than linear regression. For clarity, we will set our hyperparameters to the recommended values and run on the data.


Now our values are about the same. Below are the coefficients

coef_dict_baseline = {}
for coef, feat in zip(elastic.coef_,X.columns):
coef_dict_baseline[feat] = coef
{'religious': 0.01947541724957858,
'age': -0.008630896492807691,
'sex': 0.018116464568090795,
'ym': -0.024224831274512956,
'education': 0.04429085595448633,
'occupation': -0.0,
'nbaffairs': -0.06679513627963515}

The coefficients are mostly the same. Notice that occupation was completely removed from the model in the elastic net version. This means that this values was no good to the algorithm. Traditional regression cannot do this.


This post provided an example of elastic net regression. Elastic net regression allows for the maximum flexibility in terms of finding the best combination of ridge and lasso regression characteristics. This flexibility is what gives elastic net its power.

Lasso Regression with Python

Lasso regression is another form of regularized regression. With this particular version, the coefficient of a variable can be reduced all the way to zero through the use of the l1 regularization. This is in contrast to ridge regression which never completely removes a variable from an equation as it employs l2 regularization.

Regularization helps to stabilize estimates as well as deal with bias and variance in a model. In this post, we will use the “CaSchools” dataset from the pydataset library. Our goal will be to predict test scores based on several independent variables. The steps we will follow are as follows.

  1. Data preparation
  2. Develop a baseline linear model
  3. Develop lasso regression model

The initial code is as follows

from pydataset import data
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso

Data Preparation

The data preparation is simple in this example. We only have to store the desired variables in our X and y datasets. We are not using all of the variables. Some were left out because they were highly correlated. Lasso is able to deal with this to a certain extent w=but it was decided to leave them out anyway. Below is the code.


Baseline Model

We can now run our baseline model. This will give us a measure of comparison for the lasso model. Our metric is the mean squared error. Below is the code with the results of the model.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

First, we instantiate the LinearRegression class. Then, we run the .fit method to do the analysis. Next, we predicted future values of our regression model and save the results to the object first_model. Lastly, we printed the results.

Below are the coefficient for the baseline regression model.

coef_dict_baseline = {}
for coef, feat in zip(regression.coef_,X.columns):
coef_dict_baseline[feat] = coef
{'teachers': 0.00010011947964873427,
'calwpct': -0.07813766458116565,
'mealpct': -0.3754719080127311,
'compstu': 11.914006268826652,
'expnstu': 0.001525630709965126,
'str': -0.19234209691788984,
'avginc': 0.6211690806021222,
'elpct': -0.19857026121348267}

The for loop simply combines the features in our model with their coefficients. With this information we can now make our lasso model and compare the results.

Lasso Model

For our lasso model, we have to determine what value to set the l1 or alpha to prior to creating the model. This can be done with the grid function, This function allows you to assess several models with different l1 settings. Then python will tell which setting is the best. Below is the code.


We start be instantiate lasso with normalization set to true. It is important to scale data when doing regularized regression. Next, we setup our grid, we include the estimator, and parameter grid, and scoring. The alpha is set using logspace. We want values between -5 and 2, and we want 8 evenly spaced settings for the alpha. The other arguments include cv which stands for cross-validation. n_jobs effects processing and refit updates the parameters. 

After completing this, we used the fit function. The code below indicates the appropriate alpha and the expected score if we ran the model with this alpha setting.

Out[55]: {'alpha': 1e-05}
Out[56]: 85.38831122904011

`The alpha is set almost to zero, which is the same as a regression model. You can also see that the mean squared error is actually worse than in the baseline model. In the code below, we run the lasso model with the recommended alpha setting and print the results.


The value for the second model is almost the same as the first one. The tiny difference is due to the fact that there is some penalty involved. Below are the coefficient values.

coef_dict_baseline = {}
for coef, feat in zip(lasso.coef_,X.columns):
coef_dict_baseline[feat] = coef
{'teachers': 9.795933425676567e-05,
'calwpct': -0.07810938255735576,
'mealpct': -0.37548182158171706,
'compstu': 11.912164626067028,
'expnstu': 0.001525439984250718,
'str': -0.19225486069458508,
'avginc': 0.6211695477945162,
'elpct': -0.1985510490295491}

The coefficient values are also slightly different. The only difference is the teachers variable was essentially set to zero. This means that it is not a useful variable for predicting testscrs. That is ironic to say the least.


Lasso regression is able to remove variables that are not adequate predictors of the outcome variable. Doing this in Python  is fairly simple. This yet another tool that can be used in statistical analysis.

Differences in Thinking

Critical thinkers and problem solvers are two groups of people.  Sadly, these two groups are almost mutually exclusive. However, it is important that thinkers and solvers develop both skillsets to a certain level of competence.

The purpose of this post is to try and explain in detail critical thinking vs problem-solving in term of individual differences.

Thinking is a slow deliberate process that takes to do. In other words, a person must decide to think. Since there is a requirement of active effort, thinking is something that few people value and appreciate as they should.

Thinking involves processing information from the viewpoint of central processing. This means to examine the content of a message for its worth. Furthermore, when a person is developing their own arguments thinking involves developing support for one’s position. Often when people argue or disagree today they tend to get upset. This is an indication that their emotions are determining their position rather than their mind. They might use their mind on occasion to strengthen their argument but the foundation of their position is often emotional rather than based on strong thought.


Developing the mind usually involves reading. Reading exposes an individual to good and poor examples of thinking.  From these examples, an individual thinks about the strengths and merits of each. This process of thinking about other people’s thoughts helps a person to develop their own opinion. When an is formed it can be shared with others who are then able to judge for themselves the merit of the person’s opinion.


This process of thinking is not often required for academic studies. The focus has moved more towards problem-solving. Problem-solving is In an excellent form of thinking when the end goal is often binary in nature. This means that when a problem solves, either they solve the problem or they do not.


Critical thinking involves a certain fuzziness to it that problem-solving lacks  For example, whether a speech or paper is good or bad involves critical thinking because judging quality involves fuzziness to it. This sense of a shade of gray would make solving problems difficult at the least. T


However, if you are called to determine why a computer does not connect to the internet this is problem-solving. The goal is to get back on the internet. You have to think but the desired outcome is clear. Once the computer is back on the internet there is nothing to think about. In most cases, particular with non techie people, how you get back on the internet does not even matter. In other words, the “why does this work” is often something that problem solvers do not care about but this is exactly the type of thing critical thinking has to be able to explain when developing an argument.


Problem-solving involves action and not as much contemplation. The focus is on experience and not theory. It is not that problem-solvers never read and contemplate, rather, they learn primarily through doing. Examples include trial and error. 

Most companies want problem solvers and not necessarily critical thinkers. In other words, businesses want things done. They do not want people going around and questioning unless this helps to solve a problem.  Companies claim to want thinking but what they really want are people who think how to solve the company’s problems. Questioning the company is not one of the wiser things to do.

The fuzziness of critical thinking frustrates problem-solvers who want to solve problems and not simply talk. This is not a negative thing but rather a difference in personality. The problem is that problem solvers and critical thinkers do not see this as a matter of difference but a matter of ignorance on one hand and irrelevance on the other hand. Thinkers think and problem solvers do is a common description of both sides



Critical thinking and problem-solving are two skills that everyone needs. To focus on either to the exclusion of the other is detrimental. A combination of thought and action creates a balanced individual who is able to get things done while still have a depth of thought to support their actions.


Ridge Regression in Python

Ridge regression is one of several regularized linear models. Regularization is the process of penalizing coefficients of variables either by removing them and or reduce their impact. Ridge regression reduces the effect of problematic variables close to zero but never fully removes them.

We will go through an example of ridge regression using the VietNamI dataset available in the pydataset library. Our goal will be to predict expenses based on the variables available. We will complete this task using the following steps/

  1. Data preparation
  2. Baseline model development
  3. Ridge regression model

Below is the initial code

from pydataset import data
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_erro

Data Preparation

The data preparation is simple. All we have to do is load the data and convert the sex variable to a dummy variable. We also need to set up our X and y datasets. Below is the code.

df.loc[ 'male', 'sex'] = 0
df.loc[ 'female','sex'] = 1
df['sex'] = df['sex'].astype(int)

We can now create our baseline regression model.

Baseline Model

The metric we are using is the mean squared error. Below is the code and output for our baseline regression model. This is a model that has no regularization to it. Below is the code.


This  value of 0.355289 will be our indicator to determine if the regularized ridge regression model is superior or not.

Ridge Model

In order to create our ridge model we need to first determine the most appropriate value for the l2 regularization. L2 is the name of the hyperparameter that is used in ridge regression. Determining the value of a hyperparameter requires the use of a grid. In the code below, we first are ridge model and indicate normalization in order to get better estimates. Next we setup the grid that we will use. Below is the code.


The search object has several arguments within it. Alpha is hyperparameter we are trying to set. The log space is the range of values we want to test. We want the log of -5 to 2, but we only get 8 values from within that range evenly spread out. Are metric is the mean squared error. Refit set true means to adjust the parameters while modeling and cv is the number of folds to develop for the cross-validation. We can now use the .fit function to run the model and then use the .best_params_ and .best_scores_ function to determine the model;s strength. Below is the code.,y)
{'alpha': 0.01}

The best_params_ tells us what to set alpha too which in this case is 0.01. The best_score_ tells us what the best possible mean squared error is. In this case, the value of 0.38 is worse than what the baseline model was. We can confirm this by  fitting our model with the ridge information and finding the mean squared error. This is done below.


The 0.35 is lower than the 0.38. This is because the last results are not cross-validated. In addition, these results indicate that there is little difference between the ridge and baseline models. This is confirmed with the coefficients of each model found below.

coef_dict_baseline = {}
for coef, feat in zip(regression.coef_,data("VietNamI").columns):
coef_dict_baseline[feat] = coef
{'pharvis': 0.013282050886950674,
'lnhhexp': 0.06480086550467873,
'age': 0.004012412278795848,
'sex': -0.08739614349708981,
'married': 0.075276463838362,
'educ': -0.06180921300600292,
'illness': 0.040870384578962596,
'injury': -0.002763768716569026,
'illdays': -0.006717063310893158,
'actdays': 0.1468784364977112}

coef_dict_ridge = {}
for coef, feat in zip(ridge.coef_,data("VietNamI").columns):
coef_dict_ridge[feat] = coef
{'pharvis': 0.012881937698185289,
'lnhhexp': 0.06335455237380987,
'age': 0.003896623321297935,
'sex': -0.0846541637961565,
'married': 0.07451889604357693,
'educ': -0.06098723778992694,
'illness': 0.039430607922053884,
'injury': -0.002779341753010467,
'illdays': -0.006551280792122459,
'actdays': 0.14663287713359757}

The coefficient values are about the same. This means that the penalization made little difference with this dataset.


Ridge regression allows you to penalize variables based on their useful in developing the model. With this form of regularized regression the coefficients of the variables is never set to zero. Other forms of regularization regression allows for the total removal of variables. One example of this is lasso regression.

Undergrad and Grad Students

In this post,  we will look at a comparison of grad and undergrad students.

Student Quality

Generally, graduate students are of a higher quality academically than undergrad students. Of course, this varies widely from institution to institution. New graduate programs may have a lower quality of student than established undergrad programs. This is because the new program is trying to fill sears initially and quality is often compromised.


At the graduate level, there is an expectation of a much more focused and rigorous curriculum. This makes sense as the primary purpose of graduate school is usually specialization and not generalization. This requires that the teachers at this level have a deep expert-level mastery of the content.

In comparison to graduate school, undergrad is a generalized experience with some specialization. However, this depends on the country in which the studies take place. Some countries require rather an intense specialization from the beginning with a minimum of general education while others take a more American style approach with a wide exposure to various fields.


Graduate students are usually older. This means that they require less institution sponsored social activities and may not socialize at all. In addition, some graduate students are married which adds a whole other level of complexity to their studies. Although they are probably less inclined to be “wild” due to their family they are also going to struggle due to the time commitment of their loved ones.

Assuming that an undergraduate student is a traditional one they will tend to be straight from high school, require some social support, but will also have the free time needed to study. The challenge with these students is the maturity level and self-regulation skills that are often missing for academic success.

For the teacher, graduate students offer higher motivation and commitment generally when compared to undergrads. This is reasonable as people often feel compelled to complete a bachelors but normally do not face the same level of pressure to go to graduate school. This means that undergrad is often compulsory due to external circumstances while grad school is by choice.


Despite the differences but types of students hold in common an experience that is filled with exposure to various ideas and content for several years. Grad students and undergrad students are individuals who are developing skills for the goal of eventually finding a purpose in the world.

Hyperparameter Tuning in Python

Hyperparameters are a numerical quantity you must set yourself when developing a model. This is often one of the last steps of model development. Choosing an algorithm and determining which variables to include often come before this step.

Algorithms cannot determine hyperparameters themselves which is why you have to do it. The problem is that the typical person has no idea what is an optimally choice for the hyperparameter. To deal with this confusion, often a range of values are inputted and then it is left to python to determine which combination of hyperparameters is most appropriate.

In this post, we will learn how to set hyperparameters by developing a grid in  Python. To do this, we will use the PSID dataset from the pydataset library. Our goal will be to classify who is married and not married based on several independent variables. The steps of this process is as follows

  1.  Data preparation
  2. Baseline model (for comparison)
  3. Grid development
  4. Revised model

Below is some initial code that includes all the libraries and classes that we need.

import pandas as pd
import numpy as np
from pydataset import data
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

Data Preparation

The dataset PSID has several problems that we need to address.

  • We need to remove all NAs
  • The married variable will be converted to a dummy variable. It will simply be changed to married or not rather than all of the other possible categories.
  • The educatnn and kids variables have codes that are 98 and 99. These need to be removed because they do not  make sense.

Below is the code that deals with all of this

df.loc[df.married!= 'married', 'married'] = 0
df.loc[df.married== 'married','married'] = 1
df['married'] = df['married'].astype(int)
df.drop(df.loc[df['kids']>90].index, inplace=True)
df.drop(df.loc[df['educatn']>90].index, inplace=True
  1. Line 1 loads the dataset and drops the NAs
  2. Line 2-4 create our dummy variable for marriage. We create a new variable called marry to hold the results
  3. Lines 5-6 drop the values in  kids and educatn that are above 90.

Below we create our X and y datasets and then are ready to make our baseline model.


Baseline Model

The purpose of  baseline model is to see how much better or worst the hyperparameter tuning works. We are using K Nearest Neighbors  for our classification. In our example, there are 4 hyperparameters we need to set. They are as follows.

  1. number of neighbors
  2. weight of neighbors
  3. metric for measuring distance
  4. power parameter for minkowski

Below is the baseline model with the set hyperparameters. The second line shows the accuracy of the model after a k-fold cross-validation that was set to 10.

classifier=KNeighborsClassifier(n_neighbors=5,weights=’uniform’, metric=’minkowski’,p=2)
np.mean(cross_val_score(classifier,X,y,cv=10,scoring=’accuracy’,n_jobs=1)) 0.6188104238047426

Our model has an accuracy of about 62%. We will now move to setting up our grid so we can see if tuning the hyperparameters improves the performance

Grid Development

The grid allows you to develop scores of models with all of the hyperparameters tuned slightly differently. In the code below, we create our grid object, and then we calculate how many models we will run

grid={'n_neighbors':range(1,13),'weights':['uniform','distance'],'metric':['manhattan','minkowski'],'p':[1,2]}[len(grid[element]) for element in grid])

You can see we made a simple list that has several values for each hyperparameter

  1. Number if neighbors can be 1 to 13
  2. weight of neighbors can be uniform or distance
  3. metric can be manhatten or minkowski
  4. p can be 1 or 2

We will develop 96 models all together. Below is the code to begin tuning the hyperparameters.


The estimator is the  code for the type of algorithm we are using. We set this earlier. The param_grid is our grid. Accuracy is our metric for determining the best model. n_jobs has to do with the amount of resources committed to the process. refit is for changing parameters and cv is for cross-validation folds.The command runs the model

The code below provides the output for the results.

{'metric': 'manhattan', 'n_neighbors': 11, 'p': 1, 'weights': 'uniform'}

The best_params_ function tells us what the most appropriate parameters are. The best_score_ tells us what the accuracy of the model is with the best parameters. Are model accuracy improves from 61% to 65% from adjusting the hyperparameters. We can confirm this by running our revised model with the updated hyper parameters.

Model Revision

Below is the cod efor the erevised model

classifier2=KNeighborsClassifier(n_neighbors=11,weights='uniform', metric='manhattan',p=1)
np.mean(cross_val_score(classifier2,X,y,cv=10,scoring='accuracy',n_jobs=1)) #new res
Out[24]: 0.6503909993913031

Exactly as we thought. This is a small improvement but this can make a big difference in some situation such as in a data science competition.


Tuning hyperparameters is one of the final pieces to improving a model. With this tool, small gradually changes can be seen in a model. It is important to keep in mind this aspect of model development in order to have the best success final.