Author Archives: Dr. Darrin

Type 1 & 2 Tasks in TESOL

In the context of TESOL, language skills of speaking, writing, listening, and reading are divided into productive and receptive skills. Productive skills are speaking and writing and involve making language. Receptive skills are listening and reading and involve receiving language.

In this post, we will take a closer look at receptive skills involving the theory behind them as well as the use of task 1 and task 2 activities to develop these skills.

Top Down Bottom Up

Theories that are commonly associated with receptive skills include top down and bottom up processing. With top down processing the reader/listener is focused on the “big picture”. This means that they are focused on general view or idea of the content they are exposed to. This requires have a large amount of knowledge and experience to draw upon in order to connect understanding. Prior knowledge helps the individual to know what to expect as they receive the information.

Bottom up processing is the opposite. In this approach, the reader/listener is focused on the details of the content. Another way to see this is that with bottom up processing the focus is on the trees while with top down the focus is on forest. With bottom up processing students are focused on individual words or even individual word sounds such as when they are decoding when reading.

Type 1 & 2 Tasks

Type 1 and type 2 tasks are derived from top down and bottom processing.  Type 1 task involve seeing the “big picture”. Examples of this type include summarizing, searching for the main idea, making inferences, etc.

Often type 1 task are trickier to assess because solutions are often open-ended and open to interpretation. This involves having to assess individually each response each student makes which may not always be practical. However, type 1 task really help to broaden and strengthen higher level thinking skills which can lay a foundation for critical thinking.

Type 2 task involve looking at text and or listening for much greater detail. Such activities as recall, grammatical correction, and single answer questions all fall under the umbrella of type 2 tasks.

Type 2 tasks are easier to mark as they  frequently only have one possible answer. The problem with this is that teachers over rely on them because of their convenience. Students are trained to obsess over details rather than broad comprehension or connecting knowledge to other areas of knowledge. Opportunities for developing dynamic literacy are lost for a focus on critical literacy or even decoding.

A more reasonable approach is to use a combination of type 1 and 2 tasks. Type 1 can be used to stimulate thinking without necessarily marking the responses. Type 2 can be employed to teach students to focus on details and due to ease at which they can be marked type 2 tasks can be placed in the grade book for assessing progress.


This post explained various theories related to receptive skills in TESOL. There was also a look at different the two broad categories in which receptive skill tasks fall into. For educators, it is important to find a balance between using both type 1 and type 2 tasks in their classroom.


Data Exploration with R: Housing Prices

In this data exploration post, we will analyze a dataset from Kaggle using R .  Below are the packages we are going to use.


Let’s look at our data for a second.

train <- read_csv("~/Downloads/train.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Id = col_integer(),
##   MSSubClass = col_integer(),
##   LotFrontage = col_integer(),
##   LotArea = col_integer(),
##   OverallQual = col_integer(),
##   OverallCond = col_integer(),
##   YearBuilt = col_integer(),
##   YearRemodAdd = col_integer(),
##   MasVnrArea = col_integer(),
##   BsmtFinSF1 = col_integer(),
##   BsmtFinSF2 = col_integer(),
##   BsmtUnfSF = col_integer(),
##   TotalBsmtSF = col_integer(),
##   `1stFlrSF` = col_integer(),
##   `2ndFlrSF` = col_integer(),
##   LowQualFinSF = col_integer(),
##   GrLivArea = col_integer(),
##   BsmtFullBath = col_integer(),
##   BsmtHalfBath = col_integer(),
##   FullBath = col_integer()
##   # ... with 18 more columns
## )
## See spec(...) for full column specifications.

Data Visualization

Lets take a look at our target variable first



Here is the frequency of certain values for the target variable



Let’s examine the correlations. First we need to find out which variables are numeric. Then we can use ggcorr to see if there are any interesting associations. The code is as follows.

nums <- unlist(lapply(train, is.numeric))   train[ , nums]%>%
        select(-Id) %>%
        ggcorr(method =c('pairwise','spearman'),label = FALSE,angle=-0,hjust=.2)+coord_flip()


There are some strong associations in the data set. Below we see what the top 10 correlations.

n1 <- 20 
m1 <- abs(cor(train[ , nums],method='spearman'))
out <- as.table(m1) %>%
        as_data_frame %>% 
        transmute(Var1N = pmin(Var1, Var2), Var2N = pmax(Var1, Var2), n) %>% 
        distinct %>% 
        filter(Var1N != Var2N) %>% 
        arrange(desc(n)) %>%
        group_by(grp = as.integer(gl(n(), n1, n())))
## # A tibble: 703 x 4
## # Groups:   grp [36]
##    Var1N        Var2N            n   grp
##  1 GarageArea   GarageCars   0.853     1
##  2 1stFlrSF     TotalBsmtSF  0.829     1
##  3 GrLivArea    TotRmsAbvGrd 0.828     1
##  4 OverallQual  SalePrice    0.810     1
##  5 GrLivArea    SalePrice    0.731     1
##  6 GarageCars   SalePrice    0.691     1
##  7 YearBuilt    YearRemodAdd 0.684     1
##  8 BsmtFinSF1   BsmtFullBath 0.674     1
##  9 BedroomAbvGr TotRmsAbvGrd 0.668     1
## 10 FullBath     GrLivArea    0.658     1
## # ... with 693 more rows

There are about 4 correlations that are perhaps too strong.

Descriptive Statistics

Below are some basic descriptive statistics of our variables.

train_mean<-na.omit(train[ , nums]) %>% 
        select(-Id,-SalePrice) %>%
        summarise_all(funs(mean)) %>%
train_sd<-na.omit(train[ , nums]) %>%
        select(-Id,-SalePrice) %>%
        summarise_all(funs(sd)) %>%
train_median<-na.omit(train[ , nums]) %>%
        select(-Id,-SalePrice) %>%
        summarise_all(funs(median)) %>%
stat<-na.omit(train[ , nums]) %>%
        select(-Id,-SalePrice) %>%
        summarise_all(funs(sum(.<0.001))) %>%
## # A tibble: 36 x 6
##    feature       zeros    mean median      sd zeropercent
##  1 PoolArea       1115  2.93        0  40.2          76.4
##  2 LowQualFinSF   1104  4.57        0  41.6          75.6
##  3 3SsnPorch      1103  3.35        0  29.8          75.5
##  4 MiscVal        1087 23.4         0 166.           74.5
##  5 BsmtHalfBath   1060  0.0553      0   0.233        72.6
##  6 ScreenPorch    1026 16.1         0  57.8          70.3
##  7 BsmtFinSF2      998 44.6         0 158.           68.4
##  8 EnclosedPorch   963 21.8         0  61.3          66.0
##  9 HalfBath        700  0.382       0   0.499        47.9
## 10 BsmtFullBath    668  0.414       0   0.512        45.8
## # ... with 26 more rows

We have a lot of information stored in the code above. We have the means, median and the sd in one place for all of the features. Below are visuals of this information. We add 1 to the mean and sd to preserve features that may have a mean of 0.

p1<-stat %>%
        ggplot(aes(mean+1))+geom_histogram(bins = 20,fill='red')+scale_x_log10()+labs(x="means + 1")+ggtitle("Feature means")

p2<-stat %>%
        ggplot(aes(sd+1))+geom_histogram(bins = 30,fill='red')+scale_x_log10()+labs(x="sd + 1")+ggtitle("Feature sd")

p3<-stat %>%
        ggplot(aes(median+1))+geom_histogram(bins = 20,fill='red')+labs(x="median + 1")+ggtitle("Feature median")

p4<-stat %>%
        mutate(zeros=zeros/nrow(train)*100) %>%
        ggplot(aes(zeros))+geom_histogram(bins = 20,fill='red')+labs(x="Percent of Zeros")+ggtitle("Zeros")

p5<-stat %>%
        ggplot(aes(mean+1,sd+1))+geom_point()+scale_x_log10()+scale_y_log10()+labs(x="mean + 1",y='sd + 1')+ggtitle("Feature mean & sd")
## Warning in rbind(c(1, 2, 3), c(4, 5)): number of columns of result is not a
## multiple of vector length (arg 2)


Below we check for variables with zero variance. Such variables would cause problems if included in any model development

        mutate(zeros = zeros/nrow(train)*100)%>%
        filter(mean == 0 | sd == 0 | zeros==100)%>%

There are no zero-variance features in this dataset that may need to be remove.


Let’s look at correlation with the SalePrice variable. The plot is a histogram of all the correlations with the target variable.

sp_cor<-train[, nums] %>% 
select(-Id,-SalePrice) %>%
cor(train$SalePrice,method="spearman") %>%
as.tibble() %>%


ggplot(aes(cor))+geom_histogram()+labs(x="Correlations")+ggtitle("Cors with SalePrice")


We have several high correlations but we already knew this previously. Below we have some code that provides visuals of the correlations

        ggcorr(method=c("pairwise","pearson"),label=T,angle=-0,hjust=.2)+coord_flip()+ggtitle("Strongest Correlations")
        select(SalePrice, OverallQual)%>%
        ggplot(aes(SalePrice,OverallQual))+geom_point()+labs(y="OverallQual")+ggtitle("Strongest Correlation")
        select(SalePrice, OverallQual)%>%
        ggplot(aes(SalePrice,OverallQual))+geom_point()+geom_smooth(method= 'lm')+labs(y="OverallQual")+ggtitle("Strongest Correlation")
ggMarginal(p3,type = 'histogram')



The first plot show us the top correlations. Plot 1 show us the relationship between the strongest predictor and our target variable. Plot 2 shows us the trend-line and the histograms for the strongest predictor with our target variable.

The code below is for the categorical variables. Our primary goal is to see the protections inside each variable. If a categorical variable lacks variance in terms of frequencies in each category it may need to be removed for model developing purposes. Below is the code

ig_zero<-train[, nums]%>%

char <- unlist(lapply(train, is.character))  

    for (var in train[,char]){
        wow= print(prop.table(table(var)))
        List[[length(List)+1]] = wow

This list is not printed here in order to save space

# $MSZoning
## var
##     C (all)          FV          RH          RL          RM 
## 0.006849315 0.044520548 0.010958904 0.788356164 0.149315068 
## $Street
## var
##        Grvl        Pave 
## 0.004109589 0.995890411 
## $Alley
## var
##      Grvl      Pave 
## 0.5494505 0.4505495 
## $LotShape
## var
##         IR1         IR2         IR3         Reg 
## 0.331506849 0.028082192 0.006849315 0.633561644 
## $LandContour
## var
##        Bnk        HLS        Low        Lvl 
## 0.04315068 0.03424658 0.02465753 0.89794521 
## $Utilities
## var
##       AllPub       NoSeWa 
## 0.9993150685 0.0006849315 
## $LotConfig
## var
##      Corner     CulDSac         FR2         FR3      Inside 
## 0.180136986 0.064383562 0.032191781 0.002739726 0.720547945 
## $LandSlope
## var
##        Gtl        Mod        Sev 
## 0.94657534 0.04452055 0.00890411 
## $Neighborhood
## var
##     Blmngtn     Blueste      BrDale     BrkSide     ClearCr     CollgCr 
## 0.011643836 0.001369863 0.010958904 0.039726027 0.019178082 0.102739726 
##     Crawfor     Edwards     Gilbert      IDOTRR     MeadowV     Mitchel 
## 0.034931507 0.068493151 0.054109589 0.025342466 0.011643836 0.033561644 
##       NAmes     NoRidge     NPkVill     NridgHt      NWAmes     OldTown 
## 0.154109589 0.028082192 0.006164384 0.052739726 0.050000000 0.077397260 
##      Sawyer     SawyerW     Somerst     StoneBr       SWISU      Timber 
## 0.050684932 0.040410959 0.058904110 0.017123288 0.017123288 0.026027397 
##     Veenker 
## 0.007534247 
## $Condition1
## var
##      Artery       Feedr        Norm        PosA        PosN        RRAe 
## 0.032876712 0.055479452 0.863013699 0.005479452 0.013013699 0.007534247 
##        RRAn        RRNe        RRNn 
## 0.017808219 0.001369863 0.003424658 
## $Condition2
## var
##       Artery        Feedr         Norm         PosA         PosN 
## 0.0013698630 0.0041095890 0.9897260274 0.0006849315 0.0013698630 
##         RRAe         RRAn         RRNn 
## 0.0006849315 0.0006849315 0.0013698630 
## $BldgType
## var
##       1Fam     2fmCon     Duplex      Twnhs     TwnhsE 
## 0.83561644 0.02123288 0.03561644 0.02945205 0.07808219 
## $HouseStyle
## var
##      1.5Fin      1.5Unf      1Story      2.5Fin      2.5Unf      2Story 
## 0.105479452 0.009589041 0.497260274 0.005479452 0.007534247 0.304794521 
##      SFoyer        SLvl 
## 0.025342466 0.044520548 
## $RoofStyle
## var
##        Flat       Gable     Gambrel         Hip     Mansard        Shed 
## 0.008904110 0.781506849 0.007534247 0.195890411 0.004794521 0.001369863 
## $RoofMatl
## var
##      ClyTile      CompShg      Membran        Metal         Roll 
## 0.0006849315 0.9821917808 0.0006849315 0.0006849315 0.0006849315 
##      Tar&Grv      WdShake      WdShngl 
## 0.0075342466 0.0034246575 0.0041095890 
## $Exterior1st
## var
##      AsbShng      AsphShn      BrkComm      BrkFace       CBlock 
## 0.0136986301 0.0006849315 0.0013698630 0.0342465753 0.0006849315 
##      CemntBd      HdBoard      ImStucc      MetalSd      Plywood 
## 0.0417808219 0.1520547945 0.0006849315 0.1506849315 0.0739726027 
##        Stone       Stucco      VinylSd      Wd Sdng      WdShing 
## 0.0013698630 0.0171232877 0.3527397260 0.1410958904 0.0178082192 
## $Exterior2nd
## var
##      AsbShng      AsphShn      Brk Cmn      BrkFace       CBlock 
## 0.0136986301 0.0020547945 0.0047945205 0.0171232877 0.0006849315 
##      CmentBd      HdBoard      ImStucc      MetalSd        Other 
## 0.0410958904 0.1417808219 0.0068493151 0.1465753425 0.0006849315 
##      Plywood        Stone       Stucco      VinylSd      Wd Sdng 
## 0.0972602740 0.0034246575 0.0178082192 0.3452054795 0.1349315068 
##      Wd Shng 
## 0.0260273973 
## $MasVnrType
## var
##     BrkCmn    BrkFace       None      Stone 
## 0.01033058 0.30647383 0.59504132 0.08815427 
## $ExterQual
## var
##          Ex          Fa          Gd          TA 
## 0.035616438 0.009589041 0.334246575 0.620547945 
## $ExterCond
## var
##           Ex           Fa           Gd           Po           TA 
## 0.0020547945 0.0191780822 0.1000000000 0.0006849315 0.8780821918 
## $Foundation
## var
##      BrkTil      CBlock       PConc        Slab       Stone        Wood 
## 0.100000000 0.434246575 0.443150685 0.016438356 0.004109589 0.002054795 
## $BsmtQual
## var
##         Ex         Fa         Gd         TA 
## 0.08503162 0.02459592 0.43429375 0.45607871 
## $BsmtCond
## var
##          Fa          Gd          Po          TA 
## 0.031623331 0.045678145 0.001405481 0.921293043 
## $BsmtExposure
## var
##         Av         Gd         Mn         No 
## 0.15541491 0.09423347 0.08016878 0.67018284 
## $BsmtFinType1
## var
##        ALQ        BLQ        GLQ        LwQ        Rec        Unf 
## 0.15460295 0.10400562 0.29374561 0.05200281 0.09346451 0.30217850 
## $BsmtFinType2
## var
##         ALQ         BLQ         GLQ         LwQ         Rec         Unf 
## 0.013361463 0.023206751 0.009845288 0.032348805 0.037974684 0.883263010 
## $Heating
## var
##        Floor         GasA         GasW         Grav         OthW 
## 0.0006849315 0.9780821918 0.0123287671 0.0047945205 0.0013698630 
##         Wall 
## 0.0027397260 
## $HeatingQC
## var
##           Ex           Fa           Gd           Po           TA 
## 0.5075342466 0.0335616438 0.1650684932 0.0006849315 0.2931506849 
## $CentralAir
## var
##          N          Y 
## 0.06506849 0.93493151 
## $Electrical
## var
##       FuseA       FuseF       FuseP         Mix       SBrkr 
## 0.064427690 0.018505826 0.002056203 0.000685401 0.914324880 
## $KitchenQual
## var
##         Ex         Fa         Gd         TA 
## 0.06849315 0.02671233 0.40136986 0.50342466 
## $Functional
## var
##         Maj1         Maj2         Min1         Min2          Mod 
## 0.0095890411 0.0034246575 0.0212328767 0.0232876712 0.0102739726 
##          Sev          Typ 
## 0.0006849315 0.9315068493 
## $FireplaceQu
## var
##         Ex         Fa         Gd         Po         TA 
## 0.03116883 0.04285714 0.49350649 0.02597403 0.40649351 
## $GarageType
## var
##      2Types      Attchd     Basment     BuiltIn     CarPort      Detchd 
## 0.004350979 0.630891951 0.013778100 0.063814358 0.006526468 0.280638144 
## $GarageFinish
## var
##       Fin       RFn       Unf 
## 0.2552574 0.3060189 0.4387237 
## $GarageQual
## var
##          Ex          Fa          Gd          Po          TA 
## 0.002175489 0.034807832 0.010152284 0.002175489 0.950688905 
## $GarageCond
## var
##          Ex          Fa          Gd          Po          TA 
## 0.001450326 0.025380711 0.006526468 0.005076142 0.961566352 
## $PavedDrive
## var
##          N          P          Y 
## 0.06164384 0.02054795 0.91780822 
## $PoolQC
## var
##        Ex        Fa        Gd 
## 0.2857143 0.2857143 0.4285714 
## $Fence
## var
##      GdPrv       GdWo      MnPrv       MnWw 
## 0.20996441 0.19217082 0.55871886 0.03914591 
## $MiscFeature
## var
##       Gar2       Othr       Shed       TenC 
## 0.03703704 0.03703704 0.90740741 0.01851852 
## $SaleType
## var
##         COD         Con       ConLD       ConLI       ConLw         CWD 
## 0.029452055 0.001369863 0.006164384 0.003424658 0.003424658 0.002739726 
##         New         Oth          WD 
## 0.083561644 0.002054795 0.867808219 
## $SaleCondition
## var
##     Abnorml     AdjLand      Alloca      Family      Normal     Partial 
## 0.069178082 0.002739726 0.008219178 0.013698630 0.820547945 0.085616438

You can judge for yourself which of these variables are appropriate or not.


This post provided an example of data exploration. Through this analysis we have a beter understanding of the characteristics of the dataset. This information can be used for further analyst and or model development.

Scatterplot in LibreOffice Calc

A scatterplot is used to observe the relationship between two continuous variables. This post will explain how to make a scatterplot and calculate correlation in LibreOffice Calc.


In order to make a scatterplot you need to columns of data. Below are the first few rows of the data in this example.

Var 1 Var 2
3.07 2.95
2.07 1.90
3.32 2.75
2.93 2.40
2.82 2.00
3.36 3.10
2.86 2.85

Given the nature of this dataset, there was no need to make any preparation.

To make the plot you need to select the two column with data in them and click on insert -> chart and you will see the following.


Be sure to select the XY (Scatter) option and click next. You will then see the following


Be sure to select “data series in columns” and “first row as label.” Then click next and you will see the following.


There is nothing to modify in this window. If you wanted you could add more data to the plot as well as label data but neither of these options apply to us. Therefore, click next.


In this last window, you can see that we gave the chart a title and label the X and Y axes. We also removed the “display legend” feature by unchecking it. A legend is normally not needed when making a scatterplot. Once you add this information click “finish” and you will see the following.


There are many other ways to modify the scatterplot, but we will now look at how to add a trend line.

To add a trend line you need to click on the data inside the plot so that it turns green as shown below.


Next, click on insert -> trend line and you will see the following


For our purposes, want to select the “linear” option. Generally, the line is hard to see if you immediately click “ok”. Instead, we will click on the “Line” tab and adjust as shown below.


All we did was simply change the color of the line to black and increase the width to 0.10. When this is done, click “ok” and you will see the following.


The  scatterplot is now complete. We will now look at how to calculate the correlation between the two variables.


The correlation is essentially a number that captures what you see in a scatterplot. To calculate the correlation, do the following.

  1. Select the two columns of data
  2. Click on data -> statistics -> correlation and you will see the following


3. In the results to section just find a place on the spreadsheet to show the results. Click ok and you will see the following.

Correlations Column 1 Column 2
Column 1 1
Column 2 0.413450002676874 1

You have to rename the columns with the appropriate variables. Despite this problem the correlation has been calculated.


This post provided an explanation of calculating correlations and creating scatterplots in LibreOffice Calc. Data visualization is a critical aspect of communicating effectively and such tools as Calc can be used to support this endeavor.

Graphs in LibreOffice Calc

The LibreOffice Suite is a free open-source office suit that is considered an alternative to Microsoft Office Suite. The primary benefit of LibreOffice is that it offers similar features as Microsoft Office with having to spend any money. In addition, LibreOffice is legitimately free and is not some sort of nefarious pirated version of Microsoft Office, which means that anyone can use LibreOffice without legal concerns on as many machines as they desire.

In this post, we will go over how to make plots and graphs in LibreOffice Calc. LibreOffice Calc is the equivalent to Microsoft Excel. We will learn how to make the following visualizations.

  • Bar graph
  • histogram

Bar Graph

We are going to make a bar graph from a single column of data in LibreOffice Calc. To make a visualization you need to aggregate some data. For this post, I simply made some random data that uses a likert scale of SD, D, N, A, SA. Below is a sample of the first five rows of the data.

Var 1

In order to aggregate the data you need to make bins and count the frequency of each category in the bin. Here is how you do this. First you make a variable called “bin” in a column and you place SD, D, N, A, and SA each in their own row in the column you named “bin” as shown below.


In the next column, you created a variable called “freq” in each column you need to use the countif function as shown below

=COUNTIF(1st value in data: last value in data, criteria for counting)

Below is how this looks for my data.


What I told LibreOffice was that my data is in A2 to A177 and they need to count the row if it contains the same data as B2 which for me contains SD. You repeat this process four more time adjusting the last argument in the function. When I finished I this is what I had.

bin freq
SD 35
D 47
N 56
A 32
SA 5

We can now proceed to making the visualization.

To make the bar graph you need to first highlight the data you want to use. For us the information we want to select is the “bin” and “freq” variables we just created. Keep in mind that you never use the raw data but rather the aggregated data. Then click insert -> chart and you will see the following


Simply click next, and you will see the following


Make sure that the last three options are selected or your chart will look funny. Data series in rows or in columns has to do with how the data is read in a long or short form. Labels in first row makes sure that Calc does not insert “bin” and “freq” in the graph. First columns as label helps in identifying what the values are in the plot.

Once you click next you will see the following.


This window normally does not need adjusting and can be confusing to try to do so. It does allow you to adjust the range of the data and even and more data to your chart. For now, we will click on next and see the following.


In the last window above, you can add a title and label the axes if you want. You can see that I gave my graph a name. In addition, you can decide if you want to display a legend if you look to the right. For my graph, that was not adding any additional information so I unchecked this option. When you click finish you will see the following on the spreadsheet.



Histogram are for continuous data. Therefore, I convert my SD,  D, N, A, SA to 1, 2, 3, 4, and 5. All the other steps are the same as above. The one difference is that you want to remove the spacing between bars. Below is how to do this.

Click on one of the bars in the bar graph until you see the green squares as shown  below.


After you did this, there should be a new toolbar at the top of the spreadsheet. You need to click on the Green and blue cube as shown below


In the next window, you need to change the spacing to zero percent. This will change the bar graph into a histogram. Below is what the settings should look like.


When you click ok you should see the final histogram shown below


For free software this is not too bad. There are a lot of options that were left unexplained especial in regards to how you can manipulate the colors of everything and even make the plots 3D.


LibreOffice provides an alternative to paying for Microsoft products. The example below shows that Calc is capable of making visually appealing graphs just as Excel is.

Data Exploration Case Study: Credit Default

Exploratory data analysis is the main task of a Data Scientist with as much as 60% of their time being devoted to this task. As such, the majority of their time is spent on something that is rather boring compared to building models.

This post will provide a simple example of how to analyze a dataset from the website called Kaggle. This dataset is looking at how is likely to default on their credit. The following steps will be conducted in this analysis.

  1. Load the libraries and dataset
  2. Deal with missing data
  3. Some descriptive stats
  4. Normality check
  5. Model development

This is not an exhaustive analysis but rather a simple one for demonstration purposes. The dataset is available here

Load Libraries and Data

Here are some packages we will need

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn import tree
from scipy import stats
from sklearn import metrics

You can load the data with the code below


You can examine what variables are available with the code below. This is not displayed here because it is rather long


Missing Data

I prefer to deal with missing data first because missing values can cause errors throughout the analysis if they are not dealt with immediately. The code below calculates the percentage of missing data in each column.

                           Total   Percent
COMMONAREA_MEDI           214865  0.698723
COMMONAREA_AVG            214865  0.698723
COMMONAREA_MODE           214865  0.698723

Only the first five values are printed. You can see that some variables have a large amount of missing data. As such, they are probably worthless for inclusion in additional analysis. The code below removes all variables with any missing data.

pct_null = df_train.isnull().sum() / len(df_train)
missing_features = pct_null[pct_null > 0.0].index
df_train.drop(missing_features, axis=1, inplace=True)

You can use the .head() function if you want to see how  many variables are left.

Data Description & Visualization

For demonstration purposes, we will print descriptive stats and make visualizations of a few of the variables that are remaining.

count     307511.0
mean      599026.0
std       402491.0
min        45000.0
25%       270000.0
50%       513531.0
75%       808650.0
max      4050000.0



count       307511.0
mean        168798.0
std         237123.0
min          25650.0
25%         112500.0
50%         147150.0
75%         202500.0
max      117000000.0


I think you are getting the point. You can also look at categorical variables using the groupby() function.

We also need to address categorical variables in terms of creating dummy variables. This is so that we can develop a model in the future. Below is the code for dealing with all the categorical  variables and converting them to dummy variable’s










You have to be careful with this because now you have many variables that are not necessary. For every categorical variable you must remove at least one category in order for the model to work properly.  Below we did this manually.

df_train=df_train.drop(['Revolving loans','F','XNA','N','Y','SK_ID_CURR,''Student','Emergency','Lower secondary','Civil marriage','Municipal apartment'],axis=1)

Below are some boxplots with the target variable and other variables in the dataset.



There is a clear outlier there. Below is another boxplot with a different variable



It appears several people have more than 10 children. This is probably a typo.

Below is a correlation matrix using a heatmap technique



The heatmap is nice but it is hard to really appreciate what is happening. The code below will sort the correlations from least to strongest, so we can remove high correlations.

c = df_train.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")

Unknown FLAG_MOBIL 0.000005
FLAG_MOBIL Unknown 0.000005
Cash loans FLAG_DOCUMENT_14 0.000005

The list is to long to show here but the following variables were removed for having a high correlation with other variables.


Below we check a few variables for homoscedasticity, linearity, and normality  using plots and histograms



This is not normal



This is not normal either. We could do transformations, or we can make a non-linear model instead.

Model Development

Now comes the easy part. We will make a decision tree using only some variables to predict the target. In the code below we make are X and y dataset.


The code below fits are model and makes the predictions


Below is the confusion matrix followed by the accuracy

print (pd.crosstab(y_pred,df_train['TARGET']))
TARGET       0      1
0       280873  18493
1         1813   6332
Out[47]: 0.933966589813047

Lastly, we can look at the precision, recall, and f1 score

              precision    recall  f1-score   support

           0       0.99      0.94      0.97    299366
           1       0.26      0.78      0.38      8145

   micro avg       0.93      0.93      0.93    307511
   macro avg       0.62      0.86      0.67    307511
weighted avg       0.97      0.93      0.95    307511

This model looks rather good in terms of accuracy of the training set. It actually impressive that we could use so few variables from such a large dataset and achieve such a high degree of accuracy.


Data exploration and analysis is the primary task of a data scientist.  This post was just an example of how this can be approached. Of course, there are many other creative ways to do this but the simplistic nature of this analysis yielded strong results

Classroom Conflict Resolution Strategies

Disagreements among students and even teachers is part of working at any institution. People have different perceptions and opinions of what they see and experience. With these differences often comes disagreements that can lead to serious problems.

This post will look at several broad categories in which conflicts can be resolved when dealing with conflicts in the classroom. The categories are as follows.

  1. Avoiding
  2. Accommodating
  3. Forcing
  4. Compromising
  5. Problem-solving


The avoidance strategy involves ignoring the problem. The tension of trying to work out the difficulty is not worth the effort. The hope is  that the problem will somehow go away with any form of intervention. Often the problem becomes worst.

Teachers sometimes use avoidance in dealing with conflict. One common classroom management strategy is avoidance in which a teacher deliberately ignores poor behavior of a student to extinguish it.  Since the student is not getting any attention  from their poor behavior  they will often stop the  behavior.


Accommodating is focused on making everyone involved in the conflict happy. The focus is on relationships and not productivity. Many who employ this strategy believe that confrontation is destructive. Actual applications of this approach involve using humor, or some other tension breaking technique during a conflict. Again, the problem is never actually solved but rather some form of “happiness band-aid” is applied.

In the classroom, accommodation happens when teachers use humor to smooth over tense situations and when they make adjustments to goals to ameliorate students complaints. Generally, the first step in accommodation leads to more and more accommodating until the teacher is backed into a corner.

Another use of the term accommodating is the mandate in education under the catchphrase “meeting student needs”. Teachers are expected to accommodate as much as possible within guidelines given to them by the school. This leads to extraordinarily large amount of work and effort on the part of the teacher.


Force involves simply making people do something through the power you have over them. It gets things done but can lead to long term relational problems. As people are forced the often lose motivation and new conflicts begin to arise.

Forcing is often a default strategy for teachers. After all, the teacher is t an authority over children. However, force is highly demotivating and should be avoided if possible. If students have no voice they quickly can become passive which is often in opposite of active learning in the classroom.


Compromise involves trying to develop a win win situation for both parties. However, the reality is that often compromising can be the most frustrating. To totally avoid conflict means no fighting. TO be force means to have no power. However, compromise means that a person almost got what they wanted but not exactly, which can be more annoying.

Depending on the age a teacher is working with, compromising can be difficult to achieve. Younger children often lack the skills to see alternative solutions and half-way points of agreement. Compromising can also be viewed as accommodating by older kids which can lead to perceptions of the teacher’s weakness when conflict arises. Therefore, compromise is an excellent strategy when used with care.


Problem-solving is similar to compromising except that both parties are satisfied with the decision and the problem is actually solved, at least temporarily. This takes a great deal of trust and communication between the parties involved.

For this to work in  the classroom, a teacher must de-emphasize their position of authority in order to work with the students. This is counterintuitive for most in teachers and even for many students. It is also necessary to developing strong listening and communication skills to allow both parties to provide ways of dealing with the conflict. As with compromise, problem-solving is better reserved for older students.


Teachers need to know what their options are when it comes to addressing conflict. This post provided several ideas or ways for maneuvering disagreements and setbacks in  the classroom.

Signs a Student is Lying

Deception is a common tool students use when trying to avoid discipline or some other uncomfortable situation with a teacher. However, there are some tips and indicators that you can be aware of to help you to determine if a student is lying to you. This post will share some ways to determine if a student may be lying. The tips are as follows

  • Determine what is normal
  • Examine how the play with their clothing
  • Watch personal space
  • Tone of voice
  • Movement

Determine What is Normal

People are all individuals and thus unique. Therefore, determining deception first requires determining what is normal for the student. This involves some observation and getting to know the student. These are natural parts of teaching.

However, if you are in an administrative position and may not know the student that well it will be much harder to determine what is normal for the student sot that it can be compared to their behavior if you believe they are lying. One solution for this challenge is to first engage in small talk with the student so you can establish what appears to be natural behavior for the student.

Clothing Signs

One common  sign that someone is lying is that they begin to play with their clothing. This can include tugging on clothes, closing buttons, pulling down on sleeves, and or rubbing a spot. This all depends on what is considered normal for the individual.

Personal Space

When people pull away when talking it is often a sign of dishonesty. This can be done through such actions as shifting one’s chair, or leaning back. Other individuals will fold their arms across their chest. All these behaviors are subconscious was of trying to protect one’s self.


The voice provides several clues of deception. Often the rate or speed of the speaking slows down. Deceptive answers are often much longer and detailed than honest ones. Liars often show hesitations and pauses that are out of the ordinary for them.

A change in pitch is perhaps the strongest sigh of lying. Students will often speak with a much higher pitch one lying. This is perhaps do to the nervousness they are experiencing.


Liars have a habit of covering their mouth when speaking. Gestures also become more mute and closer to the bottom when a student is lying. Another common cue is gestures with the palms up rather than down when speaking. Additional signs include nervous tapping with the feet.


People lie for many reasons. Due to this, it is important that a teacher is able to determine the honesty of a student when necessary. The tips in this post provide some basic ways of potentially identifying who is being truthful.

Barriers to Teachers Listening

Few of us want to admit it but all teachers have had problems at one time or another listening to their students. There are many reasons for this but in this post we will look at the following barriers to listening that teachers may face.

  1. Inability to focus
  2. Difference in speaking and listening speed
  3. Willingness
  4. Detours
  5. Noise
  6. Debate

Inability to Focus

Sometimes a teacher or even a student may not be able to focus on the discussion or conversation. This could be due to a lack of motivation or desire to pay attention. Listening can be taxing mental work. Therefore, the teacher must be engaged and have some desire to try to understand what is happening.

Differences in the Speed of Speaking and Listening

We speak much slower than we think. Some have put the estimate that we speak at 1/4 the speed at which we can think. What this means is that if you can think 100 words per minute you can speak at only 25 words per minute. With thinking being 4 times faster than speaking this leaves a lot of mental energy lying around unused which can lead to daydreaming.

This difference can lead to impatience and to anticipation of what the person is going to say. Neither of these are beneficial because they discourage listening.


There are times, rightfully so, that a teacher does not want to listen. This can be when a student is not cooperating or giving an unjustified excuse for their actions. The main point here is that a teacher needs to be aware of their unwillingness to listen. Is it justified or is it unjustified? This is the question to ask.


Detours happen when we respond to a specific point or comment by the student which changes the subject. This barrier is tricking because what is happening is that you are actually paying attention but allow the conversation to wander from the original purpose. Wandering conversation is natural and often happens when we are enjoying the talk.

Preventing this requires mental discipline to stay on topic and to not what you are listening for. This is not easy but is necessary at times.


Noise can be external or internal. External noise is factors beyond our control. For example, if there is a lot of noise in the classroom it may be hard to hear a student speak. A soft-spoken student in a loud place is frustrating to try and listen to even when there is a willingness to do so.

Internal noise has to do with what is happening inside your own mind If you are tired, sick, or feeling rush due to a lack of time, these can all affect your ability to listening to others.


Sometimes we listen until we want to jump in and try to defend a point are disagree with something. This is not so much as listening as it is hunting and waiting to pounce and the slightest misstep of logic from the person we are supposed to listen to.

It is critical to show restraint and focus on allowing the other side to be heard rather than interrupted by you.


We often view teachers as communicators. However, half the job of a communicator is to listen. At times, due to the position and the need to be the talker a teacher may neglect the need to be a listener. The barriers explained here should help teachers to be aware of why they may neglect to do this.

Principles of Management and the Classroom

Henri Fayol (1841-1925) had a major impact on managerial communication in his develop of 14 principles of management. In this post, we will look at these principles briefly and see how at least some of them can be applied in the classroom as educators.

Below is a list of the 14 principles of management by Fayol

  1. Division of work
  2. Authority
  3. Discipline
  4. Unity of command
  5. Unity of direction
  6. Subordination of individual interest
  7. Remuneration
  8. The degree of centralization
  9. Scalar chain
  10. Order
  11. Equity
  12. Stability of personnel
  13. Initiative
  14. Esprit de corps

Division of Work & Authority

Division of work has to do with breaking work into small parts with each worker having responsibility for one aspect of the work. In the classroom, this would apply to group projects in which collaboration is required to complete a task.

Authority is  the power to give orders and commands. The source of the authority cannot only be in the position. The leader must demonstrate expertise and competency in order to lead. For the classroom, it is a well-known tenet of education that the teacher must demonstrate expertise in their subject matter and knowledge of teaching.

Discipline & Unity of command

Discipline has to do with obedience. The workers should obey the leader. In the classroom this relates to concepts found in classroom management. The teacher must put in place mechanisms to ensure that the students follow directions.

Unity of command means that there should only be directions given from one leader to the workers. This is the default setting in some schools until about junior high or high school. At that point, students have several teachers at once. However, generally it is one teacher per classroom even if the students have several teachers.

Unity of Direction & Subordination i of Individual Interests

The employees activities must all be linked to the same objectives. This ensures everyone is going in the same directions. In the classroom, this relates to the idea of goals and objectives in teaching. The curriculum needs to be aligned with students all going in the same direction. A major difference here is that the activities may vary in terms of achieving the learning goals from student to student.

Subordination of individual interests in tells putting the organization ahead of personal goals. This is where there may be a break in managerial and educational practices. Currently, education  in many parts of the world are highly focused on the students interest at the expense of what may be most efficient and beneficial to the institution.

Remuneration & Degree of Centralization

Remuneration has to do with the compensation. This can be monetary or non-monetary. Monetary needs to be high enough to provide some motivation to work. Non-monetary can include recognition, honor or privileges. In education, non-monetary compensation is standard in the form of grades, compliments, privileges, recognition, etc. Whatever is done is usually contributes to intrinsic or extrinsic motivation.

Centralization has to do with who makes decisions. A highly centralized institution has top down decision-making while a decentralized institution has decisions coming from many directions. Generally, in  the classroom setting, decisions are made by the teacher. Students may be given autonomy over how to approach assignments or which assignments to do but the major decisions are made by the teacher even in highly decentralized classrooms due to the students inexperience and lack of maturity.

Scalar Chain & Order

Scalar chain has to do with recognizing the chain of command. The employee should contact the immediate supervisor when there is a problem. This prevents to many people going to the same person. In education, this is enforced by default as the only authority in a classroom is usually a teacher.

Order deals with having the resources to get the job done. In the classroom, there are many things the teacher can supply such as books, paper, pencils, etc. and even social needs such as attention and encouragement. However, sometimes there are physical needs that are neglected such as kids who miss breakfast and come to school hungry.

Equity & Stability of Personal

Equity means workers are treated fairly. This principle again relates to classroom management and even assessment. Students need to know that the process for discipline is fair even if it is dislike and that there is adequate preparation for assessments such as quizzes and tests.

Stability of personnel means keeping turnover to a minimum. In education, schools generally prefer to keep teacher long term if possible. Leaving during the middle of a school year whether a student or teacher is discouraged as it is disruptive.

Initiative & Esprit de Corps

Initiative means allowing workers to contribute new ideas and do things. This empowers workers and adds value to the company. In education, this also relates to classroom management in that students need to be able to share their opinion freely during discussions and also when they have concerns about what is happening in the classroom.

Esprit de corps focuses on morale. Workers need to feel good and appreciated. The classroom learning environment is a topic that is frequently studied in education. Students need to have their psychological needs meet through having a place to study that is safe and friendly.


These 14 principles are found in the business world, but they also have a strong influence in the world of education as well. Teachers can pull these principles any ideas that may be useful l in their classroom.

Hierarchical Regression in R

In this post, we will learn how to conduct a hierarchical regression analysis in R. Hierarchical regression analysis is used in situation in which you want to see if adding additional variables to your model will significantly change the r2 when accounting for the other variables in the model. This approach is a model comparison approach and not necessarily a statistical one.

We are going to use the “Carseats” dataset from the ISLR package. Our goal will be to predict total sales using the following independent variables in three different models.

model 1 = intercept only
model 2 = Sales~Urban + US + ShelveLoc
model 3 = Sales~Urban + US + ShelveLoc + price + income
model 4 = Sales~Urban + US + ShelveLoc + price + income + Advertising

Often the primary goal with hierarchical regression is to show that the addition of a new variable builds or improves upon a previous model in a statistically significant way. For example, if a previous model was able to predict the total sales of an object using three variables you may want to see if a new additional variable you have in mind may improve model performance. Another way to see this is in the following research question

Is a model that explains the total sales of an object with Urban location, US location, shelf location, price, income and advertising cost as independent variables superior in terms of R2 compared to a model that explains total sales with Urban location, US location, shelf location, price and income as independent variables?

In this complex research question we essentially want to know if adding advertising cost will improve the model significantly in terms of the r square. The formal steps that we will following to complete this analysis is as follows.

  1. Build sequential (nested) regression models by adding variables at each step.
  2. Run ANOVAs in order to compute the R2
  3. Compute difference in sum of squares for each step
    1. Check F-statistics and p-values for the SS differences.
  4. Compare sum of squares between models from ANOVA results.
  5. Compute increase in R2 from sum of square difference
  6. Run regression to obtain the coefficients for each independent variable.

We will now begin our analysis. Below is some initial code


Model Development

We now need to create our models. Model 1 will not have any variables in it and will be created for the purpose of obtaining the total sum of squares. Model 2 will include demographic variables. Model 3 will contain the initial model with the continuous independent variables. Lastly, model 4 will contain all the information of the previous models with the addition of the continuous independent variable of advertising cost. Below is the code.

model1 = lm(Sales~1,Carseats)
model2=lm(Sales~Urban + US + ShelveLoc,Carseats)
model3=lm(Sales~Urban + US + ShelveLoc + Price + Income,Carseats)
model4=lm(Sales~Urban + US + ShelveLoc + Price + Income + Advertising,Carseats)

We can now turn to the ANOVA analysis for model comparison #ANOVA Calculation We will use the anova() function to calculate the total sum of square for model 0. This will serve as a baseline for the other models for calculating r square

## Analysis of Variance Table
## Model 1: Sales ~ 1
## Model 2: Sales ~ Urban + US + ShelveLoc
## Model 3: Sales ~ Urban + US + ShelveLoc + Price + Income
## Model 4: Sales ~ Urban + US + ShelveLoc + Price + Income + Advertising
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1    399 3182.3                                   
## 2    395 2105.4  4   1076.89  89.165 < 2.2e-16 ***
## 3    393 1299.6  2    805.83 133.443 < 2.2e-16 ***
## 4    392 1183.6  1    115.96  38.406 1.456e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For now, we are only focusing on the residual sum of squares. Here is a basic summary of what we know as we compare the models.

model 1 = sum of squares = 3182.3
model 2 = sum of squares = 2105.4 (with demographic variables of Urban, US, and ShelveLoc)
model 3 = sum of squares = 1299.6 (add price and income)
model 4 = sum of squares = 1183.6 (add Advertising)

Each model is statistical significant which means adding each variable lead to some improvement.

By adding price and income to the model we were able to improve the model in a statistically significant way. The r squared increased by .25 below is how this was calculated.

2105.4-1299.6 #SS of Model 2 - Model 3
## [1] 805.8
805.8/ 3182.3 #SS difference of Model 2 and Model 3 divided by total sum of sqaure ie model 1
## [1] 0.2532131

When we add Advertising to the model the r square increases by .03. The calculation is below

1299.6-1183.6 #SS of Model 3 - Model 4
## [1] 116
116/ 3182.3 #SS difference of Model 3 and Model 4 divided by total sum of sqaure ie model 1
## [1] 0.03645162

Coefficients and R Square

We will now look at a summary of each model using the summary() function.

## Call:
## lm(formula = Sales ~ Urban + US + ShelveLoc, data = Carseats)
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.713 -1.634 -0.019  1.738  5.823 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.8966     0.3398  14.411  < 2e-16 ***
## UrbanYes          0.0999     0.2543   0.393   0.6947    
## USYes             0.8506     0.2424   3.510   0.0005 ***
## ShelveLocGood     4.6400     0.3453  13.438  < 2e-16 ***
## ShelveLocMedium   1.8168     0.2834   6.410 4.14e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 2.309 on 395 degrees of freedom
## Multiple R-squared:  0.3384, Adjusted R-squared:  0.3317 
## F-statistic: 50.51 on 4 and 395 DF,  p-value: < 2.2e-16
## Call:
## lm(formula = Sales ~ Urban + US + ShelveLoc + Price + Income, 
##     data = Carseats)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9096 -1.2405 -0.0384  1.2754  4.7041 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     10.280690   0.561822  18.299  < 2e-16 ***
## UrbanYes         0.219106   0.200627   1.092    0.275    
## USYes            0.928980   0.191956   4.840 1.87e-06 ***
## ShelveLocGood    4.911033   0.272685  18.010  < 2e-16 ***
## ShelveLocMedium  1.974874   0.223807   8.824  < 2e-16 ***
## Price           -0.057059   0.003868 -14.752  < 2e-16 ***
## Income           0.013753   0.003282   4.190 3.44e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.818 on 393 degrees of freedom
## Multiple R-squared:  0.5916, Adjusted R-squared:  0.5854 
## F-statistic: 94.89 on 6 and 393 DF,  p-value: < 2.2e-16
## Call:
## lm(formula = Sales ~ Urban + US + ShelveLoc + Price + Income + 
##     Advertising, data = Carseats)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2199 -1.1703  0.0225  1.0826  4.1124 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     10.299180   0.536862  19.184  < 2e-16 ***
## UrbanYes         0.198846   0.191739   1.037    0.300    
## USYes           -0.128868   0.250564  -0.514    0.607    
## ShelveLocGood    4.859041   0.260701  18.638  < 2e-16 ***
## ShelveLocMedium  1.906622   0.214144   8.903  < 2e-16 ***
## Price           -0.057163   0.003696 -15.467  < 2e-16 ***
## Income           0.013750   0.003136   4.384 1.50e-05 ***
## Advertising      0.111351   0.017968   6.197 1.46e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.738 on 392 degrees of freedom
## Multiple R-squared:  0.6281, Adjusted R-squared:  0.6214 
## F-statistic: 94.56 on 7 and 392 DF,  p-value: < 2.2e-16

You can see for yourself the change in the r square. From model 2 to model 3 there is a 26 point increase in r square just as we calculated manually. From model 3 to model 4 there is a 3 point increase in r square. The purpose of the anova() analysis was determined if the significance of the change meet a statistical criterion, The lm() function reports a change but not the significance of it.


Hierarchical regression is just another potential tool for the statistical researcher. It provides you with a way to develop several models and compare the results based on any potential improvement in the r square.

RANSAC Regression in Python

RANSAC is an acronym for Random Sample Consensus. What this algorithm does is fit a regression model on a subset of data that the algorithm judges as inliers while removing outliers. This naturally improves the fit of the model due to the removal of some data points.

The process that is used to determine inliers and outliers is described below.

  1. The algorithm randomly selects a random amount of samples to be inliers in the model.
  2. All data is used to fit the model and samples that fall with a certain tolerance are relabeled as inliers.
  3. Model is refitted with the new inliers
  4. Error of the fitted model vs the inliers is calculated
  5. Terminate or go back to step 1 if a certain criterion of iterations or performance is not met.

In this post, we will use the tips data from the pydataset module. Our goal will be to predict the tip amount using two different models.

  1. Model 1 will use simple regression and will include total bill as the independent variable and tips as the dependent variable
  2. Model 2 will use multiple regression and  includes several independent variables and tips as the dependent variable

The process we will use to complete this example is as follows

  1. Data preparation
  2. Simple Regression Model fit
  3. Simple regression visualization
  4. Multiple regression model fit
  5. Multiple regression visualization

Below are the packages we will need for this example

import pandas as pd
from pydataset import data
from sklearn.linear_model import RANSACRegressor
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

Data Preparation

For the data preparation, we need to do the following

  1. Load the data
  2. Create X and y dataframes
  3. Convert several categorical variables to dummy variables
  4. Drop the original categorical variables from the X dataframe

Below is the code for these steps


Most of this is self-explanatory, we first load the tips dataset and divide the independent and dependent variables into an X and y dataframe respectively. Next, we converted the sex, smoker, and dinner variables into dummy variables, and then we dropped the original categorical variables.

We can now move to fitting the first model that uses simple regression.

Simple Regression Model

For our model, we want to use total bill to predict tip amount. All this is done in the following steps.

  1. Instantiate an instance of the RANSACRegressor. We the call LinearRegression function, and we also set the residual_threshold to 2 indicate how far an example has to be away from  2 units away from the line.
  2. Next we fit the model
  3. We predict the values
  4. We calculate the r square  the mean absolute error

Below is the code for all of this.

ransacReg1= RANSACRegressor(LinearRegression(),residual_threshold=2,random_state=0)[['total_bill']],y)
Out[150]: 0.4381748268686979

Out[151]: 0.7552429811944833

The r-square is 44% while the MAE is 0.75.  These values are most comparative and will be looked at again when we create the multiple regression model.

The next step is to make the visualization. The code below will create a plot that shows the X and y variables and the regression. It also identifies which samples are inliers and outliers. Te coding will not be explained because of the complexity of it.

plt.xlabel('Total Bill')
plt.legend(loc='upper left')


Plot is self-explanatory as a handful of samples were considered outliers. We will now move to creating our multiple regression model.

Multiple Regression Model Development

The steps for making the model are mostly the same. The real difference takes place in make the plot which we will discuss in a moment. Below is the code for  developing the model.

ransacReg2= RANSACRegressor(LinearRegression(),residual_threshold=2,random_state=0),y)
Out[154]: 0.4298703800652126

Out[155]: 0.7649733201032204

Things have actually gotten slightly worst in terms of r-square and MAE.

For the visualization, we cannot plot directly several variables t once. Therefore, we will compare the predicted values with the actual values. The better the correlated the better our prediction is. Below is the code for the visualization

plt.xlabel('Predicted Tip')
plt.ylabel('Actual Tip')
plt.legend(loc='upper left')


The plots are mostly the same  as you cans see for yourself.


This post provided an example of how to use the RANSAC regressor algorithm. This algorithm will remove samples from the model based on a criterion you set. The biggest complaint about this algorithm is that it removes data from the model. Generally, we want to avoid losing data when developing models. In addition, the algorithm removes outliers objectively this is a problem because outlier removal is often subjective. Despite these flaws, RANSAC regression is another tool that can be use din machine learning.

Teaching English

Teaching English  or any other subject requires that the teacher be able to walk into the classroom and find ways to have an immediate impact. This is much easier said than done. In this post we look at several ways to increase the likelihood of being able to help students.

Address Needs

People’s reasons for learning a language such as English can vary tremendously. Knowing this, it is critical that you as a teacher know what the need in their learning. This allows you to adjust the methods and techniques that you used to help them learn.

For example, some students may study English for academic purposes while others are just looking to develop communications skills. Some students maybe trying to pass a proficiency examine in order to study at  university or in graduate school.

How you teach these different groups will be different. The academic students want academic English and language skills. Therefore, if you plan to play games in the classroom and other fun activities there may be some frustration because the students will not see how this helps them.

On the other hand, for students who just want to learn to converse in English, if you smother them with heavy readings and academic like work they will also become frustrated from how “rigorous” the course is. This is why you must know what the goals of the students are and make the needed changes as possible

Stay Focused

When dealing with students, it is tempting to answer and following ever question that they have. However, this can quickly lead to a lost of directions as the class goes here there and everywhere to answer every nuance question.

Even though the teacher needs to know what the students want help with the teacher is also the expert and needs to place limits over how far they will go in terms of addressing questions and needs. Everything cannot be accommodated no matter how hard one tries.

As the teacher, things that limit your ability to explore questions and concerns of students includes time, resources,  your own expertise, and the importance of the question/concern. Of course, we help students, but not to the detriment of the larger group.

Providing a sense of direction is critical as a teacher. The students have their needs but it is your goal to lead them to the answers. This requires a sense of knowing what you want and  being able to get there. There re a lot of experts out there who cannot lead a group of students to the knowledge they need as this requires communication skills and an ability to see the forest from the trees.


Teaching is a mysterious profession as so many things happen that cannot be seen or measured but clearly have an effect on the classroom. Despite the confusion it never hurts to determine where the students want to go and to find a way to get them there academically.

Improving Lecturing

Lecturing is a necessary evil at the university level. The university system was founded during a time when lecturing was the only way to share information. Originally, owning books was nearly impossible due to their price, there was no internet or computer, and  there were few options for reviewing material. For these reasons, lecturing was the go to approach for centuries.

With all the advantages in technology, the world has changed but lecturing has not. This has led to students becoming disengaged in the learning experience with the emphasis on lecture style teaching.

This post will look at times when lecturing is necessary as well as ways to improve the lecturing experience.

Times to Lecture

Despite the criticism given earlier, there are times when lecturing is an appropriate strategy. Below are some examples.

  • When there is a need to cover a large amount of content-If you need to get through a lot of material quickly and don’t have time for discussion.
  • Complex concepts/instructions-You probably do not want to use discovery learning to cover lab safety policies
  • New material-The first time through they may need to listen. When the topic is addressed later a different form of instruction should be employed

The point here is not to say that lecturing is bad but rather that it is overly relied upon by the typical college lecturer. Below are ways to improve lecturing when it is necessary.

Prepare Own Materials

With all the tools on the internet from videos to textbook supplied PowerPoint slides. It is tempting to just use these materials as they are and teach. However, preparing your own materials allows you to bring yourself and your personality into the teaching experience.

You can add anecdotes to illustrate various concepts, bring in additional resources, are leave information that you do not think is pertinent. Furthermore, by preparing your own material you know inside and out where you are going and when. This can also help to organize your thinking on a topic due to the highly structured nature of PowerPoint slides.

Even modifying others materials can provide some benefit. By owning your own material it allows you to focus less on what someone else said and more on what you want to say with your own materials that you are using.

Focus on the Presentation

If many teachers listen to themselves lecturing, they might be convinced that they are boring. When presenting a lecture a teacher should make sure to try to share the content extemporaneously. There should be a sense of energy and direction to the content. The students need to be convinced that you have something to say.

There is even a component of body language to this. A teacher needs to walk into a room like they “own the place” and speak accordingly. This means standing up straight, shoulders back with a strong voice that changes speed. These are all examples of having a commanding stage presence. Make it clear you are the leader through your behavior. Who wants to listen to someone who lacks self-confidence and mumbles?

Read the Audience

If all you do is have confidence  and run through your PowerPoint like nobody exists there will be little improvement for the students. A good speaker must read the audience and respond accordingly. If, despite all your efforts to prepare an interesting talk on a subject, the students are on their phones or even unconscience there is no point continuing but to do some sort of diversionary activity to get people refocus. Some examples of diversionary tactics include the following.

  • Have the students discuss something about the lecture for a moment
  • Have the students solve a problem of some sort related to the material
  • Have the students move. Instead of talking with someone next to them they have to find someone from a different part of the lecture room. A bit of movement is all it takes to regain conscientiousness.

The lecture should be dynamic which means that it changes in nature at times. Breaking up the content into 10 minutes periods followed by some sort of activity can really prevent fatigue in the listeners.


Lecturing is a classic skill that can still be used in the 21st century. However, given that times have  changed it is necessary to make some adjustments to how a  teacher approaches lecturing.

Teaching Large Classes

It is common for undergraduate courses, particularly introductory courses, to have a large number of students. Some introductory courses can have as many as 150-300 students in them. Combine this with the fact that it is common for the people with the least amount of teaching experience whether a graduate assistant or new non-tenured professor. This leads to a question of how to handle teaching so many students at one time.

In this post, we will look at some common challenges to teaching a large at the tertiary level. In particular, we will  look at the following.

  • Addressing student engagement
  • Grading assignments
  • Logistics

Student Engagement

Once a class reaches a certain size, it becomes difficult to engage students with discussion and one-on-one  interaction. This leaves a teacher with the most commonly used tool for university teaching, which is lecturing. However, most students find lecturing to be utterly boring and even some teachers find it boring.

Lecturing can be useful but it must be broken up into “chunks.” What this means is that perhaps you lecture for 8-10 minutes then have the students do something such as discuss a concept with their neighbor and then continue lecturing 8-10 minutes. The reason for 8-10 minutes is that is about how long a TV show runs until a commercial. This implies that 8-10 minutes is about how long someone can pay attention.

The during a break in the lecturing, students can teach a neighbor how/explain a concept to a neighbor, they can write a summary of what they just learned, or they can simply discuss what they learn. What happens during this time is up to the teacher but it should provide a way to continue to examine what is being learned without having to sit and only listen to the lecturer.

Grading Assignments

Grading assignments can be a nightmare in a large class. This is particularly true if the assessment has open-ended questions. The problem with open-ended questions is that they cannot be automated and mark by a computer.

If you must have open-ended questions that require humans to grade them here are some suggestions.

  • If the assessment is formative or a stepping stone in a project selective marking may be an option.  Selective marking involves only grade some papers through sampling and then inferring that other students made the same mistakes. You can then reteach the common mistakes to the whole class while saving a large amount of time.
  • Working with your teaching assistants you can have each assistant mark a section of an exam. This helps to spread the work around and prevent students from complaining about one TA who’s grading they dislike.
  • Peer review is another form of formative feedback that can work in large classes.

As mentioned earlier, for assessments that involve one answer, such as in lower level math classes, there are many automated options that are probably already available at your school such as scan tron sheets or online examinations.

Cheating can also be a problem for examines. However, thorough preparation and developing an assignment that is based on what is taught can greatly reduce cheating. Randomizing the exams and seating can help as well. For plagiarism there are many resources available online


Common logistical problem includes communication which can be through email or office hours. If a class has over 100 or even 200 students. The demands for personal help can quickly become overwhelming. This can be avoided by establishing clear lines of communication and how you will response.

Hopefully, there is some sort of way for you to communicate with all the students simultaneously such as through a forum or some other way. In this way, you can share the answer to a good question with everyone rather than individually several times.

Office hours can be adjusted by having them in groups rather than one-on-one. This allows the teacher to help several students at once rather than individually. Another idea is to have online office hours. Again you can meet several students at once but with the added convenience of not having to be in the same physical location.


Large classes are a lot of work and can be demanding for even experienced teachers. However, with some basic adjustments it is possible to shoulder this load with care.

Teaching Materials

Regardless of what level a teacher teaches at you are always looking or materials and activities to do in your class. It is always a challenge to have new ideas and activities to support and help students because the world changes. This leads to an constant need to remove old inefficient activities and bring in new fresh one. The primary purpose of  activities is to provide practical application of the skills taught and experience in school.

For the math teacher you can naturally make your own math problems. However, this can quickly become quietly. One solution to this is to employed other worksheets that provide growth opportunities for the students with stressing out the teacher.

There are many great websites for this. For example, provides many different types of worksheets to help students. They have some great simple math worksheets like the ones below

addition_outer space_answers

addition_outer space

There  are many more resources available at  as well as other sites. There is no purpose or benefit to reinventing the wheel. The incorporation of the assignments of others is a great way to expand the resources you have available without the stress of developing them yourself.

Review of “Usborne Mysteries & Marvels of Nature”

The book Mysteries & Marvels of Nature by Elizabeth Dalby (pp. 128) is a classic picture book focused on nature for children.

The Summary

This text provides explanations of various aspects of animals as found in nature. Some of the topics covered is how animals eat, move, defend themselves, communicate, and their life cycles. Each section has various of examples of the theme with a plethora of colorful photos.

The text that is include provides a brief description of the animal(s) and what they are doing in the picture. Leaping tigers, swimming fish, and even egg-laying snakes are all apart of this text.

The Good

The pictures are fascinating, and they really help in making the text come alive. There is a strong sense of color contrast  in the text and you can tell the authors spent a great deal planning the layout of the text. There are pictures of whales, cuttlefish, frogs, beetles, etc.

There is also the use of drawings to depict scenes that may be hard to get in nature. For example, the book explains how the darkling beetle escapes prey by spraying a liquid that stinks. Off, course there is an illustration of this in the text and not a photo. Another example shows a frog waking from hibernating underground.

This text would work great for almost any age group. Young kids would love looking the pictures while older kids can read the text. The book is also largest enough to accommodate a medium size class for a whole class reading.

The Bad

There is little to complain with in regard to this book. It is a paperback text. Therefore, it would not last long in most classrooms. However, the motivation for paperback my  have been to keep the price down. At $16.99, this is a fairly cheap book  for a classroom. Besides the quality of the material there is little to criticize about this book.

The Recommendation

This is a great text for any classroom. Students will spend hours fascinated by the pictures. Older students may enjoy the pictures as well while they need to focus on the text. For families this is an even better text because in most families there would be a reduction in the number of hands that are touching the text.

Teaching Smaller Class at University

The average teacher prefers small classes. However, there are times when the enrollment in a class that is usually big (however you define this) takes a dip in size and suddenly a class has become “small.” This can be harder to deal with than many people tend to believe. There are some aspects of the teaching and learning experience that need to be adjusted because the original approach is not user-friendly for the small class.

Another time when a person  often struggles with teaching small classes is if they never had the pleasure of experiencing a small class as a student. If your learning experience was a traditional large class lecture style and all of a sudden  you are teaching at a small liberal arts college there will need to be some adjustments too.

In this post, we will look at some pros and cons of teaching smaller classes at the tertiary level. In addition, we will look at some ways to address the challenges of teaching smaller classes for those who have not had this experience.


With a smaller class size there is an overall decrease in the amount of work that has to be done. This means few assignments to mark, less preparation of materials, etc. In addition, because the class is smaller it is not necessary to be as formal and structured with the class. In other words, there is no need to have routines in place because there is little potential chaos that can ensue if everyone does what they want.

The teaching can also be more personalized. You can adjust content and address individual question much easier than when dealing with a larger class. You can even get to know the students in a much more informal manner that is not possible in a large lecturer hall.

Probably the biggest advantage  for a new teacher is the ability to make changes and adjustments during a semester. A bad teacher in a large class leads to a large problem. However, a bad teacher in a small class is a small problem. If things are not working, it is easier to change things in a small group. The analogy that I like to make is that it is easier to do an u-turn on a bike instead of in a bus. For new teachers who do not quite know how to teach, a smaller class can help them to develop their skills for larger classes


There are some challenges with small classes especially for people with a large class experience. One thing you will notice when teaching a large class is a lost of energy. If you are used to lecturing to 80 students and suddenly are teaching 12 it can seem as if that learning spark is gone.

The lost of energy can contribute to a lost of discipline. The informal nature of small classes can lead to students having a sense seriousness about the course. In larger classes there is a sense of “sink or swim”.  This may not be the most positive mindset but it helps people to take the learning experience seriously. In smaller classes this can sometimes be lost.

Attendance is another problem. In a large class having several absences is not a big deal. However, if your class is small, several absences is almost like a plague that wipes out a village. You can still teach but nobody is there or the key people who participate in the discussion are not there or there is no one to listen to their comments. This can lead to pressure to cancel class which causes even more problems.


There are several things that a teacher can do to have success with smaller class sizes. One suggestion is to adjust your teaching style. Lecturing is great for large classes in which content delivery is key. However, in  smaller class a more interact, discussion-like approach can be taken. This helps to bring energy back to the classroom as well as engage the students.

Sometimes, if this is possible, changing from a large room to a smaller one can help to bring back the energy that is lost when a class is smaller. Many times the academic office will put class in a certain classroom regardless of size. This normally no longer a problem with all the advances in scheduling and registration software. However, if you are teaching 10 students in an auditorium perhaps it is possible to find a smaller more intimate location.

Another way to deal with smaller classes is through increasing participation. This is often not practical in a large class. However, interaction can be useful in increasing the engagement.


The size of the class is not as important as the ability of the teacher to adjust to it in order to help students to learn. Small classes need a slightly different approach  than traditional large classes at university. With a few minor adjustments, a teacher can still find ways to help students even if the class is not quite the size everyone was expecting

Combining Algorithms for Classification with Python

Many approaches in machine learning involve making many models that combine their strength and weaknesses to make more accuracy classification. Generally, when this is done it is the same algorithm being used. For example, random forest is simply many decision trees being developed. Even when bagging or boosting is being used it is the same algorithm but with variances in sampling and the use of features.

In addition to this common form of ensemble learning there is also a way to combine different algorithms to make predictions. For one way of doing this is through a technique called stacking in which the predictions of several models are passed to a higher model that uses the individual model predictions to make a final prediction. In this post we will look at how to do this using Python.


This blog usually tries to explain as much  as possible about what is happening. However, due to the complexity of this topic there are several assumptions about the reader’s background.

  • Already familiar with python
  • Can use various algorithms to make predictions (logistic regression, linear discriminant analysis, decision trees, K nearest neighbors)
  • Familiar with cross-validation and hyperparameter tuning

We will be using the Mroz dataset in the pydataset module. Our goal is to use several of the independent variables to predict whether someone lives in the city or not.

The steps we will take in this post are as follows

  1. Data preparation
  2. Individual model development
  3. Ensemble model development
  4. Hyperparameter tuning of ensemble model
  5. Ensemble model testing

Below is all of the libraries we will be using in this post

import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from pydataset import data
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.classifier import EnsembleVoteClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

Data Preparation

We need to perform the following steps for the data preparation

  1. Load the data
  2. Select the independent variables to be used in the analysis
  3. Scale the independent variables
  4. Convert the dependent variable from text to numbers
  5. Split the data in train and test sets

Not all of the variables in the Mroz dataset were used. Some were left out because they were highly correlated with others. This analysis is not in this post but you can explore this on your own. The data was also scaled because many algorithms are sensitive to this so it is best practice to always scale the data. We will use the StandardScaler function for this. Lastly, the dpeendent variable currently consist of values of “yes” and “no” these need to be convert to numbers 1 and 0. We will use the LabelEncoder function for this. The code for all of this is below.

X=pd.DataFrame(X_scale, index=X.index, columns=X.columns)
X_train, X_test,y_train, y_test=train_test_split(X,y,test_size=.3,random_state=5)

We can now proceed to individul model development

Individual Model Development

Below are the steps for this part of the analysis

  1. Instantiate an instance of each algorithm
  2. Check accuracy of each model
  3. Check roc curve of each model

We will create four different models, and they are logistic regression, decision tree, k nearest neighbor, and linear discriminant analysis. We will also set some initial values for the hyperparameters for each. Below is the code

logclf=LogisticRegression(penalty='l2',C=0.001, random_state=0)

We can now assess the accuracy and roc curve of each model. This will be done through using two separate for loops. The first will have the accuracy results and the second will have the roc curve results. The results will also use k-fold cross validation with the cross_val_score function. Below is the code with the results.

clf_labels=['Logistic Regression','Decision Tree','KNN','LDAclf']
for clf, label in zip ([logclf,treeclf,knnclf,LDAclf],clf_labels):
    print("accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label))

for clf, label in zip ([logclf,treeclf,knnclf,LDAclf],clf_labels):
    print("roc auc: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label))

accuracy: 0.69 (+/- 0.04) [Logistic Regression]
accuracy: 0.72 (+/- 0.06) [Decision Tree]
accuracy: 0.66 (+/- 0.06) [KNN]
accuracy: 0.70 (+/- 0.05) [LDAclf]
roc auc: 0.71 (+/- 0.08) [Logistic Regression]
roc auc: 0.70 (+/- 0.07) [Decision Tree]
roc auc: 0.62 (+/- 0.10) [KNN]
roc auc: 0.70 (+/- 0.08) [LDAclf]

The results can speak for themselves. We have a general accuracy of around 70% but our roc auc is poor. Despite this we will now move to the ensemble model development.

Ensemble Model Development

The ensemble model requires the use of the EnsembleVoteClassifier function. Inside this function are the four models we made earlier. Other than this the rest of the code is the same as the previous step. We will assess the accuracy and the roc auc. Below is the code and the results

 mv_clf= EnsembleVoteClassifier(clfs=[logclf,treeclf,knnclf,LDAclf],weights=[1.5,1,1,1])

for clf, label in zip ([logclf,treeclf,knnclf,LDAclf,mv_clf],labels):
    print("accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label))

for clf, label in zip ([logclf,treeclf,knnclf,LDAclf,mv_clf],labels):
    print("roc auc: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label))

accuracy: 0.69 (+/- 0.04) [LR]
accuracy: 0.72 (+/- 0.06) [tree]
accuracy: 0.66 (+/- 0.06) [knn]
accuracy: 0.70 (+/- 0.05) [LDA]
accuracy: 0.70 (+/- 0.04) [combine]
roc auc: 0.71 (+/- 0.08) [LR]
roc auc: 0.70 (+/- 0.07) [tree]
roc auc: 0.62 (+/- 0.10) [knn]
roc auc: 0.70 (+/- 0.08) [LDA]
roc auc: 0.72 (+/- 0.09) [combine]

You can see that the combine model as similar performance to the individual models. This means in this situation that the ensemble learning did not make much of a difference. However, we have not tuned are hyperparameters yet. This will be done in the next step.

Hyperparameter Tuning of Ensemble Model

We are going to tune the decision tree, logistic regression, and KNN model. There are many different hyperparameters we can tune. For demonstration purposes we are only tuning one hyperparameter per algorithm. Once we set the hyperparameters we will run the model and pull the best hyperparameters values based on the roc auc as the metric. Below is the code and the output.



{'decisiontreeclassifier__max_depth': 3,
 'kneighborsclassifier__n_neighbors': 9,
 'logisticregression__C': 10}

Out[35]: 0.7196051482279385

The best values are as follows

  • Decision tree max depth set to 3
  • KNN number of neighbors set to 9
  • logistic regression C set to 10

These values give us a roc auc of 0.72 which is still poor . We can now use these values when we test our final model.

Ensemble Model Testing

The following steps are performed in the analysis

  1. Created new instances of the algorithms with the adjusted hyperparameters
  2. Run the ensemble model
  3. Predict with the test data
  4. Check the results

Below is the first step

logclf=LogisticRegression(penalty='l2',C=10, random_state=0)

Below is step two

mv_clf= EnsembleVoteClassifier(clfs=[logclf,treeclf,knnclf,LDAclf],weights=[1.5,1,1,1]),y_train)

Below are steps 3 and four


col_0   0    1
0      29   58
1      12  127
             precision    recall  f1-score   support

          0       0.71      0.33      0.45        87
          1       0.69      0.91      0.78       139

avg / total       0.69      0.69      0.66       226

The accuracy is about 69%. One thing that is noticeable low is the  recall for people who do not live in the city. This probably one reason why the overall roc auc score is so low. The f1-score is also low for those who do not live in the city as well. The f1-score is just a combination of precision and recall. If we really want to improve performance we would probably start with improving the recall of the no’s.


This post provided an example of how you can  combine different algorithms to make predictions in Python. This is a powerful technique t to use. Off course, it is offset by the complexity of the analysis which makes it hard to explain exactly what the results mean if you were asked tot do so.

Review of The Beginner’s American History

In this post, we take a look at The Beginners American History. This book was written by D.H. Montgomery in the late 19th century and was updated by John Holzmann (pp. 269).


This is a classic text that covers the history of the United States from Christopher Columbus’ discovery of America until the Gold Rush of California. All of the expected content is there from Captain John Smith, George Washington, to even Eli Whitney. Other information shared includes the various wars in America from the battles with the British for independence to the wars with the Mexicans and Indians for control of the land in what is now the United States.

The Good

This would be a great personal reader for an older student. It is primarily text based and there are few illustrations. The writing is simple for the most part and is not overly weighed down with a lot of academic insights and communication. 

The illustrations that are included tend to be an ever-changing map that shows how America is being slowly taken over by the American colonist. This provides the reader of a perspective of time and the growth of the United States.

It is also beneficial for students to get an older perspective on history. The way Montgomery viewed American history in the 19th century is vastly different from how historians see it today.

The Bad

As previously mentioned, the book is text heavy. This makes it inappropriate for small children. In addition, there are few illustrations in the book. This can be a detriment to students who learn through their senses. This would also make the text hard to use in a whole-class situation.

It is a children’s book, however, the portrayal of content is in the most rudimentary manner. This may be due to the context in which the book was written as well as the purpose for this book. Either way, the book was rich on  content but lacked depth.

The Recommendation

For personal reading this is an excellent book. However, in an academic context, I believe there are superior options to the book discussed here. The age of the text provides a distinct perspective on history but lacks the content for deep learning today.

Data Science Pipeline

One of the challenges of conducting a data analysis or any form of research is making decisions. You have to decide primarily two things

  1. What to do
  2. When to do it

People who are familiar with statistics may know what to do but may struggle with timing or when to do it. Others who are weaker when it comes to numbers may not know what to do or when to do it. Generally, it is rare for someone to know when to do something but not know how to do it.

In this post, we will look at a process that that can be used to perform an analysis in the context of data science. Keep in mind that this is just an example and there are naturally many ways to perform an analysis. The purpose here is to provide some basic structure for people who are not sure of what to do and when. One caveat, this process is focused primarily on supervised learning which has a clearer beginning, middle, and end in terms of the process.

Generally, there are three steps that probably always take place when conducting a data analysis and they are as follows.

  1. Data preparation (data mugging)
  2. Model training
  3. Model testing

Off course, it is much more complicated than this but this is the minimum. Within each of these steps there are several substeps, However, depending on the context, the substeps can be optional.

There is one pre-step that you have to consider. How you approach these three steps depends a great deal on the algorithm(s) you have in mind to use for developing different models. The assumptions and characteristics of one algorithm are different from another and shape how you prepare the data and develop models. With this in mind, we will go through each of these three steps.

Data Preparation

Data preparation involves several substeps. Some of these steps are necessary but general not all of them happen ever analysis. Below is a list of steps at this level

  • Data mugging
  • Scaling
  • Normality
  • Dimension reduction/feature extraction/feature selection
  • Train, test, validation split

Data mugging is often the first step in data preparation and involves making sure your data is in a readable structure for your algorithm. This can involve changing the format of dates, removing punctuation/text, changing text into dummy variables or factors, combining tables, splitting tables, etc. This is probably the hardest and most unclear aspect of data science because the problems you will face will be highly unique to the dataset you are working with.

Scaling involves making sure all the variables/features are on the same scale. This is important because most algorithms are sensitive to the scale of the variables/features. Scaling can be done through normalization or standardization. Normalization reduces the variables to a range of 0 – 1. Standardization involves converting the examples in the variable to their respective z-score. Which one you use depends on the situation but normally it is expected to do this.

Normality is often an optional step because there are so many variables that can be involved with big data and data science in a given project. However, when fewer variables are involved checking for normality is doable with a few tests and some visualizations. If normality is violated various transformations can be used to deal with this problem. Keep mind that many machine learning algorithms are robust against the influence of non-normal data.

Dimension reduction involves reduce the number of variables that will be included in the final analysis. This is done through factor analysis or principal component analysis. This reduction  in the number of variables is also an example of feature extraction. In some context, feature extraction is the in goal in itself. Some algorithms make their own features such as neural networks through the use of hidden layer(s)

Feature selection is the process of determining which variables to keep for future analysis. This can be done through the use of regularization such or in smaller datasets with subset regression. Whether you extract or select features depends on the context.

After all this is accomplished, it is necessary to split the dataset. Traditionally, the data was split in two. This led to the development of a training set and a testing set. You trained the model on the training set and tested the performance on the test set.

However, now many analyst split the data into three parts to avoid overfitting the data to the test set. There is now a training a set, a validation set, and a testing set. The  validation set allows you to check the model performance several times. Once you are satisfied you use the test set once at the end.

Once the data is prepared, which again is perhaps the most difficult part, it is time to train the model.

Model training

Model training involves several substeps as well

  1. Determine the metric(s) for success
  2. Creating a grid of several hyperparameter values
  3. Cross-validation
  4. Selection of the most appropriate hyperparameter values

The first thing you have to do and this is probably required is determined how you will know if your model is performing well. This involves selecting a metric. It can be accuracy for classification or mean squared error for a regression model or something else. What you pick depends on your goals. You use these metrics to determine the best algorithm and hyperparameters settings.

Most algorithms have some sort of hyperparameter(s). A hyperparameter is a value or estimate that the algorithm cannot learn and must be set by you. Since there is no way of knowing what values to select it is common practice to have several values tested and see which one is the best.

Cross-validation is another consideration. Using cross-validation always you to stabilize the results through averaging the results of the model over several folds of the data if you are using k-folds cross-validation. This also helps to improve the results of the hyperparameters as well.  There are several types of cross-validation but k-folds is probably best initially.

The information for the metric, hyperparameters, and cross-validation are usually put into  a grid that then runs the model. Whether you are using R or Python the printout will tell you which combination of hyperparameters is the best based on the metric you determined.

Validation test

When you know what your hyperparameters are you can now move your model to validation or straight to testing. If you are using a validation set you asses your models performance by using this new data. If the results are satisfying based on your metric you can move to testing. If not, you may move back and forth between training and the validation set making the necessary adjustments.

Test set

The final step is testing the model. You want to use the testing dataset as little as possible. The purpose here is to see how your model generalizes to data it has not seen before. There is little turning back after this point as there is an intense danger of overfitting now. Therefore, make sure you are ready before playing with the test data.


This is just one approach to conducting data analysis. Keep in mind the need to prepare data, train your model, and test it. This is the big picture for a somewhat complex process

Gradient Boosting Regression in Python

In this  post, we will take a look at gradient boosting for regression. Gradient boosting simply makes sequential models that try to explain any examples that had not been explained by previously models. This approach makes gradient boosting superior to AdaBoost.

Regression trees are mostly commonly teamed with boosting. There are some additional hyperparameters that need to be set  which includes the following

  • number of estimators
  • learning rate
  • subsample
  • max depth

We will deal with each of these when it is appropriate. Our goal in this post is to predict the amount of weight loss in cancer patients based on the independent variables. This is the process we will follow to achieve this.

  1. Data preparation
  2. Baseline decision tree model
  3. Hyperparameter tuning
  4. Gradient boosting model development

Below is some initial code

from sklearn.ensemble import GradientBoostingRegressor
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

Data Preparation

The data preparation is not that difficult in this situation. We simply need to load the dataset in an object and remove any missing values. Then we separate the independent and dependent variables into separate datasets. The code is below.


We can now move to creating our baseline model.

Baseline Model

The purpose of the baseline model is to have something to compare our gradient boosting model to. Therefore, all we will do here is create  several regression trees. The difference between the regression trees will be the max depth. The max depth has to with the number of nodes python can make to try to purify the classification.  We will then decide which tree is best based on the mean squared error.

The first thing we need to do is set the arguments for the cross-validation. Cross validating the results helps to check the accuracy of the results. The rest of the code  requires the use of for loops and if statements that cannot be reexplained in this post. Below is the code with the output.

for depth in range (1,10):

You can see that a max depth of 2 had the lowest amount of error. Therefore, our baseline model has a mean squared error of 176. We need to improve on this in order to say that our gradient boosting model is superior.

However, before we create our gradient boosting model. we need to tune the hyperparameters of the algorithm.

Hyperparameter Tuning

Hyperparameter tuning has to with setting the value of parameters that the algorithm cannot learn on its own. As such, these are constants that you set as the researcher. The problem is that you are not any better at knowing where to set these values than the computer. Therefore, the process that is commonly used is to have the algorithm use several combinations  of values until it finds the values that are best for the model/. Having said this, there are several hyperparameters we need to tune, and they are as follows.

  • number of estimators
  • learning rate
  • subsample
  • max depth

The number of estimators is show many trees to create. The more trees the more likely to overfit. The learning rate is the weight that each tree has on the final prediction. Subsample is the proportion of the sample to use. Max depth was explained previously.

What we will do now is make an instance of the GradientBoostingRegressor. Next, we will create our grid with the various values for the hyperparameters. We will then take this grid and place it inside GridSearchCV function so that we can prepare to run our model. There are some arguments that need to be set inside the GridSearchCV function such as estimator, grid, cv, etc. Below is the code.


We can now run the code and determine the best combination of hyperparameters and how well the model did base on the means squared error metric. Below is the code and the output.,y)
{'learning_rate': 0.01,
 'max_depth': 1,
 'n_estimators': 500,
 'random_state': 1,
 'subsample': 0.5}

Out[14]: -160.51398257591643

The hyperparameter results speak for themselves. With this tuning we can see that the mean squared error is lower than with the baseline model. We can now move to the final step of taking these hyperparameter settings and see how they do on the dataset. The results should be almost the same.

Gradient Boosting Model Development

Below is the code and the output for the tuned gradient boosting model

Out[18]: -160.77842893572068

These results were to be expected. The gradient boosting model has a better performance than the baseline regression tree model.


In this post, we looked at how to  use gradient boosting to improve a regression tree. By creating multiple models. Gradient boosting will almost certainly have a better performance than other type of algorithms that rely on only one model.

Review of The Landmark History of the American People Vol 1

The book The Landmark History of the American People Vol 1 by Daniel Boorstin (pp. 169) provides a rich explanation of the history of the United States from the dawn of colonial America until the end of the 19th century. Daniel Boorstin was a rather famous author and  a former Librarian of Congress. Holding such as position gives you the esteem in which this man was held.

The Summary

This book covers many interesting aspects of early American history. It begins with the development of the colonies. From there the text provides A detailed account of the eventual split from Great Britain. The next focus of the text is on the America heading west through the expansion that involved purchasing land, warfare, and unfortunate exploitation.

The latter part of the text focuses somewhat more on such ideas as life out in the western frontier. There is also a mention of the early effects of the industrial revolution with the development of the train and all the advantages and dangers that this brought.

The Good

This book provides a lot of interesting details about life in America. For example, on the frontier, Americans developed something called the balloon frame house. This type of building was faster and relatively safe when compared to the European model of building at this time. This kinds of little details are not common in most text for children

The text is also full illustrations that capture the time period in which the author was writing about. From pictures of puritans, to Indians, to even photos of various famous American historical sites. This text has a little of everything.

The Bad

Although the text is full of illustrations, it is still primarily text based. In addition, even though the text is full of interesting details this can also be a disadvantage of you or your student  needs the big picture about a particular time period. Yes, I did compliment the development of the balloon frame house. However, what is the benefit of knowing this small detail from American history?

Younger children would struggle with the writing and text heavy nature of the book. However, to be fair, perhaps the author was gearing this book towards older students. However, in the preface, the editor, mentions that this book was meant to be read by parents to 3rd or 4th graders. This seems like a tall task given the content.

The Recommendation

This book would be good for older kids. Perhaps middle school, who have the reading comprehension and perhaps the curiosity to handle such a text. However, for younger children I am convinced the text is too complicated for them to appreciate it. One way to address this is to focus on the visual aspects of the book and not worry too much of getting every detail of the challenging text.

Gradient Boosting Classification in Python

Gradient Boosting is an alternative form of boosting to AdaBoost. Many consider gradient boosting to be a better performer than adaboost. Some differences between the two algorithms is that gradient boosting uses optimization for weight the estimators. Like adaboost, gradient boosting can be used for most algorithms but is commonly associated with decision trees.

In addition, gradient boosting requires several additional hyperparameters such as max depth and subsample. Max depth has to do with the number of nodes in a tree. The higher the number the purer the classification become. The downside to this is the risk of overfitting.

Subsampling has to do with the proportion of the sample that is used for each estimator. This can range from a decimal value up until the whole number 1. If the value is set to 1 it becomes stochastic gradient boosting.

This post is focused on classification. To do this, we will use the cancer dataset from the pydataset library. Our goal will be to predict the status of patients (alive or dead) using the available independent variables. The steps we will use are as follows.

  1. Data preparation
  2. Baseline decision tree model
  3. Hyperparameter tuning
  4. Gradient boosting model development

Below is some initial code.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

Data Preparation

The data preparation is simple in this situtation. All we need to do is load are dataset, dropping missing values, and create our X dataset and y dataset. All this happens in the code below.


We will now develop our baseline decision tree model.

Baseline Model

The purpose of the baseline model is to have something to compare our gradient boosting model to. The strength of a model is always relative to some other model, so we need to make at least two, so we can say one is better than the other.

The criteria for better in this situation is accuracy. Therefore, we will make a decision tree model, but we will manipulate the max depth of the tree to create 9 different baseline models. The best accuracy model will be the baseline model.

To achieve this, we need to use a for loop to make python make several decision trees. We also need to set the parameters for the cross validation by calling KFold(). Once this is done, we print the results for the 9 trees. Below is the code and results.

for depth in range (1,10):
score=np.mean(cross_val_score(tree_classifier,X,y,scoring='accuracy', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 0.71875
2 0.6477941176470589
3 0.6768382352941177
4 0.6698529411764707
5 0.6584558823529412
6 0.6525735294117647
7 0.6283088235294118
8 0.6573529411764706
9 0.6577205882352941

It appears that when the max depth is limited to 1 that we get the best accuracy at almost 72%. This will be our baseline for comparison. We will now tune the parameters for the gradient boosting algorithm

Hyperparameter Tuning

There are several hyperparameters we need to tune. The ones we will tune are as follows

  • number of estimators
  • learning rate
  • subsample
  • max depth

First, we will create an instance of the gradient boosting classifier. Second, we will create our grid for the search. It is inside this grid that we set several values for each hyperparameter. Then we call GridSearchCV and place the instance of the gradient boosting classifier, the grid, the cross validation values from mad earlier, and n_jobs all together in one place. Below is the code for this.


You can now run your model by calling .fit(). Keep in mind that there are several hyperparameters. This means that it might take some time to run the calculations. It is common to find values for max depth, subsample, and number of estimators first. Then as second run through is done to find the learning rate. In our example, we are doing everything at once which is why it takes longer. Below is the code with the out for best parameters and best score.,y)
{'learning_rate': 0.01,
'max_depth': 5,
'n_estimators': 2000,
'random_state': 1,
'subsample': 0.75}
Out[12]: 0.7425149700598802

You can see what the best hyperparameters are for yourself. In addition, we see that when these parameters were set we got an accuracy of 74%. This is superior to our baseline model. We will now see if we can replicate these numbers when we use them for our Gradient Boosting model.

Gradient Boosting Model

Below is the code and results for the model with the predetermined hyperparameter values.

Out[17]: 0.742279411764706

You can see that the results are similar. This is just additional information that the gradient boosting model does outperform the baseline decision tree model.


This post provided an example of what gradient boosting classification can do for a model. With its distinct characteristics gradient boosting is generally a better performing boosting algorithm in comparison to AdaBoost.

AdaBoost Regression with Python

This post will share how to use the adaBoost algorithm for regression in Python. What boosting does is that it makes multiple models in a sequential manner. Each newer model tries to successful predict what older models struggled with. For regression, the average of the models are used for the predictions.  It is often most common to use boosting with decision trees but this approach can be used with any machine learning algorithm that deals with supervised learning.

Boosting is associated with ensemble learning because several models are created that are averaged together. An assumption of boosting, is that combining several weak models can make one really strong and accurate model.

For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the weight loss of a patient based on several independent variables. The steps of this process are as follows.

  1. Data preparation
  2. Regression decision tree baseline model
  3. Hyperparameter tuning of Adaboost regression model
  4. AdaBoost regression model development

Below is some initial code

from sklearn.ensemble import AdaBoostRegressor
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

Data Preparation

There is little data preparation for this example. All we need to do is load the data and create the X and y datasets. Below is the code.


We will now proceed to creating the baseline regression decision tree model.

Baseline Regression Tree Model

The purpose of the baseline model is for comparing it to the performance of our model that utilizes adaBoost. In order to make this model we need to Initiate a Kfold cross-validation. This will help in stabilizing the results. Next we will create a for loop so that we can create several trees that vary based on their depth. By depth, it is meant how far the tree can go to purify the classification. More depth often leads to a higher likelihood of overfitting.

Finally, we will then print the results for each tree. The criteria used for judgment is the mean squared error. Below is the code and results

for depth in range (1,10):
score=np.mean(cross_val_score(tree_regressor,X,y,scoring='neg_mean_squared_error', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 -193.55304528235052
2 -176.27520747356175
3 -209.2846723461564
4 -218.80238479654003
5 -222.4393459885871
6 -249.95330609042858
7 -286.76842138165705
8 -294.0290706405905
9 -287.39016236497804

Looks like a tree with a depth of 2 had the lowest amount of error. We can now move to tuning the hyperparameters for the adaBoost algorithm.

Hyperparameter Tuning

For hyperparameter tuning we need to start by initiating our AdaBoostRegresor() class. Then we need to create our grid. The grid will address two hyperparameters which are the number of estimators and the learning rate. The number of estimators tells Python how many models to make and the learning indicates how each tree contributes to the overall results. There is one more parameters which is random_state but this is just for setting the seed and never changes.

After making the grid, we need to use the GridSearchCV function to finish this process. Inside this function you have to set the estimator which is adaBoostRegressor, the parameter grid which we just made, the cross validation which we made when we created the baseline model, and the n_jobs which allocates resources for the calculation. Below is the code.


Next, we can run the model with the desired grid in place. Below is the code for fitting the mode as well as the best parameters and the score to expect when using the best parameters.,y)
Out[31]: {'learning_rate': 0.01, 'n_estimators': 500, 'random_state': 1}
Out[32]: -164.93176650920856

The best mix of hyperparameters is a learning rate of 0.01 and 500 estimators. This mix led to a mean error score of 164, which is a little lower than our single decision tree of 176. We will see how this works when we run our model with the refined hyperparameters.

AdaBoost Regression Model

Below is our model but this time with the refined hyperparameters.

Out[36]: -174.52604137201791

You can see the score is not as good but it is within reason.


In this post, we explored how to use the AdaBoost algorithm for regression. Employing this algorithm can help to strengthen a model in many ways at times.