Simple Interest VIDEO

Calculating simple interest

Advertisements

Review of The Beginner’s American History

In this post, we take a look at The Beginners American History. This book was written by D.H. Montgomery in the late 19th century and was updated by John Holzmann (pp. 269).

Summary

This is a classic text that covers the history of the United States from Christopher Columbus’ discovery of America until the Gold Rush of California. All of the expected content is there from Captain John Smith, George Washington, to even Eli Whitney. Other information shared includes the various wars in America from the battles with the British for independence to the wars with the Mexicans and Indians for control of the land in what is now the United States.

The Good

This would be a great personal reader for an older student. It is primarily text based and there are few illustrations. The writing is simple for the most part and is not overly weighed down with a lot of academic insights and communication. 

The illustrations that are included tend to be an ever-changing map that shows how America is being slowly taken over by the American colonist. This provides the reader of a perspective of time and the growth of the United States.

It is also beneficial for students to get an older perspective on history. The way Montgomery viewed American history in the 19th century is vastly different from how historians see it today.

The Bad

As previously mentioned, the book is text heavy. This makes it inappropriate for small children. In addition, there are few illustrations in the book. This can be a detriment to students who learn through their senses. This would also make the text hard to use in a whole-class situation.

It is a children’s book, however, the portrayal of content is in the most rudimentary manner. This may be due to the context in which the book was written as well as the purpose for this book. Either way, the book was rich on  content but lacked depth.

The Recommendation

For personal reading this is an excellent book. However, in an academic context, I believe there are superior options to the book discussed here. The age of the text provides a distinct perspective on history but lacks the content for deep learning today.

Data Science Pipeline

One of the challenges of conducting a data analysis or any form of research is making decisions. You have to decide primarily two things

  1. What to do
  2. When to do it

People who are familiar with statistics may know what to do but may struggle with timing or when to do it. Others who are weaker when it comes to numbers may not know what to do or when to do it. Generally, it is rare for someone to know when to do something but not know how to do it.

In this post, we will look at a process that that can be used to perform an analysis in the context of data science. Keep in mind that this is just an example and there are naturally many ways to perform an analysis. The purpose here is to provide some basic structure for people who are not sure of what to do and when. One caveat, this process is focused primarily on supervised learning which has a clearer beginning, middle, and end in terms of the process.

Generally, there are three steps that probably always take place when conducting a data analysis and they are as follows.

  1. Data preparation (data mugging)
  2. Model training
  3. Model testing

Off course, it is much more complicated than this but this is the minimum. Within each of these steps there are several substeps, However, depending on the context, the substeps can be optional.

There is one pre-step that you have to consider. How you approach these three steps depends a great deal on the algorithm(s) you have in mind to use for developing different models. The assumptions and characteristics of one algorithm are different from another and shape how you prepare the data and develop models. With this in mind, we will go through each of these three steps.

Data Preparation

Data preparation involves several substeps. Some of these steps are necessary but general not all of them happen ever analysis. Below is a list of steps at this level

  • Data mugging
  • Scaling
  • Normality
  • Dimension reduction/feature extraction/feature selection
  • Train, test, validation split

Data mugging is often the first step in data preparation and involves making sure your data is in a readable structure for your algorithm. This can involve changing the format of dates, removing punctuation/text, changing text into dummy variables or factors, combining tables, splitting tables, etc. This is probably the hardest and most unclear aspect of data science because the problems you will face will be highly unique to the dataset you are working with.

Scaling involves making sure all the variables/features are on the same scale. This is important because most algorithms are sensitive to the scale of the variables/features. Scaling can be done through normalization or standardization. Normalization reduces the variables to a range of 0 – 1. Standardization involves converting the examples in the variable to their respective z-score. Which one you use depends on the situation but normally it is expected to do this.

Normality is often an optional step because there are so many variables that can be involved with big data and data science in a given project. However, when fewer variables are involved checking for normality is doable with a few tests and some visualizations. If normality is violated various transformations can be used to deal with this problem. Keep mind that many machine learning algorithms are robust against the influence of non-normal data.

Dimension reduction involves reduce the number of variables that will be included in the final analysis. This is done through factor analysis or principal component analysis. This reduction  in the number of variables is also an example of feature extraction. In some context, feature extraction is the in goal in itself. Some algorithms make their own features such as neural networks through the use of hidden layer(s)

Feature selection is the process of determining which variables to keep for future analysis. This can be done through the use of regularization such or in smaller datasets with subset regression. Whether you extract or select features depends on the context.

After all this is accomplished, it is necessary to split the dataset. Traditionally, the data was split in two. This led to the development of a training set and a testing set. You trained the model on the training set and tested the performance on the test set.

However, now many analyst split the data into three parts to avoid overfitting the data to the test set. There is now a training a set, a validation set, and a testing set. The  validation set allows you to check the model performance several times. Once you are satisfied you use the test set once at the end.

Once the data is prepared, which again is perhaps the most difficult part, it is time to train the model.

Model training

Model training involves several substeps as well

  1. Determine the metric(s) for success
  2. Creating a grid of several hyperparameter values
  3. Cross-validation
  4. Selection of the most appropriate hyperparameter values

The first thing you have to do and this is probably required is determined how you will know if your model is performing well. This involves selecting a metric. It can be accuracy for classification or mean squared error for a regression model or something else. What you pick depends on your goals. You use these metrics to determine the best algorithm and hyperparameters settings.

Most algorithms have some sort of hyperparameter(s). A hyperparameter is a value or estimate that the algorithm cannot learn and must be set by you. Since there is no way of knowing what values to select it is common practice to have several values tested and see which one is the best.

Cross-validation is another consideration. Using cross-validation always you to stabilize the results through averaging the results of the model over several folds of the data if you are using k-folds cross-validation. This also helps to improve the results of the hyperparameters as well.  There are several types of cross-validation but k-folds is probably best initially.

The information for the metric, hyperparameters, and cross-validation are usually put into  a grid that then runs the model. Whether you are using R or Python the printout will tell you which combination of hyperparameters is the best based on the metric you determined.

Validation test

When you know what your hyperparameters are you can now move your model to validation or straight to testing. If you are using a validation set you asses your models performance by using this new data. If the results are satisfying based on your metric you can move to testing. If not, you may move back and forth between training and the validation set making the necessary adjustments.

Test set

The final step is testing the model. You want to use the testing dataset as little as possible. The purpose here is to see how your model generalizes to data it has not seen before. There is little turning back after this point as there is an intense danger of overfitting now. Therefore, make sure you are ready before playing with the test data.

Conclusion

This is just one approach to conducting data analysis. Keep in mind the need to prepare data, train your model, and test it. This is the big picture for a somewhat complex process

Gradient Boosting Regression in Python

In this  post, we will take a look at gradient boosting for regression. Gradient boosting simply makes sequential models that try to explain any examples that had not been explained by previously models. This approach makes gradient boosting superior to AdaBoost.

Regression trees are mostly commonly teamed with boosting. There are some additional hyperparameters that need to be set  which includes the following

  • number of estimators
  • learning rate
  • subsample
  • max depth

We will deal with each of these when it is appropriate. Our goal in this post is to predict the amount of weight loss in cancer patients based on the independent variables. This is the process we will follow to achieve this.

  1. Data preparation
  2. Baseline decision tree model
  3. Hyperparameter tuning
  4. Gradient boosting model development

Below is some initial code

from sklearn.ensemble import GradientBoostingRegressor
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

Data Preparation

The data preparation is not that difficult in this situation. We simply need to load the dataset in an object and remove any missing values. Then we separate the independent and dependent variables into separate datasets. The code is below.

df=data('cancer').dropna()
X=df[['time','sex','ph.karno','pat.karno','status','meal.cal']]
y=df['wt.loss']

We can now move to creating our baseline model.

Baseline Model

The purpose of the baseline model is to have something to compare our gradient boosting model to. Therefore, all we will do here is create  several regression trees. The difference between the regression trees will be the max depth. The max depth has to with the number of nodes python can make to try to purify the classification.  We will then decide which tree is best based on the mean squared error.

The first thing we need to do is set the arguments for the cross-validation. Cross validating the results helps to check the accuracy of the results. The rest of the code  requires the use of for loops and if statements that cannot be reexplained in this post. Below is the code with the output.

for depth in range (1,10):
    tree_regressor=tree.DecisionTreeRegressor(max_depth=depth,random_state=1)
    if tree_regressor.fit(X,y).tree_.max_depth

You can see that a max depth of 2 had the lowest amount of error. Therefore, our baseline model has a mean squared error of 176. We need to improve on this in order to say that our gradient boosting model is superior.

However, before we create our gradient boosting model. we need to tune the hyperparameters of the algorithm.

Hyperparameter Tuning

Hyperparameter tuning has to with setting the value of parameters that the algorithm cannot learn on its own. As such, these are constants that you set as the researcher. The problem is that you are not any better at knowing where to set these values than the computer. Therefore, the process that is commonly used is to have the algorithm use several combinations  of values until it finds the values that are best for the model/. Having said this, there are several hyperparameters we need to tune, and they are as follows.

  • number of estimators
  • learning rate
  • subsample
  • max depth

The number of estimators is show many trees to create. The more trees the more likely to overfit. The learning rate is the weight that each tree has on the final prediction. Subsample is the proportion of the sample to use. Max depth was explained previously.

What we will do now is make an instance of the GradientBoostingRegressor. Next, we will create our grid with the various values for the hyperparameters. We will then take this grid and place it inside GridSearchCV function so that we can prepare to run our model. There are some arguments that need to be set inside the GridSearchCV function such as estimator, grid, cv, etc. Below is the code.

GBR=GradientBoostingRegressor()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'max_depth':[1,2,4],'subsample':[.5,.75,1],'random_state':[1]}
search=GridSearchCV(estimator=GBR,param_grid=search_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=crossvalidation)

We can now run the code and determine the best combination of hyperparameters and how well the model did base on the means squared error metric. Below is the code and the output.

search.fit(X,y)
search.best_params_
Out[13]: 
{'learning_rate': 0.01,
 'max_depth': 1,
 'n_estimators': 500,
 'random_state': 1,
 'subsample': 0.5}

search.best_score_
Out[14]: -160.51398257591643

The hyperparameter results speak for themselves. With this tuning we can see that the mean squared error is lower than with the baseline model. We can now move to the final step of taking these hyperparameter settings and see how they do on the dataset. The results should be almost the same.

Gradient Boosting Model Development

Below is the code and the output for the tuned gradient boosting model

GBR2=GradientBoostingRegressor(n_estimators=500,learning_rate=0.01,subsample=.5,max_depth=1,random_state=1)
score=np.mean(cross_val_score(GBR2,X,y,scoring='neg_mean_squared_error',cv=crossvalidation,n_jobs=1))
score
Out[18]: -160.77842893572068

These results were to be expected. The gradient boosting model has a better performance than the baseline regression tree model.

Conclusion

In this post, we looked at how to  use gradient boosting to improve a regression tree. By creating multiple models. Gradient boosting will almost certainly have a better performance than other type of algorithms that rely on only one model.

Review of The Landmark History of the American People Vol 1

The book The Landmark History of the American People Vol 1 by Daniel Boorstin (pp. 169) provides a rich explanation of the history of the United States from the dawn of colonial America until the end of the 19th century. Daniel Boorstin was a rather famous author and  a former Librarian of Congress. Holding such as position gives you the esteem in which this man was held.

The Summary

This book covers many interesting aspects of early American history. It begins with the development of the colonies. From there the text provides A detailed account of the eventual split from Great Britain. The next focus of the text is on the America heading west through the expansion that involved purchasing land, warfare, and unfortunate exploitation.

The latter part of the text focuses somewhat more on such ideas as life out in the western frontier. There is also a mention of the early effects of the industrial revolution with the development of the train and all the advantages and dangers that this brought.

The Good

This book provides a lot of interesting details about life in America. For example, on the frontier, Americans developed something called the balloon frame house. This type of building was faster and relatively safe when compared to the European model of building at this time. This kinds of little details are not common in most text for children

The text is also full illustrations that capture the time period in which the author was writing about. From pictures of puritans, to Indians, to even photos of various famous American historical sites. This text has a little of everything.

The Bad

Although the text is full of illustrations, it is still primarily text based. In addition, even though the text is full of interesting details this can also be a disadvantage of you or your student  needs the big picture about a particular time period. Yes, I did compliment the development of the balloon frame house. However, what is the benefit of knowing this small detail from American history?

Younger children would struggle with the writing and text heavy nature of the book. However, to be fair, perhaps the author was gearing this book towards older students. However, in the preface, the editor, mentions that this book was meant to be read by parents to 3rd or 4th graders. This seems like a tall task given the content.

The Recommendation

This book would be good for older kids. Perhaps middle school, who have the reading comprehension and perhaps the curiosity to handle such a text. However, for younger children I am convinced the text is too complicated for them to appreciate it. One way to address this is to focus on the visual aspects of the book and not worry too much of getting every detail of the challenging text.

Gradient Boosting Classification in Python

Gradient Boosting is an alternative form of boosting to AdaBoost. Many consider gradient boosting to be a better performer than adaboost. Some differences between the two algorithms is that gradient boosting uses optimization for weight the estimators. Like adaboost, gradient boosting can be used for most algorithms but is commonly associated with decision trees.

In addition, gradient boosting requires several additional hyperparameters such as max depth and subsample. Max depth has to do with the number of nodes in a tree. The higher the number the purer the classification become. The downside to this is the risk of overfitting.

Subsampling has to do with the proportion of the sample that is used for each estimator. This can range from a decimal value up until the whole number 1. If the value is set to 1 it becomes stochastic gradient boosting.

This post is focused on classification. To do this, we will use the cancer dataset from the pydataset library. Our goal will be to predict the status of patients (alive or dead) using the available independent variables. The steps we will use are as follows.

  1. Data preparation
  2. Baseline decision tree model
  3. Hyperparameter tuning
  4. Gradient boosting model development

Below is some initial code.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

Data Preparation

The data preparation is simple in this situtation. All we need to do is load are dataset, dropping missing values, and create our X dataset and y dataset. All this happens in the code below.

df=data('cancer').dropna()
X=df[['time','sex','ph.karno','pat.karno','meal.cal','wt.loss']]
y=df['status']

We will now develop our baseline decision tree model.

Baseline Model

The purpose of the baseline model is to have something to compare our gradient boosting model to. The strength of a model is always relative to some other model, so we need to make at least two, so we can say one is better than the other.

The criteria for better in this situation is accuracy. Therefore, we will make a decision tree model, but we will manipulate the max depth of the tree to create 9 different baseline models. The best accuracy model will be the baseline model.

To achieve this, we need to use a for loop to make python make several decision trees. We also need to set the parameters for the cross validation by calling KFold(). Once this is done, we print the results for the 9 trees. Below is the code and results.

crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10):
tree_classifier=tree.DecisionTreeClassifier(max_depth=depth,random_state=1)
if tree_classifier.fit(X,y).tree_.max_depth<depth:
break
score=np.mean(cross_val_score(tree_classifier,X,y,scoring='accuracy', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 0.71875
2 0.6477941176470589
3 0.6768382352941177
4 0.6698529411764707
5 0.6584558823529412
6 0.6525735294117647
7 0.6283088235294118
8 0.6573529411764706
9 0.6577205882352941

It appears that when the max depth is limited to 1 that we get the best accuracy at almost 72%. This will be our baseline for comparison. We will now tune the parameters for the gradient boosting algorithm

Hyperparameter Tuning

There are several hyperparameters we need to tune. The ones we will tune are as follows

  • number of estimators
  • learning rate
  • subsample
  • max depth

First, we will create an instance of the gradient boosting classifier. Second, we will create our grid for the search. It is inside this grid that we set several values for each hyperparameter. Then we call GridSearchCV and place the instance of the gradient boosting classifier, the grid, the cross validation values from mad earlier, and n_jobs all together in one place. Below is the code for this.

GBC=GradientBoostingClassifier()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'max_depth':[1,3,5],'subsample':[.5,.75,1],'random_state':[1]}
search=GridSearchCV(estimator=GBC,param_grid=search_grid,scoring='accuracy',n_jobs=1,cv=crossvalidation)

You can now run your model by calling .fit(). Keep in mind that there are several hyperparameters. This means that it might take some time to run the calculations. It is common to find values for max depth, subsample, and number of estimators first. Then as second run through is done to find the learning rate. In our example, we are doing everything at once which is why it takes longer. Below is the code with the out for best parameters and best score.

search.fit(X,y)
search.best_params_
Out[11]:
{'learning_rate': 0.01,
'max_depth': 5,
'n_estimators': 2000,
'random_state': 1,
'subsample': 0.75}
search.best_score_
Out[12]: 0.7425149700598802

You can see what the best hyperparameters are for yourself. In addition, we see that when these parameters were set we got an accuracy of 74%. This is superior to our baseline model. We will now see if we can replicate these numbers when we use them for our Gradient Boosting model.

Gradient Boosting Model

Below is the code and results for the model with the predetermined hyperparameter values.

ada2=GradientBoostingClassifier(n_estimators=2000,learning_rate=0.01,subsample=.75,max_depth=5,random_state=1)
score=np.mean(cross_val_score(ada2,X,y,scoring='accuracy',cv=crossvalidation,n_jobs=1))
score
Out[17]: 0.742279411764706

You can see that the results are similar. This is just additional information that the gradient boosting model does outperform the baseline decision tree model.

Conclusion

This post provided an example of what gradient boosting classification can do for a model. With its distinct characteristics gradient boosting is generally a better performing boosting algorithm in comparison to AdaBoost.

AdaBoost Regression with Python

This post will share how to use the adaBoost algorithm for regression in Python. What boosting does is that it makes multiple models in a sequential manner. Each newer model tries to successful predict what older models struggled with. For regression, the average of the models are used for the predictions.  It is often most common to use boosting with decision trees but this approach can be used with any machine learning algorithm that deals with supervised learning.

Boosting is associated with ensemble learning because several models are created that are averaged together. An assumption of boosting, is that combining several weak models can make one really strong and accurate model.

For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the weight loss of a patient based on several independent variables. The steps of this process are as follows.

  1. Data preparation
  2. Regression decision tree baseline model
  3. Hyperparameter tuning of Adaboost regression model
  4. AdaBoost regression model development

Below is some initial code

from sklearn.ensemble import AdaBoostRegressor
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

Data Preparation

There is little data preparation for this example. All we need to do is load the data and create the X and y datasets. Below is the code.

df=data('cancer').dropna()
X=df[['time','sex','ph.karno','pat.karno','status','meal.cal']]
y=df['wt.loss']

We will now proceed to creating the baseline regression decision tree model.

Baseline Regression Tree Model

The purpose of the baseline model is for comparing it to the performance of our model that utilizes adaBoost. In order to make this model we need to Initiate a Kfold cross-validation. This will help in stabilizing the results. Next we will create a for loop so that we can create several trees that vary based on their depth. By depth, it is meant how far the tree can go to purify the classification. More depth often leads to a higher likelihood of overfitting.

Finally, we will then print the results for each tree. The criteria used for judgment is the mean squared error. Below is the code and results

for depth in range (1,10):
tree_regressor=tree.DecisionTreeRegressor(max_depth=depth,random_state=1)
if tree_regressor.fit(X,y).tree_.max_depth<depth:
break
score=np.mean(cross_val_score(tree_regressor,X,y,scoring='neg_mean_squared_error', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 -193.55304528235052
2 -176.27520747356175
3 -209.2846723461564
4 -218.80238479654003
5 -222.4393459885871
6 -249.95330609042858
7 -286.76842138165705
8 -294.0290706405905
9 -287.39016236497804

Looks like a tree with a depth of 2 had the lowest amount of error. We can now move to tuning the hyperparameters for the adaBoost algorithm.

Hyperparameter Tuning

For hyperparameter tuning we need to start by initiating our AdaBoostRegresor() class. Then we need to create our grid. The grid will address two hyperparameters which are the number of estimators and the learning rate. The number of estimators tells Python how many models to make and the learning indicates how each tree contributes to the overall results. There is one more parameters which is random_state but this is just for setting the seed and never changes.

After making the grid, we need to use the GridSearchCV function to finish this process. Inside this function you have to set the estimator which is adaBoostRegressor, the parameter grid which we just made, the cross validation which we made when we created the baseline model, and the n_jobs which allocates resources for the calculation. Below is the code.

ada=AdaBoostRegressor()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'random_state':[1]}
search=GridSearchCV(estimator=ada,param_grid=search_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=crossvalidation)

Next, we can run the model with the desired grid in place. Below is the code for fitting the mode as well as the best parameters and the score to expect when using the best parameters.

search.fit(X,y)
search.best_params_
Out[31]: {'learning_rate': 0.01, 'n_estimators': 500, 'random_state': 1}
search.best_score_
Out[32]: -164.93176650920856

The best mix of hyperparameters is a learning rate of 0.01 and 500 estimators. This mix led to a mean error score of 164, which is a little lower than our single decision tree of 176. We will see how this works when we run our model with the refined hyperparameters.

AdaBoost Regression Model

Below is our model but this time with the refined hyperparameters.

ada2=AdaBoostRegressor(n_estimators=500,learning_rate=0.001,random_state=1)
score=np.mean(cross_val_score(ada2,X,y,scoring='neg_mean_squared_error',cv=crossvalidation,n_jobs=1))
score
Out[36]: -174.52604137201791

You can see the score is not as good but it is within reason.

Conclusion

In this post, we explored how to use the AdaBoost algorithm for regression. Employing this algorithm can help to strengthen a model in many ways at times.

Review of Children’s Encyclopedia of American History

The book Children’s Encyclopedia of American History by David King (pp. 320) provides a rich explanation of the the background and shaping of America.

The Summary
This text covers American history from the 11th century all the way until the beginning of the 21st century. Over this 1,000 years of American history the text goes from  explaining the discovery of the new world, to the turbulent times of colonial America, the wars of the 18th and 19th century, all the way to dealing with terrorism.

All of the classic famous names of American history such as George Washington, Benjamin Franklin, Andrew Jackson, Abraham Lincoln, Theodore Roosevelt, John F Kennedy, and even Barrack Obama.

The Good

The text offers a rich array of authentic photos and artifacts as images in the book. Almost no detail was left undone. Pictures of buildings, famous people, and even toys of different eras are provided. There are paintings of gold  miners, maps, Indians, athletes, etc. There is even commentary on the accuracy of some of the paintings. For example, one painting shows George Washington standing in a book. The author points out that this would be dangerous as the boat might tip over. In  addition, the artist of the painted the wrong boat and the US flag was in the painting but did not exist at the time in history that the painting was depicting.

There are also lots of maps throughout the book describing America at different times in history. There are also maps of other countries when other countries interact with the US. For example, there is a map of Korea when it was divided when the book discusses the Korean War.

The book also addresses major changes in technology, influential people in such fields as arts and entertainment. Consistent with an encyclopedia, this book has a little of everything.

The Bad

There is little to disparage about this book. It is highly visually appealing for young readers. Even adults would found the text interesting especially if they do not have a strong background in history. If a criticism had to be made it might be that the book is more focus on visuals and lacks substantive text. However, this is not much of a criticism as the book is geared towards children and focus more on pictures than text.

The Recommendation

For the history teacher, this is a great text to have to augment other studies in history. The book is fairly large and could possible be used to teach a medium size group of children. The picture in the text make history come alive and remove some of the abstract nature to learning about the past.

The was even built well as it is a hard cover text that should be able to survive years of stress from the joys of children. As such, this book is highly recommended

AdaBoost Classification in Python

Boosting is a technique in machine learning in which multiple models are developed sequentially. Each new model tries to successful predict what prior models were unable to do. The average for regression and majority vote for classification are used. For classification, boosting is commonly associated with decision trees. However, boosting can be used with any machine learning algorithm in the supervised learning context.

Since several models are being developed with aggregation, boosting is associated with ensemble learning. Ensemble is just a way of developing more than one model for machine-learning purposes. With boosting, the assumption is that the combination of several weak models can make one really strong and accurate model.

For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the status of a patient based on several independent variables. The steps of this process are as follows.

  1. Data preparation
  2. Decision tree baseline model
  3. Hyperparameter tuning of Adaboost model
  4. AdaBoost model development

Below is some initial code

from sklearn.ensemble import AdaBoostClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

Data Preparation

Data preparation is minimal in this situation. We will load are data and at the same time drop any NA using the .dropna() function. In addition, we will place the independent variables in dataframe called X and the dependent variable in a dataset called y. Below is the code.

df=data('cancer').dropna()
X=df[['time','sex','ph.karno','pat.karno','meal.cal','wt.loss']]
y=df['status']

Decision Tree Baseline Model

We will make a decision tree just for the purposes of comparison. First, we will set the parameters for the cross-validation. Then we will use a for loop to run several different decision trees. The difference in the decision trees will be their depth. The depth is how far the tree can go in order to purify the classification. The more depth the more likely your decision tree is to overfit the data. The last thing we will do is print the results. Below is the code with the output

crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10):
tree_classifier=tree.DecisionTreeClassifier(max_depth=depth,random_state=1)
if tree_classifier.fit(X,y).tree_.max_depth<depth:
break
score=np.mean(cross_val_score(tree_classifier,X,y,scoring='accuracy', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 0.71875
2 0.6477941176470589
3 0.6768382352941177
4 0.6698529411764707
5 0.6584558823529412
6 0.6525735294117647
7 0.6283088235294118
8 0.6573529411764706
9 0.6577205882352941

You can see that the most accurate decision tree had a depth of 1. After that there was a general decline in accuracy.

We now can determine if the adaBoost model is better based on whether the accuracy is above 72%. Before we develop the  AdaBoost model, we need to tune several hyperparameters in order to develop the most accurate model possible.

Hyperparameter Tuning AdaBoost Model

In order to tune the hyperparameters there are several things that we need to do. First we need to initiate  our AdaBoostClassifier with some basic settings. Then We need to create our search grid with the hyperparameters. There are two hyperparameters that we will set and they are number of estimators (n_estimators) and the learning rate.

Number of estimators has to do with how many trees are developed. The learning rate indicates how each tree contributes to the overall results. We have to place in the grid several values for each of these. Once we set the arguments for the AdaBoostClassifier and the search grid we combine all this information into an object called search. This object uses the GridSearchCV function and includes additional arguments for scoring, n_jobs, and for cross-validation. Below is the code for all of this

ada=AdaBoostClassifier()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1]}
search=GridSearchCV(estimator=ada,param_grid=search_grid,scoring='accuracy',n_jobs=1,cv=crossvalidation)

We can now run the model of hyperparameter tuning and see the results. The code is below.

search.fit(X,y)
search.best_params_
Out[33]: {'learning_rate': 0.01, 'n_estimators': 1000}
search.best_score_
Out[34]: 0.7425149700598802

We can see that if the learning rate is set to 0.01 and the number of estimators to 1000 We can expect an accuracy of 74%. This is superior to our baseline model.

AdaBoost Model

We can now rune our AdaBoost Classifier based on the recommended hyperparameters. Below is the code.

score=np.mean(cross_val_score(ada,X,y,scoring='accuracy',cv=crossvalidation,n_jobs=1))
score
Out[36]: 0.7415441176470589

We knew we would get around 74% and that is what we got. It’s only a 3% improvement but depending on the context that can be a substantial difference.

Conclusion

In this post, we look at how to use boosting for classification. In particular, we used the AdaBoost algorithm. Boosting in general uses many models to determine the most accurate classification in a sequential manner. Doing this will often lead to an improvement in the prediction of a model.

Research Questions, Variables, and Statistics

Working with students over the years has led me to the conclusion that often students do not understand the connection between variables, quantitative research questions and the statistical tools

used to answer these questions. In other words, students will take statistics and pass the class. Then they will take research methods, collect data, and have no idea how to analyze the data even though they have the necessary skills in statistics to succeed.


This means that the students have a theoretical understanding of statistics but struggle in the application of it. In this post, we will look at some of the connections between research questions and statistics.

Variables

Variables are important because how they are measured affects the type of question you can ask and get answers to. Students often have no clue how they will measure a variable and therefore have no idea how they will answer any research questions they may have.

Another aspect that can make this confusing is that many variables can be measured more than one way. Sometimes the variable “salary” can be measured in a continuous manner or in a categorical manner. The superiority of one or the other depends on the goals of the research.

It is critical to support students to have a thorough understanding of variables in order to support their research.

Types of Research Questions

In general, there are two types of research questions. These two types are descriptive and relational questions. Descriptive questions involve the use of descriptive statistic such as the mean, median, mode, skew, kurtosis, etc. The purpose is to describe the sample quantitatively with numbers (ie the average height is 172cm) rather than relying on qualitative descriptions of it (ie the people are tall).

Below are several example research questions that are descriptive in nature.

  1. What is the average height of the participants in the study?
  2. What proportion of the sample is passed the exam?
  3. What are the respondents perceptions towards the cafeteria?

These questions are not intellectually sophisticated but they are all answerable with descriptive statistical tools. Question 1 can be answered by calculating the mean. Question 2 can be answered by determining how many passed the exam and dividing by the total sample size. Question 3 can be answered by calculating the mean of all the survey items that are used to measure respondents perception of the cafeteria.

Understanding the link between research question and statistical tool is critical. However, many people seem to miss the connection between the type of question and the tools to use.

Relational questions look for the connection or link between variables. Within this type there are two sub-types. Comparison question involve comparing groups. The other sub-type is called relational or an association question.

Comparison questions involve comparing groups on a continuous variable. For example, comparing men and women by height. What you want to know is whether there is a difference in the height of men and women. The comparison here is trying to determine if gender is related to height. Therefore, it is looking for a relationship just not in the way that many student understand. Common comparison questions include the following.male

  1. Is there a difference in height by gender among the participants?
  2. Is there a difference in reading scores by grade level?
  3. Is there a difference in job satisfaction in based on major?

Each of these questions can be answered using ANOVA or if we want to get technical and there are only two groups (ie gender) we can use t-test. This is a broad overview and does not include the complexities of one-sample test and or paired t-test.

Relational or association question involve continuous variables primarily. The goal is to see how variables move together. For example, you may look for the relationship between height and weight of students. Common questions include the following.

  1.  Is there a relationship between height and weight?
  2. Does height and show size explain weight?

Questions 1 can be answered by calculating the correlation. Question 2 requires the use of linear regression in order to answer the question.

Conclusion

The challenging as a teacher is showing the students the connection between statistics and research questions from the real world. It takes time for students to see how the question inspire the type of statistical tool to use. Understanding this is critical because it helps to frame the possibilities of what to do in research based on the statistical knowledge one has.

Recommendation Engine with Python

Recommendation engines make future suggestion to a person based on their prior behavior. There are several ways to develop recommendation engines but for purposes, we will be looking at the development of a user-based collaborative filter. This type of filter takes the ratings of others to suggest future items to another user based on the other user’s ratings.

Making a recommendation engine in Python actually does not take much code and is somewhat easy consider what can be done through coding. We will make a movie recommendation engine using data from movielens.

 

Below is the link for downloading the zip file 

Inside the zip file are several files we will use. We will use each in a few moments. Below is the initial code to get started

import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.decomposition import TruncatedSVD
import numpy as np

We will now make 4 dataframes. Dataframes 1-3 will be the user, rating, and movie title data. The last dataframe will be a merger of the first 3. The code is below with a printout of the final result.

user = pd.read_table('/home/darrin/Documents/python/new/ml-1m/users.dat', sep='::', header=None, names=['user_id', 'gender', 'age', 'occupation', 'zip'],engine='python')
rating = pd.read_table('/home/darrin/Documents/python/new/ml-1m/ratings.dat', sep='::', header=None, names=['user_id', 'movie_id', 'rating', 'timestamp'],engine='python')
movie = pd.read_table('/home/darrin/Documents/python/new/ml-1m/movies.dat', sep='::', header=None, names=['movie_id', 'title', 'genres'],engine='python')
MovieAll = pd.merge(pd.merge(rating, user), movie)

We now need to create a matrix using the .pivot_table function. This matrix will include ratings and user_id from our “MovieAll” dataframe. We will then move this information into a dataframe called “movie_index”. This index will help us keep track of what movie each column represents. The code is below.

rating_mtx_df = MovieAll.pivot_table(values='rating', index='user_id', columns='title', fill_value=0)

There are many variables in our matrix. This makes the computational time long and expensive. To reduce this we will reduce the dimensions using the TruncatedSVD function. We will reduce the matrix to 20 components. We also need to transform the data because we want the Vh matrix and no tthe U matrix. All this is hand in the code below.

recomm = TruncatedSVD(n_components=20, random_state=10)
R = recomm.fit_transform(rating_mtx_df.values.T)

What we saved our modified dataset as “R”. If we were to print this it would show that each row has two columns with various numbers in it that cannot be interpreted by us.  Instead, we will move to the actual recommendation part of this post.

To get a recommendation you have to tell Python the movie that you watch first. Python will then compare this movie with other movies that have a similiar rating and genera in the training dataset and then provide recommendation based on which movies have the highest correlation to the movie that was watched.

We are going to tell Python that we watched “One Flew Over the Cuckoo’s Nest” and see what movies it recommends.

First, we need to pull the information for just “One Flew Over the Cuckoo’s Nest”  and place this in a matrix. Then we need to calculate the correlations of all our movies using the modified dataset we named “R”. These two steps are completed below.

cuckoo_idx = list(movie_index).index("One Flew Over the Cuckoo's Nest (1975)")
correlation_matrix = np.corrcoef(R)

Now we can determine which movies have the highest correlation with our movie. However, to determine this, we must gvive Python a range of acceptable correlations. For our purposes we will set this between 0.93 and 1.0. The code is below with the recommendations.

P = correlation_matrix[cuckoo_idx]
print (list(movie_index[(P > 0.93) & (P < 1.0)]))
['Graduate, The (1967)', 'Taxi Driver (1976)']

You can see that the engine recommended two movies which are “The Graduate” and “Taxi Driver”. We could increase the number of recommendations by lower the correlation requirement if we desired.

Conclusion

Recommendation engines are a great tool for generating sales automatically for customers. Understanding the basics of how to do this a practical application of machine learning

 

Elastic Net Regression in Python

Elastic net regression combines the power of ridge and lasso regression into one algorithm. What this means is that with elastic net the algorithm can remove weak variables altogether as with lasso or to reduce them to close to zero as with ridge. All of these algorithms are examples of regularized regression.

This post will provide an example of elastic net regression in Python. Below are the steps of the analysis.

  1. Data preparation
  2. Baseline model development
  3. Elastic net model development

To accomplish this, we will use the Fair dataset from the pydataset library. Our goal will be to predict marriage satisfaction based on the other independent variables. Below is some initial code to begin the analysis.

from pydataset import data
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Data Preparation

We will now load our data. The only preparation that we need to do is convert the factor variables to dummy variables. Then we will make our  and y datasets. Below is the code.

df=pd.DataFrame(data('Fair'))
df.loc[df.sex== 'male', 'sex'] = 0
df.loc[df.sex== 'female','sex'] = 1
df['sex'] = df['sex'].astype(int)
df.loc[df.child== 'no', 'child'] = 0
df.loc[df.child== 'yes','child'] = 1
df['child'] = df['child'].astype(int)
X=df[['religious','age','sex','ym','education','occupation','nbaffairs']]
y=df['rate']

We can now proceed to creating the baseline model

Baseline Model

This model is a basic regression model for the purpose of comparison. We will instantiate our regression model, use the fit command and finally calculate the mean squared error of the data. The code is below.

regression=LinearRegression()
regression.fit(X,y)
first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))
print(first_model)
1.0498738644696668

This mean standard error score of 1.05 is our benchmark for determining if the elastic net model will be better or worst. Below are the coefficients of this first model. We use a for loop to go through the model and the zip function to combine the two columns.

coef_dict_baseline = {}
for coef, feat in zip(regression.coef_,X.columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[63]:
{'religious': 0.04235281110639178,
'age': -0.009059645428673819,
'sex': 0.08882013337087094,
'ym': -0.030458802565476516,
'education': 0.06810255742293699,
'occupation': -0.005979506852998164,
'nbaffairs': -0.07882571247653956}

We will now move to making the elastic net model.

Elastic Net Model

Elastic net, just like ridge and lasso regression, requires normalize data. This argument  is set inside the ElasticNet function. The second thing we need to do is create our grid. This is the same grid as we create for ridge and lasso in prior posts. The only thing that is new is the l1_ratio argument.

When the l1_ratio is set to 0 it is the same as ridge regression. When l1_ratio is set to 1 it is lasso. Elastic net is somewhere between 0 and 1 when setting the l1_ratio. Therefore, in our grid, we need to set several values of this argument. Below is the code.

elastic=ElasticNet(normalize=True)
search=GridSearchCV(estimator=elastic,param_grid={'alpha':np.logspace(-5,2,8),'l1_ratio':[.2,.4,.6,.8]},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)

We will now fit our model and display the best parameters and the best results we can get with that setup.

search.fit(X,y)
search.best_params_
Out[73]: {'alpha': 0.001, 'l1_ratio': 0.8}
abs(search.best_score_)
Out[74]: 1.0816514028705004

The best hyperparameters was an alpha set to 0.001 and a l1_ratio of 0.8. With these settings we got an MSE of 1.08. This is above our baseline model of MSE 1.05  for the baseline model. Which means that elastic net is doing worse than linear regression. For clarity, we will set our hyperparameters to the recommended values and run on the data.

elastic=ElasticNet(normalize=True,alpha=0.001,l1_ratio=0.75)
elastic.fit(X,y)
second_model=(mean_squared_error(y_true=y,y_pred=elastic.predict(X)))
print(second_model)
1.0566430678343806

Now our values are about the same. Below are the coefficients

coef_dict_baseline = {}
for coef, feat in zip(elastic.coef_,X.columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[76]:
{'religious': 0.01947541724957858,
'age': -0.008630896492807691,
'sex': 0.018116464568090795,
'ym': -0.024224831274512956,
'education': 0.04429085595448633,
'occupation': -0.0,
'nbaffairs': -0.06679513627963515}

The coefficients are mostly the same. Notice that occupation was completely removed from the model in the elastic net version. This means that this values was no good to the algorithm. Traditional regression cannot do this.

Conclusion

This post provided an example of elastic net regression. Elastic net regression allows for the maximum flexibility in terms of finding the best combination of ridge and lasso regression characteristics. This flexibility is what gives elastic net its power.

Lasso Regression with Python

Lasso regression is another form of regularized regression. With this particular version, the coefficient of a variable can be reduced all the way to zero through the use of the l1 regularization. This is in contrast to ridge regression which never completely removes a variable from an equation as it employs l2 regularization.

Regularization helps to stabilize estimates as well as deal with bias and variance in a model. In this post, we will use the “CaSchools” dataset from the pydataset library. Our goal will be to predict test scores based on several independent variables. The steps we will follow are as follows.

  1. Data preparation
  2. Develop a baseline linear model
  3. Develop lasso regression model

The initial code is as follows

from pydataset import data
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
df=pd.DataFrame(data(‘Caschool’))

Data Preparation

The data preparation is simple in this example. We only have to store the desired variables in our X and y datasets. We are not using all of the variables. Some were left out because they were highly correlated. Lasso is able to deal with this to a certain extent w=but it was decided to leave them out anyway. Below is the code.

X=df[['teachers','calwpct','mealpct','compstu','expnstu','str','avginc','elpct']]
y=df['testscr']

Baseline Model

We can now run our baseline model. This will give us a measure of comparison for the lasso model. Our metric is the mean squared error. Below is the code with the results of the model.

regression=LinearRegression()
regression.fit(X,y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))
print(first_model)
69.07380530137416

First, we instantiate the LinearRegression class. Then, we run the .fit method to do the analysis. Next, we predicted future values of our regression model and save the results to the object first_model. Lastly, we printed the results.

Below are the coefficient for the baseline regression model.

coef_dict_baseline = {}
for coef, feat in zip(regression.coef_,X.columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[52]:
{'teachers': 0.00010011947964873427,
'calwpct': -0.07813766458116565,
'mealpct': -0.3754719080127311,
'compstu': 11.914006268826652,
'expnstu': 0.001525630709965126,
'str': -0.19234209691788984,
'avginc': 0.6211690806021222,
'elpct': -0.19857026121348267}

The for loop simply combines the features in our model with their coefficients. With this information we can now make our lasso model and compare the results.

Lasso Model

For our lasso model, we have to determine what value to set the l1 or alpha to prior to creating the model. This can be done with the grid function, This function allows you to assess several models with different l1 settings. Then python will tell which setting is the best. Below is the code.

lasso=Lasso(normalize=True)
search=GridSearchCV(estimator=lasso,param_grid={'alpha':np.logspace(-5,2,8)},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)
search.fit(X,y)

We start be instantiate lasso with normalization set to true. It is important to scale data when doing regularized regression. Next, we setup our grid, we include the estimator, and parameter grid, and scoring. The alpha is set using logspace. We want values between -5 and 2, and we want 8 evenly spaced settings for the alpha. The other arguments include cv which stands for cross-validation. n_jobs effects processing and refit updates the parameters. 

After completing this, we used the fit function. The code below indicates the appropriate alpha and the expected score if we ran the model with this alpha setting.

search.best_params_
Out[55]: {'alpha': 1e-05}
abs(search.best_score_)
Out[56]: 85.38831122904011

`The alpha is set almost to zero, which is the same as a regression model. You can also see that the mean squared error is actually worse than in the baseline model. In the code below, we run the lasso model with the recommended alpha setting and print the results.

lasso=Lasso(normalize=True,alpha=1e-05)
lasso.fit(X,y)
second_model=(mean_squared_error(y_true=y,y_pred=lasso.predict(X)))
print(second_model)
69.0738055527604

The value for the second model is almost the same as the first one. The tiny difference is due to the fact that there is some penalty involved. Below are the coefficient values.

coef_dict_baseline = {}
for coef, feat in zip(lasso.coef_,X.columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[63]:
{'teachers': 9.795933425676567e-05,
'calwpct': -0.07810938255735576,
'mealpct': -0.37548182158171706,
'compstu': 11.912164626067028,
'expnstu': 0.001525439984250718,
'str': -0.19225486069458508,
'avginc': 0.6211695477945162,
'elpct': -0.1985510490295491}

The coefficient values are also slightly different. The only difference is the teachers variable was essentially set to zero. This means that it is not a useful variable for predicting testscrs. That is ironic to say the least.

Conclusion

Lasso regression is able to remove variables that are not adequate predictors of the outcome variable. Doing this in Python  is fairly simple. This yet another tool that can be used in statistical analysis.

Differences in Thinking

Critical thinkers and problem solvers are two groups of people.  Sadly, these two groups are almost mutually exclusive. However, it is important that thinkers and solvers develop both skillsets to a certain level of competence.

The purpose of this post is to try and explain in detail critical thinking vs problem-solving in term of individual differences.

Thinking is a slow deliberate process that takes to do. In other words, a person must decide to think. Since there is a requirement of active effort, thinking is something that few people value and appreciate as they should.

Thinking involves processing information from the viewpoint of central processing. This means to examine the content of a message for its worth. Furthermore, when a person is developing their own arguments thinking involves developing support for one’s position. Often when people argue or disagree today they tend to get upset. This is an indication that their emotions are determining their position rather than their mind. They might use their mind on occasion to strengthen their argument but the foundation of their position is often emotional rather than based on strong thought.

 

Developing the mind usually involves reading. Reading exposes an individual to good and poor examples of thinking.  From these examples, an individual thinks about the strengths and merits of each. This process of thinking about other people’s thoughts helps a person to develop their own opinion. When an is formed it can be shared with others who are then able to judge for themselves the merit of the person’s opinion.

 

This process of thinking is not often required for academic studies. The focus has moved more towards problem-solving. Problem-solving is In an excellent form of thinking when the end goal is often binary in nature. This means that when a problem solves, either they solve the problem or they do not.

 

Critical thinking involves a certain fuzziness to it that problem-solving lacks  For example, whether a speech or paper is good or bad involves critical thinking because judging quality involves fuzziness to it. This sense of a shade of gray would make solving problems difficult at the least. T

 

However, if you are called to determine why a computer does not connect to the internet this is problem-solving. The goal is to get back on the internet. You have to think but the desired outcome is clear. Once the computer is back on the internet there is nothing to think about. In most cases, particular with non techie people, how you get back on the internet does not even matter. In other words, the “why does this work” is often something that problem solvers do not care about but this is exactly the type of thing critical thinking has to be able to explain when developing an argument.

 

Problem-solving involves action and not as much contemplation. The focus is on experience and not theory. It is not that problem-solvers never read and contemplate, rather, they learn primarily through doing. Examples include trial and error. 

Most companies want problem solvers and not necessarily critical thinkers. In other words, businesses want things done. They do not want people going around and questioning unless this helps to solve a problem.  Companies claim to want thinking but what they really want are people who think how to solve the company’s problems. Questioning the company is not one of the wiser things to do.

The fuzziness of critical thinking frustrates problem-solvers who want to solve problems and not simply talk. This is not a negative thing but rather a difference in personality. The problem is that problem solvers and critical thinkers do not see this as a matter of difference but a matter of ignorance on one hand and irrelevance on the other hand. Thinkers think and problem solvers do is a common description of both sides

 

Conclusion

Critical thinking and problem-solving are two skills that everyone needs. To focus on either to the exclusion of the other is detrimental. A combination of thought and action creates a balanced individual who is able to get things done while still have a depth of thought to support their actions.

 

Ridge Regression in Python

Ridge regression is one of several regularized linear models. Regularization is the process of penalizing coefficients of variables either by removing them and or reduce their impact. Ridge regression reduces the effect of problematic variables close to zero but never fully removes them.

We will go through an example of ridge regression using the VietNamI dataset available in the pydataset library. Our goal will be to predict expenses based on the variables available. We will complete this task using the following steps/

  1. Data preparation
  2. Baseline model development
  3. Ridge regression model

Below is the initial code

from pydataset import data
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_erro

Data Preparation

The data preparation is simple. All we have to do is load the data and convert the sex variable to a dummy variable. We also need to set up our X and y datasets. Below is the code.

df=pd.DataFrame(data('VietNamI'))
df.loc[df.sex== 'male', 'sex'] = 0
df.loc[df.sex== 'female','sex'] = 1
df['sex'] = df['sex'].astype(int)
X=df[['pharvis','age','sex','married','educ','illness','injury','illdays','actdays','insurance']]
y=df['lnhhexp'

We can now create our baseline regression model.

Baseline Model

The metric we are using is the mean squared error. Below is the code and output for our baseline regression model. This is a model that has no regularization to it. Below is the code.

regression=LinearRegression()
regression.fit(X,y)
first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))
print(first_model)
0.35528915032173053

This  value of 0.355289 will be our indicator to determine if the regularized ridge regression model is superior or not.

Ridge Model

In order to create our ridge model we need to first determine the most appropriate value for the l2 regularization. L2 is the name of the hyperparameter that is used in ridge regression. Determining the value of a hyperparameter requires the use of a grid. In the code below, we first are ridge model and indicate normalization in order to get better estimates. Next we setup the grid that we will use. Below is the code.

ridge=Ridge(normalize=True)
search=GridSearchCV(estimator=ridge,param_grid={'alpha':np.logspace(-5,2,8)},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)

The search object has several arguments within it. Alpha is hyperparameter we are trying to set. The log space is the range of values we want to test. We want the log of -5 to 2, but we only get 8 values from within that range evenly spread out. Are metric is the mean squared error. Refit set true means to adjust the parameters while modeling and cv is the number of folds to develop for the cross-validation. We can now use the .fit function to run the model and then use the .best_params_ and .best_scores_ function to determine the model;s strength. Below is the code.

search.fit(X,y)
search.best_params_
{'alpha': 0.01}
abs(search.best_score_)
0.3801489007094425

The best_params_ tells us what to set alpha too which in this case is 0.01. The best_score_ tells us what the best possible mean squared error is. In this case, the value of 0.38 is worse than what the baseline model was. We can confirm this by  fitting our model with the ridge information and finding the mean squared error. This is done below.

ridge=Ridge(normalize=True,alpha=0.01)
ridge.fit(X,y)
second_model=(mean_squared_error(y_true=y,y_pred=ridge.predict(X)))
print(second_model)
0.35529321992606566

The 0.35 is lower than the 0.38. This is because the last results are not cross-validated. In addition, these results indicate that there is little difference between the ridge and baseline models. This is confirmed with the coefficients of each model found below.

coef_dict_baseline = {}
for coef, feat in zip(regression.coef_,data("VietNamI").columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[188]:
{'pharvis': 0.013282050886950674,
'lnhhexp': 0.06480086550467873,
'age': 0.004012412278795848,
'sex': -0.08739614349708981,
'married': 0.075276463838362,
'educ': -0.06180921300600292,
'illness': 0.040870384578962596,
'injury': -0.002763768716569026,
'illdays': -0.006717063310893158,
'actdays': 0.1468784364977112}


coef_dict_ridge = {}
for coef, feat in zip(ridge.coef_,data("VietNamI").columns):
coef_dict_ridge[feat] = coef
coef_dict_ridge
Out[190]:
{'pharvis': 0.012881937698185289,
'lnhhexp': 0.06335455237380987,
'age': 0.003896623321297935,
'sex': -0.0846541637961565,
'married': 0.07451889604357693,
'educ': -0.06098723778992694,
'illness': 0.039430607922053884,
'injury': -0.002779341753010467,
'illdays': -0.006551280792122459,
'actdays': 0.14663287713359757}

The coefficient values are about the same. This means that the penalization made little difference with this dataset.

Conclusion

Ridge regression allows you to penalize variables based on their useful in developing the model. With this form of regularized regression the coefficients of the variables is never set to zero. Other forms of regularization regression allows for the total removal of variables. One example of this is lasso regression.

Undergrad and Grad Students

In this post,  we will look at a comparison of grad and undergrad students.

Student Quality

Generally, graduate students are of a higher quality academically than undergrad students. Of course, this varies widely from institution to institution. New graduate programs may have a lower quality of student than established undergrad programs. This is because the new program is trying to fill sears initially and quality is often compromised.

Focus

At the graduate level, there is an expectation of a much more focused and rigorous curriculum. This makes sense as the primary purpose of graduate school is usually specialization and not generalization. This requires that the teachers at this level have a deep expert-level mastery of the content.

In comparison to graduate school, undergrad is a generalized experience with some specialization. However, this depends on the country in which the studies take place. Some countries require rather an intense specialization from the beginning with a minimum of general education while others take a more American style approach with a wide exposure to various fields.

Commitment

Graduate students are usually older. This means that they require less institution sponsored social activities and may not socialize at all. In addition, some graduate students are married which adds a whole other level of complexity to their studies. Although they are probably less inclined to be “wild” due to their family they are also going to struggle due to the time commitment of their loved ones.

Assuming that an undergraduate student is a traditional one they will tend to be straight from high school, require some social support, but will also have the free time needed to study. The challenge with these students is the maturity level and self-regulation skills that are often missing for academic success.

For the teacher, graduate students offer higher motivation and commitment generally when compared to undergrads. This is reasonable as people often feel compelled to complete a bachelors but normally do not face the same level of pressure to go to graduate school. This means that undergrad is often compulsory due to external circumstances while grad school is by choice.

Conclusion

Despite the differences but types of students hold in common an experience that is filled with exposure to various ideas and content for several years. Grad students and undergrad students are individuals who are developing skills for the goal of eventually finding a purpose in the world.

Hyperparameter Tuning in Python

Hyperparameters are a numerical quantity you must set yourself when developing a model. This is often one of the last steps of model development. Choosing an algorithm and determining which variables to include often come before this step.

Algorithms cannot determine hyperparameters themselves which is why you have to do it. The problem is that the typical person has no idea what is an optimally choice for the hyperparameter. To deal with this confusion, often a range of values are inputted and then it is left to python to determine which combination of hyperparameters is most appropriate.

In this post, we will learn how to set hyperparameters by developing a grid in  Python. To do this, we will use the PSID dataset from the pydataset library. Our goal will be to classify who is married and not married based on several independent variables. The steps of this process is as follows

  1.  Data preparation
  2. Baseline model (for comparison)
  3. Grid development
  4. Revised model

Below is some initial code that includes all the libraries and classes that we need.

import pandas as pd
import numpy as np
from pydataset import data
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

Data Preparation

The dataset PSID has several problems that we need to address.

  • We need to remove all NAs
  • The married variable will be converted to a dummy variable. It will simply be changed to married or not rather than all of the other possible categories.
  • The educatnn and kids variables have codes that are 98 and 99. These need to be removed because they do not  make sense.

Below is the code that deals with all of this

df=data('PSID').dropna()
df.loc[df.married!= 'married', 'married'] = 0
df.loc[df.married== 'married','married'] = 1
df['married'] = df['married'].astype(int)
df['marry']=df.married
df.drop(df.loc[df['kids']>90].index, inplace=True)
df.drop(df.loc[df['educatn']>90].index, inplace=True
  1. Line 1 loads the dataset and drops the NAs
  2. Line 2-4 create our dummy variable for marriage. We create a new variable called marry to hold the results
  3. Lines 5-6 drop the values in  kids and educatn that are above 90.

Below we create our X and y datasets and then are ready to make our baseline model.

X=df[['age','educatn','hours','kids','earnings']]
y=df['marry']

Baseline Model

The purpose of  baseline model is to see how much better or worst the hyperparameter tuning works. We are using K Nearest Neighbors  for our classification. In our example, there are 4 hyperparameters we need to set. They are as follows.

  1. number of neighbors
  2. weight of neighbors
  3. metric for measuring distance
  4. power parameter for minkowski

Below is the baseline model with the set hyperparameters. The second line shows the accuracy of the model after a k-fold cross-validation that was set to 10.

classifier=KNeighborsClassifier(n_neighbors=5,weights=’uniform’, metric=’minkowski’,p=2)
np.mean(cross_val_score(classifier,X,y,cv=10,scoring=’accuracy’,n_jobs=1)) 0.6188104238047426

Our model has an accuracy of about 62%. We will now move to setting up our grid so we can see if tuning the hyperparameters improves the performance

Grid Development

The grid allows you to develop scores of models with all of the hyperparameters tuned slightly differently. In the code below, we create our grid object, and then we calculate how many models we will run

grid={'n_neighbors':range(1,13),'weights':['uniform','distance'],'metric':['manhattan','minkowski'],'p':[1,2]}
np.prod([len(grid[element]) for element in grid])
96

You can see we made a simple list that has several values for each hyperparameter

  1. Number if neighbors can be 1 to 13
  2. weight of neighbors can be uniform or distance
  3. metric can be manhatten or minkowski
  4. p can be 1 or 2

We will develop 96 models all together. Below is the code to begin tuning the hyperparameters.

search=GridSearchCV(estimator=classifier,param_grid=grid,scoring='accuracy',n_jobs=1,refit=True,cv=10)
search.fit(X,y)

The estimator is the  code for the type of algorithm we are using. We set this earlier. The param_grid is our grid. Accuracy is our metric for determining the best model. n_jobs has to do with the amount of resources committed to the process. refit is for changing parameters and cv is for cross-validation folds.The search.fit command runs the model

The code below provides the output for the results.

print(search.best_params_)
print(search.best_score_)
{'metric': 'manhattan', 'n_neighbors': 11, 'p': 1, 'weights': 'uniform'}
0.6503975265017667

The best_params_ function tells us what the most appropriate parameters are. The best_score_ tells us what the accuracy of the model is with the best parameters. Are model accuracy improves from 61% to 65% from adjusting the hyperparameters. We can confirm this by running our revised model with the updated hyper parameters.

Model Revision

Below is the cod efor the erevised model

classifier2=KNeighborsClassifier(n_neighbors=11,weights='uniform', metric='manhattan',p=1)
np.mean(cross_val_score(classifier2,X,y,cv=10,scoring='accuracy',n_jobs=1)) #new res
Out[24]: 0.6503909993913031

Exactly as we thought. This is a small improvement but this can make a big difference in some situation such as in a data science competition.

Conclusion

Tuning hyperparameters is one of the final pieces to improving a model. With this tool, small gradually changes can be seen in a model. It is important to keep in mind this aspect of model development in order to have the best success final.

Variable Selection in Python

A key concept in machine learning and data science in general is variable selection. Sometimes, a dataset can have hundreds of variables to include in a model. The benefit of variable selection is that it reduces the amount of useless information aka noise in the model. By removing noise it can improve the learning process and help to stabilize the estimates.

In this post, we will look at two ways to do this.  These two common approaches are the univariate approach and the greedy approach. The univariate approach selects variables that are most related to the dependent variable based on a metric. The greedy approach will alone remove a variable if getting rid of it does not affect the model’s performance.

We will now move to our first example which is the univariate approach using Python. We will use the VietNamH dataset from the pydataset library. Are goal is to predict how much a family spends on medical expenses. Below is the initial code.

import pandas as pd
import numpy as np
from pydataset import data
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_regression
df=data('VietNamH').dropna()

Are data is called df. If you use the head function, you will see that we need to convert several variables to dummy variables. Below is the code for doing this.

df.loc[df.sex== 'female', 'sex'] = 0
df.loc[df.sex== 'male','sex'] = 1
df.loc[df.farm== 'no', 'farm'] = 0
df.loc[df.farm== 'yes','farm'] = 1
df.loc[df.urban== 'no', 'urban'] = 0
df.loc[df.urban== 'yes','urban'] = 1

We now need to setup or X and y datasets as shown below

X=df[['age','educyr','sex','hhsize','farm','urban','lnrlfood']]
y=df['lnmed']

We are now ready to actual use the univariate approach. This involves the use of two different classes in Python. The SelectPercentile class allows you to only include the variables that meet a certain percentile rank such as 25%. The f_regression class is designed for checking a variable’s performance in the context of regression.  Below is the code to run the analysis.

selector_f=SelectPercentile(f_regression,percentile=25)
selector_f.fit(X,y)

We can now see the results using a for loop. We want the scores from our selector_f object. To do this we setup a for lop and use the zip function to iterate over the data. The output is placed in the print statement. Below is the code and output for this.

for n,s in zip(X,selector_f.scores_):
print('F-score: %3.2f\t for feature %s ' % (s,n))
F-score: 62.42 for feature age
F-score: 33.86 for feature educyr
F-score: 3.17 for feature sex
F-score: 106.35 for feature hhsize
F-score: 14.82 for feature farm
F-score: 5.95 for feature urban
F-score: 97.77 for feature lnrlfood

You can see the f-score for all of the independent variables. You can decide for yourself which to include.

Greedy Approach

The greedy approach only removes variables if they do not impact model performance. We are using the same dataset so all we have to do is run the code. We need the RFECV class from the model_selection library. We then use the function RFECV and set the estimator, cross-validation, and scoring metric. Finally, we run the analysis and print the results. The code is below with the output.

from sklearn.feature_selection import RFECV
select=RFECV(estimator=regression,cv=10,scoring='neg_mean_squared_error')
select.fit(X,y)
print(select.n_features_)
7

The number 7 represents how many independent variables to include in the model. Since we only had 7 total variables we should include all variables in the model.

Conclusion

With help with univariate and greedy approaches, it is possible to deal with a large number of variables efficiently one developing models. The example here involve only a handful of variables. However, bear in mind that the approaches mentioned here are highly scalable and useful.

Cross-Validation in Python

A common problem in machine learning is data quality. In other words, if the data is bad the model will be bad even if it is designed using best practices. Below is a short of some possible problems with data

  • Sample size is to small-Hurts all algorithms
  • Sample size too big-Hurts complex algorithms
  • Wrong data-Hurts all  algorithms
  • Too many variables-Hurts complex algorithms

Naturally, this list is not exhaustive. Whenever some of the above situations take place it can lead to a model that has bias or variance. Bias takes place when the model highly over and under estimates values. This is common in regression when the relationship among the variables is not linear. The linear line that is developed by the  model works sometimes but is often erroneous.

Variance is when the model is too sensitive to the characteristics of the training data. This means that the model develops a complex way to classify or performs regression that does not generalize to other datasets

One solution to addressing these problems is the use of cross-validation. Cross-validation involves dividing the training set into several folds. For example, you may divide the data into 10 folds. With 9 folds you train the data and with the 10rh fold you test it. You then calculate the average prediction or classification of the ten test folds. This method is commonly called k-folds cross-validation. This process helps to stabilize the results of the final model. We will now look at how to do this using Python.

Data Preparation

We will develop a regression model using the PSID dataset. Our goal will be to predict earnings based on the other variables in the dataset. Below is some initial code.

import pandas as pd
import numpy as np
from pydataset import data
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

We now need to load the dataset PSID. When this is done, there are several things we also need to.

  • We have to drop all NA’s in the dataset
  • We also need to convert the “married” variable to a dummy variable.

Below  is the code for completing these steps

df=data('PSID').dropna()
df.loc[df.married!= 'married', 'married'] = 0
df.loc[df.married== 'married','married'] = 1
df['married'] = df['married'].astype(int)
df['marry']=df.married

The code above loads the data while dropping the NAs. We then use the .loc function to make everyone who is not married a 0 and everyone who is married a 1. This variable is then converted to an integer using the .astype function. Lastly, we make a new variable called ‘marry’ and store our data there.

There is one other problem we need to address. In the ‘kids’ and the ‘educatn’ variable are values of 98 and 99. In the original survey, these responses meant that the person did not want to say how man kids or how much education they had or that they did not know. We  will remove these individuals from the sample using the code below.

df.drop(df.loc[df['kids']>90].index, inplace=True)
df.drop(df.loc[df['educatn']>90].index, inplace=True)

The code above tells Python to remove in values greater than 90. With this We can now make are dataset that includes the independent variables and the dataset that contains the dependent variable.

X=df[['age','educatn','hours','kids','marry']]
y=df['earnings']

Model Development

We are now going to make several models and use the mean squared error as our way of comparing them. The first model will use all of the data. The second model will use the training data. The  third model will use cross-validation. Below is the code for the first model that uses all of the data,

regression=LinearRegression()
regression.fit(X,y)
first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))
print(first_model)
138544429.96275884

For the second model, we first need to make our train and test sets. Then we will run our model.  The code is below.

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=5)
regression.fit(X_train,y_train)
second_model=(mean_squared_error(y_true=y_train,y_pred=regression.predict(X_train)))
print(second_model)
148286805.4129756

You can see that the number are somewhat different. This is to be expected when dealing with different sample sizes. With cross validation using the full dataset we get results similar to the first model we developed. This is done through an instance of the KFold function. For KFold we want 10 folds, we want to shuffle the data, and set the seed.

The other function we need is the cross_val_score function. In this function, we set the type of model, the data we will use, the metric for evaluation, and the characteristics of the type of cross-validation. Once this is done we print the mean and standard deviation of the fold results. Below is the code.

crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
scores=cross_val_score(regression,X,y,scoring='neg_mean_squared_error',cv=crossvalidation,n_jobs=1)
print(len(scores),np.mean(np.abs(scores)),np.std(scores))
10 138817648.05153447 35451961.12217143

These numbers are closer to what is expected from the dataset. Despite the fact that we didn’t use all the data at the same time. You can also run these results on the training set as well for additional comparison.

Conclusion

This post provides an example of cross-validation in Python. The use of cross-validation helps to stabilize the results that ma come from your model. With increase stability comes increased confidence in your models ability to generalize to other datasets.

Artificial Intelligence in the Classroom

In 1990, a little known film called “Class of 1999” came out. In this movie, three military grade robots are placed in an inner-city war zone school to with the job of teaching.

As with any movie, things quickly get out of hand and the robot teachers begin killing the naughty students and eventually manipulating the local gangs into fighting and killing each other. Eventually, in something that can only happen in a movie, three military grade AI robots similar to the terminator are destroyed by a delinquent teenager

There has been a lot of hype and excitement over the growth of data science, machine learning, and artificial intelligence. With this growth, these ideas have begun to expend into supporting education. This has even led to speculation among some that algorithms and artificial intelligence could replace teachers in the classroom.

There are several reasons why this is. My reasons are listed as follows

  • People Need People
  • Computers need people
  • Computers assist people

People Need People

When it comes to education, people need people. Originally, education was often passed through apprenticeship for trades and one-on-one tutoring for elites. There has allows been some form of mass education but it has always involved people helping each other.

There are certain social-emotional needs that people have that cannot be satisfied by even the most life-like machine. When humans communicate they always convey some form of emotion even in the most harden computer like individual. Although AI is making strides in attempting to read, emotions they are far from convincingly portraying emotions. Besides, students want someone who can laugh, joke, smile, and do all those little things that involve being human. Even such mundane things as tripping over one’s shoes, or forgetting someone’s name add a human element to the learning experience.

Furthermore, even if a computer is able to share emotions in a human-like manner what child would really feel satisfaction from pleasing an Amazon Alexa? People need people and AI teachers cannot provide this even if they can provide top-level content.

Another concern is that people are highly unpredictable. Again, this relates to the emotional aspects of human nature. Even humans who have the same emotional characteristics are surprised by the behavior of fellow humans. When an algorithm is coldly calculating what is an appropriate action this inability to deal with unpredictable actions can be a problem.

A classic example of this is classroom management. If a student is not paying attention, or not doing their work, or showing defiance in one way or the other how would a computer handle this. In the movie “Class of 1999” the answer for disruptive behavior was to kill. Few parents and administrators would approve of such an action coming from an artificial neural network.

People need people in the context of education for the socio-emotional aspect of education as well as for the tribulation of classroom management. Computers are not humans and therefore they cannot provide the motivation or inspiration that so many students need to be successful in school.

Computers Need People

A second reason AI teachers are unlikely is because computers need people. Computers breakdown,  there are bugs in code, updates have to be made etc. All this precludes a machine going completely independent. With everything that can go wrong there has to be people there to monitor the use and interaction of machines with people.

Even in the movie “Class of 1999” there was a research professor and administrator monitoring the school. This continued until they were killed by the AI teachers.

With all the amazing advances in AI and machine learning it is still people who tweak the algorithms and provide the data for the machine to learn. After this is done, the algorithm is still monitored to see how it performs. Computers cannot escape their reliance on humans to maintain their functionality which implies that they cannot be turned loss in a classroom alone.

Computers Help People

The way going forward is that perhaps AI and other aspects of machine learning and data science can support teachers to be better teachers. For example, in some versions of Moodle there is an algorithm that will monitor students participation and will predict if students are at risk of failing. There is also an algorithm that predicts if a teacher is teaching. This is an excellent use of machine learning in that it deals with a routine task and simple flags a problem rather than trying to intervene it’s self.

Another useful application more in line with AI is through tutoring. Providing feedback on progress and adjusting what the student does based on their performance. Again, in a supporting role, AI can be excellent. The problem is when AI becomes the leader.

Conclusion

The advances in technology are going to continue. However, with the amazing breakthroughs in this field people still need interaction with other people and the example of others in a social context. Computers will never be able to provide this. Computers also need the constant support of humans in order to function. The proper role for AI and data science in education may be as a supporter to a teacher rather than the one leading and making criticaltaff decisions about other people.

Computational Thinking

Computational thinking is the process of expressing a problem in a way that a computer can solve. In general, there are four various ways that computational thinking can be done. These four ways are decomposition, pattern recognition, abstraction, and algorithmic thinking.

Although computational thinking is dealt with in the realm of computer science. Everyone thinks computationally at one time or another especially in school. Awareness of these subconscious strategies can help people to know how they think at times as well as to be aware of the various ways in which thinking is possible.

Decomposition

Decomposition is the process of breaking a large problem down into smaller and smaller parts or problems. The benefit of this is that by addressing all of the created little problems you can solve the large problem.

In education decomposition can show up in many ways. For teachers, they often have to break goals done into objectives, and sometimes down into procedures in a daily lesson plan. Seeing the big picture of the content students need and breaking it down into pieces that students can comprehend is critically to education such as with chunking.

For the student, decomposition involves breaking down the parts of a project such as writing a paper. The student has to determine what to do and how it helps to achieve the completion of their project.

Pattern Recognition

Pattern recognition has to refer to how various aspects of a problem have things in common. For a teacher, this may involve the development of a thematic unit. Developing such a unit requires the teacher to see what various subjects or disciplines have in common as they try to create the thematic unit.

For the student, pattern recognition can support the development of  shortcuts. Examples include seeing similarities in assignments that need to be completed and completing similar assignments together.

Abstraction

Abstraction  is the ability to remove irrelevant information from a problem. This is perhaps the most challenging form of thinking to develop because people often fall into the trap that everything is important.

For a teacher, abstractions involves teaching only the critical information that is in the content and not stressing the small stuff. This is not easy especially when the  teacher has a passion for their subject. This often blinds them to trying to share only the most relevant information about their field with their students.

For students, abstraction involves being able to share the most critical information. Students are guilty of the same problems as teachers in that they share everything when writing or presenting. Determining what is important requires the development of an opinion to judge the relevance of something. This is a skill that is hard to find among graduates.

Algorithmic Thinking

Algorithmic thinking is being able to develop a step-by-step plan to do something. For teachers, this happens everyday through planning activities and leading a class. Planning may be the most common form of thinking for the teacher.

For students, algorithmic thinking is somewhat more challenging. It is common for younger people to rely heavily on intuition to accomplish tasks. This means that they did something but they do not know how they did it.

Another common mistake for young people is doing things through brute force. Rather than planning, they will just keep pounding away until something works. As such, it is common for students to do things the “hard way” as the saying goes.

Conclusion

Computational thinking is really how humans think in many situations in which emotions are not the primary mover. As such, what is really happening is not that computers are thinking as much as they are trying to model how humans think. In education, there are several situations. In which computational thinking can be employed for success.

Mentoring New Teachers

A career in teaching is an attractive option for many young adults. One of the major challenges in a career in teaching is the student teaching experience that is normally placed at the end of the degree program. This post will provide some suggestion for teacher mentors

Go Over Local Expectations

Every school has its own set of policies and expectations that all employees need to adhere too. Often, the student teacher is not aware of these and it is the mentoring teacher’s responsibility to provide some idea of what is expected. This includes such things as showing them around the campus, communicating expectations for how to dress, discipline procedures, and even how to deal with grades.

Knowing these little things can allow the new teacher to focus on teaching rather than the administrative aspects of the classroom.

Provide Feedback

Feedback is critical so that the new teacher knows what they are doing well and wrong. It is, of course, important to mention what the student teacher does well. However, growth happens by providing support to overcome weaknesses.

The temptation for many supervising teachers is simply to mention what the problems are and let the student figure out what to do. This approach may work for an experience or a highly independent teacher. However, for most new teachers they need specific support on what to do in order to improve their teaching and overcome a weakness.

Therefore, criticism without some sort of suggestion for how to overcome the problem is not beneficial. In addition, it is important to only address major problems that can cripple the educational experience of the students rather than every single weakness in the students teaching. We all have issues and problems with our teaching and for beginners, only the big problems should be corrected.

The student also should provide feedback on how they view their own teaching. Most teacher education programs require this in the form of a journal. However, the benefit of the journal is only in discussing it with others such as the mentor teacher.

Lead By Example

IN reality, in order for a student to be a successful teacher, they need to see what successful teaching is so they can imitate until perfection. What this means for you as a supervising teacher is that you need to lay the example for the student to imitate. Everyone has there own style but a good example goes a long way in molding the teaching approach of a student.

This also means that a mentor teacher needs to do a lot of verbalizing in terms of what they do. Often, as an experienced teacher, things become automatic in the classroom. You know what to do without much thought or discussion. The problem is that if there is a lack of explanation in terms of wqhat is happening the student teacher is not able to deermine why you are doing certain things. Therefore, a mentor teacher must explained explictiylywhat they are doing and why while they are provding the exmple of teaching.

Conclusion

Students who dream of teaching need support in order to have success. This involves bringing in people with more experience to support these young teachers as they develop their skillset. This means that even experienced teachers need some support in order to determine how to help new teachers

3 Steps to Successful Research

When students have to conduct a research project they often struggle with determining what to do. There are many decisions that have to be made that can impede a student’s chances of achieving success.  However, there are ways to overcome this problem.

This post will essentially reduce the decision-making process for conducting research down to three main questions that need to be addressed. These questions are.

  • What do you Want to Know?
  • How do You Get the Answer?
  • What Does Your Answer Mean?

Answering these three questions makes it much easier to develop a sense of direction and scope in order to complete a project.

What do you Want to Know?

Often, students want to complete a project but it is unclear to them what they are trying to figure out. In other words, the students do not know what it is that they want to know. Therefore, one of the first steps in research is to determine exactly it is you want to know.

Understanding what you want to know will allow you to develop a problem as well as research questions to facilitate your ability to understand exactly what it is that you are looking for. Research always begins with a problem and questions about the problem and this is simply another way of stating what it is that you want to know.

How do You Get the Answer?

Once it is clear what it is that you want to know it is critical that you develop a process for determining how you will obtain the answers. It is often difficult for students to develop a systematic way in which to answer questions. However, in a research paradigm, a scientific way of addressing questions is critical.

When you are determining how to get answers to what you want to know this is essential the development of your methodology section. This section includes such matters as the research design, sample, ethics, data analysis, etc. The purpose here is again to explain the way to get the answer(s).

What Does Your Answer Mean?

After you actually get the answer you have to explain what it means. Many students fall into the trap of doing something without understanding why or determining the relevance of the outcome. However, a research project requires some sort of interpretation or explanation of the results. Just getting the answer is not enough it is the meaning that holds the power.

Often, the answers to the research questions are found in the results section of a paper and the meaning is found in the discussion and conclusion section. In the discussion section, you explain the major findings with interpretation, sare recommendations, and provide a conclusion. This requires thought into the usefulness of what you wanted to know. In other words, you are explaining why someone else should care about your work. This is much harder to do than many realize.

Conclusion

Research is challenging but if you keep in mind these three keys it will help you to see the big picture of research and o focus on the goals of your study and not so much on the tiny details that encompasses the processes.

Undergrad and Grad Students

In this post,  we will look at a comparison of grad and undergrad students.

Student Quality

Generally, graduate students are of a higher quality academically than undergrad students. Of course, this varies widely from institution to institution. New graduate programs may have a lower quality of student than established undergrad programs. This is because the new program is trying to fill sears initially and quality is often compromised.

Focus

At the graduate level, there is an expectation of a much more focused and rigorous curriculum. This makes sense as the primary purpose of graduate school is usually specialization and not generalization. This requires that the teachers at this level have a deep expert-level mastery of the content.

In comparison to graduate school, undergrad is a generalized experience with some specialization. However, this depends on the country in which the studies take place. Some countries require rather an intense specialization from the beginning with a minimum of general education while others take a more American style approach with a wide exposure to various fields.

Commitment

Graduate students are usually older. This means that they require less institution sponsored social activities and may not socialize at all. In addition, some graduate students are married which adds a whole other level of complexity to their studies. Although they are probably less inclined to be “wild” due to their family they are also going to struggle due to the time commitment of their loved ones.

Assuming that an undergraduate student is a traditional one they will tend to be straight from high school, require some social support, but will also have the free time needed to study. The challenge with these students is the maturity level and self-regulation skills that are often missing for academic success.

For the teacher, graduate students offer higher motivation and commitment generally when compared to undergrads. This is reasonable as people often feel compelled to complete a bachelors but normally do not face the same level of pressure to go to graduate school. This means that undergrad is often compulsory due to external circumstances while grad school is by choice.

Conclusion

Despite the differences but types of students hold in common an experience that is filled with exposure to various ideas and content for several years. Grad students and undergrad students are individuals who are developing skills for the goal of eventually finding a purpose in the world.

Using Blended Learning in Your Classroom

Blended learning is becoming a reality in education. Many schools now require some sort of online presences not online of the school but also for individual classes that teachers teach. This has led to much more pressure for teachers to figure out some sort of way to get content online to support studnets. This post will take a look at the pros and cons of blended learning and provide tips on how to approach the use of blended learning.

Pros and Cons

Blended learning gives you flexibility. You are not tied to either traditional or elearning completely. This allows you to find the right balance for your teaching style and the students learning. Some teachers want more online presence in the form of activities and submission of assignments. Others just want a centralized place for communicating with their students and tracking academic progress. Whatever works for you can probably be accommodated when employing blended learning.

Communication and documentation is another benefit of blended learning. Announcements and messaging can be handled by the online platform and these forms of transactions are usually logged by the system and saved. This can be useful for referencing in the future if confusion arises.

The drawback to the flexibility is actually the flexibility. When employing blended learning a teacher has to be proficient in both e-learning and traditional teaching. In other words, you have to become a jack of all trades. Without strength in both methodologies, it will be difficult to determine what you want to do online and to determine how the online experiences augment or replace in-class learning opportunities.

Another problem is the confusion over what is done in class and what is done online. When learning takes place in two mediums it increases opportunities for misunderstanding and miscommunication. I have frequently had students confused over what was to be submitted online vs in class no matter how clear I was in the course outline and calendar.

Success in a Blended Learning Context

To have success using blended learning involves doing some of the following.

  • Focus on using the online platform less for learning and more for communication when you initially begin using blended learning. There is a lot for you to learn as the teacher and trying to move everything online will lead to confusion for you and the students.
  • Consider having students submit the final version of assignments online. Final versions of assignments usually require the least amount of feedback because they have already been vetted by you in person. This will allow you to focus on the grade rather than on providing more support.
  • Online activities should support learning and probably not replace it. Blended learning is often more effective if it helps students to understand content in class rather than replace it. This means the blended learning platform is a study tool to scaffold students learning outside of class. If an assignment can be completely done online without having to go to class then perhaps this is no longer blended learning since the in-class part is not needed for support.
  • PLanning is critical to using any web-based resource. Websites are designed in advance before they are set up. Even post a blog requires you to develop a draft or two. Therefore, online activities and expectations must be planned well in advance and not just thrown online at the whim of the teacher throughout the semester. Many teachers fall into the trap of just making stuff up as they go. This is a poor methodology in a traditional classroom and a disaster in a blended learning context.
  • When in doubt go traditional. If you are unsure how to achieve a specific learning goal online it is better to stick to a traditional approach until you can figure it out. In-class teaching is old but it still has a place in the 21st century especially when it is unclear how to do it online.

Conclusion

Blended learning can be a powerful tool for helping students are a major headache that annoys everyone. The secret to success lies with the teacher who understands what they want from the online aspect of the students learning as well as what they want in the classroom. When this is clear is it critical that the teacher determine how to meet these goals through the use of various learning experiences.

Elearning Academic Success

Studying online has become almost an expectation now. Even if you never earn a degree or take a class for credit online there are still many opportunities to train and develop skills over the internet. The role of the teacher is to try and find ways to engage and support their students as they begin their learning experience physically alone with support perhaps thousands of miles away.

In this post, we will look at ways to encourage the academic success of students while studying online. Two ways to support academic success in elearning involve providing feedback and encouraging engagement.

Provide Feedback

Feedback is critical in every aspect of teaching. However, in elearning, it is even more important. This is because the students have no face-to-face communication with you so they have no idea how they are doing beyond a letter grade. In addition, there is no body language to examined or other paralinguistic features that the student can infer meaning from.

Giving feedback requires timeliness. In other words, mark assignments quickly and indicate progress. In addition, if students do not meet expectations it is critical that you point them towards resources that will help them to inderstand. For example, students seem to neglect reading rubrics. When a student gets feedback from a rubric they can see where they were not succesfful.

In terms of a more formative feedback approach, there may be times where it is beneficial to live stream lecture. This allows the students to chime in whenever they do not understand an idea or point. Furthermore, the teacher can ask a question or two of the students and get feedback from them.

Engage Them

Engaging is almost synonymous with active. In other words, students should be doing something in order to learn. Unfortunately, listening is a passive activity which implies that lecturing is not the best way to inspire learning.

In the contest of elearning, one of the ways to inspire active learning is to have the students go out and do something in the real world and report what happens online in the form of a reflection. For example, students studying English will go out and teach English in the real world. They will then come and share their experience. The teacher is then able to provide insights and feedback to improve the students teaching. This provides a connection to the real world as well as a sense of relevance

In a more abstract subject, such as history, music theory, or engineering, students can become active through sharing these insights with laymen or explaining how they are already applying this information at their job or in the home. The goal of using provides the purpose for learning the content.

Conclusion

Feedback and engagement are critical to success in a situation in which the student is primarily learning alone which is found in the context of elearning.

Scatter Plots in Python

Scatterplots are one of many crucial forms of visualization in statistics. With scatterplots, you can examine the relationship between two variables. This can lead to insights in terms of decision making or additional analysis.

We will be using the “Prestige” dataset form the pydataset module to look at scatterplot use. Below is some initial code.

from pydataset import data
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
df=data('Prestige')

We will begin by making a correlation matrix. this will help us to determine which pairs of variables have strong relationships with each other. This will be done with the .corr() function. below is the code

1.png

You can see that there are several strong relationships. For our purposes, we will look at the relationship between education and income.

The seaborn library is rather easy to use for making visuals. To make a plot you can use the .lmplot() function. Below is a basic scatterplot of our data.

1.png

The code should be self-explanatory. THe only thing that might be unknown is the fit_reg argument. This is set to False so that the function does not make a regression line. Below is the same visual but this time with the regression line.

facet = sns.lmplot(data=df, x='education', y='income',fit_reg=True)

1

It is also possible to add a third variable to our plot. One of the more common ways is through including a categorical variable. Therefore, we will look at job type and see what the relationship is. To do this we use the same .lmplot.() function but include several additional arguments. These include the hue and the indication of a legend. Below is the code and output.

1.png

You can clearly see that type separates education and income. A look at the boxplots for these variables confirms this.

1.png

As you can see, we can conclude that job type influences both education and income in this example.

Conclusion

This post focused primarily on making scatterplots with the seaborn package. Scatterplots are a tool that all data analyst should be familiar with as it can be used to communicate information to people who must make decisions.

Data Visualization in Python

In this post, we will look at how to set up various features of a graph in Python. The fine tune tweaks that can be made when creating a data visualization can be enhanced the communication of results with an audience. This will all be done using the matplotlib module available for python. Our objectives are as follows

  • Make a graph with  two lines
  • Set the tick marks
  • Change the linewidth
  • Change the line color
  • Change the shape of the line
  • Add a label to each axes
  • Annotate the graph
  • Add a legend and title

We will use two variables from the “toothpaste” dataset from the pydataset module for this demonstration. Below is some initial code.

from pydataset import data
import matplotlib.pyplot as plt
DF = data('toothpaste')

Make Graph with Two Lines

To make a plot you use the .plot() function. Inside the parentheses you out the dataframe and variable you want. If you want more than one line or graph you use the .plot() function several times. Below is the code for making a graph with two line plots using variables from the toothpaste dataset.

plt.plot(DF['meanA'])
plt.plot(DF['sdA'])

1

To get the graph above you must run both lines of code simultaneously. Otherwise, you will get two separate graphs.

Set Tick Marks

Setting the tick marks requires the use of the .axes() function. However, it is common to save this function in a variable called axes as a handle. This makes coding easier. Once this is done you can use the .set_xticks() function for the x-axes and .set_yticks() for the y axes. In our example below, we are setting the tick marks for the odd numbers only. Below is the code.

ax=plt.axes()
ax.set_xticks([1,3,5,7,9])
ax.set_yticks([1,3,5,7,9])
plt.plot(DF['meanA'])
plt.plot(DF['sdA'])

1

Changing the Line Type

It is also possible to change the line type and width. There are several options for the line type. The important thing here is to put this information after the data you want to plot inside the code. Line width is changed with an argument that has the same name. Below is the code and visual

ax=plt.axes()
ax.set_xticks([1,3,5,7,9])
ax.set_yticks([1,3,5,7,9])
plt.plot(DF['meanA'],'--',linewidth=3)
plt.plot(DF['sdA'],':',linewidth=3)

1

Changing the Line Color

It is also possible to change the line color. There are several options available. The important thing is that the argument for the line color goes inside the same parentheses as the line type. Below is the code. r means red and k means black.

ax=plt.axes()
ax.set_xticks([1,3,5,7,9])
ax.set_yticks([1,3,5,7,9])
plt.plot(DF['meanA'],'r--',linewidth=3)
plt.plot(DF['sdA'],'k:',linewidth=3)

1

Change the Point Type

Changing the point type requires more code inside the same quotation marks where the line color and line type are. Again there are several choices here. The code is below

ax=plt.axes()
ax.set_xticks([1,3,5,7,9])
ax.set_yticks([1,3,5,7,9])
plt.plot(DF['meanA'],'or--',linewidth=3)
plt.plot(DF['sdA'],'Dk:',linewidth=3)

1

Add X and Y Labels

Adding LAbels is simple. You just use the .xlabel() function or .ylabel() function. Inside the parentheses, you put the text you want in quotation marks. Below is the code.

ax=plt.axes()
ax.set_xticks([1,3,5,7,9])
ax.set_yticks([1,3,5,7,9])
plt.xlabel('X Example')
plt.ylabel('Y Example')
plt.plot(DF['meanA'],'or--',linewidth=3)
plt.plot(DF['sdA'],'Dk:',linewidth=3)

1

Adding Annotation, Legend, and Title

Annotation allows you to write text directly inside the plot wherever you want. This involves the use of the .annotate function. Inside this function, you must indicate the location of the text and the actual text you want added to the plot. For our example, we will add the word ‘python’ to the plot for fun.

The .legend() function allows you to give a description of the line types that you have included. Lastly, the .title() function allows you to add a title. Below is the code.

ax=plt.axes()
ax.set_xticks([1,3,5,7,9])
ax.set_yticks([1,3,5,7,9])
plt.xlabel('X Example')
plt.ylabel('Y Example')
plt.annotate(xy=[3,4],text='Python')
plt.plot(DF['meanA'],'or--',linewidth=3)
plt.plot(DF['sdA'],'Dk:',linewidth=3)
plt.legend(['1st','2nd'])
plt.title("Plot Example")

1.png

Conclusion

Now you have a practical understanding of how you can communicate information visually with matplotlib in python. This is barely scratching the surface in terms of the potential that is available.

Critical Thinking and Problem Solving

There have been concerns for years that critical thinking and problem-solving skills are in decline not only among students but also the general public. On the surface, this appears to be true. However, throughout human history, the average person was not much of a deep thinker but rather a doer. How much time can you spend on thinking for the sake of thinking when you are dealing with famine, war, and disease? This internal focus vs external focus is one of the differences between critical thinking and problem-solving.

Critical Thinking

There is no agreed-upon definition of critical thinking. This makes sense as any agreement would indicate a lack of critical thinking. In general, critical thinking is about questioning and testing the claims and statements made through external evidence as well as internal thought. Critical thinking is the ability to know what you don’t know and seek answers through finding information. To test and assert claims means taking time to develop them which is a lonely process many times

Thinking for the sake of thinking is a somewhat introverted process. There are few people who want to sit and ponder in the fast-paced 21st century.  This is one reason why it appears that critical thinking is in decline. It’s not that people are incapable of thinking critical they would just rather not do it and seek a quick answer and or entertainment. Critical thinking is just too slow for many people.

Whenever I give my students any form of opened assignment that requires them to develop an argument I am usually shocked by the superficial nature of the work. Having thought about this I have come to the conclusion that the students lacked the mental energy to demonstrate the critical thought needed to write a paper or even to share their opinion about something a little deeper then facebook videos.

Problem Solving

Problem-solving is about getting stuff done. When an obstacle is in the way a problem solver finds a way around. Problem-solving is focused often on tangible things and objects in a practical way. Therefore, problem-solving is more of an extroverted experience. It is common and easy to solve problems with friends gregariously. However, thinking critically is somewhat more difficult to do in groups and cannot move as fast as we want we discussing.

Due to the potential of working in groups and the fast pace that it can take, problem-solving skills are in better shape than critical thinking skills. This is because when people work in groups several superficial ideas can be combined to overcome a problem. This groupthink if you will allow for success even though the individual members are probably not the brightest.

Problem-solving has been the focus of mankind for most of their existence. Please keep in mind that for most of human history people could not even read and write. Instead, they were farmers and soldiers concern with food and protecting their civilization from invasion. These problems led to amazing discoveries for the sake of providing life and not for the sake of thinking for the sake of thinking or questioning for the sake of objection.

Overlap

There is some overlap in critical thinking and problem-solving. Solutions to problems have to be critically evaluated. However, often a potential solution is voted good or bad by whether it works or not which requires observation and not in-depth thinking. The goal for problem-solving is always “does this solve the problem” or “does this solve the problem better”. These are important criteria but critical thinking involves much broader and deeper issues than just “does this work.” Critical thinking is on a quest for truth and satisfying curiosity. These are ideas that problem-solvers struggle to appreciate

The world is mostly focused on people who can solve problems and not necessarily on deep thinkers who can ponder complex ideas alone. As such, perhaps critical thinking was a fad that has ceased to be relevant as problem solvers do not see how critical thinking solves problems. Both forms of thought are needed and they do overlap yet most of the world simply wants to know what the answer is to their problem rather than to think deeply about why they have a problem in the first place.

Random Forest Regression in Python

Random forest is simply the making of dozens if not thousands of decision trees. The decision each tree makes about an example are then tallied for the purpose of voting with the classification that receives the most votes winning. For regression, the results of the trees are averaged in  order to give the most accurate results

In this post, we will use the cancer dataset from the pydataset module to predict the age of people. Below is some initial code.

import pandas as pd
import numpy as np
from pydataset import data
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

We can load our dataset as df, drop all NAs, and create our dataset that contains the independent variables and a separate dataset that includes the dependent variable of age. The code is below

df = data('cancer')
df=df.dropna()
X=df[['time','status',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']]
y=df['age']

Next, we need to set up our train and test sets using a 70/30 split. After that, we set up our model using the RandomForestRegressor function. n_estimators is the number of trees we want to create and the random_state argument is for supporting reproducibility. The code is below

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
h=RandomForestRegressor(n_estimators=100,random_state=1)

We can now run our model and test it. Running the model requires the .fit() function and testing involves the .predict() function. The results of the test are found using the mean_squared_error() function.

h.fit(x_train,y_train)
y_pred=h.predict(x_test)
mean_squared_error(y_test,y_pred)
71.75780196078432

The MSE of 71.75 is only useful for model comparison and has little meaning by its self. Another way to assess the model is by determining variable importance. This helps you to determine in a descriptive way the strongest variables for the regression model. The code is below followed by the plot of the variables.

model_ranks=pd.Series(h.feature_importances_,index=x_train.columns,name="Importance").sort_values(ascending=True,inplace=False) 
ax=model_ranks.plot(kind='barh')

1

As you can see, the strongest predictors of age include calories per meal, weight loss, and time sick. Sex and whether the person is censored or dead make a smaller difference. This makes sense as younger people eat more and probably lose more weight because they are heavier initially when dealing with cancer.

Conclusison

This post provided an example of the use of regression with random forest. Through the use of ensemble voting, you can improve the accuracy of your models. This is a distinct power that is not available with other machine learning algorithm.

Bagging Classification with Python

Bootstrap aggregation aka bagging is a technique used in machine learning that relies on resampling from the sample and running multiple models from the different samples. The mean or some other value is calculated from the results of each model. For example, if you are using Decisions trees, bagging would have you run the model several times with several different subsamples to help deal with variance in statistics.

Bagging is an excellent tool for algorithms that are considered weaker or more susceptible to variances such as decision trees or KNN. In this post, we will use bagging to develop a model that determines whether or not people voted using the turnout dataset. These results will then be compared to a model that was developed in a traditional way.

We will use the turnout dataset available in the pydataset module. Below is some initial code.

from pydataset import data
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

We will load our dataset. Then we will separate the independnet and dependent variables from each other and create our train and test sets. The code is below.

df=data("turnout")
X=df[['age','educate','income',]]
y=df['vote']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0)

We can now prepare to run our model. We need to first set up the bagging function. There are several arguments that need to be set. The max_samples argument determines the largest amount of the dataset to use in resampling. The max_features argument is the max number of features to use in a sample. Lastly, the n_estimators is for determining the number of subsamples to draw. The code is as follows

 h=BaggingClassifier(KNeighborsClassifier(n_neighbors=7),max_samples=0.7,max_features=0.7,n_estimators=1000)

Basically, what we told python was to use up to 70% of the samples, 70% of the features, and make 100 different KNN models that use seven neighbors to classify. Now we run the model with the fit function, make a prediction with the predict function, and check the accuracy with the classificarion_reoirts function.

h.fit(X_train,y_train)
y_pred=h.predict(X_test)
print(classification_report(y_test,y_pred))

1

This looks oka below are the results when you do a traditional model without bagging

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0)
clf=KNeighborsClassifier(7)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test,y_pred))

1

The improvement is not much. However, this depends on the purpose and scale of your project. A small improvement can mean millions in the reight context such as for large companies such as Google who deal with billions of people per day.

Conclusion

This post provides an example of the use of bagging in the context of classification. Bagging provides a why to improve your model through the use of resampling.

K Nearest Neighbor Classification with Python

K Nearest Neighbor uses the idea of proximity to predict class. What this means is that with KNN Python will look at K neighbors to determine what the unknown examples class should be. It is your job to determine the K or number of neighbors that should be used to determine the unlabeled examples class.

KNN is great for a small dataset. However, it normally does not scale well when the dataset gets larger and larger. As such, unless you have an exceptionally powerful computer KNN is probably not going to do well in a Big Data context.

In this post, we will go through an example of the use of KNN with the turnout dataset from the pydataset module. We want to predict whether someone voted or not based on the independent variables. Below is some initial code.

from pydataset import data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

We now need to load the data and separate the independent variables from the dependent variable by making two datasets.

df=data("turnout")
X=df[['age','educate','income']]
y=df['vote']

Next, we will make our train and test sets with a 70/30 split. The random.state is set to 0. This argument allows us to reproduce our model if we want. After this, we will run our model. We will set the K to 7 for our model  and run the model. This means that Python will look at the 7 closes examples to predict the value of an unknown example. below is the code

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0)
clf=KNeighborsClassifier(7)
clf.fit(X_train,y_train)

We can now predict with our model and see the results with the classification reports function.

y_pred=clf.predict(X_test)
print(classification_report(y_test,y_pred))

1

The results are shown above. To determine the quality of the model relies more on domain knowledge. What we can say for now is that the model is better at classifying people who vote rather than people who do not vote.

Conclusion

This post shows you how to work with Python when using KNN. This algorithm is useful in using neighboring examples tot predict the class of an unknown example.