Author Archives: Dr. Darrin

Differences in Thinking

Critical thinkers and problem solvers are two groups of people.  Sadly, these two groups are almost mutually exclusive. However, it is important that thinkers and solvers develop both skillsets to a certain level of competence.

The purpose of this post is to try and explain in detail critical thinking vs problem-solving in term of individual differences.

Thinking is a slow deliberate process that takes to do. In other words, a person must decide to think. Since there is a requirement of active effort, thinking is something that few people value and appreciate as they should.

Thinking involves processing information from the viewpoint of central processing. This means to examine the content of a message for its worth. Furthermore, when a person is developing their own arguments thinking involves developing support for one’s position. Often when people argue or disagree today they tend to get upset. This is an indication that their emotions are determining their position rather than their mind. They might use their mind on occasion to strengthen their argument but the foundation of their position is often emotional rather than based on strong thought.

 

Developing the mind usually involves reading. Reading exposes an individual to good and poor examples of thinking.  From these examples, an individual thinks about the strengths and merits of each. This process of thinking about other people’s thoughts helps a person to develop their own opinion. When an is formed it can be shared with others who are then able to judge for themselves the merit of the person’s opinion.

 

This process of thinking is not often required for academic studies. The focus has moved more towards problem-solving. Problem-solving is In an excellent form of thinking when the end goal is often binary in nature. This means that when a problem solves, either they solve the problem or they do not.

 

Critical thinking involves a certain fuzziness to it that problem-solving lacks  For example, whether a speech or paper is good or bad involves critical thinking because judging quality involves fuzziness to it. This sense of a shade of gray would make solving problems difficult at the least. T

 

However, if you are called to determine why a computer does not connect to the internet this is problem-solving. The goal is to get back on the internet. You have to think but the desired outcome is clear. Once the computer is back on the internet there is nothing to think about. In most cases, particular with non techie people, how you get back on the internet does not even matter. In other words, the “why does this work” is often something that problem solvers do not care about but this is exactly the type of thing critical thinking has to be able to explain when developing an argument.

 

Problem-solving involves action and not as much contemplation. The focus is on experience and not theory. It is not that problem-solvers never read and contemplate, rather, they learn primarily through doing. Examples include trial and error. 

Most companies want problem solvers and not necessarily critical thinkers. In other words, businesses want things done. They do not want people going around and questioning unless this helps to solve a problem.  Companies claim to want thinking but what they really want are people who think how to solve the company’s problems. Questioning the company is not one of the wiser things to do.

The fuzziness of critical thinking frustrates problem-solvers who want to solve problems and not simply talk. This is not a negative thing but rather a difference in personality. The problem is that problem solvers and critical thinkers do not see this as a matter of difference but a matter of ignorance on one hand and irrelevance on the other hand. Thinkers think and problem solvers do is a common description of both sides

 

Conclusion

Critical thinking and problem-solving are two skills that everyone needs. To focus on either to the exclusion of the other is detrimental. A combination of thought and action creates a balanced individual who is able to get things done while still have a depth of thought to support their actions.

 

Advertisements

Ridge Regression in Python

Ridge regression is one of several regularized linear models. Regularization is the process of penalizing coefficients of variables either by removing them and or reduce their impact. Ridge regression reduces the effect of problematic variables close to zero but never fully removes them.

We will go through an example of ridge regression using the VietNamI dataset available in the pydataset library. Our goal will be to predict expenses based on the variables available. We will complete this task using the following steps/

  1. Data preparation
  2. Baseline model development
  3. Ridge regression model

Below is the initial code

from pydataset import data
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_erro

Data Preparation

The data preparation is simple. All we have to do is load the data and convert the sex variable to a dummy variable. We also need to set up our X and y datasets. Below is the code.

df=pd.DataFrame(data('VietNamI'))
df.loc[df.sex== 'male', 'sex'] = 0
df.loc[df.sex== 'female','sex'] = 1
df['sex'] = df['sex'].astype(int)
X=df[['pharvis','age','sex','married','educ','illness','injury','illdays','actdays','insurance']]
y=df['lnhhexp'

We can now create our baseline regression model.

Baseline Model

The metric we are using is the mean squared error. Below is the code and output for our baseline regression model. This is a model that has no regularization to it. Below is the code.

regression=LinearRegression()
regression.fit(X,y)
first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))
print(first_model)
0.35528915032173053

This  value of 0.355289 will be our indicator to determine if the regularized ridge regression model is superior or not.

Ridge Model

In order to create our ridge model we need to first determine the most appropriate value for the l2 regularization. L2 is the name of the hyperparameter that is used in ridge regression. Determining the value of a hyperparameter requires the use of a grid. In the code below, we first are ridge model and indicate normalization in order to get better estimates. Next we setup the grid that we will use. Below is the code.

ridge=Ridge(normalize=True)
search=GridSearchCV(estimator=ridge,param_grid={'alpha':np.logspace(-5,2,8)},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)

The search object has several arguments within it. Alpha is hyperparameter we are trying to set. The log space is the range of values we want to test. We want the log of -5 to 2, but we only get 8 values from within that range evenly spread out. Are metric is the mean squared error. Refit set true means to adjust the parameters while modeling and cv is the number of folds to develop for the cross-validation. We can now use the .fit function to run the model and then use the .best_params_ and .best_scores_ function to determine the model;s strength. Below is the code.

search.fit(X,y)
search.best_params_
{'alpha': 0.01}
abs(search.best_score_)
0.3801489007094425

The best_params_ tells us what to set alpha too which in this case is 0.01. The best_score_ tells us what the best possible mean squared error is. In this case, the value of 0.38 is worse than what the baseline model was. We can confirm this by  fitting our model with the ridge information and finding the mean squared error. This is done below.

ridge=Ridge(normalize=True,alpha=0.01)
ridge.fit(X,y)
second_model=(mean_squared_error(y_true=y,y_pred=ridge.predict(X)))
print(second_model)
0.35529321992606566

The 0.35 is lower than the 0.38. This is because the last results are not cross-validated. In addition, these results indicate that there is little difference between the ridge and baseline models. This is confirmed with the coefficients of each model found below.

coef_dict_baseline = {}
for coef, feat in zip(regression.coef_,data("VietNamI").columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[188]:
{'pharvis': 0.013282050886950674,
'lnhhexp': 0.06480086550467873,
'age': 0.004012412278795848,
'sex': -0.08739614349708981,
'married': 0.075276463838362,
'educ': -0.06180921300600292,
'illness': 0.040870384578962596,
'injury': -0.002763768716569026,
'illdays': -0.006717063310893158,
'actdays': 0.1468784364977112}


coef_dict_ridge = {}
for coef, feat in zip(ridge.coef_,data("VietNamI").columns):
coef_dict_ridge[feat] = coef
coef_dict_ridge
Out[190]:
{'pharvis': 0.012881937698185289,
'lnhhexp': 0.06335455237380987,
'age': 0.003896623321297935,
'sex': -0.0846541637961565,
'married': 0.07451889604357693,
'educ': -0.06098723778992694,
'illness': 0.039430607922053884,
'injury': -0.002779341753010467,
'illdays': -0.006551280792122459,
'actdays': 0.14663287713359757}

The coefficient values are about the same. This means that the penalization made little difference with this dataset.

Conclusion

Ridge regression allows you to penalize variables based on their useful in developing the model. With this form of regularized regression the coefficients of the variables is never set to zero. Other forms of regularization regression allows for the total removal of variables. One example of this is lasso regression.

Undergrad and Grad Students

In this post,  we will look at a comparison of grad and undergrad students.

Student Quality

Generally, graduate students are of a higher quality academically than undergrad students. Of course, this varies widely from institution to institution. New graduate programs may have a lower quality of student than established undergrad programs. This is because the new program is trying to fill sears initially and quality is often compromised.

Focus

At the graduate level, there is an expectation of a much more focused and rigorous curriculum. This makes sense as the primary purpose of graduate school is usually specialization and not generalization. This requires that the teachers at this level have a deep expert-level mastery of the content.

In comparison to graduate school, undergrad is a generalized experience with some specialization. However, this depends on the country in which the studies take place. Some countries require rather an intense specialization from the beginning with a minimum of general education while others take a more American style approach with a wide exposure to various fields.

Commitment

Graduate students are usually older. This means that they require less institution sponsored social activities and may not socialize at all. In addition, some graduate students are married which adds a whole other level of complexity to their studies. Although they are probably less inclined to be “wild” due to their family they are also going to struggle due to the time commitment of their loved ones.

Assuming that an undergraduate student is a traditional one they will tend to be straight from high school, require some social support, but will also have the free time needed to study. The challenge with these students is the maturity level and self-regulation skills that are often missing for academic success.

For the teacher, graduate students offer higher motivation and commitment generally when compared to undergrads. This is reasonable as people often feel compelled to complete a bachelors but normally do not face the same level of pressure to go to graduate school. This means that undergrad is often compulsory due to external circumstances while grad school is by choice.

Conclusion

Despite the differences but types of students hold in common an experience that is filled with exposure to various ideas and content for several years. Grad students and undergrad students are individuals who are developing skills for the goal of eventually finding a purpose in the world.

Hyperparameter Tuning in Python

Hyperparameters are a numerical quantity you must set yourself when developing a model. This is often one of the last steps of model development. Choosing an algorithm and determining which variables to include often come before this step.

Algorithms cannot determine hyperparameters themselves which is why you have to do it. The problem is that the typical person has no idea what is an optimally choice for the hyperparameter. To deal with this confusion, often a range of values are inputted and then it is left to python to determine which combination of hyperparameters is most appropriate.

In this post, we will learn how to set hyperparameters by developing a grid in  Python. To do this, we will use the PSID dataset from the pydataset library. Our goal will be to classify who is married and not married based on several independent variables. The steps of this process is as follows

  1.  Data preparation
  2. Baseline model (for comparison)
  3. Grid development
  4. Revised model

Below is some initial code that includes all the libraries and classes that we need.

import pandas as pd
import numpy as np
from pydataset import data
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

Data Preparation

The dataset PSID has several problems that we need to address.

  • We need to remove all NAs
  • The married variable will be converted to a dummy variable. It will simply be changed to married or not rather than all of the other possible categories.
  • The educatnn and kids variables have codes that are 98 and 99. These need to be removed because they do not  make sense.

Below is the code that deals with all of this

df=data('PSID').dropna()
df.loc[df.married!= 'married', 'married'] = 0
df.loc[df.married== 'married','married'] = 1
df['married'] = df['married'].astype(int)
df['marry']=df.married
df.drop(df.loc[df['kids']>90].index, inplace=True)
df.drop(df.loc[df['educatn']>90].index, inplace=True
  1. Line 1 loads the dataset and drops the NAs
  2. Line 2-4 create our dummy variable for marriage. We create a new variable called marry to hold the results
  3. Lines 5-6 drop the values in  kids and educatn that are above 90.

Below we create our X and y datasets and then are ready to make our baseline model.

X=df[['age','educatn','hours','kids','earnings']]
y=df['marry']

Baseline Model

The purpose of  baseline model is to see how much better or worst the hyperparameter tuning works. We are using K Nearest Neighbors  for our classification. In our example, there are 4 hyperparameters we need to set. They are as follows.

  1. number of neighbors
  2. weight of neighbors
  3. metric for measuring distance
  4. power parameter for minkowski

Below is the baseline model with the set hyperparameters. The second line shows the accuracy of the model after a k-fold cross-validation that was set to 10.

classifier=KNeighborsClassifier(n_neighbors=5,weights=’uniform’, metric=’minkowski’,p=2)
np.mean(cross_val_score(classifier,X,y,cv=10,scoring=’accuracy’,n_jobs=1)) 0.6188104238047426

Our model has an accuracy of about 62%. We will now move to setting up our grid so we can see if tuning the hyperparameters improves the performance

Grid Development

The grid allows you to develop scores of models with all of the hyperparameters tuned slightly differently. In the code below, we create our grid object, and then we calculate how many models we will run

grid={'n_neighbors':range(1,13),'weights':['uniform','distance'],'metric':['manhattan','minkowski'],'p':[1,2]}
np.prod([len(grid[element]) for element in grid])
96

You can see we made a simple list that has several values for each hyperparameter

  1. Number if neighbors can be 1 to 13
  2. weight of neighbors can be uniform or distance
  3. metric can be manhatten or minkowski
  4. p can be 1 or 2

We will develop 96 models all together. Below is the code to begin tuning the hyperparameters.

search=GridSearchCV(estimator=classifier,param_grid=grid,scoring='accuracy',n_jobs=1,refit=True,cv=10)
search.fit(X,y)

The estimator is the  code for the type of algorithm we are using. We set this earlier. The param_grid is our grid. Accuracy is our metric for determining the best model. n_jobs has to do with the amount of resources committed to the process. refit is for changing parameters and cv is for cross-validation folds.The search.fit command runs the model

The code below provides the output for the results.

print(search.best_params_)
print(search.best_score_)
{'metric': 'manhattan', 'n_neighbors': 11, 'p': 1, 'weights': 'uniform'}
0.6503975265017667

The best_params_ function tells us what the most appropriate parameters are. The best_score_ tells us what the accuracy of the model is with the best parameters. Are model accuracy improves from 61% to 65% from adjusting the hyperparameters. We can confirm this by running our revised model with the updated hyper parameters.

Model Revision

Below is the cod efor the erevised model

classifier2=KNeighborsClassifier(n_neighbors=11,weights='uniform', metric='manhattan',p=1)
np.mean(cross_val_score(classifier2,X,y,cv=10,scoring='accuracy',n_jobs=1)) #new res
Out[24]: 0.6503909993913031

Exactly as we thought. This is a small improvement but this can make a big difference in some situation such as in a data science competition.

Conclusion

Tuning hyperparameters is one of the final pieces to improving a model. With this tool, small gradually changes can be seen in a model. It is important to keep in mind this aspect of model development in order to have the best success final.

Variable Selection in Python

A key concept in machine learning and data science in general is variable selection. Sometimes, a dataset can have hundreds of variables to include in a model. The benefit of variable selection is that it reduces the amount of useless information aka noise in the model. By removing noise it can improve the learning process and help to stabilize the estimates.

In this post, we will look at two ways to do this.  These two common approaches are the univariate approach and the greedy approach. The univariate approach selects variables that are most related to the dependent variable based on a metric. The greedy approach will alone remove a variable if getting rid of it does not affect the model’s performance.

We will now move to our first example which is the univariate approach using Python. We will use the VietNamH dataset from the pydataset library. Are goal is to predict how much a family spends on medical expenses. Below is the initial code.

import pandas as pd
import numpy as np
from pydataset import data
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_regression
df=data('VietNamH').dropna()

Are data is called df. If you use the head function, you will see that we need to convert several variables to dummy variables. Below is the code for doing this.

df.loc[df.sex== 'female', 'sex'] = 0
df.loc[df.sex== 'male','sex'] = 1
df.loc[df.farm== 'no', 'farm'] = 0
df.loc[df.farm== 'yes','farm'] = 1
df.loc[df.urban== 'no', 'urban'] = 0
df.loc[df.urban== 'yes','urban'] = 1

We now need to setup or X and y datasets as shown below

X=df[['age','educyr','sex','hhsize','farm','urban','lnrlfood']]
y=df['lnmed']

We are now ready to actual use the univariate approach. This involves the use of two different classes in Python. The SelectPercentile class allows you to only include the variables that meet a certain percentile rank such as 25%. The f_regression class is designed for checking a variable’s performance in the context of regression.  Below is the code to run the analysis.

selector_f=SelectPercentile(f_regression,percentile=25)
selector_f.fit(X,y)

We can now see the results using a for loop. We want the scores from our selector_f object. To do this we setup a for lop and use the zip function to iterate over the data. The output is placed in the print statement. Below is the code and output for this.

for n,s in zip(X,selector_f.scores_):
print('F-score: %3.2f\t for feature %s ' % (s,n))
F-score: 62.42 for feature age
F-score: 33.86 for feature educyr
F-score: 3.17 for feature sex
F-score: 106.35 for feature hhsize
F-score: 14.82 for feature farm
F-score: 5.95 for feature urban
F-score: 97.77 for feature lnrlfood

You can see the f-score for all of the independent variables. You can decide for yourself which to include.

Greedy Approach

The greedy approach only removes variables if they do not impact model performance. We are using the same dataset so all we have to do is run the code. We need the RFECV class from the model_selection library. We then use the function RFECV and set the estimator, cross-validation, and scoring metric. Finally, we run the analysis and print the results. The code is below with the output.

from sklearn.feature_selection import RFECV
select=RFECV(estimator=regression,cv=10,scoring='neg_mean_squared_error')
select.fit(X,y)
print(select.n_features_)
7

The number 7 represents how many independent variables to include in the model. Since we only had 7 total variables we should include all variables in the model.

Conclusion

With help with univariate and greedy approaches, it is possible to deal with a large number of variables efficiently one developing models. The example here involve only a handful of variables. However, bear in mind that the approaches mentioned here are highly scalable and useful.

Cross-Validation in Python

A common problem in machine learning is data quality. In other words, if the data is bad the model will be bad even if it is designed using best practices. Below is a short of some possible problems with data

  • Sample size is to small-Hurts all algorithms
  • Sample size too big-Hurts complex algorithms
  • Wrong data-Hurts all  algorithms
  • Too many variables-Hurts complex algorithms

Naturally, this list is not exhaustive. Whenever some of the above situations take place it can lead to a model that has bias or variance. Bias takes place when the model highly over and under estimates values. This is common in regression when the relationship among the variables is not linear. The linear line that is developed by the  model works sometimes but is often erroneous.

Variance is when the model is too sensitive to the characteristics of the training data. This means that the model develops a complex way to classify or performs regression that does not generalize to other datasets

One solution to addressing these problems is the use of cross-validation. Cross-validation involves dividing the training set into several folds. For example, you may divide the data into 10 folds. With 9 folds you train the data and with the 10rh fold you test it. You then calculate the average prediction or classification of the ten test folds. This method is commonly called k-folds cross-validation. This process helps to stabilize the results of the final model. We will now look at how to do this using Python.

Data Preparation

We will develop a regression model using the PSID dataset. Our goal will be to predict earnings based on the other variables in the dataset. Below is some initial code.

import pandas as pd
import numpy as np
from pydataset import data
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

We now need to load the dataset PSID. When this is done, there are several things we also need to.

  • We have to drop all NA’s in the dataset
  • We also need to convert the “married” variable to a dummy variable.

Below  is the code for completing these steps

df=data('PSID').dropna()
df.loc[df.married!= 'married', 'married'] = 0
df.loc[df.married== 'married','married'] = 1
df['married'] = df['married'].astype(int)
df['marry']=df.married

The code above loads the data while dropping the NAs. We then use the .loc function to make everyone who is not married a 0 and everyone who is married a 1. This variable is then converted to an integer using the .astype function. Lastly, we make a new variable called ‘marry’ and store our data there.

There is one other problem we need to address. In the ‘kids’ and the ‘educatn’ variable are values of 98 and 99. In the original survey, these responses meant that the person did not want to say how man kids or how much education they had or that they did not know. We  will remove these individuals from the sample using the code below.

df.drop(df.loc[df['kids']>90].index, inplace=True)
df.drop(df.loc[df['educatn']>90].index, inplace=True)

The code above tells Python to remove in values greater than 90. With this We can now make are dataset that includes the independent variables and the dataset that contains the dependent variable.

X=df[['age','educatn','hours','kids','marry']]
y=df['earnings']

Model Development

We are now going to make several models and use the mean squared error as our way of comparing them. The first model will use all of the data. The second model will use the training data. The  third model will use cross-validation. Below is the code for the first model that uses all of the data,

regression=LinearRegression()
regression.fit(X,y)
first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))
print(first_model)
138544429.96275884

For the second model, we first need to make our train and test sets. Then we will run our model.  The code is below.

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=5)
regression.fit(X_train,y_train)
second_model=(mean_squared_error(y_true=y_train,y_pred=regression.predict(X_train)))
print(second_model)
148286805.4129756

You can see that the number are somewhat different. This is to be expected when dealing with different sample sizes. With cross validation using the full dataset we get results similar to the first model we developed. This is done through an instance of the KFold function. For KFold we want 10 folds, we want to shuffle the data, and set the seed.

The other function we need is the cross_val_score function. In this function, we set the type of model, the data we will use, the metric for evaluation, and the characteristics of the type of cross-validation. Once this is done we print the mean and standard deviation of the fold results. Below is the code.

crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
scores=cross_val_score(regression,X,y,scoring='neg_mean_squared_error',cv=crossvalidation,n_jobs=1)
print(len(scores),np.mean(np.abs(scores)),np.std(scores))
10 138817648.05153447 35451961.12217143

These numbers are closer to what is expected from the dataset. Despite the fact that we didn’t use all the data at the same time. You can also run these results on the training set as well for additional comparison.

Conclusion

This post provides an example of cross-validation in Python. The use of cross-validation helps to stabilize the results that ma come from your model. With increase stability comes increased confidence in your models ability to generalize to other datasets.

Artificial Intelligence in the Classroom

In 1990, a little known film called “Class of 1999” came out. In this movie, three military grade robots are placed in an inner-city war zone school to with the job of teaching.

As with any movie, things quickly get out of hand and the robot teachers begin killing the naughty students and eventually manipulating the local gangs into fighting and killing each other. Eventually, in something that can only happen in a movie, three military grade AI robots similar to the terminator are destroyed by a delinquent teenager

There has been a lot of hype and excitement over the growth of data science, machine learning, and artificial intelligence. With this growth, these ideas have begun to expend into supporting education. This has even led to speculation among some that algorithms and artificial intelligence could replace teachers in the classroom.

There are several reasons why this is. My reasons are listed as follows

  • People Need People
  • Computers need people
  • Computers assist people

People Need People

When it comes to education, people need people. Originally, education was often passed through apprenticeship for trades and one-on-one tutoring for elites. There has allows been some form of mass education but it has always involved people helping each other.

There are certain social-emotional needs that people have that cannot be satisfied by even the most life-like machine. When humans communicate they always convey some form of emotion even in the most harden computer like individual. Although AI is making strides in attempting to read, emotions they are far from convincingly portraying emotions. Besides, students want someone who can laugh, joke, smile, and do all those little things that involve being human. Even such mundane things as tripping over one’s shoes, or forgetting someone’s name add a human element to the learning experience.

Furthermore, even if a computer is able to share emotions in a human-like manner what child would really feel satisfaction from pleasing an Amazon Alexa? People need people and AI teachers cannot provide this even if they can provide top-level content.

Another concern is that people are highly unpredictable. Again, this relates to the emotional aspects of human nature. Even humans who have the same emotional characteristics are surprised by the behavior of fellow humans. When an algorithm is coldly calculating what is an appropriate action this inability to deal with unpredictable actions can be a problem.

A classic example of this is classroom management. If a student is not paying attention, or not doing their work, or showing defiance in one way or the other how would a computer handle this. In the movie “Class of 1999” the answer for disruptive behavior was to kill. Few parents and administrators would approve of such an action coming from an artificial neural network.

People need people in the context of education for the socio-emotional aspect of education as well as for the tribulation of classroom management. Computers are not humans and therefore they cannot provide the motivation or inspiration that so many students need to be successful in school.

Computers Need People

A second reason AI teachers are unlikely is because computers need people. Computers breakdown,  there are bugs in code, updates have to be made etc. All this precludes a machine going completely independent. With everything that can go wrong there has to be people there to monitor the use and interaction of machines with people.

Even in the movie “Class of 1999” there was a research professor and administrator monitoring the school. This continued until they were killed by the AI teachers.

With all the amazing advances in AI and machine learning it is still people who tweak the algorithms and provide the data for the machine to learn. After this is done, the algorithm is still monitored to see how it performs. Computers cannot escape their reliance on humans to maintain their functionality which implies that they cannot be turned loss in a classroom alone.

Computers Help People

The way going forward is that perhaps AI and other aspects of machine learning and data science can support teachers to be better teachers. For example, in some versions of Moodle there is an algorithm that will monitor students participation and will predict if students are at risk of failing. There is also an algorithm that predicts if a teacher is teaching. This is an excellent use of machine learning in that it deals with a routine task and simple flags a problem rather than trying to intervene it’s self.

Another useful application more in line with AI is through tutoring. Providing feedback on progress and adjusting what the student does based on their performance. Again, in a supporting role, AI can be excellent. The problem is when AI becomes the leader.

Conclusion

The advances in technology are going to continue. However, with the amazing breakthroughs in this field people still need interaction with other people and the example of others in a social context. Computers will never be able to provide this. Computers also need the constant support of humans in order to function. The proper role for AI and data science in education may be as a supporter to a teacher rather than the one leading and making criticaltaff decisions about other people.

Computational Thinking

Computational thinking is the process of expressing a problem in a way that a computer can solve. In general, there are four various ways that computational thinking can be done. These four ways are decomposition, pattern recognition, abstraction, and algorithmic thinking.

Although computational thinking is dealt with in the realm of computer science. Everyone thinks computationally at one time or another especially in school. Awareness of these subconscious strategies can help people to know how they think at times as well as to be aware of the various ways in which thinking is possible.

Decomposition

Decomposition is the process of breaking a large problem down into smaller and smaller parts or problems. The benefit of this is that by addressing all of the created little problems you can solve the large problem.

In education decomposition can show up in many ways. For teachers, they often have to break goals done into objectives, and sometimes down into procedures in a daily lesson plan. Seeing the big picture of the content students need and breaking it down into pieces that students can comprehend is critically to education such as with chunking.

For the student, decomposition involves breaking down the parts of a project such as writing a paper. The student has to determine what to do and how it helps to achieve the completion of their project.

Pattern Recognition

Pattern recognition has to refer to how various aspects of a problem have things in common. For a teacher, this may involve the development of a thematic unit. Developing such a unit requires the teacher to see what various subjects or disciplines have in common as they try to create the thematic unit.

For the student, pattern recognition can support the development of  shortcuts. Examples include seeing similarities in assignments that need to be completed and completing similar assignments together.

Abstraction

Abstraction  is the ability to remove irrelevant information from a problem. This is perhaps the most challenging form of thinking to develop because people often fall into the trap that everything is important.

For a teacher, abstractions involves teaching only the critical information that is in the content and not stressing the small stuff. This is not easy especially when the  teacher has a passion for their subject. This often blinds them to trying to share only the most relevant information about their field with their students.

For students, abstraction involves being able to share the most critical information. Students are guilty of the same problems as teachers in that they share everything when writing or presenting. Determining what is important requires the development of an opinion to judge the relevance of something. This is a skill that is hard to find among graduates.

Algorithmic Thinking

Algorithmic thinking is being able to develop a step-by-step plan to do something. For teachers, this happens everyday through planning activities and leading a class. Planning may be the most common form of thinking for the teacher.

For students, algorithmic thinking is somewhat more challenging. It is common for younger people to rely heavily on intuition to accomplish tasks. This means that they did something but they do not know how they did it.

Another common mistake for young people is doing things through brute force. Rather than planning, they will just keep pounding away until something works. As such, it is common for students to do things the “hard way” as the saying goes.

Conclusion

Computational thinking is really how humans think in many situations in which emotions are not the primary mover. As such, what is really happening is not that computers are thinking as much as they are trying to model how humans think. In education, there are several situations. In which computational thinking can be employed for success.

Mentoring New Teachers

A career in teaching is an attractive option for many young adults. One of the major challenges in a career in teaching is the student teaching experience that is normally placed at the end of the degree program. This post will provide some suggestion for teacher mentors

Go Over Local Expectations

Every school has its own set of policies and expectations that all employees need to adhere too. Often, the student teacher is not aware of these and it is the mentoring teacher’s responsibility to provide some idea of what is expected. This includes such things as showing them around the campus, communicating expectations for how to dress, discipline procedures, and even how to deal with grades.

Knowing these little things can allow the new teacher to focus on teaching rather than the administrative aspects of the classroom.

Provide Feedback

Feedback is critical so that the new teacher knows what they are doing well and wrong. It is, of course, important to mention what the student teacher does well. However, growth happens by providing support to overcome weaknesses.

The temptation for many supervising teachers is simply to mention what the problems are and let the student figure out what to do. This approach may work for an experience or a highly independent teacher. However, for most new teachers they need specific support on what to do in order to improve their teaching and overcome a weakness.

Therefore, criticism without some sort of suggestion for how to overcome the problem is not beneficial. In addition, it is important to only address major problems that can cripple the educational experience of the students rather than every single weakness in the students teaching. We all have issues and problems with our teaching and for beginners, only the big problems should be corrected.

The student also should provide feedback on how they view their own teaching. Most teacher education programs require this in the form of a journal. However, the benefit of the journal is only in discussing it with others such as the mentor teacher.

Lead By Example

IN reality, in order for a student to be a successful teacher, they need to see what successful teaching is so they can imitate until perfection. What this means for you as a supervising teacher is that you need to lay the example for the student to imitate. Everyone has there own style but a good example goes a long way in molding the teaching approach of a student.

This also means that a mentor teacher needs to do a lot of verbalizing in terms of what they do. Often, as an experienced teacher, things become automatic in the classroom. You know what to do without much thought or discussion. The problem is that if there is a lack of explanation in terms of wqhat is happening the student teacher is not able to deermine why you are doing certain things. Therefore, a mentor teacher must explained explictiylywhat they are doing and why while they are provding the exmple of teaching.

Conclusion

Students who dream of teaching need support in order to have success. This involves bringing in people with more experience to support these young teachers as they develop their skillset. This means that even experienced teachers need some support in order to determine how to help new teachers

3 Steps to Successful Research

When students have to conduct a research project they often struggle with determining what to do. There are many decisions that have to be made that can impede a student’s chances of achieving success.  However, there are ways to overcome this problem.

This post will essentially reduce the decision-making process for conducting research down to three main questions that need to be addressed. These questions are.

  • What do you Want to Know?
  • How do You Get the Answer?
  • What Does Your Answer Mean?

Answering these three questions makes it much easier to develop a sense of direction and scope in order to complete a project.

What do you Want to Know?

Often, students want to complete a project but it is unclear to them what they are trying to figure out. In other words, the students do not know what it is that they want to know. Therefore, one of the first steps in research is to determine exactly it is you want to know.

Understanding what you want to know will allow you to develop a problem as well as research questions to facilitate your ability to understand exactly what it is that you are looking for. Research always begins with a problem and questions about the problem and this is simply another way of stating what it is that you want to know.

How do You Get the Answer?

Once it is clear what it is that you want to know it is critical that you develop a process for determining how you will obtain the answers. It is often difficult for students to develop a systematic way in which to answer questions. However, in a research paradigm, a scientific way of addressing questions is critical.

When you are determining how to get answers to what you want to know this is essential the development of your methodology section. This section includes such matters as the research design, sample, ethics, data analysis, etc. The purpose here is again to explain the way to get the answer(s).

What Does Your Answer Mean?

After you actually get the answer you have to explain what it means. Many students fall into the trap of doing something without understanding why or determining the relevance of the outcome. However, a research project requires some sort of interpretation or explanation of the results. Just getting the answer is not enough it is the meaning that holds the power.

Often, the answers to the research questions are found in the results section of a paper and the meaning is found in the discussion and conclusion section. In the discussion section, you explain the major findings with interpretation, sare recommendations, and provide a conclusion. This requires thought into the usefulness of what you wanted to know. In other words, you are explaining why someone else should care about your work. This is much harder to do than many realize.

Conclusion

Research is challenging but if you keep in mind these three keys it will help you to see the big picture of research and o focus on the goals of your study and not so much on the tiny details that encompasses the processes.

Undergrad and Grad Students

In this post,  we will look at a comparison of grad and undergrad students.

Student Quality

Generally, graduate students are of a higher quality academically than undergrad students. Of course, this varies widely from institution to institution. New graduate programs may have a lower quality of student than established undergrad programs. This is because the new program is trying to fill sears initially and quality is often compromised.

Focus

At the graduate level, there is an expectation of a much more focused and rigorous curriculum. This makes sense as the primary purpose of graduate school is usually specialization and not generalization. This requires that the teachers at this level have a deep expert-level mastery of the content.

In comparison to graduate school, undergrad is a generalized experience with some specialization. However, this depends on the country in which the studies take place. Some countries require rather an intense specialization from the beginning with a minimum of general education while others take a more American style approach with a wide exposure to various fields.

Commitment

Graduate students are usually older. This means that they require less institution sponsored social activities and may not socialize at all. In addition, some graduate students are married which adds a whole other level of complexity to their studies. Although they are probably less inclined to be “wild” due to their family they are also going to struggle due to the time commitment of their loved ones.

Assuming that an undergraduate student is a traditional one they will tend to be straight from high school, require some social support, but will also have the free time needed to study. The challenge with these students is the maturity level and self-regulation skills that are often missing for academic success.

For the teacher, graduate students offer higher motivation and commitment generally when compared to undergrads. This is reasonable as people often feel compelled to complete a bachelors but normally do not face the same level of pressure to go to graduate school. This means that undergrad is often compulsory due to external circumstances while grad school is by choice.

Conclusion

Despite the differences but types of students hold in common an experience that is filled with exposure to various ideas and content for several years. Grad students and undergrad students are individuals who are developing skills for the goal of eventually finding a purpose in the world.

Using Blended Learning in Your Classroom

Blended learning is becoming a reality in education. Many schools now require some sort of online presences not online of the school but also for individual classes that teachers teach. This has led to much more pressure for teachers to figure out some sort of way to get content online to support studnets. This post will take a look at the pros and cons of blended learning and provide tips on how to approach the use of blended learning.

Pros and Cons

Blended learning gives you flexibility. You are not tied to either traditional or elearning completely. This allows you to find the right balance for your teaching style and the students learning. Some teachers want more online presence in the form of activities and submission of assignments. Others just want a centralized place for communicating with their students and tracking academic progress. Whatever works for you can probably be accommodated when employing blended learning.

Communication and documentation is another benefit of blended learning. Announcements and messaging can be handled by the online platform and these forms of transactions are usually logged by the system and saved. This can be useful for referencing in the future if confusion arises.

The drawback to the flexibility is actually the flexibility. When employing blended learning a teacher has to be proficient in both e-learning and traditional teaching. In other words, you have to become a jack of all trades. Without strength in both methodologies, it will be difficult to determine what you want to do online and to determine how the online experiences augment or replace in-class learning opportunities.

Another problem is the confusion over what is done in class and what is done online. When learning takes place in two mediums it increases opportunities for misunderstanding and miscommunication. I have frequently had students confused over what was to be submitted online vs in class no matter how clear I was in the course outline and calendar.

Success in a Blended Learning Context

To have success using blended learning involves doing some of the following.

  • Focus on using the online platform less for learning and more for communication when you initially begin using blended learning. There is a lot for you to learn as the teacher and trying to move everything online will lead to confusion for you and the students.
  • Consider having students submit the final version of assignments online. Final versions of assignments usually require the least amount of feedback because they have already been vetted by you in person. This will allow you to focus on the grade rather than on providing more support.
  • Online activities should support learning and probably not replace it. Blended learning is often more effective if it helps students to understand content in class rather than replace it. This means the blended learning platform is a study tool to scaffold students learning outside of class. If an assignment can be completely done online without having to go to class then perhaps this is no longer blended learning since the in-class part is not needed for support.
  • PLanning is critical to using any web-based resource. Websites are designed in advance before they are set up. Even post a blog requires you to develop a draft or two. Therefore, online activities and expectations must be planned well in advance and not just thrown online at the whim of the teacher throughout the semester. Many teachers fall into the trap of just making stuff up as they go. This is a poor methodology in a traditional classroom and a disaster in a blended learning context.
  • When in doubt go traditional. If you are unsure how to achieve a specific learning goal online it is better to stick to a traditional approach until you can figure it out. In-class teaching is old but it still has a place in the 21st century especially when it is unclear how to do it online.

Conclusion

Blended learning can be a powerful tool for helping students are a major headache that annoys everyone. The secret to success lies with the teacher who understands what they want from the online aspect of the students learning as well as what they want in the classroom. When this is clear is it critical that the teacher determine how to meet these goals through the use of various learning experiences.

Elearning Academic Success

Studying online has become almost an expectation now. Even if you never earn a degree or take a class for credit online there are still many opportunities to train and develop skills over the internet. The role of the teacher is to try and find ways to engage and support their students as they begin their learning experience physically alone with support perhaps thousands of miles away.

In this post, we will look at ways to encourage the academic success of students while studying online. Two ways to support academic success in elearning involve providing feedback and encouraging engagement.

Provide Feedback

Feedback is critical in every aspect of teaching. However, in elearning, it is even more important. This is because the students have no face-to-face communication with you so they have no idea how they are doing beyond a letter grade. In addition, there is no body language to examined or other paralinguistic features that the student can infer meaning from.

Giving feedback requires timeliness. In other words, mark assignments quickly and indicate progress. In addition, if students do not meet expectations it is critical that you point them towards resources that will help them to inderstand. For example, students seem to neglect reading rubrics. When a student gets feedback from a rubric they can see where they were not succesfful.

In terms of a more formative feedback approach, there may be times where it is beneficial to live stream lecture. This allows the students to chime in whenever they do not understand an idea or point. Furthermore, the teacher can ask a question or two of the students and get feedback from them.

Engage Them

Engaging is almost synonymous with active. In other words, students should be doing something in order to learn. Unfortunately, listening is a passive activity which implies that lecturing is not the best way to inspire learning.

In the contest of elearning, one of the ways to inspire active learning is to have the students go out and do something in the real world and report what happens online in the form of a reflection. For example, students studying English will go out and teach English in the real world. They will then come and share their experience. The teacher is then able to provide insights and feedback to improve the students teaching. This provides a connection to the real world as well as a sense of relevance

In a more abstract subject, such as history, music theory, or engineering, students can become active through sharing these insights with laymen or explaining how they are already applying this information at their job or in the home. The goal of using provides the purpose for learning the content.

Conclusion

Feedback and engagement are critical to success in a situation in which the student is primarily learning alone which is found in the context of elearning.

Scatter Plots in Python

Scatterplots are one of many crucial forms of visualization in statistics. With scatterplots, you can examine the relationship between two variables. This can lead to insights in terms of decision making or additional analysis.

We will be using the “Prestige” dataset form the pydataset module to look at scatterplot use. Below is some initial code.

from pydataset import data
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
df=data('Prestige')

We will begin by making a correlation matrix. this will help us to determine which pairs of variables have strong relationships with each other. This will be done with the .corr() function. below is the code

1.png

You can see that there are several strong relationships. For our purposes, we will look at the relationship between education and income.

The seaborn library is rather easy to use for making visuals. To make a plot you can use the .lmplot() function. Below is a basic scatterplot of our data.

1.png

The code should be self-explanatory. THe only thing that might be unknown is the fit_reg argument. This is set to False so that the function does not make a regression line. Below is the same visual but this time with the regression line.

facet = sns.lmplot(data=df, x='education', y='income',fit_reg=True)

1

It is also possible to add a third variable to our plot. One of the more common ways is through including a categorical variable. Therefore, we will look at job type and see what the relationship is. To do this we use the same .lmplot.() function but include several additional arguments. These include the hue and the indication of a legend. Below is the code and output.

1.png

You can clearly see that type separates education and income. A look at the boxplots for these variables confirms this.

1.png

As you can see, we can conclude that job type influences both education and income in this example.

Conclusion

This post focused primarily on making scatterplots with the seaborn package. Scatterplots are a tool that all data analyst should be familiar with as it can be used to communicate information to people who must make decisions.

Data Visualization in Python

In this post, we will look at how to set up various features of a graph in Python. The fine tune tweaks that can be made when creating a data visualization can be enhanced the communication of results with an audience. This will all be done using the matplotlib module available for python. Our objectives are as follows

  • Make a graph with  two lines
  • Set the tick marks
  • Change the linewidth
  • Change the line color
  • Change the shape of the line
  • Add a label to each axes
  • Annotate the graph
  • Add a legend and title

We will use two variables from the “toothpaste” dataset from the pydataset module for this demonstration. Below is some initial code.

from pydataset import data
import matplotlib.pyplot as plt
DF = data('toothpaste')

Make Graph with Two Lines

To make a plot you use the .plot() function. Inside the parentheses you out the dataframe and variable you want. If you want more than one line or graph you use the .plot() function several times. Below is the code for making a graph with two line plots using variables from the toothpaste dataset.

plt.plot(DF['meanA'])
plt.plot(DF['sdA'])

1

To get the graph above you must run both lines of code simultaneously. Otherwise, you will get two separate graphs.

Set Tick Marks

Setting the tick marks requires the use of the .axes() function. However, it is common to save this function in a variable called axes as a handle. This makes coding easier. Once this is done you can use the .set_xticks() function for the x-axes and .set_yticks() for the y axes. In our example below, we are setting the tick marks for the odd numbers only. Below is the code.

ax=plt.axes()
ax.set_xticks([1,3,5,7,9])
ax.set_yticks([1,3,5,7,9])
plt.plot(DF['meanA'])
plt.plot(DF['sdA'])

1

Changing the Line Type

It is also possible to change the line type and width. There are several options for the line type. The important thing here is to put this information after the data you want to plot inside the code. Line width is changed with an argument that has the same name. Below is the code and visual

ax=plt.axes()
ax.set_xticks([1,3,5,7,9])
ax.set_yticks([1,3,5,7,9])
plt.plot(DF['meanA'],'--',linewidth=3)
plt.plot(DF['sdA'],':',linewidth=3)

1

Changing the Line Color

It is also possible to change the line color. There are several options available. The important thing is that the argument for the line color goes inside the same parentheses as the line type. Below is the code. r means red and k means black.

ax=plt.axes()
ax.set_xticks([1,3,5,7,9])
ax.set_yticks([1,3,5,7,9])
plt.plot(DF['meanA'],'r--',linewidth=3)
plt.plot(DF['sdA'],'k:',linewidth=3)

1

Change the Point Type

Changing the point type requires more code inside the same quotation marks where the line color and line type are. Again there are several choices here. The code is below

ax=plt.axes()
ax.set_xticks([1,3,5,7,9])
ax.set_yticks([1,3,5,7,9])
plt.plot(DF['meanA'],'or--',linewidth=3)
plt.plot(DF['sdA'],'Dk:',linewidth=3)

1

Add X and Y Labels

Adding LAbels is simple. You just use the .xlabel() function or .ylabel() function. Inside the parentheses, you put the text you want in quotation marks. Below is the code.

ax=plt.axes()
ax.set_xticks([1,3,5,7,9])
ax.set_yticks([1,3,5,7,9])
plt.xlabel('X Example')
plt.ylabel('Y Example')
plt.plot(DF['meanA'],'or--',linewidth=3)
plt.plot(DF['sdA'],'Dk:',linewidth=3)

1

Adding Annotation, Legend, and Title

Annotation allows you to write text directly inside the plot wherever you want. This involves the use of the .annotate function. Inside this function, you must indicate the location of the text and the actual text you want added to the plot. For our example, we will add the word ‘python’ to the plot for fun.

The .legend() function allows you to give a description of the line types that you have included. Lastly, the .title() function allows you to add a title. Below is the code.

ax=plt.axes()
ax.set_xticks([1,3,5,7,9])
ax.set_yticks([1,3,5,7,9])
plt.xlabel('X Example')
plt.ylabel('Y Example')
plt.annotate(xy=[3,4],text='Python')
plt.plot(DF['meanA'],'or--',linewidth=3)
plt.plot(DF['sdA'],'Dk:',linewidth=3)
plt.legend(['1st','2nd'])
plt.title("Plot Example")

1.png

Conclusion

Now you have a practical understanding of how you can communicate information visually with matplotlib in python. This is barely scratching the surface in terms of the potential that is available.

Critical Thinking and Problem Solving

There have been concerns for years that critical thinking and problem-solving skills are in decline not only among students but also the general public. On the surface, this appears to be true. However, throughout human history, the average person was not much of a deep thinker but rather a doer. How much time can you spend on thinking for the sake of thinking when you are dealing with famine, war, and disease? This internal focus vs external focus is one of the differences between critical thinking and problem-solving.

Critical Thinking

There is no agreed-upon definition of critical thinking. This makes sense as any agreement would indicate a lack of critical thinking. In general, critical thinking is about questioning and testing the claims and statements made through external evidence as well as internal thought. Critical thinking is the ability to know what you don’t know and seek answers through finding information. To test and assert claims means taking time to develop them which is a lonely process many times

Thinking for the sake of thinking is a somewhat introverted process. There are few people who want to sit and ponder in the fast-paced 21st century.  This is one reason why it appears that critical thinking is in decline. It’s not that people are incapable of thinking critical they would just rather not do it and seek a quick answer and or entertainment. Critical thinking is just too slow for many people.

Whenever I give my students any form of opened assignment that requires them to develop an argument I am usually shocked by the superficial nature of the work. Having thought about this I have come to the conclusion that the students lacked the mental energy to demonstrate the critical thought needed to write a paper or even to share their opinion about something a little deeper then facebook videos.

Problem Solving

Problem-solving is about getting stuff done. When an obstacle is in the way a problem solver finds a way around. Problem-solving is focused often on tangible things and objects in a practical way. Therefore, problem-solving is more of an extroverted experience. It is common and easy to solve problems with friends gregariously. However, thinking critically is somewhat more difficult to do in groups and cannot move as fast as we want we discussing.

Due to the potential of working in groups and the fast pace that it can take, problem-solving skills are in better shape than critical thinking skills. This is because when people work in groups several superficial ideas can be combined to overcome a problem. This groupthink if you will allow for success even though the individual members are probably not the brightest.

Problem-solving has been the focus of mankind for most of their existence. Please keep in mind that for most of human history people could not even read and write. Instead, they were farmers and soldiers concern with food and protecting their civilization from invasion. These problems led to amazing discoveries for the sake of providing life and not for the sake of thinking for the sake of thinking or questioning for the sake of objection.

Overlap

There is some overlap in critical thinking and problem-solving. Solutions to problems have to be critically evaluated. However, often a potential solution is voted good or bad by whether it works or not which requires observation and not in-depth thinking. The goal for problem-solving is always “does this solve the problem” or “does this solve the problem better”. These are important criteria but critical thinking involves much broader and deeper issues than just “does this work.” Critical thinking is on a quest for truth and satisfying curiosity. These are ideas that problem-solvers struggle to appreciate

The world is mostly focused on people who can solve problems and not necessarily on deep thinkers who can ponder complex ideas alone. As such, perhaps critical thinking was a fad that has ceased to be relevant as problem solvers do not see how critical thinking solves problems. Both forms of thought are needed and they do overlap yet most of the world simply wants to know what the answer is to their problem rather than to think deeply about why they have a problem in the first place.

Random Forest Regression in Python

Random forest is simply the making of dozens if not thousands of decision trees. The decision each tree makes about an example are then tallied for the purpose of voting with the classification that receives the most votes winning. For regression, the results of the trees are averaged in  order to give the most accurate results

In this post, we will use the cancer dataset from the pydataset module to predict the age of people. Below is some initial code.

import pandas as pd
import numpy as np
from pydataset import data
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

We can load our dataset as df, drop all NAs, and create our dataset that contains the independent variables and a separate dataset that includes the dependent variable of age. The code is below

df = data('cancer')
df=df.dropna()
X=df[['time','status',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']]
y=df['age']

Next, we need to set up our train and test sets using a 70/30 split. After that, we set up our model using the RandomForestRegressor function. n_estimators is the number of trees we want to create and the random_state argument is for supporting reproducibility. The code is below

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
h=RandomForestRegressor(n_estimators=100,random_state=1)

We can now run our model and test it. Running the model requires the .fit() function and testing involves the .predict() function. The results of the test are found using the mean_squared_error() function.

h.fit(x_train,y_train)
y_pred=h.predict(x_test)
mean_squared_error(y_test,y_pred)
71.75780196078432

The MSE of 71.75 is only useful for model comparison and has little meaning by its self. Another way to assess the model is by determining variable importance. This helps you to determine in a descriptive way the strongest variables for the regression model. The code is below followed by the plot of the variables.

model_ranks=pd.Series(h.feature_importances_,index=x_train.columns,name="Importance").sort_values(ascending=True,inplace=False) 
ax=model_ranks.plot(kind='barh')

1

As you can see, the strongest predictors of age include calories per meal, weight loss, and time sick. Sex and whether the person is censored or dead make a smaller difference. This makes sense as younger people eat more and probably lose more weight because they are heavier initially when dealing with cancer.

Conclusison

This post provided an example of the use of regression with random forest. Through the use of ensemble voting, you can improve the accuracy of your models. This is a distinct power that is not available with other machine learning algorithm.

Bagging Classification with Python

Bootstrap aggregation aka bagging is a technique used in machine learning that relies on resampling from the sample and running multiple models from the different samples. The mean or some other value is calculated from the results of each model. For example, if you are using Decisions trees, bagging would have you run the model several times with several different subsamples to help deal with variance in statistics.

Bagging is an excellent tool for algorithms that are considered weaker or more susceptible to variances such as decision trees or KNN. In this post, we will use bagging to develop a model that determines whether or not people voted using the turnout dataset. These results will then be compared to a model that was developed in a traditional way.

We will use the turnout dataset available in the pydataset module. Below is some initial code.

from pydataset import data
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

We will load our dataset. Then we will separate the independnet and dependent variables from each other and create our train and test sets. The code is below.

df=data("turnout")
X=df[['age','educate','income',]]
y=df['vote']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0)

We can now prepare to run our model. We need to first set up the bagging function. There are several arguments that need to be set. The max_samples argument determines the largest amount of the dataset to use in resampling. The max_features argument is the max number of features to use in a sample. Lastly, the n_estimators is for determining the number of subsamples to draw. The code is as follows

 h=BaggingClassifier(KNeighborsClassifier(n_neighbors=7),max_samples=0.7,max_features=0.7,n_estimators=1000)

Basically, what we told python was to use up to 70% of the samples, 70% of the features, and make 100 different KNN models that use seven neighbors to classify. Now we run the model with the fit function, make a prediction with the predict function, and check the accuracy with the classificarion_reoirts function.

h.fit(X_train,y_train)
y_pred=h.predict(X_test)
print(classification_report(y_test,y_pred))

1

This looks oka below are the results when you do a traditional model without bagging

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0)
clf=KNeighborsClassifier(7)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test,y_pred))

1

The improvement is not much. However, this depends on the purpose and scale of your project. A small improvement can mean millions in the reight context such as for large companies such as Google who deal with billions of people per day.

Conclusion

This post provides an example of the use of bagging in the context of classification. Bagging provides a why to improve your model through the use of resampling.

K Nearest Neighbor Classification with Python

K Nearest Neighbor uses the idea of proximity to predict class. What this means is that with KNN Python will look at K neighbors to determine what the unknown examples class should be. It is your job to determine the K or number of neighbors that should be used to determine the unlabeled examples class.

KNN is great for a small dataset. However, it normally does not scale well when the dataset gets larger and larger. As such, unless you have an exceptionally powerful computer KNN is probably not going to do well in a Big Data context.

In this post, we will go through an example of the use of KNN with the turnout dataset from the pydataset module. We want to predict whether someone voted or not based on the independent variables. Below is some initial code.

from pydataset import data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

We now need to load the data and separate the independent variables from the dependent variable by making two datasets.

df=data("turnout")
X=df[['age','educate','income']]
y=df['vote']

Next, we will make our train and test sets with a 70/30 split. The random.state is set to 0. This argument allows us to reproduce our model if we want. After this, we will run our model. We will set the K to 7 for our model  and run the model. This means that Python will look at the 7 closes examples to predict the value of an unknown example. below is the code

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0)
clf=KNeighborsClassifier(7)
clf.fit(X_train,y_train)

We can now predict with our model and see the results with the classification reports function.

y_pred=clf.predict(X_test)
print(classification_report(y_test,y_pred))

1

The results are shown above. To determine the quality of the model relies more on domain knowledge. What we can say for now is that the model is better at classifying people who vote rather than people who do not vote.

Conclusion

This post shows you how to work with Python when using KNN. This algorithm is useful in using neighboring examples tot predict the class of an unknown example.

Naive Bayes with Python

Naive Bayes is a probabilistic classifier that is often employed when you have multiple or more than two classes in which you want to place your data. This algorithm is particularly used when you dealing with text classification with large datasets and many features.

If you are more familiar with statistics you know that Bayes developed a method of probability that is highly influential today. In short, his system takes into conditional probability. In the case of naive Bayes,  the classifier assumes that the presence of a certain feature in a class is not related to the presence of any other feature. This assumption is why Naive Bayes is Naive.

For our purposes, we will use Naive Bayes to predict the type of insurance a person has in the DoctorAUS dataset in the pydataset module. Below is some initial code.

from pydataset import data
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

Next, we will load our dataset DoctorAUS. Then we will separate the independent variables that we will use from the dependent variable of insurance in two different datasets. If you want to know more about the dataset and the variables you can type data(“DoctorAUS”, show_doc=True)

df=data("DoctorAUS")
X=df[['age','income','sex','illness','actdays','hscore','doctorco','nondocco','hospadmi','hospdays','medecine','prescrib']]
y=df['insurance']

Now, we will create our train and test datasets. We will do a 70/30 split. We will also use Gaussian Naive Bayes as our algorithm. This algorithm assumes the data is normally distributed. There are other algorithms available for Naive Bayes as well.  We will also create our model with the .fit function.

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0)
clf=GaussianNB()
clf.fit(X_train,y_train)

Finally, we will predict with our model and run the classification report to determine the success of the model.

y_pred=clf.predict(X_test)
print(classification_report(y_test,y_pred))

1

You can see that our overall numbers are not that great. This means that the current algorithm is probably not the best choice for classification. Of course, there could other problems as well that need to be explored.

Conclusion

This post was simply a demonstration of how to conduct an analysis with Naive Bayes using Python. The process is not all that complicate and is similar to other algorithms that are used.

K Nearest Neighbor Regression with Python

K Nearest Neighbor Regression (KNN) works in much the same way as KNN for classification. The difference lies in the characteristics of the dependent variable. With classification KNN the dependent variable is categorical. WIth regression KNN the dependent variable is continuous. Both involve the use neighboring examples to predict the class or value of other examples.

This post will provide an example of KNN regression using the turnout dataset from the pydataset module. Our purpose will be to predict the age of a voter through the use of other variables in the dataset. Below is some initial code.

from pydataset import data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

We now need to setup our data. We need to upload our actual dataset. Then we need to separate the independnet and dependent variables. Once this is done we need to create our train and test sets using the tarin test spli t funvtion. Below is the code to accmplouh each of these steps.

df=data("turnout")
X=df[['age','income','vote']]
y=df['educate']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0)

We are now ready to train our model. We need to call the function we will use and determine the size of K, which will be 11 in our case. Then we need to train our model and then predict with it. Lastly, we will print out the mean squared error. This value is useful for comparing models but does not have much value by itself. The MSE is calculated by comparing the actual test set with the predicted test data. The code is below

clf=KNeighborsRegressor(11)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print(mean_squared_error(y_test,y_pred))
9.239

If we were to continue with model development we may look for ways to improve our MAE through different nethods such as regular linear regression. However, for our purposes this is adequate.

Conclusison

This post provides an example of regression with KNN in Python. This tool is a practical and simple way to make numeric predictions that can be accurate at times.

Support Vector Machines Regression with Python

This post will provide an example of how to do regression with support vector machines SVM. SVM is a complex algorithm that allows for the development of non-linear models. This is particularly useful for messy data that does not have clear boundaries.

The steps that we will use are listed below

  1. Data preparation
  2. Model Development

We will use two different kernels in our analysis. The LinearSVR kernel and SVR kernel. The difference between these two kernels has to do with slight changes in the calculations of the boundaries between classes.

Data Preparation

We are going to use the OFP dataset available in the pydataset module. This dataset was used previously for classification with SVM on this site. Our plan this time is that we want to predict family inc (famlinc), which is a continuous variable.  Below is some initial code.

import numpy as np
import pandas as pd
from pydataset import data
from sklearn import svm
from sklearn import model_selection
from statsmodels.tools.eval_measures import mse

We now need to load our dataset and remove any missing values.

df=pd.DataFrame(data('OFP'))
df=df.dropna()

AS in the previous post, we need to change the text variables into dummy variables and we also need to scale the data. The code below creates the dummy variables, removes variables that are not needed, and also scales the data.

dummy=pd.get_dummies(df['black'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "black_person"})
df=df.drop('no', axis=1)

dummy=pd.get_dummies(df['sex'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"male": "Male"})
df=df.drop('female', axis=1)

dummy=pd.get_dummies(df['employed'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "job"})
df=df.drop('no', axis=1)

dummy=pd.get_dummies(df['maried'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"no": "single"})
df=df.drop('yes', axis=1)

dummy=pd.get_dummies(df['privins'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "insured"})
df=df.drop('no', axis=1)
df=df.drop(['black','sex','maried','employed','privins','medicaid','region','hlth'],axis=1)
df = (df - df.min()) / (df.max() - df.min())
df.head()

1

We now need to set up our datasets. The X dataset will contain the independent variables while the y dataset will contain the dependent variable

X=df[['ofp','ofnp','opp','opnp','emr','hosp','numchron','adldiff','age','school','single','black_person','Male','job','insured']]
y=df['faminc']

We can now move to model development

Model Development

We now need to create our train and test sets for or X and y datasets. We will do a 70/30 split of the data. Below is the code

X_train,X_test,y_train,y_test=model_selection.train_test_split(X,y,test_size=.3,random_state=1)

Next, we will create our two models with the code below.

h1=svm.SVR()
h2=svm.LinearSVR()

We will now run our first model and assess the results. Our metric is the mean squared error. Generally, the lower the number the better.  We will use the .fit() function to train the model and the .predict() function for test the model

1

The mse was 0.27. This number means nothing only and is only beneficial for comparison reasons. Therefore, the second model will be judged as better or worst only if the mse is lower than 0.27. Below are the results of the second model.

1.png

We can see that the mse for our second model is 0.34 which is greater than the mse for the first model. This indicates that the first model is superior based on the current results and parameter settings.

Conclusion

This post provided an example of how to use SVM for regression.

Support Vector Machines Classification with Python

Support vector machines (SVM) is an algorithm used to fit non-linear models. The details are complex but to put it simply  SVM tries to create the largest boundaries possible between the various groups it identifies in the sample. The mathematics behind this is complex especially if you are unaware of what a vector is as defined in algebra.

This post will provide an example of SVM using Python broken into the following steps.

  1. Data preparation
  2. Model Development

We will use two different kernels in our analysis. The linear kernel and he rbf kernel. The difference in terms of kernels has to do with how the boundaries between the different groups are made.

Data Preparation

We are going to use the OFP dataset available in the pydataset module. We want to predict if someone single or not. Below is some initial code.

import numpy as np
import pandas as pd
from pydataset import data
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn import model_selection

We now need to load our dataset and remove any missing values.

df=pd.DataFrame(data('OFP'))
df=df.dropna()
df.head()

1

Looking at the dataset we need to do something with the variables that have text. We will create dummy variables for all except region and hlth. The code is below.

dummy=pd.get_dummies(df['black'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "black_person"})
df=df.drop('no', axis=1)

dummy=pd.get_dummies(df['sex'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"male": "Male"})
df=df.drop('female', axis=1)

dummy=pd.get_dummies(df['employed'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "job"})
df=df.drop('no', axis=1)

dummy=pd.get_dummies(df['maried'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"no": "single"})
df=df.drop('yes', axis=1)

dummy=pd.get_dummies(df['privins'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "insured"})
df=df.drop('no', axis=1)

For each variable, we did the following

  1. Created a dummy in the dummy dataset
  2. Combined the dummy variable with our df dataset
  3. Renamed the dummy variable based on yes or no
  4. Drop the other dummy variable from the dataset. Python creates two dummies instead of one.

If you look at the dataset now you will see a lot of variables that are not necessary. Below is the code to remove the information we do not need.

df=df.drop(['black','sex','maried','employed','privins','medicaid','region','hlth'],axis=1)
df.head()

1

This is much cleaner. Now we need to scale the data. This is because SVM is sensitive to scale. The code for doing this is below.

df = (df - df.min()) / (df.max() - df.min())
df.head()

1

We can now create our dataset with the independent variables and a separate dataset with our dependent variable. The code is as follows.

X=df[['ofp','ofnp','opp','opnp','emr','hosp','numchron','adldiff','age','school','faminc','black_person','Male','job','insured']]
y=df['single']

We can now move to model development

Model Development

We need to make our test and train sets first. We will use a 70/30 split.

X_train,X_test,y_train,y_test=model_selection.train_test_split(X,y,test_size=.3,random_state=1)

Now, we need to create the models or the hypothesis we want to test. We will create two hypotheses. The first model is using a linear kernel and the second is one using the rbf kernel. For each of these kernels, there are hyperparameters that need to be set which you will see in the code below.

h1=svm.LinearSVC(C=1)
h2=svm.SVC(kernel='rbf',degree=3,gamma=0.001,C=1.0)

The details about the hyperparameters are beyond the scope of this post. Below are the results for the first model.

1.png

The overall accuracy is 73%. The crosstab() function provides a breakdown of the results and the classification_report() function provides other metrics related to classification. In this situation, 0 means not single or married while 1 means single. Below are the results for model 2

1.png

You can see the results are similar with the first model having a slight edge. The second model really struggls with predicting people who are actually single. You can see thtat the recall in particular is really poor.

Conclusion

This post provided how to ob using SVM in python. How this algorithm works can be somewhat confusing. However, its use can be powerful if use appropriately.

Linear Discriminant Analysis in Python

Linear discriminant analysis is a classification algorithm commonly used in data science. In this post, we will learn how to use LDA with Python. The steps we will for this are as follows.

  1. Data preparation
  2. Model training and evaluation

Data Preparation

We will be using the bioChemists dataset which comes from the pydataset module. We want to predict whether someone is married or single based on academic output and prestige. Below is some initial code.

import pandas as pd
from pydataset import data
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

Now we will load our data and take a quick look at it using the .head() function.

1.png

There are two variables that contain text so we need to convert these two dummy variables for our analysis the code is below with the output.

1

Here is what we did.

  1. We created the dummy variable by using the .get_dummies() function.
  2. We saved the output in an object called dummy
  3. We then combine the dummy and df dataset with the .concat() function
  4. We repeat this process for the second variable

The output shows that we have our original variables and the dummy variables. However, we do not need all of this information. Therefore, we will create a dataset that has the X variables we will use and a separate dataset that will have our y values. Below is the code.

X=df[['Men','kid5','phd','ment','art']]
y=df['Married']

The X dataset has our five independent variables and the y dataset has our dependent variable which is married or not. We can not split our data into a train and test set.  The code is below.

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0)

The data was split 70% for training and 30% for testing. We made a train and test set for the independent and dependent variables which meant we made 4 sets altogether. We can now proceed to model development and testing

Model Training and Testing

Below is the code to run our LDA model. We will use the .fit() function for this.

clf=LDA()
clf.fit(X_train,y_train)

We will now use this model to predict using the .predict function

y_pred=clf.predict(X_test)

Now for the results, we will use the classification_report function to get all of the metrics associated with a confusion matrix.

1.png

The interpretation of this information is described in another place. For our purposes, we have an accuracy of 71% for our prediction.  Below is a visual of our model using the ROC curve.

1

Here is what we did

  1. We had to calculate the roc_curve for the model this is explained in detail here
  2. Next, we plotted our own curve and compared to a baseline curve which is the dotted lines.

A ROC curve of 0.67 is considered fair by many. Our classification model is not that great but there are worst models out there.

Conclusion

This post went through an example of developing and evaluating a linear discriminant model. To do this you need to prepare the data, train the model, and evaluate.

Factor Analysis in Python

Factor analysis is a dimensionality reduction technique commonly used in statistics. FA is similar to principal component analysis. The difference are highly technical but include the fact the FA does not have an orthogonal decomposition and FA assumes that there are latent variables and that are influencing the observed variables in the model. For FA the goal is the explanation of the covariance among the observed variables present.

Our purpose here will be to use the BioChemist dataset from the pydataset module and perform a FA that creates two components. This dataset has data on the people completing PhDs and their mentors. We will also create a visual of our two-factor solution. Below is some initial code.

import pandas as pd
from pydataset import data
from sklearn.decomposition import FactorAnalysis
import matplotlib.pyplot as plt

We now need to prepare the dataset. The code is below

df = data('bioChemists')
df=df.iloc[1:250]
X=df[['art','kid5','phd','ment']]

In the code above, we did the following

  1. The first line creates our dataframe called “df” and is made up of the dataset bioChemist
  2. The second line reduces the df to 250 rows. This is done for the visual that we will make. To take the whole dataset and graph it would make a giant blob of color that would be hard to interpret.
  3. The last line pulls the variables we want to use for our analysis. The meaning of these variables can be found by typing data(“bioChemists”,show_doc=True)

In the code below we need to set the number of factors we want and then run the model.

fact_2c=FactorAnalysis(n_components=2)
X_factor=fact_2c.fit_transform(X)

The first line tells Python how many factors we want. The second line takes this information along with or revised dataset X to create the actual factors that we want. We can now make our visualization

To make the visualization requires several steps. We want to identify how well the two components separate students who are married from students who are not married. First, we need to make a dictionary that can be used to convert the single or married status to a number. Below is the code.

thisdict = {
"Single": "1",
"Married": "2",}

Now we are ready to make our plot. The code is below. Pay close attention to the ‘c’ argument as it uses our dictionary.

plt.scatter(X_factor[:,0],X_factor[:,1],c=df.mar.map(thisdict),alpha=.8,edgecolors='none')

1

You can perhaps tell why we created the dictionary now. By mapping the dictionary to the mar variable it automatically changed every single and married entry in the df dataset to a 1 or 2. The c argument needs a number in order to set a color and this is what the dictionary was able to supply it with.

You can see that two factors do not do a good job of separating the people by their marital status. Additional factors may be useful but after two factors it becomes impossible to visualize them.

Conclusion

This post provided an example of factor analysis in Python. Here the focus was primarily on visualization but there are so many other ways in which factor analysis can be deployed.

Analyzing Twitter Data in Python

In this post, we will look at how to analyze text from Twitter. We will do each of the following for tweets that refer to Donald Trump and tweets that refer to Barrack Obama.

  • Conduct a sentiment analysis
  • Create a word cloud

This is a somewhat complex analysis so I am assuming that you are familiar with Python as explaining everything would make the post much too long. In order to achieve our two objectives above we need to do the following.

  1. Obtain all of the necessary information from your twitter apps account
  2. Download the tweets & clean
  3. Perform the analysis

Before we begin, here is a list of modules we will need to load to complete our analysis

import wordcloud
import matplotlib.pyplot as plt
import twython
import re
import numpy

Obtain all Needed Information

From your twitter app account, you need the following information

  • App key
  • App key secret
  • Access token
  • Access token secret

All this information needs to be stored in individual objects in Python. Then each individual object needs to be combined into one object. The code is below.

TWITTER_APP_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXX
TWITTER_APP_KEY_SECRET=XXXXXXXXXXXXXXXXXXX
TWITTER_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXX
TWITTER_ACCESS_TOKEN_SECRET=XXXXXXXXXXXXXX
t=twython.Twython(app_key=TWITTER_APP_KEY,app_secret=TWITTER_APP_KEY_SECRET,oauth_token=TWITTER_ACCESS_TOKEN,oauth_token_secret=TWITTER_ACCESS_TOKEN_SECRET)

In the code above we saved all the information in different objects at first and then combined them. You will of course replace the XXXXXXX with your own information.

Next, we need to create a function that will pull the tweets from Twitter. Below is the code,

def get_tweets(twython_object,query,n):
   count=0
   result_generator=twython_object.cursor(twython_object.search,q=query)

   result_set=[]
   for r in result_generator:
      result_set.append(r['text'])
      count+=1
      if count ==n: break

   return result_set

You will have to figure out the code yourself. We can now download the tweets.

Downloading Tweets & Clean

Downloading the tweets involves making an empty dictionary that we can save our information in. We need two keys in our dictionary one for Trump and the other for Obama because we are downloading tweets about these two people.

There are also two additional things we need to do. We need to use regular expressions to get rid of punctuation and we also need to lower case all words. All this is done in the code below.

tweets={}
tweets['trump']=[re.sub(r'[-.#/?!.":;()\']',' ',tweet.lower())for tweet in get_tweets(t,'#trump',1500)]
tweets['obama']=[re.sub(r'[-.#/?!.":;()\']',' ',tweet.lower())for tweet in get_tweets(t,'#obama',1500)]

The get_tweets function is also used in the code above along with our twitter app information. We pulled 1500 tweets concerning Obama and 1500 tweets about Trump. We were able to download and clean our tweets at the same time. We can now do our analysis

Analysis

To do the sentiment analysis you need dictionaries of positive and negative words. The ones in this post were taken from GitHub. Below is the code for loading them into Python.

positive_words=open('XXXXXXXXXXXX').read().split('\n')
negative_words=open('XXXXXXXXXXXX').read().split('\n')

We now will make a function to calculate the sentiment

def sentiment_score(text,pos_list,neg_list):
   positive_score=0
   negative_score=0

   for w in text.split(' '):
      if w in pos_list:positive_score+=1
      if w in neg_list:negative_score+=1
   return positive_score-negative_score

Now we create an empty dictionary and run the analysis for Trump and then for Obama

tweets_sentiment={}
tweets_sentiment['trump']=[sentiment_score(tweet,positive_words,negative_words)for tweet in tweets['trump']]
tweets_sentiment['obama']=[sentiment_score(tweet,positive_words,negative_words)for tweet in tweets['obama']]

Now we can make visuals of our results with the code below

trump=plt.hist(tweets_sentiment['trump'],5)
obama=plt.hist(tweets_sentiment['obama'],5)

Obama is on the left and trump is on the right. It seems that trump tweets are consistently more positive. Below are the means for both.

numpy.mean(tweets_sentiment['trump'])
Out[133]: 0.36363636363636365

numpy.mean(tweets_sentiment['obama'])
Out[134]: 0.2222222222222222

Trump tweets are slightly more positive than Obama tweets. Below is the code for the Trump word cloud

1.png

Here is the code for the Obama word cloud

1

A lot of speculating can be made from the word clouds and sentiment analysis. However, the results will change every single time because of the dynamic nature of Twitter. People are always posting tweets which changes the results.

Conclusion

This post provided an example of how to download and analyze tweets from twitter. It is important to develop a clear idea of what you want to know before attempting this sort of analysis as it is easy to become confused and not accomplish anything.

Word Clouds in Python

Word clouds are a type of data visualization in which various words from a dataset are actuated. Words that are larger in the word cloud are more common and words in the middle are also more common. In addition, some word clouds even use various colors to indicated importance.

This post will provide an example of how to make a word cloud using python. We will be using the “Women’s E-Commerce Clothing Reviews” available on the kaggle website.  We are going to only use the text reviews to make our word cloud even though other data is in the dataset. To prepare our dataset for making the word cloud we need to the following.

  1. Lowercase all words
  2. Remove punctuation
  3. Remove stopwords

After completing these steps we can make the word cloud. First, we need to load all of the necessary modules.

import pandas as pd
import re
from nltk.corpus import stopwords
import wordcloud
import matplotlib.pyplot as plt

We now need to load our dataset we will store it as the object ‘df’

df=pd.read_csv('YOUR LOCATION HERE')
df.head()

1.png

It’s hard to read but we will be working only with the “Review Text” column as this has the text data we need. Here is what our column looks like up close.

df['Review Text'].head()

Out[244]: 
0 Absolutely wonderful - silky and sexy and comf...
1 Love this dress! it's sooo pretty. i happene...
2 I had such high hopes for this dress and reall...
3 I love, love, love this jumpsuit. it's fun, fl...
4 This shirt is very flattering to all due to th...
Name: Review Text, dtype: object

We will now make all words lower case and remove punctuation with the code below.

df["Review Text"]=df['Review Text'].str.lower()
df["Review Text"]=df['Review Text'].str.replace(r'[-./?!,":;()\']',' ')

The first line in the code above lower cases all words. The second line removes the punctuation. The second line is trickier as you have to explain to python exactly what type of punctuation you want to remove and what to replace it with. Everything we want to remove is in the first set of single quotes. We want to replace the punctuation with a space which is the second set of single quotation marks with a space in the middle. THe r at the beginning of the parentheses stands for remove.

Here is what our data looks like after making these two changes

df['Review Text'].head()

Out[249]: 
0 absolutely wonderful silky and sexy and comf...
1 love this dress it s sooo pretty i happene...
2 i had such high hopes for this dress and reall...
3 i love love love this jumpsuit it s fun fl...
4 this shirt is very flattering to all due to th...
Name: Review Text, dtype: object

All the words are in lowercase. In addition, you can see that the dash in line 0 is gone as all the punctuation in the other lines. We now need to remove stopwords. Stopwords are the functional words that glue the meaning together without. Examples include and, for, but, etc. We are trying to make a cloud of substantial words and not stopwords so these words need to be removed.

If you have never done this on your computer before you may need to import the nltk module and run nltk.download_gui(). Once this is done you need to download the stopwords package.

Below is the code for removing the stopwords. First, we need to load the stopwords this is done below.

stopwords_list=stopwords.words('english')
stopwords_list=stopwords_list+['to']

We create an object called stopwords_list which has all the English stopwords. The second line just adds the word ‘to’ to the list. Nex,t we need to make an object that will look for the pattern of words we want to remove. Below is the code

pat = r'\b(?:{})\b'.format('|'.join(stopwords_list))

This code is the basically telling Python what to look for. Using regularized expressions Python will look for any word whos pattern on the left is the same as the pattern on the right after the .join function. Inside the .join function is our stopwords_list. We will now take this object called ‘pat’ and use it on our ‘Review Text’ variable.

df['Split Text'] = df['Review Text'].str.replace(pat, '')
df['Split Text'].head()df['Split Text'].head()

Out[258]: 
0      absolutely wonderful   silky  sexy  comfortable
1    love  dress     sooo pretty    happened  find ...
2       high hopes   dress  really wanted   work   ...
3     love  love  love  jumpsuit    fun  flirty   f...
4     shirt   flattering   due   adjustable front t...
Name: Split Text, dtype: object

You can see that we have created a new column called ‘Split Text’ and the results is a text that has lost many stop words.

We are now ready to make our word cloud below is the code and the output.

wordcloud1=wordcloud.WordCloud(width=1000,height=500).generate(' '.join(map(str, df['Split Text'])))
plt.figure(figsize=(15,8))
plt.imshow(wordcloud1)
plt.axis('off')

1.png

This code is complex. We used the word cloud function and we had to use both generate map, and join as inner functions. All of these function were needed to take the words from the dataframe and make them simple text for the wordcloud function.

The rest of the code is common to mathplotlib so does not require much explanation. Ass you look at the word cloud, you can see that the most common words include top, look, dress, shirt, fabric. etc. This is reasonable given that these are women’s reviews of clothing.

Conclusion

This post provided an example of text analysis using word clouds in Python. The insights here are primarily descriptive in nature. This means that if the desire is prediction or classification other additional tools would need to build upon what is shown here.

KMeans Clustering in Python

Kmeans clustering is a technique in which the examples in a dataset our divided through segmentation. The segmentation has to do with complex statistical analysis in which examples within a group are more similar the examples outside of a group.

The application of this is that it provides the analysis with various groups that have similar characteristics which can be used to cater services to in various industries such as business or education. In this post, we will look at how to do this using Python. We will use the steps below to complete this process.

  1. Data preparation
  2. Determine the number of clusters
  3. Conduct analysis

Data Preparation

Our data for this examples comes from the sat.act dataset available in the pydataset module. Below is some initial code.

import pandas as pd
from pydataset import data
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pyplot as plt

We will now load our dataset and drop any NAs they may be present

1

You can see there are six variables that will be used for the clustering. Next, we will turn to determining the number of clusters.

Determine the Number of Clusters

Before you can actually do a kmeans analysis you must specify the number of clusters. This can be tricky as there is no single way to determine this. For our purposes, we will use the elbow method.

The elbow method measures the within sum of error in each cluster. As the number of clusters increasings this error decrease. However,  a certain point the return on increasing clustering becomes minimal and this is known as the elbow. Below is the code to calculate this.

distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k).fit(df)
distortions.append(sum(np.min(cdist(df, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / df.shape[0])

Here is what we did

  1. We made an empty list called ‘distortions’ we will save our results there.
  2. In the second line, we told R the range of clusters we want to consider. Simply, we want to consider anywhere from 1 to 10 clusters.
  3. Line 3 and 4, we use a for loop to calculate the number of clusters when fitting it to the df object.
  4. In Line 5, we save the sum of the cluster distance in the distortions list.

Below is a visual to determine the number of clusters

plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')

1.png

The graph indicates that 3 clusters are sufficient for this dataset. We can now perform the actual kmeans clustering.

KMeans Analysis

The code for the kmeans analysis is as follows

km=KMeans(3,init='k-means++',random_state=3425)
km.fit(df.values)
  1. We use the KMeans function and tell Python the number of clusters, the type of, initialization, and we set the seed with the random_state argument. All this is saved in the objet called km
  2. The km object has the .fit function used on it with df.values

Next, we will predict with the predict function and look at the first few lines of the modified df with the .head() function.

1

You can see we created a new variable called predict. This variable contains the kmeans algorithm prediction of which group each example belongs too. We then printed the first five values as an illustration. Below are the descriptive statistics for the three clusters that were produced for the variable in the dataset.

1.png

It is clear that the clusters are mainly divided based on the performance on the various test used. In the last piece of code, gender is used. 1 represents male and 2 represents female.

We will now make a visual of the clusters using two dimensions. First, w e need to make a map of the clusters that is saved as a dictionary. Then we will create a new variable in which we take the numerical value of each cluster and convert it to a sting in our cluster map dictiojnary.

clust_map={0:'Weak',1:'Average',2:'Strong'}
df['perf']=df.predict.map(clust_map)

Next, we make a different dictionary to color the points in our graph.

d_color={'Weak':'y','Average':'r','Strong':'g'}

1.png

Here is what is happening in the code above.

  1. We set the ax object to a value.
  2. A for loop is used to go through every example in clust_map.values so that they are colored according the color
  3. Lastly, a plot is called which lines upo the perf and clust values for color.

The groups are clearly separated when looking at them in two dimensions.

Conclusion

Kmeans is a form of unsupervised learning in which there is no dependent variable which you can use to assess the accuracy of the classification or the reduction of error in regression. As such, it can be difficult to know how well the algorithm did with the data. Despite this, kmeans is commonly used in situations in which people are trying to understand the data rather than predict.

Random Forest in Python

This post will provide a demonstration of the use of the random forest algorithm in python. Random forest is similar to decision trees except that instead of one tree a multitude of trees are grown to make predictions. The various trees all vote in terms of how to classify an example and majority vote is normally the winner. Through making many trees the accuracy of the model normally improves.

The steps are as follows for the use of random forest

  1. Data preparation
  2. Model development & evaluation
  3. Model comparison
  4. Determine variable importance

Data Preparation

We will use the cancer dataset from the pydataset module. We want to predict if someone is censored or dead in the status variable. The other variables will be used as predictors. Below is some code that contains all of the modules we will use.

import pandas as pd
import sklearn.ensemble as sk
from pydataset import data
from sklearn.model_selection import train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt

We will now load our data cancer in an object called ‘df’. Then we will remove all NA’s use the .dropna() function. Below is the code.

df = data('cancer')
df=df.dropna()

We now need to make two datasets. One dataset, called X, will contain all of the predictor variables. Another dataset, called y, will contain the outcome variable. In the y dataset, we need to change the numerical values to a string. This will make interpretation easier as we will not need to lookup what the numbers represents. Below is the code.

X=df[['time','age',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']]
df['status']=df.status.replace(1,'censored')
df['status']=df.status.replace(2,'dead')
y=df['status']

Instead of 1 we now have the string “censored” and instead of 2 we now have the string “dead” in the status variable. The final step is to set up our train and test sets. We will do a 70/30 split. We will have a train set for the X and y dataset as well as a test set for the X and y datasets. This means we will have four datasets in all. Below is the code.

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

We are now ready to move to model development

Model Development and Evaluation

We now need to create our classifier and fit the data to it. This is done with the following code.

clf=sk.RandomForestClassifier(n_estimators=100)
clf=clf.fit(x_train,y_train)

The clf object has our random forest algorithm,. The number of estimators is set to 100. This is the number of trees that will be generated. In the second line of code, we use the .fit function and use the training datasets x and y.

We now will test our model and evaluate it. To do this we will use the .predict() with the test dataset Then we will make a confusion matrix followed by common metrics in classification. Below is the code and the output.

1.png

You can see that our model is good at predicting who is dead but struggles with predicting who is censored. The metrics are reasonable for dead but terrible for censored.

We will now make a second model for the purpose of comparison

Model Comparision

We will now make a different model for the purpose of comparison. In this model, we will use out of bag samples to determine accuracy, set the minimum split size at 5 examples, and that each leaf has at least 2 examples. Below is the code and the output.

1.png

There was some improvement in classify people who were censored as well as for those who were dead.

Variable Importance

We will now look at which variables were most important in classifying our examples. Below is the code

model_ranks=pd.Series(clf.feature_importances_,index=x_train.columns,name="Importance").sort_values(ascending=True,inplace=False)
ax=model_ranks.plot(kind='barh')

We create an object called model_ranks and we indicate the following.

  1. Classify the features by importance
  2. Set index to the columns in the training dataset of x
  3. Sort the features from most to least importance
  4. Make a barplot

Below is the output

1.png

You can see that time is the strongest classifier. How long someone has cancer is the strongest predictor of whether they are censored or dead. Next is the number of calories per meal followed by weight and lost and age.

Conclusion

Here we learned how to use random forest in Python. This is another tool commonly used in the context of machine learning.

Decision Trees in Python

Decision trees are used in machine learning. They are easy to understand and are able to deal with data that is less than ideal. In addition, because of the pictorial nature of the results decision trees are easy for people to interpret. We are going to use the ‘cancer’ dataset to predict mortality based on several independent variables.

We will follow the steps below for our decision tree analysis

  1. Data preparation
  2. Model development
  3. Model evaluation

Data Preparation

We need to load the following modules in order to complete this analysis.

import pandas as pd
import statsmodels.api as sm
import sklearn
from pydataset import data
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import tree
import matplotlib.pyplot as plt
from sklearn.externals.six import StringIO 
from IPython.display import Image 
from sklearn.tree import export_graphviz
import pydotplus

The ‘cancer’ dataset comes from the ‘pydataset’ module. You can learn more about the dataset by typing the following

data('cancer', show_doc=True)

This provides all you need to know about our dataset in terms of what each variable is measuring. We need to load our data as ‘df’. In addition, we need to remove rows with missing values and this is done below.

df = data('cancer')
len(df)
Out[58]: 228
df=df.dropna()
len(df)
Out[59]: 167

The initial number of rows in the data set was 228. After removing missing data it dropped to 167. We now need to setup up our lists with the independent variables and a second list with the dependent variable. While doing this, we need to recode our dependent variable “status” so that the numerical values are replaced with a string. This will help us to interpret our decision tree later. Below is the code

X=df[['time','age',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']]
df['status']=df.status.replace(1,'censored')
df['status']=df.status.replace(2,'dead')
y=df['status']

Next,  we need to make our train and test sets using the train_test_split function.  We want a 70/30 split. The code is below.

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

We are now ready to develop our model.

Model Development

The code for the model is below

clf=tree.DecisionTreeClassifier(min_samples_split=10)
clf=clf.fit(x_train,y_train)

We first make an object called “clf” which calls the DecisionTreeClassifier. Inside the parentheses, we tell Python that we do not want any split in the tree to contain less than 10 examples. The second “clf” object uses the  .fit function and calls the training datasets.

We can also make a visual of our decision tree.

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, 
filled=True, rounded=True,feature_names=list(x_train.columns.values),
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png())

1

If we interpret the nodes furthest to the left we get the following

  • If a person has had cancer less than 171 days and
  • If the person is less than 74.5 years old then
  • The person is dead

If you look closely every node is classified as ‘dead’ this may indicate a problem with our model. The evaluation metrics are below.

Model Evaluation

We will use the .crosstab function and the metrics classification functions

1.png

You can see that the metrics are not that great in general. This may be why everything was classified as ‘dead’. Another reason is that few people were classified as ‘censored’ in the dataset.

Conclusion

Decisions trees are another machine learning tool. Python allows you to develop trees rather quickly that can provide insights into how to take action.

Multiple Regression in Python

In this post, we will go through the process of setting up and a regression model with a training and testing set using Python. We will use the insurance dataset from kaggle. Our goal will be to predict charges. In this analysis, the following steps will be performed.

  1. Data preparation
  2. Model training
  3. model testing

Data Preparation

Below is a list of the modules we will need in order to complete the analysis

import matplotlib.pyplot as plt
import pandas as pd
from sklearn import linear_model,model_selection, feature_selection,preprocessing
import statsmodels.formula.api as sm
from statsmodels.tools.eval_measures import mse
from statsmodels.tools.tools import add_constant
from sklearn.metrics import mean_squared_error

After you download the dataset you need to load it and take a look at it. You will use the  .read_csv function from pandas to load the data and .head() function to look at the data. Below is the code and the output.

insure=pd.read_csv('YOUR LOCATION HERE')

1.png

We need to create some dummy variables for sex, smoker, and region. We will address that in a moment, right now we will look at descriptive stats for our continuous variables. We will use the .describe() function for descriptive stats and the .corr() function to find the correlations.

1.png

The descriptives are left for your own interpretation. As for the correlations, they are generally weak which is an indication that regression may be appropriate.

As mentioned earlier, we need to make dummy variables sex, smoker, and region in order to do the regression analysis. To complete this we need to do the following.

  1. Use the pd.get_dummies function from pandas to create the dummy
  2. Save the dummy variable in an object called ‘dummy’
  3. Use the pd.concat function to add our new dummy variable to our ‘insure’ dataset
  4. Repeat this three times

Below is the code for doing this

dummy=pd.get_dummies(insure['sex'])
insure=pd.concat([insure,dummy],axis=1)
dummy=pd.get_dummies(insure['smoker'])
insure=pd.concat([insure,dummy],axis=1)
dummy=pd.get_dummies(insure['region'])
insure=pd.concat([insure,dummy],axis=1)
insure.head()

1.png

The .get_dummies function requires the name of the dataframe and in the brackets the name of the variable to convert. The .concat function requires the name of the two datasets to combine as well the axis on which to perform it.

We now need to remove the original text variables from the dataset. In addition, we need to remove the y variable “charges” because this is the dependent variable.

y = insure.charges
insure=insure.drop(['sex', 'smoker','region','charges'], axis=1)

We can now move to model development.

Model Training

Are train and test sets are model with the model_selection.trainin_test_split function. We will do an 80-20 split of the data. Below is the code.

X_train, X_test, y_train, y_test = model_selection.train_test_split(insure, y, test_size=0.2)

In this single line of code, we create a train and test set of our independent variables and our dependent variable.

We can not run our regression analysis. This requires the use of the .OLS function from statsmodels module. Below is the code.

answer=sm.OLS(y_train, add_constant(X_train)).fit()

In the code above inside the parentheses, we put the dependent variable(y_train) and the independent variables (X_train). However, we had to use the function add_constant to get the intercept for the output. All of this information is then used inside the .fit() function to fit a model.

To see the output you need to use the .summary() function as shown below.

answer.summary()

1.png

The assumption is that you know regression but our reading this post to learn python. Therefore, we will not go into great detail about the results. The r-square is strong, however, the region and gender are not statistically significant.

We will now move to model testing

Model Testing

Our goal here is to take the model that we developed and see how it does on other data. First, we need to predict values with the model we made with the new data. This is shown in the code below

ypred=answer.predict(add_constant(X_test))

We use the .predict() function for this action and we use the X_test data as well. With this information, we will calculate the mean squared error. This metric is useful for comparing models. We only made one model so it is not that useful in this situation. Below is the code and results.

print(mse(ypred,y_test))
33678660.23480476

For our final trick, we will make a scatterplot with the predicted and actual values of the test set. In addition, we will calculate the correlation of the predict values and test set values. This is an alternative metric for assessing a model.

1.png

You can see the first two lines are for making the plot. Lines 3-4 are for making the correlation matrix and involves the .concat() function. The correlation is high at 0.86 which indicates the model is good at accurately predicting the values. THis is confirmed with the scatterplot which is almost a straight line.

Conclusion

IN this post we learned how to do a regression analysis in Python. We prepared the data, developed a model, and tested a model with an evaluation of it.