Tag Archives: data science

white caution cone on keyboard

OSEMN Framework for Data Analysis

Analyzing data can be extremely challenging. It is often common to not know where to begin. Perhaps you know some basic ways of analyzing data, but it is unclear what should be done first and what should follow.

This is where a data analysis framework can come in handy. Having a basic step-by-step process, you always follow can make it much easier to start and complete a project. One example of a data analysis framework is the OSEMN model. The OSEMN model is an acronym that defines each step of the data analysis process. The steps are as follows

  • Obtain
  • Scrub
  • Explore
  • Model
  • INterpret

We will now go through each of these steps.

Obtain

The first step of this model is obtaining data. Depending on the context, this can be done for you because the stakeholders have already provided data for analysis. In other situations, you have to find the data you need to answer whatever questions you are looking for insights into.

ads

Data can be found anywhere, so the obtained data must help achieve the goals. It is also necessary to have the skills or connections to get the data. For example, data may have to be scraped from the web, pulled from a database, or even collected through the development of surveys. Each of these examples requires specific skills needed for success.

Scrub

Once data is obtained, it must be scrubbed or cleaned. Completing these tasks requires several things. Duplicates need to be removed, missing data must be addressed, outlier considered, the shape of the data addressed, among other tasks. In addition, it is often useful to look at descriptive statistics and visualizations to identify potential problems. Lastly, you often need to clean categories within a variable if they are misspelled or involve other errors such as punctuation and converting numbers.

The concepts mentioned above are just some of the steps that need to be taken to clean data. Dirty will lead to bad insights. Therefore, this must be done well.

Explore

Exploring data and scrubbing data will often happen at the same time. With exploration, you are looking for insights into your data. One of the easiest ways to do this is to drill down as far as possible into your continuous variables by segmenting with the categorical variables.

For example, you might look at average scores by gender, then you look at average scores by gender and major, then you might look at average scores by gender, major, and class. Each time you find slightly different patterns that may be useful or not. Another approach would be to look at scatterplots that consider different combinations of categorical variables.

If the objectives are clear, it can help you focus your exploration on reducing the chance of presenting non-relevant information to your stakeholders. Suppose the stakeholders want to know the average scores of women. In that case, there is maybe no benefit to knowing the average score of male music majors.

Model

Modeling involves regression/classification in the case of supervised learning or segmentation in the case of unsupervised learning. Modeling in the context of supervised learning helps in predicting future values, while segmentation helps develop insights into groups within a dataset that have similar traits.

Once again, the objectives of the analysis shape what tool to use in this context. If you want to predict enrollment, then regression tools may be appropriate. If you want what car a person will buy, then classification may help. If, on the other hand, you want to know what are some of the traits of high-performing students, then unsupervised approaches may be the best option.

INterpret

Interpreting involves sharing what does all this stuff means. It is truly difficult to explain the intricacies of data analysis to a layman. Therefore, this involves not just analytical techniques but communication skills. Breaking down the complex analysis so that people can understand it is difficult. As such, ideas around storytelling have been developed to help data analysis connect the code with the audience.

Conclusion

The framework provided here is not the only way to approach data analysis. Furthermore, as you become more comfortable with analyzing data, you do not have to limit yourself to the steps or order in which they are performed. Frameworks are intended for getting people started in the creative process of whatever task they are trying to achieve.

Principal Component Analysis with Python VIDEO

Principal component analysis is a tool for reducing the number of variables in a dataset without losing too much information. This is a great way to summarize information or to simplify things for a more complex analysis. The video provides a simple example of how to do this.

ad

T-SNE Visualization and R

It is common in research to want to visualize data in order to search for patterns. When the number of features increases, this can often become even more important. Common tools for visualizing numerous features include principal component analysis and linear discriminant analysis. Not only do these tools work for visualization they can also be beneficial in dimension reduction.

However, the available tools for us are not limited to these two options. Another option for achieving either of these goals is t-Distributed Stochastic Embedding. This relative young algorithm (2008) is the focus of the post. We will explain what it is and provide an example using a simple dataset from the Ecdat package in R.

t-sne Defined

t-sne is a nonlinear dimension reduction visualization tool. Essentially what it does is identify observed clusters. However, it is not a clustering algorithm because it reduces the dimensions (normally to 2) for visualizing. This means that the input features are not longer present in their original form and this limits the ability to make inference. Therefore, t-sne is often used for exploratory purposes.

T-sne non-linear characteristic is what makes it often appear to be superior to PCA, which is linear. Without getting too technical t-sne takes simultaneously a global and local approach to mapping points while PCA can only use a global approach.

The downside to t-sne approach is that it requires a large amount of calculations. The calculations are often pairwise comparisons which can grow exponential in large datasets.

Initial Packages

We will use the “Rtsne” package for the analysis, and we will use the “Fair” dataset from the “Ecdat” package. The “Fair” dataset is data collected from people who had cheated on their spouse. We want to see if we can find patterns among the unfaithful people based on their occupation. Below is some initial code.

library(Rtsne)
library(Ecdat)

Dataset Preparation

To prepare the data, we first remove in rows with missing data using the “na.omit” function. This is saved in a new object called “train”. Next, we change or outcome variable into a factor variable. The categories range from 1 to 9

  1. Farm laborer, day laborer,
  2. Unskilled worker, service worker,
  3. Machine operator, semiskilled worker,
  4. Skilled manual worker, craftsman, police,
  5. Clerical/sales, small farm owner,
  6. Technician, semiprofessional, supervisor,
  7. Small business owner, farm owner, teacher,
  8. Mid-level manager or professional,
  9. Senior manager or professional.

Below is the code.

train<-na.omit(Fair)
train$occupation<-as.factor(train$occupation)

Visualization Preparation

Before we do the analysis we need to set the colors for the different categories. This is done with the code below.

colors<-rainbow(length(unique(train$occupation)))
names(colors)<-unique(train$occupation)

We can now do are analysis. We will use the “Rtsne” function. When you input the dataset you must exclude the dependent variable as well as any other factor variables. You also set the dimensions and the perplexity. Perplexity determines how many neighbors are used to determine the location of the datapoint after the calculations. Verbose just provides information during the calculation. This is useful if you want to know what progress is being made. max_iter is the number of iterations to take to complete the analysis and check_duplicates checks for duplicates which could be a problem in the analysis. Below is the code.

tsne<-Rtsne(train[,-c(1,4,7)],dims=2,perplexity=30,verbose=T,max_iter=1500,check_duplicates=F)
## Performing PCA
## Read the 601 x 6 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.05 seconds (sparsity = 0.190597)!
## Learning embedding...
## Iteration 1450: error is 0.280471 (50 iterations in 0.07 seconds)
## Iteration 1500: error is 0.279962 (50 iterations in 0.07 seconds)
## Fitting performed in 2.21 seconds.

Below is the code for making the visual.

plot(tsne$Y,t='n',main='tsne',xlim=c(-30,30),ylim=c(-30,30))
text(tsne$Y,labels=train$occupation,col = colors[train$occupation])
legend(25,5,legend=unique(train$occupation),col = colors,,pch=c(1))

1

You can see that there are clusters however, the clusters are all mixed with the different occupations. What this indicates is that the features we used to make the two dimensions do not discriminant between the different occupations.

Conclusion

T-SNE is an improved way to visualize data. This is not to say that there is no place for PCA anymore. Rather, this newer approach provides a different way of quickly visualizing complex data without the limitations of PCA.

Checkout ERT online courses

Quadratic Discriminant Analysis with Python

Quadratic discriminant analysis allows for the classifier to assess non -linear relationships. This of course something that linear discriminant analysis is not able to do. This post will go through the steps necessary to complete a qda analysis using Python. The steps that will be conducted are as follows

  1. Data preparation
  2. Model training
  3. Model testing

Our goal will be to predict the gender of examples in the “Wages1” dataset using the available independent variables.

Data Preparation

We will begin by first loading the libraries we will need

import pandas as pd
from pydataset import data
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import (confusion_matrix,accuracy_score)
import seaborn as sns
from matplotlib.colors import ListedColormap

Next, we will load our data “Wages1” it comes from the “pydataset” library. After loading the data, we will use the .head() method to look at it briefly.

1

We need to transform the variable ‘sex’, our dependent variable, into a dummy variable using numbers instead of text. We will use the .getdummies() method to make the dummy variables and then add them to the dataset using the .concat() method. The code for this is below.

In the code below we have the histogram for the continuous independent variables.  We are using the .distplot() method from seaborn to make the histograms.

fig = plt.figure()
fig, axs = plt.subplots(figsize=(15, 10),ncols=3)
sns.set(font_scale=1.4)
sns.distplot(df['exper'],color='black',ax=axs[0])
sns.distplot(df['school'],color='black',ax=axs[1])
sns.distplot(df['wage'],color='black',ax=axs[2])

1

The variables look reasonable normal. Below is the proportions of the categorical dependent variable.

round(df.groupby('sex').count()/3294,2)
Out[247]: 
exper school wage female male
sex 
female 0.48 0.48 0.48 0.48 0.48
male 0.52 0.52 0.52 0.52 0.52

About half male and half female.

We will now make the correlational matrix

corrmat=df.corr(method='pearson')
f,ax=plt.subplots(figsize=(12,12))
sns.set(font_scale=1.2)
sns.heatmap(round(corrmat,2),
vmax=1.,square=True,
cmap="gist_gray",annot=True)

1

There appears to be no major problems with correlations. The last thing we will do is set up our train and test datasets.

X=df[['exper','school','wage']]
y=df['male']
X_train,X_test,y_train,y_test=train_test_split(X,y,
test_size=.2, random_state=50)

We can now move to model development

Model Development

To create our model we will instantiate an instance of the quadratic discriminant analysis function and use the .fit() method.

qda_model=QDA()
qda_model.fit(X_train,y_train)

There are some descriptive statistics that we can pull from our model. For our purposes, we will look at the group means  Below are the  group means.

exper school wage
Female 7.73 11.84 5.14
Male 8.28 11.49 6.38

You can see from the table that mean generally have more experience, higher wages, but slightly less education.

We will now use the qda_model we create to predict the classifications for the training set. This information will be used to make a confusion matrix.

cm = confusion_matrix(y_train, y_pred)
ax= plt.subplots(figsize=(10,10))
sns.set(font_scale=3.4)
with sns.axes_style('white'):
sns.heatmap(cm, cbar=False, square=True, annot=True, fmt='g',
cmap=ListedColormap(['gray']), linewidths=2.5)

1

The information in the upper-left corner are the number of people who were female and correctly classified as female. The lower-right corner is for the men who were correctly classified as men. The upper-right corner is females who were classified as male. Lastly, the lower-left corner is males who were classified as females. Below is the actually accuracy of our model

round(accuracy_score(y_train, y_pred),2)
Out[256]: 0.6

Sixty percent accuracy is not that great. However, we will now move to model testing.

Model Testing

Model testing involves using the .predict() method again but this time with the testing data. Below is the prediction with the confusion matrix.

 y_pred=qda_model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
from matplotlib.colors import ListedColormap
ax= plt.subplots(figsize=(10,10))
sns.set(font_scale=3.4)
with sns.axes_style('white'):
sns.heatmap(cm, cbar=False, square=True,annot=True,fmt='g',
cmap=ListedColormap(['gray']),linewidths=2.5)

1

The results seem similar. Below is the accuracy.

round(accuracy_score(y_test, y_pred),2)
Out[259]: 0.62

About the same, our model generalizes even though it performs somewhat poorly.

Conclusion

This post provided an explanation of how to do a quadratic discriminant analysis using python. This is just another potential tool that may be useful for the data scientist.

Data Exploration Case Study: Credit Default

Exploratory data analysis is the main task of a Data Scientist with as much as 60% of their time being devoted to this task. As such, the majority of their time is spent on something that is rather boring compared to building models.

This post will provide a simple example of how to analyze a dataset from the website called Kaggle. This dataset is looking at how is likely to default on their credit. The following steps will be conducted in this analysis.

  1. Load the libraries and dataset
  2. Deal with missing data
  3. Some descriptive stats
  4. Normality check
  5. Model development

This is not an exhaustive analysis but rather a simple one for demonstration purposes. The dataset is available here

Load Libraries and Data

Here are some packages we will need

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn import tree
from scipy import stats
from sklearn import metrics

You can load the data with the code below

df_train=pd.read_csv('/application_train.csv')

You can examine what variables are available with the code below. This is not displayed here because it is rather long

df_train.columns
df_train.head()

Missing Data

I prefer to deal with missing data first because missing values can cause errors throughout the analysis if they are not dealt with immediately. The code below calculates the percentage of missing data in each column.

total=df_train.isnull().sum().sort_values(ascending=False)
percent=(df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data=pd.concat([total,percent],axis=1,keys=['Total','Percent'])
missing_data.head()
 
                           Total   Percent
COMMONAREA_MEDI           214865  0.698723
COMMONAREA_AVG            214865  0.698723
COMMONAREA_MODE           214865  0.698723
NONLIVINGAPARTMENTS_MODE  213514  0.694330
NONLIVINGAPARTMENTS_MEDI  213514  0.694330

Only the first five values are printed. You can see that some variables have a large amount of missing data. As such, they are probably worthless for inclusion in additional analysis. The code below removes all variables with any missing data.

pct_null = df_train.isnull().sum() / len(df_train)
missing_features = pct_null[pct_null > 0.0].index
df_train.drop(missing_features, axis=1, inplace=True)

You can use the .head() function if you want to see how  many variables are left.

Data Description & Visualization

For demonstration purposes, we will print descriptive stats and make visualizations of a few of the variables that are remaining.

round(df_train['AMT_CREDIT'].describe())
Out[8]: 
count     307511.0
mean      599026.0
std       402491.0
min        45000.0
25%       270000.0
50%       513531.0
75%       808650.0
max      4050000.0

sns.distplot(df_train['AMT_CREDIT']

1.png

round(df_train['AMT_INCOME_TOTAL'].describe())
Out[10]: 
count       307511.0
mean        168798.0
std         237123.0
min          25650.0
25%         112500.0
50%         147150.0
75%         202500.0
max      117000000.0
sns.distplot(df_train['AMT_INCOME_TOTAL']

1.png

I think you are getting the point. You can also look at categorical variables using the groupby() function.

We also need to address categorical variables in terms of creating dummy variables. This is so that we can develop a model in the future. Below is the code for dealing with all the categorical  variables and converting them to dummy variable’s

df_train.groupby('NAME_CONTRACT_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_CONTRACT_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_CONTRACT_TYPE'],axis=1)

df_train.groupby('CODE_GENDER').count()
dummy=pd.get_dummies(df_train['CODE_GENDER'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['CODE_GENDER'],axis=1)

df_train.groupby('FLAG_OWN_CAR').count()
dummy=pd.get_dummies(df_train['FLAG_OWN_CAR'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['FLAG_OWN_CAR'],axis=1)

df_train.groupby('FLAG_OWN_REALTY').count()
dummy=pd.get_dummies(df_train['FLAG_OWN_REALTY'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['FLAG_OWN_REALTY'],axis=1)

df_train.groupby('NAME_INCOME_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_INCOME_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_INCOME_TYPE'],axis=1)

df_train.groupby('NAME_EDUCATION_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_EDUCATION_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_EDUCATION_TYPE'],axis=1)

df_train.groupby('NAME_FAMILY_STATUS').count()
dummy=pd.get_dummies(df_train['NAME_FAMILY_STATUS'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_FAMILY_STATUS'],axis=1)

df_train.groupby('NAME_HOUSING_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_HOUSING_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_HOUSING_TYPE'],axis=1)

df_train.groupby('ORGANIZATION_TYPE').count()
dummy=pd.get_dummies(df_train['ORGANIZATION_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['ORGANIZATION_TYPE'],axis=1)

You have to be careful with this because now you have many variables that are not necessary. For every categorical variable you must remove at least one category in order for the model to work properly.  Below we did this manually.

df_train=df_train.drop(['Revolving loans','F','XNA','N','Y','SK_ID_CURR,''Student','Emergency','Lower secondary','Civil marriage','Municipal apartment'],axis=1)

Below are some boxplots with the target variable and other variables in the dataset.

f,ax=plt.subplots(figsize=(8,6))
fig=sns.boxplot(x=df_train['TARGET'],y=df_train['AMT_INCOME_TOTAL'])

1.png

There is a clear outlier there. Below is another boxplot with a different variable

f,ax=plt.subplots(figsize=(8,6))
fig=sns.boxplot(x=df_train['TARGET'],y=df_train['CNT_CHILDREN'])

2

It appears several people have more than 10 children. This is probably a typo.

Below is a correlation matrix using a heatmap technique

corrmat=df_train.corr()
f,ax=plt.subplots(figsize=(12,9))
sns.heatmap(corrmat,vmax=.8,square=True)

1.png

The heatmap is nice but it is hard to really appreciate what is happening. The code below will sort the correlations from least to strongest, so we can remove high correlations.

c = df_train.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")
print(so.head())

FLAG_DOCUMENT_12 FLAG_MOBIL 0.000005
FLAG_MOBIL FLAG_DOCUMENT_12 0.000005
Unknown FLAG_MOBIL 0.000005
FLAG_MOBIL Unknown 0.000005
Cash loans FLAG_DOCUMENT_14 0.000005

The list is to long to show here but the following variables were removed for having a high correlation with other variables.

df_train=df_train.drop(['WEEKDAY_APPR_PROCESS_START','FLAG_EMP_PHONE','REG_CITY_NOT_WORK_CITY','REGION_RATING_CLIENT','REG_REGION_NOT_WORK_REGION'],axis=1)

Below we check a few variables for homoscedasticity, linearity, and normality  using plots and histograms

sns.distplot(df_train['AMT_INCOME_TOTAL'],fit=norm)
fig=plt.figure()
res=stats.probplot(df_train['AMT_INCOME_TOTAL'],plot=plt)

12

This is not normal

sns.distplot(df_train['AMT_CREDIT'],fit=norm)
fig=plt.figure()
res=stats.probplot(df_train['AMT_CREDIT'],plot=plt)

12

This is not normal either. We could do transformations, or we can make a non-linear model instead.

Model Development

Now comes the easy part. We will make a decision tree using only some variables to predict the target. In the code below we make are X and y dataset.

X=df_train[['Cash loans','DAYS_EMPLOYED','AMT_CREDIT','AMT_INCOME_TOTAL','CNT_CHILDREN','REGION_POPULATION_RELATIVE']]
y=df_train['TARGET']

The code below fits are model and makes the predictions

clf=tree.DecisionTreeClassifier(min_samples_split=20)
clf=clf.fit(X,y)
y_pred=clf.predict(X)

Below is the confusion matrix followed by the accuracy

print (pd.crosstab(y_pred,df_train['TARGET']))
TARGET       0      1
row_0                
0       280873  18493
1         1813   6332
accuracy_score(y_pred,df_train['TARGET'])
Out[47]: 0.933966589813047

Lastly, we can look at the precision, recall, and f1 score

print(metrics.classification_report(y_pred,df_train['TARGET']))
              precision    recall  f1-score   support

           0       0.99      0.94      0.97    299366
           1       0.26      0.78      0.38      8145

   micro avg       0.93      0.93      0.93    307511
   macro avg       0.62      0.86      0.67    307511
weighted avg       0.97      0.93      0.95    307511

This model looks rather good in terms of accuracy of the training set. It actually impressive that we could use so few variables from such a large dataset and achieve such a high degree of accuracy.

Conclusion

Data exploration and analysis is the primary task of a data scientist.  This post was just an example of how this can be approached. Of course, there are many other creative ways to do this but the simplistic nature of this analysis yielded strong results

RANSAC Regression in Python

RANSAC is an acronym for Random Sample Consensus. What this algorithm does is fit a regression model on a subset of data that the algorithm judges as inliers while removing outliers. This naturally improves the fit of the model due to the removal of some data points.

The process that is used to determine inliers and outliers is described below.

  1. The algorithm randomly selects a random amount of samples to be inliers in the model.
  2. All data is used to fit the model and samples that fall with a certain tolerance are relabeled as inliers.
  3. Model is refitted with the new inliers
  4. Error of the fitted model vs the inliers is calculated
  5. Terminate or go back to step 1 if a certain criterion of iterations or performance is not met.

In this post, we will use the tips data from the pydataset module. Our goal will be to predict the tip amount using two different models.

  1. Model 1 will use simple regression and will include total bill as the independent variable and tips as the dependent variable
  2. Model 2 will use multiple regression and  includes several independent variables and tips as the dependent variable

The process we will use to complete this example is as follows

  1. Data preparation
  2. Simple Regression Model fit
  3. Simple regression visualization
  4. Multiple regression model fit
  5. Multiple regression visualization

Below are the packages we will need for this example

import pandas as pd
from pydataset import data
from sklearn.linear_model import RANSACRegressor
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

Data Preparation

For the data preparation, we need to do the following

  1. Load the data
  2. Create X and y dataframes
  3. Convert several categorical variables to dummy variables
  4. Drop the original categorical variables from the X dataframe

Below is the code for these steps

df=data('tips')
X,y=df[['total_bill','sex','size','smoker','time']],df['tip']
male=pd.get_dummies(X['sex'])
X['male']=male['Male']
smoker=pd.get_dummies(X['smoker'])
X['smoker']=smoker['Yes']
dinner=pd.get_dummies(X['time'])
X['dinner']=dinner['Dinner']
X=X.drop(['sex','time'],1)

Most of this is self-explanatory, we first load the tips dataset and divide the independent and dependent variables into an X and y dataframe respectively. Next, we converted the sex, smoker, and dinner variables into dummy variables, and then we dropped the original categorical variables.

We can now move to fitting the first model that uses simple regression.

Simple Regression Model

For our model, we want to use total bill to predict tip amount. All this is done in the following steps.

  1. Instantiate an instance of the RANSACRegressor. We the call LinearRegression function, and we also set the residual_threshold to 2 indicate how far an example has to be away from  2 units away from the line.
  2. Next we fit the model
  3. We predict the values
  4. We calculate the r square  the mean absolute error

Below is the code for all of this.

ransacReg1= RANSACRegressor(LinearRegression(),residual_threshold=2,random_state=0)
ransacReg1.fit(X[['total_bill']],y)
prediction1=ransacReg1.predict(X[['total_bill']])
r2_score(y,prediction1)
Out[150]: 0.4381748268686979

mean_absolute_error(y,prediction1)
Out[151]: 0.7552429811944833

The r-square is 44% while the MAE is 0.75.  These values are most comparative and will be looked at again when we create the multiple regression model.

The next step is to make the visualization. The code below will create a plot that shows the X and y variables and the regression. It also identifies which samples are inliers and outliers. Te coding will not be explained because of the complexity of it.

inlier=ransacReg1.inlier_mask_
outlier=np.logical_not(inlier)
line_X=np.arange(3,51,2)
line_y=ransacReg1.predict(line_X[:,np.newaxis])
plt.scatter(X[['total_bill']][inlier],y[inlier],c='lightblue',marker='o',label='Inliers')
plt.scatter(X[['total_bill']][outlier],y[outlier],c='green',marker='s',label='Outliers')
plt.plot(line_X,line_y,color='black')
plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.legend(loc='upper left')

1

Plot is self-explanatory as a handful of samples were considered outliers. We will now move to creating our multiple regression model.

Multiple Regression Model Development

The steps for making the model are mostly the same. The real difference takes place in make the plot which we will discuss in a moment. Below is the code for  developing the model.

ransacReg2= RANSACRegressor(LinearRegression(),residual_threshold=2,random_state=0)
ransacReg2.fit(X,y)
prediction2=ransacReg2.predict(X)
r2_score(y,prediction2)
Out[154]: 0.4298703800652126

mean_absolute_error(y,prediction2)
Out[155]: 0.7649733201032204

Things have actually gotten slightly worst in terms of r-square and MAE.

For the visualization, we cannot plot directly several variables t once. Therefore, we will compare the predicted values with the actual values. The better the correlated the better our prediction is. Below is the code for the visualization

inlier=ransacReg2.inlier_mask_
outlier=np.logical_not(inlier)
line_X=np.arange(1,8,1)
line_y=(line_X[:,np.newaxis])
plt.scatter(prediction2[inlier],y[inlier],c='lightblue',marker='o',label='Inliers')
plt.scatter(prediction2[outlier],y[outlier],c='green',marker='s',label='Outliers')
plt.plot(line_X,line_y,color='black')
plt.xlabel('Predicted Tip')
plt.ylabel('Actual Tip')
plt.legend(loc='upper left')

1

The plots are mostly the same  as you cans see for yourself.

Conclusion

This post provided an example of how to use the RANSAC regressor algorithm. This algorithm will remove samples from the model based on a criterion you set. The biggest complaint about this algorithm is that it removes data from the model. Generally, we want to avoid losing data when developing models. In addition, the algorithm removes outliers objectively this is a problem because outlier removal is often subjective. Despite these flaws, RANSAC regression is another tool that can be use din machine learning.

Combining Algorithms for Classification with Python

Many approaches in machine learning involve making many models that combine their strength and weaknesses to make more accuracy classification. Generally, when this is done it is the same algorithm being used. For example, random forest is simply many decision trees being developed. Even when bagging or boosting is being used it is the same algorithm but with variances in sampling and the use of features.

In addition to this common form of ensemble learning there is also a way to combine different algorithms to make predictions. For one way of doing this is through a technique called stacking in which the predictions of several models are passed to a higher model that uses the individual model predictions to make a final prediction. In this post we will look at how to do this using Python.

Assumptions

This blog usually tries to explain as much  as possible about what is happening. However, due to the complexity of this topic there are several assumptions about the reader’s background.

  • Already familiar with python
  • Can use various algorithms to make predictions (logistic regression, linear discriminant analysis, decision trees, K nearest neighbors)
  • Familiar with cross-validation and hyperparameter tuning

We will be using the Mroz dataset in the pydataset module. Our goal is to use several of the independent variables to predict whether someone lives in the city or not.

The steps we will take in this post are as follows

  1. Data preparation
  2. Individual model development
  3. Ensemble model development
  4. Hyperparameter tuning of ensemble model
  5. Ensemble model testing

Below is all of the libraries we will be using in this post

import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from pydataset import data
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.classifier import EnsembleVoteClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

Data Preparation

We need to perform the following steps for the data preparation

  1. Load the data
  2. Select the independent variables to be used in the analysis
  3. Scale the independent variables
  4. Convert the dependent variable from text to numbers
  5. Split the data in train and test sets

Not all of the variables in the Mroz dataset were used. Some were left out because they were highly correlated with others. This analysis is not in this post but you can explore this on your own. The data was also scaled because many algorithms are sensitive to this so it is best practice to always scale the data. We will use the StandardScaler function for this. Lastly, the dpeendent variable currently consist of values of “yes” and “no” these need to be convert to numbers 1 and 0. We will use the LabelEncoder function for this. The code for all of this is below.

df=data('Mroz')
X,y=df[['hoursw','child6','child618','educw','hearnw','hoursh','educh','wageh','educwm','educwf','experience']],df['city']
sc=StandardScaler()
X_scale=sc.fit_transform(X)
X=pd.DataFrame(X_scale, index=X.index, columns=X.columns)
le=LabelEncoder()
y=le.fit_transform(y)
X_train, X_test,y_train, y_test=train_test_split(X,y,test_size=.3,random_state=5)

We can now proceed to individul model development

Individual Model Development

Below are the steps for this part of the analysis

  1. Instantiate an instance of each algorithm
  2. Check accuracy of each model
  3. Check roc curve of each model

We will create four different models, and they are logistic regression, decision tree, k nearest neighbor, and linear discriminant analysis. We will also set some initial values for the hyperparameters for each. Below is the code

logclf=LogisticRegression(penalty='l2',C=0.001, random_state=0)
treeclf=DecisionTreeClassifier(max_depth=3,criterion='entropy',random_state=0)
knnclf=KNeighborsClassifier(n_neighbors=5,p=2,metric='minkowski')
LDAclf=LDA()

We can now assess the accuracy and roc curve of each model. This will be done through using two separate for loops. The first will have the accuracy results and the second will have the roc curve results. The results will also use k-fold cross validation with the cross_val_score function. Below is the code with the results.

clf_labels=['Logistic Regression','Decision Tree','KNN','LDAclf']
for clf, label in zip ([logclf,treeclf,knnclf,LDAclf],clf_labels):
    scores=cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10,scoring='accuracy')
    print("accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label))

for clf, label in zip ([logclf,treeclf,knnclf,LDAclf],clf_labels):
    scores=cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10,scoring='roc_auc')
    print("roc auc: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label))

accuracy: 0.69 (+/- 0.04) [Logistic Regression]
accuracy: 0.72 (+/- 0.06) [Decision Tree]
accuracy: 0.66 (+/- 0.06) [KNN]
accuracy: 0.70 (+/- 0.05) [LDAclf]
roc auc: 0.71 (+/- 0.08) [Logistic Regression]
roc auc: 0.70 (+/- 0.07) [Decision Tree]
roc auc: 0.62 (+/- 0.10) [KNN]
roc auc: 0.70 (+/- 0.08) [LDAclf]

The results can speak for themselves. We have a general accuracy of around 70% but our roc auc is poor. Despite this we will now move to the ensemble model development.

Ensemble Model Development

The ensemble model requires the use of the EnsembleVoteClassifier function. Inside this function are the four models we made earlier. Other than this the rest of the code is the same as the previous step. We will assess the accuracy and the roc auc. Below is the code and the results

 mv_clf= EnsembleVoteClassifier(clfs=[logclf,treeclf,knnclf,LDAclf],weights=[1.5,1,1,1])

for clf, label in zip ([logclf,treeclf,knnclf,LDAclf,mv_clf],labels):
    scores=cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10,scoring='accuracy')
    print("accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label))

for clf, label in zip ([logclf,treeclf,knnclf,LDAclf,mv_clf],labels):
    scores=cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10,scoring='roc_auc')
    print("roc auc: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label))

accuracy: 0.69 (+/- 0.04) [LR]
accuracy: 0.72 (+/- 0.06) [tree]
accuracy: 0.66 (+/- 0.06) [knn]
accuracy: 0.70 (+/- 0.05) [LDA]
accuracy: 0.70 (+/- 0.04) [combine]
roc auc: 0.71 (+/- 0.08) [LR]
roc auc: 0.70 (+/- 0.07) [tree]
roc auc: 0.62 (+/- 0.10) [knn]
roc auc: 0.70 (+/- 0.08) [LDA]
roc auc: 0.72 (+/- 0.09) [combine]

You can see that the combine model as similar performance to the individual models. This means in this situation that the ensemble learning did not make much of a difference. However, we have not tuned are hyperparameters yet. This will be done in the next step.

Hyperparameter Tuning of Ensemble Model

We are going to tune the decision tree, logistic regression, and KNN model. There are many different hyperparameters we can tune. For demonstration purposes we are only tuning one hyperparameter per algorithm. Once we set the hyperparameters we will run the model and pull the best hyperparameters values based on the roc auc as the metric. Below is the code and the output.

params={'decisiontreeclassifier__max_depth':[2,3,5],
        'logisticregression__C':[0.001,0.1,1,10],
        'kneighborsclassifier__n_neighbors':[5,7,9,11]}

grid=GridSearchCV(estimator=mv_clf,param_grid=params,cv=10,scoring='roc_auc')

grid.fit(X_train,y_train)

grid.best_params_
Out[34]: 
{'decisiontreeclassifier__max_depth': 3,
 'kneighborsclassifier__n_neighbors': 9,
 'logisticregression__C': 10}

grid.best_score_
Out[35]: 0.7196051482279385

The best values are as follows

  • Decision tree max depth set to 3
  • KNN number of neighbors set to 9
  • logistic regression C set to 10

These values give us a roc auc of 0.72 which is still poor . We can now use these values when we test our final model.

Ensemble Model Testing

The following steps are performed in the analysis

  1. Created new instances of the algorithms with the adjusted hyperparameters
  2. Run the ensemble model
  3. Predict with the test data
  4. Check the results

Below is the first step

logclf=LogisticRegression(penalty='l2',C=10, random_state=0)
treeclf=DecisionTreeClassifier(max_depth=3,criterion='entropy',random_state=0)
knnclf=KNeighborsClassifier(n_neighbors=9,p=2,metric='minkowski')
LDAclf=LDA()

Below is step two

mv_clf= EnsembleVoteClassifier(clfs=[logclf,treeclf,knnclf,LDAclf],weights=[1.5,1,1,1])
mv_clf.fit(X_train,y_train)

Below are steps 3 and four

y_pred=mv_clf.predict(X_test)
print(accuracy_score(y_test,y_pred))
print(pd.crosstab(y_test,y_pred))
print(classification_report(y_test,y_pred))

0.6902654867256637
col_0   0    1
row_0         
0      29   58
1      12  127
             precision    recall  f1-score   support

          0       0.71      0.33      0.45        87
          1       0.69      0.91      0.78       139

avg / total       0.69      0.69      0.66       226

The accuracy is about 69%. One thing that is noticeable low is the  recall for people who do not live in the city. This probably one reason why the overall roc auc score is so low. The f1-score is also low for those who do not live in the city as well. The f1-score is just a combination of precision and recall. If we really want to improve performance we would probably start with improving the recall of the no’s.

Conclusion

This post provided an example of how you can  combine different algorithms to make predictions in Python. This is a powerful technique t to use. Off course, it is offset by the complexity of the analysis which makes it hard to explain exactly what the results mean if you were asked tot do so.

Data Science Pipeline

One of the challenges of conducting a data analysis or any form of research is making decisions. You have to decide primarily two things

  1. What to do
  2. When to do it

People who are familiar with statistics may know what to do but may struggle with timing or when to do it. Others who are weaker when it comes to numbers may not know what to do or when to do it. Generally, it is rare for someone to know when to do something but not know how to do it.

In this post, we will look at a process that that can be used to perform an analysis in the context of data science. Keep in mind that this is just an example and there are naturally many ways to perform an analysis. The purpose here is to provide some basic structure for people who are not sure of what to do and when. One caveat, this process is focused primarily on supervised learning which has a clearer beginning, middle, and end in terms of the process.

Generally, there are three steps that probably always take place when conducting a data analysis and they are as follows.

  1. Data preparation (data mugging)
  2. Model training
  3. Model testing

Off course, it is much more complicated than this but this is the minimum. Within each of these steps there are several substeps, However, depending on the context, the substeps can be optional.

There is one pre-step that you have to consider. How you approach these three steps depends a great deal on the algorithm(s) you have in mind to use for developing different models. The assumptions and characteristics of one algorithm are different from another and shape how you prepare the data and develop models. With this in mind, we will go through each of these three steps.

Data Preparation

Data preparation involves several substeps. Some of these steps are necessary but general not all of them happen ever analysis. Below is a list of steps at this level

  • Data mugging
  • Scaling
  • Normality
  • Dimension reduction/feature extraction/feature selection
  • Train, test, validation split

Data mugging is often the first step in data preparation and involves making sure your data is in a readable structure for your algorithm. This can involve changing the format of dates, removing punctuation/text, changing text into dummy variables or factors, combining tables, splitting tables, etc. This is probably the hardest and most unclear aspect of data science because the problems you will face will be highly unique to the dataset you are working with.

Scaling involves making sure all the variables/features are on the same scale. This is important because most algorithms are sensitive to the scale of the variables/features. Scaling can be done through normalization or standardization. Normalization reduces the variables to a range of 0 – 1. Standardization involves converting the examples in the variable to their respective z-score. Which one you use depends on the situation but normally it is expected to do this.

Normality is often an optional step because there are so many variables that can be involved with big data and data science in a given project. However, when fewer variables are involved checking for normality is doable with a few tests and some visualizations. If normality is violated various transformations can be used to deal with this problem. Keep mind that many machine learning algorithms are robust against the influence of non-normal data.

Dimension reduction involves reduce the number of variables that will be included in the final analysis. This is done through factor analysis or principal component analysis. This reduction  in the number of variables is also an example of feature extraction. In some context, feature extraction is the in goal in itself. Some algorithms make their own features such as neural networks through the use of hidden layer(s)

Feature selection is the process of determining which variables to keep for future analysis. This can be done through the use of regularization such or in smaller datasets with subset regression. Whether you extract or select features depends on the context.

After all this is accomplished, it is necessary to split the dataset. Traditionally, the data was split in two. This led to the development of a training set and a testing set. You trained the model on the training set and tested the performance on the test set.

However, now many analyst split the data into three parts to avoid overfitting the data to the test set. There is now a training a set, a validation set, and a testing set. The  validation set allows you to check the model performance several times. Once you are satisfied you use the test set once at the end.

Once the data is prepared, which again is perhaps the most difficult part, it is time to train the model.

Model training

Model training involves several substeps as well

  1. Determine the metric(s) for success
  2. Creating a grid of several hyperparameter values
  3. Cross-validation
  4. Selection of the most appropriate hyperparameter values

The first thing you have to do and this is probably required is determined how you will know if your model is performing well. This involves selecting a metric. It can be accuracy for classification or mean squared error for a regression model or something else. What you pick depends on your goals. You use these metrics to determine the best algorithm and hyperparameters settings.

Most algorithms have some sort of hyperparameter(s). A hyperparameter is a value or estimate that the algorithm cannot learn and must be set by you. Since there is no way of knowing what values to select it is common practice to have several values tested and see which one is the best.

Cross-validation is another consideration. Using cross-validation always you to stabilize the results through averaging the results of the model over several folds of the data if you are using k-folds cross-validation. This also helps to improve the results of the hyperparameters as well.  There are several types of cross-validation but k-folds is probably best initially.

The information for the metric, hyperparameters, and cross-validation are usually put into  a grid that then runs the model. Whether you are using R or Python the printout will tell you which combination of hyperparameters is the best based on the metric you determined.

Validation test

When you know what your hyperparameters are you can now move your model to validation or straight to testing. If you are using a validation set you asses your models performance by using this new data. If the results are satisfying based on your metric you can move to testing. If not, you may move back and forth between training and the validation set making the necessary adjustments.

Test set

The final step is testing the model. You want to use the testing dataset as little as possible. The purpose here is to see how your model generalizes to data it has not seen before. There is little turning back after this point as there is an intense danger of overfitting now. Therefore, make sure you are ready before playing with the test data.

Conclusion

This is just one approach to conducting data analysis. Keep in mind the need to prepare data, train your model, and test it. This is the big picture for a somewhat complex process

Gradient Boosting Regression in Python

In this  post, we will take a look at gradient boosting for regression. Gradient boosting simply makes sequential models that try to explain any examples that had not been explained by previously models. This approach makes gradient boosting superior to AdaBoost.

Regression trees are mostly commonly teamed with boosting. There are some additional hyperparameters that need to be set  which includes the following

  • number of estimators
  • learning rate
  • subsample
  • max depth

We will deal with each of these when it is appropriate. Our goal in this post is to predict the amount of weight loss in cancer patients based on the independent variables. This is the process we will follow to achieve this.

  1. Data preparation
  2. Baseline decision tree model
  3. Hyperparameter tuning
  4. Gradient boosting model development

Below is some initial code

from sklearn.ensemble import GradientBoostingRegressor
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

Data Preparation

The data preparation is not that difficult in this situation. We simply need to load the dataset in an object and remove any missing values. Then we separate the independent and dependent variables into separate datasets. The code is below.

df=data('cancer').dropna()
X=df[['time','sex','ph.karno','pat.karno','status','meal.cal']]
y=df['wt.loss']

We can now move to creating our baseline model.

Baseline Model

The purpose of the baseline model is to have something to compare our gradient boosting model to. Therefore, all we will do here is create  several regression trees. The difference between the regression trees will be the max depth. The max depth has to with the number of nodes python can make to try to purify the classification.  We will then decide which tree is best based on the mean squared error.

The first thing we need to do is set the arguments for the cross-validation. Cross validating the results helps to check the accuracy of the results. The rest of the code  requires the use of for loops and if statements that cannot be reexplained in this post. Below is the code with the output.

crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10):
tree_regressor=tree.DecisionTreeRegressor(max_depth=depth,random_state=1)
if tree_regressor.fit(X,y).tree_.max_depth<depth:
break
score=np.mean(cross_val_score(tree_regressor,X,y,scoring='neg_mean_squared_error', cv=crossvalidation,n_jobs=1))
print(depth, score)
 1 -193.55304528235052
 2 -176.27520747356175
 3 -209.2846723461564
 4 -218.80238479654003
 5 -222.4393459885871
 6 -249.95330609042858
 7 -286.76842138165705
 8 -294.0290706405905
 9 -287.39016236497804

You can see that a max depth of 2 had the lowest amount of error. Therefore, our baseline model has a mean squared error of 176. We need to improve on this in order to say that our gradient boosting model is superior.

However, before we create our gradient boosting model. we need to tune the hyperparameters of the algorithm.

Hyperparameter Tuning

Hyperparameter tuning has to with setting the value of parameters that the algorithm cannot learn on its own. As such, these are constants that you set as the researcher. The problem is that you are not any better at knowing where to set these values than the computer. Therefore, the process that is commonly used is to have the algorithm use several combinations  of values until it finds the values that are best for the model/. Having said this, there are several hyperparameters we need to tune, and they are as follows.

  • number of estimators
  • learning rate
  • subsample
  • max depth

The number of estimators is show many trees to create. The more trees the more likely to overfit. The learning rate is the weight that each tree has on the final prediction. Subsample is the proportion of the sample to use. Max depth was explained previously.

What we will do now is make an instance of the GradientBoostingRegressor. Next, we will create our grid with the various values for the hyperparameters. We will then take this grid and place it inside GridSearchCV function so that we can prepare to run our model. There are some arguments that need to be set inside the GridSearchCV function such as estimator, grid, cv, etc. Below is the code.

GBR=GradientBoostingRegressor()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'max_depth':[1,2,4],'subsample':[.5,.75,1],'random_state':[1]}
search=GridSearchCV(estimator=GBR,param_grid=search_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=crossvalidation)

We can now run the code and determine the best combination of hyperparameters and how well the model did base on the means squared error metric. Below is the code and the output.

search.fit(X,y)
search.best_params_
Out[13]: 
{'learning_rate': 0.01,
 'max_depth': 1,
 'n_estimators': 500,
 'random_state': 1,
 'subsample': 0.5}

search.best_score_
Out[14]: -160.51398257591643

The hyperparameter results speak for themselves. With this tuning we can see that the mean squared error is lower than with the baseline model. We can now move to the final step of taking these hyperparameter settings and see how they do on the dataset. The results should be almost the same.

Gradient Boosting Model Development

Below is the code and the output for the tuned gradient boosting model

GBR2=GradientBoostingRegressor(n_estimators=500,learning_rate=0.01,subsample=.5,max_depth=1,random_state=1)
score=np.mean(cross_val_score(GBR2,X,y,scoring='neg_mean_squared_error',cv=crossvalidation,n_jobs=1))
score
Out[18]: -160.77842893572068

These results were to be expected. The gradient boosting model has a better performance than the baseline regression tree model.

Conclusion

In this post, we looked at how to  use gradient boosting to improve a regression tree. By creating multiple models. Gradient boosting will almost certainly have a better performance than other type of algorithms that rely on only one model.

Gradient Boosting Classification in Python

Gradient Boosting is an alternative form of boosting to AdaBoost. Many consider gradient boosting to be a better performer than adaboost. Some differences between the two algorithms is that gradient boosting uses optimization for weight the estimators. Like adaboost, gradient boosting can be used for most algorithms but is commonly associated with decision trees.

In addition, gradient boosting requires several additional hyperparameters such as max depth and subsample. Max depth has to do with the number of nodes in a tree. The higher the number the purer the classification become. The downside to this is the risk of overfitting.

Subsampling has to do with the proportion of the sample that is used for each estimator. This can range from a decimal value up until the whole number 1. If the value is set to 1 it becomes stochastic gradient boosting.

This post is focused on classification. To do this, we will use the cancer dataset from the pydataset library. Our goal will be to predict the status of patients (alive or dead) using the available independent variables. The steps we will use are as follows.

  1. Data preparation
  2. Baseline decision tree model
  3. Hyperparameter tuning
  4. Gradient boosting model development

Below is some initial code.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

Data Preparation

The data preparation is simple in this situtation. All we need to do is load are dataset, dropping missing values, and create our X dataset and y dataset. All this happens in the code below.

df=data('cancer').dropna()
X=df[['time','sex','ph.karno','pat.karno','meal.cal','wt.loss']]
y=df['status']

We will now develop our baseline decision tree model.

Baseline Model

The purpose of the baseline model is to have something to compare our gradient boosting model to. The strength of a model is always relative to some other model, so we need to make at least two, so we can say one is better than the other.

The criteria for better in this situation is accuracy. Therefore, we will make a decision tree model, but we will manipulate the max depth of the tree to create 9 different baseline models. The best accuracy model will be the baseline model.

To achieve this, we need to use a for loop to make python make several decision trees. We also need to set the parameters for the cross validation by calling KFold(). Once this is done, we print the results for the 9 trees. Below is the code and results.

crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10):
tree_classifier=tree.DecisionTreeClassifier(max_depth=depth,random_state=1)
if tree_classifier.fit(X,y).tree_.max_depth<depth:
break
score=np.mean(cross_val_score(tree_classifier,X,y,scoring='accuracy', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 0.71875
2 0.6477941176470589
3 0.6768382352941177
4 0.6698529411764707
5 0.6584558823529412
6 0.6525735294117647
7 0.6283088235294118
8 0.6573529411764706
9 0.6577205882352941

It appears that when the max depth is limited to 1 that we get the best accuracy at almost 72%. This will be our baseline for comparison. We will now tune the parameters for the gradient boosting algorithm

Hyperparameter Tuning

There are several hyperparameters we need to tune. The ones we will tune are as follows

  • number of estimators
  • learning rate
  • subsample
  • max depth

First, we will create an instance of the gradient boosting classifier. Second, we will create our grid for the search. It is inside this grid that we set several values for each hyperparameter. Then we call GridSearchCV and place the instance of the gradient boosting classifier, the grid, the cross validation values from mad earlier, and n_jobs all together in one place. Below is the code for this.

GBC=GradientBoostingClassifier()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'max_depth':[1,3,5],'subsample':[.5,.75,1],'random_state':[1]}
search=GridSearchCV(estimator=GBC,param_grid=search_grid,scoring='accuracy',n_jobs=1,cv=crossvalidation)

You can now run your model by calling .fit(). Keep in mind that there are several hyperparameters. This means that it might take some time to run the calculations. It is common to find values for max depth, subsample, and number of estimators first. Then as second run through is done to find the learning rate. In our example, we are doing everything at once which is why it takes longer. Below is the code with the out for best parameters and best score.

search.fit(X,y)
search.best_params_
Out[11]:
{'learning_rate': 0.01,
'max_depth': 5,
'n_estimators': 2000,
'random_state': 1,
'subsample': 0.75}
search.best_score_
Out[12]: 0.7425149700598802

You can see what the best hyperparameters are for yourself. In addition, we see that when these parameters were set we got an accuracy of 74%. This is superior to our baseline model. We will now see if we can replicate these numbers when we use them for our Gradient Boosting model.

Gradient Boosting Model

Below is the code and results for the model with the predetermined hyperparameter values.

ada2=GradientBoostingClassifier(n_estimators=2000,learning_rate=0.01,subsample=.75,max_depth=5,random_state=1)
score=np.mean(cross_val_score(ada2,X,y,scoring='accuracy',cv=crossvalidation,n_jobs=1))
score
Out[17]: 0.742279411764706

You can see that the results are similar. This is just additional information that the gradient boosting model does outperform the baseline decision tree model.

Conclusion

This post provided an example of what gradient boosting classification can do for a model. With its distinct characteristics gradient boosting is generally a better performing boosting algorithm in comparison to AdaBoost.

AdaBoost Regression with Python

This post will share how to use the adaBoost algorithm for regression in Python. What boosting does is that it makes multiple models in a sequential manner. Each newer model tries to successful predict what older models struggled with. For regression, the average of the models are used for the predictions.  It is often most common to use boosting with decision trees but this approach can be used with any machine learning algorithm that deals with supervised learning.

Boosting is associated with ensemble learning because several models are created that are averaged together. An assumption of boosting, is that combining several weak models can make one really strong and accurate model.

For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the weight loss of a patient based on several independent variables. The steps of this process are as follows.

  1. Data preparation
  2. Regression decision tree baseline model
  3. Hyperparameter tuning of Adaboost regression model
  4. AdaBoost regression model development

Below is some initial code

from sklearn.ensemble import AdaBoostRegressor
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

Data Preparation

There is little data preparation for this example. All we need to do is load the data and create the X and y datasets. Below is the code.

df=data('cancer').dropna()
X=df[['time','sex','ph.karno','pat.karno','status','meal.cal']]
y=df['wt.loss']

We will now proceed to creating the baseline regression decision tree model.

Baseline Regression Tree Model

The purpose of the baseline model is to compare it to the performance of our model that utilizes adaBoost. To make this model we need to Initiate a K-fold cross-validation. This will help in stabilizing the results. Next, we will create a for loop to create several trees that vary based on their depth. By depth, it is meant how far the tree can go to purify the classification. More depth often leads to a higher likelihood of overfitting.

Finally, we will then print the results for each tree. The criteria used for judgment is the mean squared error. Below is the code and results

crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10):
tree_regressor=tree.DecisionTreeRegressor(max_depth=depth,random_state=1)
if tree_regressor.fit(X,y).tree_.max_depth<depth:
break
score=np.mean(cross_val_score(tree_regressor,X,y,scoring='neg_mean_squared_error', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 -193.55304528235052
2 -176.27520747356175
3 -209.2846723461564
4 -218.80238479654003
5 -222.4393459885871
6 -249.95330609042858
7 -286.76842138165705
8 -294.0290706405905
9 -287.39016236497804

Looks like a tree with a depth of 2 had the lowest amount of error. We can now move to tuning the hyperparameters for the adaBoost algorithm.

Hyperparameter Tuning

For hyperparameter tuning we need to start by initiating our AdaBoostRegresor() class. Then we need to create our grid. The grid will address two hyperparameters which are the number of estimators and the learning rate. The number of estimators tells Python how many models to make and the learning indicates how each tree contributes to the overall results. There is one more parameter which is random_state, but this is just for setting the seed and never changes.

After making the grid, we need to use the GridSearchCV function to finish this process. Inside this function, you have to set the estimator, which is adaBoostRegressor, the parameter grid which we just made, the cross-validation which we made when we created the baseline model, and the n_jobs, which allocates resources for the calculation. Below is the code.

ada=AdaBoostRegressor()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'random_state':[1]}
search=GridSearchCV(estimator=ada,param_grid=search_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=crossvalidation)

Next, we can run the model with the desired grid in place. Below is the code for fitting the mode as well as the best parameters and the score to expect when using the best parameters.

search.fit(X,y)
search.best_params_
Out[31]: {'learning_rate': 0.01, 'n_estimators': 500, 'random_state': 1}
search.best_score_
Out[32]: -164.93176650920856

The best mix of hyperparameters is a learning rate of 0.01 and 500 estimators. This mix led to a mean error score of 164, which is a little lower than our single decision tree of 176. We will see how this works when we run our model with refined hyperparameters.

AdaBoost Regression Model

Below is our model, but this time with the refined hyperparameters.

ada2=AdaBoostRegressor(n_estimators=500,learning_rate=0.001,random_state=1)
score=np.mean(cross_val_score(ada2,X,y,scoring='neg_mean_squared_error',cv=crossvalidation,n_jobs=1))
score
Out[36]: -174.52604137201791

You can see the score is not as good but it is within reason.

Conclusion

In this post, we explored how to use the AdaBoost algorithm for regression. Employing this algorithm can help to strengthen a model in many ways at times.

AdaBoost Classification in Python

Boosting is a technique in machine learning in which multiple models are developed sequentially. Each new model tries to successful predict what prior models were unable to do. The average for regression and majority vote for classification are used. For classification, boosting is commonly associated with decision trees. However, boosting can be used with any machine learning algorithm in the supervised learning context.

Since several models are being developed with aggregation, boosting is associated with ensemble learning. Ensemble is just a way of developing more than one model for machine-learning purposes. With boosting, the assumption is that the combination of several weak models can make one really strong and accurate model.

For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the status of a patient based on several independent variables. The steps of this process are as follows.

  1. Data preparation
  2. Decision tree baseline model
  3. Hyperparameter tuning of Adaboost model
  4. AdaBoost model development

Below is some initial code

from sklearn.ensemble import AdaBoostClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

Data Preparation

Data preparation is minimal in this situation. We will load are data and at the same time drop any NA using the .dropna() function. In addition, we will place the independent variables in dataframe called X and the dependent variable in a dataset called y. Below is the code.

df=data('cancer').dropna()
X=df[['time','sex','ph.karno','pat.karno','meal.cal','wt.loss']]
y=df['status']

Decision Tree Baseline Model

We will make a decision tree just for the purposes of comparison. First, we will set the parameters for the cross-validation. Then we will use a for loop to run several different decision trees. The difference in the decision trees will be their depth. The depth is how far the tree can go in order to purify the classification. The more depth the more likely your decision tree is to overfit the data. The last thing we will do is print the results. Below is the code with the output

crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10):
tree_classifier=tree.DecisionTreeClassifier(max_depth=depth,random_state=1)
if tree_classifier.fit(X,y).tree_.max_depth<depth:
break
score=np.mean(cross_val_score(tree_classifier,X,y,scoring='accuracy', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 0.71875
2 0.6477941176470589
3 0.6768382352941177
4 0.6698529411764707
5 0.6584558823529412
6 0.6525735294117647
7 0.6283088235294118
8 0.6573529411764706
9 0.6577205882352941

You can see that the most accurate decision tree had a depth of 1. After that there was a general decline in accuracy.

We now can determine if the adaBoost model is better based on whether the accuracy is above 72%. Before we develop the  AdaBoost model, we need to tune several hyperparameters in order to develop the most accurate model possible.

Hyperparameter Tuning AdaBoost Model

In order to tune the hyperparameters there are several things that we need to do. First we need to initiate  our AdaBoostClassifier with some basic settings. Then We need to create our search grid with the hyperparameters. There are two hyperparameters that we will set and they are number of estimators (n_estimators) and the learning rate.

Number of estimators has to do with how many trees are developed. The learning rate indicates how each tree contributes to the overall results. We have to place in the grid several values for each of these. Once we set the arguments for the AdaBoostClassifier and the search grid we combine all this information into an object called search. This object uses the GridSearchCV function and includes additional arguments for scoring, n_jobs, and for cross-validation. Below is the code for all of this

ada=AdaBoostClassifier()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1]}
search=GridSearchCV(estimator=ada,param_grid=search_grid,scoring='accuracy',n_jobs=1,cv=crossvalidation)

We can now run the model of hyperparameter tuning and see the results. The code is below.

search.fit(X,y)
search.best_params_
Out[33]: {'learning_rate': 0.01, 'n_estimators': 1000}
search.best_score_
Out[34]: 0.7425149700598802

We can see that if the learning rate is set to 0.01 and the number of estimators to 1000 We can expect an accuracy of 74%. This is superior to our baseline model.

AdaBoost Model

We can now rune our AdaBoost Classifier based on the recommended hyperparameters. Below is the code.

ada=AdaBoostClassifier(n_estimators=1000,learning_rate=0.01)
score=np.mean(cross_val_score(ada,X,y,scoring='accuracy',cv=crossvalidation,n_jobs=1))
score
Out[36]: 0.7415441176470589

We knew we would get around 74% and that is what we got. It’s only a 3% improvement but depending on the context that can be a substantial difference.

Conclusion

In this post, we look at how to use boosting for classification. In particular, we used the AdaBoost algorithm. Boosting in general uses many models to determine the most accurate classification in a sequential manner. Doing this will often lead to an improvement in the prediction of a model.

Elastic Net Regression in Python

Elastic net regression combines the power of ridge and lasso regression into one algorithm. What this means is that with elastic net the algorithm can remove weak variables altogether as with lasso or to reduce them to close to zero as with ridge. All of these algorithms are examples of regularized regression.

This post will provide an example of elastic net regression in Python. Below are the steps of the analysis.

  1. Data preparation
  2. Baseline model development
  3. Elastic net model development

To accomplish this, we will use the Fair dataset from the pydataset library. Our goal will be to predict marriage satisfaction based on the other independent variables. Below is some initial code to begin the analysis.

from pydataset import data
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Data Preparation

We will now load our data. The only preparation that we need to do is convert the factor variables to dummy variables. Then we will make our  and y datasets. Below is the code.

df=pd.DataFrame(data('Fair'))
df.loc[df.sex== 'male', 'sex'] = 0
df.loc[df.sex== 'female','sex'] = 1
df['sex'] = df['sex'].astype(int)
df.loc[df.child== 'no', 'child'] = 0
df.loc[df.child== 'yes','child'] = 1
df['child'] = df['child'].astype(int)
X=df[['religious','age','sex','ym','education','occupation','nbaffairs']]
y=df['rate']

We can now proceed to creating the baseline model

Baseline Model

This model is a basic regression model for the purpose of comparison. We will instantiate our regression model, use the fit command and finally calculate the mean squared error of the data. The code is below.

regression=LinearRegression()
regression.fit(X,y)
first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))
print(first_model)
1.0498738644696668

This mean standard error score of 1.05 is our benchmark for determining if the elastic net model will be better or worst. Below are the coefficients of this first model. We use a for loop to go through the model and the zip function to combine the two columns.

coef_dict_baseline = {}
for coef, feat in zip(regression.coef_,X.columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[63]:
{'religious': 0.04235281110639178,
'age': -0.009059645428673819,
'sex': 0.08882013337087094,
'ym': -0.030458802565476516,
'education': 0.06810255742293699,
'occupation': -0.005979506852998164,
'nbaffairs': -0.07882571247653956}

We will now move to making the elastic net model.

Elastic Net Model

Elastic net, just like ridge and lasso regression, requires normalize data. This argument  is set inside the ElasticNet function. The second thing we need to do is create our grid. This is the same grid as we create for ridge and lasso in prior posts. The only thing that is new is the l1_ratio argument.

When the l1_ratio is set to 0 it is the same as ridge regression. When l1_ratio is set to 1 it is lasso. Elastic net is somewhere between 0 and 1 when setting the l1_ratio. Therefore, in our grid, we need to set several values of this argument. Below is the code.

elastic=ElasticNet(normalize=True)
search=GridSearchCV(estimator=elastic,param_grid={'alpha':np.logspace(-5,2,8),'l1_ratio':[.2,.4,.6,.8]},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)

We will now fit our model and display the best parameters and the best results we can get with that setup.

search.fit(X,y)
search.best_params_
Out[73]: {'alpha': 0.001, 'l1_ratio': 0.8}
abs(search.best_score_)
Out[74]: 1.0816514028705004

The best hyperparameters was an alpha set to 0.001 and a l1_ratio of 0.8. With these settings we got an MSE of 1.08. This is above our baseline model of MSE 1.05  for the baseline model. Which means that elastic net is doing worse than linear regression. For clarity, we will set our hyperparameters to the recommended values and run on the data.

elastic=ElasticNet(normalize=True,alpha=0.001,l1_ratio=0.75)
elastic.fit(X,y)
second_model=(mean_squared_error(y_true=y,y_pred=elastic.predict(X)))
print(second_model)
1.0566430678343806

Now our values are about the same. Below are the coefficients

coef_dict_baseline = {}
for coef, feat in zip(elastic.coef_,X.columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[76]:
{'religious': 0.01947541724957858,
'age': -0.008630896492807691,
'sex': 0.018116464568090795,
'ym': -0.024224831274512956,
'education': 0.04429085595448633,
'occupation': -0.0,
'nbaffairs': -0.06679513627963515}

The coefficients are mostly the same. Notice that occupation was completely removed from the model in the elastic net version. This means that this values was no good to the algorithm. Traditional regression cannot do this.

Conclusion

This post provided an example of elastic net regression. Elastic net regression allows for the maximum flexibility in terms of finding the best combination of ridge and lasso regression characteristics. This flexibility is what gives elastic net its power.

Cross-Validation in Python

A common problem in machine learning is data quality. In other words, if the data is bad the model will be bad even if it is designed using best practices. Below is a short of some possible problems with data

  • Sample size is to small-Hurts all algorithms
  • Sample size too big-Hurts complex algorithms
  • Wrong data-Hurts all  algorithms
  • Too many variables-Hurts complex algorithms

Naturally, this list is not exhaustive. Whenever some of the above situations take place it can lead to a model that has bias or variance. Bias takes place when the model highly over and under estimates values. This is common in regression when the relationship among the variables is not linear. The linear line that is developed by the  model works sometimes but is often erroneous.

Variance is when the model is too sensitive to the characteristics of the training data. This means that the model develops a complex way to classify or performs regression that does not generalize to other datasets

One solution to addressing these problems is the use of cross-validation. Cross-validation involves dividing the training set into several folds. For example, you may divide the data into 10 folds. With 9 folds you train the data and with the 10rh fold you test it. You then calculate the average prediction or classification of the ten test folds. This method is commonly called k-folds cross-validation. This process helps to stabilize the results of the final model. We will now look at how to do this using Python.

Data Preparation

We will develop a regression model using the PSID dataset. Our goal will be to predict earnings based on the other variables in the dataset. Below is some initial code.

import pandas as pd
import numpy as np
from pydataset import data
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

We now need to load the dataset PSID. When this is done, there are several things we also need to.

  • We have to drop all NA’s in the dataset
  • We also need to convert the “married” variable to a dummy variable.

Below  is the code for completing these steps

df=data('PSID').dropna()
df.loc[df.married!= 'married', 'married'] = 0
df.loc[df.married== 'married','married'] = 1
df['married'] = df['married'].astype(int)
df['marry']=df.married

The code above loads the data while dropping the NAs. We then use the .loc function to make everyone who is not married a 0 and everyone who is married a 1. This variable is then converted to an integer using the .astype function. Lastly, we make a new variable called ‘marry’ and store our data there.

There is one other problem we need to address. In the ‘kids’ and the ‘educatn’ variable are values of 98 and 99. In the original survey, these responses meant that the person did not want to say how man kids or how much education they had or that they did not know. We  will remove these individuals from the sample using the code below.

df.drop(df.loc[df['kids']>90].index, inplace=True)
df.drop(df.loc[df['educatn']>90].index, inplace=True)

The code above tells Python to remove in values greater than 90. With this We can now make are dataset that includes the independent variables and the dataset that contains the dependent variable.

X=df[['age','educatn','hours','kids','marry']]
y=df['earnings']

Model Development

We are now going to make several models and use the mean squared error as our way of comparing them. The first model will use all of the data. The second model will use the training data. The  third model will use cross-validation. Below is the code for the first model that uses all of the data,

regression=LinearRegression()
regression.fit(X,y)
first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))
print(first_model)
138544429.96275884

For the second model, we first need to make our train and test sets. Then we will run our model.  The code is below.

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=5)
regression.fit(X_train,y_train)
second_model=(mean_squared_error(y_true=y_train,y_pred=regression.predict(X_train)))
print(second_model)
148286805.4129756

You can see that the number are somewhat different. This is to be expected when dealing with different sample sizes. With cross validation using the full dataset we get results similar to the first model we developed. This is done through an instance of the KFold function. For KFold we want 10 folds, we want to shuffle the data, and set the seed.

The other function we need is the cross_val_score function. In this function, we set the type of model, the data we will use, the metric for evaluation, and the characteristics of the type of cross-validation. Once this is done we print the mean and standard deviation of the fold results. Below is the code.

crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
scores=cross_val_score(regression,X,y,scoring='neg_mean_squared_error',cv=crossvalidation,n_jobs=1)
print(len(scores),np.mean(np.abs(scores)),np.std(scores))
10 138817648.05153447 35451961.12217143

These numbers are closer to what is expected from the dataset. Despite the fact that we didn’t use all the data at the same time. You can also run these results on the training set as well for additional comparison.

Conclusion

This post provides an example of cross-validation in Python. The use of cross-validation helps to stabilize the results that ma come from your model. With increase stability comes increased confidence in your models ability to generalize to other datasets.

Artificial Intelligence in the Classroom

In 1990, a little known film called “Class of 1999” came out. In this movie, three military grade robots are placed in an inner-city war zone school to with the job of teaching.

As with any movie, things quickly get out of hand and the robot teachers begin killing the naughty students and eventually manipulating the local gangs into fighting and killing each other. Eventually, in something that can only happen in a movie, three military grade AI robots similar to the terminator are destroyed by a delinquent teenager

There has been a lot of hype and excitement over the growth of data science, machine learning, and artificial intelligence. With this growth, these ideas have begun to expend into supporting education. This has even led to speculation among some that algorithms and artificial intelligence could replace teachers in the classroom.

There are several reasons why this is. My reasons are listed as follows

  • People Need People
  • Computers need people
  • Computers assist people

People Need People

When it comes to education, people need people. Originally, education was often passed through apprenticeship for trades and one-on-one tutoring for elites. There has allows been some form of mass education but it has always involved people helping each other.

There are certain social-emotional needs that people have that cannot be satisfied by even the most life-like machine. When humans communicate they always convey some form of emotion even in the most harden computer like individual. Although AI is making strides in attempting to read, emotions they are far from convincingly portraying emotions. Besides, students want someone who can laugh, joke, smile, and do all those little things that involve being human. Even such mundane things as tripping over one’s shoes, or forgetting someone’s name add a human element to the learning experience.

Furthermore, even if a computer is able to share emotions in a human-like manner what child would really feel satisfaction from pleasing an Amazon Alexa? People need people and AI teachers cannot provide this even if they can provide top-level content.

Another concern is that people are highly unpredictable. Again, this relates to the emotional aspects of human nature. Even humans who have the same emotional characteristics are surprised by the behavior of fellow humans. When an algorithm is coldly calculating what is an appropriate action this inability to deal with unpredictable actions can be a problem.

A classic example of this is classroom management. If a student is not paying attention, or not doing their work, or showing defiance in one way or the other how would a computer handle this. In the movie “Class of 1999” the answer for disruptive behavior was to kill. Few parents and administrators would approve of such an action coming from an artificial neural network.

People need people in the context of education for the socio-emotional aspect of education as well as for the tribulation of classroom management. Computers are not humans and therefore they cannot provide the motivation or inspiration that so many students need to be successful in school.

Computers Need People

A second reason AI teachers are unlikely is because computers need people. Computers breakdown,  there are bugs in code, updates have to be made etc. All this precludes a machine going completely independent. With everything that can go wrong there has to be people there to monitor the use and interaction of machines with people.

Even in the movie “Class of 1999” there was a research professor and administrator monitoring the school. This continued until they were killed by the AI teachers.

With all the amazing advances in AI and machine learning it is still people who tweak the algorithms and provide the data for the machine to learn. After this is done, the algorithm is still monitored to see how it performs. Computers cannot escape their reliance on humans to maintain their functionality which implies that they cannot be turned loss in a classroom alone.

Computers Help People

The way going forward is that perhaps AI and other aspects of machine learning and data science can support teachers to be better teachers. For example, in some versions of Moodle there is an algorithm that will monitor students participation and will predict if students are at risk of failing. There is also an algorithm that predicts if a teacher is teaching. This is an excellent use of machine learning in that it deals with a routine task and simple flags a problem rather than trying to intervene it’s self.

Another useful application more in line with AI is through tutoring. Providing feedback on progress and adjusting what the student does based on their performance. Again, in a supporting role, AI can be excellent. The problem is when AI becomes the leader.

Conclusion

The advances in technology are going to continue. However, with the amazing breakthroughs in this field people still need interaction with other people and the example of others in a social context. Computers will never be able to provide this. Computers also need the constant support of humans in order to function. The proper role for AI and data science in education may be as a supporter to a teacher rather than the one leading and making criticaltaff decisions about other people.

Data Munging with Dplyr

Data preparation aka data munging is what most data scientist spend the majority of their time doing. Extracting and transforming data is difficult, to say the least. Every dataset is different with unique problems. This makes it hard to generalize best practices for transforming data so that it is suitable for analysis.

In this post, we will look at how to use the various functions in the “dplyr”” package. This package provides numerous ways to develop features as well as explore the data. We will use the “attitude” dataset from base r for our analysis. Below is some initial code.

library(dplyr)
data("attitude")
str(attitude)
## 'data.frame':    30 obs. of  7 variables:
##  $ rating    : num  43 63 71 61 81 43 58 71 72 67 ...
##  $ complaints: num  51 64 70 63 78 55 67 75 82 61 ...
##  $ privileges: num  30 51 68 45 56 49 42 50 72 45 ...
##  $ learning  : num  39 54 69 47 66 44 56 55 67 47 ...
##  $ raises    : num  61 63 76 54 71 54 66 70 71 62 ...
##  $ critical  : num  92 73 86 84 83 49 68 66 83 80 ...
##  $ advance   : num  45 47 48 35 47 34 35 41 31 41 ...

You can see we have seven variables and only 30 observations. Our first function that we will learn to use is the “select” function. This function allows you to select columns of data you want to use. In order to use this feature, you need to know the names of the columns you want. Therefore, we will first use the “names” function to determine the names of the columns and then use the “select”” function.

names(attitude)[1:3]
## [1] "rating"     "complaints" "privileges"
smallset<-select(attitude,rating:privileges)
head(smallset)
##   rating complaints privileges
## 1     43         51         30
## 2     63         64         51
## 3     71         70         68
## 4     61         63         45
## 5     81         78         56
## 6     43         55         49

The difference is probably obvious. Using the “select” function we have 3 instead of 7 variables. We can also exclude columns we do not want by placing a negative in front of the names of the columns. Below is the code

head(select(attitude,-(rating:privileges)))
##   learning raises critical advance
## 1       39     61       92      45
## 2       54     63       73      47
## 3       69     76       86      48
## 4       47     54       84      35
## 5       66     71       83      47
## 6       44     54       49      34

We can also use the “rename” function to change the names of columns. In our example below, we will change the name of the “rating” to “rates.” The code is below. Keep in mind that the new name for the column is to the left of the equal sign and the old name is to the right

attitude<-rename(attitude,rates=rating)
head(attitude)
##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34

The “select”” function can be used in combination with other functions to find specific columns in the dataset. For example, we will use the “ends_with” function inside the “select” function to find all columns that end with the letter s.

s_set<-head(select(attitude,ends_with("s")))
s_set
##   rates complaints privileges raises
## 1    43         51         30     61
## 2    63         64         51     63
## 3    71         70         68     76
## 4    61         63         45     54
## 5    81         78         56     71
## 6    43         55         49     54

The “filter” function allows you to select rows from a dataset based on criteria. In the code below we will select only rows that have a 75 or higher in the “raises” variable.

bigraise<-filter(attitude,raises>75)
bigraise
##   rates complaints privileges learning raises critical advance
## 1    71         70         68       69     76       86      48
## 2    77         77         54       72     79       77      46
## 3    74         85         64       69     79       79      63
## 4    66         77         66       63     88       76      72
## 5    78         75         58       74     80       78      49
## 6    85         85         71       71     77       74      55

If you look closely all values in the “raise” column are greater than 75. Of course, you can have more than one criteria. IN the code below there are two.

filter(attitude, raises>70 & learning<67)
##   rates complaints privileges learning raises critical advance
## 1    81         78         56       66     71       83      47
## 2    65         70         46       57     75       85      46
## 3    66         77         66       63     88       76      72

The “arrange” function allows you to sort the order of the rows. In the code below we first sort the data ascending by the “critical” variable. Then we sort it descendingly by adding the “desc” function.

ascCritical<-arrange(attitude, critical)
head(ascCritical)
##   rates complaints privileges learning raises critical advance
## 1    43         55         49       44     54       49      34
## 2    81         90         50       72     60       54      36
## 3    40         37         42       58     50       57      49
## 4    69         62         57       42     55       63      25
## 5    50         40         33       34     43       64      33
## 6    71         75         50       55     70       66      41
descCritical<-arrange(attitude, desc(critical))
head(descCritical)
##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    71         70         68       69     76       86      48
## 3    65         70         46       57     75       85      46
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    72         82         72       67     71       83      31

The “mutate” function is useful for engineering features. In the code below we will transform the “learning” variable by subtracting its mean from its self

attitude<-mutate(attitude,learningtrend=learning-mean(learning))
head(attitude)
##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34
##   learningtrend
## 1    -17.366667
## 2     -2.366667
## 3     12.633333
## 4     -9.366667
## 5      9.633333
## 6    -12.366667

You can also create logical variables with the “mutate” function.In the code below, we create a logical variable that is true when the “critical” variable” is higher than 80 and false when “critical”” is less than 80. The new variable is called “highCritical”

attitude<-mutate(attitude,highCritical=critical>=80)
head(attitude)
##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34
##   learningtrend highCritical
## 1    -17.366667         TRUE
## 2     -2.366667        FALSE
## 3     12.633333         TRUE
## 4     -9.366667         TRUE
## 5      9.633333         TRUE
## 6    -12.366667        FALSE

The “group_by” function is used for creating summary statistics based on a specific variable. It is similar to the “aggregate” function in R. This function works in combination with the “summarize” function for our purposes here. We will group our data by the “highCritical” variable. This means our data will be viewed as either TRUE for “highCritical” or FALSE. The results of this function will be saved in an object called “hcgroups”

hcgroups<-group_by(attitude,highCritical)
head(hcgroups)
## # A tibble: 6 x 9
## # Groups:   highCritical [2]
##   rates complaints privileges learning raises critical advance
##                            
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34
## # ... with 2 more variables: learningtrend , highCritical 

Looking at the data you probably saw no difference. This is because we are not done yet. We need to summarize the data in order to see the results for our two groups in the “highCritical” variable.

We will now generate the summary statistics by using the “summarize” function. We specifically want to know the mean of the “complaint” variable based on the variable “highCritical.” Below is the code

summarize(hcgroups,complaintsAve=mean(complaints))
## # A tibble: 2 x 2
##   highCritical complaintsAve
##                   
## 1        FALSE      67.31579
## 2         TRUE      65.36364

Of course, you could have learned this through doing a t.test but this is another approach.

Conclusion

The “dplyr” package is one powerful tool for wrestling with data. There is nothing new in this package. Instead, the coding is simpler than what you can excute using base r.

Exploratory Data Analyst

In data science, exploratory data analyst serves the purpose of assessing whether the data set that you have is suitable for answering the research questions of the project. As such, there are several steps that can be taken to make this process more efficient.

Therefore, the purpose of this post is to explain one process that can be used for exploratory data analyst. The steps include the following.

  • Consult your questions
  • Check the structure of the dataset
  • Use visuals

Consult Your Questions

Research questions give a project a sense of direction. They help you to know what you want to know. In addition, research questions help you to determine what type of analyst to conduct as well.

During the data exploration stage, the purpose of a research question is not for analyst but rather to determine if your data can actually provide answers to the questions. For example, if you want to know what the average height of men in America are and your data tells you the salary of office workers there is a problem,. Your question (average height) cannot be answered with the current data that you have (office workers salaries).

As such, the research questions need to be answerable and specific before moving forward. By answerable, we mean that the data can provide the solution. By specific, we mean a question moves away from generalities and deals with a clearly defined phenomenon. For example, “what is the average height of males age 20-30 in the United states?” This question clearly identifies the what we want to know (average height) and among who (20-30, male Americans).

Not can you confirm if your questions are answerable you can also decide if you need to be more or less specific with your questions. Returning to our average height question. We may find that we can be more specific and check average height by state if we want. Or, we might learn that we can only determine the average height for a region. All this depends on the type of data we have.

Check the Structure

Checking the structure involves determining how many rows and columns in the dataset, the sample size, as well as looking for missing data and erroneous data. Data sets in data science almost always need some sort of cleaning or data wrangling before analyst and checking the structure helps to determine what needs to be done.

You should have a priori expectations for the structure of the dataset. If the stakeholders tell you that there should be several million rows in the data set and you check and there are only several thousand you know there is a problem. This concept also applies to the number of features you expect as well.

Make Visuals

Visuals, which can be plots or tables, help you further develop your expectations as well as to look for deviations or outliers. Tables are an excellent source for summarizing data. Plots, on the other hand, allow you to see deviations from your expectations in the data.What kind of tables and plots to make depends heavily on

What kind of tables and plots to make depends heavily on the type of data as well as the type of questions that you have. For example, for descriptive questions tables of summary statistics with bar plots might be sufficient. For comparison questions, summary stats and boxplots may be enough. For relationship question, summary stat tables with a scatterplot may be enough. Please keep in mind that it is much more complicated than this.

Conclusion

Before questions can be answered the data needs to be explored. This will help to make sure that the potential answers that are developed are appropriate.

Data Science Research Questions

Developing research questions is an absolute necessity in completing any research project. The questions you ask help to shape the type of analysis that you need to conduct.

The type of questions you ask in the context of analytics and data science are similar to those found in traditional quantitative research. Yet data science, like any other field, has its own distinct traits.

In this post, we will look at six different types of questions that are used frequently in the context of the field of data science. The six questions are…

  1. Descriptive
  2. Exploratory/Inferential
  3. Predictive
  4. Causal
  5. Mechanistic

Understanding the types of question that can be asked will help anyone involved in data science to determine what exactly it is that they want to know.

Descriptive

A descriptive question seeks to describe a characteristic of the dataset. For example, if I collect the GPA of 100 university student I may want to what the average GPA of the students is. Seeking the average is one example of a descriptive question.

With descriptive questions, there is no need for a hypothesis as you are not trying to infer, establish a relationship, or generalize to a broader context. You simply want to know a trait of the dataset.

Exploratory/Inferential

Exploratory questions seek to identify things that may be “interesting” in the dataset. Examples of things that may be interesting include trends, patterns, and or relationships among variables.

Exploratory questions generate hypotheses. This means that they lead to something that may be more formal questioned and tested. For example, if you have GPA and hours of sleep for university students. You may explore the potential that there is a relationship between these two variables.

 

Inferential questions are an extension of exploratory questions. What this means is that the exploratory question is formally tested by developing an inferential question. Often, the difference between an exploratory and inferential question is the following

  1. Exploratory questions are usually developed first
  2. Exploratory questions generate inferential questions
  3. Inferential questions are tested often on a different dataset from exploratory questions

In our example, if we find a relationship between GPA and sleep in our dataset. We may test this relationship in a different, perhaps larger dataset. If the relationship holds we can then generalize this to the population of the study.

Causal

Causal questions address if a change in one variable directly affects another. In analytics, A/B testing is one form of data collection that can be used to develop causal questions. For example, we may develop two version of a website and see which one generates more sales.

In this example, the type of website is the independent variable and sales is the dependent variable. By controlling the type of website people see we can see if this affects sales.

Mechanistic 

Mechanistic questions deal with how one variable affects another. This is different from causal questions that focus on if one variable affects another. Continuing with the website example, we may take a closer look at the two different websites and see what it was about them that made one more succesful in generating sales. It may be that one had more banners than another or fewer pictures. Perhaps there were different products offered on the home page.

All of these different features, of course, require data that helps to explain what is happening. This leads to an important point that the questions that can be asked are limited by the available data. You can’t answer a question that does not contain data that may answer it.

Conclusion

Answering questions is essential what research is about. In order to do this, you have to know what your questions are. This information will help you to decide on the analysis you wish to conduct. Familiarity with the types of research questions that are common in data science can help you to approach and complete analysis much faster than when this is unclear

Primary Tasks in Data Analysis

Performing a data analysis in the realm of data science is a difficult task due to the huge number of decisions that need to be made. For some people,  plotting the course to conduct an analysis is easy. However, for most of us, beginning a project leads to a sense of paralysis as we struggle to determine what to do.

In light of this challenge, there are at least 5 core task that you need to consider when preparing to analyze data. These five tasks are

  1. Developing  your question(s)
  2. Data exploration
  3. Developing a statistical model
  4. Interpreting the results
  5. Sharing the results

Developing Your Question(s)

You really cannot analyze data until you first determine what it is you want to know. It is tempting to just jump in and start looking for interesting stuff but you will not know if something you find is interesting unless it helps to answer your question(s).

There are several types of research questions. The point is you need to ask them in order to answer them.

Data Exploration

Data exploration allows you to determine if you can answer your questions with the data you have. In data science, the data is normally already collected by the time you are called upon to analyze it. As such, what you want to find may not be possible.

In addition, exploration of the data allows you to determine if there are any problems with the data set such as missing data, strange variables, and if necessary to develop a data dictionary so you know the characteristics of the variables.

Data exploration allows you to determine what kind of data wrangling needs to be done. This involves the preparation of the data for a more formal analysis when you develop your statistical models. This process takes up the majority of a data scientist time and is not easy at all.  Mastery of this in many ways means being a master of data science

Develop a Statistical Model

Your research questions and the data exploration process helps you to determine what kind of model to develop. The factors that can affect this is whether your data is supervised or unsupervised and whether you want to classify or predict numerical values.

This is probably the funniest part of data analysis and is much easier than having to wrangle with the data. Your goal is to determine if the model helps to answer your question(s)

Interpreting the Results

Once a model is developed it is time to explain what it means. Sometimes you can make a really cool model that nobody (including yourself) can explain. This is especially true of “black box” methods such as support vector machines and artificial neural networks. Models need to normally be explainable to non-technical stakeholders.

With interpretation, you are trying to determine “what does this answer mean to the stakeholders?”  For example, if you find that people who smoke are 5 times more likely to die before the age of 50 what are the implications of this? How can the stakeholders use this information to achieve their own goals? In other words, why should they care about what you found out?

Communication of Results

Now is the time to actually share the answer(s) to your question(s). How this is done varies but it can be written, verbal or both. Whatever the mode of communication it is necessary to consider the following

  • The audience or stakeholders
  • The actual answers to the questions
  • The benefits of knowing this

You must remember the stakeholders because this affects how you communicate. How you speak to business professionals would be different from academics. Next, you must share the answers to the questions. This can be done with charts, figures, illustrations etc. Data visualization is an expertise of its own. Lastly, you explain how this information is useful in a practical way.

Conclusion

The process shared here is one way to approach the analysis of data. Think of this as a framework from which to develop your own method of analysis.

Data Wrangling in R

Collecting and preparing data for analysis is the primary job of a data scientist. This experience is called data wrangling. In this post, we will look at an example of data wrangling using a simple artificial data set. You can create the table below in r or excel. If you created it in excel just save it as a csv and load it into r. Below is the initial code

library(readr)
apple <- read_csv("~/Desktop/apple.csv")
## # A tibble: 10 × 2
##        weight      location
##                  
## 1         3.2        Europe
## 2       4.2kg       europee
## 3      1.3 kg          U.S.
## 4  7200 grams           USA
## 5          42 United States
## 6         2.3       europee
## 7       2.1kg        Europe
## 8       3.1kg           USA
## 9  2700 grams          U.S.
## 10         24 United States

This a small dataset with the columns of “weight” and “location”. Here are some of the problems

  • Weights are in different units
  • Weights are written in different ways
  • Location is not consistent

In order to have any success with data wrangling, you need to state specifically what it is you want to do. Here are our goals for this project

  • Convert the “Weight variable” to a numerical variable instead of character
  • Remove the text and have only numbers in the “weight variable”
  • Change weights in grams to kilograms
  • Convert the “location” variable to a factor variable instead of character
  • Have consistent spelling for Europe and United States in the “location” variable

We will begin with the “weight” variable. We want to convert it to a numerical variable and remove any non-numerical text. Below is the code for this

corrected.weight<-as.numeric(gsub(pattern = "[[:alpha:]]","",apple$weight))
corrected.weight
##  [1]    3.2    4.2    1.3 7200.0   42.0    2.3    2.1    3.1 2700.0   24.0

Here is what we did.

  1. We created a variable called “corrected.weight”
  2. We use the function “as.numeric” this makes whatever results inside it to be a numerical variable
  3. Inside “as.numeric” we used the “gsub” function which allows us to substitute one value for another.
  4. Inside “gsub” we used the argument pattern and set it to “[[alpha:]]” and “” this told r to look for any lower or uppercase letters and replace with nothing or remove it. This all pertains to the “weight” variable in the apple dataframe.

We now need to convert the weights into grams to kilograms so that everything is the same unit. Below is the code

gram.error<-grep(pattern = "[[:digit:]]{4}",apple$weight)
corrected.weight[gram.error]<-corrected.weight[gram.error]/1000
corrected.weight
##  [1]  3.2  4.2  1.3  7.2 42.0  2.3  2.1  3.1  2.7 24.0

Here is what we did

  1. We created a variable called “gram.error”
  2. We used the grep function to search are the “weight” variable in the apple data frame for input that is a digit and is 4 digits in length this is what the “[[:digit:]]{4}” argument means. We do not change any values yet we just store them in “gram.error”
  3. Once this information is stored in “gram.error” we use it as a subset for the “corrected.weight” variable.
  4. We tell r to save into the “corrected.weight” variable any value that is changeable according to the criteria set in “gram.error” and to divide it by 1000. Dividing it by 1000 converts the value from grams to kilograms.

We have completed the transformation of the “weight” and will move to dealing with the problems with the “location” variable in the “apple” dataframe. To do this we will first deal with the issues related to the values that relate to Europe and then we will deal with values related to the United States. Below is the code.

europe<-agrep(pattern = "europe",apple$location,ignore.case = T,max.distance = list(insertion=c(1),deletions=c(2)))
america<-agrep(pattern = "us",apple$location,ignore.case = T,max.distance = list(insertion=c(0),deletions=c(2),substitutions=0))
corrected.location<-apple$location
corrected.location[europe]<-"europe"
corrected.location[america]<-"US"
corrected.location<-gsub(pattern = "United States","US",corrected.location)
corrected.location
##  [1] "europe" "europe" "US"     "US"     "US"     "europe" "europe"
##  [8] "US"     "US"     "US"

The code is a little complicated to explain but in short We used the “agrep” function to tell r to search the “location” to look for values similar to our term “europe”. The other arguments provide some exceptions that r should change because the exceptions are close to the term europe. This process is repeated for the term “us”. We then store are the location variable from the “apple” dataframe in a new variable called “corrected.location” We then apply the two objects we made called “europe” and “america” to the “corrected.location” variable. Next, we have to make some code to deal with “United States” and apply this using the “gsub” function.

We are almost done, now we combine are two variables “corrected.weight” and “corrected.location” into a new data.frame. The code is below

cleaned.apple<-data.frame(corrected.weight,corrected.location)
names(cleaned.apple)<-c('weight','location')
cleaned.apple
##    weight location
## 1     3.2   europe
## 2     4.2   europe
## 3     1.3       US
## 4     7.2       US
## 5    42.0       US
## 6     2.3   europe
## 7     2.1   europe
## 8     3.1       US
## 9     2.7       US
## 10   24.0       US

If you use the “str” function on the “cleaned.apple” dataframe you will see that “location” was automatically converted to a factor.

This looks much better especially if you compare it to the original dataframe that is printed at the top of this post.

Ensemble Learning for Machine Models

One way to improve a machine learning model is to not make just one model. Instead, you can make several models that all have different strengths and weaknesses. This combination of diverse abilities can allow for much more accurate predictions.

The use of multiple models is known as ensemble learning. This post will provide insights into ensemble learning as they are used in developing machine models.

The Major Challenge

The biggest challenges in creating an ensemble of models are deciding what models to develop and how the various models are combined to make predictions. To deal with these challenges involves the use of training data and several different functions.

The Process

Developing an ensemble model begins with training data. The next step is the use of some sort of allocation function. The allocation function determines how much data each model receives in order to make predictions. For example, each model may receive a subset of the data or limit how many features each model can use. However, if several different algorithms are used the allocation function may pass all the data to each model with making any changes.

After the data is allocated, it is necessary for the models to be created. From there, the next step is to determine how to combine the models. The decision on how to combine the models is made with a combination function.

The combination function can take one of several approaches for determining final predictions. For example, a simple majority vote can be used which means that if 5 models where developed and 3 vote “yes” than the example is classified as a yes. Another option is to weight the models so that some have more influence than others in the final predictions.

Benefits of Ensemble Learning

Ensemble learning provides several advantages. One, ensemble learning improves the generalizability of your model. With the combined strengths of many different models and or algorithms it is difficult to go wrong

Two, ensemble learning approaches allow for tackling large datasets. The biggest enemy to machine learning is memory. With ensemble approaches, the data can be broken into smaller pieces for each model.

Conclusion

Ensemble learning is yet another critical tool in the data scientist’s toolkit. The complexity of the world today makes it too difficult to lean on a singular model to explain things. Therefore, understanding the application of ensemble methods is a necessary step.

Classification Rules in Machine Learning

Classification rules represent knowledge in an if-else format. These types of rules involve the terms antecedent and consequent. The antecedent is the before and consequent is after. For example, I may have the following rule.

If the students studies 5 hours a week then they will pass the class with an A

This simple rule can be broken down into the following antecedent and consequent.

  • Antecedent–If the student studies 5 hours a week
  • Consequent-then they will pass the class with an A

The antecedent determines if the consequent takes place. For example, the student must study 5 hours a week to get an A. This is the rule in this particular context.

This post will further explain the characteristics and traits of classification rules.

Classification Rules and Decision Trees

Classification rules are developed on current data to make decisions about future actions. They are highly similar to the more common decision trees. The primary difference is that decision trees involve a complex step-by-step process to make a decision.

Classification rules are stand-alone rules that are abstracted from a process. To appreciate a classification rule you do not need to be familiar with the process that created it. While with decision trees you do need to be familiar with the process that generated the decision.

One catch with classification rules in machine learning is that the majority of the variables need to be nominal in nature. As such, classification rules are not as useful for large amounts of numeric variables. This is not a problem with decision trees.

The Algorithm

Classification rules use algorithms that employ a separate and conquer heuristic. What this means is that the algorithm will try to separate the data into smaller and smaller subset by generating enough rules to make homogeneous subsets. The goal is always to separate the examples in the data set into subgroups that have similar characteristics.

Common algorithms used in classification rules include the One Rule Algorithm and the RIPPER Algorithm. The One Rule Algorithm analyzes data and generates one all-encompassing rule. This algorithm works by finding the single rule that contains the less amount of error. Despite its simplicity, it is surprisingly accurate.

The RIPPER algorithm grows as many rules as possible. When a rule begins to become so complex that in no longer helps to purify the various groups the rule is pruned or the part of the rule that is not beneficial is removed. This process of growing and pruning rules is continued until there is no further benefit.

RIPPER algorithm rules are more complex than One Rule Algorithm. This allows for the development of complex models. The drawback is that the rules can become too complex to make practical sense.

Conclusion

Classification rules are a useful way to develop clear principles as found in the data. The advantage of such an approach is simplicity. However, numeric data is harder to use when trying to develop such rules.

Steps for Approaching Data Science Analysis

Research is difficult for many reasons. One major challenge of research is knowing exactly what to do. You have to develop your way of approaching your problem, data collection and analysis that is acceptable to peers.

This level of freedom leads to people literally freezing and not completing a project. Now imagine have several gigabytes or terabytes of data and being expected to “analyze” it.

This is a daily problem in data science. In this post, we will look at one simply six step process to approaching data science. The process involves the following six steps

  1. Acquire data
  2. Explore the data
  3. Process the data
  4. Analyze the data
  5. Communicate the results
  6. Apply the results

Step 1 Acquire the Data

This may seem obvious but it needs to be said. The first step is to access data for further analysis. Not always, but often data scientist are given data that was already collected by others who want answers from it.

In contrast with traditional empirical research in which you are often involved from the beginning to the end, in data science you jump to analyze a mess of data that others collected. This is challenging as it may not be clear what people what to know are what exactly the collected.

Step 2 Explore the Data

Exploring the data allows you to see what is going on. You have to determine what kinds of potential feature variables you have, the level of data that was collected (nominal, ordinal, interval, ratio). In addition, exploration allows you to determine what you need to do to prep the data for analysis.

Since data can come in many different formats from structured to unstructured. It is critical to take a look at the data through using summary statistics and various visualization options such as plots and graphs.

Another purpose for exploring data is that it can provide insights into how to analyze the data. If you are not given specific instructions as to what stakeholders want to know, exploration can help you to determine what may be valuable for them to know.

Step 3 Process the Data

Processing data involves cleaning it. This involves dealing with missing data, transforming features, addressing outliers, and other necessary processes for preparing analysis. The primary goal is to organize the data for analysis

This is a critical step as various machine learning models have different assumptions that must be met. Some models can handle missing data some cannot. Some models are affected by outliers some are not.

Step 4 Analyze the Data

This is often the most enjoyable part of the process. At this step, you actually get to develop your model. How this is done depends on the type of model you selected.

In machine learning, analysis is almost never complete until some form of validation of the model takes place. This involves taking the model developed on one set of data and seeing how well the model predicts the results on another set of data. One of the greatest fears of statistical modeling is overfitting, which is a model that only works on one set of data and lacks the ability to generalize.

Step 5 Communicate Results

This step is self-explanatory. The results of your analysis needs to be shared with those involved. This is actually an art in data science called storytelling. It involves the use of visuals as well-spoken explanations.

Steps 6 Apply the Results

This is the chance to actual use the results of a study. Again, how this is done depends on the type of model developed. If a model was developed to predict which people to approve for home loans, then the model will be used to analyze applications by people interested in applying for a home loan.

Conclusion

The steps in this process is just one way to approach data science analysis. One thing to keep in mind is that these steps are iterative, which means that it is common to go back and forth and to skip steps as necessary. This process is just a guideline for those who need direction in doing an analysis.

Describing Categorical Data in R

This post will explain how to create tables, calculate proportions, find the mode, and make plots for categorical variables in R. Before providing examples below is the script needed to setup the data that we are using

cars <- mtcars[c(1,2,9,10)]
cars$am <- factor(cars$am, labels=c('auto', 'manual'))
cars$gear <- ordered(cars$gear)

Tables

Frequency tables are useful for summarizing data that has a limited number of values. It represents how often a particular value appears in a dataset. For example, in our cars dataset, we may want to know how many different kinds of transmission we have. To determine this, use the code below.

> transmission_table <- table(cars$am)
> transmission_table

  auto manual 
    19     13

Here is what we did.

  1. We created the variable ‘transmission_table’
  2. In this variable, we used the ‘table’ function which took information from the ‘am’ variable from the ‘cars’ dataset.
  3. Final we displayed the information by typing ‘transmission_table’ and pressing enter

Proportion

Proportions can also be calculated. A proportion will tell you what percentage of the data belongs to a particular category. Below are the proportions of automatic and manual transmissions in the ‘cats’ dataset.

> transmission_table/sum(transmission_table)

   auto  manual 
0.59375 0.40625

The table above indicates that about 59% of the sample consists of automatic transmissions while about 40% are manual transmissions

Mode

When dealing with categorical variables there is not mean or median. However, it is still possible to calculate the mode, which is the most common value found. Below is the code.

> mode_transmission <-transmission_table ==max(transmission_table)
> names(transmission_table) [mode_transmission]
[1] "auto"

Here is what we did.

  1. We created the variable ‘mode_transmission’ and use the ‘max’ function to calculate the max number of counts in the transmission_table.
  2. Next we calculated the names found in the ‘transmission_table’ but we subsetted the ‘modes_transmission variable
  3. The most common value was ‘auto’ or automatic tradition,

Plots

Plots are one of the funniest capabilities in R. For now, we will only show you how to plot the data that we have been using. What is seen here is only the simplest and basic use of plots in R and there is a much more to it than this. Below is the code for plotting the number of transmission by type in R.

> plot(cars$am)

If you did this correctly you should see the following.

Rplot09

All we did was have R create a visual of the number of auto and manual transmissions.Naturally, you can make plots with continuous variables as well.

Conclusion

This post provided some basic information on various task can be accomplished in R for assessing categorical data. These skills will help researchers to take a sea of data and find simple ways to summarize all of the information.