Category Archives: Big Data

Quadratic Discriminant Analysis with Python

Quadratic discriminant analysis allows for the classifier to assess non -linear relationships. This of course something that linear discriminant analysis is not able to do. This post will go through the steps necessary to complete a qda analysis using Python. The steps that will be conducted are as follows

  1. Data preparation
  2. Model training
  3. Model testing

Our goal will be to predict the gender of examples in the “Wages1” dataset using the available independent variables.

Data Preparation

We will begin by first loading the libraries we will need

import pandas as pd
from pydataset import data
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import (confusion_matrix,accuracy_score)
import seaborn as sns
from matplotlib.colors import ListedColormap

Next, we will load our data “Wages1” it comes from the “pydataset” library. After loading the data, we will use the .head() method to look at it briefly.


We need to transform the variable ‘sex’, our dependent variable, into a dummy variable using numbers instead of text. We will use the .getdummies() method to make the dummy variables and then add them to the dataset using the .concat() method. The code for this is below.

In the code below we have the histogram for the continuous independent variables.  We are using the .distplot() method from seaborn to make the histograms.

fig = plt.figure()
fig, axs = plt.subplots(figsize=(15, 10),ncols=3)


The variables look reasonable normal. Below is the proportions of the categorical dependent variable.

exper school wage female male
female 0.48 0.48 0.48 0.48 0.48
male 0.52 0.52 0.52 0.52 0.52

About half male and half female.

We will now make the correlational matrix



There appears to be no major problems with correlations. The last thing we will do is set up our train and test datasets.

test_size=.2, random_state=50)

We can now move to model development

Model Development

To create our model we will instantiate an instance of the quadratic discriminant analysis function and use the .fit() method.


There are some descriptive statistics that we can pull from our model. For our purposes, we will look at the group means  Below are the  group means.

exper school wage
Female 7.73 11.84 5.14
Male 8.28 11.49 6.38

You can see from the table that mean generally have more experience, higher wages, but slightly less education.

We will now use the qda_model we create to predict the classifications for the training set. This information will be used to make a confusion matrix.

cm = confusion_matrix(y_train, y_pred)
ax= plt.subplots(figsize=(10,10))
with sns.axes_style('white'):
sns.heatmap(cm, cbar=False, square=True, annot=True, fmt='g',
cmap=ListedColormap(['gray']), linewidths=2.5)


The information in the upper-left corner are the number of people who were female and correctly classified as female. The lower-right corner is for the men who were correctly classified as men. The upper-right corner is females who were classified as male. Lastly, the lower-left corner is males who were classified as females. Below is the actually accuracy of our model

round(accuracy_score(y_train, y_pred),2)
Out[256]: 0.6

Sixty percent accuracy is not that great. However, we will now move to model testing.

Model Testing

Model testing involves using the .predict() method again but this time with the testing data. Below is the prediction with the confusion matrix.

cm = confusion_matrix(y_test, y_pred)
from matplotlib.colors import ListedColormap
ax= plt.subplots(figsize=(10,10))
with sns.axes_style('white'):
sns.heatmap(cm, cbar=False, square=True,annot=True,fmt='g',


The results seem similar. Below is the accuracy.

round(accuracy_score(y_test, y_pred),2)
Out[259]: 0.62

About the same, our model generalizes even though it performs somewhat poorly.


This post provided an explanation of how to do a quadratic discriminant analysis using python. This is just another potential tool that may be useful for the data scientist.

Data Munging with Dplyr

Data preparation aka data munging is what most data scientist spend the majority of their time doing. Extracting and transforming data is difficult, to say the least. Every dataset is different with unique problems. This makes it hard to generalize best practices for transforming data so that it is suitable for analysis.

In this post, we will look at how to use the various functions in the “dplyr”” package. This package provides numerous ways to develop features as well as explore the data. We will use the “attitude” dataset from base r for our analysis. Below is some initial code.

## 'data.frame':    30 obs. of  7 variables:
##  $ rating    : num  43 63 71 61 81 43 58 71 72 67 ...
##  $ complaints: num  51 64 70 63 78 55 67 75 82 61 ...
##  $ privileges: num  30 51 68 45 56 49 42 50 72 45 ...
##  $ learning  : num  39 54 69 47 66 44 56 55 67 47 ...
##  $ raises    : num  61 63 76 54 71 54 66 70 71 62 ...
##  $ critical  : num  92 73 86 84 83 49 68 66 83 80 ...
##  $ advance   : num  45 47 48 35 47 34 35 41 31 41 ...

You can see we have seven variables and only 30 observations. Our first function that we will learn to use is the “select” function. This function allows you to select columns of data you want to use. In order to use this feature, you need to know the names of the columns you want. Therefore, we will first use the “names” function to determine the names of the columns and then use the “select”” function.

## [1] "rating"     "complaints" "privileges"
##   rating complaints privileges
## 1     43         51         30
## 2     63         64         51
## 3     71         70         68
## 4     61         63         45
## 5     81         78         56
## 6     43         55         49

The difference is probably obvious. Using the “select” function we have 3 instead of 7 variables. We can also exclude columns we do not want by placing a negative in front of the names of the columns. Below is the code

##   learning raises critical advance
## 1       39     61       92      45
## 2       54     63       73      47
## 3       69     76       86      48
## 4       47     54       84      35
## 5       66     71       83      47
## 6       44     54       49      34

We can also use the “rename” function to change the names of columns. In our example below, we will change the name of the “rating” to “rates.” The code is below. Keep in mind that the new name for the column is to the left of the equal sign and the old name is to the right

##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34

The “select”” function can be used in combination with other functions to find specific columns in the dataset. For example, we will use the “ends_with” function inside the “select” function to find all columns that end with the letter s.

##   rates complaints privileges raises
## 1    43         51         30     61
## 2    63         64         51     63
## 3    71         70         68     76
## 4    61         63         45     54
## 5    81         78         56     71
## 6    43         55         49     54

The “filter” function allows you to select rows from a dataset based on criteria. In the code below we will select only rows that have a 75 or higher in the “raises” variable.

##   rates complaints privileges learning raises critical advance
## 1    71         70         68       69     76       86      48
## 2    77         77         54       72     79       77      46
## 3    74         85         64       69     79       79      63
## 4    66         77         66       63     88       76      72
## 5    78         75         58       74     80       78      49
## 6    85         85         71       71     77       74      55

If you look closely all values in the “raise” column are greater than 75. Of course, you can have more than one criteria. IN the code below there are two.

filter(attitude, raises>70 & learning<67)
##   rates complaints privileges learning raises critical advance
## 1    81         78         56       66     71       83      47
## 2    65         70         46       57     75       85      46
## 3    66         77         66       63     88       76      72

The “arrange” function allows you to sort the order of the rows. In the code below we first sort the data ascending by the “critical” variable. Then we sort it descendingly by adding the “desc” function.

ascCritical<-arrange(attitude, critical)
##   rates complaints privileges learning raises critical advance
## 1    43         55         49       44     54       49      34
## 2    81         90         50       72     60       54      36
## 3    40         37         42       58     50       57      49
## 4    69         62         57       42     55       63      25
## 5    50         40         33       34     43       64      33
## 6    71         75         50       55     70       66      41
descCritical<-arrange(attitude, desc(critical))
##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    71         70         68       69     76       86      48
## 3    65         70         46       57     75       85      46
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    72         82         72       67     71       83      31

The “mutate” function is useful for engineering features. In the code below we will transform the “learning” variable by subtracting its mean from its self

##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34
##   learningtrend
## 1    -17.366667
## 2     -2.366667
## 3     12.633333
## 4     -9.366667
## 5      9.633333
## 6    -12.366667

You can also create logical variables with the “mutate” function.In the code below, we create a logical variable that is true when the “critical” variable” is higher than 80 and false when “critical”” is less than 80. The new variable is called “highCritical”

##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34
##   learningtrend highCritical
## 1    -17.366667         TRUE
## 2     -2.366667        FALSE
## 3     12.633333         TRUE
## 4     -9.366667         TRUE
## 5      9.633333         TRUE
## 6    -12.366667        FALSE

The “group_by” function is used for creating summary statistics based on a specific variable. It is similar to the “aggregate” function in R. This function works in combination with the “summarize” function for our purposes here. We will group our data by the “highCritical” variable. This means our data will be viewed as either TRUE for “highCritical” or FALSE. The results of this function will be saved in an object called “hcgroups”

## # A tibble: 6 x 9
## # Groups:   highCritical [2]
##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34
## # ... with 2 more variables: learningtrend , highCritical 

Looking at the data you probably saw no difference. This is because we are not done yet. We need to summarize the data in order to see the results for our two groups in the “highCritical” variable.

We will now generate the summary statistics by using the “summarize” function. We specifically want to know the mean of the “complaint” variable based on the variable “highCritical.” Below is the code

## # A tibble: 2 x 2
##   highCritical complaintsAve
## 1        FALSE      67.31579
## 2         TRUE      65.36364

Of course, you could have learned this through doing a t.test but this is another approach.


The “dplyr” package is one powerful tool for wrestling with data. There is nothing new in this package. Instead, the coding is simpler than what you can excute using base r.

Understanding Recommendation Engines

Recommendations engines are used to make predictions about what future users would like based on prior users suggestions. Whenever you provide numerical feedback on a product or services this information can be used to provide recommendations in the future.

This post will look at various ways in which recommendation engines derive their conclusions.

Ways of Recommending

There are two common ways to develop a recommendation engine in a machine learning context. These two ways are collaborative filtering and content-based. Content-based recommendations rely solely on the data provided by the user. A user develops a profile through their activity and the engine recommends products or services. The only problem is if there is little data on user poor recommendations are made.

Collaborative filtering is crowd-based recommendations. What this means the data of many is used to recommend to one. This bypasses the concern with a lack of data that can happen with content-based recommendations.

There are four common ways to develop collaborative filters and they are as follows

  • User-based collaborative filtering
  • Item-baed collaborative filtering
  • Singular value decomposition and Principal component  analysis

User-based Collaborative Filtering (UBCF)

UBCF uses k-nearest neighbor or some similarity measurement such as Pearson Correlation to predict the missing rating for a user. Once the number of neighbors is determined the algorithm calculates the average of the neighbors to predict the information for the user. The predicted value can be used to determine if a user will like a particular product or service

The predicted value can be used to determine if a user will like a particular product or service. Low values are not recommended while high values may be. A major weakness of UBCF is calculating the similarities of users requires keeping all the data in memory which is a computational challenge.

Item-based Collaborative Filtering (IBCF)

IBCF uses the similarity between items to make recomeendations. This is calculated with the same measures as before (Knn, Pearson correlation, etc.). After finding the most similar items, The algorithm will take the average from the individual user of the other items to predict recommendation the user would make for the unknown item.

In order to assure accuracy, it is necessary to have a huge number of items that can have the similarities calculated. This leads to the same computational problems mentioned earlier.

Singular Value Decomposition and Principal Component Analysis (SVD, PCA)

When the dataset is too big for the first two options. SVD or PCA could be an appropriate choice. What each of these two methods does in a simple way is reduce the dimensionality by making latent variables. Doing this reduces the computational effort as well as reduce noise in the data.

With SVD, we can reduce the data to a handful of factors. The remaining factors can be used to reproduce the original values which can then be used to predict missing values.

For PCA, items are combined in components and like items that load on the same component can be used to make predictions for an unknown data point for a user.


Recommendation engines play a critical part in generating sales for many companies. This post provided an insight into how they are created. Understanding this can allow you to develop recommendation engines based on data.

Exploratory Data Analyst

In data science, exploratory data analyst serves the purpose of assessing whether the data set that you have is suitable for answering the research questions of the project. As such, there are several steps that can be taken to make this process more efficient.

Therefore, the purpose of this post is to explain one process that can be used for exploratory data analyst. The steps include the following.

  • Consult your questions
  • Check the structure of the dataset
  • Use visuals

Consult Your Questions

Research questions give a project a sense of direction. They help you to know what you want to know. In addition, research questions help you to determine what type of analyst to conduct as well.

During the data exploration stage, the purpose of a research question is not for analyst but rather to determine if your data can actually provide answers to the questions. For example, if you want to know what the average height of men in America are and your data tells you the salary of office workers there is a problem,. Your question (average height) cannot be answered with the current data that you have (office workers salaries).

As such, the research questions need to be answerable and specific before moving forward. By answerable, we mean that the data can provide the solution. By specific, we mean a question moves away from generalities and deals with a clearly defined phenomenon. For example, “what is the average height of males age 20-30 in the United states?” This question clearly identifies the what we want to know (average height) and among who (20-30, male Americans).

Not can you confirm if your questions are answerable you can also decide if you need to be more or less specific with your questions. Returning to our average height question. We may find that we can be more specific and check average height by state if we want. Or, we might learn that we can only determine the average height for a region. All this depends on the type of data we have.

Check the Structure

Checking the structure involves determining how many rows and columns in the dataset, the sample size, as well as looking for missing data and erroneous data. Data sets in data science almost always need some sort of cleaning or data wrangling before analyst and checking the structure helps to determine what needs to be done.

You should have a priori expectations for the structure of the dataset. If the stakeholders tell you that there should be several million rows in the data set and you check and there are only several thousand you know there is a problem. This concept also applies to the number of features you expect as well.

Make Visuals

Visuals, which can be plots or tables, help you further develop your expectations as well as to look for deviations or outliers. Tables are an excellent source for summarizing data. Plots, on the other hand, allow you to see deviations from your expectations in the data.What kind of tables and plots to make depends heavily on

What kind of tables and plots to make depends heavily on the type of data as well as the type of questions that you have. For example, for descriptive questions tables of summary statistics with bar plots might be sufficient. For comparison questions, summary stats and boxplots may be enough. For relationship question, summary stat tables with a scatterplot may be enough. Please keep in mind that it is much more complicated than this.


Before questions can be answered the data needs to be explored. This will help to make sure that the potential answers that are developed are appropriate.

Data Science Research Questions

Developing research questions is an absolute necessity in completing any research project. The questions you ask help to shape the type of analysis that you need to conduct.

The type of questions you ask in the context of analytics and data science are similar to those found in traditional quantitative research. Yet data science, like any other field, has its own distinct traits.

In this post, we will look at six different types of questions that are used frequently in the context of the field of data science. The six questions are…

  1. Descriptive
  2. Exploratory/Inferential
  3. Predictive
  4. Causal
  5. Mechanistic

Understanding the types of question that can be asked will help anyone involved in data science to determine what exactly it is that they want to know.


A descriptive question seeks to describe a characteristic of the dataset. For example, if I collect the GPA of 100 university student I may want to what the average GPA of the students is. Seeking the average is one example of a descriptive question.

With descriptive questions, there is no need for a hypothesis as you are not trying to infer, establish a relationship, or generalize to a broader context. You simply want to know a trait of the dataset.


Exploratory questions seek to identify things that may be “interesting” in the dataset. Examples of things that may be interesting include trends, patterns, and or relationships among variables.

Exploratory questions generate hypotheses. This means that they lead to something that may be more formal questioned and tested. For example, if you have GPA and hours of sleep for university students. You may explore the potential that there is a relationship between these two variables.


Inferential questions are an extension of exploratory questions. What this means is that the exploratory question is formally tested by developing an inferential question. Often, the difference between an exploratory and inferential question is the following

  1. Exploratory questions are usually developed first
  2. Exploratory questions generate inferential questions
  3. Inferential questions are tested often on a different dataset from exploratory questions

In our example, if we find a relationship between GPA and sleep in our dataset. We may test this relationship in a different, perhaps larger dataset. If the relationship holds we can then generalize this to the population of the study.


Causal questions address if a change in one variable directly affects another. In analytics, A/B testing is one form of data collection that can be used to develop causal questions. For example, we may develop two version of a website and see which one generates more sales.

In this example, the type of website is the independent variable and sales is the dependent variable. By controlling the type of website people see we can see if this affects sales.


Mechanistic questions deal with how one variable affects another. This is different from causal questions that focus on if one variable affects another. Continuing with the website example, we may take a closer look at the two different websites and see what it was about them that made one more succesful in generating sales. It may be that one had more banners than another or fewer pictures. Perhaps there were different products offered on the home page.

All of these different features, of course, require data that helps to explain what is happening. This leads to an important point that the questions that can be asked are limited by the available data. You can’t answer a question that does not contain data that may answer it.


Answering questions is essential what research is about. In order to do this, you have to know what your questions are. This information will help you to decide on the analysis you wish to conduct. Familiarity with the types of research questions that are common in data science can help you to approach and complete analysis much faster than when this is unclear

Regularized Linear Regression

Traditional linear regression has been a tried and true model for making predictions for decades. However, with the growth of Big Data and datasets with 100’s of variables problems have begun to arise. For example, using stepwise or best subset method with regression could take hours if not days to converge in even some of the best computers.

To deal with this problem, regularized regression has been developed to help to determine which features or variables to keep when developing models from large datasets with a huge number of variables. In this post, we will look at the following concepts

  • Definition of regularized regression
  • Ridge regression
  • Lasso regression
  • Elastic net regression


Regularization involves the use of a shrinkage penalty in order to reduce the residual sum of squares (RSS). This is done by selecting a value for a tuning parameter called “lambda”. Tuning parameters are used in machine learning algorithms to control the behavior of the models that are developed.

The lambda is multiplied by the normalized coefficients of the model and added to the RSS. Below is an equation of what was just said

RSS + λ(normalized coefficients)

The benefits of regularization are at least three-fold. First, regularization is highly computationally efficient. Instead of fitting k-1 models when k is the number of variables available (for example, 50 variables would lead 49 models!), with regularization only one model is developed for each value of lambda you specify.

Second, regularization helps to deal with the bias-variance headache of model development. When small changes are made to data, such as switching from the training to testing data, there can be wild changes in the estimates. Regularization can often smooth this problem out substantially.

Finally, regularization can help to reduce or eliminate any multicollinearity in a model. As such, the benefits of using regularization make it clear that this should be considered when working with larger datasets.

Ridge Regression

Ridge regression involves the normalization of the squared weights or as shown in the equation below

RSS + λ(normalized coefficients^2)

This is also referred to as the L2-norm. As lambda increase in value, the coefficients in the model are shrunk towards 0 but never reach 0. This is how the error is shrunk. The higher the lambda the lower the value of the coefficients as they are reduced more and more thus reducing the RSS.

The benefit is that predictive accuracy is often increased. However, interpreting and communicating your results can become difficult because no variables are removed from the model. Instead, the variables are reduced near to zero. This can be especially tough if you have dozens of variables remaining in your model to try to explain.


Lasso is short for “Least Absolute Shrinkage and Selection Operator”. This approach uses the L1-norm which is the sum of the absolute value of the coefficients or as shown in the equation below

RSS + λ(Σ|normalized coefficients|)

This shrinkage penalty will reduce a coefficient to 0 which is another way of saying that variables will be removed from the model. One problem is that highly correlated variables that need to be in your model may be removed when Lasso shrinks coefficients. This is one reason why ridge regression is still used.

Elastic Net

Elastic net is the best of ridge and Lasso without the weaknesses of either. It combines extracts variables like Lasso and Ridge does not while also group variables like Ridge does but Lasso does not.

This is done by including a second tuning parameter called “alpha”. If alpha is set to 0 it is the same as ridge regression and if alpha is set to 1 it is the same as lasso regression. If you are able to appreciate it below is the formula used for elastic net regression

(RSS + l[(1 – alpha)(S|normalized coefficients|2)/2 + alpha(S|normalized coefficients|)])/N)

As such when working with elastic net you have to set two different tuning parameters (alpha and lambda) in order to develop a model.


Regularized regression was developed as an answer to the growth in the size and number of variables in a data set today. Ridge, lasso an elastic net all provide solutions to converging over large datasets and selecting features.

Data Wrangling in R

Collecting and preparing data for analysis is the primary job of a data scientist. This experience is called data wrangling. In this post, we will look at an example of data wrangling using a simple artificial data set. You can create the table below in r or excel. If you created it in excel just save it as a csv and load it into r. Below is the initial code

apple <- read_csv("~/Desktop/apple.csv")
## # A tibble: 10 × 2
##        weight      location
## 1         3.2        Europe
## 2       4.2kg       europee
## 3      1.3 kg          U.S.
## 4  7200 grams           USA
## 5          42 United States
## 6         2.3       europee
## 7       2.1kg        Europe
## 8       3.1kg           USA
## 9  2700 grams          U.S.
## 10         24 United States

This a small dataset with the columns of “weight” and “location”. Here are some of the problems

  • Weights are in different units
  • Weights are written in different ways
  • Location is not consistent

In order to have any success with data wrangling, you need to state specifically what it is you want to do. Here are our goals for this project

  • Convert the “Weight variable” to a numerical variable instead of character
  • Remove the text and have only numbers in the “weight variable”
  • Change weights in grams to kilograms
  • Convert the “location” variable to a factor variable instead of character
  • Have consistent spelling for Europe and United States in the “location” variable

We will begin with the “weight” variable. We want to convert it to a numerical variable and remove any non-numerical text. Below is the code for this

corrected.weight<-as.numeric(gsub(pattern = "[[:alpha:]]","",apple$weight))
##  [1]    3.2    4.2    1.3 7200.0   42.0    2.3    2.1    3.1 2700.0   24.0

Here is what we did.

  1. We created a variable called “corrected.weight”
  2. We use the function “as.numeric” this makes whatever results inside it to be a numerical variable
  3. Inside “as.numeric” we used the “gsub” function which allows us to substitute one value for another.
  4. Inside “gsub” we used the argument pattern and set it to “[[alpha:]]” and “” this told r to look for any lower or uppercase letters and replace with nothing or remove it. This all pertains to the “weight” variable in the apple dataframe.

We now need to convert the weights into grams to kilograms so that everything is the same unit. Below is the code

gram.error<-grep(pattern = "[[:digit:]]{4}",apple$weight)
##  [1]  3.2  4.2  1.3  7.2 42.0  2.3  2.1  3.1  2.7 24.0

Here is what we did

  1. We created a variable called “gram.error”
  2. We used the grep function to search are the “weight” variable in the apple data frame for input that is a digit and is 4 digits in length this is what the “[[:digit:]]{4}” argument means. We do not change any values yet we just store them in “gram.error”
  3. Once this information is stored in “gram.error” we use it as a subset for the “corrected.weight” variable.
  4. We tell r to save into the “corrected.weight” variable any value that is changeable according to the criteria set in “gram.error” and to divide it by 1000. Dividing it by 1000 converts the value from grams to kilograms.

We have completed the transformation of the “weight” and will move to dealing with the problems with the “location” variable in the “apple” dataframe. To do this we will first deal with the issues related to the values that relate to Europe and then we will deal with values related to the United States. Below is the code.

europe<-agrep(pattern = "europe",apple$location, = T,max.distance = list(insertion=c(1),deletions=c(2)))
america<-agrep(pattern = "us",apple$location, = T,max.distance = list(insertion=c(0),deletions=c(2),substitutions=0))
corrected.location<-gsub(pattern = "United States","US",corrected.location)
##  [1] "europe" "europe" "US"     "US"     "US"     "europe" "europe"
##  [8] "US"     "US"     "US"

The code is a little complicated to explain but in short We used the “agrep” function to tell r to search the “location” to look for values similar to our term “europe”. The other arguments provide some exceptions that r should change because the exceptions are close to the term europe. This process is repeated for the term “us”. We then store are the location variable from the “apple” dataframe in a new variable called “corrected.location” We then apply the two objects we made called “europe” and “america” to the “corrected.location” variable. Next, we have to make some code to deal with “United States” and apply this using the “gsub” function.

We are almost done, now we combine are two variables “corrected.weight” and “corrected.location” into a new data.frame. The code is below<-data.frame(corrected.weight,corrected.location)
##    weight location
## 1     3.2   europe
## 2     4.2   europe
## 3     1.3       US
## 4     7.2       US
## 5    42.0       US
## 6     2.3   europe
## 7     2.1   europe
## 8     3.1       US
## 9     2.7       US
## 10   24.0       US

If you use the “str” function on the “” dataframe you will see that “location” was automatically converted to a factor.

This looks much better especially if you compare it to the original dataframe that is printed at the top of this post.

Making Regression and Modal Trees in R

In this post, we will look at an example of regression trees. Regression trees use decision tree-like approach to develop prediction models involving numerical data. In our example, we will be trying to predict how many kids a person has based on several independent variables in the “PSID” data set in the “Ecdat” package.

Let’s begin by loading the necessary packages and data set. The code is below

## 'data.frame':    4856 obs. of  8 variables:
##  $ intnum  : int  4 4 4 4 5 6 6 7 7 7 ...
##  $ persnum : int  4 6 7 173 2 4 172 4 170 171 ...
##  $ age     : int  39 35 33 39 47 44 38 38 39 37 ...
##  $ educatn : int  12 12 12 10 9 12 16 9 12 11 ...
##  $ earnings: int  77250 12000 8000 15000 6500 6500 7000 5000 21000 0 ...
##  $ hours   : int  2940 2040 693 1904 1683 2024 1144 2080 2575 0 ...
##  $ kids    : int  2 2 1 2 5 2 3 4 3 5 ...
##  $ married : Factor w/ 7 levels "married","never married",..: 1 4 1 1 1 1 1 4 1 1 ...
##      intnum        persnum            age           educatn     
##  Min.   :   4   Min.   :  1.00   Min.   :30.00   Min.   : 0.00  
##  1st Qu.:1905   1st Qu.:  2.00   1st Qu.:34.00   1st Qu.:12.00  
##  Median :5464   Median :  4.00   Median :38.00   Median :12.00  
##  Mean   :4598   Mean   : 59.21   Mean   :38.46   Mean   :16.38  
##  3rd Qu.:6655   3rd Qu.:170.00   3rd Qu.:43.00   3rd Qu.:14.00  
##  Max.   :9306   Max.   :205.00   Max.   :50.00   Max.   :99.00  
##                                                  NA's   :1      
##     earnings          hours           kids                 married    
##  Min.   :     0   Min.   :   0   Min.   : 0.000   married      :3071  
##  1st Qu.:    85   1st Qu.:  32   1st Qu.: 1.000   never married: 681  
##  Median : 11000   Median :1517   Median : 2.000   widowed      :  90  
##  Mean   : 14245   Mean   :1235   Mean   : 4.481   divorced     : 645  
##  3rd Qu.: 22000   3rd Qu.:2000   3rd Qu.: 3.000   separated    : 317  
##  Max.   :240000   Max.   :5160   Max.   :99.000   NA/DF        :   9  
##                                                   no histories :  43

The variables “intnum” and “persnum” are for identification and are useless for our analysis. We will now explore our dataset with the following code.











##       married never married       widowed      divorced     separated 
##          3071           681            90           645           317 
##         NA/DF  no histories 
##             9            43

Almost all of the variables are non-normal. However, this is not a problem when using regression trees. There are some major problems with the “kids” and “educatn” variables. Each of these variables has values at 98 and 99. When the data for this survey was collected 98 meant the respondent did not know the answer and a 99 means they did not want to say. Since both of these variables are numerical we have to do something with them so they do not ruin our analysis.

We are going to recode all values equal to or greater than 98 as 3 for the “kids” variable. The number 3 means they have 3 kids. This number was picked because it was the most common response for the other respondents. For the “educatn” variable all values equal to or greater than 98 are recoded as 12, which means that they completed 12th grade. Again this was the most frequent response. Below is the code.

PSID$kids[PSID$kids >= 98] <- 3
PSID$educatn[PSID$educatn >= 98] <- 12

Another peek at the histograms for these two variables and things look much better.





Make Model and Visualization

Now that everything is cleaned up we now need to make our training and testing data sets as seen in the code below.


We will now make our model and also create a visual of it. Our goal is to predict the number of children a person has based on their age, education, earnings, hours worked, marital status. Below is the code

#make model
PSID_Model<-rpart(kids~age+educatn+earnings+hours+married, PSID_train)
#make visualization
rpart.plot(PSID_Model, digits=3, fallen.leaves = TRUE,type = 3, extra=101)


The first split on the tree is by income. On the left, we have those who make more than 20k and on the right those who make less than 20k. On the left the next split is by marriage, those who are never married or not applicable have on average 0.74 kids. Those who are married, widowed, divorced, separated, or have no history have on average 1.72.

The left side of the tree is much more complicated and I will not explain all of it. The after making less than 20k the next split is by marriage. Those who are married, widowed, divorced, separated, or no history with less than 13.5 years of education have 2.46 on average.

Make Prediction Model and Conduct Evaluation

Our next task is to make the prediction model. We will do this with the following code

PSID_pred<-predict(PSID_Model, PSID_test)

We will now evaluate the model. We will do this three different ways. The first involves looking at the summary statistics of the prediction model and the testing data. The numbers should be about the same. After that, we will calculate the correlation between the prediction model and the testing data. Lastly, we will use a technique called the mean absolute error. Below is the code for the summary statistics and correlation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.735   2.041   2.463   2.226   2.463   2.699
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   2.000   2.494   3.000  10.000
cor(PSID_pred, PSID_test$kids)
## [1] 0.308116

Looking at the summary stats our model has a hard time predicting extreme values because the max value of the two models are far apart. However, how often do people have ten kids? As such, this is not a major concern.

A look at the correlation finds that it is pretty low (0.30) this means that the two models have little in common and this means we need to make some changes. The mean absolute error is a measure of the difference between the predicted and actual values in a model. We need to make a function first before we analyze our model.

MAE<-function(actual, predicted){

We now assess the model with the code below

MAE(PSID_pred, PSID_test$kids)
## [1] 1.134968

The results indicate that on average the difference between our model’s prediction of the number of kids and the actual number of kids was 1.13 on a scale of 0 – 10. That’s a lot of error. However, we need to compare this number to how well the mean does to give us a benchmark. The code is below.

MAE(ave_kids, PSID_test$kids)
## [1] 1.178909

Model Tree

Our model with a score of 1.13 is slightly better than using the mean which is 1.17. We will try to improve our model by switching from a regression tree to a model tree which uses a slightly different approach for prediction. In a model tree each node in the tree ends in a linear regression model. Below is the code.

PSIDM5<- M5P(kids~age+educatn+earnings+hours+married, PSID_train)
## M5 pruned model tree:
## (using smoothed linear models)
## earnings <= 20754 : 
## |   earnings <= 2272 : 
## |   |   educatn <= 12.5 : LM1 (702/111.555%) ## |   |   educatn >  12.5 : LM2 (283/92%)
## |   earnings >  2272 : LM3 (1509/88.566%)
## earnings >  20754 : LM4 (1147/82.329%)
## LM num: 1
## kids = 
##  0.0385 * age 
##  + 0.0308 * educatn 
##  - 0 * earnings 
##  - 0 * hours 
##  + 0.0187 * married=married,divorced,widowed,separated,no histories 
##  + 0.2986 * married=divorced,widowed,separated,no histories 
##  + 0.0082 * married=widowed,separated,no histories 
##  + 0.0017 * married=separated,no histories 
##  + 0.7181
## LM num: 2
## kids = 
##  0.002 * age 
##  - 0.0028 * educatn 
##  + 0.0002 * earnings 
##  - 0 * hours 
##  + 0.7854 * married=married,divorced,widowed,separated,no histories 
##  - 0.3437 * married=divorced,widowed,separated,no histories 
##  + 0.0154 * married=widowed,separated,no histories 
##  + 0.0017 * married=separated,no histories 
##  + 1.4075
## LM num: 3
## kids = 
##  0.0305 * age 
##  - 0.1362 * educatn 
##  - 0 * earnings 
##  - 0 * hours 
##  + 0.9028 * married=married,divorced,widowed,separated,no histories 
##  + 0.2151 * married=widowed,separated,no histories 
##  + 0.0017 * married=separated,no histories 
##  + 2.0218
## LM num: 4
## kids = 
##  0.0393 * age 
##  - 0.0658 * educatn 
##  - 0 * earnings 
##  - 0 * hours 
##  + 0.8845 * married=married,divorced,widowed,separated,no histories 
##  + 0.3666 * married=widowed,separated,no histories 
##  + 0.0037 * married=separated,no histories 
##  + 0.4712
## Number of Rules : 4

It would take too much time to explain everything. You can read part of this model as follows earnings greater than 20754 use linear model 4earnings less than 20754 and less than 2272 and less than 12.5 years of education use linear model 1 earnings less than 20754 and less than 2272 and greater than 12.5 years of education use linear model 2 earnings less than 20754 and greater than 2272 linear model 3 The print out then shows each of the linear model. Lastly, we will evaluate our model tree with the following code

PSIDM5_Pred<-predict(PSIDM5, PSID_test)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3654  2.0490  2.3400  2.3370  2.6860  4.4220
cor(PSIDM5_Pred, PSID_test$kids)
## [1] 0.3486492
MAE(PSID_test$kids, PSIDM5_Pred)
## [1] 1.088617

This model is slightly better. For example, it is better at predict extreme values at 4.4 compare to 2.69 for the regression tree model. The correlation is 0.34 which is better than 0.30 for the regression tree model. Lastly. the mean absolute error shows a slight improve to 1.08 compared to 1.13 in the regression tree model


This provide examples of the use of regression trees and model trees. Both of these models make prediction a key component of their analysis.

Numeric Prediction Trees

Decision trees are used for classifying examples into distinct classes or categories. Such as pass/fail, win/lose, buy/sell/trade, etc. However, as we all know, categories are just one form of outcome in machine learning. Sometimes we want to make numeric predictions.

The use of trees in making predictions numeric involves the use of regression trees or model trees. In this post, we will look at each of these forms of numeric prediction with the use of trees.

Regression Trees and Modal Trees

Regression trees have been around since the 1980’s. They work by predicting the average value of specific examples that reach a given leaf in the tree. Despite their name, there is no regression involved with regression trees. Regression trees are straightforward to interpret but at the expense of accuracy.

Modal trees are similar to regression trees but employ multiple regression with the examples at each leaf in a tree. This leads to many different regression models being used to split the data throughout a tree. This makes model trees hard to interpret and understand in comparison to regression trees. However, they are normally much more accurate than regression trees.

Both types of trees have the goal of making groups that are as homogeneous as possible. For decision trees, entropy is used to measure the homogeneity of groups. For numeric decision trees, the standard deviation reduction (SDR) is used. The detail of SDR are somewhat complex and technical and will be avoided for that reason.

Strengths of Numeric Prediction Trees

Numeric prediction trees do not have the assumptions of linear regression. As such, they can be used to model non-normal and or non-linear data. In addition, if a dataset has a large number of feature variables, a numeric prediction tree can easily select the most appropriate ones automatically. Lastly, numeric prediction trees also do not need the model to be specific in advance of the analysis.

Weaknesses of Numeric Prediction Trees

This form of analysis requires a large amount of data in the training set in order to develop a testable model. It is also hard to tell which variables are most important in shaping the outcome. Lastly, sometimes numeric prediction trees are hard to interpret. This naturally limits there usefulness among people who lack statistical training.


Numeric prediction trees combine the strength of decision trees with the ability to digest a large amount of numerical variables. This form of machine learning is useful when trying to rate or measure something that is very difficult to rate or measure. However, when possible, it is usually wise to allow to try to use simpler methods if permissible.

Making a Decision Tree in R

In this post, we are going to learn how to use the C5.0 algorithm to make a classification tree in order to make predictions about gender based on wage, education, and job experience using a data set in the “Ecdat” package in R. Below is some code to get started.

library(Ecdat); library(C50); library(gmodels)

We now will explore the data to get a sense of what is happening in it. Below is the code for this

 ## 'data.frame': 3294 obs. of 4 variables:
 ## $ exper : int 9 12 11 9 8 9 8 10 12 7 ...
 ## $ sex : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
 ## $ school: int 13 12 11 14 14 14 12 12 10 12 ...
 ## $ wage : num 6.32 5.48 3.64 4.59 2.42 ...


 ## Min. 1st Qu. Median Mean 3rd Qu. Max.
 ## 1.000 7.000 8.000 8.043 9.000 18.000



## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07656 3.62200 5.20600 5.75800 7.30500 39.81000



 ## Min. 1st Qu. Median Mean 3rd Qu. Max.
 ## 3.00 11.00 12.00 11.63 12.00 16.00

## female male
## 1569 1725

As you can see, we have four features (exper, sex, school, wage) in the “Wages1” data set. The histogram for “exper” indicates that it is normally distributed. The “wage” feature is highly left-skewed and almost bimodal. This is not a big deal as classification trees are robust against non-normality. The ‘school’ feature is mostly normally distributed. Lastly, the ‘sex’ feature is categorical but there is almost an equal number of men and women in the data. All of the outputs for the means are listed above.

Create Training and Testing Sets

We now need to create our training and testing data sets. In order to do this, we need to first randomly reorder our data set. For example, if the data is sorted by one of the features, to split it now would lead to extreme values all being lumped together in one data set.

To make things more confusing, we also need to set our seed. This allows us to be able to replicate our results. Below is the code for doing this.


What we did is explained as follows

  1. set the seed using the ‘set.seed’ function (We randomly picked the number 12345)
  2. We created the variable ‘Wage_rand’ and we assigned the following
  3. From the ‘Wages1’ dataset we used the ‘runif’ function to create a list of 3294 numbers (1-3294) we did this because there are a total of 3294 examples in the dataset.
  4. After generating the 3294 numbers we then order sequentially using the “order” function.
  5. We then assigned each example in the “Wages1” dataset one of the numbers we created

We will now create are training and testing set using the code below.


Make the Model
We can now begin training a model below is the code.

Wage_model<-C5.0(Wage_train[-2], Wage_train$sex)

The coding for making the model should be familiar by now. One thing that is new is the brackets with the -2 inside. This tells r to ignore the second column in the dataset. We are doing this because we want to predict sex. If it is a part of the independent variables we cannot predict it. We can now examine the results of our model by using the following code.

## Call:
## C5.0.default(x = Wage_train[-2], y = Wage_train$sex)
## Classification Tree
## Number of samples: 2294
## Number of predictors: 3
## Tree size: 9
## Non-standard options: attempt to group attributes
## Call:
## C5.0.default(x = Wage_train[-2], y = Wage_train$sex)
## C5.0 [Release 2.07 GPL Edition] Wed May 25 10:55:22 2016
## ——————————-
## Class specified by attribute `outcome’
## Read 2294 cases (4 attributes) from
## Decision tree:
## wage <= 3.985179: ## :…school > 11: female (345/109)
## : school <= 11:
## : :…exper <= 8: female (224/96) ## : exper > 8: male (143/59)
## wage > 3.985179:
## :…wage > 9.478313: male (254/61)
## wage <= 9.478313: ## :…school > 12: female (320/132)
## school <= 12:
## :…school <= 10: male (246/70) ## school > 10:
## :…school <= 11: male (265/114) ## school > 11:
## :…exper <= 6: female (83/35) ## exper > 6: male (414/173)
## Evaluation on training data (2294 cases):
## Decision Tree
## —————-
## Size Errors
## 9 849(37.0%) <<
## (a) (b) ## —- —-
## 600 477 (a): class female
## 372 845 (b): class male
## Attribute usage:
## 100.00% wage
## 88.93% school
## 37.66% exper
## Time: 0.0 secs

The “Wage_model” indicates a small decision tree of only 9 decisions. The “summary” function shows the actual decision tree. It’s somewhat complicated but I will explain the beginning part of the tree.

If wages are less than or equal to 3.98 then the person is female THEN

If the school is greater than 11 then the person is female ELSE

If the school is less than or equal to 11 THEN

If The experience of the person is less than or equal to 8 the person is female ELSE

If the experience is greater than 8 the person is male etc.

The next part of the output shows the amount of error. This model misclassified 37% of the examples which is pretty high. 477 men were misclassified as women and 372 women were misclassified as men.

Predict with the Model

We will now see how well this model predicts gender in the testing set. Below is the code

Wage_pred<-predict(Wage_model, Wage_test)

CrossTable(Wage_test$sex, Wage_pred, prop.c = FALSE,
 prop.r = FALSE, dnn=c('actual sex', 'predicted sex'))

The output will not display properly here. Please see C50 for a pdf of this post and go to page 7

Again, this code should be mostly familiar for the prediction model. For the table, we are comparing the test set sex with predicted sex. The overall model was correct 269 + 346/1000 for 61.5% accuracy rate, which is pretty bad.

Improve the Model

There are two ways we are going to try and improve our model. The first is adaptive boosting and the second is error cost.

Adaptive boosting involves making several models that “vote” how to classify an example. To do this you need to add the ‘trials’ parameter to the code. The ‘trial’ parameter sets the upper limit of the number of models R will iterate if necessary. Below is the code for this and the code for the results.

Wage_boost10<-C5.0(Wage_train[-2], Wage_train$sex, trials = 10)
 #view boosted model
 ## Call:
 ## C5.0.default(x = Wage_train[-2], y = Wage_train$sex, trials = 10)
 ## C5.0 [Release 2.07 GPL Edition] Wed May 25 10:55:22 2016
 ## -------------------------------
 ## Class specified by attribute `outcome'
 ## Read 2294 cases (4 attributes) from
 ## ----- Trial 0: -----
 ## Decision tree:
 ## wage <= 3.985179: ## > 11: female (345/109)
 ## : school <= 11:
 ## : :...exper <= 8: female (224/96) ## : exper > 8: male (143/59)
 ## wage > 3.985179:
 ## :...wage > 9.478313: male (254/61)
 ## wage <= 9.478313: ## > 12: female (320/132)
 ## school <= 12:
 ## <= 10: male (246/70) ## school > 10:
 ## <= 11: male (265/114) ## school > 11:
 ## :...exper <= 6: female (83/35) ## exper > 6: male (414/173)
 ## ----- Trial 1: -----
 ## Decision tree:
 ## wage > 6.848846: male (663.6/245)
 ## wage <= 6.848846:
 ## <= 10: male (413.9/175) ## school > 10: female (1216.5/537.6)
 ## ----- Trial 2: -----
 ## Decision tree:
 ## wage <= 3.234474: female (458.1/192.9) ## wage > 3.234474: male (1835.9/826.2)
 ## ----- Trial 3: -----
 ## Decision tree:
 ## wage > 9.478313: male (234.8/82.1)
 ## wage <= 9.478313:
 ## <= 11: male (883.2/417.8) ## school > 11: female (1175.9/545.1)
 ## ----- Trial 4: -----
 ## Decision tree:
 ## male (2294/1128.1)
 ## *** boosting reduced to 4 trials since last classifier is very inaccurate
 ## Evaluation on training data (2294 cases):
 ## Trial Decision Tree
 ## ----- ----------------
 ## Size Errors
 ## 0 9 849(37.0%)
 ## 1 3 917(40.0%)
 ## 2 2 958(41.8%)
 ## 3 3 949(41.4%)
 ## boost 864(37.7%) <<
 ## (a) (b) ## ---- ----
 ## 507 570 (a): class female
 ## 294 923 (b): class male
 ## Attribute usage:
 ## 100.00% wage
 ## 88.93% school
 ## 37.66% exper
 ## Time: 0.0 secs

R only created 4 models as there was no additional improvement after this. You can see each model in the printout. The overall results are similar to our original model that was not boosted. We will now see how well our boosted model predicts with the code below.

Wage_boost_pred10<-predict(Wage_boost10, Wage_test)
 CrossTable(Wage_test$sex, Wage_boost_pred10, prop.c = FALSE,
 prop.r = FALSE, dnn=c('actual Sex Boost', 'predicted Sex Boost'))

Our boosted model has an accuracy rate 223+379/1000 = 60.2% which is about 1% better then our unboosted model (59.1%). As such, boosting the model was not useful (see page 11 of the pdf for the table printout.)

Our next effort will be through the use of a cost matrix. A cost matrix allows you to impose a penalty on false positives and negatives at your discretion. This is useful if certain mistakes are too costly for the learner to make. IN our example, we are going to make it 4 times more costly misclassify a female as a male (false negative) and 1 times for costly to misclassify a male as a female (false positive). Below is the code

error_cost Wage_cost<-C5.0(Wage_train[-21], Wage_train$sex, cost = error_cost)
 Wage_cost_pred<-predict(Wage_cost, Wage_test)
 CrossTable(Wage_test$sex, Wage_cost_pred, prop.c = FALSE,
 prop.r = FALSE, dnn=c('actual Sex EC', 'predicted Sex EC'))

With this small change our model is 100% accurate (see page 12 of the pdf).


This post provided an example of decision trees. Such a model allows someone to predict a given outcome when given specific information.

Understanding Decision Trees

Decision trees are yet another method of machine learning that is used for classifying outcomes. Decision trees are very useful for, as you can guess, making decisions based on the characteristics of the data.

In this post, we will discuss the following

  • Physical traits of decision trees
  • How decision trees work
  • Pros and cons of decision trees

Physical Traits of a Decision Tree

Decision trees consist of what is called a tree structure. The tree structure consists of a root node, decision nodes, branches and leaf nodes.

A root node is an initial decision made in the tree. This depends on which feature the algorithm selects first.

Following the root node, the tree splits into various branches. Each branch leads to an additional decision node where the data is further subdivided. When you reach the bottom of a tree at the terminal node(s) these are also called leaf nodes.

How Decision Trees Work

Decision trees use a heuristic called recursive partitioning. What this does is it splits the overall dataset into smaller and smaller subsets until each subset is as close to pure (having the same characteristics) as possible. This process is also known as divide and conquer.

The mathematics for deciding how to split the data is based on an equation called entropy, which measures the purity of a potential decision node. The lower the entropy scores the purer the decision node is. The entropy can range from 0 (most pure) to 1 (most impure).

One of the most popular algorithms for developing decision trees is the C5.0 algorithm. This algorithm, in particular, uses entropy to assess potential decision nodes.

Pros and Cons

The prose of decision trees includes its versatile nature. Decision trees can deal with all types of data as well as missing data. Furthermore, this approach learns automatically and only uses the most important features. Lastly, a deep understanding of mathematics is not necessary to use this method in comparison to more complex models.

Some problems with decision trees are that they can easily overfit the data. This means that the tree does not generalize well to other datasets. In addition, a large complex tree can be hard to interpret, which may be yet another indication of overfitting.


Decision trees provide another vehicle that researchers can use to empower decision making. This model is most useful particularly when a decision that was made needs to be explained and defended. For example, when rejecting a person’s loan application. Complex models made provide stronger mathematical reasons but would be difficult to explain to an irate customer.

Therefore, for complex calculation presented in an easy to follow format. Decision trees are one possibility.

Conditional Probability & Bayes’ Theorem

In a prior post, we look at some of the basics of probability. The prior forms of probability we looked at focused on independent events, which are events that are unrelated to each other.

In this post, we will look at conditional probability which involves calculating probabilities for events that are dependent on each other. We will understand conditional probability through the use of Bayes’ theorem.

Conditional Probability 

If all events were independent of it would be impossible to predict anything because there would be no relationships between features. However, there are many examples of on event affecting another. For example, thunder and lighting can be used to predictors of rain and lack of study can be used as a predictor of test performance.

Thomas Bayes develop a theorem to understand conditional probability. A theorem is a statement that can be proven true through the use of math. Bayes’ theorem is written as follows

P(A | B)

This complex notation simply means

The probability of event A given event B occurs

Calculating probabilities using Bayes’ theorem can be somewhat confusing when done by hand. There are a few terms however that you need to be exposed too.

  • prior probability is the probability of an event without a conditional event
  • likelihood is the probability of a given event
  • posterior probability is the probability of an event given that another event occurred. the calculation or posterior probability is the application of Bayes’ theorem

Naive Bayes Algorithm

Bayes’ theorem has been used to develop the Naive Bayes Algorithm. This algorithm is particularly useful in classifying text data, such as emails. This algorithm is fast, good with missing data, and powerful with large or small data sets. However, naive Bayes struggles with large amounts of numeric data and it has a problem with assuming that all features are of equal value, which is rarely the case.


Probability is a core component of prediction. However, prediction cannot truly take place with events being dependent. Thanks to the work of Thomas Bayes, we have one approach to making prediction through the use of his theorem.

In a future post, we will use naive Bayes algorithm to make predictions about text.

Characteristics of Big Data

In a previous post, we talked about types of Big Data. However, another way to look at big data and define it is by looking at the characteristics of Big Data. In other words, what helps to identify makes Big Data as data that is big.

This post will explain the 6 main characteristics of Big Data. These characteristics are often known as the V’s of Big Data. They are as follows

  • Volume
  • Variety
  • Velocity
  • Veracity
  • Valence
  • Value


Volume has to do with the size of the data. It is hard to comprehend how volume is measured in computer science when it comes to memory for many people. Most of the computers that the average person uses works in the range of gigabytes. For example, a dvd will hold about 5 gigabytes of data.

It is now becoming more and more common to find people with terabytes of storage. A terabyte is 1,000 gigabytes! This is enough memory to hold 500 dvds worth of data. The next step up is petabytes which is 1000 terabytes or 5,000,000 dvds.

Big data involves data that is large as in the examples above. Such massive amounts of data called on new ways of analysis.


Variety is another term for complexity. Big data can be highly or lowly complex. There was a previous post about structured and unstructured data that we won’t repeat here. The point is that these various levels of complexity make analysis highly difficult because of the tremendous amount of data mugging or cleaning of the data that is often necessary.


Velocity is the speed at which big data is created, stored, and or analyzed. Two approaches to processing data are batch and real-time. Batch processing involves collecting and cleaning the data in “batches” for processing. It is necessary to wait for all the “batches” to come in before making a decision. As such this is a slow process.

An alternative is real-team processing. This approach involves streaming the information into machines which process the data immediately.

The speed at which data needs to be processed is linked directly with the cost. As such, faster may not always be better or necessary.


The quality of the data is what veracity is. If the data is no good the results are no good. The most reliable data tends to be collected companies and other forms of enterprise. The next lower level is social media data. Finally, the lowest level of data is often data that is captured by sensors. The differences between the levels is often the lack of discrimination.


Valence is a term that is used in chemistry and has to do with how an element has electrons available for bonding with other elements. This can lead to complex molecules due to elements being interconnected through sharing electrons.

In Big Data, valence is how interconnected the data is. As there are more and more connections among the data the complexity of the analysis increases.


Value is the ability to convert Big Data information into a monetary reward. For example, if you find a relationship between two products at a point of sale, you can recommend them to customers at a website or put the products next to each in a store.

A lot of Big Data research is done with a motive of making money. However, there is a lot of Big Data research happening that is driven exclusively by a profit motive such as the research being used to analyze the human genome. As such, the “value” characteristic is not always included when talking about the characteristics of Big Data.


Understanding the traits of Big Data allows an individual to identify Big Data when they see it. The traits here are the common ones of Big Data. However, this list is far from exhaustive and there is much more that could be said.

Nearest Neighbor Classification

There are times when the relationships among examples you want to classify are messy and complicated. This makes it difficult to actually classify them. Yet in this same situation, items of the same class have a lot of features in common even though the overall sample is messy. In such a situation, nearest neighbor classification may be useful.

Nearest neighbor classification uses a simple technique to classify unlabeled examples. The algorithm assigns an unlabeled example the label of the nearest example. This based on the assumption that if two examples are next to each other they must be of the same class.

In this post, we will look at the characteristics of nearest neighbor classification as well as the strengths and weakness of this approach.


Nearest neighbor classification uses the features of the data set to create a multidimensional feature space. The number of features determines the number of dimensions. Therefore, two features leads to a two-dimensional feature space, three features leads to a three dimensional feature space, etc. In this feature space all the examples are placed based on their respective features.

The label of the unknown examples are determined by who the closet neighbor is or are. This calculation is based on Euclidean distance, which is the shortest distance possible. The number of neighbors that are used to calculate the distance varies at the discretion of the researcher. For example, we could use one neighbor or several to determine the label of an unlabeled example. There are pros and cons to how many neighbors to use. The more neighbors used the more complicated the classification becomes.

Nearest neighbor classification is considered a type of lazy learning. What is meant by lazy is that no abstraction of the data happens. This means there is no real explanation or theory provide by the model to understand why there are certain relationships. Nearest neighbor tells you where the relationships are but not why or how. This is partly due to the fact that it is a non-parametric learning method and provides no parameters (summary statistics) about the data.

Pros and Cons

Nearest neighbor classification has the advantage of being simple, highly effective, and fast during the training phase. There are also no assumptions made about the data distribution. This means that common problems like a lack of normality are not an issue.

Some problems include the lack of a model. This deprives us of insights into the relationships in the data. Another concern is the headache of missing data.  This forces you to spend time cleaning the data more thoroughly.  One final issue is that the classification phase of a project is slow and cumbersome because of the messy nature of the data.


Nearest neighbor classification is one useful tool in machine learning. This approach is valuable for times when the data is heterogeneous but with clear homogeneous groups in the data. In a future post, we will go through an example of this classification approach using R.

Steps for Approaching Data Science Analysis

Research is difficult for many reasons. One major challenge of research is knowing exactly what to do. You have to develop your way of approaching your problem, data collection and analysis that is acceptable to peers.

This level of freedom leads to people literally freezing and not completing a project. Now imagine have several gigabytes or terabytes of data and being expected to “analyze” it.

This is a daily problem in data science. In this post, we will look at one simply six step process to approaching data science. The process involves the following six steps

  1. Acquire data
  2. Explore the data
  3. Process the data
  4. Analyze the data
  5. Communicate the results
  6. Apply the results

Step 1 Acquire the Data

This may seem obvious but it needs to be said. The first step is to access data for further analysis. Not always, but often data scientist are given data that was already collected by others who want answers from it.

In contrast with traditional empirical research in which you are often involved from the beginning to the end, in data science you jump to analyze a mess of data that others collected. This is challenging as it may not be clear what people what to know are what exactly the collected.

Step 2 Explore the Data

Exploring the data allows you to see what is going on. You have to determine what kinds of potential feature variables you have, the level of data that was collected (nominal, ordinal, interval, ratio). In addition, exploration allows you to determine what you need to do to prep the data for analysis.

Since data can come in many different formats from structured to unstructured. It is critical to take a look at the data through using summary statistics and various visualization options such as plots and graphs.

Another purpose for exploring data is that it can provide insights into how to analyze the data. If you are not given specific instructions as to what stakeholders want to know, exploration can help you to determine what may be valuable for them to know.

Step 3 Process the Data

Processing data involves cleaning it. This involves dealing with missing data, transforming features, addressing outliers, and other necessary processes for preparing analysis. The primary goal is to organize the data for analysis

This is a critical step as various machine learning models have different assumptions that must be met. Some models can handle missing data some cannot. Some models are affected by outliers some are not.

Step 4 Analyze the Data

This is often the most enjoyable part of the process. At this step, you actually get to develop your model. How this is done depends on the type of model you selected.

In machine learning, analysis is almost never complete until some form of validation of the model takes place. This involves taking the model developed on one set of data and seeing how well the model predicts the results on another set of data. One of the greatest fears of statistical modeling is overfitting, which is a model that only works on one set of data and lacks the ability to generalize.

Step 5 Communicate Results

This step is self-explanatory. The results of your analysis needs to be shared with those involved. This is actually an art in data science called storytelling. It involves the use of visuals as well-spoken explanations.

Steps 6 Apply the Results

This is the chance to actual use the results of a study. Again, how this is done depends on the type of model developed. If a model was developed to predict which people to approve for home loans, then the model will be used to analyze applications by people interested in applying for a home loan.


The steps in this process is just one way to approach data science analysis. One thing to keep in mind is that these steps are iterative, which means that it is common to go back and forth and to skip steps as necessary. This process is just a guideline for those who need direction in doing an analysis.

The Types of Data in Big Data

A well-known quote in the business world is “cash is king.” Nothing will destroy a business faster than a lack of liquidity to meet a financial emergency. What your worth may not matter as much as what you can spend that makes a difference.

However, there is now a challenge to this mantra. In the world of data science, there is the belief that data is king. This can potentially make sense as using data to foresee financial disaster can help people to have cash ready.

In this post, we are going to examine the different types of data in the world of data science. Generally, there are two types of data which are unstructured and structured data.

Unstructured Data

Unstructured data is data that is produced by people. Normally, this data is text heavy. Examples of unstructured data include twits on Twitter, customer feedback on Amazon, blogs, emails, etc. This type of data is very challenging to work with because it is not necessarily in a format for analysis.

Despite the challenges, there are techniques available for using this information to make decisions. Often, the analysis of unstructured data is used to target products and make recommendations for purchases by companies.

Structured Data

Structured data is in many ways the complete opposite of unstructured data. Structured data has a clear format and a specific place for various pieces of data. An excel document is one example of structured data. A receipt is another example. A receipt has a specific place for different pieces of information such as price, total, date, etc. Often, structured data is made by organizations and machines.

Naturally, analyzing structured data is often much easier than unstructured data. With a consistent format, there is less processing required before analysis.

Working With Data

When approaching a project, data often comes from several sources. Normally, the data has to be moved around and consolidated into one space for analysis. When working with unstructured and or structured data that is coming from several different sources, there is a three-step process used to facilitate this. The process is called ETL which stands for extract, transform, and load.

Extracting data means taking it from one place and planning to move it somewhere else. Transform means changing the data in some way or another. For example, this often means organizing it for the purposes of answer research questions. How this is done is context specific.

Load simply means placing all the transformed data into one place for analysis. This is a critical last step as it is helpful to have what you are analyzing in one convenient place. The details of this will be addressed in a future post.


In what may be an interesting contradiction, as we collect more and more data, data is actually becoming more valuable. Normally, an increase in a resource lessens its value but not with data. Organizations are collecting data at a recording break in order to anticipate the behavior of people. This predictive power derived from data can lead to significant profits, which leads to the conclusion that perhaps data is now the king.