Category Archives: Big Data

Data Wrangling in R

Collecting and preparing data for analysis is the primary job of a data scientist. This experience is called data wrangling. In this post, we will look at an example of data wrangling using a simple artificial data set. You can create the table below in r or excel. If you created it in excel just save it as a csv and load it into r. Below is the initial code

library(readr)
apple <- read_csv("~/Desktop/apple.csv")
## # A tibble: 10 × 2
##        weight      location
##         <chr>         <chr>
## 1         3.2        Europe
## 2       4.2kg       europee
## 3      1.3 kg          U.S.
## 4  7200 grams           USA
## 5          42 United States
## 6         2.3       europee
## 7       2.1kg        Europe
## 8       3.1kg           USA
## 9  2700 grams          U.S.
## 10         24 United States

This a small dataset with the columns of “weight” and “location”. Here are some of the problems

  • Weights are in different units
  • Weights are written in different ways
  • Location is not consistent

In order to have any success with data wrangling you need to state specifically what it is you want to do. Here are our goals for this project

  • Convert the “Weight variable” to a numerical variable instead of character
  • Remove the text and have only numbers in the “weight variable”
  • Change weights in grams to kilograms
  • Convert the “location” variable to a factor variable instead of character
  • Have consistent spelling for Europe and United States in the “location” variable

We will begin with the “weight” variable. We want to convert it to a numerical variable and remove any non-numerical text. Below is the code for this

corrected.weight<-as.numeric(gsub(pattern = "[[:alpha:]]","",apple$weight))
corrected.weight
##  [1]    3.2    4.2    1.3 7200.0   42.0    2.3    2.1    3.1 2700.0   24.0

Here is what we did.

  1. We created a variable called “corrected.weight”
  2. We use the function “as.numeric” this makes whatever results inside it to be a numerical variable
  3. Inside “as.numeric” we used the “gsub” function which allows us to substitute one value for another.
  4. Inside “gsub” we used the argument pattern and set it to “[[alpha:]]” and “” this told r to look for any lower or uppercase letters and replace with nothing or remove it. This all pertains to the “weight” variable in the apple dataframe.

We now need to convert the weights in grams to kilograms so that everything is the same unit. Below is the code

gram.error<-grep(pattern = "[[:digit:]]{4}",apple$weight)
corrected.weight[gram.error]<-corrected.weight[gram.error]/1000
corrected.weight
##  [1]  3.2  4.2  1.3  7.2 42.0  2.3  2.1  3.1  2.7 24.0

Here is what we did

  1. We created a variable called “gram.error”
  2. We used the grep function to search are the “weight” variable in the apple data frame for input that is a digit and is 4 digits in length this is what the “[[:digit:]]{4}” argument means. We do not change any values yet we just store them in “gram.error”
  3. Once this information is stored in “gram.error” we use it as a subset for the “corrected.weight” variable.
  4. We tell r to save into the “corrected.weight” variable any value that is changeable according to the criteria set in “gram.error” and to divided it by 1000. Dividing it by 1000 converts the value from grams to kilograms.

We have completed the transformation of the “weight” and will move to dealing with the problems with the “location” variable in the “apple” dataframe. To do this we will first deal with the issues related to the values that relate to Europe and then we will deal with values related to United States. Below is the code.

europe<-agrep(pattern = "europe",apple$location,ignore.case = T,max.distance = list(insertion=c(1),deletions=c(2)))
america<-agrep(pattern = "us",apple$location,ignore.case = T,max.distance = list(insertion=c(0),deletions=c(2),substitutions=0))
corrected.location<-apple$location
corrected.location[europe]<-"europe"
corrected.location[america]<-"US"
corrected.location<-gsub(pattern = "United States","US",corrected.location)
corrected.location
##  [1] "europe" "europe" "US"     "US"     "US"     "europe" "europe"
##  [8] "US"     "US"     "US"

The code is a little complicated to explain but in short We used the “agrep” function to tell r to search the “location” to look for values similar to our term “europe”. The other arguments provide some exceptions that r should change because the exceptions are close to the term europe. This process is repeated for the term “us”. We then store are the location variable from the “apple” dataframe in a new variable called “corrected.location” We then apply the two objects we made called “europe” and “america” to the “corrected.location” variable. Next we have to make some code to deal with “United States” and apply this using the “gsub” function.

We are almost done, now we combine are two variables “corrected.weight” and “corrected.location” into a new data.frame. The code is below

cleaned.apple<-data.frame(corrected.weight,corrected.location)
names(cleaned.apple)<-c('weight','location')
cleaned.apple
##    weight location
## 1     3.2   europe
## 2     4.2   europe
## 3     1.3       US
## 4     7.2       US
## 5    42.0       US
## 6     2.3   europe
## 7     2.1   europe
## 8     3.1       US
## 9     2.7       US
## 10   24.0       US

If you use the “str” function on the “cleaned.apple” dataframe you will see that “location” was automatically converted to a factor.

This looks much better especially if you compare it to the original dataframe that is printed at the top of this post.

Advertisements

Making Regression and Modal Trees in R

In this post, we will look at an example of regression trees. Regression trees use decision tree-like approach to develop predicition models involving numerical data. In our example we will be trying to predict how many kids a person has based on several independent variables in the “PSID” data set in the “Ecdat” package.

Lets begin by loading the necessary packages and data set. The code is below

library(Ecdat);library(rpart);library(rpart.plot)
library(RWeka)
data(PSID)
str(PSID)
## 'data.frame':    4856 obs. of  8 variables:
##  $ intnum  : int  4 4 4 4 5 6 6 7 7 7 ...
##  $ persnum : int  4 6 7 173 2 4 172 4 170 171 ...
##  $ age     : int  39 35 33 39 47 44 38 38 39 37 ...
##  $ educatn : int  12 12 12 10 9 12 16 9 12 11 ...
##  $ earnings: int  77250 12000 8000 15000 6500 6500 7000 5000 21000 0 ...
##  $ hours   : int  2940 2040 693 1904 1683 2024 1144 2080 2575 0 ...
##  $ kids    : int  2 2 1 2 5 2 3 4 3 5 ...
##  $ married : Factor w/ 7 levels "married","never married",..: 1 4 1 1 1 1 1 4 1 1 ...
summary(PSID)
##      intnum        persnum            age           educatn     
##  Min.   :   4   Min.   :  1.00   Min.   :30.00   Min.   : 0.00  
##  1st Qu.:1905   1st Qu.:  2.00   1st Qu.:34.00   1st Qu.:12.00  
##  Median :5464   Median :  4.00   Median :38.00   Median :12.00  
##  Mean   :4598   Mean   : 59.21   Mean   :38.46   Mean   :16.38  
##  3rd Qu.:6655   3rd Qu.:170.00   3rd Qu.:43.00   3rd Qu.:14.00  
##  Max.   :9306   Max.   :205.00   Max.   :50.00   Max.   :99.00  
##                                                  NA's   :1      
##     earnings          hours           kids                 married    
##  Min.   :     0   Min.   :   0   Min.   : 0.000   married      :3071  
##  1st Qu.:    85   1st Qu.:  32   1st Qu.: 1.000   never married: 681  
##  Median : 11000   Median :1517   Median : 2.000   widowed      :  90  
##  Mean   : 14245   Mean   :1235   Mean   : 4.481   divorced     : 645  
##  3rd Qu.: 22000   3rd Qu.:2000   3rd Qu.: 3.000   separated    : 317  
##  Max.   :240000   Max.   :5160   Max.   :99.000   NA/DF        :   9  
##                                                   no histories :  43

The variables “intnum” and “persnum” are for identification and are useless for our analysis. We will now explore our dataset with the following code.

hist(PSID$age)

Rplot.jpeg

hist(PSID$educatn)

Rplot06.jpeg

hist(PSID$earnings)

Rplot02

hist(PSID$hours)

Rplot03

hist(PSID$kids)

Rplot04

table(PSID$married)
## 
##       married never married       widowed      divorced     separated 
##          3071           681            90           645           317 
##         NA/DF  no histories 
##             9            43

Almost all of the variables are non-normal. However, this is not a problem when using regression trees. There are some major problems with the “kids” and “educatn” variables. Each of these variables have values at 98 and 99. When the data for this survey was collected 98 meant the respondent did not know the answer and a 99 means they did not want to say. Since both of these variables are numerical we have to do something with them so they do not ruin our analysis.

We are going to recode all values equal to or greater than 98 as 3 for the “kids” variable. The number 3 means they have 3 kids. This number was picked because it was the most common response for the other respondents. For the “educatn” variable all values equal to or greater than 98 are recoded as 12, which means that they completed 12th grade. Again this was the most frequent response. Below is the code.

PSID$kids[PSID$kids >= 98] <- 3
PSID$educatn[PSID$educatn >= 98] <- 12

Another peek at the histograms for these two variables and things look much better.

hist(PSID$kids)

Rplot05.jpeg

hist(PSID$educatn)

Rplot01

Make Model and Visualization

Now that everything is cleaned up we now need to make are training and testing data sets as seen in the code below.

PSID_train<-PSID[1:3642,]
PSID_test<-PSID[3643:4856,]

We will now make our model and also create a visual of it. Our goal is to predict the number of children a person has based on their age, education, earnings, hours worked, marital status. Below is the code

#make model
PSID_Model<-rpart(kids~age+educatn+earnings+hours+married, PSID_train)
#make visualization
rpart.plot(PSID_Model, digits=3, fallen.leaves = TRUE,type = 3, extra=101)

Rplot07

The first split on the tree is by income. On the left we have those who make more than 20k and on the right those who make less than 20k. On the left the next split is by marriage, those who are never married or not applicable have on average 0.74 kids. Those who are married, widowed, divorced, separated, or have no history have on average 1.72.

The left side of the tree is much more complicated and I will not explain all of it. The after making less than 20k the next split is by marriage. Those who are married, widowed, divorced, separated, or no history with less than 13.5 years of education have 2.46 on average.

Make Prediction Model and Conduct Evaluation

Our next task is to make the prediction model. We will do this with the folllowing code

PSID_pred<-predict(PSID_Model, PSID_test)

We will now evaluate the model. We will do this three different ways. The first involves looking at the summary statistics of the prediction model and the testing data. The numbers should be about the same. After that we will calculate the correlation between the prediction model and the testing data. Lastly, we will use a technique called the mean absolute error. Below is the code for the summary statistics and correlation.

summary(PSID_pred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.735   2.041   2.463   2.226   2.463   2.699
summary(PSID_test$kids)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   2.000   2.494   3.000  10.000
cor(PSID_pred, PSID_test$kids)
## [1] 0.308116

Looking at the summary stats are model has a hard time predicting extreme values because the max value of the two models are vey far a part. However, how often do people have ten kids? As such, this is not a major concern.

A look at the correlation finds that it is pretty low (0.30) this means that the two models have little in common and this means we need to make some changes. The mean absolute error is a measure of the difference between the predicted and actual values in a model. We need to make a function first before we analyze our model.

MAE<-function(actual, predicted){
        mean(abs(actual-predicted))
}

We now assess the model with the code below

MAE(PSID_pred, PSID_test$kids)
## [1] 1.134968

The results indicate that on average the difference between our model’s prediction of the number of kids and the actual number of kids was 1.13 on a scale of 0 – 10. That’s a lot of error. However, we need to compare this number to how well the mean does to give us a benchmark. The code is below.

ave_kids<-mean(PSID_train$kids)
MAE(ave_kids, PSID_test$kids)
## [1] 1.178909

Model Tree

Our model with a score of 1.13 is slightly better than using the mean which is 1.17. We will try to improve our model by switching from a regression tree to a model tree which uses a slightly different approach for prediction. In a model tree each node in the tree ends in a linear regression model. Below is the code.

PSIDM5<- M5P(kids~age+educatn+earnings+hours+married, PSID_train)
PSIDM5
## M5 pruned model tree:
## (using smoothed linear models)
## 
## earnings <= 20754 : 
## |   earnings <= 2272 : 
## |   |   educatn <= 12.5 : LM1 (702/111.555%)
## |   |   educatn >  12.5 : LM2 (283/92%)
## |   earnings >  2272 : LM3 (1509/88.566%)
## earnings >  20754 : LM4 (1147/82.329%)
## 
## LM num: 1
## kids = 
##  0.0385 * age 
##  + 0.0308 * educatn 
##  - 0 * earnings 
##  - 0 * hours 
##  + 0.0187 * married=married,divorced,widowed,separated,no histories 
##  + 0.2986 * married=divorced,widowed,separated,no histories 
##  + 0.0082 * married=widowed,separated,no histories 
##  + 0.0017 * married=separated,no histories 
##  + 0.7181
## 
## LM num: 2
## kids = 
##  0.002 * age 
##  - 0.0028 * educatn 
##  + 0.0002 * earnings 
##  - 0 * hours 
##  + 0.7854 * married=married,divorced,widowed,separated,no histories 
##  - 0.3437 * married=divorced,widowed,separated,no histories 
##  + 0.0154 * married=widowed,separated,no histories 
##  + 0.0017 * married=separated,no histories 
##  + 1.4075
## 
## LM num: 3
## kids = 
##  0.0305 * age 
##  - 0.1362 * educatn 
##  - 0 * earnings 
##  - 0 * hours 
##  + 0.9028 * married=married,divorced,widowed,separated,no histories 
##  + 0.2151 * married=widowed,separated,no histories 
##  + 0.0017 * married=separated,no histories 
##  + 2.0218
## 
## LM num: 4
## kids = 
##  0.0393 * age 
##  - 0.0658 * educatn 
##  - 0 * earnings 
##  - 0 * hours 
##  + 0.8845 * married=married,divorced,widowed,separated,no histories 
##  + 0.3666 * married=widowed,separated,no histories 
##  + 0.0037 * married=separated,no histories 
##  + 0.4712
## 
## Number of Rules : 4

It would take too much time to explain everything. You can read part of this model as follows earnings greater than 20754 use linear model 4earnings less than 20754 and less than 2272 and less than 12.5 years of education use linear model 1 earnings less than 20754 and less than 2272 and greater than 12.5 years of education use linear model 2 earnings less than 20754 and greater than 2272 linear model 3 The print out then shows each of the linear model. Lastly, we will evaluate our model tree with the following code

PSIDM5_Pred<-predict(PSIDM5, PSID_test)
summary(PSIDM5_Pred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3654  2.0490  2.3400  2.3370  2.6860  4.4220
cor(PSIDM5_Pred, PSID_test$kids)
## [1] 0.3486492
MAE(PSID_test$kids, PSIDM5_Pred)
## [1] 1.088617

This model is slightly better. FOr example, it is better at predict extreme values at 4.4 compare to 2.69 for the regression tree model. The correlation is 0.34 which is better than 0.30 for the regression tree model. Lastly. the mean absolute error shows a slight improve to 1.08 compared to 1.13 in the regression tree model

Conclusion

This provide examples of the use of regression trees and model trees. Both of these models make prediction a key component of their analysis.

Numeric Prediction Trees

Decision trees are used for classifying examples into distinct classes or categories. Such as pass/fail, win/lose, buy/sell/trade, etc. However, as we all know, categories are just one form of outcome in machine learning. Sometimes we want to make numeric predictions.

The use of trees in making predictions numeric involves the use of regression trees or model trees. In this post, we will look at each of these forms of numeric prediction with the use of trees.

Regression Trees and Modal Trees

Regression trees have been around since the 1980’s. They work by predicting the average value of specific examples that reach a given leaf in the tree. Despite their name, there is no regression involved with regression trees. Regression trees are straightforward  to interpret but at the expense of accuracy.

Modal trees are similar to regression trees but employs multiple regression with the examples at each leaf in a tree. This leads to many different regression models being used to split the data throughout a tree. This makes model trees hard to interpret and understand in comparison to regression trees. However, they are normally much more accurate than regression trees.

Both types of trees have the goal of making groups that are as homogeneous as possible. For decision trees, entropy is used to measure the homogeneity of groups. For numeric decision trees, the standard deviation reduction (SDR) is used. The detail of SDR are somewhat complex and technical and will be avoided for that reason.

Strengths of Numeric Prediction Trees

Numeric prediction trees do not have the assumptions of linear regression. As such, they can be used to model non-normal and or non-linear data. In addition, if a dataset has a large number of feature variables, a numeric prediction tree can easily select the most appropriate ones automatically. Lastly, numeric prediction trees also do not need the the model to be specific in advance of the analysis.

Weaknesses of Numeric Prediction Trees

This form of analysis requires a large amounts of data in the training set in order to develop a testable model. It is also hard to tell which variables are most important in shaping the outcome. Lastly, sometimes numeric prediction trees are hard to interpret. This naturally limits there usefulness among people who lack statistical training.

Conclusion

Numeric prediction trees combine the strength of decision tress with the ability to digest la large amount of numerical variables. This form of machine learning is useful when trying to rate or measure something that is very difficult to rate or measure. However, when possible, it is usually wise to allows try to use simpler methods if permissible.

Making a Decision Tree in R

In this post, we are going to learn how to use the C5.0 algorithm to make a classification tree in order to make predictions about gender based on wage, education, and job experience using a data set in the “Ecdat” package in R. Below is some code to get started.

library(Ecdat); library(C50); library(gmodels)
 data(Wages1)

We now will explore the data to get a sense of what is happening in the it. Below is the code for this

str(Wages1)
 ## 'data.frame': 3294 obs. of 4 variables:
 ## $ exper : int 9 12 11 9 8 9 8 10 12 7 ...
 ## $ sex : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
 ## $ school: int 13 12 11 14 14 14 12 12 10 12 ...
 ## $ wage : num 6.32 5.48 3.64 4.59 2.42 ...
 hist(Wages1$exper)

Rplot02

summary(Wages1$exper)
 ## Min. 1st Qu. Median Mean 3rd Qu. Max.
 ## 1.000 7.000 8.000 8.043 9.000 18.000

hist(Wages1$wage)

Rplot

summary(Wages1$wage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07656 3.62200 5.20600 5.75800 7.30500 39.81000

hist(Wages1$school)

Rplot01

summary(Wages1$school)
 ## Min. 1st Qu. Median Mean 3rd Qu. Max.
 ## 3.00 11.00 12.00 11.63 12.00 16.00

table(Wages1$sex)
## female male
## 1569 1725

As you can see, we have four features (exper, sex, school, wage) in the “Wages1” data set. The histogram for “exper” indicates that it is normally distributed. The “wage” feature is highly left skewed and almost bimodal. This is not a big deal as classification trees are robust against non-normality. The ‘school’ feature is mostly normally distributed. Lastly, the ‘sex’ feature is categorical but there is almost an equal number of men and women in the data. All of the outputs for the means are listed above.

Create Training and Testing Sets

We now need to create are training and testing data sets. In order to do this we need to first randomly reorder our data set. For example, if the data is sorted by one of the features, to split it now would lead to extreme values all being lumped together in one data set.

To make things more confusing, we also need to set our seed. This allows us to be able to replicate our results. Below is the code for doing this.

set.seed(12345)
 Wage_rand<-Wages1[order(runif(3294)),]

What we did is explained as follows

  1. set the seed using the ‘set.seed’ function (We randomly picked the number 12345)
  2. We created the variable ‘Wage_rand’ and we assigned the following
  3. From the ‘Wages1’ dataset we used the ‘runif’ function to create a list of 3294 numbers (1-3294) we did this because there are a total of 3294 examples in the dataset.
  4. After generating the 3294 numbers we then order sequentially using the “order” function.
  5. We then assigned each example in the “Wages1” dataset one of the numbers we created

We will now created are training and testing set using the code below.

Wage_train<-Wage_rand[1:2294,]
 Wage_test<-Wage_rand[2295:3294,]

Make the Model
We can now begin training a model below is the code.

Wage_model<-C5.0(Wage_train[-2], Wage_train$sex)

The coding for making the model should be familiar by now. One thing that is new is the brackets with the -2 inside. This tells r to ignore the second column in the dataset. We are doing this because we want to predict sex. If it is a part of the indepedent variables we cannot predict it. We can now examine the results of our model by using the following code.

Wage_model
##
## Call:
## C5.0.default(x = Wage_train[-2], y = Wage_train$sex)
##
## Classification Tree
## Number of samples: 2294
## Number of predictors: 3
##
## Tree size: 9
##
## Non-standard options: attempt to group attributes
summary(Wage_model)
##
## Call:
## C5.0.default(x = Wage_train[-2], y = Wage_train$sex)
##
##
## C5.0 [Release 2.07 GPL Edition] Wed May 25 10:55:22 2016
## ——————————-
##
## Class specified by attribute `outcome’
##
## Read 2294 cases (4 attributes) from undefined.data
##
## Decision tree:
##
## wage <= 3.985179: ## :…school > 11: female (345/109)
## : school <= 11:
## : :…exper <= 8: female (224/96) ## : exper > 8: male (143/59)
## wage > 3.985179:
## :…wage > 9.478313: male (254/61)
## wage <= 9.478313: ## :…school > 12: female (320/132)
## school <= 12:
## :…school <= 10: male (246/70) ## school > 10:
## :…school <= 11: male (265/114) ## school > 11:
## :…exper <= 6: female (83/35) ## exper > 6: male (414/173)
##
##
## Evaluation on training data (2294 cases):
##
## Decision Tree
## —————-
## Size Errors
##
## 9 849(37.0%) <<
##
##
## (a) (b) ## —- —-
## 600 477 (a): class female
## 372 845 (b): class male
##
##
## Attribute usage:
##
## 100.00% wage
## 88.93% school
## 37.66% exper
##
##
## Time: 0.0 secs

The “Wage_model” indicates a small decision tree of only 9 decisions. The “summary” function shows the actual decision tree. It’s somewhat complicated but I will explain the beginning part of the tree.

If wages are less than or equal to 3.98 then the person is female THEN

If the school is greater than 11 then the person is female ELSE

If the school is less then or equal to 11 THEN

If The experience of the person is less than or equal to 8 the person is female ELSE

If the experience is greater than 8 the person is male etc.

The next part of the output shows the amount of error. This model misclassified 37% of the examples which is pretty high. 477 men were misclassified as women and 372 women were misclassified as men.

Predict with the Model

We will now see how well this model predicts gender in the testing set. Below is the code

Wage_pred<-predict(Wage_model, Wage_test)

CrossTable(Wage_test$sex, Wage_pred, prop.c = FALSE,
 prop.r = FALSE, dnn=c('actual sex', 'predicted sex'))

The output will not display properly here. Please see C50 for a pdf of this post and go to page 7

Again, this code should be mostly familiar for the prediction model. For the table we are comparing the test set sex with predicted sex. The overall model was correct 269 + 346/1000 for 61.5% accuracy rate, which is pretty bad.

Improve the Model

There are two way we are going to try and improve our model. The first is adaptive boosting and the second is error cost.

Adaptive boosting involves making several models that “vote” how to classify an example. To do this you need to add the ‘trials’ parameter to the code. The ‘trial’ parameter sets the upper limit of the number of models R will iterate if necessary. Below is the code for this and the code for the results.

Wage_boost10<-C5.0(Wage_train[-2], Wage_train$sex, trials = 10)
 #view boosted model
 summary(Wage_boost10)
 ##
 ## Call:
 ## C5.0.default(x = Wage_train[-2], y = Wage_train$sex, trials = 10)
 ##
 ##
 ## C5.0 [Release 2.07 GPL Edition] Wed May 25 10:55:22 2016
 ## -------------------------------
 ##
 ## Class specified by attribute `outcome'
 ##
 ## Read 2294 cases (4 attributes) from undefined.data
 ##
 ## ----- Trial 0: -----
 ##
 ## Decision tree:
 ##
 ## wage <= 3.985179: ## :...school > 11: female (345/109)
 ## : school <= 11:
 ## : :...exper <= 8: female (224/96) ## : exper > 8: male (143/59)
 ## wage > 3.985179:
 ## :...wage > 9.478313: male (254/61)
 ## wage <= 9.478313: ## :...school > 12: female (320/132)
 ## school <= 12:
 ## :...school <= 10: male (246/70) ## school > 10:
 ## :...school <= 11: male (265/114) ## school > 11:
 ## :...exper <= 6: female (83/35) ## exper > 6: male (414/173)
 ##
 ## ----- Trial 1: -----
 ##
 ## Decision tree:
 ##
 ## wage > 6.848846: male (663.6/245)
 ## wage <= 6.848846:
 ## :...school <= 10: male (413.9/175) ## school > 10: female (1216.5/537.6)
 ##
 ## ----- Trial 2: -----
 ##
 ## Decision tree:
 ##
 ## wage <= 3.234474: female (458.1/192.9) ## wage > 3.234474: male (1835.9/826.2)
 ##
 ## ----- Trial 3: -----
 ##
 ## Decision tree:
 ##
 ## wage > 9.478313: male (234.8/82.1)
 ## wage <= 9.478313:
 ## :...school <= 11: male (883.2/417.8) ## school > 11: female (1175.9/545.1)
 ##
 ## ----- Trial 4: -----
 ##
 ## Decision tree:
 ## male (2294/1128.1)
 ##
 ## *** boosting reduced to 4 trials since last classifier is very inaccurate
 ##
 ##
 ## Evaluation on training data (2294 cases):
 ##
 ## Trial Decision Tree
 ## ----- ----------------
 ## Size Errors
 ##
 ## 0 9 849(37.0%)
 ## 1 3 917(40.0%)
 ## 2 2 958(41.8%)
 ## 3 3 949(41.4%)
 ## boost 864(37.7%) <<
 ##
 ##
 ## (a) (b) ## ---- ----
 ## 507 570 (a): class female
 ## 294 923 (b): class male
 ##
 ##
 ## Attribute usage:
 ##
 ## 100.00% wage
 ## 88.93% school
 ## 37.66% exper
 ##
 ##
 ## Time: 0.0 secs

R only created 4 models as there was no additional improvement after this. You can see each model in the printout. The overall results are similar to our original model that was not boosted. We will now see how well our boosted model predicts with the code below.

Wage_boost_pred10<-predict(Wage_boost10, Wage_test)
 CrossTable(Wage_test$sex, Wage_boost_pred10, prop.c = FALSE,
 prop.r = FALSE, dnn=c('actual Sex Boost', 'predicted Sex Boost'))

Our boosted model has an accuracy rate 223+379/1000 = 60.2% which is about 1% better the our unboosted model (59.1%). As such, boosting the model was not useful (see page 11 of the pdf for the table printout.)

Our next effort will be through the use of a cost matrix. A cost matrix allows you to impose a penalty on false positives and negatives at your discretion. This is useful if certain mistakes are too costly for the learner to make. IN our example, we are going to make it 4 times more costly misclassify a female as a male (false negative) and 1 times for costly to misclassify a male as a female (false positive). Below is the code

error_cost Wage_cost<-C5.0(Wage_train[-21], Wage_train$sex, cost = error_cost)
 Wage_cost_pred<-predict(Wage_cost, Wage_test)
 CrossTable(Wage_test$sex, Wage_cost_pred, prop.c = FALSE,
 prop.r = FALSE, dnn=c('actual Sex EC', 'predicted Sex EC'))

With this small change are model is 100% accurate (see page 12 of the pdf).

Conclusion

This post provided an example of decision trees. Such a model allows someone to predict a given outcome when given specific information.

Understanding Decision Trees

Decision trees are yet another method in machine learning that is used for classifying outcomes. Decision trees are very useful for, as you can guess, making decisions based on the characteristics of the data.

In this post we will discuss the following

  • Physical traits of decision trees
  • How decision trees work
  • Pros and cons of decision trees

Physical Traits of a Decision Tree

Decision trees consist of what is called a tree structure. The tree structure consist of a root node, decision nodes, branches and leaf nodes.

A root node is the initial decision made in the tree. This depends on which feature the algorithm selects first.

Following the root node the tree splits into various branches. Each branch leads to an additional decision node where the data is further subdivided. When you reach the bottom of a tree at the terminal node(s) these are also called leaf nodes.

How Decision Trees Work

Decision trees use a heuristic called recursive partitioning. What this does is it splits the overall data set into smaller and smaller subsets until each subset is as close to pure (having the same characteristics) as possible. This process is also know as divide and conquer.

The mathematics for deciding how to split the data is based on an equation called entropy, which measures the purity of a potential decision node. The lower the entropy score the more pure the decision node is. The entropy can range from 0 (most pure) to 1 (most impure).

One of the most popular algorithms for developing decision trees is the C5.0 algorithm. This algorithm in particular uses entropy to assess potential decision nodes.

Pros and Cons

The prose of decision trees include it versatile nature. Decision trees can deal with all types of data as well as missing data. Furthermore, this approach learns automatically and only uses the most important features. Lastly, a deep understanding of mathematics is not necessary to use this method in comparison to more complex models.

Some problems with decision trees is that the can easily overfit the data. This means that the tree does not generalize well to other datasets. In addition, a large complex tree can be hard to interpret, which may be yet another indication of overfitting.

Conclusion

Decision trees provide another vehicle that researchers can use to empower decision making. This model is most useful particular when a decision that was made needs to be explained and defended. For example, when rejecting a person’s loan application. Complex models made provide stronger mathematical reasons but would be difficult to explain to an irate customer.

Therefore, for complex calculation presented in an easy to follow format. Decision trees are one possibility.

Conditional Probability & Bayes’ Theorem

In a prior post, we look at some of the basics of probability. The prior forms pf probability we looked at focused on independent events, which are events that are unrelated to each other.

In this post we will look at conditional probability which involves calculating probabilities for events that are dependent on each other. We will understand conditional probability through the use of Bayes’ theorem.

Conditional Probability 

If all events were independent in it would be impossible to predict anything because there would be no relationships between features. However, there are many examples of on event affecting another. For example, thunder and lighting can be used to predictors of rain and lack of study can be used as a predictor for test performance.

Thomas Bayes develop a theorem to understand conditional probability. A theorem is a statement that can be proven true through the use of math. Bayes’ theorem is written as follows

P(A | B)

This complex notation simply means

The probability of event A given event B occurs

Calculating probabilities using Bayes’ theorem can be somewhat confusing when done by hand. There are a few terms however that you need to be exposed too.

  • prior probability is the probability of an event without a conditional event
  • likelihood is the probability of a given event
  • posterior probability is the probability of an event given that another event occurred. the calculation or posterior probability is the application of Bayes’ theorem

Naives Bayes Algorithm

Bayes’ theorem has been used to develop the Naive Bayes Algorithm. This algorithm is particularly useful in classifying text data, such as emails. This algorithm is fast, good with missing data, and powerful with large or small data sets. However, naive bayes struggles with large amounts of numeric data and it has a problem with assuming that all features are of equal value, which is rarely the case.

Conclusion

Probability is a core component of prediction. However, prediction cannot truly take place with events being dependent. Thanks to the work of Thomas Bayes, we have one approach to making prediction through the use of his theorem.

In a future post, we will use naive Bayes algorithm to make predictions about text.

 

Characteristics of Big Data

In a previous post we talked about types of Big Data. However, another way to look at big data and define it is by looking at the characteristics of Big Data. In other words, what helps to identify makes Big Data  as data that is big.

This post will explain the 6 main characteristics of Big Data. These characteristics are often known as the V’s of Big Data. They are as follows

  • Volume
  • Variety
  • Velocity
  • Veracity
  • Valence
  • Value

Volume

Volume has to do with the size of the data. It is hard comprehend how volume is measured in computer science when it comes to memory for many people. Most of the computers that that the average person uses works in the range of gigabytes. For example, a dvd will hold about 5 gigabytes of data.

It is now becoming more and more common to find people with terabytes of storage. A terabyte is 1,000 gigabytes! This is enough memory to hold 500 dvds worth of data. The next step up is petabytes which is 1000 terabytes or 5,000,000 dvds.

Big data involves data that is large as in the examples above. Such massive amounts of data called on new ways of analysis.

Variety

Variety is another term for complexity. Big data can be highly or lowly complex. There was a previous post about structured and unstructured data that we won’t repeat here. The point is that these vary levels of complexity make analysis highly difficult because of the tremendous amount of data mugging or cleaning of the data that is often necessary.

Velocity

Velocity is the speed at which big data is created, stored, and or analyzed. Two approach to processing data are batch and real-time. Batch processing involves collecting and cleaning the data in “batches” for processing. It is necessary to wait for all the “batches” to come in before making a decision. As such this is a slow process.

An alternative is real-team processing. This approach involves streaming the information into machines which process the data immediately.

The speed at which data needs to be process is linked directly with the cost. As such, faster may not always be better or necessary.

Veracity

The quality of the data is what veracity is. If the data is no good the results are no good. The most reliable data tends to be collected companies and other forms of enterprise. The next lower level is social media data. Finally, the lowest level of data is often data that is captured by sensors. The differences between the levels is often the lack of discrimination.

Valence

Valence is a term that is used in chemistry and has to do with how an element has electrons available for bonding with other elements. This can lead to complex molecules due to elements being interconnected through sharing electrons.

In Big Data, valence is how interconnected  the data is. As there are more and more connections among the data the complexity of the analysis increases.

Value

Values is the ability to convert Big Data information into a monetary reward. For example, if you find a relationship between to products a t a point of sale, you can recommend them to customers at a website or put the products next to each in a store.

A lot of Big Data research is done with a motive of making money. However, there is a lot of Big Data research happening that is driven exclusively by a profit motive such as the research being used to analyze the human genome. As such, the “value” characteristic is not always included when talking about the characteristics of Big Data.

Conclusion

Understanding the traits of Big Data allows an individual to identify Big Data when the see it. The traits here are the common ones of Big Data. However, this list is far from exhaustive and there is much more that could be said.

Nearest Neighbor Classification

There are times when the relationships among examples you want to classify are messy and complicated. This makes it difficult to actually classify them. Yet in this same situation, items of the same class have a lot of features in common even though the overall sample is messy. In such a situation, nearest neighbor classification may be useful.

Nearest neighbor classification uses a simple technique to classify unlabeled examples. The algorithm assigns an unlabeled example the label of the nearest example. This based on the assumption that if two examples are next to each other they must be of the same class.

In this post, we will look at the characteristics of nearest neighbor classification as well as the strengths and weakness of this approach.

Characteristics

Nearest neighbor classification uses the features of the data set to create a multidimensional feature space. The number of features determines the number of dimensions. Therefore, two features leads to a two-dimensional feature space, three features leads to a three dimensional feature space, etc. In this feature space all the examples are placed based on their respective features.

The label of the unknown examples are determined by who the closet neighbor is or are. This calculation is based on Euclidean distance, which is the shortest distance possible. The number of neighbors that are used to calculate the distance varies at the discretion of the researcher. For example, we could use one neighbor or several to determine the label of an unlabeled example. There are pros and cons to how many neighbors to use. The more neighbors used the more complicated the classification becomes.

Nearest neighbor classification is considered a type of lazy learning. What is meant by lazy is that no abstraction of the data happens. This means there is no real explanation or theory provide by the model to understand why there are certain relationships. Nearest neighbor tells you where the relationships are but not why or how. This is partly due to the fact that it is a non-parametric learning method and provides no parameters (summary statistics) about the data.

Pros and Cons

Nearest neighbor classification has the advantage of being simple, highly effective, and fast during the training phase. There are also no assumptions made about the data distribution. This means that common problems like a lack of normality are not an issue.

Some problems include the lack of a model. This deprives us of insights into the relationships in the data. Another concern is the headache of missing data.  This forces you to spend time cleaning the data more thoroughly.  One final issue is that the classification phase of a project is slow and cumbersome because of the messy nature of the data.

Conclusion

Nearest neighbor classification is one useful tool in machine learning. This approach is valuable for times when the data is heterogeneous but with clear homogeneous groups in the data. In a future post, we will go through an example of this classification approach using R.

Steps for Approaching Data Science Analysis

Research is difficult for many reasons. One major challenge of research is knowing exactly what to do. You have to develop your way of approaching your problem, data collection and analysis that is acceptable to peers.

This level of freedom leads to people literally freezing and not completing a project. Now imagine have several gigabytes or terabytes of data and being expected to “analyze” it.

This is a daily problem in data science. In this post, we will look at one simply six step process to approaching data science. The process involves the following six steps

  1. Acquire data
  2. Explore the data
  3. Process the data
  4. Analyze the data
  5. Communicate the results
  6. Apply the results

Step 1 Acquire the Data

This may seem obvious but it needs to be said. The first step is to access data for further analysis. Not always, but often data scientist are given data that was already collected by others who want answers from it.

In contrast with traditional empirical research in which you are often involved from the beginning to the end, in data science you jump to analyze a mess of data that others collected. This is challenging as it may not be clear what people what to know are what exactly the collected.

Step 2 Explore the Data

Exploring the data allows you to see what is going on. You have to determine what kinds of potential feature variables you have, the level of data that was collected (nominal, ordinal, interval, ratio). In addition, exploration allows you to determine what you need to do to prep the data for analysis.

Since data can come in many different formats from structured to unstructured. It is critical to take a look at the data through using summary statistics and various visualization options such as plots and graphs.

Another purpose for exploring data is that it can provide insights into how to analyze the data. If you are not given specific instructions as to what stakeholders want to know, exploration can help you to determine what may be valuable for them to know.

Step 3 Process the Data

Processing data involves cleaning it. This involves dealing with missing data, transforming features, addressing outliers, and other necessary processes for preparing analysis. The primary goal is to organize the data for analysis

This is a critical step as various machine learning models have different assumptions that must be met. Some models can handle missing data some cannot. Some models are affected by outliers some are not.

Step 4 Analyze the Data

This is often the most enjoyable part of the process. At this step, you actually get to develop your model. How this is done depends on the type of model you selected.

In machine learning, analysis is almost never complete until some form of validation of the model takes place. This involves taking the model developed on one set of data and seeing how well the model predicts the results on another set of data. One of the greatest fears of statistical modeling is overfitting, which is a model that only works on one set of data and lacks the ability to generalize.

Step 5 Communicate Results

This step is self-explanatory. The results of your analysis needs to be shared with those involved. This is actually an art in data science called storytelling. It involves the use of visuals as well-spoken explanations.

Steps 6 Apply the Results

This is the chance to actual use the results of a study. Again, how this is done depends on the type of model developed. If a model was developed to predict which people to approve for home loans, then the model will be used to analyze applications by people interested in applying for a home loan.

Conclusion

The steps in this process is just one way to approach data science analysis. One thing to keep in mind is that these steps are iterative, which means that it is common to go back and forth and to skip steps as necessary. This process is just a guideline for those who need direction in doing an analysis.

The Types of Data in Big Data

A well known quote in the business world is “cash is king.” Nothing will destroy a business faster than a lack of liquidity to meet a financial emergency. What your worth may not matter as much as what you can spend that makes a difference.

However, there is now a challenge to this mantra. In the world of data science there is the belief that data is king. This can potential make sense as using data to foresee financial disaster can help people to have cash ready.

In this post we are going to examine the different types of data in the world of data science. Generally, there are two types of data which are unstructured and structured data.

Unstructured Data

Unstructured data is data that is produce by people. Normally, this data is text heavy. Examples of unstructured data includes twits on twitter, customer feedback on Amazon, blogs, emails, etc. This type of data is very challenging to work with because it is not necessarily in a format for analysis.

Despite the challenges, there are techniques available for using this information to make decisions. Often, the analysis of unstructured data is used to target products and make recommendations for purchases by companies.

Structured Data

Structured data is in many ways the complete opposite of unstructured data. Structured data has a clear format and a specific place for various pieces of data. An excel document is one example of structured data. A receipt is another example. A receipt has a specific place for different pieces of information such as price, total, date, etc. Often, structured data is made by organizations and machines.

Naturally, analyzing structured data is often much easier than unstructured data. With a consistent format there is less processing required before analysis.

Working With Data

When approaching a project, data often comes from several sources. Normally, the data has to be moved around and consolidated into one space for analysis. When working with unstructured and or structured data that is coming from several different sources, there is a three step process used to facilitate this. The process is called ETL which stands for extract, transform, and load.

Extracting data means taking it from one place and planning to move it somewhere else. Transform means changing the data in some way or another. For example, this often means organizing it for the purposes of answer research questions. How this is done is context specific.

Load simply means placing all the transformed data into one place for analysis. This is a critical last step as it is helpful to have what you are analyzing in one convenient place. The details of this will be addressed in a future post

Conclusion

In what may be an interesting contradiction, as we collecting more and more data, data is actually becoming more valuable. Normally, an increase in a resource lessens its value but not with data. Organizations are collecting data at a recording break in order to anticipate the behavior of people. This predictive power derive from data can lead to significant profits, which leads to the conclusion that perhaps data is now the king.