# Primary Tasks in Data Analysis

Performing a data analysis in the realm of data science is a difficult task due to the huge number of decisions that need to be made. For some people,  plotting the course to conduct an analysis is easy. However, for most of us, beginning a project leads to a sense of paralysis as we struggle to determine what to do.

In light of this challenge, there are at least 5 core task that you need to consider when preparing to analyze data. These five task are

2. Data exploration
3. Developing a statistical model
4. Interpreting the results
5. Sharing the results

You really cannot analyze data until you first determine what it is you want to know. It is tempting to just jump in and start looking for interesting stuff but you will not know if something you find is interesting unless it helps to answer your question(s).

There are several types of research questions. The point is you need to ask them in order to answer them.

Data Exploration

Data exploration allows you to determine if you can answer your questions with the data you have. In data science, the data is normally already collected by the time you are called upon to analyze it. As such, what you want to find may not be possible.

In addition, exploration of the data allows you to determine if there are any problems with the data set such as missing data, strange variables, and if necessary to develop a data dictionary so you know the characteristics of the variables.

Data exploration allows you to determine what kind of data wrangling needs to be done. This involves the preparation of the data for a more formal analysis when you develop your statistical models. This process takes up the majority of a data scientist time and is not easy at all.  Mastery of this in many ways means being a master of data science

Develop a Statistical Model

Your research questions  and the data exploration  process helps you to determine what kind of model to develop. The factors that can affect this is whether your data is supervised or unsupervised and whether you want to classify or predict numerical values.

This is probably the funniest part of data analysis and is much easier then having to wrangle with the data. Your goal is to determine if the model helps to answer your question(s)

Interpreting the Results

Once a model is developed it is time to explain what it means. Sometimes you can make a really cool model that nobody (including yourself) can explain. This is especially true of “black box” methods such as support vector machines and artificial neural networks. Models need to normally be explainable to non-technical stakeholders.

With interpretation you are trying to determine “what does this answer mean to the stakeholders?”  For example, if you find that people who smoke are 5 times more likely to die before the age of 50 what are the implications of this? How can the stakeholders use this information to achieve their own goals? In other words, why should they care about what you found out?

Communication of Results

Now  is the time to actually share the answer(s) to your question(s). How this is done varies but it can be written, verbal or both. Whatever the mode of communication it is necessary to consider the following

• The audience or stakeholders
• The actual answers to the questions
• The benefits of knowing this

You must remember the stakeholders because this affects how you communicate. How you speak to business professionals would be  different from academics. Next, you must share the answers to the questions. This can be done with charts, figures, illustrations etc. Data visualization is an expertise of its own. Lastly, you explain how this information is useful in a practical way.

Conclusion

The process shared here is one way to approach the analysis of data. Think of this as a framework from which to develop your own method of analysis.

# Logistic Regression in R

In this post, we will conduct a logistic regression analysis. Logistic regression is used when you want to predict a categorical dependent variable using continuous or categorical dependent variables. In our example, we want to predict Sex (male or female) when using several continuous variables from the “survey” dataset in the “MASS” package.

library(MASS);library(bestglm);library(reshape2);library(corrplot)
data(survey)
?MASS::survey #explains the variables in the study

The first thing we need to do is remove the independent factor variables from our dataset. The reason for this is that the function that we will use for the cross-validation does not accept factors. We will first use the “str” function to identify factor variables and then remove them from the dataset. We also need to remove in examples that are missing data so we use the “na.omit” function for this. Below is the code

survey$Clap<-NULL survey$W.Hnd<-NULL
survey$Fold<-NULL survey$Exer<-NULL
survey$Smoke<-NULL survey$M.I<-NULL
survey<-na.omit(survey)

We now need to check for collinearity using the “corrplot.mixed” function form the “corrplot” package.

pc<-cor(survey[,2:5])
corrplot.mixed(pc)
corrplot.mixed(pc)

We have extreme correlation between “We.Hnd” and “NW.Hnd” this makes sense because people’s hands are normally the same size. Since this blog post  is a demonstration of logistic regression we will not worry about this too much.

We now need to divide our dataset into a train and a test set. We set the seed for. First we need to make a variable that we call “ind” that is randomly assigns 70% of the number of rows of survey 1 and 30% 2. We then subset the “train” dataset by taking all rows that are 1’s based on the “ind” variable and we create the “test” dataset for all the rows that line up with 2 in the “ind” variable. This means our data split is 70% train and 30% test. Below is the code

set.seed(123)
ind<-sample(2,nrow(survey),replace=T,prob = c(0.7,0.3))
train<-survey[ind==1,]
test<-survey[ind==2,]

We now make our model. We use the “glm” function for logistic regression. We set the family argument to “binomial”. Next, we look at the results as well as the odds ratios.

fit<-glm(Sex~.,family=binomial,train)
summary(fit)
##
## Call:
## glm(formula = Sex ~ ., family = binomial, data = train)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.9875  -0.5466  -0.1395   0.3834   3.4443
##
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -46.42175    8.74961  -5.306 1.12e-07 ***
## Wr.Hnd       -0.43499    0.66357  -0.656    0.512
## NW.Hnd        1.05633    0.70034   1.508    0.131
## Pulse        -0.02406    0.02356  -1.021    0.307
## Height        0.21062    0.05208   4.044 5.26e-05 ***
## Age           0.00894    0.05368   0.167    0.868
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 169.14  on 122  degrees of freedom
## Residual deviance:  81.15  on 117  degrees of freedom
## AIC: 93.15
##
## Number of Fisher Scoring iterations: 6
exp(coef(fit))
##  (Intercept)       Wr.Hnd       NW.Hnd        Pulse       Height
## 6.907034e-21 6.472741e-01 2.875803e+00 9.762315e-01 1.234447e+00
##          Age
## 1.008980e+00

The results indicate that only height is useful in predicting if someone is a male or female. The second piece of code shares the odds ratios. The odds ratio tell how a one unit increase in the independent variable leads to an increase in the odds of being male in our model. For example, for every one unit increase in height there is a 1.23 increase in the odds of a particular example being male.

We now need to see how well our model does on the train and test dataset. We first capture the probabilities and save them to the train dataset as “probs”. Next we create a “predict” variable and place the string “Female” in the same number of rows as are in the “train” dataset. Then we rewrite the “predict” variable by changing any example that has a probability above 0.5 as “Male”. Then we make a table of our results to see the number correct, false positives/negatives. Lastly, we calculate the accuracy rate. Below is the code.

train$probs<-predict(fit, type = 'response') train$predict<-rep('Female',123)
train$predict[train$probs>0.5]<-"Male"
table(train$predict,train$Sex)
##
##          Female Male
##   Female     61    7
##   Male        7   48
mean(train$predict==train$Sex)
## [1] 0.8861789

Despite the weaknesses of the model with so many insignificant variables it is surprisingly accurate at 88.6%. Let’s see how well we do on the “test” dataset.

test$prob<-predict(fit,newdata = test, type = 'response') test$predict<-rep('Female',46)
test$predict[test$prob>0.5]<-"Male"
table(test$predict,test$Sex)
##
##          Female Male
##   Female     17    3
##   Male        0   26
mean(test$predict==test$Sex)
## [1] 0.9347826

As you can see, we do even better on the test set with an accuracy of 93.4%. Our model is looking pretty good and height is an excellent predictor of sex which makes complete sense. However, in the next post we will use cross-validation and the ROC plot to further assess the quality of it.

# Probability,Odds, and Odds Ratio

In logistic regression, there are three terms that are used frequently but can be confusing if they are not thoroughly explained. These three terms are probability, odds, and odds ratio. In this post, we will look at these three terms and provide an explanation of them.

Probability

Probability is probably (no pun intended) the easiest of these three terms to understand. Probability is simply the likelihood that a certain even will happen.  To calculate the probability in the traditional sense you need to know the number of events and outcomes to find the probability.

Bayesian probability uses prior probabilities to develop a posterior probability based on new evidence. For example, at one point during Super Bowl LI the Atlanta Falcons had a 99.7% chance of winning. This was base don such factors as the number points they were ahead and the time remaining.  As these changed, so did the probability of them winning. yet the Patriots still found a way to win with less then a 1% chance

Bayesian probability was also used for predicting who would win the 2016 US presidential race. It is important to remember that probability is an expression of confidence and not a guarantee as we saw in both examples.

Odds

Odds are the expression of relative probabilities. Odds are calculated using the following equation

probability of the event ⁄ 1 – probability of the event

For example, at one point during Super Bowl LI the odds of the Atlanta Falcons winning were as follows

0.997 ⁄ 1 – 0.997 = 332

This can be interpreted as the odds being 332 to 1! This means that Atlanta was 332 times more likely to win the Super Bowl then loss the Super Bowl.

Odds are commonly used in gambling and this is probably (again no pun intended) where most of us have heard the term before. The odds is just an extension of probabilities and the are most commonly expressed as a fraction such as one in four, etc.

Odds Ratio

A ratio is the comparison of of two numbers and indicates how many times one number is contained or contains another number. For example, a ration of boys to girls is 5 to 1 it means that there are five boys for every one girl.

By  extension odds ratio is the comparison of two different odds. For example, if the odds of Team A making the playoffs is 45% and the odds of Team B making the playoffs is 35% the odds ratio is calculated as follows.

0.45 ⁄ 0.35 = 1.28

Team A is 1.28 more likely to make the playoffs then Team B.

The value of the odds and the odds ratio can sometimes be the same.  Below is the odds ratio of the Atlanta Falcons winning and the New Patriots winning Super Bowl LI

0.997⁄ 0.003 = 332

As such there is little difference between odds and odds ratio except that odds ratio is the ratio of two odds ratio. As you can tell, there is a lot of confusion about this for the average person. However, understanding these terms is critical to the application of logistic regression.

# Best Subset Regression in R

In this post, we will take a look at best subset regression. Best subset regression fits a model for all possible feature or variable combinations and the decision for the most appropriate model is made by the analyst based on judgment or some statistical criteria.

Best subset regression is an alternative to both Forward and Backward stepwise regression. Forward stepwise selection adds one variable at a time based on the lowest residual sum of squares until no more variables continues to lower the residual sum of squares. Backward stepwise regression starts with all variables in the model and removes variables one at a time. The concern with stepwise methods is they can produce biased regression coefficients, conflicting models, and inaccurate confidence intervals.

Best subset regression bypasses these weaknesses of stepwise models by creating all models possible and then allowing you to assess which variables should be including in your final model. The one drawback to best subset is that a large number of variables means a large number of potential models, which can make it difficult to make a decision among several choices.

In this post, we will use the “Fair” dataset from the “Ecdat” package to predict marital satisfaction based on age, Sex, the presence of children, years married, religiosity, education, occupation, and number of affairs in the past year. Below is some initial code.

library(leaps);library(Ecdat);library(car);library(lmtest)
data(Fair)

We begin our analysis by building the initial model with all variables in it. Below is the code

fit<-lm(rate~.,Fair)
summary(fit)
##
## Call:
## lm(formula = rate ~ ., data = Fair)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.2049 -0.6661  0.2298  0.7705  2.2292
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3.522875   0.358793   9.819  < 2e-16 ***
## sexmale     -0.062281   0.099952  -0.623  0.53346
## age         -0.009683   0.007548  -1.283  0.20005
## ym          -0.019978   0.013887  -1.439  0.15079
## childyes    -0.206976   0.116227  -1.781  0.07546 .
## religious    0.042142   0.037705   1.118  0.26416
## education    0.068874   0.021153   3.256  0.00119 **
## occupation  -0.015606   0.029602  -0.527  0.59825
## nbaffairs   -0.078812   0.013286  -5.932 5.09e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.03 on 592 degrees of freedom
## Multiple R-squared:  0.1405, Adjusted R-squared:  0.1289
## F-statistic:  12.1 on 8 and 592 DF,  p-value: 4.487e-16

The initial results are already interesting even though the r-square is low. When couples have children the have less martial satisfaction than couples without children when controlling for the other factors and this is the strongest regression weight. In addition, the more education a person has there is an increase in marital satisfaction. Lastly, as the number of affairs increases there is also a decrease in martial satisfaction. Keep in mind that the “rate” variable goes from 1 to 5 with one meaning a terrible marriage to five being a great one. The mean marital satisfaction was 3.52 when controlling for the other variables.

We will now create our subset models. Below is the code.

sub.fit<-regsubsets(rate~.,Fair)
best.summary<-summary(sub.fit)

In the code above we create the sub models using the “regsubsets” function from the “leaps” package and saved it in the variable called “sub.fit”. We then saved the summary of “sub.fit” in the variable “best.summary”. We will use the “best.summary” “sub.fit variables several times to determine which model to use.

There are many different ways to assess the model. We will use the following statistical methods that come with the results from the “regsubset” function.

• Mallow’ Cp
• Bayesian Information Criteria

We will make two charts for each of the criteria above. The plot to the left will explain how many features to include in the model. The plot to the right will tell you which variables to include. It is important to note that for both of these methods, the lower the score the better the model. Below is the code for Mallow’s Cp.

par(mfrow=c(1,2))
plot(best.summary$cp) plot(sub.fit,scale = "Cp") The plot on the left suggest that a four feature model is the most appropriate. However, this chart does not tell me which four features. The chart on the right is read in reverse order. The high numbers are at the bottom and the low numbers are at the top when looking at the y-axis. Knowing this, we can conclude that the most appropriate variables to include in the model are age, children presence, education, and number of affairs. Below are the results using the Bayesian Information Criterion par(mfrow=c(1,2)) plot(best.summary$bic)
plot(sub.fit,scale = "bic")

These results indicate that a three feature model is appropriate. The variables or features are years married, education, and number of affairs. Presence of children was not considered beneficial. Since our original model and Mallow’s Cp indicated that presence of children was significant we will include it for now.

Below is the code for the model based on the subset regression.

fit2<-lm(rate~age+child+education+nbaffairs,Fair)
summary(fit2)
##
## Call:
## lm(formula = rate ~ age + child + education + nbaffairs, data = Fair)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.2172 -0.7256  0.1675  0.7856  2.2713
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3.861154   0.307280  12.566  < 2e-16 ***
## age         -0.017440   0.005057  -3.449 0.000603 ***
## childyes    -0.261398   0.103155  -2.534 0.011531 *
## education    0.058637   0.017697   3.313 0.000978 ***
## nbaffairs   -0.084973   0.012830  -6.623 7.87e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.029 on 596 degrees of freedom
## Multiple R-squared:  0.1352, Adjusted R-squared:  0.1294
## F-statistic: 23.29 on 4 and 596 DF,  p-value: < 2.2e-16

The results look ok. The older a person is the less satisfied they are with their marriage. If children are presence the marriage is less satisfying. The more educated the more satisfied they are. Lastly, the higher the number of affairs indicate less marital satisfaction. However, before we get excited we need to check for collinearity and homoscedasticity. Below is the code

vif(fit2)
##       age     child education nbaffairs
##  1.249430  1.228733  1.023722  1.014338

No issues with collinearity.For vif values above 5 or 10 indicate a problem. Let’s check for homoscedasticity

par(mfrow=c(2,2))
plot(fit2)

The normal qqplot and residuals vs leverage plot can be used for locating outliers. The residual vs fitted and the scale-location plot do not look good as there appears to be a pattern in the dispersion which indicates homoscedasticity. To confirm this we will use Breusch-Pagan test from the “lmtest” package. Below is the code

bptest(fit2)
##
##  studentized Breusch-Pagan test
##
## data:  fit2
## BP = 16.238, df = 4, p-value = 0.002716

There you have it. Our model violates the assumption of homoscedasticity. However, this model was developed for demonstration purpose to provide an example of subset regression.

# Data Wrangling in R

Collecting and preparing data for analysis is the primary job of a data scientist. This experience is called data wrangling. In this post, we will look at an example of data wrangling using a simple artificial data set. You can create the table below in r or excel. If you created it in excel just save it as a csv and load it into r. Below is the initial code

library(readr)
apple <- read_csv("~/Desktop/apple.csv")
## # A tibble: 10 × 2
##        weight      location
##         <chr>         <chr>
## 1         3.2        Europe
## 2       4.2kg       europee
## 3      1.3 kg          U.S.
## 4  7200 grams           USA
## 5          42 United States
## 6         2.3       europee
## 7       2.1kg        Europe
## 8       3.1kg           USA
## 9  2700 grams          U.S.
## 10         24 United States

This a small dataset with the columns of “weight” and “location”. Here are some of the problems

• Weights are in different units
• Weights are written in different ways
• Location is not consistent

In order to have any success with data wrangling you need to state specifically what it is you want to do. Here are our goals for this project

• Convert the “Weight variable” to a numerical variable instead of character
• Remove the text and have only numbers in the “weight variable”
• Change weights in grams to kilograms
• Convert the “location” variable to a factor variable instead of character
• Have consistent spelling for Europe and United States in the “location” variable

We will begin with the “weight” variable. We want to convert it to a numerical variable and remove any non-numerical text. Below is the code for this

corrected.weight<-as.numeric(gsub(pattern = "[[:alpha:]]","",apple$weight)) corrected.weight ## [1] 3.2 4.2 1.3 7200.0 42.0 2.3 2.1 3.1 2700.0 24.0 Here is what we did. 1. We created a variable called “corrected.weight” 2. We use the function “as.numeric” this makes whatever results inside it to be a numerical variable 3. Inside “as.numeric” we used the “gsub” function which allows us to substitute one value for another. 4. Inside “gsub” we used the argument pattern and set it to “[[alpha:]]” and “” this told r to look for any lower or uppercase letters and replace with nothing or remove it. This all pertains to the “weight” variable in the apple dataframe. We now need to convert the weights in grams to kilograms so that everything is the same unit. Below is the code gram.error<-grep(pattern = "[[:digit:]]{4}",apple$weight)
corrected.weight[gram.error]<-corrected.weight[gram.error]/1000
corrected.weight
##  [1]  3.2  4.2  1.3  7.2 42.0  2.3  2.1  3.1  2.7 24.0

Here is what we did

1. We created a variable called “gram.error”
2. We used the grep function to search are the “weight” variable in the apple data frame for input that is a digit and is 4 digits in length this is what the “[[:digit:]]{4}” argument means. We do not change any values yet we just store them in “gram.error”
3. Once this information is stored in “gram.error” we use it as a subset for the “corrected.weight” variable.
4. We tell r to save into the “corrected.weight” variable any value that is changeable according to the criteria set in “gram.error” and to divided it by 1000. Dividing it by 1000 converts the value from grams to kilograms.

We have completed the transformation of the “weight” and will move to dealing with the problems with the “location” variable in the “apple” dataframe. To do this we will first deal with the issues related to the values that relate to Europe and then we will deal with values related to United States. Below is the code.

europe<-agrep(pattern = "europe",apple$location,ignore.case = T,max.distance = list(insertion=c(1),deletions=c(2))) america<-agrep(pattern = "us",apple$location,ignore.case = T,max.distance = list(insertion=c(0),deletions=c(2),substitutions=0))
corrected.location<-apple$location corrected.location[europe]<-"europe" corrected.location[america]<-"US" corrected.location<-gsub(pattern = "United States","US",corrected.location) corrected.location ## [1] "europe" "europe" "US" "US" "US" "europe" "europe" ## [8] "US" "US" "US" The code is a little complicated to explain but in short We used the “agrep” function to tell r to search the “location” to look for values similar to our term “europe”. The other arguments provide some exceptions that r should change because the exceptions are close to the term europe. This process is repeated for the term “us”. We then store are the location variable from the “apple” dataframe in a new variable called “corrected.location” We then apply the two objects we made called “europe” and “america” to the “corrected.location” variable. Next we have to make some code to deal with “United States” and apply this using the “gsub” function. We are almost done, now we combine are two variables “corrected.weight” and “corrected.location” into a new data.frame. The code is below cleaned.apple<-data.frame(corrected.weight,corrected.location) names(cleaned.apple)<-c('weight','location') cleaned.apple ## weight location ## 1 3.2 europe ## 2 4.2 europe ## 3 1.3 US ## 4 7.2 US ## 5 42.0 US ## 6 2.3 europe ## 7 2.1 europe ## 8 3.1 US ## 9 2.7 US ## 10 24.0 US If you use the “str” function on the “cleaned.apple” dataframe you will see that “location” was automatically converted to a factor. This looks much better especially if you compare it to the original dataframe that is printed at the top of this post. # Principal Component Analysis in R This post will demonstrate the use of principal component analysis (PCA). PCA is useful for several reasons. One it allows you place your examples into groups similar to linear discriminant analysis but you do not need to know beforehand what the groups are. Second, PCA is used for the purpose of dimension reduction. For example, if you have 50 variables PCA can allow you to reduce this while retaining a certain threshold of variance. If you are working with a large dataset this can greatly reduce the computational time and general complexity of your models. Keep in mind that there really is not a dependent variable as this is unsupervised learning. What you are trying to see is how different examples can be mapped in space based on whatever independent variables are used. For our example, we will use the “Carseats” dataset form the “ISLR”. Our goal is to understanding the relationship among the variables when examining the shelve location of the car seat. Below is the initial code to begin the analysis library(ggplot2) library(ISLR) data("Carseats") We first need to rearrange the data and remove the variables we are not going to use in the analysis. Below is the code. Carseats1<-Carseats Carseats1<-Carseats1[,c(1,2,3,4,5,6,8,9,7,10,11)] Carseats1$Urban<-NULL
Carseats1$US<-NULL Here is what we did 1. We made a copy of the “Carseats” data called “Careseats1” 2. We rearranged the order of the variables so that the factor variables are at the end. This will make sense later 3.We removed the “Urban” and “US” variables from the table as they will not be a part of our analysis We will now do the PCA. We need to scale and center our data otherwise the larger numbers will have a much stronger influence on the results than smaller numbers. Fortunately, the “prcomp” function has a “scale” and a “center” argument. We will also use only the first 7 columns for the analysis as “sheveLoc” is not useful for this analysis. If we hadn’t moved “shelveLoc” to the end of the dataframe it would cause some headache. Below is the code. Carseats.pca<-prcomp(Carseats1[,1:7],scale. = T,center = T) summary(Carseats.pca) ## Importance of components: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 ## Standard deviation 1.3315 1.1907 1.0743 0.9893 0.9260 0.80506 0.41320 ## Proportion of Variance 0.2533 0.2026 0.1649 0.1398 0.1225 0.09259 0.02439 ## Cumulative Proportion 0.2533 0.4558 0.6207 0.7605 0.8830 0.97561 1.00000 The summary of “Carseats.pca” Tells us how much of the variance each component explains. Keep in mind that number of components is equal to the number of variables. The “proportion of variance” tells us the contribution each component makes and the “cumulative proportion”. If your goal is dimension reduction than the number of components to keep depends on the threshold you set. For example, if you need around 90% of the variance you would keep the first 5 components. If you need 95% or more of the variance you would keep the first six. To actually use the components you would take the “Carseats.pca$x” data and move it to your data frame.

Keep in mind that the actual components have no conceptual meaning but is a numerical representation of a combination of several variables that were reduce using PCA to fewer variables such as going form 7 variables to 5 variables.

This means that PCA is great for reducing variables for prediction purpose but is much harder for explanatory studies unless you can explain what the new components represent.

For our purposes, we will keep 5 components. This means that we have reduce our dimensions from 7 to 5 while still keeping almost 90% of the variance. Graphing our results is tricky because we have 5 dimensions but the human mind can only conceptualize 3 at the best and normally 2. As such we will plot the first two components and label them by shelf location using ggplot2. Below is the code

scores<-as.data.frame(Carseats.pca$x) pcaplot<-ggplot(scores,(aes(PC1,PC2,color=Carseats1$ShelveLoc)))+geom_point()
pcaplot

From the plot you can see there is little separation when using the first two components of the PCA analysis. This makes sense as we can only graph to components so we are missing a lot of the variance. However for demonstration purposes the analysis is complete.

# Developing a Data Analysis Plan

It is extremely common for beginners and perhaps even experience researchers to lose track of what they are trying to achieve or do when trying to complete a research project. The open nature of research allows for a multitude of equally acceptable ways to complete a project. This leads to  an inability to make decision and or stay on course when doing research.

One way to reduce and eliminate the roadblock to decision making and focus in research is to develop a plan. In this post we will look at one version of a data analysis plan.

Data Analysis Plan

A data analysis plan includes many features of a research project in it with a particular emphasis on mapping out how research questions will be answered and what is necessary to answer the question. Below is a sample template of the analysis plan.

The majority of this diagram should be familiar to someone who has ever done research. At the top, you state the problem, this is the overall focus of the paper. Next comes the purpose, the purpose is the over-arching goal of a research project.

After purpose comes the research questions. The research questions are questions about the problem that are answerable. People struggle with developing clear and answerable research questions. It is critical that research questions are written in a way that they can be answered and that the questions are clearly derived from the problem. Poor questions means poor or even no answers.

After the research questions it is important to know what variables are available for the entire study and specifically what variables can be used to answer each research question. Lastly, you must indicate what analysis or visual you will develop in order to answer your research questions about your problem. This requires you to know how you will answer your research questions

Example

Below is an example of a completed analysis plan for  simple undergraduate level research paper

In the example above, the  student want to understand the perceptions of university students about the cafeteria food quality and their satisfaction with the university. There were four research questions, a demographic descriptive question, a descriptive question about the two main variables, a comparison question, and lastly a relationship question.

The variables available for answering the questions are listed of to the left  side. Under that, the student indicates the variables needed to answer each question. For example, the demographic variables of sex, class level, and major are needed to answer the question about the demographic profile.

The last section is the analysis. For the demographic profile the student found the percentage of the population in each sub group of the demographic variables.

Conclusion

A data analysis plan provides an excellent way to determine what needs to be done to complete a study. It also helps a researcher to clearly understand what they are trying to do and provides a visuals for those who the research wants to communicate  with about the progress of a study.

# Linear Discriminant Analysis in R

In this post we will look at an example of linear discriminant analysis (LDA). LDA is used to develop a statistical model that classifies examples in a dataset. In the example in this post, we will use the “Star” dataset from the “Ecdat” package. What we will do is try to predict the type of class the students learned in (regular, small, regular with aide) using their math scores, reading scores, and the teaching experience of the teacher. Below is the initial code

library(Ecdat)
library(MASS)
data(Star)

We first need to examine the data by using the “str” function

str(Star)
## 'data.frame':    5748 obs. of  8 variables:
##  $tmathssk: int 473 536 463 559 489 454 423 500 439 528 ... ##$ treadssk: int  447 450 439 448 447 431 395 451 478 455 ...
##  $classk : Factor w/ 3 levels "regular","small.class",..: 2 2 3 1 2 1 3 1 2 2 ... ##$ totexpk : int  7 21 0 16 5 8 17 3 11 10 ...
##  $sex : Factor w/ 2 levels "girl","boy": 1 1 2 2 2 2 1 1 1 1 ... ##$ freelunk: Factor w/ 2 levels "no","yes": 1 1 2 1 2 2 2 1 1 1 ...
##  $race : Factor w/ 3 levels "white","black",..: 1 2 2 1 1 1 2 1 2 1 ... ##$ schidkn : int  63 20 19 69 79 5 16 56 11 66 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:5850] 1 4 6 7 8 9 10 15 16 17 ...
##   .. ..- attr(*, "names")= chr [1:5850] "1" "4" "6" "7" ...

We will use the following variables

• dependent variable = classk (class type)
• independent variable = tmathssk (Math score)
• independent variable = totexpk (Teaching experience)

We now need to examine the data visually by looking at histograms for our independent variables and a table for our dependent variable

hist(Star$tmathssk) hist(Star$treadssk)

hist(Star$totexpk) prop.table(table(Star$classk))
##
##           regular       small.class regular.with.aide
##         0.3479471         0.3014962         0.3505567

The data mostly looks good. The results of the “prop.table” function will help us when we develop are training and testing datasets. The only problem is with the “totexpk” variable. IT is not anywhere near to be normally distributed. TO deal with this we will use the square root for teaching experience. Below is the code

star.sqrt<-Star
star.sqrt$totexpk.sqrt<-sqrt(star.sqrt$totexpk)
hist(sqrt(star.sqrt$totexpk)) Much better. We now need to check the correlation among the variables as well and we will use the code below. cor.star<-data.frame(star.sqrt$tmathssk,star.sqrt$treadssk,star.sqrt$totexpk.sqrt)
cor(cor.star)
##                        star.sqrt.tmathssk star.sqrt.treadssk
## star.sqrt.tmathssk             1.00000000          0.7135489
## star.sqrt.totexpk.sqrt         0.08647957          0.1045353
##                        star.sqrt.totexpk.sqrt
## star.sqrt.tmathssk                 0.08647957
## star.sqrt.totexpk.sqrt             1.00000000

None of the correlations are too bad. We can now develop our model using linear discriminant analysis. First, we need to scale are scores because the test scores and the teaching experience are measured differently. Then, we need to divide our data into a train and test set as this will allow us to determine the accuracy of the model. Below is the code.

star.sqrt$tmathssk<-scale(star.sqrt$tmathssk)
star.sqrt$treadssk<-scale(star.sqrt$treadssk)
star.sqrt$totexpk.sqrt<-scale(star.sqrt$totexpk.sqrt)
train.star<-star.sqrt[1:4000,]
test.star<-star.sqrt[4001:5748,]

Now we develop our model. In the code before the “prior” argument indicates what we expect the probabilities to be. In our data the distribution of the the three class types is about the same which means that the apriori probability is 1/3 for each class type.

train.lda<-lda(classk~tmathssk+treadssk+totexpk.sqrt, data =
train.star,prior=c(1,1,1)/3)
train.lda
## Call:
## lda(classk ~ tmathssk + treadssk + totexpk.sqrt, data = train.star,
##     prior = c(1, 1, 1)/3)
##
## Prior probabilities of groups:
##           regular       small.class regular.with.aide
##         0.3333333         0.3333333         0.3333333
##
## Group means:
## regular           -0.04237438 -0.05258944  -0.05082862
## small.class        0.13465218  0.11021666  -0.02100859
## regular.with.aide -0.05129083 -0.01665593   0.09068835
##
## Coefficients of linear discriminants:
##                      LD1         LD2
## tmathssk      0.89656393 -0.04972956
## totexpk.sqrt -0.49061950  0.80051026
##
## Proportion of trace:
##    LD1    LD2
## 0.7261 0.2739

The printout is mostly readable. At the top is the actual code used to develop the model followed by the probabilities of each group. The next section shares the means of the groups. The coefficients of linear discriminants are the values used to classify each example. The coefficients are similar to regression coefficients. The computer places each example in both equations and probabilities are calculated. Whichever class has the highest probability is the winner. In addition, the higher the coefficient the more weight it has. For example, “tmathssk” is the most influential on LD1 with a coefficient of 0.89.

The proportion of trace is similar to principal component analysis

Now we will take the trained model and see how it does with the test set. We create a new model called “predict.lda” and use are “train.lda” model and the test data called “test.star”

predict.lda<-predict(train.lda,newdata = test.star)

We can use the “table” function to see how well are model has done. We can do this because we actually know what class our data is beforehand because we divided the dataset. What we need to do is compare this to what our model predicted. Therefore, we compare the “classk” variable of our “test.star” dataset with the “class” predicted by the “predict.lda” model.

table(test.star$classk,predict.lda$class)
##
##                     regular small.class regular.with.aide
##   regular               155         182               249
##   small.class           145         198               174
##   regular.with.aide     172         204               269

The results are pretty bad. For example, in the first row called “regular” we have 155 examples that were classified as “regular” and predicted as “regular” by the model. In rhe next column, 182 examples that were classified as “regular” but predicted as “small.class”, etc. To find out how well are model did you add together the examples across the diagonal from left to right and divide by the total number of examples. Below is the code

(155+198+269)/1748
## [1] 0.3558352

Only 36% accurate, terrible but ok for a demonstration of linear discriminant analysis. Since we only have two-functions or two-dimensions we can plot our model.  Below I provide a visual of the first 50 examples classified by the predict.lda model.

plot(predict.lda$x[1:50]) text(predict.lda$x[1:50],as.character(predict.lda$class[1:50]),col=as.numeric(predict.lda$class[1:100]))
abline(h=0,col="blue")
abline(v=0,col="blue")

The first function, which is the vertical line, doesn’t seem to discriminant anything as it off to the side and not separating any of the data. However, the second function, which is the horizontal one, does a good of dividing the “regular.with.aide” from the “small.class”. Yet, there are problems with distinguishing the class “regular” from either of the other two groups.  In order improve our model we need additional independent variables to help to distinguish the groups in the dependent variable.

# Generalized Additive Models in R

In this post, we will learn how to create a generalized additive model (GAM). GAMs are non-parametric generalized linear models. This means that linear predictor of the model uses smooth functions on the predictor variables. As such, you do not need to specific the functional relationship between the response and continuous variables. This allows you to explore the data for potential relationships that can be more rigorously tested with other statistical models

In our example, we will use the “Auto” dataset from the “ISLR” package and use the variables “mpg”,“displacement”,“horsepower”,and “weight” to predict “acceleration”. We will also use the “mgcv” package. Below is some initial code to begin the analysis

library(mgcv)
library(ISLR)
data(Auto)

We will now make the model we want to understand the response of “accleration” to the explanatory variables of “mpg”,“displacement”,“horsepower”,and “weight”. After setting the model we will examine the summary. Below is the code

model1<-gam(acceleration~s(mpg)+s(displacement)+s(horsepower)+s(weight),data=Auto)
summary(model1)
##
## Family: gaussian
##
## Formula:
## acceleration ~ s(mpg) + s(displacement) + s(horsepower) + s(weight)
##
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.54133    0.07205   215.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
##                   edf Ref.df      F  p-value
## s(mpg)          6.382  7.515  3.479  0.00101 **
## s(displacement) 1.000  1.000 36.055 4.35e-09 ***
## s(horsepower)   4.883  6.006 70.187  < 2e-16 ***
## s(weight)       3.785  4.800 41.135  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) =  0.733   Deviance explained = 74.4%
## GCV = 2.1276  Scale est. = 2.0351    n = 392

All of the explanatory variables are significant and the adjust r-squared is .73 which is excellent. edf stands for “effective degrees of freedom”. This modified version of the degree of freedoms is due to the smoothing process in the model. GCV stands for generalized cross validation and this number is useful when comparing models. The model with the lowest number is the better model.

We can also examine the model visually by using the “plot” function. This will allow us to examine if the curvature fitted by the smoothing process was useful or not for each variable. Below is the code.

plot(model1)

We can also look at a 3d graph that includes the linear predictor as well as the two strongest predictors. This is done with the “vis.gam” function. Below is the code

vis.gam(model1)

If multiple models are developed. You can compare the GCV values to determine which model is the best. In addition, another way to compare models is with the “AIC” function. In the code below, we will create an additional model that includes “year” compare the GCV scores and calculate the AIC. Below is the code.

model2<-gam(acceleration~s(mpg)+s(displacement)+s(horsepower)+s(weight)+s(year),data=Auto)
summary(model2)
##
## Family: gaussian
##
## Formula:
## acceleration ~ s(mpg) + s(displacement) + s(horsepower) + s(weight) +
##     s(year)
##
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.54133    0.07203   215.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
##                   edf Ref.df      F p-value
## s(mpg)          5.578  6.726  2.749  0.0106 *
## s(displacement) 2.251  2.870 13.757 3.5e-08 ***
## s(horsepower)   4.936  6.054 66.476 < 2e-16 ***
## s(weight)       3.444  4.397 34.441 < 2e-16 ***
## s(year)         1.682  2.096  0.543  0.6064
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) =  0.733   Deviance explained = 74.5%
## GCV = 2.1368  Scale est. = 2.0338    n = 392
#model1 GCV
model1$gcv.ubre ## GCV.Cp ## 2.127589 #model2 GCV model2$gcv.ubre
##   GCV.Cp
## 2.136797

As you can see, the second model has a higher GCV score when compared to the first model. This indicates that the first model is a better choice. This makes sense because in the second model the variable “year” is not significant. To confirm this we will calculate the AIC scores using the AIC function.

AIC(model1,model2)
##              df      AIC
## model1 18.04952 1409.640
## model2 19.89068 1411.156

Again, you can see that model1 s better due to its fewer degrees of freedom and slightly lower AIC score.

Conclusion

Using GAMs is most common for exploring potential relationships in your data. This is stated because they are difficult to interpret and to try and summarize. Therefore, it is normally better to develop a generalized linear model over a GAM due to the difficulty in understanding what the data is trying to tell you when using GAMs.

# Generalized Models in R

Generalized linear models are another way to approach linear regression. The advantage of of GLM is that allows the error to follow many different distributions rather than only the normal distribution which is an assumption of traditional linear regression.

Often GLM is used for response or dependent variables that are binary or represent count data. THis post will provide a brief explanation of GLM as well as provide an example.

Key Information

There are three important components to a GLM and they are

• Error structure
• Linear predictor

The error structure is the type of distribution you will use in generating the model. There are many different distributions in statistical modeling such as binomial, gaussian, poission, etc. Each distribution comes with certain assumptions that govern their use.

The linear predictor is the sum of the effects of the independent variables. Lastly, the link function determines the relationship between the linear predictor and the mean of the dependent variable. There are many different link functions and the best link function is the one that reduces the residual deviances the most.

In our example, we will try to predict if a house will have air conditioning based on the interactioon between number of bedrooms and bathrooms, number of stories, and the price of the house. To do this, we will use the “Housing” dataset from the “Ecdat” package. Below is some initial code to get started.

library(Ecdat)
data("Housing")

The dependent variable “airco” in the “Housing” dataset is binary. This calls for us to use a GLM. To do this we will use the “glm” function in R. Furthermore, in our example, we want to determine if there is an interaction between number of bedrooms and bathrooms. Interaction means that the two independent variables (bathrooms and bedrooms) influence on the dependent variable (aircon) is not additive, which means that the combined effect of the independnet variables is different than if you just added them together. Below is the code for the model followed by a summary of the results

model<-glm(Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + Housing$price, family=binomial) summary(model) ## ## Call: ## glm(formula = Housing$airco ~ Housing$bedrooms * Housing$bathrms +
##     Housing$stories + Housing$price, family = binomial)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -2.7069  -0.7540  -0.5321   0.8073   2.4217
##
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)                      -6.441e+00  1.391e+00  -4.632 3.63e-06
## Housing$bedrooms 8.041e-01 4.353e-01 1.847 0.0647 ## Housing$bathrms                   1.753e+00  1.040e+00   1.685   0.0919
## Housing$stories 3.209e-01 1.344e-01 2.388 0.0170 ## Housing$price                     4.268e-05  5.567e-06   7.667 1.76e-14
## Housing$bedrooms:Housing$bathrms -6.585e-01  3.031e-01  -2.173   0.0298
##
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 681.92  on 545  degrees of freedom
## Residual deviance: 549.75  on 540  degrees of freedom
## AIC: 561.75
##
## Number of Fisher Scoring iterations: 4

To check how good are model is we need to check for overdispersion as well as compared this model to other potential models. Overdispersion is a measure to determine if there is too much variablity in the model. It is calcualted by dividing the residual deviance by the degrees of freedom. Below is the solution for this

549.75/540
## [1] 1.018056

Our answer is 1.01, which is pretty good because the cutoff point is 1, so we are really close.

Now we will make several models and we will compare the results of them

Model 2

#add recroom and garagepl
model2<-glm(Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + Housing$price + Housing$recroom + Housing$garagepl, family=binomial) summary(model2) ## ## Call: ## glm(formula = Housing$airco ~ Housing$bedrooms * Housing$bathrms +
##     Housing$stories + Housing$price + Housing$recroom + Housing$garagepl,
##     family = binomial)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -2.6733  -0.7522  -0.5287   0.8035   2.4239
##
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)                      -6.369e+00  1.401e+00  -4.545 5.51e-06
## Housing$bedrooms 7.830e-01 4.391e-01 1.783 0.0745 ## Housing$bathrms                   1.702e+00  1.047e+00   1.626   0.1039
## Housing$stories 3.286e-01 1.378e-01 2.384 0.0171 ## Housing$price                     4.204e-05  6.015e-06   6.989 2.77e-12
## Housing$recroomyes 1.229e-01 2.683e-01 0.458 0.6470 ## Housing$garagepl                  2.555e-03  1.308e-01   0.020   0.9844
## Housing$bedrooms:Housing$bathrms -6.430e-01  3.054e-01  -2.106   0.0352
##
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 681.92  on 545  degrees of freedom
## Residual deviance: 549.54  on 538  degrees of freedom
## AIC: 565.54
##
## Number of Fisher Scoring iterations: 4
#overdispersion calculation
549.54/538
## [1] 1.02145

Model 3

model3<-glm(Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + Housing$price + Housing$recroom + Housing$fullbase + Housing$garagepl, family=binomial)
summary(model3)
##
## Call:
## glm(formula = Housing$airco ~ Housing$bedrooms * Housing$bathrms + ## Housing$stories + Housing$price + Housing$recroom + Housing$fullbase + ## Housing$garagepl, family = binomial)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -2.6629  -0.7436  -0.5295   0.8056   2.4477
##
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)                      -6.424e+00  1.409e+00  -4.559 5.14e-06
## Housing$bedrooms 8.131e-01 4.462e-01 1.822 0.0684 ## Housing$bathrms                   1.764e+00  1.061e+00   1.662   0.0965
## Housing$stories 3.083e-01 1.481e-01 2.082 0.0374 ## Housing$price                     4.241e-05  6.106e-06   6.945 3.78e-12
## Housing$recroomyes 1.592e-01 2.860e-01 0.557 0.5778 ## Housing$fullbaseyes              -9.523e-02  2.545e-01  -0.374   0.7083
## Housing$garagepl -1.394e-03 1.313e-01 -0.011 0.9915 ## Housing$bedrooms:Housing$bathrms -6.611e-01 3.095e-01 -2.136 0.0327 ## ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 681.92 on 545 degrees of freedom ## Residual deviance: 549.40 on 537 degrees of freedom ## AIC: 567.4 ## ## Number of Fisher Scoring iterations: 4 #overdispersion calculation 549.4/537 ## [1] 1.023091 Now we can assess the models by using the “anova” function with the “test” argument set to “Chi” for the chi-square test. anova(model, model2, model3, test = "Chi") ## Analysis of Deviance Table ## ## Model 1: Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + ## Housing$price
## Model 2: Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories +
##     Housing$price + Housing$recroom + Housing$garagepl ## Model 3: Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + ## Housing$price + Housing$recroom + Housing$fullbase + Housing$garagepl ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 540 549.75 ## 2 538 549.54 2 0.20917 0.9007 ## 3 537 549.40 1 0.14064 0.7076 The results of the anova indicate that the models are all essentially the same as there is no statistical difference. The only criteria on which to select a model is the measure of overdispersion. The first model has the lowest rate of overdispersion and so is the best when using this criteria. Therefore, determining if a hous has air conditioning depends on examining number of bedrooms and bathrooms simultenously as well as the number of stories and the price of the house. Conclusion The post explained how to use and interpret GLM in R. GLM can be used primarilyy for fitting data to disrtibutions that are not normal. # Proportion Test in R Proportions are are a fraction or “portion” of a total amount. For example, if there are ten men and ten women in a room the proportion of men in the room is 50% (5 / 10). There are times when doing an analysis that you want to evaluate proportions in our data rather than individual measurements of mean, correlation, standard deviation etc. In this post we will learn how to do a test of proportions using R. We will use the dataset “Default” which is found in the “ISLR” pacakage. We will compare the proportion of those who are students in the dataset to a theoretical value. We will calculate the results using the z-test and the binomial exact test. Below is some initial code to get started. library(ISLR) data("Default") We first need to determine the actual number of students that are in the sample. This is calculated below using the “table” function. table(Default$student)
##
##   No  Yes
## 7056 2944

We have 2944 students in the sample and 7056 people who are not students. We now need to determine how many people are in the sample. If we sum the results from the table below is the code.

sum(table(Default$student)) ## [1] 10000 There are 10000 people in the sample. To determine the proprtion of students we take the number 2944 / 10000 which equals 29.44 or 29.44%. Below is the code to calculate this table(Default$student) / sum(table(Default$student)) ## ## No Yes ## 0.7056 0.2944 The proportion test is used to compare a particular value with a theoretical value. For our example, the particular value we have is 29.44% of the people were students. We want to compare this value with a theoretical value of 50%. Before we do so it is better to state specificallt what are hypotheses are. NULL = The value of 29.44% of the sample being students is the same as 50% found in the population ALTERNATIVE = The value of 29.44% of the sample being students is NOT the same as 50% found in the population. Below is the code to complete the z-test. prop.test(2944,n = 10000, p = 0.5, alternative = "two.sided", correct = FALSE) ## ## 1-sample proportions test without continuity correction ## ## data: 2944 out of 10000, null probability 0.5 ## X-squared = 1690.9, df = 1, p-value < 2.2e-16 ## alternative hypothesis: true p is not equal to 0.5 ## 95 percent confidence interval: ## 0.2855473 0.3034106 ## sample estimates: ## p ## 0.2944 Here is what the code means. 1. prop.test is the function used 2. The first value of 2944 is the total number of students in the sample 3. n = is the sample size 4. p= 0.5 is the theoretical proportion 5. alternative =“two.sided” means we want a two-tail test 6. correct = FALSE means we do not want a correction applied to the z-test. This is useful for small sample sizes but not for our sample of 10000 The p-value is essentially zero. This means that we reject the null hypothesis and conclude that the proprtion of students in our sample is different from a theortical proprition of 50% in the population. Below is the same analysis using the binomial exact test. binom.test(2944, n = 10000, p = 0.5) ## ## Exact binomial test ## ## data: 2944 and 10000 ## number of successes = 2944, number of trials = 10000, p-value < ## 2.2e-16 ## alternative hypothesis: true probability of success is not equal to 0.5 ## 95 percent confidence interval: ## 0.2854779 0.3034419 ## sample estimates: ## probability of success ## 0.2944 The results are the same. Whether to use the “prop.test”” or “binom.test” is a major argument among statisticians. The purpose here was to provide an example of the use of both # Theoretical Distribution and R This post will explore an example of testing if a dataset fits a specific theoretical distribution. This is a very important aspect of statistical modeling as it allows to understand the normality of the data and the appropriate steps needed to take to prepare for analysis. In our example, we will use the “Auto” dataset from the “ISLR” package. We will check if the horsepower of the cars in the dataset is normally distributed or not. Below is some initial code to begin the process. library(ISLR) library(nortest) library(fBasics) data("Auto") Determining if a dataset is normally distributed is simple in R. This is normally done visually through making a Quantile-Quantile plot (Q-Q plot). It involves using two functions the “qnorm” and the “qqline”. Below is the code for the Q-Q plot qqnorm(Auto$horsepower)

We now need to add the Q-Q line to see how are distribution lines up with the theoretical normal one. Below is the code. Note that we have to repeat the code above in order to get the completed plot.

qqnorm(Auto$horsepower) qqline(Auto$horsepower, distribution = qnorm, probs=c(.25,.75))

The “qqline” function needs the data you want to test as well as the distribution and probability. The distribution we wanted is normal and is indicated by the argument “qnorm”. The probs argument means probability. The default values are .25 and .75. The resulting graph indicates that the distribution of “horsepower”, in the “Auto” dataset is not normally distributed. That are particular problems with the lower and upper values.

We can confirm our suspicion by running a statistical test. The Anderson-Darling test from the “nortest” package will allow us to test whether our data is normally distributed or not. The code is below

ad.test(Auto$horsepower) ## Anderson-Darling normality test ## ## data: Auto$horsepower
## A = 12.675, p-value < 2.2e-16

From the results, we can conclude that the data is not normally distributed. This could mean that we may need to use non-parametric tools for statistical analysis.

We can further explore our distribution in terms of its skew and kurtosis. Skew measures how far to the left or right the data leans and kurtosis measures how peaked or flat the data is. This is done with the “fBasics” package and the functions “skewness” and “kurtosis”.

First we will deal with skewness. Below is the code for calculating skewness.

horsepowerSkew<-skewness(Auto$horsepower) horsepowerSkew ## [1] 1.079019 ## attr(,"method") ## [1] "moment" We now need to determine if this value of skewness is significantly different from zero. This is done with a simple t-test. We must calculate the t-value before calculating the probability. The standard error of the skew is defined as the square root of six divided by the total number of samples. The code is below stdErrorHorsepower<-horsepowerSkew/(sqrt(6/length(Auto$horsepower)))
stdErrorHorsepower
## [1] 8.721607
## attr(,"method")
## [1] "moment"

Now we take the standard error of Horsepower and plug this into the “pt” function (t probability) with the degrees of freedom (sample size – 1 = 391) we also put in the number 1 and subtract all of this information. Below is the code

1-pt(stdErrorHorsepower,391)
## [1] 0
## attr(,"method")
## [1] "moment"

The value zero means that we reject the null hypothesis that the skew is not significantly different form zero and conclude that the skew is different form zero. However, the value of the skew was only 1.1 which is not that non-normal.

We will now repeat this process for the kurtosis. The only difference is that instead of taking the square root divided by six we divided by 24 in the example below.

horsepowerKurt<-kurtosis(Auto$horsepower) horsepowerKurt ## [1] 0.6541069 ## attr(,"method") ## [1] "excess" stdErrorHorsepowerKurt<-horsepowerKurt/(sqrt(24/length(Auto$horsepower)))
stdErrorHorsepowerKurt
## [1] 2.643542
## attr(,"method")
## [1] "excess"
1-pt(stdErrorHorsepowerKurt,391)
## [1] 0.004267199
## attr(,"method")
## [1] "excess"

Again the pvalue is essentially zero, which means that the kurtosis is significantly different from zero. With a value of 2.64 this is not that bad. However, when both skew and kurtosis are non-normally it explains why our overall distributions was not normal either.

Conclusion

This post provided insights into assessing the normality of a dataset. Visually inspection can take place using  Q-Q plots. Statistical inspection can be done through hypothesis testing along with checking skew and kurtosis.

# Probability Distribution and Graphs in R

In this post, we will use probability distributions and ggplot2 in R to solve a hypothetical example. This provides a practical example of the use of R in everyday life through the integration of several statistical and coding skills. Below is the scenario.

At a busing company the average number of stops for a bus is 81 with a standard deviation of 7.9. The data is normally distributed. Knowing this complete the following.

• Calculate the interval value to use using the 68-95-99.7 rule
• Calculate the density curve
• Graph the normal curve
• Evaluate the probability of a bus having less then 65 stops
• Evaluate the probability of a bus having more than 93 stops

Calculate the Interval Value

Our first step is to calculate the interval value. This is the range in which 99.7% of the values falls within. Doing this requires knowing the mean and the standard deviation and subtracting/adding the standard deviation as it is multiplied by three from the mean. Below is the code for this.

busStopMean<-81
busStopSD<-7.9
busStopMean+3*busStopSD
## [1] 104.7
busStopMean-3*busStopSD
## [1] 57.3

The values above mean that we can set are interval between 55 and 110 with 100 buses in the data. Below is the code to set the interval.

interval<-seq(55,110, length=100) #length here represents
100 fictitious buses

Density Curve

The next step is to calculate the density curve. This is done with our knowledge of the interval, mean, and standard deviation. We also need to use the “dnorm” function. Below is the code for this.

densityCurve<-dnorm(interval,mean=81,sd=7.9)

We will now plot the normal curve of our data using ggplot. Before we need to put our “interval” and “densityCurve” variables in a dataframe. We will call the dataframe “normal” and then we will create the plot. Below is the code.

library(ggplot2)
normal<-data.frame(interval, densityCurve)
ggplot(normal, aes(interval, densityCurve))+geom_line()+ggtitle("Number of Stops for Buses")

Probability Calculation

We now want to determine what is the provability of a bus having less than 65 stops. To do this we use the “pnorm” function in R and include the value 65, along with the mean, standard deviation, and tell R we want the lower tail only. Below is the code for completing this.

pnorm(65,mean = 81,sd=7.9,lower.tail = TRUE)
## [1] 0.02141744

As you can see, at 2% it would be unusually to. We can also plot this using ggplot. First, we need to set a different density curve using the “pnorm” function. Combine this with our “interval” variable in a dataframe and then use this information to make a plot in ggplot2. Below is the code.

CumulativeProb<-pnorm(interval, mean=81,sd=7.9,lower.tail = TRUE)
pnormal<-data.frame(interval, CumulativeProb)
ggplot(pnormal, aes(interval, CumulativeProb))+geom_line()+ggtitle("Cumulative Density of Stops for Buses")

Second Probability Problem

We will now calculate the probability of a bus have 93 or more stops. To make it more interesting we will create a plot that shades the area under the curve for 93 or more stops. The code is a little to complex to explain so just enjoy the visual.

pnorm(93,mean=81,sd=7.9,lower.tail = FALSE)
## [1] 0.06438284
x<-interval
ytop<-dnorm(93,81,7.9)
MyDF<-data.frame(x=x,y=densityCurve)
p<-ggplot(MyDF,aes(x,y))+geom_line()+scale_x_continuous(limits = c(50, 110))
+ggtitle("Probabilty of 93 Stops or More is 6.4%")
shade <- rbind(c(93,0), subset(MyDF, x > 93), c(MyDF[nrow(MyDF), "X"], 0))

p + geom_segment(aes(x=93,y=0,xend=93,yend=ytop)) +
geom_polygon(data = shade, aes(x, y))

Conclusion

A lot of work was done but all in a practical manner. Looking at realistic problem. We were able to calculate several different probabilities and graph them accordingly.

# A History of Structural Equation Modeling

Structural Equation Modeling (SEM) is complex form of multiple regression that is commonly used in social science research. In many ways, SEM is an amalgamation of factor analysis and path analysis as we shall see. The history of this data analysis approach can be traced all the way back to the beginning of the 20th century.

This post will provide a brief overview of SEM. Specifically, we will look at the role of factory and path analysis in the development of SEM.

The Beginning with Factor and Path Analysis

The foundation of SEM was laid with the development of Spearman’s work with intelligence in the early 20th century. Spearman was trying to trace the various dimensions of intelligence back to a single factor. In the 1930’s Thurstone developed multi-factor analysis as he saw intelligence not as a a single factor as Spearman but rather as several factors. Thurstone also bestowed the gift of factor rotation on the statistical community.

Around the same time (1920’s-1930’s), Wright was developing path analysis. Path analysis relies on manifest variables with the ability to model indirect relationships among variables. This is something that standard regression normally does not do.

In economics, a econometrics was using many of the same ideas as Wright. It was in  the early 1950’s that econometricians saw what Wright was doing in his discipline of biometrics.

SEM is Born

In the 1970’s, Joreskog combined the measurement powers of factor analysis with the regression modeling power of path analysis. The factor analysis capabilities of SEM allow it to assess the accuracy of the measurement of the model. The path analysis capabilities of SEM allow it to model direct and indirect relationships among latent variables.

From there, there was an explosion in ways to assess models as well as best practice suggestions. In addition, there are many different software available for conducting SEM analysis. Examples include the LISREL which was the first software available, AMOS which allows the use of a graphical interface.

One software worthy of mentioning is Lavaan. Lavaan is a r package that performs SEM. The primary benefit of Lavaan is that it is available for free. Other software can be exceedingly expensive but Lavaan provides the same features for a price that cannot be beat.

Conclusion

SEM is by far not new to the statistical community. With a history that is almost 100 years old, SEM has been in many ways with the statistical community since the birth of modern statistics.

# Ensemble Learning for Machine Models

One way to improve a machine learning model is to not make just one model. Instead, you can make several models  that all have different strengths and weaknesses. This combination of diverse abilities can allow for much more accurate predictions.

The use of multiple models is know as ensemble learning. This post will provide insights into ensemble learning as they are used in developing machine models.

The Major Challenge

The biggest challenges in creating an ensemble of models is deciding what models to develop and how the various models are combined to make predictions. To deal with these challenges involves the use of training data and several different functions.

The Process

Developing an ensemble model begins with training data. The next step is the use of some sort of allocation function. The allocation function determines how much data each model receives in order to make predictions. For example, each model may receive a subset of the data or limit how many features each model can use. However, if several different algorithms are used the allocation function may pass all the data to each model with making any changes.

After the data is allocated, it is necessary for the models to be created. From there, the next step is to determine how to combine the models. The decision on how to combine the models is made with a combination function.

The combination function can take one of several approaches for determining final predictions. For example, a simple majority vote can be used which means that if 5 models where develop and 3 vote “yes” than the example is classified as a yes. Another option is to weight the models so that some have more influence then others in the final predictions.

Benefits of Ensemble Learning

Ensemble learning provides several advantages. One, ensemble learning improves the generalizability of your model. With the combine strengths of many different models and or algorithms it is difficult to go wrong

Two, ensemble learning approaches allow for tackling large datasets. The biggest enemy to machine learning is memory. With ensemble approaches, the data can be broken into smaller pieces for each model.

Conclusion

Ensemble learning is yet another critical tool in the data scientist’s toolkit. The complexity of the world today makes it too difficult to lean on a singular model to explain things. Therefore, understanding the application of ensemble methods is a necessary step.

# Developing a Customize Tuning Process in R

In this post, we will learn how to develop customize criteria for tuning a machine learning model using the “caret” package. There are two things that need to be done in order to complete assess a model using customized features. These two steps are…

• Determine the model evaluation criteria
• Create a grid of parameters to optimize

The model we are going to tune is the decision tree model made in a previous post with the C5.0 algorithm. Below is code for loading some prior information.

library(caret); library(Ecdat)
data(Wages1)

DETERMINE the MODEL EVALUATION CRITERIA

We are going to begin by using the “trainControl” function to indicate to R what re-sampling method we want to use, the number of folds in the sample, and the method for determining the best model. Remember, that there are many more options but these are the onese we will use. All this information must be saved into a variable using the “trainControl” function. Later, the information we place into the variable will be used when we rerun our model.

For our example, we are going to code the following information into a variable we will call “chck” for re sampling we will use k-fold cross-validation. The number of folds will be set to 10. The criteria for selecting the best model will be the through the use of the “oneSE” method. The “oneSE” method selects the simplest model within one standard error of the best performance. Below is the code for our variable “chck”

chck<-trainControl(method = "cv",number = 10, selectionFunction = "oneSE")

For now this information is stored to be used later

CREATE GRID OF PARAMETERS TO OPTIMIZE

We now need to create a grid of parameters. The grid is essential the characteristics of each model. For the C5.0 model we need to optimize the model, number of trials, and if winnowing was used. Therefore we will do the following.

• For model, we want decision trees only
• Trials will go from 1-35 by increments of 5
• For winnowing, we do not want any winnowing to take place.

In all we are developing 8 models. We know this based on the trial parameter which is set to 1, 5, 10, 15, 20, 25, 30, 35. To make the grid we use the “expand.grid” function. Below is the code.

modelGrid<-expand.grid(.model ="tree", .trials= c(1,5,10,15,20,25,30,35), .winnow="FALSE")

CREATE THE MODEL

We are now ready to generate our model. We will use the kappa statistic to evaluate each model’s performance

set.seed(1)
customModel<- train(sex ~., data=Wages1, method="C5.0", metric="Kappa", trControl=chck, tuneGrid=modelGrid)
customModel
## C5.0
##
## 3294 samples
##    3 predictors
##    2 classes: 'female', 'male'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2966, 2965, 2964, 2964, 2965, 2964, ...
## Resampling results across tuning parameters:
##
##   trials  Accuracy   Kappa      Accuracy SD  Kappa SD
##    1      0.5922991  0.1792161  0.03328514   0.06411924
##    5      0.6147547  0.2255819  0.03394219   0.06703475
##   10      0.6077693  0.2129932  0.03113617   0.06103682
##   15      0.6077693  0.2129932  0.03113617   0.06103682
##   20      0.6077693  0.2129932  0.03113617   0.06103682
##   25      0.6077693  0.2129932  0.03113617   0.06103682
##   30      0.6077693  0.2129932  0.03113617   0.06103682
##   35      0.6077693  0.2129932  0.03113617   0.06103682
##
## Tuning parameter 'model' was held constant at a value of tree
##
## Tuning parameter 'winnow' was held constant at a value of FALSE
## Kappa was used to select the optimal model using  the one SE rule.
## The final values used for the model were trials = 5, model = tree
##  and winnow = FALSE.

The actually output is similar to the model that “caret” can automatically create. The difference here is that the criteria was set by us rather than automatically. A close look reveals that all of the models perform poorly but that there is no change in performance after ten trials.

CONCLUSION

This post provided a brief explanation of developing a customize way of assessing a models performance. To complete this, you need configure your options as well as setup your grid in order to assess a model. Understanding the customization process for evaluating machine learning models is one of the strongest ways to develop supremely accurate models that retain generalizability.

# Improving the Performance of Machine Learning Model

For many, especially beginners, making a machine learning model is difficult enough. Trying to understand what to do, how to specify the model, among other things is confusing in itself. However, after developing a model it is necessary to assess ways in which to improve performance.

This post will serve as an introduction to understanding how to improving model performance. In particular, we will look at the following

• When it is necessary to improve performance
• Parameter tuning

When to Improve

It is not always necessary to try and improve the performance of a model. There are times when a model does well and you know this through the evaluating it. If the commonly used measures are adequate there is no cause for concern.

However, there are times when improvement is necessary. Complex problems, noisy data, and trying to look for subtle/unclear relationships can make improvement necessary. Normally, real-world data has the problems so model improvement is usually necessary.

Model improvement requires the application of scientific means in an artistic manner. It requires a sense of intuition at times and also brute trial-and-error effort as well. The point is that there is no singular agreed upon way to improve a model. It is better to focus on explaining how you did it if necessary.

Parameter Tuning

Parameter tuning is the actual adjustment of model fit options. Different machine learning models have different options that can be adjusted. Often, this process can be automated in r through the use of the “caret” package.

When trying to decide what to do when tuning parameters it is important to remember the following.

• What machine learning model and algorithm you are using for your data.
• Which parameters you can adjust.
• What criteria you are using to evaluate the model

Naturally, you need to know what kind of model and algorithm you are using in order to improve the model. There are three types of models in machine learning, those that classify, those that employ regression, and those that can do both. Understanding this helps you to make decision about what you are trying to do.

Next, you need to understand what exactly you or r are adjusting when analyzing the model. For example, for C5.0 decision trees “trials” is one parameter you can adjust. If you don’t know this, you will not know how the model was improved.

Lastly, it is important to know what criteria you are using to compare the various models. For classifying models you can look at the kappa and the various information derived from the confusion matrix. For regression based models you may look at the r-square, the RMSE (Root mean squared error), or the ROC curve.

Conclusion

As you can perhaps tell there is an incredible amount of choice and options in trying to improve a model. As such, model improvement requires a clearly developed strategy that allows for clear decision-making.

In a future post, we will look at an example of model improvement.

The receiver operating characteristic curve (ROC curve) is a tool used in statistical research to assess the trade-off of detecting true positives and true negatives. The origins of this tool goes all the way back to WWII when engineers were trying to distinguish between true and false alarms. Now this technique is used in machine learning

This post will explain the ROC curve and provide and example using R.

Below is a diagram of an ROC curve

On the X axis we have the false positive rate. As you move to the right the false positive rate increases which is bad. We want to be as close to zero as possible.

On the y axis we have the true positive rate. Unlike the x axis we want the true positive rate to be as close to 100 as possible. In general we want a low value on the x-axis and a high value on the y-axis.

In the diagram above, the diagonal line called “Test without diagnostic benefit” represents a model that cannot tell the difference between true and false positives. Therefore, it is not useful for our purpose.

The L-shaped curve call “Good diagnostic test” is an example of an excellent model. This is because  all the true positives are detected .

Lastly, the curved-line called “Medium diagonistic test” represents an actually model. This model is a balance between the perfect L-shaped model and the useless straight-line model. The curved-line model is able to moderately distinguish between false and true positives.

Area Under the ROC Curve

The area under an ROC curve is literally called the “Area Under the Curve” (AUC). This area is calculated with a standardized value ranging from 0 – 1. The closer to 1 the better the model

We will now look at an analysis of a model using the ROC curve and AUC. This is based on the results of a post using the KNN algorithm for nearest neighbor classification. Below is the code

predCollege <- ifelse(College_test_pred=="Yes", 1, 0)
realCollege <- ifelse(College_test_labels=="Yes", 1, 0)
pr <- prediction(predCollege, realCollege)
collegeResults <- performance(pr, "tpr", "fpr")
plot(collegeResults, main="ROC Curve for KNN Model", col="dark green", lwd=5)
abline(a=0,b=1, lwd=1, lty=2)
aucOfModel<-performance(pr, measure="auc")
unlist(aucOfModel@y.values)
1. The first to variables (predCollege & realCollege) is just for converting the values of the prediction of the model and the actual results to numeric variables
2. The “pr” variable is for storing the actual values to be used for the ROC curve. The “prediction” function comes from the “ROCR” package
3. With the information information of the “pr” variable we can now analyze the true and false positives, which are stored in the “collegeResults” variable. The “performance” function also comes from the “ROCR” package.
4. The next two lines of code are for plot the ROC curve. You can see the results below

6. The curve looks pretty good. To confirm this we use the last two lines of code to calculate the actually AUC. The actual AUC is 0.88 which is excellent. In other words, the model developed does an excellent job of discerning between true and false positives.

Conclusion

The ROC curve provides one of many ways in which to assess the appropriateness of a model. As such, it is yet another tool available for a person who is trying to test models.

# Using Confusion Matrices to Evaluate Performance

The data within a confusion matrix can be used to calculate several different statistics that can indicate the usefulness of a statistical model in machine learning. In this post, we will look at several commonly used measures, specifically…

• accuracy
• error
• sensitivity
• specificity
• precision
• recall
• f-measure

Accuracy

Accuracy is probably the easiest statistic to understand. Accuracy is the total number of items correctly classified divided by the total number of items below is the equation

accuracy =   TP + TN
TP + TN + FP  + FN

TP =  true positive, TN =  true negative, FP = false positive, FN = false negative

Accuracy can range in value from 0-1 with one representing 100% accuracy. Normally, you don’t want perfect accuracy as this is an indication of overfitting and your model will probably not do well with other data.

Error

Error is the opposite of accuracy and represent the percentage of examples that are incorrectly classified it’s equation is as follows.

error =   FP + FN
TP + TN + FP  + FN

The lower the error the better in general. However, if error is 0 it indicates overfitting. Keep in mind that error is the inverse of accuracy. As one increases the other decreases.

Sensitivity

Sensitivity is the proportion of true positives that were correctly classified.The formula is as follows

sensitivity =       TP
TP + FN

This may sound confusing but high sensitivity is useful for assessing a negative result. In other words, if I am testing people for a disease and my model has a high sensitivity. This means that the model is useful telling me a person does not have a disease.

Specificity

Specificity measures the proportion of negative examples that were correctly classified. The formula is below

specificity =       TN
TN + FP

Returning to the disease example, a high specificity is a good measure for determining if someone has a disease if they test positive for it. Remember that no test is foolproof and there are always false positives and negatives happening. The role of the researcher is to maximize the sensitivity or specificity based on the purpose of the model.

Precision

Precision is the proportion of examples that are really positive. The formula is as follows

precision =       TP
TP + FP

The more precise a model is the more trustworthy it is. In other words, high precision indicates that the results are relevant.

Recall

Recall is a measure of the completeness of the results of a model. It is calculated as follows

recall =       TP
TP + FN

This formula is the same as the formula for sensitivity. The difference is in the interpretation. High recall means that the results have a breadth to them such as in search engine results.

F-Measure

The f-measure uses recall and precision to develop another way to assess a model. The formula is below

sensitivity =      2 * TP
2 * TP + FP + FN

The f-measure can range from 0 – 1 and is useful for comparing several potential models using one convenient number.

Conclusion

This post provide a basic explanation of various statistics that can be used to determine the strength of a model. Through using a combination of statistics a researcher can develop insights into the strength of a model. The only mistake is relying exclusively on any single statistical measurement.

# Understanding Confusion Matrices

A confusion matrix is a table that is used to organize the predictions made during an analysis of data. Without making a joke confusion matrices can be confusing especially for those who are new to research.

In this post, we will look at how confusion matrices are setup as well as what the information in them means.
Actual Vs Predicted Class

The most common confusion matrix is a two class matrix. This matrix compares the actual class of an example with the predicted class of the model. Below is an example

Two Class Matrix
Predicted Class

A  B
Correctly classified as A Incorrectly classified as B
Incorrectly classified as A Correctly classified as B

Actual class is along the vertical side

Looking at the table there are four possible outcomes.

• Correctly classified as A-This means that the example was a part of the A category and the model predicted it as such
• Correctly classified as B-This means that the example was a part of the B category and the model predicted it as such
• Incorrectly classified as A-This means that the example was a part of the B category but the model predicted it to be a part of the A group
• Incorrectly classified as B-This means that the example was a part of the A category but the model predicted it to be a part of the B group

These four types of classifications have four different names which are true positive, true negative, false positive, and false negative. We will look at another example to understand these four terms.

Two Class Matrix
Predicted Lazy Students

Lazy  Not Lazy
1. Correctly classified as lazy 2. Incorrectly classified as not Lazy
3. Incorrectly classified as Lazy 4. Correctly classified as not lazy

Actual class is along the vertical side

In the example above, we want to predict which students are lazy. Group one, is the group in which students who are lazy are correctly classified as lazy. This is called true positive.

Group 2 are those who are lazy but are predicted as not being lazy. This is known as a false negative also known as a type II error in statistics. This is a problem because if the student is misclassified they may not get the support they need.

Group three is students who are not lazy but are classified as such. This is known as a false positive or type I error. In this example, being labeled lazy is a major headache for the students but not as dangerous perhaps as a false negative.

Lastly, group four are students who are not lazy and are correctly classified as such. This is known as a true negative.

Conclusion

The primary purpose of a confusion matrix is to display this information visually. In future post we will see that there is even more information found in a confusion matrix than what was cover briefly here.