Tag Archives: statistics

Statistical Models

In research, the term ‘model’ is employed frequently. Normally, a model is some sort of a description or explanation of a real world phenomenon. In data science, we employ the use of statistical models. Statistical models used numbers to help us to understand something that happens in the real world.

A statistical model used numbers to help us to understand something that happens in the real world.

In the real world, quantitative research relies on numeric observations of some phenomenon, behavior, and or perception. For example, let’s say we have the quiz results of 20 students as show below.

32 60 95 15 43 22 45 14 48 98 79 97 49 63 50 11 26 52 39 97

This is great information but what if we want to go beyond how these students did and try to understand how students in the population would do on the quizzes. Doing this requires the development of a model.

A model is simply trying to describe how the data is generated in terms of whatever we are mesuring while allowing for randomness. It helps in summarizing a large collection of numbers while also providing structure to it.

One commonly used model is the normal model. This model is the famous bell-curve model that most of us are familiar with. To calculate this model we need to calculate the mean and standard deviation to get a plot similar to the one below

 1

 Now, this model is not completely perfect. For example, a student cannot normally get a score above 100 or below 0 on a quiz. Despite this weakness, normal distribution gives is an indication of what the population looks like.

With this, we can also calculate the probability of getting a specific score on the quiz. For example, if we want to calculate the probability that a student would get a score of 70  or higher we can do a simple test and find that it is about 26%.

Other Options

The normal model is not the only model. There are many different models to match different types of data. There are the gamma, student t, binomial, chi-square, etc. To determine which model to use requires examining the distribution of your data and match it to an appropriate model.

Another option is to transform the data. This is normally done to make data conform to a normal distribution. Which transformation to employ depends on how the data looks when it is plotted.

Conclusion

Modeling helps to bring order to data that has been collected for analysis. By using a model such as the normal distribution, you can begin to make inferences about what the population is like. This allows you to take a handful of data to better understand the world.

Advertisements

Regularized Linear Regression

Traditional linear regression has been a tried and true  model for making predictions for decades. However, with the growth of Big Data and datasets with 100’s of variables  problems have begun to arise. For example, using stepwise or best subset method with regression could take hours if not days to converge in even some of the best computers.

To deal with this problem, regularized regression has been developed to help to determine which features or variables to keep when developing models from large datasets with a huge number of variables. In this post, we will look at the following concepts

  • Definition of regularized regression
  • Ridge regression
  • Lasso regression
  • Elastic net regression

Regularization

Regularization involves the use of a shrinkage penalty in order to reduce the residual sum of squares (RSS). This is done through selecting a value for a tuning parameter called “lambda”. Tuning parameters are used in machine learning algorithms to control the behavior of the models that are developed.

The lambda is multiplied by the normalized coefficients of the model and added to the RSS. Below is an equation of what was just said

RSS + λ(normalized coefficients)

The benefits of regularization are at least three-fold. First, regularization is highly computationally efficient. Instead of fitting k-1 models when k is the number of variables available (for example, 50 variables would lead 49 models!), with regularization only one model is developed for each value of lambda you specify.

Second, regularization helps to deal with the bias-variance headache of model development. When small changes are made to data, such as switching from the training to testing data, there can be wild changes in the estimates. Regularization can often smooth this problem out substantially.

Finally, regularization can help to reduce or eliminate any multicollenarity in a model. As such, the benefits of using regularization make it clear that this should be considering when working with larger data sets.

Ridge Regression

Ridge regression involves the normalization of the squared weights or as shown in the equation below

RSS + λ(normalized coefficients^2)

This is also refered to as the L2-norm. As lambda increase in value the coefficients in the model are shrunk towards 0 but never reach 0. This is how the error is shrunk. The higher the lambda the lower the value of the coefficients as they are reduce more and more thus reducing the RSS.

The benefit is that predictive accuracy is often increased. However, interpreting and communicating your results can become difficult because no variables are removed from the model. Instead the variables are reduced near to zero. This can be especially tough if you have dozens of variables remaining in your model to try to explain.

Lasso

Lasso is short for “Least Absolute Shrinkage and Selection Operator”. This approach uses the L1-norm which is the sum of the absolute value of the coefficients or as shown in the equation below

RSS + λ(Σ|normalized coefficients|)

This shrinkage penalty will reduce a coefficient to 0 which is another way of saying that variables will be removed from the model. One problem is that highly correlated variables that need to be  in your model my be removed when Lasso shrinks coefficients. This is one reason why ridge regression is still used.

Elastic Net

Elastic net is the best of ridge and Lasso without the weaknesses of either. It combines extracts variables like Lasso and Ridge does not while also group variables like Ridge does but Lasso does not.

This is done by including a second tuning parameter called “alpha”. If alpha is set to 0 it is the same as ridge regression and if alpha is set to 1 it is the same as lasso regression. If you are able to appreciate it below is the formula used for elastic net regression

(RSS + l[(1 – alpha)(S|normalized coefficients|2)/2 + alpha(S|normalized coefficients|)])/N)

As such when working with elastic net you have to set two different tuning parameters (alpha and lambda) in order to develop a model.

Conclusion

Regularized regressio was developed as an answer to the growth in the size and number of variables in a data set today. Ridge, lasso an elastic net all provide solutions to converging over large datasets and selecting features.

Probability,Odds, and Odds Ratio

In logistic regression, there are three terms that are used frequently but can be confusing if they are not thoroughly explained. These three terms are probability, odds, and odds ratio. In this post, we will look at these three terms and provide an explanation of them.

Probability

Probability is probably (no pun intended) the easiest of these three terms to understand. Probability is simply the likelihood that a certain even will happen.  To calculate the probability in the traditional sense you need to know the number of events and outcomes to find the probability.

Bayesian probability uses prior probabilities to develop a posterior probability based on new evidence. For example, at one point during Super Bowl LI the Atlanta Falcons had a 99.7% chance of winning. This was base don such factors as the number points they were ahead and the time remaining.  As these changed, so did the probability of them winning. yet the Patriots still found a way to win with less then a 1% chance

Bayesian probability was also used for predicting who would win the 2016 US presidential race. It is important to remember that probability is an expression of confidence and not a guarantee as we saw in both examples.

Odds

Odds are the expression of relative probabilities. Odds are calculated using the following equation

probability of the event ⁄ 1 – probability of the event

For example, at one point during Super Bowl LI the odds of the Atlanta Falcons winning were as follows

0.997 ⁄ 1 – 0.997 = 332

This can be interpreted as the odds being 332 to 1! This means that Atlanta was 332 times more likely to win the Super Bowl then loss the Super Bowl.

Odds are commonly used in gambling and this is probably (again no pun intended) where most of us have heard the term before. The odds is just an extension of probabilities and the are most commonly expressed as a fraction such as one in four, etc.

Odds Ratio

A ratio is the comparison of of two numbers and indicates how many times one number is contained or contains another number. For example, a ration of boys to girls is 5 to 1 it means that there are five boys for every one girl.

By  extension odds ratio is the comparison of two different odds. For example, if the odds of Team A making the playoffs is 45% and the odds of Team B making the playoffs is 35% the odds ratio is calculated as follows.

0.45 ⁄ 0.35 = 1.28

Team A is 1.28 more likely to make the playoffs then Team B.

The value of the odds and the odds ratio can sometimes be the same.  Below is the odds ratio of the Atlanta Falcons winning and the New Patriots winning Super Bowl LI

0.997⁄ 0.003 = 332

As such there is little difference between odds and odds ratio except that odds ratio is the ratio of two odds ratio. As you can tell, there is a lot of confusion about this for the average person. However, understanding these terms is critical to the application of logistic regression.

Generalized Models in R

Generalized linear models are another way to approach linear regression. The advantage of of GLM is that allows the error to follow many different distributions rather than only the normal distribution which is an assumption of traditional linear regression.

Often GLM is used for response or dependent variables that are binary or represent count data. THis post will provide a brief explanation of GLM as well as provide an example.

Key Information

There are three important components to a GLM and they are

  • Error structure
  • Linear predictor
  • Link function

The error structure is the type of distribution you will use in generating the model. There are many different distributions in statistical modeling such as binomial, gaussian, poission, etc. Each distribution comes with certain assumptions that govern their use.

The linear predictor is the sum of the effects of the independent variables. Lastly, the link function determines the relationship between the linear predictor and the mean of the dependent variable. There are many different link functions and the best link function is the one that reduces the residual deviances the most.

In our example, we will try to predict if a house will have air conditioning based on the interactioon between number of bedrooms and bathrooms, number of stories, and the price of the house. To do this, we will use the “Housing” dataset from the “Ecdat” package. Below is some initial code to get started.

library(Ecdat)
data("Housing")

The dependent variable “airco” in the “Housing” dataset is binary. This calls for us to use a GLM. To do this we will use the “glm” function in R. Furthermore, in our example, we want to determine if there is an interaction between number of bedrooms and bathrooms. Interaction means that the two independent variables (bathrooms and bedrooms) influence on the dependent variable (aircon) is not additive, which means that the combined effect of the independnet variables is different than if you just added them together. Below is the code for the model followed by a summary of the results

model<-glm(Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + Housing$price, family=binomial)
summary(model)
## 
## Call:
## glm(formula = Housing$airco ~ Housing$bedrooms * Housing$bathrms + 
##     Housing$stories + Housing$price, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7069  -0.7540  -0.5321   0.8073   2.4217  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)                      -6.441e+00  1.391e+00  -4.632 3.63e-06
## Housing$bedrooms                  8.041e-01  4.353e-01   1.847   0.0647
## Housing$bathrms                   1.753e+00  1.040e+00   1.685   0.0919
## Housing$stories                   3.209e-01  1.344e-01   2.388   0.0170
## Housing$price                     4.268e-05  5.567e-06   7.667 1.76e-14
## Housing$bedrooms:Housing$bathrms -6.585e-01  3.031e-01  -2.173   0.0298
##                                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 681.92  on 545  degrees of freedom
## Residual deviance: 549.75  on 540  degrees of freedom
## AIC: 561.75
## 
## Number of Fisher Scoring iterations: 4

To check how good are model is we need to check for overdispersion as well as compared this model to other potential models. Overdispersion is a measure to determine if there is too much variablity in the model. It is calcualted by dividing the residual deviance by the degrees of freedom. Below is the solution for this

549.75/540
## [1] 1.018056

Our answer is 1.01, which is pretty good because the cutoff point is 1, so we are really close.

Now we will make several models and we will compare the results of them

Model 2

#add recroom and garagepl
model2<-glm(Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + Housing$price + Housing$recroom + Housing$garagepl, family=binomial)
summary(model2)
## 
## Call:
## glm(formula = Housing$airco ~ Housing$bedrooms * Housing$bathrms + 
##     Housing$stories + Housing$price + Housing$recroom + Housing$garagepl, 
##     family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6733  -0.7522  -0.5287   0.8035   2.4239  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)                      -6.369e+00  1.401e+00  -4.545 5.51e-06
## Housing$bedrooms                  7.830e-01  4.391e-01   1.783   0.0745
## Housing$bathrms                   1.702e+00  1.047e+00   1.626   0.1039
## Housing$stories                   3.286e-01  1.378e-01   2.384   0.0171
## Housing$price                     4.204e-05  6.015e-06   6.989 2.77e-12
## Housing$recroomyes                1.229e-01  2.683e-01   0.458   0.6470
## Housing$garagepl                  2.555e-03  1.308e-01   0.020   0.9844
## Housing$bedrooms:Housing$bathrms -6.430e-01  3.054e-01  -2.106   0.0352
##                                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 681.92  on 545  degrees of freedom
## Residual deviance: 549.54  on 538  degrees of freedom
## AIC: 565.54
## 
## Number of Fisher Scoring iterations: 4
#overdispersion calculation
549.54/538
## [1] 1.02145

Model 3

model3<-glm(Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + Housing$price + Housing$recroom + Housing$fullbase + Housing$garagepl, family=binomial)
summary(model3)
## 
## Call:
## glm(formula = Housing$airco ~ Housing$bedrooms * Housing$bathrms + 
##     Housing$stories + Housing$price + Housing$recroom + Housing$fullbase + 
##     Housing$garagepl, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6629  -0.7436  -0.5295   0.8056   2.4477  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)                      -6.424e+00  1.409e+00  -4.559 5.14e-06
## Housing$bedrooms                  8.131e-01  4.462e-01   1.822   0.0684
## Housing$bathrms                   1.764e+00  1.061e+00   1.662   0.0965
## Housing$stories                   3.083e-01  1.481e-01   2.082   0.0374
## Housing$price                     4.241e-05  6.106e-06   6.945 3.78e-12
## Housing$recroomyes                1.592e-01  2.860e-01   0.557   0.5778
## Housing$fullbaseyes              -9.523e-02  2.545e-01  -0.374   0.7083
## Housing$garagepl                 -1.394e-03  1.313e-01  -0.011   0.9915
## Housing$bedrooms:Housing$bathrms -6.611e-01  3.095e-01  -2.136   0.0327
##                                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 681.92  on 545  degrees of freedom
## Residual deviance: 549.40  on 537  degrees of freedom
## AIC: 567.4
## 
## Number of Fisher Scoring iterations: 4
#overdispersion calculation
549.4/537
## [1] 1.023091

Now we can assess the models by using the “anova” function with the “test” argument set to “Chi” for the chi-square test.

anova(model, model2, model3, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + 
##     Housing$price
## Model 2: Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + 
##     Housing$price + Housing$recroom + Housing$garagepl
## Model 3: Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + 
##     Housing$price + Housing$recroom + Housing$fullbase + Housing$garagepl
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       540     549.75                     
## 2       538     549.54  2  0.20917   0.9007
## 3       537     549.40  1  0.14064   0.7076

The results of the anova indicate that the models are all essentially the same as there is no statistical difference. The only criteria on which to select a model is the measure of overdispersion. The first model has the lowest rate of overdispersion and so is the best when using this criteria. Therefore, determining if a hous has air conditioning depends on examining number of bedrooms and bathrooms simultenously as well as the number of stories and the price of the house.

Conclusion

The post explained how to use and interpret GLM in R. GLM can be used primarilyy for fitting data to disrtibutions that are not normal.

Proportion Test in R

Proportions are are a fraction or “portion” of a total amount. For example, if there are ten men and ten women in a room the proportion of men in the room is 50% (5 / 10). There are times when doing an analysis that you want to evaluate proportions in our data rather than individual measurements of mean, correlation, standard deviation etc.

In this post we will learn how to do a test of proportions using R. We will use the dataset “Default” which is found in the “ISLR” pacakage. We will compare the proportion of those who are students in the dataset to a theoretical value. We will calculate the results using the z-test and the binomial exact test. Below is some initial code to get started.

library(ISLR)
data("Default")

We first need to determine the actual number of students that are in the sample. This is calculated below using the “table” function.

table(Default$student)
## 
##   No  Yes 
## 7056 2944

We have 2944 students in the sample and 7056 people who are not students. We now need to determine how many people are in the sample. If we sum the results from the table below is the code.

sum(table(Default$student))
## [1] 10000

There are 10000 people in the sample. To determine the proprtion of students we take the number 2944 / 10000 which equals 29.44 or 29.44%. Below is the code to calculate this

table(Default$student) / sum(table(Default$student))
## 
##     No    Yes 
## 0.7056 0.2944

The proportion test is used to compare a particular value with a theoretical value. For our example, the particular value we have is 29.44% of the people were students. We want to compare this value with a theoretical value of 50%. Before we do so it is better to state specificallt what are hypotheses are. NULL = The value of 29.44% of the sample being students is the same as 50% found in the population ALTERNATIVE = The value of 29.44% of the sample being students is NOT the same as 50% found in the population.

Below is the code to complete the z-test.

prop.test(2944,n = 10000, p = 0.5, alternative = "two.sided", correct = FALSE)
## 
##  1-sample proportions test without continuity correction
## 
## data:  2944 out of 10000, null probability 0.5
## X-squared = 1690.9, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.2855473 0.3034106
## sample estimates:
##      p 
## 0.2944

Here is what the code means. 1. prop.test is the function used 2. The first value of 2944 is the total number of students in the sample 3. n = is the sample size 4. p= 0.5 is the theoretical proportion 5. alternative =“two.sided” means we want a two-tail test 6. correct = FALSE means we do not want a correction applied to the z-test. This is useful for small sample sizes but not for our sample of 10000

The p-value is essentially zero. This means that we reject the null hypothesis and conclude that the proprtion of students in our sample is different from a theortical proprition of 50% in the population.

Below is the same analysis using the binomial exact test.

binom.test(2944, n = 10000, p = 0.5)
## 
##  Exact binomial test
## 
## data:  2944 and 10000
## number of successes = 2944, number of trials = 10000, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.2854779 0.3034419
## sample estimates:
## probability of success 
##                 0.2944

The results are the same. Whether to use the “prop.test”” or “binom.test” is a major argument among statisticians. The purpose here was to provide an example of the use of both

Probability Distribution and Graphs in R

In this post, we will use probability distributions and ggplot2 in R to solve a hypothetical example. This provides a practical example of the use of R in everyday life through the integration of several statistical and coding skills. Below is the scenario.

At a busing company the average number of stops for a bus is 81 with a standard deviation of 7.9. The data is normally distributed. Knowing this complete the following.

  • Calculate the interval value to use using the 68-95-99.7 rule
  • Calculate the density curve
  • Graph the normal curve
  • Evaluate the probability of a bus having less then 65 stops
  • Evaluate the probability of a bus having more than 93 stops

Calculate the Interval Value

Our first step is to calculate the interval value. This is the range in which 99.7% of the values falls within. Doing this requires knowing the mean and the standard deviation and subtracting/adding the standard deviation as it is multiplied by three from the mean. Below is the code for this.

busStopMean<-81
busStopSD<-7.9
busStopMean+3*busStopSD
## [1] 104.7
busStopMean-3*busStopSD
## [1] 57.3

The values above mean that we can set are interval between 55 and 110 with 100 buses in the data. Below is the code to set the interval.

interval<-seq(55,110, length=100) #length here represents 
100 fictitious buses

Density Curve

The next step is to calculate the density curve. This is done with our knowledge of the interval, mean, and standard deviation. We also need to use the “dnorm” function. Below is the code for this.

densityCurve<-dnorm(interval,mean=81,sd=7.9)

We will now plot the normal curve of our data using ggplot. Before we need to put our “interval” and “densityCurve” variables in a dataframe. We will call the dataframe “normal” and then we will create the plot. Below is the code.

library(ggplot2)
normal<-data.frame(interval, densityCurve)
ggplot(normal, aes(interval, densityCurve))+geom_line()+ggtitle("Number of Stops for Buses")

282deee2-ff95-488d-ad97-471b74fe4cb8

Probability Calculation

We now want to determine what is the provability of a bus having less than 65 stops. To do this we use the “pnorm” function in R and include the value 65, along with the mean, standard deviation, and tell R we want the lower tail only. Below is the code for completing this.

pnorm(65,mean = 81,sd=7.9,lower.tail = TRUE)
## [1] 0.02141744

As you can see, at 2% it would be unusually to. We can also plot this using ggplot. First, we need to set a different density curve using the “pnorm” function. Combine this with our “interval” variable in a dataframe and then use this information to make a plot in ggplot2. Below is the code.

CumulativeProb<-pnorm(interval, mean=81,sd=7.9,lower.tail = TRUE)
pnormal<-data.frame(interval, CumulativeProb)
ggplot(pnormal, aes(interval, CumulativeProb))+geom_line()+ggtitle("Cumulative Density of Stops for Buses")

9667dd01-f7d3-4025-8995-b6441a3735d0.png

Second Probability Problem

We will now calculate the probability of a bus have 93 or more stops. To make it more interesting we will create a plot that shades the area under the curve for 93 or more stops. The code is a little to complex to explain so just enjoy the visual.

pnorm(93,mean=81,sd=7.9,lower.tail = FALSE)
## [1] 0.06438284
x<-interval  
ytop<-dnorm(93,81,7.9)
MyDF<-data.frame(x=x,y=densityCurve)
p<-ggplot(MyDF,aes(x,y))+geom_line()+scale_x_continuous(limits = c(50, 110))
+ggtitle("Probabilty of 93 Stops or More is 6.4%")
shade <- rbind(c(93,0), subset(MyDF, x > 93), c(MyDF[nrow(MyDF), "X"], 0))

p + geom_segment(aes(x=93,y=0,xend=93,yend=ytop)) +
        geom_polygon(data = shade, aes(x, y))

b42a7c19-1992-4df1-95cc-40ea097058de

Conclusion

A lot of work was done but all in a practical manner. Looking at realistic problem. We were able to calculate several different probabilities and graph them accordingly.

A History of Structural Equation Modeling

Structural Equation Modeling (SEM) is a complex form of multiple regression that is commonly used in social science research. In many ways, SEM is an amalgamation of factor analysis and path analysis as we shall see. The history of this data analysis approach can be traced all the way back to the beginning of the 20th century.

This post will provide a brief overview of SEM. Specifically, we will look at the role of factory and path analysis in the development of SEM.

The Beginning with Factor and Path Analysis 

The foundation of SEM was laid with the development of Spearman’s work with intelligence in the early 20th century. Spearman was trying to trace the various dimensions of intelligence back to a single factor. In the 1930’s Thurstone developed multi-factor analysis as he saw intelligence, not as a single factor as Spearman but rather as several factors. Thurstone also bestowed the gift of factor rotation on the statistical community.

Around the same time (1920’s-1930’s), Wright was developing path analysis. Path analysis relies on manifest variables with the ability to model indirect relationships among variables. This is something that standard regression normally does not do.

In economics, an econometrics was using many of the same ideas as Wright. It was in the early 1950’s that econometricians saw what Wright was doing in his discipline of biometrics.

SEM is Born

In the 1970’s, Joreskog combined the measurement powers of factor analysis with the regression modeling power of path analysis. The factor analysis capabilities of SEM allow it to assess the accuracy of the measurement of the model. The path analysis capabilities of SEM allow it to model direct and indirect relationships among latent variables.

From there, there was an explosion in ways to assess models as well as best practice suggestions. In addition, there are many different software available for conducting SEM analysis. Examples include the LISREL which was the first software available, AMOS which allows the use of a graphical interface.

One software worthy of mentioning is Lavaan. Lavaan is a r package that performs SEM. The primary benefit of Lavaan is that it is available for free. Other software can be exceedingly expensive but Lavaan provides the same features for a price that cannot be beaten.

Conclusion

SEM is by far not new to the statistical community. With a history that is almost 100 years old, SEM has been in many ways with the statistical community since the birth of modern statistics.

Using Confusion Matrices to Evaluate Performance

The data within a confusion matrix can be used to calculate several different statistics that can indicate the usefulness of a statistical model in machine learning. In this post, we will look at several commonly used measures, specifically…

  • accuracy
  • error
  • sensitivity
  • specificity
  • precision
  • recall
  • f-measure

Accuracy

Accuracy is probably the easiest statistic to understand. Accuracy is the total number of items correctly classified divided by the total number of items below is the equation

accuracy =   TP + TN
                          TP + TN + FP  + FN

TP =  true positive, TN =  true negative, FP = false positive, FN = false negative

Accuracy can range in value from 0-1 with one representing 100% accuracy. Normally, you don’t want perfect accuracy as this is an indication of overfitting and your model will probably not do well with other data.

Error

Error is the opposite of accuracy and represent the percentage of examples that are incorrectly classified it’s equation is as follows.

error =   FP + FN
                          TP + TN + FP  + FN

The lower the error the better in general. However, if error is 0 it indicates overfitting. Keep in mind that error is the inverse of accuracy. As one increases the other decreases.

Sensitivity 

Sensitivity is the proportion of true positives that were correctly classified.The formula is as follows

sensitivity =       TP
                       TP + FN

This may sound confusing but high sensitivity is useful for assessing a negative result. In other words, if I am testing people for a disease and my model has a high sensitivity. This means that the model is useful telling me a person does not have a disease.

Specificity

Specificity measures the proportion of negative examples that were correctly classified. The formula is below

specificity =       TN
                       TN + FP

Returning to the disease example, a high specificity is a good measure for determining if someone has a disease if they test positive for it. Remember that no test is foolproof and there are always false positives and negatives happening. The role of the researcher is to maximize the sensitivity or specificity based on the purpose of the model.

Precision

Precision is the proportion of examples that are really positive. The formula is as follows

precision =       TP
                       TP + FP

 The more precise a model is the more trustworthy it is. In other words, high precision indicates that the results are relevant.

Recall

Recall is a measure of the completeness of the results of a model. It is calculated as follows

recall =       TP
                       TP + FN

This formula is the same as the formula for sensitivity. The difference is in the interpretation. High recall means that the results have a breadth to them such as in search engine results.

F-Measure

The f-measure uses recall and precision to develop another way to assess a model. The formula is below

sensitivity =      2 * TP
                       2 * TP + FP + FN

The f-measure can range from 0 – 1 and is useful for comparing several potential models using one convenient number.

Conclusion

This post provide a basic explanation of various statistics that can be used to determine the strength of a model. Through using a combination of statistics a researcher can develop insights into the strength of a model. The only mistake is relying exclusively on any single statistical measurement.

Understanding Confusion Matrices

A confusion matrix is a table that is used to organize the predictions made during an analysis of data. Without making a joke confusion matrices can be confusing especially for those who are new to research.

In this post, we will look at how confusion matrices are setup as well as what the information in them means.
Actual Vs Predicted Class

The most common confusion matrix is a two class matrix. This matrix compares the actual class of an example with the predicted class of the model. Below is an example

Two Class Matrix
Predicted Class

A  B
Correctly classified as A Incorrectly classified as B
Incorrectly classified as A Correctly classified as B

 

 

 

 

 Actual class is along the vertical side

Looking at the table there are four possible outcomes.

  • Correctly classified as A-This means that the example was a part of the A category and the model predicted it as such
  • Correctly classified as B-This means that the example was a part of the B category and the model predicted it as such
  • Incorrectly classified as A-This means that the example was a part of the B category but the model predicted it to be a part of the A group
  • Incorrectly classified as B-This means that the example was a part of the A category but the model predicted it to be a part of the B group

These four types of classifications have four different names which are true positive, true negative, false positive, and false negative. We will look at another example to understand these four terms.

Two Class Matrix
Predicted Lazy Students

Lazy  Not Lazy
1. Correctly classified as lazy 2. Incorrectly classified as not Lazy
3. Incorrectly classified as Lazy 4. Correctly classified as not lazy

Actual class is along the vertical side

In the example above, we want to predict which students are lazy. Group one, is the group in which students who are lazy are correctly classified as lazy. This is called true positive.

Group 2 are those who are lazy but are predicted as not being lazy. This is known as a false negative also known as a type II error in statistics. This is a problem because if the student is misclassified they may not get the support they need.

Group three is students who are not lazy but are classified as such. This is known as a false positive or type I error. In this example, being labeled lazy is a major headache for the students but not as dangerous perhaps as a false negative.

Lastly, group four are students who are not lazy and are correctly classified as such. This is known as a true negative.

Conclusion

The primary purpose of a confusion matrix is to display this information visually. In future post we will see that there is even more information found in a confusion matrix than what was cover briefly here.

Developing an Artificial Neural Network in R

In this post, we are going make an artificial neural network (ANN) by analyzing some data about computers. Specifically, we are going to make an ANN to predict the price of computers.

We will be using the “ecdat” package and the data set “Computers” from this pacakge. In addition, we are going to use the “neuralnet” pacakge to conduct the ANN analysis. Below is the code for the packages and dataset we are using

library(Ecdat);library(neuralnet)
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
## 
## Loading required package: grid
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:Ecdat':
## 
##     SP500
#load data set
data("Computers")

Explore the Data

The first step is always data exploration. We will first look at nature of the data using the “str” function and then used the “summary” function. Below is the code.

str(Computers)
## 'data.frame':    6259 obs. of  10 variables:
##  $ price  : num  1499 1795 1595 1849 3295 ...
##  $ speed  : num  25 33 25 25 33 66 25 50 50 50 ...
##  $ hd     : num  80 85 170 170 340 340 170 85 210 210 ...
##  $ ram    : num  4 2 4 8 16 16 4 2 8 4 ...
##  $ screen : num  14 14 15 14 14 14 14 14 14 15 ...
##  $ cd     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ multi  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ premium: Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
##  $ ads    : num  94 94 94 94 94 94 94 94 94 94 ...
##  $ trend  : num  1 1 1 1 1 1 1 1 1 1 ...
lapply(Computers, summary)
## $price
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     949    1794    2144    2220    2595    5399 
## 
## $speed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.00   33.00   50.00   52.01   66.00  100.00 
## 
## $hd
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    80.0   214.0   340.0   416.6   528.0  2100.0 
## 
## $ram
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   4.000   8.000   8.287   8.000  32.000 
## 
## $screen
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.00   14.00   14.00   14.61   15.00   17.00 
## 
## $cd
##   no  yes 
## 3351 2908 
## 
## $multi
##   no  yes 
## 5386  873 
## 
## $premium
##   no  yes 
##  612 5647 
## 
## $ads
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    39.0   162.5   246.0   221.3   275.0   339.0 
## 
## $trend
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   10.00   16.00   15.93   21.50   35.00

ANN is primarily for numerical data and not categorical or factors. As such, we will remove the factor variables cd,multi, and premium from further analysis. Below are is the code and histograms of the remaining variables.

hist(Computers$price);hist(Computers$speed);hist(Computers$hd);hist(Computers$ram)

dc97c105-c791-41ae-9e6a-e01061d82f10c50779eb-99c6-47e6-b44f-c037cd58ab2d.png34980edf-1c6f-4d5b-9214-c932860dff32.png725e049e-f58d-4d22-a5a6-70e9a8290e61.png

hist(Computers$screen);hist(Computers$ads);hist(Computers$trend)

200d47df-0ca3-4ee6-90a7-0ab6c8057278.png8a53ccf2-b6be-4243-90ec-4fcf6b6f4eb4.pngd3080f32-fefe-419a-af3c-8e2ade2fc11f.png

Clean the Data

Looking at the summary combined with the histograms indicates that we need to normalize are data as this is a requirement for ANN. We want all of the variables to have an equal influence initially. Below is the code for the function we wil use to normalize are variables.

normalize<-function(x) {
        return((x-min(x)) / (max(x)))
}

Explore the Data Again

We now need to make a new dataframe that has only the variables we are going to use for the analysis. Then we will use our “normalize” function to scale and center the variables appropriately. Lastly, we will re-explore the data to make sure it is ok using the “str” “summary” and “hist” functions. Below is the code.

#make dataframe without factor variables
Computers_no_factors<-data.frame(Computers$price,Computers$speed,Computers$hd,
                              Computers$ram,Computers$screen,Computers$ad,
                              Computers$trend)
#make a normalize dataframe of the data
Computers_norm<-as.data.frame(lapply(Computers_no_factors, normalize))
#reexamine the normalized data
lapply(Computers_norm, summary)
## $Computers.price
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1565  0.2213  0.2353  0.3049  0.8242 
## 
## $Computers.speed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0800  0.2500  0.2701  0.4100  0.7500 
## 
## $Computers.hd
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.06381 0.12380 0.16030 0.21330 0.96190 
## 
## $Computers.ram
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0625  0.1875  0.1965  0.1875  0.9375 
## 
## $Computers.screen
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.03581 0.05882 0.17650 
## 
## $Computers.ad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3643  0.6106  0.5378  0.6962  0.8850 
## 
## $Computers.trend
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2571  0.4286  0.4265  0.5857  0.9714
hist(Computers_norm$Computers.price);hist(Computers_norm$Computers.speed);

ed7fe31f-5ac2-47a0-a178-bbf2565d1d5b.png8b4b0bac-0936-448a-8216-cc7b06768692.png

hist(Computers_norm$Computers.hd);hist(Computers_norm$Computers.ram);

38b5ea71-47f8-4912-8a8d-c7656544d4da.pngdbd9d052-ca58-4ba7-b774-46f5da6d4532

hist(Computers_norm$Computers.screen);hist(Computers_norm$Computers.ad);

bde52e4e-cacd-4798-9dc1-2b56c1ca9c1a.png27d73af5-56d6-4c0b-ab9d-368be59a5d78

hist(Computers_norm$Computers.trend)

cfe12fb9-9074-441c-9a6c-9b4136ad6a4e.png

Develop the Model

Everything looks good, so we will now split are data into a training and testing set and develop the ANN model that will predict computer price. Below is the code for this.

#split data into training and testing
Computer_train<-Computers_norm[1:4694,]
Computers_test<-Computers_norm[4695:6259,]
#model
computer_model<-neuralnet(Computers.price~Computers.speed+Computers.hd+
                                  Computers.ram+Computers.screen+
                                  Computers.ad+Computers.trend, Computer_train)

Our intial model is a simple feedforward networm with a single hidden node. You can visualize the model using the “plot” function as shown in the code below.

plot(computer_model)

Screenshot from 2016-07-11 14:35:35.png

We now need to evaluate our model’s performance. We will use the “compute” function to generate predictions. The predictions generated by the “compute” function will then be compared to the actual prices in the test data set using a pearson correlation. Since we are not classifying we can’t measure accuracy with a confusion matrix but rather with correlation. Below is the code followed by the results

evaluate_model<-compute(computer_model, Computers_test[2:7])
predicted_price<-evaluate_model$net.result
cor(predicted_price, Computers_test$Computers.price)
##              [,1]
## [1,] 0.8809571295

The correlation between the predict results and the actual results is 0.88 which is a strong relationship. THis indicates that our model does an excellent job in predicting the price of a computer based on ram, screen size, speed, hard drive size, advertising, and trends.

Develop Refined Model

Just for fun we are going to make a more complex model with three hidden nodes and see how the results change below is the code.

computer_model2<-neuralnet(Computers.price~Computers.speed+Computers.hd+
                                   Computers.ram+Computers.screen+ 
                                   Computers.ad+Computers.trend, 
                           Computer_train, hidden =3)
plot(computer_model2)
evaluate_model2<-compute(computer_model2, Computers_test[2:7])
predicted_price2<-evaluate_model2$net.result
cor(predicted_price2, Computers_test$Computers.price)

Screenshot from 2016-07-11 14:36:34.png

##              [,1]
## [1,] 0.8963994092

The correlation improves to 0.89. As such, the increased complexity did not yield much of an improvement in the overall correlation. Therefore, a single node model is more appropriate.

Conclusion

In this post we explored an application of artificial neural networks. This black box method is useful for making powerful predictions in highly complex data.

Conditional Probability & Bayes’ Theorem

In a prior post, we look at some of the basics of probability. The prior forms pf probability we looked at focused on independent events, which are events that are unrelated to each other.

In this post we will look at conditional probability which involves calculating probabilities for events that are dependent on each other. We will understand conditional probability through the use of Bayes’ theorem.

Conditional Probability 

If all events were independent in it would be impossible to predict anything because there would be no relationships between features. However, there are many examples of on event affecting another. For example, thunder and lighting can be used to predictors of rain and lack of study can be used as a predictor for test performance.

Thomas Bayes develop a theorem to understand conditional probability. A theorem is a statement that can be proven true through the use of math. Bayes’ theorem is written as follows

P(A | B)

This complex notation simply means

The probability of event A given event B occurs

Calculating probabilities using Bayes’ theorem can be somewhat confusing when done by hand. There are a few terms however that you need to be exposed too.

  • prior probability is the probability of an event without a conditional event
  • likelihood is the probability of a given event
  • posterior probability is the probability of an event given that another event occurred. the calculation or posterior probability is the application of Bayes’ theorem

Naives Bayes Algorithm

Bayes’ theorem has been used to develop the Naive Bayes Algorithm. This algorithm is particularly useful in classifying text data, such as emails. This algorithm is fast, good with missing data, and powerful with large or small data sets. However, naive bayes struggles with large amounts of numeric data and it has a problem with assuming that all features are of equal value, which is rarely the case.

Conclusion

Probability is a core component of prediction. However, prediction cannot truly take place with events being dependent. Thanks to the work of Thomas Bayes, we have one approach to making prediction through the use of his theorem.

In a future post, we will use naive Bayes algorithm to make predictions about text.

 

Introduction to Probability

Probability is a critical component of statistical analysis and serves as a way to determine the likelihood of an event occurring. This post will provide a brief introduction into some of the principles of probability.

Probability 

There are several basic probability terms we need to cover

  • events
  • trial
  • mutually exclusive and exhaustive

Events are possible outcomes. For example, if you flip a coin, the event can be heads or tails. A trial is a single opportunity for an event to occur. For example, if you flip a coin one time this means that there was one trial or one opportunity for the event of heads or tails to occur.

To calculate the probability of an event you need to take the number of trials an event occurred divided by the total number of trials. The capital letter “P” followed by the number in parentheses is always how probability is expressed. Below is the actual equation for this

Number of trial the event occurredTotal number of trials = P(event)

To provide an example, if we flip a coin ten times and we recored five heads and five tails, if we want to know the probability of heads this is the answer below

Five heads ⁄ Ten trials = P(heads) = 0.5

Another term to understand is mutually exclusive and exhaustive. This means that events cannot occur at the same time. For example, if we flip a coin, the result can only be heads or tails. We cannot flip a coin and have both heads and tails happen simultaneously.

Joint Probability 

There are times were events are not mutually exclusive. For example, lets say we have the possible events

  1. Musicians
  2. Female
  3.  Female musicians

There are many different events that came happen simultaneously

  • Someone is a musician and not female
  • Someone who is female and not a musician
  • Someone who is a female musician

There are also other things we need to keep in mind

  • Everyone is not female
  • Everyone is not a musician
  • There are many people who are not female and are not musicians

We can now work through a sample problem as shown below.

25% of the population are musicians and 60% of the population is female. What is the probability that someone is a female musician

To solve this problem we need to find the joint probability which is the probability of two independent events happening at the same time. Independent events or events that do not influence each other. For example, being female has no influence on becoming a musician and vice versa. For our female musician example, we run the follow calculation.

P(Being Musician) * P(Being Female) = 0.25 * 0.60 = 0.25 = 15%

 From the calculation, we can see that there is a 15% chance that someone will be female and a musician.

Conclusion

Probability is the foundation of statistical inference. We will see in a future post that not all events are independent. When they are not the use of conditional probability and Bayes theorem is appropriate.

Logistic Regression in R

Logistic regression is used when the dependent variable is categorical with two choices. For example, if we want to predict whether someone will default on their loan. The dependent variable is categorical with two choices yes they default and no they do not.

Interpreting the output of a logistic regression analysis can be tricky. Basically, you need to interpret the odds ratio. For example, if the results of a study say the odds of default are 40% higher when someone is unemployed it is an increase in the likelihood of something happening. This is different from the probability which is what we normally use. Odds can go from any value from negative infinity to positive infinity. Probability is constrained to be anywhere from 0-100%.

We will now take a look at a simple example of logistic regression in R. We want to calculate the odds of defaulting on a loan. The dependent variable is “default” which can be either yes or no. The independent variables are “student” which can be yes or no, “income” which how much the person made, and “balance” which is the amount remaining on their credit card.

Below is the coding for developing this model.

The first step is to load the “Default” dataseat. This dataseat is a part of the “ISLR” package. Below is the code to get started

library(ISLR)
data("Default")

It is always good to examine the data first before developing a model. We do this by using the ‘summary’ function as shown below.

summary(Default)
##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554

We now need to check our two continuous variables “balance” and “income” to see if they are normally distributed. Below is the code followed by the histograms.

hist(Default$income)

Rplot

hist(Default$balance)

Rplot.jpeg

The ‘income’ variable looks fine but there appear to be some problems with ‘balance’ to deal with this we will perform a square root transformation on the ‘balance’ variable and then examine it again by looking at a histogram. Below is the code.

Default$sqrt_balance<-(sqrt(Default$balance))
hist(Default$sqrt_balance)

Rplot

As you can see this is much better looking.

We are now ready to make our model and examine the results. Below is the code.

Credit_Model<-glm(default~student+sqrt_balance+income, family=binomial, Default)
summary(Credit_Model)
## 
## Call:
## glm(formula = default ~ student + sqrt_balance + income, family = binomial, 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2656  -0.1367  -0.0418  -0.0085   3.9730  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.938e+01  8.116e-01 -23.883  < 2e-16 ***
## studentYes   -6.045e-01  2.336e-01  -2.587  0.00967 ** 
## sqrt_balance  4.438e-01  1.883e-02  23.567  < 2e-16 ***
## income        3.412e-06  8.147e-06   0.419  0.67538    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1574.8  on 9996  degrees of freedom
## AIC: 1582.8
## 
## Number of Fisher Scoring iterations: 9

The results indicate that the variable ‘student’ and ‘sqrt_balance’ are significant. However, ‘income’ is not significant. What all this means in simple terms is that being a student and having a balance on your credit card influence the odds of going into default while your income makes no difference. Unlike, multiple regression coefficients, the logistic coefficients require a transformation in order to interpret them The statistical reason for this is somewhat complicated. As such, below is the code to interpret the logistic regression coefficients.

exp(coef(Credit_Model))
##  (Intercept)   studentYes sqrt_balance       income 
## 3.814998e-09 5.463400e-01 1.558568e+00 1.000003e+00

To explain this as simply as possible. You subtract 1 from each coefficient to determine the actually odds. For example, if a person is a student the odds of them defaulting are 445% higher than when somebody is not a student when controlling for balance and income. Furthermore, for every 1 unit increase in the square root of the balance the odds of default go up by 55% when controlling for being a student and income. Naturally, speaking in terms of a 1 unit increase in the square root of anything is confusing. However, we had to transform the variable in order to improve normality.

Conclusion

Logistic regression is one approach for predicting and modeling that involves a categorical dependent variable. Although the details are little confusing this approach is valuable at times when doing an analysis.

Assumption Check for Multiple Regression

The goal of the post is to attempt to explain the salary of a baseball based on several variables. We will see how to test various assumptions of multiple regression as well as deal with missing data. The first thing we need to do is load our data. Our data will come from the “ISLR” package and we will use the data set “Hitters”. There are 20 variables in the dataset as shown by the “str” function

#Load data 
library(ISLR)
data("Hitters")
str(Hitters)
## 'data.frame':    322 obs. of  20 variables:
##  $ AtBat    : int  293 315 479 496 321 594 185 298 323 401 ...
##  $ Hits     : int  66 81 130 141 87 169 37 73 81 92 ...
##  $ HmRun    : int  1 7 18 20 10 4 1 0 6 17 ...
##  $ Runs     : int  30 24 66 65 39 74 23 24 26 49 ...
##  $ RBI      : int  29 38 72 78 42 51 8 24 32 66 ...
##  $ Walks    : int  14 39 76 37 30 35 21 7 8 65 ...
##  $ Years    : int  1 14 3 11 2 11 2 3 2 13 ...
##  $ CAtBat   : int  293 3449 1624 5628 396 4408 214 509 341 5206 ...
##  $ CHits    : int  66 835 457 1575 101 1133 42 108 86 1332 ...
##  $ CHmRun   : int  1 69 63 225 12 19 1 0 6 253 ...
##  $ CRuns    : int  30 321 224 828 48 501 30 41 32 784 ...
##  $ CRBI     : int  29 414 266 838 46 336 9 37 34 890 ...
##  $ CWalks   : int  14 375 263 354 33 194 24 12 8 866 ...
##  $ League   : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
##  $ PutOuts  : int  446 632 880 200 805 282 76 121 143 0 ...
##  $ Assists  : int  33 43 82 11 40 421 127 283 290 0 ...
##  $ Errors   : int  20 10 14 3 4 25 7 9 19 0 ...
##  $ Salary   : num  NA 475 480 500 91.5 750 70 100 75 1100 ...
##  $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...

We now need to assess the amount of missing data. This is important because missing data can cause major problems with different analysis. We are going to create a simple function that well explain to us the amount of missing data for each variable in the “Hitters” dataset. After using the function we need to use the “apply” function to display the results according to the amount of data missing by column and row.

Missing_Data <- function(x){sum(is.na(x))/length(x)*100}
apply(Hitters,2,Missing_Data)
##     AtBat      Hits     HmRun      Runs       RBI     Walks     Years 
##   0.00000   0.00000   0.00000   0.00000   0.00000   0.00000   0.00000 
##    CAtBat     CHits    CHmRun     CRuns      CRBI    CWalks    League 
##   0.00000   0.00000   0.00000   0.00000   0.00000   0.00000   0.00000 
##  Division   PutOuts   Assists    Errors    Salary NewLeague 
##   0.00000   0.00000   0.00000   0.00000  18.32298   0.00000
apply(Hitters,1,Missing_Data)

For column we can see that the missing data is all in the salary variable, which is missing 18% of its data. By row (not displayed here) you can see that a row might be missing anywhere from 0-5% of its data. The 5% is from the fact that there are 20 variables and there is only missing data in the salary variable. Therefore 1/20 = 5% missing data for a row. To deal with the missing data, we will us the ‘mice’ package. You can install it yourself and run the following code

 

library(mice)
md.pattern(Hitters)
##     AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 263     1    1     1    1   1     1     1      1     1      1     1    1
##  59     1    1     1    1   1     1     1      1     1      1     1    1
##         0    0     0    0   0     0     0      0     0      0     0    0
##     CWalks League Division PutOuts Assists Errors NewLeague Salary   
## 263      1      1        1       1       1      1         1      1  0
##  59      1      1        1       1       1      1         1      0  1
##          0      0        0       0       0      0         0     59 59
Hitters1 <- mice(Hitters,m=5,maxit=50,meth='pmm',seed=500)

 

summary(Hitters1)
## Multiply imputed data set
## Call:
## mice(data = Hitters, m = 5, method = "pmm", maxit = 50, seed = 500)

In the code above we did the following

  1. loaded the ‘mice’ package Run the ‘md.pattern’ function Made a new variable called ‘Hitters’ and ran the ‘mice’ function on it.
  2. This function made 5 datasets  (m = 5) and used predictive meaning matching to guess the missing data point for each row (method = ‘pmm’).
  3. The seed is set for the purpose of reproducing the results The md.pattern function indicates that

There are 263 complete cases and 59 incomplete ones (not displayed). All the missing data is in the ‘Salary’ variable. The ‘mice’ function shares various information of how the missing data was dealt with. The ‘mice’ function makes five guesses for each missing data point. You can view the guesses for each row by the name of the baseball player. We will then select the first dataset as are new dataset to continue the analysis using the ‘complete’ function from the ‘mice’ package.

#View Imputed data
Hitters1$imp$Salary

 

#Make Complete Dataset
completedData <- complete(Hitters1,1)

Now we need to deal with the normality of each variable which is the first assumption we will deal with. To save time, I will only explain how I dealt with the non-normal variables. The two variables that were non-normal were “salary” and “Years”. To fix these two variables I did a log transformation of the data. The new variables are called ‘log_Salary’ and “log_Years”. Below is the code for this with the before and after histograms

#Histogram of Salary
hist(completedData$Salary)

Rplot

#log transformation of Salary
completedData$log_Salary<-log(completedData$Salary)
#Histogram of transformed salary
hist(completedData$log_Salary)

Rplot

#Histogram of years
hist(completedData$Years)
Rplot
#Log transformation of Years completedData$log_Years<-log(completedData$Years) hist(completedData$log_Years)

Rplot

We can now do are regression analysis and produce the residual plot in order to deal with the assumpotion of homoscedestacity and lineraity. Below is the code

Salary_Model<-lm(log_Salary~Hits+HmRun+Walks+log_Years+League, data=completedData)
#Residual Plot checks Linearity 
plot(Salary_Model)

When using the ‘plot’ function you will get several plots. The first is the residual vs fitted which assesses linearity. The next is the qq plot which explains if are data is normally distributed. The scale location plot explains if there is equal variance. The residual vs leverage plot is used for finding outliers. All plots look good.

RplotRplotRplotRplot

summary(Salary_Model)
## 
## Call:
## lm(formula = log_Salary ~ Hits + HmRun + Walks + log_Years + 
##     League, data = completedData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1052 -0.3649  0.0171  0.3429  3.2139 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.8790683  0.1098027  35.328  < 2e-16 ***
## Hits        0.0049427  0.0009928   4.979 1.05e-06 ***
## HmRun       0.0081890  0.0046938   1.745  0.08202 .  
## Walks       0.0063070  0.0020284   3.109  0.00205 ** 
## log_Years   0.6390014  0.0429482  14.878  < 2e-16 ***
## League2     0.1217445  0.0668753   1.820  0.06963 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5869 on 316 degrees of freedom
## Multiple R-squared:  0.5704, Adjusted R-squared:  0.5636 
## F-statistic: 83.91 on 5 and 316 DF,  p-value: < 2.2e-16

Furthermore, the model explains 57% of the variance in salary. All varibles (Hits, HmRun, Walks, Years, and League) are significant at 0.1. Are last step is to find the correlations among the variables. To do this, we need to make a correlational matrix. We need to remove variables that are not a part of our study to do this. We also need to load the “Hmisc” package and use the ‘rcorr’ function to produce the matrix along with the p values. Below is the code

#find correlation
completedData1<-completedData;completedData1$Chits<-NULL;completedData1$CAtBat<-NULL;completedData1$CHmRun<-NULL;completedData1$CRuns<-NULL;completedData1$CRBI<-NULL;completedData1$CWalks<-NULL;completedData1$League<-NULL;completedData1$Division<-NULL;completedData1$PutOuts<-NULL;completedData1$Assists<-NULL; completedData1$NewLeague<-NULL;completedData1$AtBat<-NULL;completedData1$Runs<-NULL;completedData1$RBI<-NULL;completedData1$Errors<-NULL; completedData1$CHits<-NULL;completedData1$Years<-NULL; completedData1$Salary<-NULL
library(Hmisc)

 

 rcorr(as.matrix(completedData1))
##            Hits HmRun Walks log_Salary log_Years
## Hits       1.00  0.56  0.64       0.47      0.13
## HmRun      0.56  1.00  0.48       0.36      0.14
## Walks      0.64  0.48  1.00       0.46      0.18
## log_Salary 0.47  0.36  0.46       1.00      0.63
## log_Years  0.13  0.14  0.18       0.63      1.00
## 
## n= 322 
## 
## 
## P
##            Hits   HmRun  Walks  log_Salary log_Years
## Hits              0.0000 0.0000 0.0000     0.0227   
## HmRun      0.0000        0.0000 0.0000     0.0153   
## Walks      0.0000 0.0000        0.0000     0.0009   
## log_Salary 0.0000 0.0000 0.0000            0.0000   
## log_Years  0.0227 0.0153 0.0009 0.0000

There are no high correlations among our variables so multicolinearity is not an issue

Conclusion

This post provided an example dealing with missing data, checking the assumptions of a regression model, and displaying plots. All this was done using R.

Wilcoxon Signed Rank Test in R

The Wilcoxon Signed Rank Test is the non-parametric equivalent of the t-test. If you have questions whether or not your data is normally distributed the Wilcoxon Signed Rank Test can still indicate to you if there is a difference between the means of your sample.

Th Wilcoxon Test compares the medians of two samples instead of their means. The differences between the median and each individual value for each sample is calculated. Values that come to zero are removed. Any remaining values are ranked from lowest to highest. Lastly, the ranks are summed. If the rank sum is different between the two samples it indicates  statistical difference between samples.

We will now do an example using r. We want to see if there is a difference in enrollment between private and public universities. Below is the code

We will begin by loading the ISLR package. Then we will load the ‘College’ data and take a look at the variables in the “College” dataset by using the ‘str’ function.

library(ISLR)
data=College
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

We will now look at the Enroll variable and see if it is normally distributed

hist(College$Enroll)

download.png

This variable is highly skewed to the right. This may mean that it is not normally distributed. Therefore, we may not be able to use a regular t-test to compare private and public universities and the Wilcoxon Test is more appropriate. We will now use the Wilcoxon Test. Below are the results

wilcox.test(College$Enroll~College$Private)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  College$Enroll by College$Private
## W = 104090, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

The results indicate a difference we will now calculate the medians of the two groups using the ‘aggregate’ function. This function allows us to compare our two groups based on the median. Below is the code with the results.

aggregate(College$Enroll~College$Private, FUN=median)
##   College$Private College$Enroll
## 1              No       1337.5
## 2             Yes        328.0

As you can see, there is a large difference in enrollment in private and public colleges. We can then make the conclusion that there is a difference in the medians of private and public colleges with public colleges have a much higher enrollment.

Conclusion

The Wilcoxon Test is used for a non-parametric analysis of data. This test is useful whenever there are concerns with the normality of the data.

Kruskal-Willis Test in R

Sometimes when the data that needs to be analyzed is not normally distributed. This makes it difficult to make any inferences based on the results because one of the main assumptions of parametric statistical test such as ANOVA, t-test, etc is normal distribution of the data.

Fortunately, for every parametric test there is a non-parametric test. Non-parametric test are test that make no assumptions about the normality of the data. This means that the non-normal data can still be analyzed with a certain measure of confidence in terms of the results.

This post will look at non-parametric test that are used to test the difference in means. For three or more groups we used the Kruskal-Wallis Test. The Kruskal-Wallis Test is the non-parametric version of ANOVA.

 

Setup

We are going to use the “ISLR” package available on R to demonstrate the use of the Kruskal-Wallis test. After downloading this package you need to load the “Auto” data. Below is the code to do all of this.

install.packages('ISLR')
library(ISLR)
data=Auto

We now need to examine the structure of the data set. This is done with the “str” function below is code followed by the results

str(Auto)
'data.frame':	392 obs. of  9 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
 $ weight      : num  3504 3693 3436 3433 3449 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
 $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

So we have 9 variables. We first need to find if any of the continuous variables are non-normal because this indicates that the Kruskal-Willis test is needed. We will look at the ‘displacement’ variable and look at the histogram to see if it is normally distributed. Below is the code followed by the histogram

hist(Auto$displacement)
Rplot.jpeg

This does not look normally distributed. We now need a factor variable with 3 or more groups. We are going to use the ‘origin’ variable. This variable indicates were the care was made 1 = America, 2 = Europe, and 3 = Japan. However, this variable is currently a numeric variable. We need to change it into a factor variable. Below is the code for this

Auto$origin<-as.factor(Auto$origin)

The Test

We will now use the Kruskal-Wallis test. The question we have is “is there a difference in displacement based on the origin of the vehicle?” The code for the analysis is below followed by the results.

> kruskal.test(displacement ~ origin, data = Auto)

	Kruskal-Wallis rank sum test

data:  displacement by origin
Kruskal-Wallis chi-squared = 201.63, df = 2, p-value < 2.2e-16

Based on the results, we know there is a difference among the groups. However, just like ANOVA, we do not know were. We have to do a post-hoc test in order to determine where the difference in means is among the three groups.

To do this we need to install a new package and do a new analysis. We will download the “PCMR” package and run the code below

install.packages('PMCMR')
library(PMCMR)
data(Auto)
attach(Auto)
posthoc.kruskal.nemenyi.test(x=displacement, g=origin, dist='Tukey')

Here is what we did,

  1. Installed the PMCMR package and loaded it
  2. Loaded the “Auto” data and used the “attach” function to make it available
  3. Ran the function “posthoc.kruskal.nemenyi.test” and place the appropriate variables in their place and then indicated the type of posthoc test ‘Tukey’

Below are the results

Pairwise comparisons using Tukey and Kramer (Nemenyi) test	
                   with Tukey-Dist approximation for independent samples 

data:  displacement and origin 

  1       2   
2 3.4e-14 -   
3 < 2e-16 0.51

P value adjustment method: none 
Warning message:
In posthoc.kruskal.nemenyi.test.default(x = displacement, g = origin,  :
  Ties are present, p-values are not corrected.

The results are listed in a table. When a comparison was made between group 1 and 2 the results were significant (p < 0.0001). The same when group 1 and 3 are compared (p < 0.0001).  However, there was no difference between group 2 and 3 (p = 0.51).

Do not worry about the warning message this can be corrected if necessary

Perhaps you are wondering what the actually means for each group is. Below is the code with the results

> aggregate(Auto[, 3], list(Auto$origin), mean)
  Group.1        x
1       1 247.5122
2       2 109.6324
3       3 102.7089

Cares made in America have an average displacement of 247.51 while cars from Europe and Japan have a displacement of 109.63 and 102.70. Below is the code for the boxplot followed by the graph

boxplot(displacement~origin, data=Auto, ylab= 'Displacement', xlab='Origin')
title('Car Displacement')

Rplot01.jpeg

Conclusion

This post provided an example of the Kruskal-Willis test. This test is useful when the data is not normally distributed. The main problem with this test is that it is less powerful than an ANOVA test. However, this is a problem with most non-parametric test when compared to parametric test.

Decisions Trees with Titanic

In this post, we will make predictions about Titanic survivors using decision trees. The advantage of decisions trees is that they split the data into clearly defined groups. This process continues until the data is divided into extremely small subsets. This subsetting is used for making predictions.

We are assuming you have the data and have viewed the previous machine learning post on this blog. If not please click here.

Beginnings

You need to install the ‘rpart’ package from the CRAN repository as the package contains a decision tree function.You will also need to install the following packages

  • ratttle
  • rpart.plot
  • RColorNrewer

Each of these packages plays a role in developing decision trees

Building the Model

Once you have installed the packages. You need to develop the model below is the code. The model uses most of the variables in the data set for predicting survivors.

tree <- rpart(Survived~Pclass+Sex+Age+SibSp+Parch+Fare+Embarked,data=train, method=’class’)

The ‘rpart’ function is used for making the classification tree aka decision tree.

We now need to see the tree we do this with the code below

plot(tree)
text(tree)

Plot makes the tree and ‘text’ adds names download

You can probably tell that this is an ugly plot. To improve the appearance we will run the following code.

library(rattle)
library(rpart.plot)
library(RColorBrewer)
fancyRpartPlot(my_tree_two)

Below is the revised plot

download

This looks much better

How to Read

Here is one way to read a decision tree.

  1. At the top node, keep in mind we are predicting Survival rate. There is a 0 or 1 in all of the ‘buckets’. This number represents how the bucket voted. If more than 50% perish than the bucket votes ‘0’ or no survivors
  2. Still looking at the top bucket, 62% of the passengers die while 38% survived before we even split the data.The number under the node tells what percentage of the sample is in this “bucket”. For the first bucket 100% of the sample is represented.
  3. The first split is based on sex. If the person is male you look to the left. For males, 81% of them died compared to 19% who survived and the bucket votes 0 for death. 65% of the sample is in this bucket
  4. For those who are not male (female) we look to the right and see that only 26% died compared to 74% who survived leading to this bucket voting 1 for survival. This bucket represents 35% of the entire sample.
  5. This process continues all the way down to the bottom

Conclusion

Decisions trees are useful for making ‘decisions’ about what is happening in data. For those who are looking for a simple prediction algorithm, decision trees is one place to begin

Data Science Application: Titanic Survivors

This post will provide a practical application of some of the basics of data science using data from the sinking of the Titanic. In this post, in particular, we will explore the dataset and see what we can uncover.

Loading the Dataset

The first thing that we need to do is load the actual datasets into R. In machine learning, there are always at least two datasets. One dataset is the training dataset and the second dataset is the testing dataset. The training is used for developing a model and the testing is used for checking the accuracy of the model on a different dataset. Downloading both datasets can be done through the use of the following code.

url_train <- "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
url_test <- "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
training <- read.csv(url_train)
testing <- read.csv(url_test)

What Happen?

  1. We created the variable “url_train” and put the web link in quotes. We then repeat this for the test data set
  2. Next, we create the variable “training” and use the function “read.csv” for our “url_train” variable. This tells R to read the csv file at the web address in ‘url_train’. We then repeat this process for the testing variable

Exploration

We will now do some basic data exploration. This will help us to see what is happening in the data. What to look for is endless. Therefore, we will look at a few basics things that we might need to know. Below are some questions with answers.

  1. What variables are in the dataset?

This can be found by using the code below

str(training)

The output reveals 12 variables

'data.frame':	891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 354 273 16 555 516 625 413 577 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
 $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Here is a better explanation of each

  • PassengerID-ID number of each passenger
  • Survived-Who survived as indicated by a 1 and who did not by a 0
  • Pclass-Class of passenger 1st class, 2nd class, 3rd class
  • Name-The full name of a passenger
  • Sex-Male or female
  • Age-How old a passenger was
  • SibSp-Number of siblings and spouse a passenger had on board
  • Parch-Number of parents and or children a passenger had
  • Ticket-Ticket number
  • Fare-How much a ticket cost a passenger
  • Cabin-Where they slept on the titanic
  • Embarked-What port they came from

2. How many people survived in the training data?

The answer for this is found by running the following code in the R console. Keep in mind that 0 means died and 1 means survived

> table(training$Survived)
  0   1 
549 342

Unfortunately, unless you are really good at math these numbers do not provide much context. It is better to examine this using percentages as can be found in the code below.

> prop.table(table(training$Survived))

        0         1 
0.6161616 0.3838384 

These results indicate that about 62% of the passengers died while 38% survived in the training data.

3. What percentage of men and women survived?

This information can help us to determine how to setup a model for predicting who will survive. The answer is below. This time we only look percentages.

> prop.table(table(training$Sex, training$Survived), 1)
        
                 0         1
  female 0.2579618 0.7420382
  male   0.8110919 0.1889081

You can see that being male was not good on the titanic. Men died in much higher proportions compared to women.

4. Who survived by class?

The code for this is below

> prop.table(table(train$Pclass, train$Survived), 1)
   
            0         1
  1 0.3703704 0.6296296
  2 0.5271739 0.4728261
  3 0.7576375 0.2423625

Here is a code for a plot of this information followed by the plot

plot(deathbyclass, main="Passenger Fate by Traveling Class", shade=FALSE, 
+      color=TRUE, xlab="Pclass", ylab="Survived")
Rplot

3rd class had the highest mortality rate. This makes sense as 3rd class was the cheapest tickets.

5. Does age make a difference in survival?

We want to see if age matters in survival. It would make sense that younger people would be more likely to survive. This might be due to parents give their kids a seat on the lifeboats and younger singles pushing their way past older people.

We cannot use a plot for this because we would have several dozen columns on the x-axis for each year of life. Instead, we will use a box plot based on survived or died to see. Below is the code followed by a visual.

boxplot(training$Age ~ training$Survived, 
        main="Passenger Fate by Age",
        xlab="Survived", ylab="Age")

Rplot01

As you can see, there is little difference in terms of who survives based on age. This means that age may not be a useful predictor of survival.

Conclusion

Here is what we know so far

  • Sex makes a major difference in survival
  • Class makes a major difference in survival
  • Age may not make a difference in survival

There is so much more we can explore in the data. However, this is enough for beginning to lay down criteria for developing a model.

Random Forest in R

Random Forest is a similar machine learning approach to decision trees. The main difference is that with random forest. At each node in the tree, the variable is bootstrapped. In addition, several different trees are made and the average of the trees are presented as the results. This means that there is no individual tree to analyze but rather a ‘forest’ of trees

The primary advantage of random forest is accuracy and prevent overfitting. In this post, we will look at an application of random forest in R. We will use the ‘College’ data from the ‘ISLR’ package to predict whether a college is public or private

Preparing the Data

First, we need to split our data into a training and testing set as well as load the various packages that we need. We have run this code several times when using machine learning. Below is the code to complete this.

library(ggplot2);library(ISLR)
library(caret)
data("College")
forTrain<-createDataPartition(y=College$Private, p=0.7, list=FALSE)
trainingset<-College[forTrain, ]
testingset<-College[-forTrain, ]

Develop the Model

Next, we need to setup the model we want to run using Random Forest. The coding is similar to that which is used for regression. Below is the code

Model1<-train(Private~Grad.Rate+Outstate+Room.Board+Books+PhD+S.F.Ratio+Expend, data=trainingset, method='rf',prox=TRUE)

We are using 7 variables to predict whether a university is private or not. The method is ‘rf’ which stands for “Random Forest”. By now, I am assuming you can read code and understand what the model is trying to do. For a refresher on reading code for a model please click here.

Reading the Output

If you type “Model1” followed by pressing enter, you will receive the output for the random forest

Random Forest 

545 samples
 17 predictors
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 545, 545, 545, 545, 545, 545, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa      Accuracy SD  Kappa SD  
  2     0.8957658  0.7272629  0.01458794   0.03874834
  4     0.8969672  0.7320475  0.01394062   0.04050297
  7     0.8937115  0.7248174  0.01536274   0.04135164

Accuracy was used to select the optimal model using 
 the largest value.
The final value used for the model was mtry = 4.

Most of this is self-explanatory. The main focus is on the mtry, accuracy, and Kappa.

The shows several different models that the computer generated. Each model reports the accuracy of the model as well as the Kappa. The accuracy states how well the model predicted accurately whether a university was public or private. The kappa shares the same information but it calculates how well a model predicted while taking into account chance or luck. As such, the Kappa should be lower than the accuracy.

At the bottom of the output, the computer tells which mtry was the best. For our example, the best mtry was number 4. If you look closely, you will see that mtry 4 had the highest accuracy and Kappa as well.

Confusion Matrix for the Training Data

Below is the confusion matrix for the training data using the model developed by the random forest. As you can see, the results are different from the random forest output. This is because this model is predicting without bootstrapping

> predNew<-predict(Model1, trainingset) > trainingset$predRight<-predNew==trainingset$Private > table(predNew, trainingset$Private)
       
predNew  No Yes
    No  149   0
    Yes   0 396

Results of the Testing Data

We will now use the testing data to check the accuracy of the model we developed on the training data. Below is the code followed by the output

pred <- predict(Model1, testingset)
testingset$predRight<-pred==testingset$Private
table(pred, testingset$Private)
pred   No Yes
  No   48  11
  Yes  15 158

For the most part, the model we developed to predict if a university is private or not is accurate.

How Important is a Variable

You can calculate how important an individual variable is in the model by using the following code

Model1RF<-randomForest(Private~Grad.Rate+Outstate+Room.Board+Books+PhD+S.F.Ratio+Expend, data=trainingset, importance=TRUE)
importance(Model1RF)

The output tells you how much the accuracy of the model is reduced if you remove the variable. As such, the higher the number the more valuable the variable is in improving accuracy.

Conclusion

This post exposed you to the basics of random forest. Random forest is a technique that develops a forest of decisions trees through resampling. The results of all these trees are then averaged to give you an idea of which variables are most useful in prediction.

Type I and Type II Error

Hypothesis testing in statistics involves deciding whether to reject or not reject a null hypothesis. There are problems that can occur when making decisions about a null hypothesis. A researcher can reject a null when they should not reject it, which is called a type I error. The other mistake is not rejecting a null when they should have, which is a type II error. Both of these mistakes represent can seriously damage the interpretation of data.

An Example

The classic example that explains type I and type II errors is a courtroom. In a trial, the defendant is considered innocent until proven guilty. The defendant can be compared to the null hypothesis being true. The prosecutor job is to present evidence that the defendant is guilty. This is the same as providing statistical evidence to reject the null hypothesis which indicates that the null is not true and needs to be rejected.

There are four possible outcomes of our trial and our statistical test…

  1. The defendant can be declared guilty when they are really guilty. That’s a correct decision.This is the same as rejecting the null hypothesis.
  2. The defendant could be judged not guilty when they really are innocent. That’s a correct and is the same as not rejecting the null hypothesis.
  3. The defendant is convicted when they are actually innocent, which is wrong. This is the same as rejecting the null hypothesis when you should not and is know as a type I error
  4. The defendant is guilty but declared innocent, which is also incorrect. This is the same as not rejecting the null hypothesis when you should have. This is known as a type II error.

Important Notes

The probability of committing a type I error is the same as the alpha or significance level of a statistical test. Common values associated with alpha are o.1, 0.05, and 0.01. This means that the likelihood of committing a type I error depends on the level of the significance that the researcher picks.

The probability of committing a type II error is known as beta. Calculating beta is complicated as you need specific values in your null and alternative hypothesis. It is not always possible to supply this. As such, researcher often do not focus on type II error avoidance as they do with type I.

Another concern is that decrease the risk of committing one type of error increases the risk of committing the other. This means that if you reduce the risk of type I error you increase the risk of committing a type II error.

Conclusion

The risk of error or incorrect judgment of a null hypothesis is a risk in statistical analysis. As such, researchers need to be aware of these problems as they study data.

Z-Scores

A z-score indicates how closely related one given score is to mean of the sample. Extremely high or low z-scores indicates that the given data point is unusually above or below the mean of the sample.

In order to understand z-scores you need to be familiar with distribution. In general, data is distributed in a bell shape curve. With the mean being the exact middle of the graph as shown in the picture below.

download

The Greek letter μ is the mean. In this post, we will go through an example that will try to demonstrate how to use and interpret the z-score. Notice that a z-score + 1 takes of 68% of the potential values a z-score + 2 takes of 95%, a z-score + 3 takes of 99%.

Imagine you know the average test score of students on a quiz. The average is 75%. with a standard deviation of 6.4%. Below is the equation for calculating the z-score.

Standard_Score_Calc

Let’s say that one student scored 52% on the quiz. We can calculate the likelihood for this data point by using the formula above.

(52 – 75) / 6.4 = -3.59

Our value is negative which indicates that the score is below the mean of the sample. Our score is very exceptionally low from the mean. This makes sense given that the mean is 75% and the standard deviation is 6.4%. To get a 52% on the quiz was really bad performance.

We can convert the z-score to a percentage to indicate the probability of getting such a value. To do this you would need to find a z-score conversion table on the internet. A quick glance at the table will show you that the probability of getting a score of 52 on the quiz is less than 1%.

Off course, this is based on the average score of 75% with a standard deviation of 6.4%. A different average and standard deviation would change the probability of getting a 52%.

Standardization 

Z-scores are also used to standardize a variable. If you look at our example, the original values were in percentages. By using the z-score formula we converted these numbers into a different value. Specifically, the values of a z-score represent standard deviations from the mean.

In our example, we calculated a z-score of -3.59. In other words, the person who scored 52% on the quiz had a score 3.59 standard deviations below the mean. When attempting to interpret data the z-score is a foundational piece of information that is used extensively in statistics.

Using Regression for Prediction in R

In the last post about R, we looked at plotting information to make predictions. We will now look at an example of making predictions using regression.

We will used the same data as last time with the help of the ‘caret’ package as well. The code below sets up the seed and the training and testing sets we need.

> library(caret); library(ISLR); library(ggplot2)
> data("College");set.seed(1)
> PracticeSet<-createDataPartition(y=College$Grad.Rate, 
+                                  p=0.5, list=FALSE)
> TrainingSet<-College[PracticeSet, ]; TestingSet<-
+         College[-PracticeSet, ]
> head(TrainingSet)

The code above should look familiar from previous post.

Make the Scatterplot

We will now create the scatterplot showing the relationship between “S.F. Ratio” and “Grad.Rate” with the code below and the scatterplot.

> plot(TrainingSet$S.F.Ratio, TrainingSet$Grad.Rate, pch=5, col="green", 
xlab="Student Faculty Ratio", ylab="Graduation Rate")

Rplot10

Here is what we did

  1. We used the ‘plot’ function to make this scatterplot. The x variable was ‘S.F.Ratio’ of the ‘TrainingSet’ the y variable was ‘Grad.Rate’.
  2. We picked the type of dot to use using the ‘pch’ argument and choosing ’19’
  3. Next we chose a color and labeled each axis

Fitting the Model

We will now develop the linear model. This model will help us to predict future models. Furthermore, we will compare the model of the Training Set with the Test Set. Below is the code for developing the model.

> TrainingModel<-lm(Grad.Rate~S.F.Ratio, data=TrainingSet)
> summary(TrainingModel)

How to interpret this information was presented in a previous post. However, to summarize, we can say that when the student to faculty ratio increases one the graduation rate decreases 1.29. In other words, an increase in the student to faculty ratio leads to decrease in the graduation rate.

Adding the Regression Line to the Plot

Below is the code for adding the regression line followed by the scatterplot

> plot(TrainingSet$S.F.Ratio, TrainingSet$Grad.Rate, pch=19, col="green", xlab="Student Faculty Ratio", ylab="Graduation Rate")
> lines(TrainingSet$S.F.Ratio, TrainingModel$fitted, lwd=3)

Rplot01

Predicting New Values

With are model complete we can now predict values. For our example, we will only predict one value. We want to know what the graduation rate would be if we have a student to faculty ratio of 33. Below is the code for this with the answer

> newdata<-data.frame(S.F.Ratio=33)
> predict(TrainingModel, newdata)
      1 
40.6811

Here is what we did

  1. We made a variable called ‘newdata’ and stored a data frame in it with a variable called ‘S.F.Ratio’ with a value of 33. This is x value
  2. Next, we used the ‘predict’ function from the ‘caret’ package to determine what the graduation rate would be if the student to faculty ratio is 33. To do this we told caret to use the ‘TrainingModel’ we developed using regression and to run this model with the information in the ‘newdata’ dataframe
  3. The answer was 40.68. This means that if the student to faculty ratio is 33 at a university then the graduation rate would be about 41%.

Testing the Model

We will now test the model we made with the training set with the testing set. First, we will make a visual of both models by using the “plot” function. Below is the code follow by the plots.

par(mfrow=c(1,2))
plot(TrainingSet$S.F.Ratio,
TrainingSet$Grad.Rate, pch=19, col=’green’,  xlab=”Student Faculty Ratio”, ylab=’Graduation Rate’)
lines(TrainingSet$S.F.Ratio,  predict(TrainingModel), lwd=3)
plot(TestingSet$S.F.Ratio,  TestingSet$Grad.Rate, pch=19, col=’purple’,
xlab=”Student Faculty Ratio”, ylab=’Graduation Rate’)
lines(TestingSet$S.F.Ratio,  predict(TrainingModel, newdata = TestingSet),lwd=3)

Rplot02.jpeg

In the code, all that is new is the “par” function which allows us to see to plots at the same time. We also used the ‘predict’ function to set the plots. As you can see, the two plots are somewhat differ based on a visual inspection. To determine how much so, we need to calculate the error. This is done through computing the root mean square error as shown below.

> sqrt(sum((TrainingModel$fitted-TrainingSet$Grad.Rate)^2))
[1] 328.9992
> sqrt(sum((predict(TrainingModel, newdata=TestingSet)-TestingSet$Grad.Rate)^2))
[1] 315.0409

The main take away from this complicated calculation is the number 328.9992 and 315.0409. These numbers tell you the amount of error in the training model and testing model. The lower the number the better the model. Since the error number in the testing set is lower than the training set we know that our model actually improves when using the testing set. This means that our model is beneficial in assessing graduation rates. If there were problems we may consider using other variables in the model.

Conclusion

This post shared ways to develop a regression model for the purpose of prediction and for model testing.

Simple Prediction

Prediction is one of the key concepts of machine learning. Machine learning is a field of study that is focused on the development of algorithms that can be used to make predictions.

Anyone who has shopped online at has experienced machine learning. When you make a purchase at an online store, the website will recommend additional purchases for you to make. Often these recommendations are based on whatever you have purchased or whatever you click on while at the site.

There are two common forms of machine learning, unsupervised and supervised learning. Unsupervised learning involves using data that is not cleaned and labeled and attempts are made to find patterns within the data. Since the data is not labeled, there is no indication into what is right or wrong

Supervised machine learning is using cleaned and properly labeled data. Since the data is labeled there is some form of indication whether the model that is developed is accurate or not. If the is incorrect then you need to make adjustments to it. In other words, the model learns based on its ability to accurately predict results. However, it is up to the human to make adjustments to the model in order to improve the accuracy of it.

In this post we will look at using R for supervised machine learning. The definition presented so far will make more sense with an example.

The Example

We are going to make a simple prediction about whether emails are spam or not using data from kern lab.

The first thing that you need to do is to install and load the “kernlab” package using the following code

install.packages("kernlab")
library(kernlab)

If you use the “View” function to examine the data you will see that there are several columns. Each column tells you the frequency of a word that kernlab found in a collection of emails. We are going to use the word/variable “money” to predict whether an email is spam or not. First we need to plot the density of the use of the word “money” when the email was not coded as spam. Below is the code for this.

plot(density(spam$money[spam$type=="nonspam"]),
 col='blue',main="", xlab="Frequency of 'money'")

This is an advance R post so I am assuming you can read the code. The plot should look like the following.

Rplot10

As you can see, money is not used to frequently in emails  that are not spam in this dataset. However, you really cannot say this unless you compare the times ‘money’ is labeled nonspam to the times that it is labeled spam. To learn this we need to add a second line that explains to us when the word ‘money’ is used and classified as spam. The code for this is below with the prior code included.

plot(density(spam$money[spam$type=="nonspam"]),
 col='blue',main="", xlab="Frequency of 'money'")
lines(density(spam$money[spam$type=="spam"]), col="red")

Your new plot should look like the following

Rplot10

If you look closely at the plot doing a visual inspection, where there is a separation between the blue line for non spam and the red line for spam is the cutoff point for whether an email is spam or not. In other words, everything inside the arc is labeled correctly while the information outside the arc is not.

The next code and graph show that this cutoff point is around 0.1. This means that any email that has on average more than 0.1 frequency of the word ‘money’ is spam. Below is the code and the graph with the cutoff point indicated with a black line.

plot(density(spam$money[spam$type=="nonspam"]),
 col='blue',main="", xlab="Frequency of 'money'")
lines(density(spam$money[spam$type=="spam"]), col="red")
abline(v=0.1, col="black", lw= 3)

Rplot10.jpeg Now we need to calculate the accuracy of the use of the word ‘money’ to predict spam. For our current example we will simply use in “ifelse” function. If the frequency is greater than 0.1.

We then need to make a table to see the results. The code for the “ifelse” function and the table are below followed by the table.

predict<-ifelse(spam$money > 0.1, "spam","nonspam")
table(predict, spam$type)/length(spam$type)
predict       nonspam        spam
  nonspam 0.596392089 0.266898500
  spam    0.009563138 0.127146273

Based on the table that I am assuming you can read, our model accurately calculates that an email is spam about 71% (0.59 + 0.12) of the time based on the frequency of the word ‘money’ being greater than 0.1.

Off course, for this to be true machine learning we would repeat this process by trying to improve the accuracy of the prediction. However, this is an adequate introduction to this topic.

Survey Design

Survey design is used to describe the opinions, beliefs, behaviors, and or characteristics of a population based on the results of a sample. This design involves the use of surveys that include questions, statements, and or other ways of soliciting information from the sample. This design is used for descriptive purpose primarily but can be combined with other designs (correlational, experimental) at times as well. In this post we will look at the following.

  • Types of Survey Design
  • Characteristics of Survey Design

Types of Survey Design

There are two common forms of survey design which are cross-sectional and longitudinal.   A cross-sectional survey design is the collection of data at one specific point in time. Data is only collected once in a cross-sectional design.

A cross-sectional design can be used to measure opinions/beliefs, compare two or more groups, evaluate a program, and or measure the needs of a specific group. The main goal is to analyze the data from a sample at a given moment in time.

A longitudinal design is similar to a cross-sectional design with the difference being that longitudinal designs require collection over time.Longitudinal studies involve cohorts and panels in which data is collected over days, months, years, and even decades. Through doing this, a longitudinal study is able to expose trends over time in a sample.

Characteristics of Survey Design

There are certain traits that are associated with survey design. Questionnaires and interviews are a common component of survey design. The questionnaires can happen by mail, phone, internet, and in person. Interviews can happen by phone, in focus groups, or one-on-one.

The design of a survey instrument often includes personal, behavioral and attitudinal questions, and open/closed questions.

Another important characteristic of survey design is monitoring the response rate. The response rate is the percentage of participants in the study compared to the number of surveys that were distributed. The response rate varies depending on how the data was collected. Normally, personal interviews have the highest rate while email request have the lowest.

It is sometimes necessary to report the response rate when trying to publish. As such, you should at the very least be aware of what the rate is for a study you are conducting.

Conclusion

Surveys are used to collected data at one point in time or over time. The purpose of this approach is develop insights into the population in order to describe what is happening or to be used to make decisions and inform practice.

 

Intro to Making Plots in R

One of the strongest points of R in the opinion of many are the various features for creating graphs and other visualizations of data. In thia post, we begin to look at using the various visualization features of R. Specifically, we are going to do the following

  • Using data in R to display graph
  • Add text to a graph
  • Manipulate the appearance of data in a graph

Using Plots

The ‘plot’ function is one of the basic options for graphing data. We are going to go through an example using the ‘islands’ data that comes with the R software. The ‘islands’ sotware includes lots of data in particular, it contains data on the lass mass of different islands. We want to plot the land mass of the seven largest islands. Below is the code for doing this.

islandgraph<-head(sort(islands, decreasing=TRUE), 7)
plot(islandgraph, main = "Land Area", ylab = "Square Miles")
text(islandgraph, labels=names(islandgraph), adj=c(0.5,1))

Here is what we did

  1. We made the variable ‘islandgraph’
  2. In the variable ‘islandgraph’ We used the ‘head’ and the ‘sort’ function. The sort function told R to sort the ‘island’ data by decreasing value ( this is why we have the decreasing argument equaling TRUE). After sorting the data, the ‘head’ function tells R to only take the first 7 values of ‘island’ (see the 7 in the code) after they are sorted by decreasing order.
  3. Next, we use the plot function to plot are information in the ‘islandgraph’ variable. We also give the graph a title using the ‘main’ argument followed by the title. Following the title we label the y axis using the ‘ylab’ argument and putting in quotes “Square Miles”.
  4. The last step is to add text to the information inside the graph for clarity. Using the ‘text’ function, we tell R to add text to the ‘islandgraph’ variable using the names from the ‘islandgraph’ data which uses the code ‘label=names(islandgraph)’. Remember the ‘islandgraph’ data is the first 7 islands from the ‘islands’ dataset.
  5. After telling R to use the names from the islandgraph dataset we then tell it to place the label a little of center for readability reasons with the code ‘adj = c(0.5,1).

Below is what the graph should look like.

Rplotblog

Changing Point Color and Shape in a Graph

For visiual purposes, it may be beneficially to manipulate the color and appearance of several data points in a graph. To do this, we are going to use the ‘faithful’ dataset in R. The ‘faithful’ dataset indicates the length of eruption time and how long people had to wait for the erupution. The first thing we want to do is plot the data using the “plot” function.

Rplot

As you see the data, there are two clear clusters. One contains data from 1.5-3 and the second cluster contains data from 3.5-5. To help people to see this distinction we are going to change the color and shape of the data points in the 1.5-3 range. Below is the code for this.

eruption_time<-with(faithful, faithful[eruptions < 3, ])
plot(faithful)
points(eruption_time, col = "blue", pch = 24)

Here is what we did

  1. We created a variable named ‘eruption_time’
  2. In this variable, we use the ‘with’ function. This allows us to access columns in the dataframe without having to use the $ sign constantly. We are telling R to look at the ‘faithful’ dataframe and only take the information from faithful that has eruptions that are less than 3. All of this is indicated in the first line of code above.
  3. Next we plot ‘faithful’ again
  4. Last, we add the points from are ‘eruption_time’ variable and we tell R to color these points blue and to use a different point shape by using the ‘pch = 24’ argument
  5. The results are below

last.jpeg

Conclusion

In this post we learned the following

  • How to make a graph
  • How to add a title and label the y axis
  • How to change the color and shape of the data points in a graph

Two-Way Tables in R

A two-way table is used to explain two or more categorical variables at the same time. The difference between a two-way table and a frequency table is that a two-table tells you the number of subjects that share two or more variables in common while a frequency table tells you the number of subjects that share one variable.

For example, a frequency table would be gender. In such a table, you only know how many subjects are male or female. The only variable involved is gender. In a frequency table, you would learn some of the following

  • Total number of men
  • Total number of women
  • Total number of subjects in the study

In a two-way table, you might look at gender and marital status. In such a table you would be able to learn several things

  • Total number of men are married
  • Total number of men are single
  • Total number of women are married
  • Total number of women are single
  • Total number of men
  • Total number of women
  • Total number of married subjects
  • Total number of single subjects
  • Total number of subjects in the study

As such, there is a lot of information in a two-way table. In this post, we will look at the following

  • How to create a table
  • How to add margins to a table
  • How to calculate proportions in a table

Creating a Table

In the example, we are going to look at two categorical variables. One variable is gender and the other is marital status. For gender, the choices are “Male” and Female”. For marital status, the choicest are ‘Married” and “Single”. Below is the code for developing the table.

Marriage_Study<-matrix(c(34,20,19,42), ncol = 2)
colnames(Marriage_Study) <- c('Male', 'Female')
rownames(Marriage_Study) <- c('Married', 'Single')
Marriage_table <- as.table(Marriage_Study)
print(Marriage_table)

There has already been a discussion on creating matrices in R. Therefore, the details of this script will not be explained here.

If you type this correctly and run the script you should see the following

        Male Female
Married   34     19
Single    20     42

This table tells you about married and single people broken down by their gender. For example, 34 males are married.

Adding Margins and Calculating Proportions

A useful addition to a table is to add the margins. The margins tell you the total number of subjects in each row and column of a table. To do this in R use the ‘addmargins’ function as indicated below.

> addmargins(Marriage_table)
        Male Female Sum
Married   34     19  53
Single    20     42  62
Sum       54     61 115

We now know the total number of Married people, Singles, Males, and Females. In addition to the information we already knew.

One more useful piece of information is to calculate the proportions. This will allow us to know what percentage of each two-way possibility makes up the table. To do this we will use the “prop.table” function. Below is the script

> prop.table(Marriage_table)
             Male    Female
Married 0.2956522 0.1652174
Single  0.1739130 0.3652174

As you can see, we now know the proportions of each category in the table.

Conclusions

This post provided information on how to construct and manipulate data that is in a two-way table. Two-way tables are a useful way of describing categorical variables.

Plotting Correlations in R

A correlation indicates the strength of the relationship between two or more variables.  Plotting correlations allows you to see if there is a potential relationship between two variables. In this post, we will look at how to plot correlations with multiple variables.

In R, there is a built-in dataset called ‘iris’. This dataset includes information about different types of flowers. Specifically, the ‘iris’ dataset contains the following variables

  • Sepal.Length
  • Sepal.Width
  • Petal.Length
  • Petal.Width
  • Species

You can confirm this by inputting the following script

> names(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

We now want to examine the relationship that each of these variables has with each other. In other words, we want to see the relationship of

  • Sepal.Length and Sepal.Width
  • Sepal.Length and Petal.Length
  • Sepal.Length and Petal.Width
  • Sepal.Width and Petal.Length
  • Sepal.Width and Petal.Width
  • Petal.Length and Petal.Width

The ‘Species’ variable will not be a part of our analysis since it is a categorical variable and not a continuous one. The type of correlation we are analyzing is for continuous variables.

We are now going to plot all of these variables above at the same time by using the ‘plot’ function. We also need to tell R not to include the “Species” variable. This is done by adding a subset code to the script. Below is the code to complete this task.

> plot(iris[-5])

Here is what we did

  1. We use the ‘plot’ function and told R to use the “iris” dataset
  2. In brackets, we told R to remove ( – ) the 5th variable, which was species
  3. After pressing enter you should have seen the following

Rplot10

The variable names are placed diagonally from left to right. The x-axis of a plot is determined by variable name in that column. For example,

  • The variable of the x-axis of the first column is ‘Sepal.Length”
  • The variable of the x-axis of the second column is ‘Sepal.Width”
  • The variable of the x-axis of the third column is ‘Petal.Length”
  • The variable of the x-axis of the fourth column is ‘Petal.Width”

The y-axis is determined by the variable that is in the same row as the plot. For example,

  • The variable of the y-axis of the first column is ‘Sepal.Length”
  • The variable of the y-axis of the second column is ‘Sepal.Width”
  • The variable of the y-axis of the third column is ‘Petal.Length”
  • The variable of the y-axis of the fourth column is ‘Petal.Width”

AS you can see, this is the same information. We will now look at a few examples of plots

  • The plot in the first column second row plots “Sepal.Length” as the x-axis and “Sepal.Width” as the y-axis
  • The plot in the first column third row plots “Sepal.Length” as the x-axis and “Petal.Length” as the y-axis
  • The plot in the first column fourth row plots “Sepal.Length” as the x-axis and “Petal.Width” as the y-axis

Hopefully, you can see the pattern. The plots above the diagonal are mirrors of the ones below. If you are familiar with correlational matrices this should not be surprising.

After a visual inspection, you can calculate the actual statistical value of the correlations. To do so use the script below and you will see the table below after it.

> cor(iris[-5])
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

As you can see, there are many strong relationships between the variables. For example “Petal.Width” and “Petal.Length” has a correlation of .96, which is almost perfect. This means that when “Petal.Width” grows by one unit “Petal.Length” grows by .96 units.

Conclusion

Plots help you to see the relationship between two variables. After visual inspection, it is beneficial to calculate the actual correlation.

Analyzing Quantitative Data: Descriptive Statisitics

For quantitative studies, once you have prepared your data it is now time to analyze it. How you analyze data is heavily influenced by your research questions. Most studies involve the use of descriptive and or inferential statistics to answer the research questions. This post will explain briefly discussed various forms of descriptive statistics.

Descriptive Statistics

Descriptive statistics describe trends or characteristics in the data. There are in general, three forms of descriptive statistics. One form deals specifically with trends and includes the mean, median, and mode. The second form deals with the spread of the scores and includes the variance, standard deviation, and range. The third form deals with comparing scores and includes z scores and percentile rank

Trend Descriptive Stats

Common examples of descriptive statistics that describe trends in the data are mean, median, mode. For example, if we gather the weight of 20 people. The mean weight of the people gives us an idea of about how much each person weighs. The mean is easier to use and remember than 20 individual data points.

The median is the value that is exactly in the middle of a range of several data points. For example, if we have several values arrange from less to greatest such as 1, 4, 7. The number 4 is the median as it is the value exactly in the middle. The mode is the most common number in a list of several values arranged from least to greatest. For example, if we have the values 1, 3, 4, 5, 5, 7. The number 5 is the mode since it appears twice while all the other numbers appear only once.

Spread Scores Descriptive Stats

Calculating spread scores is somewhat more complicated than trend stats. Variance is the average amount of deviation from the mean. It is an average of the amount of error in the data. If the mean of a data set is 5 and the variance is 1 this means that the average departure from the mean of 5 is 1 point.

One problem with variance is that its results are squared. This means that the values of the variance are measured differently than whatever the mean is. To deal with this problem, statisticians square root the results of variance to get the standard deviation. The standard deviation is the average amount that the values in a sample are different from the mean. This value is used in many different statistical analysis.

The range measures the dispersion of the data by subtracting the highest value from the lowest. For example, if the highest value in a data set is 5 and the lowest is 1 the range is 5 – 1 = 4.

Comparison Descriptive States

Comparison descriptive stats are much harder to explain and are often used to calculate more advanced statistics. Two types of comparison descriptive stats include z scores percentile rank.

Z scores tells us how far a data point is from the mean in terms of standard deviation. For example, a z score of 3.0 indicates that this particular data point is 3 standard deviations away from the mean. Z scores are useful in identify outliers and many other things.

The percentile rank is much easier to understand. Percentile rank tells you how many scores fall at or below the percentile. For example, some with a score at the 80th percentile outperformed 80% of the sample.

Conclusion

Descriptive stats are used at the beginning of an analysis. There are many other forms of descriptive stats such as skew, kurtosis, etc. Descriptive stats are useful for making sure your data meets various forms of normality in order to begin inferential statistical analysis. Always remember that your research questions determine what form of analysis to conduct.

Finding Text in a Vector in R

This post will cover how to search for text within a vector in R. There are times when you may be working with a lot of information and you want to find a specific piece of information. For example, let’s say you have a list of names that are not in alphabetical order and you want to know how many names start with the letter “E”. To solve this problem, you need to learn how to search text by searching for a pattern. Below is an example of how to do this followed by an explanation.

  • > Student.names
     [1] "Andy"    "Billy"   "Chris"   "Darrin"  "Ed"      "Frank"   "Gabe"    "Hank"   
     [9] "Ivan"    "James"   "Karl"    "Larry"   "Matt"    "Norman"  "Oscar"   "Paul"   
    [17] "Quinton" "Alex"    "Andre"   "Aron"    "Bob"     "Rick"    "Simon"   "Steve"  
    [25] "Thomas"  "Tim"     "Victor"  "Vince"   "William" "Warren"  "Wilson"  "Ted"    
    [33] "Dan"     "Eric"    "Ernest"  "Fred"    "Jim"     "Ethan"   "Lance"   "Mitch"  
    [41] "Pete"    "John"   
    > grep("E",Student.names)
    [1]  5 34 35 38
  1. You have to create the variable ‘Student.names’ and type all the names above as a vector
  2. Next, you use the ‘grep’ function to determine which of the names start with “E” in the variable ‘Student.names’
  3. R tells by position or index which names start with ‘E’

Now you know where the names that start with ‘E’ are but you don’t know the actual names. Below is how you extract the names from the variable.

>  Student.names[grep("E", Student.names)]
[1] "Ed"     "Eric"   "Ernest" "Ethan"

Here is what happened

  1. You told the computer that you want a subset of all the names that start with “E” from the variable ‘Student.names’
  2. You used the ‘grep function to do this.
  3. R returned the names that start with ‘E’

Substituting Text

You can also substitute text in a vector. For example, let’s say you want to replace the name ‘Ed’ in the ‘Student.names’ variable with the more formal name of ‘Edward’ here is how it is done. Just so you know, ‘Ed’ was the 5th name in the list but below it will be replaced with ‘Edward.

> gsub("Ed", "Edward", Student.names)
 [1] "Andy"    "Billy"   "Chris"   "Darrin"  "Edward"  "Frank"   "Gabe"    "Hank"   
 [9] "Ivan"    "James"   "Karl"    "Larry"   "Matt"    "Norman"  "Oscar"   "Paul"   
[17] "Quinton" "Alex"    "Andre"   "Aron"    "Bob"     "Rick"    "Simon"   "Steve"  
[25] "Thomas"  "Tim"     "Victor"  "Vince"   "William" "Warren"  "Wilson"  "Ted"    
[33] "Dan"     "Eric"    "Ernest"  "Fred"    "Jim"     "Ethan"   "Lance"   "Mitch"  
[41] "Pete"    "John"
  1. In this example, we used the ‘gsub’ function to replace the name ‘Ed’ with ‘Edward
  2. Using ‘gsub’ we tell R to find ‘Ed’ and replace it with ‘Edward in the variable ‘Student.names’
  3. R completes the code and prints the list as seen above

Hopefully, the information provided will give you ideas into using text in R

Introduction to Vectors Part III: Logical Vectors and More

Logical vectors are vectors that compare values. The response R gives is either TRUE which means that the comparision statement is correct or FALSE which means the comparision statement is incorrect.

Logical vectors use various operators that indicate ‘greater than’, ‘less than’, ‘equal to’, etc. As this is an abstract concept, it is better to work through several examples to understand.

You want to know how many times James scored more than 20 points in a game. to determine this you develop a simple equation that R will answer with a logical vector

  • > points.of.James <- c(12, 15, 30, 25, 23, 32)
    > points.of.James
    [1] 12 15 30 25 23 32
    > points.of.James > 20
    [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE

Here is what we did

  1. We inputted the values for ‘points.of.James
  2. We then entered the equation ‘points.of.James > 20’ which means which values in the variable ‘points.of.James’ are greater than 20′.
  3. R replied by stating that the first to values are not greater than 20, which is why they are FALSE and that the last 4 values are greater than 20, which is why they are TRUE.

Logical vectors can also be used to compare values in different vectors. This involves the use of the function ‘which(). The function which() is used for comparing different vectors. Below is an example

You want to know which games that James scored more points than Kevin. Below is the code for doing this

  • > points.of.James <- c(12, 15, 30, 25, 23, 32)
    > points.of.Kevin <- c(20, 19, 25, 30, 31, 22)
    > the.best <- points.of.Kevin < points.of.James 
    > which(the.best) 
    [1] 3 6

Here is what happen

  1. You set the values for the variables ‘points.of.James’ and ‘points.of.Kevin’
  2. You create a new variable called ‘the.best’. In this variable you set the equation that compares when Kevin scored less than James by comparing the values in the variable ‘points.of.Kevin’ with ‘points.of.James
  3. You then used the ‘which()’ function for it to tell which times that Kevin scored less than James.
  4. R responds by telling you that Kevin scored less than James the 3rd time and 6th time

You can also find not each time that Kevin scored less than James but instead find out how many times Kevin scored less than James by using the ‘sum()’ function instead of the ‘which()’ function.

> points.of.James <- c(12, 15, 30, 25, 23, 32)
> points.of.Kevin <- c(20, 19, 25, 30, 31, 22)
> the.best <- points.of.Kevin < points.of.James 
> sum(the.best)
[1] 2

R explains that Kevin scored less than James two times.

Naturally, there is much more that can be done with vectors than what was covered here. This is just a glimpse at what is possible.

Introduction to Vectors in R: Part II

In a previous post, we took our first look at vectors and their use in R. In this post, we will build on this knowledge by learning more ways to use and analyze vectors.

Functions for Analyzing Vectors

An important function for analyzing vectors is the str() function. This function allows you to look at the structure or characteristics of any object, including vectors. Objects are the pieces of data you create and or manipulate in R.

We are going to analyze the structure of the variable ‘points.of.James’, which contains a vector. Below is the code for this followed by an explanation.

  • > points.of.James > str(points.of.James)
     num [1:6] 12 15 30 25 23 32

Here is what we did

  1. We created the variable ‘points.of.James’
  2. We assigned the values 12, 15,30, 25, 23, 32 to the variable ‘points.of.James’ Using a vector
  3. We used the str() function to analyze the variable ‘points.of.James’
  4. The output told us the following
    1. num means that the vector is numeric
    2. 1:6 shares two pieces of information. The 1 tells us how many dimensions the example, which is one. The six tells us how many values or indices the vector has, which is six. In other words there are six numbers in the variable. Remember the word ‘indices’, which refers to the location of a value within a vector, as it will be important in the future.

Let’s say you are curious to know how many indices a vector has. To figure this out you use the length()  function is demonstrated below.

  • > points.of.James > length(points.of.James)
    [1] 6

You can probably see that the variable ‘points.of.James’ has six values within it.

Another useful function for vectors is the ability to combine them. In the example below, we will combine the variables ‘points.of.James’ and ‘points.of.Kevin’.

  • > points.of.James > points.of.Kevin 
    > all.points > all.points  
    [1] 12 15 30 25 23 32 20 19 25 30 31 22

Here is what happen

  1. We made the variables ‘point.of.James’ and ‘points.of.Kevin’ and inputted the values
  2. We created a new variable called ‘all.points’ and assigned the variables ‘points.of.James’ and ‘points.of.Kevin’ to it.
    1. NOTE: Assigning a variable to another variable means assigning the values of the first variable to the second one. In other words, the values of ‘points.of.James’ and the values of ‘points.of.Kevin’ are now stored in the variable ‘all.points’
  3. We then read the ‘all.points’ variable and it showed us all of the values within it. Which the same values found with ‘points.of.James’ and ‘points.of.Kevin’.

One last useful tool when using vectors is the ability to extract values from a vector. This can be done by using the brackets or [ ]. Extracting values means taking a handful of values from a vector and looking at them alone. Below is an example.

  • > all.points
     [1] 12 15 30 25 23 32 20 19 25 30 31 22
    > all.points[10]
    [1] 30

In the example above we were looking at our ‘all.points’ variable. I typed all.points[10] into R and this told R to do the following

  • Extract the tenth value from the ‘all.points’ variable

R then replies by telling me the tenth value of the ‘all.points’ variable is 30.

Off course, you can extract multiple values as seen below

  • > all.points
     [1] 12 15 30 25 23 32 20 19 25 30 31 22
    > all.points[c(2, 4, 6, 8)]
    [1] 15 25 32 19

In this example, we extracted the second, fourth, sixth, and 8th value from the variable ‘all.points’

This is still just an introduction into how vectors can be used in R.

Basics of Vectors Functions in R

Vectors were discussed in a previous post. In short, vectors are a collection of information. Functions serve the purpose of performing several operations one after another. Now, will combine these two concepts into what are known as vectorized functions. Vectorized functions are functions that perform several operations on a vector. In other words, you can put several numbers into a vector and then have a function do something to all the numbers at the same time.

Here is an example

You have been asked to keep score during basketball season. Below are the number of points per game for a player name James for his first six games. The information is inputted in R as follows

  • > points.of.James <- c(24, 8, 8, 12, 18, 9)
    > points.of.James
    [1] 24 8 8 12 18 9

What we have done so far is create the variable “points.of.James” and assigned the vector of 24, 8, 8, 12, 18, 9. The points represent the number of points he scored in the six games.

The first function we are going to use is the sum() function. This function will add up how many points James scored in the six games. Below is the code for it.

  • > sum(points.of.James)
    [1] 79

Here is what we did

  1. We told R we want the sum of the values in the variable “points.of.James”
  2. The values in the variable “points.of.James” are the vector 24, 8, 8, 12, 18, 9.
  3. R adds the values up and produces the answer 79

Another more complicated example has to do with how functions can combine vectors. For example, let’s say a list of people in the English department and you want to add after their name that they all have the position of lecturer. To do this you need to use the paste() function which will combine the information. Below is an example of how to do this using R.

  • >  faculty.names <- c("Darrin Thomas", "John Williams", "Sarah Smith")
    >  Rank <- "Lecturer"
    >  Rank <- ", Lecturer"
    >  paste(faculty.name, Rank, sep = "")
    [1] "Darrin Thomas, Lecturer" "John Williams, Lecturer" "Sarah Smith, Lecturer"

Here is what we did.

  1. We created a variable called “faculty.names” and we assigned a vector to it that contain the following names
    1. Darrin Thomas
    2. John Williams
    3. Sarah Smith
  2. When then created a variable called “Rank” and assigned the value “, Lecturer”. We put the comma in front of the word Lecturer so that when we combine the faculty.names and rank variables we have a comma after the name of the faculty person and before their rank, which is lecturer.
  3. Next, we combined all of the names of the faculty members with the rank of “Lecturer” using the paste() function.
    1. NOTE: In the paste function we included the argument sep = “” This prevents a space appearing after the name of the person when the comma is inserted. Below is an example of what we do not want
      1. Darrin Thomas , Faculty   (Remeber we do not want this space. That’s why we use the sep = “” argument because it removes the space in front of the comma)
  4. After setting up the paste function we get the “faculty.names” and “Rank” combined which shows the names of the faculty members with their corresponding rank.

Let’s pretend you are dealing with the same problem but now the faculty members have different rankings. Here is the code for how to do this.

  • > faculty.names <- c("Darrin Thomas", "John Williams", "Sarah Smith")  
    > faculty.ranks <- c(", Lecturer", ", Senior Lecturer", ", Principal Lecturer")
    > paste(faculty.names, faculty.ranks, sep="")
    [1] "Darrin Thomas, Lecturer" "John Williams, Senior Lecturer"  
    [3]"Sarah Smith, Principal Lecturer"

All that is new is that we created a variable called “faculty.ranks”, which included the new rankings. Then we used the paste function to combine the names with ranks. As you can see, each person can their corresponding rank. The first name got the first rank, the second name got the second rank, etc. If there is a difference between the number of names and ranks R will recycle values to complete the function

Sourcing Script in R

Sourcing a script has to do with having R running several commands either one after another with stopping. In order to do this R Studio, you need to type your code into Source Editor which is normally at the top left-hand side of the program. We are going to go through an example in which R will ask a question and you will answer it.

In the source editor, type the following information. At the end of each line, press enter to move to the next line. I will explain the meaning of the text after the example

h <- "Welcome to R"
yourname <- readline("What is your name?")
print(paste(h, yourname))

After typing this information, you should click on the source button near the top middle of the computer screen. If this is done correctly you should see the following in the console.

> source('~/.active-rstudio-document')
What is your name?

Type your name in response to the question and press enter. Then you should see the following.

> source('~/.active-rstudio-document')
What is your name?Darrin
[1] "Welcome to R Darrin"

So what exactly did we do?

  1. We made a variable named h and assigned the object “Welcome to R”
  2. Next, we made a variable called yourname and assigned the function readline to read the phrase “What is your name?”. The readline function literally reads text.
  3. We then told R to print a combination (using the paste function) of the h variable first (Welcome to R) and the yourname variable second (What is your name?)
  4. Next, we clicked the source script button
  5. In the console, we were asked the question “What is your name?”
  6. We responded with our name.
  7. R takes your name and combines it with the h variable to create the following phrase
  8. “Welcome to R Darrin”

Several steps of code were run at once through using the script editor. This same process could have been done in the console but using the script editor saves time. Many aspects of programming focus on saving time and improving efficiency. There are always several ways to do something but the goal is always to find the method that takes the least amount of effort and improves the readability of the code.

Making and Using Variables in R

Variables are used in R to store information for computational purposes. It seems that there is almost no limit to what can be stored in a variable. To make a variable, you need to know the following information.

  • The name you want to give the variable
  • The information you want to store in the variable

Here is an example,

You want to make a variable that will store the following test score: 80, 81, 82, 83, 84, 85. You want to call the variable test_scores. Here is what we know

  • The name of the variable is test_scores
  • It will contain the values of 80, 85, 90, 95, 100

Here is how this would look in R.

  • > test_score <- 80:85

There are a few things to explain

  1. the <- sign means “assigned to” in other words, the variable name on the left of the <- sign is being assigned to the values 80:85.
  2. The colon sign stands for a sequence. We wanted all whole numbers from 80 to 85 and the colon sign provides this information.
  3. Both the <- and : are known as operators in R. Operators symbols you place between numbers in order to make a calculation.

Know, type test_scores into the R console and press enter. You should see the following.

  • > test_scores
    [1] 80 81 82 83 84 85

R now shows you everything that is stored in the variable. We can also created other variables and perform calculations through the use of variables. For example, let’s say that you want to add 5 points of extra credit to the scores of the test. To do this let’s make a variable called extra_credit and add test_scores to extra_credit

  • > extra_credit <- 5
    > test_scores + extra_credit
    [1] 85 86 87 88 89 90

As you can see, R took the values of test_scores and add extra_credit to each value in test_scores.  This is much faster than entering each value separately to calculate it. We can also make a new variable for the new scores and we can call it revised_test_scores Let’s try

  • > revised_test_scores <- test_scores + extra_credit
    > revised_test_scores
    [1] 85 86 87 88 89 90

Variables can also be used for text. The only difference is that you must put quotes around the words. Otherwise, the computer will think the words are numbers, which does not make sense. Below is an example,

  • > h <- "Howdy"
    > h
    [1] "Howdy"

Lastly, variables can be used to store vectors. This is very useful in saving a lot of time in performing calculations. We will now  make the variable student_names and assigned a vector to it containing the names of students.

  • > student_names <- c("David", "Edward", "John")
    > student_names
    [1] "David" Edward" "John"

This is only the beginning of some of the amazing features of R.

Introduction to Vectors in R

A key component of R is the use of vectors. Vectors a single piece of information that contents a collection of information. This probably sounds extremely confusing so an example will be provided.

Think of an organization such as a school. We will call the school Asia International School. Asia International school consist of an administrator named Dr. T, teachers named Mr. Bob and Mrs . Smith, and students named Sam, David, and Mary. In this example, the school is consider a vector or a single piece of information (Asia International School). The administrators, teachers, and students are the collection of information within the large piece of information that is the school.

If we wanted to write this example using the R programming language it would look something like the following…

  • > Asia International School(Dr. T, Mr. Bob, Mrs. Smith, Sam, David, Mary)
    [1] Dr. T Mr. Bob Mrs. Smith Sam David Mary

To be fair this is not exactly how it is done this is strictly an imperfect illustration of a very abstract concept.

A Real Example

In order to make a vector in R you need to use c() function. A function is a piece of code that does something to the information that is within its parentheses. For example, let’s make a vector that contains the numbers 1, 2, 3, 4, 5.

  • > c(1,2,3,4,5)
    [1] 1 2 3 4 5

To get the second line you need you press enter

The c means combine. The information inside the parentheses is the information that is being combined into one vector. Going back to our school example, Dr. T, Mr. Bob, Mrs. Smith, Sam, David, and Mary were being combined into the school Asia International. In a vector, all information inside the parentheses is known as arguments.

Conclusion

This is just some of the most basic ideas about vectors. There is a great deal more to explore about the use of vectors which is considered one of the most powerful features of R. The challenge of learning R is with the abstract nature of programming. You have to think of things you want to do in terms of a code that the computer can understand. This is very confusing for most people.