Tag Archives: stats

Linear VS Quadratic Discriminant Analysis in R

In this post, we will look at linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). Discriminant analysis is used when the dependent variable is categorical. Another commonly used option is logistic regression but there are differences between logistic regression and discriminant analysis. Both LDA and QDA are used in situations in which there is a clear separation between the classes you want to predict. If the categories are fuzzier logistic regression is often the better choice.

For our example, we will use the “Mathlevel” dataset found in the “Ecdat” package. Our goal will be to predict the sex of a respondent based on SAT math score, major, foreign language proficiency, as well as the number of math, physic, and chemistry classes a respondent took. Below is some initial code to start our analysis.


The first thing we need to do is clean up the data set. We have to remove any missing data in order to run our model. We will create a dataset called “math” that has the “Mathlevel” dataset but with the “NA”s removed use the “na.omit” function. After this, we need to set our seed for the purpose of reproducibility using the “set.seed” function. Lastly, we will split the data using the “sample” function using a 70/30 split. The training dataset will be called “math.train” and the testing dataset will be called “math.test”. Below is the code

math.ind<-sample(2,nrow(math),replace=T,prob = c(0.7,0.3))

Now we will make our model and it is called “lda.math” and it will include all available variables in the “math.train” dataset. Next, we will check the results by calling the model. Finally, we will examine the plot to see how our model is doing. Below is the code.

## Call:
## lda(sex ~ ., data = math.train)
## Prior probabilities of groups:
##      male    female 
## 0.5986079 0.4013921 
## Group means:
##        mathlevel.L mathlevel.Q mathlevel.C mathlevel^4 mathlevel^5
## male   -0.10767593  0.01141838 -0.05854724   0.2070778  0.05032544
## female -0.05571153  0.05360844 -0.08967303   0.2030860 -0.01072169
##        mathlevel^6      sat languageyes  majoreco  majoross   majorns
## male    -0.2214849 632.9457  0.07751938 0.3914729 0.1472868 0.1782946
## female  -0.2226767 613.6416  0.19653179 0.2601156 0.1907514 0.2485549
##          majorhum mathcourse physiccourse chemistcourse
## male   0.05426357   1.441860    0.7441860      1.046512
## female 0.07514451   1.421965    0.6531792      1.040462
## Coefficients of linear discriminants:
##                       LD1
## mathlevel.L    1.38456344
## mathlevel.Q    0.24285832
## mathlevel.C   -0.53326543
## mathlevel^4    0.11292817
## mathlevel^5   -1.24162715
## mathlevel^6   -0.06374548
## sat           -0.01043648
## languageyes    1.50558721
## majoreco      -0.54528930
## majoross       0.61129797
## majorns        0.41574298
## majorhum       0.33469586
## mathcourse    -0.07973960
## physiccourse  -0.53174168
## chemistcourse  0.16124610


Calling “lda.math” gives us the details of our model. It starts be indicating the prior probabilities of someone being male or female. Next is the means for each variable by sex. The last part is the coefficients of the linear discriminants. Each of these values is used to determine the probability that a particular example is male or female. This is similar to a regression equation.

The plot provides us with densities of the discriminant scores for males and then for females. The output indicates a problem. There is a great deal of overlap between male and females in the model. What this indicates is that there is a lot of misclassification going on as the two groups are not clearly separated. Furthermore, this means that logistic regression is probably a better choice for distinguishing between male and females. However, since this is for demonstrating purposes we will not worry about this.

We will now use the “predict” function on the training set data to see how well our model classifies the respondents by gender. We will then compare the prediction of the model with the actual classification. Below is the code.

##          male female
##   male    219    100
##   female   39     73
## [1] 0.6774942

As you can see, we have a lot of misclassification happening. A large amount of false negatives which is a lot of males being classified as female. The overall accuracy is only 59% which is not much better than chance.

We will now conduct the same analysis on the test data set. Below is the code.

##          male female
##   male     92     43
##   female   23     20
## [1] 0.6292135

As you can see the results are similar. To put it simply, our model is terrible. The main reason is that there is little distinction between males and females as shown in the plot. However, we can see if perhaps a quadratic discriminant analysis will do better

QDA allows for each class in the dependent variable to have its own covariance rather than a shared covariance as in LDA. This allows for quadratic terms in the development of the model. To complete a QDA we need to use the “qda” function from the “MASS” package. Below is the code for the training data set.

##          male female
##   male    215     84
##   female   43     89
## [1] 0.7053364

You can see there is almost no difference. Below is the code for the test data.

##          male female
##   male     91     43
##   female   24     20
## [1] 0.6235955

Still disappointing. However, in this post, we reviewed linear discriminant analysis as well as learned about the use of quadratic linear discriminant analysis. Both of these statistical tools are used for predicting categorical dependent variables. LDA assumes shared covariance in the dependent variable categories will QDA allows for each category in the dependent variable to have its own variance.


Linear Discriminant Analysis in R

In this post we will look at an example of linear discriminant analysis (LDA). LDA is used to develop a statistical model that classifies examples in a dataset. In the example in this post, we will use the “Star” dataset from the “Ecdat” package. What we will do is try to predict the type of class the students learned in (regular, small, regular with aide) using their math scores, reading scores, and the teaching experience of the teacher. Below is the initial code


We first need to examine the data by using the “str” function

## 'data.frame':    5748 obs. of  8 variables:
##  $ tmathssk: int  473 536 463 559 489 454 423 500 439 528 ...
##  $ treadssk: int  447 450 439 448 447 431 395 451 478 455 ...
##  $ classk  : Factor w/ 3 levels "regular","small.class",..: 2 2 3 1 2 1 3 1 2 2 ...
##  $ totexpk : int  7 21 0 16 5 8 17 3 11 10 ...
##  $ sex     : Factor w/ 2 levels "girl","boy": 1 1 2 2 2 2 1 1 1 1 ...
##  $ freelunk: Factor w/ 2 levels "no","yes": 1 1 2 1 2 2 2 1 1 1 ...
##  $ race    : Factor w/ 3 levels "white","black",..: 1 2 2 1 1 1 2 1 2 1 ...
##  $ schidkn : int  63 20 19 69 79 5 16 56 11 66 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:5850] 1 4 6 7 8 9 10 15 16 17 ...
##   .. ..- attr(*, "names")= chr [1:5850] "1" "4" "6" "7" ...

We will use the following variables

  • dependent variable = classk (class type)
  • independent variable = tmathssk (Math score)
  • independent variable = treadssk (Reading score)
  • independent variable = totexpk (Teaching experience)

We now need to examine the data visually by looking at histograms for our independent variables and a table for our dependent variable







##           regular       small.class regular.with.aide 
##         0.3479471         0.3014962         0.3505567

The data mostly looks good. The results of the “prop.table” function will help us when we develop are training and testing datasets. The only problem is with the “totexpk” variable. IT is not anywhere near to be normally distributed. TO deal with this we will use the square root for teaching experience. Below is the code



Much better. We now need to check the correlation among the variables as well and we will use the code below.

##                        star.sqrt.tmathssk star.sqrt.treadssk
## star.sqrt.tmathssk             1.00000000          0.7135489
## star.sqrt.treadssk             0.71354889          1.0000000
## star.sqrt.totexpk.sqrt         0.08647957          0.1045353
##                        star.sqrt.totexpk.sqrt
## star.sqrt.tmathssk                 0.08647957
## star.sqrt.treadssk                 0.10453533
## star.sqrt.totexpk.sqrt             1.00000000

None of the correlations are too bad. We can now develop our model using linear discriminant analysis. First, we need to scale are scores because the test scores and the teaching experience are measured differently. Then, we need to divide our data into a train and test set as this will allow us to determine the accuracy of the model. Below is the code.


Now we develop our model. In the code before the “prior” argument indicates what we expect the probabilities to be. In our data the distribution of the the three class types is about the same which means that the apriori probability is 1/3 for each class type.

train.lda<-lda(classk~tmathssk+treadssk+totexpk.sqrt, data = 
## Call:
## lda(classk ~ tmathssk + treadssk + totexpk.sqrt, data = train.star, 
##     prior = c(1, 1, 1)/3)
## Prior probabilities of groups:
##           regular       small.class regular.with.aide 
##         0.3333333         0.3333333         0.3333333 
## Group means:
##                      tmathssk    treadssk totexpk.sqrt
## regular           -0.04237438 -0.05258944  -0.05082862
## small.class        0.13465218  0.11021666  -0.02100859
## regular.with.aide -0.05129083 -0.01665593   0.09068835
## Coefficients of linear discriminants:
##                      LD1         LD2
## tmathssk      0.89656393 -0.04972956
## treadssk      0.04337953  0.56721196
## totexpk.sqrt -0.49061950  0.80051026
## Proportion of trace:
##    LD1    LD2 
## 0.7261 0.2739

The printout is mostly readable. At the top is the actual code used to develop the model followed by the probabilities of each group. The next section shares the means of the groups. The coefficients of linear discriminants are the values used to classify each example. The coefficients are similar to regression coefficients. The computer places each example in both equations and probabilities are calculated. Whichever class has the highest probability is the winner. In addition, the higher the coefficient the more weight it has. For example, “tmathssk” is the most influential on LD1 with a coefficient of 0.89.

The proportion of trace is similar to principal component analysis

Now we will take the trained model and see how it does with the test set. We create a new model called “predict.lda” and use are “train.lda” model and the test data called “test.star”

predict.lda<-predict(train.lda,newdata = test.star)

We can use the “table” function to see how well are model has done. We can do this because we actually know what class our data is beforehand because we divided the dataset. What we need to do is compare this to what our model predicted. Therefore, we compare the “classk” variable of our “test.star” dataset with the “class” predicted by the “predict.lda” model.

##                     regular small.class regular.with.aide
##   regular               155         182               249
##   small.class           145         198               174
##   regular.with.aide     172         204               269

The results are pretty bad. For example, in the first row called “regular” we have 155 examples that were classified as “regular” and predicted as “regular” by the model. In rhe next column, 182 examples that were classified as “regular” but predicted as “small.class”, etc. To find out how well are model did you add together the examples across the diagonal from left to right and divide by the total number of examples. Below is the code

## [1] 0.3558352

Only 36% accurate, terrible but ok for a demonstration of linear discriminant analysis. Since we only have two-functions or two-dimensions we can plot our model.  Below I provide a visual of the first 50 examples classified by the predict.lda model.



The first function, which is the vertical line, doesn’t seem to discriminant anything as it off to the side and not separating any of the data. However, the second function, which is the horizontal one, does a good of dividing the “regular.with.aide” from the “small.class”. Yet, there are problems with distinguishing the class “regular” from either of the other two groups.  In order improve our model we need additional independent variables to help to distinguish the groups in the dependent variable.

Generalized Additive Models in R

In this post, we will learn how to create a generalized additive model (GAM). GAMs are non-parametric generalized linear models. This means that linear predictor of the model uses smooth functions on the predictor variables. As such, you do not need to specific the functional relationship between the response and continuous variables. This allows you to explore the data for potential relationships that can be more rigorously tested with other statistical models

In our example, we will use the “Auto” dataset from the “ISLR” package and use the variables “mpg”,“displacement”,“horsepower”,and “weight” to predict “acceleration”. We will also use the “mgcv” package. Below is some initial code to begin the analysis


We will now make the model we want to understand the response of “accleration” to the explanatory variables of “mpg”,“displacement”,“horsepower”,and “weight”. After setting the model we will examine the summary. Below is the code

## Family: gaussian 
## Link function: identity 
## Formula:
## acceleration ~ s(mpg) + s(displacement) + s(horsepower) + s(weight)
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.54133    0.07205   215.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Approximate significance of smooth terms:
##                   edf Ref.df      F  p-value    
## s(mpg)          6.382  7.515  3.479  0.00101 ** 
## s(displacement) 1.000  1.000 36.055 4.35e-09 ***
## s(horsepower)   4.883  6.006 70.187  < 2e-16 ***
## s(weight)       3.785  4.800 41.135  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## R-sq.(adj) =  0.733   Deviance explained = 74.4%
## GCV = 2.1276  Scale est. = 2.0351    n = 392

All of the explanatory variables are significant and the adjust r-squared is .73 which is excellent. edf stands for “effective degrees of freedom”. This modified version of the degree of freedoms is due to the smoothing process in the model. GCV stands for generalized cross validation and this number is useful when comparing models. The model with the lowest number is the better model.

We can also examine the model visually by using the “plot” function. This will allow us to examine if the curvature fitted by the smoothing process was useful or not for each variable. Below is the code.



We can also look at a 3d graph that includes the linear predictor as well as the two strongest predictors. This is done with the “vis.gam” function. Below is the code



If multiple models are developed. You can compare the GCV values to determine which model is the best. In addition, another way to compare models is with the “AIC” function. In the code below, we will create an additional model that includes “year” compare the GCV scores and calculate the AIC. Below is the code.

## Family: gaussian 
## Link function: identity 
## Formula:
## acceleration ~ s(mpg) + s(displacement) + s(horsepower) + s(weight) + 
##     s(year)
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.54133    0.07203   215.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Approximate significance of smooth terms:
##                   edf Ref.df      F p-value    
## s(mpg)          5.578  6.726  2.749  0.0106 *  
## s(displacement) 2.251  2.870 13.757 3.5e-08 ***
## s(horsepower)   4.936  6.054 66.476 < 2e-16 ***
## s(weight)       3.444  4.397 34.441 < 2e-16 ***
## s(year)         1.682  2.096  0.543  0.6064    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## R-sq.(adj) =  0.733   Deviance explained = 74.5%
## GCV = 2.1368  Scale est. = 2.0338    n = 392
#model1 GCV
##   GCV.Cp 
## 2.127589
#model2 GCV
##   GCV.Cp 
## 2.136797

As you can see, the second model has a higher GCV score when compared to the first model. This indicates that the first model is a better choice. This makes sense because in the second model the variable “year” is not significant. To confirm this we will calculate the AIC scores using the AIC function.

##              df      AIC
## model1 18.04952 1409.640
## model2 19.89068 1411.156

Again, you can see that model1 s better due to its fewer degrees of freedom and slightly lower AIC score.


Using GAMs is most common for exploring potential relationships in your data. This is stated because they are difficult to interpret and to try and summarize. Therefore, it is normally better to develop a generalized linear model over a GAM due to the difficulty in understanding what the data is trying to tell you when using GAMs.

Theoretical Distribution and R

This post will explore an example of testing if a dataset fits a specific theoretical distribution. This is a very important aspect of statistical modeling as it allows to understand the normality of the data and the appropriate steps needed to take to prepare for analysis.

In our example, we will use the “Auto” dataset from the “ISLR” package. We will check if the horsepower of the cars in the dataset is normally distributed or not. Below is some initial code to begin the process.


Determining if a dataset is normally distributed is simple in R. This is normally done visually through making a Quantile-Quantile plot (Q-Q plot). It involves using two functions the “qnorm” and the “qqline”. Below is the code for the Q-Q plot



We now need to add the Q-Q line to see how are distribution lines up with the theoretical normal one. Below is the code. Note that we have to repeat the code above in order to get the completed plot.

qqline(Auto$horsepower, distribution = qnorm, probs=c(.25,.75))


The “qqline” function needs the data you want to test as well as the distribution and probability. The distribution we wanted is normal and is indicated by the argument “qnorm”. The probs argument means probability. The default values are .25 and .75. The resulting graph indicates that the distribution of “horsepower”, in the “Auto” dataset is not normally distributed. That are particular problems with the lower and upper values.

We can confirm our suspicion by running a statistical test. The Anderson-Darling test from the “nortest” package will allow us to test whether our data is normally distributed or not. The code is below

##  Anderson-Darling normality test
## data:  Auto$horsepower
## A = 12.675, p-value < 2.2e-16

From the results, we can conclude that the data is not normally distributed. This could mean that we may need to use non-parametric tools for statistical analysis.

We can further explore our distribution in terms of its skew and kurtosis. Skew measures how far to the left or right the data leans and kurtosis measures how peaked or flat the data is. This is done with the “fBasics” package and the functions “skewness” and “kurtosis”.

First we will deal with skewness. Below is the code for calculating skewness.

## [1] 1.079019
## attr(,"method")
## [1] "moment"

We now need to determine if this value of skewness is significantly different from zero. This is done with a simple t-test. We must calculate the t-value before calculating the probability. The standard error of the skew is defined as the square root of six divided by the total number of samples. The code is below

## [1] 8.721607
## attr(,"method")
## [1] "moment"

Now we take the standard error of Horsepower and plug this into the “pt” function (t probability) with the degrees of freedom (sample size – 1 = 391) we also put in the number 1 and subtract all of this information. Below is the code

## [1] 0
## attr(,"method")
## [1] "moment"

The value zero means that we reject the null hypothesis that the skew is not significantly different form zero and conclude that the skew is different form zero. However, the value of the skew was only 1.1 which is not that non-normal.

We will now repeat this process for the kurtosis. The only difference is that instead of taking the square root divided by six we divided by 24 in the example below.

## [1] 0.6541069
## attr(,"method")
## [1] "excess"
## [1] 2.643542
## attr(,"method")
## [1] "excess"
## [1] 0.004267199
## attr(,"method")
## [1] "excess"

Again the pvalue is essentially zero, which means that the kurtosis is significantly different from zero. With a value of 2.64 this is not that bad. However, when both skew and kurtosis are non-normally it explains why our overall distributions was not normal either.


This post provided insights into assessing the normality of a dataset. Visually inspection can take place using  Q-Q plots. Statistical inspection can be done through hypothesis testing along with checking skew and kurtosis.

The History and Characteristics of R

R is a programming language and software environment that is used for the development of graphic data products and the computation of many forms of mathematics. The history of R goes back about 20-30 years ago. This post will look at the history of R as well as the Characteristics of this software.

The History

Ross Ihaka and Robert Gentleman are the developers of R. R is actually based on an older programming language known as S which goes back to the 1970″s. Ihaka and Gentleman develop their own programming language while working together in New Zealand. With the release of R in the early 1990’s, several people joined the project to help to improve it. By 1995, the software had become “open-sourced” which means that anyone can use and modify it for themselves without cost. By 2000, the first version of R (1.0) was released to the public.

Characteristics of R

In many peoples opinion, the best feature of R is the price. Being free, R is by far one of the best softwares for statistical analysis is price is the most important criterion. SPSS and SAS are also great and user-friendlier, however, their price is completely outlandish for most individual researchers. R removes this problem completely

R also has an active community around it that supports its development. For example, people are able to develop packages that provide assistance in running various task in the R software. Naturally, most packages are free as well. The focus on community has enabled R to be run on almost any operating system as well, such as Windows, OSX, or Linux.

R also allows people to make graphs and data products. The graphs are actually very well made. The drawback is understanding the coding necessary to develop these various products. This is discussed more below.

One major drawback that affects the typical computer user is learning the programming language of R. This can be challenging for those who are not techie or able to think abstractly in computer codes. I have never seen a satisfactory way to get around this problem but to crack open a book and practice, practice, practice. With time, the code will start to make sense but it is not a five-minute process for someone who has not studied programming.


Despite the challenges of learning computer programming R is becoming the software of choice for many. The benefits far outweigh the problems for many individuals. Personally, I am looking forward to continuing to develop skills in understanding this dynamic software.

Population vs Sample

In statistics, one of the most fundamental concepts is the population and sample. A population is all the member from a group. For example, if my population is the United States, I would have to collect data from everyone in the country. This is to say the least, very challenging.

To deal with this, must studies take a sample from the population. A sample is a portion of the population. Continuing our example, instead of collecting data from every in the US I would collect data from several hundred or thousand depending on the research question of my study.

There are several different techniques to sampling that will be covered later.  For now, the most important thing to remember is that your research questions and circumstances of the study influence what steps you take. There is rarely one way to do this.