Linear VS Quadratic Discriminant Analysis in R

In this post we will look at linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). Discriminant analysis is used when the dependent variable is categorical. Another commonly used option is logistic regression but there are differences between logistic regression and discriminant analysis. Both LDA and QDA are used in situations in which there is a clear separation between the classes you want to predict. If the categories are fuzzier logistic regression is often the better choice.

For our example, we will use the “Mathlevel” dataset found in the “Ecdat” package. Our goal will be to predict the sex of a respondent based on SAT math score, major, foreign language proficiency, as well as the number of math, physic, and chemistry classes a respondent took. Below is some initial code to start our analysis.

library(MASS);library(Ecdat)
data("Mathlevel")

The first thing we need to do is clean up the data set. We have to remove any missing data in order to run our model. We will create a dataset called “math” that has the “Mathlevel” dataset but with the “NA”s removed use the “na.omit” function. After this, we need to set our seed for the purpose of reproducibility using the “set.seed” function. Lastly, we will split the data using the “sample” function using a 70/30 split. The training dataset will be called “math.train” and the testing dataset will be called “math.test”. Below is the code

math<-na.omit(Mathlevel)
set.seed(123)
math.ind<-sample(2,nrow(math),replace=T,prob = c(0.7,0.3))
math.train<-math[math.ind==1,]
math.test<-math[math.ind==2,]

Now we will make our model and it is called “lda.math” and it will include all available variables in the “math.train” dataset. Next we will check the results by calling the modle. Finally, we will examine the plot to see how our model is doing. Below is the code.

lda.math<-lda(sex~.,math.train)
lda.math
## Call:
## lda(sex ~ ., data = math.train)
## 
## Prior probabilities of groups:
##      male    female 
## 0.5986079 0.4013921 
## 
## Group means:
##        mathlevel.L mathlevel.Q mathlevel.C mathlevel^4 mathlevel^5
## male   -0.10767593  0.01141838 -0.05854724   0.2070778  0.05032544
## female -0.05571153  0.05360844 -0.08967303   0.2030860 -0.01072169
##        mathlevel^6      sat languageyes  majoreco  majoross   majorns
## male    -0.2214849 632.9457  0.07751938 0.3914729 0.1472868 0.1782946
## female  -0.2226767 613.6416  0.19653179 0.2601156 0.1907514 0.2485549
##          majorhum mathcourse physiccourse chemistcourse
## male   0.05426357   1.441860    0.7441860      1.046512
## female 0.07514451   1.421965    0.6531792      1.040462
## 
## Coefficients of linear discriminants:
##                       LD1
## mathlevel.L    1.38456344
## mathlevel.Q    0.24285832
## mathlevel.C   -0.53326543
## mathlevel^4    0.11292817
## mathlevel^5   -1.24162715
## mathlevel^6   -0.06374548
## sat           -0.01043648
## languageyes    1.50558721
## majoreco      -0.54528930
## majoross       0.61129797
## majorns        0.41574298
## majorhum       0.33469586
## mathcourse    -0.07973960
## physiccourse  -0.53174168
## chemistcourse  0.16124610
plot(lda.math,type='both')

1.png

Calling “lda.math” gives us the details of our model. It starts be indicating the prior probabilities of someone being male or female. Next is the means for each variable by sex. The last part is the coefficients of the linear discriminants. Each of these values is used to determine the probability that a particular example is male or female. This is similar to a regression equation.

The plot provides us with densities of the discriminant scores for males and then for females. The output indicates a problem. There is a great deal of overlap between male and females in the model. What this indicates is that there is a lot of misclassification going on as the two groups are not clearly separated. Furthermore, this means that logistic regression is probably a better choice for distinguishing between male and females. However, since this is for demonstrating purposes we will not worry about this.

We will now use the “predict” function on the training set data to see how well our model classifies the respondents by gender. We will then compare the prediction of the model with thee actual classification. Below is the code.

math.lda.predict<-predict(lda.math)
math.train$lda<-math.lda.predict$class
table(math.train$lda,math.train$sex)
##         
##          male female
##   male    219    100
##   female   39     73
mean(math.train$lda==math.train$sex)
## [1] 0.6774942

As you can see, we have a lot of misclassification happening. A large amount of false negatives which is a lot of males being classified as female. The overall accuracy us only 59% which is not much better than chance.

We will now conduct the same analysis on the test data set. Below is the code.

lda.math.test<-predict(lda.math,math.test)
math.test$lda<-lda.math.test$class
table(math.test$lda,math.test$sex)
##         
##          male female
##   male     92     43
##   female   23     20
mean(math.test$lda==math.test$sex)
## [1] 0.6292135

As you can see the results are similar. To put it simply, our model is terrible. The main reason is that there is little distinction between males and females as shown in the plot. However, we can see if perhaps a quadratic discriminant analysis will do better

QDA allows for each class in the dependent variable to have it’s own covariance rather than a shared covariance as in LDA. This allows for quadratic terms in the development of the model. To complete a QDA we need to use the “qda” function from the “MASS” package. Below is the code for the training data set.

math.qda.fit<-qda(sex~.,math.train)
math.qda.predict<-predict(math.qda.fit)
math.train$qda<-math.qda.predict$class
table(math.train$qda,math.train$sex)
##         
##          male female
##   male    215     84
##   female   43     89
mean(math.train$qda==math.train$sex)
## [1] 0.7053364

You can see there is almost no difference. Below is the code for the test data.

math.qda.test<-predict(math.qda.fit,math.test)
math.test$qda<-math.qda.test$class
table(math.test$qda,math.test$sex)
##         
##          male female
##   male     91     43
##   female   24     20
mean(math.test$qda==math.test$sex)
## [1] 0.6235955

Still disappointing. However, in this post we reviewed linear discriminant analysis as well as learned about the use of quadratic linear discriminant analysis. Both of these statistical tools are used for predicting categorical dependent variables. LDA assumes shared covariance in the dependent variable categories will QDA allows for each category in the dependent variable to have it’s own variance.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s