In this post we will look at linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). Discriminant analysis is used when the dependent variable is categorical. Another commonly used option is logistic regression but there are differences between logistic regression and discriminant analysis. Both LDA and QDA are used in situations in which there is a clear separation between the classes you want to predict. If the categories are fuzzier logistic regression is often the better choice.

For our example, we will use the “Mathlevel” dataset found in the “Ecdat” package. Our goal will be to predict the sex of a respondent based on SAT math score, major, foreign language proficiency, as well as the number of math, physic, and chemistry classes a respondent took. Below is some initial code to start our analysis.

`library(MASS);library(Ecdat)`

data("Mathlevel")

The first thing we need to do is clean up the data set. We have to remove any missing data in order to run our model. We will create a dataset called “math” that has the “Mathlevel” dataset but with the “NA”s removed use the “na.omit” function. After this, we need to set our seed for the purpose of reproducibility using the “set.seed” function. Lastly, we will split the data using the “sample” function using a 70/30 split. The training dataset will be called “math.train” and the testing dataset will be called “math.test”. Below is the code

```
math<-na.omit(Mathlevel)
set.seed(123)
math.ind<-sample(2,nrow(math),replace=T,prob = c(0.7,0.3))
math.train<-math[math.ind==1,]
math.test<-math[math.ind==2,]
```

Now we will make our model and it is called “lda.math” and it will include all available variables in the “math.train” dataset. Next we will check the results by calling the modle. Finally, we will examine the plot to see how our model is doing. Below is the code.

```
lda.math<-lda(sex~.,math.train)
lda.math
```

```
## Call:
## lda(sex ~ ., data = math.train)
##
## Prior probabilities of groups:
## male female
## 0.5986079 0.4013921
##
## Group means:
## mathlevel.L mathlevel.Q mathlevel.C mathlevel^4 mathlevel^5
## male -0.10767593 0.01141838 -0.05854724 0.2070778 0.05032544
## female -0.05571153 0.05360844 -0.08967303 0.2030860 -0.01072169
## mathlevel^6 sat languageyes majoreco majoross majorns
## male -0.2214849 632.9457 0.07751938 0.3914729 0.1472868 0.1782946
## female -0.2226767 613.6416 0.19653179 0.2601156 0.1907514 0.2485549
## majorhum mathcourse physiccourse chemistcourse
## male 0.05426357 1.441860 0.7441860 1.046512
## female 0.07514451 1.421965 0.6531792 1.040462
##
## Coefficients of linear discriminants:
## LD1
## mathlevel.L 1.38456344
## mathlevel.Q 0.24285832
## mathlevel.C -0.53326543
## mathlevel^4 0.11292817
## mathlevel^5 -1.24162715
## mathlevel^6 -0.06374548
## sat -0.01043648
## languageyes 1.50558721
## majoreco -0.54528930
## majoross 0.61129797
## majorns 0.41574298
## majorhum 0.33469586
## mathcourse -0.07973960
## physiccourse -0.53174168
## chemistcourse 0.16124610
```

`plot(lda.math,type='both')`

Calling “lda.math” gives us the details of our model. It starts be indicating the prior probabilities of someone being male or female. Next is the means for each variable by sex. The last part is the coefficients of the linear discriminants. Each of these values is used to determine the probability that a particular example is male or female. This is similar to a regression equation.

The plot provides us with densities of the discriminant scores for males and then for females. The output indicates a problem. There is a great deal of overlap between male and females in the model. What this indicates is that there is a lot of misclassification going on as the two groups are not clearly separated. Furthermore, this means that logistic regression is probably a better choice for distinguishing between male and females. However, since this is for demonstrating purposes we will not worry about this.

We will now use the “predict” function on the training set data to see how well our model classifies the respondents by gender. We will then compare the prediction of the model with thee actual classification. Below is the code.

```
math.lda.predict<-predict(lda.math)
math.train$lda<-math.lda.predict$class
table(math.train$lda,math.train$sex)
```

```
##
## male female
## male 219 100
## female 39 73
```

`mean(math.train$lda==math.train$sex)`

`## [1] 0.6774942`

As you can see, we have a lot of misclassification happening. A large amount of false negatives which is a lot of males being classified as female. The overall accuracy us only 59% which is not much better than chance.

We will now conduct the same analysis on the test data set. Below is the code.

```
lda.math.test<-predict(lda.math,math.test)
math.test$lda<-lda.math.test$class
table(math.test$lda,math.test$sex)
```

```
##
## male female
## male 92 43
## female 23 20
```

`mean(math.test$lda==math.test$sex)`

`## [1] 0.6292135`

As you can see the results are similar. To put it simply, our model is terrible. The main reason is that there is little distinction between males and females as shown in the plot. However, we can see if perhaps a quadratic discriminant analysis will do better

QDA allows for each class in the dependent variable to have it’s own covariance rather than a shared covariance as in LDA. This allows for quadratic terms in the development of the model. To complete a QDA we need to use the “qda” function from the “MASS” package. Below is the code for the training data set.

```
math.qda.fit<-qda(sex~.,math.train)
math.qda.predict<-predict(math.qda.fit)
math.train$qda<-math.qda.predict$class
table(math.train$qda,math.train$sex)
```

```
##
## male female
## male 215 84
## female 43 89
```

`mean(math.train$qda==math.train$sex)`

`## [1] 0.7053364`

You can see there is almost no difference. Below is the code for the test data.

```
math.qda.test<-predict(math.qda.fit,math.test)
math.test$qda<-math.qda.test$class
table(math.test$qda,math.test$sex)
```

```
##
## male female
## male 91 43
## female 24 20
```

`mean(math.test$qda==math.test$sex)`

`## [1] 0.6235955`

Still disappointing. However, in this post we reviewed linear discriminant analysis as well as learned about the use of quadratic linear discriminant analysis. Both of these statistical tools are used for predicting categorical dependent variables. LDA assumes shared covariance in the dependent variable categories will QDA allows for each category in the dependent variable to have it’s own variance.