The goal of the post is to attempt to explain the salary of a baseball based on several variables. We will see how to test various assumptions of multiple regression as well as deal with missing data. The first thing we need to do is load our data. Our data will come from the “ISLR” package and we will use the dataset “Hitters”. There are 20 variables in the dataset as shown by the “str” function
#Load data
library(ISLR)
data("Hitters")
str(Hitters)
## 'data.frame': 322 obs. of 20 variables:
## $ AtBat : int 293 315 479 496 321 594 185 298 323 401 ...
## $ Hits : int 66 81 130 141 87 169 37 73 81 92 ...
## $ HmRun : int 1 7 18 20 10 4 1 0 6 17 ...
## $ Runs : int 30 24 66 65 39 74 23 24 26 49 ...
## $ RBI : int 29 38 72 78 42 51 8 24 32 66 ...
## $ Walks : int 14 39 76 37 30 35 21 7 8 65 ...
## $ Years : int 1 14 3 11 2 11 2 3 2 13 ...
## $ CAtBat : int 293 3449 1624 5628 396 4408 214 509 341 5206 ...
## $ CHits : int 66 835 457 1575 101 1133 42 108 86 1332 ...
## $ CHmRun : int 1 69 63 225 12 19 1 0 6 253 ...
## $ CRuns : int 30 321 224 828 48 501 30 41 32 784 ...
## $ CRBI : int 29 414 266 838 46 336 9 37 34 890 ...
## $ CWalks : int 14 375 263 354 33 194 24 12 8 866 ...
## $ League : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
## $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
## $ PutOuts : int 446 632 880 200 805 282 76 121 143 0 ...
## $ Assists : int 33 43 82 11 40 421 127 283 290 0 ...
## $ Errors : int 20 10 14 3 4 25 7 9 19 0 ...
## $ Salary : num NA 475 480 500 91.5 750 70 100 75 1100 ...
## $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...
We now need to assess the amount of missing data. This is important because missing data can cause major problems with the different analysis. We are going to create a simple function that will explain to us the amount of missing data for each variable in the “Hitters” dataset. After using the function we need to use the “apply” function to display the results according to the amount of data missing by column and row.
Missing_Data <- function(x){sum(is.na(x))/length(x)*100}
apply(Hitters,2,Missing_Data)
## AtBat Hits HmRun Runs RBI Walks Years ## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ## CAtBat CHits CHmRun CRuns CRBI CWalks League ## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ## Division PutOuts Assists Errors Salary NewLeague ## 0.00000 0.00000 0.00000 0.00000 18.32298 0.00000
apply(Hitters,1,Missing_Data)
For column, we can see that the missing data is all in the salary variable, which is missing 18% of its data. By row (not displayed here) you can see that a row might be missing anywhere from 0-5% of its data. The 5% is from the fact that there are 20 variables and there is only missing data in the salary variable. Therefore 1/20 = 5% missing data for a row. To deal with the missing data, we will use the ‘mice’ package. You can install it yourself and run the following code
library(mice)
md.pattern(Hitters)
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 263 1 1 1 1 1 1 1 1 1 1 1 1
## 59 1 1 1 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0 0 0 0 0
## CWalks League Division PutOuts Assists Errors NewLeague Salary
## 263 1 1 1 1 1 1 1 1 0
## 59 1 1 1 1 1 1 1 0 1
## 0 0 0 0 0 0 0 59 59
Hitters1 <- mice(Hitters,m=5,maxit=50,meth='pmm',seed=500)
summary(Hitters1)
## Multiply imputed data set
## Call:
## mice(data = Hitters, m = 5, method = "pmm", maxit = 50, seed = 500)
In the code above we did the following
- loaded the ‘mice’ package Run the ‘md.pattern’ function Made a new variable called ‘Hitters’ and ran the ‘mice’ function on it.
- This function made 5 datasets (m = 5) and used predictive meaning matching to guess the missing data point for each row (method = ‘pmm’).
- The seed is set for the purpose of reproducing the results The md.pattern function indicates that
There are 263 complete cases and 59 incomplete ones (not displayed). All the missing data is in the ‘Salary’ variable. The ‘mice’ function shares various information of how the missing data was dealt with. The ‘mice’ function makes five guesses for each missing data point. You can view the guesses for each row by the name of the baseball player. We will then select the first dataset as are new dataset to continue the analysis using the ‘complete’ function from the ‘mice’ package.
#View Imputed data
Hitters1$imp$Salary
#Make Complete Dataset
completedData <- complete(Hitters1,1)
Now we need to deal with the normality of each variable which is the first assumption we will deal with. To save time, I will only explain how I dealt with the non-normal variables. The two variables that were non-normal were “salary” and “Years”. To fix these two variables I did a log transformation of the data. The new variables are called ‘log_Salary’ and “log_Years”. Below is the code for this with the before and after histograms
#Histogram of Salary
hist(completedData$Salary)
#log transformation of Salary
completedData$log_Salary<-log(completedData$Salary)
#Histogram of transformed salary
hist(completedData$log_Salary)
#Histogram of years hist(completedData$Years)
#Log transformation of Years completedData$log_Years<-log(completedData$Years) hist(completedData$log_Years)
We can now do are regression analysis and produce the residual plot in order to deal with the assumption of homoscedasticity and linearity. Below is the code
Salary_Model<-lm(log_Salary~Hits+HmRun+Walks+log_Years+League, data=completedData)
#Residual Plot checks Linearity
plot(Salary_Model)
When using the ‘plot’ function you will get several plots. The first is the residual vs fitted which assesses linearity. The next is the qq plot which explains if our data is normally distributed. The scale location plot explains if there is equal variance. The residual vs leverage plot is used for finding outliers. All plots look good.
summary(Salary_Model)
##
## Call:
## lm(formula = log_Salary ~ Hits + HmRun + Walks + log_Years +
## League, data = completedData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1052 -0.3649 0.0171 0.3429 3.2139
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.8790683 0.1098027 35.328 < 2e-16 ***
## Hits 0.0049427 0.0009928 4.979 1.05e-06 ***
## HmRun 0.0081890 0.0046938 1.745 0.08202 .
## Walks 0.0063070 0.0020284 3.109 0.00205 **
## log_Years 0.6390014 0.0429482 14.878 < 2e-16 ***
## League2 0.1217445 0.0668753 1.820 0.06963 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5869 on 316 degrees of freedom
## Multiple R-squared: 0.5704, Adjusted R-squared: 0.5636
## F-statistic: 83.91 on 5 and 316 DF, p-value: < 2.2e-16
Furthermore, the model explains 57% of the variance in salary. All variables (Hits, HmRun, Walks, Years, and League) are significant at 0.1. Our last step is to find the correlations among the variables. To do this, we need to make a correlational matrix. We need to remove variables that are not a part of our study to do this. We also need to load the “Hmisc” package and use the ‘rcorr’ function to produce the matrix along with the p values. Below is the code
#find correlation
completedData1<-completedData;completedData1$Chits<-NULL;completedData1$CAtBat<-NULL;completedData1$CHmRun<-NULL;completedData1$CRuns<-NULL;completedData1$CRBI<-NULL;completedData1$CWalks<-NULL;completedData1$League<-NULL;completedData1$Division<-NULL;completedData1$PutOuts<-NULL;completedData1$Assists<-NULL; completedData1$NewLeague<-NULL;completedData1$AtBat<-NULL;completedData1$Runs<-NULL;completedData1$RBI<-NULL;completedData1$Errors<-NULL; completedData1$CHits<-NULL;completedData1$Years<-NULL; completedData1$Salary<-NULL
library(Hmisc)
rcorr(as.matrix(completedData1))
## Hits HmRun Walks log_Salary log_Years
## Hits 1.00 0.56 0.64 0.47 0.13
## HmRun 0.56 1.00 0.48 0.36 0.14
## Walks 0.64 0.48 1.00 0.46 0.18
## log_Salary 0.47 0.36 0.46 1.00 0.63
## log_Years 0.13 0.14 0.18 0.63 1.00
##
## n= 322
##
##
## P
## Hits HmRun Walks log_Salary log_Years
## Hits 0.0000 0.0000 0.0000 0.0227
## HmRun 0.0000 0.0000 0.0000 0.0153
## Walks 0.0000 0.0000 0.0000 0.0009
## log_Salary 0.0000 0.0000 0.0000 0.0000
## log_Years 0.0227 0.0153 0.0009 0.0000
There are no high correlations among our variables so multicollinearity is not an issue
Conclusion
This post provided an example dealing with missing data, checking the assumptions of a regression model, and displaying plots. All this was done using R.
Pingback: Assumption Check for Multiple Regression | Educ...