# Hierarchical Regression in R

In this post, we will learn how to conduct a hierarchical regression analysis in R. Hierarchical regression analysis is used in situation in which you want to see if adding additional variables to your model will significantly change the r2 when accounting for the other variables in the model. This approach is a model comparison approach and not necessarily a statistical one.

We are going to use the “Carseats” dataset from the ISLR package. Our goal will be to predict total sales using the following independent variables in three different models.

model 1 = intercept only
model 2 = Sales~Urban + US + ShelveLoc
model 3 = Sales~Urban + US + ShelveLoc + price + income
model 4 = Sales~Urban + US + ShelveLoc + price + income + Advertising

Often the primary goal with hierarchical regression is to show that the addition of a new variable builds or improves upon a previous model in a statistically significant way. For example, if a previous model was able to predict the total sales of an object using three variables you may want to see if a new additional variable you have in mind may improve model performance. Another way to see this is in the following research question

Is a model that explains the total sales of an object with Urban location, US location, shelf location, price, income and advertising cost as independent variables superior in terms of R2 compared to a model that explains total sales with Urban location, US location, shelf location, price and income as independent variables?

In this complex research question we essentially want to know if adding advertising cost will improve the model significantly in terms of the r square. The formal steps that we will following to complete this analysis is as follows.

1. Build sequential (nested) regression models by adding variables at each step.
2. Run ANOVAs in order to compute the R2
3. Compute difference in sum of squares for each step
1. Check F-statistics and p-values for the SS differences.
4. Compare sum of squares between models from ANOVA results.
5. Compute increase in R2 from sum of square difference
6. Run regression to obtain the coefficients for each independent variable.

We will now begin our analysis. Below is some initial code

``````library(ISLR)
data("Carseats")``````

# Model Development

We now need to create our models. Model 1 will not have any variables in it and will be created for the purpose of obtaining the total sum of squares. Model 2 will include demographic variables. Model 3 will contain the initial model with the continuous independent variables. Lastly, model 4 will contain all the information of the previous models with the addition of the continuous independent variable of advertising cost. Below is the code.

``````model1 = lm(Sales~1,Carseats)
model2=lm(Sales~Urban + US + ShelveLoc,Carseats)
model3=lm(Sales~Urban + US + ShelveLoc + Price + Income,Carseats)
model4=lm(Sales~Urban + US + ShelveLoc + Price + Income + Advertising,Carseats)``````

We can now turn to the ANOVA analysis for model comparison #ANOVA Calculation We will use the anova() function to calculate the total sum of square for model 0. This will serve as a baseline for the other models for calculating r square

``anova(model1,model2,model3,model4)``
``````## Analysis of Variance Table
##
## Model 1: Sales ~ 1
## Model 2: Sales ~ Urban + US + ShelveLoc
## Model 3: Sales ~ Urban + US + ShelveLoc + Price + Income
## Model 4: Sales ~ Urban + US + ShelveLoc + Price + Income + Advertising
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)
## 1    399 3182.3
## 2    395 2105.4  4   1076.89  89.165 < 2.2e-16 ***
## 3    393 1299.6  2    805.83 133.443 < 2.2e-16 ***
## 4    392 1183.6  1    115.96  38.406 1.456e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1``````

For now, we are only focusing on the residual sum of squares. Here is a basic summary of what we know as we compare the models.

model 1 = sum of squares = 3182.3
model 2 = sum of squares = 2105.4 (with demographic variables of Urban, US, and ShelveLoc)
model 3 = sum of squares = 1299.6 (add price and income)
model 4 = sum of squares = 1183.6 (add Advertising)

Each model is statistical significant which means adding each variable lead to some improvement.

By adding price and income to the model we were able to improve the model in a statistically significant way. The r squared increased by .25 below is how this was calculated.

``2105.4-1299.6 #SS of Model 2 - Model 3``
``## [1] 805.8``
``805.8/ 3182.3 #SS difference of Model 2 and Model 3 divided by total sum of sqaure ie model 1``
``## [1] 0.2532131``

When we add Advertising to the model the r square increases by .03. The calculation is below

``1299.6-1183.6 #SS of Model 3 - Model 4``
``## [1] 116``
``116/ 3182.3 #SS difference of Model 3 and Model 4 divided by total sum of sqaure ie model 1``
``## [1] 0.03645162``

# Coefficients and R Square

We will now look at a summary of each model using the summary() function.

``summary(model2)``
``````##
## Call:
## lm(formula = Sales ~ Urban + US + ShelveLoc, data = Carseats)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -6.713 -1.634 -0.019  1.738  5.823
##
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)       4.8966     0.3398  14.411  < 2e-16 ***
## UrbanYes          0.0999     0.2543   0.393   0.6947
## USYes             0.8506     0.2424   3.510   0.0005 ***
## ShelveLocGood     4.6400     0.3453  13.438  < 2e-16 ***
## ShelveLocMedium   1.8168     0.2834   6.410 4.14e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.309 on 395 degrees of freedom
## Multiple R-squared:  0.3384, Adjusted R-squared:  0.3317
## F-statistic: 50.51 on 4 and 395 DF,  p-value: < 2.2e-16``````
``summary(model3)``
``````##
## Call:
## lm(formula = Sales ~ Urban + US + ShelveLoc + Price + Income,
##     data = Carseats)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -4.9096 -1.2405 -0.0384  1.2754  4.7041
##
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)     10.280690   0.561822  18.299  < 2e-16 ***
## UrbanYes         0.219106   0.200627   1.092    0.275
## USYes            0.928980   0.191956   4.840 1.87e-06 ***
## ShelveLocGood    4.911033   0.272685  18.010  < 2e-16 ***
## ShelveLocMedium  1.974874   0.223807   8.824  < 2e-16 ***
## Price           -0.057059   0.003868 -14.752  < 2e-16 ***
## Income           0.013753   0.003282   4.190 3.44e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.818 on 393 degrees of freedom
## Multiple R-squared:  0.5916, Adjusted R-squared:  0.5854
## F-statistic: 94.89 on 6 and 393 DF,  p-value: < 2.2e-16``````
``summary(model4)``
``````##
## Call:
## lm(formula = Sales ~ Urban + US + ShelveLoc + Price + Income +
##     Advertising, data = Carseats)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -5.2199 -1.1703  0.0225  1.0826  4.1124
##
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)     10.299180   0.536862  19.184  < 2e-16 ***
## UrbanYes         0.198846   0.191739   1.037    0.300
## USYes           -0.128868   0.250564  -0.514    0.607
## ShelveLocGood    4.859041   0.260701  18.638  < 2e-16 ***
## ShelveLocMedium  1.906622   0.214144   8.903  < 2e-16 ***
## Price           -0.057163   0.003696 -15.467  < 2e-16 ***
## Income           0.013750   0.003136   4.384 1.50e-05 ***
## Advertising      0.111351   0.017968   6.197 1.46e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.738 on 392 degrees of freedom
## Multiple R-squared:  0.6281, Adjusted R-squared:  0.6214
## F-statistic: 94.56 on 7 and 392 DF,  p-value: < 2.2e-16``````

You can see for yourself the change in the r square. From model 2 to model 3 there is a 26 point increase in r square just as we calculated manually. From model 3 to model 4 there is a 3 point increase in r square. The purpose of the anova() analysis was determined if the significance of the change meet a statistical criterion, The lm() function reports a change but not the significance of it.

# Conclusion

Hierarchical regression is just another potential tool for the statistical researcher. It provides you with a way to develop several models and compare the results based on any potential improvement in the r square.

# Correlation

A correlation is a statistical method used to determine if a relationship exists between variables.  If there is a relationship between the variables it indicates a departure from independence. In other words, the higher the correlation the stronger the relationship and thus the more the variables have in common at least on the surface.

There are four common types of relationships between variables there are the following

1. positive-Both variables increase or decrease in value
2. Negative- One variable decreases in value while another increases.
3. Non-linear-Both variables move together for a time then one decreases while the other continues to increase
4. Zero-No relationship

The most common way to measure the correlation between variables is the Pearson product-moment correlation aka correlation coefficient aka r.  Correlations are usually measured on a standardized scale that ranges from -1 to +1. The value of the number, whether positive or negative, indicates the strength of the relationship.

The Person Product Moment Correlation test confirms if the r is statistically significant or if such a relationship would exist in the population and not just the sample. Below are the assumptions

• Subjects are randomly selected
• Both populations are normally distributed

Here is the process for finding the r.

1. Determine hypotheses
• H0: = 0 (There is no relationship between the variables in the population)
• H0: r ≠ 0 (There is a relationship between the variables in the population)
2. Decided what the level of significance will be
3. Calculate degrees of freedom to determine the t critical value (computer does this)
4. Calculate Pearson’s (computer does this)
5. Calculate t value (computer does this)
6. State conclusion.

Below is an example

A clerk wants to see if there is a correlation between the overall grade students get on an exam and the number of words they wrote for their essay. Below are the results

Student         Grade        Words on Essay
1                             79                           147
2                             76                           143
3                             78                           147
4                             84                           168
5                             90                           206
6                             83                           155
7                             93                           192
8                             94                           211
9                             97                           209
10                          85                           187
11                          88                           200
12                          82                           150

Step 1: State Hypotheses
H0: There is no relationship between grade and the number of words on the essay
H1: There is a relationship between grade and the number of words on the essay

Step 2: Level of significance
Set to 0.05

Step 3: Determine degrees of freedom and t critical value
t-critical = + 2.228 (This info is found in a chart in the back of most stat books)

Step 4: Compute r
r = 0.93                       (calculated by the computer)

Step 5: Decision rule. Calculate t-value for the r

t-value for r = 8.00  (Computer found this)

Since the computed t-value of 8.00 is greater than the t-critical value of 2.228 we reject the null hypothesis.

Step 6: Conclusion
Since the null hypothesis was rejected, we conclude that there is evidence that a strong relationship between the overall grade on the exam and the number of words written for the essay. To make this practical, the teacher could tell the students to write longer essays if they want a better score on the test.

IMPORTANT NOTE

When a null hypothesis is rejected there are several possible relationships between the variables.

• Direct cause and effect
• The relationship between X and Y may be due to the influence of a third variable not in the model
• This could be a chance relationship. For example, foot size and vocabulary. Older people have bigger feet and also a larger vocabulary. Thus it is a nonsense relationship