# Theoretical Distribution and R

This post will explore an example of testing if a dataset fits a specific theoretical distribution. This is a very important aspect of statistical modeling as it allows to understand the normality of the data and the appropriate steps needed to take to prepare for analysis.

In our example, we will use the “Auto” dataset from the “ISLR” package. We will check if the horsepower of the cars in the dataset is normally distributed or not. Below is some initial code to begin the process.

``````library(ISLR)
library(nortest)
library(fBasics)``````
``data("Auto")``

Determining if a dataset is normally distributed is simple in R. This is normally done visually through making a Quantile-Quantile plot (Q-Q plot). It involves using two functions the “qnorm” and the “qqline”. Below is the code for the Q-Q plot

``qqnorm(Auto\$horsepower)`` We now need to add the Q-Q line to see how are distribution lines up with the theoretical normal one. Below is the code. Note that we have to repeat the code above in order to get the completed plot.

``````qqnorm(Auto\$horsepower)
qqline(Auto\$horsepower, distribution = qnorm, probs=c(.25,.75))`````` The “qqline” function needs the data you want to test as well as the distribution and probability. The distribution we wanted is normal and is indicated by the argument “qnorm”. The probs argument means probability. The default values are .25 and .75. The resulting graph indicates that the distribution of “horsepower”, in the “Auto” dataset is not normally distributed. That are particular problems with the lower and upper values.

We can confirm our suspicion by running a statistical test. The Anderson-Darling test from the “nortest” package will allow us to test whether our data is normally distributed or not. The code is below

``ad.test(Auto\$horsepower)``
``````##  Anderson-Darling normality test
##
## data:  Auto\$horsepower
## A = 12.675, p-value < 2.2e-16``````

From the results, we can conclude that the data is not normally distributed. This could mean that we may need to use non-parametric tools for statistical analysis.

We can further explore our distribution in terms of its skew and kurtosis. Skew measures how far to the left or right the data leans and kurtosis measures how peaked or flat the data is. This is done with the “fBasics” package and the functions “skewness” and “kurtosis”.

First we will deal with skewness. Below is the code for calculating skewness.

``````horsepowerSkew<-skewness(Auto\$horsepower)
horsepowerSkew``````
``````##  1.079019
## attr(,"method")
##  "moment"``````

We now need to determine if this value of skewness is significantly different from zero. This is done with a simple t-test. We must calculate the t-value before calculating the probability. The standard error of the skew is defined as the square root of six divided by the total number of samples. The code is below

``````stdErrorHorsepower<-horsepowerSkew/(sqrt(6/length(Auto\$horsepower)))
stdErrorHorsepower``````
``````##  8.721607
## attr(,"method")
##  "moment"``````

Now we take the standard error of Horsepower and plug this into the “pt” function (t probability) with the degrees of freedom (sample size – 1 = 391) we also put in the number 1 and subtract all of this information. Below is the code

``1-pt(stdErrorHorsepower,391)``
``````##  0
## attr(,"method")
##  "moment"``````

The value zero means that we reject the null hypothesis that the skew is not significantly different form zero and conclude that the skew is different form zero. However, the value of the skew was only 1.1 which is not that non-normal.

We will now repeat this process for the kurtosis. The only difference is that instead of taking the square root divided by six we divided by 24 in the example below.

``````horsepowerKurt<-kurtosis(Auto\$horsepower)
horsepowerKurt``````
``````##  0.6541069
## attr(,"method")
##  "excess"``````
``````stdErrorHorsepowerKurt<-horsepowerKurt/(sqrt(24/length(Auto\$horsepower)))
stdErrorHorsepowerKurt``````
``````##  2.643542
## attr(,"method")
##  "excess"``````
``1-pt(stdErrorHorsepowerKurt,391)``
``````##  0.004267199
## attr(,"method")
##  "excess"``````

Again the pvalue is essentially zero, which means that the kurtosis is significantly different from zero. With a value of 2.64 this is not that bad. However, when both skew and kurtosis are non-normally it explains why our overall distributions was not normal either.

Conclusion

This post provided insights into assessing the normality of a dataset. Visually inspection can take place using  Q-Q plots. Statistical inspection can be done through hypothesis testing along with checking skew and kurtosis. # Wilcoxon Signed Rank Test in R

The Wilcoxon Signed Rank Test is the non-parametric equivalent of the t-test. If you have questions  whether or not your data is normally distributed the Wilcoxon Signed Rank Test can still indicate to you if there is a difference between the means of your sample.

Th Wilcoxon Test compares the medians of two samples instead of their means. The differences between the median and each individual value for each sample is calculated. Values that come to zero are removed. Any remaining values are ranked from lowest to highest. Lastly, the ranks are summed. If the rank sum is different between the two samples it indicates statistical difference between samples.

We will now do an example using r. We want to see if there is a difference in enrollment between private and public universities. Below is the code

We will begin by loading the ISLR package. Then we will load the ‘College’ data and take a look at the variables in the “College” dataset by using the ‘str’ function.

``````library(ISLR)
data=College
str(College)``````
``````## 'data.frame':    777 obs. of  18 variables:
##  \$ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  \$ Apps       : num  1660 2186 1428 417 193 ...
##  \$ Accept     : num  1232 1924 1097 349 146 ...
##  \$ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  \$ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  \$ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  \$ F.Undergrad: num  2885 2683 1036 510 249 ...
##  \$ P.Undergrad: num  537 1227 99 63 869 ...
##  \$ Outstate   : num  7440 12280 11250 12960 7560 ...
##  \$ Room.Board : num  3300 6450 3750 5450 4120 ...
##  \$ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  \$ Personal   : num  2200 1500 1165 875 1500 ...
##  \$ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  \$ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  \$ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  \$ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  \$ Expend     : num  7041 10527 8735 19016 10922 ...
##  \$ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...``````

We will now look at the Enroll variable and see if it is normally distributed

``hist(College\$Enroll)`` This variable is highly skewed to the right. This may mean that it is not normally distributed. Therefore, we may not be able to use a regular t-test to compare private and public universities and the Wilcoxon Test is more appropriate. We will now use the Wilcoxon Test. Below are the results

``wilcox.test(College\$Enroll~College\$Private)``
``````##
##  Wilcoxon rank sum test with continuity correction
##
## data:  College\$Enroll by College\$Private
## W = 104090, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0``````

The results indicate a difference we will now calculate the medians of the two groups using the ‘aggregate’ function. This function allows us to compare our two groups based on the median. Below is the code with the results.

``aggregate(College\$Enroll~College\$Private, FUN=median)``
``````##   College\$Private College\$Enroll
## 1              No       1337.5
## 2             Yes        328.0
``````

As you can see, there is a large difference in enrollment in private and public colleges. We can then make the conclusion that there is a difference in the medians of private and public colleges with public colleges have a much higher enrollment.

Conclusion

The Wilcoxon Test is used for a non-parametric analysis of data. This test is useful whenever there are concerns with the normality of the data.

# Kruskal-Willis Test in R

Sometimes when the data that needs to be analyzed is not normally distributed. This makes it difficult to make any inferences based on the results because one of the main assumptions of parametric statistical test such as ANOVA, t-test, etc is normal distribution of the data.

Fortunately, for every parametric test there is a non-parametric test. Non-parametric test are test that make no assumptions about the normality of the data. This means that the non-normal data can still be analyzed with a certain measure of confidence in terms of the results.

This post will look at non-parametric test that are used to test the difference in means. For three or more groups we used the Kruskal-Wallis Test. The Kruskal-Wallis Test is the non-parametric version of ANOVA.

Setup

We are going to use the “ISLR” package available on R to demonstrate the use of the Kruskal-Wallis test. After downloading this package you need to load the “Auto” data. Below is the code to do all of this.

```install.packages('ISLR')
library(ISLR)
data=Auto```

We now need to examine the structure of the data set. This is done with the “str” function below is code followed by the results

```str(Auto)
'data.frame':	392 obs. of  9 variables:
\$ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
\$ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
\$ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
\$ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
\$ weight      : num  3504 3693 3436 3433 3449 ...
\$ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
\$ year        : num  70 70 70 70 70 70 70 70 70 70 ...
\$ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
\$ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...```

So we have 9 variables. We first need to find if any of the continuous variables are non-normal because this indicates that the Kruskal-Willis test is needed. We will look at the ‘displacement’ variable and look at the histogram to see if it is normally distributed. Below is the code followed by the histogram

```hist(Auto\$displacement) ```

This does not look normally distributed. We now need a factor variable with 3 or more groups. We are going to use the ‘origin’ variable. This variable indicates were the care was made 1 = America, 2 = Europe, and 3 = Japan. However, this variable is currently a numeric variable. We need to change it into a factor variable. Below is the code for this

`Auto\$origin<-as.factor(Auto\$origin)`

The Test

We will now use the Kruskal-Wallis test. The question we have is “is there a difference in displacement based on the origin of the vehicle?” The code for the analysis is below followed by the results.

```> kruskal.test(displacement ~ origin, data = Auto)

Kruskal-Wallis rank sum test

data:  displacement by origin
Kruskal-Wallis chi-squared = 201.63, df = 2, p-value < 2.2e-16```

Based on the results, we know there is a difference among the groups. However, just like ANOVA, we do not know were. We have to do a post-hoc test in order to determine where the difference in means is among the three groups.

To do this we need to install a new package and do a new analysis. We will download the “PCMR” package and run the code below

```install.packages('PMCMR')
library(PMCMR)
data(Auto)
attach(Auto)
posthoc.kruskal.nemenyi.test(x=displacement, g=origin, dist='Tukey')```

Here is what we did,

1. Installed the PMCMR package and loaded it
2. Loaded the “Auto” data and used the “attach” function to make it available
3. Ran the function “posthoc.kruskal.nemenyi.test” and place the appropriate variables in their place and then indicated the type of posthoc test ‘Tukey’

Below are the results

```Pairwise comparisons using Tukey and Kramer (Nemenyi) test
with Tukey-Dist approximation for independent samples

data:  displacement and origin

1       2
2 3.4e-14 -
3 < 2e-16 0.51

Warning message:
In posthoc.kruskal.nemenyi.test.default(x = displacement, g = origin,  :
Ties are present, p-values are not corrected.```

The results are listed in a table. When a comparison was made between group 1 and 2 the results were significant (p < 0.0001). The same when group 1 and 3 are compared (p < 0.0001).  However, there was no difference between group 2 and 3 (p = 0.51).

Do not worry about the warning message this can be corrected if necessary

Perhaps you are wondering what the actually means for each group is. Below is the code with the results

```> aggregate(Auto[, 3], list(Auto\$origin), mean)
Group.1        x
1       1 247.5122
2       2 109.6324
3       3 102.7089```

Cares made in America have an average displacement of 247.51 while cars from Europe and Japan have a displacement of 109.63 and 102.70. Below is the code for the boxplot followed by the graph

```boxplot(displacement~origin, data=Auto, ylab= 'Displacement', xlab='Origin')
title('Car Displacement')``` Conclusion

This post provided an example of the Kruskal-Willis test. This test is useful when the data is not normally distributed. The main problem with this test is that it is less powerful than an ANOVA test. However, this is a problem with most non-parametric test when compared to parametric test. # Type I and Type II Error

Hypothesis testing in statistics involves deciding whether to reject or not reject a null hypothesis. There are problems that can occur when making decisions about a null hypothesis. A researcher can reject a null when they should not reject it, which is called a type I error. The other mistake is not rejecting a null when they should have, which is a type II error. Both of these mistakes represent can seriously damage the interpretation of data.

An Example

The classic example that explains type I and type II errors is a courtroom. In a trial, the defendant is considered innocent until proven guilty. The defendant  can be compared to the null hypothesis being true. The prosecutor job is to present evidence that the defendant is guilty. This is the same as providing statistical evidence to reject the null hypothesis which indicates that the null is not true and needs to be rejected.

There are four possible outcomes of our trial and our statistical test…

1. The defendant can be declared guilty when they are really guilty. That’s a correct decision.This is the same as rejecting the null hypothesis.
2. The defendant could be judged not guilty when they really are innocent. That’s a correct and is the same as not rejecting the null hypothesis.
3. The defendant is convicted when they are actually innocent, which is wrong. This is the same as rejecting the null hypothesis when you should not and is know as a type I error
4. The defendant is guilty but declared innocent, which is also incorrect. This is the same as not rejecting the null hypothesis when you should have. This is known as a type II error.

Important Notes

The probability of committing a type I error is the same as the alpha or significance level of a statistical test. Common values associated with alpha are o.1, 0.05, and 0.01. This means that the likelihood of committing a type I error depends on the level of the significance that the researcher picks.

The probability of committing a type II error is known as beta. Calculating beta is complicated as you need specific values in your null and alternative hypothesis. It is not always possible to supply this. As such, researcher often do not focus on type II error avoidance as they do with type I.

Another concern is that decrease the risk of committing one type of error increases the risk of committing the other. This means that if you reduce the risk of type I error you increase the risk of committing a type II error.

Conclusion

The risk of error or incorrect judgment of a null hypothesis is a risk in statistical analysis. As such, researchers need to be aware of these problems as they study data.

# Hypothesis Testing for Two Means: Large Independent Samples

Hypothesis testing for two large samples examines again if there is a difference  between the two means. We infer that there is a difference between the population means by seeing if there is a difference between the sample means. The assumptions for testing for the difference between two means are below.

• Subjects are randomly selected and independently assigned to groups
• Population is normally distributed
• Sample size is greater than 30

The hypotheses can be stated as follows

• Null hypothesis: There is no difference between the population means of the two groups
• The technical way to say this is…  H0: μ1 = μ2
• Alternative hypothesis: There is a difference between the population means of the two groups. One is greater or smaller than the other
• The technical way to say this is… H1: μ1≠ μ2 or μ1> μ2 or         μ1< μ2

The process for conducting a z test for independent samples is provided below

2. Determine the level of significance (normally .1, .05, or .01)
3. Decide if it is a one-tail or two tail test.
4. Determine the critical value of z. This is found in chart in the back of most stat books common values include +1.64, +1.96, or +2.32
5. Calculate the means and standard deviations of the two samples.
6. Calculate the test for the two independent samples. Below is the formula

z = (sample mean 1 – sample  mean 2)

√[(variance of sample 1 squared/ sample population 1) +
(variance  of sample 2 squared/ sample population 2)]

7. If the computed z is less than the critical z then you do not reject your null hypothesis. This means there is no difference between the means. If the computed z is greater than the critical z then you reject the null hypothesis and this indicates that there is evidence that there is a difference.

Below is an example

A business man is comparing the price of buildings in two different provinces to see if there is a difference. Below are the results. Determine if the buildings in Bangkok cost more than the buildings in Saraburi.

Bangkok                                   Saraburi
average price     2,140,000                                1,970,000
variance                 226,000                                     243,000
sample size           47                                                  45

Now let us go through the steps

• Null hypothesis: There is no difference between the average price of buildings in Bangkok and Saraburi
• In stat language, it would be
• H0: μ1 ≠ μ2
• Alternative hypothesis: The  average price of buildings in Bangkok is higher than in  Saraburi
• In stat language, it would be
• H1: μ1 > μ2
2. Determine the level of significance (normally .1, .05, or .01)
• We will select .05
3. Decide if it is a one-tail or two tail test.
• This is a one-tail test. We want to know if one mean is greater than another. Therefore, to reject the null we need a z computed that is positive and larger than our z critical.
4. Determine the critical value of z. This is found in chart in the back of most stat books common values include +1.64, +1.96, or +2.32 when it is a two tailed test
• Our z critical is + 1.64  since this is a one-tail test we only have one value so we do not split the probable and place have on one side and half on the other side. If this were two-tailed we would have -1.96 and +1.96 which indicates that the difference is greater or less
5. Calculate the means and standard deviations of the two samples.
• Already done in the table above
6. Calculate the test for the two independent samples. Below is the formula.

(2,140,000 – 1,970,000)
√[((226,000)²)/47) + ((243,000)²)/45)]
our final answer for are z computed is 3.47

Since 3.47 is greater than our z critical of +1.64 we reject the null hypothesis and state that there is evidence that building prices are higher in Bangkok than in Saraburi. # What is Hypothesis Testing?

Hypothesis testing is a statistical approach used in making decisions about data.  In hypothesis testing, there are two hypotheses that are posed by the  researcher and they are…

1. Null hypothesis-There is no difference between the sample population and the statistical population in relation to the mean or some other parameter that is being assessed
2. Alternative hypothesis-There is a difference between the sample population and the statistical population in relation to the mean or some other parameter that is being assessed

Generally, researchers often hope to reject the null hypothesis which indicates that the alternative hypothesis is correct.  However, strictly speaking, a researcher never accepts any hypothesis. Instead, you reject or you do not reject the null hypothesis. This is because further testing will always be needed to confirm the results.

How to know whether to reject or not reject the null depends on the results of the analysis. A researcher needs to select a level of statistical significance which is usually 1%, 5%, or 10%. The significance level changes the size of the rejection region at the tails of the normal distribution. The lower the significance level the smaller the rejection region which influences the interpretation of the results.  To reject a null hypothesis, the results of the analysis must fall within the rejection region.

After determining the level of significance a researcher analyzes the data to determine the results. The results then need to be interpreted by stating them in simple English.  From this, the researcher can develop a conclusion about what the results mean.