Category Archives: Statistics

Making and Using Variables in R

Variables are used in R to store information for computational purposes. It seems that there is almost no limit to what can be stored in a variable. To make a variable, you need to know the following information.

  • The name you want to give the variable
  • The information you want to store in the variable

Here is an example,

You want to make a variable that will store the following test score: 80, 81, 82, 83, 84, 85. You want to call the variable test_scores. Here is what we know

  • The name of the variable is test_scores
  • It will contain the values of 80, 85, 90, 95, 100

Here is how this would look in R.

  • > test_score <- 80:85

There are a few things to explain

  1. the <- sign means “assigned to” in other words, the variable name on the left of the <- sign is being assigned to the values 80:85.
  2. The colon sign stands for a sequence. We wanted all whole numbers from 80 to 85 and the colon sign provides this information.
  3. Both the <- and : are known as operators in R. Operators symbols you place between numbers in order to make a calculation.

Know, type test_scores into the R console and press enter. You should see the following.

  • > test_scores
    [1] 80 81 82 83 84 85

R now shows you everything that is stored in the variable. We can also created other variables and perform calculations through the use of variables. For example, let’s say that you want to add 5 points of extra credit to the scores of the test. To do this let’s make a variable called extra_credit and add test_scores to extra_credit

  • > extra_credit <- 5
    > test_scores + extra_credit
    [1] 85 86 87 88 89 90

As you can see, R took the values of test_scores and add extra_credit to each value in test_scores.  This is much faster than entering each value separately to calculate it. We can also make a new variable for the new scores and we can call it revised_test_scores Let’s try

  • > revised_test_scores <- test_scores + extra_credit
    > revised_test_scores
    [1] 85 86 87 88 89 90

Variables can also be used for text. The only difference is that you must put quotes around the words. Otherwise, the computer will think the words are numbers, which does not make sense. Below is an example,

  • > h <- "Howdy"
    > h
    [1] "Howdy"

Lastly, variables can be used to store vectors. This is very useful in saving a lot of time in performing calculations. We will now  make the variable student_names and assigned a vector to it containing the names of students.

  • > student_names <- c("David", "Edward", "John")
    > student_names
    [1] "David" Edward" "John"

This is only the beginning of some of the amazing features of R.

Introduction to Vectors in R

A key component of R is the use of vectors. Vectors a single piece of information that contents a collection of information. This probably sounds extremely confusing so an example will be provided.

Think of an organization such as a school. We will call the school Asia International School. Asia International school consist of an administrator named Dr. T, teachers named Mr. Bob and Mrs . Smith, and students named Sam, David, and Mary. In this example, the school is consider a vector or a single piece of information (Asia International School). The administrators, teachers, and students are the collection of information within the large piece of information that is the school.

If we wanted to write this example using the R programming language it would look something like the following…

  • > Asia International School(Dr. T, Mr. Bob, Mrs. Smith, Sam, David, Mary)
    [1] Dr. T Mr. Bob Mrs. Smith Sam David Mary

To be fair this is not exactly how it is done this is strictly an imperfect illustration of a very abstract concept.

A Real Example

In order to make a vector in R you need to use c() function. A function is a piece of code that does something to the information that is within its parentheses. For example, let’s make a vector that contains the numbers 1, 2, 3, 4, 5.

  • > c(1,2,3,4,5)
    [1] 1 2 3 4 5

To get the second line you need you press enter

The c means combine. The information inside the parentheses is the information that is being combined into one vector. Going back to our school example, Dr. T, Mr. Bob, Mrs. Smith, Sam, David, and Mary were being combined into the school Asia International. In a vector, all information inside the parentheses is known as arguments.

Conclusion

This is just some of the most basic ideas about vectors. There is a great deal more to explore about the use of vectors which is considered one of the most powerful features of R. The challenge of learning R is with the abstract nature of programming. You have to think of things you want to do in terms of a code that the computer can understand. This is very confusing for most people.

Using R

The R program can be used only to process many different mathematical functions. However, many people choose to use some sort of editing tool while using R. The editing tool provides a place for developing and save codes and functions.

There are many different editing tools available. For Windows, a popular choice is RGui. For Mac, R.app is a common choice. The choice that is quickly becoming the standard for R Users is RStudio. RStudio works on all software platforms. This provides a consistent interface for people despite whatever operating system they are using. Below are some additional benefits of RStudio.

  • Brackets are automatically setup when developing code
  • Different parts of a code have a corresponding color. This helps in reading the code.
  • Code completion. Saves time

Coding in RStudio

The first thing people often do when learning to code is create the message “Hello World.” To do this in RStudio requires the following

  1. In RStudio make sure the cursor is blinking in the console section (The console is normally in the lower left hand part of the window)
  2. Type the following and press enter
    1. print (“Hello World!”)
  3. After pressing enter you should see the following
    1. [1] “Hello World!”

Congrats, you have just developed and implemented your first R Script

Doing Some Math

R can be used for performing math calculations as well. Consider the following example

  1. Type the following into the console and press enter
    1. 1 + 3 + 5 + 7
  2. You should get the following output
    1. [1] 16

Have some fun playing with the print as well as calculating various math problems as well.

The History and Characteristics of R

R is a programming language and software environment that is used for the development of graphic data products and the computation of many forms of mathematics. The history of R goes back about 20-30 years ago. This post will look at the history of R as well as the Characteristics of this software.

The History

Ross Ihaka and Robert Gentleman are the developers of R. R is actually based on an older programming language known as S which goes back to the 1970″s. Ihaka and Gentleman develop their own programming language while working together in New Zealand. With the release of R in the early 1990’s, several people joined the project to help to improve it. By 1995, the software had become “open-sourced” which means that anyone can use and modify it for themselves without cost. By 2000, the first version of R (1.0) was released to the public.

Characteristics of R

In many peoples opinion, the best feature of R is the price. Being free, R is by far one of the best softwares for statistical analysis is price is the most important criterion. SPSS and SAS are also great and user-friendlier, however, their price is completely outlandish for most individual researchers. R removes this problem completely

R also has an active community around it that supports its development. For example, people are able to develop packages that provide assistance in running various task in the R software. Naturally, most packages are free as well. The focus on community has enabled R to be run on almost any operating system as well, such as Windows, OSX, or Linux.

R also allows people to make graphs and data products. The graphs are actually very well made. The drawback is understanding the coding necessary to develop these various products. This is discussed more below.

One major drawback that affects the typical computer user is learning the programming language of R. This can be challenging for those who are not techie or able to think abstractly in computer codes. I have never seen a satisfactory way to get around this problem but to crack open a book and practice, practice, practice. With time, the code will start to make sense but it is not a five-minute process for someone who has not studied programming.

Conclusion

Despite the challenges of learning computer programming R is becoming the software of choice for many. The benefits far outweigh the problems for many individuals. Personally, I am looking forward to continuing to develop skills in understanding this dynamic software.

Chi-Square Goodness-of-Fit-Test

The chi-square test is a non-parametric test that is used in statistic to determine if an observed distribution or model conforms or is similar to an expected distribution or model. In simple terms, this test will tell you if the data you collected is similar to other data or to what you expected.

There are several types of chi-square test such as the Chi-square Test of Independence, which is used for nominal data, and the Goodness-of-Fit Test, which deals with data that is not nominal. This post is about the Goodness-of-Fit Test. The Goodness-of-Fit test compares the distribution of the observed data with an expected distribution.

A unique caveat of chi-square test is that we normally desire as a researcher to make sure we do not reject our model. This is opposite of traditional hypothesis testing which desires often to reject the null hypothesis as this indicates that there is a statistical difference. With chi-square test, we want our observed model to be similar to the values found in the expected model. What this means is that our model represents what is happening in the real-world and is not only theoretical. If we reject the null it means that the model we are trying to create is not similar to expected values that might be found in the real world. In other words, we found something that does not conform to what is expected. If a model does not represent the world, it may not serve much purpose.

Here are the assumptions of Goodness-of-Fit Test

  • Random selection of subjects
  • Mutually exclusive categories

Here are the steps

  1. Determine hypothesis
    • H0: There is no difference between the observed values/model and the expected values/model
    • H1: There is a difference between the observed values/model and the expected values/model
  2. Decide level of significance
  3. Determine degree of freedom to find chi-square critical
  4. Compute for the expected frequencies
  5. Compute chi-square
  6. Make decision to accept or reject null
  7. State conclusion

Here is an example

A principal wants to know if the number of students absent each day of the week is the same. Below are the results for one week.

Day                  Absents

Monday                 17

Tuesday                 20

Wednesday            16

Thursday               14

Friday                    13

Step 1: Determine Hypothesis

  • H0: The number of students absent is the same every day
  • H1: The number of students absent is not the same every day

Step 2: Decide level of significance

  • 0.05

Step 3 Determine chi-square critical region (computer does this for you)

  • Chi-square critical region = 9.48

Step 4: Compute expected frequencies

  • Computer does this

Step 5: Compute Chi square (computer does this for you)

  • Chi-square = 1.87

Step 6: Make decision

  • Since the computed chi-square of 1,87 is less than the critical chi-square value of 9.48 we do not reject the null hypothesis

Step 7: Conclusion

  • Since we do not reject the null hypothesis we can say that there is a lack of evidence that there is a difference in the number of absences each day of the week. In other words, the number of students absent each day is the same.

NOTE: There is also a way to do this test when the expected frequencies are unequal

Simple Linear Regression Analysis

Simple linear regression analysis is a technique that is used to model the dependency of one dependent variable upon one independent variable. This relationship between these two variables is explained by an equation.

When regression is employed normally the data points are graphed on a scatterplot. Next, the computer draws what is called the “best-fitting” line. The line is the best fit because it reduces the amount of error between actual values and predicted values in the model. The official name of the model is the least square model in that it is the model with the least amount of error. As such, it is the best model for predicting future values

It is important to remember that one of the great enemies of statistics is explaining error or residual. In general, any particular data point that is not the mean is said to have some error in it. For example, if the average is 5 and one of the data points is three 5 -3 = 2 or an error of 2. Statistics often want to explain this error. What is causing this variation from the mean is a common question.

There are two ways that simple regression deals with error

  1. The error cannot be explained. This is known as unexplained variation.
  2. The error can be explained. This is known as explained variation.

When these two values are added together you get the total variation which is also known as the “sum of squares for error.”

Another important term to be familiar with is the standard error of estimate. The standard error of estimate is a measurement of the standard deviation of the observed dependent variables values from predicted values of the dependent variable. Remember that there is always a slight difference between observed and predicted values and the model wants to explain as much of this as possible.

In general, the smaller the standard error the better because this indicates that there is not much difference between observed data points and predicted data points. In other words, the model fits the data very well.

Another name for the explained variation is the coefficient of determination. The coefficient of determination is the amount of variation that is explained by the regression line and the independent variable. Another name for this value is the r². The coefficient of determination is standardized to have a value between 0 to 1 or 0% to 100%.

The higher your r² the better your model is at explaining the dependent variable. However, there are a lot of qualifiers to this statement that goes beyond this post.

Here are the assumptions of simple regression

  • Linearity–The mean of each error is zero
  • Independence of error terms–The errors are independent of each other
  • Normality of error terms–The error of each variable is normally distributed
  • Homoscedasticity–The variance of the error for the value of each variable is the same

There are many ways to check all of this in SPSS which is beyond this post.

Below is an example of simple regression using data from a previous post

You want to know how strong is the relationship of the exam grade on the number of words in the students’ essay. The data is below

Student         Grade        Words on Essay
1                             79                           147
2                             76                           143
3                             78                           147
4                             84                           168
5                             90                           206
6                             83                           155
7                             93                           192
8                             94                           211
9                             97                           209
10                          85                           187
11                          88                           200
12                          82                           150

Step 1: Find the Slope (The computer does this for you)
slope = 3.74

Step 2: Find the mean of X (exam grade) and Y (words on the essay) (Computer does this for you)
X (Exam grade) = 85.75        Y (Words on Essay) = 176.25

Step 3: Compute the intercept of the simple linear regression (computer does this)
-145.27

Step 4: Create linear regression equation (you do this)
Y (words on essay) = 3.74*(exam grade) – 145.27
NOTE: you can use this equation to predict the number of words on the essay if you know the exam grade or to predict the exam grade if you know how many words they wrote in the essay. It is simple algebra.

Step 5: Calculate Coefficient of Determination r² (computer does this for you)
r² = 0.85
The coefficient of determination explains 85% of the variation in the number of words on the essay. In other words, exam grades strongly predict how many words a student will write in their essay.

Spearman Rank Correlation

Spearman rank correlation aka ρ is used to measure the strength of the relationship between two variables. You may be already wondering what is the difference between Spearman rank correlation and Person product moment correlation. The difference is that Spearman rank correlation is a non-parametric test while Person product moment correlation is a parametric test.

A non-parametric test does not have to comply with the assumptions of parametric test such as the data being normally distributed. This allows a researcher to still make inferences from data that may not have normality. In addition, non-parametric test are used for data that is at the ordinal or nominal level. In many ways, Spearman correlation and Pearson product moment correlation compliment each other. One is used in non-parametric statistics and the other for parametric statistics and each analyzes the relationship between variables.

If you get suspicious results from your Pearson product moment correlation analysis or your data lacks normality Spearman rank correlation may be useful for you if you still want to determine if there is a relationship between the variables. Spearmen correlation works by ranking the data within each variable. Next, the Pearson product moment correlation is calculated between the two sets of rank variables. Below are the assumptions of Spearman correlation test.

  • Subjects are randomly selected
  • Observations are at the ordinal level at least

Below are the steps of Spearman correlation

  1. Setup the hypotheses
    1. H0: There is no correlation between the variables
    2. H1: There is a correlation between the variables
  2. Set the level of significance
  3. Calculate the degrees of freedom and find the t-critical value (computer does this for you)
  4. Calculate the value of Spearman correlation or ρ (computer does this for you)
  5. Calculate the t-value(computer does this for you) and make a statistical decision
  6. State conclusion

Here is an example

A clerk wants to see if there is a correlation between the overall grade students get on an exam and  the number of words they wrote for their essay. Below are the results

Student         Grade        Words on Essay
1                             79                           147
2                             76                           143
3                             78                           147
4                             84                           168
5                             90                           206
6                             83                           155
7                             93                           192
8                             94                           211
9                             97                           209
10                           85                           187
11                           88                           200
12                           82                           150

Note: The computer will rank the data of each variable with a rank of 1 being the highest value of a variable and a rank 12 being the lowest value of a variable. Remember that the computer does this for you.

Step 1: State hypotheses
H0: There is no relationship between grades and words on the essay
H1: There is a relationship between grades and words on the essay

Step 2: Determine level of significance
Level set to 0.05

Step 3: Determine critical t-value
t = + 2.228 (computer does this for you)

Step 4: Compute Spearman correlation
ρ = 0.97 (computer does this for you)
Note: This correlation is very strong. Remember the strongest relationship possible is + 1

Step 5: Calculate t-value and make a decision
t = 12.62   ( the computer does this for you)
Since the computed t-value of 12.62 is greater than the t-critical value of 2.228 we reject the null hypothesis

Step 6: Conclusion
Since the null hypotheses are rejected, we can conclude that there is evidence that there is a strong relationship between exam grade and the number of words written on an essay. This means that a teacher could tell students they should write longer essays if they want a higher grade on exams

Correlation

A correlation is a statistical method used to determine if a relationship exists between variables.  If there is a relationship between the variables it indicates a departure from independence. In other words, the higher the correlation the stronger the relationship and thus the more the variables have in common at least on the surface.

There are four common types of relationships between variables there are the following

  1. positive-Both variables increase or decrease in value
  2. Negative- One variable decreases in value while another increases.
  3. Non-linear-Both variables move together for a time then one decreases while the other continues to increase
  4. Zero-No relationship

The most common way to measure the correlation between variables is the Pearson product-moment correlation aka correlation coefficient aka r.  Correlations are usually measured on a standardized scale that ranges from -1 to +1. The value of the number, whether positive or negative, indicates the strength of the relationship.

The Person Product Moment Correlation test confirms if the r is statistically significant or if such a relationship would exist in the population and not just the sample. Below are the assumptions

  • Subjects are randomly selected
  • Both populations are normally distributed

Here is the process for finding the r.

  1. Determine hypotheses
    • H0: = 0 (There is no relationship between the variables in the population)
    • H0: r ≠ 0 (There is a relationship between the variables in the population)
  2. Decided what the level of significance will be
  3. Calculate degrees of freedom to determine the t critical value (computer does this)
  4. Calculate Pearson’s (computer does this)
  5. Calculate t value (computer does this)
  6. State conclusion.

Below is an example

A clerk wants to see if there is a correlation between the overall grade students get on an exam and the number of words they wrote for their essay. Below are the results

Student         Grade        Words on Essay
1                             79                           147
2                             76                           143
3                             78                           147
4                             84                           168
5                             90                           206
6                             83                           155
7                             93                           192
8                             94                           211
9                             97                           209
10                          85                           187
11                          88                           200
12                          82                           150

Step 1: State Hypotheses
H0: There is no relationship between grade and the number of words on the essay
H1: There is a relationship between grade and the number of words on the essay

Step 2: Level of significance
Set to 0.05

Step 3: Determine degrees of freedom and t critical value
t-critical = + 2.228 (This info is found in a chart in the back of most stat books)

Step 4: Compute r
r = 0.93                       (calculated by the computer)

Step 5: Decision rule. Calculate t-value for the r

t-value for r = 8.00  (Computer found this)

Since the computed t-value of 8.00 is greater than the t-critical value of 2.228 we reject the null hypothesis.

Step 6: Conclusion
Since the null hypothesis was rejected, we conclude that there is evidence that a strong relationship between the overall grade on the exam and the number of words written for the essay. To make this practical, the teacher could tell the students to write longer essays if they want a better score on the test.

IMPORTANT NOTE

When a null hypothesis is rejected there are several possible relationships between the variables.

  • Direct cause and effect
  • The relationship between X and Y may be due to the influence of a third variable not in the model
  • This could be a chance relationship. For example, foot size and vocabulary. Older people have bigger feet and also a larger vocabulary. Thus it is a nonsense relationship

Two-Way Analysis of Variance

Two-way analysis of variance is used when we want to know the following pieces of information.
• The means of the blocks or subpopulations
• The means of the treatment groups
• The means of the interaction of the subpopulation and treatment groups

Now you are probably confused but remember that two-way analysis of variance is an extension of randomized block designed. With randomized block design, there were two hypotheses one for the treatment groups and one for the blocks or subpopulations. What we are doing for two-analysis is assessing the interaction effect, which is the amount of the variation of
subpopulation and treatment group). The assessment of the interaction effect gives us the third hypothesis. To put it in simple words when both the subpopulation and the treatment are present combined they have some sort of influence just as they do when one or the other is present. Therefore, two-way analysis of variance is randomized block designed plus an interaction effect hypothesis.

Another important difference is the use of repeated measures. In a two-way analysis of variance, at least one of the groups received the treatment more than once. In a randomized block design, each group receives the treatment only one time. Your research questions determine if any group needs to experience the treatment more than once.

Below are the assumptions
• Sample randomly selected
• Populations have homogeneous standard deviations
• Population distributions are normal
• Population covariances are equal.

Here are the steps
1. Set up hypotheses (there will be three of them)
a.Treatment means (AKA factor A)
i. H0: There is no difference in the treatment means
ii. H1: H0 is false
b. Block means (AKA factor B)
i. H0: There is no difference in the block means
ii. H1: is false
c. Interaction between Factor A and B
i. H0: There is no interacting effect between factor A & B
ii. H1: There is an interacting effect between factor A & B
2. Determine your level of statistical significance
3. Determine F critical (there will be three now and the computer does this)
4. Calculate the F-test values (there will be three now and the computer does this)
5. Test hypotheses
6. State conclusion

Here is an example
A music teacher wants to study the effect of instrument type and service center on the repair time measured in minutes. Four instruments (sax, trumpet, clarinet, flute) were picked for the analysis. Each service center was assigned to perform the particular repair on two instruments in each category

Instrument
Service centers Sax Trumpet Clarinet Flute
1                        60      50          58         60
70      56          62         64
2                        50      53          48         54
54      57          64         46
3                        62      54          46         51
64      66          52         49

Here are your research questions
• Is there a difference in the means of the repair time between service centers?
• Is there a difference in the means of the repair time between instrument type?
• Is there an interaction due to service center and type of instrument on the mean of the repair time
Let us go through each of our steps
Step 1: State the hypotheses
• Treatment means (AKA factor A)
a. H0: There is no difference in the means of the service centers
b. H1: H0 is false
• Block means (AKA factor B)
a. H0: There is no difference in means of the instrument types
b. H1: is false
• Interaction between Factor A and B
a. H0: There is no interacting effect between service center and instrument type
b. H1: There is an interacting effect between service center and instrument type

Step 2: Significance level
• Set at 0.1

Step 3: Determine F-Critical
For the instruments, F-critical is 2.81
For the service centers, F-critical is 2.61
For the interaction effect, F-critical is 2.33

Step 4: Calculate F-values
Service centers 3.2
Instrument type 1.4
Interaction 2.1

Step 5: Make decision
Since the F-value of 3.2 is greater than the F-critical of 2.8 we reject the null hypothesis for the service centers

Since the F-value of 1.4 is less than the F-critical of 2.61 we do not reject the null hypothesis for the instrument types

Since the F-value of 2.1 is less than the F-critical of 2.3 we do reject the null hypothesis for the interaction effect of service center and instrument type.

Step 6: Conclusion
Since we reject the null hypothesis that there is no difference in the means of the repair time of the service centers, we conclude that there is evidence of a difference in the repair times between service centers. This means that one service center is faster than the others are. To find out, do a posthoc test.

Since we do not reject the null hypothesis that there is no difference in the means of the repair time of the instrument types, we conclude that there is no evidence of a difference in the repair time between instrument types. In other words, it does not matter what type of instrument is being fixed as they will all take about the same amount of time.

Since we do not reject the null hypothesis that there is no interaction effect of service center and instrument type on the mean of the repair time, we conclude that there is no evidence of an interaction effect of service center and instrument type on repair time. In other words, if service center and instrument type are considered at the same time there is no difference in how fast the instruments are repaired.

Analysis of Variance: Randomized Block Design

Randomized blocked design is used when a researcher wants to compare treatment means. What is unique to this research design is that the experiment is divided into two or more mini-experiments.

The reason behind this is to reduce the variation within-treatments so that it is easier to find differences between means.  Another unique characteristic of randomized block design is that since there is more than one experiment happening at the same time, there will be more than one set of hypotheses to consider. There will be a set of hypotheses for the treatment groups and also for the block groups. The block groups are the several subpopulations with the sample. Below are the assumptions

  • Samples are randomly selected
  • Populations are homogeneous
  • Populations are normally distributed
  • Populations covariances are equal
    •  Covariance is a measure of the commonality that two variables deviate from their expected values. If two variable deviates in similar ways the covariance will be high and vice versa. The standardized version of covariance is correlation.

Looking at equations and doing this by hand is tough. It is better to use SPSS or excel to calculate results. We are going to look at an example and see an application of randomized block design.

A professor wants to see if “time of day” affects his students score on a quiz. He randomly divides his stat class into five groups and has them take the quiz at one of four times during the day.  Below are the results
Time Period/Treatment
Section    8-9                10-11                11-12                1-2
1                  25                      22                        20                     25
2                  28                      24                        29                     23
3                  30                      25                        25                     27
4                  24                      27                        28                     25
5                  21                      28                        30                     24

The treatment groups here are the time periods. The are along the time and are 8-9, 10-11, 11-12, 1-2. The block groups are along the left-hand side and the are section 1, 2, 3, 4, 5. The block groups are the 5 different experimental groups of the larger population of the statistics class. What is happening here is that all members from all groups all took the quiz at one of the four times. For example, members from section one took the quiz at 8-9, 10-11, 11-12, and 1-2. The same for group 2 and so forth.  By having five different groups take the quiz at each of the time periods it should hopefully improve the accuracy of the results. It is like sampling a population five times instead of one time.

In addition, by having four different time periods, we can hopefully see much more clearly if the time period makes a difference. We have four different time periods instead of two or three. Below are the steps for solving this problem.

Step 1: State hypotheses
For Time periods
Null hypothesis: There is no difference in the means between time periods
Alternative hypothesis: There is a difference in the means between time periods
For Blocks
Null hypothesis: There is no difference in the means among the sections of students
Alternative hypothesis: There is difference in the means among the sections of students

Step 2: Significance level
are alpha is set to .05

Step 3: Critical value of F
This is done by the computer and it indicates that the F critical for the treatment (time periods) is 3.49 and the F critical for the blocks (section of students) is 3.26. There are two F criticals because there are two sets of hypotheses, one for the time periods and one for the students.

Step 4: Calculate
The computed F-value for treatment (time periods) is 0.25
The computed F-value for the blocks (section of students) is 0.89

Step 5: Decision
Since the F-value of the treatment (time periods) is 0.25 is less than F critical of 3.49 at an alpha of .05 we do not reject the null hypothesis

Since the F-value of the blocks (section of students) is 0.89 is less than F critical of 3.26 at an alpha of .05 we do not reject the null hypothesis

Step 6: Conclusion
Treatment (Time period)
Since we did not reject the null hypothesis, we can conclude that there is no evidence that time of day affects the quiz scores.

Blocks (Section of Student)
Since we did not reject the null hypothesis, we can conclude that there is no evidence that group affects the quiz scores.

From this, we know that time of day and the group a student belongs to does not matter. If the time of day mattered it might have been due to a host of factors such as early morning or late afternoon. For the groups, the difference could be identified by how they did on individual items. Maybe they struggled with finding the means of question 3.

Remember in this example there was no difference. The ideas above are for determining why there was a difference if that had happened.

One-Way Analysis of Variance (ANOVA)

Analysis of variance is a statistical technique that is used to determine if there is a difference in two or more sample populations.  Z-test and t-tests are used when comparing one sample population to a known value or two sample populations to each other. When two or more sample populations are involved it is necessary to use analysis of variance.  The simple rule is 3 or more use analysis of variance

Analysis of variance is too complicated to do by hand, even though it is possible. It takes a great deal of time and one error will ruin the answer. Therefore, we are not going to look at equations during this example. Instead, we will focus on the hypotheses and practical applications of analysis of variance. To calculate analysis of variance results you can use SPSS or Microsoft excel.

There are several types of analysis of variance. We are going to first look at one-way analysis of variance.

Here are the assumptions for one-way analysis of variance

  • Samples are randomly selected
  • Samples are independently assigned
  • Samples are homogeneous
  • Sample is normally distributed

One-way analysis of variance is used when 2 or more groups receive the same treatment or intervention. The treatment is the independent variable while the means of each group is the dependent variable. This is because as the researcher, you control the treatment but you do not control the resulting mean that is recorded. One-way analysis of variance is often used in performing experiments.

Let’s look at an example. You want to know if there is any difference in the average life of four different breeds of dogs. You take a random sample of five dogs from four different breeds. Below are the results

Terrier    Retriever   Hound   Bulldog
12                 11                   12            12
13                 10                   11             15
14                 13                   15             10
11                 15                   15             12
15                 14                   16             11

In this example, the independent variable is the breed of dog. This is because you control this. You can select whatever dog breed you want. The dependent variable is the average length of the dog’s lives. You have no control over how long they live. You are trying to see if  dog breed influences how long the dog will live

Here are the hypotheses

Null hypotheses: There is no difference in the average length of a dog’s life because of breed

Alternative hypotheses: There is a difference in the average length of a dog’s life because of breed

The significance level is 0.05  are F critical is 3.24

After running the results in the computer we get an F-value of 0.76. This means we do not reject are null hypotheses.  This means that there is no difference in the average life of the dog breeds in this study.

One-way analysis is used when we have one treatment and three or more groups that experience the treatment. This statistical tool is useful for research designs that call on the need for experiments.

Testing the Difference Between two Means: Paired Samples

The paired sample t-test is used to compare two sample populations that are correlated. This test is most commonly employed for “before and after” or pretest-postest design. Below are the assumptions of paired sample t-test.

  • Only the matched pairs are used to perform the analysis
  • The data is normally distributed
  • The variances of the two samples are homogeneous
  • The observations are independent of each other

Below are the steps involved in conducting a paired sample t-test

  1. Set up the hypotheses
    • H0: The mean of the paired samples are the same or (μ1 = μ2)
    • H1: The means of the paired samples are not equal or
      (μ1 ≠ μ2, μ1 > μ2, μ1 < μ2)
  2. Determine the level of statistical significance (.1, .05, or .01) and if it is two-tailed or one-tailed
    • Two-tailed means there are two choices. One mean can be greater or lesser than the other.
    • One-tailed means there is one choice. One of the means is greater or it lesser but not vice versa.
  3. You also must take into account the degree of freedom which is sample size – 1 this information is useful when looking at the t-test chart to calculate the t critical value
  4. Calculate the paired t-test. The formulas are below. Take note that there are three separate formulas labeled A, B, and CFORMULA A   t computed = Mean difference
                                                                     standard deviation of the mean /                                                                               square root of the  sample sizeFORMULA B    mean difference = sum of the difference
                                                                                  sample size

FORMULA C    standard deviation of the difference =
√ΣD² – (ΣD)²
                 n     
n – 1
Sorry there is no simple way to explain formula C

4. Make Statistical decision

5. State conclusion

Below is an example

A teacher develops an incentive plan for his students. Students who were quiet got additional stickers in their notebook. The teacher picked 10 students at random to see if the number of stickers they earned was more after the incentive program was adopted. Here are the results

Student               Before             After
1                               20                        35
2                               30                        41
3                                25                       38
4                                31                       42
5                                19                       18
6                                18                       16
7                                23                       34
8                                32                       19
9                                24                       24
10                             19                       33

Step 1 State the hypotheses
H0: μ1 < 0 or the number of stickers after the incentive plan are not more than before
H2: μ1 > 0 or the number of stickers is greater after the incentive plan

Step 2: The level of statistical significance is .05. This is also a one-tailed test

Step 3: Calculate the critical region which is
degrees of freedom = sample size – 1 = tcritical
df = 10 – 1= 9 and the tcritical is 1.83 according to the table

Step 4: Compute t computed for paired samples

Student    Before     After      Difference       Difference²
1                       20                35           15                             225
2                       30                41           11                             121
3                       25                38           13                             169
4                       31                42            11                             121
5                       19                18            -1                                    1
6                       18                16            -2                                    4
7                        23                34           11                              121
8                        32                19         -13                              169
9                        24                24             0                                     0
10                     19                33           14                               196
Sum of difference     59 Sum of the difference² 1127

Find the mean of the difference
59 / 10 = 5.9

Find the standard deviation of the differences (the entire equation below is squared)

1127 – (5.9)²
                10      
10-1

the standard deviation is 11.17

Find the t computed

            59              = 16.69
11.17 / √10

Step 5: Decision

Since the t computed 16.69 is greater than the t critical of 1.83 we reject the null hypothesis

Step 6 Conclusion

Since we reject the null we can conclude that there is evidence that the incentive program has increased the number of stickers the students earn.

Hypothesis Testing for Two Means: Large Independent Samples

Hypothesis testing for two large samples examines again if there is a difference between the two means. We infer that there is a difference between the population means by seeing if there is a difference between the sample means. The assumptions for testing for the difference between two means are below.

  • Subjects are randomly selected and independently assigned to groups
  • Population is normally distributed
  • Sample size is greater than 30

The hypotheses can be stated as follows

  • Null hypothesis: There is no difference between the population means of the two groups
    • The technical way to say this is…  H0: μ1 = μ2
  • Alternative hypothesis: There is a difference between the population means of the two groups. One is greater or smaller than the other
    • The technical way to say this is… H1: μ1≠ μ2 or μ1> μ2 or         μ1< μ2

The process for conducting a z test for independent samples is provided below

  1. Develop your hypotheses
  2. Determine the level of significance (normally .1, .05, or .01)
  3. Decide if it is a one-tail or two tail test.
  4. Determine the critical value of z. This is found in chart in the back of most stat books common values include +1.64, +1.96, or +2.32
  5. Calculate the means and standard deviations of the two samples.
  6. Calculate the test for the two independent samples. Below is the formula

z = (sample mean 1 – sample  mean 2)

√[(variance of sample 1 squared/ sample population 1) +
(variance  of sample 2 squared/ sample population 2)]

7. If the computed z is less than the critical z then you do not reject your null hypothesis. This means there is no difference between the means. If the computed z is greater than the critical z then you reject the null hypothesis and this indicates that there is evidence that there is a difference.

Below is an example

A business man is comparing the price of buildings in two different provinces to see if there is a difference. Below are the results. Determine if the buildings in Bangkok cost more than the buildings in Saraburi.

Bangkok                                   Saraburi
average price     2,140,000                                1,970,000
variance                 226,000                                     243,000
sample size           47                                                  45

Now let us go through the steps

  1. Develop your hypotheses
    • Null hypothesis: There is no difference between the average price of buildings in Bangkok and Saraburi
      • In stat language, it would be
      • H0: μ1 ≠ μ2
    • Alternative hypothesis: The  average price of buildings in Bangkok is higher than in  Saraburi
      • In stat language, it would be
      • H1: μ1 > μ2
  2. Determine the level of significance (normally .1, .05, or .01)
    • We will select .05
  3. Decide if it is a one-tail or two tail test.
    • This is a one-tail test. We want to know if one mean is greater than another. Therefore, to reject the null we need a z computed that is positive and larger than our z critical.
  4. Determine the critical value of z. This is found in chart in the back of most stat books common values include +1.64, +1.96, or +2.32 when it is a two tailed test
    • Our z critical is + 1.64  since this is a one-tail test we only have one value so we do not split the probable and place have on one side and half on the other side. If this were two-tailed we would have -1.96 and +1.96 which indicates that the difference is greater or less
  5. Calculate the means and standard deviations of the two samples.
    • Already done in the table above
  6. Calculate the test for the two independent samples. Below is the formula.

(2,140,000 – 1,970,000)
√[((226,000)²)/47) + ((243,000)²)/45)]
our final answer for are z computed is 3.47

Since 3.47 is greater than our z critical of +1.64 we reject the null hypothesis and state that there is evidence that building prices are higher in Bangkok than in Saraburi.

What is a One Sample z Test?

There are actually several different situations in which a researcher can use hypothesis testing. The first instance we will look at is the one sample z test. The one sample z test has the following assumptions that need to be met before employing it.

  • Sample size > 30
  • Subjects are randomly selected
  • Population is normally distributed
  • Cases within the sample are independent
  • One sample was taken

If your data collection meets the above assumptions one sample z test may be appropriate.

With the one sample z test, you are comparing your results to a known expected value. For example, if someone states that the average salaries for teachers are $63,000.00 you can assess this by collecting data from teachers to compare it to this known value. You collect some data and you find that the average salary for 35 teachers was $65,7000.00. The questions you have is who is right? Do teachers really make on average $63,000.00 like the report or do they make $65,700.00 as my data says? Before going further let us establish are hypotheses for this example.

  • Null hypothesis: the average salaries for my sample of teacher salaries will be the same as the average salary’s of the reported value of $63,000.00
    1. The mathematical shorthand for this is H0: μ = 63,000.00
  • Alternative hypothesis: the average salaries for my sample of teacher salaries will be different (greater or lesser) than the average salary’s of the reported value of $63,000.00
    1. The mathematical shorthand for this is H1: μ ≠ 63,000.00

Keep in mind that this is a two-tail level of significance. This is because our final value has the option of being either greater or lesser than $63,000.00. Two-tail means two options, greater or lesser than the expected value while one-tail means only one option either we expect greater or we expected lesser but not both. This is why we will have two z critical values to think about in the near future.

We also need two more pieces of information before we put our numbers into the equation. The two items we need to know are the standard deviation of the sample and the level of statistical significance. For the sample data, we collected we will say the standard deviation is $5,250.00 and the level of statistical significance is α = 0.01. When we convert this alpha value to the z critical value we get 2.32 and -2.32 because we are using a two-tail or two option approach. Do not get distracted by the z critical value it is the same as the alpha value but translated for the numbers set to the normal distribution. It is similar to switching from one language to another, same meaning but different language.

If our final value is greater than 2.32 or less than -2.32 we will reject the null hypothesis that average teacher salaries are $63,000.00. Now we can take a look at the equation

z critical value = sample data – expected value                                                                                           Sample standard deviation / square root of the                                                    number of those in the sample population

In simple English

z critical value  = 65,700 – 63,000                                                                                                                         5,250 / square root of 35

Z critical value = 3.04

Our answer is 3.04, which is greater than +2.32. This indicates that we can reject the null hypothesis that the average salary teachers are $63,000.00 as our data indicate that there is evidence that teachers make more on average.

We don’t want to get too excited here. We found evidence that teachers make more but further testing would be needed to validate these claims. As more data confirms our findings we can confidently state that teachers make more.

I would like to thank andydevil12 for the question and suggestion. If there are any other questions please send them to me as they help me to understand research and statistics much better as well.

What is Hypothesis Testing?

Hypothesis testing is a statistical approach used in making decisions about data.  In hypothesis testing, there are two hypotheses that are posed by the researcher and they are…

  1. Null hypothesis-There is no difference between the sample population and the statistical population in relation to the mean or some other parameter that is being assessed
  2. Alternative hypothesis-There is a difference between the sample population and the statistical population in relation to the mean or some other parameter that is being assessed

Generally, researchers often hope to reject the null hypothesis which indicates that the alternative hypothesis is correct.  However, strictly speaking, a researcher never accepts any hypothesis. Instead, you reject or you do not reject the null hypothesis. This is because further testing will always be needed to confirm the results.

How to know whether to reject or not reject the null depends on the results of the analysis. A researcher needs to select a level of statistical significance which is usually 1%, 5%, or 10%. The significance level changes the size of the rejection region at the tails of the normal distribution. The lower the significance level the smaller the rejection region which influences the interpretation of the results.  To reject a null hypothesis, the results of the analysis must fall within the rejection region.

After determining the level of significance a researcher analyzes the data to determine the results. The results then need to be interpreted by stating them in simple English.  From this, the researcher can develop a conclusion about what the results mean.

Sampling Part II

Random sampling has already been discussed. This post deals with non-random sampling which is a sample that is selected in a deliberate manner.  Below are a few of the more common forms of non-random sampling

  1. Convenience sampling is the selection of individuals who are available for the study.  Whoever is free and willing is a part of the study.
  2. Purposive sampling is the inclusion or participants based on a criteria developed by the researcher. For example, a researcher wants to only include middle age male teachers in their study. Individuals who meet this criteria will be asked to be a part of the study.
  3. Quota sampling is used when a researcher collect data from a certain number of people from several sample units or sub-populations who meet certain criteria. For example, at a university, selecting students from several different departments such as English and Education to be a part of the study. Whoever is from either department may be a part of the study
  4. Snowball sampling is a technique in which the researcher locates one member of the sample population and collects data from them. The participant then recommends other people the researcher can collect data from. An example, would be a detective interviewing various people about a crime. One witness suggest someone else the detective should talk to. This form of sampling is common in qualitative research.

Sampling is normally influenced by circumstances. Random is often preferred to non-random both the context often dictates that a researcher do the best they can and choose a technique that is appropriate for the situation.

Sampling

There are a plethora of sampling approaches in research. Below is a partial list of the more common approaches.  Please remember that the population is the group you are studying. The sample is a smaller portion of the population. Often it is not practical to collect data from an entire population. Therefore, researchers collect data from a sample and make inferences about the population based on the sample.

The sampling approaches below are all forms of random sampling, which is a process in which any member of the population has an equal likelihood of being selected as part of the sample. Non-random sampling will be dealt with in a future post.

  1. Simple random sampling. The sample is derived via random numbers or lottery from the population
  2. Systematic sampling-The selection of every kth element in the population. For example, selecting every fifth student at a school
  3. Stratified sampling. Subdividing the population into subgroups and taking member at random from each subgroup. This helps to replicate the proportions of the population in the sample. For example, if a school is 75% men and 25% women these same proportions need to exist in the sample population
  4. Cluster sampling. Random selecting clusters from a population that is spread over a large geographical area.  For example, subdividing a country into provinces and then randomly selecting some of the provinces to participate in the study

The sampling approach you used is determined by the purpose of the research, finances, and practicality.

Levels of Measurement Part II

There are four levels of measurement used in statistics. They are nominal, ordinal, interval, and ratio. This post focuses on the last two levels of measurement of interval and ratio.

Interval level of measurement is used to classify and differentiate between categories based on how different they are.  The difference is determined by amount and direction.  The difference can also be discrete (finite difference) or continuous (infinite amount of difference). An example of an interval level is temperature. Temperature indicates the difference in hot and cold, you can tell the direction whether it is increasing or decreasing, and it is continuous in that there are an infinite number of potential temperatures.

Ratio level of measurement is the same as interval with the only difference being that it has an absolute zero. One example is weight, it has all the characteristics of a continuous interval variable (there is direction, amount, and infinite amount of difference). The only difference is that nothing can have a negative weight. The temperature, on the other hand, can go negative (for the sake of illustration please ignore that temperature has an absolute zero).

Levels of Measurement

A variable can be measured several different ways. This variety in variable measurement is broken down into four levels. These levels are nominal, ordinal, interval, and ratio. In this blog, I will talk about nominal and ordinal and I will address interval and ratio in the next post.

Nominal data is data that is broken into separate and discrete categories. The categories are mutually exclusive which means that no data can be in more than one category. Nominal data is also exhaustive in that all the data must go into one of the categories.  This is one of the weakest forms of measurement because differences within the category cannot be accounted for because all data is forced to conform to a category. Examples of nominal measurement would be gender because everyone who responds must be placed in one category or the other and there is no way for someone to be half male half female when using nominal classifications.

Ordinal measurement is used for ranking data.  At this level, data is still nominal but the order matters. An example would be class standing which is freshman, sophomore, junior,  and senior. The data is nominal in that there are categories but the order matters as a senior is a higher level in comparison to a freshman. There is still no attempt to differentiate within categories which weakens this level of measurement.

What level of measurement to use is dependent on what your research questions are. Research is guided by the question you ask.

Classification of Variables

In addition to the types of variables, there also several ways to classify variables. Two ways to classify variables is experimental and mathematical.

Experimental classification is used to classify variables by the function they serve in the experiment. In experimental research, we have independent and dependent variables.  Independent variables are variables that are controlled by the researcher and are believed to have an effect on the dependent variable. Dependent variables are affected by the independent variables.

For example, let’s say we want to see how sleep affects GPA. We would manipulate the amount of sleep a person gets, which is the independent variable to see how their GPA changes as GPA is the dependent variable influenced by sleep.

The second type of classification is mathematical. A continuous variable is can assume an infinite number of values. An example would be weight or height.

A discrete variable consists of a finite number of values. Examples include gender and the number of computers. You can’t be half a gender you are a man or woman.

What type of variable to use depends again on the research questions of the study.

Population vs Sample

In statistics, one of the most fundamental concepts is the population and sample. A population is all the member from a group. For example, if my population is the United States, I would have to collect data from everyone in the country. This is to say the least, very challenging.

To deal with this, must studies take a sample from the population. A sample is a portion of the population. Continuing our example, instead of collecting data from every in the US I would collect data from several hundred or thousand depending on the research question of my study.

There are several different techniques to sampling that will be covered later.  For now, the most important thing to remember is that your research questions and circumstances of the study influence what steps you take. There is rarely one way to do this.