Tag Archives: research

Experimental Designs: Between Groups

In experimental research, there are two common designs. They are between and within group design. The difference between these two groups of designs is that between group involves two or more groups in an experiment while within group involves only one group.

This post will focus on between group designs. We will look at the following forms of between group design…

  • True/quasi-experiment
  • Factorial Design

True/quasi Experiment

A true experiment is one in which the participants are randomly assigned to different groups. In a quasi-experiment, the researcher is not able to randomly assigned participants to different groups.

Random assignment is important in reducing many threats to internal validity. However, there are times when a researcher does not have control over this, such as when they conduct an experiment at a school where classes have already been established. In general, a true experiment is always considered superior methodological to a quasi-experiment.

Whether the experiment is a true experiment or a quasi-experiment. There are always two groups that are compared in the study. One group is the controlled group, which does not receive the treatment. The other group is called the experimental group, which receives the treatment of the study. It is possible to have more than two groups and several treatments but the minimum for between group designs is two groups.

Another characteristic that true and quasi-experiments have in common is the type of formats that the experiment can take. There are two common formats

  • Pre- and post test
  • Post test only

A pre- and post test involves measuring the groups of the study before the treatment and after the treatment. The desire normally is for the groups to be the same before the treatment and for them to be different statistically after the treatment. The reason for them being different is because of the treatment, at least hopefully.

For example, let’s say you have some bushes and you want to see if the fertilizer you bought makes any difference in the growth of the bushes.  You divide the bushes into two groups, one that receives the fertilizer (experimental group), and one that does not (controlled group). You measure the height of the bushes before the experiment to be sure they are the same. Then, you apply the fertilizer to the experimental group and after a period of time, you measure the heights of both groups again. If the fertilized bushes grow taller than the control group you can infer that it is because of the fertilizer.

Post-test only design is when the groups are measured only after the treatment. For example, let’s say you have some corn plants and you want to see if the fertilizer you bought makes any difference.in the amount of corn produced.  You divide the corn plants into two groups, one that receives the fertilizer (experimental group), and one that does not (controlled group). You apply the fertilizer to the control group and after a period of time, you measure the amount of corn produced. If the fertilized corn produces more you can infer that it is because of the fertilizer. You never measure the corn beforehand because they had not produced any corn yet.

Factorial design involves the use of more than one treatment. Returning to the corn example, let’s say you want to see not only how fertilizer affects corn production but also how the amount of water the corn receives affects production as well.

In this example, you are trying to see if there is an interaction effect between fertilizer and water.  When water and fertilizer are increased does production increase, is there no increase, or if one goes up and the other goes down does that have an effect?

Conclusion

Between group designs such as true and quasi-experiments provide a way for researchers to establish cause and effect. Pre- post test is employed as well as factorial designs to establish relationships between variables

Calculating Chi-Square in R

There are times when conducting research that you want to know if there is a difference in categorical data. For example, is there a difference in the number of men who have blue eyes and who have brown eyes. Or is there a relationship between gender and hair color. In other words, is there a difference in the count of a particular characteristic or is there a relationship between two or more categorical variables.

In statistics, the chi-square test is used to compare categorical data. In this post, we will look at how you can use the chi-square test in R.

For our example, we are going to use data that is already available in R called “HairEyeColor”. Below is the data

> HairEyeColor
, , Sex = Male

       Eye
Hair    Brown Blue Hazel Green
  Black    32   11    10     3
  Brown    53   50    25    15
  Red      10   10     7     7
  Blond     3   30     5     8

, , Sex = Female

       Eye
Hair    Brown Blue Hazel Green
  Black    36    9     5     2
  Brown    66   34    29    14
  Red      16    7     7     7
  Blond     4   64     5     8

As you can see, the data comes in the form of a list and shows hair and eye color for men and women in separate tables. The current data is unusable for us in terms of calculating differences. However, by using the ‘marign.table’ function we can make the data useable as shown in the example below.

> HairEyeNew<- margin.table(HairEyeColor, margin = c(1,2))
> HairEyeNew
       Eye
Hair    Brown Blue Hazel Green
  Black    68   20    15     5
  Brown   119   84    54    29
  Red      26   17    14    14
  Blond     7   94    10    16

Here is what we did. We created the variable ‘HairEyeNew’ and we stored the information from ‘HairEyeColor’ into one table using the ‘margin.table’ function. The margin was set 1,2 for the table.

Now all are data from the list are combined into one table.

We now want to see if there is a particular relationship between hair and eye color that is more common. To do this, we calculate the chi-square statistic as in the example below.

> chisq.test(HairEyeNew)

	Pearson's Chi-squared test

data:  HairEyeNew
X-squared = 138.29, df = 9, p-value < 2.2e-16

The test tells us that one or more of the relationships are more common than others within the table. To determine which relationship between hair and eye color is more common than the rest we will calculate the proportions for the table as seen below.

> HairEyeNew/sum(HairEyeNew)
       Eye
Hair          Brown        Blue       Hazel       Green
  Black 0.114864865 0.033783784 0.025337838 0.008445946
  Brown 0.201013514 0.141891892 0.091216216 0.048986486
  Red   0.043918919 0.028716216 0.023648649 0.023648649
  Blond 0.011824324 0.158783784 0.016891892 0.027027027

As you can see from the table, brown hair and brown eyes are the most common (0.20 or 20%) flowed by blond hair and blue eyes (0.15 or 15%).

Conclusion

The chi-square serves to determine differences among categorical data. This tool is useful for calculating the potential relationships among non-continuous variables

Philosophical Foundations of Research: Epistemology

Epistemology is the study of the nature of knowledge. It deals with questions as is there truth and or absolute truth, is there one way or many ways to see something. In research, epistemology manifest itself in several views. The two extremes are positivism and interpretivism.

Positivism

Positivism asserts that all truth can be verified and proven scientifically and can be measured and or observed. This position discounts religious revelation as a source of knowledge as this cannot be verified scientifically. The position of positivist is also derived from realism in that there is an external world out there that needs to be studied.

For researchers, positivism is the foundation of quantitative research. Quantitative researchers try to be objective in their research, they try to avoid coming into contact with whatever they are studying as they do not want to disturb the environment. One of the primary goals is to make generalizations that are applicable in all instances.

For quantitative researchers, they normally have a desire to test a theory. In other words, the develop one example of what they believe is a truth about a phenomenon (a theory) and they test the accuracy of this theory with statistical data. The data determines the accuracy of the theory and the changes that need to be made.

By the late 19th and early 20th centuries, people were looking for alternative ways to approach research. One new approach was interpretivism.

Interpretivism

Interpretivism is the complete opposite of positivism in many ways. Interpretivism asserts that there is no absolute truth but relative truth based on context. There is no single reality but multiple realities that need to be explored and understood.

For interpretist, There is a fluidity in their methods of data collection and analysis. These two steps are often iterative in the same design. Furthermore, intrepretist see themselves not as outside the reality but a player within it. Thus, they often will share not only what the data says but their own view and stance about it.

Qualitative researchers are interpretists. They spend time in the field getting close to their participants through interviews and observations. They then interpret the meaning of these communications to explain a local context specific reality.

While quantitative researchers test theories, qualitative researchers build theories. For qualitative researchers, they gather data and interpret the data by developing a theory that explains the local reality of the context. Since the sampling is normally small in qualitative studies, the theories do not often apply to many.

Conclusion

There is little purpose in debating which view is superior. Both positivism and interpretivism have their place in research. What matters more is to understand your position and preference and to be able to articulate in a reasonable manner. It is often not what a person does and believes that is important as why they believe or do what they do.

Internal Validity

In experimental research design, internal validity is the appropriateness of the inferences made about cause and effects relationships between the independent and dependent variables. If there are threats to internal validity it may mean that the cause and effect relationship you are trying to establish is not real. In general, there are three categories of external validity, which are..,

  • Participant threats
  • Treatment threats
  • Procedural threats

We will not discuss all three categories but will focus on participant threats

Participant Threats

There are several forms of threats to internal validity that relate to participants. Below is a list

  • History
  • Maturation
  • Regression
  • Selection
  • Mortality

History

A historical threat to internal validity is the problem of the passages of time from  the beginning to the end of the experiment. During this elapse of time, the groups involved in the study may have different experiences. These different experiences are history threats. One way to deal with this threat is to be sure that the conditions of the experiment are the same.

Maturation

Maturation threat is the problem of how people change over time during an experiment. These changes make it hard to infer if the results of a study are because of the treatment or because of maturation. One way to deal with this threat is to select participants who develop in similar ways and speed.

Regression

Regression threat is the action of the researcher selecting extreme cases to include in their sample. Eventually, these cases regress to the mean, which impacts the results of the pretest or posttest. One option for overcoming this problem is to avoid outliers when selecting the sample.

Selection

Selection bias is the poor habit of picking people in a non-random why for an experiment Examples of this include choosing mostly ‘smart people for an experiment. Or working with petite women for a study on diet and exercise. Random selection is the strongest way to deal with this threat.

Mortality

Mortality is the lost of participants in a study. It is common for participants in a study to dropout and quit for many reasons. This leads to a decrease in the sample size, which weakens the statistical interpretation. Dealing with this requires using larger sample sizes as well as comparing data of dropouts with those who completed the study.

Conclusion

Internal validity can ruin a paper that has not careful planned out how these threats work together to skew results. Researchers need to have an idea of what threats are out there as well as strategies that can alleviate them.

Comparing Samples in R

Comparing groups is a common goal in statistics. This is done to see if there is a difference between two groups. Understanding the difference can lead to insights based on statistical results. In this post, we will examine the following statistical test for comparing samples.

  • t-test/Wilcoxon test
  • Paired t-test

T-test & Wilcoxon Test

The T-test indicates if there is a significant statistical difference between two groups. This is useful if you know what the difference between the two groups are. For example, if you are measuring height of men and women, if you find that men are taller through a t-test, you can state that gender influences height because the only difference between men and women in this example is their gender.

Below is an example of conducting a t-test in R. In the example, we are looking at if there is a difference in body temperature between beavers who are active versus beavers who are not.

> t.test(temp ~ activ, data = beaver2)

	Welch Two Sample t-test

data:  temp by activ
t = -18.548, df = 80.852, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.8927106 -0.7197342
sample estimates:
mean in group 0 mean in group 1 
       37.09684        37.90306

Here is what happened

  1. We use the ‘t.test’ function
  2. Within the ‘t.test’ function we indicate that we want to if there is a difference in ‘temp’ when compared on the factor variable ‘activ’ in the data ‘beaver2’
  3. The output provides a lot of information. The t-stat is -18.58 any number at + 1.96 indicates statistical difference.
  4. Df stands for degrees of freeedom and is used to determine the t-stat and p-value.
  5. The p-value is basically zero. Anything less than 0.05 is considered statistically significant.
  6. Next, we have the 95% confidence interval, which is the range of the difference of the means of the two groups in the population.
  7. Lastly, we have the means of each group. Group 0, the inactive group. had a mean of 37.09684. Group 1. the active group, has a mean of  37.90306.

T-test assumes that the data is normally distributed. When normality is a problem, it is possible to use the Wilcoxon test instead. Below is the script for the Wilcoxon Test using the same example.

> wilcox.test(temp ~ activ, data = beaver2)

	Wilcoxon rank sum test with continuity correction

data:  temp by activ
W = 15, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

A closer look at the output indicates the same results for the most part. Instead of the t-stat the W-stat is used but the p value is the same for both test.

Paired T-Test

A paired t-test is used when you want to compare how the same group of people respond to different interventions. For example, you might use this for a before and after experiment. We will use the ‘sleep’ data in R  to compare a group of people when they receive different types of sleep medication. The script is below

> t.test(extra ~ group, data = sleep, paired = TRUE)

	Paired t-test

data:  extra by group
t = -4.0621, df = 9, p-value = 0.002833
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.4598858 -0.7001142
sample estimates:
mean of the differences 
                  -1.58

Here is what happened

  1. We used the ‘t.test’ function and indicate we want to see if ‘extra’ (amount of sleep) is influenced by ‘group’ (two types of sleep medication.
  2. We add the new argument of ‘paired = TRUE’ this tells R that this is a paired test.
  3. The output is the same information as in the regular t.test. The only differences is at the bottom where R only tells you the difference between the two groups and not the means of each. For this example, the people slept about 1 hour and 30 minutes longer on the second sleep medication when compared to the first.

Conclusion

Comparing samples in R is a simple process of understanding what you want to do. With this knowledge, the script and output are not too challenge even for many beginners

Philosophical Foundations of Research: Ontology

Philosophy is a term that is commonly used but hard to define. To put it simply, philosophy explains an individuals or a groups worldview in general or in a specific context. Such questions as the nature of knowledge, reality, and existence are questions that philosophy tries to answer. There are different schools of thought on these questions and these are what we commonly call philosophies

In this post, we will try to look at ontology, which is the study of the nature of reality. In particular, we will define it as well as explain its influence on research.

Ontology

Ontology is the study of the nature of being. It tries to understand the reality of existence. In this body of philosophy, there are two major camps, ontological realism, and ontological idealism.

Ontological realism believes that reality is objective. In other words, there is one objective reality that is external to each individual person. We are in a reality and we do not create it.

Ontological idealism is the opposite extreme. This philosophy states that there are multiple realities an each depends on the person. My reality is different from your reality and each of us builds our own reality.

Ontological realism is one of the philosophical foundations for quantitative research. Quantitative research is a search for an objective reality that accurately explains whatever is being studied.

For qualitative researchers, ontological idealism is one of their philosophical foundations. Qualitative researchers often support the idea of multiple realities. For them, since there is no objective reality it is necessary to come contact with people to explain their reality.

Something that has been alluded to but not stated specifically is the role of independence and dependence of individuals. Regardless of whether someone ascribes to ontological realism or idealis, there is the factor of whether people or independent of reality or dependent to reality. The level of independence and dependence contributes to other philosophies such as objectivism constructivism and pragmatism.

Objectivism, Constructivism and Pragmatism

Objectivism is the belief that there is a single reality that is independent of the individuals within it. Again this is the common assumption of quantitative research. At the opposite end we have constructivism which states that there are multiple realities and the are dependent on the individuals who make each respective reality.

Pragmatism supports the idea of a single reality with the caveat that it is true if it is useful and works. The application of the idea depends upon the individuals, which pushes pragmatism into the realm of dependence.

Conclusion

From this complex explanation of ontology and research comes the following implications

  • Quantitative and qualitative researchers differ in how they see reality. Quantitative researchers are searching for and attempting to explain a single reality while qualitative researchers are searching for and trying to explain multiple realities.
  • Quantitative and qualitative researchers also differ on the independence of reality. Quantitative researchers see reality as independent of people while qualitative researchers see reality as dependent on people
  • These factors of reality and its dependence shape the methodologies employed by quantitative and qualitative researchers.

Experimental Design: Treatment Conditions and Group Comparision

A key component of experimental design involves making decisions about the manipulation of the treatment conditions. In this post, we will look at the following traits of treatment conditions

  • Treatment Variables
  • Experimental treatment
  • Conditions
  • Interventions
  • Outcomes

Lastly, we will examine group comparison.

One of the most common independent variables in experimental design are treatment and measured variables. Treatment variables are manipulated by the researcher. For example, if you are looking at how sleep affects academic performance, you may manipulate the amount of sleep participants receive in order to determine the relationship between academic performance and sleep.

Measured variables are variables that are measured by are not manipulated by the researcher. Examples include age, gender, height, weight, etc.

An experimental treatment is the intervention of the researcher to alter the conditions of an experiment. This is done by keeping all other factors constant and only manipulating the experimental treatment, it allows for the potential establishment of a cause-effect relationship. In other words, the experimental treatment is a term for the use of a treatment variable.

Treatment variables usually have different conditions or levels in them. For example, if I am looking at sleep’s affect on academic performance. I may manipulate the treatment variable by creating several categories of the amount of sleep. Such as high, medium, and low amounts of sleep.

Intervention is a term that means the actual application of the treatment variables. In other words, I broke my sample into several groups and caused one group to get plenty of sleep the second group to lack a little bit of sleep and the last group got nothing. Experimental treatment and intervention mean the same thing.

The outcome measure is the experience of measuring the outcome variable. In our example, the outcome variable is academic performance.

Group Comparison

Experimental design often focuses on comparing groups. Groups can be compared between groups and within groups. Returning to the example of sleep and academic performance, a between group comparison would be to compare the different groups based on the amount of sleep they received. A within group comparison would be to compare the participants who received the same amount of sleep.

Often there are at least three groups in an experimental study, which are the controlled, comparison, and experimental group. The controlled group receives no intervention or treatment variable. This group often serves as a baseline for comparing the other groups.

The comparison group is exposed to everything but the actual treatment of the study. They are highly similar to the experimental group with the experience of the treatment. Lastly, the experimental group experiences the treatment of the study.

Conclusion

Experiments involve treatment conditions and groups. As such, researchers need to understand their options for treatment conditions as well as what types of groups they should include in a study.

Examining Distributions In R

Normal distribution is an important term in statistics. When we say normal distribution, we are speaking of the traditional bell curve concept. Normal distribution is important because it is often an assumption of inferential statistics that the distribution of data points is normal. Therefore, one of the first things you do when analyzing data is to check for normality.

In this post, we will look at the following ways to assess normality.

  • By graph
  • By plots
  • By test

Checking Normality by Graph

The easiest and crudest way to check for normality is visually through the use of histograms. You simply look at the histogram and determine how closely it resembles a bell.

To illustrate this we will use the ‘beaver2’ data that is already loaded into R. This dataset contains five variables “day”, “time”, “temp”, and “activ” for data about beavers. Day indicates what day it was, time indicates what time it was, temp is the temperature of the beavers, and activ is whether the beavers were active when their temperature was taking. We are going to examine the normality of the temperature of active and inactive. Below is the code

> library(lattice)
> histogram(~temp | factor(activ), data = beaver2)
  1. We loaded the ‘lattice’ package. (If you don’t have this package please download)>
  2. We used the histogram function and indicate the variable ‘temp’ then the union ( | ) followed by the factor variable ‘activ’ (0 = inactive, 1 = active)
  3. Next, we indicate the dataset we are using ‘beaver2’
  4. After pressing ‘enter’ you should see the following

Rplot10

As you look at the histograms, you can say that they are somewhat normal. The peaks of the data are a little high in both. Group 1 is more normal than Group 0. The problem with visual inspection is lack of accuracy in interpretation. This is partially solved by using QQ plots.

Checking Normality by Plots

QQplots are useful for comparing your data with a normally distributed theoretical dataset. The QQplot includes a line of a normal distribution and the data points for your data for comparison. The more closely your data follows the line the more normal it is. Below is the code for doing this with our beaver information.

> qqnorm(beaver2$temp[beaver2$activ==1], main = 'Active')
> qqline(beaver2$temp[beaver2$activ==1])

Here is what we did

  1. We used the ‘qqnorm’ function to make the plot
  2. Within the ‘qqnorm’ function we tell are to use ‘temp’ from the ‘beaver2’ dataset.
  3. From the ‘temp’ variable we subset the values that have a 1 in the ‘activ’ variable.
  4. We give the plot a title by adding ‘main = Active’
  5. Finally, we add the ‘qqline’ using most of the previous information.
  6. Below is how the plot should look

R

Going by sight again. The data still looks pretty good. However, one last test will determine conclusively if the dataset is normal or not.

Checking Normality by Test

The Shapiro-Wilks normality test determines the probability that the data is normally distributed. The lower the probability the less likely that the data is normally distributed. Below is the code and results for the Shapiro test.

> shapiro.test(beaver2$temp[beaver2$activ==1])

	Shapiro-Wilk normality test

data:  beaver2$temp[beaver2$activ == 1]
W = 0.98326, p-value = 0.5583

Here is what happened

  1. We use the ‘shapiro.test’ function for ‘temp’ of only the beavers who are active (activ = 1)
  2. R tells us the p-value is 0.55 or 55%
  3. This means that the probability of our data being normally distributed is 55% which means it is highly likely to be normal.

Conclusion

It is necessary to always test the normality of data before data analysis. The tips presented here provide some framework for accomplishing this.

Characteristics of Experimental Design

In a previous post, we began a discussion on experimental design. In this post, we will begin a discussion on the characteristics of experimental design. In particular, we will look at the following

  • Random assignment
  • Control over extraneous variables

Random Assignment

After developing an appropriate sampling method, a researcher needs to randomly assign individuals to the different groups of the study. One of the main reasons for doing this is to remove the bias of individual differences in all groups of the study.

For example, if you are doing a study on intelligence. You want to make sure that all groups have the same characteristics of intelligence. This helps for the groups to equate or to be the same. This prevents people from saying the reason there are differences between groups is because the groups are different and not because of the treatment.

Control Over Extraneous Variables

Random assignment directly leads to the concern of controlling extraneous variables. Extraneous variables are any factors that might influence the cause and effect relationship that you are trying to establish. These other factors confound or confuse the results of a study. There are several methods for dealing with this as shown below

  • Pretest-posttest
  • Homogeneous sampling
  • Covriate
  • Matching

Pretest-Posttest

A pre-test post-test allows a researcher to compare the measurement of something before the treatment and after the treatment. The assumption is that any difference in the scores of before and after is due to the treatment.Doing the tests takes into account the confounding of the different contexts of the setting and individual characteristics.

Homogeneous Sampling

This approach involves selecting people who are highly similar on the particular trait that is being measured. This removes the problem of individual differences when attempting to interpret the results. The more similar the subjects in the sample are the more controlled the traits of the people are controlled for.

Covariate

Covariates is a statistical approach in which controls are placed on the dependent variable through statistical analysis. The influence of other variables are removed from the explained variance of the dependent variable. Covariates help to explain more about the relationship between the independent and dependent variables.

This is a difficult concept to understand. However, the point is that you use covariates to explain in greater detail the relationship between the independent and dependent variable by removing other variables that might explain the relationship.

Matching

Matching is deliberate, rather than randomly, assigning subject to various groups. For example, if you are looking at intelligence. You might match high achievers in both groups of the study. By placing he achievers in both groups you cancel out there difference.

Conclusion

Experimental design involves the cooperation in random assignment of inclusive differences in a sample. The goal of experimental design is to be sure that the sample groups are mostly the same in a study. This allows for concluding that what happened was due to the treatment.

Two-Way Tables in R

A two-way table is used to explain two or more categorical variables at the same time. The difference between a two-way table and a frequency table is that a two-table tells you the number of subjects that share two or more variables in common while a frequency table tells you the number of subjects that share one variable.

For example, a frequency table would be gender. In such a table, you only know how many subjects are male or female. The only variable involved is gender. In a frequency table, you would learn some of the following

  • Total number of men
  • Total number of women
  • Total number of subjects in the study

In a two-way table, you might look at gender and marital status. In such a table you would be able to learn several things

  • Total number of men are married
  • Total number of men are single
  • Total number of women are married
  • Total number of women are single
  • Total number of men
  • Total number of women
  • Total number of married subjects
  • Total number of single subjects
  • Total number of subjects in the study

As such, there is a lot of information in a two-way table. In this post, we will look at the following

  • How to create a table
  • How to add margins to a table
  • How to calculate proportions in a table

Creating a Table

In the example, we are going to look at two categorical variables. One variable is gender and the other is marital status. For gender, the choices are “Male” and Female”. For marital status, the choicest are ‘Married” and “Single”. Below is the code for developing the table.

Marriage_Study<-matrix(c(34,20,19,42), ncol = 2)
colnames(Marriage_Study) <- c('Male', 'Female')
rownames(Marriage_Study) <- c('Married', 'Single')
Marriage_table <- as.table(Marriage_Study)
print(Marriage_table)

There has already been a discussion on creating matrices in R. Therefore, the details of this script will not be explained here.

If you type this correctly and run the script you should see the following

        Male Female
Married   34     19
Single    20     42

This table tells you about married and single people broken down by their gender. For example, 34 males are married.

Adding Margins and Calculating Proportions

A useful addition to a table is to add the margins. The margins tell you the total number of subjects in each row and column of a table. To do this in R use the ‘addmargins’ function as indicated below.

> addmargins(Marriage_table)
        Male Female Sum
Married   34     19  53
Single    20     42  62
Sum       54     61 115

We now know the total number of Married people, Singles, Males, and Females. In addition to the information we already knew.

One more useful piece of information is to calculate the proportions. This will allow us to know what percentage of each two-way possibility makes up the table. To do this we will use the “prop.table” function. Below is the script

> prop.table(Marriage_table)
             Male    Female
Married 0.2956522 0.1652174
Single  0.1739130 0.3652174

As you can see, we now know the proportions of each category in the table.

Conclusions

This post provided information on how to construct and manipulate data that is in a two-way table. Two-way tables are a useful way of describing categorical variables.

Experimental Design: A Background

Experimental design is now considered a standard methodology in research. However, this now classic design has not always been a standard approach. In this post, we will the following

  • The definition of experiment
  • The history of experiments
  • When to conduct experiments

Definition

The word experiment is derived from the word experience. When conducting an experiment, the researcher assigns people to have different experiences. He then determines if the experience he assigned people to had some effect on some sort of outcome. For example, if I want to know if the experience of sunlight affects the growth of plants I may develop two different experiences

  • Experience 1: Some plants receive sunlight
  • Experience 2: Some plants receive no sunlight

The outcome is the growth of the plants.By giving the plants different experiences of sunlight I can determine if sunlight influences the growth of plants.

History of Experiments

Experiments have been around informally since the 10th century with work done in the field of medicine. The use of experiments as known today began in the early 20th century in the field of psychology. By the 1920’s group comparison became an establish characteristics of experiments. By the 1930’s, random assignment was introduced. By the 1960’s various experimental designs were codified and documented. By the 1980’s there was literature coming out that addressed threats to validity.

Since the 1980’s experiments have become much more complicated with the development of more advanced statistical software problems. Despite all of the new complexity, normally simple experimental designs are easier to understand

When to Conduct Experiments

Experiments are conducted to attempt to establish a cause and effect relationship between independent and dependent variables. You try to create a controlled environment in which you provide the experience or independent variable(s) and then measure how they affect the outcome or dependent variable.

Since the setting of the experiment is controlled. You can say withou a doubt that only the experience influence the outcome. Off course, in reality, it is difficult to control all the factors in a study. The real goal is to try and limit the effect that these other factors have on the outcomes of a study.

Conclusion

Despite their long history, experiments are relatively new in research. This design has grown and matured over the years to become a powerful method for determining cause and effect. Therefore, researchers should e aware of this approach for use in their studies.

Plotting Correlations in R

A correlation indicates the strength of the relationship between two or more variables.  Plotting correlations allows you to see if there is a potential relationship between two variables. In this post, we will look at how to plot correlations with multiple variables.

In R, there is a built-in dataset called ‘iris’. This dataset includes information about different types of flowers. Specifically, the ‘iris’ dataset contains the following variables

  • Sepal.Length
  • Sepal.Width
  • Petal.Length
  • Petal.Width
  • Species

You can confirm this by inputting the following script

> names(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

We now want to examine the relationship that each of these variables has with each other. In other words, we want to see the relationship of

  • Sepal.Length and Sepal.Width
  • Sepal.Length and Petal.Length
  • Sepal.Length and Petal.Width
  • Sepal.Width and Petal.Length
  • Sepal.Width and Petal.Width
  • Petal.Length and Petal.Width

The ‘Species’ variable will not be a part of our analysis since it is a categorical variable and not a continuous one. The type of correlation we are analyzing is for continuous variables.

We are now going to plot all of these variables above at the same time by using the ‘plot’ function. We also need to tell R not to include the “Species” variable. This is done by adding a subset code to the script. Below is the code to complete this task.

> plot(iris[-5])

Here is what we did

  1. We use the ‘plot’ function and told R to use the “iris” dataset
  2. In brackets, we told R to remove ( – ) the 5th variable, which was species
  3. After pressing enter you should have seen the following

Rplot10

The variable names are placed diagonally from left to right. The x-axis of a plot is determined by variable name in that column. For example,

  • The variable of the x-axis of the first column is ‘Sepal.Length”
  • The variable of the x-axis of the second column is ‘Sepal.Width”
  • The variable of the x-axis of the third column is ‘Petal.Length”
  • The variable of the x-axis of the fourth column is ‘Petal.Width”

The y-axis is determined by the variable that is in the same row as the plot. For example,

  • The variable of the y-axis of the first column is ‘Sepal.Length”
  • The variable of the y-axis of the second column is ‘Sepal.Width”
  • The variable of the y-axis of the third column is ‘Petal.Length”
  • The variable of the y-axis of the fourth column is ‘Petal.Width”

AS you can see, this is the same information. We will now look at a few examples of plots

  • The plot in the first column second row plots “Sepal.Length” as the x-axis and “Sepal.Width” as the y-axis
  • The plot in the first column third row plots “Sepal.Length” as the x-axis and “Petal.Length” as the y-axis
  • The plot in the first column fourth row plots “Sepal.Length” as the x-axis and “Petal.Width” as the y-axis

Hopefully, you can see the pattern. The plots above the diagonal are mirrors of the ones below. If you are familiar with correlational matrices this should not be surprising.

After a visual inspection, you can calculate the actual statistical value of the correlations. To do so use the script below and you will see the table below after it.

> cor(iris[-5])
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

As you can see, there are many strong relationships between the variables. For example “Petal.Width” and “Petal.Length” has a correlation of .96, which is almost perfect. This means that when “Petal.Width” grows by one unit “Petal.Length” grows by .96 units.

Conclusion

Plots help you to see the relationship between two variables. After visual inspection, it is beneficial to calculate the actual correlation.

Analyzing Qualitative Data

Analyzing qualitative data is not an easy task. Instead of punching script into a statistical programming and receiving results, you become the computer who needs to analyze the data. For this reason alone, the analysis of qualitative data is difficult as different people will have vastly different interpretations of data.

This post will look at the following

  • The basic characteristics of qualitative data analysis
  • Exploring and coding data

Basic Characteristics

Qualitative data analysis has the following traits

  • Inductive form
  • Analyzing data while still collecting data
  • Interpretative

Qualitative analysis is inductive by nature. This indicates that a researcher goes from specific examples to the generation of broad concepts or themes. In many ways, the researcher is trying to organize and summarize what they found in their research coherently in nice neat categories and themes.

Qualitative analysis also involves analyzing while still collecting data.You begin to process the data while still accumulating data. This is an iterative process that involves moving back and forth between analysis and collection of data. This is a strong contrast to quantitative research which is usually linear in nature.

Lastly, qualitative analysis is highly subjective. Everyone who views the data will have a different perspective on the results of a study. This means that people will all see different ideas and concepts that are important in qualitative research.

Exploring and Coding

Coding data in qualitative research can be done with text, images, and or observations. In coding, a researcher determines which information to share through the development of segments, codes, categories, themes. Below is the process for developing codes and categories

1. Read the text

Reading the text means to get familiar with it and now what is discussed.

2. Pick segments of quotes to include in the article

After reading the text, you begin to pick quotes from the interview that will be used for further inductive processing

3, Develop codes from segments

After picking many different segments, you need to organize them into codes. All segments in one code have something in common that unites them as a code.

4. Develop categories from codes

The next level of abstract is developing categories from codes. The same process in step 3 is performed here.

5. Develop themes from categories

This final step involves further summarizing the results of the categories development into themes. The process is the same as steps 3 and 4.

Please keep in mind that as you move from step 1 to 5 the number of concepts decreases. For example, you may start with 50 segments that are reduced to 10 codes then reduce to 5 categories and finally 3 themes.

Conclusion

Qualitative data analysis is not agreed upon. There are many different ways to approach this. In general, the best approach possible is one that is consistent in terms of its treatment of the data. The example provided here is just one approach to organizing data in qualitative research.

Basics of Histograms and Plots in R

R has many fascinating features for creating histograms and plots. In this post, we will only cover some of the most basic concepts of make histograms and plots in R. The code for the data we are using is available in a previous post.

Making a Histogram

We are going to make a histogram of the ‘mpg’ variable in our ‘cars’ dataset. Below is the code for doing this followed by the actual histogram.

Histogram of mpg variable

Histogram of mpg variable

Here is what we did

  1. We used the ‘hist’ function to create the histogram
  2. Within the hist function we told r to make a histogram of ‘mpg’ variable found in the ‘cars’ dataset.
  3. An additional argument that we added was ‘col’. This argument is used to determine the color of the bars in the histogram. For our example, the color was set to gray.

Plotting Multiple Variables

Before we look at plotting multiple variables you need to make an adjustment to the ‘cyl’ variable in our cars variable. THis variable needs t be changed from a numeric to a factor variable as shown below

cars$cyl<- as.factor(cars$cyl)

Boxplots are an excellent way of comparing groups visually. In this example, we will compare the ‘mpg’ or miles per gallon variable by the ‘cyl’ or number of cylinders in the engine variable in the ‘cars’ dataset. Below is the code and diagram followed by an explanation.

boxplot(mpg ~ cyl, data = cars)

Rplot

Here is what happened.

  1. We use the ‘boxplot’ function
  2. Within this function we tell are to plot mpg and cyl using the tilda  ” ~ ” to tell R to compare ‘mpg’ by the number of cylinders

The box of the boxplot tells you several things

  1. The bottom of the box tells you the 25th percentile
  2. The line in the middle of the box tells you the median
  3. The top of the box tells you the 75th percentile
  4. The bottom line tells you the minimum or lowest value excluding outliers
  5. The top line tells you the maximum or highest value excluding outliers

In order boxplot above, there are three types of cylinders 4, 6, and 8. For 4 cylinders the 25th percentile is about 23 mpg, the 50th percentile is about 26 mpg, while the 75th percentile is about 31 mpg. The minimum value was about 22 and the maximum value was about 35 mpg. A close look at the different blots indicates that four cylinder cars have the best mpg followed by six and finally eight cylinders.

Conclusions

Histograms and boxplots serve the purpose of describing numerical data in a visual manner. Nothing like a picture helps to explain abstract concepts such mean and median.

Content Analysis In Qualitative Research

Content analysis serves the purpose in qualitative research to enable you to study human behavior indirectly through how people choose to communicate. The type of data collected can vary tremendously in this form of research. However, common examples of data include images, documents, and media.

In this post, we will look at the following in relation to content analysis

  • The purpose of content analysis
  • Coding in content analysis
  • Analysis in content analysis
  • Pros and cons in content analysis

Purpose

The purpose of content analysis is to study the central phenomenon through the analysis of examples of the communication of people connected with the central phenomenon. This information is coded into categories and themes. Categories and themes are just different levels of summarizing the results of the analysis. Themes are a summary of categories and categories are a direct summary of the data that was analyzed.

Coding

Coding the data is the process of organizing the results in a comprehensible way. In terms of coding, there are two choices.

  • Establish categories before beginning the analysis
  • Allow the categories to emerge during analysis

Which is best depends on the research questions and context of the study.

There are two forms of content manifest content and latent content. Manifest content is evidence that is directly seen such as the words in an interview. Latent content refers to the underlying meaning of content such as the interpretation of an interview.

The difference between these two forms of content is the objective or subjective nature of them. Many studies include both forms as this provides a fuller picture of the central phenomenon.

Analysis

There are several steps to consider when conducting a content analysis such as…

  • Define your terms-This helps readers to know what you are talking about
  • Explain what you are analyzing-This can be words, phrases, interviews, pictures, etc.
  • Explain your coding approach-Explained above
  • Present results

This list is far from complete but provides some basis for content analysis

Pros and Cons

Pros of content analysis include

  • Unobtrusive-Content analysis does not disturb the field or a people group normally
  • Replication-Since the documents are permanent, it is possible to replicate a study
  • Simplicity-Compared to other forms of research, content analysis is highly practical to complete

Cons include

  • Validity-It is hard to assess the validity of the analysis. The results of an analysis is the subjective opinion of an individual(s)
  • Limited data-Content analysis is limited to recorded content. This leaves out other forms of information

Conclusion

Content analysis provides another way for the qualitative research to analyze the world. There are strengths and weaknesses to this approach as there are such for forms of analysis. The point is to understand that there are times when the content analysis is appropriate

Describing Categorical Data in R

This post will explain how to create tables, calculate proportions, find the mode, and make plots for categorical variables in R. Before providing examples below is the script needed to setup the data that we are using

cars <- mtcars[c(1,2,9,10)]
cars$am <- factor(cars$am, labels=c('auto', 'manual'))
cars$gear <- ordered(cars$gear)

Tables

Frequency tables are useful for summarizing data that has a limited number of values. It represents how often a particular value appears in a dataset. For example, in our cars dataset, we may want to know how many different kinds of transmission we have. To determine this, use the code below.

> transmission_table <- table(cars$am)
> transmission_table

  auto manual 
    19     13

Here is what we did.

  1. We created the variable ‘transmission_table’
  2. In this variable, we used the ‘table’ function which took information from the ‘am’ variable from the ‘cars’ dataset.
  3. Final we displayed the information by typing ‘transmission_table’ and pressing enter

Proportion

Proportions can also be calculated. A proportion will tell you what percentage of the data belongs to a particular category. Below are the proportions of automatic and manual transmissions in the ‘cats’ dataset.

> transmission_table/sum(transmission_table)

   auto  manual 
0.59375 0.40625

The table above indicates that about 59% of the sample consists of automatic transmissions while about 40% are manual transmissions

Mode

When dealing with categorical variables there is not mean or median. However, it is still possible to calculate the mode, which is the most common value found. Below is the code.

> mode_transmission <-transmission_table ==max(transmission_table)
> names(transmission_table) [mode_transmission]
[1] "auto"

Here is what we did.

  1. We created the variable ‘mode_transmission’ and use the ‘max’ function to calculate the max number of counts in the transmission_table.
  2. Next we calculated the names found in the ‘transmission_table’ but we subsetted the ‘modes_transmission variable
  3. The most common value was ‘auto’ or automatic tradition,

Plots

Plots are one of the funniest capabilities in R. For now, we will only show you how to plot the data that we have been using. What is seen here is only the simplest and basic use of plots in R and there is a much more to it than this. Below is the code for plotting the number of transmission by type in R.

> plot(cars$am)

If you did this correctly you should see the following.

Rplot09

All we did was have R create a visual of the number of auto and manual transmissions.Naturally, you can make plots with continuous variables as well.

Conclusion

This post provided some basic information on various task can be accomplished in R for assessing categorical data. These skills will help researchers to take a sea of data and find simple ways to summarize all of the information.

Interviews in Qualitative Research

Interviews provide another way to collect data when conducting qualitative research. In this post, we will look at the following,

  • Characteristics of the interviewees
  • Types of interviews
  • Types of questions
  • Tips for conducting interviews

Characteristics of the Interviewees

Qualitative research involves two types of interviewees. If you are interviewing only one person this is a one-on-one interview. If you are interviewing a group this is often called a focus group.

One-on-One interviewing allows for in-depth data collection but takes a great deal of time. Focus groups, on the other hand, allows a researcher to gather a more varied opinion while saving time. Care also must be taken to make sure everyone participates in a focus group.

Types of Interviews

There are three common types of interview structured, semi-structured and informal. Structured interviews consist of a strict set of questions that are read in order word for word to interviewees. The goal is for the interviewee to answer all questions.

At the opposite extreme are informal interviews which are conversations that can flow in any direction. There is no set script of questions and the interviewee can go anywhere they want in the conversation

The middle ground between formal and informal interviewing is semi-structured interviews. In this approach, the researcher has questions they want to ask but they can vary the order, reword, ask follow-up questions, and or omit questions As such, there is a negotiable format in semi-structured interviews.

Types of Questions

There are several types of questions that are used in qualitative research. The types are self-explanatory and are listed below with an example

  • Knowledge question-“How does email work?”
  • Experience question-“What was it like growing up in the 1990’s?”
  • Opinion question-“What is your view of the tax cuts?”
  • Feeling question-“How do the change in curfew make you feel?”
  • Sensory question-“What does the kitchen smell like?”

Keep in mind that open ended questions are more common the closed-ended questions in qualitative research. This allows the interviewee to share their perspective rather than reply yes and no.

Tips for Conducting Interviews

Below are some tips for conducting interviews

  • Establish rapport-Establishing some form of relationship helps the interviewee to feel comfortable.
  • Location matters-Pick a quiet place to conduct the interview(s) if possible.
  • Record the interview-This is standard practice and is necessary in order to develop a transcript.
  • Take notes-Even with a recording, taking notes helps you to recall what happened during the interview.
  • Use wisdom with questions-Avoid leading questions which are unfair and make sure to ask one question at a time.
  • Show respect and courtesy during the interview-Be polite and considerate of the interviewee who has given you some of their time.

This not meant to be an exhaustive list but rather to provide some basic guidelines.

Conclusion

Along with observations, interviews is one of the most common forms of data collection in qualitative research. When you are in the field doing interviews it is important to consider what kind of interview you are doing, what questions you are going to ask, as well as the guidelines for conducting interviews presented in this post.

Observation in Qualitative Research

Observation is one of several forms of data collection in qualitative research. It involves watching and recording, through the use of notes, the behavior of people at the research site. In this post, we will cover the following

  • Different observational roles
  • The guidelines for observation
  • Problems with observation

Observational Roles

The role you play as an observer can vary between two extremes which are

nonparticipant to participant observer. A nonparticipant observer does not participate in any of the activities of the people being studied. For example, you are doing teaching observations, as you sit in the classroom you only watch what happens and never participate.

The other extreme is a participant observer. In this role, a researcher takes part in the activities of the group. For example, if you are serving as a teacher in a lower income community and are observing the students while you teach and interact with them this is participant observer.

Between these two extremes of non-participation and participation are several other forms of observation.  For example, a a non-participant observer can be an observer-as-participant or a complete observer. Furthermore, a participant observer can be a participant-as-observer or complete participant. The difference between these is whether or not the the group being studied knows the identity of the researcher.

Guidelines for Observation

  • Decide your role-What type of observer are you
  • Determine what you are observing-The observation must support what you are trying to learn about the central phenomenon 
  • Observe the subject multiple times-This provides a deeper understanding of the subjects
  • Take notes-An observer should of some way of taking notes. These notes are called fieldnotes and provide a summary of what was seen during the observation.

Problems with Observation

Common problems that are somewhat related when doing observations are observer effect, observer bias, observer expectations. The observer effect is how the people being observed change their behavior because of the presence of an outsider. For example, it is common for students to behave differently when the principal comes to observe the teacher. They modify their behavior because of the presence of the principal. In addition, if the students are aware of the principal’s purpose, they may act extra obedient for the sake of their teacher.

Observer bias is the potential that a researchers viewpoint may influence what they see. For example, if a principal is authoritarian he may view a democratic classroom with a laid back teacher as chaotic when the students may actually be learning a great deal.

Observer expectation is the observer assuming beforehand what they are going to see. For example, if a researcher is going to observe students in a lower income school, he will expect to see low performing unruly students. This becomes a self-fulfilling prophecy as the researcher sees what they expected to see.

Conclusion

Observation is one of the forms of data collection in qualitative research. Keeping in mind the types of observation, guidelines, and problems can help a researcher to succeed.

Describing Continuous Variables in R

In the last post on R, we began to look at how to pull information from a data set. In this post, we will continue this by examining how to describe continuous variables. One of the first steps in any data analysis is to look at the descriptive statistics to determine if there are any problems, errors or issues with normality. Below is the code for the data frame we created.

> cars <- mtcars[c(1,2,9,10)]
> cars$am <- factor(cars$am, labels=c('auto', 'manual'))
> cars$gear <- ordered(cars$gear)

Finding the Mean, Median, Standard Deviation, Range, and Quartiles

The mean is useful to know as it gives you an idea of the centrality of the data. Finding the mean is simple and involves the use of the “mean” function. Keep in mind that there are four variables in our data frame which are mpg, cyl, am, and gear. Only ‘mpg’ has a mean because the other variables are all categorical (cyl, am, and gear). Below is the script for finding the mean of ‘mpg’ with the answer.

> mean(cars$mpg)
[1] 20.09062

The median is the number found exactly in the middle of a dataset. Below is the median of ‘mpg.’

> median(cars$mpg)
[1] 19.2

The answer above indicates that the median of ‘mpg’ is 19.2. This means half the values are above this number while half the values are below it.

Standard deviation is the average amount that a particular data point is different from the mean. For example, if the standard deviation is 3 and the mean is 10 this means that any given data point is either is between 7 and 13 or -3 to  +3 different from the mean. In other words, the standard deviation calculates the average amount of deviance from the mean. Below is a calculation of the standard deviation of the ‘mpg’ in the ‘cars’ data frame using ‘sd’ function.

> sd(cars$mpg)
[1] 6.026948

What this tells is that the average amount of deviance from the mean of 20.09 for ‘mpg’ in the ‘car’ data frame is 6.02. In simple terms, most of the data points are between 14.0 and 26.0 mpg.

Range is the highest and lowest number in a data set. It gives you an idea of the scope of a dataset. This can be calculated in R using the ‘range’ function. Below is the range for ‘mpg’ in the ‘cars’ data frame.

> range(cars$mpg)
[1] 10.4 33.9

These results mean that the lowest ‘mpg’ in the cars data frame is 10.4 while the highest is 33.9. There are no values lower or higher than these.

Lastly, quartiles tell breaks the data into several groups based on a percentage. For example, the 25th percentile gives you a number that tells that 75% of the numbers are higher than it and that 25% is lower. If there are 100 data points in a dataset and the number 25 is the 25th percentile, this means that 75% or 75 numbers are of greater value than 25 and that there are about 25 numbers lower than 25. Below is an example using the ‘mpg’ information from the ‘cars’ data frame using the ‘quantile’ function.

> quantile(cars$mpg)
    0%    25%    50%    75%   100% 
10.400 15.425 19.200 22.800 33.900

In this example above, 15.4 is the 25th percentile. This means that 75% of the values are above 15.4 while 25% are below it. In addition, you may have noticed that the 0% and the 100% are the same as the range. Furthermore, the 50% is the same as the median. In other words, calculating the quartiles gives you the range and median. Therefore, calculating the quartiles saves you time.

Conclusion

These are the basics of describing continuous variables. The next post on R will look at describing categorical variables.

Qualitative Research Sampling Methods

Qualitative research employs what is generally called purposeful sampling, which is the intentional selection of individuals to better understand the central phenomenon. Under purposeful sampling, there are several ways of selecting individuals for a qualitative study. Below are some examples discussed in this post.

  • Maximal variation
  • Extreme case
  • Theory
  • Homogeneous
  • Opportunistic
  • Snowball

We will also look at suggestions for sample size.

Maximal Variation Sampling

Maximal variation involves selecting individuals that are different on a particular characteristic. For example, if you are doing a study on discrimination, you might select various ethnicities to share their experience with discrimination. By selecting several races you are ensuring a richer description of discrimination.

Extreme Case Sampling

Extreme case sampling involves looking out outliers or unusually situations. For example, studying a successful school in a low-income area may be an example since high academic performance does not normally correlate with low-income areas.

Theory Sampling

Theory sampling involves selecting people based on their ability to help understand theory or process. For example, if you are trying to understand why students drop out of school. You may select dropout students and their teachers to understand the events that lead to dropping out. This technique is often associated with grounded theory.

Homogeneous Sampling

This approach involves selecting several members from the same subgroup. For example, if we are looking at discrimination at a university, we may select only African-American English Majors. Such an example is a clear sub-group of a larger community.

Opportunistic Sampling

Opportunistic sampling is in many ways sampling without a plan or starting with on sampling method and then switching to another because of changes in the circumstances. For example, you may begin with theory sampling as you study the process of dropping out of high school. While doing this, you encounter a student who is dropping out in order to pursue independent studies online. This provides you with the “opportunity” to study an extreme case as well.

Snowball Sampling

Sometimes it is not clear who to contact. In this case, snowball sampling may be appropriate. Snowball sampling is an approach commonly used by detectives in various television shows. You find one person to interview and this same person recommends someone else to talk to. You repeat this process several times until an understanding of the central phenomenon emerges.

Sample Size

Qualitative research involves a much lower sampling size than quantitative. This is for several reasons

  • You want to provide an in-depth look at one perspective rather than a shallow overview of many perspectives.
  • The more people involved the harder it is to conduct the analysis.
  • You want to share the complexities rather than the generalizations of a central phenomenon.

One common rule of thumb is to collect data until saturation is reached. Saturation is when the people in your data begin to say the same things. How long this takes depends and this is by far not an absolute standard.

Conclusion

This is just some of the more common forms of sampling in qualitative research. Naturally, there are other methods and approaches to sampling. The point is that the questions of the study and the context shape the appropriateness of a sampling method.

Working with Data in R

In this post and future posts, we will work with actual data that is available in R to apply various skills. For now, we will work with the ‘mtcars’ data that is available in R. This dataset contains information about 32 different cars that were built in the 1970’s.

Often in research, the person who collects the data and the person who analyzes the data (the statistician) are different. This often leads to the statistician getting a lot of missing data that cannot be analyzed in its current state. This leads to the statistician having to clean the data in order to analyze it.

Initial Decisions

One of the first things a statistician does after he has received some data is to see how many different variables there are and perhaps decide if any can be converted to factors. In order to achieve these two goals we need to answer the following questions

  1. How many variables are there? Many functions can answer this question
  2. How many unique values does each variable have? This will tell us if the variable is a candidate for becoming a factor.

Below is the code for doing this with the ‘mtcar’ dataset.

> sapply(mtcars, function(x) length(unique(x)))
 mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
  25    3   27   22   22   29   30    2    2    3    6

Here is what we did

  1. We used the ‘sapply’ function because we want to apply our function on the whole dataframe at once.
  2. Inside the ‘sapply’ function we tell R that the dataset is ‘mtcars’
  3. Next is the function we want to use. We tell R to use the function ‘x’ which is an anonymous function.
  4. After indicating that the function is anonymous we tell R what is inside the anonymous function.
  5. The anonymous function contains the length function with the unique function within it.
  6. This means that we want the length (or number) of unique values in each variable.

We have the answers to our question

  1. There are eleven variables in the ‘mtcars’ dataset as listed in the table above
  2. The unique values are also listed in the table above.

A common rule of thumb for determining whether to convert a variable to a factor is that the variable has less then 10 unique values. Based on this rule, the cyl, vs, am, gear, and carb variables could be converted to factors. Converting to a factor makes a continuous variable a categorical one and opens up various analysis possibilities.

Preparing Data

Now that we have an idea of what are the characteristics of the dataset we have to decide the following…

  1. What variables are we going to use
  2. What type of analysis are we going to do with the variables we use.

For question 1 we are going to use the 1st (mpg), the 2nd (cyl), the 9th (am) and the 10th (gear) variables for our analysis. Then we are going to make the ‘am’ variable a factor. Lastly, we are going to put in order the values in the ‘gear’ variable. Below is the code

> cars <- mtcars[c(1,2,9,10)]
> cars$am <- factor(cars$am, labels=c('auto', 'manual'))
> cars$gear <- ordered(cars$gear)

Here is what we did

  1. We created the variable ‘cars’ and assigned a subset of variables from the ‘mtcars’ dataset. In particular, we took the  1st (mpg), the 2nd (cyl), the 9th (am) and the 10th (gear) variables from ‘mtcars’ and saved them in ‘cars’
  2. For the ‘am’ variable we converted it to a factor. Data points that were once 0 became ‘auto’ and data points that were once 1 became ‘manual’
  3. We made the ‘gear’ variable an ordered factor this means that 5 is more than 4, 4 is more than 3, etc.

Our next post on R will focus on analyzing the ‘cars’ variable that we created.

Qualitative Data Collection

Qualitative data collection involves normally collecting non-numerical information about a phenomenon. The data collected can be text, pictures, and numerical observation at times. This post will examine differences between qualitative and quantitative research as well as the steps of qualitative data collection

Qualitative vs. Quantitative Research

There are several major difference between qualitative and quantitative data collection.

  • Qualitative relies on a much smaller sample size using purposeful sampling while quantitative involves larger samples selected through random/non-random sampling.
  • Qualitative uses open-ended questions while quantitative uses closed questions when interviewing
  • Qualitative researchers almost always develop their own instruments while quantitative researchers use other people’s instruments.
  • Qualitative research involves personal contact with the respondents while quantitative research does not necessarily require this.
  • Qualitative research tries to help us understand a central phenomenon better while quantitative research seeks to test theories.

Steps of Qualitative Data Collection

Anything that involves qualitative research is rarely linear in nature. This means that the steps mention below does not necessarily happen in the order present. Qualitative research is often iterative and involves repeating steps, moving back and forth as well. In general, it is common to see the following steps happen in qualitative research.

  1. Determine participants-This involves deciding who you will collect data from by identifying a population and appropriate sampling strategy.
  2. Ethical concerns-This relates to obtaining the needed permissions to contact the sample population.
  3. Decide what to collect-You need to determine what information that the participants supply is needed to answer your research questions.
  4. Develop instrument-Design interview protocol
  5. Administer instrument-Conduct the data collection

Again, the process is often never this simple and straightforward. Regardless of the order in which these steps take place, all of the steps need to happen at one time or another when conducting qualitatively.

Conclusion

Qualitative research provides an alternative view to understanding the world. By relying on text and images, qualitative research provides a research description of reality in comparison to quantitative research. In general, there are five steps to qualitative research. The order of the completion of these steps vary but all should be completed when conducting qualitative research.

Apply Functions in R

The last R post focused on the use of the “for” loop. This option is powerful in repeating an action that cannot be calculated in a vector. However, there are some drawbacks to ‘for’loops that are highly technical and hard to explain to beginners. The problems have something to do with strange results in the workspace and environment.

To deal with the complex problems with ‘for’ loops have the alternative approach of using functions from the apply family. The functions in the apply family provide the same results as a ‘for’ loop without the occasional problem. There are three functions in the apply family and they are

  • apply
  • sapply
  • lapply

We discuss each with a realistic example.

Apply

The ‘apply’ function is useful for producing results for a matrix, array, or data frame. They do this by producing results from the rows and or columns. The results of an ‘apply’ function are always shared as a vector, matrix, or list. Below is an example of the use of an ‘apply’ function.

You make a matrix that contains how many points James, Kevin, Williams have scored in the past three games. Below is the code for this.

> points.per.game<- matrix(c(25,23,32,20,18,24,12,15,16), ncol=3)
> colnames(points.per.game)<-c('James', 'Kevin', 'Williams')
> rownames(points.per.game)<-c('Game1', 'Game2', 'Game3')
> points.per.game
      James Kevin Williams
Game1    25    20       12
Game2    23    18       15
Game3    32    24       16

You want to know the most points James, Kevin, and Williams scored for any game. To do this, you use the ‘apply’ function as follows.

> apply(points.per.game, 2, max)
   James    Kevin Williams 
      32       24       16

Here is what we did

  1. We used the ‘apply’ function and in the parentheses we  put the arguments “points.per.game” as this is the name of the matrix, ‘2’ which tells R to examine the matrix by column, and lastly we used the argument ‘max’ which tells are to find the maximum value in each column.
  2. R prints out the results telling us the most points each player scored regardless of the game.

Sapply

The ‘apply’ function works for multidimensional objects such as matrices, arrays, and data frames. ‘Sapply’ is used for vectors, data frames, and list. ‘Sapply’ is more flexible in that it can be used for single dimension (vectors) and multidimensional (data frames) objects.  The output from using the ‘sapply’ function is always a vector, matrix, or list. Below is an example

Let’s say you make the following data frame

> GameInfo
  points      GameType
1    101          Home
2    115          Away
3     98 International
4     89          Away
5    104          Home

You now want to know what kind of variables you have in the ‘GameInfo’ data frame. You can calculate this one at a time or you can use the following script with the ‘sapply’ function

> sapply(GameInfo, class)
   points  GameType 
"numeric"  "factor"

Here is what we did

  1. We use the ‘sappy’ function and included two arguments. The first is the name of the data frame “GameInfo” the second is the argument ‘class’ which tells R to determine what kind of variable is in the “GameInfo” data frame
  2. The answer is then printed. The variable ‘points’ is a numeric variable while the variable ‘GameType’ is a factor.

lapply

The ‘lapply’ function works exactly like the ‘sapply’ function but always returns a list. See the example below

> lapply(GameInfo, class)
$points
[1] "numeric"

$GameType
[1] "character"

This the same information from the ‘sapply’ example we ran earlier.

Conclusion

The ‘apply’ family serves the purpose of running a loop without the concerns of the ‘for’ loop. This feature is useful for various objects.

Analyzing Quantitative Data: Inferential Statistics

In a prior post, we looked at analyzing quantitative data using descriptive statistics. In general, descriptive statistics describe your data in terms of the tendencies within the sample. However, with descriptive stats, you only learn about your sample but you are not able to compare groups nor find the relationship between variables. To deal with this problem, we use inferential statistics.

Types of Inferential Statistics

With inferential statistics, you look at a sample and make inferences about the population. There are many different types of analysis that involve inferential statistics. Below is a partial list. The ones with links have been covered in this blog before.

  • Pearson Correlation Coefficient–Used to determine the strength of the relationship between two continuous variables
  • Regression Coefficient-The squared value of the Pearson Correlation Coefficient. Indicates the amount of variance explained between two or more variables
  • Spearman Rho–Used to determine the strength of the relationship between two continuous variables for non-parametric data.
  •   t-test-Determines if there is a significant statistical difference between two means. The independent variable is categorical while the dependent variable is continuous.
  • Analysis of Variance-Same as a t-test but for three means or more.
  • Chi-Square-Goodness-of-Fit-This test determines if there is a difference between two categorical variables.

As you can see, there are many different types of inferential statistical test. However, one thing all test have in common is the testing of a hypothesis. Hypothesis testing has been discussed on this blog before. To summarize, a hypothesis test can tell you if there is a difference between the sample and the population or between the sample and a normal distribution.

One other value that can be calculated is the confidence interval. The confidence interval calculates a range that the results of a statistical test (either descriptive or inferential) can be found. For example, If we that the regression coefficient between two variables is .30 the confidence interval may be between .25 — .40. This range tells us what the value of the correlation would be found in the population.

Conclusion

This post serves as an overview of the various forms of inferential statistics available. Remember, that it is the research questions that determine the form of analysis to conduct. Inferential statistics are used for comparing groups and examining relationships between variables.

Using ‘for’ Loops in R

In our last post on R, we looked at using the ‘switch’ function. The problem with the ‘switch’ function is that you have to input every data point manually. For example, if you have five different values you want to calculate in a function you need to input each value manually until you have calculated all five.

There is a way around this and it involves the use of a ‘for’ loop. A ‘for’ loop is used to repeat an action in R in a vector. This allows you to do a repeated action with functions that cannot use vectors.

We will return to are ‘CashDonate’  that has been used during this discussion of programming. Below is the code from R as it was last modified for the previous post.

CashDonate <- function(points, GameType){
 JamesDonate <- points * ifelse(points > 100, 30, 40)
 OwnerRate<- switch(GameType, Home = 1.5, Away = 1.3, International = 2)
 totalDonation<-JamesDonate * OwnerRate
 round(totalDonation)
}

The code above allows you to calculate how much money James and the owner will give based on points scored and whether it is a home, away, or international game. Remember, the problem with this code is that each value must be entered manually in the “CashDonate” function. To fix this we will used a ‘for’ loop as seen in the example below.

CashDonate <- function(points, GameType){
 JamesDonate <- points * ifelse(points > 100, 30, 40)
  OwnerRate <- numeric(0)
 for(i in GameType){
 OwnerRate<- c(OwnerRate, switch(i, GameType, Home = 1.5, Away = 1.3, International = 2))}
 totalDonation<-JamesDonate * OwnerRate
 round(totalDonation)
}

There are some slight yet significant changes to the function as explained below

  1. The variable ‘OwnerRate’ is initially set to be a numeric variable with nothing inside it (0). This because different values will be put into this variable based on the ‘for’ loop.
  2. Next, we have the ‘for’ loop. “i” represent whatever value is found in the argument ‘GameType’ the choices are home, away, and international. For every value in the dataframe the ‘CashDonate’ function will check the’ GameType’ to decide what to put into the ‘OwnerRate’ variable.
  3. Next, we have the ‘OwnerRate’ variable again. We use the ‘c’ function to combine OwnerRate (which currently has nothing in it), with the “switch” function. The switch function looks at ‘i’ (from the ‘for’ loop) and calculates the same information as before.
  4. The rest of the code is the same.

We will now try to use this modified ‘CashDonate’ function. But first we need a dataframe with several data points to run it. Input the follow dataframe into R.

GameInfo<- data.frame( points = c(101, 115, 98, 89, 104), GameType = c("Home", "Away", "International", "Away", "Home"))

In order to use the ‘CashDonate’ function with the ‘for’ loop you have to extract the values from the ‘GameInfo’ dataframe use $ sign. Below is the code with the answer from the ‘CashDonate’ function

CashDonate(GameInfo$points, GameInfo$GameType)
[1] 4545 4485 7840 4628 4680

You can check the values for yourself manually. The main advantage of the for loop is that we were able to calculate all five values at the same time rather than one at a time.

Analyzing Quantitative Data: Descriptive Statisitics

For quantitative studies, once you have prepared your data it is now time to analyze it. How you analyze data is heavily influenced by your research questions. Most studies involve the use of descriptive and or inferential statistics to answer the research questions. This post will explain briefly discussed various forms of descriptive statistics.

Descriptive Statistics

Descriptive statistics describe trends or characteristics in the data. There are in general, three forms of descriptive statistics. One form deals specifically with trends and includes the mean, median, and mode. The second form deals with the spread of the scores and includes the variance, standard deviation, and range. The third form deals with comparing scores and includes z scores and percentile rank

Trend Descriptive Stats

Common examples of descriptive statistics that describe trends in the data are mean, median, mode. For example, if we gather the weight of 20 people. The mean weight of the people gives us an idea of about how much each person weighs. The mean is easier to use and remember than 20 individual data points.

The median is the value that is exactly in the middle of a range of several data points. For example, if we have several values arrange from less to greatest such as 1, 4, 7. The number 4 is the median as it is the value exactly in the middle. The mode is the most common number in a list of several values arranged from least to greatest. For example, if we have the values 1, 3, 4, 5, 5, 7. The number 5 is the mode since it appears twice while all the other numbers appear only once.

Spread Scores Descriptive Stats

Calculating spread scores is somewhat more complicated than trend stats. Variance is the average amount of deviation from the mean. It is an average of the amount of error in the data. If the mean of a data set is 5 and the variance is 1 this means that the average departure from the mean of 5 is 1 point.

One problem with variance is that its results are squared. This means that the values of the variance are measured differently than whatever the mean is. To deal with this problem, statisticians square root the results of variance to get the standard deviation. The standard deviation is the average amount that the values in a sample are different from the mean. This value is used in many different statistical analysis.

The range measures the dispersion of the data by subtracting the highest value from the lowest. For example, if the highest value in a data set is 5 and the lowest is 1 the range is 5 – 1 = 4.

Comparison Descriptive States

Comparison descriptive stats are much harder to explain and are often used to calculate more advanced statistics. Two types of comparison descriptive stats include z scores percentile rank.

Z scores tells us how far a data point is from the mean in terms of standard deviation. For example, a z score of 3.0 indicates that this particular data point is 3 standard deviations away from the mean. Z scores are useful in identify outliers and many other things.

The percentile rank is much easier to understand. Percentile rank tells you how many scores fall at or below the percentile. For example, some with a score at the 80th percentile outperformed 80% of the sample.

Conclusion

Descriptive stats are used at the beginning of an analysis. There are many other forms of descriptive stats such as skew, kurtosis, etc. Descriptive stats are useful for making sure your data meets various forms of normality in order to begin inferential statistical analysis. Always remember that your research questions determine what form of analysis to conduct.

Switch Function in R

The ‘switch’ function in R is useful when you have more than two choices based on a condition. If you remember ‘if’ and ‘ifelse’ are useful when there are two choices for a condition. However, using if/else statements for multiple choices is possible but somewhat confusing. Thus the ‘switch’ function is available for simplicity.

One problem with the ‘switch’ function is that you can not use vectors as the input to the function. This means that you have to manually calculate the value for each data point yourself. There is a way around this but that is the topic of a future post. For now, we will look at how to use the ‘switch’ function.

Switch Function

The last time we discussed R Programming we had setup the “CashDonate” to calculate how much James would give for dollars per point and how much the owner would much James based on whether it was a home or away game. Below is the code.

CashDonate <- function(points, HomeGame=TRUE){
 JamesDonate<- points * ifelse(points > 100, 30, 40)
 totalDonation<- JamesDonate * ifelse(HomeGame, 1.5, 1.3)
 round(totalDonation)
}

Now the team will have several international games outside the country. For these games, the owner will double whatever James donates. We now have three choices in our code.

  • Home game multiple by 1.5
  • Away game multiple by 1.3
  • International game multiple by 2

It is possible to use the ‘if/else’ function but it is complicated to code, especially for beginners. Instead, we will use the ‘switch’ function which allows R to switch between multiple options within the argument. Below is the new code.

CashDonate <- function(points, GameType){
 JamesDonate<- points * ifelse(points > 100, 30, 40)
 OwnerRate<- switch(GameType, Home = 1.5, Away = 1.3, International = 2)
 totalDonation<-JamesDonate * OwnerRate
 round(totalDonation)
}

Here is what the code does

  1. The function ‘CashDonate’ has the arguments ‘points’ and ‘Gametype’
  2. ‘JamesDonate’ is ‘points’ multiply by 30 if more than 100 points are scored or 40 if less than 100 points are scored
  3. ‘OwnerRate’ is new as it uses the ‘switch’ function. If the ‘GameType’ is ‘Home’ the amount from ‘JamesDonate’ is multiplied by 1.5, ‘Away’ is multiplied by 1.3 and ‘International’ is multiplied by 2. The results of this is inputted into ‘totalDonation’
  4. Lastly, the results in ‘totalDonation’ are rounded using the ’round’ function.

Since there are three choices for ‘GameType’ we use the switch function for this. You can input different values into the modified ‘CashDonate’ function. Below are several examples

> CashDonate(88, GameType="Away")
[1] 4576
> CashDonate(130, GameType="International")
[1] 7800
> CashDonate(102, GameType="Home")
[1] 4590

The next time we discuss R programming, I will explain how to use overcome the problem of inputting each value into the function manually.

Quantitative Data Analysis Preparation

There are many different ways to approach data analysis preparation for quantitative studies. This post will provide some insight into how to do this. In particular, we will look at the following steps in quantitative data analysis preparation.

  • Scoring the data
  • Deciding on the types of scores to analyze
  • Inputting data
  • Cleaning the data

Scoring the Data

Scoring the data involves the researching assigning a numerical value to each response on an instrument. This includes categorical and continuous variables. Below is an example

Gender: Male(1)____________ Female(2)___________

I think school is boring

  1. Strongly Agree
  2. Agree
  3. Neutral
  4. Disagree
  5. Strongly Disagree

In the example of above, the first item about gender has the value 1 for male and 2 for female. The second item asks the person’s perception of school from 1 being strongly agree all the way to 5 which indicates strongly disagree. Every response was given a numerical value and it is the number that is inputted into the computer for analysis

Determining the Types of scores to Analyze

Once data has been received, it is necessary to determine what types of scores to analyze. Single-item score involves assessing the results from how each individual person responded. An example would be voting, in voting each individual score is add up to determine the results.

Another approach is summed scores. In this approach, the results of several items are added together. This is done because one item alone does not fully capture whatever is being measured. For example, there are many different instruments that measure depression. Several questions are asked and then the sum of the scores indicates the level of depression the individual is experiencing. No single question can accurately measure a person’s depression so a summed score approach is often much better.

Difference scores can involve single-item or summed scores. The difference is that difference scores measure change over time. For example, a teacher might measure a student’s reading comprehension before and after teaching the students basic skills. The difference is then calculated as below

  • Score 2 – Score 1 = Difference

Inputting Data

Inputting data often happens in Microsoft Excel since it is easy to load an excel file into various statistical programs. In general, inputting data involves giving each item its own column. In this column, you put the respondent’s responses. Each row belongs to one respondent. For example Row 2 would refer to respondent 2. All the results for respondent 2 would be in this row for all the items on the instrument.

If you are summing scores are looking for differences, you would need to create a column to hold the results of the summation or difference calculation. Often this is done in the statistical program and not Microsoft Excel.

Cleaning Data

Cleaning data involves searching for scores that are outside the range of the scale of an item(s) and dealing with missing data. Out range scores can be found through a visual inspection or through running some descriptive statistics. For example, if you have a Lickert scale of 1-5 and one item has a standard deviation of 7 it is an indication that something is wrong because the standard deviation cannot be larger than the range.

Missing data are items that do not have a response. Depending on the type of analysis this can be a major problem. There are several ways o deal with missing data.

  • Listwise deletion is the removal of any respondent who missed even one item on an instrument
  • Mean imputation is the inputting of the mean of the variable wherever there is a missing response

There are other more complicated approaches but this provides some idea of what to do.

Conclusion

Preparing data involves planning what you will do. You need to consider how you will score the items, what type of score you will analyze, input the data, and how you will clean it. From here, a deeper analysis is possible.

Test Validity

Validity is often seen as a close companion of reliability. Validity is the assessment of the evidence that indicates that an instrument is measuring what it claims to measure. An instrument can be highly reliable (consistent in measuring something) yet lack validity. For example, an instrument may reliably measure motivation but not valid in measuring income. The problem is that an instrument that measures motivation would not measure income appropriately.

In general, there are several ways to measure validity, which includes the following.

  • Content validity
  • Response process validity
  • Criterion-related evidence of validity
  • Consequence testing validity
  • Face validity

Content Validity

Content validity is perhaps the easiest way to assess validity. In this approach, the instrument is given to several experts who assess the appropriateness or validity of the instrument. Based on their feedback, a determination of the validity is determined.

Response Process Validity

In this approach, the respondents to an instrument are interviewed to see if they considered the instrument to be valid. Another approach is to compare the responses of different respondents for the same items on the instrument. High validity is determined by the consistency of the responses among the respondents.

Criterion-Related Evidence of Validity

This form of validity involves measuring the same variable with two different instruments. The instrument can be administered over time (predictive validity) or simultaneously (concurrent validity). The results are then analyzed by finding the correlation between the two instruments. The stronger the correlation implies the stronger validity of both instruments.

Consequence Testing Validity

This form of validity looks at what happened to the environment after an instrument was administered. An example of this would be improved learning due to test. Since the the students are studying harder it can be inferred that this is due to the test they just experienced.

Face Validity

Face validity is the perception that the students have that a test measures what it is supposed to measure. This form of validity cannot be tested empirically. However, it should not be ignored. Students may dislike assessment but they know if a test is testing what the teacher tried to teach them.

Conclusion 

Validity plays an important role in the development of instruments in quantitative research. Which form of validity to use to assess the instrument depends on the researcher and the context that he or she is facing.

Logical Flow in R: If/Else Statements Part II

In a previous post, we looked at If/Else statements in R. We developed a function that calculated the amount of money James and the owner would give based on how many points the team scored. Below is a copy of this function.

CashDonate <- function(points, Dollars_per_point=40, HomeGame=TRUE){
 game.points<- points * Dollars_per_point  if(points > 100) {game.points <-points * 30}
 if(HomeGame) {Total.Donation <- game.points * 1.5
 } else {Total.Donation <- game.points * 1.3}
 round(Total.Donation)
}

There is one small problem with this function. Currently, you have to input each game one at a time. You can have R calculate the results of several games at once. For example, look at the results of the code below when we try to input more than one game at once.

> CashDonate(c(99,100,78))
[1] 5940 6000 4680
Warning message:
In if (points > 100) { :
  the condition has length > 1 and only the first element will be used

As you can see, we get a warning message and some of the values are wrong. For example, the second value should be 4,500 and not 6,000.

In order to deal with this problem, R has the ‘ifelse’ function available. The ‘ifelse’ allows R to choose values in two or more vectors to complete an action. We need to be able to choose the appropriate action based on the following information

  • points scored is less than or greater than 100
  • Home game or not a home game

Remember, R could do this if one value was put into the ‘CashDonate’ function. Now we need to be able to calculate what to do based on several values in each of the vectors above. Below is the modified code for doing this.

CashDonate <- function(points, HomeGame=TRUE){
 JamesDonate<- points * ifelse(points > 100, 30, 40)
 totalDonation<- JamesDonate * ifelse(HomeGame, 1.5, 1.3)
 round(totalDonation)
}

Here is what the modified function does

  1. It has the argument ‘points’ and the default argument of ‘HomeGame = TRUE’
  2. The first calculation is the number of points but with the ‘ifelse’ function. If the number of points is greater than 100 the points is multiplied by 30 if less than 100 than the points are multiplied by 40. All this is put in the variable ‘JamesDonate’
  3. Next, the amount from ‘JamesDonate’ is multiplied by 1.5 if it was a home game or 1.3 if it was not a home game. All this is put into the variable ‘totalDonation’
  4. The results are rounded

To use CashDonate to its full potential you need to make a dataframe. Below is the code for the ‘games’ data frame we will use.

games<- data.frame(game.points=c(88,100,99,111,96), HomeGame=c(TRUE, FALSE, FALSE, TRUE, FALSE))

In the ‘games’ data frame we have two columns, one for game points and another that tells us if it was a home game or not. Now we will use the ‘games’ data frame with the new ‘CashDonate’ and calculate the results. We need to use the ‘with’ function to do this. This function will be explained at a later date. Below are the results.

> with(games, CashDonate(game.points, HomeGame=HomeGame))
[1] 5280 5200 5148 4995 4992

You can calculate this manually if you would like. Now, we can calculate more than one value in are ‘CashDonate’ function which makes it much more useful than before. All thanks to the use of the ‘ifelse’ function in the code.

Assessing Reliability

In quantitative research, reliability measures an instruments stability and consistency. In simpler terms, reliability is how well an instrument is able to measure something repeatedly. There are several factors that can influence reliability. Some of the factors include unclear questions/statements, poor test administration procedures, and even the participants in the study.

In this post, we will look at different ways that a researcher can assess the reliability of an instrument. In particular, we will look at the following ways of measuring reliability…

  • Test-retest reliability
  • Alternative forms reliability
  • Kuder-Richardson Split Half Test
  • Coefficient Alpha

Test-Retest Reliability

Test-retest reliability assesses the reliability of an instrument by comparing results from several samples over time. A researcher will administer the instrument at two different times to the same participants. The researcher then analyzes the data and looks for a correlation between the results of the two different administrations of the instrument. in general, a correlation above about 0.6 is considered evidence of reasonable reliability of an instrument.

One major drawback of this approach is that often given the same instrument to the same people a second time influences the results of the second administration. It is important that a researcher is aware of this as it indicates that test-retest reliability is not foolproof.

Alternative Forms Reliability 

Alternative forms reliability involves the use of two different instruments that measure the same thing. The two different instruments are given to the same sample. The data from the two instruments are analyzed by calculating the correlation between them. Again, a correlation around 0.6 or higher is considered as an indication of reliability.

The major problem with this is that it is difficult to find two instruments that really measure the same thing. Often scales may claim to measure the same concept but they may both have different operational definitions of the concept.

Kuder-Richardson Split Half Test

The Kuder-Richardson test involves the reliability of categorical variables. In this approach, an instrument is cut in half and the correlation is found between the two halves of the instrument. This approach looks at internal consistency of the items of an instrument.

Coefficient Alpha

Another approach that looks at internal consistency is the Coefficient Alpha. This approach involves administering an instrument and analyze the Cronbach Alpha. Most statistical programs can calculate this number. Normally, scores above 0.7 indicate adequate reliability. The coefficient alpha can only be used for continuous variables like Lickert scales

Conclusion

Assessing reliability is important when conducting research. The approaches discussed here are among the most common. Which approach is best depends on the circumstances of the study that is being conducted.

Measuring Variables

When conducting quantitative research, one of the earliest things a researcher does is determine what their variables are. This involves developing an operational definition of the variable which description of how you define the variable as well as how you intend to measure it.

After developing an operational definition of the variable(s) of a study, it is now necessary to measure the variable in a way that is consistent with the operational definition. In general, there are five forms of measurement and they are…

  • Performance measures
  • Attitudinal measures
  • Behavioral observation
  • Factual Information
  • Web-based data collection

All forms of measurement involve an instrument which is a tool for actually recording what is measured.

Performance Measures

Performance measures assess a person’s ability to do something. Examples of instruments of this type include an aptitude test, intelligence test, or a rubric for assessing an essay. Often these form of measurement leads to “norms” that serves as a criterion for the progress of students.

Attitudinal Measures

Attitudinal measures assess peoples’ perception They are commonly associated with Lickert Scales (strongly disagree to strongly agree). This form of measurement allows a research access to the attitudes of hundreds instead of the attitudes of few as would be found in qualitative research.

Behavioral Observation

Behavioral observation is the observation of behaviors of interest to the researcher. The instrument involved is normally some sort of checklist. When the behavior is seen it is notated using tick marks.

Factual Information

Data that has already been collected and is available to the public is often called factual information.  The researcher takes this information and analyzes it to answer their questions.

Web-Based Data Collection

Surveys or interviews conducted over the internet are examples of web-based data collection. This is still relatively new. There are still people who question this approach as there are concerns over the representativeness of the sample.

Which Measure Should I Choose?

There are several guidelines to keep in mind when deciding how to measure variables.

  • What form of measurement are you able to complete?  Your personal expertise, as well as the context of your study, affected what you are able to do. Naturally, you want to avoid doing publication quality research with a measurement form you are unfamiliar with or do research in an uncooperative place.
  • What are your research questions? Research questions shape the entire study. A close look at research questions should reveal the most appropriate form of measurement.

The actual analysis of the data depends on the research questions. As such, almost any statistical technique can be applied for all of the forms of measurement. The only limitation is what the researcher wants to know.

Conclusion

Measuring variables is the heart of quantitative research. The approach taken depends on the skills of the researcher as well as the research questions. Ever form of measurement has its place when conducting research.

Developing Functions in R Part III: Using Functions as Arguments

Previously, we learned how to add nameless arguments to a function using ellipses ‘. . .’.  In this post, we will learn how to use functions as arguments in functions. An argument is the information found within parentheses ( ) or braces { } in R programming. The reason for doing this is that it allows for many shortcuts in coding. Instead of having to retype the formula for something you can pass the function as an argument in order to save a lot of time.

Below is the code that we have been working with for awhile before we add a function as an argument.

Percent_Divided <- function(x, divide = 2, ...) {
 ToPercent <- round(x/divide, ...)
 Output <- paste(ToPercent, "%", sep = "")
 return(Output)
 }

As a reminder, the ‘Percent_Divided’ function takes a number or variable ‘x’ divides it by two as a default and adds a ‘ % ‘ sign after number(s). With the ‘. . .’ you can pass other arguments such as ‘digits’ and specify how many number you want after the decimal.  Below is an example of the ‘Percent_Divided’ function in action for a variable called ‘B’ and with the add argument of ‘digits =3’

> B
[1] 23.35345 45.56456 32.12131
> Percent_Divided(B, digits = 3)
[1] "11.677%" "22.782%" "16.061%"

Functions as Arguments

We will now make a function that has a function for an argument. We will set a default function but remember that anything can be passed for the function argument. Below is the code for one way to do this.

Percent_Divided <- function(x, divide = 2,FUN_ARG = round, ...) {
 ToPercent <- FUN_ARG(x/divide, ...)
 Output <- paste(ToPercent, "%", sep = "")
 return(Output)
}

Here is an explanation

  1. Most of this script is the same. The main difference is that we added the argument ‘FUN_ARG’ to the first line of the script. This is the place where we can insert whatever function we want. The default function is ’round’. If we do not specify any function ’round’ will be used.
  2. In the second line of the code you again see ‘FUN_ARG’ this function will be activated after ‘x’ is divided by 2 and whatever arguments are used with the ‘. . .’
  3. The rest of the code has already been explained and has not been changed
  4. Important note. If we do not change the default of ‘FUN_ARG’, which is the ’round’ function, we will keep getting the same answers as always. The ‘FUN_ARG’ is only interesting if we do not use the default.

Below is an example of our modified function. The function we are going to pass through the ‘FUN_ARG’ argument is ‘signif’. ‘signif’ sounds the values in its first argument to the specified number of significant digits. We will also pass the argument ‘digits = 3’ through the ellipses ‘. . .’ The values of variable B (see above) will be used for the function

> Percent_Divided(B, FUN_ARG = signif, digits = 3)
[1] "11.7%" "22.8%" "16.1%"

Here is what happen

  1. The Percent_Divided’ function was run
  2. The function ‘signif’ was passed through the ‘FUN_ARG’ argument and the argument ‘digits = 3’ was passed through the ellipses ‘. . .’ argument.
  3. All the values for B were transformed based on the script in the ‘Percent_Divided’ function

Conclusion

Using functions as argument is mostly for saving time when developing a code. Even though this seems complicated this is actually rudimentary programming.

Developing Functions in R Part II: Adding Arguments

In this post, we will continue the discussion on working with functions in R. Functions serve the purpose of programming R to execute several operations at once. This post, we will look at adding additional arguments to a function.

Arguments are the various entries within the parentheses. For example, in are example below the arguments of the function is x.

  • MakePercent <- function(x) {
     ToPercent <- round(x, digits = 2)
     Output <- paste(ToPercent, "%", sep = "")
     return(Output)
    }

In the example above there are many other arguments beside x. However, the only argument for the function is x. The other arguments in other parentheses belong to  other objects in the script. In this post, we are going to learn how to add additional arguments to the function.

Let’s say that we want to convert a number to a percentage like a previous function we made but we now want to be able to divide the number by whatever we want. Here is how it could be done.

Percent_Divided <- function(x, divide) {
 ToPercent <- round(x/divide, digits = 2)
 Output <- paste(ToPercent, "%", sep = "")
 return(Output)
}

Here is what we did

  1. We created the object ‘Percent_Divided’ and assigned the function with the arguments ‘x’ and ‘divide’
  2. Next we use a { and we create the variable ‘ToPercent’ and we assigned the function ’round’  to round ‘x’ divided by whatever value ‘divide’ from the function takes. We then round the results of this two digits.
  3. The results of ‘ToPercent’ are then assigned to the variable ‘Output’ where a ‘ %’ sign is assigned to the value
  4. Lastly, the results of ‘Output’ are printed in the console.

Sounds simple. Below is the function in action dividing a number by 2 and then by 3

> source('~/.active-rstudio-document', echo=TRUE)

> Percent_Divided <- function(x, divide) {
+         ToPercent <- round(x/divide, digits = 2)
+         Output <- paste(ToPercent, "%", sep = "")
+    .... [TRUNCATED] 
> Percent_Divided(22.12234566, divide=2)
[1] "11.06%"
> Percent_Divided(22.12234566, divide=3)
[1] "7.37%"

Here is what happen

  1. You source the script from the source editor by typing ctrl + shift + enter
  2. Next, I used the function ‘Percent_Divided’ with the number 22.12234566 and I decided to divide the number by two
  3. R returns the answer 11.06%
  4. Next I repeat the process but I divide by 3 this time
  5. R returns the answer 7.37%

There is one problem. The argument ‘divide’ has no default  value. What this means is that you have to tell R what the value of ‘divide’ is every single time. As an example see below

> Percent_Divided(22.12234566)
Error in Percent_Divided(22.12234566) : 
  argument "divide" is missing, with no default

Because I did not tell R what value ‘divide’ would be, R was not able to complete the process of the function. To solve this problem we will set the default value of ‘divide’ to 10 in the script as shown below.

Percent_Divided <- function(x, divide = 10) {
 ToPercent <- round(x/divide, digits = 2)
 Output <- paste(ToPercent, "%", sep = "")
 return(Output)
}

If you look closely you will see ‘divide = 10’. This is the default value for ‘divide’ if we do not set another number for ‘divide’ R will use 10. Below is an example using the default value of ‘divide’ and another example with ‘divide’ set to 5.

> Percent_Divided <- function(x, divide = 10) {
+         ToPercent <- round(x/divide, digits = 2)
+         Output <- paste(ToPercent, "%", sep = "") .... [TRUNCATED] 
> Percent_Divided(22.12234566)
[1] "2.21%"
> Percent_Divided(22.12234566, divide = 5)
[1] "4.42%"

First we sourced the script using ctrl + shift + enter. In the first example, the number is automatically divided by 10 because this is the default. In the second example, we specific we wanted to divide by five by adding the argument ‘divide = 5’. You can see the difference in the results.

In a future post, we will continue to examine the role of arguments in functions.

Reviewing the Literature: Part II

In the last post, we began a discussion on the steps involved in reviewing the literature and we look at the first two steps, which are identifying key terms and locating literature. In this post, we will look at the last three steps of developing a review of literature which are…

3. Evaluate and select literature to include in your review
4. Organize the literature
5. Write the literature review

Evaluating Literature

This step was alluding to when I wrote about using google scholar and google book in part I. For articles, you want to assess the quality of them by determining who publishes the journal. Reputable publishers usually publish respectable journals. This is not to say that other sources of articles are totally useless. The point is that you want to attract as few questions as possible when it comes to the quality of the sources you use to develop a literature review.

One other important concept in evaluating literature is the relevancy of the sources. You want sources that focus on a similar topic, population, and or problems. It is easy for a review of literature to lose focus so this is a critical criteria to consider.

Organizing the Literature 

There are many options for organizing sources. You can make an outline and group the sources together in by heading or you can construct some sort of visual of the information. The place to start is to examine the abstract of the articles that are going to be a part of your literature review. The abstract is a summary of the study and is a way to get an understanding of a study quickly.

If the abstract indicates that a study is beneficial you can look at the whole article to learn more. If the whole article is unavailable you can use the abstract as a potential source.

Writing a Review of Literature

Writing involves taking your outline or visual and convert it into paragraph format. There are at least three common ways to write a literature review. The three ways are thematic review, study-by-study review, and combo review.

The thematic review shares a theme in research and cites several sources. There is very little detail. The cites support the claim made by the theme. Below is an example using APA formatting.

Smoking is bad for you (James, 2013; Smith, 2012; Thomas, 2009)

The details of the studies above are never shared but it is assumed that these studies all support the claim that smoking is bad for you.

Another type of literature review is the study-by-study review. In this approach, a detailed summary is provided of several studies under a larger theme. Consider the example below

Thomas (2009) found in his study among middle class workers that smoking reduces lifespan by five years.

This example provides details about the dangers of smoking as found in one study.

A combo review is a mixture of the first two approaches. Sometimes you provide a thematic review other times you provide the details of a study-by-study review. This is the most common approach as it’s the easiest to read because it provides an overview with an occasional detail.

Conclusion

The ideas presented here are for providing support in writing review of literature. There are many other ways to approach this but the concepts presented here will provide some guidance.

Reviewing the Literature: Part I

The research process often begins with a literature review. A review of literature is a systematic summary of books, journal articles, and other sources pertaining to a particular topic.The purpose of a literature review is to demonstrate how your study adds to the existing literature and also to show why your study is needed.

In general, there are five common steps to reviewing the literature and they are…

  1. Identify key terms
  2. Locate literature
  3. Evaluate and select literature to include in your review
  4. Organize the literature
  5. Write the literature review

In this post, we will discuss the first two

Identify Key Terms

The purpose of identifying key terms is that they give you words to “google” when you conduct a search. Below are some ways to develop key terms.

  • Creating some sort of title, even if it is temporary, and conduct a search based on words in this title is one way to begin.
  • If you already have research questions, you can look for important words in these questions to conduct a search.
  • Find an article that is studying something similar to you and look at the keywords that they include. Many articles have a list of keywords on the first page that can be used for other studies.

Locating Literature

Locating literature is not as difficult as it was years ago thanks to the internet. Now, the search for high-quality sources doesn’t even require the need to leave home. There is some sort of hierarchy in terms of the quality and age of material available and it is as follows. Each example below is rate on a scale of 1-5 for quality and newness the higher the rating the higher the quality and newness of the example

  • Websites, newspapers, and blogs Quality 1 Newness 5
  • Academic publications such as conference papers, theses, Quality 2 Newness 4
  • Peer-reviewed Journal Articles Quality 3 Newness 3
  • Books Quality 4 Newness 2
  • Summaries like encyclopedias Quality 5 Newness 1

In this example, normally the lower the quality the younger the information is. Keep in mind that there are many exceptions to the example above. Self-published books would obviously have a  much lower quality rating while some online sources are of much higher quality because of who is providing the information.

Once you have some keywords it is time to begin the search. Google books is an excellent place to begin. When you get to this website, you type in your key term and Google returns a list of books that contain the key term. You click on the book and it takes you to the page where the term is. This is like holding the book in your hand at the library. You note whatever information you need and go to another book.

For Google scholar, you go to the site and type in your key term. Google Scholar gives you several pages of articles. Before choosing, there are a few guidelines to keep in mind.

  • Depending on your field, you will probably be expected to cite new literature in your review often in the last 5-10 years. To do this you need set a custom range for articles you want to view. Focusing on the last 5-10 actually helps you to focus and gets things done quicker. You only cite older material if it was groundbreaking.
  • Google Scholar gives you any article with concern for quality. To protect yourself from citing poor research one strategy is to consider who the publisher was. Below is a few examples of high-quality publishers of academic journals. If the article was published by them it is probably of decent quality.
    • Sage, JSTOR, Wiley, Elsevier

Conclusion

This provides some basic information on beginning the process. In a later post, we will go over the last few steps of conducting a literature review.

Understanding Lists in R

Lists are yet another way to store and retrieve information in R. The advantage of lists are their high degree of flexibility. You can store almost anything in any combination in a list in ways that you could never store information in a vector, matrix, or data frame. This post will provide an introduction to ways to develop lists.

Making a List 

Making a list involves the use of the ‘list’ function. Our first list will involve converting a matrix to a list with the additional information of the year of the list.

> points.team
      1st 2nd 3rd 4th 5th 6th
James  12  15  30  25  23  32
Kevin  20  19  25  30  31  22
> team.list <-list(points.team, '2014-15')
> team.list
[[1]]
      1st 2nd 3rd 4th 5th 6th
James  12  15  30  25  23  32
Kevin  20  19  25  30  31  22

[[2]]
[1] "2014-15"

Here is what we did

  1. We printed the matrix ‘points.team’ If you forgot how to create this please CLICK HERE
  2. We then create the variable ‘team.list’ and assign two elements to this list
    1. The matrix ‘points.team’
    2. The year 2014-2015
  3. Next, we display team.list

The number in double brackets [ [ ] ] represent the element number. The ‘points.team’ is the first element in the list. The number in single brackets [ ] represent a sub-element of an element. For example, ‘2014-15’ is a sub-element of the unnamed element 2 in the ‘team.list’ list.

Making Named List

You can name each element in a list. Below is an example using the same data.

> Named.List<- list(scores=points.team, season='2014-15')
> Named.List
$scores
      1st 2nd 3rd 4th 5th 6th
James  12  15  30  25  23  32
Kevin  20  19  25  30  31  22

$season
[1] "2014-15"

Here is what happened

  1. We created the variable ‘Named.List’ and assigned scores as the data from ‘points.team’ and season with the information ‘2014-15
  2. We display the list
  3. If you look closely instead of seeing double brackets [ [ ] ] you should see instead the name $scores and $season in the list.

All the tricks below apply to data frames as well

Extracting Elements from a List

Let’s say you want to extract the ‘points.team’ information from the ‘Named.List’ here is how to do it.

> Named.List[['scores']]
      1st 2nd 3rd 4th 5th 6th
James  12  15  30  25  23  32
Kevin  20  19  25  30  31  22

All you do is simply type the name of the element in double brackets after the name of the list

Changing Information in a List

Let’s say that you want to change the season year from ‘2014-2015’ to 2013-14. Here is one way to do this

> Named.List[2] <- list('2013-2014')
> Named.List
$scores
      1st 2nd 3rd 4th 5th 6th
James  12  15  30  25  23  32
Kevin  20  19  25  30  31  22

$season
[1] "2013-2014"

All we did was type the name of the variable ‘Named.List’ use the brackets to indicate the second element and assigned ‘2013-2014’ using the ‘list’ function.

Adding Information to a List

You now want to add the names of the opponents that James and Kevin faced over the six games. To do this examine the following.

> Opponents<- c("Warriors", "Kings","Hawks","Wizards","Bulls","Grizzlies")
> Named.List<- list(scores=points.team, season='2014-15', Opponents=Opponents)
> Named.List
$scores
      1st 2nd 3rd 4th 5th 6th
James  12  15  30  25  23  32
Kevin  20  19  25  30  31  22

$season
[1] "2014-15"

$Opponents
[1] "Warriors"  "Kings"     "Hawks"     "Wizards"   "Bulls"     "Grizzlies"

Here is what we did

  1. You created a variable called ‘Opponents’ and entered the name of six teams
  2. Next, you type ‘Named.List’ and you add the previous information along with the new argument Opponents = Opponents. This tells R to add your variable ‘Opponents’ to the list.
  3. Finally, we display the modified list.

Conclusion

There is a lot of overlap in what you can do with lists, data frames, and matrices. Therefore, keep in mind that much that could be done with this other objects can be done with lists as well. The benefit of lists are the ability to combine so much diverse information in one place.

Data Frames in R: Part II

In this post, we will explore how to create data frames as well as looking at other aspects of using data frames in R. The first example below is a data frame that contains information about fictional faculty members. Our job will be to put this information a data frame and to rename the columns. Below is the example and it will be followed by an explanation.

> Faculty <- c('Darrin Thomas', 'Hank Smith', 'Sarah William')
> Salary <- c(60000, 50000, 53000)
> Hire_Date <- as.Date(c('2015-1-1', '2000-6-1', '2012-9-1'))
> Lecturers.data <- data.frame(Faculty, Salary, Hire_Date)
> str(Lecturers.data)
'data.frame':	3 obs. of  3 variables:
 $ Faculty  : Factor w/ 3 levels "Darrin Thomas",..: 1 2 3
 $ Salary   : num  60000 50000 53000
 $ Hire_Date: Date, format: "2015-01-01" "2000-06-01" "2012-09-01"

Here is what happen

  1. We started by making three different vectors and assigning a variable to each. The variables are ‘Faculty’, ‘Salary’, and ‘Hire_Date’.
  2. We then assigned all three variables to the data frame ‘Lecturers.data’ ‘Faculty’ is a factor vector, ‘Salary’ is a numeric vector, and ‘Hire_Date’ is a date vector. Again, the advantage of data frames is their ability to have several different types of data
  3. We then used the ‘str’ function to see the attributes of the ‘Lecturers.data’ data frame.

There is one small problem with the data frame above. ‘Faculty’ is a factor vector but our original vector for “Faculty’ was a character vector. We want ‘Faculty’ to continue to be a character vector instead of it becoming a factor. The example below shows one way to deal with this small problem.

> Lecturers.data <- data.frame(Faculty, Salary, Hire_Date, stringsAsFactors=FALSE)
> str(Lecturers.data)
'data.frame':	3 obs. of  3 variables:
 $ Faculty  : chr  "Darrin Thomas" "Hank Smith" "Sarah William"
 $ Salary   : num  60000 50000 53000
 $ Hire_Date: Date, format: "2015-01-01" "2000-06-01" "2012-09-01"

By adding the argument ‘stringsAsFactors=FALSE’ it make forces all vectors to not be factors. If you look closely you will see that ‘$ Faculty’ is not a Factor anymore as in the previous example. Instead it is now a ‘chr’ or character variable.

It is also possible to rename column names in a data frame just like in a matrix. For example, let’s say you made a mistake with the ‘Hire_Date’ variable. You did not mean the the date the lecturers were hired but the date they resigned. Below is an example of how to fix this.

> Lecturers.data
        Faculty Salary  Hire_Date
1 Darrin Thomas  60000 2015-01-01
2    Hank Smith  50000 2000-06-01
3 Sarah William  53000 2012-09-01
> names(Lecturers.data) [3] <- 'Resign_Date'
> Lecturers.data
        Faculty Salary Resign_Date
1 Darrin Thomas  60000  2015-01-01
2    Hank Smith  50000  2000-06-01
3 Sarah William  53000  2012-09-01

Here is what happening

  1. We displayed the data frame ‘Lecturers.data’ as a reference point
  2. We noticed that we did not want a column named ‘Hired_Date’ but want to change the name to ‘Resign_Date’
  3. To change the name we use the ‘names’ function to change the name of a column in ‘Lecturers.data’. We specifically tell are to rename the third column by using the subset brackets [3] and assign the name ‘Resign_Date’
  4. We then redisplay the ‘Lecturers.data’ data frame. If you compare this data frame with the first you can see that the third column has been renamed as desired.

This post provided some basic information on developing data frames. We learned how to combine vectors into a data frame, how to change a factor to a character vector, and how to rename a column. Such skills as these are beneficial to anyone who needs to use data frames.

Data Frames in R: Part I

So far we have looked at vectors, matrices, and arrays. One thing these three objects have in common is that they consist of one type of data. In other words, vectors, matrices and arrays contain either numerical or character information but not both at the same time.

Data frames are different. They allow you to have a mixture of information all contained within one place. You can have character data, such as names, and numerical data, such as salaries all in one place. In this post, we will first look at how to convert a matrix to a data frame.

Converting a Matrix to a Data Frame

One of the benefits of converting a matrix to a data frame is that the columns in a matrix become variables in a data frame. This is useful for data analysis at times. In the example below, we will convert the matrix of ‘points.team’ into a data frame.  In order to do this we have to use to new functions the ‘as.data.frame’ function which converts the matrix into a data frame and the ‘t’ function which transposes the rows so that they become the columns. Below is the code for completing this

> points.team
      1st 2nd 3rd 4th 5th 6th
James  12  15  30  25  23  32
Kevin  20  19  25  30  31  22
> team.points.df <- as.data.frame(t(points.team))
> team.points.df
    James Kevin
1st    12    20
2nd    15    19
3rd    30    25
4th    25    30
5th    23    31
6th    32    22

This is what we did

  1. We displayed the matrix ‘points.team’ as a reference. The code for creating this is available here.
  2. We then created the variable ‘team.points.df’ and used two functions to convert the matrix ‘points.team’ to a data frame
    1. ‘as.data.frame’ was used to make the actually data frame
    2. ‘t’ was used to move or transpose the names of the rows in the matrix (James and Kevin) to be the names of columns in the data frame. This gives us two variables (James and Kevin) with six entries (1st-6th) if we had not done this we would have made the example below
      > team.points.df
            1st 2nd 3rd 4th 5th 6th
      James  12  15  30  25  23  32
      Kevin  20  19  25  30  31  22

      In this example, we did not transpose ‘James’ and ‘Kevin’ to be columns. Instead 1st-6th are the variables instead of ‘James’ and ‘Kevin’. For our purposes this does not make sense but it may be appropriate it other situations.

  3. The last step involved displaying the new data frame ‘team.points.df’

Using the ‘str’ function allows you to learn some information about a data frame as in the example below

> str(team.points.df)
'data.frame':	6 obs. of  2 variables:
 $ James: num  12 15 30 25 23 32
 $ Kevin: num  20 19 25 30 31 22

Here is what we now know about our data frame

  • The variable is a data frame (data.frame)
  • It has six observations (6 obs.) which are the points scored by the players in six games
  • There are two variables ‘James’ and ‘Kevin’
  • The variables are numeric (num)

This is just the beginning of our examination of data frames in R. In a future post we will look at making original data frames.

Beyond Vectors: Introduction to Matrices and Arrays in R Part IV

In this post, we will look at how to rename the rows and columns in a matrix. We will also examined how to do the following

  • Name rows and columns in matrices
  • Make an array
  • Basic math in matrices & arrays

Naming Rows and Columns

Renaming rows and columns has a practical use. It allows people to provide meaning and or context to the data they are interpreting. In the example below, we will rename the rows and the columns of the basketball players so that it contains their names as well as what game.

> points.of.James
[1] 12 15 30 25 23 32
> points.of.Kevin
[1] 20 19 25 30 31 22
> points.team <- rbind(points.of.James, points.of.Kevin)
> points.team
                [,1] [,2] [,3] [,4] [,5] [,6]
points.of.James   12   15   30   25   23   32
points.of.Kevin   20   19   25   30   31   22
> rownames(points.team) <- c("James", "Kevin")
> points.team
      [,1] [,2] [,3] [,4] [,5] [,6]
James   12   15   30   25   23   32
Kevin   20   19   25   30   31   22
> colnames(points.team) <- c("1st", "2nd", "3rd", "4th", "5th", "6th")
> points.team
      1st 2nd 3rd 4th 5th 6th
James  12  15  30  25  23  32
Kevin  20  19  25  30  31  22

.Here is what’s going on

  1. We make the variables ‘point.of.James’ and ‘point.of.Kevin’ and display them
  2. We combine ‘point.of.James’ and ‘point.of.Kevin’ into a matrix using the ‘rbind’ function and assign this to the variable ‘points.team’
  3. We then display ‘points.team’
  4. We then rename the rows using the ‘rownames’ function. We replace ‘point.of.James’ and ‘point.of.Kevin’ with ‘James’ and “Kevin’ in the rows. We then display this change in the matrix.
  5. Next we rename 1,2,3,4,5,6 with 1st, 2nd, 3rd, 4th, 5th, & 6th. using the ‘colnames’ function.
  6. Lastly, we display the finished table.

Making Arrays

A vector has one dimension (row), a matrix has two dimensions (row & column), and an array has three or more dimensions (length, width, & height for example). We are not going to cover arrays extensively because it is hard to envision data beyond 3 dimensions. In addition, please remember that all the rules and tricks for vectors and matrices apply towards an array. Below is an example of how to make an array

> array1 <-array(1:48, dim=c(6, 8, 4))
> array1
, , 1

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    1    7   13   19   25   31   37   43
[2,]    2    8   14   20   26   32   38   44
[3,]    3    9   15   21   27   33   39   45
[4,]    4   10   16   22   28   34   40   46
[5,]    5   11   17   23   29   35   41   47
[6,]    6   12   18   24   30   36   42   48

, , 2

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    1    7   13   19   25   31   37   43
[2,]    2    8   14   20   26   32   38   44
[3,]    3    9   15   21   27   33   39   45
[4,]    4   10   16   22   28   34   40   46
[5,]    5   11   17   23   29   35   41   47
[6,]    6   12   18   24   30   36   42   48

, , 3

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    1    7   13   19   25   31   37   43
[2,]    2    8   14   20   26   32   38   44
[3,]    3    9   15   21   27   33   39   45
[4,]    4   10   16   22   28   34   40   46
[5,]    5   11   17   23   29   35   41   47
[6,]    6   12   18   24   30   36   42   48

, , 4

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    1    7   13   19   25   31   37   43
[2,]    2    8   14   20   26   32   38   44
[3,]    3    9   15   21   27   33   39   45
[4,]    4   10   16   22   28   34   40   46
[5,]    5   11   17   23   29   35   41   47
[6,]    6   12   18   24   30   36   42   48

Here is what is happening

  1. We created ‘array1’ and assigned an array to it that contains numbers 1 to 48 and has 6 rows, 8 columns, and 4 dimensions.
  2. When then have R display the results

To locate values in specific indices you have to add an additional number to the address. For example, if you are looking for 1,1,1 you would go to the first dimension, first row, and first column. Also do not forget how extracting rows and columns re-shuffles the numbering of the remaining rows and columns.

Basic Math in Matrices and Arrays

You can change all the values in a matrix and array. For example, let us say we want to add four points to every game that Kevin and James played. We can do this by using the following code.

> points.team
      1st 2nd 3rd 4th 5th 6th
James  12  15  30  25  23  32
Kevin  20  19  25  30  31  22
> new.points.tream <- points.team+4
> new.points.tream
      1st 2nd 3rd 4th 5th 6th
James  16  19  34  29  27  36
Kevin  24  23  29  34  35  26

What happening

  1. We redisplayed the variable “points. team’
  2. We then create a new variable called ‘new.points.team’ and we told R to add 4 points to every value in the variable ‘points.team’
  3. We display the new table. A closer look will show how every value is 4 points higher in the new table than the results of

Conclusions

This conclude the examinations on matrices and arrays.