A demo on the use of the Zotero Reference software
A demo on the use of the Zotero Reference software
Working with students over the years has led me to the conclusion that often students do not understand the connection between variables, quantitative research questions and the statistical tools
used to answer these questions. In other words, students will take statistics and pass the class. Then they will take research methods, collect data, and have no idea how to analyze the data even though they have the necessary skills in statistics to succeed.
This means that the students have a theoretical understanding of statistics but struggle in the application of it. In this post, we will look at some of the connections between research questions and statistics.
Variables are important because how they are measured affects the type of question you can ask and get answers to. Students often have no clue how they will measure a variable and therefore have no idea how they will answer any research questions they may have.
Another aspect that can make this confusing is that many variables can be measured more than one way. Sometimes the variable “salary” can be measured in a continuous manner or in a categorical manner. The superiority of one or the other depends on the goals of the research.
It is critical to support students to have a thorough understanding of variables in order to support their research.
Types of Research Questions
In general, there are two types of research questions. These two types are descriptive and relational questions. Descriptive questions involve the use of descriptive statistic such as the mean, median, mode, skew, kurtosis, etc. The purpose is to describe the sample quantitatively with numbers (ie the average height is 172cm) rather than relying on qualitative descriptions of it (ie the people are tall).
Below are several example research questions that are descriptive in nature.
These questions are not intellectually sophisticated but they are all answerable with descriptive statistical tools. Question 1 can be answered by calculating the mean. Question 2 can be answered by determining how many passed the exam and dividing by the total sample size. Question 3 can be answered by calculating the mean of all the survey items that are used to measure respondents perception of the cafeteria.
Understanding the link between research question and statistical tool is critical. However, many people seem to miss the connection between the type of question and the tools to use.
Relational questions look for the connection or link between variables. Within this type there are two sub-types. Comparison question involve comparing groups. The other sub-type is called relational or an association question.
Comparison questions involve comparing groups on a continuous variable. For example, comparing men and women by height. What you want to know is whether there is a difference in the height of men and women. The comparison here is trying to determine if gender is related to height. Therefore, it is looking for a relationship just not in the way that many student understand. Common comparison questions include the following.male
Each of these questions can be answered using ANOVA or if we want to get technical and there are only two groups (ie gender) we can use t-test. This is a broad overview and does not include the complexities of one-sample test and or paired t-test.
Relational or association question involve continuous variables primarily. The goal is to see how variables move together. For example, you may look for the relationship between height and weight of students. Common questions include the following.
Questions 1 can be answered by calculating the correlation. Question 2 requires the use of linear regression in order to answer the question.
The challenging as a teacher is showing the students the connection between statistics and research questions from the real world. It takes time for students to see how the question inspire the type of statistical tool to use. Understanding this is critical because it helps to frame the possibilities of what to do in research based on the statistical knowledge one has.
Some of the biggest challenges in helping students with research is their lack of preparation. The problem is not an ignorance of statistics or research design as that takes only a little bit of support. The real problem is that students want to do research without hardly reading any research and lacking knowledge of how research writing is communicated. This post will share some prerequisites to performing research.
Extensive reading means reading broadly about a topic and not focusing too much on specifics. Therefore, you read indiscriminately perhaps limited yourself only to your general discipline.
In order to communicate research, you must first be familiar with the vocabulary and norms of research. This can be learned to a great extent through reading academic empirical articles.
The ananoloy I like to use is how a baby learns. By spends large amounts of time being exposed to the words and actions of others. The baby has no real idea in terms of what is going on at first. However, after continuous exposure, the child begins to understand the words and actions fo those around them and even begins to mimic the behaviors.
In many ways, this is the purpose of reading a great deal before even attempting to do any research. Just as the baby, a writer needs to observe how others do things, continue this process even if they do not understand, and attempt to imitate the desired behaviors. You must understand the forms of communication as well as the cultural expectations of research writing and this can only happen through direct observation.
At the end of this experience, you begin to notice a pattern in terms of the structure of research writing. The style is highly ridge with litter variation.
It is hard to say how much extensive reading a person needs. Generally, the more reading that was done in the past the less reading needed to understand the structure of research writing. If you hate to read and did little reading in the past you will need to read a lot more to understand research writing then someone with an extensive background in reading. In addition, if you are trying to write in a second language you will need to read much more than someone writing in their native language.
If you are still desirous of a hard number of articles to read I would say aim for the following
Extensive reading is just reading. There is no notetaking or even highlighting. You are focusing on exposure only. Just as the observant baby so you are living in the moment trying to determine what is the appropriate behavior. If you don’t understand you need to keep going anyway as the purpose is quantity and not quality. Generally, when the structure of the writing begins to become redundant ad you can tell what the author is doing without having to read too closely you are ready to move on.
Intensive reading is reading more for understanding. This involves slows with the goal of deeper understanding. Now you select something, in particular, you want to know. Perhaps you want to become more familiar with the writing of one excellent author or maybe there is one topic in particular that you are interested in. With intensive writing, you want to know everything that is happening in the text. To achieve this you read fewer articles and focus much more on quality over quantity.
By the end of the extensive and intensive reading, you should be familiar with the following.
Once a student has read a lot of research there is some hope that they can now attempt to write in this style. As the teacher, it is my responsibility to point out the structure of research writing which involves such as ideas as the 5 sections and the parts of each section.
Students grasp this but they often cannot build paragraphs. In order to write academic research, you must know the purpose of main ideas, supporting details, and writing patterns. If these terms are unknown to you it will be difficult to write research that is communicated clearly.
The main idea is almost always the first sentence of a paragraph and writing patterns provide different ways to organize the supporting details. This involves understanding the purpose of each paragraph that is written which is a task that many students could not explain. This is looking at writing from a communicative or discourse perspective and not at a minute detail or grammar one.
The only way to do this is to practice writing. I often will have students develop several different reviews of literature. During this experience, they learn how to share the ideas of others. The next step is developing a proposal in which the student shares their ideas and someone else’s. The final step is writing a formal research paper.
To write you must first observe how others write. Then you need to imitate what you saw. Once you can do it what others have done it will allow you to ask questions about why things are this way. Too often, people just want to write without even understanding what they are trying to do. This leads to paralysis at best (I don’t know what to do) to a disaster at worst (spending hours confidently writing garbage). The enemy to research is not methodology as many people write a lot without knowledge of stats or research design because they collaborate. The real enemy of research is neglecting the preparation of reading and the practicing of writing.
Writing the results of a research paper is difficult. As a researcher, you have to try and figure out if you answered the question. In addition, you have to figure out what information is important enough to share. As such it is easy to get stuck at this stage of the research experience. Below are some ideas to help with speeding up this process.
Consider the Order of the Answers
This may seem obvious but probably the best advice I could give a student when writing their results section is to be sure to answer their questions in the order they presented them in the introduction of their study. This helps with cohesion and coherency. The reader is anticipating answers to these questions and they often subconsciously remember the order the questions came in.
If a student answers the questions out of order it can be jarring for the reader. When this happens the reader starts to double check what the questions were and they begin to second-guess their understanding of the paper which reflects poorly on the writer. An analogy would be that if you introduce three of your friends to your parents you might share each person’s name and then you might go back and share a little bit of personal information about each friend. When we do this we often go in order 1st 2nd 3rd friend and then going back and talking about the 1st friend. The same courtesy should apply when answering research questions in the results section. Whoever was first is shared first etc.
Consider how to Represent the Answers
Another aspect to consider is the presentation of the answers. Should everything be in text? What about the use of visuals and tables? The answers depend on several factors
Know when to Interpret
Sometimes I have had students try to explain the results while presenting them. I cannot say this is wrong, however, it can be confusing. The reason it is so confusing is that the student is trying to do two things at the same time which are present the results and interpret them. This would be ok in a presentation and even expected but when someone is reading a paper it is difficult to keep two separate threads of thought going at the same time. Therefore, the meaning or interpretation of the results should be saved for the Discussion Conclusion section.
Presenting the results is in many ways the high point of a research experience. It is not easy to take numerical results and try to capture the useful information clearly. As such, the advice given here is intending to help support this experience
Students often struggle with shaping their methodology section in their paper. The problem is often that students do not see the connection between the different sections of a research paper. This inability to connect the dots leads to isolated thinking on the topic and inability to move forward.
The methodology section of a research paper plays a critical role. In brief, the purpose of a methodology is to explain to your readers how you will answer your research questions. In the strictest sense, this is important for reproducing a study. Therefore, what is really important when writing a methodology is the research questions of the study. The research questions determine the following of a methodology.
What this means is that a student must know what they want to know in order to explain how they will find the answers. Below is a description of these sections along with one section that is not often influenced by the research questions.
Sample & Setting
In the sample section of the methodology, it is common or the student to explain the setting of the study, provide some demographics, and explain the sampling method. In this section of the methodology, the goal is to describe what the reader needs to know about the participants in order to understand the context from which the results were derived.
Research Design & Scales
The research design explains specifically how the data was collected. There are several standard ways to do this in the social sciences such.
Within this section, some academic disciplines also explain the scales or the tool used to measure the variable(s) of the study. Again, it is impossible to develop this section of the research questions are unclear or unknown.
The data analysis section provides an explanation of how the answers were calculated in a study. Success in this section requires a knowledge of the various statistical tools that are available. However, understanding the research questions is key to articulating this section clearly.
A final section in many methodologies is ethics. The ethical section is a place where the student can explain how the protected participant’s anonymity, made sure to get the permission and other aspects of morals. This section is required by most universities in order to gain permission to do research. However, it is often missing from journals.
The methodology is part of the larger picture of communicating one’s research. It is important that a research paper is not seen as isolated parts but rather as a whole. The reason for this position is that a paper cannot make sense on its own if any of these aspects are missing.
Writing a review of literature can be challenging for students. The purpose here is to try and synthesize a huge amount of information and to try and communicate it clearly to someone who has not read what you have read.
Often a student will collect as many articles as possible and try to throw them all together to make a review of the literature. This naturally leads to problems of the paper sounded like a shopping list of various articles. Neither interesting nor coherent.
Instead, when writing a review of literature a student should keep in mind the question
What do my readers need to know in order to understand my study?
This is a foundational principle when writing. Readers don’t need to know everything only what they need to know to appreciate the study they are ready. An extension of this is that different readers need to know different things. As such, there is always a contextual element to framing a review of the literature.
Consider the Format
When working with a student, I always recommend the following format to get there writing started.
For each major variable in your study do the following…
There first thing that needs to be done is to provide a definition of the construct. This is important because many constructs are defined many different ways. This can lead to confusion if the reader is thinking one definition and the writer is thinking another.
Examples and Theories
Step 2 is more complex. After a definition is provided the student can either provide an example of what this looks like in the real world and or provide more information in regards to theories related to the construct.
Sometimes examples are useful. For example, if writing a paper on addiction it would be useful to not only define it but also to provide examples of the symptoms of addiction. The examples help the reader to see what used to be an abstract definition in the real world.
Theories are important for providing a deeper explanation of a construct. Theories tend to be highly abstract and often do not help a reader to understand the construct better. One benefit of theories is that they provide a historical background of where the construct came from and can be used to develop the significance of the study as the student tries to find some sort of gap to explore in their own paper.
Often it can be beneficial to include both examples and theories as this demonstrates stronger expertise in the subject matter. In theses and dissertations, both are expected whenever possible. However, for articles space limitations and knowing the audience affects the inclusion of both.
The relevant studies section is similar breaking news on CNN. The relevant studies should generally be newer. In the social sciences, we are often encouraged to look at literature from the last five years, perhaps ten years in some cases. Generally, readers want to know what has happened recently as experience experts are familiar with older papers. This rule does not apply as strictly to theses and dissertations.
Once recent literature has been found the student needs to organize it thematically. The reason for a thematic organization is that the theme serves as the main idea of the section and the studies themselves serve as the supporting details. This structure is surprisingly clear for many readers as the predictable nature allows the reader to focus on content rather than on trying to figure out what the author is tiring to say. Below is an example
There are several challenges with using technology in class(ref, 2003; ref 2010). For example, Doe (2009) found that technology can be unpredictable in the classroom. James (2010) found that like of training can lead some teachers to resent having to use new technology
The main idea here is “challenges with technology.” The supporting details are Doe (2009) and James (2010). This concept of themes is much more complex than this and can include several paragraphs and or pages.
This process really cuts down on the confusion of students writing. For stronger students, they can be free to do what they want. However, many students require structure and guidance when the first begin writing research papers
Academic and applied research are perhaps the only two ways that research can be performed. In this post, we will look at the differences between these two perspectives on research.
Academic research falls into two categories. These two categories are
Research ON your field is research is research that is searching for best practice. It looks at how your academic area is practiced in the real world. A scholar will examine how well a theory is being applied or used in a real-world setting and make recommendations.
For example, in education, if a scholar does research in reading comprehension, they may want to determine what are some of the most appropriate strategies for teaching reading comprehension. The scholar will look at existing theories and such which one(s) are most appropriate for supporting students.
Research ON your field is focused on existing theories that are tested with the goal of developing recommendations for improving practice.
Research FOR your field is slightly different. This perspective seeks to expand theoretical knowledge about your field. In orders, the scholar develops new theories rather than assess the application of older ones.
An example of this in education would be developing a new theory in reading comprehension. By theory, it is meant explanation. Famous theories in education include Piaget’s stages of development, Kohlberg’s stages of moral development, and more. At their time each of these theories pushes the boundaries of our understanding of something.
The main thing about academic research is that it leads to recommendations but not necessarily to answers that solve problems. Answering problems is something that is done with applied research.
Applied research is also known as research IN your field. This type of research is often performed by practitioners in the field.
There are several forms of research IN your field and they are as follows
Formative research is for identifying problems. For example, a teacher may notice that students are not performing well or doing their homework. Formative applied research is when the detective hat is put on and the teacher begins to search for the cause of this behavior.
The results of formative research lead to some sort of an action plan to solve the problem. During the implementation of the solution, monitoring applied research is conducted. Monitoring research is conducted during implementation of a solution to see how things are going.
For example, if the teacher discovers that students are struggling with reading because they are struggling with phonological awareness. They may implement a review program of this skill for the students. Monitoring would involve assessing student performance of reading during the program.
Summative applied research is conduct at the end of implementation to see if the objectives of the program were met. Returning to the reading example, if the teacher’s objective was to improve reading comprehension scores 10% the summative research would assess how well the students can now read and whether there was a 10% improvement.
In education, applied research is also known as action research.
Research can serve many different purposes. Academics focus on recommendations, not action while practitioners want to solve problems and perhaps not recommend as much. The point is that understanding what type of research you are trying to conduct can help you in shaping the direction of your study.
Performing a data analysis in the realm of data science is a difficult task due to the huge number of decisions that need to be made. For some people, plotting the course to conduct an analysis is easy. However, for most of us, beginning a project leads to a sense of paralysis as we struggle to determine what to do.
In light of this challenge, there are at least 5 core task that you need to consider when preparing to analyze data. These five tasks are
Developing Your Question(s)
You really cannot analyze data until you first determine what it is you want to know. It is tempting to just jump in and start looking for interesting stuff but you will not know if something you find is interesting unless it helps to answer your question(s).
There are several types of research questions. The point is you need to ask them in order to answer them.
Data exploration allows you to determine if you can answer your questions with the data you have. In data science, the data is normally already collected by the time you are called upon to analyze it. As such, what you want to find may not be possible.
In addition, exploration of the data allows you to determine if there are any problems with the data set such as missing data, strange variables, and if necessary to develop a data dictionary so you know the characteristics of the variables.
Data exploration allows you to determine what kind of data wrangling needs to be done. This involves the preparation of the data for a more formal analysis when you develop your statistical models. This process takes up the majority of a data scientist time and is not easy at all. Mastery of this in many ways means being a master of data science
Develop a Statistical Model
Your research questions and the data exploration process helps you to determine what kind of model to develop. The factors that can affect this is whether your data is supervised or unsupervised and whether you want to classify or predict numerical values.
This is probably the funniest part of data analysis and is much easier than having to wrangle with the data. Your goal is to determine if the model helps to answer your question(s)
Interpreting the Results
Once a model is developed it is time to explain what it means. Sometimes you can make a really cool model that nobody (including yourself) can explain. This is especially true of “black box” methods such as support vector machines and artificial neural networks. Models need to normally be explainable to non-technical stakeholders.
With interpretation, you are trying to determine “what does this answer mean to the stakeholders?” For example, if you find that people who smoke are 5 times more likely to die before the age of 50 what are the implications of this? How can the stakeholders use this information to achieve their own goals? In other words, why should they care about what you found out?
Communication of Results
Now is the time to actually share the answer(s) to your question(s). How this is done varies but it can be written, verbal or both. Whatever the mode of communication it is necessary to consider the following
You must remember the stakeholders because this affects how you communicate. How you speak to business professionals would be different from academics. Next, you must share the answers to the questions. This can be done with charts, figures, illustrations etc. Data visualization is an expertise of its own. Lastly, you explain how this information is useful in a practical way.
The process shared here is one way to approach the analysis of data. Think of this as a framework from which to develop your own method of analysis.
Generalized linear models are another way to approach linear regression. The advantage of of GLM is that allows the error to follow many different distributions rather than only the normal distribution which is an assumption of traditional linear regression.
Often GLM is used for response or dependent variables that are binary or represent count data. THis post will provide a brief explanation of GLM as well as provide an example.
There are three important components to a GLM and they are
The error structure is the type of distribution you will use in generating the model. There are many different distributions in statistical modeling such as binomial, gaussian, poission, etc. Each distribution comes with certain assumptions that govern their use.
The linear predictor is the sum of the effects of the independent variables. Lastly, the link function determines the relationship between the linear predictor and the mean of the dependent variable. There are many different link functions and the best link function is the one that reduces the residual deviances the most.
In our example, we will try to predict if a house will have air conditioning based on the interactioon between number of bedrooms and bathrooms, number of stories, and the price of the house. To do this, we will use the “Housing” dataset from the “Ecdat” package. Below is some initial code to get started.
The dependent variable “airco” in the “Housing” dataset is binary. This calls for us to use a GLM. To do this we will use the “glm” function in R. Furthermore, in our example, we want to determine if there is an interaction between number of bedrooms and bathrooms. Interaction means that the two independent variables (bathrooms and bedrooms) influence on the dependent variable (aircon) is not additive, which means that the combined effect of the independnet variables is different than if you just added them together. Below is the code for the model followed by a summary of the results
model<-glm(Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + Housing$price, family=binomial) summary(model)
## ## Call: ## glm(formula = Housing$airco ~ Housing$bedrooms * Housing$bathrms + ## Housing$stories + Housing$price, family = binomial) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.7069 -0.7540 -0.5321 0.8073 2.4217 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -6.441e+00 1.391e+00 -4.632 3.63e-06 ## Housing$bedrooms 8.041e-01 4.353e-01 1.847 0.0647 ## Housing$bathrms 1.753e+00 1.040e+00 1.685 0.0919 ## Housing$stories 3.209e-01 1.344e-01 2.388 0.0170 ## Housing$price 4.268e-05 5.567e-06 7.667 1.76e-14 ## Housing$bedrooms:Housing$bathrms -6.585e-01 3.031e-01 -2.173 0.0298 ## ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 681.92 on 545 degrees of freedom ## Residual deviance: 549.75 on 540 degrees of freedom ## AIC: 561.75 ## ## Number of Fisher Scoring iterations: 4
To check how good are model is we need to check for overdispersion as well as compared this model to other potential models. Overdispersion is a measure to determine if there is too much variablity in the model. It is calcualted by dividing the residual deviance by the degrees of freedom. Below is the solution for this
##  1.018056
Our answer is 1.01, which is pretty good because the cutoff point is 1, so we are really close.
Now we will make several models and we will compare the results of them
#add recroom and garagepl model2<-glm(Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + Housing$price + Housing$recroom + Housing$garagepl, family=binomial) summary(model2)
## ## Call: ## glm(formula = Housing$airco ~ Housing$bedrooms * Housing$bathrms + ## Housing$stories + Housing$price + Housing$recroom + Housing$garagepl, ## family = binomial) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.6733 -0.7522 -0.5287 0.8035 2.4239 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -6.369e+00 1.401e+00 -4.545 5.51e-06 ## Housing$bedrooms 7.830e-01 4.391e-01 1.783 0.0745 ## Housing$bathrms 1.702e+00 1.047e+00 1.626 0.1039 ## Housing$stories 3.286e-01 1.378e-01 2.384 0.0171 ## Housing$price 4.204e-05 6.015e-06 6.989 2.77e-12 ## Housing$recroomyes 1.229e-01 2.683e-01 0.458 0.6470 ## Housing$garagepl 2.555e-03 1.308e-01 0.020 0.9844 ## Housing$bedrooms:Housing$bathrms -6.430e-01 3.054e-01 -2.106 0.0352 ## ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 681.92 on 545 degrees of freedom ## Residual deviance: 549.54 on 538 degrees of freedom ## AIC: 565.54 ## ## Number of Fisher Scoring iterations: 4
#overdispersion calculation 549.54/538
##  1.02145
model3<-glm(Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + Housing$price + Housing$recroom + Housing$fullbase + Housing$garagepl, family=binomial) summary(model3)
## ## Call: ## glm(formula = Housing$airco ~ Housing$bedrooms * Housing$bathrms + ## Housing$stories + Housing$price + Housing$recroom + Housing$fullbase + ## Housing$garagepl, family = binomial) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.6629 -0.7436 -0.5295 0.8056 2.4477 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -6.424e+00 1.409e+00 -4.559 5.14e-06 ## Housing$bedrooms 8.131e-01 4.462e-01 1.822 0.0684 ## Housing$bathrms 1.764e+00 1.061e+00 1.662 0.0965 ## Housing$stories 3.083e-01 1.481e-01 2.082 0.0374 ## Housing$price 4.241e-05 6.106e-06 6.945 3.78e-12 ## Housing$recroomyes 1.592e-01 2.860e-01 0.557 0.5778 ## Housing$fullbaseyes -9.523e-02 2.545e-01 -0.374 0.7083 ## Housing$garagepl -1.394e-03 1.313e-01 -0.011 0.9915 ## Housing$bedrooms:Housing$bathrms -6.611e-01 3.095e-01 -2.136 0.0327 ## ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 681.92 on 545 degrees of freedom ## Residual deviance: 549.40 on 537 degrees of freedom ## AIC: 567.4 ## ## Number of Fisher Scoring iterations: 4
#overdispersion calculation 549.4/537
##  1.023091
Now we can assess the models by using the “anova” function with the “test” argument set to “Chi” for the chi-square test.
anova(model, model2, model3, test = "Chi")
## Analysis of Deviance Table ## ## Model 1: Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + ## Housing$price ## Model 2: Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + ## Housing$price + Housing$recroom + Housing$garagepl ## Model 3: Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + ## Housing$price + Housing$recroom + Housing$fullbase + Housing$garagepl ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 540 549.75 ## 2 538 549.54 2 0.20917 0.9007 ## 3 537 549.40 1 0.14064 0.7076
The results of the anova indicate that the models are all essentially the same as there is no statistical difference. The only criteria on which to select a model is the measure of overdispersion. The first model has the lowest rate of overdispersion and so is the best when using this criteria. Therefore, determining if a hous has air conditioning depends on examining number of bedrooms and bathrooms simultenously as well as the number of stories and the price of the house.
The post explained how to use and interpret GLM in R. GLM can be used primarilyy for fitting data to disrtibutions that are not normal.
Proportions are are a fraction or “portion” of a total amount. For example, if there are ten men and ten women in a room the proportion of men in the room is 50% (5 / 10). There are times when doing an analysis that you want to evaluate proportions in our data rather than individual measurements of mean, correlation, standard deviation etc.
In this post we will learn how to do a test of proportions using R. We will use the dataset “Default” which is found in the “ISLR” pacakage. We will compare the proportion of those who are students in the dataset to a theoretical value. We will calculate the results using the z-test and the binomial exact test. Below is some initial code to get started.
We first need to determine the actual number of students that are in the sample. This is calculated below using the “table” function.
## ## No Yes ## 7056 2944
We have 2944 students in the sample and 7056 people who are not students. We now need to determine how many people are in the sample. If we sum the results from the table below is the code.
##  10000
There are 10000 people in the sample. To determine the proprtion of students we take the number 2944 / 10000 which equals 29.44 or 29.44%. Below is the code to calculate this
table(Default$student) / sum(table(Default$student))
## ## No Yes ## 0.7056 0.2944
The proportion test is used to compare a particular value with a theoretical value. For our example, the particular value we have is 29.44% of the people were students. We want to compare this value with a theoretical value of 50%. Before we do so it is better to state specificallt what are hypotheses are. NULL = The value of 29.44% of the sample being students is the same as 50% found in the population ALTERNATIVE = The value of 29.44% of the sample being students is NOT the same as 50% found in the population.
Below is the code to complete the z-test.
prop.test(2944,n = 10000, p = 0.5, alternative = "two.sided", correct = FALSE)
## ## 1-sample proportions test without continuity correction ## ## data: 2944 out of 10000, null probability 0.5 ## X-squared = 1690.9, df = 1, p-value < 2.2e-16 ## alternative hypothesis: true p is not equal to 0.5 ## 95 percent confidence interval: ## 0.2855473 0.3034106 ## sample estimates: ## p ## 0.2944
Here is what the code means. 1. prop.test is the function used 2. The first value of 2944 is the total number of students in the sample 3. n = is the sample size 4. p= 0.5 is the theoretical proportion 5. alternative =“two.sided” means we want a two-tail test 6. correct = FALSE means we do not want a correction applied to the z-test. This is useful for small sample sizes but not for our sample of 10000
The p-value is essentially zero. This means that we reject the null hypothesis and conclude that the proprtion of students in our sample is different from a theortical proprition of 50% in the population.
Below is the same analysis using the binomial exact test.
binom.test(2944, n = 10000, p = 0.5)
## ## Exact binomial test ## ## data: 2944 and 10000 ## number of successes = 2944, number of trials = 10000, p-value < ## 2.2e-16 ## alternative hypothesis: true probability of success is not equal to 0.5 ## 95 percent confidence interval: ## 0.2854779 0.3034419 ## sample estimates: ## probability of success ## 0.2944
The results are the same. Whether to use the “prop.test”” or “binom.test” is a major argument among statisticians. The purpose here was to provide an example of the use of both
In this post, we will use probability distributions and ggplot2 in R to solve a hypothetical example. This provides a practical example of the use of R in everyday life through the integration of several statistical and coding skills. Below is the scenario.
At a busing company the average number of stops for a bus is 81 with a standard deviation of 7.9. The data is normally distributed. Knowing this complete the following.
Calculate the Interval Value
Our first step is to calculate the interval value. This is the range in which 99.7% of the values falls within. Doing this requires knowing the mean and the standard deviation and subtracting/adding the standard deviation as it is multiplied by three from the mean. Below is the code for this.
busStopMean<-81 busStopSD<-7.9 busStopMean+3*busStopSD
##  104.7
##  57.3
The values above mean that we can set are interval between 55 and 110 with 100 buses in the data. Below is the code to set the interval.
interval<-seq(55,110, length=100) #length here represents
100 fictitious buses
The next step is to calculate the density curve. This is done with our knowledge of the interval, mean, and standard deviation. We also need to use the “dnorm” function. Below is the code for this.
We will now plot the normal curve of our data using ggplot. Before we need to put our “interval” and “densityCurve” variables in a dataframe. We will call the dataframe “normal” and then we will create the plot. Below is the code.
library(ggplot2) normal<-data.frame(interval, densityCurve) ggplot(normal, aes(interval, densityCurve))+geom_line()+ggtitle("Number of Stops for Buses")
We now want to determine what is the provability of a bus having less than 65 stops. To do this we use the “pnorm” function in R and include the value 65, along with the mean, standard deviation, and tell R we want the lower tail only. Below is the code for completing this.
pnorm(65,mean = 81,sd=7.9,lower.tail = TRUE)
##  0.02141744
As you can see, at 2% it would be unusually to. We can also plot this using ggplot. First, we need to set a different density curve using the “pnorm” function. Combine this with our “interval” variable in a dataframe and then use this information to make a plot in ggplot2. Below is the code.
CumulativeProb<-pnorm(interval, mean=81,sd=7.9,lower.tail = TRUE) pnormal<-data.frame(interval, CumulativeProb) ggplot(pnormal, aes(interval, CumulativeProb))+geom_line()+ggtitle("Cumulative Density of Stops for Buses")
Second Probability Problem
We will now calculate the probability of a bus have 93 or more stops. To make it more interesting we will create a plot that shades the area under the curve for 93 or more stops. The code is a little to complex to explain so just enjoy the visual.
pnorm(93,mean=81,sd=7.9,lower.tail = FALSE)
##  0.06438284
x<-interval ytop<-dnorm(93,81,7.9) MyDF<-data.frame(x=x,y=densityCurve) p<-ggplot(MyDF,aes(x,y))+geom_line()+scale_x_continuous(limits = c(50, 110)) +ggtitle("Probabilty of 93 Stops or More is 6.4%") shade <- rbind(c(93,0), subset(MyDF, x > 93), c(MyDF[nrow(MyDF), "X"], 0)) p + geom_segment(aes(x=93,y=0,xend=93,yend=ytop)) + geom_polygon(data = shade, aes(x, y))
A lot of work was done but all in a practical manner. Looking at realistic problem. We were able to calculate several different probabilities and graph them accordingly.
In a previous post, we looked at mix methods and some examples of this design. Mixed methods are focused on combining quantitative and qualitative methods to study a research problem. In this post, we will look at several additional mixed method designs. Specifically, we will look at the follow designs
Embedded design is the simultaneous collection of quantitative and qualitative data with one form of data by supportive to the other. The supportive data augments the conclusions of the main data collection.
The benefits of this design is that allows for one method to lead the analysis with the secondary method provides additional information. For example, quantitative measures are excellent at recording the results of an experiment. Qualitative measures would be useful in determining how participants perceived their experience in the experiment.
A downside to this approach making sure the secondary method is truly supporting the overall research. Quantitative and qualitative methods natural answer different research questions. Therefore, the research questions of a study must be worded in a way that allows for cooperation between qualitative and quantitative methods.
The transformative design is more of a philosophy than a mixed method design. This design can employ any other mixed method design. The main difference that transformative designs focus on helping a marginalized population with the goal of bringing about change.
For example, a researcher might do a study Asian students facing discrimination in a predominately African American high school. The goal of the study would be to document the experiences of Asian students in order to provide administrators with information on the extent of this problem.
Such a focus on the oppressed is drawn heavily from Critical Theory which exposes how oppression takes place through education. The emphasis on change is derived from Dewy and progressivism.
Multiphase design is actually the use of several designs over several studies. This is a massive and supremely complex process. You would need to tie together several different mixed method studies under one general research problem. From this, you can see that this is not a commonly used design.
For example, you may decide to continue doing research into Asian student discrimination at African American high schools. The first study might employ an explanatory design. The second study might employ and exploratory design. The last study might be a transformative design.
After completing all this work, you would need to be able to articulate the experiences with discrimination of the Asian students. This is not an easy task by any means. As such, if and when this design is used, it often requires the teamwork of several researchers.
Mixed method designs require a different way of thinking when it comes to research. The uniqueness of this approach is the combination of qualitative and quantitative methods. This mixing of methods has advantages and disadvantage. The primary point to remember is that the most appropriate design depends on the circumstances of the study.
Mix Methods research involves the combination of qualitative and quantitative approaches to addressing a research problem. Generally, qualitative and quantitative methods have separate philosophical positions when it comes to how to uncover insights in addressing research questions.
For many, mixed methods have their own philosophical position, which is pragmatism. Pragmatists believe that if it works it’s good. Therefore, if mixed methods lead to a solution it’s an appropriate method to use.
This post will try to explain some of the mixed method designs. Before explaining it is important to understand that there are several common ways to approach mixed methods
Convergent Parallel Design
This design involves the simultaneous collecting of qualitative and quantitative data. The results are then compared to provide insights into the problem. The advantage of this design is the quantitative data provides for generalizability while the qualitative data provides information about the context of the study.
However, the challenge is in trying to merge the two types of data. Qualitative and quantitative methods answer slightly different questions about a problem. As such it can be difficult to paint a picture of the results that are comprehensible.
This design puts emphasis on the quantitative data with qualitative data playing a secondary role. Normally, the results found in the quantitative data are followed up on in the qualitative part.
For example, if you collect surveys about what students think about college and the results indicate negative opinions, you might conduct an interview with students to understand why they are negative towards college. A Likert survey will not explain why students are negative. Interviews will help to capture why students have a particular position.
The advantage of this approach is the clear organization of the data. Quantitative data is more important. The drawback is deciding what about the quantitative data to explore when conducting the qualitative data collection.
This design is the opposite of explanatory. Now the qualitative data is more important than the quantitative. This design is used when you want to understand a phenomenon in order to measure it.
It is common when developing an instrument to interview people in focus groups to understand the phenomenon. For example, if I want to understand what cellphone addiction is I might ask students to share what they think about this in interviews. From there, I could develop a survey instrument to measure cell phone addiction.
The drawback to this approach is the time consumption. It takes a lot of work to conduct interviews, develop an instrument, and assess the instrument.
Mixed methods are not that new. However, they are still a somewhat unusual approach to research in many fields. Despite this, the approaches of mixed methods can be beneficial depending on the context.
Machine learning is a tool used in analytics for using data to make a decision for action. This field of study is at the crossroads of regular academic research and action research used in professional settings. This juxtaposition of skills has led to exciting new opportunities in the domains of academics and industry.
This post will provide information on basic types of machine learning which includes predictive models, supervised learning, descriptive models, and unsupervised learning.
Predictive Models and Supervised Learning
Predictive models do as their name implies. Predictive models predict one value based on other values. For example, a model might predict who is most likely to buy a plane ticket or purchase a specific book.
Predictive models are not limited to the future. They can also be used to predict something that has already happened but we are not sure when. For example, data can be collected from expectant mothers to determine the date that they conceived. Such information would be useful in preparing for birth.
Predictive models are intimately connected with supervised learning. Supervised learning is a form of machine learning in which the predictive model is given clear direction as to what they need to learn and how to do it.
For example, if we want to predict who will be accepted or rejected for a home loan we would provide clear instructions to our model. We might include such features as salary, gender, credit score, etc. These features would be used to predict whether an individual person should be accepted or rejected for the home loan. The supervisors in this example or the features (salary, gender, credit score) used to predict the target feature (home loan).
The target feature can either be a classification or a numeric prediction. A classification target feature is a nominal variable such as gender, race, type of car, etc. A classification feature has a limited number of choices or classes that the feature can take. In addition, the classes are mutually exclusive. At least in machine learning, someone can only be classified as male or female, current algorithms cannot place a person in both classes.
A numeric prediction predicts a number that has an infinite number of possibilities. Examples include height, weight, and salary.
Descriptive Models and Unsupervised Learning
Descriptive models summarize data to provide interesting insights. There is no target feature that you are trying to predict. Since there is no specific goal or target to predict there are no supervisors or specific features that are used to predict the target feature. Instead, descriptive models use a process of unsupervised learning. There are no instructions given to model as to what to do per say.
Descriptive models are very useful for discovering patterns. For example, one descriptive model analysis found a relationship between beer purchases and diaper purchases. It was later found that when men went to the store they often would be beer for themselves and diapers for their small children. Stores used this information and they placed beer and diapers next to each in the stores. This led to an increase in profits as men could now find beer and diapers together. This kind of relationship can only be found through machine learning techniques.
The model you used depends on what you want to know. Prediction is for, as you can guess, predicting. With this model, you are not as concern about relationships as you are about understanding what affects specifically the target feature. If you want to explore relationships then descriptive models can be of use. Machine learning models are tools that are appropriate for different situations.
Machine learning is about using data to take action. This post will explain common steps that are taking when using machine learning in the analysis of data. In general, there are five steps when applying machine learning.
We will go through each step briefly
Step One, Collecting Data
Data can come from almost anywhere. Data can come from a database in a structured format like mySQL. Data can also come unstructured such as tweets collected from twitter. However you get the data, you need to develop a way to clean and process it so that it is ready for analysis.
There are some distinct terms used in machine learning that some coming from traditional research may not be familiar.
Step Two, Preparing Data
This is actually the most difficult step in machine learning analysis. It can take up to 80% of the time. With data coming from multiple sources and in multiple formats it is a real challenge to get everything where it needs to be for an analysis.
Missing data needs to be addressed, duplicate records, and other issues are a part of this process. Once these challenges are dealt with it is time to explore the data.
Step Three, Explore the Data
Before analyzing the data, it is critical that the data is explored. This is often done visually in the form of plots and graphs but also with summary statistics. You are looking for some insights into the data and the characteristics of different features. You are also looking out for things that might be unusually such as outliers. There are also times when a variable needs to be transformed because there are issues with normality.
Step Four, Training a Model
After exploring the data, you should have an idea of what you want to know if you did not already know. Determining what you want to know helps you to decide which algorithm is most appropriate for developing a model.
To develop a model, we normally split the data into a training and testing set. This allows us to assess the model for its strengths and weaknesses.
Step Five, Assessing the Model
The strength of the model is determined. Every model has certain biases that limit its usefulness. How to assess a model depends on what type of model is developed and the purpose of the analysis. Whenever suitable, we want to try and improve the model.
Step Six, Improving the Model
Improve can happen in many ways. You might decide to normalize the variables in a different way. Or you may choose to add or remove features from the model. Perhaps you may switch to a different model.
Success in data analysis involves have a clear path for achieving your goals. The steps presented here provide one way of tackling machine learning.
Logistic regression is used when the dependent variable is categorical with two choices. For example, if we want to predict whether someone will default on their loan. The dependent variable is categorical with two choices yes they default and no they do not.
Interpreting the output of a logistic regression analysis can be tricky. Basically, you need to interpret the odds ratio. For example, if the results of a study say the odds of default are 40% higher when someone is unemployed it is an increase in the likelihood of something happening. This is different from the probability which is what we normally use. Odds can go from any value from negative infinity to positive infinity. Probability is constrained to be anywhere from 0-100%.
We will now take a look at a simple example of logistic regression in R. We want to calculate the odds of defaulting on a loan. The dependent variable is “default” which can be either yes or no. The independent variables are “student” which can be yes or no, “income” which how much the person made, and “balance” which is the amount remaining on their credit card.
Below is the coding for developing this model.
The first step is to load the “Default” dataset. This dataset is a part of the “ISLR” package. Below is the code to get started
It is always good to examine the data first before developing a model. We do this by using the ‘summary’ function as shown below.
## default student balance income ## No :9667 No :7056 Min. : 0.0 Min. : 772 ## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340 ## Median : 823.6 Median :34553 ## Mean : 835.4 Mean :33517 ## 3rd Qu.:1166.3 3rd Qu.:43808 ## Max. :2654.3 Max. :73554
We now need to check our two continuous variables “balance” and “income” to see if they are normally distributed. Below is the code followed by the histograms.
The ‘income’ variable looks fine but there appear to be some problems with ‘balance’ to deal with this we will perform a square root transformation on the ‘balance’ variable and then examine it again by looking at a histogram. Below is the code.
As you can see this is much better looking.
We are now ready to make our model and examine the results. Below is the code.
Credit_Model<-glm(default~student+sqrt_balance+income, family=binomial, Default) summary(Credit_Model)
## ## Call: ## glm(formula = default ~ student + sqrt_balance + income, family = binomial, ## data = Default) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.2656 -0.1367 -0.0418 -0.0085 3.9730 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.938e+01 8.116e-01 -23.883 < 2e-16 *** ## studentYes -6.045e-01 2.336e-01 -2.587 0.00967 ** ## sqrt_balance 4.438e-01 1.883e-02 23.567 < 2e-16 *** ## income 3.412e-06 8.147e-06 0.419 0.67538 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2920.6 on 9999 degrees of freedom ## Residual deviance: 1574.8 on 9996 degrees of freedom ## AIC: 1582.8 ## ## Number of Fisher Scoring iterations: 9
The results indicate that the variable ‘student’ and ‘sqrt_balance’ are significant. However, ‘income’ is not significant. What all this means in simple terms is that being a student and having a balance on your credit card influence the odds of going into default while your income makes no difference. Unlike, multiple regression coefficients, the logistic coefficients require a transformation in order to interpret them The statistical reason for this is somewhat complicated. As such, below is the code to interpret the logistic regression coefficients.
## (Intercept) studentYes sqrt_balance income ## 3.814998e-09 5.463400e-01 1.558568e+00 1.000003e+00
To explain this as simply as possible. You subtract 1 from each coefficient to determine the actual odds. For example, if a person is a student the odds of them defaulting are 445% higher than when somebody is not a student when controlling for balance and income. Furthermore, for every 1 unit increase in the square root of the balance the odds of default go up by 55% when controlling for being a student and income. Naturally, speaking in terms of a 1 unit increase in the square root of anything is confusing. However, we had to transform the variable in order to improve normality.
Logistic regression is one approach for predicting and modeling that involves a categorical dependent variable. Although the details are little confusing this approach is valuable at times when doing an analysis.
Sometimes when the data that needs to be analyzed is not normally distributed. This makes it difficult to make any inferences based on the results because one of the main assumptions of parametric statistical test such as ANOVA, t-test, etc is normal distribution of the data.
Fortunately, for every parametric test there is a non-parametric test. Non-parametric test are test that make no assumptions about the normality of the data. This means that the non-normal data can still be analyzed with a certain measure of confidence in terms of the results.
This post will look at non-parametric test that are used to test the difference in means. For three or more groups we used the Kruskal-Wallis Test. The Kruskal-Wallis Test is the non-parametric version of ANOVA.
We are going to use the “ISLR” package available on R to demonstrate the use of the Kruskal-Wallis test. After downloading this package you need to load the “Auto” data. Below is the code to do all of this.
install.packages('ISLR') library(ISLR) data=Auto
We now need to examine the structure of the data set. This is done with the “str” function below is code followed by the results
str(Auto) 'data.frame': 392 obs. of 9 variables: $ mpg : num 18 15 18 16 17 15 14 14 14 15 ... $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ... $ displacement: num 307 350 318 304 302 429 454 440 455 390 ... $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ... $ weight : num 3504 3693 3436 3433 3449 ... $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ... $ year : num 70 70 70 70 70 70 70 70 70 70 ... $ origin : num 1 1 1 1 1 1 1 1 1 1 ... $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
So we have 9 variables. We first need to find if any of the continuous variables are non-normal because this indicates that the Kruskal-Willis test is needed. We will look at the ‘displacement’ variable and look at the histogram to see if it is normally distributed. Below is the code followed by the histogram
This does not look normally distributed. We now need a factor variable with 3 or more groups. We are going to use the ‘origin’ variable. This variable indicates were the care was made 1 = America, 2 = Europe, and 3 = Japan. However, this variable is currently a numeric variable. We need to change it into a factor variable. Below is the code for this
We will now use the Kruskal-Wallis test. The question we have is “is there a difference in displacement based on the origin of the vehicle?” The code for the analysis is below followed by the results.
> kruskal.test(displacement ~ origin, data = Auto) Kruskal-Wallis rank sum test data: displacement by origin Kruskal-Wallis chi-squared = 201.63, df = 2, p-value < 2.2e-16
Based on the results, we know there is a difference among the groups. However, just like ANOVA, we do not know were. We have to do a post-hoc test in order to determine where the difference in means is among the three groups.
To do this we need to install a new package and do a new analysis. We will download the “PCMR” package and run the code below
install.packages('PMCMR') library(PMCMR) data(Auto) attach(Auto) posthoc.kruskal.nemenyi.test(x=displacement, g=origin, dist='Tukey')
Here is what we did,
Below are the results
Pairwise comparisons using Tukey and Kramer (Nemenyi) test with Tukey-Dist approximation for independent samples data: displacement and origin 1 2 2 3.4e-14 - 3 < 2e-16 0.51 P value adjustment method: none Warning message: In posthoc.kruskal.nemenyi.test.default(x = displacement, g = origin, : Ties are present, p-values are not corrected.
The results are listed in a table. When a comparison was made between group 1 and 2 the results were significant (p < 0.0001). The same when group 1 and 3 are compared (p < 0.0001). However, there was no difference between group 2 and 3 (p = 0.51).
Do not worry about the warning message this can be corrected if necessary
Perhaps you are wondering what the actually means for each group is. Below is the code with the results
> aggregate(Auto[, 3], list(Auto$origin), mean) Group.1 x 1 1 247.5122 2 2 109.6324 3 3 102.7089
Cares made in America have an average displacement of 247.51 while cars from Europe and Japan have a displacement of 109.63 and 102.70. Below is the code for the boxplot followed by the graph
boxplot(displacement~origin, data=Auto, ylab= 'Displacement', xlab='Origin') title('Car Displacement')
This post provided an example of the Kruskal-Willis test. This test is useful when the data is not normally distributed. The main problem with this test is that it is less powerful than an ANOVA test. However, this is a problem with most non-parametric test when compared to parametric test.
Processes serve the purpose of providing people with clear step-by-step procedures to accomplish a task. In many ways, a process serves as a shortcut to solving a problem. As data mining is a complex situation with an endless number of problems there have been developed several processes for completing a data mining project. In this post, we will look at the Cross-Industry Standard Process for Data Mining (CRISP-DM).
The CRISP-DM is an iterative process that has the following steps…
We will look at each step briefly
Step 1 involves assessing the current goals of the organization and the current context. This information is then used to in deciding goals or research questions for data mining. Data mining needs to be done with a sense of purpose and not just to see what’s out there. Organizational understanding is similar to the introduction section of a research paper in which you often include the problem, questions, and even the intended audience of the research
2. Data Understanding
Once a purpose and questions have been developed for data mining, it is necessary to determine what it will take to answer the questions. Specifically, the data scientist assesses the data requirements, description, collection, and assesses data quality. In many ways, data understanding is similar to the methodology of a standard research paper in which you assess how you will answer the research questions.
It is particularly common to go back and forth between steps one and two. Organizational understanding influences data understanding which influences data understanding.
3. Data Preparation
Data preparation involves cleaning the data. Another term for this is data mugging. This is the main part of an analysis in data mining. Often the data comes in a very messy way with information spread all over the place and incoherently. This requires the researcher to carefully deal with this problem.
A model provides a numerical explanation of something in the data. How this is done depends on the type of analysis that is used. As you develop various models you are arriving at various answers to your research questions. It is common to move back and forth between step 3 and 4 as the preparation affects the modeling and the type of modeling you may want to develop may influence data preparation. The results of this step can also be seen as being similar to the results section of an empirical paper.
Evaluation is about comparing the results of the study with the original questions. In many ways, it is about determining the appropriateness of the answers to the research questions. This experience leads to ideas for additional research. As such, this step is similar to the discussion section of a research paper.
The last step is when the results are actually used for decision-making or action. If the results indicate that a company should target people under 25 then this is what they do as an example.
The CRISP-DM process is a useful way to begin the data mining experience. The primary goal of data mining is providing evidence for making decisions and or taking action. This end goal has shaped the development of a clear process for taking action.
Dealing with large amounts of data has been a problem throughout most of human history. Ancient civilizations had to keep large amounts of clay tablets, papyrus, steles, parchments, scrolls etc. to keep track of all the details of an empire.
However, whenever it seemed as though there would be no way to hold any more information a new technology would be developed to alleviate the problem. When people could not handle keeping records on stone paper scrolls were invented. When scrolls were no longer practical books were developed. When hand-copying books were too much the printing press came along.
By the mid 20th century there were concerns that we would not be able to have libraries large enough to keep all of the books that were being developed. With this problem came the solution of the computer. One computer could hold the information of several dozen if not hundreds of libraries.
Now even a single computer can no longer cope with all of the information that is constantly being developed for just a single subject. This has lead to computers working together in networks to share the task of storing information. With data spread across several computers, it makes analyzing data much more challenging. It was now necessary to mine for useful information in a way that people used to mine for gold in the 19th century.
Big data is data that is too large to fit within the memory of a single computer. Analyzing data that is spread across a network of databases takes skills different from traditional statistical analysis. This post will explain some of the characteristics of big data as well as data mining.
Big Data Traits
The three main traits of big data are volume, velocity, and variety. Volume describes the size of big data, which means data to big to be on only one computer. Velocity is about how fast the data can be processed. Lastly, variety different types of data. common sources of big data includes the following
Data mining is the process of discovering a model in a big dataset. Through the development of an algorithm, we can find specific information that helps us to answer our questions. Generally, there are two ways to mine data and these are extraction and summarization.
Extraction is the process of pulling specific information from a big dataset. For example, if we want all the addresses of people who bought a specific book from Amazon the result would be an extraction from a big data set.
Summarization is reducing a dataset to describe it. We might do a cluster analysis in which similar data is combined on a characteristic. For example, if we analyze all the books people ordered through Amazon last year we might notice that one cluster of groups buys mostly religious books while another buys investment books.
Big data will only continue to get bigger. Currently, the response has been to just use more computers and servers. As such, there is now a need for finding information on many computers and servers. This is the purpose of data mining, which is to find pertinent information that answers stakeholders questions.
Decision trees are useful for splitting data based into smaller distinct groups based on criteria you establish. This post will attempt to explain how to develop decision trees in R.
We are going to use the ‘College’ dataset found in the “ISLR” package. Once you load the package you need to split the data into a training and testing set as shown in the code below. We want to divide the data based on education level, age, and income
library(ISLR); library(ggplot2); library(caret) data("College") inTrain<-createDataPartition(y=College$education, p=0.7, list=FALSE) trainingset <- College[inTrain, ] testingset <- College[-inTrain, ]
Visualize the Data
We will now make a plot of the data based on education as the groups and age and wage as the x and y variable. Below is the code followed by the plot. Please note that education is divided into 5 groups as indicated in the chart.
qplot(age, wage, colour=education, data=trainingset) Create the Model
We are now going to develop the model for the decision tree. We will use age and wage to predict education as shown in the code below.
TreeModel<-train(education~age+income, method='rpart', data=trainingset)
Create Visual of the Model
We now need to create a visual of the model. This involves installing the package called ‘rattle’. You can install ‘rattle’ separately yourself. After doing this below is the code for the tree model followed by the diagram.
Here is what the chart means
You can predict individual values in the dataset by using the ‘predict’ function with the test data as shown in the code below.
predict(TreeModel, newdata = testingset)
Prediction Trees are a unique feature in data analysis for determining how well data can be divided into subsets. It also provides a visual of how to move through data sequentially based on characteristics in the data.
It is common in machine learning to look at the training set of your data visually. This helps you to decide what to do as you begin to build your model. In this post, we will make several different visual representations of data using datasets available in several R packages.
Once these packages are installed in R you want to look at a summary of the variables use the summary function as shown below.
You should get a printout of information about 18 different variables. Based on this printout, we want to explore the relationship between graduation rate “Grad.Rate” and student to faculty ratio “S.F.Ratio”. This is the objective of this post.
Next, we need to create a training and testing dataset below is the code to do this.
> library(ISLR);library(ggplot2);library(caret) > data("College") > PracticeSet<-createDataPartition(y=College$Enroll, p=0.7, + list=FALSE) > trainingSet<-College[PracticeSet,] > testSet<-College[-PracticeSet,] > dim(trainingSet); dim(testSet)  545 18  232 18
The explanation behind this code was covered in predicting with caret so we will not explain it again. You just need to know that the dataset you will use for the rest of this post is called “trainingSet”.
Developing a Plot
We now want to explore the relationship between graduation rates and student to faculty ratio. We will be used the ‘ggpolt2’ package to do this. Below is the code for this followed by the plot.
qplot(S.F.Ratio, Grad.Rate, data=trainingSet)
As you can see, there appears to be a negative relationship between student faculty ratio and grad rate. In other words, as the ration of student to faculty increases there is a decrease in the graduation rate.
Next, we will color the plots on the graph based on whether they are a public or private university to get a better understanding of the data. Below is the code for this followed by the plot.
> qplot(S.F.Ratio, Grad.Rate, colour = Private, data=trainingSet)
It appears that private colleges usually have lower student to faculty ratios and also higher graduation rates than public colleges
Add Regression Line
We will now plot the same data but will add a regression line. This will provide us with a visual of the slope. Below is the code followed by the plot.
> collegeplot<-qplot(S.F.Ratio, Grad.Rate, colour = Private, data=trainingSet) > collegeplot+geom_smooth(method = ‘lm’,formula=y~x)
Most of this code should be familiar to you. We saved the plot as the variable ‘collegeplot’. In the second line of code, we add specific coding for ‘ggplot2’ to add the regression line. ‘lm’ means linear model and formula is for creating the regression.
Cutting the Data
We will now divide the data based on the student-faculty ratio into three equal size groups to look for additional trends. To do this you need the “Hmisc” packaged. Below is the code followed by the table
> divide_College<-cut2(trainingSet$S.F.Ratio, g=3) > table(divide_College) divide_College [ 2.9,12.3) [12.3,15.2) [15.2,39.8] 185 179 181
Our data is now divided into three equal sizes.
Lastly, we will make a box plot with our three equal size groups based on student-faculty ratio. Below is the code followed by the box plot
CollegeBP<-qplot(divide_College, Grad.Rate, data=trainingSet, fill=divide_College, geom=c(“boxplot”)) > CollegeBP
As you can see, the negative relationship continues even when student-faculty is divided into three equally size groups. However, our information about private and public college is missing. To fix this we need to make a table as shown in the code below.
> CollegeTable<-table(divide_College, trainingSet$Private) > CollegeTable divide_College No Yes [ 2.9,12.3) 14 171 [12.3,15.2) 27 152 [15.2,39.8] 106 75
This table tells you how many public and private colleges there based on the division of the student-faculty ratio into three groups. We can also get proportions by using the following
> prop.table(CollegeTable, 1) divide_College No Yes [ 2.9,12.3) 0.07567568 0.92432432 [12.3,15.2) 0.15083799 0.84916201 [15.2,39.8] 0.58563536 0.41436464
In this post, we found that there is a negative relationship between student-faculty ratio and graduation rate. We also found that private colleges have a lower student-faculty ratio and a higher graduation rate than public colleges. In other words, the status of a university as public or private moderates the relationship between student-faculty ratio and graduation rate.
You can probably tell by now that R can be a lot of fun with some basic knowledge of coding.
Prediction is one of the key concepts of machine learning. Machine learning is a field of study that is focused on the development of algorithms that can be used to make predictions.
Anyone who has shopped online at has experienced machine learning. When you make a purchase at an online store, the website will recommend additional purchases for you to make. Often these recommendations are based on whatever you have purchased or whatever you click on while at the site.
There are two common forms of machine learning, unsupervised and supervised learning. Unsupervised learning involves using data that is not cleaned and labeled and attempts are made to find patterns within the data. Since the data is not labeled, there is no indication of what is right or wrong
Supervised machine learning is using cleaned and properly labeled data. Since the data is labeled there is some form of indication whether the model that is developed is accurate or not. If the is incorrect then you need to make adjustments to it. In other words, the model learns based on its ability to accurately predict results. However, it is up to the human to make adjustments to the model in order to improve the accuracy of it.
In this post, we will look at using R for supervised machine learning. The definition presented so far will make more sense with an example.
We are going to make a simple prediction about whether emails are spam or not using data from kern lab.
The first thing that you need to do is to install and load the “kernlab” package using the following code
If you use the “View” function to examine the data you will see that there are several columns. Each column tells you the frequency of a word that kernlab found in a collection of emails. We are going to use the word/variable “money” to predict whether an email is spam or not. First, we need to plot the density of the use of the word “money” when the email was not coded as spam. Below is the code for this.
plot(density(spam$money[spam$type=="nonspam"]), col='blue',main="", xlab="Frequency of 'money'")
This is an advance R post so I am assuming you can read the code. The plot should look like the following.
As you can see, money is not used to frequently in emails that are not spam in this dataset. However, you really cannot say this unless you compare the times ‘money’ is labeled nonspam to the times that it is labeled spam. To learn this we need to add a second line that explains to us when the word ‘money’ is used and classified as spam. The code for this is below with the prior code included.
plot(density(spam$money[spam$type=="nonspam"]), col='blue',main="", xlab="Frequency of 'money'") lines(density(spam$money[spam$type=="spam"]), col="red")
Your new plot should look like the following
If you look closely at the plot doing a visual inspection, where there is a separation between the blue line for nonspam and the red line for spam is the cutoff point for whether an email is spam or not. In other words, everything inside the arc is labeled correctly while the information outside the arc is not.
The next code and graph show that this cutoff point is around 0.1. This means that any email that has on average more than 0.1 frequency of the word ‘money’ is spam. Below is the code and the graph with the cutoff point indicated by a black line.
plot(density(spam$money[spam$type=="nonspam"]), col='blue',main="", xlab="Frequency of 'money'") lines(density(spam$money[spam$type=="spam"]), col="red") abline(v=0.1, col="black", lw= 3)
Now we need to calculate the accuracy of the use of the word ‘money’ to predict spam. For our current example, we will simply use in “ifelse” function. If the frequency is greater than 0.1.
We then need to make a table to see the results. The code for the “ifelse” function and the table are below followed by the table.
predict<-ifelse(spam$money > 0.1, "spam","nonspam") table(predict, spam$type)/length(spam$type)
predict nonspam spam nonspam 0.596392089 0.266898500 spam 0.009563138 0.127146273
Based on the table that I am assuming you can read, our model accurately calculates that an email is spam about 71% (0.59 + 0.12) of the time based on the frequency of the word ‘money’ being greater than 0.1.
Of course, for this to be true machine learning we would repeat this process by trying to improve the accuracy of the prediction. However, this is an adequate introduction to this topic.
Survey design is used to describe the opinions, beliefs, behaviors, and or characteristics of a population based on the results of a sample. This design involves the use of surveys that include questions, statements, and or other ways of soliciting information from the sample. This design is used for descriptive purpose primarily but can be combined with other designs (correlational, experimental) at times as well. In this post, we will look at the following.
Types of Survey Design
There are two common forms of survey design which are cross-sectional and longitudinal. A cross-sectional survey design is the collection of data at one specific point in time. Data is only collected once in a cross-sectional design.
A cross-sectional design can be used to measure opinions/beliefs, compare two or more groups, evaluate a program, and or measure the needs of a specific group. The main goal is to analyze the data from a sample at a given moment in time.
A longitudinal design is similar to a cross-sectional design with the difference being that longitudinal designs require collection over time.Longitudinal studies involve cohorts and panels in which data is collected over days, months, years and even decades. Through doing this, a longitudinal study is able to expose trends over time in a sample.
Characteristics of Survey Design
There are certain traits that are associated with survey design. Questionnaires and interviews are a common component of survey design. The questionnaires can happen by mail, phone, internet, and in person. Interviews can happen by phone, in focus groups, or one-on-one.
The design of a survey instrument often includes personal, behavioral and attitudinal questions and open/closed questions.
Another important characteristic of survey design is monitoring the response rate. The response rate is the percentage of participants in the study compared to the number of surveys that were distributed. The response rate varies depending on how the data was collected. Normally, personal interviews have the highest rate while email request has the lowest.
It is sometimes necessary to report the response rate when trying to publish. As such, you should at the very least be aware of what the rate is for a study you are conducting.
Surveys are used to collect data at one point in time or over time. The purpose of this approach is to develop insights into the population in order to describe what is happening or to be used to make decisions and inform practice.
One of the strongest points of R in the opinion of many are the various features for creating graphs and other visualizations of data. In this post, we begin to look at using the various visualization features of R. Specifically, we are going to do the following
The ‘plot’ function is one of the basic options for graphing data. We are going to go through an example using the ‘islands’ data that comes with the R software. The ‘islands’ software includes lots of data, in particular, it contains data on the lass mass of different islands. We want to plot the land mass of the seven largest islands. Below is the code for doing this.
islandgraph<-head(sort(islands, decreasing=TRUE), 7)
plot(islandgraph, main = "Land Area", ylab = "Square Miles")
text(islandgraph, labels=names(islandgraph), adj=c(0.5,1))
Here is what we did
Below is what the graph should look like.
Changing Point Color and Shape in a Graph
For visual purposes, it may be beneficial to manipulate the color and appearance of several data points in a graph. To do this, we are going to use the ‘faithful’ dataset in R. The ‘faithful’ dataset indicates the length of eruption time and how long people had to wait for the eruption. The first thing we want to do is plot the data using the “plot” function.
As you see the data, there are two clear clusters. One contains data from 1.5-3 and the second cluster contains data from 3.5-5. To help people to see this distinction we are going to change the color and shape of the data points in the 1.5-3 range. Below is the code for this.
eruption_time<-with(faithful, faithful[eruptions < 3, ])
points(eruption_time, col = "blue", pch = 24)
Here is what we did
In this post, we learned the following
In this post, we will look at how to perform a simple regression using R. We will use a built-in dataset in R called ‘mtcars.’ There are several variables in this dataset but we will only use ‘wt’ which stands for the weight of the cars and ‘mpg’ which stands for miles per gallon.
Developing the Model
We want to know the association or relationship between the weight of a car and miles per gallon. We want to see how much of the variance of ‘mpg’ is explained by ‘wt’. Below is the code for this.
> Carmodel <- lm(mpg ~ wt, data = mtcars)
Here is what we did
Seeing the Results
Once you pressed enter you probably noticed nothing happens. The model ‘Carmodel’ was created but the results have not been displayed. Below is various information you can extract from your model.
The ‘summary’ function is useful for pulling most of the critical information. Below is the code for the output.
> summary(Carmodel) Call: lm(formula = mpg ~ wt, data = mtcars) Residuals: Min 1Q Median 3Q Max -4.5432 -2.3647 -0.1252 1.4096 6.8727 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.2851 1.8776 19.858 < 2e-16 *** wt -5.3445 0.5591 -9.559 1.29e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.046 on 30 degrees of freedom Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446 F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
We are not going to explain the details of the output (please see simple regression). The results indicate a model that explains 75% of the variance or has an r-square of 0.75. The model is statistically significant and the equation of the model is -5.45x + 37.28 = y
Plotting the Model
We can also plot the model using the code below
> coef_Carmodel<-coef(Carmodel) > plot(mpg ~ wt, data = mtcars) > abline(a = coef_Carmodel, b = coef_Carmodel)
Here is what we did
From the visual, you can see that as weight increases there is a decrease in miles per gallon.
R is capable of much more complex models than the simple regression used here. However, understanding the coding for simple modeling can help in preparing you for much more complex forms of statistical modeling.
Within groups, experimental design is the use of only one group in an experiment. This is in contrast to a between-group design which involves two or more groups. Within-group design is useful when the number of is two low in order to split them into different groups.
There are two common forms of within-group experimental design, time-series, and repeated measures. Under time series there are interrupted times series and equivalent time series. Under repeated-measure, there is only repeated measure design. In this post, we will look at the following forms of within-group experimental design.
Interrupted Time Series Design
Interrupted time series design involves several pre-tests followed by an intervention and then several post-test of one group. By measuring the several times, many threats to internal validity are reduced, such as regression, maturation, and selection. The pre-test results are also used as covariates when analyzing the post-tests.
Equivalent Time Series Design
Equivalent time series design involves the use of a measurement followed by intervention followed by measurement etc. In many ways, this design is a repeated post-test only design. The primary goal is to plot the results of the post-test and determine if there is a pattern that develops over time.
For example, if you are tracking the influence of blog writing on vocabulary acquisition, the intervention is blog writing and the dependent variable is vocabulary acquisition. As the students write a blog, you measure them several times over a certain period. If a plot indicates an upward trend you could infer that blog writing made a difference in vocabulary acquisition.
Repeated measures is the use of several different treatments over time. Before each treatment, the group is measured. Each post-test is compared to other post-test to determine which treatment was the best.
For example, let’s say that you still want to assess vocabulary acquisition but want to see how blog writing and public speaking affect it. First, you measure vocabulary acquisition. Next, you employ the first intervention followed by a second assessment of vocabulary acquisition. Third, you use the public speaking intervention followed by the third assessment of vocabulary acquisition. You now have three parts of data to compare
Within-group experimental designs are used when it is not possible to have several groups in an experiment. The benefits include needing fewer participants. However, one problem with this approach is the need to measure several times which can be labor intensive.
Analysis of variance (ANOVA) is used when you want to see if there is a difference in the means of three or more groups due to some form of treatment(s). In this post, we will look at conducting an ANOVA calculation using R.
We are going to use a dataset that is already built into R called ‘InsectSprays.’ This dataset contains information on different insecticides and their ability to kill insects. What we want to know is which insecticide was the best at killing insects.
In the dataset ‘InsectSprays’, there are two variables, ‘count’, which is the number of dead bugs, and ‘spray’ which is the spray that was used to kill the bug. For the ‘spray’ variable there are six types label A-F. There are 72 total observations for the six types of spray which comes to about 12 observations per spray.
Building the Model
The code for calculating the ANOVA is below
> BugModel<- aov(count ~ spray, data=InsectSprays) > BugModel Call: aov(formula = count ~ spray, data = InsectSprays) Terms: spray Residuals Sum of Squares 2668.833 1015.167 Deg. of Freedom 5 66 Residual standard error: 3.921902 Estimated effects may be unbalanced
Here is what we did
Now we need to see if there are any significant results. To do this we will use the ‘summary’ function as shown in the script below
> summary(BugModel) Df Sum Sq Mean Sq F value Pr(>F) spray 5 2669 533.8 34.7 <2e-16 *** Residuals 66 1015 15.4 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
These results indicate that there are significant results in the model as shown by the p-value being essentially zero (Pr(>F)). In other words, there is at least one mean that is different from the other means statistically.
We need to see what the means are overall for all sprays and for each spray individually. This is done with the following script
> model.tables(BugModel, type = 'means') Tables of means Grand mean 9.5 spray spray A B C D E F 14.500 15.333 2.083 4.917 3.500 16.667
The ‘model.tables’ function tells us the means overall and for each spray. As you can see, it appears spray F is the most efficient at killing bugs with a mean of 16.667.
However, this table does not indicate the statistically significance. For this we need to conduct a post-hoc Tukey test. This test will determine which mean is significantly different from the others. Below is the script
> BugSpraydiff<- TukeyHSD(BugModel) > BugSpraydiff Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = count ~ spray, data = InsectSprays) $spray diff lwr upr p adj B-A 0.8333333 -3.866075 5.532742 0.9951810 C-A -12.4166667 -17.116075 -7.717258 0.0000000 D-A -9.5833333 -14.282742 -4.883925 0.0000014 E-A -11.0000000 -15.699409 -6.300591 0.0000000 F-A 2.1666667 -2.532742 6.866075 0.7542147 C-B -13.2500000 -17.949409 -8.550591 0.0000000 D-B -10.4166667 -15.116075 -5.717258 0.0000002 E-B -11.8333333 -16.532742 -7.133925 0.0000000 F-B 1.3333333 -3.366075 6.032742 0.9603075 D-C 2.8333333 -1.866075 7.532742 0.4920707 E-C 1.4166667 -3.282742 6.116075 0.9488669 F-C 14.5833333 9.883925 19.282742 0.0000000 E-D -1.4166667 -6.116075 3.282742 0.9488669 F-D 11.7500000 7.050591 16.449409 0.0000000 F-E 13.1666667 8.467258 17.866075 0.0000000
There is a lot of information here. To make things easy, wherever there is a p adj of less than 0.05 that means that there is a difference between those two means. For example, bug spray F and E have a difference of 13.16 that has a p adj of zero. So these two means are really different statistically.This chart also includes the lower and upper bounds of the confidence interval.
The results can also be plotted with the script below
> plot(BugSpraydiff, las=1)
Below is the plot
ANOVA is used to calculate if there is a difference of means among three or more groups. This analysis can be conducted in R using various scripts and codes.
Key components of qualitative research include hermeneutics and phenomenology. This post will examine these two terms and their role in qualitative research.
Hermeneutics is essential a method of interpretation of a text. The word hermeneutics comes from Hermes, the Greek messenger God. As such, at least for the ancient Greeks, there was a connection with interpreting and serving as a messenger. Today, his term is most commonly associated with theology such as biblical hermeneutics.
In relation to biblical hermeneutics, Augustine (354-430) develop a process of hermeneutics that was iterative. Through studying the Bible and the meaning of one’s own interpretations of the Bible, a person can understand divine truth. There was no need to look at the context, history, or anything else. Simply the Word and your interpretation of it.
In the 17th century, the Dutch philosopher Spinoza expanded on Augustine’s view of hermeneutics by stating that the text, its historical context, and even the author of a text, should be studied to understand the text. In other words, text plus context leads to truth.
By combing Augustine’s view of the role of the individual in hermeneutics with Spinoza’s contribution to the context we arrive at how interpretation happens in qualitative research.
In qualitative research, data interpretation (aka hermeneutics) involves the individual’s interpretation combined with the context that the data comes from. Both the personal interpretation and the context of the data influence each other.
The develops in hermeneutics led to the development of the philosophy called phenomenology. Phenomenology states that a phenomenon can only be understood subjectively (from a certain viewpoint) and intuitively (through thinking and finding hidden meaning).
In phenomenology, interpretation happens through describing events, analyzing an event, and by connecting a current experience to another one or by finding similarities among distinct experiences.
For a phenomenologist, there is a constant work of reducing several experiences into abstract constructs through an inductive approach. This is a form of theory building that is connected with several forms of qualitative research, such as grounded theory.
Hermeneutics has played an important role in qualitative research by influencing the development of phenomenology. The study of a phenomenon is for the purpose of seeing how context will influence interpretation.
In experimental research, there are two common designs. They are between and within group design. The difference between these two groups of designs is that between group involves two or more groups in an experiment while within group involves only one group.
This post will focus on between group designs. We will look at the following forms of between group design…
A true experiment is one in which the participants are randomly assigned to different groups. In a quasi-experiment, the researcher is not able to randomly assigned participants to different groups.
Random assignment is important in reducing many threats to internal validity. However, there are times when a researcher does not have control over this, such as when they conduct an experiment at a school where classes have already been established. In general, a true experiment is always considered superior methodological to a quasi-experiment.
Whether the experiment is a true experiment or a quasi-experiment. There are always two groups that are compared in the study. One group is the controlled group, which does not receive the treatment. The other group is called the experimental group, which receives the treatment of the study. It is possible to have more than two groups and several treatments but the minimum for between group designs is two groups.
Another characteristic that true and quasi-experiments have in common is the type of formats that the experiment can take. There are two common formats
A pre- and post test involves measuring the groups of the study before the treatment and after the treatment. The desire normally is for the groups to be the same before the treatment and for them to be different statistically after the treatment. The reason for them being different is because of the treatment, at least hopefully.
For example, let’s say you have some bushes and you want to see if the fertilizer you bought makes any difference in the growth of the bushes. You divide the bushes into two groups, one that receives the fertilizer (experimental group), and one that does not (controlled group). You measure the height of the bushes before the experiment to be sure they are the same. Then, you apply the fertilizer to the experimental group and after a period of time, you measure the heights of both groups again. If the fertilized bushes grow taller than the control group you can infer that it is because of the fertilizer.
Post-test only design is when the groups are measured only after the treatment. For example, let’s say you have some corn plants and you want to see if the fertilizer you bought makes any difference.in the amount of corn produced. You divide the corn plants into two groups, one that receives the fertilizer (experimental group), and one that does not (controlled group). You apply the fertilizer to the control group and after a period of time, you measure the amount of corn produced. If the fertilized corn produces more you can infer that it is because of the fertilizer. You never measure the corn beforehand because they had not produced any corn yet.
Factorial design involves the use of more than one treatment. Returning to the corn example, let’s say you want to see not only how fertilizer affects corn production but also how the amount of water the corn receives affects production as well.
In this example, you are trying to see if there is an interaction effect between fertilizer and water. When water and fertilizer are increased does production increase, is there no increase, or if one goes up and the other goes down does that have an effect?
Between group designs such as true and quasi-experiments provide a way for researchers to establish cause and effect. Pre- post test is employed as well as factorial designs to establish relationships between variables
There are times when conducting research that you want to know if there is a difference in categorical data. For example, is there a difference in the number of men who have blue eyes and who have brown eyes. Or is there a relationship between gender and hair color. In other words, is there a difference in the count of a particular characteristic or is there a relationship between two or more categorical variables.
For our example, we are going to use data that is already available in R called “HairEyeColor”. Below is the data
> HairEyeColor , , Sex = Male Eye Hair Brown Blue Hazel Green Black 32 11 10 3 Brown 53 50 25 15 Red 10 10 7 7 Blond 3 30 5 8 , , Sex = Female Eye Hair Brown Blue Hazel Green Black 36 9 5 2 Brown 66 34 29 14 Red 16 7 7 7 Blond 4 64 5 8
As you can see, the data comes in the form of a list and shows hair and eye color for men and women in separate tables. The current data is unusable for us in terms of calculating differences. However, by using the ‘marign.table’ function we can make the data useable as shown in the example below.
> HairEyeNew<- margin.table(HairEyeColor, margin = c(1,2)) > HairEyeNew Eye Hair Brown Blue Hazel Green Black 68 20 15 5 Brown 119 84 54 29 Red 26 17 14 14 Blond 7 94 10 16
Here is what we did. We created the variable ‘HairEyeNew’ and we stored the information from ‘HairEyeColor’ into one table using the ‘margin.table’ function. The margin was set 1,2 for the table.
Now all are data from the list are combined into one table.
We now want to see if there is a particular relationship between hair and eye color that is more common. To do this, we calculate the chi-square statistic as in the example below.
> chisq.test(HairEyeNew) Pearson's Chi-squared test data: HairEyeNew X-squared = 138.29, df = 9, p-value < 2.2e-16
The test tells us that one or more of the relationships are more common than others within the table. To determine which relationship between hair and eye color is more common than the rest we will calculate the proportions for the table as seen below.
> HairEyeNew/sum(HairEyeNew) Eye Hair Brown Blue Hazel Green Black 0.114864865 0.033783784 0.025337838 0.008445946 Brown 0.201013514 0.141891892 0.091216216 0.048986486 Red 0.043918919 0.028716216 0.023648649 0.023648649 Blond 0.011824324 0.158783784 0.016891892 0.027027027
As you can see from the table, brown hair and brown eyes are the most common (0.20 or 20%) flowed by blond hair and blue eyes (0.15 or 15%).
The chi-square serves to determine differences among categorical data. This tool is useful for calculating the potential relationships among non-continuous variables
Epistemology is the study of the nature of knowledge. It deals with questions as is there truth and or absolute truth, is there one way or many ways to see something. In research, epistemology manifest itself in several views. The two extremes are positivism and interpretivism.
Positivism asserts that all truth can be verified and proven scientifically and can be measured and or observed. This position discounts religious revelation as a source of knowledge as this cannot be verified scientifically. The position of positivist is also derived from realism in that there is an external world out there that needs to be studied.
For researchers, positivism is the foundation of quantitative research. Quantitative researchers try to be objective in their research, they try to avoid coming into contact with whatever they are studying as they do not want to disturb the environment. One of the primary goals is to make generalizations that are applicable in all instances.
For quantitative researchers, they normally have a desire to test a theory. In other words, the develop one example of what they believe is a truth about a phenomenon (a theory) and they test the accuracy of this theory with statistical data. The data determines the accuracy of the theory and the changes that need to be made.
By the late 19th and early 20th centuries, people were looking for alternative ways to approach research. One new approach was interpretivism.
Interpretivism is the complete opposite of positivism in many ways. Interpretivism asserts that there is no absolute truth but relative truth based on context. There is no single reality but multiple realities that need to be explored and understood.
For interpretist, There is a fluidity in their methods of data collection and analysis. These two steps are often iterative in the same design. Furthermore, intrepretist see themselves not as outside the reality but a player within it. Thus, they often will share not only what the data says but their own view and stance about it.
Qualitative researchers are interpretists. They spend time in the field getting close to their participants through interviews and observations. They then interpret the meaning of these communications to explain a local context specific reality.
While quantitative researchers test theories, qualitative researchers build theories. For qualitative researchers, they gather data and interpret the data by developing a theory that explains the local reality of the context. Since the sampling is normally small in qualitative studies, the theories do not often apply to many.
There is little purpose in debating which view is superior. Both positivism and interpretivism have their place in research. What matters more is to understand your position and preference and to be able to articulate in a reasonable manner. It is often not what a person does and believes that is important as why they believe or do what they do.
In experimental research design, internal validity is the appropriateness of the inferences made about cause and effects relationships between the independent and dependent variables. If there are threats to internal validity it may mean that the cause and effect relationship you are trying to establish is not real. In general, there are three categories of external validity, which are..,
There are several forms of threats to internal validity that relate to participants. Below is a list
A historical threat to internal validity is the problem of the passages of time from the beginning to the end of the experiment. During this elapse of time, the groups involved in the study may have different experiences. These different experiences are history threats. One way to deal with this threat is to be sure that the conditions of the experiment are the same.
Maturation threat is the problem of how people change over time during an experiment. These changes make it hard to infer if the results of a study are because of the treatment or because of maturation. One way to deal with this threat is to select participants who develop in similar ways and speed.
Regression threat is the action of the researcher selecting extreme cases to include in their sample. Eventually, these cases regress to the mean, which impacts the results of the pretest or posttest. One option for overcoming this problem is to avoid outliers when selecting the sample.
Selection bias is the poor habit of picking people in a non-random why for an experiment Examples of this include choosing mostly ‘smart people for an experiment. Or working with petite women for a study on diet and exercise. Random selection is the strongest way to deal with this threat.
Mortality is the lost of participants in a study. It is common for participants in a study to dropout and quit for many reasons. This leads to a decrease in the sample size, which weakens the statistical interpretation. Dealing with this requires using larger sample sizes as well as comparing data of dropouts with those who completed the study.
Internal validity can ruin a paper that has not careful planned out how these threats work together to skew results. Researchers need to have an idea of what threats are out there as well as strategies that can alleviate them.
Comparing groups is a common goal in statistics. This is done to see if there is a difference between two groups. Understanding the difference can lead to insights based on statistical results. In this post, we will examine the following statistical test for comparing samples.
T-test & Wilcoxon Test
The T-test indicates if there is a significant statistical difference between two groups. This is useful if you know what the difference between the two groups are. For example, if you are measuring height of men and women, if you find that men are taller through a t-test, you can state that gender influences height because the only difference between men and women in this example is their gender.
Below is an example of conducting a t-test in R. In the example, we are looking at if there is a difference in body temperature between beavers who are active versus beavers who are not.
> t.test(temp ~ activ, data = beaver2) Welch Two Sample t-test data: temp by activ t = -18.548, df = 80.852, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.8927106 -0.7197342 sample estimates: mean in group 0 mean in group 1 37.09684 37.90306
Here is what happened
T-test assumes that the data is normally distributed. When normality is a problem, it is possible to use the Wilcoxon test instead. Below is the script for the Wilcoxon Test using the same example.
> wilcox.test(temp ~ activ, data = beaver2) Wilcoxon rank sum test with continuity correction data: temp by activ W = 15, p-value < 2.2e-16 alternative hypothesis: true location shift is not equal to 0
A closer look at the output indicates the same results for the most part. Instead of the t-stat the W-stat is used but the p value is the same for both test.
A paired t-test is used when you want to compare how the same group of people respond to different interventions. For example, you might use this for a before and after experiment. We will use the ‘sleep’ data in R to compare a group of people when they receive different types of sleep medication. The script is below
> t.test(extra ~ group, data = sleep, paired = TRUE) Paired t-test data: extra by group t = -4.0621, df = 9, p-value = 0.002833 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.4598858 -0.7001142 sample estimates: mean of the differences -1.58
Here is what happened
Comparing samples in R is a simple process of understanding what you want to do. With this knowledge, the script and output are not too challenge even for many beginners
Philosophy is a term that is commonly used but hard to define. To put it simply, philosophy explains an individuals or a groups worldview in general or in a specific context. Such questions as the nature of knowledge, reality, and existence are questions that philosophy tries to answer. There are different schools of thought on these questions and these are what we commonly call philosophies
In this post, we will try to look at ontology, which is the study of the nature of reality. In particular, we will define it as well as explain its influence on research.
Ontological realism believes that reality is objective. In other words, there is one objective reality that is external to each individual person. We are in a reality and we do not create it.
Ontological idealism is the opposite extreme. This philosophy states that there are multiple realities an each depends on the person. My reality is different from your reality and each of us builds our own reality.
Ontological realism is one of the philosophical foundations for quantitative research. Quantitative research is a search for an objective reality that accurately explains whatever is being studied.
For qualitative researchers, ontological idealism is one of their philosophical foundations. Qualitative researchers often support the idea of multiple realities. For them, since there is no objective reality it is necessary to come contact with people to explain their reality.
Something that has been alluded to but not stated specifically is the role of independence and dependence of individuals. Regardless of whether someone ascribes to ontological realism or idealis, there is the factor of whether people or independent of reality or dependent to reality. The level of independence and dependence contributes to other philosophies such as objectivism constructivism and pragmatism.
Objectivism, Constructivism and Pragmatism
Objectivism is the belief that there is a single reality that is independent of the individuals within it. Again this is the common assumption of quantitative research. At the opposite end we have constructivism which states that there are multiple realities and the are dependent on the individuals who make each respective reality.
Pragmatism supports the idea of a single reality with the caveat that it is true if it is useful and works. The application of the idea depends upon the individuals, which pushes pragmatism into the realm of dependence.
From this complex explanation of ontology and research comes the following implications
A key component of experimental design involves making decisions about the manipulation of the treatment conditions. In this post, we will look at the following traits of treatment conditions
Lastly, we will examine group comparison.
One of the most common independent variables in experimental design are treatment and measured variables. Treatment variables are manipulated by the researcher. For example, if you are looking at how sleep affects academic performance, you may manipulate the amount of sleep participants receive in order to determine the relationship between academic performance and sleep.
Measured variables are variables that are measured by are not manipulated by the researcher. Examples include age, gender, height, weight, etc.
An experimental treatment is the intervention of the researcher to alter the conditions of an experiment. This is done by keeping all other factors constant and only manipulating the experimental treatment, it allows for the potential establishment of a cause-effect relationship. In other words, the experimental treatment is a term for the use of a treatment variable.
Treatment variables usually have different conditions or levels in them. For example, if I am looking at sleep’s affect on academic performance. I may manipulate the treatment variable by creating several categories of the amount of sleep. Such as high, medium, and low amounts of sleep.
Intervention is a term that means the actual application of the treatment variables. In other words, I broke my sample into several groups and caused one group to get plenty of sleep the second group to lack a little bit of sleep and the last group got nothing. Experimental treatment and intervention mean the same thing.
The outcome measure is the experience of measuring the outcome variable. In our example, the outcome variable is academic performance.
Experimental design often focuses on comparing groups. Groups can be compared between groups and within groups. Returning to the example of sleep and academic performance, a between group comparison would be to compare the different groups based on the amount of sleep they received. A within group comparison would be to compare the participants who received the same amount of sleep.
Often there are at least three groups in an experimental study, which are the controlled, comparison, and experimental group. The controlled group receives no intervention or treatment variable. This group often serves as a baseline for comparing the other groups.
The comparison group is exposed to everything but the actual treatment of the study. They are highly similar to the experimental group with the experience of the treatment. Lastly, the experimental group experiences the treatment of the study.
Experiments involve treatment conditions and groups. As such, researchers need to understand their options for treatment conditions as well as what types of groups they should include in a study.
Normal distribution is an important term in statistics. When we say normal distribution, we are speaking of the traditional bell curve concept. Normal distribution is important because it is often an assumption of inferential statistics that the distribution of data points is normal. Therefore, one of the first things you do when analyzing data is to check for normality.
In this post, we will look at the following ways to assess normality.
Checking Normality by Graph
The easiest and crudest way to check for normality is visually through the use of histograms. You simply look at the histogram and determine how closely it resembles a bell.
To illustrate this we will use the ‘beaver2’ data that is already loaded into R. This dataset contains five variables “day”, “time”, “temp”, and “activ” for data about beavers. Day indicates what day it was, time indicates what time it was, temp is the temperature of the beavers, and activ is whether the beavers were active when their temperature was taking. We are going to examine the normality of the temperature of active and inactive. Below is the code
> library(lattice) > histogram(~temp | factor(activ), data = beaver2)
As you look at the histograms, you can say that they are somewhat normal. The peaks of the data are a little high in both. Group 1 is more normal than Group 0. The problem with visual inspection is lack of accuracy in interpretation. This is partially solved by using QQ plots.
Checking Normality by Plots
QQplots are useful for comparing your data with a normally distributed theoretical dataset. The QQplot includes a line of a normal distribution and the data points for your data for comparison. The more closely your data follows the line the more normal it is. Below is the code for doing this with our beaver information.
> qqnorm(beaver2$temp[beaver2$activ==1], main = 'Active') > qqline(beaver2$temp[beaver2$activ==1])
Here is what we did
Going by sight again. The data still looks pretty good. However, one last test will determine conclusively if the dataset is normal or not.
Checking Normality by Test
The Shapiro-Wilks normality test determines the probability that the data is normally distributed. The lower the probability the less likely that the data is normally distributed. Below is the code and results for the Shapiro test.
> shapiro.test(beaver2$temp[beaver2$activ==1]) Shapiro-Wilk normality test data: beaver2$temp[beaver2$activ == 1] W = 0.98326, p-value = 0.5583
Here is what happened
It is necessary to always test the normality of data before data analysis. The tips presented here provide some framework for accomplishing this.
In a previous post, we began a discussion on experimental design. In this post, we will begin a discussion on the characteristics of experimental design. In particular, we will look at the following
After developing an appropriate sampling method, a researcher needs to randomly assign individuals to the different groups of the study. One of the main reasons for doing this is to remove the bias of individual differences in all groups of the study.
For example, if you are doing a study on intelligence. You want to make sure that all groups have the same characteristics of intelligence. This helps for the groups to equate or to be the same. This prevents people from saying the reason there are differences between groups is because the groups are different and not because of the treatment.
Control Over Extraneous Variables
Random assignment directly leads to the concern of controlling extraneous variables. Extraneous variables are any factors that might influence the cause and effect relationship that you are trying to establish. These other factors confound or confuse the results of a study. There are several methods for dealing with this as shown below
A pre-test post-test allows a researcher to compare the measurement of something before the treatment and after the treatment. The assumption is that any difference in the scores of before and after is due to the treatment.Doing the tests takes into account the confounding of the different contexts of the setting and individual characteristics.
This approach involves selecting people who are highly similar on the particular trait that is being measured. This removes the problem of individual differences when attempting to interpret the results. The more similar the subjects in the sample are the more controlled the traits of the people are controlled for.
Covariates is a statistical approach in which controls are placed on the dependent variable through statistical analysis. The influence of other variables are removed from the explained variance of the dependent variable. Covariates help to explain more about the relationship between the independent and dependent variables.
This is a difficult concept to understand. However, the point is that you use covariates to explain in greater detail the relationship between the independent and dependent variable by removing other variables that might explain the relationship.
Matching is deliberate, rather than randomly, assigning subject to various groups. For example, if you are looking at intelligence. You might match high achievers in both groups of the study. By placing he achievers in both groups you cancel out there difference.
Experimental design involves the cooperation in random assignment of inclusive differences in a sample. The goal of experimental design is to be sure that the sample groups are mostly the same in a study. This allows for concluding that what happened was due to the treatment.
A two-way table is used to explain two or more categorical variables at the same time. The difference between a two-way table and a frequency table is that a two-table tells you the number of subjects that share two or more variables in common while a frequency table tells you the number of subjects that share one variable.
For example, a frequency table would be gender. In such a table, you only know how many subjects are male or female. The only variable involved is gender. In a frequency table, you would learn some of the following
In a two-way table, you might look at gender and marital status. In such a table you would be able to learn several things
As such, there is a lot of information in a two-way table. In this post, we will look at the following
Creating a Table
In the example, we are going to look at two categorical variables. One variable is gender and the other is marital status. For gender, the choices are “Male” and Female”. For marital status, the choicest are ‘Married” and “Single”. Below is the code for developing the table.
Marriage_Study<-matrix(c(34,20,19,42), ncol = 2) colnames(Marriage_Study) <- c('Male', 'Female') rownames(Marriage_Study) <- c('Married', 'Single') Marriage_table <- as.table(Marriage_Study) print(Marriage_table)
There has already been a discussion on creating matrices in R. Therefore, the details of this script will not be explained here.
If you type this correctly and run the script you should see the following
Male Female Married 34 19 Single 20 42
This table tells you about married and single people broken down by their gender. For example, 34 males are married.
Adding Margins and Calculating Proportions
A useful addition to a table is to add the margins. The margins tell you the total number of subjects in each row and column of a table. To do this in R use the ‘addmargins’ function as indicated below.
> addmargins(Marriage_table) Male Female Sum Married 34 19 53 Single 20 42 62 Sum 54 61 115
We now know the total number of Married people, Singles, Males, and Females. In addition to the information we already knew.
One more useful piece of information is to calculate the proportions. This will allow us to know what percentage of each two-way possibility makes up the table. To do this we will use the “prop.table” function. Below is the script
> prop.table(Marriage_table) Male Female Married 0.2956522 0.1652174 Single 0.1739130 0.3652174
As you can see, we now know the proportions of each category in the table.
This post provided information on how to construct and manipulate data that is in a two-way table. Two-way tables are a useful way of describing categorical variables.
Experimental design is now considered a standard methodology in research. However, this now classic design has not always been a standard approach. In this post, we will the following
The word experiment is derived from the word experience. When conducting an experiment, the researcher assigns people to have different experiences. He then determines if the experience he assigned people to had some effect on some sort of outcome. For example, if I want to know if the experience of sunlight affects the growth of plants I may develop two different experiences
The outcome is the growth of the plants.By giving the plants different experiences of sunlight I can determine if sunlight influences the growth of plants.
History of Experiments
Experiments have been around informally since the 10th century with work done in the field of medicine. The use of experiments as known today began in the early 20th century in the field of psychology. By the 1920’s group comparison became an establish characteristics of experiments. By the 1930’s, random assignment was introduced. By the 1960’s various experimental designs were codified and documented. By the 1980’s there was literature coming out that addressed threats to validity.
Since the 1980’s experiments have become much more complicated with the development of more advanced statistical software problems. Despite all of the new complexity, normally simple experimental designs are easier to understand
When to Conduct Experiments
Experiments are conducted to attempt to establish a cause and effect relationship between independent and dependent variables. You try to create a controlled environment in which you provide the experience or independent variable(s) and then measure how they affect the outcome or dependent variable.
Since the setting of the experiment is controlled. You can say withou a doubt that only the experience influence the outcome. Off course, in reality, it is difficult to control all the factors in a study. The real goal is to try and limit the effect that these other factors have on the outcomes of a study.
Despite their long history, experiments are relatively new in research. This design has grown and matured over the years to become a powerful method for determining cause and effect. Therefore, researchers should e aware of this approach for use in their studies.
A correlation indicates the strength of the relationship between two or more variables. Plotting correlations allows you to see if there is a potential relationship between two variables. In this post, we will look at how to plot correlations with multiple variables.
In R, there is a built-in dataset called ‘iris’. This dataset includes information about different types of flowers. Specifically, the ‘iris’ dataset contains the following variables
You can confirm this by inputting the following script
> names(iris)  "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
We now want to examine the relationship that each of these variables has with each other. In other words, we want to see the relationship of
We are now going to plot all of these variables above at the same time by using the ‘plot’ function. We also need to tell R not to include the “Species” variable. This is done by adding a subset code to the script. Below is the code to complete this task.
Here is what we did
The variable names are placed diagonally from left to right. The x-axis of a plot is determined by variable name in that column. For example,
The y-axis is determined by the variable that is in the same row as the plot. For example,
AS you can see, this is the same information. We will now look at a few examples of plots
Hopefully, you can see the pattern. The plots above the diagonal are mirrors of the ones below. If you are familiar with correlational matrices this should not be surprising.
After a visual inspection, you can calculate the actual statistical value of the correlations. To do so use the script below and you will see the table below after it.
> cor(iris[-5]) Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411 Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259 Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654 Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
As you can see, there are many strong relationships between the variables. For example “Petal.Width” and “Petal.Length” has a correlation of .96, which is almost perfect. This means that when “Petal.Width” grows by one unit “Petal.Length” grows by .96 units.
Plots help you to see the relationship between two variables. After visual inspection, it is beneficial to calculate the actual correlation.