Tag Archives: research

Business vs Analytical Questions

A major challenge for a data analyst is working within an organization with people who do not have an analytical background. The reason this is difficult is that the data analyst has to try and determine what the audience of his analysis wants him to analyze. This is further complicated when the consumers of the results are not able to fully articulate what they want to know and lean on their intuition and understanding of the context. The analyst who lacks this shared experience is left to figure out what to do.

The wording and language used by non-data and data people to develop questions are similar but different. To be fair, it is not the responsibility of the non-analyst to understand how to form analytical questions. However, it is the data analyst’s responsibility to speak the language of the non-data people and to translate the business questions that are provided into analytical questions.

In this post, we will look at business and analytical questions and how they are shaped and formed.

Definitions

Business questions are questions developed by non-technical individuals within an organization. These questions are derived from the business goals of the organization. The goals of the business or organization are often targets that need to be met over a period of time. An example of an organizational goal would be

To reduce turnover by 10% over a year

The business question that could be extracted from this goal could be

What programs should we use to reduce turnover?

The question above is a business or organizational question. However, this is not a question that a data analyst could answer in its current format. Whereas business questions are broad and open-ended analytical questions are specific. At times, it may be necessary to develop several analytical questions to answer one business question due to the need for specificity. One way to reword the business question above for data analysis is shown below.

Which if any of the turnover reduction programs implemented over the past year were able to reduce turnover by at least 10% in the sample population?

There are several assumptions in this question that were not built into the original business question

There are several turnover reduction programs
The sample population was broken into various groups and each group experienced one type of turnover reduction program
Each program was given one year in length and was run concurrently
There was a way to collect data in a scientific manner
People were appropriately trained to implement the various turnover reduction programs
etc.

As you can see this can truly get complicated. The data analyst may not only focus on the analysis but may be called upon to shape how data will also be collected and or the research design. As mentioned in the introduction, business experts often know what they want but are not able to articulate it clearly and they struggle with developing the linear process that is needed to collect and analyze the data.

Often, the data analyst’s challenge may be a lack of experience in the field in which they have to analyze data. For example, an individual with a background in education may have to analyze health data. The analytical techniques are the same in terms of the data but a lack of knowledge in the context of health care can make it difficult to get things done. The real problem is that the business people know the context but not the data analysis while the data analysis knows the numbers but not the context.

Conclusion

The primary difference between a business and an analytical question is the amount of detail. A business question is a broad overview that may not articulate all that is needed to develop an answer. An analytical question is detail-oriented and captures all that is needed in terms of obtaining an answer using whatever statistical techniques are needed. Business experts have contextual knowledge but may not have analytical knowledge while the data analyst is burdened to have both contextual knowledge and analytical expertise to shape the questions so that they are answerable.

Research Hypotheses & Objectives VIDEO

Leave a reply

The video below explains the differences between research questions, hypotheses, and objectives. This is important to understand because these terms are so commonly used when conducting research.

Developing a Review of Literature VIDEO

Leave a reply

A review of literature in a a research paper is an critical step in the discover process of academic writing. The video below will provide some types on structuring a review of literature.

Definition and Traits of Research Questions VIDEO

Leave a reply

Research questions are a key component of working methodically through the research process. The video below provides some introductory information on defining and explaining traits of research questions.

Writing a Significance Statement VIDEO

Leave a reply

The significance statement of a research paper two important ideas. The video below explains what these two ideas are and how to develop them when writing.

Writing a Purpose Statement VIDEO

Leave a reply

A purpose statement is a critical component of the introduction of a research paper. In the video below you will learn about how to write this statement for your own research

Overview of Intro to a Research Paper VIDEO

Leave a reply

Writing a research paper is an extremely challenging experience. The beginning in particular is perhaps the most difficult part as it is unclear what to do. The video below provides an overview of the different components of the introduction of a research paper.

Research Process VIDEO

Leave a reply

In the video below is a brief explanation of the various parts of an academic research paper. The main thrust was to show how these different parts work together to share the learning experience of the authors.

OSEMN Framework for Data Analysis

Leave a reply

Analyzing data can be extremely challenging. It is often common to not know where to begin. Perhaps you know some basic ways of analyzing data, but it is unclear what should be done first and what should follow.

This is where a data analysis framework can come in handy. Having a basic step-by-step process, you always follow can make it much easier to start and complete a project. One example of a data analysis framework is the OSEMN model. The OSEMN model is an acronym that defines each step of the data analysis process. The steps are as follows

Obtain
Scrub
Explore
Model
INterpret

We will now go through each of these steps.

Obtain

The first step of this model is obtaining data. Depending on the context, this can be done for you because the stakeholders have already provided data for analysis. In other situations, you have to find the data you need to answer whatever questions you are looking for insights into.

Data can be found anywhere, so the obtained data must help achieve the goals. It is also necessary to have the skills or connections to get the data. For example, data may have to be scraped from the web, pulled from a database, or even collected through the development of surveys. Each of these examples requires specific skills needed for success.

Scrub

Once data is obtained, it must be scrubbed or cleaned. Completing these tasks requires several things. Duplicates need to be removed, missing data must be addressed, outlier considered, the shape of the data addressed, among other tasks. In addition, it is often useful to look at descriptive statistics and visualizations to identify potential problems. Lastly, you often need to clean categories within a variable if they are misspelled or involve other errors such as punctuation and converting numbers.

The concepts mentioned above are just some of the steps that need to be taken to clean data. Dirty will lead to bad insights. Therefore, this must be done well.

Explore

Exploring data and scrubbing data will often happen at the same time. With exploration, you are looking for insights into your data. One of the easiest ways to do this is to drill down as far as possible into your continuous variables by segmenting with the categorical variables.

For example, you might look at average scores by gender, then you look at average scores by gender and major, then you might look at average scores by gender, major, and class. Each time you find slightly different patterns that may be useful or not. Another approach would be to look at scatterplots that consider different combinations of categorical variables.

If the objectives are clear, it can help you focus your exploration on reducing the chance of presenting non-relevant information to your stakeholders. Suppose the stakeholders want to know the average scores of women. In that case, there is maybe no benefit to knowing the average score of male music majors.

Model

Modeling involves regression/classification in the case of supervised learning or segmentation in the case of unsupervised learning. Modeling in the context of supervised learning helps in predicting future values, while segmentation helps develop insights into groups within a dataset that have similar traits.

Once again, the objectives of the analysis shape what tool to use in this context. If you want to predict enrollment, then regression tools may be appropriate. If you want what car a person will buy, then classification may help. If, on the other hand, you want to know what are some of the traits of high-performing students, then unsupervised approaches may be the best option.

INterpret

Interpreting involves sharing what does all this stuff means. It is truly difficult to explain the intricacies of data analysis to a layman. Therefore, this involves not just analytical techniques but communication skills. Breaking down the complex analysis so that people can understand it is difficult. As such, ideas around storytelling have been developed to help data analysis connect the code with the audience.

Conclusion

The framework provided here is not the only way to approach data analysis. Furthermore, as you become more comfortable with analyzing data, you do not have to limit yourself to the steps or order in which they are performed. Frameworks are intended for getting people started in the creative process of whatever task they are trying to achieve.

Developing Conceptual and Operational Definitions for Research

Leave a reply

Defining terms is one of the first things required when writing a research paper. However, it is also one of the hardest things to do as we often know what we want to study intuitively rather than literally. This post will provide guidance in the following

Developing conceptual definitions
Determining operational definitions
Understanding the measurement model

Each of the ideas above is fundamental to developing coherent research papers.

Concepts

A concept is a mental construct or a tool used to understand the world around us. An example of a concept would be intelligence, humor, motivation, desire. These terms have meaning, but they cannot be seen or observed directly. You cannot pick up intelligence, buy humor, or weigh either of these. However, you can tell when someone is intelligent or has a sense of humor.

This is because constructs are observed indirectly through behaviors, which provide evidence of the construct. For example, someone demonstrates intelligence through their academic success, how they speak, etc. A person can demonstrate humor by making others laugh through what they say. Concepts represent things around us that we want to study as researchers.

Defining Concepts

To define a concept for the purpose of research requires the following three things

A manner in which to measure the concept indirectly
A unit of analysis
Some variation among the unit of analysis

The criteria listed above is essentially a definition of a conceptual definition. Below is an example of a conceptual definition of academic dishonesty

Below is a breakdown of this definition

Academic dishonesty is the extent to which individuals exhibit a disregard towards educational norms of scholarly integrity.

Measurement: exhibit a disregard towards educational norms of scholarly integrity.
Unit of analysis: individual
Variation: Extent to which

It becomes much easier to shape a research study with these three components.

Conceptual Definition Template

There is also a template readily available in books and the internet to generate a conceptual definition. Below is one example.

The concept of _____________ is defined as the extent to which

_________________________ exhibit the characteristic(s) of _______________.

Here is a revised version of our conceptual defintion of academic dishonesty

The concept of academic dishonesty is defined as the ewxtent to whcih invidivudals exhibit the characteristic of disregard towards educational norms of scholarly integrity.

The same three components are there. The wording is mostly the same, but having a template such as this can really save them time in formulating a study. It also helps make things clearer for them as they go forward with a project.

Operational Definition

Once a concept has been defined, it must next be operationalized. The operational definition indicates how a concept will be measured quantitatively. This means that a researcher must specify at least one metric. Below is an example using academic dishonesty again.

Conceptual Definition: Academic dishonesty is the extent to which an individual exhibits a disregard towards educational norms of scholarly integrity.

Operational Definition: Survey Items

It is okay to cheat
It is okay to turn in someone else’s work as my own

In the example above, academic dishonesty was operationalized using survey items. In other words, we will measure people’s opinions about academic dishonesty by having them respond to survey items.

Measurement error happens when there is a disconnect between the conceptual definition and the measurement method. It can be hard to detect this, so students need to be careful when developing a study.

Measurement Models

A concept is not measured directly, as has already been mentioned. This means that when it is time to analyze our data, our contract is a latent or unobserved variable. The items on the survey are observed because people gave us this information directly. This means that the survey items are observed variables.

The measurement model links the latent variables with the observed variables statistically. A strong measurement model indicates that the observed variables correlate with the underlying latent variable or construct.

For example, academic dishonesty has been the latent variable example of this entire post. The survey items “it’s okay to cheat” and “it’s okay to turn in someon else’s work as my own” are observed variables. Using statistical tools, we can check if these observed variables are associated with our concept of academic dishonesty.

Conclusion

Defining concepts is one of the more challenging aspects of conducting research. It requires a researcher to know what they are trying to study and how to measure it. For students, this is challenging because articulating ideas in this manner is often not done in everyday life.

Different Views of Research

Leave a reply

People have been doing research formally or informally since the beginning of time. We are always trying to figure out how to do this or why something is the way that it is. In this post, we will look at different ways to view and or conduct research. These perspectives are empirical, theoretical, and analytical.

Empirical

Perhaps the most common form or approach to doing research is the empirical approach. This approach involves observing reality and developing hypotheses and theories based on what was observed. This is an inductive approach to doing research because the researcher starts with their observations to make a theory. In other words, you start with examples and abstract them to theories.

An example of this is found in the work of Charles Darwin and evolution. Darwin collected a lot of examples and observations of birds during his travels. Based on what he saw he inferred that animals evolved over time. This was his conclusion based on his interpretation of the data. Later, other researchers tried to further bolster Darwin’s theory by finding mathematical support for his claims.

The order in which empirical research is conducted is as follows…

Identify the phenomenon
Collect data
Abstraction/model development
Hypothesis
Test

You can see that hypotheses and theory are derived from data which is similar to qualitative research. However, steps 4 and 5 are were the equation developing and or statistical tools are used. As such the empirical view of research is valuable when there is a large amount of data available and can include many variables, which is again often common for quantitative methods.

To summarize this, empirical research is focused on what happened, which is one way in which scientific laws are derived.

Theoretical

The theoretical perspective is essentially the same process as empirical but moving in the opposite direction. For theorists, the will start with what they think about the phenomenon and how things should be. This approach starts with a general principle and then the researcher goes and looks for evidence that supports their general principle. Another way of stating this is that the theoretical approach is deductive in nature.

A classic example of this is Einstein’s theory of relativity. Apparently, he deduced this theory through logic and left it to others to determine if the theory was correct. To put it simply, he knew without knowing, if this makes sense. In this approach, the steps are as follows

Theory
Hypotheses
model abstraction
data collection
Phenomenon

You collect data to confirm the hypotheses. Common statistical tools can include simulations or any other method that is suitable in situations in which there is little data available. The caveat is that the data must match the phenomenon to have meaning. For example, if I am trying to understand some sort of phenomenon about women I cannot collect data from as this does not match the phenomenon.

In general, theoretical research is focused on why something happens which is the goal of most theories, explaining why.

Analytical

Analytical research is probably the hardest to explain and understand. Essentially, analytical research is trying to understand how people develop their empirical or theoretical research. How did Darwin make this collection or how did Einstein develop his ideas.

In other words, analytical research is commonly used to judge the research of others. Examples of this can be people who spend a lot of time criticizing the works of others. An analytical approach is looking for the strengths and weaknesses of various research. Therefore, this approach is focused on how research is done and can use tools both from empirical and theoretical research.

Conclusion

The point here was to explain different views om conducting research. The goal was not to state that one is superior to the other. Rather, the goal was to show how different tools can be used in different ways

Paraphrasing Tips for ESL Students

1 Reply

Paraphrasing is an absolute skill in a professional setting. By paraphrasing, it is meant to have the ability to take someone else’s words and rephrase them while giving credit for the original source. Whenever a student fails to do this it is called plagiarism which is a major problem in academia. In this post, we will look at several tips on how to paraphrase.

The ability to paraphrase academically takes almost near-native writing ability. This is because you have to be able to play with the language in a highly complex manner. To be able to do this after a few semesters of ESL is difficult for the typical student. Despite this, there are several ways to try to make paraphrase work. Below are just some ideas.

Use synonyms
Change the syntax
Make several sentences
Condense/summarize

One tip not mentioned is reading. Next, to actually writing, nothing will improve writing skills like reading. Being exposed to different texts helps you develop an intuitive understanding of the second language in a way that copying and pasting never will.

Use Synonyms

Using synonyms is a first step in paraphrasing an idea but this approach cannot be used by itself as that is considered to be plagiarism by many people. With synonyms, you replace some words with others. The easiest words to replace are adjectives and verbs, followed by nouns. Below is an example. The first sentence is the original one and the second is the paraphrase.

The man loves to play guitar
The man likes to play guitar

In the example above all we did was change the word “loves” to “like”. This is a superficial change that is often still considered plagiarism because of how easy it is to do. We can take this a step further by modifying the infinitive verb “to play.”

The man loves to play guitar
The man likes to play guitar
The man likes playing guitar

Again this is superficial but a step above the first example. In addition, most word processors will provide synonyms if you right-click on the word and off course there are online options as well. Remember that this is a beginning and is a tool you use in addition to more complex approaches.

Change the Syntax

Changing the syntax has to do with the word order of the sentence or sentences. Below is an example

The man loves to play guitar
Playing the guitar is something the man loves

In this example, we move the infinitive phrase “to play” to the front and change it to a present participle. There were other adjustments that needed to be made to maintain the flow of the sentence. This example is a more advanced form of paraphrasing and it may be enough to only do this to avoid plagiarism. However, you can combine synonyms and syntax as shown in the example below

The man loves to play guitar
Playing the guitar is something the man likes

Make Several Sentences

Another approach is to convert a sentence(s) into several more sentences. As shown below

The man loves to play guitar
This man has a hobby. He likes playing guitar.

You can see that there are two sentences now. The first sentence indicates the man has a hobby and the second explains what the hobby is and how much he likes it. In addition, in the second sentence, the verb “to play” was changed to the present participle of “playing.”

Condense/Summarize

Condensing or summarizing is not considered by everyone to be paraphrasing. The separation between paraphrasing and summarizing is fuzzy and it is more of a continuum than black and white. With this technique, you try to reduce the length of the statement you are paraphrasing as shown below.

The man loves to play guitar
He likes guitar

This was a difficult sentence to summarizes because it was already so short. However, we were able to shrink it from six to three words by removing what it was about the guitar he liked.

Academic Examples

We will now look at several academic examples to show the applications of these rules in a real context. The passage below is some academic text

There is also a push within Southeast Asia for college graduates to have
interpersonal skills. For example, Malaysia is calling for graduates to
have soft skills and that these need to be part of the curriculum of tertiary schools.
In addition, a lack of these skills has been found to limit graduates’ employability.

Example 1: Paraphrase with synonyms and syntax changes

There are several skills graduates need for employability in Southeast Asia. For example, people skills are needed. The ability to relate to others is being pushed for inclusion in higher education from parts of Southeast Asia (Thomas, 2018).

You can see how difficult this can be. We rearranged several concepts and changed several verbs to try and make this our own sentence. Below is an example of condensing.

Example 2: Condensing

There is demand in Southeast Asia for higher education to develop the interpersonal skills of their students as this is limiting the employability of graduates (Thomas, 2018).

With this example, we reduced the paragraph to one sentence.

Culture and Plagiarism

There are majors differences in terms of how plagiarism is viewed based on culture. In the West, plagiarism is universally condemned both in and out of academia as essentially stealing ideas from other people. However, in other places, the idea of plagiarism is much more nuanced or even okay.

In some cultures, one way to honor what someone has said or taught is to literally repeat it verbatim. The thought process goes something like this

This person is a great teacher/elder
What they said is insightful
As a student or lower person, I cannot improve what they said
Therefore, I should copy these perfects words into my own paper.

Of course, everyone does not think like this but I have experienced enough to know that it does happen.

Whether the West likes it or not plagiarism is a cultural position rather than an ethical one. To reduce plagiarism requires to show students how it is culturally unacceptable in an academic/professional setting to do this. The tips in this post will at least provide tools for how to support students to overcome this habit

Understanding Variables

Leave a reply

In research, there are many terms that have the same underlying meaning which can be confusing for researchers as they try to complete a project. The problem is that people have different backgrounds and learn different terms during their studies and when they try to work with others there is often confusion over what is what.

In this post, we will try to clarify as much as possible various terms that are used when referring to variables. We will look at the following during this discussion

Definition of a variable
Minimum knowledge of the characteristics of a variable in research
Various synonyms of variable

Definition

The word variable has the root of “vary” and the suffix “able”. This literally means that a variable is something that is able to change. Examples include such concepts as height, weigh, salary, etc. All of these concepts change as you gather data from different people. Statistics is primarily about trying to explain and or understand the variability of variables.

However, to make things more confusing there are times in research when a variable dies not change or remains constant. This will be explained in greater detail in a moment.

Minimum You Need to Know

Two broad concepts that you need to understand regardless of the specific variable terms you encounter are the following

Whether the variable(s) are independent or dependent
Whether the variable(s) are categorical or continuous

When we speak of independent and dependent variables we are looking at the relationship(s) between variables. Dependent variables are explained by independent variables. Therefore, one dimension of variables is understanding how they relate to each other and the most basic way to see this is independent vs dependent.

The second dimension to consider when thinking about variables is how they are measured which is captured with the terms categorical or continuous. A categorical variable has a finite number of values that can be used. Examples in clue gender, hair color, or cellphone brand. A person can only be male or female, have blue or brown eyes, and can only have one brand of cellphone.

Continuous variables are variables that can take on an infinite number of values. Salary, temperature, etc are all continuous in nature. It is possible to limit a continuous variable to categorical variable by creating intervals in which to place values. This is commonly done when creating bins for histograms. In sum, here are the four possible general variable types below

Independent categorical
Independent continuous
Dependent categorical
Dependent continuous

Natural, most models have one dependent categorical or continuous variable, however you can have any combination of continuous and categorical variables as independents. Remember that all variables have the above characteristics despite whatever terms is used for them.

Variable Synonyms

Below is a list of various names that variables go by in different disciplines. This is by no means an exhaustive list.

Experimental variable

A variable whose values are independent of any changes in the values of other variables. In other words, an experimental variable is just another term for independent variable.

Manipulated Variable

A variable that is independent in an experiment but whose value/behavior the researcher is able to control or manipulate. This is also another term for an independent variable.

Control Variable

A variable whose value does not change. Controlling a variable helps to explain the relationship between the independent and dependent variable in an experiment by making sure the control variable has not influenced in the model

Responding Variable

The dependent variable in an experiment. It responds to the experimental variable.

Intervening Variable

This is a hypothetical variable. It is used to explain the causal links between variables. Since they are hypothetical, they are observed in an actual experiment. For example, if you are looking at a strong relationship between income and life expectancy and find a positive relationship. The intervening variable for this may be access to healthcare. People who make more money have more access to health care and this contributes to them often living longer.

Mediating Variable

This is the same thing as an intervening variable. The difference being often that the mediating variable is not always hypothetical in nature and is often measured it’s self.

Confounding Variable

A confounder is a variable that influences both the independent and dependent variable, causing a spurious or false association. Often a confounding variable is a causal idea and cannot be described in terms of correlations or associations with other variables. In other words, it is often the same thing as an intervening variable.

Explanatory Variable

This variable is the same as an independent variable. The difference being that an independent variable is not influenced by any other variables. However, when independence is not for sure, than the variable is called an explanatory variable.

Predictor Variable

A predictor variable is an independent variable. This term is commonly used for regression analysis.

Outcome Variable

An outcome variable is a dependent variable in the context of regression analysis.

Observed Variable

This is a variable that is measured directly. An example would be gender or height. There is no psychology construct to infer the meaning of such variables.

Unobserved Variable

Unobserved variables are constructs that cannot be measured directly. In such situations, observe variables are used to try to determine the characteristic of the unobserved variable. For example, it is hard to measure addiction directly. Instead, other things will be measure to infer addiction such as health, drug use, performance, etc. The measures of this observed variables will indicate the level of the unobserved variable of addiction

Features

A feature is an independent variable in the context of machine learning and data science.

Target Variable

A target variable is the dependent variable in the context f machine learning and data science.

To conclude this, below is a summary of the different variables discussed and whether they are independent, dependent, or neither.

Independent	Dependent	Neither
Experimental	Responding	Control
Manipulated	Target	Explanatory
Predictor	Outcome	Intervening
Feature		Mediating
		Observed
		Unobserved
		Confounding

You can see how confusing this can be. Even though variables are mostly independent or dependent, there is a class of variables that do not fall into either category. However, for most purposes, the first to columns cover the majority of needs in simple research.

Conclusion

The confusion over variables is mainly due to an inconsistency in terms across variables. There is nothing right or wrong about the different terms. They all developed in different places to address the same common problem. However, for students or those new to research, this can be confusing and this post hopefully helps to clarify this.

Zotero Reference Software VIDEO

Leave a reply

A demo on the use of the Zotero Reference software

Research Questions, Variables, and Statistics

2 Replies

Working with students over the years has led me to the conclusion that often students do not understand the connection between variables, quantitative research questions and the statistical tools

used to answer these questions. In other words, students will take statistics and pass the class. Then they will take research methods, collect data, and have no idea how to analyze the data even though they have the necessary skills in statistics to succeed.

This means that the students have a theoretical understanding of statistics but struggle in the application of it. In this post, we will look at some of the connections between research questions and statistics.

Variables

Variables are important because how they are measured affects the type of question you can ask and get answers to. Students often have no clue how they will measure a variable and therefore have no idea how they will answer any research questions they may have.

Another aspect that can make this confusing is that many variables can be measured more than one way. Sometimes the variable “salary” can be measured in a continuous manner or in a categorical manner. The superiority of one or the other depends on the goals of the research.

It is critical to support students to have a thorough understanding of variables in order to support their research.

Types of Research Questions

In general, there are two types of research questions. These two types are descriptive and relational questions. Descriptive questions involve the use of descriptive statistic such as the mean, median, mode, skew, kurtosis, etc. The purpose is to describe the sample quantitatively with numbers (ie the average height is 172cm) rather than relying on qualitative descriptions of it (ie the people are tall).

Below are several example research questions that are descriptive in nature.

What is the average height of the participants in the study?
What proportion of the sample is passed the exam?
What are the respondents perceptions towards the cafeteria?

These questions are not intellectually sophisticated but they are all answerable with descriptive statistical tools. Question 1 can be answered by calculating the mean. Question 2 can be answered by determining how many passed the exam and dividing by the total sample size. Question 3 can be answered by calculating the mean of all the survey items that are used to measure respondents perception of the cafeteria.

Understanding the link between research question and statistical tool is critical. However, many people seem to miss the connection between the type of question and the tools to use.

Relational questions look for the connection or link between variables. Within this type there are two sub-types. Comparison question involve comparing groups. The other sub-type is called relational or an association question.

Comparison questions involve comparing groups on a continuous variable. For example, comparing men and women by height. What you want to know is whether there is a difference in the height of men and women. The comparison here is trying to determine if gender is related to height. Therefore, it is looking for a relationship just not in the way that many student understand. Common comparison questions include the following.male

Is there a difference in height by gender among the participants?
Is there a difference in reading scores by grade level?
Is there a difference in job satisfaction in based on major?

Each of these questions can be answered using ANOVA or if we want to get technical and there are only two groups (ie gender) we can use t-test. This is a broad overview and does not include the complexities of one-sample test and or paired t-test.

Relational or association question involve continuous variables primarily. The goal is to see how variables move together. For example, you may look for the relationship between height and weight of students. Common questions include the following.

Is there a relationship between height and weight?
Does height and show size explain weight?

Questions 1 can be answered by calculating the correlation. Question 2 requires the use of linear regression in order to answer the question.

Conclusion

The challenging as a teacher is showing the students the connection between statistics and research questions from the real world. It takes time for students to see how the question inspire the type of statistical tool to use. Understanding this is critical because it helps to frame the possibilities of what to do in research based on the statistical knowledge one has.

Prerequistes to Conducting Research

Leave a reply

Some of the biggest challenges in helping students with research is their lack of preparation. The problem is not an ignorance of statistics or research design as that takes only a little bit of support. The real problem is that students want to do research without hardly reading any research and lacking knowledge of how research writing is communicated. This post will share some prerequisites to performing research.

Read Extensively

Extensive reading means reading broadly about a topic and not focusing too much on specifics. Therefore, you read indiscriminately perhaps limited yourself only to your general discipline.

In order to communicate research, you must first be familiar with the vocabulary and norms of research. This can be learned to a great extent through reading academic empirical articles.

The ananoloy I like to use is how a baby learns. By spends large amounts of time being exposed to the words and actions of others. The baby has no real idea in terms of what is going on at first. However, after continuous exposure, the child begins to understand the words and actions fo those around them and even begins to mimic the behaviors.

In many ways, this is the purpose of reading a great deal before even attempting to do any research. Just as the baby, a writer needs to observe how others do things, continue this process even if they do not understand, and attempt to imitate the desired behaviors. You must understand the forms of communication as well as the cultural expectations of research writing and this can only happen through direct observation.

At the end of this experience, you begin to notice a pattern in terms of the structure of research writing. The style is highly ridge with litter variation.

It is hard to say how much extensive reading a person needs. Generally, the more reading that was done in the past the less reading needed to understand the structure of research writing. If you hate to read and did little reading in the past you will need to read a lot more to understand research writing then someone with an extensive background in reading. In addition, if you are trying to write in a second language you will need to read much more than someone writing in their native language.

If you are still desirous of a hard number of articles to read I would say aim for the following

Native who loves to read-at least 25 articles
Native who hates to read-at least 40 articles
Non-native reader-60 articles or more

Extensive reading is just reading. There is no notetaking or even highlighting. You are focusing on exposure only. Just as the observant baby so you are living in the moment trying to determine what is the appropriate behavior. If you don’t understand you need to keep going anyway as the purpose is quantity and not quality. Generally, when the structure of the writing begins to become redundant ad you can tell what the author is doing without having to read too closely you are ready to move on.

Read Intensively

Intensive reading is reading more for understanding. This involves slows with the goal of deeper understanding. Now you select something, in particular, you want to know. Perhaps you want to become more familiar with the writing of one excellent author or maybe there is one topic in particular that you are interested in. With intensive writing, you want to know everything that is happening in the text. To achieve this you read fewer articles and focus much more on quality over quantity.

By the end of the extensive and intensive reading, you should be familiar with the following.

The basic structure of research writing even if you don’t understand why it is the way it is.
A more thorough understanding of something specific you read about during your intensive reading.
Some sense of purpose in terms of what you need to do for your own writing.
A richer vocabulary and content knowledge related to your field.

Write Academicly

Once a student has read a lot of research there is some hope that they can now attempt to write in this style. As the teacher, it is my responsibility to point out the structure of research writing which involves such as ideas as the 5 sections and the parts of each section.

Students grasp this but they often cannot build paragraphs. In order to write academic research, you must know the purpose of main ideas, supporting details, and writing patterns. If these terms are unknown to you it will be difficult to write research that is communicated clearly.

The main idea is almost always the first sentence of a paragraph and writing patterns provide different ways to organize the supporting details. This involves understanding the purpose of each paragraph that is written which is a task that many students could not explain. This is looking at writing from a communicative or discourse perspective and not at a minute detail or grammar one.

The only way to do this is to practice writing. I often will have students develop several different reviews of literature. During this experience, they learn how to share the ideas of others. The next step is developing a proposal in which the student shares their ideas and someone else’s. The final step is writing a formal research paper.

Conclusion

To write you must first observe how others write. Then you need to imitate what you saw. Once you can do it what others have done it will allow you to ask questions about why things are this way. Too often, people just want to write without even understanding what they are trying to do. This leads to paralysis at best (I don’t know what to do) to a disaster at worst (spending hours confidently writing garbage). The enemy to research is not methodology as many people write a lot without knowledge of stats or research design because they collaborate. The real enemy of research is neglecting the preparation of reading and the practicing of writing.

Shaping the Results of a Research Paper

Leave a reply

Writing the results of a research paper is difficult. As a researcher, you have to try and figure out if you answered the question. In addition, you have to figure out what information is important enough to share. As such it is easy to get stuck at this stage of the research experience. Below are some ideas to help with speeding up this process.

Consider the Order of the Answers

This may seem obvious but probably the best advice I could give a student when writing their results section is to be sure to answer their questions in the order they presented them in the introduction of their study. This helps with cohesion and coherency. The reader is anticipating answers to these questions and they often subconsciously remember the order the questions came in.

If a student answers the questions out of order it can be jarring for the reader. When this happens the reader starts to double check what the questions were and they begin to second-guess their understanding of the paper which reflects poorly on the writer. An analogy would be that if you introduce three of your friends to your parents you might share each person’s name and then you might go back and share a little bit of personal information about each friend. When we do this we often go in order 1st 2nd 3rd friend and then going back and talking about the 1st friend. The same courtesy should apply when answering research questions in the results section. Whoever was first is shared first etc.

Consider how to Represent the Answers

Another aspect to consider is the presentation of the answers. Should everything be in text? What about the use of visuals and tables? The answers depend on several factors

If you have a small amount of information to share writing in paragraphs is practical. Defining small depends on how much space you have to write as well but generally anything more than five ideas should be placed in a table and referred too.
Tables are for sharing large amounts of information. If an answer to a research question requires more than five different pieces of information a table may be best. You can extract really useful information from a table and place it directly in paragraphs while referring the reader to the table for even more information.
Visuals such as graphs and plots are not used as frequently in research papers as I would have thought. This may be because they take up so much space in articles that usually have page limits. In addition, readers of an academic journal are pretty good at visually results mentally based on numbers that can be placed in a table. Therefore, visuals are most appropriate for presentations and writing situations in which there are fewer constraints on the length of the document such as a thesis or dissertation.

Know when to Interpret

Sometimes I have had students try to explain the results while presenting them. I cannot say this is wrong, however, it can be confusing. The reason it is so confusing is that the student is trying to do two things at the same time which are present the results and interpret them. This would be ok in a presentation and even expected but when someone is reading a paper it is difficult to keep two separate threads of thought going at the same time. Therefore, the meaning or interpretation of the results should be saved for the Discussion Conclusion section.

Conclusion

Presenting the results is in many ways the high point of a research experience. It is not easy to take numerical results and try to capture the useful information clearly. As such, the advice given here is intending to help support this experience

Purpose of a Quantitative Methodology

Leave a reply

Students often struggle with shaping their methodology section in their paper. The problem is often that students do not see the connection between the different sections of a research paper. This inability to connect the dots leads to isolated thinking on the topic and inability to move forward.

The methodology section of a research paper plays a critical role. In brief, the purpose of a methodology is to explain to your readers how you will answer your research questions. In the strictest sense, this is important for reproducing a study. Therefore, what is really important when writing a methodology is the research questions of the study. The research questions determine the following of a methodology.

What this means is that a student must know what they want to know in order to explain how they will find the answers. Below is a description of these sections along with one section that is not often influenced by the research questions.

Sample & Setting

In the sample section of the methodology, it is common or the student to explain the setting of the study, provide some demographics, and explain the sampling method. In this section of the methodology, the goal is to describe what the reader needs to know about the participants in order to understand the context from which the results were derived.

Research Design & Scales

The research design explains specifically how the data was collected. There are several standard ways to do this in the social sciences such.

Survey design
experimental design
correlational design

Within this section, some academic disciplines also explain the scales or the tool used to measure the variable(s) of the study. Again, it is impossible to develop this section of the research questions are unclear or unknown.

Data Analysis

The data analysis section provides an explanation of how the answers were calculated in a study. Success in this section requires a knowledge of the various statistical tools that are available. However, understanding the research questions is key to articulating this section clearly.

Ethics

A final section in many methodologies is ethics. The ethical section is a place where the student can explain how the protected participant’s anonymity, made sure to get the permission and other aspects of morals. This section is required by most universities in order to gain permission to do research. However, it is often missing from journals.

Conclusion

The methodology is part of the larger picture of communicating one’s research. It is important that a research paper is not seen as isolated parts but rather as a whole. The reason for this position is that a paper cannot make sense on its own if any of these aspects are missing.

Tips for Writing a Quantitative Review of Literature

Leave a reply

Writing a review of literature can be challenging for students. The purpose here is to try and synthesize a huge amount of information and to try and communicate it clearly to someone who has not read what you have read.

From my experience working with students, I have developed several tips that help them to make faster decisions and to develop their writing as well.

Remember the Purpose

Often a student will collect as many articles as possible and try to throw them all together to make a review of the literature. This naturally leads to problems of the paper sounded like a shopping list of various articles. Neither interesting nor coherent.

Instead, when writing a review of literature a student should keep in mind the question

What do my readers need to know in order to understand my study?

This is a foundational principle when writing. Readers don’t need to know everything only what they need to know to appreciate the study they are ready. An extension of this is that different readers need to know different things. As such, there is always a contextual element to framing a review of the literature.

Consider the Format

When working with a student, I always recommend the following format to get there writing started.

For each major variable in your study do the following…

Define it
Provide examples or explain theories about it
Go through relevant studies thematically

Definition

There first thing that needs to be done is to provide a definition of the construct. This is important because many constructs are defined many different ways. This can lead to confusion if the reader is thinking one definition and the writer is thinking another.

Examples and Theories

Step 2 is more complex. After a definition is provided the student can either provide an example of what this looks like in the real world and or provide more information in regards to theories related to the construct.

Sometimes examples are useful. For example, if writing a paper on addiction it would be useful to not only define it but also to provide examples of the symptoms of addiction. The examples help the reader to see what used to be an abstract definition in the real world.

Theories are important for providing a deeper explanation of a construct. Theories tend to be highly abstract and often do not help a reader to understand the construct better. One benefit of theories is that they provide a historical background of where the construct came from and can be used to develop the significance of the study as the student tries to find some sort of gap to explore in their own paper.

Often it can be beneficial to include both examples and theories as this demonstrates stronger expertise in the subject matter. In theses and dissertations, both are expected whenever possible. However, for articles space limitations and knowing the audience affects the inclusion of both.

Relevant Studies

The relevant studies section is similar breaking news on CNN. The relevant studies should generally be newer. In the social sciences, we are often encouraged to look at literature from the last five years, perhaps ten years in some cases. Generally, readers want to know what has happened recently as experience experts are familiar with older papers. This rule does not apply as strictly to theses and dissertations.

Once recent literature has been found the student needs to organize it thematically. The reason for a thematic organization is that the theme serves as the main idea of the section and the studies themselves serve as the supporting details. This structure is surprisingly clear for many readers as the predictable nature allows the reader to focus on content rather than on trying to figure out what the author is tiring to say. Below is an example

There are several challenges with using technology in class(ref, 2003; ref 2010). For example, Doe (2009) found that technology can be unpredictable in the classroom. James (2010) found that like of training can lead some teachers to resent having to use new technology

The main idea here is “challenges with technology.” The supporting details are Doe (2009) and James (2010). This concept of themes is much more complex than this and can include several paragraphs and or pages.

Conclusion

This process really cuts down on the confusion of students writing. For stronger students, they can be free to do what they want. However, many students require structure and guidance when the first begin writing research papers

Academic vs Applied Research

2 Replies

Academic and applied research are perhaps the only two ways that research can be performed. In this post, we will look at the differences between these two perspectives on research.

Academic Research

Academic research falls into two categories. These two categories are

Research ON your field
Research FOR your field

Research ON your field is research is research that is searching for best practice. It looks at how your academic area is practiced in the real world. A scholar will examine how well a theory is being applied or used in a real-world setting and make recommendations.

For example, in education, if a scholar does research in reading comprehension, they may want to determine what are some of the most appropriate strategies for teaching reading comprehension. The scholar will look at existing theories and such which one(s) are most appropriate for supporting students.

Research ON your field is focused on existing theories that are tested with the goal of developing recommendations for improving practice.

Research FOR your field is slightly different. This perspective seeks to expand theoretical knowledge about your field. In orders, the scholar develops new theories rather than assess the application of older ones.

An example of this in education would be developing a new theory in reading comprehension. By theory, it is meant explanation. Famous theories in education include Piaget’s stages of development, Kohlberg’s stages of moral development, and more. At their time each of these theories pushes the boundaries of our understanding of something.

The main thing about academic research is that it leads to recommendations but not necessarily to answers that solve problems. Answering problems is something that is done with applied research.

Applied Research

Applied research is also known as research IN your field. This type of research is often performed by practitioners in the field.

research IN your field

There are several forms of research IN your field and they are as follows

Formative
Monitoring
Summative

Formative research is for identifying problems. For example, a teacher may notice that students are not performing well or doing their homework. Formative applied research is when the detective hat is put on and the teacher begins to search for the cause of this behavior.

The results of formative research lead to some sort of an action plan to solve the problem. During the implementation of the solution, monitoring applied research is conducted. Monitoring research is conducted during implementation of a solution to see how things are going.

For example, if the teacher discovers that students are struggling with reading because they are struggling with phonological awareness. They may implement a review program of this skill for the students. Monitoring would involve assessing student performance of reading during the program.

Summative applied research is conduct at the end of implementation to see if the objectives of the program were met. Returning to the reading example, if the teacher’s objective was to improve reading comprehension scores 10% the summative research would assess how well the students can now read and whether there was a 10% improvement.

In education, applied research is also known as action research.

Conclusion

Research can serve many different purposes. Academics focus on recommendations, not action while practitioners want to solve problems and perhaps not recommend as much. The point is that understanding what type of research you are trying to conduct can help you in shaping the direction of your study.

Primary Tasks in Data Analysis

Leave a reply

Performing a data analysis in the realm of data science is a difficult task due to the huge number of decisions that need to be made. For some people, plotting the course to conduct an analysis is easy. However, for most of us, beginning a project leads to a sense of paralysis as we struggle to determine what to do.

In light of this challenge, there are at least 5 core task that you need to consider when preparing to analyze data. These five tasks are

Developing your question(s)
Data exploration
Developing a statistical model
Interpreting the results
Sharing the results

Developing Your Question(s)

You really cannot analyze data until you first determine what it is you want to know. It is tempting to just jump in and start looking for interesting stuff but you will not know if something you find is interesting unless it helps to answer your question(s).

There are several types of research questions. The point is you need to ask them in order to answer them.

Data Exploration

Data exploration allows you to determine if you can answer your questions with the data you have. In data science, the data is normally already collected by the time you are called upon to analyze it. As such, what you want to find may not be possible.

In addition, exploration of the data allows you to determine if there are any problems with the data set such as missing data, strange variables, and if necessary to develop a data dictionary so you know the characteristics of the variables.

Data exploration allows you to determine what kind of data wrangling needs to be done. This involves the preparation of the data for a more formal analysis when you develop your statistical models. This process takes up the majority of a data scientist time and is not easy at all. Mastery of this in many ways means being a master of data science

Develop a Statistical Model

Your research questions and the data exploration process helps you to determine what kind of model to develop. The factors that can affect this is whether your data is supervised or unsupervised and whether you want to classify or predict numerical values.

This is probably the funniest part of data analysis and is much easier than having to wrangle with the data. Your goal is to determine if the model helps to answer your question(s)

Interpreting the Results

Once a model is developed it is time to explain what it means. Sometimes you can make a really cool model that nobody (including yourself) can explain. This is especially true of “black box” methods such as support vector machines and artificial neural networks. Models need to normally be explainable to non-technical stakeholders.

With interpretation, you are trying to determine “what does this answer mean to the stakeholders?” For example, if you find that people who smoke are 5 times more likely to die before the age of 50 what are the implications of this? How can the stakeholders use this information to achieve their own goals? In other words, why should they care about what you found out?

Communication of Results

Now is the time to actually share the answer(s) to your question(s). How this is done varies but it can be written, verbal or both. Whatever the mode of communication it is necessary to consider the following

The audience or stakeholders
The actual answers to the questions
The benefits of knowing this

You must remember the stakeholders because this affects how you communicate. How you speak to business professionals would be different from academics. Next, you must share the answers to the questions. This can be done with charts, figures, illustrations etc. Data visualization is an expertise of its own. Lastly, you explain how this information is useful in a practical way.

Conclusion

The process shared here is one way to approach the analysis of data. Think of this as a framework from which to develop your own method of analysis.

Generalized Models in R

Leave a reply

Generalized linear models are another way to approach linear regression. The advantage of of GLM is that allows the error to follow many different distributions rather than only the normal distribution which is an assumption of traditional linear regression.

Often GLM is used for response or dependent variables that are binary or represent count data. THis post will provide a brief explanation of GLM as well as provide an example.

Key Information

There are three important components to a GLM and they are

Error structure
Linear predictor
Link function

The error structure is the type of distribution you will use in generating the model. There are many different distributions in statistical modeling such as binomial, gaussian, poission, etc. Each distribution comes with certain assumptions that govern their use.

The linear predictor is the sum of the effects of the independent variables. Lastly, the link function determines the relationship between the linear predictor and the mean of the dependent variable. There are many different link functions and the best link function is the one that reduces the residual deviances the most.

In our example, we will try to predict if a house will have air conditioning based on the interactioon between number of bedrooms and bathrooms, number of stories, and the price of the house. To do this, we will use the “Housing” dataset from the “Ecdat” package. Below is some initial code to get started.

library(Ecdat)

data("Housing")

The dependent variable “airco” in the “Housing” dataset is binary. This calls for us to use a GLM. To do this we will use the “glm” function in R. Furthermore, in our example, we want to determine if there is an interaction between number of bedrooms and bathrooms. Interaction means that the two independent variables (bathrooms and bedrooms) influence on the dependent variable (aircon) is not additive, which means that the combined effect of the independnet variables is different than if you just added them together. Below is the code for the model followed by a summary of the results

model<-glm(Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + Housing$price, family=binomial)
summary(model)

## 
## Call:
## glm(formula = Housing$airco ~ Housing$bedrooms * Housing$bathrms + 
##     Housing$stories + Housing$price, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7069  -0.7540  -0.5321   0.8073   2.4217  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)                      -6.441e+00  1.391e+00  -4.632 3.63e-06
## Housing$bedrooms                  8.041e-01  4.353e-01   1.847   0.0647
## Housing$bathrms                   1.753e+00  1.040e+00   1.685   0.0919
## Housing$stories                   3.209e-01  1.344e-01   2.388   0.0170
## Housing$price                     4.268e-05  5.567e-06   7.667 1.76e-14
## Housing$bedrooms:Housing$bathrms -6.585e-01  3.031e-01  -2.173   0.0298
##                                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 681.92  on 545  degrees of freedom
## Residual deviance: 549.75  on 540  degrees of freedom
## AIC: 561.75
## 
## Number of Fisher Scoring iterations: 4

To check how good are model is we need to check for overdispersion as well as compared this model to other potential models. Overdispersion is a measure to determine if there is too much variablity in the model. It is calcualted by dividing the residual deviance by the degrees of freedom. Below is the solution for this

549.75/540

## [1] 1.018056

Our answer is 1.01, which is pretty good because the cutoff point is 1, so we are really close.

Now we will make several models and we will compare the results of them

Model 2

#add recroom and garagepl
model2<-glm(Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + Housing$price + Housing$recroom + Housing$garagepl, family=binomial)
summary(model2)

## 
## Call:
## glm(formula = Housing$airco ~ Housing$bedrooms * Housing$bathrms + 
##     Housing$stories + Housing$price + Housing$recroom + Housing$garagepl, 
##     family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6733  -0.7522  -0.5287   0.8035   2.4239  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)                      -6.369e+00  1.401e+00  -4.545 5.51e-06
## Housing$bedrooms                  7.830e-01  4.391e-01   1.783   0.0745
## Housing$bathrms                   1.702e+00  1.047e+00   1.626   0.1039
## Housing$stories                   3.286e-01  1.378e-01   2.384   0.0171
## Housing$price                     4.204e-05  6.015e-06   6.989 2.77e-12
## Housing$recroomyes                1.229e-01  2.683e-01   0.458   0.6470
## Housing$garagepl                  2.555e-03  1.308e-01   0.020   0.9844
## Housing$bedrooms:Housing$bathrms -6.430e-01  3.054e-01  -2.106   0.0352
##                                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 681.92  on 545  degrees of freedom
## Residual deviance: 549.54  on 538  degrees of freedom
## AIC: 565.54
## 
## Number of Fisher Scoring iterations: 4

#overdispersion calculation
549.54/538

## [1] 1.02145

Model 3

model3<-glm(Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + Housing$price + Housing$recroom + Housing$fullbase + Housing$garagepl, family=binomial)
summary(model3)

## 
## Call:
## glm(formula = Housing$airco ~ Housing$bedrooms * Housing$bathrms + 
##     Housing$stories + Housing$price + Housing$recroom + Housing$fullbase + 
##     Housing$garagepl, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6629  -0.7436  -0.5295   0.8056   2.4477  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)                      -6.424e+00  1.409e+00  -4.559 5.14e-06
## Housing$bedrooms                  8.131e-01  4.462e-01   1.822   0.0684
## Housing$bathrms                   1.764e+00  1.061e+00   1.662   0.0965
## Housing$stories                   3.083e-01  1.481e-01   2.082   0.0374
## Housing$price                     4.241e-05  6.106e-06   6.945 3.78e-12
## Housing$recroomyes                1.592e-01  2.860e-01   0.557   0.5778
## Housing$fullbaseyes              -9.523e-02  2.545e-01  -0.374   0.7083
## Housing$garagepl                 -1.394e-03  1.313e-01  -0.011   0.9915
## Housing$bedrooms:Housing$bathrms -6.611e-01  3.095e-01  -2.136   0.0327
##                                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 681.92  on 545  degrees of freedom
## Residual deviance: 549.40  on 537  degrees of freedom
## AIC: 567.4
## 
## Number of Fisher Scoring iterations: 4

#overdispersion calculation
549.4/537

## [1] 1.023091

Now we can assess the models by using the “anova” function with the “test” argument set to “Chi” for the chi-square test.

anova(model, model2, model3, test = "Chi")

## Analysis of Deviance Table
## 
## Model 1: Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + 
##     Housing$price
## Model 2: Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + 
##     Housing$price + Housing$recroom + Housing$garagepl
## Model 3: Housing$airco ~ Housing$bedrooms * Housing$bathrms + Housing$stories + 
##     Housing$price + Housing$recroom + Housing$fullbase + Housing$garagepl
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       540     549.75                     
## 2       538     549.54  2  0.20917   0.9007
## 3       537     549.40  1  0.14064   0.7076

The results of the anova indicate that the models are all essentially the same as there is no statistical difference. The only criteria on which to select a model is the measure of overdispersion. The first model has the lowest rate of overdispersion and so is the best when using this criteria. Therefore, determining if a hous has air conditioning depends on examining number of bedrooms and bathrooms simultenously as well as the number of stories and the price of the house.

Conclusion

The post explained how to use and interpret GLM in R. GLM can be used primarilyy for fitting data to disrtibutions that are not normal.

Proportion Test in R

1 Reply

Proportions are a fraction or “portion” of a total amount. For example, if there are ten men and ten women in a room the proportion of men in the room is 50% (5 / 10). There are times when doing an analysis that you want to evaluate proportions in our data rather than individual measurements of mean, correlation, standard deviation etc.

In this post we will learn how to do a test of proportions using R. We will use the dataset “Default” which is found in the “ISLR” package. We will compare the proportion of those who are students in the dataset to a theoretical value. We will calculate the results using the z-test and the binomial exact test. Below is some initial code to get started.

library(ISLR)
data("Default")

We first need to determine the actual number of students that are in the sample. This is calculated below using the “table” function.

table(Default$student)

## 
##   No  Yes 
## 7056 2944

We have 2944 students in the sample and 7056 people who are not students. We now need to determine how many people are in the sample. If we sum the results from the table below is the code.

sum(table(Default$student))

## [1] 10000

There are 10000 people in the sample. To determine the proportion of students we take the number 2944 / 10000 which equals 29.44 or 29.44%. Below is the code to calculate this

table(Default$student) / sum(table(Default$student))

## 
##     No    Yes 
## 0.7056 0.2944

The proportion test compares a particular value with a theoretical value. For our example, the particular value we have is 29.44% of the people were students. We want to compare this value with a theoretical value of 50%. Before we do so it is better to state specificallt what are hypotheses are. NULL = The value of 29.44% of the sample being students is the same as 50% found in the population ALTERNATIVE = The value of 29.44% of the sample being students is NOT the same as 50% found in the population.

Below is the code to complete the z-test.

prop.test(2944,n = 10000, p = 0.5, alternative = "two.sided", correct = FALSE)

## 
##  1-sample proportions test without continuity correction
## 
## data:  2944 out of 10000, null probability 0.5
## X-squared = 1690.9, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.2855473 0.3034106
## sample estimates:
##      p 
## 0.2944

Here is what the code means. 1. prop.test is the function used 2. The first value of 2944 is the total number of students in the sample 3. n = is the sample size 4. p= 0.5 is the theoretical proportion 5. alternative =“two.sided” means we want a two-tail test 6. correct = FALSE means we do not want a correction applied to the z-test. This is useful for small sample sizes but not for our sample of 10000

The p-value is essentially zero. This means that we reject the null hypothesis and conclude that the proportion of students in our sample is different from a theortical proportion of 50% in the population.

Below is the same analysis using the binomial exact test.

binom.test(2944, n = 10000, p = 0.5)

## 
##  Exact binomial test
## 
## data:  2944 and 10000
## number of successes = 2944, number of trials = 10000, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.2854779 0.3034419
## sample estimates:
## probability of success 
##                 0.2944

The results are the same. Whether to use the “prop.test”” or “binom.test” is a major argument among statisticians. The purpose here was to provide an example of the use of both

Probability Distribution and Graphs in R

Leave a reply

In this post, we will use probability distributions and ggplot2 in R to solve a hypothetical example. This provides a practical example of the use of R in everyday life through the integration of several statistical and coding skills. Below is the scenario.

At a busing company the average number of stops for a bus is 81 with a standard deviation of 7.9. The data is normally distributed. Knowing this complete the following.

Calculate the interval value to use using the 68-95-99.7 rule
Calculate the density curve
Graph the normal curve
Evaluate the probability of a bus having less then 65 stops
Evaluate the probability of a bus having more than 93 stops

Calculate the Interval Value

Our first step is to calculate the interval value. This is the range in which 99.7% of the values falls within. Doing this requires knowing the mean and the standard deviation and subtracting/adding the standard deviation as it is multiplied by three from the mean. Below is the code for this.

busStopMean<-81
busStopSD<-7.9
busStopMean+3*busStopSD

## [1] 104.7

busStopMean-3*busStopSD

## [1] 57.3

The values above mean that we can set are interval between 55 and 110 with 100 buses in the data. Below is the code to set the interval.

interval<-seq(55,110, length=100) #length here represents 
100 fictitious buses

Density Curve

The next step is to calculate the density curve. This is done with our knowledge of the interval, mean, and standard deviation. We also need to use the “dnorm” function. Below is the code for this.

densityCurve<-dnorm(interval,mean=81,sd=7.9)

We will now plot the normal curve of our data using ggplot. Before we need to put our “interval” and “densityCurve” variables in a dataframe. We will call the dataframe “normal” and then we will create the plot. Below is the code.

library(ggplot2)
normal<-data.frame(interval, densityCurve)
ggplot(normal, aes(interval, densityCurve))+geom_line()+ggtitle("Number of Stops for Buses")

Probability Calculation

We now want to determine what is the provability of a bus having less than 65 stops. To do this we use the “pnorm” function in R and include the value 65, along with the mean, standard deviation, and tell R we want the lower tail only. Below is the code for completing this.

pnorm(65,mean = 81,sd=7.9,lower.tail = TRUE)

## [1] 0.02141744

As you can see, at 2% it would be unusually to. We can also plot this using ggplot. First, we need to set a different density curve using the “pnorm” function. Combine this with our “interval” variable in a dataframe and then use this information to make a plot in ggplot2. Below is the code.

CumulativeProb<-pnorm(interval, mean=81,sd=7.9,lower.tail = TRUE)
pnormal<-data.frame(interval, CumulativeProb)
ggplot(pnormal, aes(interval, CumulativeProb))+geom_line()+ggtitle("Cumulative Density of Stops for Buses")

Second Probability Problem

We will now calculate the probability of a bus have 93 or more stops. To make it more interesting we will create a plot that shades the area under the curve for 93 or more stops. The code is a little to complex to explain so just enjoy the visual.

pnorm(93,mean=81,sd=7.9,lower.tail = FALSE)

## [1] 0.06438284

x<-interval  
ytop<-dnorm(93,81,7.9)
MyDF<-data.frame(x=x,y=densityCurve)
p<-ggplot(MyDF,aes(x,y))+geom_line()+scale_x_continuous(limits = c(50, 110))
+ggtitle("Probabilty of 93 Stops or More is 6.4%")
shade <- rbind(c(93,0), subset(MyDF, x > 93), c(MyDF[nrow(MyDF), "X"], 0))

p + geom_segment(aes(x=93,y=0,xend=93,yend=ytop)) +
        geom_polygon(data = shade, aes(x, y))

Conclusion

A lot of work was done but all in a practical manner. Looking at realistic problem. We were able to calculate several different probabilities and graph them accordingly.

Types of Mixed Method Design

2 Replies

In a previous post, we looked at mix methods and some examples of this design. Mixed methods are focused on combining quantitative and qualitative methods to study a research problem. In this post, we will look at several additional mixed method designs. Specifically, we will look at the follow designs

Embedded design
Transformative design
Multi-phase design

Embedded Design

Embedded design is the simultaneous collection of quantitative and qualitative data with one form of data by supportive to the other. The supportive data augments the conclusions of the main data collection.

The benefits of this design is that allows for one method to lead the analysis with the secondary method provides additional information. For example, quantitative measures are excellent at recording the results of an experiment. Qualitative measures would be useful in determining how participants perceived their experience in the experiment.

A downside to this approach making sure the secondary method is truly supporting the overall research. Quantitative and qualitative methods natural answer different research questions. Therefore, the research questions of a study must be worded in a way that allows for cooperation between qualitative and quantitative methods.

Transformative Design

The transformative design is more of a philosophy than a mixed method design. This design can employ any other mixed method design. The main difference that transformative designs focus on helping a marginalized population with the goal of bringing about change.

For example, a researcher might do a study Asian students facing discrimination in a predominately African American high school. The goal of the study would be to document the experiences of Asian students in order to provide administrators with information on the extent of this problem.

Such a focus on the oppressed is drawn heavily from Critical Theory which exposes how oppression takes place through education. The emphasis on change is derived from Dewy and progressivism.

Multiphase Design

Multiphase design is actually the use of several designs over several studies. This is a massive and supremely complex process. You would need to tie together several different mixed method studies under one general research problem. From this, you can see that this is not a commonly used design.

For example, you may decide to continue doing research into Asian student discrimination at African American high schools. The first study might employ an explanatory design. The second study might employ and exploratory design. The last study might be a transformative design.

After completing all this work, you would need to be able to articulate the experiences with discrimination of the Asian students. This is not an easy task by any means. As such, if and when this design is used, it often requires the teamwork of several researchers.

Conclusion

Mixed method designs require a different way of thinking when it comes to research. The uniqueness of this approach is the combination of qualitative and quantitative methods. This mixing of methods has advantages and disadvantage. The primary point to remember is that the most appropriate design depends on the circumstances of the study.

Mixed Methods

3 Replies

Mix Methods research involves the combination of qualitative and quantitative approaches to addressing a research problem. Generally, qualitative and quantitative methods have separate philosophical positions when it comes to how to uncover insights in addressing research questions.

For many, mixed methods have their own philosophical position, which is pragmatism. Pragmatists believe that if it works it’s good. Therefore, if mixed methods lead to a solution it’s an appropriate method to use.

This post will try to explain some of the mixed method designs. Before explaining it is important to understand that there are several common ways to approach mixed methods

Qualitative and Quantitative are equal (Convergent Parallel Design)
Quantitative is more important than qualitative (explanatory design)
Qualitative is more important than quantitative

Convergent Parallel Design

This design involves the simultaneous collecting of qualitative and quantitative data. The results are then compared to provide insights into the problem. The advantage of this design is the quantitative data provides for generalizability while the qualitative data provides information about the context of the study.

However, the challenge is in trying to merge the two types of data. Qualitative and quantitative methods answer slightly different questions about a problem. As such it can be difficult to paint a picture of the results that are comprehensible.

Explanatory Design

This design puts emphasis on the quantitative data with qualitative data playing a secondary role. Normally, the results found in the quantitative data are followed up on in the qualitative part.

For example, if you collect surveys about what students think about college and the results indicate negative opinions, you might conduct an interview with students to understand why they are negative towards college. A Likert survey will not explain why students are negative. Interviews will help to capture why students have a particular position.

The advantage of this approach is the clear organization of the data. Quantitative data is more important. The drawback is deciding what about the quantitative data to explore when conducting the qualitative data collection.

Exploratory Design

This design is the opposite of explanatory. Now the qualitative data is more important than the quantitative. This design is used when you want to understand a phenomenon in order to measure it.

It is common when developing an instrument to interview people in focus groups to understand the phenomenon. For example, if I want to understand what cellphone addiction is I might ask students to share what they think about this in interviews. From there, I could develop a survey instrument to measure cell phone addiction.

The drawback to this approach is the time consumption. It takes a lot of work to conduct interviews, develop an instrument, and assess the instrument.

Conclusions

Mixed methods are not that new. However, they are still a somewhat unusual approach to research in many fields. Despite this, the approaches of mixed methods can be beneficial depending on the context.

Types of Machine Learning

2 Replies

Machine learning is a tool used in analytics for using data to make a decision for action. This field of study is at the crossroads of regular academic research and action research used in professional settings. This juxtaposition of skills has led to exciting new opportunities in the domains of academics and industry.

This post will provide information on basic types of machine learning which includes predictive models, supervised learning, descriptive models, and unsupervised learning.

Predictive Models and Supervised Learning

Predictive models do as their name implies. Predictive models predict one value based on other values. For example, a model might predict who is most likely to buy a plane ticket or purchase a specific book.

Predictive models are not limited to the future. They can also be used to predict something that has already happened but we are not sure when. For example, data can be collected from expectant mothers to determine the date that they conceived. Such information would be useful in preparing for birth.

Predictive models are intimately connected with supervised learning. Supervised learning is a form of machine learning in which the predictive model is given clear direction as to what they need to learn and how to do it.

For example, if we want to predict who will be accepted or rejected for a home loan we would provide clear instructions to our model. We might include such features as salary, gender, credit score, etc. These features would be used to predict whether an individual person should be accepted or rejected for the home loan. The supervisors in this example or the features (salary, gender, credit score) used to predict the target feature (home loan).

The target feature can either be a classification or a numeric prediction. A classification target feature is a nominal variable such as gender, race, type of car, etc. A classification feature has a limited number of choices or classes that the feature can take. In addition, the classes are mutually exclusive. At least in machine learning, someone can only be classified as male or female, current algorithms cannot place a person in both classes.

A numeric prediction predicts a number that has an infinite number of possibilities. Examples include height, weight, and salary.

Descriptive Models and Unsupervised Learning

Descriptive models summarize data to provide interesting insights. There is no target feature that you are trying to predict. Since there is no specific goal or target to predict there are no supervisors or specific features that are used to predict the target feature. Instead, descriptive models use a process of unsupervised learning. There are no instructions given to model as to what to do per say.

Descriptive models are very useful for discovering patterns. For example, one descriptive model analysis found a relationship between beer purchases and diaper purchases. It was later found that when men went to the store they often would be beer for themselves and diapers for their small children. Stores used this information and they placed beer and diapers next to each in the stores. This led to an increase in profits as men could now find beer and diapers together. This kind of relationship can only be found through machine learning techniques.

Conclusion

The model you used depends on what you want to know. Prediction is for, as you can guess, predicting. With this model, you are not as concern about relationships as you are about understanding what affects specifically the target feature. If you want to explore relationships then descriptive models can be of use. Machine learning models are tools that are appropriate for different situations.

Intro to Using Machine Learning

2 Replies

Machine learning is about using data to take action. This post will explain common steps that are taking when using machine learning in the analysis of data. In general, there are five steps when applying machine learning.

Collecting data
Preparing data
Exploring data
Training a model on the data
Assessing the performance of the model
Improving the model

We will go through each step briefly

Step One, Collecting Data

Data can come from almost anywhere. Data can come from a database in a structured format like mySQL. Data can also come unstructured such as tweets collected from twitter. However you get the data, you need to develop a way to clean and process it so that it is ready for analysis.

There are some distinct terms used in machine learning that some coming from traditional research may not be familiar.

example-An example is one set of data. In an excel spreadsheet, an example would be one row. In empirical social science research, we would call an example a respondent or participant.
Unit of observation-This is how an example is measured. The units can be time, money, height, weight, etc.
feature-A feature is a characteristic of an example. In other forms of research, we normally call a feature a variable. For example, ‘salary’ would be a feature in machine learning but a variable in traditional research.

Step Two, Preparing Data

This is actually the most difficult step in machine learning analysis. It can take up to 80% of the time. With data coming from multiple sources and in multiple formats it is a real challenge to get everything where it needs to be for an analysis.

Missing data needs to be addressed, duplicate records, and other issues are a part of this process. Once these challenges are dealt with it is time to explore the data.

Step Three, Explore the Data

Before analyzing the data, it is critical that the data is explored. This is often done visually in the form of plots and graphs but also with summary statistics. You are looking for some insights into the data and the characteristics of different features. You are also looking out for things that might be unusually such as outliers. There are also times when a variable needs to be transformed because there are issues with normality.

Step Four, Training a Model

After exploring the data, you should have an idea of what you want to know if you did not already know. Determining what you want to know helps you to decide which algorithm is most appropriate for developing a model.

To develop a model, we normally split the data into a training and testing set. This allows us to assess the model for its strengths and weaknesses.

Step Five, Assessing the Model

The strength of the model is determined. Every model has certain biases that limit its usefulness. How to assess a model depends on what type of model is developed and the purpose of the analysis. Whenever suitable, we want to try and improve the model.

Step Six, Improving the Model

Improve can happen in many ways. You might decide to normalize the variables in a different way. Or you may choose to add or remove features from the model. Perhaps you may switch to a different model.

Conclusion

Success in data analysis involves have a clear path for achieving your goals. The steps presented here provide one way of tackling machine learning.

Logistic Regression in R

1 Reply

Logistic regression is used when the dependent variable is categorical with two choices. For example, if we want to predict whether someone will default on their loan. The dependent variable is categorical with two choices yes they default and no they do not.

Interpreting the output of a logistic regression analysis can be tricky. Basically, you need to interpret the odds ratio. For example, if the results of a study say the odds of default are 40% higher when someone is unemployed it is an increase in the likelihood of something happening. This is different from the probability which is what we normally use. Odds can go from any value from negative infinity to positive infinity. Probability is constrained to be anywhere from 0-100%.

We will now take a look at a simple example of logistic regression in R. We want to calculate the odds of defaulting on a loan. The dependent variable is “default” which can be either yes or no. The independent variables are “student” which can be yes or no, “income” which how much the person made, and “balance” which is the amount remaining on their credit card.

Below is the coding for developing this model.

The first step is to load the “Default” dataset. This dataset is a part of the “ISLR” package. Below is the code to get started

library(ISLR)
data("Default")

It is always good to examine the data first before developing a model. We do this by using the ‘summary’ function as shown below.

summary(Default)

##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554

We now need to check our two continuous variables “balance” and “income” to see if they are normally distributed. Below is the code followed by the histograms.

hist(Default$income)

Rplot

hist(Default$balance)

The ‘income’ variable looks fine but there appear to be some problems with ‘balance’ to deal with this we will perform a square root transformation on the ‘balance’ variable and then examine it again by looking at a histogram. Below is the code.

Default$sqrt_balance<-(sqrt(Default$balance))
hist(Default$sqrt_balance)

Rplot

As you can see this is much better looking.

We are now ready to make our model and examine the results. Below is the code.

Credit_Model<-glm(default~student+sqrt_balance+income, family=binomial, Default)
summary(Credit_Model)

## 
## Call:
## glm(formula = default ~ student + sqrt_balance + income, family = binomial, 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2656  -0.1367  -0.0418  -0.0085   3.9730  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.938e+01  8.116e-01 -23.883  < 2e-16 ***
## studentYes   -6.045e-01  2.336e-01  -2.587  0.00967 ** 
## sqrt_balance  4.438e-01  1.883e-02  23.567  < 2e-16 ***
## income        3.412e-06  8.147e-06   0.419  0.67538    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1574.8  on 9996  degrees of freedom
## AIC: 1582.8
## 
## Number of Fisher Scoring iterations: 9

The results indicate that the variable ‘student’ and ‘sqrt_balance’ are significant. However, ‘income’ is not significant. What all this means in simple terms is that being a student and having a balance on your credit card influence the odds of going into default while your income makes no difference. Unlike, multiple regression coefficients, the logistic coefficients require a transformation in order to interpret them The statistical reason for this is somewhat complicated. As such, below is the code to interpret the logistic regression coefficients.

exp(coef(Credit_Model))

##  (Intercept)   studentYes sqrt_balance       income 
## 3.814998e-09 5.463400e-01 1.558568e+00 1.000003e+00

To explain this as simply as possible. You subtract 1 from each coefficient to determine the actual odds. For example, if a person is a student the odds of them defaulting are 445% higher than when somebody is not a student when controlling for balance and income. Furthermore, for every 1 unit increase in the square root of the balance the odds of default go up by 55% when controlling for being a student and income. Naturally, speaking in terms of a 1 unit increase in the square root of anything is confusing. However, we had to transform the variable in order to improve normality.

Conclusion

Logistic regression is one approach for predicting and modeling that involves a categorical dependent variable. Although the details are little confusing this approach is valuable at times when doing an analysis.

Kruskal-Willis Test in R

1 Reply

Sometimes when the data that needs to be analyzed is not normally distributed. This makes it difficult to make any inferences based on the results because one of the main assumptions of parametric statistical test such as ANOVA, t-test, etc is normal distribution of the data.

Fortunately, for every parametric test there is a non-parametric test. Non-parametric test are test that make no assumptions about the normality of the data. This means that the non-normal data can still be analyzed with a certain measure of confidence in terms of the results.

This post will look at non-parametric test that are used to test the difference in means. For three or more groups we used the Kruskal-Wallis Test. The Kruskal-Wallis Test is the non-parametric version of ANOVA.

Setup

We are going to use the “ISLR” package available on R to demonstrate the use of the Kruskal-Wallis test. After downloading this package you need to load the “Auto” data. Below is the code to do all of this.

install.packages('ISLR')
library(ISLR)
data=Auto

We now need to examine the structure of the data set. This is done with the “str” function below is code followed by the results

str(Auto)
'data.frame':	392 obs. of  9 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
 $ weight      : num  3504 3693 3436 3433 3449 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
 $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

So we have 9 variables. We first need to find if any of the continuous variables are non-normal because this indicates that the Kruskal-Willis test is needed. We will look at the ‘displacement’ variable and look at the histogram to see if it is normally distributed. Below is the code followed by the histogram

hist(Auto$displacement)

This does not look normally distributed. We now need a factor variable with 3 or more groups. We are going to use the ‘origin’ variable. This variable indicates were the care was made 1 = America, 2 = Europe, and 3 = Japan. However, this variable is currently a numeric variable. We need to change it into a factor variable. Below is the code for this

Auto$origin<-as.factor(Auto$origin)

The Test

We will now use the Kruskal-Wallis test. The question we have is “is there a difference in displacement based on the origin of the vehicle?” The code for the analysis is below followed by the results.

> kruskal.test(displacement ~ origin, data = Auto)

	Kruskal-Wallis rank sum test

data:  displacement by origin
Kruskal-Wallis chi-squared = 201.63, df = 2, p-value < 2.2e-16

Based on the results, we know there is a difference among the groups. However, just like ANOVA, we do not know were. We have to do a post-hoc test in order to determine where the difference in means is among the three groups.

To do this we need to install a new package and do a new analysis. We will download the “PCMR” package and run the code below

install.packages('PMCMR')
library(PMCMR)
data(Auto)
attach(Auto)
posthoc.kruskal.nemenyi.test(x=displacement, g=origin, dist='Tukey')

Here is what we did,

Installed the PMCMR package and loaded it
Loaded the “Auto” data and used the “attach” function to make it available
Ran the function “posthoc.kruskal.nemenyi.test” and place the appropriate variables in their place and then indicated the type of posthoc test ‘Tukey’

Below are the results

Pairwise comparisons using Tukey and Kramer (Nemenyi) test	
                   with Tukey-Dist approximation for independent samples 

data:  displacement and origin 

  1       2   
2 3.4e-14 -   
3 < 2e-16 0.51

P value adjustment method: none 
Warning message:
In posthoc.kruskal.nemenyi.test.default(x = displacement, g = origin,  :
  Ties are present, p-values are not corrected.

The results are listed in a table. When a comparison was made between group 1 and 2 the results were significant (p < 0.0001). The same when group 1 and 3 are compared (p < 0.0001). However, there was no difference between group 2 and 3 (p = 0.51).

Do not worry about the warning message this can be corrected if necessary

Perhaps you are wondering what the actually means for each group is. Below is the code with the results

> aggregate(Auto[, 3], list(Auto$origin), mean)
  Group.1        x
1       1 247.5122
2       2 109.6324
3       3 102.7089

Cares made in America have an average displacement of 247.51 while cars from Europe and Japan have a displacement of 109.63 and 102.70. Below is the code for the boxplot followed by the graph

boxplot(displacement~origin, data=Auto, ylab= 'Displacement', xlab='Origin')
title('Car Displacement')

Conclusion

This post provided an example of the Kruskal-Willis test. This test is useful when the data is not normally distributed. The main problem with this test is that it is less powerful than an ANOVA test. However, this is a problem with most non-parametric test when compared to parametric test.

Data Mining Process

1 Reply

Processes serve the purpose of providing people with clear step-by-step procedures to accomplish a task. In many ways, a process serves as a shortcut to solving a problem. As data mining is a complex situation with an endless number of problems there have been developed several processes for completing a data mining project. In this post, we will look at the Cross-Industry Standard Process for Data Mining (CRISP-DM).

CRISP-DM

The CRISP-DM is an iterative process that has the following steps…

Organizational understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment

We will look at each step briefly

Organizational Understanding

Step 1 involves assessing the current goals of the organization and the current context. This information is then used to in deciding goals or research questions for data mining. Data mining needs to be done with a sense of purpose and not just to see what’s out there. Organizational understanding is similar to the introduction section of a research paper in which you often include the problem, questions, and even the intended audience of the research

2. Data Understanding

Once a purpose and questions have been developed for data mining, it is necessary to determine what it will take to answer the questions. Specifically, the data scientist assesses the data requirements, description, collection, and assesses data quality. In many ways, data understanding is similar to the methodology of a standard research paper in which you assess how you will answer the research questions.

It is particularly common to go back and forth between steps one and two. Organizational understanding influences data understanding which influences data understanding.

3. Data Preparation

Data preparation involves cleaning the data. Another term for this is data mugging. This is the main part of an analysis in data mining. Often the data comes in a very messy way with information spread all over the place and incoherently. This requires the researcher to carefully deal with this problem.

4. Modeling

A model provides a numerical explanation of something in the data. How this is done depends on the type of analysis that is used. As you develop various models you are arriving at various answers to your research questions. It is common to move back and forth between step 3 and 4 as the preparation affects the modeling and the type of modeling you may want to develop may influence data preparation. The results of this step can also be seen as being similar to the results section of an empirical paper.

5. Evaluation

Evaluation is about comparing the results of the study with the original questions. In many ways, it is about determining the appropriateness of the answers to the research questions. This experience leads to ideas for additional research. As such, this step is similar to the discussion section of a research paper.

6. Deployment

The last step is when the results are actually used for decision-making or action. If the results indicate that a company should target people under 25 then this is what they do as an example.

Conclusion

The CRISP-DM process is a useful way to begin the data mining experience. The primary goal of data mining is providing evidence for making decisions and or taking action. This end goal has shaped the development of a clear process for taking action.

Big Data & Data Mining

2 Replies

Dealing with large amounts of data has been a problem throughout most of human history. Ancient civilizations had to keep large amounts of clay tablets, papyrus, steles, parchments, scrolls etc. to keep track of all the details of an empire.

However, whenever it seemed as though there would be no way to hold any more information a new technology would be developed to alleviate the problem. When people could not handle keeping records on stone paper scrolls were invented. When scrolls were no longer practical books were developed. When hand-copying books were too much the printing press came along.

By the mid 20th century there were concerns that we would not be able to have libraries large enough to keep all of the books that were being developed. With this problem came the solution of the computer. One computer could hold the information of several dozen if not hundreds of libraries.

Now even a single computer can no longer cope with all of the information that is constantly being developed for just a single subject. This has lead to computers working together in networks to share the task of storing information. With data spread across several computers, it makes analyzing data much more challenging. It was now necessary to mine for useful information in a way that people used to mine for gold in the 19th century.

Big data is data that is too large to fit within the memory of a single computer. Analyzing data that is spread across a network of databases takes skills different from traditional statistical analysis. This post will explain some of the characteristics of big data as well as data mining.

Big Data Traits

The three main traits of big data are volume, velocity, and variety. Volume describes the size of big data, which means data to big to be on only one computer. Velocity is about how fast the data can be processed. Lastly, variety different types of data. common sources of big data includes the following

Metadata from visual sources such as cameras
Data from sensors such as in medical equipment
Social media data such as information from google, youtube or facebook

Data Mining

Data mining is the process of discovering a model in a big dataset. Through the development of an algorithm, we can find specific information that helps us to answer our questions. Generally, there are two ways to mine data and these are extraction and summarization.

Extraction is the process of pulling specific information from a big dataset. For example, if we want all the addresses of people who bought a specific book from Amazon the result would be an extraction from a big data set.

Summarization is reducing a dataset to describe it. We might do a cluster analysis in which similar data is combined on a characteristic. For example, if we analyze all the books people ordered through Amazon last year we might notice that one cluster of groups buys mostly religious books while another buys investment books.

Conclusion

Big data will only continue to get bigger. Currently, the response has been to just use more computers and servers. As such, there is now a need for finding information on many computers and servers. This is the purpose of data mining, which is to find pertinent information that answers stakeholders questions.

Decision Trees in R

4 Replies

Decision trees are useful for splitting data based into smaller distinct groups based on criteria you establish. This post will attempt to explain how to develop decision trees in R.

We are going to use the ‘College’ dataset found in the “ISLR” package. Once you load the package you need to split the data into a training and testing set as shown in the code below. We want to divide the data based on education level, age, and income

library(ISLR); library(ggplot2); library(caret)
data("College")
inTrain<-createDataPartition(y=College$education, 
 p=0.7, list=FALSE)
trainingset <- College[inTrain, ]
testingset <- College[-inTrain, ]

Visualize the Data

We will now make a plot of the data based on education as the groups and age and wage as the x and y variable. Below is the code followed by the plot. Please note that education is divided into 5 groups as indicated in the chart.

qplot(age, wage, colour=education, data=trainingset)

Create the Model

We are now going to develop the model for the decision tree. We will use age and wage to predict education as shown in the code below.

TreeModel<-train(education~age+income, method='rpart', data=trainingset)

Create Visual of the Model

We now need to create a visual of the model. This involves installing the package called ‘rattle’. You can install ‘rattle’ separately yourself. After doing this below is the code for the tree model followed by the diagram.

fancyRpartPlot(TreeModel$finalModel)

Rplot02

Here is what the chart means

At the top is node 1 which is called ‘HS Grad” the decimals underneath is the percentage of the data that falls within the “HS Grad” category. As the highest node, everything is classified as “HS grad” until we begin to apply our criteria.
Underneath nod 1 is a decision about wage. If a person makes less than 112 you go to the left if they make more you go to the right.
Nod 2 indicates the percentage of the sample that was classified as HS grade regardless of education. 14% of those with less than a HS diploma were classified as a HS Grade based on wage. 43% of those with a HS diploma were classified as a HS grade based on income. The percentage underneath the decimals indicates the total amount of the sample placed in the HS grad category. Which was 57%.
This process is repeated for each node until the data is divided as much as possible.

Predict

You can predict individual values in the dataset by using the ‘predict’ function with the test data as shown in the code below.

predict(TreeModel, newdata = testingset)

Conclusion

Prediction Trees are a unique feature in data analysis for determining how well data can be divided into subsets. It also provides a visual of how to move through data sequentially based on characteristics in the data.

Using Plots for Prediction in R

1 Reply

It is common in machine learning to look at the training set of your data visually. This helps you to decide what to do as you begin to build your model. In this post, we will make several different visual representations of data using datasets available in several R packages.

We are going to explore data in the “College” dataset in the “ISLR” package. If you have not done so already, you need to download the “ISLR” package along with “ggplot2” and the “caret” package.

Once these packages are installed in R you want to look at a summary of the variables use the summary function as shown below.

summary(College)

You should get a printout of information about 18 different variables. Based on this printout, we want to explore the relationship between graduation rate “Grad.Rate” and student to faculty ratio “S.F.Ratio”. This is the objective of this post.

Next, we need to create a training and testing dataset below is the code to do this.

> library(ISLR);library(ggplot2);library(caret)
> data("College")
> PracticeSet<-createDataPartition(y=College$Enroll, p=0.7, +                                  list=FALSE) > trainingSet<-College[PracticeSet,] > testSet<-College[-PracticeSet,] > dim(trainingSet); dim(testSet)
[1] 545  18
[1] 232  18

The explanation behind this code was covered in predicting with caret so we will not explain it again. You just need to know that the dataset you will use for the rest of this post is called “trainingSet”.

Developing a Plot

We now want to explore the relationship between graduation rates and student to faculty ratio. We will be used the ‘ggpolt2’ package to do this. Below is the code for this followed by the plot.

qplot(S.F.Ratio, Grad.Rate, data=trainingSet)

As you can see, there appears to be a negative relationship between student faculty ratio and grad rate. In other words, as the ration of student to faculty increases there is a decrease in the graduation rate.

Next, we will color the plots on the graph based on whether they are a public or private university to get a better understanding of the data. Below is the code for this followed by the plot.

> qplot(S.F.Ratio, Grad.Rate, colour = Private, data=trainingSet)

It appears that private colleges usually have lower student to faculty ratios and also higher graduation rates than public colleges

Add Regression Line

We will now plot the same data but will add a regression line. This will provide us with a visual of the slope. Below is the code followed by the plot.

> collegeplot<-qplot(S.F.Ratio, Grad.Rate, colour = Private, data=trainingSet) > collegeplot+geom_smooth(method = ‘lm’,formula=y~x)

Most of this code should be familiar to you. We saved the plot as the variable ‘collegeplot’. In the second line of code, we add specific coding for ‘ggplot2’ to add the regression line. ‘lm’ means linear model and formula is for creating the regression.

Cutting the Data

We will now divide the data based on the student-faculty ratio into three equal size groups to look for additional trends. To do this you need the “Hmisc” packaged. Below is the code followed by the table

> library(Hmisc)

> divide_College<-cut2(trainingSet$S.F.Ratio, g=3)
> table(divide_College)
divide_College
[ 2.9,12.3) [12.3,15.2) [15.2,39.8] 
        185         179         181

Our data is now divided into three equal sizes.

Box Plots

Lastly, we will make a box plot with our three equal size groups based on student-faculty ratio. Below is the code followed by the box plot

CollegeBP<-qplot(divide_College, Grad.Rate, data=trainingSet, fill=divide_College, geom=c(“boxplot”)) > CollegeBP

As you can see, the negative relationship continues even when student-faculty is divided into three equally size groups. However, our information about private and public college is missing. To fix this we need to make a table as shown in the code below.

> CollegeTable<-table(divide_College, trainingSet$Private)
> CollegeTable
              
divide_College  No Yes
   [ 2.9,12.3)  14 171
   [12.3,15.2)  27 152
   [15.2,39.8] 106  75

This table tells you how many public and private colleges there based on the division of the student-faculty ratio into three groups. We can also get proportions by using the following

> prop.table(CollegeTable, 1)
              
divide_College         No        Yes
   [ 2.9,12.3) 0.07567568 0.92432432
   [12.3,15.2) 0.15083799 0.84916201
   [15.2,39.8] 0.58563536 0.41436464

In this post, we found that there is a negative relationship between student-faculty ratio and graduation rate. We also found that private colleges have a lower student-faculty ratio and a higher graduation rate than public colleges. In other words, the status of a university as public or private moderates the relationship between student-faculty ratio and graduation rate.

You can probably tell by now that R can be a lot of fun with some basic knowledge of coding.

Simple Prediction

Leave a reply

Prediction is one of the key concepts of machine learning. Machine learning is a field of study that is focused on the development of algorithms that can be used to make predictions.

Anyone who has shopped online at has experienced machine learning. When you make a purchase at an online store, the website will recommend additional purchases for you to make. Often these recommendations are based on whatever you have purchased or whatever you click on while at the site.

There are two common forms of machine learning, unsupervised and supervised learning. Unsupervised learning involves using data that is not cleaned and labeled and attempts are made to find patterns within the data. Since the data is not labeled, there is no indication of what is right or wrong

Supervised machine learning is using cleaned and properly labeled data. Since the data is labeled there is some form of indication whether the model that is developed is accurate or not. If the is incorrect then you need to make adjustments to it. In other words, the model learns based on its ability to accurately predict results. However, it is up to the human to make adjustments to the model in order to improve the accuracy of it.

In this post, we will look at using R for supervised machine learning. The definition presented so far will make more sense with an example.

The Example

We are going to make a simple prediction about whether emails are spam or not using data from kern lab.

The first thing that you need to do is to install and load the “kernlab” package using the following code

install.packages("kernlab")
library(kernlab)

If you use the “View” function to examine the data you will see that there are several columns. Each column tells you the frequency of a word that kernlab found in a collection of emails. We are going to use the word/variable “money” to predict whether an email is spam or not. First, we need to plot the density of the use of the word “money” when the email was not coded as spam. Below is the code for this.

plot(density(spam$money[spam$type=="nonspam"]),
 col='blue',main="", xlab="Frequency of 'money'")

This is an advance R post so I am assuming you can read the code. The plot should look like the following.

Rplot10

As you can see, money is not used to frequently in emails that are not spam in this dataset. However, you really cannot say this unless you compare the times ‘money’ is labeled nonspam to the times that it is labeled spam. To learn this we need to add a second line that explains to us when the word ‘money’ is used and classified as spam. The code for this is below with the prior code included.

plot(density(spam$money[spam$type=="nonspam"]),
 col='blue',main="", xlab="Frequency of 'money'")
lines(density(spam$money[spam$type=="spam"]), col="red")

Your new plot should look like the following

Rplot10

If you look closely at the plot doing a visual inspection, where there is a separation between the blue line for nonspam and the red line for spam is the cutoff point for whether an email is spam or not. In other words, everything inside the arc is labeled correctly while the information outside the arc is not.

The next code and graph show that this cutoff point is around 0.1. This means that any email that has on average more than 0.1 frequency of the word ‘money’ is spam. Below is the code and the graph with the cutoff point indicated by a black line.

plot(density(spam$money[spam$type=="nonspam"]),
 col='blue',main="", xlab="Frequency of 'money'")
lines(density(spam$money[spam$type=="spam"]), col="red")
abline(v=0.1, col="black", lw= 3)

Now we need to calculate the accuracy of the use of the word ‘money’ to predict spam. For our current example, we will simply use in “ifelse” function. If the frequency is greater than 0.1.

We then need to make a table to see the results. The code for the “ifelse” function and the table are below followed by the table.

predict<-ifelse(spam$money > 0.1, "spam","nonspam")
table(predict, spam$type)/length(spam$type)

predict       nonspam        spam
  nonspam 0.596392089 0.266898500
  spam    0.009563138 0.127146273

Based on the table that I am assuming you can read, our model accurately calculates that an email is spam about 71% (0.59 + 0.12) of the time based on the frequency of the word ‘money’ being greater than 0.1.

Of course, for this to be true machine learning we would repeat this process by trying to improve the accuracy of the prediction. However, this is an adequate introduction to this topic.

Survey Design

Leave a reply

Survey design is used to describe the opinions, beliefs, behaviors, and or characteristics of a population based on the results of a sample. This design involves the use of surveys that include questions, statements, and or other ways of soliciting information from the sample. This design is used for descriptive purpose primarily but can be combined with other designs (correlational, experimental) at times as well. In this post, we will look at the following.

Types of Survey Design
Characteristics of Survey Design

Types of Survey Design

There are two common forms of survey design which are cross-sectional and longitudinal. A cross-sectional survey design is the collection of data at one specific point in time. Data is only collected once in a cross-sectional design.

A cross-sectional design can be used to measure opinions/beliefs, compare two or more groups, evaluate a program, and or measure the needs of a specific group. The main goal is to analyze the data from a sample at a given moment in time.

A longitudinal design is similar to a cross-sectional design with the difference being that longitudinal designs require collection over time.Longitudinal studies involve cohorts and panels in which data is collected over days, months, years and even decades. Through doing this, a longitudinal study is able to expose trends over time in a sample.

Characteristics of Survey Design

There are certain traits that are associated with survey design. Questionnaires and interviews are a common component of survey design. The questionnaires can happen by mail, phone, internet, and in person. Interviews can happen by phone, in focus groups, or one-on-one.

The design of a survey instrument often includes personal, behavioral and attitudinal questions and open/closed questions.

Another important characteristic of survey design is monitoring the response rate. The response rate is the percentage of participants in the study compared to the number of surveys that were distributed. The response rate varies depending on how the data was collected. Normally, personal interviews have the highest rate while email request has the lowest.

It is sometimes necessary to report the response rate when trying to publish. As such, you should at the very least be aware of what the rate is for a study you are conducting.

Conclusion

Surveys are used to collect data at one point in time or over time. The purpose of this approach is to develop insights into the population in order to describe what is happening or to be used to make decisions and inform practice.

Intro to Making Plots in R

Leave a reply

One of the strongest points of R in the opinion of many are the various features for creating graphs and other visualizations of data. In this post, we begin to look at using the various visualization features of R. Specifically, we are going to do the following

Using data in R to display graph
Add text to a graph
Manipulate the appearance of data in a graph

Using Plots

The ‘plot’ function is one of the basic options for graphing data. We are going to go through an example using the ‘islands’ data that comes with the R software. The ‘islands’ software includes lots of data, in particular, it contains data on the lass mass of different islands. We want to plot the land mass of the seven largest islands. Below is the code for doing this.

islandgraph<-head(sort(islands, decreasing=TRUE), 7)

plot(islandgraph, main = "Land Area", ylab = "Square Miles")

text(islandgraph, labels=names(islandgraph), adj=c(0.5,1))

Here is what we did

We made the variable ‘islandgraph’
In the variable ‘islandgraph’ We used the ‘head’ and the ‘sort’ function. The sort function told R to sort the ‘island’ data by decreasing value ( this is why we have the decreasing argument equaling TRUE). After sorting the data, the ‘head’ function tells R to only take the first 7 values of ‘island’ (see the 7 in the code) after they are sorted by decreasing order.
Next, we use the plot function to plot are information in the ‘islandgraph’ variable. We also give the graph a title using the ‘main’ argument followed by the title. Following the title, we label the y-axis using the ‘ylab’ argument and putting in quotes “Square Miles”.
The last step is to add text to the information inside the graph for clarity. Using the ‘text’ function, we tell R to add text to the ‘islandgraph’ variable using the names from the ‘islandgraph’ data which uses the code ‘label=names(islandgraph)’. Remember the ‘islandgraph’ data is the first 7 islands from the ‘islands’ dataset.
After telling R to use the names from the islandgraph dataset we then tell it to place the label a little of center for readability reasons with the code ‘adj = c(0.5,1).

Below is what the graph should look like.

Rplotblog

Changing Point Color and Shape in a Graph

For visual purposes, it may be beneficial to manipulate the color and appearance of several data points in a graph. To do this, we are going to use the ‘faithful’ dataset in R. The ‘faithful’ dataset indicates the length of eruption time and how long people had to wait for the eruption. The first thing we want to do is plot the data using the “plot” function.

Rplot

As you see the data, there are two clear clusters. One contains data from 1.5-3 and the second cluster contains data from 3.5-5. To help people to see this distinction we are going to change the color and shape of the data points in the 1.5-3 range. Below is the code for this.

eruption_time<-with(faithful, faithful[eruptions < 3, ])

plot(faithful)

points(eruption_time, col = "blue", pch = 24)

Here is what we did

We created a variable named ‘eruption_time’
In this variable, we use the ‘with’ function. This allows us to access columns in the dataframe without having to use the $ sign constantly. We are telling R to look at the ‘faithful’ dataframe and only take the information from faithful that has eruptions that are less than 3. All of this is indicated in the first line of code above.
Next, we plot ‘faithful’ again
Last, we add the points from are ‘eruption_time’ variable and we tell R to color these points blue and to use a different point shape by using the ‘pch = 24’ argument
The results are below

Conclusion

In this post, we learned the following

How to make a graph
How to add a title and label the y-axis
How to change the color and shape of the data points in a graph

Simple Regression in R

3 Replies

In this post, we will look at how to perform a simple regression using R. We will use a built-in dataset in R called ‘mtcars.’ There are several variables in this dataset but we will only use ‘wt’ which stands for the weight of the cars and ‘mpg’ which stands for miles per gallon.

Developing the Model

We want to know the association or relationship between the weight of a car and miles per gallon. We want to see how much of the variance of ‘mpg’ is explained by ‘wt’. Below is the code for this.

> Carmodel <- lm(mpg ~ wt, data = mtcars)

Here is what we did

We created the variable ‘Carmodel’ to store are information
Inside this variable, we used the ‘lm’ function to tell R to make a linear model.
Inside the function, we put ‘mpg ~ wt’ the ‘~’ sign means ’tilda’ in English and is used to indicate that ‘mpg’ is a function of ‘wt’. This section is the actual notation for the regression model.
After the comma, we see ‘data = mtcars’ this is telling R to use the ‘mtcar’ dataset.

Seeing the Results

Once you pressed enter you probably noticed nothing happens. The model ‘Carmodel’ was created but the results have not been displayed. Below is various information you can extract from your model.

The ‘summary’ function is useful for pulling most of the critical information. Below is the code for the output.

> summary(Carmodel)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,	Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

We are not going to explain the details of the output (please see simple regression). The results indicate a model that explains 75% of the variance or has an r-square of 0.75. The model is statistically significant and the equation of the model is -5.45x + 37.28 = y

Plotting the Model

We can also plot the model using the code below

> coef_Carmodel<-coef(Carmodel)
> plot(mpg ~ wt, data = mtcars)
> abline(a = coef_Carmodel[1], b = coef_Carmodel[2])

Here is what we did

We made the variable ‘coef_Carmodel’ and stored the coefficients (intercept and slope) of the ‘Carmodel’ using the ‘coef’ function. We will need this information soon.
Next, we plot the ‘mtcars’ dataset using ‘mpg’ and ‘wt’.
To add the regression line we use the ‘abline’ function. To add the intercept and slope we use a = the intercept from our ‘coef_Carmodel’ variable which is subset [1] from this variable. For slope, we follow the same process but use a [2]. This will add the line and your graph should look like the following.

From the visual, you can see that as weight increases there is a decrease in miles per gallon.

Conclusion

R is capable of much more complex models than the simple regression used here. However, understanding the coding for simple modeling can help in preparing you for much more complex forms of statistical modeling.

Experimental Design: Within Groups

1 Reply

Within groups, experimental design is the use of only one group in an experiment. This is in contrast to a between-group design which involves two or more groups. Within-group design is useful when the number of is two low in order to split them into different groups.

There are two common forms of within-group experimental design, time-series, and repeated measures. Under time series there are interrupted times series and equivalent time series. Under repeated-measure, there is only repeated measure design. In this post, we will look at the following forms of within-group experimental design.

interrupted time series
equivalent time series
repeated measures design

Interrupted Time Series Design

Interrupted time series design involves several pre-tests followed by an intervention and then several post-test of one group. By measuring the several times, many threats to internal validity are reduced, such as regression, maturation, and selection. The pre-test results are also used as covariates when analyzing the post-tests.

Equivalent Time Series Design

Equivalent time series design involves the use of a measurement followed by intervention followed by measurement etc. In many ways, this design is a repeated post-test only design. The primary goal is to plot the results of the post-test and determine if there is a pattern that develops over time.

For example, if you are tracking the influence of blog writing on vocabulary acquisition, the intervention is blog writing and the dependent variable is vocabulary acquisition. As the students write a blog, you measure them several times over a certain period. If a plot indicates an upward trend you could infer that blog writing made a difference in vocabulary acquisition.

Repeated Measures

Repeated measures is the use of several different treatments over time. Before each treatment, the group is measured. Each post-test is compared to other post-test to determine which treatment was the best.

For example, let’s say that you still want to assess vocabulary acquisition but want to see how blog writing and public speaking affect it. First, you measure vocabulary acquisition. Next, you employ the first intervention followed by a second assessment of vocabulary acquisition. Third, you use the public speaking intervention followed by the third assessment of vocabulary acquisition. You now have three parts of data to compare

The first assessment of vocabulary acquisition (a pre-test)
The second assessment of vocabulary acquisition (post-test 1 after the blog writing)
The third assessment of vocabulary acquisition (post-test 2 after the public speaking)

Conclusion

Within-group experimental designs are used when it is not possible to have several groups in an experiment. The benefits include needing fewer participants. However, one problem with this approach is the need to measure several times which can be labor intensive.

ANOVA with R

1 Reply

Analysis of variance (ANOVA) is used when you want to see if there is a difference in the means of three or more groups due to some form of treatment(s). In this post, we will look at conducting an ANOVA calculation using R.

We are going to use a dataset that is already built into R called ‘InsectSprays.’ This dataset contains information on different insecticides and their ability to kill insects. What we want to know is which insecticide was the best at killing insects.

In the dataset ‘InsectSprays’, there are two variables, ‘count’, which is the number of dead bugs, and ‘spray’ which is the spray that was used to kill the bug. For the ‘spray’ variable there are six types label A-F. There are 72 total observations for the six types of spray which comes to about 12 observations per spray.

Building the Model

The code for calculating the ANOVA is below

> BugModel<- aov(count ~ spray, data=InsectSprays)
> BugModel
Call:
   aov(formula = count ~ spray, data = InsectSprays)

Terms:
                   spray Residuals
Sum of Squares  2668.833  1015.167
Deg. of Freedom        5        66

Residual standard error: 3.921902
Estimated effects may be unbalanced

Here is what we did

We created the variable ‘BugModel’
In this variable, we used the function ‘aov’ which is the ANOVA function.
Within the ‘aov’ function we told are to determine the count by the difference sprays that is what the ‘~’ tilde operator does.
After the comma, we told R what dataset to use which was “InsectSprays.”
Next, we pressed ‘enter’ and nothing happens. This is because we have to make R print the results by typing the name of the variable “BugModel” and pressing ‘enter’.
The results do not tell us anything too useful yet. However, now that we have the ‘BugModel’ saved we can use this information to find the what we want.

Now we need to see if there are any significant results. To do this we will use the ‘summary’ function as shown in the script below

> summary(BugModel)
            Df Sum Sq Mean Sq F value Pr(>F)    
spray        5   2669   533.8    34.7 <2e-16 ***
Residuals   66   1015    15.4                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

These results indicate that there are significant results in the model as shown by the p-value being essentially zero (Pr(>F)). In other words, there is at least one mean that is different from the other means statistically.

We need to see what the means are overall for all sprays and for each spray individually. This is done with the following script

> model.tables(BugModel, type = 'means')
Tables of means
Grand mean
    
9.5 

 spray 
spray
     A      B      C      D      E      F 
14.500 15.333  2.083  4.917  3.500 16.667

The ‘model.tables’ function tells us the means overall and for each spray. As you can see, it appears spray F is the most efficient at killing bugs with a mean of 16.667.

However, this table does not indicate the statistically significance. For this we need to conduct a post-hoc Tukey test. This test will determine which mean is significantly different from the others. Below is the script

> BugSpraydiff<- TukeyHSD(BugModel)
> BugSpraydiff
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = count ~ spray, data = InsectSprays)

$spray
           diff        lwr       upr     p adj
B-A   0.8333333  -3.866075  5.532742 0.9951810
C-A -12.4166667 -17.116075 -7.717258 0.0000000
D-A  -9.5833333 -14.282742 -4.883925 0.0000014
E-A -11.0000000 -15.699409 -6.300591 0.0000000
F-A   2.1666667  -2.532742  6.866075 0.7542147
C-B -13.2500000 -17.949409 -8.550591 0.0000000
D-B -10.4166667 -15.116075 -5.717258 0.0000002
E-B -11.8333333 -16.532742 -7.133925 0.0000000
F-B   1.3333333  -3.366075  6.032742 0.9603075
D-C   2.8333333  -1.866075  7.532742 0.4920707
E-C   1.4166667  -3.282742  6.116075 0.9488669
F-C  14.5833333   9.883925 19.282742 0.0000000
E-D  -1.4166667  -6.116075  3.282742 0.9488669
F-D  11.7500000   7.050591 16.449409 0.0000000
F-E  13.1666667   8.467258 17.866075 0.0000000

There is a lot of information here. To make things easy, wherever there is a p adj of less than 0.05 that means that there is a difference between those two means. For example, bug spray F and E have a difference of 13.16 that has a p adj of zero. So these two means are really different statistically.This chart also includes the lower and upper bounds of the confidence interval.

The results can also be plotted with the script below

> plot(BugSpraydiff, las=1)

Below is the plot

Conclusion

ANOVA is used to calculate if there is a difference of means among three or more groups. This analysis can be conducted in R using various scripts and codes.

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: