Decision trees are common tool used in data science and machine learning. In the video below we will learn how to develop a simple decision tree using Python.

# Category Archives: Research

# Developing Conceptual and Operational Definitions for Research

Defining terms is one of the first things required when writing a research paper. However, it is also one of the hardest things to do as we often know what we want to study intuitively rather than literally. This post will provide guidance in the following

- Developing conceptual definitions
- Determining operational definitions
- Understanding the measurement model

Each of the ideas above is fundamental to developing coherent research papers.

**Concepts**

A concept is a mental construct or a tool used to understand the world around us. An example of a concept would be intelligence, humor, motivation, desire. These terms have meaning, but they cannot be seen or observed directly. You cannot pick up intelligence, buy humor, or weigh either of these. However, you can tell when someone is intelligent or has a sense of humor.

This is because constructs are observed indirectly through behaviors, which provide evidence of the construct. For example, someone demonstrates intelligence through their academic success, how they speak, etc. A person can demonstrate humor by making others laugh through what they say. Concepts represent things around us that we want to study as researchers.

**Defining Concepts**

To define a concept for the purpose of research requires the following three things

- A manner in which to measure the concept indirectly
- A unit of analysis
- Some variation among the unit of analysis

The criteria listed above is essentially a definition of a conceptual definition. Below is an example of a conceptual definition of academic dishonesty

Below is a breakdown of this definition

*Academic dishonesty is the extent to which individuals exhibit a disregard towards educational norms of scholarly integrity.*

- Measurement: exhibit a disregard towards educational norms of scholarly integrity.
- Unit of analysis: individual
- Variation: Extent to which

It becomes much easier to shape a research study with these three components.

**Conceptual Definition Template**

There is also a template readily available in books and the internet to generate a conceptual definition. Below is one example.

*The concept of ____________**_** is defined as the extent to which*

*_________________________** exhibit the characteristic(s) of **__**_____________.*

Here is a revised version of our conceptual defintion of academic dishonesty

*The concept of **academic dishonesty **is defined as the ewxtent to whcih **invidivudals** exhibit the characteristic of * *disregard towards educational norms of scholarly integrity**.*

The same three components are there. The wording is mostly the same, but having a template such as this can really save them time in formulating a study. It also helps make things clearer for them as they go forward with a project.

**Operational Definition**

Once a concept has been defined, it must next be operationalized. The operational definition indicates how a concept will be measured quantitatively. This means that a researcher must specify at least one metric. Below is an example using academic dishonesty again.

C*onceptual Definition: Academic dishonesty is the extent to which an individual exhibits a disregard towards educational norms of scholarly integrity.*

*Operational Definition: Survey Items*

- It is okay to cheat
- It is okay to turn in someone else’s work as my own

In the example above, academic dishonesty was operationalized using survey items. In other words, we will measure people’s opinions about academic dishonesty by having them respond to survey items.

Measurement error happens when there is a disconnect between the conceptual definition and the measurement method. It can be hard to detect this, so students need to be careful when developing a study.

**Measurement Models**

A concept is not measured directly, as has already been mentioned. This means that when it is time to analyze our data, our contract is a latent or unobserved variable. The items on the survey are observed because people gave us this information directly. This means that the survey items are observed variables.

The measurement model links the latent variables with the observed variables statistically. A strong measurement model indicates that the observed variables correlate with the underlying latent variable or construct.

For example, academic dishonesty has been the latent variable example of this entire post. The survey items “it’s okay to cheat” and “it’s okay to turn in someon else’s work as my own” are observed variables. Using statistical tools, we can check if these observed variables are associated with our concept of academic dishonesty.

**Conclusion**

Defining concepts is one of the more challenging aspects of conducting research. It requires a researcher to know what they are trying to study and how to measure it. For students, this is challenging because articulating ideas in this manner is often not done in everyday life.

# Types of Experiments

This post will provide some basic ideas for developing experiments. The process of doing valid experiments is rather challenging as one misstep can make your results invalid. Therefore, care is needed when attempting to set up an experiment

**Definition**

An experiment is a process in which changes are made to input variables to see how they affect the output variable(s). The inputs are called controllable variables, while the outputs are called response variables. Other variables that cannot be controlled are called uncontrollable variables.

When developing an experiment, the experimenter’s approach or plan for experimenting is called the strategy of experimentation. Extensive planning is necessary to conduct an experiment, while the actual data collection is often not that difficult.

**Best Guess Approach**

There are several different strategies for experimentation. The best-guess approach involves manipulating input variables based on prior results from the output variable. For example, if you are teaching a math class and notice that students score better when they work in groups in the morning compared to working in the afternoon. You may switch to group work in the morning and see if lectures may further increase performance.

This guesswork can be highly efficient if you are familiar with the domain in which you are doing the experiments. However, if the guess is wrong, you have to continue guessing, and this can go on for a long time.

**One-Factor-At-A-Time**

Another strategy of experimentation is the one-factor-at-a-time (OFAT) approach. You begin by having a baseline for each factor (variable) and then vary each variable to see how it affects the output. For example, you can switch whether students study in the morning or even and see how it affects performance. Then you might test whether group work and individual work affect scores.

The biggest weakness with this is that you can see interactions between variables. Interactions are an instance in which one factor does not produce the same results at a different level of another factor. Interactions can be hard to understand, but sometimes when two factors are mapped at the same time with the response variable, the lines cross to indicate that there is an interaction.

**Factorial Experiments**

Factorial experiments involve varying factors together. For example, a 2^2 factorial design means four combinations of experiments with two variables are varied, and one response variable with four possible combinations of experiments. Often these types of experiments are drawn as a square, as shown below.

Each point represents a different combination of the two factors. The calculation of this involves subtracting the means of the variable or factor on the x-axis. If we run each combination twice, we would calculate the difference, as shown below.

The more significant this difference, the more likely there is a strong effect based on the independent variables in the model.

When the number of combinations becomes large and complicated to manage, it may not be practical to run all possible combinations. In this situation, an experimenter will use a fractional factorial experiment in which only some of the combinations are used. For example, if 32 experiments are possible (2^5), maybe only 12 of them are conducted. The calculation is the same as above, just with more groups to compare.

**Conclusion**

Experiments are a practical way to determine the best combination of factors or variables for a given output variable(s). The majority of the time is spent planning and designing the experiment, with the actual data collection being straightforward.

# Different Views of Research

People have been doing research formally or informally since the beginning of time. We are always trying to figure out how to do this or why something is the way that it is. In this post, we will look at different ways to view and or conduct research. These perspectives are empirical, theoretical, and analytical.

**Empirical **

Perhaps the most common form or approach to doing research is the empirical approach. This approach involves observing reality and developing hypotheses and theories based on what was observed. This is an inductive approach to doing research because the researcher starts with their observations to make a theory. In other words, you start with examples and abstract them to theories.

An example of this is found in the work of Charles Darwin and evolution. Darwin collected a lot of examples and observations of birds during his travels. Based on what he saw he inferred that animals evolved over time. This was his conclusion based on his interpretation of the data. Later, other researchers tried to further bolster Darwin’s theory by finding mathematical support for his claims.

The order in which empirical research is conducted is as follows…

- Identify the phenomenon
- Collect data
- Abstraction/model development
- Hypothesis
- Test

You can see that hypotheses and theory are derived from data which is similar to qualitative research. However, steps 4 and 5 are were the equation developing and or statistical tools are used. As such the empirical view of research is valuable when there is a large amount of data available and can include many variables, which is again often common for quantitative methods.

To summarize this, empirical research is focused on what happened, which is one way in which scientific laws are derived.

**Theoretical**

The theoretical perspective is essentially the same process as empirical but moving in the opposite direction. For theorists, the will start with what they think about the phenomenon and how things should be. This approach starts with a general principle and then the researcher goes and looks for evidence that supports their general principle. Another way of stating this is that the theoretical approach is deductive in nature.

A classic example of this is Einstein’s theory of relativity. Apparently, he deduced this theory through logic and left it to others to determine if the theory was correct. To put it simply, he knew without knowing, if this makes sense. In this approach, the steps are as follows

- Theory
- Hypotheses
- model abstraction
- data collection
- Phenomenon

You collect data to confirm the hypotheses. Common statistical tools can include simulations or any other method that is suitable in situations in which there is little data available. The caveat is that the data must match the phenomenon to have meaning. For example, if I am trying to understand some sort of phenomenon about women I cannot collect data from as this does not match the phenomenon.

In general, theoretical research is focused on why something happens which is the goal of most theories, explaining why.

**Analytical **

Analytical research is probably the hardest to explain and understand. Essentially, analytical research is trying to understand how people develop their empirical or theoretical research. How did Darwin make this collection or how did Einstein develop his ideas.

In other words, analytical research is commonly used to judge the research of others. Examples of this can be people who spend a lot of time criticizing the works of others. An analytical approach is looking for the strengths and weaknesses of various research. Therefore, this approach is focused on how research is done and can use tools both from empirical and theoretical research.

**Conclusion**

The point here was to explain different views om conducting research. The goal was not to state that one is superior to the other. Rather, the goal was to show how different tools can be used in different ways

# Paraphrasing Tips for ESL Students

Paraphrasing is an absolute skill in a professional setting. By paraphrasing, it is meant to have the ability to take someone else’s words and rephrase them while giving credit for the original source. Whenever a student fails to do this it is called plagiarism which is a major problem in academia. In this post, we will look at several tips on how to paraphrase.

The ability to paraphrase academically takes almost near-native writing ability. This is because you have to be able to play with the language in a highly complex manner. To be able to do this after a few semesters of ESL is difficult for the typical student. Despite this, there are several ways to try to make paraphrase work. Below are just some ideas.

- Use synonyms
- Change the syntax
- Make several sentences
- Condense/summarize

One tip not mentioned is reading. Next, to actually writing, nothing will improve writing skills like reading. Being exposed to different texts helps you develop an intuitive understanding of the second language in a way that copying and pasting never will.

**Use Synonyms**

Using synonyms is a first step in paraphrasing an idea but this approach cannot be used by itself as that is considered to be plagiarism by many people. With synonyms, you replace some words with others. The easiest words to replace are adjectives and verbs, followed by nouns. Below is an example. The first sentence is the original one and the second is the paraphrase.

The man loves to play guitar

The man likes to play guitar

In the example above all we did was change the word “loves” to “like”. This is a superficial change that is often still considered plagiarism because of how easy it is to do. We can take this a step further by modifying the infinitive verb “to play.”

The man loves to play guitar

The man likes to play guitar

The man likes playing guitar

Again this is superficial but a step above the first example. In addition, most word processors will provide synonyms if you right-click on the word and off course there are online options as well. Remember that this is a beginning and is a tool you use in addition to more complex approaches.

**Change the Syntax**

Changing the syntax has to do with the word order of the sentence or sentences. Below is an example

The man loves to play guitar

Playing the guitar is something the man loves

In this example, we move the infinitive phrase “to play” to the front and change it to a present participle. There were other adjustments that needed to be made to maintain the flow of the sentence. This example is a more advanced form of paraphrasing and it may be enough to only do this to avoid plagiarism. However, you can combine synonyms and syntax as shown in the example below

The man loves to play guitar

Playing the guitar is something the man likes

**Make Several Sentences**

Another approach is to convert a sentence(s) into several more sentences. As shown below

The man loves to play guitar

This man has a hobby. He likes playing guitar.

You can see that there are two sentences now. The first sentence indicates the man has a hobby and the second explains what the hobby is and how much he likes it. In addition, in the second sentence, the verb “to play” was changed to the present participle of “playing.”

**Condense/Summarize**

Condensing or summarizing is not considered by everyone to be paraphrasing. The separation between paraphrasing and summarizing is fuzzy and it is more of a continuum than black and white. With this technique, you try to reduce the length of the statement you are paraphrasing as shown below.

The man loves to play guitar

He likes guitar

This was a difficult sentence to summarizes because it was already so short. However, we were able to shrink it from six to three words by removing what it was about the guitar he liked.

**Academic Examples **

We will now look at several academic examples to show the applications of these rules in a real context. The passage below is some academic text

There is also a push within Southeast Asia for college graduates to have

interpersonal skills. For example, Malaysia is calling for graduates to

have soft skills and that these need to be part of the curriculum of tertiary schools.

In addition, a lack of these skills has been found to limit graduates’ employability.

*Example 1: Paraphrase with synonyms and syntax changes*

There are several skills graduates need for employability in Southeast Asia. For example, people skills are needed. The ability to relate to others is being pushed for inclusion in higher education from parts of Southeast Asia (Thomas, 2018).

You can see how difficult this can be. We rearranged several concepts and changed several verbs to try and make this our own sentence. Below is an example of condensing.

*Example 2: Condensing*

There is demand in Southeast Asia for higher education to develop the interpersonal skills of their students as this is limiting the employability of graduates (Thomas, 2018).

With this example, we reduced the paragraph to one sentence.

**Culture and Plagiarism**

There are majors differences in terms of how plagiarism is viewed based on culture. In the West, plagiarism is universally condemned both in and out of academia as essentially stealing ideas from other people. However, in other places, the idea of plagiarism is much more nuanced or even okay.

In some cultures, one way to honor what someone has said or taught is to literally repeat it verbatim. The thought process goes something like this

- This person is a great teacher/elder
- What they said is insightful
- As a student or lower person, I cannot improve what they said
- Therefore, I should copy these perfects words into my own paper.

Of course, everyone does not think like this but I have experienced enough to know that it does happen.

Whether the West likes it or not plagiarism is a cultural position rather than an ethical one. To reduce plagiarism requires to show students how it is culturally unacceptable in an academic/professional setting to do this. The tips in this post will at least provide tools for how to support students to overcome this habit

# Understanding Variables

In research, there are many terms that have the same underlying meaning which can be confusing for researchers as they try to complete a project. The problem is that people have different backgrounds and learn different terms during their studies and when they try to work with others there is often confusion over what is what.

In this post, we will try to clarify as much as possible various terms that are used when referring to variables. We will look at the following during this discussion

- Definition of a variable
- Minimum knowledge of the characteristics of a variable in research
- Various synonyms of variable

**Definition **

The word variable has the root of “vary” and the suffix “able”. This literally means that a variable is something that is able to change. Examples include such concepts as height, weigh, salary, etc. All of these concepts change as you gather data from different people. Statistics is primarily about trying to explain and or understand the variability of variables.

However, to make things more confusing there are times in research when a variable dies not change or remains constant. This will be explained in greater detail in a moment.

**Minimum You Need to Know**

Two broad concepts that you need to understand regardless of the specific variable terms you encounter are the following

- Whether the variable(s) are independent or dependent
- Whether the variable(s) are categorical or continuous

When we speak of independent and dependent variables we are looking at the relationship(s) between variables. Dependent variables are explained by independent variables. Therefore, one dimension of variables is understanding how they relate to each other and the most basic way to see this is independent vs dependent.

The second dimension to consider when thinking about variables is how they are measured which is captured with the terms categorical or continuous. A categorical variable has a finite number of values that can be used. Examples in clue gender, hair color, or cellphone brand. A person can only be male or female, have blue or brown eyes, and can only have one brand of cellphone.

Continuous variables are variables that can take on an infinite number of values. Salary, temperature, etc are all continuous in nature. It is possible to limit a continuous variable to categorical variable by creating intervals in which to place values. This is commonly done when creating bins for histograms. In sum, here are the four possible general variable types below

- Independent categorical
- Independent continuous
- Dependent categorical
- Dependent continuous

Natural, most models have one dependent categorical or continuous variable, however you can have any combination of continuous and categorical variables as independents. Remember that all variables have the above characteristics despite whatever terms is used for them.

**Variable Synonyms**

Below is a list of various names that variables go by in different disciplines. This is by no means an exhaustive list.

*Experimental variable*

A variable whose values are independent of any changes in the values of other variables. In other words, an experimental variable is just another term for independent variable.

*Manipulated Variable*

A** **variable that is independent in an experiment but whose value/behavior the researcher is able to control or manipulate. This is also another term for an independent variable.

*Control Variable*

A variable whose value does not change. Controlling a variable helps to explain the relationship between the independent and dependent variable in an experiment by making sure the control variable has not influenced in the model

*Responding Variable*

The dependent variable in an experiment. It responds to the experimental variable.

*Intervening Variable*

This is a hypothetical variable. It is used to explain the causal links between variables. Since they are hypothetical, they are observed in an actual experiment. For example, if you are looking at a strong relationship between income and life expectancy and find a positive relationship. The intervening variable for this may be access to healthcare. People who make more money have more access to health care and this contributes to them often living longer.

*Mediating Variable*

This is the same thing as an intervening variable. The difference being often that the mediating variable is not always hypothetical in nature and is often measured it’s self.

*Confounding Variable*

A confounder is a variable that influences both the independent and dependent variable, causing a spurious or false association. Often a confounding variable is a causal idea and cannot be described in terms of correlations or associations with other variables. In other words, it is often the same thing as an intervening variable.

*Explanatory Variable*

This variable is the same as an independent variable. The difference being that an independent variable is not influenced by any other variables. However, when independence is not for sure, than the variable is called an explanatory variable.

*Predictor Variable*

A predictor variable is an independent variable. This term is commonly used for regression analysis.

*Outcome Variable*

An outcome variable is a dependent variable in the context of regression analysis.

*Observed Variable*

This is a variable that is measured directly. An example would be gender or height. There is no psychology construct to infer the meaning of such variables.

*Unobserved Variable*

Unobserved variables are constructs that cannot be measured directly. In such situations, observe variables are used to try to determine the characteristic of the unobserved variable. For example, it is hard to measure addiction directly. Instead, other things will be measure to infer addiction such as health, drug use, performance, etc. The measures of this observed variables will indicate the level of the unobserved variable of addiction

*Features*

A feature is an independent variable in the context of machine learning and data science.

*Target Variable*

A target variable is the dependent variable in the context f machine learning and data science.

To conclude this, below is a summary of the different variables discussed and whether they are independent, dependent, or neither.

Independent | Dependent | Neither |
---|---|---|

Experimental | Responding | Control |

Manipulated | Target | Explanatory |

Predictor | Outcome | Intervening |

Feature | Mediating | |

Observed | ||

Unobserved | ||

Confounding |

You can see how confusing this can be. Even though variables are mostly independent or dependent, there is a class of variables that do not fall into either category. However, for most purposes, the first to columns cover the majority of needs in simple research.

**Conclusion**

The confusion over variables is mainly due to an inconsistency in terms across variables. There is nothing right or wrong about the different terms. They all developed in different places to address the same common problem. However, for students or those new to research, this can be confusing and this post hopefully helps to clarify this.

# Z-Scores VIDEO

Calculating Z-Scores

# Data Exploration Case Study: Credit Default

Exploratory data analysis is the main task of a Data Scientist with as much as 60% of their time being devoted to this task. As such, the majority of their time is spent on something that is rather boring compared to building models.

This post will provide a simple example of how to analyze a dataset from the website called Kaggle. This dataset is looking at how is likely to default on their credit. The following steps will be conducted in this analysis.

- Load the libraries and dataset
- Deal with missing data
- Some descriptive stats
- Normality check
- Model development

This is not an exhaustive analysis but rather a simple one for demonstration purposes. The dataset is available here

**Load Libraries and Data**

Here are some packages we will need

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from scipy.stats import norm from sklearn import tree from scipy import stats from sklearn import metrics

You can load the data with the code below

df_train=pd.read_csv('/application_train.csv')

You can examine what variables are available with the code below. This is not displayed here because it is rather long

df_train.columns df_train.head()

**Missing Data**

I prefer to deal with missing data first because missing values can cause errors throughout the analysis if they are not dealt with immediately. The code below calculates the percentage of missing data in each column.

total=df_train.isnull().sum().sort_values(ascending=False) percent=(df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False) missing_data=pd.concat([total,percent],axis=1,keys=['Total','Percent']) missing_data.head() Total Percent COMMONAREA_MEDI 214865 0.698723 COMMONAREA_AVG 214865 0.698723 COMMONAREA_MODE 214865 0.698723 NONLIVINGAPARTMENTS_MODE 213514 0.694330 NONLIVINGAPARTMENTS_MEDI 213514 0.694330

Only the first five values are printed. You can see that some variables have a large amount of missing data. As such, they are probably worthless for inclusion in additional analysis. The code below removes all variables with any missing data.

pct_null = df_train.isnull().sum() / len(df_train) missing_features = pct_null[pct_null > 0.0].index df_train.drop(missing_features, axis=1, inplace=True)

You can use the .head() function if you want to see how many variables are left.

**Data Description & Visualization**

For demonstration purposes, we will print descriptive stats and make visualizations of a few of the variables that are remaining.

round(df_train['AMT_CREDIT'].describe()) Out[8]: count 307511.0 mean 599026.0 std 402491.0 min 45000.0 25% 270000.0 50% 513531.0 75% 808650.0 max 4050000.0 sns.distplot(df_train['AMT_CREDIT']

round(df_train['AMT_INCOME_TOTAL'].describe()) Out[10]: count 307511.0 mean 168798.0 std 237123.0 min 25650.0 25% 112500.0 50% 147150.0 75% 202500.0 max 117000000.0 sns.distplot(df_train['AMT_INCOME_TOTAL']

I think you are getting the point. You can also look at categorical variables using the groupby() function.

We also need to address categorical variables in terms of creating dummy variables. This is so that we can develop a model in the future. Below is the code for dealing with all the categorical variables and converting them to dummy variable’s

df_train.groupby('NAME_CONTRACT_TYPE').count() dummy=pd.get_dummies(df_train['NAME_CONTRACT_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_CONTRACT_TYPE'],axis=1) df_train.groupby('CODE_GENDER').count() dummy=pd.get_dummies(df_train['CODE_GENDER']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['CODE_GENDER'],axis=1) df_train.groupby('FLAG_OWN_CAR').count() dummy=pd.get_dummies(df_train['FLAG_OWN_CAR']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['FLAG_OWN_CAR'],axis=1) df_train.groupby('FLAG_OWN_REALTY').count() dummy=pd.get_dummies(df_train['FLAG_OWN_REALTY']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['FLAG_OWN_REALTY'],axis=1) df_train.groupby('NAME_INCOME_TYPE').count() dummy=pd.get_dummies(df_train['NAME_INCOME_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_INCOME_TYPE'],axis=1) df_train.groupby('NAME_EDUCATION_TYPE').count() dummy=pd.get_dummies(df_train['NAME_EDUCATION_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_EDUCATION_TYPE'],axis=1) df_train.groupby('NAME_FAMILY_STATUS').count() dummy=pd.get_dummies(df_train['NAME_FAMILY_STATUS']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_FAMILY_STATUS'],axis=1) df_train.groupby('NAME_HOUSING_TYPE').count() dummy=pd.get_dummies(df_train['NAME_HOUSING_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_HOUSING_TYPE'],axis=1) df_train.groupby('ORGANIZATION_TYPE').count() dummy=pd.get_dummies(df_train['ORGANIZATION_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['ORGANIZATION_TYPE'],axis=1)

You have to be careful with this because now you have many variables that are not necessary. For every categorical variable you must remove at least one category in order for the model to work properly. Below we did this manually.

df_train=df_train.drop(['Revolving loans','F','XNA','N','Y','SK_ID_CURR,''Student','Emergency','Lower secondary','Civil marriage','Municipal apartment'],axis=1)

Below are some boxplots with the target variable and other variables in the dataset.

f,ax=plt.subplots(figsize=(8,6)) fig=sns.boxplot(x=df_train['TARGET'],y=df_train['AMT_INCOME_TOTAL'])

There is a clear outlier there. Below is another boxplot with a different variable

f,ax=plt.subplots(figsize=(8,6)) fig=sns.boxplot(x=df_train['TARGET'],y=df_train['CNT_CHILDREN'])

It appears several people have more than 10 children. This is probably a typo.

Below is a correlation matrix using a heatmap technique

corrmat=df_train.corr() f,ax=plt.subplots(figsize=(12,9)) sns.heatmap(corrmat,vmax=.8,square=True)

The heatmap is nice but it is hard to really appreciate what is happening. The code below will sort the correlations from least to strongest, so we can remove high correlations.

c = df_train.corr().abs() s = c.unstack() so = s.sort_values(kind="quicksort") print(so.head()) FLAG_DOCUMENT_12 FLAG_MOBIL 0.000005 FLAG_MOBIL FLAG_DOCUMENT_12 0.000005 Unknown FLAG_MOBIL 0.000005 FLAG_MOBIL Unknown 0.000005 Cash loans FLAG_DOCUMENT_14 0.000005

The list is to long to show here but the following variables were removed for having a high correlation with other variables.

df_train=df_train.drop(['WEEKDAY_APPR_PROCESS_START','FLAG_EMP_PHONE','REG_CITY_NOT_WORK_CITY','REGION_RATING_CLIENT','REG_REGION_NOT_WORK_REGION'],axis=1)

Below we check a few variables for homoscedasticity, linearity, and normality using plots and histograms

sns.distplot(df_train['AMT_INCOME_TOTAL'],fit=norm) fig=plt.figure() res=stats.probplot(df_train['AMT_INCOME_TOTAL'],plot=plt)

This is not normal

sns.distplot(df_train['AMT_CREDIT'],fit=norm) fig=plt.figure() res=stats.probplot(df_train['AMT_CREDIT'],plot=plt)

This is not normal either. We could do transformations, or we can make a non-linear model instead.

**Model Development**

Now comes the easy part. We will make a decision tree using only some variables to predict the target. In the code below we make are X and y dataset.

X=df_train[['Cash loans','DAYS_EMPLOYED','AMT_CREDIT','AMT_INCOME_TOTAL','CNT_CHILDREN','REGION_POPULATION_RELATIVE']] y=df_train['TARGET']

The code below fits are model and makes the predictions

clf=tree.DecisionTreeClassifier(min_samples_split=20) clf=clf.fit(X,y) y_pred=clf.predict(X)

Below is the confusion matrix followed by the accuracy

print (pd.crosstab(y_pred,df_train['TARGET'])) TARGET 0 1 row_0 0 280873 18493 1 1813 6332 accuracy_score(y_pred,df_train['TARGET']) Out[47]: 0.933966589813047

Lastly, we can look at the precision, recall, and f1 score

print(metrics.classification_report(y_pred,df_train['TARGET'])) precision recall f1-score support 0 0.99 0.94 0.97 299366 1 0.26 0.78 0.38 8145 micro avg 0.93 0.93 0.93 307511 macro avg 0.62 0.86 0.67 307511 weighted avg 0.97 0.93 0.95 307511

This model looks rather good in terms of accuracy of the training set. It actually impressive that we could use so few variables from such a large dataset and achieve such a high degree of accuracy.

**Conclusion**

Data exploration and analysis is the primary task of a data scientist. This post was just an example of how this can be approached. Of course, there are many other creative ways to do this but the simplistic nature of this analysis yielded strong results

# Confidence Intervals for Proportions VIDEO

Calculating Confidence Intervals for Proportions

# Hierarchical Regression in R

In this post, we will learn how to conduct a hierarchical regression analysis in R. Hierarchical regression analysis is used in situation in which you want to see if adding additional variables to your model will significantly change the r2 when accounting for the other variables in the model. This approach is a model comparison approach and not necessarily a statistical one.

We are going to use the “Carseats” dataset from the ISLR package. Our goal will be to predict total sales using the following independent variables in three different models.

model 1 = intercept only

model 2 = Sales~Urban + US + ShelveLoc

model 3 = Sales~Urban + US + ShelveLoc + price + income

model 4 = Sales~Urban + US + ShelveLoc + price + income + Advertising

Often the primary goal with hierarchical regression is to show that the addition of a new variable builds or improves upon a previous model in a statistically significant way. For example, if a previous model was able to predict the total sales of an object using three variables you may want to see if a new additional variable you have in mind may improve model performance. Another way to see this is in the following research question

Is a model that explains the total sales of an object with Urban location, US location, shelf location, price, income and advertising cost as independent variables superior in terms of R2 compared to a model that explains total sales with Urban location, US location, shelf location, price and income as independent variables?

In this complex research question we essentially want to know if adding advertising cost will improve the model significantly in terms of the r square. The formal steps that we will following to complete this analysis is as follows.

- Build sequential (nested) regression models by adding variables at each step.
- Run ANOVAs in order to compute the R2
- Compute difference in sum of squares for each step
- Check F-statistics and p-values for the SS differences.

- Compare sum of squares between models from ANOVA results.
- Compute increase in R2 from sum of square difference
- Run regression to obtain the coefficients for each independent variable.

We will now begin our analysis. Below is some initial code

```
library(ISLR)
data("Carseats")
```

# Model Development

We now need to create our models. Model 1 will not have any variables in it and will be created for the purpose of obtaining the total sum of squares. Model 2 will include demographic variables. Model 3 will contain the initial model with the continuous independent variables. Lastly, model 4 will contain all the information of the previous models with the addition of the continuous independent variable of advertising cost. Below is the code.

```
model1 = lm(Sales~1,Carseats)
model2=lm(Sales~Urban + US + ShelveLoc,Carseats)
model3=lm(Sales~Urban + US + ShelveLoc + Price + Income,Carseats)
model4=lm(Sales~Urban + US + ShelveLoc + Price + Income + Advertising,Carseats)
```

We can now turn to the ANOVA analysis for model comparison #ANOVA Calculation We will use the anova() function to calculate the total sum of square for model 0. This will serve as a baseline for the other models for calculating r square

`anova(model1,model2,model3,model4)`

```
## Analysis of Variance Table
##
## Model 1: Sales ~ 1
## Model 2: Sales ~ Urban + US + ShelveLoc
## Model 3: Sales ~ Urban + US + ShelveLoc + Price + Income
## Model 4: Sales ~ Urban + US + ShelveLoc + Price + Income + Advertising
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 399 3182.3
## 2 395 2105.4 4 1076.89 89.165 < 2.2e-16 ***
## 3 393 1299.6 2 805.83 133.443 < 2.2e-16 ***
## 4 392 1183.6 1 115.96 38.406 1.456e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

For now, we are only focusing on the residual sum of squares. Here is a basic summary of what we know as we compare the models.

model 1 = sum of squares = 3182.3

model 2 = sum of squares = 2105.4 (with demographic variables of Urban, US, and ShelveLoc)

model 3 = sum of squares = 1299.6 (add price and income)

model 4 = sum of squares = 1183.6 (add Advertising)

Each model is statistical significant which means adding each variable lead to some improvement.

By adding price and income to the model we were able to improve the model in a statistically significant way. The r squared increased by .25 below is how this was calculated.

`2105.4-1299.6 #SS of Model 2 - Model 3`

`## [1] 805.8`

`805.8/ 3182.3 #SS difference of Model 2 and Model 3 divided by total sum of sqaure ie model 1`

`## [1] 0.2532131`

When we add Advertising to the model the r square increases by .03. The calculation is below

`1299.6-1183.6 #SS of Model 3 - Model 4`

`## [1] 116`

`116/ 3182.3 #SS difference of Model 3 and Model 4 divided by total sum of sqaure ie model 1`

`## [1] 0.03645162`

# Coefficients and R Square

We will now look at a summary of each model using the summary() function.

`summary(model2)`

```
##
## Call:
## lm(formula = Sales ~ Urban + US + ShelveLoc, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.713 -1.634 -0.019 1.738 5.823
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.8966 0.3398 14.411 < 2e-16 ***
## UrbanYes 0.0999 0.2543 0.393 0.6947
## USYes 0.8506 0.2424 3.510 0.0005 ***
## ShelveLocGood 4.6400 0.3453 13.438 < 2e-16 ***
## ShelveLocMedium 1.8168 0.2834 6.410 4.14e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.309 on 395 degrees of freedom
## Multiple R-squared: 0.3384, Adjusted R-squared: 0.3317
## F-statistic: 50.51 on 4 and 395 DF, p-value: < 2.2e-16
```

`summary(model3)`

```
##
## Call:
## lm(formula = Sales ~ Urban + US + ShelveLoc + Price + Income,
## data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9096 -1.2405 -0.0384 1.2754 4.7041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.280690 0.561822 18.299 < 2e-16 ***
## UrbanYes 0.219106 0.200627 1.092 0.275
## USYes 0.928980 0.191956 4.840 1.87e-06 ***
## ShelveLocGood 4.911033 0.272685 18.010 < 2e-16 ***
## ShelveLocMedium 1.974874 0.223807 8.824 < 2e-16 ***
## Price -0.057059 0.003868 -14.752 < 2e-16 ***
## Income 0.013753 0.003282 4.190 3.44e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.818 on 393 degrees of freedom
## Multiple R-squared: 0.5916, Adjusted R-squared: 0.5854
## F-statistic: 94.89 on 6 and 393 DF, p-value: < 2.2e-16
```

`summary(model4)`

```
##
## Call:
## lm(formula = Sales ~ Urban + US + ShelveLoc + Price + Income +
## Advertising, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2199 -1.1703 0.0225 1.0826 4.1124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.299180 0.536862 19.184 < 2e-16 ***
## UrbanYes 0.198846 0.191739 1.037 0.300
## USYes -0.128868 0.250564 -0.514 0.607
## ShelveLocGood 4.859041 0.260701 18.638 < 2e-16 ***
## ShelveLocMedium 1.906622 0.214144 8.903 < 2e-16 ***
## Price -0.057163 0.003696 -15.467 < 2e-16 ***
## Income 0.013750 0.003136 4.384 1.50e-05 ***
## Advertising 0.111351 0.017968 6.197 1.46e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.738 on 392 degrees of freedom
## Multiple R-squared: 0.6281, Adjusted R-squared: 0.6214
## F-statistic: 94.56 on 7 and 392 DF, p-value: < 2.2e-16
```

You can see for yourself the change in the r square. From model 2 to model 3 there is a 26 point increase in r square just as we calculated manually. From model 3 to model 4 there is a 3 point increase in r square. The purpose of the anova() analysis was determined if the significance of the change meet a statistical criterion, The lm() function reports a change but not the significance of it.

# Conclusion

Hierarchical regression is just another potential tool for the statistical researcher. It provides you with a way to develop several models and compare the results based on any potential improvement in the r square.

# Zotero Reference Software VIDEO

A demo on the use of the Zotero Reference software

# Confidence Intervals VIDEO

Calculating Confidence intervals

# Gradient Boosting Classification in Python

Gradient Boosting is an alternative form of boosting to AdaBoost. Many consider gradient boosting to be a better performer than adaboost. Some differences between the two algorithms is that gradient boosting uses optimization for weight the estimators. Like adaboost, gradient boosting can be used for most algorithms but is commonly associated with decision trees.

In addition, gradient boosting requires several additional hyperparameters such as max depth and subsample. Max depth has to do with the number of nodes in a tree. The higher the number the purer the classification become. The downside to this is the risk of overfitting.

Subsampling has to do with the proportion of the sample that is used for each estimator. This can range from a decimal value up until the whole number 1. If the value is set to 1 it becomes stochastic gradient boosting.

This post is focused on classification. To do this, we will use the cancer dataset from the pydataset library. Our goal will be to predict the status of patients (alive or dead) using the available independent variables. The steps we will use are as follows.

- Data preparation
- Baseline decision tree model
- Hyperparameter tuning
- Gradient boosting model development

Below is some initial code.

from sklearn.ensemble import GradientBoostingClassifier

from sklearn import tree

from sklearn.model_selection import GridSearchCV

import numpy as np

from pydataset import data

import pandas as pd

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import KFold

**Data Preparation**

The data preparation is simple in this situtation. All we need to do is load are dataset, dropping missing values, and create our X dataset and y dataset. All this happens in the code below.

df=data('cancer').dropna()

X=df[['time','sex','ph.karno','pat.karno','meal.cal','wt.loss']]

y=df['status']

We will now develop our baseline decision tree model.

**Baseline Model**

The purpose of the baseline model is to have something to compare our gradient boosting model to. The strength of a model is always relative to some other model, so we need to make at least two, so we can say one is better than the other.

The criteria for better in this situation is accuracy. Therefore, we will make a decision tree model, but we will manipulate the max depth of the tree to create 9 different baseline models. The best accuracy model will be the baseline model.

To achieve this, we need to use a for loop to make python make several decision trees. We also need to set the parameters for the cross validation by calling KFold(). Once this is done, we print the results for the 9 trees. Below is the code and results.

crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)

for depth in range (1,10):

tree_classifier=tree.DecisionTreeClassifier(max_depth=depth,random_state=1)

if tree_classifier.fit(X,y).tree_.max_depth<depth:

break

score=np.mean(cross_val_score(tree_classifier,X,y,scoring='accuracy', cv=crossvalidation,n_jobs=1))

print(depth, score)

1 0.71875

2 0.6477941176470589

3 0.6768382352941177

4 0.6698529411764707

5 0.6584558823529412

6 0.6525735294117647

7 0.6283088235294118

8 0.6573529411764706

9 0.6577205882352941

It appears that when the max depth is limited to 1 that we get the best accuracy at almost 72%. This will be our baseline for comparison. We will now tune the parameters for the gradient boosting algorithm

**Hyperparameter Tuning**

There are several hyperparameters we need to tune. The ones we will tune are as follows

- number of estimators
- learning rate
- subsample
- max depth

First, we will create an instance of the gradient boosting classifier. Second, we will create our grid for the search. It is inside this grid that we set several values for each hyperparameter. Then we call GridSearchCV and place the instance of the gradient boosting classifier, the grid, the cross validation values from mad earlier, and n_jobs all together in one place. Below is the code for this.

GBC=GradientBoostingClassifier()

search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'max_depth':[1,3,5],'subsample':[.5,.75,1],'random_state':[1]}

search=GridSearchCV(estimator=GBC,param_grid=search_grid,scoring='accuracy',n_jobs=1,cv=crossvalidation)

You can now run your model by calling .fit(). Keep in mind that there are several hyperparameters. This means that it might take some time to run the calculations. It is common to find values for max depth, subsample, and number of estimators first. Then as second run through is done to find the learning rate. In our example, we are doing everything at once which is why it takes longer. Below is the code with the out for best parameters and best score.

search.fit(X,y)

search.best_params_

Out[11]:

{'learning_rate': 0.01,

'max_depth': 5,

'n_estimators': 2000,

'random_state': 1,

'subsample': 0.75}

search.best_score_

Out[12]: 0.7425149700598802

You can see what the best hyperparameters are for yourself. In addition, we see that when these parameters were set we got an accuracy of 74%. This is superior to our baseline model. We will now see if we can replicate these numbers when we use them for our Gradient Boosting model.

**Gradient Boosting Model**

Below is the code and results for the model with the predetermined hyperparameter values.

ada2=GradientBoostingClassifier(n_estimators=2000,learning_rate=0.01,subsample=.75,max_depth=5,random_state=1)

score=np.mean(cross_val_score(ada2,X,y,scoring='accuracy',cv=crossvalidation,n_jobs=1))

score

Out[17]: 0.742279411764706

You can see that the results are similar. This is just additional information that the gradient boosting model does outperform the baseline decision tree model.

**Conclusion**

This post provided an example of what gradient boosting classification can do for a model. With its distinct characteristics gradient boosting is generally a better performing boosting algorithm in comparison to AdaBoost.

# AdaBoost Regression with Python

This post will share how to use the adaBoost algorithm for regression in Python. What boosting does is that it makes multiple models in a sequential manner. Each newer model tries to successful predict what older models struggled with. For regression, the average of the models are used for the predictions. It is often most common to use boosting with decision trees but this approach can be used with any machine learning algorithm that deals with supervised learning.

Boosting is associated with ensemble learning because several models are created that are averaged together. An assumption of boosting, is that combining several weak models can make one really strong and accurate model.

For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the weight loss of a patient based on several independent variables. The steps of this process are as follows.

- Data preparation
- Regression decision tree baseline model
- Hyperparameter tuning of Adaboost regression model
- AdaBoost regression model development

Below is some initial code

from sklearn.ensemble import AdaBoostRegressor

from sklearn import tree

from sklearn.model_selection import GridSearchCV

import numpy as np

from pydataset import data

import pandas as pd

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import train_test_split

from sklearn.model_selection import KFold

from sklearn.metrics import mean_squared_error

**Data Preparation**

There is little data preparation for this example. All we need to do is load the data and create the X and y datasets. Below is the code.

df=data('cancer').dropna()

X=df[['time','sex','ph.karno','pat.karno','status','meal.cal']]

y=df['wt.loss']

We will now proceed to creating the baseline regression decision tree model.

**Baseline Regression Tree Model**

The purpose of the baseline model is for comparing it to the performance of our model that utilizes adaBoost. In order to make this model we need to Initiate a Kfold cross-validation. This will help in stabilizing the results. Next we will create a for loop so that we can create several trees that vary based on their depth. By depth, it is meant how far the tree can go to purify the classification. More depth often leads to a higher likelihood of overfitting.

Finally, we will then print the results for each tree. The criteria used for judgment is the mean squared error. Below is the code and results

for depth in range (1,10):

tree_regressor=tree.DecisionTreeRegressor(max_depth=depth,random_state=1)

if tree_regressor.fit(X,y).tree_.max_depth<depth:

break

score=np.mean(cross_val_score(tree_regressor,X,y,scoring='neg_mean_squared_error', cv=crossvalidation,n_jobs=1))

print(depth, score)

1 -193.55304528235052

2 -176.27520747356175

3 -209.2846723461564

4 -218.80238479654003

5 -222.4393459885871

6 -249.95330609042858

7 -286.76842138165705

8 -294.0290706405905

9 -287.39016236497804

Looks like a tree with a depth of 2 had the lowest amount of error. We can now move to tuning the hyperparameters for the adaBoost algorithm.

**Hyperparameter Tuning**

For hyperparameter tuning we need to start by initiating our AdaBoostRegresor() class. Then we need to create our grid. The grid will address two hyperparameters which are the number of estimators and the learning rate. The number of estimators tells Python how many models to make and the learning indicates how each tree contributes to the overall results. There is one more parameters which is random_state but this is just for setting the seed and never changes.

After making the grid, we need to use the GridSearchCV function to finish this process. Inside this function you have to set the estimator which is adaBoostRegressor, the parameter grid which we just made, the cross validation which we made when we created the baseline model, and the n_jobs which allocates resources for the calculation. Below is the code.

ada=AdaBoostRegressor()

search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'random_state':[1]}

search=GridSearchCV(estimator=ada,param_grid=search_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=crossvalidation)

Next, we can run the model with the desired grid in place. Below is the code for fitting the mode as well as the best parameters and the score to expect when using the best parameters.

search.fit(X,y)

search.best_params_

Out[31]: {'learning_rate': 0.01, 'n_estimators': 500, 'random_state': 1}

search.best_score_

Out[32]: -164.93176650920856

The best mix of hyperparameters is a learning rate of 0.01 and 500 estimators. This mix led to a mean error score of 164, which is a little lower than our single decision tree of 176. We will see how this works when we run our model with the refined hyperparameters.

**AdaBoost Regression Model**

Below is our model but this time with the refined hyperparameters.

ada2=AdaBoostRegressor(n_estimators=500,learning_rate=0.001,random_state=1)

score=np.mean(cross_val_score(ada2,X,y,scoring='neg_mean_squared_error',cv=crossvalidation,n_jobs=1))

score

Out[36]: -174.52604137201791

You can see the score is not as good but it is within reason.

**Conclusion**

In this post, we explored how to use the AdaBoost algorithm for regression. Employing this algorithm can help to strengthen a model in many ways at times.

# Standard Deviation VIDEO

Calculating standard deviation

# Research Questions, Variables, and Statistics

Working with students over the years has led me to the conclusion that often students do not understand the connection between variables, quantitative research questions and the statistical tools

used to answer these questions. In other words, students will take statistics and pass the class. Then they will take research methods, collect data, and have no idea how to analyze the data even though they have the necessary skills in statistics to succeed.

This means that the students have a theoretical understanding of statistics but struggle in the application of it. In this post, we will look at some of the connections between research questions and statistics.

**Variables**

Variables are important because how they are measured affects the type of question you can ask and get answers to. Students often have no clue how they will measure a variable and therefore have no idea how they will answer any research questions they may have.

Another aspect that can make this confusing is that many variables can be measured more than one way. Sometimes the variable “salary” can be measured in a continuous manner or in a categorical manner. The superiority of one or the other depends on the goals of the research.

It is critical to support students to have a thorough understanding of variables in order to support their research.

**Types of Research Questions**

In general, there are two types of research questions. These two types are descriptive and relational questions. Descriptive questions involve the use of descriptive statistic such as the mean, median, mode, skew, kurtosis, etc. The purpose is to describe the sample quantitatively with numbers (ie the average height is 172cm) rather than relying on qualitative descriptions of it (ie the people are tall).

Below are several example research questions that are descriptive in nature.

- What is the average height of the participants in the study?
- What proportion of the sample is passed the exam?
- What are the respondents perceptions towards the cafeteria?

These questions are not intellectually sophisticated but they are all answerable with descriptive statistical tools. Question 1 can be answered by calculating the mean. Question 2 can be answered by determining how many passed the exam and dividing by the total sample size. Question 3 can be answered by calculating the mean of all the survey items that are used to measure respondents perception of the cafeteria.

Understanding the link between research question and statistical tool is critical. However, many people seem to miss the connection between the type of question and the tools to use.

Relational questions look for the connection or link between variables. Within this type there are two sub-types. Comparison question involve comparing groups. The other sub-type is called relational or an association question.

Comparison questions involve comparing groups on a continuous variable. For example, comparing men and women by height. What you want to know is whether there is a difference in the height of men and women. The comparison here is trying to determine if gender is related to height. Therefore, it is looking for a relationship just not in the way that many student understand. Common comparison questions include the following.male

- Is there a difference in height by gender among the participants?
- Is there a difference in reading scores by grade level?
- Is there a difference in job satisfaction in based on major?

Each of these questions can be answered using ANOVA or if we want to get technical and there are only two groups (ie gender) we can use t-test. This is a broad overview and does not include the complexities of one-sample test and or paired t-test.

Relational or association question involve continuous variables primarily. The goal is to see how variables move together. For example, you may look for the relationship between height and weight of students. Common questions include the following.

- Is there a relationship between height and weight?
- Does height and show size explain weight?

Questions 1 can be answered by calculating the correlation. Question 2 requires the use of linear regression in order to answer the question.

**Conclusion**

The challenging as a teacher is showing the students the connection between statistics and research questions from the real world. It takes time for students to see how the question inspire the type of statistical tool to use. Understanding this is critical because it helps to frame the possibilities of what to do in research based on the statistical knowledge one has.

# Elastic Net Regression in Python

Elastic net regression combines the power of ridge and lasso regression into one algorithm. What this means is that with elastic net the algorithm can remove weak variables altogether as with lasso or to reduce them to close to zero as with ridge. All of these algorithms are examples of regularized regression.

This post will provide an example of elastic net regression in Python. Below are the steps of the analysis.

- Data preparation
- Baseline model development
- Elastic net model development

To accomplish this, we will use the Fair dataset from the pydataset library. Our goal will be to predict marriage satisfaction based on the other independent variables. Below is some initial code to begin the analysis.

from pydataset import data

import numpy as np

import pandas as pd

pd.set_option('display.max_rows', 5000)

pd.set_option('display.max_columns', 5000)

pd.set_option('display.width', 10000)

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import ElasticNet

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

**Data Preparation**

We will now load our data. The only preparation that we need to do is convert the factor variables to dummy variables. Then we will make our and y datasets. Below is the code.

df=pd.DataFrame(data('Fair'))

df.loc[df.sex== 'male', 'sex'] = 0

df.loc[df.sex== 'female','sex'] = 1

df['sex'] = df['sex'].astype(int)

df.loc[df.child== 'no', 'child'] = 0

df.loc[df.child== 'yes','child'] = 1

df['child'] = df['child'].astype(int)

X=df[['religious','age','sex','ym','education','occupation','nbaffairs']]

y=df['rate']

We can now proceed to creating the baseline model** **

**Baseline Model**

This model is a basic regression model for the purpose of comparison. We will instantiate our regression model, use the fit command and finally calculate the mean squared error of the data. The code is below.

regression=LinearRegression()

regression.fit(X,y)

first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))

print(first_model)

1.0498738644696668

This mean standard error score of 1.05 is our benchmark for determining if the elastic net model will be better or worst. Below are the coefficients of this first model. We use a for loop to go through the model and the zip function to combine the two columns.

coef_dict_baseline = {}

for coef, feat in zip(regression.coef_,X.columns):

coef_dict_baseline[feat] = coef

coef_dict_baseline

Out[63]:

{'religious': 0.04235281110639178,

'age': -0.009059645428673819,

'sex': 0.08882013337087094,

'ym': -0.030458802565476516,

'education': 0.06810255742293699,

'occupation': -0.005979506852998164,

'nbaffairs': -0.07882571247653956}

We will now move to making the elastic net model.

**Elastic Net Model**

Elastic net, just like ridge and lasso regression, requires normalize data. This argument is set inside the ElasticNet function. The second thing we need to do is create our grid. This is the same grid as we create for ridge and lasso in prior posts. The only thing that is new is the l1_ratio argument.

When the l1_ratio is set to 0 it is the same as ridge regression. When l1_ratio is set to 1 it is lasso. Elastic net is somewhere between 0 and 1 when setting the l1_ratio. Therefore, in our grid, we need to set several values of this argument. Below is the code.

elastic=ElasticNet(normalize=True)

search=GridSearchCV(estimator=elastic,param_grid={'alpha':np.logspace(-5,2,8),'l1_ratio':[.2,.4,.6,.8]},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)

We will now fit our model and display the best parameters and the best results we can get with that setup.

search.fit(X,y)

search.best_params_

Out[73]: {'alpha': 0.001, 'l1_ratio': 0.8}

abs(search.best_score_)

Out[74]: 1.0816514028705004

The best hyperparameters was an alpha set to 0.001 and a l1_ratio of 0.8. With these settings we got an MSE of 1.08. This is above our baseline model of MSE 1.05 for the baseline model. Which means that elastic net is doing worse than linear regression. For clarity, we will set our hyperparameters to the recommended values and run on the data.

elastic=ElasticNet(normalize=True,alpha=0.001,l1_ratio=0.75)

elastic.fit(X,y)

second_model=(mean_squared_error(y_true=y,y_pred=elastic.predict(X)))

print(second_model)

1.0566430678343806

Now our values are about the same. Below are the coefficients

coef_dict_baseline = {}

for coef, feat in zip(elastic.coef_,X.columns):

coef_dict_baseline[feat] = coef

coef_dict_baseline

Out[76]:

{'religious': 0.01947541724957858,

'age': -0.008630896492807691,

'sex': 0.018116464568090795,

'ym': -0.024224831274512956,

'education': 0.04429085595448633,

'occupation': -0.0,

'nbaffairs': -0.06679513627963515}

The coefficients are mostly the same. Notice that occupation was completely removed from the model in the elastic net version. This means that this values was no good to the algorithm. Traditional regression cannot do this.

**Conclusion**

This post provided an example of elastic net regression. Elastic net regression allows for the maximum flexibility in terms of finding the best combination of ridge and lasso regression characteristics. This flexibility is what gives elastic net its power.

# Lasso Regression with Python

Lasso regression is another form of regularized regression. With this particular version, the coefficient of a variable can be reduced all the way to zero through the use of the l1 regularization. This is in contrast to ridge regression which never completely removes a variable from an equation as it employs l2 regularization.

Regularization helps to stabilize estimates as well as deal with bias and variance in a model. In this post, we will use the “CaSchools” dataset from the pydataset library. Our goal will be to predict test scores based on several independent variables. The steps we will follow are as follows.

- Data preparation
- Develop a baseline linear model
- Develop lasso regression model

The initial code is as follows

from pydataset import data

import numpy as np

import pandas as pd

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import Lasso

df=pd.DataFrame(data(‘Caschool’))

**Data Preparation**

The data preparation is simple in this example. We only have to store the desired variables in our X and y datasets. We are not using all of the variables. Some were left out because they were highly correlated. Lasso is able to deal with this to a certain extent w=but it was decided to leave them out anyway. Below is the code.

X=df[['teachers','calwpct','mealpct','compstu','expnstu','str','avginc','elpct']]

y=df['testscr']

**Baseline Model**

We can now run our baseline model. This will give us a measure of comparison for the lasso model. Our metric is the mean squared error. Below is the code with the results of the model.

regression=LinearRegression()

regression.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))

print(first_model)

69.07380530137416

First, we instantiate the LinearRegression class. Then, we run the .fit method to do the analysis. Next, we predicted future values of our regression model and save the results to the object first_model. Lastly, we printed the results.

Below are the coefficient for the baseline regression model.

coef_dict_baseline = {}

for coef, feat in zip(regression.coef_,X.columns):

coef_dict_baseline[feat] = coef

coef_dict_baseline

Out[52]:

{'teachers': 0.00010011947964873427,

'calwpct': -0.07813766458116565,

'mealpct': -0.3754719080127311,

'compstu': 11.914006268826652,

'expnstu': 0.001525630709965126,

'str': -0.19234209691788984,

'avginc': 0.6211690806021222,

'elpct': -0.19857026121348267}

The for loop simply combines the features in our model with their coefficients. With this information we can now make our lasso model and compare the results.

**Lasso Model**

For our lasso model, we have to determine what value to set the l1 or alpha to prior to creating the model. This can be done with the grid function, This function allows you to assess several models with different l1 settings. Then python will tell which setting is the best. Below is the code.

lasso=Lasso(normalize=True)

search=GridSearchCV(estimator=lasso,param_grid={'alpha':np.logspace(-5,2,8)},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)

search.fit(X,y)

We start be instantiate lasso with normalization set to true. It is important to scale data when doing regularized regression. Next, we setup our grid, we include the estimator, and parameter grid, and scoring. The alpha is set using logspace. We want values between -5 and 2, and we want 8 evenly spaced settings for the alpha. The other arguments include cv which stands for cross-validation. n_jobs effects processing and refit updates the parameters.

After completing this, we used the fit function. The code below indicates the appropriate alpha and the expected score if we ran the model with this alpha setting.

search.best_params_

Out[55]: {'alpha': 1e-05}

abs(search.best_score_)

Out[56]: 85.38831122904011

`The alpha is set almost to zero, which is the same as a regression model. You can also see that the mean squared error is actually worse than in the baseline model. In the code below, we run the lasso model with the recommended alpha setting and print the results.

lasso=Lasso(normalize=True,alpha=1e-05)

lasso.fit(X,y)

second_model=(mean_squared_error(y_true=y,y_pred=lasso.predict(X)))

print(second_model)

69.0738055527604

The value for the second model is almost the same as the first one. The tiny difference is due to the fact that there is some penalty involved. Below are the coefficient values.

coef_dict_baseline = {}

for coef, feat in zip(lasso.coef_,X.columns):

coef_dict_baseline[feat] = coef

coef_dict_baseline

Out[63]:

{'teachers': 9.795933425676567e-05,

'calwpct': -0.07810938255735576,

'mealpct': -0.37548182158171706,

'compstu': 11.912164626067028,

'expnstu': 0.001525439984250718,

'str': -0.19225486069458508,

'avginc': 0.6211695477945162,

'elpct': -0.1985510490295491}

The coefficient values are also slightly different. The only difference is the teachers variable was essentially set to zero. This means that it is not a useful variable for predicting testscrs. That is ironic to say the least.

**Conclusion**

Lasso regression is able to remove variables that are not adequate predictors of the outcome variable. Doing this in Python is fairly simple. This yet another tool that can be used in statistical analysis.

# Ridge Regression in Python

Ridge regression is one of several regularized linear models. Regularization is the process of penalizing coefficients of variables either by removing them and or reduce their impact. Ridge regression reduces the effect of problematic variables close to zero but never fully removes them.

We will go through an example of ridge regression using the VietNamI dataset available in the pydataset library. Our goal will be to predict expenses based on the variables available. We will complete this task using the following steps/

- Data preparation
- Baseline model development
- Ridge regression model

Below is the initial code

from pydataset import data

import numpy as np

import pandas as pd

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import Ridge

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_erro

**Data Preparation**

The data preparation is simple. All we have to do is load the data and convert the sex variable to a dummy variable. We also need to set up our X and y datasets. Below is the code.

df=pd.DataFrame(data('VietNamI'))

df.loc[df.sex== 'male', 'sex'] = 0

df.loc[df.sex== 'female','sex'] = 1

df['sex'] = df['sex'].astype(int)

X=df[['pharvis','age','sex','married','educ','illness','injury','illdays','actdays','insurance']]

y=df['lnhhexp'

We can now create our baseline regression model.

**Baseline Model**

The metric we are using is the mean squared error. Below is the code and output for our baseline regression model. This is a model that has no regularization to it. Below is the code.

regression=LinearRegression()

regression.fit(X,y)

first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))

print(first_model)

0.35528915032173053

This value of 0.355289 will be our indicator to determine if the regularized ridge regression model is superior or not.

**Ridge Model**

In order to create our ridge model we need to first determine the most appropriate value for the l2 regularization. L2 is the name of the hyperparameter that is used in ridge regression. Determining the value of a hyperparameter requires the use of a grid. In the code below, we first are ridge model and indicate normalization in order to get better estimates. Next we setup the grid that we will use. Below is the code.

ridge=Ridge(normalize=True)

search=GridSearchCV(estimator=ridge,param_grid={'alpha':np.logspace(-5,2,8)},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)

The search object has several arguments within it. Alpha is hyperparameter we are trying to set. The log space is the range of values we want to test. We want the log of -5 to 2, but we only get 8 values from within that range evenly spread out. Are metric is the mean squared error. Refit set true means to adjust the parameters while modeling and cv is the number of folds to develop for the cross-validation. We can now use the .fit function to run the model and then use the .best_params_ and .best_scores_ function to determine the model;s strength. Below is the code.

search.fit(X,y)

search.best_params_

{'alpha': 0.01}

abs(search.best_score_)

0.3801489007094425

The best_params_ tells us what to set alpha too which in this case is 0.01. The best_score_ tells us what the best possible mean squared error is. In this case, the value of 0.38 is worse than what the baseline model was. We can confirm this by fitting our model with the ridge information and finding the mean squared error. This is done below.

ridge=Ridge(normalize=True,alpha=0.01)

ridge.fit(X,y)

second_model=(mean_squared_error(y_true=y,y_pred=ridge.predict(X)))

print(second_model)

0.35529321992606566

The 0.35 is lower than the 0.38. This is because the last results are not cross-validated. In addition, these results indicate that there is little difference between the ridge and baseline models. This is confirmed with the coefficients of each model found below.

coef_dict_baseline = {}

for coef, feat in zip(regression.coef_,data("VietNamI").columns):

coef_dict_baseline[feat] = coef

coef_dict_baseline

Out[188]:

{'pharvis': 0.013282050886950674,

'lnhhexp': 0.06480086550467873,

'age': 0.004012412278795848,

'sex': -0.08739614349708981,

'married': 0.075276463838362,

'educ': -0.06180921300600292,

'illness': 0.040870384578962596,

'injury': -0.002763768716569026,

'illdays': -0.006717063310893158,

'actdays': 0.1468784364977112}

coef_dict_ridge = {}

for coef, feat in zip(ridge.coef_,data("VietNamI").columns):

coef_dict_ridge[feat] = coef

coef_dict_ridge

Out[190]:

{'pharvis': 0.012881937698185289,

'lnhhexp': 0.06335455237380987,

'age': 0.003896623321297935,

'sex': -0.0846541637961565,

'married': 0.07451889604357693,

'educ': -0.06098723778992694,

'illness': 0.039430607922053884,

'injury': -0.002779341753010467,

'illdays': -0.006551280792122459,

'actdays': 0.14663287713359757}

The coefficient values are about the same. This means that the penalization made little difference with this dataset.

**Conclusion**

Ridge regression allows you to penalize variables based on their useful in developing the model. With this form of regularized regression the coefficients of the variables is never set to zero. Other forms of regularization regression allows for the total removal of variables. One example of this is lasso regression.

# Computational Thinking

Computational thinking is the process of expressing a problem in a way that a computer can solve. In general, there are four various ways that computational thinking can be done. These four ways are decomposition, pattern recognition, abstraction, and algorithmic thinking.

Although computational thinking is dealt with in the realm of computer science. Everyone thinks computationally at one time or another especially in school. Awareness of these subconscious strategies can help people to know how they think at times as well as to be aware of the various ways in which thinking is possible.

**Decomposition**

Decomposition is the process of breaking a large problem down into smaller and smaller parts or problems. The benefit of this is that by addressing all of the created little problems you can solve the large problem.

In education decomposition can show up in many ways. For teachers, they often have to break goals done into objectives, and sometimes down into procedures in a daily lesson plan. Seeing the big picture of the content students need and breaking it down into pieces that students can comprehend is critically to education such as with chunking.

For the student, decomposition involves breaking down the parts of a project such as writing a paper. The student has to determine what to do and how it helps to achieve the completion of their project.

**Pattern Recognition**

Pattern recognition has to refer to how various aspects of a problem have things in common. For a teacher, this may involve the development of a thematic unit. Developing such a unit requires the teacher to see what various subjects or disciplines have in common as they try to create the thematic unit.

For the student, pattern recognition can support the development of shortcuts. Examples include seeing similarities in assignments that need to be completed and completing similar assignments together.

**Abstraction**

Abstraction is the ability to remove irrelevant information from a problem. This is perhaps the most challenging form of thinking to develop because people often fall into the trap that everything is important.

For a teacher, abstractions involves teaching only the critical information that is in the content and not stressing the small stuff. This is not easy especially when the teacher has a passion for their subject. This often blinds them to trying to share only the most relevant information about their field with their students.

For students, abstraction involves being able to share the most critical information. Students are guilty of the same problems as teachers in that they share everything when writing or presenting. Determining what is important requires the development of an opinion to judge the relevance of something. This is a skill that is hard to find among graduates.

**Algorithmic Thinking**

Algorithmic thinking is being able to develop a step-by-step plan to do something. For teachers, this happens everyday through planning activities and leading a class. Planning may be the most common form of thinking for the teacher.

For students, algorithmic thinking is somewhat more challenging. It is common for younger people to rely heavily on intuition to accomplish tasks. This means that they did something but they do not know how they did it.

Another common mistake for young people is doing things through brute force. Rather than planning, they will just keep pounding away until something works. As such, it is common for students to do things the “hard way” as the saying goes.

**Conclusion**

Computational thinking is really how humans think in many situations in which emotions are not the primary mover. As such, what is really happening is not that computers are thinking as much as they are trying to model how humans think. In education, there are several situations. In which computational thinking can be employed for success.

# 3 Steps to Successful Research

When students have to conduct a research project they often struggle with determining what to do. There are many decisions that have to be made that can impede a student’s chances of achieving success. However, there are ways to overcome this problem.

This post will essentially reduce the decision-making process for conducting research down to three main questions that need to be addressed. These questions are.

- What do you Want to Know?
- How do You Get the Answer?
- What Does Your Answer Mean?

Answering these three questions makes it much easier to develop a sense of direction and scope in order to complete a project.

**What do you Want to Know?**

Often, students want to complete a project but it is unclear to them what they are trying to figure out. In other words, the students do not know what it is that they want to know. Therefore, one of the first steps in research is to determine exactly it is you want to know.

Understanding what you want to know will allow you to develop a problem as well as research questions to facilitate your ability to understand exactly what it is that you are looking for. Research always begins with a problem and questions about the problem and this is simply another way of stating what it is that you want to know.

**How do You Get the Answer?**

Once it is clear what it is that you want to know it is critical that you develop a process for determining how you will obtain the answers. It is often difficult for students to develop a systematic way in which to answer questions. However, in a research paradigm, a scientific way of addressing questions is critical.

When you are determining how to get answers to what you want to know this is essential the development of your methodology section. This section includes such matters as the research design, sample, ethics, data analysis, etc. The purpose here is again to explain the way to get the answer(s).

**What Does Your Answer Mean?**

After you actually get the answer you have to explain what it means. Many students fall into the trap of doing something without understanding why or determining the relevance of the outcome. However, a research project requires some sort of interpretation or explanation of the results. Just getting the answer is not enough it is the meaning that holds the power.

Often, the answers to the research questions are found in the results section of a paper and the meaning is found in the discussion and conclusion section. In the discussion section, you explain the major findings with interpretation, sare recommendations, and provide a conclusion. This requires thought into the usefulness of what you wanted to know. In other words, you are explaining why someone else should care about your work. This is much harder to do than many realize.

**Conclusion**

Research is challenging but if you keep in mind these three keys it will help you to see the big picture of research and o focus on the goals of your study and not so much on the tiny details that encompasses the processes.

# Random Forest Regression in Python

Random forest is simply the making of dozens if not thousands of decision trees. The decision each tree makes about an example are then tallied for the purpose of voting with the classification that receives the most votes winning. For regression, the results of the trees are averaged in order to give the most accurate results

In this post, we will use the cancer dataset from the pydataset module to predict the age of people. Below is some initial code.

import pandas as pd import numpy as np from pydataset import data from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error

We can load our dataset as df, drop all NAs, and create our dataset that contains the independent variables and a separate dataset that includes the dependent variable of age. The code is below

df = data('cancer') df=df.dropna() X=df[['time','status',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']] y=df['age']

Next, we need to set up our train and test sets using a 70/30 split. After that, we set up our model using the RandomForestRegressor function. n_estimators is the number of trees we want to create and the random_state argument is for supporting reproducibility. The code is below

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) h=RandomForestRegressor(n_estimators=100,random_state=1)

We can now run our model and test it. Running the model requires the .fit() function and testing involves the .predict() function. The results of the test are found using the mean_squared_error() function.

h.fit(x_train,y_train) y_pred=h.predict(x_test) mean_squared_error(y_test,y_pred) 71.75780196078432

The MSE of 71.75 is only useful for model comparison and has little meaning by its self. Another way to assess the model is by determining variable importance. This helps you to determine in a descriptive way the strongest variables for the regression model. The code is below followed by the plot of the variables.

model_ranks=pd.Series(h.feature_importances_,index=x_train.columns,name="Importance").sort_values(ascending=True,inplace=False) ax=model_ranks.plot(kind='barh')

As you can see, the strongest predictors of age include calories per meal, weight loss, and time sick. Sex and whether the person is censored or dead make a smaller difference. This makes sense as younger people eat more and probably lose more weight because they are heavier initially when dealing with cancer.

**Conclusison**

This post provided an example of the use of regression with random forest. Through the use of ensemble voting, you can improve the accuracy of your models. This is a distinct power that is not available with other machine learning algorithm.

# Support Vector Machines Regression with Python

This post will provide an example of how to do regression with support vector machines SVM. SVM is a complex algorithm that allows for the development of non-linear models. This is particularly useful for messy data that does not have clear boundaries.

The steps that we will use are listed below

- Data preparation
- Model Development

We will use two different kernels in our analysis. The LinearSVR kernel and SVR kernel. The difference between these two kernels has to do with slight changes in the calculations of the boundaries between classes.

**Data Preparation
**

We are going to use the OFP dataset available in the pydataset module. This dataset was used previously for classification with SVM on this site. Our plan this time is that we want to predict family inc (famlinc), which is a continuous variable. Below is some initial code.

import numpy as np import pandas as pd from pydataset import data from sklearn import svm from sklearn import model_selection from statsmodels.tools.eval_measures import mse

We now need to load our dataset and remove any missing values.

df=pd.DataFrame(data('OFP')) df=df.dropna()

AS in the previous post, we need to change the text variables into dummy variables and we also need to scale the data. The code below creates the dummy variables, removes variables that are not needed, and also scales the data.

dummy=pd.get_dummies(df['black']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"yes": "black_person"}) df=df.drop('no', axis=1) dummy=pd.get_dummies(df['sex']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"male": "Male"}) df=df.drop('female', axis=1) dummy=pd.get_dummies(df['employed']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"yes": "job"}) df=df.drop('no', axis=1) dummy=pd.get_dummies(df['maried']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"no": "single"}) df=df.drop('yes', axis=1) dummy=pd.get_dummies(df['privins']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"yes": "insured"}) df=df.drop('no', axis=1) df=df.drop(['black','sex','maried','employed','privins','medicaid','region','hlth'],axis=1) df = (df - df.min()) / (df.max() - df.min()) df.head()

We now need to set up our datasets. The X dataset will contain the independent variables while the y dataset will contain the dependent variable

X=df[['ofp','ofnp','opp','opnp','emr','hosp','numchron','adldiff','age','school','single','black_person','Male','job','insured']] y=df['faminc']

We can now move to model development

**Model Development**

We now need to create our train and test sets for or X and y datasets. We will do a 70/30 split of the data. Below is the code

X_train,X_test,y_train,y_test=model_selection.train_test_split(X,y,test_size=.3,random_state=1)

Next, we will create our two models with the code below.

h1=svm.SVR() h2=svm.LinearSVR()

We will now run our first model and assess the results. Our metric is the mean squared error. Generally, the lower the number the better. We will use the .fit() function to train the model and the .predict() function for test the model

The mse was 0.27. This number means nothing only and is only beneficial for comparison reasons. Therefore, the second model will be judged as better or worst only if the mse is lower than 0.27. Below are the results of the second model.

We can see that the mse for our second model is 0.34 which is greater than the mse for the first model. This indicates that the first model is superior based on the current results and parameter settings.

**Conclusion**

This post provided an example of how to use SVM for regression.

# Analyzing Twitter Data in Python

In this post, we will look at how to analyze text from Twitter. We will do each of the following for tweets that refer to Donald Trump and tweets that refer to Barrack Obama.

- Conduct a sentiment analysis
- Create a word cloud

This is a somewhat complex analysis so I am assuming that you are familiar with Python as explaining everything would make the post much too long. In order to achieve our two objectives above we need to do the following.

- Obtain all of the necessary information from your twitter apps account
- Download the tweets & clean
- Perform the analysis

Before we begin, here is a list of modules we will need to load to complete our analysis

import wordcloud import matplotlib.pyplot as plt import twython import re import numpy

**Obtain all Needed Information**

From your twitter app account, you need the following information

- App key
- App key secret
- Access token
- Access token secret

All this information needs to be stored in individual objects in Python. Then each individual object needs to be combined into one object. The code is below.

TWITTER_APP_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXX TWITTER_APP_KEY_SECRET=XXXXXXXXXXXXXXXXXXX TWITTER_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXX TWITTER_ACCESS_TOKEN_SECRET=XXXXXXXXXXXXXX t=twython.Twython(app_key=TWITTER_APP_KEY,app_secret=TWITTER_APP_KEY_SECRET,oauth_token=TWITTER_ACCESS_TOKEN,oauth_token_secret=TWITTER_ACCESS_TOKEN_SECRET)

In the code above we saved all the information in different objects at first and then combined them. You will of course replace the XXXXXXX with your own information.

Next, we need to create a function that will pull the tweets from Twitter. Below is the code,

def get_tweets(twython_object,query,n): count=0 result_generator=twython_object.cursor(twython_object.search,q=query) result_set=[] for r in result_generator: result_set.append(r['text']) count+=1 if count ==n: break return result_set

You will have to figure out the code yourself. We can now download the tweets.

**Downloading Tweets & Clean**

Downloading the tweets involves making an empty dictionary that we can save our information in. We need two keys in our dictionary one for Trump and the other for Obama because we are downloading tweets about these two people.

There are also two additional things we need to do. We need to use regular expressions to get rid of punctuation and we also need to lower case all words. All this is done in the code below.

tweets={} tweets['trump']=[re.sub(r'[-.#/?!.":;()\']',' ',tweet.lower())for tweet in get_tweets(t,'#trump',1500)] tweets['obama']=[re.sub(r'[-.#/?!.":;()\']',' ',tweet.lower())for tweet in get_tweets(t,'#obama',1500)]

The get_tweets function is also used in the code above along with our twitter app information. We pulled 1500 tweets concerning Obama and 1500 tweets about Trump. We were able to download and clean our tweets at the same time. We can now do our analysis

**Analysis**

To do the sentiment analysis you need dictionaries of positive and negative words. The ones in this post were taken from GitHub. Below is the code for loading them into Python.

positive_words=open('XXXXXXXXXXXX').read().split('\n') negative_words=open('XXXXXXXXXXXX').read().split('\n')

We now will make a function to calculate the sentiment

def sentiment_score(text,pos_list,neg_list): positive_score=0 negative_score=0 for w in text.split(' '): if w in pos_list:positive_score+=1 if w in neg_list:negative_score+=1 return positive_score-negative_score

Now we create an empty dictionary and run the analysis for Trump and then for Obama

tweets_sentiment={} tweets_sentiment['trump']=[sentiment_score(tweet,positive_words,negative_words)for tweet in tweets['trump']] tweets_sentiment['obama']=[sentiment_score(tweet,positive_words,negative_words)for tweet in tweets['obama']]

Now we can make visuals of our results with the code below

trump=plt.hist(tweets_sentiment['trump'],5) obama=plt.hist(tweets_sentiment['obama'],5)

Obama is on the left and trump is on the right. It seems that trump tweets are consistently more positive. Below are the means for both.

numpy.mean(tweets_sentiment['trump']) Out[133]: 0.36363636363636365 numpy.mean(tweets_sentiment['obama']) Out[134]: 0.2222222222222222

Trump tweets are slightly more positive than Obama tweets. Below is the code for the Trump word cloud

Here is the code for the Obama word cloud

A lot of speculating can be made from the word clouds and sentiment analysis. However, the results will change every single time because of the dynamic nature of Twitter. People are always posting tweets which changes the results.

**Conclusion**

This post provided an example of how to download and analyze tweets from twitter. It is important to develop a clear idea of what you want to know before attempting this sort of analysis as it is easy to become confused and not accomplish anything.

# Multiple Regression in Python

In this post, we will go through the process of setting up and a regression model with a training and testing set using Python. We will use the insurance dataset from kaggle. Our goal will be to predict charges. In this analysis, the following steps will be performed.

- Data preparation
- Model training
- model testing

**Data Preparation**

Below is a list of the modules we will need in order to complete the analysis

import matplotlib.pyplot as plt import pandas as pd from sklearn import linear_model,model_selection, feature_selection,preprocessing import statsmodels.formula.api as sm from statsmodels.tools.eval_measures import mse from statsmodels.tools.tools import add_constant from sklearn.metrics import mean_squared_error

After you download the dataset you need to load it and take a look at it. You will use the .read_csv function from pandas to load the data and .head() function to look at the data. Below is the code and the output.

insure=pd.read_csv('YOUR LOCATION HERE')

We need to create some dummy variables for sex, smoker, and region. We will address that in a moment, right now we will look at descriptive stats for our continuous variables. We will use the .describe() function for descriptive stats and the .corr() function to find the correlations.

The descriptives are left for your own interpretation. As for the correlations, they are generally weak which is an indication that regression may be appropriate.

As mentioned earlier, we need to make dummy variables sex, smoker, and region in order to do the regression analysis. To complete this we need to do the following.

- Use the pd.get_dummies function from pandas to create the dummy
- Save the dummy variable in an object called ‘dummy’
- Use the pd.concat function to add our new dummy variable to our ‘insure’ dataset
- Repeat this three times

Below is the code for doing this

dummy=pd.get_dummies(insure['sex']) insure=pd.concat([insure,dummy],axis=1) dummy=pd.get_dummies(insure['smoker']) insure=pd.concat([insure,dummy],axis=1) dummy=pd.get_dummies(insure['region']) insure=pd.concat([insure,dummy],axis=1) insure.head()

The .get_dummies function requires the name of the dataframe and in the brackets the name of the variable to convert. The .concat function requires the name of the two datasets to combine as well the axis on which to perform it.

We now need to remove the original text variables from the dataset. In addition, we need to remove the y variable “charges” because this is the dependent variable.

y = insure.charges insure=insure.drop(['sex', 'smoker','region','charges'], axis=1)

We can now move to model development.

**Model Training**

Are train and test sets are model with the model_selection.trainin_test_split function. We will do an 80-20 split of the data. Below is the code.

X_train, X_test, y_train, y_test = model_selection.train_test_split(insure, y, test_size=0.2)

In this single line of code, we create a train and test set of our independent variables and our dependent variable.

We can not run our regression analysis. This requires the use of the .OLS function from statsmodels module. Below is the code.

answer=sm.OLS(y_train, add_constant(X_train)).fit()

In the code above inside the parentheses, we put the dependent variable(y_train) and the independent variables (X_train). However, we had to use the function add_constant to get the intercept for the output. All of this information is then used inside the .fit() function to fit a model.

To see the output you need to use the .summary() function as shown below.

answer.summary()

The assumption is that you know regression but our reading this post to learn python. Therefore, we will not go into great detail about the results. The r-square is strong, however, the region and gender are not statistically significant.

We will now move to model testing

**Model Testing**

Our goal here is to take the model that we developed and see how it does on other data. First, we need to predict values with the model we made with the new data. This is shown in the code below

ypred=answer.predict(add_constant(X_test))

We use the .predict() function for this action and we use the X_test data as well. With this information, we will calculate the mean squared error. This metric is useful for comparing models. We only made one model so it is not that useful in this situation. Below is the code and results.

print(mse(ypred,y_test)) 33678660.23480476

For our final trick, we will make a scatterplot with the predicted and actual values of the test set. In addition, we will calculate the correlation of the predict values and test set values. This is an alternative metric for assessing a model.

You can see the first two lines are for making the plot. Lines 3-4 are for making the correlation matrix and involves the .concat() function. The correlation is high at 0.86 which indicates the model is good at accurately predicting the values. THis is confirmed with the scatterplot which is almost a straight line.

**Conclusion**

IN this post we learned how to do a regression analysis in Python. We prepared the data, developed a model, and tested a model with an evaluation of it.

# Logistic Regression in Python

This post will provide an example of a logistic regression analysis in Python. Logistic regression is commonly used when the dependent variable is categorical. Our goal will be to predict the gender of an example based on the other variables in the model. Below are the steps we will take to achieve this.

- Data preparation
- Model development
- Model testing
- Model evaluation

**Data Preparation**

The dataset we will use is the ‘Survey of Labour and Income Dynamics’ (SLID) dataset available in the pydataset module in Python. This dataset contains basic data on labor and income along with some demographic information. The initial code that we need is below.

import pandas as pd import statsmodels.api as sm import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn import metrics from pydataset import data

The code above loads all the modules and other tools we will need in this example. We now can load our data. In addition to loading the data, we will also look at the count and the characteristics of the variables. Below is the code.

At the top of this code, we create the ‘df’ object which contains our data from the “SLID”. Next, we used the .count() function to determine if there was any missing data and to see what variables were available. It appears that we have five variables and a lot of missing data as each variable has different amounts of data. Lastly, we used the .head() function to see what each variable contained. It appears that wages, education, and age are continuous variables well sex and language are categorical. The categorical variables will need to be converted to dummy variables as well.

The next thing we need to do is drop all the rows that are missing data since it is hard to create a model when data is missing. Below is the code and the output for this process.

In the code above, we used the .dropna() function to remove missing data. Then we used the .count() function to see how many rows remained. You can see that all the variables have the same number of rows which is important for model analysis. We will now make our dummy variables for sex and language in the code below.

Here is what we did,

- We used the .get_dummies function from pandas first on the sex variable. All this was stored in a new object called “dummy”
- We then combined the dummy and df datasets using the .concat() function. The axis =1 argument is for combing by column.
- We repeat steps 1 and 2 for the language variable
- Lastly, we used the .head() function to see the results

With this, we are ready to move to model development.

**Model Development**

The first thing we need to do is put all of the independent variables in one dataframe and the dependent variable in its own dataframe. Below is the code for this

X=df[['wages','education','age',"French","Other"]] y=df['Male']

Notice that we did not use every variable that was available. For the language variables, we only used “French” and “Other”. This is because when you make dummy variables you only need k-1 dummies created. Since the language variable had three categories we only need two dummy variables. Therefore, we excluded “English” because when “French” and “Other” are coded 0 it means that “English” is the characteristic of the example.

In addition, we only took “male” as our dependent variable because if “male” is set to 0 it means that example is female. We now need to create our train and test dataset. The code is below.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

We created four datasets

- train dataset with the independent variables
- train dataset with the dependent variable
- test dataset with the independent variables
- test dataset with the independent variable

The split is 70/30 with 70% being used for the training and 30% being used for testing. This is the purpose of the “test_size” argument. we used the train_test_split function to do this. We can now run our model and get the results. Below is the code.

Here is what we did

- We used the .Logit() function from statsmodel to create the logistic model. Notice we used only the training data.
- We then use the .fit() function to get the results and stored this in the result object.
- Lastly, we printed the results in the ‘result’ object using the .summary()

There are some problems with the results. The Pseudo R-square is infinity which is usually. Also, you may have some error output about hessian inversion in your output. For these reasons, we cannot trust the results but will continue for the sake of learning.

The coefficients are clear. Only wage, education, and age are significant. In order to determine the probability you have to take the coefficient from the model and use the .exp() function from numpy. Below are the results.

np.exp(.08) Out[107]: 1.0832870676749586 np.exp(-0.06) Out[108]: 0.9417645335842487 np.exp(-.01) Out[109]: 0.9900498337491681

For the first value, for every unit wages increaser the probability that they are male increase 8%. For every 1 unit increase in education there probability of the person being male decrease 6%. Lastly, for every one unit increase in age the probability of the person being male decrease by 1%. Notice that we subtract 1 from the outputs to find the actual probability.

We will now move to model testing

**Model Testing**

To do this we first test our model with the code below

y_pred=result.predict(X_test)

We made the result object earlier. Now we just use the .predict() function with the X_test data. Next, we need to flag examples that the model believes has a 60% chance or greater of being male. The code is below

y_pred_flag=y_pred>.6

This creates a boolean object with True and False as the output. Now we will make our confusion matrix as well as other metrics for classification.

The results speak for themselves. There are a lot of false positives if you look at the confusion matrix. In addition precision, recall, and f1 are all low. Hopefully, the coding should be clear the main point is to be sure to use the test set dependent dataset (y_test) with the flag data you made in the previous step.

We will not make the ROC curve. For a strong model, it should have a strong elbow shape while with a weak model it will be a diagonal straight line.

The first plot is of our data. The second plot is what a really bad model would look like. As you can see there is littte difference between the two. Again this is because of all the false positives we have in the model. The actual coding should be clear. fpr is the false positive rate, tpr is the true positive rate. The function is .roc_curve. Inside goes the predict vs actual test data.

**Conclusion**

This post provided a demonstration of the use of logistic regression in Python. It is necessary to follow the steps above but keep in mind that this was a demonstration and the results are dubious.

# Prerequistes to Conducting Research

Some of the biggest challenges in helping students with research is their lack of preparation. The problem is not an ignorance of statistics or research design as that takes only a little bit of support. The real problem is that students want to do research without hardly reading any research and lacking knowledge of how research writing is communicated. This post will share some prerequisites to performing research.

**Read Extensively**

Extensive reading means reading broadly about a topic and not focusing too much on specifics. Therefore, you read indiscriminately perhaps limited yourself only to your general discipline.

In order to communicate research, you must first be familiar with the vocabulary and norms of research. This can be learned to a great extent through reading academic empirical articles.

The ananoloy I like to use is how a baby learns. By spends large amounts of time being exposed to the words and actions of others. The baby has no real idea in terms of what is going on at first. However, after continuous exposure, the child begins to understand the words and actions fo those around them and even begins to mimic the behaviors.

In many ways, this is the purpose of reading a great deal before even attempting to do any research. Just as the baby, a writer needs to observe how others do things, continue this process even if they do not understand, and attempt to imitate the desired behaviors. You must understand the forms of communication as well as the cultural expectations of research writing and this can only happen through direct observation.

At the end of this experience, you begin to notice a pattern in terms of the structure of research writing. The style is highly ridge with litter variation.

It is hard to say how much extensive reading a person needs. Generally, the more reading that was done in the past the less reading needed to understand the structure of research writing. If you hate to read and did little reading in the past you will need to read a lot more to understand research writing then someone with an extensive background in reading. In addition, if you are trying to write in a second language you will need to read much more than someone writing in their native language.

If you are still desirous of a hard number of articles to read I would say aim for the following

- Native who loves to read-at least 25 articles
- Native who hates to read-at least 40 articles
- Non-native reader-60 articles or more

Extensive reading is just reading. There is no notetaking or even highlighting. You are focusing on exposure only. Just as the observant baby so you are living in the moment trying to determine what is the appropriate behavior. If you don’t understand you need to keep going anyway as the purpose is quantity and not quality. Generally, when the structure of the writing begins to become redundant ad you can tell what the author is doing without having to read too closely you are ready to move on.

**Read Intensively**

Intensive reading is reading more for understanding. This involves slows with the goal of deeper understanding. Now you select something, in particular, you want to know. Perhaps you want to become more familiar with the writing of one excellent author or maybe there is one topic in particular that you are interested in. With intensive writing, you want to know everything that is happening in the text. To achieve this you read fewer articles and focus much more on quality over quantity.

By the end of the extensive and intensive reading, you should be familiar with the following.

- The basic structure of research writing even if you don’t understand why it is the way it is.
- A more thorough understanding of something specific you read about during your intensive reading.
- Some sense of purpose in terms of what you need to do for your own writing.
- A richer vocabulary and content knowledge related to your field.

**Write Academicly**

Once a student has read a lot of research there is some hope that they can now attempt to write in this style. As the teacher, it is my responsibility to point out the structure of research writing which involves such as ideas as the 5 sections and the parts of each section.

Students grasp this but they often cannot build paragraphs. In order to write academic research, you must know the purpose of main ideas, supporting details, and writing patterns. If these terms are unknown to you it will be difficult to write research that is communicated clearly.

The main idea is almost always the first sentence of a paragraph and writing patterns provide different ways to organize the supporting details. This involves understanding the purpose of each paragraph that is written which is a task that many students could not explain. This is looking at writing from a communicative or discourse perspective and not at a minute detail or grammar one.

The only way to do this is to practice writing. I often will have students develop several different reviews of literature. During this experience, they learn how to share the ideas of others. The next step is developing a proposal in which the student shares their ideas and someone else’s. The final step is writing a formal research paper.

**Conclusion**

To write you must first observe how others write. Then you need to imitate what you saw. Once you can do it what others have done it will allow you to ask questions about why things are this way. Too often, people just want to write without even understanding what they are trying to do. This leads to paralysis at best (I don’t know what to do) to a disaster at worst (spending hours confidently writing garbage). The enemy to research is not methodology as many people write a lot without knowledge of stats or research design because they collaborate. The real enemy of research is neglecting the preparation of reading and the practicing of writing.

# Writing Discussion & Conclusions in Research

The Discussion & Conclusion section of a research article/thesis/dissertation is probably the trickiest part of a project to write. Unlike the other parts of a paper, the Discussion & Conclusions are hard to plan in advance as it depends on the results. In addition, since this is the end of a paper the writer is often excited and wants to finish it quickly, which can lead to superficial analysis.

This post will discuss common components of the Discussion & Conclusion section of a paper. Not all disciplines have all of these components nor do they use the same terms as the ones mentioned below.

**Discussion**

The discussion is often a summary of the findings of a paper. For a thesis/dissertation, you would provide the purpose of the study again but you probably would not need to share this in a short article. In addition, you also provide highlights of what you learn with interpretation. In the results section of a paper, you simply state the statistical results. In the discussion section, you can now explain what those results mean for the average person.

The ordering of the summary matters as well. Some recommend that you go from the most important finding to the least important. Personally, I prefer to share the findings by the order in which the research questions are presented. This maintains a cohesiveness across sections of a paper that a reader can appreciate. However, there is nothing superior to either approach. Just remember to connect the findings with the purpose of the study as this helps to connect the themes of the paper together.

What really makes this a discussion is to compare/contrast your results with the results of other studies and to explain why the results are similar and or different. You also can consider how your results extend the works of other writers. This takes a great deal of critical thinking and familiarity with the relevant literature.

**Recommendation/Implications**

The next component of this final section of the paper is either recommendations or implications but almost never both. Recommendations are practical ways to apply the results of this study through action. For example, if your study finds that sleeping 8 hours a night improves test scores then the recommendation would be that students should sleep 8 hours a night to improve their test scores. This is not an amazing insight but the recommendations must be grounded in the results and not just opinion.

Implications, on the other hand, explain why the results are important. Implications are often more theoretical in nature and lack the application of recommendations. Often implications are used when it is not possible to provide a strong recommendation.

The terms conclusion and implications are often used interchangeably in different disciplines and this is highly confusing. Therefore, keep in mind your own academic background when considering what these terms mean.

There is one type of recommendation that is almost always present in a study and that is recommendations for further study. This is self-explanatory but recommendations for further study are especially important if the results are preliminary in nature. A common way to recommend further studies is to deal with inconclusive results in the current study. In other words, if something weird happened in your current paper or if something surprised you this could be studied in the future. Another term for this is “suggestions for further research.”

**Limitations**

Limitations involve discussing some of the weaknesses of your paper. There is always some sort of weakness with a sampling method, statistical analysis, measurement, data collection etc. This section is an opportunity to confess these problems in a transparent matter that further researchers may want to control for.

**Conclusion**

Finally, the conclusion of the Discussion & Conclusion is where you try to summarize the results in a sentence or two and connect them with the purpose of the study. In other words, trying to shrink the study down to a one-liner. If this sounds repetitive it is and often the conclusion just repeats parts of the discussion.

**Blog Conclusion**

This post provides an overview of writing the final section of a research paper. The explanation here provides just one view on how to do this. Every discipline and every researcher has there own view on how to construct this section of a paper.

# Shaping the Results of a Research Paper

Writing the results of a research paper is difficult. As a researcher, you have to try and figure out if you answered the question. In addition, you have to figure out what information is important enough to share. As such it is easy to get stuck at this stage of the research experience. Below are some ideas to help with speeding up this process.

**Consider the Order of the Answers**

This may seem obvious but probably the best advice I could give a student when writing their results section is to be sure to answer their questions in the order they presented them in the introduction of their study. This helps with cohesion and coherency. The reader is anticipating answers to these questions and they often subconsciously remember the order the questions came in.

If a student answers the questions out of order it can be jarring for the reader. When this happens the reader starts to double check what the questions were and they begin to second-guess their understanding of the paper which reflects poorly on the writer. An analogy would be that if you introduce three of your friends to your parents you might share each person’s name and then you might go back and share a little bit of personal information about each friend. When we do this we often go in order 1st 2nd 3rd friend and then going back and talking about the 1st friend. The same courtesy should apply when answering research questions in the results section. Whoever was first is shared first etc.

**Consider how to Represent the Answers**

Another aspect to consider is the presentation of the answers. Should everything be in text? What about the use of visuals and tables? The answers depend on several factors

- If you have a small amount of information to share writing in paragraphs is practical. Defining small depends on how much space you have to write as well but generally anything more than five ideas should be placed in a table and referred too.
- Tables are for sharing large amounts of information. If an answer to a research question requires more than five different pieces of information a table may be best. You can extract really useful information from a table and place it directly in paragraphs while referring the reader to the table for even more information.
- Visuals such as graphs and plots are not used as frequently in research papers as I would have thought. This may be because they take up so much space in articles that usually have page limits. In addition, readers of an academic journal are pretty good at visually results mentally based on numbers that can be placed in a table. Therefore, visuals are most appropriate for presentations and writing situations in which there are fewer constraints on the length of the document such as a thesis or dissertation.

**Know when to Interpret**

Sometimes I have had students try to explain the results while presenting them. I cannot say this is wrong, however, it can be confusing. The reason it is so confusing is that the student is trying to do two things at the same time which are present the results and interpret them. This would be ok in a presentation and even expected but when someone is reading a paper it is difficult to keep two separate threads of thought going at the same time. Therefore, the meaning or interpretation of the results should be saved for the Discussion Conclusion section.

**Conclusion**

Presenting the results is in many ways the high point of a research experience. It is not easy to take numerical results and try to capture the useful information clearly. As such, the advice given here is intending to help support this experience

# Purpose of a Quantitative Methodology

Students often struggle with shaping their methodology section in their paper. The problem is often that students do not see the connection between the different sections of a research paper. This inability to connect the dots leads to isolated thinking on the topic and inability to move forward.

The methodology section of a research paper plays a critical role. In brief, the purpose of a methodology is to explain to your readers how you will answer your research questions. In the strictest sense, this is important for reproducing a study. Therefore, what is really important when writing a methodology is the research questions of the study. The research questions determine the following of a methodology.

- Sample
- Setting
- Research design
- Scales
- Data collection
- Data analysis

What this means is that a student must know what they want to know in order to explain how they will find the answers. Below is a description of these sections along with one section that is not often influenced by the research questions.

**Sample & Setting**

In the sample section of the methodology, it is common or the student to explain the setting of the study, provide some demographics, and explain the sampling method. In this section of the methodology, the goal is to describe what the reader needs to know about the participants in order to understand the context from which the results were derived.

**Research Design & Scales**

The research design explains specifically how the data was collected. There are several standard ways to do this in the social sciences such.

- Survey design
- experimental design
- correlational design

Within this section, some academic disciplines also explain the scales or the tool used to measure the variable(s) of the study. Again, it is impossible to develop this section of the research questions are unclear or unknown.

**Data Analysis**

The data analysis section provides an explanation of how the answers were calculated in a study. Success in this section requires a knowledge of the various statistical tools that are available. However, understanding the research questions is key to articulating this section clearly.

**Ethics**

A final section in many methodologies is ethics. The ethical section is a place where the student can explain how the protected participant’s anonymity, made sure to get the permission and other aspects of morals. This section is required by most universities in order to gain permission to do research. However, it is often missing from journals.

**Conclusion**

The methodology is part of the larger picture of communicating one’s research. It is important that a research paper is not seen as isolated parts but rather as a whole. The reason for this position is that a paper cannot make sense on its own if any of these aspects are missing.

# Tips for Writing a Quantitative Review of Literature

Writing a review of literature can be challenging for students. The purpose here is to try and synthesize a huge amount of information and to try and communicate it clearly to someone who has not read what you have read.

From my experience working with students, I have developed several tips that help them to make faster decisions and to develop their writing as well.

**Remember the Purpose**

Often a student will collect as many articles as possible and try to throw them all together to make a review of the literature. This naturally leads to problems of the paper sounded like a shopping list of various articles. Neither interesting nor coherent.

Instead, when writing a review of literature a student should keep in mind the question

What do my readers need to know in order to understand my study?

This is a foundational principle when writing. Readers don’t need to know everything only what they need to know to appreciate the study they are ready. An extension of this is that different readers need to know different things. As such, there is always a contextual element to framing a review of the literature.

**Consider the Format**

When working with a student, I always recommend the following format to get there writing started.

For each major variable in your study do the following…

- Define it
- Provide examples or explain theories about it
- Go through relevant studies thematically

*Definition*

There first thing that needs to be done is to provide a definition of the construct. This is important because many constructs are defined many different ways. This can lead to confusion if the reader is thinking one definition and the writer is thinking another.

*Examples and Theories*

Step 2 is more complex. After a definition is provided the student can either provide an example of what this looks like in the real world and or provide more information in regards to theories related to the construct.

Sometimes examples are useful. For example, if writing a paper on addiction it would be useful to not only define it but also to provide examples of the symptoms of addiction. The examples help the reader to see what used to be an abstract definition in the real world.

Theories are important for providing a deeper explanation of a construct. Theories tend to be highly abstract and often do not help a reader to understand the construct better. One benefit of theories is that they provide a historical background of where the construct came from and can be used to develop the significance of the study as the student tries to find some sort of gap to explore in their own paper.

Often it can be beneficial to include both examples and theories as this demonstrates stronger expertise in the subject matter. In theses and dissertations, both are expected whenever possible. However, for articles space limitations and knowing the audience affects the inclusion of both.

*Relevant Studies*

The relevant studies section is similar breaking news on CNN. The relevant studies should generally be newer. In the social sciences, we are often encouraged to look at literature from the last five years, perhaps ten years in some cases. Generally, readers want to know what has happened recently as experience experts are familiar with older papers. This rule does not apply as strictly to theses and dissertations.

Once recent literature has been found the student needs to organize it thematically. The reason for a thematic organization is that the theme serves as the main idea of the section and the studies themselves serve as the supporting details. This structure is surprisingly clear for many readers as the predictable nature allows the reader to focus on content rather than on trying to figure out what the author is tiring to say. Below is an example

There are several challenges with using technology in class(ref, 2003; ref 2010). For example, Doe (2009) found that technology can be unpredictable in the classroom. James (2010) found that like of training can lead some teachers to resent having to use new technology

The main idea here is “challenges with technology.” The supporting details are Doe (2009) and James (2010). This concept of themes is much more complex than this and can include several paragraphs and or pages.

**Conclusion**

This process really cuts down on the confusion of students writing. For stronger students, they can be free to do what they want. However, many students require structure and guidance when the first begin writing research papers

# Common Problems with Research for Students

I have worked with supporting undergrad and graduate students with research projects for several years. This post is what I consider to be the top reasons why students and even the occasional faculty member struggles to conduct research. The reasons are as follows

- They don’t read
- No clue what a problem is
- No questions
- No clue how to measure
- No clue how to analyze
- No clue how to report

**Lack of Reading**

The first obstacle to conducting research is that students frequently do not read enough to conceptualize how research is done. Reading not just anything bust specifically research allows a student to synthesize the vocabulary and format of research writing. You cannot do research unless you first read research. This axiom applies to all genres of writing.

A common complaint is the difficulty with understanding research articles. For whatever reason, the academic community has chosen to write research articles in an exceedingly dense and unclear manner. This is not going to change because one graduate student cannot understand what the experts are saying. Therefore, the only solution to understand research English is exposure to this form of communication.

**Determining the Problem**

If a student actually reads they often go to the extreme of trying to conduct Nobel Prize type research. In other words, their expectations are overinflated given what they know. What this means is that the problem they want to study is infeasible given the skillset they currently possess.

The opposite extreme is to find such a minute problem that nobody cares about it. Again, reading will help in avoiding this two pitfalls.

Another problem is not knowing exactly how to articulate a problem. A student will come to me with excellent examples of a problem but they never abstract or take a step away from the examples of the problem to develop a researchable problem. There can be no progress without a clearly defined research problem.

**Lack the Ability to Ask Questions about the Problem**

If a student actually has a problem they never think of questions that they want to answer about the problem. Another extreme is they ask questions they cannot answer. Without question, you can never better understand your problem. Bad questions or no questions means no answers.

Generally, there are three types of quantitative research questions while qualitative is more flexible. If a student does not know this they have no clue how to even begin to explore their problem.

**Issues with Measurement**

Let’s say a student does know what their questions are, the next mystery for many is measuring the variables if the study is quantitative. This is were applying statistical knowledge rather than simply taking quizzes and test comes to play. The typical student does not understand often how to operationalize their variables and determine what type of variables they will include in their study. If you don’t know how you will measure your variables you cannot answer any questions about your problem.

**Lost at the Analysis Stage**

The measurement affects the analysis. I cannot tell you how many times a student or even a colleague wanted me to analyze their data without telling me what the research questions were. How can you find answers without questions? The type of measurement affects the potential ways of analyzing data. How you summary categorical data is different from continuous data. Lacking this knowledge leads to inaction.

**No Plan for the Write-Up**

If a student makes it to this stage, firstly congratulations are in order, however, many students have no idea what to report or how. This is because students lose track of the purpose of their study which was to answer their research questions about the problem. Therefore, in the write-up, you present the answers systematically. First, you answer question 1, then 2, etc.

If necessary you include visuals of the answers. Again Visuals are determined by the type of variable as well as the type of question. A top reason for article rejection is an unclear write-up. Therefore, great care is needed in order for this process to be successful.

**Conclusion**

Whenever I deal with research students I often walk through these six concepts. Most students never make it past the second or third concept. Perhaps the results will differ for others.

Successful research writing requires the ability to see the big picture and connection the various section of a paper so that the present a cohesive whole. Too many students focus on the little details and forget the purpose of their study. Losing the main idea makes the details worthless.

If I left out any common problems with research please add them in the comments section.

# Local Regression in R

Local regression uses something similar to nearest neighbor classification to generate a regression line. In local regression, nearby observations are used to fit the line rather than all observations. It is necessary to indicate the percentage of the observations you want R to use for fitting the local line. The name for this hyperparameter is the span. The higher the span the smoother the line becomes.

Local regression is great one there are only a handful of independent variables in the model. When the total number of variables becomes too numerous the model will struggle. As such, we will only fit a bivariate model. This will allow us to process the model and to visualize it.

In this post, we will use the “Clothing” dataset from the “Ecdat” package and we will examine innovation (inv2) relationship with total sales (tsales). Below is some initial code.

`library(Ecdat)`

```
data(Clothing)
str(Clothing)
```

```
## 'data.frame': 400 obs. of 13 variables:
## $ tsales : int 750000 1926395 1250000 694227 750000 400000 1300000 495340 1200000 495340 ...
## $ sales : num 4412 4281 4167 2670 15000 ...
## $ margin : num 41 39 40 40 44 41 39 28 41 37 ...
## $ nown : num 1 2 1 1 2 ...
## $ nfull : num 1 2 2 1 1.96 ...
## $ npart : num 1 3 2.22 1.28 1.28 ...
## $ naux : num 1.54 1.54 1.41 1.37 1.37 ...
## $ hoursw : int 76 192 114 100 104 72 161 80 158 87 ...
## $ hourspw: num 16.8 22.5 17.2 21.5 15.7 ...
## $ inv1 : num 17167 17167 292857 22207 22207 ...
## $ inv2 : num 27177 27177 71571 15000 10000 ...
## $ ssize : int 170 450 300 260 50 90 400 100 450 75 ...
## $ start : num 41 39 40 40 44 41 39 28 41 37 ...
```

There is no data preparation in this example. The first thing we will do is fit two different models that have different values for the span hyperparameter. “fit” will have a span of .41 which means it will use 41% of the nearest examples. “fit2” will use .82. Below is the code.

```
fit<-loess(tsales~inv2,span = .41,data = Clothing)
fit2<-loess(tsales~inv2,span = .82,data = Clothing)
```

In the code above, we used the “loess” function to fit the model. The “span” argument was set to .41 and .82.

We now need to prepare for the visualization. We begin by using the “range” function to find the distance from the lowest to the highest value. Then use the “seq” function to create a grid. Below is the code.

```
inv2lims<-range(Clothing$inv2)
inv2.grid<-seq(from=inv2lims[1],to=inv2lims[2])
```

The information in the code above is for setting our x-axis in the plot. We are now ready to fit our model. We will fit the models and draw each regression line.

```
plot(Clothing$inv2,Clothing$tsales,xlim=inv2lims)
lines(inv2.grid,predict(fit,data.frame(inv2=inv2.grid)),col='blue',lwd=3)
lines(inv2.grid,predict(fit2,data.frame(inv2=inv2.grid)),col='red',lwd=3)
```

Not much difference in the two models. For our final task, we will predict with our “fit” model using all possible values of “inv2” and also fit the confidence interval lines.

```
pred<-predict(fit,newdata=inv2.grid,se=T)
plot(Clothing$inv2,Clothing$tsales)
lines(inv2.grid,pred$fit,col='red',lwd=3)
lines(inv2.grid,pred$fit+2*pred$se.fit,lty="dashed",lwd=2,col='blue')
lines(inv2.grid,pred$fit-2*pred$se.fit,lty="dashed",lwd=2,col='blue')
```

**Conclusion**

Local regression provides another way to model complex non-linear relationships in low dimensions. The example here provides just the basics of how this is done is much more complicated than described here.

# Smoothing Splines in R

This post will provide information on smoothing splines. Smoothing splines are used in regression when we want to reduce the residual sum of squares by adding more flexibility to the regression line without allowing too much overfitting.

In order to do this, we must tune the parameter called the smoothing spline. The smoothing spline is essentially a natural cubic spline with a knot at every unique value of x in the model. Having this many knots can lead to severe overfitting. This is corrected for by controlling the degrees of freedom through the parameter called lambda. You can manually set this value or select it through cross-validation.

We will now look at an example of the use of smoothing splines with the “Clothing” dataset from the “Ecdat” package. We want to predict “tsales” based on the use of innovation in the stores. Below is some initial code.

`library(Ecdat)`

```
data(Clothing)
str(Clothing)
```

```
## 'data.frame': 400 obs. of 13 variables:
## $ tsales : int 750000 1926395 1250000 694227 750000 400000 1300000 495340 1200000 495340 ...
## $ sales : num 4412 4281 4167 2670 15000 ...
## $ margin : num 41 39 40 40 44 41 39 28 41 37 ...
## $ nown : num 1 2 1 1 2 ...
## $ nfull : num 1 2 2 1 1.96 ...
## $ npart : num 1 3 2.22 1.28 1.28 ...
## $ naux : num 1.54 1.54 1.41 1.37 1.37 ...
## $ hoursw : int 76 192 114 100 104 72 161 80 158 87 ...
## $ hourspw: num 16.8 22.5 17.2 21.5 15.7 ...
## $ inv1 : num 17167 17167 292857 22207 22207 ...
## $ inv2 : num 27177 27177 71571 15000 10000 ...
## $ ssize : int 170 450 300 260 50 90 400 100 450 75 ...
## $ start : num 41 39 40 40 44 41 39 28 41 37 ...
```

We are going to create three models. Model one will have 70 degrees of freedom, model two will have 7, and model three will have the number of degrees of freedom are determined through cross-validation. Below is the code.

```
fit1<-smooth.spline(Clothing$inv2,Clothing$tsales,df=57)
fit2<-smooth.spline(Clothing$inv2,Clothing$tsales,df=7)
fit3<-smooth.spline(Clothing$inv2,Clothing$tsales,cv=T)
```

```
## Warning in smooth.spline(Clothing$inv2, Clothing$tsales, cv = T): cross-
## validation with non-unique 'x' values seems doubtful
```

`(data.frame(fit1$df,fit2$df,fit3$df))`

```
## fit1.df fit2.df fit3.df
## 1 57 7.000957 2.791762
```

In the code above we used the “smooth.spline” function which comes with base r.Notice that we did not use the same coding syntax as the “lm” function calls for. The code above also indicates the degrees of freedom for each model. You can see that for “fit3” the cross-validation determine that 2.79 was the most appropriate degrees of freedom. In addition, if you type in the following code.

`sapply(data.frame(fit1$x,fit2$x,fit3$x),length)`

```
## fit1.x fit2.x fit3.x
## 73 73 73
```

You will see that there are only 73 data points in each model. The “Clothing” dataset has 400 examples in it. The reason for this reduction is that the “smooth.spline” function only takes unique values from the original dataset. As such, though there are 400 examples in the dataset only 73 of them are unique.

Next, we plot our data and add regression lines

```
plot(Clothing$inv2,Clothing$tsales)
lines(fit1,col='red',lwd=3)
lines(fit2,col='green',lwd=3)
lines(fit3,col='blue',lwd=3)
legend('topright',lty=1,col=c('red','green','blue'),c("df = 57",'df=7','df=CV 2.8'))
```

You can see that as the degrees of freedom increase so does the flexibility in the line. The advantage of smoothing splines is to have a more flexible way to assess the characteristics of a dataset.

# Polynomial Spline Regression in R

Normally, when least squares regression is used, you fit one line to the model. However, sometimes you may want enough flexibility that you fit different lines over different regions of your independent variable. This process of fitting different lines over different regions of X is known as Regression Splines.

How this works is that there are different coefficient values based on the regions of X. As the researcher, you can set the cutoff points for each region. The cutoff point is called a “knot.” The more knots you use the more flexible the model becomes because there are fewer data points with each range allowing for more variability.

We will now go through an example of polynomial regression splines. Remeber that polynomial means that we will have a curved line as we are using higher order polynomials. Our goal will be to predict total sales based on the amount of innovation a store employs. We will use the “Ecdat” package and the “Clothing” dataset. In addition, we will need the “splines” package. The code is as follows.

`library(splines);library(Ecdat)`

`data(Clothing)`

We will now fit our model. We must indicate the number and placement of the knots. This is commonly down at the 25th 50th and 75th percentile. Below is the code

`fit<-lm(tsales~bs(inv2,knots = c(12000,60000,150000)),data = Clothing)`

In the code above we used the traditional “lm” function to set the model. However, we also used the “bs” function which allows us to create our spline regression model. The argument “knots” was set to have three different values. Lastly, the dataset was indicated.

Remember that the default spline model in R is a third-degree polynomial. This is because it is hard for the eye to detect the discontinuity at the knots.

We now need X values that we can use for prediction purposes. In the code below we first find the range of the “inv2” variable. We then create a grid that includes all the possible values of “inv2” in increments of 1. Lastly, we use the “predict” function to develop the prediction model. We set the “se” argument to true as we will need this information. The code is below.

```
inv2lims<-range(Clothing$inv2)
inv2.grid<-seq(from=inv2lims[1],to=inv2lims[2])
pred<-predict(fit,newdata=list(inv2=inv2.grid),se=T)
```

We are now ready to plot our model. The code below graphs the model and includes the regression line (red), confidence interval (green), as well as the location of each knot (blue)

```
plot(Clothing$inv2,Clothing$tsales,main="Regression Spline Plot")
lines(inv2.grid,pred$fit,col='red',lwd=3)
lines(inv2.grid,pred$fit+2*pred$se.fit,lty="dashed",lwd=2,col='green')
lines(inv2.grid,pred$fit-2*pred$se.fit,lty="dashed",lwd=2,col='green')
segments(12000,0,x1=12000,y1=5000000,col='blue' )
segments(60000,0,x1=60000,y1=5000000,col='blue' )
segments(150000,0,x1=150000,y1=5000000,col='blue' )
```

When this model was created it was essentially three models connected. Model on goes from the first blue line to the second. Model 2 goes form the second blue line to the third and model three was from the third blue line until the end. This kind of flexibility is valuable in understanding nonlinear relationship

# Logistic Polynomial Regression in R

Polynomial regression is used when you want to develop a regression model that is not linear. It is common to use this method when performing traditional least squares regression. However, it is also possible to use polynomial regression when the dependent variable is categorical. As such, in this post, we will go through an example of logistic polynomial regression.

Specifically, we will use the “Clothing” dataset from the “Ecdat” package. We will divide the “tsales” dependent variable into two categories to run the analysis. Below is the code to get started.

`library(Ecdat)`

`data(Clothing)`

There is little preparation for this example. Below is the code for the model

`fitglm<-glm(I(tsales>900000)~poly(inv2,4),data=Clothing,family = binomial)`

Here is what we did

1. We created an object called “fitglm” to save our results

2. We used the “glm” function to process the model

3. We used the “I” function. This told R to process the information inside the parentheses as is. As such, we did not have to make a new variable in which we split the “tsales” variable. Simply, if sales were greater than 900000 it was code 1 and 0 if less than this amount.

4. Next, we set the information for the independent variable. We used the “poly” function. Inside this function, we placed the “inv2” variable and the highest order polynomial we want to explore.

5. We set the data to “Clothing”

6. Lastly, we set the “family” argument to “binomial” which is needed for logistic regression

Below is the results

`summary(fitglm)`

```
##
## Call:
## glm(formula = I(tsales > 9e+05) ~ poly(inv2, 4), family = binomial,
## data = Clothing)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5025 -0.8778 -0.8458 1.4534 1.5681
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.074 2.685 1.145 0.2523
## poly(inv2, 4)1 641.710 459.327 1.397 0.1624
## poly(inv2, 4)2 585.975 421.723 1.389 0.1647
## poly(inv2, 4)3 259.700 178.081 1.458 0.1448
## poly(inv2, 4)4 73.425 44.206 1.661 0.0967 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 521.57 on 399 degrees of freedom
## Residual deviance: 493.51 on 395 degrees of freedom
## AIC: 503.51
##
## Number of Fisher Scoring iterations: 13
```

It appears that only the 4th-degree polynomial is significant and barely at that. We will now find the range of our independent variable “inv2” and make a grid from this information. Doing this will allow us to run our model using the full range of possible values for our independent variable.

```
inv2lims<-range(Clothing$inv2)
inv2.grid<-seq(from=inv2lims[1],to=inv2lims[2])
```

The “inv2lims” object has two values. The lowest value in “inv2” and the highest value. These values serve as the highest and lowest values in our “inv2.grid” object. This means that we have values started at 350 and going to 400000 by 1 in a grid to be used as values for “inv2” in our prediction model. Below is our prediction model.

`predsglm<-predict(fitglm,newdata=list(inv2=inv2.grid),se=T,type="response")`

Next, we need to calculate the probabilities that a given value of “inv2” predicts a store has “tsales” greater than 900000. The equation is as follows.

`pfit<-exp(predsglm$fit)/(1+exp(predsglm$fit))`

Graphing this leads to interesting insights. Below is the code

`plot(pfit)`

You can see the curves in the line from the polynomial expression. As it appears. As inv2 increase the probability increase until the values fall between 125000 and 200000. This is interesting, to say the least.

We now need to plot the actual model. First, we need to calculate the confidence intervals. This is done with the code below.

```
se.bandsglm.logit<-cbind(predsglm$fit+2*predsglm$se.fit,predsglm$fit-2*predsglm$se.fit)
se.bandsglm<-exp(se.bandsglm.logit)/(1+exp(se.bandsglm.logit))
```

The ’se.bandsglm” object contains the log odds of each example and the “se.bandsglm” has the probabilities. Now we plot the results

```
plot(Clothing$inv2,I(Clothing$tsales>900000),xlim=inv2lims,type='n')
points(jitter(Clothing$inv2),I((Clothing$tsales>900000)),cex=2,pch='|',col='darkgrey')
lines(inv2.grid,pfit,lwd=4)
matlines(inv2.grid,se.bandsglm,col="green",lty=6,lwd=6)
```

In the code above we did the following.

1. We plotted our dependent and independent variables. However, we set the argument “type” to n which means nothing. This was done so we can add the information step-by-step.

2. We added the points. This was done using the “points” function. The “jitter” function just helps to spread the information out. The other arguments (cex, pch, col) our for aesthetics and our optional.

3. We add our logistic polynomial line based on our independent variable grid and the “pfit” object which has all of the predicted probabilities.

4. Last, we add the confidence intervals using the “matlines” function. Which includes the grid object as well as the “se.bandsglm” information.

You can see that these results are similar to when we only graphed the “pfit” information. However, we also add the confidence intervals. You can see the same dip around 125000-200000 were there is also a larger confidence interval. if you look at the plot you can see that there are fewer data points in this range which may be what is making the intervals wider.

**Conclusion**

Logistic polynomial regression allows the regression line to have more curves to it if it is necessary. This is useful for fitting data that is non-linear in nature.

# Polynomial Regression in R

Polynomial regression is one of the easiest ways to fit a non-linear line to a data set. This is done through the use of higher order polynomials such as cubic, quadratic, etc to one or more predictor variables in a model.

Generally, polynomial regression is used for one predictor and one outcome variable. When there are several predictor variables it is more common to use generalized additive modeling/ In this post, we will use the “Clothing” dataset from the “Ecdat” package to predict total sales with the use of polynomial regression. Below is some initial code.

`library(Ecdat)`

`data(Clothing) str(Clothing)`

```
## 'data.frame': 400 obs. of 13 variables:
## $ tsales : int 750000 1926395 1250000 694227 750000 400000 1300000 495340 1200000 495340 ...
## $ sales : num 4412 4281 4167 2670 15000 ...
## $ margin : num 41 39 40 40 44 41 39 28 41 37 ...
## $ nown : num 1 2 1 1 2 ...
## $ nfull : num 1 2 2 1 1.96 ...
## $ npart : num 1 3 2.22 1.28 1.28 ...
## $ naux : num 1.54 1.54 1.41 1.37 1.37 ...
## $ hoursw : int 76 192 114 100 104 72 161 80 158 87 ...
## $ hourspw: num 16.8 22.5 17.2 21.5 15.7 ...
## $ inv1 : num 17167 17167 292857 22207 22207 ...
## $ inv2 : num 27177 27177 71571 15000 10000 ...
## $ ssize : int 170 450 300 260 50 90 400 100 450 75 ...
## $ start : num 41 39 40 40 44 41 39 28 41 37 ...
```

We are going to use the “inv2” variable as our predictor. This variable measures the investment in automation by a particular store. We will now run our polynomial regression model.

```
fit<-lm(tsales~poly(inv2,5),data = Clothing)
summary(fit)
```

```
##
## Call:
## lm(formula = tsales ~ poly(inv2, 5), data = Clothing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -946668 -336447 -96763 184927 3599267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 833584 28489 29.259 < 2e-16 ***
## poly(inv2, 5)1 2391309 569789 4.197 3.35e-05 ***
## poly(inv2, 5)2 -665063 569789 -1.167 0.2438
## poly(inv2, 5)3 49793 569789 0.087 0.9304
## poly(inv2, 5)4 1279190 569789 2.245 0.0253 *
## poly(inv2, 5)5 -341189 569789 -0.599 0.5497
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 569800 on 394 degrees of freedom
## Multiple R-squared: 0.05828, Adjusted R-squared: 0.04633
## F-statistic: 4.876 on 5 and 394 DF, p-value: 0.0002428
```

The code above should be mostly familiar. We use the “lm” function as normal for regression. However, we then used the “poly” function on the “inv2” variable. What this does is runs our model 5 times (5 is the number next to “inv2”). Each time a different polynomial is used from 1 (no polynomial) to 5 (5th order polynomial). The results indicate that the 4th-degree polynomial is significant.

We now will prepare a visual of the results but first, there are several things we need to prepare. First, we want to find what the range of our predictor variable “inv2” is and we will save this information in a grade. The code is below.

`inv2lims<-range(Clothing$inv2)`

Second, we need to create a grid that has all the possible values of “inv2” from the lowest to the highest broken up in intervals of one. Below is the code.

`inv2.grid<-seq(from=inv2lims[1],to=inv2lims[2])`

We now have a dataset with almost 400000 data points in the “inv2.grid” object through this approach. We will now use these values to predict “tsales.” We also want the standard errors so we se “se” to TRUE

`preds<-predict(fit,newdata=list(inv2=inv2.grid),se=TRUE)`

We now need to find the confidence interval for our regression line. This is done by making a dataframe that takes the predicted fit adds or subtracts 2 and multiples this number by the standard error as shown below.

`se.bands<-cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)`

With these steps completed, we are ready to create our civilization.

To make our visual, we use the “plot” function on the predictor and outcome. Doing this gives us a plot without a regression line. We then use the “lines” function to add the polynomial regression line, however, this line is based on the “inv2.grid” object (40,000 observations) and our predictions. Lastly, we use the “matlines” function to add the confidence intervals we found and stored in the “se.bands” object.

```
plot(Clothing$inv2,Clothing$tsales)
lines(inv2.grid,preds$fit,lwd=4,col='blue')
matlines(inv2.grid,se.bands,lwd = 4,col = "yellow",lty=4)
```

**Conclusion**

You can clearly see the curvature of the line. Which helped to improve model fit. Now any of you can tell that we are fitting this line to mostly outliers. This is one reason we the standard error gets wider and wider it is because there are fewer and fewer observations on which to base it. However, for demonstration purposes, this is a clear example of the power of polynomial regression.

# Partial Least Squares Regression in R

Partial least squares regression is a form of regression that involves the development of components of the original variables in a supervised way. What this means is that the dependent variable is used to help create the new components form the original variables. This means that when pls is used the linear combination of the new features helps to explain both the independent and dependent variables in the model.

In this post, we will use predict “income” in the “Mroz” dataset using pls. Below is some initial code.

`library(pls);library(Ecdat)`

```
data("Mroz")
str(Mroz)
```

```
## 'data.frame': 753 obs. of 18 variables:
## $ work : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
## $ hoursw : int 1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
## $ child6 : int 1 0 1 0 1 0 0 0 0 0 ...
## $ child618 : int 0 2 3 3 2 0 2 0 2 2 ...
## $ agew : int 32 30 35 34 31 54 37 54 48 39 ...
## $ educw : int 12 12 12 12 14 12 16 12 12 12 ...
## $ hearnw : num 3.35 1.39 4.55 1.1 4.59 ...
## $ wagew : num 2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
## $ hoursh : int 2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
## $ ageh : int 34 30 40 53 32 57 37 53 52 43 ...
## $ educh : int 12 9 12 10 12 11 12 8 4 12 ...
## $ wageh : num 4.03 8.44 3.58 3.54 10 ...
## $ income : int 16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
## $ educwm : int 12 7 12 7 12 14 14 3 7 7 ...
## $ educwf : int 7 7 7 7 14 7 7 3 7 7 ...
## $ unemprate : num 5 11 5 5 9.5 7.5 5 5 3 5 ...
## $ city : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ...
## $ experience: int 14 5 15 6 7 33 11 35 24 21 ...
```

First, we must prepare our data by dividing it into a training and test set. We will do this by doing a 50/50 split of the data.

```
set.seed(777)
train<-sample(c(T,F),nrow(Mroz),rep=T) #50/50 train/test split
test<-(!train)
```

In the code above we set the “set.seed function in order to assure reduplication. Then we created the “train” object and used the “sample” function to make a vector with ‘T’ and ‘F’ based on the number of rows in “Mroz”. Lastly, we created the “test”” object base don everything that is not in the “train” object as that is what the exclamation point is for.

Now we create our model using the “plsr” function from the “pls” package and we will examine the results using the “summary” function. We will also scale the data since this the scale affects the development of the components and use cross-validation. Below is the code.

```
set.seed(777)
pls.fit<-plsr(income~.,data=Mroz,subset=train,scale=T,validation="CV")
summary(pls.fit)
```

```
## Data: X dimension: 392 17
## Y dimension: 392 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 11218 8121 6701 6127 5952 5886 5857
## adjCV 11218 8114 6683 6108 5941 5872 5842
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 5853 5849 5854 5853 5853 5852 5852
## adjCV 5837 5833 5837 5836 5836 5835 5835
## 14 comps 15 comps 16 comps 17 comps
## CV 5852 5852 5852 5852
## adjCV 5835 5835 5835 5835
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 17.04 26.64 37.18 49.16 59.63 64.63 69.13
## income 49.26 66.63 72.75 74.16 74.87 75.25 75.44
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 72.82 76.06 78.59 81.79 85.52 89.55 92.14
## income 75.49 75.51 75.51 75.52 75.52 75.52 75.52
## 15 comps 16 comps 17 comps
## X 94.88 97.62 100.00
## income 75.52 75.52 75.52
```

The printout includes the root mean squared error for each of the components in the VALIDATION section as well as the variance explained in the TRAINING section. There are 17 components because there are 17 independent variables. You can see that after component 3 or 4 there is little improvement in the variance explained in the dependent variable. Below is the code for the plot of these results. It requires the use of the “validationplot” function with the “val.type” argument set to “MSEP” Below is the code

`validationplot(pls.fit,val.type = "MSEP")`

We will do the predictions with our model. We use the “predict” function, use our “Mroz” dataset but only those index in the “test” vector and set the components to three based on our previous plot. Below is the code.

```
set.seed(777)
pls.pred<-predict(pls.fit,Mroz[test,],ncomp=3)
```

After this, we will calculate the mean squared error. This is done by subtracting the results of our predicted model from the dependent variable of the test set. We then square this information and calculate the mean. Below is the code

`mean((pls.pred-Mroz$income[test])^2)`

`## [1] 63386682`

As you know, this information is only useful when compared to something else. Therefore, we will run the data with a tradition least squares regression model and compare the results.

```
set.seed(777)
lm.fit<-lm(income~.,data=Mroz,subset=train)
lm.pred<-predict(lm.fit,Mroz[test,])
mean((lm.pred-Mroz$income[test])^2)
```

`## [1] 59432814`

The least squares model is slightly better then our partial least squares model but if we look at the model we see several variables that are not significant. We will remove these see what the results are

`summary(lm.fit)`

```
##
## Call:
## lm(formula = income ~ ., data = Mroz, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20131 -2923 -1065 1670 36246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.946e+04 3.224e+03 -6.036 3.81e-09 ***
## workno -4.823e+03 1.037e+03 -4.651 4.59e-06 ***
## hoursw 4.255e+00 5.517e-01 7.712 1.14e-13 ***
## child6 -6.313e+02 6.694e+02 -0.943 0.346258
## child618 4.847e+02 2.362e+02 2.052 0.040841 *
## agew 2.782e+02 8.124e+01 3.424 0.000686 ***
## educw 1.268e+02 1.889e+02 0.671 0.502513
## hearnw 6.401e+02 1.420e+02 4.507 8.79e-06 ***
## wagew 1.945e+02 1.818e+02 1.070 0.285187
## hoursh 6.030e+00 5.342e-01 11.288 < 2e-16 ***
## ageh -9.433e+01 7.720e+01 -1.222 0.222488
## educh 1.784e+02 1.369e+02 1.303 0.193437
## wageh 2.202e+03 8.714e+01 25.264 < 2e-16 ***
## educwm -4.394e+01 1.128e+02 -0.390 0.697024
## educwf 1.392e+02 1.053e+02 1.322 0.186873
## unemprate -1.657e+02 9.780e+01 -1.694 0.091055 .
## cityyes -3.475e+02 6.686e+02 -0.520 0.603496
## experience -1.229e+02 4.490e+01 -2.737 0.006488 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5668 on 374 degrees of freedom
## Multiple R-squared: 0.7552, Adjusted R-squared: 0.744
## F-statistic: 67.85 on 17 and 374 DF, p-value: < 2.2e-16
```

```
set.seed(777)
lm.fit<-lm(income~work+hoursw+child618+agew+hearnw+hoursh+wageh+experience,data=Mroz,subset=train)
lm.pred<-predict(lm.fit,Mroz[test,])
mean((lm.pred-Mroz$income[test])^2)
```

`## [1] 57839715`

As you can see the error decreased even more which indicates that the least squares regression model is superior to the partial least squares model. In addition, the partial least squares model is much more difficult to explain because of the use of components. As such, the least squares model is the favored one.

# Principal Component Regression in R

This post will explain and provide an example of principal component regression (PCR). Principal component regression involves having the model construct components from the independent variables that are a linear combination of the independent variables. This is similar to principal component analysis but the components are designed in a way to best explain the dependent variable. Doing this often allows you to use fewer variables in your model and usually improves the fit of your model as well.

Since PCR is based on principal component analysis it is an unsupervised method, which means the dependent variable has no influence on the development of the components. As such, there are times when the components that are developed may not be beneficial for explaining the dependent variable.

Our example will use the “Mroz” dataset from the “Ecdat” package. Our goal will be to predict “income” based on the variables in the dataset. Below is the initial code

`library(pls);library(Ecdat)`

```
data(Mroz)
str(Mroz)
```

```
## 'data.frame': 753 obs. of 18 variables:
## $ work : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
## $ hoursw : int 1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
## $ child6 : int 1 0 1 0 1 0 0 0 0 0 ...
## $ child618 : int 0 2 3 3 2 0 2 0 2 2 ...
## $ agew : int 32 30 35 34 31 54 37 54 48 39 ...
## $ educw : int 12 12 12 12 14 12 16 12 12 12 ...
## $ hearnw : num 3.35 1.39 4.55 1.1 4.59 ...
## $ wagew : num 2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
## $ hoursh : int 2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
## $ ageh : int 34 30 40 53 32 57 37 53 52 43 ...
## $ educh : int 12 9 12 10 12 11 12 8 4 12 ...
## $ wageh : num 4.03 8.44 3.58 3.54 10 ...
## $ income : int 16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
## $ educwm : int 12 7 12 7 12 14 14 3 7 7 ...
## $ educwf : int 7 7 7 7 14 7 7 3 7 7 ...
## $ unemprate : num 5 11 5 5 9.5 7.5 5 5 3 5 ...
## $ city : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ...
## $ experience: int 14 5 15 6 7 33 11 35 24 21 ...
```

Our first step is to divide our dataset into a train and test set. We will do a simple 50/50 split for this demonstration.

```
train<-sample(c(T,F),nrow(Mroz),rep=T) #50/50 train/test split
test<-(!train)
```

In the code above we use the “sample” function to create a “train” index based on the number of rows in the “Mroz” dataset. Basically, R is making a vector that randomly assigns different rows in the “Mroz” dataset to be marked as True or False. Next, we use the “train” vector and we assign everything or every number that is not in the “train” vector to the test vector by using the exclamation mark.

We are now ready to develop our model. Below is the code

```
set.seed(777)
pcr.fit<-pcr(income~.,data=Mroz,subset=train,scale=T,validation="CV")
```

To make our model we use the “pcr” function from the “pls” package. The “subset” argument tells r to use the “train” vector to select examples from the “Mroz” dataset. The “scale” argument makes sure everything is measured the same way. This is important when using a component analysis tool as variables with different scale have a different influence on the components. Lastly, the “validation” argument enables cross-validation. This will help us to determine the number of components to use for prediction. Below is the results of the model using the “summary” function.

`summary(pcr.fit)`

```
## Data: X dimension: 381 17
## Y dimension: 381 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 12102 11533 11017 9863 9884 9524 9563
## adjCV 12102 11534 11011 9855 9878 9502 9596
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 9149 9133 8811 8527 7265 7234 7120
## adjCV 9126 9123 8798 8877 7199 7172 7100
## 14 comps 15 comps 16 comps 17 comps
## CV 7118 7141 6972 6992
## adjCV 7100 7123 6951 6969
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 21.359 38.71 51.99 59.67 65.66 71.20 76.28
## income 9.927 19.50 35.41 35.63 41.28 41.28 46.75
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 80.70 84.39 87.32 90.15 92.65 95.02 96.95
## income 47.08 50.98 51.73 68.17 68.29 68.31 68.34
## 15 comps 16 comps 17 comps
## X 98.47 99.38 100.00
## income 68.48 70.29 70.39
```

There is a lot of information here.The VALIDATION: RMSEP section gives you the root mean squared error of the model broken down by component. The TRAINING section is similar the printout of any PCA but it shows the amount of cumulative variance of the components, as well as the variance, explained for the dependent variable “income.” In this model, we are able to explain up to 70% of the variance if we use all 17 components.

We can graph the MSE using the “validationplot” function with the argument “val.type” set to “MSEP”. The code is below.

`validationplot(pcr.fit,val.type = "MSEP")`

How many components to pick is subjective, however, there is almost no improvement beyond 13 so we will use 13 components in our prediction model and we will calculate the means squared error.

```
set.seed(777)
pcr.pred<-predict(pcr.fit,Mroz[test,],ncomp=13)
mean((pcr.pred-Mroz$income[test])^2)
```

`## [1] 48958982`

MSE is what you would use to compare this model to other models that you developed. Below is the performance of a least squares regression model

```
set.seed(777)
lm.fit<-lm(income~.,data=Mroz,subset=train)
lm.pred<-predict(lm.fit,Mroz[test,])
mean((lm.pred-Mroz$income[test])^2)
```

`## [1] 47794472`

If you compare the MSE the least squares model performs slightly better than the PCR one. However, there are a lot of non-significant features in the model as shown below.

`summary(lm.fit)`

```
##
## Call:
## lm(formula = income ~ ., data = Mroz, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27646 -3337 -1387 1860 48371
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.215e+04 3.987e+03 -5.556 5.35e-08 ***
## workno -3.828e+03 1.316e+03 -2.909 0.00385 **
## hoursw 3.955e+00 7.085e-01 5.582 4.65e-08 ***
## child6 5.370e+02 8.241e+02 0.652 0.51512
## child618 4.250e+02 2.850e+02 1.491 0.13673
## agew 1.962e+02 9.849e+01 1.992 0.04709 *
## educw 1.097e+02 2.276e+02 0.482 0.63013
## hearnw 9.835e+02 2.303e+02 4.270 2.50e-05 ***
## wagew 2.292e+02 2.423e+02 0.946 0.34484
## hoursh 6.386e+00 6.144e-01 10.394 < 2e-16 ***
## ageh -1.284e+01 9.762e+01 -0.132 0.89542
## educh 1.460e+02 1.592e+02 0.917 0.35982
## wageh 2.083e+03 9.930e+01 20.978 < 2e-16 ***
## educwm 1.354e+02 1.335e+02 1.014 0.31115
## educwf 1.653e+02 1.257e+02 1.315 0.18920
## unemprate -1.213e+02 1.148e+02 -1.057 0.29140
## cityyes -2.064e+02 7.905e+02 -0.261 0.79421
## experience -1.165e+02 5.393e+01 -2.159 0.03147 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6729 on 363 degrees of freedom
## Multiple R-squared: 0.7039, Adjusted R-squared: 0.69
## F-statistic: 50.76 on 17 and 363 DF, p-value: < 2.2e-16
```

Removing these and the MSE is almost the same for the PCR and least square models

```
set.seed(777)
lm.fit2<-lm(income~work+hoursw+hearnw+hoursh+wageh,data=Mroz,subset=train)
lm.pred2<-predict(lm.fit2,Mroz[test,])
mean((lm.pred2-Mroz$income[test])^2)
```

`## [1] 47968996`

**Conclusion**

Since the least squares model is simpler it is probably the superior model. PCR is strongest when there are a lot of variables involve and if there are issues with multicollinearity.

# Example of Best Subset Regression in R

This post will provide an example of best subset regression. This is a topic that has been covered before in this blog. However, in the current post, we will approach this using a slightly different coding and a different dataset. We will be using the “HI” dataset from the “Ecdat” package. Our goal will be to predict the number of hours a women works based on the other variables in the dataset. Below is some initial code.

`library(leaps);library(Ecdat)`

```
data(HI)
str(HI)
```

```
## 'data.frame': 22272 obs. of 13 variables:
## $ whrswk : int 0 50 40 40 0 40 40 25 45 30 ...
## $ hhi : Factor w/ 2 levels "no","yes": 1 1 2 1 2 2 2 1 1 1 ...
## $ whi : Factor w/ 2 levels "no","yes": 1 2 1 2 1 2 1 1 2 1 ...
## $ hhi2 : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 2 1 1 2 ...
## $ education : Ord.factor w/ 6 levels "<9years"<"9-11years"<..: 4 4 3 4 2 3 5 3 5 4 ...
## $ race : Factor w/ 3 levels "white","black",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ hispanic : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ experience: num 13 24 43 17 44.5 32 14 1 4 7 ...
## $ kidslt6 : int 2 0 0 0 0 0 0 1 0 1 ...
## $ kids618 : int 1 1 0 1 0 0 0 0 0 0 ...
## $ husby : num 12 1.2 31.3 9 0 ...
## $ region : Factor w/ 4 levels "other","northcentral",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ wght : int 214986 210119 219955 210317 219955 208148 213615 181960 214874 214874 ...
```

To develop a model we use the “regsubset” function from the “leap” package. Most of the coding is the same as linear regression. The only difference is the “nvmax” argument which is set to 13. The default setting for “nvmax” is 8. This is good if you only have 8 variables. However, the results from the “str” function indicate that we have 13 functions. Therefore, we need to set the “nvmax” argument to 13 instead of the default value of 8 in order to be sure to include all variables. Below is the code

`regfit.full<-regsubsets(whrswk~.,HI, nvmax = 13)`

We can look at the results with the “summary” function. For space reasons, the code is shown but the results will not be shown here.

`summary(regfit.full)`

If you run the code above in your computer you will 13 columns that are named after the variables created. A star in a column means that that variable is included in the model. To the left is the numbers 1-13 which. One means one variable in the model two means two variables in the model etc.

Our next step is to determine which of these models is the best. First, we need to decide what our criteria for inclusion will be. Below is a list of available fit indices.

`names(summary(regfit.full))`

`## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"`

For our purposes, we will use “rsq” (r-square) and “bic” “Bayesian Information Criteria.” In the code below we are going to save the values for these two fit indices in their own objects.

```
rsq<-summary(regfit.full)$rsq
bic<-summary(regfit.full)$bic
```

Now let’s plot them

`plot(rsq,type='l',main="R-Square",xlab="Number of Variables")`

`plot(bic,type='l',main="BIC",xlab="Number of Variables")`

You can see that for r-square the values increase and for BIC the values decrease. We will now make both of these plots again but we will have r tell the optimal number of variables when considering each model index. For we use the “which” function to determine the max r-square and the minimum BIC

`which.max(rsq)`

`## [1] 13`

`which.min(bic)`

`## [1] 12`

The model with the best r-square is the one with 13 variables. This makes sense as r-square always improves as you add variables. Since this is a demonstration we will not correct for this. For BIC the lowest values was for 12 variables. We will now plot this information and highlight the best model in the plot using the “points” function, which allows you to emphasis one point in a graph

```
plot(rsq,type='l',main="R-Square with Best Model Highlighted",xlab="Number of Variables")
points(13,(rsq[13]),col="blue",cex=7,pch=20)
```

```
plot(bic,type='l',main="BIC with Best Model Highlighted",xlab="Number of Variables")
points(12,(bic[12]),col="blue",cex=7,pch=20)
```

Since BIC calls for only 12 variables it is simpler than the r-square recommendation of 13. Therefore, we will fit our final model using the BIC recommendation of 12. Below is the code.

`coef(regfit.full,12)`

```
## (Intercept) hhiyes whiyes
## 30.31321796 1.16940604 18.25380263
## education.L education^4 education^5
## 6.63847641 1.54324869 -0.77783663
## raceblack hispanicyes experience
## 3.06580207 -1.33731802 -0.41883100
## kidslt6 kids618 husby
## -6.02251640 -0.82955827 -0.02129349
## regionnorthcentral
## 0.94042820
```

So here is our final model. This is what we would use for our test set.

**Conclusion**

Best subset regression provides the researcher with insights into every possible model as well as clues as to which model is at least statistically superior. This knowledge can be used for developing models for data science applications.