Writing the final section of a research paper is the most exciting and at times most tricky part of the research process. The video below provides insights into writing the discussion and conclusion section of a paper.
Category Archives: Research
Writing the Results Section VIDEO
Data Analysis VIDEO
Research Hypotheses & Objectives VIDEO
Developing a Review of Literature VIDEO
Research Question Type VIDEO
Definition and Traits of Research Questions VIDEO
Writing a Significance Statement VIDEO
Writing a Purpose Statement VIDEO
Overview of Intro to a Research Paper VIDEO
Developing a Statement of the Problem VIDEO
Defining A Quantitative Research Problem VIDEO
Identifying a research problem is one of the hardest parts of writing a research paper. If this is done poorly you may have to go back and redefine the problem our you may discover that you cannot go forward in your research. For this reason, the video below explains what a research problem is along with a criterion to consider when selecting potential research problem.
Research Process VIDEO
K-Nearest Neighbor Regression with Python VIDEO
K-Nearest neighbor is a great technique for dealing with data. In the video below, we will look at how to use this tool with Python.
Linear Discriminant Analysis with Python VIDEO
Linear discriminant analysis is a tool that is used for classification. This tool is one of many that is employed in data science. In this video, we will look at an example of how to use this tool in Python for practical purposes.

OSEMN Framework for Data Analysis
Analyzing data can be extremely challenging. It is often common to not know where to begin. Perhaps you know some basic ways of analyzing data, but it is unclear what should be done first and what should follow.
This is where a data analysis framework can come in handy. Having a basic step-by-step process, you always follow can make it much easier to start and complete a project. One example of a data analysis framework is the OSEMN model. The OSEMN model is an acronym that defines each step of the data analysis process. The steps are as follows
- Obtain
- Scrub
- Explore
- Model
- INterpret
We will now go through each of these steps.
Obtain
The first step of this model is obtaining data. Depending on the context, this can be done for you because the stakeholders have already provided data for analysis. In other situations, you have to find the data you need to answer whatever questions you are looking for insights into.
Data can be found anywhere, so the obtained data must help achieve the goals. It is also necessary to have the skills or connections to get the data. For example, data may have to be scraped from the web, pulled from a database, or even collected through the development of surveys. Each of these examples requires specific skills needed for success.
Scrub
Once data is obtained, it must be scrubbed or cleaned. Completing these tasks requires several things. Duplicates need to be removed, missing data must be addressed, outlier considered, the shape of the data addressed, among other tasks. In addition, it is often useful to look at descriptive statistics and visualizations to identify potential problems. Lastly, you often need to clean categories within a variable if they are misspelled or involve other errors such as punctuation and converting numbers.
The concepts mentioned above are just some of the steps that need to be taken to clean data. Dirty will lead to bad insights. Therefore, this must be done well.
Explore
Exploring data and scrubbing data will often happen at the same time. With exploration, you are looking for insights into your data. One of the easiest ways to do this is to drill down as far as possible into your continuous variables by segmenting with the categorical variables.
For example, you might look at average scores by gender, then you look at average scores by gender and major, then you might look at average scores by gender, major, and class. Each time you find slightly different patterns that may be useful or not. Another approach would be to look at scatterplots that consider different combinations of categorical variables.
If the objectives are clear, it can help you focus your exploration on reducing the chance of presenting non-relevant information to your stakeholders. Suppose the stakeholders want to know the average scores of women. In that case, there is maybe no benefit to knowing the average score of male music majors.
Model
Modeling involves regression/classification in the case of supervised learning or segmentation in the case of unsupervised learning. Modeling in the context of supervised learning helps in predicting future values, while segmentation helps develop insights into groups within a dataset that have similar traits.
Once again, the objectives of the analysis shape what tool to use in this context. If you want to predict enrollment, then regression tools may be appropriate. If you want what car a person will buy, then classification may help. If, on the other hand, you want to know what are some of the traits of high-performing students, then unsupervised approaches may be the best option.
INterpret
Interpreting involves sharing what does all this stuff means. It is truly difficult to explain the intricacies of data analysis to a layman. Therefore, this involves not just analytical techniques but communication skills. Breaking down the complex analysis so that people can understand it is difficult. As such, ideas around storytelling have been developed to help data analysis connect the code with the audience.
Conclusion
The framework provided here is not the only way to approach data analysis. Furthermore, as you become more comfortable with analyzing data, you do not have to limit yourself to the steps or order in which they are performed. Frameworks are intended for getting people started in the creative process of whatever task they are trying to achieve.
Decision Trees in Python VIDEO

Developing Conceptual and Operational Definitions for Research
Defining terms is one of the first things required when writing a research paper. However, it is also one of the hardest things to do as we often know what we want to study intuitively rather than literally. This post will provide guidance in the following
- Developing conceptual definitions
- Determining operational definitions
- Understanding the measurement model
Each of the ideas above is fundamental to developing coherent research papers.
Concepts
A concept is a mental construct or a tool used to understand the world around us. An example of a concept would be intelligence, humor, motivation, desire. These terms have meaning, but they cannot be seen or observed directly. You cannot pick up intelligence, buy humor, or weigh either of these. However, you can tell when someone is intelligent or has a sense of humor.
This is because constructs are observed indirectly through behaviors, which provide evidence of the construct. For example, someone demonstrates intelligence through their academic success, how they speak, etc. A person can demonstrate humor by making others laugh through what they say. Concepts represent things around us that we want to study as researchers.
Defining Concepts
To define a concept for the purpose of research requires the following three things
- A manner in which to measure the concept indirectly
- A unit of analysis
- Some variation among the unit of analysis
The criteria listed above is essentially a definition of a conceptual definition. Below is an example of a conceptual definition of academic dishonesty
Below is a breakdown of this definition
Academic dishonesty is the extent to which individuals exhibit a disregard towards educational norms of scholarly integrity.
- Measurement: exhibit a disregard towards educational norms of scholarly integrity.
- Unit of analysis: individual
- Variation: Extent to which
It becomes much easier to shape a research study with these three components.
Conceptual Definition Template
There is also a template readily available in books and the internet to generate a conceptual definition. Below is one example.
The concept of _____________ is defined as the extent to which
_________________________ exhibit the characteristic(s) of _______________.
Here is a revised version of our conceptual defintion of academic dishonesty
The concept of academic dishonesty is defined as the ewxtent to whcih invidivudals exhibit the characteristic of disregard towards educational norms of scholarly integrity.
The same three components are there. The wording is mostly the same, but having a template such as this can really save them time in formulating a study. It also helps make things clearer for them as they go forward with a project.
Operational Definition
Once a concept has been defined, it must next be operationalized. The operational definition indicates how a concept will be measured quantitatively. This means that a researcher must specify at least one metric. Below is an example using academic dishonesty again.
Conceptual Definition: Academic dishonesty is the extent to which an individual exhibits a disregard towards educational norms of scholarly integrity.
Operational Definition: Survey Items
- It is okay to cheat
- It is okay to turn in someone else’s work as my own
In the example above, academic dishonesty was operationalized using survey items. In other words, we will measure people’s opinions about academic dishonesty by having them respond to survey items.
Measurement error happens when there is a disconnect between the conceptual definition and the measurement method. It can be hard to detect this, so students need to be careful when developing a study.
Measurement Models
A concept is not measured directly, as has already been mentioned. This means that when it is time to analyze our data, our contract is a latent or unobserved variable. The items on the survey are observed because people gave us this information directly. This means that the survey items are observed variables.
The measurement model links the latent variables with the observed variables statistically. A strong measurement model indicates that the observed variables correlate with the underlying latent variable or construct.
For example, academic dishonesty has been the latent variable example of this entire post. The survey items “it’s okay to cheat” and “it’s okay to turn in someon else’s work as my own” are observed variables. Using statistical tools, we can check if these observed variables are associated with our concept of academic dishonesty.
Conclusion
Defining concepts is one of the more challenging aspects of conducting research. It requires a researcher to know what they are trying to study and how to measure it. For students, this is challenging because articulating ideas in this manner is often not done in everyday life.

Types of Experiments
This post will provide some basic ideas for developing experiments. The process of doing valid experiments is rather challenging as one misstep can make your results invalid. Therefore, care is needed when attempting to set up an experiment
Definition
An experiment is a process in which changes are made to input variables to see how they affect the output variable(s). The inputs are called controllable variables, while the outputs are called response variables. Other variables that cannot be controlled are called uncontrollable variables.
When developing an experiment, the experimenter’s approach or plan for experimenting is called the strategy of experimentation. Extensive planning is necessary to conduct an experiment, while the actual data collection is often not that difficult.
Best Guess Approach
There are several different strategies for experimentation. The best-guess approach involves manipulating input variables based on prior results from the output variable. For example, if you are teaching a math class and notice that students score better when they work in groups in the morning compared to working in the afternoon. You may switch to group work in the morning and see if lectures may further increase performance.
This guesswork can be highly efficient if you are familiar with the domain in which you are doing the experiments. However, if the guess is wrong, you have to continue guessing, and this can go on for a long time.
One-Factor-At-A-Time
Another strategy of experimentation is the one-factor-at-a-time (OFAT) approach. You begin by having a baseline for each factor (variable) and then vary each variable to see how it affects the output. For example, you can switch whether students study in the morning or even and see how it affects performance. Then you might test whether group work and individual work affect scores.
The biggest weakness with this is that you can see interactions between variables. Interactions are an instance in which one factor does not produce the same results at a different level of another factor. Interactions can be hard to understand, but sometimes when two factors are mapped at the same time with the response variable, the lines cross to indicate that there is an interaction.

Factorial Experiments
Factorial experiments involve varying factors together. For example, a 2^2 factorial design means four combinations of experiments with two variables are varied, and one response variable with four possible combinations of experiments. Often these types of experiments are drawn as a square, as shown below.

Each point represents a different combination of the two factors. The calculation of this involves subtracting the means of the variable or factor on the x-axis. If we run each combination twice, we would calculate the difference, as shown below.

The more significant this difference, the more likely there is a strong effect based on the independent variables in the model.
When the number of combinations becomes large and complicated to manage, it may not be practical to run all possible combinations. In this situation, an experimenter will use a fractional factorial experiment in which only some of the combinations are used. For example, if 32 experiments are possible (2^5), maybe only 12 of them are conducted. The calculation is the same as above, just with more groups to compare.
Conclusion
Experiments are a practical way to determine the best combination of factors or variables for a given output variable(s). The majority of the time is spent planning and designing the experiment, with the actual data collection being straightforward.

Different Views of Research
People have been doing research formally or informally since the beginning of time. We are always trying to figure out how to do this or why something is the way that it is. In this post, we will look at different ways to view and or conduct research. These perspectives are empirical, theoretical, and analytical.
Empirical
Perhaps the most common form or approach to doing research is the empirical approach. This approach involves observing reality and developing hypotheses and theories based on what was observed. This is an inductive approach to doing research because the researcher starts with their observations to make a theory. In other words, you start with examples and abstract them to theories.
An example of this is found in the work of Charles Darwin and evolution. Darwin collected a lot of examples and observations of birds during his travels. Based on what he saw he inferred that animals evolved over time. This was his conclusion based on his interpretation of the data. Later, other researchers tried to further bolster Darwin’s theory by finding mathematical support for his claims.
The order in which empirical research is conducted is as follows…
- Identify the phenomenon
- Collect data
- Abstraction/model development
- Hypothesis
- Test
You can see that hypotheses and theory are derived from data which is similar to qualitative research. However, steps 4 and 5 are were the equation developing and or statistical tools are used. As such the empirical view of research is valuable when there is a large amount of data available and can include many variables, which is again often common for quantitative methods.
To summarize this, empirical research is focused on what happened, which is one way in which scientific laws are derived.
Theoretical
The theoretical perspective is essentially the same process as empirical but moving in the opposite direction. For theorists, the will start with what they think about the phenomenon and how things should be. This approach starts with a general principle and then the researcher goes and looks for evidence that supports their general principle. Another way of stating this is that the theoretical approach is deductive in nature.
A classic example of this is Einstein’s theory of relativity. Apparently, he deduced this theory through logic and left it to others to determine if the theory was correct. To put it simply, he knew without knowing, if this makes sense. In this approach, the steps are as follows
- Theory
- Hypotheses
- model abstraction
- data collection
- Phenomenon
You collect data to confirm the hypotheses. Common statistical tools can include simulations or any other method that is suitable in situations in which there is little data available. The caveat is that the data must match the phenomenon to have meaning. For example, if I am trying to understand some sort of phenomenon about women I cannot collect data from as this does not match the phenomenon.
In general, theoretical research is focused on why something happens which is the goal of most theories, explaining why.
Analytical
Analytical research is probably the hardest to explain and understand. Essentially, analytical research is trying to understand how people develop their empirical or theoretical research. How did Darwin make this collection or how did Einstein develop his ideas.
In other words, analytical research is commonly used to judge the research of others. Examples of this can be people who spend a lot of time criticizing the works of others. An analytical approach is looking for the strengths and weaknesses of various research. Therefore, this approach is focused on how research is done and can use tools both from empirical and theoretical research.
Conclusion
The point here was to explain different views om conducting research. The goal was not to state that one is superior to the other. Rather, the goal was to show how different tools can be used in different ways

Paraphrasing Tips for ESL Students
Paraphrasing is an absolute skill in a professional setting. By paraphrasing, it is meant to have the ability to take someone else’s words and rephrase them while giving credit for the original source. Whenever a student fails to do this it is called plagiarism which is a major problem in academia. In this post, we will look at several tips on how to paraphrase.
The ability to paraphrase academically takes almost near-native writing ability. This is because you have to be able to play with the language in a highly complex manner. To be able to do this after a few semesters of ESL is difficult for the typical student. Despite this, there are several ways to try to make paraphrase work. Below are just some ideas.
- Use synonyms
- Change the syntax
- Make several sentences
- Condense/summarize
One tip not mentioned is reading. Next, to actually writing, nothing will improve writing skills like reading. Being exposed to different texts helps you develop an intuitive understanding of the second language in a way that copying and pasting never will.
Use Synonyms
Using synonyms is a first step in paraphrasing an idea but this approach cannot be used by itself as that is considered to be plagiarism by many people. With synonyms, you replace some words with others. The easiest words to replace are adjectives and verbs, followed by nouns. Below is an example. The first sentence is the original one and the second is the paraphrase.
The man loves to play guitar
The man likes to play guitar
In the example above all we did was change the word “loves” to “like”. This is a superficial change that is often still considered plagiarism because of how easy it is to do. We can take this a step further by modifying the infinitive verb “to play.”
The man loves to play guitar
The man likes to play guitar
The man likes playing guitar
Again this is superficial but a step above the first example. In addition, most word processors will provide synonyms if you right-click on the word and off course there are online options as well. Remember that this is a beginning and is a tool you use in addition to more complex approaches.
Change the Syntax
Changing the syntax has to do with the word order of the sentence or sentences. Below is an example
The man loves to play guitar
Playing the guitar is something the man loves
In this example, we move the infinitive phrase “to play” to the front and change it to a present participle. There were other adjustments that needed to be made to maintain the flow of the sentence. This example is a more advanced form of paraphrasing and it may be enough to only do this to avoid plagiarism. However, you can combine synonyms and syntax as shown in the example below
The man loves to play guitar
Playing the guitar is something the man likes
Make Several Sentences
Another approach is to convert a sentence(s) into several more sentences. As shown below
The man loves to play guitar
This man has a hobby. He likes playing guitar.
You can see that there are two sentences now. The first sentence indicates the man has a hobby and the second explains what the hobby is and how much he likes it. In addition, in the second sentence, the verb “to play” was changed to the present participle of “playing.”
Condense/Summarize
Condensing or summarizing is not considered by everyone to be paraphrasing. The separation between paraphrasing and summarizing is fuzzy and it is more of a continuum than black and white. With this technique, you try to reduce the length of the statement you are paraphrasing as shown below.
The man loves to play guitar
He likes guitar
This was a difficult sentence to summarizes because it was already so short. However, we were able to shrink it from six to three words by removing what it was about the guitar he liked.
Academic Examples
We will now look at several academic examples to show the applications of these rules in a real context. The passage below is some academic text
There is also a push within Southeast Asia for college graduates to have
interpersonal skills. For example, Malaysia is calling for graduates to
have soft skills and that these need to be part of the curriculum of tertiary schools.
In addition, a lack of these skills has been found to limit graduates’ employability.
Example 1: Paraphrase with synonyms and syntax changes
There are several skills graduates need for employability in Southeast Asia. For example, people skills are needed. The ability to relate to others is being pushed for inclusion in higher education from parts of Southeast Asia (Thomas, 2018).
You can see how difficult this can be. We rearranged several concepts and changed several verbs to try and make this our own sentence. Below is an example of condensing.
Example 2: Condensing
There is demand in Southeast Asia for higher education to develop the interpersonal skills of their students as this is limiting the employability of graduates (Thomas, 2018).
With this example, we reduced the paragraph to one sentence.
Culture and Plagiarism
There are majors differences in terms of how plagiarism is viewed based on culture. In the West, plagiarism is universally condemned both in and out of academia as essentially stealing ideas from other people. However, in other places, the idea of plagiarism is much more nuanced or even okay.
In some cultures, one way to honor what someone has said or taught is to literally repeat it verbatim. The thought process goes something like this
- This person is a great teacher/elder
- What they said is insightful
- As a student or lower person, I cannot improve what they said
- Therefore, I should copy these perfects words into my own paper.
Of course, everyone does not think like this but I have experienced enough to know that it does happen.
Whether the West likes it or not plagiarism is a cultural position rather than an ethical one. To reduce plagiarism requires to show students how it is culturally unacceptable in an academic/professional setting to do this. The tips in this post will at least provide tools for how to support students to overcome this habit

Understanding Variables
In research, there are many terms that have the same underlying meaning which can be confusing for researchers as they try to complete a project. The problem is that people have different backgrounds and learn different terms during their studies and when they try to work with others there is often confusion over what is what.
In this post, we will try to clarify as much as possible various terms that are used when referring to variables. We will look at the following during this discussion
- Definition of a variable
- Minimum knowledge of the characteristics of a variable in research
- Various synonyms of variable
Definition
The word variable has the root of “vary” and the suffix “able”. This literally means that a variable is something that is able to change. Examples include such concepts as height, weigh, salary, etc. All of these concepts change as you gather data from different people. Statistics is primarily about trying to explain and or understand the variability of variables.
However, to make things more confusing there are times in research when a variable dies not change or remains constant. This will be explained in greater detail in a moment.
Minimum You Need to Know
Two broad concepts that you need to understand regardless of the specific variable terms you encounter are the following
- Whether the variable(s) are independent or dependent
- Whether the variable(s) are categorical or continuous
When we speak of independent and dependent variables we are looking at the relationship(s) between variables. Dependent variables are explained by independent variables. Therefore, one dimension of variables is understanding how they relate to each other and the most basic way to see this is independent vs dependent.
The second dimension to consider when thinking about variables is how they are measured which is captured with the terms categorical or continuous. A categorical variable has a finite number of values that can be used. Examples in clue gender, hair color, or cellphone brand. A person can only be male or female, have blue or brown eyes, and can only have one brand of cellphone.
Continuous variables are variables that can take on an infinite number of values. Salary, temperature, etc are all continuous in nature. It is possible to limit a continuous variable to categorical variable by creating intervals in which to place values. This is commonly done when creating bins for histograms. In sum, here are the four possible general variable types below
- Independent categorical
- Independent continuous
- Dependent categorical
- Dependent continuous
Natural, most models have one dependent categorical or continuous variable, however you can have any combination of continuous and categorical variables as independents. Remember that all variables have the above characteristics despite whatever terms is used for them.
Variable Synonyms
Below is a list of various names that variables go by in different disciplines. This is by no means an exhaustive list.
Experimental variable
A variable whose values are independent of any changes in the values of other variables. In other words, an experimental variable is just another term for independent variable.
Manipulated Variable
A variable that is independent in an experiment but whose value/behavior the researcher is able to control or manipulate. This is also another term for an independent variable.
Control Variable
A variable whose value does not change. Controlling a variable helps to explain the relationship between the independent and dependent variable in an experiment by making sure the control variable has not influenced in the model
Responding Variable
The dependent variable in an experiment. It responds to the experimental variable.
Intervening Variable
This is a hypothetical variable. It is used to explain the causal links between variables. Since they are hypothetical, they are observed in an actual experiment. For example, if you are looking at a strong relationship between income and life expectancy and find a positive relationship. The intervening variable for this may be access to healthcare. People who make more money have more access to health care and this contributes to them often living longer.
Mediating Variable
This is the same thing as an intervening variable. The difference being often that the mediating variable is not always hypothetical in nature and is often measured it’s self.
Confounding Variable
A confounder is a variable that influences both the independent and dependent variable, causing a spurious or false association. Often a confounding variable is a causal idea and cannot be described in terms of correlations or associations with other variables. In other words, it is often the same thing as an intervening variable.
Explanatory Variable
This variable is the same as an independent variable. The difference being that an independent variable is not influenced by any other variables. However, when independence is not for sure, than the variable is called an explanatory variable.
Predictor Variable
A predictor variable is an independent variable. This term is commonly used for regression analysis.
Outcome Variable
An outcome variable is a dependent variable in the context of regression analysis.
Observed Variable
This is a variable that is measured directly. An example would be gender or height. There is no psychology construct to infer the meaning of such variables.
Unobserved Variable
Unobserved variables are constructs that cannot be measured directly. In such situations, observe variables are used to try to determine the characteristic of the unobserved variable. For example, it is hard to measure addiction directly. Instead, other things will be measure to infer addiction such as health, drug use, performance, etc. The measures of this observed variables will indicate the level of the unobserved variable of addiction
Features
A feature is an independent variable in the context of machine learning and data science.
Target Variable
A target variable is the dependent variable in the context f machine learning and data science.
To conclude this, below is a summary of the different variables discussed and whether they are independent, dependent, or neither.
Independent | Dependent | Neither |
---|---|---|
Experimental | Responding | Control |
Manipulated | Target | Explanatory |
Predictor | Outcome | Intervening |
Feature | Mediating | |
Observed | ||
Unobserved | ||
Confounding |
You can see how confusing this can be. Even though variables are mostly independent or dependent, there is a class of variables that do not fall into either category. However, for most purposes, the first to columns cover the majority of needs in simple research.
Conclusion
The confusion over variables is mainly due to an inconsistency in terms across variables. There is nothing right or wrong about the different terms. They all developed in different places to address the same common problem. However, for students or those new to research, this can be confusing and this post hopefully helps to clarify this.
Z-Scores VIDEO
Calculating Z-Scores

Data Exploration Case Study: Credit Default
Exploratory data analysis is the main task of a Data Scientist with as much as 60% of their time being devoted to this task. As such, the majority of their time is spent on something that is rather boring compared to building models.
This post will provide a simple example of how to analyze a dataset from the website called Kaggle. This dataset is looking at how is likely to default on their credit. The following steps will be conducted in this analysis.
- Load the libraries and dataset
- Deal with missing data
- Some descriptive stats
- Normality check
- Model development
This is not an exhaustive analysis but rather a simple one for demonstration purposes. The dataset is available here
Load Libraries and Data
Here are some packages we will need
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from scipy.stats import norm from sklearn import tree from scipy import stats from sklearn import metrics
You can load the data with the code below
df_train=pd.read_csv('/application_train.csv')
You can examine what variables are available with the code below. This is not displayed here because it is rather long
df_train.columns df_train.head()
Missing Data
I prefer to deal with missing data first because missing values can cause errors throughout the analysis if they are not dealt with immediately. The code below calculates the percentage of missing data in each column.
total=df_train.isnull().sum().sort_values(ascending=False) percent=(df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False) missing_data=pd.concat([total,percent],axis=1,keys=['Total','Percent']) missing_data.head() Total Percent COMMONAREA_MEDI 214865 0.698723 COMMONAREA_AVG 214865 0.698723 COMMONAREA_MODE 214865 0.698723 NONLIVINGAPARTMENTS_MODE 213514 0.694330 NONLIVINGAPARTMENTS_MEDI 213514 0.694330
Only the first five values are printed. You can see that some variables have a large amount of missing data. As such, they are probably worthless for inclusion in additional analysis. The code below removes all variables with any missing data.
pct_null = df_train.isnull().sum() / len(df_train) missing_features = pct_null[pct_null > 0.0].index df_train.drop(missing_features, axis=1, inplace=True)
You can use the .head() function if you want to see how many variables are left.
Data Description & Visualization
For demonstration purposes, we will print descriptive stats and make visualizations of a few of the variables that are remaining.
round(df_train['AMT_CREDIT'].describe()) Out[8]: count 307511.0 mean 599026.0 std 402491.0 min 45000.0 25% 270000.0 50% 513531.0 75% 808650.0 max 4050000.0 sns.distplot(df_train['AMT_CREDIT']
round(df_train['AMT_INCOME_TOTAL'].describe()) Out[10]: count 307511.0 mean 168798.0 std 237123.0 min 25650.0 25% 112500.0 50% 147150.0 75% 202500.0 max 117000000.0 sns.distplot(df_train['AMT_INCOME_TOTAL']
I think you are getting the point. You can also look at categorical variables using the groupby() function.
We also need to address categorical variables in terms of creating dummy variables. This is so that we can develop a model in the future. Below is the code for dealing with all the categorical variables and converting them to dummy variable’s
df_train.groupby('NAME_CONTRACT_TYPE').count() dummy=pd.get_dummies(df_train['NAME_CONTRACT_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_CONTRACT_TYPE'],axis=1) df_train.groupby('CODE_GENDER').count() dummy=pd.get_dummies(df_train['CODE_GENDER']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['CODE_GENDER'],axis=1) df_train.groupby('FLAG_OWN_CAR').count() dummy=pd.get_dummies(df_train['FLAG_OWN_CAR']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['FLAG_OWN_CAR'],axis=1) df_train.groupby('FLAG_OWN_REALTY').count() dummy=pd.get_dummies(df_train['FLAG_OWN_REALTY']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['FLAG_OWN_REALTY'],axis=1) df_train.groupby('NAME_INCOME_TYPE').count() dummy=pd.get_dummies(df_train['NAME_INCOME_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_INCOME_TYPE'],axis=1) df_train.groupby('NAME_EDUCATION_TYPE').count() dummy=pd.get_dummies(df_train['NAME_EDUCATION_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_EDUCATION_TYPE'],axis=1) df_train.groupby('NAME_FAMILY_STATUS').count() dummy=pd.get_dummies(df_train['NAME_FAMILY_STATUS']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_FAMILY_STATUS'],axis=1) df_train.groupby('NAME_HOUSING_TYPE').count() dummy=pd.get_dummies(df_train['NAME_HOUSING_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_HOUSING_TYPE'],axis=1) df_train.groupby('ORGANIZATION_TYPE').count() dummy=pd.get_dummies(df_train['ORGANIZATION_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['ORGANIZATION_TYPE'],axis=1)
You have to be careful with this because now you have many variables that are not necessary. For every categorical variable you must remove at least one category in order for the model to work properly. Below we did this manually.
df_train=df_train.drop(['Revolving loans','F','XNA','N','Y','SK_ID_CURR,''Student','Emergency','Lower secondary','Civil marriage','Municipal apartment'],axis=1)
Below are some boxplots with the target variable and other variables in the dataset.
f,ax=plt.subplots(figsize=(8,6)) fig=sns.boxplot(x=df_train['TARGET'],y=df_train['AMT_INCOME_TOTAL'])
There is a clear outlier there. Below is another boxplot with a different variable
f,ax=plt.subplots(figsize=(8,6)) fig=sns.boxplot(x=df_train['TARGET'],y=df_train['CNT_CHILDREN'])
It appears several people have more than 10 children. This is probably a typo.
Below is a correlation matrix using a heatmap technique
corrmat=df_train.corr() f,ax=plt.subplots(figsize=(12,9)) sns.heatmap(corrmat,vmax=.8,square=True)
The heatmap is nice but it is hard to really appreciate what is happening. The code below will sort the correlations from least to strongest, so we can remove high correlations.
c = df_train.corr().abs() s = c.unstack() so = s.sort_values(kind="quicksort") print(so.head()) FLAG_DOCUMENT_12 FLAG_MOBIL 0.000005 FLAG_MOBIL FLAG_DOCUMENT_12 0.000005 Unknown FLAG_MOBIL 0.000005 FLAG_MOBIL Unknown 0.000005 Cash loans FLAG_DOCUMENT_14 0.000005
The list is to long to show here but the following variables were removed for having a high correlation with other variables.
df_train=df_train.drop(['WEEKDAY_APPR_PROCESS_START','FLAG_EMP_PHONE','REG_CITY_NOT_WORK_CITY','REGION_RATING_CLIENT','REG_REGION_NOT_WORK_REGION'],axis=1)
Below we check a few variables for homoscedasticity, linearity, and normality using plots and histograms
sns.distplot(df_train['AMT_INCOME_TOTAL'],fit=norm) fig=plt.figure() res=stats.probplot(df_train['AMT_INCOME_TOTAL'],plot=plt)
This is not normal
sns.distplot(df_train['AMT_CREDIT'],fit=norm) fig=plt.figure() res=stats.probplot(df_train['AMT_CREDIT'],plot=plt)
This is not normal either. We could do transformations, or we can make a non-linear model instead.
Model Development
Now comes the easy part. We will make a decision tree using only some variables to predict the target. In the code below we make are X and y dataset.
X=df_train[['Cash loans','DAYS_EMPLOYED','AMT_CREDIT','AMT_INCOME_TOTAL','CNT_CHILDREN','REGION_POPULATION_RELATIVE']] y=df_train['TARGET']
The code below fits are model and makes the predictions
clf=tree.DecisionTreeClassifier(min_samples_split=20) clf=clf.fit(X,y) y_pred=clf.predict(X)
Below is the confusion matrix followed by the accuracy
print (pd.crosstab(y_pred,df_train['TARGET'])) TARGET 0 1 row_0 0 280873 18493 1 1813 6332 accuracy_score(y_pred,df_train['TARGET']) Out[47]: 0.933966589813047
Lastly, we can look at the precision, recall, and f1 score
print(metrics.classification_report(y_pred,df_train['TARGET'])) precision recall f1-score support 0 0.99 0.94 0.97 299366 1 0.26 0.78 0.38 8145 micro avg 0.93 0.93 0.93 307511 macro avg 0.62 0.86 0.67 307511 weighted avg 0.97 0.93 0.95 307511
This model looks rather good in terms of accuracy of the training set. It actually impressive that we could use so few variables from such a large dataset and achieve such a high degree of accuracy.
Conclusion
Data exploration and analysis is the primary task of a data scientist. This post was just an example of how this can be approached. Of course, there are many other creative ways to do this but the simplistic nature of this analysis yielded strong results
Confidence Intervals for Proportions VIDEO
Calculating Confidence Intervals for Proportions

Hierarchical Regression in R
In this post, we will learn how to conduct a hierarchical regression analysis in R. Hierarchical regression analysis is used in situation in which you want to see if adding additional variables to your model will significantly change the r2 when accounting for the other variables in the model. This approach is a model comparison approach and not necessarily a statistical one.
We are going to use the “Carseats” dataset from the ISLR package. Our goal will be to predict total sales using the following independent variables in three different models.
model 1 = intercept only
model 2 = Sales~Urban + US + ShelveLoc
model 3 = Sales~Urban + US + ShelveLoc + price + income
model 4 = Sales~Urban + US + ShelveLoc + price + income + Advertising
Often the primary goal with hierarchical regression is to show that the addition of a new variable builds or improves upon a previous model in a statistically significant way. For example, if a previous model was able to predict the total sales of an object using three variables you may want to see if a new additional variable you have in mind may improve model performance. Another way to see this is in the following research question
Is a model that explains the total sales of an object with Urban location, US location, shelf location, price, income and advertising cost as independent variables superior in terms of R2 compared to a model that explains total sales with Urban location, US location, shelf location, price and income as independent variables?
In this complex research question we essentially want to know if adding advertising cost will improve the model significantly in terms of the r square. The formal steps that we will following to complete this analysis is as follows.
- Build sequential (nested) regression models by adding variables at each step.
- Run ANOVAs in order to compute the R2
- Compute difference in sum of squares for each step
- Check F-statistics and p-values for the SS differences.
- Compare sum of squares between models from ANOVA results.
- Compute increase in R2 from sum of square difference
- Run regression to obtain the coefficients for each independent variable.
We will now begin our analysis. Below is some initial code
library(ISLR)
data("Carseats")
Model Development
We now need to create our models. Model 1 will not have any variables in it and will be created for the purpose of obtaining the total sum of squares. Model 2 will include demographic variables. Model 3 will contain the initial model with the continuous independent variables. Lastly, model 4 will contain all the information of the previous models with the addition of the continuous independent variable of advertising cost. Below is the code.
model1 = lm(Sales~1,Carseats)
model2=lm(Sales~Urban + US + ShelveLoc,Carseats)
model3=lm(Sales~Urban + US + ShelveLoc + Price + Income,Carseats)
model4=lm(Sales~Urban + US + ShelveLoc + Price + Income + Advertising,Carseats)
We can now turn to the ANOVA analysis for model comparison #ANOVA Calculation We will use the anova() function to calculate the total sum of square for model 0. This will serve as a baseline for the other models for calculating r square
anova(model1,model2,model3,model4)
## Analysis of Variance Table
##
## Model 1: Sales ~ 1
## Model 2: Sales ~ Urban + US + ShelveLoc
## Model 3: Sales ~ Urban + US + ShelveLoc + Price + Income
## Model 4: Sales ~ Urban + US + ShelveLoc + Price + Income + Advertising
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 399 3182.3
## 2 395 2105.4 4 1076.89 89.165 < 2.2e-16 ***
## 3 393 1299.6 2 805.83 133.443 < 2.2e-16 ***
## 4 392 1183.6 1 115.96 38.406 1.456e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For now, we are only focusing on the residual sum of squares. Here is a basic summary of what we know as we compare the models.
model 1 = sum of squares = 3182.3
model 2 = sum of squares = 2105.4 (with demographic variables of Urban, US, and ShelveLoc)
model 3 = sum of squares = 1299.6 (add price and income)
model 4 = sum of squares = 1183.6 (add Advertising)
Each model is statistical significant which means adding each variable lead to some improvement.
By adding price and income to the model we were able to improve the model in a statistically significant way. The r squared increased by .25 below is how this was calculated.
2105.4-1299.6 #SS of Model 2 - Model 3
## [1] 805.8
805.8/ 3182.3 #SS difference of Model 2 and Model 3 divided by total sum of sqaure ie model 1
## [1] 0.2532131
When we add Advertising to the model the r square increases by .03. The calculation is below
1299.6-1183.6 #SS of Model 3 - Model 4
## [1] 116
116/ 3182.3 #SS difference of Model 3 and Model 4 divided by total sum of sqaure ie model 1
## [1] 0.03645162
Coefficients and R Square
We will now look at a summary of each model using the summary() function.
summary(model2)
##
## Call:
## lm(formula = Sales ~ Urban + US + ShelveLoc, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.713 -1.634 -0.019 1.738 5.823
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.8966 0.3398 14.411 < 2e-16 ***
## UrbanYes 0.0999 0.2543 0.393 0.6947
## USYes 0.8506 0.2424 3.510 0.0005 ***
## ShelveLocGood 4.6400 0.3453 13.438 < 2e-16 ***
## ShelveLocMedium 1.8168 0.2834 6.410 4.14e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.309 on 395 degrees of freedom
## Multiple R-squared: 0.3384, Adjusted R-squared: 0.3317
## F-statistic: 50.51 on 4 and 395 DF, p-value: < 2.2e-16
summary(model3)
##
## Call:
## lm(formula = Sales ~ Urban + US + ShelveLoc + Price + Income,
## data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9096 -1.2405 -0.0384 1.2754 4.7041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.280690 0.561822 18.299 < 2e-16 ***
## UrbanYes 0.219106 0.200627 1.092 0.275
## USYes 0.928980 0.191956 4.840 1.87e-06 ***
## ShelveLocGood 4.911033 0.272685 18.010 < 2e-16 ***
## ShelveLocMedium 1.974874 0.223807 8.824 < 2e-16 ***
## Price -0.057059 0.003868 -14.752 < 2e-16 ***
## Income 0.013753 0.003282 4.190 3.44e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.818 on 393 degrees of freedom
## Multiple R-squared: 0.5916, Adjusted R-squared: 0.5854
## F-statistic: 94.89 on 6 and 393 DF, p-value: < 2.2e-16
summary(model4)
##
## Call:
## lm(formula = Sales ~ Urban + US + ShelveLoc + Price + Income +
## Advertising, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2199 -1.1703 0.0225 1.0826 4.1124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.299180 0.536862 19.184 < 2e-16 ***
## UrbanYes 0.198846 0.191739 1.037 0.300
## USYes -0.128868 0.250564 -0.514 0.607
## ShelveLocGood 4.859041 0.260701 18.638 < 2e-16 ***
## ShelveLocMedium 1.906622 0.214144 8.903 < 2e-16 ***
## Price -0.057163 0.003696 -15.467 < 2e-16 ***
## Income 0.013750 0.003136 4.384 1.50e-05 ***
## Advertising 0.111351 0.017968 6.197 1.46e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.738 on 392 degrees of freedom
## Multiple R-squared: 0.6281, Adjusted R-squared: 0.6214
## F-statistic: 94.56 on 7 and 392 DF, p-value: < 2.2e-16
You can see for yourself the change in the r square. From model 2 to model 3 there is a 26 point increase in r square just as we calculated manually. From model 3 to model 4 there is a 3 point increase in r square. The purpose of the anova() analysis was determined if the significance of the change meet a statistical criterion, The lm() function reports a change but not the significance of it.
Conclusion
Hierarchical regression is just another potential tool for the statistical researcher. It provides you with a way to develop several models and compare the results based on any potential improvement in the r square.
Zotero Reference Software VIDEO
A demo on the use of the Zotero Reference software
Confidence Intervals VIDEO
Calculating Confidence intervals

Gradient Boosting Classification in Python
Gradient Boosting is an alternative form of boosting to AdaBoost. Many consider gradient boosting to be a better performer than adaboost. Some differences between the two algorithms is that gradient boosting uses optimization for weight the estimators. Like adaboost, gradient boosting can be used for most algorithms but is commonly associated with decision trees.
In addition, gradient boosting requires several additional hyperparameters such as max depth and subsample. Max depth has to do with the number of nodes in a tree. The higher the number the purer the classification become. The downside to this is the risk of overfitting.
Subsampling has to do with the proportion of the sample that is used for each estimator. This can range from a decimal value up until the whole number 1. If the value is set to 1 it becomes stochastic gradient boosting.
This post is focused on classification. To do this, we will use the cancer dataset from the pydataset library. Our goal will be to predict the status of patients (alive or dead) using the available independent variables. The steps we will use are as follows.
- Data preparation
- Baseline decision tree model
- Hyperparameter tuning
- Gradient boosting model development
Below is some initial code.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
Data Preparation
The data preparation is simple in this situtation. All we need to do is load are dataset, dropping missing values, and create our X dataset and y dataset. All this happens in the code below.
df=data('cancer').dropna()
X=df[['time','sex','ph.karno','pat.karno','meal.cal','wt.loss']]
y=df['status']
We will now develop our baseline decision tree model.
Baseline Model
The purpose of the baseline model is to have something to compare our gradient boosting model to. The strength of a model is always relative to some other model, so we need to make at least two, so we can say one is better than the other.
The criteria for better in this situation is accuracy. Therefore, we will make a decision tree model, but we will manipulate the max depth of the tree to create 9 different baseline models. The best accuracy model will be the baseline model.
To achieve this, we need to use a for loop to make python make several decision trees. We also need to set the parameters for the cross validation by calling KFold(). Once this is done, we print the results for the 9 trees. Below is the code and results.
crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10):
tree_classifier=tree.DecisionTreeClassifier(max_depth=depth,random_state=1)
if tree_classifier.fit(X,y).tree_.max_depth<depth:
break
score=np.mean(cross_val_score(tree_classifier,X,y,scoring='accuracy', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 0.71875
2 0.6477941176470589
3 0.6768382352941177
4 0.6698529411764707
5 0.6584558823529412
6 0.6525735294117647
7 0.6283088235294118
8 0.6573529411764706
9 0.6577205882352941
It appears that when the max depth is limited to 1 that we get the best accuracy at almost 72%. This will be our baseline for comparison. We will now tune the parameters for the gradient boosting algorithm
Hyperparameter Tuning
There are several hyperparameters we need to tune. The ones we will tune are as follows
- number of estimators
- learning rate
- subsample
- max depth
First, we will create an instance of the gradient boosting classifier. Second, we will create our grid for the search. It is inside this grid that we set several values for each hyperparameter. Then we call GridSearchCV and place the instance of the gradient boosting classifier, the grid, the cross validation values from mad earlier, and n_jobs all together in one place. Below is the code for this.
GBC=GradientBoostingClassifier()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'max_depth':[1,3,5],'subsample':[.5,.75,1],'random_state':[1]}
search=GridSearchCV(estimator=GBC,param_grid=search_grid,scoring='accuracy',n_jobs=1,cv=crossvalidation)
You can now run your model by calling .fit(). Keep in mind that there are several hyperparameters. This means that it might take some time to run the calculations. It is common to find values for max depth, subsample, and number of estimators first. Then as second run through is done to find the learning rate. In our example, we are doing everything at once which is why it takes longer. Below is the code with the out for best parameters and best score.
search.fit(X,y)
search.best_params_
Out[11]:
{'learning_rate': 0.01,
'max_depth': 5,
'n_estimators': 2000,
'random_state': 1,
'subsample': 0.75}
search.best_score_
Out[12]: 0.7425149700598802
You can see what the best hyperparameters are for yourself. In addition, we see that when these parameters were set we got an accuracy of 74%. This is superior to our baseline model. We will now see if we can replicate these numbers when we use them for our Gradient Boosting model.
Gradient Boosting Model
Below is the code and results for the model with the predetermined hyperparameter values.
ada2=GradientBoostingClassifier(n_estimators=2000,learning_rate=0.01,subsample=.75,max_depth=5,random_state=1)
score=np.mean(cross_val_score(ada2,X,y,scoring='accuracy',cv=crossvalidation,n_jobs=1))
score
Out[17]: 0.742279411764706
You can see that the results are similar. This is just additional information that the gradient boosting model does outperform the baseline decision tree model.
Conclusion
This post provided an example of what gradient boosting classification can do for a model. With its distinct characteristics gradient boosting is generally a better performing boosting algorithm in comparison to AdaBoost.

AdaBoost Regression with Python
This post will share how to use the adaBoost algorithm for regression in Python. What boosting does is that it makes multiple models in a sequential manner. Each newer model tries to successful predict what older models struggled with. For regression, the average of the models are used for the predictions. It is often most common to use boosting with decision trees but this approach can be used with any machine learning algorithm that deals with supervised learning.
Boosting is associated with ensemble learning because several models are created that are averaged together. An assumption of boosting, is that combining several weak models can make one really strong and accurate model.
For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the weight loss of a patient based on several independent variables. The steps of this process are as follows.
- Data preparation
- Regression decision tree baseline model
- Hyperparameter tuning of Adaboost regression model
- AdaBoost regression model development
Below is some initial code
from sklearn.ensemble import AdaBoostRegressor
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
Data Preparation
There is little data preparation for this example. All we need to do is load the data and create the X and y datasets. Below is the code.
df=data('cancer').dropna()
X=df[['time','sex','ph.karno','pat.karno','status','meal.cal']]
y=df['wt.loss']
We will now proceed to creating the baseline regression decision tree model.
Baseline Regression Tree Model
The purpose of the baseline model is to compare it to the performance of our model that utilizes adaBoost. To make this model we need to Initiate a K-fold cross-validation. This will help in stabilizing the results. Next, we will create a for loop to create several trees that vary based on their depth. By depth, it is meant how far the tree can go to purify the classification. More depth often leads to a higher likelihood of overfitting.
Finally, we will then print the results for each tree. The criteria used for judgment is the mean squared error. Below is the code and results
crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10):
tree_regressor=tree.DecisionTreeRegressor(max_depth=depth,random_state=1)
if tree_regressor.fit(X,y).tree_.max_depth<depth:
break
score=np.mean(cross_val_score(tree_regressor,X,y,scoring='neg_mean_squared_error', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 -193.55304528235052
2 -176.27520747356175
3 -209.2846723461564
4 -218.80238479654003
5 -222.4393459885871
6 -249.95330609042858
7 -286.76842138165705
8 -294.0290706405905
9 -287.39016236497804
Looks like a tree with a depth of 2 had the lowest amount of error. We can now move to tuning the hyperparameters for the adaBoost algorithm.
Hyperparameter Tuning
For hyperparameter tuning we need to start by initiating our AdaBoostRegresor() class. Then we need to create our grid. The grid will address two hyperparameters which are the number of estimators and the learning rate. The number of estimators tells Python how many models to make and the learning indicates how each tree contributes to the overall results. There is one more parameter which is random_state, but this is just for setting the seed and never changes.
After making the grid, we need to use the GridSearchCV function to finish this process. Inside this function, you have to set the estimator, which is adaBoostRegressor, the parameter grid which we just made, the cross-validation which we made when we created the baseline model, and the n_jobs, which allocates resources for the calculation. Below is the code.
ada=AdaBoostRegressor()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'random_state':[1]}
search=GridSearchCV(estimator=ada,param_grid=search_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=crossvalidation)
Next, we can run the model with the desired grid in place. Below is the code for fitting the mode as well as the best parameters and the score to expect when using the best parameters.
search.fit(X,y)
search.best_params_
Out[31]: {'learning_rate': 0.01, 'n_estimators': 500, 'random_state': 1}
search.best_score_
Out[32]: -164.93176650920856
The best mix of hyperparameters is a learning rate of 0.01 and 500 estimators. This mix led to a mean error score of 164, which is a little lower than our single decision tree of 176. We will see how this works when we run our model with refined hyperparameters.
AdaBoost Regression Model
Below is our model, but this time with the refined hyperparameters.
ada2=AdaBoostRegressor(n_estimators=500,learning_rate=0.001,random_state=1)
score=np.mean(cross_val_score(ada2,X,y,scoring='neg_mean_squared_error',cv=crossvalidation,n_jobs=1))
score
Out[36]: -174.52604137201791
You can see the score is not as good but it is within reason.
Conclusion
In this post, we explored how to use the AdaBoost algorithm for regression. Employing this algorithm can help to strengthen a model in many ways at times.
Standard Deviation VIDEO
Calculating standard deviation

Research Questions, Variables, and Statistics
Working with students over the years has led me to the conclusion that often students do not understand the connection between variables, quantitative research questions and the statistical tools
used to answer these questions. In other words, students will take statistics and pass the class. Then they will take research methods, collect data, and have no idea how to analyze the data even though they have the necessary skills in statistics to succeed.
This means that the students have a theoretical understanding of statistics but struggle in the application of it. In this post, we will look at some of the connections between research questions and statistics.
Variables
Variables are important because how they are measured affects the type of question you can ask and get answers to. Students often have no clue how they will measure a variable and therefore have no idea how they will answer any research questions they may have.
Another aspect that can make this confusing is that many variables can be measured more than one way. Sometimes the variable “salary” can be measured in a continuous manner or in a categorical manner. The superiority of one or the other depends on the goals of the research.
It is critical to support students to have a thorough understanding of variables in order to support their research.
Types of Research Questions
In general, there are two types of research questions. These two types are descriptive and relational questions. Descriptive questions involve the use of descriptive statistic such as the mean, median, mode, skew, kurtosis, etc. The purpose is to describe the sample quantitatively with numbers (ie the average height is 172cm) rather than relying on qualitative descriptions of it (ie the people are tall).
Below are several example research questions that are descriptive in nature.
- What is the average height of the participants in the study?
- What proportion of the sample is passed the exam?
- What are the respondents perceptions towards the cafeteria?
These questions are not intellectually sophisticated but they are all answerable with descriptive statistical tools. Question 1 can be answered by calculating the mean. Question 2 can be answered by determining how many passed the exam and dividing by the total sample size. Question 3 can be answered by calculating the mean of all the survey items that are used to measure respondents perception of the cafeteria.
Understanding the link between research question and statistical tool is critical. However, many people seem to miss the connection between the type of question and the tools to use.
Relational questions look for the connection or link between variables. Within this type there are two sub-types. Comparison question involve comparing groups. The other sub-type is called relational or an association question.
Comparison questions involve comparing groups on a continuous variable. For example, comparing men and women by height. What you want to know is whether there is a difference in the height of men and women. The comparison here is trying to determine if gender is related to height. Therefore, it is looking for a relationship just not in the way that many student understand. Common comparison questions include the following.male
- Is there a difference in height by gender among the participants?
- Is there a difference in reading scores by grade level?
- Is there a difference in job satisfaction in based on major?
Each of these questions can be answered using ANOVA or if we want to get technical and there are only two groups (ie gender) we can use t-test. This is a broad overview and does not include the complexities of one-sample test and or paired t-test.
Relational or association question involve continuous variables primarily. The goal is to see how variables move together. For example, you may look for the relationship between height and weight of students. Common questions include the following.
- Is there a relationship between height and weight?
- Does height and show size explain weight?
Questions 1 can be answered by calculating the correlation. Question 2 requires the use of linear regression in order to answer the question.
Conclusion
The challenging as a teacher is showing the students the connection between statistics and research questions from the real world. It takes time for students to see how the question inspire the type of statistical tool to use. Understanding this is critical because it helps to frame the possibilities of what to do in research based on the statistical knowledge one has.

Elastic Net Regression in Python
Elastic net regression combines the power of ridge and lasso regression into one algorithm. What this means is that with elastic net the algorithm can remove weak variables altogether as with lasso or to reduce them to close to zero as with ridge. All of these algorithms are examples of regularized regression.
This post will provide an example of elastic net regression in Python. Below are the steps of the analysis.
- Data preparation
- Baseline model development
- Elastic net model development
To accomplish this, we will use the Fair dataset from the pydataset library. Our goal will be to predict marriage satisfaction based on the other independent variables. Below is some initial code to begin the analysis.
from pydataset import data
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
Data Preparation
We will now load our data. The only preparation that we need to do is convert the factor variables to dummy variables. Then we will make our and y datasets. Below is the code.
df=pd.DataFrame(data('Fair'))
df.loc[df.sex== 'male', 'sex'] = 0
df.loc[df.sex== 'female','sex'] = 1
df['sex'] = df['sex'].astype(int)
df.loc[df.child== 'no', 'child'] = 0
df.loc[df.child== 'yes','child'] = 1
df['child'] = df['child'].astype(int)
X=df[['religious','age','sex','ym','education','occupation','nbaffairs']]
y=df['rate']
We can now proceed to creating the baseline model
Baseline Model
This model is a basic regression model for the purpose of comparison. We will instantiate our regression model, use the fit command and finally calculate the mean squared error of the data. The code is below.
regression=LinearRegression()
regression.fit(X,y)
first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))
print(first_model)
1.0498738644696668
This mean standard error score of 1.05 is our benchmark for determining if the elastic net model will be better or worst. Below are the coefficients of this first model. We use a for loop to go through the model and the zip function to combine the two columns.
coef_dict_baseline = {}
for coef, feat in zip(regression.coef_,X.columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[63]:
{'religious': 0.04235281110639178,
'age': -0.009059645428673819,
'sex': 0.08882013337087094,
'ym': -0.030458802565476516,
'education': 0.06810255742293699,
'occupation': -0.005979506852998164,
'nbaffairs': -0.07882571247653956}
We will now move to making the elastic net model.
Elastic Net Model
Elastic net, just like ridge and lasso regression, requires normalize data. This argument is set inside the ElasticNet function. The second thing we need to do is create our grid. This is the same grid as we create for ridge and lasso in prior posts. The only thing that is new is the l1_ratio argument.
When the l1_ratio is set to 0 it is the same as ridge regression. When l1_ratio is set to 1 it is lasso. Elastic net is somewhere between 0 and 1 when setting the l1_ratio. Therefore, in our grid, we need to set several values of this argument. Below is the code.
elastic=ElasticNet(normalize=True)
search=GridSearchCV(estimator=elastic,param_grid={'alpha':np.logspace(-5,2,8),'l1_ratio':[.2,.4,.6,.8]},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)
We will now fit our model and display the best parameters and the best results we can get with that setup.
search.fit(X,y)
search.best_params_
Out[73]: {'alpha': 0.001, 'l1_ratio': 0.8}
abs(search.best_score_)
Out[74]: 1.0816514028705004
The best hyperparameters was an alpha set to 0.001 and a l1_ratio of 0.8. With these settings we got an MSE of 1.08. This is above our baseline model of MSE 1.05 for the baseline model. Which means that elastic net is doing worse than linear regression. For clarity, we will set our hyperparameters to the recommended values and run on the data.
elastic=ElasticNet(normalize=True,alpha=0.001,l1_ratio=0.75)
elastic.fit(X,y)
second_model=(mean_squared_error(y_true=y,y_pred=elastic.predict(X)))
print(second_model)
1.0566430678343806
Now our values are about the same. Below are the coefficients
coef_dict_baseline = {}
for coef, feat in zip(elastic.coef_,X.columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[76]:
{'religious': 0.01947541724957858,
'age': -0.008630896492807691,
'sex': 0.018116464568090795,
'ym': -0.024224831274512956,
'education': 0.04429085595448633,
'occupation': -0.0,
'nbaffairs': -0.06679513627963515}
The coefficients are mostly the same. Notice that occupation was completely removed from the model in the elastic net version. This means that this values was no good to the algorithm. Traditional regression cannot do this.
Conclusion
This post provided an example of elastic net regression. Elastic net regression allows for the maximum flexibility in terms of finding the best combination of ridge and lasso regression characteristics. This flexibility is what gives elastic net its power.

Lasso Regression with Python
Lasso regression is another form of regularized regression. With this particular version, the coefficient of a variable can be reduced all the way to zero through the use of the l1 regularization. This is in contrast to ridge regression which never completely removes a variable from an equation as it employs l2 regularization.
Regularization helps to stabilize estimates as well as deal with bias and variance in a model. In this post, we will use the “CaSchools” dataset from the pydataset library. Our goal will be to predict test scores based on several independent variables. The steps we will follow are as follows.
- Data preparation
- Develop a baseline linear model
- Develop lasso regression model
The initial code is as follows
from pydataset import data
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
df=pd.DataFrame(data(‘Caschool’))
Data Preparation
The data preparation is simple in this example. We only have to store the desired variables in our X and y datasets. We are not using all of the variables. Some were left out because they were highly correlated. Lasso is able to deal with this to a certain extent w=but it was decided to leave them out anyway. Below is the code.
X=df[['teachers','calwpct','mealpct','compstu','expnstu','str','avginc','elpct']]
y=df['testscr']
Baseline Model
We can now run our baseline model. This will give us a measure of comparison for the lasso model. Our metric is the mean squared error. Below is the code with the results of the model.
regression=LinearRegression()
regression.fit(X,y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))
print(first_model)
69.07380530137416
First, we instantiate the LinearRegression class. Then, we run the .fit method to do the analysis. Next, we predicted future values of our regression model and save the results to the object first_model. Lastly, we printed the results.
Below are the coefficient for the baseline regression model.
coef_dict_baseline = {}
for coef, feat in zip(regression.coef_,X.columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[52]:
{'teachers': 0.00010011947964873427,
'calwpct': -0.07813766458116565,
'mealpct': -0.3754719080127311,
'compstu': 11.914006268826652,
'expnstu': 0.001525630709965126,
'str': -0.19234209691788984,
'avginc': 0.6211690806021222,
'elpct': -0.19857026121348267}
The for loop simply combines the features in our model with their coefficients. With this information we can now make our lasso model and compare the results.
Lasso Model
For our lasso model, we have to determine what value to set the l1 or alpha to prior to creating the model. This can be done with the grid function, This function allows you to assess several models with different l1 settings. Then python will tell which setting is the best. Below is the code.
lasso=Lasso(normalize=True)
search=GridSearchCV(estimator=lasso,param_grid={'alpha':np.logspace(-5,2,8)},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)
search.fit(X,y)
We start be instantiate lasso with normalization set to true. It is important to scale data when doing regularized regression. Next, we setup our grid, we include the estimator, and parameter grid, and scoring. The alpha is set using logspace. We want values between -5 and 2, and we want 8 evenly spaced settings for the alpha. The other arguments include cv which stands for cross-validation. n_jobs effects processing and refit updates the parameters.
After completing this, we used the fit function. The code below indicates the appropriate alpha and the expected score if we ran the model with this alpha setting.
search.best_params_
Out[55]: {'alpha': 1e-05}
abs(search.best_score_)
Out[56]: 85.38831122904011
`The alpha is set almost to zero, which is the same as a regression model. You can also see that the mean squared error is actually worse than in the baseline model. In the code below, we run the lasso model with the recommended alpha setting and print the results.
lasso=Lasso(normalize=True,alpha=1e-05)
lasso.fit(X,y)
second_model=(mean_squared_error(y_true=y,y_pred=lasso.predict(X)))
print(second_model)
69.0738055527604
The value for the second model is almost the same as the first one. The tiny difference is due to the fact that there is some penalty involved. Below are the coefficient values.
coef_dict_baseline = {}
for coef, feat in zip(lasso.coef_,X.columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[63]:
{'teachers': 9.795933425676567e-05,
'calwpct': -0.07810938255735576,
'mealpct': -0.37548182158171706,
'compstu': 11.912164626067028,
'expnstu': 0.001525439984250718,
'str': -0.19225486069458508,
'avginc': 0.6211695477945162,
'elpct': -0.1985510490295491}
The coefficient values are also slightly different. The only difference is the teachers variable was essentially set to zero. This means that it is not a useful variable for predicting testscrs. That is ironic to say the least.
Conclusion
Lasso regression is able to remove variables that are not adequate predictors of the outcome variable. Doing this in Python is fairly simple. This yet another tool that can be used in statistical analysis.

Ridge Regression in Python
Ridge regression is one of several regularized linear models. Regularization is the process of penalizing coefficients of variables either by removing them and or reduce their impact. Ridge regression reduces the effect of problematic variables close to zero but never fully removes them.
We will go through an example of ridge regression using the VietNamI dataset available in the pydataset library. Our goal will be to predict expenses based on the variables available. We will complete this task using the following steps/
- Data preparation
- Baseline model development
- Ridge regression model
Below is the initial code
from pydataset import data
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_erro
Data Preparation
The data preparation is simple. All we have to do is load the data and convert the sex variable to a dummy variable. We also need to set up our X and y datasets. Below is the code.
df=pd.DataFrame(data('VietNamI'))
df.loc[df.sex== 'male', 'sex'] = 0
df.loc[df.sex== 'female','sex'] = 1
df['sex'] = df['sex'].astype(int)
X=df[['pharvis','age','sex','married','educ','illness','injury','illdays','actdays','insurance']]
y=df['lnhhexp'
We can now create our baseline regression model.
Baseline Model
The metric we are using is the mean squared error. Below is the code and output for our baseline regression model. This is a model that has no regularization to it. Below is the code.
regression=LinearRegression()
regression.fit(X,y)
first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))
print(first_model)
0.35528915032173053
This value of 0.355289 will be our indicator to determine if the regularized ridge regression model is superior or not.
Ridge Model
In order to create our ridge model we need to first determine the most appropriate value for the l2 regularization. L2 is the name of the hyperparameter that is used in ridge regression. Determining the value of a hyperparameter requires the use of a grid. In the code below, we first are ridge model and indicate normalization in order to get better estimates. Next we setup the grid that we will use. Below is the code.
ridge=Ridge(normalize=True)
search=GridSearchCV(estimator=ridge,param_grid={'alpha':np.logspace(-5,2,8)},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)
The search object has several arguments within it. Alpha is hyperparameter we are trying to set. The log space is the range of values we want to test. We want the log of -5 to 2, but we only get 8 values from within that range evenly spread out. Are metric is the mean squared error. Refit set true means to adjust the parameters while modeling and cv is the number of folds to develop for the cross-validation. We can now use the .fit function to run the model and then use the .best_params_ and .best_scores_ function to determine the model;s strength. Below is the code.
search.fit(X,y)
search.best_params_
{'alpha': 0.01}
abs(search.best_score_)
0.3801489007094425
The best_params_ tells us what to set alpha too which in this case is 0.01. The best_score_ tells us what the best possible mean squared error is. In this case, the value of 0.38 is worse than what the baseline model was. We can confirm this by fitting our model with the ridge information and finding the mean squared error. This is done below.
ridge=Ridge(normalize=True,alpha=0.01)
ridge.fit(X,y)
second_model=(mean_squared_error(y_true=y,y_pred=ridge.predict(X)))
print(second_model)
0.35529321992606566
The 0.35 is lower than the 0.38. This is because the last results are not cross-validated. In addition, these results indicate that there is little difference between the ridge and baseline models. This is confirmed with the coefficients of each model found below.
coef_dict_baseline = {}
for coef, feat in zip(regression.coef_,data("VietNamI").columns):
coef_dict_baseline[feat] = coef
coef_dict_baseline
Out[188]:
{'pharvis': 0.013282050886950674,
'lnhhexp': 0.06480086550467873,
'age': 0.004012412278795848,
'sex': -0.08739614349708981,
'married': 0.075276463838362,
'educ': -0.06180921300600292,
'illness': 0.040870384578962596,
'injury': -0.002763768716569026,
'illdays': -0.006717063310893158,
'actdays': 0.1468784364977112}
coef_dict_ridge = {}
for coef, feat in zip(ridge.coef_,data("VietNamI").columns):
coef_dict_ridge[feat] = coef
coef_dict_ridge
Out[190]:
{'pharvis': 0.012881937698185289,
'lnhhexp': 0.06335455237380987,
'age': 0.003896623321297935,
'sex': -0.0846541637961565,
'married': 0.07451889604357693,
'educ': -0.06098723778992694,
'illness': 0.039430607922053884,
'injury': -0.002779341753010467,
'illdays': -0.006551280792122459,
'actdays': 0.14663287713359757}
The coefficient values are about the same. This means that the penalization made little difference with this dataset.
Conclusion
Ridge regression allows you to penalize variables based on their useful in developing the model. With this form of regularized regression the coefficients of the variables is never set to zero. Other forms of regularization regression allows for the total removal of variables. One example of this is lasso regression.

Computational Thinking
Computational thinking is the process of expressing a problem in a way that a computer can solve. In general, there are four various ways that computational thinking can be done. These four ways are decomposition, pattern recognition, abstraction, and algorithmic thinking.
Although computational thinking is dealt with in the realm of computer science. Everyone thinks computationally at one time or another especially in school. Awareness of these subconscious strategies can help people to know how they think at times as well as to be aware of the various ways in which thinking is possible.
Decomposition
Decomposition is the process of breaking a large problem down into smaller and smaller parts or problems. The benefit of this is that by addressing all of the created little problems you can solve the large problem.
In education decomposition can show up in many ways. For teachers, they often have to break goals done into objectives, and sometimes down into procedures in a daily lesson plan. Seeing the big picture of the content students need and breaking it down into pieces that students can comprehend is critically to education such as with chunking.
For the student, decomposition involves breaking down the parts of a project such as writing a paper. The student has to determine what to do and how it helps to achieve the completion of their project.
Pattern Recognition
Pattern recognition has to refer to how various aspects of a problem have things in common. For a teacher, this may involve the development of a thematic unit. Developing such a unit requires the teacher to see what various subjects or disciplines have in common as they try to create the thematic unit.
For the student, pattern recognition can support the development of shortcuts. Examples include seeing similarities in assignments that need to be completed and completing similar assignments together.
Abstraction
Abstraction is the ability to remove irrelevant information from a problem. This is perhaps the most challenging form of thinking to develop because people often fall into the trap that everything is important.
For a teacher, abstractions involves teaching only the critical information that is in the content and not stressing the small stuff. This is not easy especially when the teacher has a passion for their subject. This often blinds them to trying to share only the most relevant information about their field with their students.
For students, abstraction involves being able to share the most critical information. Students are guilty of the same problems as teachers in that they share everything when writing or presenting. Determining what is important requires the development of an opinion to judge the relevance of something. This is a skill that is hard to find among graduates.
Algorithmic Thinking
Algorithmic thinking is being able to develop a step-by-step plan to do something. For teachers, this happens everyday through planning activities and leading a class. Planning may be the most common form of thinking for the teacher.
For students, algorithmic thinking is somewhat more challenging. It is common for younger people to rely heavily on intuition to accomplish tasks. This means that they did something but they do not know how they did it.
Another common mistake for young people is doing things through brute force. Rather than planning, they will just keep pounding away until something works. As such, it is common for students to do things the “hard way” as the saying goes.
Conclusion
Computational thinking is really how humans think in many situations in which emotions are not the primary mover. As such, what is really happening is not that computers are thinking as much as they are trying to model how humans think. In education, there are several situations. In which computational thinking can be employed for success.

3 Steps to Successful Research
When students have to conduct a research project they often struggle with determining what to do. There are many decisions that have to be made that can impede a student’s chances of achieving success. However, there are ways to overcome this problem.
This post will essentially reduce the decision-making process for conducting research down to three main questions that need to be addressed. These questions are.
- What do you Want to Know?
- How do You Get the Answer?
- What Does Your Answer Mean?
Answering these three questions makes it much easier to develop a sense of direction and scope in order to complete a project.
What do you Want to Know?
Often, students want to complete a project but it is unclear to them what they are trying to figure out. In other words, the students do not know what it is that they want to know. Therefore, one of the first steps in research is to determine exactly it is you want to know.
Understanding what you want to know will allow you to develop a problem as well as research questions to facilitate your ability to understand exactly what it is that you are looking for. Research always begins with a problem and questions about the problem and this is simply another way of stating what it is that you want to know.
How do You Get the Answer?
Once it is clear what it is that you want to know it is critical that you develop a process for determining how you will obtain the answers. It is often difficult for students to develop a systematic way in which to answer questions. However, in a research paradigm, a scientific way of addressing questions is critical.
When you are determining how to get answers to what you want to know this is essential the development of your methodology section. This section includes such matters as the research design, sample, ethics, data analysis, etc. The purpose here is again to explain the way to get the answer(s).
What Does Your Answer Mean?
After you actually get the answer you have to explain what it means. Many students fall into the trap of doing something without understanding why or determining the relevance of the outcome. However, a research project requires some sort of interpretation or explanation of the results. Just getting the answer is not enough it is the meaning that holds the power.
Often, the answers to the research questions are found in the results section of a paper and the meaning is found in the discussion and conclusion section. In the discussion section, you explain the major findings with interpretation, sare recommendations, and provide a conclusion. This requires thought into the usefulness of what you wanted to know. In other words, you are explaining why someone else should care about your work. This is much harder to do than many realize.
Conclusion
Research is challenging but if you keep in mind these three keys it will help you to see the big picture of research and o focus on the goals of your study and not so much on the tiny details that encompasses the processes.

Random Forest Regression in Python
Random forest is simply the making of dozens if not thousands of decision trees. The decision each tree makes about an example are then tallied for the purpose of voting with the classification that receives the most votes winning. For regression, the results of the trees are averaged in order to give the most accurate results
In this post, we will use the cancer dataset from the pydataset module to predict the age of people. Below is some initial code.
import pandas as pd import numpy as np from pydataset import data from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error
We can load our dataset as df, drop all NAs, and create our dataset that contains the independent variables and a separate dataset that includes the dependent variable of age. The code is below
df = data('cancer') df=df.dropna() X=df[['time','status',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']] y=df['age']
Next, we need to set up our train and test sets using a 70/30 split. After that, we set up our model using the RandomForestRegressor function. n_estimators is the number of trees we want to create and the random_state argument is for supporting reproducibility. The code is below
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) h=RandomForestRegressor(n_estimators=100,random_state=1)
We can now run our model and test it. Running the model requires the .fit() function and testing involves the .predict() function. The results of the test are found using the mean_squared_error() function.
h.fit(x_train,y_train) y_pred=h.predict(x_test) mean_squared_error(y_test,y_pred) 71.75780196078432
The MSE of 71.75 is only useful for model comparison and has little meaning by its self. Another way to assess the model is by determining variable importance. This helps you to determine in a descriptive way the strongest variables for the regression model. The code is below followed by the plot of the variables.
model_ranks=pd.Series(h.feature_importances_,index=x_train.columns,name="Importance").sort_values(ascending=True,inplace=False) ax=model_ranks.plot(kind='barh')
As you can see, the strongest predictors of age include calories per meal, weight loss, and time sick. Sex and whether the person is censored or dead make a smaller difference. This makes sense as younger people eat more and probably lose more weight because they are heavier initially when dealing with cancer.
Conclusison
This post provided an example of the use of regression with random forest. Through the use of ensemble voting, you can improve the accuracy of your models. This is a distinct power that is not available with other machine learning algorithm.

Support Vector Machines Regression with Python
This post will provide an example of how to do regression with support vector machines SVM. SVM is a complex algorithm that allows for the development of non-linear models. This is particularly useful for messy data that does not have clear boundaries.
The steps that we will use are listed below
- Data preparation
- Model Development
We will use two different kernels in our analysis. The LinearSVR kernel and SVR kernel. The difference between these two kernels has to do with slight changes in the calculations of the boundaries between classes.
Data Preparation
We are going to use the OFP dataset available in the pydataset module. This dataset was used previously for classification with SVM on this site. Our plan this time is that we want to predict family inc (famlinc), which is a continuous variable. Below is some initial code.
import numpy as np import pandas as pd from pydataset import data from sklearn import svm from sklearn import model_selection from statsmodels.tools.eval_measures import mse
We now need to load our dataset and remove any missing values.
df=pd.DataFrame(data('OFP')) df=df.dropna()
AS in the previous post, we need to change the text variables into dummy variables and we also need to scale the data. The code below creates the dummy variables, removes variables that are not needed, and also scales the data.
dummy=pd.get_dummies(df['black']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"yes": "black_person"}) df=df.drop('no', axis=1) dummy=pd.get_dummies(df['sex']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"male": "Male"}) df=df.drop('female', axis=1) dummy=pd.get_dummies(df['employed']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"yes": "job"}) df=df.drop('no', axis=1) dummy=pd.get_dummies(df['maried']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"no": "single"}) df=df.drop('yes', axis=1) dummy=pd.get_dummies(df['privins']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"yes": "insured"}) df=df.drop('no', axis=1) df=df.drop(['black','sex','maried','employed','privins','medicaid','region','hlth'],axis=1) df = (df - df.min()) / (df.max() - df.min()) df.head()
We now need to set up our datasets. The X dataset will contain the independent variables while the y dataset will contain the dependent variable
X=df[['ofp','ofnp','opp','opnp','emr','hosp','numchron','adldiff','age','school','single','black_person','Male','job','insured']] y=df['faminc']
We can now move to model development
Model Development
We now need to create our train and test sets for or X and y datasets. We will do a 70/30 split of the data. Below is the code
X_train,X_test,y_train,y_test=model_selection.train_test_split(X,y,test_size=.3,random_state=1)
Next, we will create our two models with the code below.
h1=svm.SVR() h2=svm.LinearSVR()
We will now run our first model and assess the results. Our metric is the mean squared error. Generally, the lower the number the better. We will use the .fit() function to train the model and the .predict() function for test the model
The mse was 0.27. This number means nothing only and is only beneficial for comparison reasons. Therefore, the second model will be judged as better or worst only if the mse is lower than 0.27. Below are the results of the second model.
We can see that the mse for our second model is 0.34 which is greater than the mse for the first model. This indicates that the first model is superior based on the current results and parameter settings.
Conclusion
This post provided an example of how to use SVM for regression.

Analyzing Twitter Data in Python
In this post, we will look at how to analyze text from Twitter. We will do each of the following for tweets that refer to Donald Trump and tweets that refer to Barrack Obama.
- Conduct a sentiment analysis
- Create a word cloud
This is a somewhat complex analysis so I am assuming that you are familiar with Python as explaining everything would make the post much too long. In order to achieve our two objectives above we need to do the following.
- Obtain all of the necessary information from your twitter apps account
- Download the tweets & clean
- Perform the analysis
Before we begin, here is a list of modules we will need to load to complete our analysis
import wordcloud import matplotlib.pyplot as plt import twython import re import numpy
Obtain all Needed Information
From your twitter app account, you need the following information
- App key
- App key secret
- Access token
- Access token secret
All this information needs to be stored in individual objects in Python. Then each individual object needs to be combined into one object. The code is below.
TWITTER_APP_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXX TWITTER_APP_KEY_SECRET=XXXXXXXXXXXXXXXXXXX TWITTER_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXX TWITTER_ACCESS_TOKEN_SECRET=XXXXXXXXXXXXXX t=twython.Twython(app_key=TWITTER_APP_KEY,app_secret=TWITTER_APP_KEY_SECRET,oauth_token=TWITTER_ACCESS_TOKEN,oauth_token_secret=TWITTER_ACCESS_TOKEN_SECRET)
In the code above we saved all the information in different objects at first and then combined them. You will of course replace the XXXXXXX with your own information.
Next, we need to create a function that will pull the tweets from Twitter. Below is the code,
def get_tweets(twython_object,query,n): count=0 result_generator=twython_object.cursor(twython_object.search,q=query) result_set=[] for r in result_generator: result_set.append(r['text']) count+=1 if count ==n: break return result_set
You will have to figure out the code yourself. We can now download the tweets.
Downloading Tweets & Clean
Downloading the tweets involves making an empty dictionary that we can save our information in. We need two keys in our dictionary one for Trump and the other for Obama because we are downloading tweets about these two people.
There are also two additional things we need to do. We need to use regular expressions to get rid of punctuation and we also need to lower case all words. All this is done in the code below.
tweets={} tweets['trump']=[re.sub(r'[-.#/?!.":;()\']',' ',tweet.lower())for tweet in get_tweets(t,'#trump',1500)] tweets['obama']=[re.sub(r'[-.#/?!.":;()\']',' ',tweet.lower())for tweet in get_tweets(t,'#obama',1500)]
The get_tweets function is also used in the code above along with our twitter app information. We pulled 1500 tweets concerning Obama and 1500 tweets about Trump. We were able to download and clean our tweets at the same time. We can now do our analysis
Analysis
To do the sentiment analysis you need dictionaries of positive and negative words. The ones in this post were taken from GitHub. Below is the code for loading them into Python.
positive_words=open('XXXXXXXXXXXX').read().split('\n') negative_words=open('XXXXXXXXXXXX').read().split('\n')
We now will make a function to calculate the sentiment
def sentiment_score(text,pos_list,neg_list): positive_score=0 negative_score=0 for w in text.split(' '): if w in pos_list:positive_score+=1 if w in neg_list:negative_score+=1 return positive_score-negative_score
Now we create an empty dictionary and run the analysis for Trump and then for Obama
tweets_sentiment={} tweets_sentiment['trump']=[sentiment_score(tweet,positive_words,negative_words)for tweet in tweets['trump']] tweets_sentiment['obama']=[sentiment_score(tweet,positive_words,negative_words)for tweet in tweets['obama']]
Now we can make visuals of our results with the code below
trump=plt.hist(tweets_sentiment['trump'],5) obama=plt.hist(tweets_sentiment['obama'],5)
Obama is on the left and trump is on the right. It seems that trump tweets are consistently more positive. Below are the means for both.
numpy.mean(tweets_sentiment['trump']) Out[133]: 0.36363636363636365 numpy.mean(tweets_sentiment['obama']) Out[134]: 0.2222222222222222
Trump tweets are slightly more positive than Obama tweets. Below is the code for the Trump word cloud
Here is the code for the Obama word cloud
A lot of speculating can be made from the word clouds and sentiment analysis. However, the results will change every single time because of the dynamic nature of Twitter. People are always posting tweets which changes the results.
Conclusion
This post provided an example of how to download and analyze tweets from twitter. It is important to develop a clear idea of what you want to know before attempting this sort of analysis as it is easy to become confused and not accomplish anything.