Presumption & Burden of Proof in Debating

In debating, it is important to understand the role of presumption and burden of proof and how these terms affect the status quo. This post will attempt to explain these concepts.

Status Quo

The status quo is the way things currently are or the way things are done. The affirmative in a debate is generally pushing change or departure from the status quo. This is in no way easy as people often prefer to keep things the way they are and minimize change.

Presumption

Presumption is the tendency of favoring one side of an argument over another. There are at least two forms of presumption. These two forms or judicial presumption and policy presumption.

Judicial presumption always favors the status quo or keeping things they way they are currently. Small changes can be made but the existing structure is not going to be different. In debates that happen from the judicial perspective it is the affirmative side that has the burden of proof or how must show that the benefit of change outweighs the status quo. A common idiom that summarizes the status quo is “If it ain’t broke don’t fix it.”

The policy form of presumption is used when change is necessary to the status quo. Example would be replacing an employee. The status quo of keeping the worker is impossible and the debate is now focused on who should be the replacement. A debate from a policy perspective is about which of the new approaches is the best to adopt.

In addition, the concept of burden of proof goes from the burden of proof to a burden of proof. This is because either side of the debate must provide must support the argument that they are making.

Burden of Refutation

The burden of refutation is the obligation to respond the opposing arguments. In other words, debaters often need to explain why the other side’s arguments are weak or perhaps even wrong. Failure to do so can make the refuting debater’s position weaker.

This leads to the point that there are no ties in debating. If both sides are equally good the status quo wins, which is normally the negative side. This is because the affirmative side did not bring the burden of proof necessary to warrant change.

Conclusion

Structure of debating requires debaters have a basic understanding of the various concepts in this field. Therefore, understanding such terms as status quo, presumption, and burden of proof  is beneficial if not required in order to participate in debating.

Advertisements

Intro to D3.js

D3.js is a JavaScript library that is useful for manipulating HTML documents. D3. js stands for Data Driven Documents JavaScript and was developed by Mike Bobstock. One of the primary purposes of D3.js is for data visualization. However, D3.js can do more than just provide data visualization as it can also allow for interaction binding, item selection, and dynamic styling of DOM (document object model) elements.

In order to use D3.js you should have a basic understanding of HTML. For data visualization you should have some idea of basic statistics and data visualization concepts in order to know what it is you  want to visualize. You also need to pick some sort of IDE so that you can manipulate your code. Lastly, you must know how to start a server on your local computer so you can see the results of your work.

In this post, we will make document that will use D3.js to make the phrase “Hello World”.

Example

Below is a bare minimum skeleton that an HTML document often has

1.png

Line 1 indicates the document type. Line 2 indicates the beginning of the html element. The html element can be used to store information about the page. Next, in line 6 is the body element. This is where the content of the web page is mostly kept. Notice how the information is contained in the various elements. All code is contained within the html element and the head and body elements are separate from one another.

First Script

We are now going to add our first few lines of d3.js code to our html document. Below is the code.

1

The code within the body element is new. In line 7-8 we are using a script element to access the d3.js library. Notice how it is a link. This means that when we run the code the d3.js library is access from some other place on the internet. An alternative to this is to download d3.js yourself. If you do this d3.js must be in the same folder as the html document that you are making.

To get d3.js from the internet you use the src argument and the place the web link in quotations. The charset argument is the setting for the character encoding. Sometimes this information  is important but it depends.

The second script element is where we actually do something with d3.js. Inside this second script element we have in line 10 the command d3.select(‘body’) this tells d3.js to select the first body element in the document. In line 11 we have the command .append(‘h1’) this tells d3.js to add an h1 heading in the body element. Lastly, we have the .text(‘Hello World’). This tells d3.js to add the text ‘Hello World’ to the h1 heading in the body element. This process of adding one command after another to modify the same object is called chaining.

When everything is done and you show your results you should see the following.

1

This is not the most amazing thing to see given what d3.js can do but it serves as an introduction.

More Examples

You are not limited to only one line of code or only one script element. Below is a more advanced version.

1.png

The new information is in lines 14-20. Lines 14-15 are just two p elements with some text. Lines 17-19 is another script element. This time we use the d3.selectAll(‘p’) which tells d3.js to apply the following commands to all p elements. In line 19 we use the .style command to set the background color to light blue. When this is done you should see the following.

1

Still not amazing, but things were modified as we wanted. Notice also that the second script element is not inside the body element. This is not necessary because you never see script elements in a website. Rather, you see the results of such a code.

Conclusion

This post introduce d3.js, which is a powerful tool for visualization. Although the examples here are fairly simple, you can be assured that there is more to this library than what has been explored so far.

Quadratic Discriminant Analysis with Python

Quadratic discriminant analysis allows for the classifier to assess non -linear relationships. This of course something that linear discriminant analysis is not able to do. This post will go through the steps necessary to complete a qda analysis using Python. The steps that will be conducted are as follows

  1. Data preparation
  2. Model training
  3. Model testing

Our goal will be to predict the gender of examples in the “Wages1” dataset using the available independent variables.

Data Preparation

We will begin by first loading the libraries we will need

import pandas as pd
from pydataset import data
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import (confusion_matrix,accuracy_score)
import seaborn as sns
from matplotlib.colors import ListedColormap

Next, we will load our data “Wages1” it comes from the “pydataset” library. After loading the data, we will use the .head() method to look at it briefly.

1

We need to transform the variable ‘sex’, our dependent variable, into a dummy variable using numbers instead of text. We will use the .getdummies() method to make the dummy variables and then add them to the dataset using the .concat() method. The code for this is below.

In the code below we have the histogram for the continuous independent variables.  We are using the .distplot() method from seaborn to make the histograms.

fig = plt.figure()
fig, axs = plt.subplots(figsize=(15, 10),ncols=3)
sns.set(font_scale=1.4)
sns.distplot(df['exper'],color='black',ax=axs[0])
sns.distplot(df['school'],color='black',ax=axs[1])
sns.distplot(df['wage'],color='black',ax=axs[2])

1

The variables look reasonable normal. Below is the proportions of the categorical dependent variable.

round(df.groupby('sex').count()/3294,2)
Out[247]: 
exper school wage female male
sex 
female 0.48 0.48 0.48 0.48 0.48
male 0.52 0.52 0.52 0.52 0.52

About half male and half female.

We will now make the correlational matrix

corrmat=df.corr(method='pearson')
f,ax=plt.subplots(figsize=(12,12))
sns.set(font_scale=1.2)
sns.heatmap(round(corrmat,2),
vmax=1.,square=True,
cmap="gist_gray",annot=True)

1

There appears to be no major problems with correlations. The last thing we will do is set up our train and test datasets.

X=df[['exper','school','wage']]
y=df['male']
X_train,X_test,y_train,y_test=train_test_split(X,y,
test_size=.2, random_state=50)

We can now move to model development

Model Development

To create our model we will instantiate an instance of the quadratic discriminant analysis function and use the .fit() method.

qda_model=QDA()
qda_model.fit(X_train,y_train)

There are some descriptive statistics that we can pull from our model. For our purposes, we will look at the group means  Below are the  group means.

exper school wage
Female 7.73 11.84 5.14
Male 8.28 11.49 6.38

You can see from the table that mean generally have more experience, higher wages, but slightly less education.

We will now use the qda_model we create to predict the classifications for the training set. This information will be used to make a confusion matrix.

cm = confusion_matrix(y_train, y_pred)
ax= plt.subplots(figsize=(10,10))
sns.set(font_scale=3.4)
with sns.axes_style('white'):
sns.heatmap(cm, cbar=False, square=True, annot=True, fmt='g',
cmap=ListedColormap(['gray']), linewidths=2.5)

1

The information in the upper-left corner are the number of people who were female and correctly classified as female. The lower-right corner is for the men who were correctly classified as men. The upper-right corner is females who were classified as male. Lastly, the lower-left corner is males who were classified as females. Below is the actually accuracy of our model

round(accuracy_score(y_train, y_pred),2)
Out[256]: 0.6

Sixty percent accuracy is not that great. However, we will now move to model testing.

Model Testing

Model testing involves using the .predict() method again but this time with the testing data. Below is the prediction with the confusion matrix.

 y_pred=qda_model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
from matplotlib.colors import ListedColormap
ax= plt.subplots(figsize=(10,10))
sns.set(font_scale=3.4)
with sns.axes_style('white'):
sns.heatmap(cm, cbar=False, square=True,annot=True,fmt='g',
cmap=ListedColormap(['gray']),linewidths=2.5)

1

The results seem similar. Below is the accuracy.

round(accuracy_score(y_test, y_pred),2)
Out[259]: 0.62

About the same, our model generalizes even though it performs somewhat poorly.

Conclusion

This post provided an explanation of how to do a quadratic discriminant analysis using python. This is just another potential tool that may be useful for the data scientist.

Background of Debates

Debating has a history as long as the history of man. The is evidence that debating dates back at least 4,000 years. From Egypt to china and even in poetry such as Homer’s “Iliad”  one can find examples of debating. Academic debating is believed to have started about 2,500 years ago with the work of Pythagoras.

We will look at the role of culture in debating as well as debate’s role in academics in the US along with some of the benefits of debating.

Debating and Culture

For whatever reason, debating is a key component of Western civilization and in particular Democratic civilizations. Speculating on why can go on forever. However, one key component for the emphasis on debating in the west is the epistemological view of truth.

In many western cultures, there is an underlying belief that truth is relative. As such, when two sides are debating the topic it is through the combine contributions of both arguments that some idea of truth is revealed. In many ways, this is a form of the Hegelian dialectic in which thesis and antithesis make syntheses. The synthesis is the truth and can only be found through a struggle of opposing relative positions.

In other cultures, such as Asian, what is true is much more stable and agreed upon as unchanging. This may be a partial reason for why debating is not as strenuously practice in non-western context. Confucianism in particular focus on stability, tradition, and rigid hierarchy. These are concepts there often considered unthinkable in a Western culture.

Debating in the United States

In the United States, applied debating has been of the country from almost the beginning. However, academic debating has been present since at least the 18th century. It was at the beginning of the 20th century that academic debating begin to be taken much more seriously.  Intercollegiate debating during this time lead to the development of several debate associations that had various rules and ways to support the growth of debating.

Benefits of Debating

Debating has been found to develop argumentation  skills, critical thinking, and enhance general academic performance. Through  have to gather information and synthesis it in a clear way seems to transfer when students study for other academic subjects. In addition, even though debating is about sharing one side of an argument it also improves listening skills. This is because you have to listen in order to point out weaknesses in the oppositions position.

Debating also develops the ability to thinking quickly. If the ability to think is not develop a student will struggle with refutation and rebuttals which are key components of debating. Lastly, debating sharpens the judgment of participants. It i important to be able to judge the strengths and weaknesses of various aspects of an argument in order to provide a strong case for our against an idea or action and this involves sharp judgment.

Conclusion

With its rich history and clear benefits. Debating will continue to be a part of the academic experience of  many students. The skills that are developed are practical and useful for many occupations found outside of an academic setting.

Types of Debates

Debating has a long history with historical evidence of this practice dating back 4,000 year. Debating was used in ancient Egypt, China, and Greece. As such, people who participate in debates are contributing to a rich history.

In this post, we will take a look at several types of debates that are commonly used today. The types of debates we will cover are as follows.

  • Special
  • Judicial
  • Parliamentary
  • Non-formal
  • Academic

Special Debate

A special debate is special because it has distinct rules  for a specific occasion. Examples include the Lincoln-Douglas debates of 1858. These debates were so influential that there is a debate format today called the Lincoln-Douglas format. This format often focuses on moral issues and has a specific use of time for the debaters that is distinct.

Special debates are also commonly used for presidential debates. Since there is no set format, the debaters literally may debate over the rules of the actual debate. For example, the Bush vs Kerry debates of 2004 had some of the following rules agreed to by both parties prior to the debate.

  1. Height of the lectern
  2. type of stools used
  3. Nature of the audience

In this example above, sometimes the rules have nothing to do with the actual debate but the atmosphere/setting around it.

Judicial Debate

Judicial debates happen in courts judicial like settings. The goal is to prosecute or defend individuals for some sort of crime. For lawyers in training or even general students, moot court debates are used to hone debating skills and mock trial debates are also used.

Parliamentary Debate

The parliamentary debate purpose s to support or attack potential legislation. Despite its name, the parliamentary debate format is used in the United States at various levels of government. There is a particular famous variation of this called the Asian parliamentary debate style.

Non-formal Debate

A non-formal debate lacks the rules of the other styles mentioned. In many ways, any form of disagreeing that does not have a structure for how to present one’s argument can fall under the category of non-formal. For example, children arguing with parents could be considered non-formal as well as classroom discussion on a controversial issue such as immigration.

This form of debate is probably the only one that everyone is familiar with and has participated in. However, it is probably the hardest to develop skills in due to the lack of structure.

Academic Debate

The academic debate is used to develop the educational skills of the participants. Often the format deployed is taken from applied debates. For example, many academic debates use the Lincoln Douglas format. There are several major Debate organizations that promote debate competitions between school’s. The details of this will be expanded in a future post.

Conclusion

This post provided an overview of different styles of debating that are commonly employed. Understanding this can be important because how you present and defend a point of view depends on the rules of engagement.

Persuasion vs Propaganda

Getting people to believe or do something has been a major problem on both an individual and even an international level. To address this concern both individuals and nations have turned to both persuasion and propaganda. This post will define both persuasion  and propaganda and compare and contrast them.

Persuasion

Persuasion is communication that attempts to influence the behavior or beliefs of others. This can be done through appeals to reason, appeals to emotions, or a combination of both. Often persuasion is done  on a small scale and is informal. Example would be a child trying to persuade their mother to let them go outside to play.

A more serious example would be a lawyer trying to persuade a judge. This involves one lawyer try to move the opinion of one judge. The goal here is for the lawyer to show the strength of their position while discrediting the position of others and the opposition.

Even though it is not on a large scale, persuasion works critical thinking, deep thought, with a thorough knowledge of the problem and the person(s) one is trying to persuade. Nothing can ruin persuasion like ignorance of the problem or people who you want to persuade.

Propaganda

Propaganda is persuasion on a large scale. It involves a group or organization of persuaders who combine their efforts to reach a large audience. The term propaganda was supposedly created in the 17th by Pope Gregory XV who in 1622 created the Sacred Congregation for the Propagating of the Faith. This group is responsible for spreading the Catholic religion in an evangelistic manner for conversions to the religion, which implies that propaganda is the spreading of ideas so that people accept them.

Edward Bernays is often seen as the master of propaganda. It was he who brought the use of propaganda to an art form in the early to mid 20th century. Ever the master of language and knowing the negative connotation of propaganda Bernays used an alternative term publicly on many occasion called “public relations.” This is the term essentially all institutions used today even though it has the same primary characteristics of propaganda which is to influence public opinion about something.

As mentioned in the previous paragraph, generally, the term propaganda is viewed negatively even though it is simply massive organized persuasion. This may be because propaganda is usually used for nefarious purposes throughout history. For example, Hitler used propaganda to strengthen the Nazi party. However, all countries are guilty of developing  propaganda for reasons that may not be completely altruistic in order to support their position in a competitive world.

Comparison

Persuasion and propaganda are in many ways opposite extremes of the same idea. What persuasion is on a small scale propaganda is on a large scale. However, it is hard to tell how big persuasion has to become before it reaches the level of propaganda. One indication may be in perception of the message. People who disagree with a position may call it propaganda, while people who agree with the message may call it persuasion.

Both persuasion and propaganda involve the use of planning and serious thought. Propaganda may involve more planning as it requires a large group of people to impact a  much larger audience. Finally, when persuasion and propaganda fail it may lead to something more sinister called coercion. This is when people are not necessarily forced to believe but usually to do something.

Conclusion

Whether persuading or sharing propaganda it is important to be aware of how these two terms are similar and different. Generally, the difference is a matter of scale. Persuasion is a local personal form of propaganda while propaganda is a massive impersonal form of persuasion

Random Forest Classification with Python

Random forest is a type of machine learning algorithm in which the algorithm makes multiple decision trees that may use different features and subsample to making as many trees as you specify. The trees then vote to determine the class of an example. This approach helps to deal with the high variance that is a problem with making only one decision tree.

In this post, we will learn how to develop a random forest model in Python. We will use the cancer dataset from the pydataset module to classify whether a person status is censored or dead based on several independent variables. The steps we need to perform to complete this task are defined below

  1. Data preparation
  2. Model development and evaluation

Data Preparation

Below are some initial modules we need to complete all of the tasks for this project.

import pandas as pd
import numpy as np
from pydataset import data
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

We will now load our dataset “Cancer” and drop any rows that contain NA using the .dropna() function.

df = data('cancer')
df=df.dropna()

Next, we need to separate our independent variables from our dependent variable. We will do this by make two datasets. The X dataset will contain all of our independent variables and the y dataset will contain our dependent variable. You can check the documentation for the dataset using the code data(“Cancer”, show_doc=True)

Before we make the y dataset we need to change the numerical values in the status variable to text. Doing this will aid in the interpretation of the results. If you look at the documentation of the dataset you will see that a 1 in the status variable means censored while a 2 means dead. We will change the 1 to censored and the 2 to dead when we make the y dataset. This involves the use of the .replace() function. The code is below.

X=df[['time','age',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']]
df['status']=df.status.replace(1,'censored')
df['status']=df.status.replace(2,'dead')
y=df['status']

We can now proceed to model development.

Model Development and Evaluation

We will first make our train and test datasets. We will use a 70/30 split. Next, we initialize the actual random forest classifier. There are many options that can be set. For our purposes, we will set the number of trees to make to 100. Setting the random_state option is similar to setting the seed for the purpose of reproducibility. Below is the code.

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
h=RandomForestClassifier(n_estimators=100,random_state=1)

We can now run our modle with the .fit() function and test it with the .pred() function. The code is velow.

h.fit(x_train,y_train)
y_pred=h.predict(x_test)

We will now print two tables. The first will provide the raw results for the classification using the .crosstab() function. THe classification_reports function will provide the various metrics used for determining the value of a classification model.

print(pd.crosstab(y_test,y_pred))
print(classification_report(y_test,y_pred))

1

Our overall accuracy is about 75%. How good this is depends in context. We are really good at predicting people are dead but have much more trouble with predicting if people are censored.

Conclusion

This post provided an example of using random forest in python. Through the use of a forest of trees, it is possible to get much more accurate results when a comparison is made to a single decision tree. This is one of many reasons for the use of random forest in machine learning.

Critical Thinking and Debating

Debating is a commonly used activity for developing critical thinking skills. The question that this post wants to answer is how debating develops critical thinking. This will be achieved through discussing the following…

  • Defining debate
  • Debating in the past
  • Debating today

Defining Debate

A debate is a process of defending or attacking a proposition through the use of reasoning and judgement. The goal is to go through a process of argumentation in which good reasons are shared with an audience. Good reasons are persuasive reasons that have a psychological influence on an audience. Naturally, what constitutes a good reason varies from context to context. Therefore, a good debater always keeps in mind who their audience is.

One key element of debating is what is missing. Technically, debating is an intellectual experience and not an emotional one. This has been lost sight of over time as debaters and public speakers have learned that emotional fanaticism is much more influential in moving the masses the deliberate thinking.

Debating in the Past

Debating was a key tool among the ancient Greeks. Aristotle provides us with at least four purposes for debating. The first purpose of debating was that debating allows people to see both sides of an argument. As such, debating dispels bias and allows for more carefully defined decision-making. One of the  characteristics  of critical thinking is the ability to see both sides of an argument or to think empathically rather than only sympathetically.

A second purpose of debating is for instructing the public. Debates for experts to take complex ideas and reduce to simple ones for general consumption. Off course, this has been take to extremes through sound bites and memes in the 21st century but learning how to communicate clearly is yet another goal of critical thinking.

A third purpose of debating is to prevent fraud and injustice. Aristotle was assuming that there was truth and that truth was more powerful the injustice. These are ideas that have been lost with time as we now live in a postmodern world. However, Aristotle believed that people needed to know how to argue for truth and how to communicate it with others. Today, experiential knowledge, and emotions are the primary determiners for what is right and wrong rather than cold truth.

A final purpose of debating is debating in order to defend one’s self. Debating is an intellectual way of protecting someone as fighting is a physical way of protecting someone. There is an idiom in English that states that “the pen is mightier than the swords.” Often physical fighting comes after several intellectual machinations by leaders who find ways to manipulate things. Skilled debater can  move millions whereas a strong solider can only do a limited amount of damage alone.

Debating Today

One aspect of debating that is not covered above is the aspect of time when it comes to debating. Debating is a way to develop critical thinking but it is also a way of developing real-time critical thinking. In others words, not only do you have to prepare your argument and ideas before a debate you also have to respond and react during a debate. This requires thinking on your feet in front of an audience while still trying to persuasive and articulate. Not an easy task for most people.

Debating is often a lost art as people have turned to arguing instead. Arguing often involves emotional exchanges rather than rational thought. Some have stated that when debating disappears so does freedom of speech. In  many ways, as topics and ideas become more emotionally charged there is greater and greater restriction  on  what can be said so that no one is “offended”. Perhaps Aristotle was correct about his views on debating and injustice.

Type 1 & 2 Tasks in TESOL

In the context of TESOL, language skills of speaking, writing, listening, and reading are divided into productive and receptive skills. Productive skills are speaking and writing and involve making language. Receptive skills are listening and reading and involve receiving language.

In this post, we will take a closer look at receptive skills involving the theory behind them as well as the use of task 1 and task 2 activities to develop these skills.

Top Down Bottom Up

Theories that are commonly associated with receptive skills include top down and bottom up processing. With top down processing the reader/listener is focused on the “big picture”. This means that they are focused on general view or idea of the content they are exposed to. This requires have a large amount of knowledge and experience to draw upon in order to connect understanding. Prior knowledge helps the individual to know what to expect as they receive the information.

Bottom up processing is the opposite. In this approach, the reader/listener is focused on the details of the content. Another way to see this is that with bottom up processing the focus is on the trees while with top down the focus is on forest. With bottom up processing students are focused on individual words or even individual word sounds such as when they are decoding when reading.

Type 1 & 2 Tasks

Type 1 and type 2 tasks are derived from top down and bottom processing.  Type 1 task involve seeing the “big picture”. Examples of this type include summarizing, searching for the main idea, making inferences, etc.

Often type 1 task are trickier to assess because solutions are often open-ended and open to interpretation. This involves having to assess individually each response each student makes which may not always be practical. However, type 1 task really help to broaden and strengthen higher level thinking skills which can lay a foundation for critical thinking.

Type 2 task involve looking at text and or listening for much greater detail. Such activities as recall, grammatical correction, and single answer questions all fall under the umbrella of type 2 tasks.

Type 2 tasks are easier to mark as they  frequently only have one possible answer. The problem with this is that teachers over rely on them because of their convenience. Students are trained to obsess over details rather than broad comprehension or connecting knowledge to other areas of knowledge. Opportunities for developing dynamic literacy are lost for a focus on critical literacy or even decoding.

A more reasonable approach is to use a combination of type 1 and 2 tasks. Type 1 can be used to stimulate thinking without necessarily marking the responses. Type 2 can be employed to teach students to focus on details and due to ease at which they can be marked type 2 tasks can be placed in the grade book for assessing progress.

Conclusion

This post explained various theories related to receptive skills in TESOL. There was also a look at different the two broad categories in which receptive skill tasks fall into. For educators, it is important to find a balance between using both type 1 and type 2 tasks in their classroom.

Data Exploration with R: Housing Prices


In this data exploration post, we will analyze a dataset from Kaggle using R .  Below are the packages we are going to use.

library(ggplot2);library(readr);library(magrittr);library(dplyr);library(tidyverse);library(data.table);library(DT);library(GGally);library(gridExtra);library(ggExtra);library(fastDummies);library(caret);library(glmnet)

Let’s look at our data for a second.

train <- read_csv("~/Downloads/train.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Id = col_integer(),
##   MSSubClass = col_integer(),
##   LotFrontage = col_integer(),
##   LotArea = col_integer(),
##   OverallQual = col_integer(),
##   OverallCond = col_integer(),
##   YearBuilt = col_integer(),
##   YearRemodAdd = col_integer(),
##   MasVnrArea = col_integer(),
##   BsmtFinSF1 = col_integer(),
##   BsmtFinSF2 = col_integer(),
##   BsmtUnfSF = col_integer(),
##   TotalBsmtSF = col_integer(),
##   `1stFlrSF` = col_integer(),
##   `2ndFlrSF` = col_integer(),
##   LowQualFinSF = col_integer(),
##   GrLivArea = col_integer(),
##   BsmtFullBath = col_integer(),
##   BsmtHalfBath = col_integer(),
##   FullBath = col_integer()
##   # ... with 18 more columns
## )
## See spec(...) for full column specifications.

Data Visualization

Lets take a look at our target variable first

p1<-train%>%
        ggplot(aes(SalePrice))+geom_histogram(bins=10,fill='red')+labs(x="Type")+ggtitle("Global")
p1

1.png

Here is the frequency of certain values for the target variable

p2<-train%>%
        mutate(tar=as.character(SalePrice))%>%
        group_by(tar)%>%
        count()%>%
        arrange(desc(n))%>%
        head(10)%>%
        ggplot(aes(reorder(tar,n,FUN=min),n))+geom_col(fill='blue')+coord_flip()+labs(x='Target',y='freq')+ggtitle('Freuency')
p2        

2.png

Let’s examine the correlations. First we need to find out which variables are numeric. Then we can use ggcorr to see if there are any interesting associations. The code is as follows.

nums <- unlist(lapply(train, is.numeric))   train[ , nums]%>%
        select(-Id) %>%
        ggcorr(method =c('pairwise','spearman'),label = FALSE,angle=-0,hjust=.2)+coord_flip()

1.png

There are some strong associations in the data set. Below we see what the top 10 correlations.

n1 <- 20 
m1 <- abs(cor(train[ , nums],method='spearman'))
out <- as.table(m1) %>%
        as_data_frame %>% 
        transmute(Var1N = pmin(Var1, Var2), Var2N = pmax(Var1, Var2), n) %>% 
        distinct %>% 
        filter(Var1N != Var2N) %>% 
        arrange(desc(n)) %>%
        group_by(grp = as.integer(gl(n(), n1, n())))
out
## # A tibble: 703 x 4
## # Groups:   grp [36]
##    Var1N        Var2N            n   grp
##                     
##  1 GarageArea   GarageCars   0.853     1
##  2 1stFlrSF     TotalBsmtSF  0.829     1
##  3 GrLivArea    TotRmsAbvGrd 0.828     1
##  4 OverallQual  SalePrice    0.810     1
##  5 GrLivArea    SalePrice    0.731     1
##  6 GarageCars   SalePrice    0.691     1
##  7 YearBuilt    YearRemodAdd 0.684     1
##  8 BsmtFinSF1   BsmtFullBath 0.674     1
##  9 BedroomAbvGr TotRmsAbvGrd 0.668     1
## 10 FullBath     GrLivArea    0.658     1
## # ... with 693 more rows

There are about 4 correlations that are perhaps too strong.

Descriptive Statistics

Below are some basic descriptive statistics of our variables.

train_mean<-na.omit(train[ , nums]) %>% 
        select(-Id,-SalePrice) %>%
        summarise_all(funs(mean)) %>%
        gather(everything(),key='feature',value='mean')
train_sd<-na.omit(train[ , nums]) %>%
        select(-Id,-SalePrice) %>%
        summarise_all(funs(sd)) %>%
        gather(everything(),key='feature',value='sd')
train_median<-na.omit(train[ , nums]) %>%
        select(-Id,-SalePrice) %>%
        summarise_all(funs(median)) %>%
        gather(everything(),key='feature',value='median')
stat<-na.omit(train[ , nums]) %>%
        select(-Id,-SalePrice) %>%
        summarise_all(funs(sum(.<0.001))) %>%
        gather(everything(),key='feature',value='zeros')%>%
        left_join(train_mean,by='feature')%>%
        left_join(train_median,by='feature')%>%
        left_join(train_sd,by='feature')
stat$zeropercent<-(stat$zeros/(nrow(train))*100)
stat[order(stat$zeropercent,decreasing=T),] 
## # A tibble: 36 x 6
##    feature       zeros    mean median      sd zeropercent
##                            
##  1 PoolArea       1115  2.93        0  40.2          76.4
##  2 LowQualFinSF   1104  4.57        0  41.6          75.6
##  3 3SsnPorch      1103  3.35        0  29.8          75.5
##  4 MiscVal        1087 23.4         0 166.           74.5
##  5 BsmtHalfBath   1060  0.0553      0   0.233        72.6
##  6 ScreenPorch    1026 16.1         0  57.8          70.3
##  7 BsmtFinSF2      998 44.6         0 158.           68.4
##  8 EnclosedPorch   963 21.8         0  61.3          66.0
##  9 HalfBath        700  0.382       0   0.499        47.9
## 10 BsmtFullBath    668  0.414       0   0.512        45.8
## # ... with 26 more rows

We have a lot of information stored in the code above. We have the means, median and the sd in one place for all of the features. Below are visuals of this information. We add 1 to the mean and sd to preserve features that may have a mean of 0.

p1<-stat %>%
        ggplot(aes(mean+1))+geom_histogram(bins = 20,fill='red')+scale_x_log10()+labs(x="means + 1")+ggtitle("Feature means")

p2<-stat %>%
        ggplot(aes(sd+1))+geom_histogram(bins = 30,fill='red')+scale_x_log10()+labs(x="sd + 1")+ggtitle("Feature sd")

p3<-stat %>%
        ggplot(aes(median+1))+geom_histogram(bins = 20,fill='red')+labs(x="median + 1")+ggtitle("Feature median")

p4<-stat %>%
        mutate(zeros=zeros/nrow(train)*100) %>%
        ggplot(aes(zeros))+geom_histogram(bins = 20,fill='red')+labs(x="Percent of Zeros")+ggtitle("Zeros")

p5<-stat %>%
        ggplot(aes(mean+1,sd+1))+geom_point()+scale_x_log10()+scale_y_log10()+labs(x="mean + 1",y='sd + 1')+ggtitle("Feature mean & sd")
grid.arrange(p1,p2,p3,p4,p5,layout_matrix=rbind(c(1,2,3),c(4,5)))
## Warning in rbind(c(1, 2, 3), c(4, 5)): number of columns of result is not a
## multiple of vector length (arg 2)

2.png

Below we check for variables with zero variance. Such variables would cause problems if included in any model development

stat%>%
        mutate(zeros = zeros/nrow(train)*100)%>%
        filter(mean == 0 | sd == 0 | zeros==100)%>%
        DT::datatable()

There are no zero-variance features in this dataset that may need to be remove.

Correlations

Let’s look at correlation with the SalePrice variable. The plot is a histogram of all the correlations with the target variable.

sp_cor<-train[, nums] %>% 
select(-Id,-SalePrice) %>%
cor(train$SalePrice,method="spearman") %>%
as.tibble() %>%
rename(cor_p=V1)

stat<-stat%>%
#filter(sd>0)
bind_cols(sp_cor)

stat%>%
ggplot(aes(cor))+geom_histogram()+labs(x="Correlations")+ggtitle("Cors with SalePrice")

1

We have several high correlations but we already knew this previously. Below we have some code that provides visuals of the correlations

top<-stat%>%
        arrange(desc(cor_p))%>%
        head(10)%>%
        .$feature
p1<-train%>%
        select(SalePrice,one_of(top))%>%
        ggcorr(method=c("pairwise","pearson"),label=T,angle=-0,hjust=.2)+coord_flip()+ggtitle("Strongest Correlations")
p2<-train%>%
        select(SalePrice, OverallQual)%>%
        ggplot(aes(SalePrice,OverallQual))+geom_point()+labs(y="OverallQual")+ggtitle("Strongest Correlation")
p3<-train%>%
        select(SalePrice, OverallQual)%>%
        ggplot(aes(SalePrice,OverallQual))+geom_point()+geom_smooth(method= 'lm')+labs(y="OverallQual")+ggtitle("Strongest Correlation")
ggMarginal(p3,type = 'histogram')
p3
grid.arrange(p1,p2,layout_matrix=rbind(c(1,2)))

1

1.png

The first plot show us the top correlations. Plot 1 show us the relationship between the strongest predictor and our target variable. Plot 2 shows us the trend-line and the histograms for the strongest predictor with our target variable.

The code below is for the categorical variables. Our primary goal is to see the protections inside each variable. If a categorical variable lacks variance in terms of frequencies in each category it may need to be removed for model developing purposes. Below is the code

ig_zero<-train[, nums]%>%
        na_if(0)%>%
        select(-Id,-SalePrice)%>%
        cor(train$SalePrice,use="pairwise",method="spearman")%>%
        as.tibble()%>%
        rename(cor_s0=V1)
stat<-stat%>%
        bind_cols(ig_zero)%>%
        mutate(non_zero=nrow(train)-zeros)

char <- unlist(lapply(train, is.character))  
me<-names(train[,char])

List=list()
    for (var in train[,char]){
        wow= print(prop.table(table(var)))
        List[[length(List)+1]] = wow
    }
names(List)<-me
List

This list is not printed here in order to save space

# $MSZoning
## var
##     C (all)          FV          RH          RL          RM 
## 0.006849315 0.044520548 0.010958904 0.788356164 0.149315068 
## 
## $Street
## var
##        Grvl        Pave 
## 0.004109589 0.995890411 
## 
## $Alley
## var
##      Grvl      Pave 
## 0.5494505 0.4505495 
## 
## $LotShape
## var
##         IR1         IR2         IR3         Reg 
## 0.331506849 0.028082192 0.006849315 0.633561644 
## 
## $LandContour
## var
##        Bnk        HLS        Low        Lvl 
## 0.04315068 0.03424658 0.02465753 0.89794521 
## 
## $Utilities
## var
##       AllPub       NoSeWa 
## 0.9993150685 0.0006849315 
## 
## $LotConfig
## var
##      Corner     CulDSac         FR2         FR3      Inside 
## 0.180136986 0.064383562 0.032191781 0.002739726 0.720547945 
## 
## $LandSlope
## var
##        Gtl        Mod        Sev 
## 0.94657534 0.04452055 0.00890411 
## 
## $Neighborhood
## var
##     Blmngtn     Blueste      BrDale     BrkSide     ClearCr     CollgCr 
## 0.011643836 0.001369863 0.010958904 0.039726027 0.019178082 0.102739726 
##     Crawfor     Edwards     Gilbert      IDOTRR     MeadowV     Mitchel 
## 0.034931507 0.068493151 0.054109589 0.025342466 0.011643836 0.033561644 
##       NAmes     NoRidge     NPkVill     NridgHt      NWAmes     OldTown 
## 0.154109589 0.028082192 0.006164384 0.052739726 0.050000000 0.077397260 
##      Sawyer     SawyerW     Somerst     StoneBr       SWISU      Timber 
## 0.050684932 0.040410959 0.058904110 0.017123288 0.017123288 0.026027397 
##     Veenker 
## 0.007534247 
## 
## $Condition1
## var
##      Artery       Feedr        Norm        PosA        PosN        RRAe 
## 0.032876712 0.055479452 0.863013699 0.005479452 0.013013699 0.007534247 
##        RRAn        RRNe        RRNn 
## 0.017808219 0.001369863 0.003424658 
## 
## $Condition2
## var
##       Artery        Feedr         Norm         PosA         PosN 
## 0.0013698630 0.0041095890 0.9897260274 0.0006849315 0.0013698630 
##         RRAe         RRAn         RRNn 
## 0.0006849315 0.0006849315 0.0013698630 
## 
## $BldgType
## var
##       1Fam     2fmCon     Duplex      Twnhs     TwnhsE 
## 0.83561644 0.02123288 0.03561644 0.02945205 0.07808219 
## 
## $HouseStyle
## var
##      1.5Fin      1.5Unf      1Story      2.5Fin      2.5Unf      2Story 
## 0.105479452 0.009589041 0.497260274 0.005479452 0.007534247 0.304794521 
##      SFoyer        SLvl 
## 0.025342466 0.044520548 
## 
## $RoofStyle
## var
##        Flat       Gable     Gambrel         Hip     Mansard        Shed 
## 0.008904110 0.781506849 0.007534247 0.195890411 0.004794521 0.001369863 
## 
## $RoofMatl
## var
##      ClyTile      CompShg      Membran        Metal         Roll 
## 0.0006849315 0.9821917808 0.0006849315 0.0006849315 0.0006849315 
##      Tar&Grv      WdShake      WdShngl 
## 0.0075342466 0.0034246575 0.0041095890 
## 
## $Exterior1st
## var
##      AsbShng      AsphShn      BrkComm      BrkFace       CBlock 
## 0.0136986301 0.0006849315 0.0013698630 0.0342465753 0.0006849315 
##      CemntBd      HdBoard      ImStucc      MetalSd      Plywood 
## 0.0417808219 0.1520547945 0.0006849315 0.1506849315 0.0739726027 
##        Stone       Stucco      VinylSd      Wd Sdng      WdShing 
## 0.0013698630 0.0171232877 0.3527397260 0.1410958904 0.0178082192 
## 
## $Exterior2nd
## var
##      AsbShng      AsphShn      Brk Cmn      BrkFace       CBlock 
## 0.0136986301 0.0020547945 0.0047945205 0.0171232877 0.0006849315 
##      CmentBd      HdBoard      ImStucc      MetalSd        Other 
## 0.0410958904 0.1417808219 0.0068493151 0.1465753425 0.0006849315 
##      Plywood        Stone       Stucco      VinylSd      Wd Sdng 
## 0.0972602740 0.0034246575 0.0178082192 0.3452054795 0.1349315068 
##      Wd Shng 
## 0.0260273973 
## 
## $MasVnrType
## var
##     BrkCmn    BrkFace       None      Stone 
## 0.01033058 0.30647383 0.59504132 0.08815427 
## 
## $ExterQual
## var
##          Ex          Fa          Gd          TA 
## 0.035616438 0.009589041 0.334246575 0.620547945 
## 
## $ExterCond
## var
##           Ex           Fa           Gd           Po           TA 
## 0.0020547945 0.0191780822 0.1000000000 0.0006849315 0.8780821918 
## 
## $Foundation
## var
##      BrkTil      CBlock       PConc        Slab       Stone        Wood 
## 0.100000000 0.434246575 0.443150685 0.016438356 0.004109589 0.002054795 
## 
## $BsmtQual
## var
##         Ex         Fa         Gd         TA 
## 0.08503162 0.02459592 0.43429375 0.45607871 
## 
## $BsmtCond
## var
##          Fa          Gd          Po          TA 
## 0.031623331 0.045678145 0.001405481 0.921293043 
## 
## $BsmtExposure
## var
##         Av         Gd         Mn         No 
## 0.15541491 0.09423347 0.08016878 0.67018284 
## 
## $BsmtFinType1
## var
##        ALQ        BLQ        GLQ        LwQ        Rec        Unf 
## 0.15460295 0.10400562 0.29374561 0.05200281 0.09346451 0.30217850 
## 
## $BsmtFinType2
## var
##         ALQ         BLQ         GLQ         LwQ         Rec         Unf 
## 0.013361463 0.023206751 0.009845288 0.032348805 0.037974684 0.883263010 
## 
## $Heating
## var
##        Floor         GasA         GasW         Grav         OthW 
## 0.0006849315 0.9780821918 0.0123287671 0.0047945205 0.0013698630 
##         Wall 
## 0.0027397260 
## 
## $HeatingQC
## var
##           Ex           Fa           Gd           Po           TA 
## 0.5075342466 0.0335616438 0.1650684932 0.0006849315 0.2931506849 
## 
## $CentralAir
## var
##          N          Y 
## 0.06506849 0.93493151 
## 
## $Electrical
## var
##       FuseA       FuseF       FuseP         Mix       SBrkr 
## 0.064427690 0.018505826 0.002056203 0.000685401 0.914324880 
## 
## $KitchenQual
## var
##         Ex         Fa         Gd         TA 
## 0.06849315 0.02671233 0.40136986 0.50342466 
## 
## $Functional
## var
##         Maj1         Maj2         Min1         Min2          Mod 
## 0.0095890411 0.0034246575 0.0212328767 0.0232876712 0.0102739726 
##          Sev          Typ 
## 0.0006849315 0.9315068493 
## 
## $FireplaceQu
## var
##         Ex         Fa         Gd         Po         TA 
## 0.03116883 0.04285714 0.49350649 0.02597403 0.40649351 
## 
## $GarageType
## var
##      2Types      Attchd     Basment     BuiltIn     CarPort      Detchd 
## 0.004350979 0.630891951 0.013778100 0.063814358 0.006526468 0.280638144 
## 
## $GarageFinish
## var
##       Fin       RFn       Unf 
## 0.2552574 0.3060189 0.4387237 
## 
## $GarageQual
## var
##          Ex          Fa          Gd          Po          TA 
## 0.002175489 0.034807832 0.010152284 0.002175489 0.950688905 
## 
## $GarageCond
## var
##          Ex          Fa          Gd          Po          TA 
## 0.001450326 0.025380711 0.006526468 0.005076142 0.961566352 
## 
## $PavedDrive
## var
##          N          P          Y 
## 0.06164384 0.02054795 0.91780822 
## 
## $PoolQC
## var
##        Ex        Fa        Gd 
## 0.2857143 0.2857143 0.4285714 
## 
## $Fence
## var
##      GdPrv       GdWo      MnPrv       MnWw 
## 0.20996441 0.19217082 0.55871886 0.03914591 
## 
## $MiscFeature
## var
##       Gar2       Othr       Shed       TenC 
## 0.03703704 0.03703704 0.90740741 0.01851852 
## 
## $SaleType
## var
##         COD         Con       ConLD       ConLI       ConLw         CWD 
## 0.029452055 0.001369863 0.006164384 0.003424658 0.003424658 0.002739726 
##         New         Oth          WD 
## 0.083561644 0.002054795 0.867808219 
## 
## $SaleCondition
## var
##     Abnorml     AdjLand      Alloca      Family      Normal     Partial 
## 0.069178082 0.002739726 0.008219178 0.013698630 0.820547945 0.085616438

You can judge for yourself which of these variables are appropriate or not.

Conclusion

This post provided an example of data exploration. Through this analysis we have a beter understanding of the characteristics of the dataset. This information can be used for further analyst and or model development.

Scatterplot in LibreOffice Calc

A scatterplot is used to observe the relationship between two continuous variables. This post will explain how to make a scatterplot and calculate correlation in LibreOffice Calc.

Scatterplot

In order to make a scatterplot you need to columns of data. Below are the first few rows of the data in this example.

Var 1 Var 2
3.07 2.95
2.07 1.90
3.32 2.75
2.93 2.40
2.82 2.00
3.36 3.10
2.86 2.85

Given the nature of this dataset, there was no need to make any preparation.

To make the plot you need to select the two column with data in them and click on insert -> chart and you will see the following.

1

Be sure to select the XY (Scatter) option and click next. You will then see the following

1

Be sure to select “data series in columns” and “first row as label.” Then click next and you will see the following.

1

There is nothing to modify in this window. If you wanted you could add more data to the plot as well as label data but neither of these options apply to us. Therefore, click next.

1

In this last window, you can see that we gave the chart a title and label the X and Y axes. We also removed the “display legend” feature by unchecking it. A legend is normally not needed when making a scatterplot. Once you add this information click “finish” and you will see the following.

1

There are many other ways to modify the scatterplot, but we will now look at how to add a trend line.

To add a trend line you need to click on the data inside the plot so that it turns green as shown below.

1

Next, click on insert -> trend line and you will see the following

1.png

For our purposes, want to select the “linear” option. Generally, the line is hard to see if you immediately click “ok”. Instead, we will click on the “Line” tab and adjust as shown below.

1

All we did was simply change the color of the line to black and increase the width to 0.10. When this is done, click “ok” and you will see the following.

1

The  scatterplot is now complete. We will now look at how to calculate the correlation between the two variables.

Correlation

The correlation is essentially a number that captures what you see in a scatterplot. To calculate the correlation, do the following.

  1. Select the two columns of data
  2. Click on data -> statistics -> correlation and you will see the following

1.png

3. In the results to section just find a place on the spreadsheet to show the results. Click ok and you will see the following.

Correlations Column 1 Column 2
Column 1 1
Column 2 0.413450002676874 1

You have to rename the columns with the appropriate variables. Despite this problem the correlation has been calculated.

Conclusion

This post provided an explanation of calculating correlations and creating scatterplots in LibreOffice Calc. Data visualization is a critical aspect of communicating effectively and such tools as Calc can be used to support this endeavor.

Graphs in LibreOffice Calc

The LibreOffice Suite is a free open-source office suit that is considered an alternative to Microsoft Office Suite. The primary benefit of LibreOffice is that it offers similar features as Microsoft Office with having to spend any money. In addition, LibreOffice is legitimately free and is not some sort of nefarious pirated version of Microsoft Office, which means that anyone can use LibreOffice without legal concerns on as many machines as they desire.

In this post, we will go over how to make plots and graphs in LibreOffice Calc. LibreOffice Calc is the equivalent to Microsoft Excel. We will learn how to make the following visualizations.

  • Bar graph
  • histogram

Bar Graph

We are going to make a bar graph from a single column of data in LibreOffice Calc. To make a visualization you need to aggregate some data. For this post, I simply made some random data that uses a likert scale of SD, D, N, A, SA. Below is a sample of the first five rows of the data.

Var 1
N
SD
D
SD
D

In order to aggregate the data you need to make bins and count the frequency of each category in the bin. Here is how you do this. First you make a variable called “bin” in a column and you place SD, D, N, A, and SA each in their own row in the column you named “bin” as shown below.

bin
SD
D
N
A
SA

In the next column, you created a variable called “freq” in each column you need to use the countif function as shown below

=COUNTIF(1st value in data: last value in data, criteria for counting)

Below is how this looks for my data.

=COUNTIF(A2:A177,B2)

What I told LibreOffice was that my data is in A2 to A177 and they need to count the row if it contains the same data as B2 which for me contains SD. You repeat this process four more time adjusting the last argument in the function. When I finished I this is what I had.

bin freq
SD 35
D 47
N 56
A 32
SA 5

We can now proceed to making the visualization.

To make the bar graph you need to first highlight the data you want to use. For us the information we want to select is the “bin” and “freq” variables we just created. Keep in mind that you never use the raw data but rather the aggregated data. Then click insert -> chart and you will see the following

1.png

Simply click next, and you will see the following

1.png

Make sure that the last three options are selected or your chart will look funny. Data series in rows or in columns has to do with how the data is read in a long or short form. Labels in first row makes sure that Calc does not insert “bin” and “freq” in the graph. First columns as label helps in identifying what the values are in the plot.

Once you click next you will see the following.

1.png

This window normally does not need adjusting and can be confusing to try to do so. It does allow you to adjust the range of the data and even and more data to your chart. For now, we will click on next and see the following.

1

In the last window above, you can add a title and label the axes if you want. You can see that I gave my graph a name. In addition, you can decide if you want to display a legend if you look to the right. For my graph, that was not adding any additional information so I unchecked this option. When you click finish you will see the following on the spreadsheet.

1

Histogram

Histogram are for continuous data. Therefore, I convert my SD,  D, N, A, SA to 1, 2, 3, 4, and 5. All the other steps are the same as above. The one difference is that you want to remove the spacing between bars. Below is how to do this.

Click on one of the bars in the bar graph until you see the green squares as shown  below.

1.png

After you did this, there should be a new toolbar at the top of the spreadsheet. You need to click on the Green and blue cube as shown below

1

In the next window, you need to change the spacing to zero percent. This will change the bar graph into a histogram. Below is what the settings should look like.

1

When you click ok you should see the final histogram shown below

1

For free software this is not too bad. There are a lot of options that were left unexplained especial in regards to how you can manipulate the colors of everything and even make the plots 3D.

Conclusion

LibreOffice provides an alternative to paying for Microsoft products. The example below shows that Calc is capable of making visually appealing graphs just as Excel is.

Data Exploration Case Study: Credit Default

Exploratory data analysis is the main task of a Data Scientist with as much as 60% of their time being devoted to this task. As such, the majority of their time is spent on something that is rather boring compared to building models.

This post will provide a simple example of how to analyze a dataset from the website called Kaggle. This dataset is looking at how is likely to default on their credit. The following steps will be conducted in this analysis.

  1. Load the libraries and dataset
  2. Deal with missing data
  3. Some descriptive stats
  4. Normality check
  5. Model development

This is not an exhaustive analysis but rather a simple one for demonstration purposes. The dataset is available here

Load Libraries and Data

Here are some packages we will need

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn import tree
from scipy import stats
from sklearn import metrics

You can load the data with the code below

df_train=pd.read_csv('/application_train.csv')

You can examine what variables are available with the code below. This is not displayed here because it is rather long

df_train.columns
df_train.head()

Missing Data

I prefer to deal with missing data first because missing values can cause errors throughout the analysis if they are not dealt with immediately. The code below calculates the percentage of missing data in each column.

total=df_train.isnull().sum().sort_values(ascending=False)
percent=(df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data=pd.concat([total,percent],axis=1,keys=['Total','Percent'])
missing_data.head()
 
                           Total   Percent
COMMONAREA_MEDI           214865  0.698723
COMMONAREA_AVG            214865  0.698723
COMMONAREA_MODE           214865  0.698723
NONLIVINGAPARTMENTS_MODE  213514  0.694330
NONLIVINGAPARTMENTS_MEDI  213514  0.694330

Only the first five values are printed. You can see that some variables have a large amount of missing data. As such, they are probably worthless for inclusion in additional analysis. The code below removes all variables with any missing data.

pct_null = df_train.isnull().sum() / len(df_train)
missing_features = pct_null[pct_null > 0.0].index
df_train.drop(missing_features, axis=1, inplace=True)

You can use the .head() function if you want to see how  many variables are left.

Data Description & Visualization

For demonstration purposes, we will print descriptive stats and make visualizations of a few of the variables that are remaining.

round(df_train['AMT_CREDIT'].describe())
Out[8]: 
count     307511.0
mean      599026.0
std       402491.0
min        45000.0
25%       270000.0
50%       513531.0
75%       808650.0
max      4050000.0

sns.distplot(df_train['AMT_CREDIT']

1.png

round(df_train['AMT_INCOME_TOTAL'].describe())
Out[10]: 
count       307511.0
mean        168798.0
std         237123.0
min          25650.0
25%         112500.0
50%         147150.0
75%         202500.0
max      117000000.0
sns.distplot(df_train['AMT_INCOME_TOTAL']

1.png

I think you are getting the point. You can also look at categorical variables using the groupby() function.

We also need to address categorical variables in terms of creating dummy variables. This is so that we can develop a model in the future. Below is the code for dealing with all the categorical  variables and converting them to dummy variable’s

df_train.groupby('NAME_CONTRACT_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_CONTRACT_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_CONTRACT_TYPE'],axis=1)

df_train.groupby('CODE_GENDER').count()
dummy=pd.get_dummies(df_train['CODE_GENDER'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['CODE_GENDER'],axis=1)

df_train.groupby('FLAG_OWN_CAR').count()
dummy=pd.get_dummies(df_train['FLAG_OWN_CAR'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['FLAG_OWN_CAR'],axis=1)

df_train.groupby('FLAG_OWN_REALTY').count()
dummy=pd.get_dummies(df_train['FLAG_OWN_REALTY'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['FLAG_OWN_REALTY'],axis=1)

df_train.groupby('NAME_INCOME_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_INCOME_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_INCOME_TYPE'],axis=1)

df_train.groupby('NAME_EDUCATION_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_EDUCATION_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_EDUCATION_TYPE'],axis=1)

df_train.groupby('NAME_FAMILY_STATUS').count()
dummy=pd.get_dummies(df_train['NAME_FAMILY_STATUS'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_FAMILY_STATUS'],axis=1)

df_train.groupby('NAME_HOUSING_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_HOUSING_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_HOUSING_TYPE'],axis=1)

df_train.groupby('ORGANIZATION_TYPE').count()
dummy=pd.get_dummies(df_train['ORGANIZATION_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['ORGANIZATION_TYPE'],axis=1)

You have to be careful with this because now you have many variables that are not necessary. For every categorical variable you must remove at least one category in order for the model to work properly.  Below we did this manually.

df_train=df_train.drop(['Revolving loans','F','XNA','N','Y','SK_ID_CURR,''Student','Emergency','Lower secondary','Civil marriage','Municipal apartment'],axis=1)

Below are some boxplots with the target variable and other variables in the dataset.

f,ax=plt.subplots(figsize=(8,6))
fig=sns.boxplot(x=df_train['TARGET'],y=df_train['AMT_INCOME_TOTAL'])

1.png

There is a clear outlier there. Below is another boxplot with a different variable

f,ax=plt.subplots(figsize=(8,6))
fig=sns.boxplot(x=df_train['TARGET'],y=df_train['CNT_CHILDREN'])

2

It appears several people have more than 10 children. This is probably a typo.

Below is a correlation matrix using a heatmap technique

corrmat=df_train.corr()
f,ax=plt.subplots(figsize=(12,9))
sns.heatmap(corrmat,vmax=.8,square=True)

1.png

The heatmap is nice but it is hard to really appreciate what is happening. The code below will sort the correlations from least to strongest, so we can remove high correlations.

c = df_train.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")
print(so.head())

FLAG_DOCUMENT_12 FLAG_MOBIL 0.000005
FLAG_MOBIL FLAG_DOCUMENT_12 0.000005
Unknown FLAG_MOBIL 0.000005
FLAG_MOBIL Unknown 0.000005
Cash loans FLAG_DOCUMENT_14 0.000005

The list is to long to show here but the following variables were removed for having a high correlation with other variables.

df_train=df_train.drop(['WEEKDAY_APPR_PROCESS_START','FLAG_EMP_PHONE','REG_CITY_NOT_WORK_CITY','REGION_RATING_CLIENT','REG_REGION_NOT_WORK_REGION'],axis=1)

Below we check a few variables for homoscedasticity, linearity, and normality  using plots and histograms

sns.distplot(df_train['AMT_INCOME_TOTAL'],fit=norm)
fig=plt.figure()
res=stats.probplot(df_train['AMT_INCOME_TOTAL'],plot=plt)

12

This is not normal

sns.distplot(df_train['AMT_CREDIT'],fit=norm)
fig=plt.figure()
res=stats.probplot(df_train['AMT_CREDIT'],plot=plt)

12

This is not normal either. We could do transformations, or we can make a non-linear model instead.

Model Development

Now comes the easy part. We will make a decision tree using only some variables to predict the target. In the code below we make are X and y dataset.

X=df_train[['Cash loans','DAYS_EMPLOYED','AMT_CREDIT','AMT_INCOME_TOTAL','CNT_CHILDREN','REGION_POPULATION_RELATIVE']]
y=df_train['TARGET']

The code below fits are model and makes the predictions

clf=tree.DecisionTreeClassifier(min_samples_split=20)
clf=clf.fit(X,y)
y_pred=clf.predict(X)

Below is the confusion matrix followed by the accuracy

print (pd.crosstab(y_pred,df_train['TARGET']))
TARGET       0      1
row_0                
0       280873  18493
1         1813   6332
accuracy_score(y_pred,df_train['TARGET'])
Out[47]: 0.933966589813047

Lastly, we can look at the precision, recall, and f1 score

print(metrics.classification_report(y_pred,df_train['TARGET']))
              precision    recall  f1-score   support

           0       0.99      0.94      0.97    299366
           1       0.26      0.78      0.38      8145

   micro avg       0.93      0.93      0.93    307511
   macro avg       0.62      0.86      0.67    307511
weighted avg       0.97      0.93      0.95    307511

This model looks rather good in terms of accuracy of the training set. It actually impressive that we could use so few variables from such a large dataset and achieve such a high degree of accuracy.

Conclusion

Data exploration and analysis is the primary task of a data scientist.  This post was just an example of how this can be approached. Of course, there are many other creative ways to do this but the simplistic nature of this analysis yielded strong results

Classroom Conflict Resolution Strategies

Disagreements among students and even teachers is part of working at any institution. People have different perceptions and opinions of what they see and experience. With these differences often comes disagreements that can lead to serious problems.

This post will look at several broad categories in which conflicts can be resolved when dealing with conflicts in the classroom. The categories are as follows.

  1. Avoiding
  2. Accommodating
  3. Forcing
  4. Compromising
  5. Problem-solving

Avoiding

The avoidance strategy involves ignoring the problem. The tension of trying to work out the difficulty is not worth the effort. The hope is  that the problem will somehow go away with any form of intervention. Often the problem becomes worst.

Teachers sometimes use avoidance in dealing with conflict. One common classroom management strategy is avoidance in which a teacher deliberately ignores poor behavior of a student to extinguish it.  Since the student is not getting any attention  from their poor behavior  they will often stop the  behavior.

Accommodating

Accommodating is focused on making everyone involved in the conflict happy. The focus is on relationships and not productivity. Many who employ this strategy believe that confrontation is destructive. Actual applications of this approach involve using humor, or some other tension breaking technique during a conflict. Again, the problem is never actually solved but rather some form of “happiness band-aid” is applied.

In the classroom, accommodation happens when teachers use humor to smooth over tense situations and when they make adjustments to goals to ameliorate students complaints. Generally, the first step in accommodation leads to more and more accommodating until the teacher is backed into a corner.

Another use of the term accommodating is the mandate in education under the catchphrase “meeting student needs”. Teachers are expected to accommodate as much as possible within guidelines given to them by the school. This leads to extraordinarily large amount of work and effort on the part of the teacher.

Forcing

Force involves simply making people do something through the power you have over them. It gets things done but can lead to long term relational problems. As people are forced the often lose motivation and new conflicts begin to arise.

Forcing is often a default strategy for teachers. After all, the teacher is t an authority over children. However, force is highly demotivating and should be avoided if possible. If students have no voice they quickly can become passive which is often in opposite of active learning in the classroom.

Compromising

Compromise involves trying to develop a win win situation for both parties. However, the reality is that often compromising can be the most frustrating. To totally avoid conflict means no fighting. TO be force means to have no power. However, compromise means that a person almost got what they wanted but not exactly, which can be more annoying.

Depending on the age a teacher is working with, compromising can be difficult to achieve. Younger children often lack the skills to see alternative solutions and half-way points of agreement. Compromising can also be viewed as accommodating by older kids which can lead to perceptions of the teacher’s weakness when conflict arises. Therefore, compromise is an excellent strategy when used with care.

Problem-Solving

Problem-solving is similar to compromising except that both parties are satisfied with the decision and the problem is actually solved, at least temporarily. This takes a great deal of trust and communication between the parties involved.

For this to work in  the classroom, a teacher must de-emphasize their position of authority in order to work with the students. This is counterintuitive for most in teachers and even for many students. It is also necessary to developing strong listening and communication skills to allow both parties to provide ways of dealing with the conflict. As with compromise, problem-solving is better reserved for older students.

Conclusion

Teachers need to know what their options are when it comes to addressing conflict. This post provided several ideas or ways for maneuvering disagreements and setbacks in  the classroom.

Signs a Student is Lying

Deception is a common tool students use when trying to avoid discipline or some other uncomfortable situation with a teacher. However, there are some tips and indicators that you can be aware of to help you to determine if a student is lying to you. This post will share some ways to determine if a student may be lying. The tips are as follows

  • Determine what is normal
  • Examine how the play with their clothing
  • Watch personal space
  • Tone of voice
  • Movement

Determine What is Normal

People are all individuals and thus unique. Therefore, determining deception first requires determining what is normal for the student. This involves some observation and getting to know the student. These are natural parts of teaching.

However, if you are in an administrative position and may not know the student that well it will be much harder to determine what is normal for the student sot that it can be compared to their behavior if you believe they are lying. One solution for this challenge is to first engage in small talk with the student so you can establish what appears to be natural behavior for the student.

Clothing Signs

One common  sign that someone is lying is that they begin to play with their clothing. This can include tugging on clothes, closing buttons, pulling down on sleeves, and or rubbing a spot. This all depends on what is considered normal for the individual.

Personal Space

When people pull away when talking it is often a sign of dishonesty. This can be done through such actions as shifting one’s chair, or leaning back. Other individuals will fold their arms across their chest. All these behaviors are subconscious was of trying to protect one’s self.

Voice

The voice provides several clues of deception. Often the rate or speed of the speaking slows down. Deceptive answers are often much longer and detailed than honest ones. Liars often show hesitations and pauses that are out of the ordinary for them.

A change in pitch is perhaps the strongest sigh of lying. Students will often speak with a much higher pitch one lying. This is perhaps do to the nervousness they are experiencing.

Movement

Liars have a habit of covering their mouth when speaking. Gestures also become more mute and closer to the bottom when a student is lying. Another common cue is gestures with the palms up rather than down when speaking. Additional signs include nervous tapping with the feet.

Conclusion

People lie for many reasons. Due to this, it is important that a teacher is able to determine the honesty of a student when necessary. The tips in this post provide some basic ways of potentially identifying who is being truthful.

Barriers to Teachers Listening

Few of us want to admit it but all teachers have had problems at one time or another listening to their students. There are many reasons for this but in this post we will look at the following barriers to listening that teachers may face.

  1. Inability to focus
  2. Difference in speaking and listening speed
  3. Willingness
  4. Detours
  5. Noise
  6. Debate

Inability to Focus

Sometimes a teacher or even a student may not be able to focus on the discussion or conversation. This could be due to a lack of motivation or desire to pay attention. Listening can be taxing mental work. Therefore, the teacher must be engaged and have some desire to try to understand what is happening.

Differences in the Speed of Speaking and Listening

We speak much slower than we think. Some have put the estimate that we speak at 1/4 the speed at which we can think. What this means is that if you can think 100 words per minute you can speak at only 25 words per minute. With thinking being 4 times faster than speaking this leaves a lot of mental energy lying around unused which can lead to daydreaming.

This difference can lead to impatience and to anticipation of what the person is going to say. Neither of these are beneficial because they discourage listening.

Willingness

There are times, rightfully so, that a teacher does not want to listen. This can be when a student is not cooperating or giving an unjustified excuse for their actions. The main point here is that a teacher needs to be aware of their unwillingness to listen. Is it justified or is it unjustified? This is the question to ask.

Detours

Detours happen when we respond to a specific point or comment by the student which changes the subject. This barrier is tricking because what is happening is that you are actually paying attention but allow the conversation to wander from the original purpose. Wandering conversation is natural and often happens when we are enjoying the talk.

Preventing this requires mental discipline to stay on topic and to not what you are listening for. This is not easy but is necessary at times.

Noise

Noise can be external or internal. External noise is factors beyond our control. For example, if there is a lot of noise in the classroom it may be hard to hear a student speak. A soft-spoken student in a loud place is frustrating to try and listen to even when there is a willingness to do so.

Internal noise has to do with what is happening inside your own mind If you are tired, sick, or feeling rush due to a lack of time, these can all affect your ability to listening to others.

Debate

Sometimes we listen until we want to jump in and try to defend a point are disagree with something. This is not so much as listening as it is hunting and waiting to pounce and the slightest misstep of logic from the person we are supposed to listen to.

It is critical to show restraint and focus on allowing the other side to be heard rather than interrupted by you.

Conclusion

We often view teachers as communicators. However, half the job of a communicator is to listen. At times, due to the position and the need to be the talker a teacher may neglect the need to be a listener. The barriers explained here should help teachers to be aware of why they may neglect to do this.

Principles of Management and the Classroom

Henri Fayol (1841-1925) had a major impact on managerial communication in his develop of 14 principles of management. In this post, we will look at these principles briefly and see how at least some of them can be applied in the classroom as educators.

Below is a list of the 14 principles of management by Fayol

  1. Division of work
  2. Authority
  3. Discipline
  4. Unity of command
  5. Unity of direction
  6. Subordination of individual interest
  7. Remuneration
  8. The degree of centralization
  9. Scalar chain
  10. Order
  11. Equity
  12. Stability of personnel
  13. Initiative
  14. Esprit de corps

Division of Work & Authority

Division of work has to do with breaking work into small parts with each worker having responsibility for one aspect of the work. In the classroom, this would apply to group projects in which collaboration is required to complete a task.

Authority is  the power to give orders and commands. The source of the authority cannot only be in the position. The leader must demonstrate expertise and competency in order to lead. For the classroom, it is a well-known tenet of education that the teacher must demonstrate expertise in their subject matter and knowledge of teaching.

Discipline & Unity of command

Discipline has to do with obedience. The workers should obey the leader. In the classroom this relates to concepts found in classroom management. The teacher must put in place mechanisms to ensure that the students follow directions.

Unity of command means that there should only be directions given from one leader to the workers. This is the default setting in some schools until about junior high or high school. At that point, students have several teachers at once. However, generally it is one teacher per classroom even if the students have several teachers.

Unity of Direction & Subordination i of Individual Interests

The employees activities must all be linked to the same objectives. This ensures everyone is going in the same directions. In the classroom, this relates to the idea of goals and objectives in teaching. The curriculum needs to be aligned with students all going in the same direction. A major difference here is that the activities may vary in terms of achieving the learning goals from student to student.

Subordination of individual interests in tells putting the organization ahead of personal goals. This is where there may be a break in managerial and educational practices. Currently, education  in many parts of the world are highly focused on the students interest at the expense of what may be most efficient and beneficial to the institution.

Remuneration & Degree of Centralization

Remuneration has to do with the compensation. This can be monetary or non-monetary. Monetary needs to be high enough to provide some motivation to work. Non-monetary can include recognition, honor or privileges. In education, non-monetary compensation is standard in the form of grades, compliments, privileges, recognition, etc. Whatever is done is usually contributes to intrinsic or extrinsic motivation.

Centralization has to do with who makes decisions. A highly centralized institution has top down decision-making while a decentralized institution has decisions coming from many directions. Generally, in  the classroom setting, decisions are made by the teacher. Students may be given autonomy over how to approach assignments or which assignments to do but the major decisions are made by the teacher even in highly decentralized classrooms due to the students inexperience and lack of maturity.

Scalar Chain & Order

Scalar chain has to do with recognizing the chain of command. The employee should contact the immediate supervisor when there is a problem. This prevents to many people going to the same person. In education, this is enforced by default as the only authority in a classroom is usually a teacher.

Order deals with having the resources to get the job done. In the classroom, there are many things the teacher can supply such as books, paper, pencils, etc. and even social needs such as attention and encouragement. However, sometimes there are physical needs that are neglected such as kids who miss breakfast and come to school hungry.

Equity & Stability of Personal

Equity means workers are treated fairly. This principle again relates to classroom management and even assessment. Students need to know that the process for discipline is fair even if it is dislike and that there is adequate preparation for assessments such as quizzes and tests.

Stability of personnel means keeping turnover to a minimum. In education, schools generally prefer to keep teacher long term if possible. Leaving during the middle of a school year whether a student or teacher is discouraged as it is disruptive.

Initiative & Esprit de Corps

Initiative means allowing workers to contribute new ideas and do things. This empowers workers and adds value to the company. In education, this also relates to classroom management in that students need to be able to share their opinion freely during discussions and also when they have concerns about what is happening in the classroom.

Esprit de corps focuses on morale. Workers need to feel good and appreciated. The classroom learning environment is a topic that is frequently studied in education. Students need to have their psychological needs meet through having a place to study that is safe and friendly.

Conclusion

These 14 principles are found in the business world, but they also have a strong influence in the world of education as well. Teachers can pull these principles any ideas that may be useful l in their classroom.

Hierarchical Regression in R

In this post, we will learn how to conduct a hierarchical regression analysis in R. Hierarchical regression analysis is used in situation in which you want to see if adding additional variables to your model will significantly change the r2 when accounting for the other variables in the model. This approach is a model comparison approach and not necessarily a statistical one.

We are going to use the “Carseats” dataset from the ISLR package. Our goal will be to predict total sales using the following independent variables in three different models.

model 1 = intercept only
model 2 = Sales~Urban + US + ShelveLoc
model 3 = Sales~Urban + US + ShelveLoc + price + income
model 4 = Sales~Urban + US + ShelveLoc + price + income + Advertising

Often the primary goal with hierarchical regression is to show that the addition of a new variable builds or improves upon a previous model in a statistically significant way. For example, if a previous model was able to predict the total sales of an object using three variables you may want to see if a new additional variable you have in mind may improve model performance. Another way to see this is in the following research question

Is a model that explains the total sales of an object with Urban location, US location, shelf location, price, income and advertising cost as independent variables superior in terms of R2 compared to a model that explains total sales with Urban location, US location, shelf location, price and income as independent variables?

In this complex research question we essentially want to know if adding advertising cost will improve the model significantly in terms of the r square. The formal steps that we will following to complete this analysis is as follows.

  1. Build sequential (nested) regression models by adding variables at each step.
  2. Run ANOVAs in order to compute the R2
  3. Compute difference in sum of squares for each step
    1. Check F-statistics and p-values for the SS differences.
  4. Compare sum of squares between models from ANOVA results.
  5. Compute increase in R2 from sum of square difference
  6. Run regression to obtain the coefficients for each independent variable.

We will now begin our analysis. Below is some initial code

library(ISLR)
data("Carseats")

Model Development

We now need to create our models. Model 1 will not have any variables in it and will be created for the purpose of obtaining the total sum of squares. Model 2 will include demographic variables. Model 3 will contain the initial model with the continuous independent variables. Lastly, model 4 will contain all the information of the previous models with the addition of the continuous independent variable of advertising cost. Below is the code.

model1 = lm(Sales~1,Carseats)
model2=lm(Sales~Urban + US + ShelveLoc,Carseats)
model3=lm(Sales~Urban + US + ShelveLoc + Price + Income,Carseats)
model4=lm(Sales~Urban + US + ShelveLoc + Price + Income + Advertising,Carseats)

We can now turn to the ANOVA analysis for model comparison #ANOVA Calculation We will use the anova() function to calculate the total sum of square for model 0. This will serve as a baseline for the other models for calculating r square

anova(model1,model2,model3,model4)
## Analysis of Variance Table
## 
## Model 1: Sales ~ 1
## Model 2: Sales ~ Urban + US + ShelveLoc
## Model 3: Sales ~ Urban + US + ShelveLoc + Price + Income
## Model 4: Sales ~ Urban + US + ShelveLoc + Price + Income + Advertising
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1    399 3182.3                                   
## 2    395 2105.4  4   1076.89  89.165 < 2.2e-16 ***
## 3    393 1299.6  2    805.83 133.443 < 2.2e-16 ***
## 4    392 1183.6  1    115.96  38.406 1.456e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For now, we are only focusing on the residual sum of squares. Here is a basic summary of what we know as we compare the models.

model 1 = sum of squares = 3182.3
model 2 = sum of squares = 2105.4 (with demographic variables of Urban, US, and ShelveLoc)
model 3 = sum of squares = 1299.6 (add price and income)
model 4 = sum of squares = 1183.6 (add Advertising)

Each model is statistical significant which means adding each variable lead to some improvement.

By adding price and income to the model we were able to improve the model in a statistically significant way. The r squared increased by .25 below is how this was calculated.

2105.4-1299.6 #SS of Model 2 - Model 3
## [1] 805.8
805.8/ 3182.3 #SS difference of Model 2 and Model 3 divided by total sum of sqaure ie model 1
## [1] 0.2532131

When we add Advertising to the model the r square increases by .03. The calculation is below

1299.6-1183.6 #SS of Model 3 - Model 4
## [1] 116
116/ 3182.3 #SS difference of Model 3 and Model 4 divided by total sum of sqaure ie model 1
## [1] 0.03645162

Coefficients and R Square

We will now look at a summary of each model using the summary() function.

summary(model2)
## 
## Call:
## lm(formula = Sales ~ Urban + US + ShelveLoc, data = Carseats)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.713 -1.634 -0.019  1.738  5.823 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.8966     0.3398  14.411  < 2e-16 ***
## UrbanYes          0.0999     0.2543   0.393   0.6947    
## USYes             0.8506     0.2424   3.510   0.0005 ***
## ShelveLocGood     4.6400     0.3453  13.438  < 2e-16 ***
## ShelveLocMedium   1.8168     0.2834   6.410 4.14e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.309 on 395 degrees of freedom
## Multiple R-squared:  0.3384, Adjusted R-squared:  0.3317 
## F-statistic: 50.51 on 4 and 395 DF,  p-value: < 2.2e-16
summary(model3)
## 
## Call:
## lm(formula = Sales ~ Urban + US + ShelveLoc + Price + Income, 
##     data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9096 -1.2405 -0.0384  1.2754  4.7041 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     10.280690   0.561822  18.299  < 2e-16 ***
## UrbanYes         0.219106   0.200627   1.092    0.275    
## USYes            0.928980   0.191956   4.840 1.87e-06 ***
## ShelveLocGood    4.911033   0.272685  18.010  < 2e-16 ***
## ShelveLocMedium  1.974874   0.223807   8.824  < 2e-16 ***
## Price           -0.057059   0.003868 -14.752  < 2e-16 ***
## Income           0.013753   0.003282   4.190 3.44e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.818 on 393 degrees of freedom
## Multiple R-squared:  0.5916, Adjusted R-squared:  0.5854 
## F-statistic: 94.89 on 6 and 393 DF,  p-value: < 2.2e-16
summary(model4)
## 
## Call:
## lm(formula = Sales ~ Urban + US + ShelveLoc + Price + Income + 
##     Advertising, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2199 -1.1703  0.0225  1.0826  4.1124 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     10.299180   0.536862  19.184  < 2e-16 ***
## UrbanYes         0.198846   0.191739   1.037    0.300    
## USYes           -0.128868   0.250564  -0.514    0.607    
## ShelveLocGood    4.859041   0.260701  18.638  < 2e-16 ***
## ShelveLocMedium  1.906622   0.214144   8.903  < 2e-16 ***
## Price           -0.057163   0.003696 -15.467  < 2e-16 ***
## Income           0.013750   0.003136   4.384 1.50e-05 ***
## Advertising      0.111351   0.017968   6.197 1.46e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.738 on 392 degrees of freedom
## Multiple R-squared:  0.6281, Adjusted R-squared:  0.6214 
## F-statistic: 94.56 on 7 and 392 DF,  p-value: < 2.2e-16

You can see for yourself the change in the r square. From model 2 to model 3 there is a 26 point increase in r square just as we calculated manually. From model 3 to model 4 there is a 3 point increase in r square. The purpose of the anova() analysis was determined if the significance of the change meet a statistical criterion, The lm() function reports a change but not the significance of it.

Conclusion

Hierarchical regression is just another potential tool for the statistical researcher. It provides you with a way to develop several models and compare the results based on any potential improvement in the r square.

RANSAC Regression in Python

RANSAC is an acronym for Random Sample Consensus. What this algorithm does is fit a regression model on a subset of data that the algorithm judges as inliers while removing outliers. This naturally improves the fit of the model due to the removal of some data points.

The process that is used to determine inliers and outliers is described below.

  1. The algorithm randomly selects a random amount of samples to be inliers in the model.
  2. All data is used to fit the model and samples that fall with a certain tolerance are relabeled as inliers.
  3. Model is refitted with the new inliers
  4. Error of the fitted model vs the inliers is calculated
  5. Terminate or go back to step 1 if a certain criterion of iterations or performance is not met.

In this post, we will use the tips data from the pydataset module. Our goal will be to predict the tip amount using two different models.

  1. Model 1 will use simple regression and will include total bill as the independent variable and tips as the dependent variable
  2. Model 2 will use multiple regression and  includes several independent variables and tips as the dependent variable

The process we will use to complete this example is as follows

  1. Data preparation
  2. Simple Regression Model fit
  3. Simple regression visualization
  4. Multiple regression model fit
  5. Multiple regression visualization

Below are the packages we will need for this example

import pandas as pd
from pydataset import data
from sklearn.linear_model import RANSACRegressor
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

Data Preparation

For the data preparation, we need to do the following

  1. Load the data
  2. Create X and y dataframes
  3. Convert several categorical variables to dummy variables
  4. Drop the original categorical variables from the X dataframe

Below is the code for these steps

df=data('tips')
X,y=df[['total_bill','sex','size','smoker','time']],df['tip']
male=pd.get_dummies(X['sex'])
X['male']=male['Male']
smoker=pd.get_dummies(X['smoker'])
X['smoker']=smoker['Yes']
dinner=pd.get_dummies(X['time'])
X['dinner']=dinner['Dinner']
X=X.drop(['sex','time'],1)

Most of this is self-explanatory, we first load the tips dataset and divide the independent and dependent variables into an X and y dataframe respectively. Next, we converted the sex, smoker, and dinner variables into dummy variables, and then we dropped the original categorical variables.

We can now move to fitting the first model that uses simple regression.

Simple Regression Model

For our model, we want to use total bill to predict tip amount. All this is done in the following steps.

  1. Instantiate an instance of the RANSACRegressor. We the call LinearRegression function, and we also set the residual_threshold to 2 indicate how far an example has to be away from  2 units away from the line.
  2. Next we fit the model
  3. We predict the values
  4. We calculate the r square  the mean absolute error

Below is the code for all of this.

ransacReg1= RANSACRegressor(LinearRegression(),residual_threshold=2,random_state=0)
ransacReg1.fit(X[['total_bill']],y)
prediction1=ransacReg1.predict(X[['total_bill']])
r2_score(y,prediction1)
Out[150]: 0.4381748268686979

mean_absolute_error(y,prediction1)
Out[151]: 0.7552429811944833

The r-square is 44% while the MAE is 0.75.  These values are most comparative and will be looked at again when we create the multiple regression model.

The next step is to make the visualization. The code below will create a plot that shows the X and y variables and the regression. It also identifies which samples are inliers and outliers. Te coding will not be explained because of the complexity of it.

inlier=ransacReg1.inlier_mask_
outlier=np.logical_not(inlier)
line_X=np.arange(3,51,2)
line_y=ransacReg1.predict(line_X[:,np.newaxis])
plt.scatter(X[['total_bill']][inlier],y[inlier],c='lightblue',marker='o',label='Inliers')
plt.scatter(X[['total_bill']][outlier],y[outlier],c='green',marker='s',label='Outliers')
plt.plot(line_X,line_y,color='black')
plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.legend(loc='upper left')

1

Plot is self-explanatory as a handful of samples were considered outliers. We will now move to creating our multiple regression model.

Multiple Regression Model Development

The steps for making the model are mostly the same. The real difference takes place in make the plot which we will discuss in a moment. Below is the code for  developing the model.

ransacReg2= RANSACRegressor(LinearRegression(),residual_threshold=2,random_state=0)
ransacReg2.fit(X,y)
prediction2=ransacReg2.predict(X)
r2_score(y,prediction2)
Out[154]: 0.4298703800652126

mean_absolute_error(y,prediction2)
Out[155]: 0.7649733201032204

Things have actually gotten slightly worst in terms of r-square and MAE.

For the visualization, we cannot plot directly several variables t once. Therefore, we will compare the predicted values with the actual values. The better the correlated the better our prediction is. Below is the code for the visualization

inlier=ransacReg2.inlier_mask_
outlier=np.logical_not(inlier)
line_X=np.arange(1,8,1)
line_y=(line_X[:,np.newaxis])
plt.scatter(prediction2[inlier],y[inlier],c='lightblue',marker='o',label='Inliers')
plt.scatter(prediction2[outlier],y[outlier],c='green',marker='s',label='Outliers')
plt.plot(line_X,line_y,color='black')
plt.xlabel('Predicted Tip')
plt.ylabel('Actual Tip')
plt.legend(loc='upper left')

1

The plots are mostly the same  as you cans see for yourself.

Conclusion

This post provided an example of how to use the RANSAC regressor algorithm. This algorithm will remove samples from the model based on a criterion you set. The biggest complaint about this algorithm is that it removes data from the model. Generally, we want to avoid losing data when developing models. In addition, the algorithm removes outliers objectively this is a problem because outlier removal is often subjective. Despite these flaws, RANSAC regression is another tool that can be use din machine learning.

Teaching English

Teaching English  or any other subject requires that the teacher be able to walk into the classroom and find ways to have an immediate impact. This is much easier said than done. In this post we look at several ways to increase the likelihood of being able to help students.

Address Needs

People’s reasons for learning a language such as English can vary tremendously. Knowing this, it is critical that you as a teacher know what the need in their learning. This allows you to adjust the methods and techniques that you used to help them learn.

For example, some students may study English for academic purposes while others are just looking to develop communications skills. Some students maybe trying to pass a proficiency examine in order to study at  university or in graduate school.

How you teach these different groups will be different. The academic students want academic English and language skills. Therefore, if you plan to play games in the classroom and other fun activities there may be some frustration because the students will not see how this helps them.

On the other hand, for students who just want to learn to converse in English, if you smother them with heavy readings and academic like work they will also become frustrated from how “rigorous” the course is. This is why you must know what the goals of the students are and make the needed changes as possible

Stay Focused

When dealing with students, it is tempting to answer and following ever question that they have. However, this can quickly lead to a lost of directions as the class goes here there and everywhere to answer every nuance question.

Even though the teacher needs to know what the students want help with the teacher is also the expert and needs to place limits over how far they will go in terms of addressing questions and needs. Everything cannot be accommodated no matter how hard one tries.

As the teacher, things that limit your ability to explore questions and concerns of students includes time, resources,  your own expertise, and the importance of the question/concern. Of course, we help students, but not to the detriment of the larger group.

Providing a sense of direction is critical as a teacher. The students have their needs but it is your goal to lead them to the answers. This requires a sense of knowing what you want and  being able to get there. There re a lot of experts out there who cannot lead a group of students to the knowledge they need as this requires communication skills and an ability to see the forest from the trees.

Conclusion

Teaching is a mysterious profession as so many things happen that cannot be seen or measured but clearly have an effect on the classroom. Despite the confusion it never hurts to determine where the students want to go and to find a way to get them there academically.