Intro to Animation with D3.js

This post will provide an introduction to animation using D3.js. Animation simply changes the properties of the visual object over time. This can be useful for help the viewer of the web page to understand key features of the data.

For now we will do the following in terms of animation.

  • Create a simple animation
  • Animate multiple properties
  • Create chained transitions
  • Handling Transitions

Create a Simple Animation

What is new for us in terms of d3.js code for animation is the use of the .transition() and .duration() methods. The transition method provides instructions on how to changing a visual attribute over time. Duration is simply how long the transition takes.

In the code below, we are going to create a simply black rectangle that will turn white and disappear on the screen. This is done by appending an svg that contains a black rectangle into the body element and then have that rectangle turn white and disappear.

1

1.gif

This interesting but far from amazing. We simply change the color  or animated one property. Next, we learn how to animate more than one property at a time.

Animating Multiple Properties at Once

You are not limited to only animating one property. In the code below we will change the color will have the rectangle move as well. This id one through the x,y coordinates to the second .attr({}) method. The code and the animation are below.

12

You can see how the rectangle moves from the top left to the bottom right while also changing colors from black to white thus disappearing. Next, we will look at chained transitions

Chained Transitions

Chained transitions involves have some sort of animation take place. Followed by a delay and then another animation taking place. In order to do this you need to use the .delay() method. This method tells the  browser to wait a specify number of seconds before doing something else.

In our example, we are going to have our rectangle travel diagonally down while also disappearing only tot suddenly travel up while changing back to the color of black. Below is the code followed by the animation.

13

By now you are starting to see that the only limit to animation in d3.js is your imagination.

Handling Transitions

The beginning and end of a transition can be handle by a .each() method. This is useful when you want to control the style of the element at the beginning and or end of a transition.

In the code below, you will see the rectangle go from red, to green, to orange, to black, and then to gray. At the same time the rectangle will move and change sizes. Notice careful the change from red to green and form black to gray are controlled by .each() methods.

11

Conclusion

Animation is not to only be used for entertainment. When developing visualizations, an animation should provide additional understanding of the content that you are trying to present. This is important to remember so that d3.js does not suffer the same fate as PowerPoint in that people focus more on the visual effects rather than the content.

Advertisements

Adding labels to Graphs D3.js

In this post, we will look at how to add the following to a bar graph using d3.js.

  • Labels
  • Margins
  • Axes

Before we begin, you need the initial code that has a bar graph already created. This is shown below follow by what it should look like before we make any changes.

1

1

The first change is in line 16-19. Here, we change the name of the variable and modify the type of element it creates.

1.png

Our next change begins at line 27 and continues until line 38. Here we make two changes. First, we make a variable called barGroup, which selects all the group elements of the variable g. We also use the data, enter, append and attr methods. Starting in line 33 and continuing until line 38 we use the append method on our new variable barGroup to add rect elements as well as the color and size of each bar. Below is the code.

1.png

The last step for adding text appears in lines 42-50. First, we make a variable called textTranslator to move our text. Then we append the text to the bargroup variable. The color, font type, and font size are all set in the code below followed by a visual of what our graph looks like now.

12

Margin

Margins serve to provide spacing in a graph. This is especially useful if you want to add axes. The changes in the code take place in lines 16-39 and include an extensive reworking of the code. In lines 16-20 we create several variables that are used for calculating the margins and the size and shape of the svg element. In lines 22-30 we set the attributes for the svg variable. In line 32-34 we add a group element to hold the main parts of the graph. Lastly, in lines 36-40 we add a gray background for effect. Below is the code followed by our new graph. 1.png

1

Axes

In order for this to work, we have to change the value for the variable maxValue to 150. This would give a little more space at the top of the graph. The code for the axis goes form line 74 to line 98.

  • Line 74-77 we create variables to set up the axis so that it is on the left
  • Lines 78-85 we create two more variables that set the scale and the orientation of the axis
  • Lines 87-99 sets the visual characteristics of the axis.

Below is the code followed by the updated graph

12

You can see the scale off to the left as planned..

Conclusion

Make bar graphs is a basic task for d3.js. Although the code can seem cumbersome to people who do not use JavaScript. The ability to design visuals like this often outweighs the challenges.

Defining Terms in Debates

Defining terms in debates is an important part of the process that can be tricky at times. In this post, we will look at three criteria to consider when dealing with terms in debates. Below are the three criteria

  • When to define
  • What to define
  • How to define

When to Define

Definitions are almost always giving at the beginning of the debate. This is cause it helps to set up limits about what is discussed. It also makes it clear what the issue and potential propositions are.

Some debates focus exclusively on just defining terms. For example, highly controversial ideas such as abortion, non-traditional marriage, etc. Often the focus is just on such definitions as when does life beginning, or what is marriage? Defining terms helps to remove the fuzziness of the controversy and to focus on the exchange of ideas.

What to Define

It is not always clear what needs to be defined when staring a debate. Consider the following proposition of value

Resolved: That  playing videos games is detrimental to the development of children

Here are just a few things that may need to be  defined.

  • Video games: Does this refer to online, mobile, or console games? What about violent vs non-violent? Do educational games also fall into this category as well?
  • Development: What kind of development? Is this referring to emotional, physical, social or some other form of development
  • Children: Is this referring only to small children (0-6), young children (7-12) or teenagers?

These are just some of the questions to consider when trying to determine what to define. Again this is important because the affirmative may be arguing that videos are bad for small children but not for teenagers while the negative may be preparing a debate for the opposite.

How to Define

There are several ways to define a term below are just a few examples of how to do this.

Common Usage

Common usage is the everyday meaning of the term. For example,

We define children as individuals who are under the age of 18

This is clear and simple

Example

Example definitions give an example of the term to illustrate it as shown below.

An example of a video game would be PlayerUnknwon’s Battleground

This provides a context of the type of video games the debate may focus one

Operation

An operational definition is a working definition limited to the specific context. For example,

Video games for us is any game that is played on an electronic device

Fex define video games like this but this is an example.

Authority

Authority is a term that is defined by an expert.

According to technopedia, a video game is…..

Authority uses their experiences and knowledge to set what a term means and this can be used by debaters.

Negation

Negation is defining a word by what it is not. For example,

When we speak of video games we are not talking about educational games such as Oregon Trail. Rather, we are speaking of violent games such as Grand Theft Auto

The contrast between the types of games here is what the debater is using to define their term.

Conclusion

Defining terms is part of debating. Debaters need to be trained to understand the importance of this so that they can enhance their communication and persuasion.

Making Bar Graphs with D3.js

This post will provide an example of how to make a basic bar graph using d3.js. Visualizing data is important and developing bar graphs in one way to communicate information efficiently.

This post has the following steps

  1. Initial Template
  2. Enter the data
  3. Setup for the bar graphs
  4. Svg element
  5. Positioning
  6. Make the bar graph

Initial Template

Below is what the initial code should look like.

1

Entering the Data

For the data we will hard code it into the script using an array. This is not the most optimal way of doing this but it is the simplest for a first time experience.  This code is placed inside the second script element. Below is a picture.

1

The new code is in lines 10-11 save as the variable data.

Setup for the Bars in the Graph

We are now going to create three variables. Each is explained below

  • The barWidth variable will indicate ho wide the bars should be for the graph
  • barPadding variable will put space between the bars in the graph. If this is set to 0 it would make a histogram
  • The variable maxValue scales the height of the bars relative to the largest observation in the array. This variable uses the method .max() to find the largest value.

Below is the code for these three variables

1

The new information was added in lines 13-14

SVG Element

We can now begin to work with the svg element. We are going to create another variable called mainGroup. This will assign the svg element inside the body element using the .select() method. We will append the svg using .append and will set the width and height using .attr. Lastly, we will append a group element inside the svg so that all of our bars are inside the group that is inside the svg element.

The code is getting longer, so we will only show the new additions in the pictures with a reference to older code. Below is the new code in lines 16-19 directly  under the maxValue variable.

1

Positioning

New=x we need to make three functions.

  • The first function will calculate the x location of the bar graph
  • The second function  will calculate the y location of the bar graph
  • The last function will combine the work of the first two functions to place the bar in the proper x,y coordinate in the svg element.

Below is the code for the three functions. These are added in lines 21-251

The xloc function starts in the bottom left of the mainGroup element and adds the barWidth plus the barPadding to make the next bar. The yloc function starts in the top left and subtracts the maxValue from the given data point to calculate the y position. Lastly, the translator combines the output of both the xloc and the yloc functions to position bar using the translate method.

Making the Graph

We can now make our graph. We will use our mainGroup variable with the .selectAll method with the rect argument inside. Next, we use .data(data) to add the data, .enter() to update the element, .append(“rect”) to add the rectangles. Lastly, we use .attr() to set the color, transformation, and height of the bars. Below is the code in lines 27-36 followed by actual bar graph. 1.png

1

The graph is complete but you can see that there is a lot of work that needs to be done in order to improve it. However, that will be done in a future post.

Types of Debate Proposition

In debating, the proposition is the main issue or the central topic of the debate. In general, there are three types of propositions. The three types of propositions are propositions of

  • Fact
  • Value
  • Policy

Understanding the differences in these three types of propositions is important in developing a strategy for a debate.

Proposition of Fact

A debate that is defined as a proposition of fact is a debate that is focused on whether something is true or not. For example, a debate may address the following proposition of facet.

Resloved: human activity is contributing to global warming

The affirmative side would argue that humans are contributing to global warming while the negative side would argue that humans are not contributing to global warming. The main concern is the truthfulness of the proposition. There is no focus on ethics of the proposition as this is when we come to a proposition of value.

Proposition of Value

A proposition of value looks at your beliefs about what is right or wrong and or good and bad. This type of proposition is focused on ethics and or aesthetics. An example of a proposition of value would be the following..

Resolved: That television is a waste of time

This type of proposition  is trying to judge the acceptability of something and or make an ethical claim.

Value propositions can also have these other more nuances characteristics. Instead, affirming the good or bad of a proposition, a proposition of value can also make a case of one idea being better than another such as…

Resloved: That exercise is a better use of time than watching television

Now the debate is focus not on good vs bad but rather on better vs worst. It is s slightly different way of looking at the argument. Another variation on proposition of value is when the affirmative argues to reject a value such as in the following.

Resolved: That encouraging the watching of television is harmful to young people

The wording is slightly different from previous examples but the primary goal of the affirmative is to argue why television watching should not be valued or at least valued less.

One final variation of the proposition of value is the quasi-policy proposition of value. A quasi-policy value proposition is used to express a value judgement about a policy. An example would be

Resolved: That mandatory vaccinations would be beneficial to school age children

Here the affirmative is not only judging vaccinations but simultaneously the potential policy of making vaccinations mandatory.

Proposition of Policy

Propositions of policy call for change. This type of proposition in pushing strongly against the status quo. Below is an example.

Resolved: That the cafeteria should adopt a vegetarian diet

The example above is using for clear change. However, notice how there is no judgement on the current state affairs. In others words, there is not judgement that the non-vegetarian diet is good or bad or that a vegetarian diet is good or bad. This is noe reason why this is not a proposition of value.

In the case of a proposition of policy, the affirmative supports the change while the negative supports the status quo.

Conclusion

Debate propositions shape the entire direction and preparation for the debate itself. Therefore, it is important for debaters to understand what type of proposition they are dealing with. In addition, for teachers who are creating debates, they need to know exactly what they want the students to do in a debt when they create propositions.

SVG and D3.js

Scalable Vector Graphics or SVG for short is a XML markup language that is the standard for creating vector-based graphics in web browsers. The true power of d3.js is unlocked through its use of svg elements. Employing vectors allows for the image that are created to be various sizes based on scale.

One unique characteristic of svg is that the coordinate system starts in the upper left with the x axis increases as normal from left to right. However, the y axis starts from the top and increases going down rather than the opposite of increasing from the bottom up. Below is a visual of this.

1

You can see that (0,0) is in the top left corner. As you go to the right the x-axis increases and as you go down the y axis increases.  By changing the x, y axis values you are able to position your image where you want it. If you’re curious the visual above was made using d3.js.

For the rest of post, we will look at different svg elements that can be used in making visuals. We will look at the following

  • Circles
  • Rectangles/squares
  • Lines & Text
  • Paths

Circles

Below is the code for making circle followed by the output

1.png

1.png

To make a shape such as the circles above, you first must specify the size of the svg element as shown in line 6. Then you make a circle element. Inside the circle element you must indicate the x and y position (cx, cy) and also the radius (r). The default color is black however you can specify other colors as shown below.

12

To change the color simply add the style argument and indicate the fill and color in quotations.

Rectangle/Square

Below is the code for making rectangles/squares. The arguments are slightly different but this should not be too challenging to figure out.

12

The x, y arguments indicate the position and the width and height arguments determine the size of the rectangle, square.

Lines & Text

Lines are another tool in d3.js. Below is the code for drawing a line.

1

2

The code should not be too hard to understand. You now need to separate coordinates. This is because the line needs to start in one place and draw until it reaches another. You can also control the color of the line and the thickness.

You can also add text using svg. In the code below we combine the line element with the text element.

12

With the text element after you set the position, font, font size, and color, you have to also add your text in-between the tags of the element.

Path

The path element is slightly trickier but also more powerful when compared to the elements we have used previously. Below is the code and output from using the path element.

12

The path element has a mini-language all to its self. “M” is where the drawing begins and is followed by the xy coordinate. “L”  draw a line. Essentially, it takes the original position and draws a line to next position. “V” indicates a vertical line. Lastly, “Z” means to close the path.

In the code below here is what it literally means

  1. Start at 40, 40
  2. Make a line from 40, 40 to 250, 40
  3. Make another line from 250, 40 to 140, 40
  4. Make a vertical line from 140,40 to 4,40
  5. Close the path

Using path can be much more complicated than this. However, this is enough for an introduction

Conclusion

This was just a teaser of what is possible with d3.js. The ability to make various graphics based on data is something that we have not even discussed yet. As such, there is much more to look forward to when using this visualization tool.

Phrasing Debate Propositions

Debating is a practical way for students to develop communication and critical thinking skills. However, it is often the job of the teacher to find debate topics and to form these into propositions. A proposition is a strong statement that identifies the central issue/problem of a controversial topic.

There is a clear process for this that should be followed in order to allow the students to focus on develop their arguments rather than on trying to figure out what they are to debate about.

This post will provide guidance for teachers on developing debate propositions. In general, debate propositions have the following characteristics

  • controversial
  • central idea
  • unemotional word use
  • Statement of affirmative’s wanted decision

We will look at each of these concepts in detail

Controversial

Controversy is what debating is about. A proposition must be controversial. This is because strong statements for people to take a position. With a slight push to the edges students are not required to dig deep and understand the topic. Below is an example of an uncontroversial proposition

Resloved: Illegal immigration is sometimes a problem in the world

This is not controversial because it’s hard to agree or disagree strongly. The mildness of the statement makes it uninteresting to debate about. Below is the same proposition but written in a more controversial manner

Resloved: Illegal immigration is a major problem that destabilizes nations all over the world.

The revised proposition uses languages that is less neutral yet not aggressive. People have to think carefully where they stand.

Central Idea

A debate proposition should only address one single idea. The safest way to do this is to avoid using the word “and” in a proposition. However, this is not a strict rule but rather a guideline. Below is an example of proposition that does not have one idea.

Resloved: Illegal immigration and pollution are major problems that destabilizes nations all over the world.

The problem with the proposition above is to determine if the debate is about illegal immigration or pollution. These are topics that are not connected or the debaters must find ways to connect them.  In other words, this is messy and unclear and the proposition cannot have these characteristics.

Unemotional Words

Propositions should avoid emotional language. One of the foundational beliefs of debating is rational thought. Emotional terms lead to emotional thinking which is not the goal of debating. Generally, emotional terms are used more in advertising and propaganda than in debate. Below is an example of emotional language in a proposition.

Resloved: Illegal immigration is an abominable problem whose deprived, lawless, existence destabilizes nations all over the world.

The terms here are clearly strong in how they sound. For supporters of illegal immigration such words as abominable, depraved, and lawless are going to trigger a strong emotional response. However, what we really want is a logical, rational response and not just emotional attacks.

Statement of affirmative’s wanted decision

This last idea has to do with the fact the proposition should be stated in the positive and not negative. Below is the incorrect way to do this.

Resloved: Illegal immigration is not a major problem that destabilizes nations all over the world.

The proposition above is stated in the negative. This wording makes debating unnecessarily complicated. Below is another way to state this

Resloved: Illegal immigration is beneficial for nations all over the world.

This slight rewording helps a great deal in developing clear arguments. However, negative affirmatives can appear in  slightly different manner as shown below.

Resloved: Illegal immigration should be decriminalize.

The problem with this statement is that it provides no replacement for illegal immigration. When debating identified problems must be matched with  identified solutions. Below is a revised version of the previous proposition.

Resloved: Illegal immigration should be decriminalize and replaced with a system of open borders who monitor the movement of people.

This proposition has a strong opinion with a proposed solution.

Conclusion

This is not an exhaustive list of forming debate propositions. Rather, then goal here was to provide some guidelines to help teachers who are trying to encourage debating among their students. Off course, the guidelines provided here are for older students. For younger, kids it would be necessary to modify the wording and not worry as much about the small details of making strong debate propositions.

Presumption & Burden of Proof in Debating

In debating, it is important to understand the role of presumption and burden of proof and how these terms affect the status quo. This post will attempt to explain these concepts.

Status Quo

The status quo is the way things currently are or the way things are done. The affirmative in a debate is generally pushing change or departure from the status quo. This is in no way easy as people often prefer to keep things the way they are and minimize change.

Presumption

Presumption is the tendency of favoring one side of an argument over another. There are at least two forms of presumption. These two forms or judicial presumption and policy presumption.

Judicial presumption always favors the status quo or keeping things they way they are currently. Small changes can be made but the existing structure is not going to be different. In debates that happen from the judicial perspective it is the affirmative side that has the burden of proof or how must show that the benefit of change outweighs the status quo. A common idiom that summarizes the status quo is “If it ain’t broke don’t fix it.”

The policy form of presumption is used when change is necessary to the status quo. Example would be replacing an employee. The status quo of keeping the worker is impossible and the debate is now focused on who should be the replacement. A debate from a policy perspective is about which of the new approaches is the best to adopt.

In addition, the concept of burden of proof goes from the burden of proof to a burden of proof. This is because either side of the debate must provide must support the argument that they are making.

Burden of Refutation

The burden of refutation is the obligation to respond the opposing arguments. In other words, debaters often need to explain why the other side’s arguments are weak or perhaps even wrong. Failure to do so can make the refuting debater’s position weaker.

This leads to the point that there are no ties in debating. If both sides are equally good the status quo wins, which is normally the negative side. This is because the affirmative side did not bring the burden of proof necessary to warrant change.

Conclusion

Structure of debating requires debaters have a basic understanding of the various concepts in this field. Therefore, understanding such terms as status quo, presumption, and burden of proof  is beneficial if not required in order to participate in debating.

Intro to D3.js

D3.js is a JavaScript library that is useful for manipulating HTML documents. D3. js stands for Data Driven Documents JavaScript and was developed by Mike Bobstock. One of the primary purposes of D3.js is for data visualization. However, D3.js can do more than just provide data visualization as it can also allow for interaction binding, item selection, and dynamic styling of DOM (document object model) elements.

In order to use D3.js you should have a basic understanding of HTML. For data visualization you should have some idea of basic statistics and data visualization concepts in order to know what it is you  want to visualize. You also need to pick some sort of IDE so that you can manipulate your code. Lastly, you must know how to start a server on your local computer so you can see the results of your work.

In this post, we will make document that will use D3.js to make the phrase “Hello World”.

Example

Below is a bare minimum skeleton that an HTML document often has

1.png

Line 1 indicates the document type. Line 2 indicates the beginning of the html element. The html element can be used to store information about the page. Next, in line 6 is the body element. This is where the content of the web page is mostly kept. Notice how the information is contained in the various elements. All code is contained within the html element and the head and body elements are separate from one another.

First Script

We are now going to add our first few lines of d3.js code to our html document. Below is the code.

1

The code within the body element is new. In line 7-8 we are using a script element to access the d3.js library. Notice how it is a link. This means that when we run the code the d3.js library is access from some other place on the internet. An alternative to this is to download d3.js yourself. If you do this d3.js must be in the same folder as the html document that you are making.

To get d3.js from the internet you use the src argument and the place the web link in quotations. The charset argument is the setting for the character encoding. Sometimes this information  is important but it depends.

The second script element is where we actually do something with d3.js. Inside this second script element we have in line 10 the command d3.select(‘body’) this tells d3.js to select the first body element in the document. In line 11 we have the command .append(‘h1’) this tells d3.js to add an h1 heading in the body element. Lastly, we have the .text(‘Hello World’). This tells d3.js to add the text ‘Hello World’ to the h1 heading in the body element. This process of adding one command after another to modify the same object is called chaining.

When everything is done and you show your results you should see the following.

1

This is not the most amazing thing to see given what d3.js can do but it serves as an introduction.

More Examples

You are not limited to only one line of code or only one script element. Below is a more advanced version.

1.png

The new information is in lines 14-20. Lines 14-15 are just two p elements with some text. Lines 17-19 is another script element. This time we use the d3.selectAll(‘p’) which tells d3.js to apply the following commands to all p elements. In line 19 we use the .style command to set the background color to light blue. When this is done you should see the following.

1

Still not amazing, but things were modified as we wanted. Notice also that the second script element is not inside the body element. This is not necessary because you never see script elements in a website. Rather, you see the results of such a code.

Conclusion

This post introduce d3.js, which is a powerful tool for visualization. Although the examples here are fairly simple, you can be assured that there is more to this library than what has been explored so far.

Quadratic Discriminant Analysis with Python

Quadratic discriminant analysis allows for the classifier to assess non -linear relationships. This of course something that linear discriminant analysis is not able to do. This post will go through the steps necessary to complete a qda analysis using Python. The steps that will be conducted are as follows

  1. Data preparation
  2. Model training
  3. Model testing

Our goal will be to predict the gender of examples in the “Wages1” dataset using the available independent variables.

Data Preparation

We will begin by first loading the libraries we will need

import pandas as pd
from pydataset import data
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import (confusion_matrix,accuracy_score)
import seaborn as sns
from matplotlib.colors import ListedColormap

Next, we will load our data “Wages1” it comes from the “pydataset” library. After loading the data, we will use the .head() method to look at it briefly.

1

We need to transform the variable ‘sex’, our dependent variable, into a dummy variable using numbers instead of text. We will use the .getdummies() method to make the dummy variables and then add them to the dataset using the .concat() method. The code for this is below.

In the code below we have the histogram for the continuous independent variables.  We are using the .distplot() method from seaborn to make the histograms.

fig = plt.figure()
fig, axs = plt.subplots(figsize=(15, 10),ncols=3)
sns.set(font_scale=1.4)
sns.distplot(df['exper'],color='black',ax=axs[0])
sns.distplot(df['school'],color='black',ax=axs[1])
sns.distplot(df['wage'],color='black',ax=axs[2])

1

The variables look reasonable normal. Below is the proportions of the categorical dependent variable.

round(df.groupby('sex').count()/3294,2)
Out[247]: 
exper school wage female male
sex 
female 0.48 0.48 0.48 0.48 0.48
male 0.52 0.52 0.52 0.52 0.52

About half male and half female.

We will now make the correlational matrix

corrmat=df.corr(method='pearson')
f,ax=plt.subplots(figsize=(12,12))
sns.set(font_scale=1.2)
sns.heatmap(round(corrmat,2),
vmax=1.,square=True,
cmap="gist_gray",annot=True)

1

There appears to be no major problems with correlations. The last thing we will do is set up our train and test datasets.

X=df[['exper','school','wage']]
y=df['male']
X_train,X_test,y_train,y_test=train_test_split(X,y,
test_size=.2, random_state=50)

We can now move to model development

Model Development

To create our model we will instantiate an instance of the quadratic discriminant analysis function and use the .fit() method.

qda_model=QDA()
qda_model.fit(X_train,y_train)

There are some descriptive statistics that we can pull from our model. For our purposes, we will look at the group means  Below are the  group means.

exper school wage
Female 7.73 11.84 5.14
Male 8.28 11.49 6.38

You can see from the table that mean generally have more experience, higher wages, but slightly less education.

We will now use the qda_model we create to predict the classifications for the training set. This information will be used to make a confusion matrix.

cm = confusion_matrix(y_train, y_pred)
ax= plt.subplots(figsize=(10,10))
sns.set(font_scale=3.4)
with sns.axes_style('white'):
sns.heatmap(cm, cbar=False, square=True, annot=True, fmt='g',
cmap=ListedColormap(['gray']), linewidths=2.5)

1

The information in the upper-left corner are the number of people who were female and correctly classified as female. The lower-right corner is for the men who were correctly classified as men. The upper-right corner is females who were classified as male. Lastly, the lower-left corner is males who were classified as females. Below is the actually accuracy of our model

round(accuracy_score(y_train, y_pred),2)
Out[256]: 0.6

Sixty percent accuracy is not that great. However, we will now move to model testing.

Model Testing

Model testing involves using the .predict() method again but this time with the testing data. Below is the prediction with the confusion matrix.

 y_pred=qda_model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
from matplotlib.colors import ListedColormap
ax= plt.subplots(figsize=(10,10))
sns.set(font_scale=3.4)
with sns.axes_style('white'):
sns.heatmap(cm, cbar=False, square=True,annot=True,fmt='g',
cmap=ListedColormap(['gray']),linewidths=2.5)

1

The results seem similar. Below is the accuracy.

round(accuracy_score(y_test, y_pred),2)
Out[259]: 0.62

About the same, our model generalizes even though it performs somewhat poorly.

Conclusion

This post provided an explanation of how to do a quadratic discriminant analysis using python. This is just another potential tool that may be useful for the data scientist.

Background of Debates

Debating has a history as long as the history of man. The is evidence that debating dates back at least 4,000 years. From Egypt to china and even in poetry such as Homer’s “Iliad”  one can find examples of debating. Academic debating is believed to have started about 2,500 years ago with the work of Pythagoras.

We will look at the role of culture in debating as well as debate’s role in academics in the US along with some of the benefits of debating.

Debating and Culture

For whatever reason, debating is a key component of Western civilization and in particular Democratic civilizations. Speculating on why can go on forever. However, one key component for the emphasis on debating in the west is the epistemological view of truth.

In many western cultures, there is an underlying belief that truth is relative. As such, when two sides are debating the topic it is through the combine contributions of both arguments that some idea of truth is revealed. In many ways, this is a form of the Hegelian dialectic in which thesis and antithesis make syntheses. The synthesis is the truth and can only be found through a struggle of opposing relative positions.

In other cultures, such as Asian, what is true is much more stable and agreed upon as unchanging. This may be a partial reason for why debating is not as strenuously practice in non-western context. Confucianism in particular focus on stability, tradition, and rigid hierarchy. These are concepts there often considered unthinkable in a Western culture.

Debating in the United States

In the United States, applied debating has been of the country from almost the beginning. However, academic debating has been present since at least the 18th century. It was at the beginning of the 20th century that academic debating begin to be taken much more seriously.  Intercollegiate debating during this time lead to the development of several debate associations that had various rules and ways to support the growth of debating.

Benefits of Debating

Debating has been found to develop argumentation  skills, critical thinking, and enhance general academic performance. Through  have to gather information and synthesis it in a clear way seems to transfer when students study for other academic subjects. In addition, even though debating is about sharing one side of an argument it also improves listening skills. This is because you have to listen in order to point out weaknesses in the oppositions position.

Debating also develops the ability to thinking quickly. If the ability to think is not develop a student will struggle with refutation and rebuttals which are key components of debating. Lastly, debating sharpens the judgment of participants. It i important to be able to judge the strengths and weaknesses of various aspects of an argument in order to provide a strong case for our against an idea or action and this involves sharp judgment.

Conclusion

With its rich history and clear benefits. Debating will continue to be a part of the academic experience of  many students. The skills that are developed are practical and useful for many occupations found outside of an academic setting.

Types of Debates

Debating has a long history with historical evidence of this practice dating back 4,000 year. Debating was used in ancient Egypt, China, and Greece. As such, people who participate in debates are contributing to a rich history.

In this post, we will take a look at several types of debates that are commonly used today. The types of debates we will cover are as follows.

  • Special
  • Judicial
  • Parliamentary
  • Non-formal
  • Academic

Special Debate

A special debate is special because it has distinct rules  for a specific occasion. Examples include the Lincoln-Douglas debates of 1858. These debates were so influential that there is a debate format today called the Lincoln-Douglas format. This format often focuses on moral issues and has a specific use of time for the debaters that is distinct.

Special debates are also commonly used for presidential debates. Since there is no set format, the debaters literally may debate over the rules of the actual debate. For example, the Bush vs Kerry debates of 2004 had some of the following rules agreed to by both parties prior to the debate.

  1. Height of the lectern
  2. type of stools used
  3. Nature of the audience

In this example above, sometimes the rules have nothing to do with the actual debate but the atmosphere/setting around it.

Judicial Debate

Judicial debates happen in courts judicial like settings. The goal is to prosecute or defend individuals for some sort of crime. For lawyers in training or even general students, moot court debates are used to hone debating skills and mock trial debates are also used.

Parliamentary Debate

The parliamentary debate purpose s to support or attack potential legislation. Despite its name, the parliamentary debate format is used in the United States at various levels of government. There is a particular famous variation of this called the Asian parliamentary debate style.

Non-formal Debate

A non-formal debate lacks the rules of the other styles mentioned. In many ways, any form of disagreeing that does not have a structure for how to present one’s argument can fall under the category of non-formal. For example, children arguing with parents could be considered non-formal as well as classroom discussion on a controversial issue such as immigration.

This form of debate is probably the only one that everyone is familiar with and has participated in. However, it is probably the hardest to develop skills in due to the lack of structure.

Academic Debate

The academic debate is used to develop the educational skills of the participants. Often the format deployed is taken from applied debates. For example, many academic debates use the Lincoln Douglas format. There are several major Debate organizations that promote debate competitions between school’s. The details of this will be expanded in a future post.

Conclusion

This post provided an overview of different styles of debating that are commonly employed. Understanding this can be important because how you present and defend a point of view depends on the rules of engagement.

Persuasion vs Propaganda

Getting people to believe or do something has been a major problem on both an individual and even an international level. To address this concern both individuals and nations have turned to both persuasion and propaganda. This post will define both persuasion  and propaganda and compare and contrast them.

Persuasion

Persuasion is communication that attempts to influence the behavior or beliefs of others. This can be done through appeals to reason, appeals to emotions, or a combination of both. Often persuasion is done  on a small scale and is informal. Example would be a child trying to persuade their mother to let them go outside to play.

A more serious example would be a lawyer trying to persuade a judge. This involves one lawyer try to move the opinion of one judge. The goal here is for the lawyer to show the strength of their position while discrediting the position of others and the opposition.

Even though it is not on a large scale, persuasion works critical thinking, deep thought, with a thorough knowledge of the problem and the person(s) one is trying to persuade. Nothing can ruin persuasion like ignorance of the problem or people who you want to persuade.

Propaganda

Propaganda is persuasion on a large scale. It involves a group or organization of persuaders who combine their efforts to reach a large audience. The term propaganda was supposedly created in the 17th by Pope Gregory XV who in 1622 created the Sacred Congregation for the Propagating of the Faith. This group is responsible for spreading the Catholic religion in an evangelistic manner for conversions to the religion, which implies that propaganda is the spreading of ideas so that people accept them.

Edward Bernays is often seen as the master of propaganda. It was he who brought the use of propaganda to an art form in the early to mid 20th century. Ever the master of language and knowing the negative connotation of propaganda Bernays used an alternative term publicly on many occasion called “public relations.” This is the term essentially all institutions used today even though it has the same primary characteristics of propaganda which is to influence public opinion about something.

As mentioned in the previous paragraph, generally, the term propaganda is viewed negatively even though it is simply massive organized persuasion. This may be because propaganda is usually used for nefarious purposes throughout history. For example, Hitler used propaganda to strengthen the Nazi party. However, all countries are guilty of developing  propaganda for reasons that may not be completely altruistic in order to support their position in a competitive world.

Comparison

Persuasion and propaganda are in many ways opposite extremes of the same idea. What persuasion is on a small scale propaganda is on a large scale. However, it is hard to tell how big persuasion has to become before it reaches the level of propaganda. One indication may be in perception of the message. People who disagree with a position may call it propaganda, while people who agree with the message may call it persuasion.

Both persuasion and propaganda involve the use of planning and serious thought. Propaganda may involve more planning as it requires a large group of people to impact a  much larger audience. Finally, when persuasion and propaganda fail it may lead to something more sinister called coercion. This is when people are not necessarily forced to believe but usually to do something.

Conclusion

Whether persuading or sharing propaganda it is important to be aware of how these two terms are similar and different. Generally, the difference is a matter of scale. Persuasion is a local personal form of propaganda while propaganda is a massive impersonal form of persuasion

Random Forest Classification with Python

Random forest is a type of machine learning algorithm in which the algorithm makes multiple decision trees that may use different features and subsample to making as many trees as you specify. The trees then vote to determine the class of an example. This approach helps to deal with the high variance that is a problem with making only one decision tree.

In this post, we will learn how to develop a random forest model in Python. We will use the cancer dataset from the pydataset module to classify whether a person status is censored or dead based on several independent variables. The steps we need to perform to complete this task are defined below

  1. Data preparation
  2. Model development and evaluation

Data Preparation

Below are some initial modules we need to complete all of the tasks for this project.

import pandas as pd
import numpy as np
from pydataset import data
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

We will now load our dataset “Cancer” and drop any rows that contain NA using the .dropna() function.

df = data('cancer')
df=df.dropna()

Next, we need to separate our independent variables from our dependent variable. We will do this by make two datasets. The X dataset will contain all of our independent variables and the y dataset will contain our dependent variable. You can check the documentation for the dataset using the code data(“Cancer”, show_doc=True)

Before we make the y dataset we need to change the numerical values in the status variable to text. Doing this will aid in the interpretation of the results. If you look at the documentation of the dataset you will see that a 1 in the status variable means censored while a 2 means dead. We will change the 1 to censored and the 2 to dead when we make the y dataset. This involves the use of the .replace() function. The code is below.

X=df[['time','age',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']]
df['status']=df.status.replace(1,'censored')
df['status']=df.status.replace(2,'dead')
y=df['status']

We can now proceed to model development.

Model Development and Evaluation

We will first make our train and test datasets. We will use a 70/30 split. Next, we initialize the actual random forest classifier. There are many options that can be set. For our purposes, we will set the number of trees to make to 100. Setting the random_state option is similar to setting the seed for the purpose of reproducibility. Below is the code.

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
h=RandomForestClassifier(n_estimators=100,random_state=1)

We can now run our modle with the .fit() function and test it with the .pred() function. The code is velow.

h.fit(x_train,y_train)
y_pred=h.predict(x_test)

We will now print two tables. The first will provide the raw results for the classification using the .crosstab() function. THe classification_reports function will provide the various metrics used for determining the value of a classification model.

print(pd.crosstab(y_test,y_pred))
print(classification_report(y_test,y_pred))

1

Our overall accuracy is about 75%. How good this is depends in context. We are really good at predicting people are dead but have much more trouble with predicting if people are censored.

Conclusion

This post provided an example of using random forest in python. Through the use of a forest of trees, it is possible to get much more accurate results when a comparison is made to a single decision tree. This is one of many reasons for the use of random forest in machine learning.

Critical Thinking and Debating

Debating is a commonly used activity for developing critical thinking skills. The question that this post wants to answer is how debating develops critical thinking. This will be achieved through discussing the following…

  • Defining debate
  • Debating in the past
  • Debating today

Defining Debate

A debate is a process of defending or attacking a proposition through the use of reasoning and judgement. The goal is to go through a process of argumentation in which good reasons are shared with an audience. Good reasons are persuasive reasons that have a psychological influence on an audience. Naturally, what constitutes a good reason varies from context to context. Therefore, a good debater always keeps in mind who their audience is.

One key element of debating is what is missing. Technically, debating is an intellectual experience and not an emotional one. This has been lost sight of over time as debaters and public speakers have learned that emotional fanaticism is much more influential in moving the masses the deliberate thinking.

Debating in the Past

Debating was a key tool among the ancient Greeks. Aristotle provides us with at least four purposes for debating. The first purpose of debating was that debating allows people to see both sides of an argument. As such, debating dispels bias and allows for more carefully defined decision-making. One of the  characteristics  of critical thinking is the ability to see both sides of an argument or to think empathically rather than only sympathetically.

A second purpose of debating is for instructing the public. Debates for experts to take complex ideas and reduce to simple ones for general consumption. Off course, this has been take to extremes through sound bites and memes in the 21st century but learning how to communicate clearly is yet another goal of critical thinking.

A third purpose of debating is to prevent fraud and injustice. Aristotle was assuming that there was truth and that truth was more powerful the injustice. These are ideas that have been lost with time as we now live in a postmodern world. However, Aristotle believed that people needed to know how to argue for truth and how to communicate it with others. Today, experiential knowledge, and emotions are the primary determiners for what is right and wrong rather than cold truth.

A final purpose of debating is debating in order to defend one’s self. Debating is an intellectual way of protecting someone as fighting is a physical way of protecting someone. There is an idiom in English that states that “the pen is mightier than the swords.” Often physical fighting comes after several intellectual machinations by leaders who find ways to manipulate things. Skilled debater can  move millions whereas a strong solider can only do a limited amount of damage alone.

Debating Today

One aspect of debating that is not covered above is the aspect of time when it comes to debating. Debating is a way to develop critical thinking but it is also a way of developing real-time critical thinking. In others words, not only do you have to prepare your argument and ideas before a debate you also have to respond and react during a debate. This requires thinking on your feet in front of an audience while still trying to persuasive and articulate. Not an easy task for most people.

Debating is often a lost art as people have turned to arguing instead. Arguing often involves emotional exchanges rather than rational thought. Some have stated that when debating disappears so does freedom of speech. In  many ways, as topics and ideas become more emotionally charged there is greater and greater restriction  on  what can be said so that no one is “offended”. Perhaps Aristotle was correct about his views on debating and injustice.

Type 1 & 2 Tasks in TESOL

In the context of TESOL, language skills of speaking, writing, listening, and reading are divided into productive and receptive skills. Productive skills are speaking and writing and involve making language. Receptive skills are listening and reading and involve receiving language.

In this post, we will take a closer look at receptive skills involving the theory behind them as well as the use of task 1 and task 2 activities to develop these skills.

Top Down Bottom Up

Theories that are commonly associated with receptive skills include top down and bottom up processing. With top down processing the reader/listener is focused on the “big picture”. This means that they are focused on general view or idea of the content they are exposed to. This requires have a large amount of knowledge and experience to draw upon in order to connect understanding. Prior knowledge helps the individual to know what to expect as they receive the information.

Bottom up processing is the opposite. In this approach, the reader/listener is focused on the details of the content. Another way to see this is that with bottom up processing the focus is on the trees while with top down the focus is on forest. With bottom up processing students are focused on individual words or even individual word sounds such as when they are decoding when reading.

Type 1 & 2 Tasks

Type 1 and type 2 tasks are derived from top down and bottom processing.  Type 1 task involve seeing the “big picture”. Examples of this type include summarizing, searching for the main idea, making inferences, etc.

Often type 1 task are trickier to assess because solutions are often open-ended and open to interpretation. This involves having to assess individually each response each student makes which may not always be practical. However, type 1 task really help to broaden and strengthen higher level thinking skills which can lay a foundation for critical thinking.

Type 2 task involve looking at text and or listening for much greater detail. Such activities as recall, grammatical correction, and single answer questions all fall under the umbrella of type 2 tasks.

Type 2 tasks are easier to mark as they  frequently only have one possible answer. The problem with this is that teachers over rely on them because of their convenience. Students are trained to obsess over details rather than broad comprehension or connecting knowledge to other areas of knowledge. Opportunities for developing dynamic literacy are lost for a focus on critical literacy or even decoding.

A more reasonable approach is to use a combination of type 1 and 2 tasks. Type 1 can be used to stimulate thinking without necessarily marking the responses. Type 2 can be employed to teach students to focus on details and due to ease at which they can be marked type 2 tasks can be placed in the grade book for assessing progress.

Conclusion

This post explained various theories related to receptive skills in TESOL. There was also a look at different the two broad categories in which receptive skill tasks fall into. For educators, it is important to find a balance between using both type 1 and type 2 tasks in their classroom.

Data Exploration with R: Housing Prices


In this data exploration post, we will analyze a dataset from Kaggle using R .  Below are the packages we are going to use.

library(ggplot2);library(readr);library(magrittr);library(dplyr);library(tidyverse);library(data.table);library(DT);library(GGally);library(gridExtra);library(ggExtra);library(fastDummies);library(caret);library(glmnet)

Let’s look at our data for a second.

train <- read_csv("~/Downloads/train.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Id = col_integer(),
##   MSSubClass = col_integer(),
##   LotFrontage = col_integer(),
##   LotArea = col_integer(),
##   OverallQual = col_integer(),
##   OverallCond = col_integer(),
##   YearBuilt = col_integer(),
##   YearRemodAdd = col_integer(),
##   MasVnrArea = col_integer(),
##   BsmtFinSF1 = col_integer(),
##   BsmtFinSF2 = col_integer(),
##   BsmtUnfSF = col_integer(),
##   TotalBsmtSF = col_integer(),
##   `1stFlrSF` = col_integer(),
##   `2ndFlrSF` = col_integer(),
##   LowQualFinSF = col_integer(),
##   GrLivArea = col_integer(),
##   BsmtFullBath = col_integer(),
##   BsmtHalfBath = col_integer(),
##   FullBath = col_integer()
##   # ... with 18 more columns
## )
## See spec(...) for full column specifications.

Data Visualization

Lets take a look at our target variable first

p1<-train%>%
        ggplot(aes(SalePrice))+geom_histogram(bins=10,fill='red')+labs(x="Type")+ggtitle("Global")
p1

1.png

Here is the frequency of certain values for the target variable

p2<-train%>%
        mutate(tar=as.character(SalePrice))%>%
        group_by(tar)%>%
        count()%>%
        arrange(desc(n))%>%
        head(10)%>%
        ggplot(aes(reorder(tar,n,FUN=min),n))+geom_col(fill='blue')+coord_flip()+labs(x='Target',y='freq')+ggtitle('Freuency')
p2        

2.png

Let’s examine the correlations. First we need to find out which variables are numeric. Then we can use ggcorr to see if there are any interesting associations. The code is as follows.

nums <- unlist(lapply(train, is.numeric))   train[ , nums]%>%
        select(-Id) %>%
        ggcorr(method =c('pairwise','spearman'),label = FALSE,angle=-0,hjust=.2)+coord_flip()

1.png

There are some strong associations in the data set. Below we see what the top 10 correlations.

n1 <- 20 
m1 <- abs(cor(train[ , nums],method='spearman'))
out <- as.table(m1) %>%
        as_data_frame %>% 
        transmute(Var1N = pmin(Var1, Var2), Var2N = pmax(Var1, Var2), n) %>% 
        distinct %>% 
        filter(Var1N != Var2N) %>% 
        arrange(desc(n)) %>%
        group_by(grp = as.integer(gl(n(), n1, n())))
out
## # A tibble: 703 x 4
## # Groups:   grp [36]
##    Var1N        Var2N            n   grp
##                     
##  1 GarageArea   GarageCars   0.853     1
##  2 1stFlrSF     TotalBsmtSF  0.829     1
##  3 GrLivArea    TotRmsAbvGrd 0.828     1
##  4 OverallQual  SalePrice    0.810     1
##  5 GrLivArea    SalePrice    0.731     1
##  6 GarageCars   SalePrice    0.691     1
##  7 YearBuilt    YearRemodAdd 0.684     1
##  8 BsmtFinSF1   BsmtFullBath 0.674     1
##  9 BedroomAbvGr TotRmsAbvGrd 0.668     1
## 10 FullBath     GrLivArea    0.658     1
## # ... with 693 more rows

There are about 4 correlations that are perhaps too strong.

Descriptive Statistics

Below are some basic descriptive statistics of our variables.

train_mean<-na.omit(train[ , nums]) %>% 
        select(-Id,-SalePrice) %>%
        summarise_all(funs(mean)) %>%
        gather(everything(),key='feature',value='mean')
train_sd<-na.omit(train[ , nums]) %>%
        select(-Id,-SalePrice) %>%
        summarise_all(funs(sd)) %>%
        gather(everything(),key='feature',value='sd')
train_median<-na.omit(train[ , nums]) %>%
        select(-Id,-SalePrice) %>%
        summarise_all(funs(median)) %>%
        gather(everything(),key='feature',value='median')
stat<-na.omit(train[ , nums]) %>%
        select(-Id,-SalePrice) %>%
        summarise_all(funs(sum(.<0.001))) %>%
        gather(everything(),key='feature',value='zeros')%>%
        left_join(train_mean,by='feature')%>%
        left_join(train_median,by='feature')%>%
        left_join(train_sd,by='feature')
stat$zeropercent<-(stat$zeros/(nrow(train))*100)
stat[order(stat$zeropercent,decreasing=T),] 
## # A tibble: 36 x 6
##    feature       zeros    mean median      sd zeropercent
##                            
##  1 PoolArea       1115  2.93        0  40.2          76.4
##  2 LowQualFinSF   1104  4.57        0  41.6          75.6
##  3 3SsnPorch      1103  3.35        0  29.8          75.5
##  4 MiscVal        1087 23.4         0 166.           74.5
##  5 BsmtHalfBath   1060  0.0553      0   0.233        72.6
##  6 ScreenPorch    1026 16.1         0  57.8          70.3
##  7 BsmtFinSF2      998 44.6         0 158.           68.4
##  8 EnclosedPorch   963 21.8         0  61.3          66.0
##  9 HalfBath        700  0.382       0   0.499        47.9
## 10 BsmtFullBath    668  0.414       0   0.512        45.8
## # ... with 26 more rows

We have a lot of information stored in the code above. We have the means, median and the sd in one place for all of the features. Below are visuals of this information. We add 1 to the mean and sd to preserve features that may have a mean of 0.

p1<-stat %>%
        ggplot(aes(mean+1))+geom_histogram(bins = 20,fill='red')+scale_x_log10()+labs(x="means + 1")+ggtitle("Feature means")

p2<-stat %>%
        ggplot(aes(sd+1))+geom_histogram(bins = 30,fill='red')+scale_x_log10()+labs(x="sd + 1")+ggtitle("Feature sd")

p3<-stat %>%
        ggplot(aes(median+1))+geom_histogram(bins = 20,fill='red')+labs(x="median + 1")+ggtitle("Feature median")

p4<-stat %>%
        mutate(zeros=zeros/nrow(train)*100) %>%
        ggplot(aes(zeros))+geom_histogram(bins = 20,fill='red')+labs(x="Percent of Zeros")+ggtitle("Zeros")

p5<-stat %>%
        ggplot(aes(mean+1,sd+1))+geom_point()+scale_x_log10()+scale_y_log10()+labs(x="mean + 1",y='sd + 1')+ggtitle("Feature mean & sd")
grid.arrange(p1,p2,p3,p4,p5,layout_matrix=rbind(c(1,2,3),c(4,5)))
## Warning in rbind(c(1, 2, 3), c(4, 5)): number of columns of result is not a
## multiple of vector length (arg 2)

2.png

Below we check for variables with zero variance. Such variables would cause problems if included in any model development

stat%>%
        mutate(zeros = zeros/nrow(train)*100)%>%
        filter(mean == 0 | sd == 0 | zeros==100)%>%
        DT::datatable()

There are no zero-variance features in this dataset that may need to be remove.

Correlations

Let’s look at correlation with the SalePrice variable. The plot is a histogram of all the correlations with the target variable.

sp_cor<-train[, nums] %>% 
select(-Id,-SalePrice) %>%
cor(train$SalePrice,method="spearman") %>%
as.tibble() %>%
rename(cor_p=V1)

stat<-stat%>%
#filter(sd>0)
bind_cols(sp_cor)

stat%>%
ggplot(aes(cor))+geom_histogram()+labs(x="Correlations")+ggtitle("Cors with SalePrice")

1

We have several high correlations but we already knew this previously. Below we have some code that provides visuals of the correlations

top<-stat%>%
        arrange(desc(cor_p))%>%
        head(10)%>%
        .$feature
p1<-train%>%
        select(SalePrice,one_of(top))%>%
        ggcorr(method=c("pairwise","pearson"),label=T,angle=-0,hjust=.2)+coord_flip()+ggtitle("Strongest Correlations")
p2<-train%>%
        select(SalePrice, OverallQual)%>%
        ggplot(aes(SalePrice,OverallQual))+geom_point()+labs(y="OverallQual")+ggtitle("Strongest Correlation")
p3<-train%>%
        select(SalePrice, OverallQual)%>%
        ggplot(aes(SalePrice,OverallQual))+geom_point()+geom_smooth(method= 'lm')+labs(y="OverallQual")+ggtitle("Strongest Correlation")
ggMarginal(p3,type = 'histogram')
p3
grid.arrange(p1,p2,layout_matrix=rbind(c(1,2)))

1

1.png

The first plot show us the top correlations. Plot 1 show us the relationship between the strongest predictor and our target variable. Plot 2 shows us the trend-line and the histograms for the strongest predictor with our target variable.

The code below is for the categorical variables. Our primary goal is to see the protections inside each variable. If a categorical variable lacks variance in terms of frequencies in each category it may need to be removed for model developing purposes. Below is the code

ig_zero<-train[, nums]%>%
        na_if(0)%>%
        select(-Id,-SalePrice)%>%
        cor(train$SalePrice,use="pairwise",method="spearman")%>%
        as.tibble()%>%
        rename(cor_s0=V1)
stat<-stat%>%
        bind_cols(ig_zero)%>%
        mutate(non_zero=nrow(train)-zeros)

char <- unlist(lapply(train, is.character))  
me<-names(train[,char])

List=list()
    for (var in train[,char]){
        wow= print(prop.table(table(var)))
        List[[length(List)+1]] = wow
    }
names(List)<-me
List

This list is not printed here in order to save space

# $MSZoning
## var
##     C (all)          FV          RH          RL          RM 
## 0.006849315 0.044520548 0.010958904 0.788356164 0.149315068 
## 
## $Street
## var
##        Grvl        Pave 
## 0.004109589 0.995890411 
## 
## $Alley
## var
##      Grvl      Pave 
## 0.5494505 0.4505495 
## 
## $LotShape
## var
##         IR1         IR2         IR3         Reg 
## 0.331506849 0.028082192 0.006849315 0.633561644 
## 
## $LandContour
## var
##        Bnk        HLS        Low        Lvl 
## 0.04315068 0.03424658 0.02465753 0.89794521 
## 
## $Utilities
## var
##       AllPub       NoSeWa 
## 0.9993150685 0.0006849315 
## 
## $LotConfig
## var
##      Corner     CulDSac         FR2         FR3      Inside 
## 0.180136986 0.064383562 0.032191781 0.002739726 0.720547945 
## 
## $LandSlope
## var
##        Gtl        Mod        Sev 
## 0.94657534 0.04452055 0.00890411 
## 
## $Neighborhood
## var
##     Blmngtn     Blueste      BrDale     BrkSide     ClearCr     CollgCr 
## 0.011643836 0.001369863 0.010958904 0.039726027 0.019178082 0.102739726 
##     Crawfor     Edwards     Gilbert      IDOTRR     MeadowV     Mitchel 
## 0.034931507 0.068493151 0.054109589 0.025342466 0.011643836 0.033561644 
##       NAmes     NoRidge     NPkVill     NridgHt      NWAmes     OldTown 
## 0.154109589 0.028082192 0.006164384 0.052739726 0.050000000 0.077397260 
##      Sawyer     SawyerW     Somerst     StoneBr       SWISU      Timber 
## 0.050684932 0.040410959 0.058904110 0.017123288 0.017123288 0.026027397 
##     Veenker 
## 0.007534247 
## 
## $Condition1
## var
##      Artery       Feedr        Norm        PosA        PosN        RRAe 
## 0.032876712 0.055479452 0.863013699 0.005479452 0.013013699 0.007534247 
##        RRAn        RRNe        RRNn 
## 0.017808219 0.001369863 0.003424658 
## 
## $Condition2
## var
##       Artery        Feedr         Norm         PosA         PosN 
## 0.0013698630 0.0041095890 0.9897260274 0.0006849315 0.0013698630 
##         RRAe         RRAn         RRNn 
## 0.0006849315 0.0006849315 0.0013698630 
## 
## $BldgType
## var
##       1Fam     2fmCon     Duplex      Twnhs     TwnhsE 
## 0.83561644 0.02123288 0.03561644 0.02945205 0.07808219 
## 
## $HouseStyle
## var
##      1.5Fin      1.5Unf      1Story      2.5Fin      2.5Unf      2Story 
## 0.105479452 0.009589041 0.497260274 0.005479452 0.007534247 0.304794521 
##      SFoyer        SLvl 
## 0.025342466 0.044520548 
## 
## $RoofStyle
## var
##        Flat       Gable     Gambrel         Hip     Mansard        Shed 
## 0.008904110 0.781506849 0.007534247 0.195890411 0.004794521 0.001369863 
## 
## $RoofMatl
## var
##      ClyTile      CompShg      Membran        Metal         Roll 
## 0.0006849315 0.9821917808 0.0006849315 0.0006849315 0.0006849315 
##      Tar&Grv      WdShake      WdShngl 
## 0.0075342466 0.0034246575 0.0041095890 
## 
## $Exterior1st
## var
##      AsbShng      AsphShn      BrkComm      BrkFace       CBlock 
## 0.0136986301 0.0006849315 0.0013698630 0.0342465753 0.0006849315 
##      CemntBd      HdBoard      ImStucc      MetalSd      Plywood 
## 0.0417808219 0.1520547945 0.0006849315 0.1506849315 0.0739726027 
##        Stone       Stucco      VinylSd      Wd Sdng      WdShing 
## 0.0013698630 0.0171232877 0.3527397260 0.1410958904 0.0178082192 
## 
## $Exterior2nd
## var
##      AsbShng      AsphShn      Brk Cmn      BrkFace       CBlock 
## 0.0136986301 0.0020547945 0.0047945205 0.0171232877 0.0006849315 
##      CmentBd      HdBoard      ImStucc      MetalSd        Other 
## 0.0410958904 0.1417808219 0.0068493151 0.1465753425 0.0006849315 
##      Plywood        Stone       Stucco      VinylSd      Wd Sdng 
## 0.0972602740 0.0034246575 0.0178082192 0.3452054795 0.1349315068 
##      Wd Shng 
## 0.0260273973 
## 
## $MasVnrType
## var
##     BrkCmn    BrkFace       None      Stone 
## 0.01033058 0.30647383 0.59504132 0.08815427 
## 
## $ExterQual
## var
##          Ex          Fa          Gd          TA 
## 0.035616438 0.009589041 0.334246575 0.620547945 
## 
## $ExterCond
## var
##           Ex           Fa           Gd           Po           TA 
## 0.0020547945 0.0191780822 0.1000000000 0.0006849315 0.8780821918 
## 
## $Foundation
## var
##      BrkTil      CBlock       PConc        Slab       Stone        Wood 
## 0.100000000 0.434246575 0.443150685 0.016438356 0.004109589 0.002054795 
## 
## $BsmtQual
## var
##         Ex         Fa         Gd         TA 
## 0.08503162 0.02459592 0.43429375 0.45607871 
## 
## $BsmtCond
## var
##          Fa          Gd          Po          TA 
## 0.031623331 0.045678145 0.001405481 0.921293043 
## 
## $BsmtExposure
## var
##         Av         Gd         Mn         No 
## 0.15541491 0.09423347 0.08016878 0.67018284 
## 
## $BsmtFinType1
## var
##        ALQ        BLQ        GLQ        LwQ        Rec        Unf 
## 0.15460295 0.10400562 0.29374561 0.05200281 0.09346451 0.30217850 
## 
## $BsmtFinType2
## var
##         ALQ         BLQ         GLQ         LwQ         Rec         Unf 
## 0.013361463 0.023206751 0.009845288 0.032348805 0.037974684 0.883263010 
## 
## $Heating
## var
##        Floor         GasA         GasW         Grav         OthW 
## 0.0006849315 0.9780821918 0.0123287671 0.0047945205 0.0013698630 
##         Wall 
## 0.0027397260 
## 
## $HeatingQC
## var
##           Ex           Fa           Gd           Po           TA 
## 0.5075342466 0.0335616438 0.1650684932 0.0006849315 0.2931506849 
## 
## $CentralAir
## var
##          N          Y 
## 0.06506849 0.93493151 
## 
## $Electrical
## var
##       FuseA       FuseF       FuseP         Mix       SBrkr 
## 0.064427690 0.018505826 0.002056203 0.000685401 0.914324880 
## 
## $KitchenQual
## var
##         Ex         Fa         Gd         TA 
## 0.06849315 0.02671233 0.40136986 0.50342466 
## 
## $Functional
## var
##         Maj1         Maj2         Min1         Min2          Mod 
## 0.0095890411 0.0034246575 0.0212328767 0.0232876712 0.0102739726 
##          Sev          Typ 
## 0.0006849315 0.9315068493 
## 
## $FireplaceQu
## var
##         Ex         Fa         Gd         Po         TA 
## 0.03116883 0.04285714 0.49350649 0.02597403 0.40649351 
## 
## $GarageType
## var
##      2Types      Attchd     Basment     BuiltIn     CarPort      Detchd 
## 0.004350979 0.630891951 0.013778100 0.063814358 0.006526468 0.280638144 
## 
## $GarageFinish
## var
##       Fin       RFn       Unf 
## 0.2552574 0.3060189 0.4387237 
## 
## $GarageQual
## var
##          Ex          Fa          Gd          Po          TA 
## 0.002175489 0.034807832 0.010152284 0.002175489 0.950688905 
## 
## $GarageCond
## var
##          Ex          Fa          Gd          Po          TA 
## 0.001450326 0.025380711 0.006526468 0.005076142 0.961566352 
## 
## $PavedDrive
## var
##          N          P          Y 
## 0.06164384 0.02054795 0.91780822 
## 
## $PoolQC
## var
##        Ex        Fa        Gd 
## 0.2857143 0.2857143 0.4285714 
## 
## $Fence
## var
##      GdPrv       GdWo      MnPrv       MnWw 
## 0.20996441 0.19217082 0.55871886 0.03914591 
## 
## $MiscFeature
## var
##       Gar2       Othr       Shed       TenC 
## 0.03703704 0.03703704 0.90740741 0.01851852 
## 
## $SaleType
## var
##         COD         Con       ConLD       ConLI       ConLw         CWD 
## 0.029452055 0.001369863 0.006164384 0.003424658 0.003424658 0.002739726 
##         New         Oth          WD 
## 0.083561644 0.002054795 0.867808219 
## 
## $SaleCondition
## var
##     Abnorml     AdjLand      Alloca      Family      Normal     Partial 
## 0.069178082 0.002739726 0.008219178 0.013698630 0.820547945 0.085616438

You can judge for yourself which of these variables are appropriate or not.

Conclusion

This post provided an example of data exploration. Through this analysis we have a beter understanding of the characteristics of the dataset. This information can be used for further analyst and or model development.

Scatterplot in LibreOffice Calc

A scatterplot is used to observe the relationship between two continuous variables. This post will explain how to make a scatterplot and calculate correlation in LibreOffice Calc.

Scatterplot

In order to make a scatterplot you need to columns of data. Below are the first few rows of the data in this example.

Var 1 Var 2
3.07 2.95
2.07 1.90
3.32 2.75
2.93 2.40
2.82 2.00
3.36 3.10
2.86 2.85

Given the nature of this dataset, there was no need to make any preparation.

To make the plot you need to select the two column with data in them and click on insert -> chart and you will see the following.

1

Be sure to select the XY (Scatter) option and click next. You will then see the following

1

Be sure to select “data series in columns” and “first row as label.” Then click next and you will see the following.

1

There is nothing to modify in this window. If you wanted you could add more data to the plot as well as label data but neither of these options apply to us. Therefore, click next.

1

In this last window, you can see that we gave the chart a title and label the X and Y axes. We also removed the “display legend” feature by unchecking it. A legend is normally not needed when making a scatterplot. Once you add this information click “finish” and you will see the following.

1

There are many other ways to modify the scatterplot, but we will now look at how to add a trend line.

To add a trend line you need to click on the data inside the plot so that it turns green as shown below.

1

Next, click on insert -> trend line and you will see the following

1.png

For our purposes, want to select the “linear” option. Generally, the line is hard to see if you immediately click “ok”. Instead, we will click on the “Line” tab and adjust as shown below.

1

All we did was simply change the color of the line to black and increase the width to 0.10. When this is done, click “ok” and you will see the following.

1

The  scatterplot is now complete. We will now look at how to calculate the correlation between the two variables.

Correlation

The correlation is essentially a number that captures what you see in a scatterplot. To calculate the correlation, do the following.

  1. Select the two columns of data
  2. Click on data -> statistics -> correlation and you will see the following

1.png

3. In the results to section just find a place on the spreadsheet to show the results. Click ok and you will see the following.

Correlations Column 1 Column 2
Column 1 1
Column 2 0.413450002676874 1

You have to rename the columns with the appropriate variables. Despite this problem the correlation has been calculated.

Conclusion

This post provided an explanation of calculating correlations and creating scatterplots in LibreOffice Calc. Data visualization is a critical aspect of communicating effectively and such tools as Calc can be used to support this endeavor.

Graphs in LibreOffice Calc

The LibreOffice Suite is a free open-source office suit that is considered an alternative to Microsoft Office Suite. The primary benefit of LibreOffice is that it offers similar features as Microsoft Office with having to spend any money. In addition, LibreOffice is legitimately free and is not some sort of nefarious pirated version of Microsoft Office, which means that anyone can use LibreOffice without legal concerns on as many machines as they desire.

In this post, we will go over how to make plots and graphs in LibreOffice Calc. LibreOffice Calc is the equivalent to Microsoft Excel. We will learn how to make the following visualizations.

  • Bar graph
  • histogram

Bar Graph

We are going to make a bar graph from a single column of data in LibreOffice Calc. To make a visualization you need to aggregate some data. For this post, I simply made some random data that uses a likert scale of SD, D, N, A, SA. Below is a sample of the first five rows of the data.

Var 1
N
SD
D
SD
D

In order to aggregate the data you need to make bins and count the frequency of each category in the bin. Here is how you do this. First you make a variable called “bin” in a column and you place SD, D, N, A, and SA each in their own row in the column you named “bin” as shown below.

bin
SD
D
N
A
SA

In the next column, you created a variable called “freq” in each column you need to use the countif function as shown below

=COUNTIF(1st value in data: last value in data, criteria for counting)

Below is how this looks for my data.

=COUNTIF(A2:A177,B2)

What I told LibreOffice was that my data is in A2 to A177 and they need to count the row if it contains the same data as B2 which for me contains SD. You repeat this process four more time adjusting the last argument in the function. When I finished I this is what I had.

bin freq
SD 35
D 47
N 56
A 32
SA 5

We can now proceed to making the visualization.

To make the bar graph you need to first highlight the data you want to use. For us the information we want to select is the “bin” and “freq” variables we just created. Keep in mind that you never use the raw data but rather the aggregated data. Then click insert -> chart and you will see the following

1.png

Simply click next, and you will see the following

1.png

Make sure that the last three options are selected or your chart will look funny. Data series in rows or in columns has to do with how the data is read in a long or short form. Labels in first row makes sure that Calc does not insert “bin” and “freq” in the graph. First columns as label helps in identifying what the values are in the plot.

Once you click next you will see the following.

1.png

This window normally does not need adjusting and can be confusing to try to do so. It does allow you to adjust the range of the data and even and more data to your chart. For now, we will click on next and see the following.

1

In the last window above, you can add a title and label the axes if you want. You can see that I gave my graph a name. In addition, you can decide if you want to display a legend if you look to the right. For my graph, that was not adding any additional information so I unchecked this option. When you click finish you will see the following on the spreadsheet.

1

Histogram

Histogram are for continuous data. Therefore, I convert my SD,  D, N, A, SA to 1, 2, 3, 4, and 5. All the other steps are the same as above. The one difference is that you want to remove the spacing between bars. Below is how to do this.

Click on one of the bars in the bar graph until you see the green squares as shown  below.

1.png

After you did this, there should be a new toolbar at the top of the spreadsheet. You need to click on the Green and blue cube as shown below

1

In the next window, you need to change the spacing to zero percent. This will change the bar graph into a histogram. Below is what the settings should look like.

1

When you click ok you should see the final histogram shown below

1

For free software this is not too bad. There are a lot of options that were left unexplained especial in regards to how you can manipulate the colors of everything and even make the plots 3D.

Conclusion

LibreOffice provides an alternative to paying for Microsoft products. The example below shows that Calc is capable of making visually appealing graphs just as Excel is.