Category Archives: Data Visualization

Intro to Matplotlib with Python VIDEO

Matplotlib is a data visualization module used often in Python. In this video, we will go over some introductory basic commands. Doing so will allow anybody who wants to be able to make simple manipulations to their visualizations.

ad

Principal Component Analysis with Python VIDEO

Principal component analysis is a tool for reducing the number of variables in a dataset without losing too much information. This is a great way to summarize information or to simplify things for a more complex analysis. The video provides a simple example of how to do this.

ad

Data Visualization with Altair VIDEO

Python has a great library called that Altair that makes it really easy to make various data visualizations. The primary strength of this particular library is how easy it is to use and to also create interactive plots. The video below provides an introduction to using this innovative tool.

ad

pie graph illustration

Visualizations with Altair

We are going to take a look at Altair which is a data visulization library for Python. What is unique abiut Altair compared to other packages experienced on this blog is that it allows for interactions.

ad

The interactions can take place inside jupyter or they can be exported and loaded onto websites as we shall see. In the past, making interactions for website was often tught using a jacascript library such as d3.js. D3.js works but is cumbersome to work with for the avaerage non-coder. Altair solves this problem as Python is often seen as easier to work with compared to javascript.

Installing Altair

If Altair is not already install on your computer you can do so with the following code

pip install altair vega_datasets

OR

conda install -c conda-forge altair vega_datasets

Which one of the lines above you use will depend on the type of Python installation you have.

Goal

We are going to make some simple visualizations using the “Duncan” dataset from the pydataset library using Altair. If you do not have pydataset install on your ocmputer you can use the code listed above to install it. Simple replace “altair vega_datasets” with “pydataset.” Below is the initial code followed by the output

import pandas as pd
from pydataset import data
df=data("Duncan")
df.head()

In the code above, we load pandas and import “data” from the “pydataset” library. Next, we load the “Duncan” dataset as the object “df”. Lastly, we use the .head() function to take a look at the dataset. You can see in the imagine above what variables are available.

Our first visualization is a simple bar graph. The code is below followed by the visualization.

import altair as alt
alt.Chart(df).mark_bar().encode(
x= "type",
y = "prestige"
)

In the code above we did the following,

  1. Line one loads the altair library.
  2. Line 2 uses several functions together to make the bar graph. .Chart(df) loads the data for the plot. .mark_bar() assigns the geomtric shape for the plot which in this case is bars. Lastly, the .encode() function contains the information for the variables that will be assigned to the x and y axes. In this case we are looking at job type and prestige.

The three dots in the upper right provide options for saving or editing the plot. We will learn more about saving plots later. In addition, Altair follows the grammar of graphics for creating plots. This has been discussed in another post but a summary of the components are below.

  • Data
  • Aesthetics
  • Scale.
  • Statistical transformation
  • Geometric object
  • Facets
  • Coordinate system

We will not deal with all of these but we have dealt with the following

  • Data as .Chart()
  • Aesthetics and Geometric object as .mark_bar()
  • coordinate system as .encode()

In our second example, we will make a scatterplot. The code and output are below.

alt.Chart(df).mark_circle().encode(
x= "education",
y = "prestige"
)

The code is mostly the same. We simple use .mark_circle() as to indicate the type of geometric object. For .encode() we made sure to use two continuous variables.

In the next plot, we add a categorical variable to the scatterplot by manipulating the color.

alt.Chart(df).mark_circle().encode(
    x= "education",
    y = "prestige",
    color='type'
)

The only change is the addition of the “color”argument which is set to the categorical vareiable of “type.”

It is also possible to use bubbles to indicate size. In the plot below we can add the income varibale to the plot using bubbles.

alt.Chart(df).mark_circle().encode(
    x= "education",
    y = "prestige",
    color='type',
    size="income"
)

The latest argument that was added was the “size” argument which was used to map income to the plot.

You can also facet data by piping. The code below makes two plots and saving them as objects. Then you print both by typing the name of the objects while separated by the pipe symbol (|) which you can find above the enter key on your keyboard. Below you will find two different plots created through this piping process.

educationPlot=alt.Chart(df).mark_circle().encode(
    x= "education",
    y = "prestige",
    color='type', 
)
incomePlot=alt.Chart(df).mark_circle().encode(
    x= "income",
    y = "prestige",
    color='type',
)
educationPlot | incomePlot

With this code you can make multiple plots. Simply keep adding pipes to make more plots.

Interaction and Saving Plots

It is also possible to move plots interactive. In the code below we add the command called tool tip. This allows us to add an additional variable called “income” to the chart. When the mouse hoovers over a data-point the income will display.

However, since we are in a browser right now this will not work unless w save the chart as an html file. The last line of code saves the plot as an html file and renders it using svg. We also remove the three dots in the upper left corner by adding the ‘actions’:False. Below is the code and the plot once the html was loaded to this blog.

interact_plot=alt.Chart(df).mark_circle().encode(
    x= "education",
    y = "prestige",
    color='type',
    tooltip=["income"]
)
interact_plot.save('interact_plot.html',embed_options={'renderer':'svg','actions':False})

I’ve made a lot of visuals in the past and never has it been this simple

Conclusion

Altair is another tool for visualizations. This may be the easiest way to make complex and interactive charts that I have seen. As such, this is a great way to achieve goals if visualizing data is something that needs to be done.

T-SNE Visualization and R

It is common in research to want to visualize data in order to search for patterns. When the number of features increases, this can often become even more important. Common tools for visualizing numerous features include principal component analysis and linear discriminant analysis. Not only do these tools work for visualization they can also be beneficial in dimension reduction.

However, the available tools for us are not limited to these two options. Another option for achieving either of these goals is t-Distributed Stochastic Embedding. This relative young algorithm (2008) is the focus of the post. We will explain what it is and provide an example using a simple dataset from the Ecdat package in R.

t-sne Defined

t-sne is a nonlinear dimension reduction visualization tool. Essentially what it does is identify observed clusters. However, it is not a clustering algorithm because it reduces the dimensions (normally to 2) for visualizing. This means that the input features are not longer present in their original form and this limits the ability to make inference. Therefore, t-sne is often used for exploratory purposes.

T-sne non-linear characteristic is what makes it often appear to be superior to PCA, which is linear. Without getting too technical t-sne takes simultaneously a global and local approach to mapping points while PCA can only use a global approach.

The downside to t-sne approach is that it requires a large amount of calculations. The calculations are often pairwise comparisons which can grow exponential in large datasets.

Initial Packages

We will use the “Rtsne” package for the analysis, and we will use the “Fair” dataset from the “Ecdat” package. The “Fair” dataset is data collected from people who had cheated on their spouse. We want to see if we can find patterns among the unfaithful people based on their occupation. Below is some initial code.

library(Rtsne)
library(Ecdat)

Dataset Preparation

To prepare the data, we first remove in rows with missing data using the “na.omit” function. This is saved in a new object called “train”. Next, we change or outcome variable into a factor variable. The categories range from 1 to 9

  1. Farm laborer, day laborer,
  2. Unskilled worker, service worker,
  3. Machine operator, semiskilled worker,
  4. Skilled manual worker, craftsman, police,
  5. Clerical/sales, small farm owner,
  6. Technician, semiprofessional, supervisor,
  7. Small business owner, farm owner, teacher,
  8. Mid-level manager or professional,
  9. Senior manager or professional.

Below is the code.

train<-na.omit(Fair)
train$occupation<-as.factor(train$occupation)

Visualization Preparation

Before we do the analysis we need to set the colors for the different categories. This is done with the code below.

colors<-rainbow(length(unique(train$occupation)))
names(colors)<-unique(train$occupation)

We can now do are analysis. We will use the “Rtsne” function. When you input the dataset you must exclude the dependent variable as well as any other factor variables. You also set the dimensions and the perplexity. Perplexity determines how many neighbors are used to determine the location of the datapoint after the calculations. Verbose just provides information during the calculation. This is useful if you want to know what progress is being made. max_iter is the number of iterations to take to complete the analysis and check_duplicates checks for duplicates which could be a problem in the analysis. Below is the code.

tsne<-Rtsne(train[,-c(1,4,7)],dims=2,perplexity=30,verbose=T,max_iter=1500,check_duplicates=F)
## Performing PCA
## Read the 601 x 6 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.05 seconds (sparsity = 0.190597)!
## Learning embedding...
## Iteration 1450: error is 0.280471 (50 iterations in 0.07 seconds)
## Iteration 1500: error is 0.279962 (50 iterations in 0.07 seconds)
## Fitting performed in 2.21 seconds.

Below is the code for making the visual.

plot(tsne$Y,t='n',main='tsne',xlim=c(-30,30),ylim=c(-30,30))
text(tsne$Y,labels=train$occupation,col = colors[train$occupation])
legend(25,5,legend=unique(train$occupation),col = colors,,pch=c(1))

1

You can see that there are clusters however, the clusters are all mixed with the different occupations. What this indicates is that the features we used to make the two dimensions do not discriminant between the different occupations.

Conclusion

T-SNE is an improved way to visualize data. This is not to say that there is no place for PCA anymore. Rather, this newer approach provides a different way of quickly visualizing complex data without the limitations of PCA.

Checkout ERT online courses

Force-Directed Graph with D3.js

Network visualizations involve displaying interconnected nodes commonly associated with social networks. D3.js has powerful capabilities to create these visualizations. In this post, we will learn how to make a simple force-directed graph.

A force directed graph uses an algorithm that spaces the nodes in the graph away from each other based on a value you set. There are several different ways to determine how the force influences the distance of the nodes from each other that will be explored somewhat in this post.

1.jpg

The Data

To make the visualization it is necessary to have data. We will use a simple json file that has nodes and edges. Below is the code

{

"nodes": [

{ "name": "Tom" },

{ "name": "Sue" },

{ "name": "Jina" },

{ "name": "Soli" },

{ "name": "Lala" }

],

"edges": [

{ "source": 0, "target": 1 },

{ "source": 0, "target": 4 },

{ "source": 0, "target": 3 },

{ "source": 0, "target": 4 },

{ "source": 0, "target": 2 },

{ "source": 1, "target": 2 }

]

}

The nodes in this situation will represent the circles that we will make. In this case, the nodes have names. However, we will not print the names in the visualization for the sake of simplicity. The edges represent the lines that will connect the circles/nodes. The source number is the origin of the line and the target number is where the line ends at. For example, “source: 0” represents Tom and “Target”: 1 means draw a line from Tom to Sue.

Setup

To begin the visualization we have to create the svg element inside our html doc. Lines 6-17 do this as shown below.

1.png

Next, we need to create the layout of the graph. The .node() and the .link() functions affect the location of the nodes and links. The .size() affects the gravitational center and initial position of the visualization. There is also some code that is commented out in below that will be discussed later. Below are lines 18-25 of our code.

1

Now we can write the code that will render or draw our object. We need to append the edges and nodes, indicate color for both, as well as the radius of the circles of the nodes. All of this is captured in lines 26-44

1

The final step is to handle the ticks. To put it simply, the ticks handles recalculating the position of the nodes. Below is the code for this.

1

We can finally see are visual as shown below

You can clearly see that the nodes are on top of each other. This is because we need to adjust the use of the force in the force-directed graph. There are many ways to adjust this, but we will look at two functions. These are .linkDistance() and .charge().

The .linkDistance() function indicates how far nodes are from each other at the end of the simulation. To add this to our code you need to remove the comments on line22 as shown below.

1

Below is an update of what our visualization  looks like.

Things are better but the nodes are still on top of each other. The real differences is that the edges are longer. To fix this, we need to use the .charge() function. The .charge() function indicates how much nodes are attracted to each other or repel each other. To use this function you need to remove the comments on line 23 as shown below.

1

The negative charge will cause the nodes to push away from each other. Below is what this looks like.

You can see that as the nodes were moved around the stayed far from each other. This is because of the negative charge. Off course, there are other ways to modify this visualization but this is enough for now.

Conclusion

Force-directed graphs are a  powerful tool for conveying what could be a large amount of information. This post provided some simple ways that this visualization can be developed and utilized for practical purposes.

Pie Charts with D3.js

Pie charts are one of many visualizations that you can create using D3.js. We are going to learn how to do the following in this post.

  • Make a circle
  • Make a donut
  • Make a pie wedge
  • Make a segment
  • Make a pie chart

Most of these examples require just a minor adjustment in a standard piece of code. This  will make more sense in a minute.

Make a Circle

Making a circle involves using the .arc() method. In this method there are four parameters that you manipulate to get different shapes. They are explained below

  • .innerRadius() This parameter makes a whole in your circle to give the appearance of a donut
  • .outerRadius() Determines the size of your circle
  • .startAngle() is used in combination with .endAngle() to make a pie wedge

Therefore, to make several different shapes we manipulate these different parameters. In the code below, we create the svg element first (lines 7-10)  then our circle (lines 11-15). Lastly, we append the path to the svg element (lines 16-21). Below is the code and picture.

12

The rest of the examples primarily deal with manipulating the existing code.

Donut

To make the donut, you need to change the value inside the .innerRadius() parameter. The larger the value the bigger the hole in the middle of the circle will become. In order to generate the donut below you need to change the value found in line 12 of  the code to 100.

1.png

Pie Wedge

To make a wedge you need to replace lines 14-15 of the code with the following

.startAngle(0*Math.PI * 2/360)

.endAngle(90*Math.PI * 2/360);

This is telling d3.js to start the angle at 0 degrees and stop at 90 degrees. This is another way of saying making a wedge of 1/4 of the circle. Doing this will create the following.

1

Segment

In order to make the segment you keep the code the same as above for the pie wedge but at a change to the .innerRadius() parameter. AS shown below,

.innerRadius(100)

.outerRadius(170)

.startAngle(0*Math.PI * 2/360)

.endAngle(90*Math.PI * 2/360);

1

Pie Chart

A pie chart is just a more complex version of what we have already done. You still need to set up your svg element. This is don in lines 7-14. Notice that we also had to add a g element and a transform attribute.

Line 16 contains the data for making the pie chat. This is hard coded but we can also use other forms of data. Line 17 uses the .pie() method with the data to set the stage for our pie chart.

Lines 19-27 are for generating the arc. This code is mostly the same except for the functions that are used for the .startAngle() and .end Angle() methods. Line 29 sets the color and Lines 30-42 draw the paths for creating the image. The code with the // will be explained in a moment. Below is the code and the pie chart.

12

Pie Chart Variation

Below is the same pie chart but with some different features.

  • It now has a donut (line 20 change .innerRadius(0) to .innerRadius(50))
  • There are now separations between the different segments to give the appearance that it is pulling apart. Remove the // in lines 36-37 to activate this.

Below is the pie chart

1.png

You can perhaps now see that the possibilities are endless with this.

Conclusion

Pie charts and their related visualizations are another option for communicating insights of data

Dendrogram with D3.js

Dendrogram is a type of hierarchical visualization commonly used in data science with hierarchical clustering. In d3.js, the dendrogram looks slightly different in that the root is in the center and the nodes branch out from there. This idea continues for every parent child relationship in the data.

In this post, we will make a simple dendrogram with d3.js using   a simple json file shown below.

{

  "name": "President",

          "children": [

        { "name": "VP of Academics",

        "children":[

          {"name":"Dean of Psych"},

          {"name":"Dean of Ed"}

        ] },

       { "name": "VP of Finance" },

        { "name": "VP of Students" }

      ]

    }

You will need to save the code above as a json file if you want to make this example. In the code, this json example is called “university.json”

Making the Dendrogram

We will begin by setting up the basic svg limit as found in lines 1-19 below.
1.png

In lines 20-26, the radius of the nodes is set and added to the svg element. The clusters are set with the creation of the cluster variable. Below is the code.

1.png

In lines 27-37, we set the position of the nodes and links as well as diagonals for the path.

1

Lines 49-80, have three main blocks of code.

The first block creates a variable called nodesGroups. This code provides a place to hold the nodes and the text that we are going to create.

The second block of code is adding circles to the nodeGroups with  additional settings for the color and radius size. Lastly, the third block of code adds texts to the nodesGroups.

1.png

When all the steps have been taken, you will see the following when you run the code.

1.png

You can see how things worked. The center is the president with the vp of academics of to the right. The president has three people under him (all VPs) while the VP of academics has two people under him (deans). The colors for the path and the circle are all set in the last block of code that was discussed.

Conclusion

Dendrograms allow you to communicate insights about data in a distinctly creative way. This example was simplistic but serves the purpose of exposing you to another way  of using d3.js

Path Generators in D3.js

D3.js has the ability to draw rather complex shapes using techniques called path generators. This will allow you to many things that would otherwise be difficult when working with visualizations.

In this post, we will look at three different examples of path generation involving d3.js. The three examples are as follows

  • Making a triangle
  • Duplicating the triangle
  • Make an area chart

Making a Triangle

To make a triangle we start by appending a svg to the body element of our document (lines 7-12). Then when create an array that has three coordinates. We need three coordinates because that is how many coordinates a triangle needs (lines 13-17). After this, we create a variable called generate and use the .svg.line() method to draw the actual lines that we want (lines 18-20).

After this, we append the data and other characteristics to the svg variable. This allows us to set the fill color and the line color (lines 21-28). Below is the code followed by the output.

12

A simple triangle was created with only a few lines of code.

Create a Triangle and Duplicate

This example is an extension of the previous one. Here, we create the same triangle, but then we duplicate it using the translate method. This is a powerful technique if you need to make the same shape more than once.

The new code is found in lines 29-36

12

You can see that the new triangle was moved to a different position based on the arguments we gave it.

Area Chart

Area charts are often used to make line plots but with the area under the line being filled with a certain color. To achieve this we do the following.

  • Add our svg element to the body element (lines 7-10)
  • Create some random data using the .range() method and several functions (lines 12-18)
    • we generate 1000 random numbers (line 12) between 0 and 100 (line 14)
    •   We set the Y values as random values of X in increments of 10 (line16-17)
  • We create a variable called generate to create the path using the .area() method (line 20-23)
    • y0 has to do with the height or depth of the area the other two attributes are the X and Y values
  • We then append all this to the svg variable we created (lines 25-31)

Below is the code followed by the visual.

12

Much more can be done with this but I think this makes the point for now.

Conclusion

This post was just  an introduction to making paths with D3.js.

Drag, Pan, & Zoom Elements with D3.js

Mouse events can be combined in order to create some  rather complex interactions using d3.js. Some examples of these complex actions includes dragging,  panning, and zooming. These events are handle with tools called behaviors. The behaviors deal with dragging, panning, and zooming.

In this post, we will look at these three behaviors in two examples.

  • Dragging
  • Panning and zooming

Dragging

Dragging allows the user to move an element around on the screen. What we are going to do is make three circles that are different colors that we can move around as we desire within the element. We start by setting the width, height of the svg element as well as the radius of the circles we will make (line 7). Next, we create our svg by appending it to the body element. We also set a black line around the element so that the user knows where the borders are (lines 8-14).

The next part involves setting the colors for the circles and then creating the circles and setting all of their attributes (lines 21 – 30). Setting the drag behavior comes later, and we use the .drag() and the .on() methods t create this behavior and the .call() method connects the information in this section to our circles variable.

The last part is the use of the onDrag function. This function retrieves the position of the moving element and transform the element within the svg element (lines 36-46). This involves using an if statement as well as setting attributes. If this sounds confusing, below is the code followed by a visual of what the code does.

1

If you look carefully you will notice I can never move the circles beyond the border. This is because the border represents the edge of the element. This is important because you can limit how far an element can travel by determining the size of the elements space.

Panning and Zooming

Panning allows you to move all visuals around at once inside an element. Zooming allows you to expand or contract what you see. Most of this code is a extension  of the what we did in the previous example. The new additions are explained below.

  1. A variable called zoomAction sets the zoom behavior by determining the scale of the zoom and setting the .on() method (Lines 9-11)
  2. We add the .call() method to the svg variable as well as the .append(‘g’) so that this behavior can be used (Lines 20-21).
  3. The dragAction variable is created to allow us to pan or move the entire element around. This same variable is placed inside a .call() method for the circles variable that was created earlier (Lines 40-46).
  4. Lines 48-60 update the position of the element by making two functions. The onDrag function deals with panning and the onZzoom function deal with zooming.

Below is the code and a visual of what it does.

You can clearly see that we can move the circles individually or as a group. In addition, you also were able to see how we could zoom in and out. Unlike the first example this example allows you to leave the border. This is probably due to the zoom capability.

Conclusion

The behaviors shared here provide additional tools that you can use as you design visuals using D3.js. There are other more practical ways to use these tools as we shall see.

Intro to Interactivity with D3.js

The D3.js provides many ways in which the user can interact with visual data. Interaction with a visual can help the user to better understand the nature and characteristics of the data, which can lead to insights. In this post, we will look at three basic examples of interactivity involving mouse events.

Mouse events are actions taken by the browser in response to some action by the mouse. The handler for mouse events is primarily the .on() method. The three examples of mouse events in this post are listed below.

  • Tracking the mouse’s position
  • Highlighting an element based on mouse position
  • Communicating to the user when they have clicked on an element

Tracking the Mouses’s Position

The code for tracking the mouse’s position is rather simple. What is new is Is that we need to create a variable that appends a text element to the svg element. When we do this we need to indicate the position and size of the text as well.

Next, we need to use the .on() method on the svg variable we created. Inside this method is the type of behavior to monitor which in this case is the movement of the mouse. We then create a simple way for the browser to display the x, y coordinates.  Below is the code followed by the actual visual.

1.png

You can see that as the mouse moves the x,y coordinates move as well. The browser is watching the movement of the mouse and communicating this through the changes in the coordinates in the clip above.

Highlighting an Element Based on Mouse Position

This example allows an element to change color when the mouse comes in contact with it. To do this we need to create some data that will contain the radius of four circles with their x,y position (line 13).

Next we use the .selectAll() method to select all circles, load the data, enter the data, append the circles, set the color of the circles to green, and create a function that sets the position of the circles (lines 15-26).

Lastly, we will use the .on() function twice. Once will be for when the mouse touches the circle and the second time for when the mouse leaves the circle. When the mouse touches a circle the circle will turn black. When the mouse leaves a circle the circle will return to the original color of green (lines 27-32). Below is the code followed by the visual.

1

Indicating when a User Clicks on an Element

This example is an extension of the previous one. All the code is the same except you add the following at the bottom of the code right before the close of the script element.

.on('click', function (d, i) {

alert(d + ' ' + i);

});

This .on() method has an alert inside the function. When this is used it will tell the user when they have clicked on an element and will also tell the user the radius of the circle as well what position in the array the data comes from. Below is the visual of this code.

Conclusion

You can perhaps see the fun that is possible with interaction when using D3.js. There is much more that can be done in ways that are much more practical than what was shown here.

Tweening with D3.js

Tweening is a tool that allows you to tell D3.js how to calculate attributes during transitions without keyframes tracking. The problem with keyframes tracking is that it can develop performance issues if there is a lot of animation.

We are going to look at three examples of the use of tweening in this post. The examples are as follows.

  • Counting numbers animation
  • Changing font size animation
  • Spinning shape animation

Counting Numbers Animation

This simple animation involves using the .tween() method to count from 0 to 25. The other information in the code determines the position of the element, the font-size, and the length of the animation.

In order to use the .tween()  method you must make a function. You first give the function a name followed by providing the arguments to be used. Inside the function  we indicate what it should do using the .interpolateRound() method which indicates to d3.js to count from 0 to 25. Below is the code followed by the animation.

1

You can see that the speed of the numbers is not constant. This is because we did not control for this.

Changing Font-Size Animation

The next example is more of the same. This time we simply make the size of a text change. TO do this you use the .text() method in your svg element. In addition, you now use the .styleTween() method. Inside this method we use the .interpolate method and set arguments for the font and font-size at the beginning and the end of the animation. Below is the code and the animation1

Spinning Shape Animation

The last example is somewhat more complicated. It involves create a shape that spins in place. To achieve this we do the following.

  1. Set the width and height of the element
  2. Set the svg element to the width and height
  3. Append a group element to  the svg element.
  4. Transform and translate the g element in order to move it
  5. Append a path to the g element
  6. Set the shape to a diamond using the .symbol(), .type(), and .size() methods.
  7. Set the color of the shape using .style()
  8. Set the .each() method to follow the cycle function
  9. Create the cycle function
  10. Set the .transition(), and .duration() methods
  11. Use the .attrTween() and use the .interpolateString() method to set the rotation of the spinning.
  12. Finish with the .each() method

Below is the code followed by the animation.

1

This animation never stops because we are using a cycle.

Conclusion

Animations can be a lot of fun when using d3.js. The examples here may not be the most practical, but they provide you with an opportunity to look at the code and decide how you will want to use d3.js in the future.

Intro to Animation with D3.js

This post will provide an introduction to animation using D3.js. Animation simply changes the properties of the visual object over time. This can be useful for help the viewer of the web page to understand key features of the data.

For now we will do the following in terms of animation.

  • Create a simple animation
  • Animate multiple properties
  • Create chained transitions
  • Handling Transitions

Create a Simple Animation

What is new for us in terms of d3.js code for animation is the use of the .transition() and .duration() methods. The transition method provides instructions on how to changing a visual attribute over time. Duration is simply how long the transition takes.

In the code below, we are going to create a simply black rectangle that will turn white and disappear on the screen. This is done by appending an svg that contains a black rectangle into the body element and then have that rectangle turn white and disappear.

1

1.gif

This interesting but far from amazing. We simply change the color  or animated one property. Next, we learn how to animate more than one property at a time.

Animating Multiple Properties at Once

You are not limited to only animating one property. In the code below we will change the color will have the rectangle move as well. This id one through the x,y coordinates to the second .attr({}) method. The code and the animation are below.

12

You can see how the rectangle moves from the top left to the bottom right while also changing colors from black to white thus disappearing. Next, we will look at chained transitions

Chained Transitions

Chained transitions involves have some sort of animation take place. Followed by a delay and then another animation taking place. In order to do this you need to use the .delay() method. This method tells the  browser to wait a specify number of seconds before doing something else.

In our example, we are going to have our rectangle travel diagonally down while also disappearing only tot suddenly travel up while changing back to the color of black. Below is the code followed by the animation.

13

By now you are starting to see that the only limit to animation in d3.js is your imagination.

Handling Transitions

The beginning and end of a transition can be handle by a .each() method. This is useful when you want to control the style of the element at the beginning and or end of a transition.

In the code below, you will see the rectangle go from red, to green, to orange, to black, and then to gray. At the same time the rectangle will move and change sizes. Notice careful the change from red to green and form black to gray are controlled by .each() methods.

11

Conclusion

Animation is not to only be used for entertainment. When developing visualizations, an animation should provide additional understanding of the content that you are trying to present. This is important to remember so that d3.js does not suffer the same fate as PowerPoint in that people focus more on the visual effects rather than the content.

Adding labels to Graphs D3.js

In this post, we will look at how to add the following to a bar graph using d3.js.

  • Labels
  • Margins
  • Axes

Before we begin, you need the initial code that has a bar graph already created. This is shown below follow by what it should look like before we make any changes.

1

1

The first change is in line 16-19. Here, we change the name of the variable and modify the type of element it creates.

1.png

Our next change begins at line 27 and continues until line 38. Here we make two changes. First, we make a variable called barGroup, which selects all the group elements of the variable g. We also use the data, enter, append and attr methods. Starting in line 33 and continuing until line 38 we use the append method on our new variable barGroup to add rect elements as well as the color and size of each bar. Below is the code.

1.png

The last step for adding text appears in lines 42-50. First, we make a variable called textTranslator to move our text. Then we append the text to the bargroup variable. The color, font type, and font size are all set in the code below followed by a visual of what our graph looks like now.

12

Margin

Margins serve to provide spacing in a graph. This is especially useful if you want to add axes. The changes in the code take place in lines 16-39 and include an extensive reworking of the code. In lines 16-20 we create several variables that are used for calculating the margins and the size and shape of the svg element. In lines 22-30 we set the attributes for the svg variable. In line 32-34 we add a group element to hold the main parts of the graph. Lastly, in lines 36-40 we add a gray background for effect. Below is the code followed by our new graph. 1.png

1

Axes

In order for this to work, we have to change the value for the variable maxValue to 150. This would give a little more space at the top of the graph. The code for the axis goes form line 74 to line 98.

  • Line 74-77 we create variables to set up the axis so that it is on the left
  • Lines 78-85 we create two more variables that set the scale and the orientation of the axis
  • Lines 87-99 sets the visual characteristics of the axis.

Below is the code followed by the updated graph

12

You can see the scale off to the left as planned..

Conclusion

Make bar graphs is a basic task for d3.js. Although the code can seem cumbersome to people who do not use JavaScript. The ability to design visuals like this often outweighs the challenges.

Making Bar Graphs with D3.js

This post will provide an example of how to make a basic bar graph using d3.js. Visualizing data is important and developing bar graphs in one way to communicate information efficiently.

This post has the following steps

  1. Initial Template
  2. Enter the data
  3. Setup for the bar graphs
  4. Svg element
  5. Positioning
  6. Make the bar graph

Initial Template

Below is what the initial code should look like.

1

Entering the Data

For the data we will hard code it into the script using an array. This is not the most optimal way of doing this but it is the simplest for a first time experience.  This code is placed inside the second script element. Below is a picture.

1

The new code is in lines 10-11 save as the variable data.

Setup for the Bars in the Graph

We are now going to create three variables. Each is explained below

  • The barWidth variable will indicate ho wide the bars should be for the graph
  • barPadding variable will put space between the bars in the graph. If this is set to 0 it would make a histogram
  • The variable maxValue scales the height of the bars relative to the largest observation in the array. This variable uses the method .max() to find the largest value.

Below is the code for these three variables

1

The new information was added in lines 13-14

SVG Element

We can now begin to work with the svg element. We are going to create another variable called mainGroup. This will assign the svg element inside the body element using the .select() method. We will append the svg using .append and will set the width and height using .attr. Lastly, we will append a group element inside the svg so that all of our bars are inside the group that is inside the svg element.

The code is getting longer, so we will only show the new additions in the pictures with a reference to older code. Below is the new code in lines 16-19 directly  under the maxValue variable.

1

Positioning

New=x we need to make three functions.

  • The first function will calculate the x location of the bar graph
  • The second function  will calculate the y location of the bar graph
  • The last function will combine the work of the first two functions to place the bar in the proper x,y coordinate in the svg element.

Below is the code for the three functions. These are added in lines 21-251

The xloc function starts in the bottom left of the mainGroup element and adds the barWidth plus the barPadding to make the next bar. The yloc function starts in the top left and subtracts the maxValue from the given data point to calculate the y position. Lastly, the translator combines the output of both the xloc and the yloc functions to position bar using the translate method.

Making the Graph

We can now make our graph. We will use our mainGroup variable with the .selectAll method with the rect argument inside. Next, we use .data(data) to add the data, .enter() to update the element, .append(“rect”) to add the rectangles. Lastly, we use .attr() to set the color, transformation, and height of the bars. Below is the code in lines 27-36 followed by actual bar graph. 1.png

1

The graph is complete but you can see that there is a lot of work that needs to be done in order to improve it. However, that will be done in a future post.

SVG and D3.js

Scalable Vector Graphics or SVG for short is a XML markup language that is the standard for creating vector-based graphics in web browsers. The true power of d3.js is unlocked through its use of svg elements. Employing vectors allows for the image that are created to be various sizes based on scale.

One unique characteristic of svg is that the coordinate system starts in the upper left with the x axis increases as normal from left to right. However, the y axis starts from the top and increases going down rather than the opposite of increasing from the bottom up. Below is a visual of this.

1

You can see that (0,0) is in the top left corner. As you go to the right the x-axis increases and as you go down the y axis increases.  By changing the x, y axis values you are able to position your image where you want it. If you’re curious the visual above was made using d3.js.

For the rest of post, we will look at different svg elements that can be used in making visuals. We will look at the following

  • Circles
  • Rectangles/squares
  • Lines & Text
  • Paths

Circles

Below is the code for making circle followed by the output

1.png

1.png

To make a shape such as the circles above, you first must specify the size of the svg element as shown in line 6. Then you make a circle element. Inside the circle element you must indicate the x and y position (cx, cy) and also the radius (r). The default color is black however you can specify other colors as shown below.

12

To change the color simply add the style argument and indicate the fill and color in quotations.

Rectangle/Square

Below is the code for making rectangles/squares. The arguments are slightly different but this should not be too challenging to figure out.

12

The x, y arguments indicate the position and the width and height arguments determine the size of the rectangle, square.

Lines & Text

Lines are another tool in d3.js. Below is the code for drawing a line.

1

2

The code should not be too hard to understand. You now need to separate coordinates. This is because the line needs to start in one place and draw until it reaches another. You can also control the color of the line and the thickness.

You can also add text using svg. In the code below we combine the line element with the text element.

12

With the text element after you set the position, font, font size, and color, you have to also add your text in-between the tags of the element.

Path

The path element is slightly trickier but also more powerful when compared to the elements we have used previously. Below is the code and output from using the path element.

12

The path element has a mini-language all to its self. “M” is where the drawing begins and is followed by the xy coordinate. “L”  draw a line. Essentially, it takes the original position and draws a line to next position. “V” indicates a vertical line. Lastly, “Z” means to close the path.

In the code below here is what it literally means

  1. Start at 40, 40
  2. Make a line from 40, 40 to 250, 40
  3. Make another line from 250, 40 to 140, 40
  4. Make a vertical line from 140,40 to 4,40
  5. Close the path

Using path can be much more complicated than this. However, this is enough for an introduction

Conclusion

This was just a teaser of what is possible with d3.js. The ability to make various graphics based on data is something that we have not even discussed yet. As such, there is much more to look forward to when using this visualization tool.

Data Exploration with R: Housing Prices


In this data exploration post, we will analyze a dataset from Kaggle using R .  Below are the packages we are going to use.

library(ggplot2);library(readr);library(magrittr);library(dplyr);library(tidyverse);library(data.table);library(DT);library(GGally);library(gridExtra);library(ggExtra);library(fastDummies);library(caret);library(glmnet)

Let’s look at our data for a second.

train <- read_csv("~/Downloads/train.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Id = col_integer(),
##   MSSubClass = col_integer(),
##   LotFrontage = col_integer(),
##   LotArea = col_integer(),
##   OverallQual = col_integer(),
##   OverallCond = col_integer(),
##   YearBuilt = col_integer(),
##   YearRemodAdd = col_integer(),
##   MasVnrArea = col_integer(),
##   BsmtFinSF1 = col_integer(),
##   BsmtFinSF2 = col_integer(),
##   BsmtUnfSF = col_integer(),
##   TotalBsmtSF = col_integer(),
##   `1stFlrSF` = col_integer(),
##   `2ndFlrSF` = col_integer(),
##   LowQualFinSF = col_integer(),
##   GrLivArea = col_integer(),
##   BsmtFullBath = col_integer(),
##   BsmtHalfBath = col_integer(),
##   FullBath = col_integer()
##   # ... with 18 more columns
## )
## See spec(...) for full column specifications.

Data Visualization

Lets take a look at our target variable first

p1<-train%>%
        ggplot(aes(SalePrice))+geom_histogram(bins=10,fill='red')+labs(x="Type")+ggtitle("Global")
p1

1.png

Here is the frequency of certain values for the target variable

p2<-train%>%
        mutate(tar=as.character(SalePrice))%>%
        group_by(tar)%>%
        count()%>%
        arrange(desc(n))%>%
        head(10)%>%
        ggplot(aes(reorder(tar,n,FUN=min),n))+geom_col(fill='blue')+coord_flip()+labs(x='Target',y='freq')+ggtitle('Freuency')
p2        

2.png

Let’s examine the correlations. First we need to find out which variables are numeric. Then we can use ggcorr to see if there are any interesting associations. The code is as follows.

nums <- unlist(lapply(train, is.numeric))   train[ , nums]%>%
        select(-Id) %>%
        ggcorr(method =c('pairwise','spearman'),label = FALSE,angle=-0,hjust=.2)+coord_flip()

1.png

There are some strong associations in the data set. Below we see what the top 10 correlations.

n1 <- 20 
m1 <- abs(cor(train[ , nums],method='spearman'))
out <- as.table(m1) %>%
        as_data_frame %>% 
        transmute(Var1N = pmin(Var1, Var2), Var2N = pmax(Var1, Var2), n) %>% 
        distinct %>% 
        filter(Var1N != Var2N) %>% 
        arrange(desc(n)) %>%
        group_by(grp = as.integer(gl(n(), n1, n())))
out
## # A tibble: 703 x 4
## # Groups:   grp [36]
##    Var1N        Var2N            n   grp
##                     
##  1 GarageArea   GarageCars   0.853     1
##  2 1stFlrSF     TotalBsmtSF  0.829     1
##  3 GrLivArea    TotRmsAbvGrd 0.828     1
##  4 OverallQual  SalePrice    0.810     1
##  5 GrLivArea    SalePrice    0.731     1
##  6 GarageCars   SalePrice    0.691     1
##  7 YearBuilt    YearRemodAdd 0.684     1
##  8 BsmtFinSF1   BsmtFullBath 0.674     1
##  9 BedroomAbvGr TotRmsAbvGrd 0.668     1
## 10 FullBath     GrLivArea    0.658     1
## # ... with 693 more rows

There are about 4 correlations that are perhaps too strong.

Descriptive Statistics

Below are some basic descriptive statistics of our variables.

train_mean<-na.omit(train[ , nums]) %>% 
        select(-Id,-SalePrice) %>%
        summarise_all(funs(mean)) %>%
        gather(everything(),key='feature',value='mean')
train_sd<-na.omit(train[ , nums]) %>%
        select(-Id,-SalePrice) %>%
        summarise_all(funs(sd)) %>%
        gather(everything(),key='feature',value='sd')
train_median<-na.omit(train[ , nums]) %>%
        select(-Id,-SalePrice) %>%
        summarise_all(funs(median)) %>%
        gather(everything(),key='feature',value='median')
stat<-na.omit(train[ , nums]) %>%
        select(-Id,-SalePrice) %>%
        summarise_all(funs(sum(.<0.001))) %>%
        gather(everything(),key='feature',value='zeros')%>%
        left_join(train_mean,by='feature')%>%
        left_join(train_median,by='feature')%>%
        left_join(train_sd,by='feature')
stat$zeropercent<-(stat$zeros/(nrow(train))*100)
stat[order(stat$zeropercent,decreasing=T),] 
## # A tibble: 36 x 6
##    feature       zeros    mean median      sd zeropercent
##                            
##  1 PoolArea       1115  2.93        0  40.2          76.4
##  2 LowQualFinSF   1104  4.57        0  41.6          75.6
##  3 3SsnPorch      1103  3.35        0  29.8          75.5
##  4 MiscVal        1087 23.4         0 166.           74.5
##  5 BsmtHalfBath   1060  0.0553      0   0.233        72.6
##  6 ScreenPorch    1026 16.1         0  57.8          70.3
##  7 BsmtFinSF2      998 44.6         0 158.           68.4
##  8 EnclosedPorch   963 21.8         0  61.3          66.0
##  9 HalfBath        700  0.382       0   0.499        47.9
## 10 BsmtFullBath    668  0.414       0   0.512        45.8
## # ... with 26 more rows

We have a lot of information stored in the code above. We have the means, median and the sd in one place for all of the features. Below are visuals of this information. We add 1 to the mean and sd to preserve features that may have a mean of 0.

p1<-stat %>%
        ggplot(aes(mean+1))+geom_histogram(bins = 20,fill='red')+scale_x_log10()+labs(x="means + 1")+ggtitle("Feature means")

p2<-stat %>%
        ggplot(aes(sd+1))+geom_histogram(bins = 30,fill='red')+scale_x_log10()+labs(x="sd + 1")+ggtitle("Feature sd")

p3<-stat %>%
        ggplot(aes(median+1))+geom_histogram(bins = 20,fill='red')+labs(x="median + 1")+ggtitle("Feature median")

p4<-stat %>%
        mutate(zeros=zeros/nrow(train)*100) %>%
        ggplot(aes(zeros))+geom_histogram(bins = 20,fill='red')+labs(x="Percent of Zeros")+ggtitle("Zeros")

p5<-stat %>%
        ggplot(aes(mean+1,sd+1))+geom_point()+scale_x_log10()+scale_y_log10()+labs(x="mean + 1",y='sd + 1')+ggtitle("Feature mean & sd")
grid.arrange(p1,p2,p3,p4,p5,layout_matrix=rbind(c(1,2,3),c(4,5)))
## Warning in rbind(c(1, 2, 3), c(4, 5)): number of columns of result is not a
## multiple of vector length (arg 2)

2.png

Below we check for variables with zero variance. Such variables would cause problems if included in any model development

stat%>%
        mutate(zeros = zeros/nrow(train)*100)%>%
        filter(mean == 0 | sd == 0 | zeros==100)%>%
        DT::datatable()

There are no zero-variance features in this dataset that may need to be remove.

Correlations

Let’s look at correlation with the SalePrice variable. The plot is a histogram of all the correlations with the target variable.

sp_cor<-train[, nums] %>% 
select(-Id,-SalePrice) %>%
cor(train$SalePrice,method="spearman") %>%
as.tibble() %>%
rename(cor_p=V1)

stat<-stat%>%
#filter(sd>0)
bind_cols(sp_cor)

stat%>%
ggplot(aes(cor))+geom_histogram()+labs(x="Correlations")+ggtitle("Cors with SalePrice")

1

We have several high correlations but we already knew this previously. Below we have some code that provides visuals of the correlations

top<-stat%>%
        arrange(desc(cor_p))%>%
        head(10)%>%
        .$feature
p1<-train%>%
        select(SalePrice,one_of(top))%>%
        ggcorr(method=c("pairwise","pearson"),label=T,angle=-0,hjust=.2)+coord_flip()+ggtitle("Strongest Correlations")
p2<-train%>%
        select(SalePrice, OverallQual)%>%
        ggplot(aes(SalePrice,OverallQual))+geom_point()+labs(y="OverallQual")+ggtitle("Strongest Correlation")
p3<-train%>%
        select(SalePrice, OverallQual)%>%
        ggplot(aes(SalePrice,OverallQual))+geom_point()+geom_smooth(method= 'lm')+labs(y="OverallQual")+ggtitle("Strongest Correlation")
ggMarginal(p3,type = 'histogram')
p3
grid.arrange(p1,p2,layout_matrix=rbind(c(1,2)))

1

1.png

The first plot show us the top correlations. Plot 1 show us the relationship between the strongest predictor and our target variable. Plot 2 shows us the trend-line and the histograms for the strongest predictor with our target variable.

The code below is for the categorical variables. Our primary goal is to see the protections inside each variable. If a categorical variable lacks variance in terms of frequencies in each category it may need to be removed for model developing purposes. Below is the code

ig_zero<-train[, nums]%>%
        na_if(0)%>%
        select(-Id,-SalePrice)%>%
        cor(train$SalePrice,use="pairwise",method="spearman")%>%
        as.tibble()%>%
        rename(cor_s0=V1)
stat<-stat%>%
        bind_cols(ig_zero)%>%
        mutate(non_zero=nrow(train)-zeros)

char <- unlist(lapply(train, is.character))  
me<-names(train[,char])

List=list()
    for (var in train[,char]){
        wow= print(prop.table(table(var)))
        List[[length(List)+1]] = wow
    }
names(List)<-me
List

This list is not printed here in order to save space

# $MSZoning
## var
##     C (all)          FV          RH          RL          RM 
## 0.006849315 0.044520548 0.010958904 0.788356164 0.149315068 
## 
## $Street
## var
##        Grvl        Pave 
## 0.004109589 0.995890411 
## 
## $Alley
## var
##      Grvl      Pave 
## 0.5494505 0.4505495 
## 
## $LotShape
## var
##         IR1         IR2         IR3         Reg 
## 0.331506849 0.028082192 0.006849315 0.633561644 
## 
## $LandContour
## var
##        Bnk        HLS        Low        Lvl 
## 0.04315068 0.03424658 0.02465753 0.89794521 
## 
## $Utilities
## var
##       AllPub       NoSeWa 
## 0.9993150685 0.0006849315 
## 
## $LotConfig
## var
##      Corner     CulDSac         FR2         FR3      Inside 
## 0.180136986 0.064383562 0.032191781 0.002739726 0.720547945 
## 
## $LandSlope
## var
##        Gtl        Mod        Sev 
## 0.94657534 0.04452055 0.00890411 
## 
## $Neighborhood
## var
##     Blmngtn     Blueste      BrDale     BrkSide     ClearCr     CollgCr 
## 0.011643836 0.001369863 0.010958904 0.039726027 0.019178082 0.102739726 
##     Crawfor     Edwards     Gilbert      IDOTRR     MeadowV     Mitchel 
## 0.034931507 0.068493151 0.054109589 0.025342466 0.011643836 0.033561644 
##       NAmes     NoRidge     NPkVill     NridgHt      NWAmes     OldTown 
## 0.154109589 0.028082192 0.006164384 0.052739726 0.050000000 0.077397260 
##      Sawyer     SawyerW     Somerst     StoneBr       SWISU      Timber 
## 0.050684932 0.040410959 0.058904110 0.017123288 0.017123288 0.026027397 
##     Veenker 
## 0.007534247 
## 
## $Condition1
## var
##      Artery       Feedr        Norm        PosA        PosN        RRAe 
## 0.032876712 0.055479452 0.863013699 0.005479452 0.013013699 0.007534247 
##        RRAn        RRNe        RRNn 
## 0.017808219 0.001369863 0.003424658 
## 
## $Condition2
## var
##       Artery        Feedr         Norm         PosA         PosN 
## 0.0013698630 0.0041095890 0.9897260274 0.0006849315 0.0013698630 
##         RRAe         RRAn         RRNn 
## 0.0006849315 0.0006849315 0.0013698630 
## 
## $BldgType
## var
##       1Fam     2fmCon     Duplex      Twnhs     TwnhsE 
## 0.83561644 0.02123288 0.03561644 0.02945205 0.07808219 
## 
## $HouseStyle
## var
##      1.5Fin      1.5Unf      1Story      2.5Fin      2.5Unf      2Story 
## 0.105479452 0.009589041 0.497260274 0.005479452 0.007534247 0.304794521 
##      SFoyer        SLvl 
## 0.025342466 0.044520548 
## 
## $RoofStyle
## var
##        Flat       Gable     Gambrel         Hip     Mansard        Shed 
## 0.008904110 0.781506849 0.007534247 0.195890411 0.004794521 0.001369863 
## 
## $RoofMatl
## var
##      ClyTile      CompShg      Membran        Metal         Roll 
## 0.0006849315 0.9821917808 0.0006849315 0.0006849315 0.0006849315 
##      Tar&Grv      WdShake      WdShngl 
## 0.0075342466 0.0034246575 0.0041095890 
## 
## $Exterior1st
## var
##      AsbShng      AsphShn      BrkComm      BrkFace       CBlock 
## 0.0136986301 0.0006849315 0.0013698630 0.0342465753 0.0006849315 
##      CemntBd      HdBoard      ImStucc      MetalSd      Plywood 
## 0.0417808219 0.1520547945 0.0006849315 0.1506849315 0.0739726027 
##        Stone       Stucco      VinylSd      Wd Sdng      WdShing 
## 0.0013698630 0.0171232877 0.3527397260 0.1410958904 0.0178082192 
## 
## $Exterior2nd
## var
##      AsbShng      AsphShn      Brk Cmn      BrkFace       CBlock 
## 0.0136986301 0.0020547945 0.0047945205 0.0171232877 0.0006849315 
##      CmentBd      HdBoard      ImStucc      MetalSd        Other 
## 0.0410958904 0.1417808219 0.0068493151 0.1465753425 0.0006849315 
##      Plywood        Stone       Stucco      VinylSd      Wd Sdng 
## 0.0972602740 0.0034246575 0.0178082192 0.3452054795 0.1349315068 
##      Wd Shng 
## 0.0260273973 
## 
## $MasVnrType
## var
##     BrkCmn    BrkFace       None      Stone 
## 0.01033058 0.30647383 0.59504132 0.08815427 
## 
## $ExterQual
## var
##          Ex          Fa          Gd          TA 
## 0.035616438 0.009589041 0.334246575 0.620547945 
## 
## $ExterCond
## var
##           Ex           Fa           Gd           Po           TA 
## 0.0020547945 0.0191780822 0.1000000000 0.0006849315 0.8780821918 
## 
## $Foundation
## var
##      BrkTil      CBlock       PConc        Slab       Stone        Wood 
## 0.100000000 0.434246575 0.443150685 0.016438356 0.004109589 0.002054795 
## 
## $BsmtQual
## var
##         Ex         Fa         Gd         TA 
## 0.08503162 0.02459592 0.43429375 0.45607871 
## 
## $BsmtCond
## var
##          Fa          Gd          Po          TA 
## 0.031623331 0.045678145 0.001405481 0.921293043 
## 
## $BsmtExposure
## var
##         Av         Gd         Mn         No 
## 0.15541491 0.09423347 0.08016878 0.67018284 
## 
## $BsmtFinType1
## var
##        ALQ        BLQ        GLQ        LwQ        Rec        Unf 
## 0.15460295 0.10400562 0.29374561 0.05200281 0.09346451 0.30217850 
## 
## $BsmtFinType2
## var
##         ALQ         BLQ         GLQ         LwQ         Rec         Unf 
## 0.013361463 0.023206751 0.009845288 0.032348805 0.037974684 0.883263010 
## 
## $Heating
## var
##        Floor         GasA         GasW         Grav         OthW 
## 0.0006849315 0.9780821918 0.0123287671 0.0047945205 0.0013698630 
##         Wall 
## 0.0027397260 
## 
## $HeatingQC
## var
##           Ex           Fa           Gd           Po           TA 
## 0.5075342466 0.0335616438 0.1650684932 0.0006849315 0.2931506849 
## 
## $CentralAir
## var
##          N          Y 
## 0.06506849 0.93493151 
## 
## $Electrical
## var
##       FuseA       FuseF       FuseP         Mix       SBrkr 
## 0.064427690 0.018505826 0.002056203 0.000685401 0.914324880 
## 
## $KitchenQual
## var
##         Ex         Fa         Gd         TA 
## 0.06849315 0.02671233 0.40136986 0.50342466 
## 
## $Functional
## var
##         Maj1         Maj2         Min1         Min2          Mod 
## 0.0095890411 0.0034246575 0.0212328767 0.0232876712 0.0102739726 
##          Sev          Typ 
## 0.0006849315 0.9315068493 
## 
## $FireplaceQu
## var
##         Ex         Fa         Gd         Po         TA 
## 0.03116883 0.04285714 0.49350649 0.02597403 0.40649351 
## 
## $GarageType
## var
##      2Types      Attchd     Basment     BuiltIn     CarPort      Detchd 
## 0.004350979 0.630891951 0.013778100 0.063814358 0.006526468 0.280638144 
## 
## $GarageFinish
## var
##       Fin       RFn       Unf 
## 0.2552574 0.3060189 0.4387237 
## 
## $GarageQual
## var
##          Ex          Fa          Gd          Po          TA 
## 0.002175489 0.034807832 0.010152284 0.002175489 0.950688905 
## 
## $GarageCond
## var
##          Ex          Fa          Gd          Po          TA 
## 0.001450326 0.025380711 0.006526468 0.005076142 0.961566352 
## 
## $PavedDrive
## var
##          N          P          Y 
## 0.06164384 0.02054795 0.91780822 
## 
## $PoolQC
## var
##        Ex        Fa        Gd 
## 0.2857143 0.2857143 0.4285714 
## 
## $Fence
## var
##      GdPrv       GdWo      MnPrv       MnWw 
## 0.20996441 0.19217082 0.55871886 0.03914591 
## 
## $MiscFeature
## var
##       Gar2       Othr       Shed       TenC 
## 0.03703704 0.03703704 0.90740741 0.01851852 
## 
## $SaleType
## var
##         COD         Con       ConLD       ConLI       ConLw         CWD 
## 0.029452055 0.001369863 0.006164384 0.003424658 0.003424658 0.002739726 
##         New         Oth          WD 
## 0.083561644 0.002054795 0.867808219 
## 
## $SaleCondition
## var
##     Abnorml     AdjLand      Alloca      Family      Normal     Partial 
## 0.069178082 0.002739726 0.008219178 0.013698630 0.820547945 0.085616438

You can judge for yourself which of these variables are appropriate or not.

Conclusion

This post provided an example of data exploration. Through this analysis we have a beter understanding of the characteristics of the dataset. This information can be used for further analyst and or model development.

Scatterplot in LibreOffice Calc

A scatterplot is used to observe the relationship between two continuous variables. This post will explain how to make a scatterplot and calculate correlation in LibreOffice Calc.

Scatterplot

In order to make a scatterplot you need to columns of data. Below are the first few rows of the data in this example.

Var 1 Var 2
3.07 2.95
2.07 1.90
3.32 2.75
2.93 2.40
2.82 2.00
3.36 3.10
2.86 2.85

Given the nature of this dataset, there was no need to make any preparation.

To make the plot you need to select the two column with data in them and click on insert -> chart and you will see the following.

1

Be sure to select the XY (Scatter) option and click next. You will then see the following

1

Be sure to select “data series in columns” and “first row as label.” Then click next and you will see the following.

1

There is nothing to modify in this window. If you wanted you could add more data to the plot as well as label data but neither of these options apply to us. Therefore, click next.

1

In this last window, you can see that we gave the chart a title and label the X and Y axes. We also removed the “display legend” feature by unchecking it. A legend is normally not needed when making a scatterplot. Once you add this information click “finish” and you will see the following.

1

There are many other ways to modify the scatterplot, but we will now look at how to add a trend line.

To add a trend line you need to click on the data inside the plot so that it turns green as shown below.

1

Next, click on insert -> trend line and you will see the following

1.png

For our purposes, want to select the “linear” option. Generally, the line is hard to see if you immediately click “ok”. Instead, we will click on the “Line” tab and adjust as shown below.

1

All we did was simply change the color of the line to black and increase the width to 0.10. When this is done, click “ok” and you will see the following.

1

The  scatterplot is now complete. We will now look at how to calculate the correlation between the two variables.

Correlation

The correlation is essentially a number that captures what you see in a scatterplot. To calculate the correlation, do the following.

  1. Select the two columns of data
  2. Click on data -> statistics -> correlation and you will see the following

1.png

3. In the results to section just find a place on the spreadsheet to show the results. Click ok and you will see the following.

Correlations Column 1 Column 2
Column 1 1
Column 2 0.413450002676874 1

You have to rename the columns with the appropriate variables. Despite this problem the correlation has been calculated.

Conclusion

This post provided an explanation of calculating correlations and creating scatterplots in LibreOffice Calc. Data visualization is a critical aspect of communicating effectively and such tools as Calc can be used to support this endeavor.

Graphs in LibreOffice Calc

The LibreOffice Suite is a free open-source office suit that is considered an alternative to Microsoft Office Suite. The primary benefit of LibreOffice is that it offers similar features as Microsoft Office with having to spend any money. In addition, LibreOffice is legitimately free and is not some sort of nefarious pirated version of Microsoft Office, which means that anyone can use LibreOffice without legal concerns on as many machines as they desire.

In this post, we will go over how to make plots and graphs in LibreOffice Calc. LibreOffice Calc is the equivalent to Microsoft Excel. We will learn how to make the following visualizations.

  • Bar graph
  • histogram

Bar Graph

We are going to make a bar graph from a single column of data in LibreOffice Calc. To make a visualization you need to aggregate some data. For this post, I simply made some random data that uses a likert scale of SD, D, N, A, SA. Below is a sample of the first five rows of the data.

Var 1
N
SD
D
SD
D

In order to aggregate the data you need to make bins and count the frequency of each category in the bin. Here is how you do this. First you make a variable called “bin” in a column and you place SD, D, N, A, and SA each in their own row in the column you named “bin” as shown below.

bin
SD
D
N
A
SA

In the next column, you created a variable called “freq” in each column you need to use the countif function as shown below

=COUNTIF(1st value in data: last value in data, criteria for counting)

Below is how this looks for my data.

=COUNTIF(A2:A177,B2)

What I told LibreOffice was that my data is in A2 to A177 and they need to count the row if it contains the same data as B2 which for me contains SD. You repeat this process four more time adjusting the last argument in the function. When I finished I this is what I had.

bin freq
SD 35
D 47
N 56
A 32
SA 5

We can now proceed to making the visualization.

To make the bar graph you need to first highlight the data you want to use. For us the information we want to select is the “bin” and “freq” variables we just created. Keep in mind that you never use the raw data but rather the aggregated data. Then click insert -> chart and you will see the following

1.png

Simply click next, and you will see the following

1.png

Make sure that the last three options are selected or your chart will look funny. Data series in rows or in columns has to do with how the data is read in a long or short form. Labels in first row makes sure that Calc does not insert “bin” and “freq” in the graph. First columns as label helps in identifying what the values are in the plot.

Once you click next you will see the following.

1.png

This window normally does not need adjusting and can be confusing to try to do so. It does allow you to adjust the range of the data and even and more data to your chart. For now, we will click on next and see the following.

1

In the last window above, you can add a title and label the axes if you want. You can see that I gave my graph a name. In addition, you can decide if you want to display a legend if you look to the right. For my graph, that was not adding any additional information so I unchecked this option. When you click finish you will see the following on the spreadsheet.

1

Histogram

Histogram are for continuous data. Therefore, I convert my SD,  D, N, A, SA to 1, 2, 3, 4, and 5. All the other steps are the same as above. The one difference is that you want to remove the spacing between bars. Below is how to do this.

Click on one of the bars in the bar graph until you see the green squares as shown  below.

1.png

After you did this, there should be a new toolbar at the top of the spreadsheet. You need to click on the Green and blue cube as shown below

1

In the next window, you need to change the spacing to zero percent. This will change the bar graph into a histogram. Below is what the settings should look like.

1

When you click ok you should see the final histogram shown below

1

For free software this is not too bad. There are a lot of options that were left unexplained especial in regards to how you can manipulate the colors of everything and even make the plots 3D.

Conclusion

LibreOffice provides an alternative to paying for Microsoft products. The example below shows that Calc is capable of making visually appealing graphs just as Excel is.

Data Exploration Case Study: Credit Default

Exploratory data analysis is the main task of a Data Scientist with as much as 60% of their time being devoted to this task. As such, the majority of their time is spent on something that is rather boring compared to building models.

This post will provide a simple example of how to analyze a dataset from the website called Kaggle. This dataset is looking at how is likely to default on their credit. The following steps will be conducted in this analysis.

  1. Load the libraries and dataset
  2. Deal with missing data
  3. Some descriptive stats
  4. Normality check
  5. Model development

This is not an exhaustive analysis but rather a simple one for demonstration purposes. The dataset is available here

Load Libraries and Data

Here are some packages we will need

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn import tree
from scipy import stats
from sklearn import metrics

You can load the data with the code below

df_train=pd.read_csv('/application_train.csv')

You can examine what variables are available with the code below. This is not displayed here because it is rather long

df_train.columns
df_train.head()

Missing Data

I prefer to deal with missing data first because missing values can cause errors throughout the analysis if they are not dealt with immediately. The code below calculates the percentage of missing data in each column.

total=df_train.isnull().sum().sort_values(ascending=False)
percent=(df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data=pd.concat([total,percent],axis=1,keys=['Total','Percent'])
missing_data.head()
 
                           Total   Percent
COMMONAREA_MEDI           214865  0.698723
COMMONAREA_AVG            214865  0.698723
COMMONAREA_MODE           214865  0.698723
NONLIVINGAPARTMENTS_MODE  213514  0.694330
NONLIVINGAPARTMENTS_MEDI  213514  0.694330

Only the first five values are printed. You can see that some variables have a large amount of missing data. As such, they are probably worthless for inclusion in additional analysis. The code below removes all variables with any missing data.

pct_null = df_train.isnull().sum() / len(df_train)
missing_features = pct_null[pct_null > 0.0].index
df_train.drop(missing_features, axis=1, inplace=True)

You can use the .head() function if you want to see how  many variables are left.

Data Description & Visualization

For demonstration purposes, we will print descriptive stats and make visualizations of a few of the variables that are remaining.

round(df_train['AMT_CREDIT'].describe())
Out[8]: 
count     307511.0
mean      599026.0
std       402491.0
min        45000.0
25%       270000.0
50%       513531.0
75%       808650.0
max      4050000.0

sns.distplot(df_train['AMT_CREDIT']

1.png

round(df_train['AMT_INCOME_TOTAL'].describe())
Out[10]: 
count       307511.0
mean        168798.0
std         237123.0
min          25650.0
25%         112500.0
50%         147150.0
75%         202500.0
max      117000000.0
sns.distplot(df_train['AMT_INCOME_TOTAL']

1.png

I think you are getting the point. You can also look at categorical variables using the groupby() function.

We also need to address categorical variables in terms of creating dummy variables. This is so that we can develop a model in the future. Below is the code for dealing with all the categorical  variables and converting them to dummy variable’s

df_train.groupby('NAME_CONTRACT_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_CONTRACT_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_CONTRACT_TYPE'],axis=1)

df_train.groupby('CODE_GENDER').count()
dummy=pd.get_dummies(df_train['CODE_GENDER'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['CODE_GENDER'],axis=1)

df_train.groupby('FLAG_OWN_CAR').count()
dummy=pd.get_dummies(df_train['FLAG_OWN_CAR'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['FLAG_OWN_CAR'],axis=1)

df_train.groupby('FLAG_OWN_REALTY').count()
dummy=pd.get_dummies(df_train['FLAG_OWN_REALTY'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['FLAG_OWN_REALTY'],axis=1)

df_train.groupby('NAME_INCOME_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_INCOME_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_INCOME_TYPE'],axis=1)

df_train.groupby('NAME_EDUCATION_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_EDUCATION_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_EDUCATION_TYPE'],axis=1)

df_train.groupby('NAME_FAMILY_STATUS').count()
dummy=pd.get_dummies(df_train['NAME_FAMILY_STATUS'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_FAMILY_STATUS'],axis=1)

df_train.groupby('NAME_HOUSING_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_HOUSING_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_HOUSING_TYPE'],axis=1)

df_train.groupby('ORGANIZATION_TYPE').count()
dummy=pd.get_dummies(df_train['ORGANIZATION_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['ORGANIZATION_TYPE'],axis=1)

You have to be careful with this because now you have many variables that are not necessary. For every categorical variable you must remove at least one category in order for the model to work properly.  Below we did this manually.

df_train=df_train.drop(['Revolving loans','F','XNA','N','Y','SK_ID_CURR,''Student','Emergency','Lower secondary','Civil marriage','Municipal apartment'],axis=1)

Below are some boxplots with the target variable and other variables in the dataset.

f,ax=plt.subplots(figsize=(8,6))
fig=sns.boxplot(x=df_train['TARGET'],y=df_train['AMT_INCOME_TOTAL'])

1.png

There is a clear outlier there. Below is another boxplot with a different variable

f,ax=plt.subplots(figsize=(8,6))
fig=sns.boxplot(x=df_train['TARGET'],y=df_train['CNT_CHILDREN'])

2

It appears several people have more than 10 children. This is probably a typo.

Below is a correlation matrix using a heatmap technique

corrmat=df_train.corr()
f,ax=plt.subplots(figsize=(12,9))
sns.heatmap(corrmat,vmax=.8,square=True)

1.png

The heatmap is nice but it is hard to really appreciate what is happening. The code below will sort the correlations from least to strongest, so we can remove high correlations.

c = df_train.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")
print(so.head())

FLAG_DOCUMENT_12 FLAG_MOBIL 0.000005
FLAG_MOBIL FLAG_DOCUMENT_12 0.000005
Unknown FLAG_MOBIL 0.000005
FLAG_MOBIL Unknown 0.000005
Cash loans FLAG_DOCUMENT_14 0.000005

The list is to long to show here but the following variables were removed for having a high correlation with other variables.

df_train=df_train.drop(['WEEKDAY_APPR_PROCESS_START','FLAG_EMP_PHONE','REG_CITY_NOT_WORK_CITY','REGION_RATING_CLIENT','REG_REGION_NOT_WORK_REGION'],axis=1)

Below we check a few variables for homoscedasticity, linearity, and normality  using plots and histograms

sns.distplot(df_train['AMT_INCOME_TOTAL'],fit=norm)
fig=plt.figure()
res=stats.probplot(df_train['AMT_INCOME_TOTAL'],plot=plt)

12

This is not normal

sns.distplot(df_train['AMT_CREDIT'],fit=norm)
fig=plt.figure()
res=stats.probplot(df_train['AMT_CREDIT'],plot=plt)

12

This is not normal either. We could do transformations, or we can make a non-linear model instead.

Model Development

Now comes the easy part. We will make a decision tree using only some variables to predict the target. In the code below we make are X and y dataset.

X=df_train[['Cash loans','DAYS_EMPLOYED','AMT_CREDIT','AMT_INCOME_TOTAL','CNT_CHILDREN','REGION_POPULATION_RELATIVE']]
y=df_train['TARGET']

The code below fits are model and makes the predictions

clf=tree.DecisionTreeClassifier(min_samples_split=20)
clf=clf.fit(X,y)
y_pred=clf.predict(X)

Below is the confusion matrix followed by the accuracy

print (pd.crosstab(y_pred,df_train['TARGET']))
TARGET       0      1
row_0                
0       280873  18493
1         1813   6332
accuracy_score(y_pred,df_train['TARGET'])
Out[47]: 0.933966589813047

Lastly, we can look at the precision, recall, and f1 score

print(metrics.classification_report(y_pred,df_train['TARGET']))
              precision    recall  f1-score   support

           0       0.99      0.94      0.97    299366
           1       0.26      0.78      0.38      8145

   micro avg       0.93      0.93      0.93    307511
   macro avg       0.62      0.86      0.67    307511
weighted avg       0.97      0.93      0.95    307511

This model looks rather good in terms of accuracy of the training set. It actually impressive that we could use so few variables from such a large dataset and achieve such a high degree of accuracy.

Conclusion

Data exploration and analysis is the primary task of a data scientist.  This post was just an example of how this can be approached. Of course, there are many other creative ways to do this but the simplistic nature of this analysis yielded strong results

Scatter Plots in Python

Scatterplots are one of many crucial forms of visualization in statistics. With scatterplots, you can examine the relationship between two variables. This can lead to insights in terms of decision making or additional analysis.

We will be using the “Prestige” dataset form the pydataset module to look at scatterplot use. Below is some initial code.

from pydataset import data
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
df=data('Prestige')

We will begin by making a correlation matrix. this will help us to determine which pairs of variables have strong relationships with each other. This will be done with the .corr() function. below is the code

1.png

You can see that there are several strong relationships. For our purposes, we will look at the relationship between education and income.

The seaborn library is rather easy to use for making visuals. To make a plot you can use the .lmplot() function. Below is a basic scatterplot of our data.

1.png

The code should be self-explanatory. THe only thing that might be unknown is the fit_reg argument. This is set to False so that the function does not make a regression line. Below is the same visual but this time with the regression line.

facet = sns.lmplot(data=df, x='education', y='income',fit_reg=True)

1

It is also possible to add a third variable to our plot. One of the more common ways is through including a categorical variable. Therefore, we will look at job type and see what the relationship is. To do this we use the same .lmplot.() function but include several additional arguments. These include the hue and the indication of a legend. Below is the code and output.

1.png

You can clearly see that type separates education and income. A look at the boxplots for these variables confirms this.

1.png

As you can see, we can conclude that job type influences both education and income in this example.

Conclusion

This post focused primarily on making scatterplots with the seaborn package. Scatterplots are a tool that all data analyst should be familiar with as it can be used to communicate information to people who must make decisions.

Data Visualization in Python

In this post, we will look at how to set up various features of a graph in Python. The fine tune tweaks that can be made when creating a data visualization can be enhanced the communication of results with an audience. This will all be done using the matplotlib module available for python. Our objectives are as follows

  • Make a graph with  two lines
  • Set the tick marks
  • Change the linewidth
  • Change the line color
  • Change the shape of the line
  • Add a label to each axes
  • Annotate the graph
  • Add a legend and title

We will use two variables from the “toothpaste” dataset from the pydataset module for this demonstration. Below is some initial code.

from pydataset import data
import matplotlib.pyplot as plt
DF = data('toothpaste')

Make Graph with Two Lines

To make a plot you use the .plot() function. Inside the parentheses you out the dataframe and variable you want. If you want more than one line or graph you use the .plot() function several times. Below is the code for making a graph with two line plots using variables from the toothpaste dataset.

plt.plot(DF['meanA'])
plt.plot(DF['sdA'])

1

To get the graph above you must run both lines of code simultaneously. Otherwise, you will get two separate graphs.

Set Tick Marks

Setting the tick marks requires the use of the .axes() function. However, it is common to save this function in a variable called axes as a handle. This makes coding easier. Once this is done you can use the .set_xticks() function for the x-axes and .set_yticks() for the y axes. In our example below, we are setting the tick marks for the odd numbers only. Below is the code.

ax=plt.axes()
ax.set_xticks([1,3,5,7,9])
ax.set_yticks([1,3,5,7,9])
plt.plot(DF['meanA'])
plt.plot(DF['sdA'])

1

Changing the Line Type

It is also possible to change the line type and width. There are several options for the line type. The important thing here is to put this information after the data you want to plot inside the code. Line width is changed with an argument that has the same name. Below is the code and visual

ax=plt.axes()
ax.set_xticks([1,3,5,7,9])
ax.set_yticks([1,3,5,7,9])
plt.plot(DF['meanA'],'--',linewidth=3)
plt.plot(DF['sdA'],':',linewidth=3)

1

Changing the Line Color

It is also possible to change the line color. There are several options available. The important thing is that the argument for the line color goes inside the same parentheses as the line type. Below is the code. r means red and k means black.

ax=plt.axes()
ax.set_xticks([1,3,5,7,9])
ax.set_yticks([1,3,5,7,9])
plt.plot(DF['meanA'],'r--',linewidth=3)
plt.plot(DF['sdA'],'k:',linewidth=3)

1

Change the Point Type

Changing the point type requires more code inside the same quotation marks where the line color and line type are. Again there are several choices here. The code is below

ax=plt.axes()
ax.set_xticks([1,3,5,7,9])
ax.set_yticks([1,3,5,7,9])
plt.plot(DF['meanA'],'or--',linewidth=3)
plt.plot(DF['sdA'],'Dk:',linewidth=3)

1

Add X and Y Labels

Adding LAbels is simple. You just use the .xlabel() function or .ylabel() function. Inside the parentheses, you put the text you want in quotation marks. Below is the code.

ax=plt.axes()
ax.set_xticks([1,3,5,7,9])
ax.set_yticks([1,3,5,7,9])
plt.xlabel('X Example')
plt.ylabel('Y Example')
plt.plot(DF['meanA'],'or--',linewidth=3)
plt.plot(DF['sdA'],'Dk:',linewidth=3)

1

Adding Annotation, Legend, and Title

Annotation allows you to write text directly inside the plot wherever you want. This involves the use of the .annotate function. Inside this function, you must indicate the location of the text and the actual text you want added to the plot. For our example, we will add the word ‘python’ to the plot for fun.

The .legend() function allows you to give a description of the line types that you have included. Lastly, the .title() function allows you to add a title. Below is the code.

ax=plt.axes()
ax.set_xticks([1,3,5,7,9])
ax.set_yticks([1,3,5,7,9])
plt.xlabel('X Example')
plt.ylabel('Y Example')
plt.annotate(xy=[3,4],text='Python')
plt.plot(DF['meanA'],'or--',linewidth=3)
plt.plot(DF['sdA'],'Dk:',linewidth=3)
plt.legend(['1st','2nd'])
plt.title("Plot Example")

1.png

Conclusion

Now you have a practical understanding of how you can communicate information visually with matplotlib in python. This is barely scratching the surface in terms of the potential that is available.

Factor Analysis in Python

Factor analysis is a dimensionality reduction technique commonly used in statistics. FA is similar to principal component analysis. The difference are highly technical but include the fact the FA does not have an orthogonal decomposition and FA assumes that there are latent variables and that are influencing the observed variables in the model. For FA the goal is the explanation of the covariance among the observed variables present.

Our purpose here will be to use the BioChemist dataset from the pydataset module and perform a FA that creates two components. This dataset has data on the people completing PhDs and their mentors. We will also create a visual of our two-factor solution. Below is some initial code.

import pandas as pd
from pydataset import data
from sklearn.decomposition import FactorAnalysis
import matplotlib.pyplot as plt

We now need to prepare the dataset. The code is below

df = data('bioChemists')
df=df.iloc[1:250]
X=df[['art','kid5','phd','ment']]

In the code above, we did the following

  1. The first line creates our dataframe called “df” and is made up of the dataset bioChemist
  2. The second line reduces the df to 250 rows. This is done for the visual that we will make. To take the whole dataset and graph it would make a giant blob of color that would be hard to interpret.
  3. The last line pulls the variables we want to use for our analysis. The meaning of these variables can be found by typing data(“bioChemists”,show_doc=True)

In the code below we need to set the number of factors we want and then run the model.

fact_2c=FactorAnalysis(n_components=2)
X_factor=fact_2c.fit_transform(X)

The first line tells Python how many factors we want. The second line takes this information along with or revised dataset X to create the actual factors that we want. We can now make our visualization

To make the visualization requires several steps. We want to identify how well the two components separate students who are married from students who are not married. First, we need to make a dictionary that can be used to convert the single or married status to a number. Below is the code.

thisdict = {
"Single": "1",
"Married": "2",}

Now we are ready to make our plot. The code is below. Pay close attention to the ‘c’ argument as it uses our dictionary.

plt.scatter(X_factor[:,0],X_factor[:,1],c=df.mar.map(thisdict),alpha=.8,edgecolors='none')

1

You can perhaps tell why we created the dictionary now. By mapping the dictionary to the mar variable it automatically changed every single and married entry in the df dataset to a 1 or 2. The c argument needs a number in order to set a color and this is what the dictionary was able to supply it with.

You can see that two factors do not do a good job of separating the people by their marital status. Additional factors may be useful but after two factors it becomes impossible to visualize them.

Conclusion

This post provided an example of factor analysis in Python. Here the focus was primarily on visualization but there are so many other ways in which factor analysis can be deployed.

Word Clouds in Python

Word clouds are a type of data visualization in which various words from a dataset are actuated. Words that are larger in the word cloud are more common and words in the middle are also more common. In addition, some word clouds even use various colors to indicated importance.

This post will provide an example of how to make a word cloud using python. We will be using the “Women’s E-Commerce Clothing Reviews” available on the kaggle website.  We are going to only use the text reviews to make our word cloud even though other data is in the dataset. To prepare our dataset for making the word cloud we need to the following.

  1. Lowercase all words
  2. Remove punctuation
  3. Remove stopwords

After completing these steps we can make the word cloud. First, we need to load all of the necessary modules.

import pandas as pd
import re
from nltk.corpus import stopwords
import wordcloud
import matplotlib.pyplot as plt

We now need to load our dataset we will store it as the object ‘df’

df=pd.read_csv('YOUR LOCATION HERE')
df.head()

1.png

It’s hard to read but we will be working only with the “Review Text” column as this has the text data we need. Here is what our column looks like up close.

df['Review Text'].head()

Out[244]: 
0 Absolutely wonderful - silky and sexy and comf...
1 Love this dress! it's sooo pretty. i happene...
2 I had such high hopes for this dress and reall...
3 I love, love, love this jumpsuit. it's fun, fl...
4 This shirt is very flattering to all due to th...
Name: Review Text, dtype: object

We will now make all words lower case and remove punctuation with the code below.

df["Review Text"]=df['Review Text'].str.lower()
df["Review Text"]=df['Review Text'].str.replace(r'[-./?!,":;()\']',' ')

The first line in the code above lower cases all words. The second line removes the punctuation. The second line is trickier as you have to explain to python exactly what type of punctuation you want to remove and what to replace it with. Everything we want to remove is in the first set of single quotes. We want to replace the punctuation with a space which is the second set of single quotation marks with a space in the middle. THe r at the beginning of the parentheses stands for remove.

Here is what our data looks like after making these two changes

df['Review Text'].head()

Out[249]: 
0 absolutely wonderful silky and sexy and comf...
1 love this dress it s sooo pretty i happene...
2 i had such high hopes for this dress and reall...
3 i love love love this jumpsuit it s fun fl...
4 this shirt is very flattering to all due to th...
Name: Review Text, dtype: object

All the words are in lowercase. In addition, you can see that the dash in line 0 is gone as all the punctuation in the other lines. We now need to remove stopwords. Stopwords are the functional words that glue the meaning together without. Examples include and, for, but, etc. We are trying to make a cloud of substantial words and not stopwords so these words need to be removed.

If you have never done this on your computer before you may need to import the nltk module and run nltk.download_gui(). Once this is done you need to download the stopwords package.

Below is the code for removing the stopwords. First, we need to load the stopwords this is done below.

stopwords_list=stopwords.words('english')
stopwords_list=stopwords_list+['to']

We create an object called stopwords_list which has all the English stopwords. The second line just adds the word ‘to’ to the list. Nex,t we need to make an object that will look for the pattern of words we want to remove. Below is the code

pat = r'\b(?:{})\b'.format('|'.join(stopwords_list))

This code is the basically telling Python what to look for. Using regularized expressions Python will look for any word whos pattern on the left is the same as the pattern on the right after the .join function. Inside the .join function is our stopwords_list. We will now take this object called ‘pat’ and use it on our ‘Review Text’ variable.

df['Split Text'] = df['Review Text'].str.replace(pat, '')
df['Split Text'].head()

Out[258]: 
0      absolutely wonderful   silky  sexy  comfortable
1    love  dress     sooo pretty    happened  find ...
2       high hopes   dress  really wanted   work   ...
3     love  love  love  jumpsuit    fun  flirty   f...
4     shirt   flattering   due   adjustable front t...
Name: Split Text, dtype: object

You can see that we have created a new column called ‘Split Text’ and the results is a text that has lost many stop words.

We are now ready to make our word cloud below is the code and the output.

wordcloud1=wordcloud.WordCloud(width=1000,height=500).generate(' '.join(map(str, df['Split Text'])))
plt.figure(figsize=(15,8))
plt.imshow(wordcloud1)
plt.axis('off')

1.png

This code is complex. We used the word cloud function and we had to use both generate map, and join as inner functions. All of these function were needed to take the words from the dataframe and make them simple text for the wordcloud function.

The rest of the code is common to mathplotlib so does not require much explanation. Ass you look at the word cloud, you can see that the most common words include top, look, dress, shirt, fabric. etc. This is reasonable given that these are women’s reviews of clothing.

Conclusion

This post provided an example of text analysis using word clouds in Python. The insights here are primarily descriptive in nature. This means that if the desire is prediction or classification other additional tools would need to build upon what is shown here.

Principal Component Analysis in Python

Principal component analysis is a form of dimension reduction commonly used in statistics. By dimension reduction, it is meant to reduce the number of variables without losing too much overall information. This has the practical application of speeding up computational times if you want to run other forms of analysis such as regression but with fewer variables.

Another application of principal component analysis is for data visualization. Sometimes, you may want to reduce many variables to two in order to see subgroups in the data.

Keep in mind that in either situation PCA works better when there are high correlations among the variables. The explanation is complex but has to do with the rotation of the data which helps to separate the overlapping variance.

Prepare the Data

We will be using the pneumon dataset from the pydataset module. We want to try and explain the variance with fewer variables than in the dataset. Below is some initial code.

import pandas as pd
from sklearn.decomposition import PCA
from pydataset import data
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

Next, we will set up our dataframe. We will only take the first 200 examples from the dataset. If we take all (over 3000 examples), the visualization will be a giant blob of dotes that cannot be interpreted. We will also drop in missing values. Below is the code

df = data('pneumon')
df=df.dropna()
df=df.iloc[0:199,]

When doing a PCA, it is important to scale the data because PCA is sensitive to this. The result of the scaling process is an array. This is a problem because the PCA function needs a dataframe. This means we have to convert the array to a dataframe. When this happens you also have to rename the columns in the new dataframe. All this is done in the code below.

scaler = StandardScaler() #instance
df_scaled = scaler.fit_transform(df) #scaled the data
df_scaled= pd.DataFrame(df_scaled) #made the dataframe
df_scaled=df_scaled.rename(index=str, columns={0: "chldage", 1: "hospital",2:"mthage",3:"urban",4:"alcohol",5:"smoke",6:"region",7:"poverty",8:"bweight",9:"race",10:"education",11:"nsibs",12:"wmonth",13:"sfmonth",14:"agepn"}) # renamed columns

Analysis

We are now ready to do our analysis. We first use the PCA function to indicate how many components we want. For our first example, we will have two components. Next, you use the .fit_transform function to fit the model. Below is the code.

pca_2c=PCA(n_components=2)
X_pca_2c=pca_2c.fit_transform(df_scaled)

Now we can see the variance explained by component and the sum

pca_2c.explained_variance_ratio_
Out[199]: array([0.18201588, 0.12022734])

pca_2c.explained_variance_ratio_.sum()
Out[200]: 0.30224321247148167

In the first line of code, we can see that the first component explained 18% of the variance and the second explained 12%. This leads to a total of about 30%. Below is a visual of our 2 component model the color represents the race of the respondent. The three different colors represent three different races.

1.png

Our two components do a reasonable separating the data. Below is the code for making four components. We can not graph four components since our graph can only handle two but you will see that as we increase the components we also increase the variance explained.

pca_4c=PCA(n_components=4)

X_pca_4c=pca_4c.fit_transform(df_scaled)

pca_4c.explained_variance_ratio_
Out[209]: array([0.18201588, 0.12022734, 0.09290502, 0.08945079])

pca_4c.explained_variance_ratio_.sum()
Out[210]: 0.4845990164486457

With four components we now have almost 50% of the variance explained.

Conclusion

PCA is for summarising and reducing the number of variables used in an analysis or for the purposes of data visualization. Once this process is complete you can use the results to do further analysis if you desire.

Visualizing Clustered Data in R

In this post, we will look at how to visualize multivariate clustered data. We will use the “Hitters” dataset from the “ISLR” package. We will use the features of the various baseball players as the dimensions for the clustering. Below is the initial code

library(ISLR);library(cluster)
data("Hitters")
str(Hitters)
## 'data.frame':    322 obs. of  20 variables:
##  $ AtBat    : int  293 315 479 496 321 594 185 298 323 401 ...
##  $ Hits     : int  66 81 130 141 87 169 37 73 81 92 ...
##  $ HmRun    : int  1 7 18 20 10 4 1 0 6 17 ...
##  $ Runs     : int  30 24 66 65 39 74 23 24 26 49 ...
##  $ RBI      : int  29 38 72 78 42 51 8 24 32 66 ...
##  $ Walks    : int  14 39 76 37 30 35 21 7 8 65 ...
##  $ Years    : int  1 14 3 11 2 11 2 3 2 13 ...
##  $ CAtBat   : int  293 3449 1624 5628 396 4408 214 509 341 5206 ...
##  $ CHits    : int  66 835 457 1575 101 1133 42 108 86 1332 ...
##  $ CHmRun   : int  1 69 63 225 12 19 1 0 6 253 ...
##  $ CRuns    : int  30 321 224 828 48 501 30 41 32 784 ...
##  $ CRBI     : int  29 414 266 838 46 336 9 37 34 890 ...
##  $ CWalks   : int  14 375 263 354 33 194 24 12 8 866 ...
##  $ League   : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
##  $ PutOuts  : int  446 632 880 200 805 282 76 121 143 0 ...
##  $ Assists  : int  33 43 82 11 40 421 127 283 290 0 ...
##  $ Errors   : int  20 10 14 3 4 25 7 9 19 0 ...
##  $ Salary   : num  NA 475 480 500 91.5 750 70 100 75 1100 ...
##  $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...

Data Preparation

We need to remove all of the factor variables as the kmeans algorithm cannot support factor variables. In addition, we need to remove the “Salary” variable because it is missing data. Lastly, we need to scale the data because the scaling affects the results of the clustering. The code for all of this is below.

hittersScaled<-scale(Hitters[,c(-14,-15,-19,-20)])

Data Analysis

We will set the k for the kmeans to 3. This can be set to any number and it often requires domain knowledge to determine what is most appropriate. Below is the code

kHitters<-kmeans(hittersScaled,3)

We now look at some descriptive stats. First, we will see how many examples are in each cluster.

table(kHitters$cluster)
## 
##   1   2   3 
## 116 144  62

The groups are mostly balanced. Next, we will look at the mean of each feature by cluster. This will be done with the “aggregate” function. We will use the original data and make a list by the three clusters.

round(aggregate(Hitters[,c(-14,-15,-19,-20)],FUN=mean,by=list(kHitters$cluster)),1)
##   Group.1 AtBat  Hits HmRun Runs  RBI Walks Years CAtBat  CHits CHmRun
## 1       1 522.4 143.4  15.1 73.8 66.0  51.7   5.7 2179.1  597.2   51.3
## 2       2 256.6  64.5   5.5 30.9 28.6  24.3   5.6 1377.1  355.6   24.7
## 3       3 404.9 106.7  14.8 54.6 59.4  48.1  15.1 6480.7 1783.4  207.5
##   CRuns  CRBI CWalks PutOuts Assists Errors
## 1 299.2 256.1  199.7   380.2   181.8   11.7
## 2 170.1 143.6  122.2   209.0    62.4    5.8
## 3 908.5 901.8  694.0   303.7    70.3    6.4

Now we can see some difference. It seems group 3 are young (5.6 years of experience) starters based on the number of at-bats they get. Group 1 is young players who may not get to start due to the lower at-bats the receive. Group 2 is old (15.1 years) players who receive significant playing time and have but together impressive career statistics.

Now we will create our visual of the three clusters. For this, we use the “clusplot” function from the “cluster” package.

clusplot(hittersScaled,kHitters$cluster,color = T,shade = T,labels = 4)

1.png

In general, there is little overlap between the clusters. The overlap between groups 1 and 3 may be due to how they both have a similar amount of experience.

Conclusion

Visualizing the clusters can help with developing insights into the groups found during the analysis. This post provided one example of this.

Multidimensional Scale in R

In this post, we will explore multidimensional scaling (MDS) in R. The main benefit of MDS is that it allows you to plot multivariate data into two dimensions. This allows you to create visuals of complex models. In addition, the plotting of MDS allows you to see relationships among examples in a dataset based on how far or close they are to each other.

We will use the “College” dataset from the “ISLR” package to create an MDS of the colleges that are in the data set. Below is some initial code.

library(ISLR);library(ggplot2)
data("College")
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

Data Preparation

After using the “str” function we know that we need to remove the variable “Private” because it is a factor and type of MDS we are doing can only accommodate numerical variables. After removing this variable we will then make a matrix using the “as.matrix” function. Once the matrix is ready we can use the “cmdscale” function to create the actual two-dimensional MDS. Another point to mention is that for the sake of simplicity, we are only going to use the first ten colleges in the dataset. The reason being that using all 722 will m ake it hard to understand the plots we will make. Below is the code.

collegedata<-as.matrix(College[,-1])
collegemds<-cmdscale(dist(collegedata[1:10,]))

Data Analysis

We can now make our initial plot. The xlim and ylim arguments had to be played with a little for the plot to display properly. In addition, the “text” function was used to provide additional information such as the names of the colleges.

plot(collegemds,xlim=c(-15000,10000),ylim=c(-15000,10000))
text(collegemds[,1],collegemds[,2],rownames(collegemds))

1.png

From the plot, you can see that even with only ten names it is messy. The colleges are mostly clumped together which makes it difficult to interpret. We can plot this with a four quadrant graph using “ggplot2”. First, we need to convert the matrix that we create to a dataframe.

collegemdsdf<-as.data.frame(collegemds)

We are now ready to use “ggplot” to create the four quadrant plot.

p<-ggplot(collegemdsdf, aes(x=V1, y=V2)) +
        geom_point() +
        lims(x=c(-10000,8000),y=c(-4000,5000)) +
        theme_minimal() +
        coord_fixed() +  
        geom_vline(xintercept = 5) + geom_hline(yintercept = 5)+geom_text(aes(label=rownames(collegemdsdf)))
p+theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

1

We set the horizontal and vertical line at the x and y-intercept respectively. By doing this it is much easier to understand and interpret the graph. Agnes Scott College is way off to the left while Alaska Pacific University, Abilene Christian College, and even Alderson-Broaddus College are clump together. The rest of the colleges are straddling below the x-axis.

Conclusion

In this example, we took several variables and condense them to two dimensions. This is the primary benefit of MDS. It allows you to visualize was cannot be visualized normally. The visualizing allows you to see the structure of the data from which you can draw inferences.

Probability Distribution and Graphs in R

In this post, we will use probability distributions and ggplot2 in R to solve a hypothetical example. This provides a practical example of the use of R in everyday life through the integration of several statistical and coding skills. Below is the scenario.

At a busing company the average number of stops for a bus is 81 with a standard deviation of 7.9. The data is normally distributed. Knowing this complete the following.

  • Calculate the interval value to use using the 68-95-99.7 rule
  • Calculate the density curve
  • Graph the normal curve
  • Evaluate the probability of a bus having less then 65 stops
  • Evaluate the probability of a bus having more than 93 stops

Calculate the Interval Value

Our first step is to calculate the interval value. This is the range in which 99.7% of the values falls within. Doing this requires knowing the mean and the standard deviation and subtracting/adding the standard deviation as it is multiplied by three from the mean. Below is the code for this.

busStopMean<-81
busStopSD<-7.9
busStopMean+3*busStopSD
## [1] 104.7
busStopMean-3*busStopSD
## [1] 57.3

The values above mean that we can set are interval between 55 and 110 with 100 buses in the data. Below is the code to set the interval.

interval<-seq(55,110, length=100) #length here represents 
100 fictitious buses

Density Curve

The next step is to calculate the density curve. This is done with our knowledge of the interval, mean, and standard deviation. We also need to use the “dnorm” function. Below is the code for this.

densityCurve<-dnorm(interval,mean=81,sd=7.9)

We will now plot the normal curve of our data using ggplot. Before we need to put our “interval” and “densityCurve” variables in a dataframe. We will call the dataframe “normal” and then we will create the plot. Below is the code.

library(ggplot2)
normal<-data.frame(interval, densityCurve)
ggplot(normal, aes(interval, densityCurve))+geom_line()+ggtitle("Number of Stops for Buses")

282deee2-ff95-488d-ad97-471b74fe4cb8

Probability Calculation

We now want to determine what is the provability of a bus having less than 65 stops. To do this we use the “pnorm” function in R and include the value 65, along with the mean, standard deviation, and tell R we want the lower tail only. Below is the code for completing this.

pnorm(65,mean = 81,sd=7.9,lower.tail = TRUE)
## [1] 0.02141744

As you can see, at 2% it would be unusually to. We can also plot this using ggplot. First, we need to set a different density curve using the “pnorm” function. Combine this with our “interval” variable in a dataframe and then use this information to make a plot in ggplot2. Below is the code.

CumulativeProb<-pnorm(interval, mean=81,sd=7.9,lower.tail = TRUE)
pnormal<-data.frame(interval, CumulativeProb)
ggplot(pnormal, aes(interval, CumulativeProb))+geom_line()+ggtitle("Cumulative Density of Stops for Buses")

9667dd01-f7d3-4025-8995-b6441a3735d0.png

Second Probability Problem

We will now calculate the probability of a bus have 93 or more stops. To make it more interesting we will create a plot that shades the area under the curve for 93 or more stops. The code is a little to complex to explain so just enjoy the visual.

pnorm(93,mean=81,sd=7.9,lower.tail = FALSE)
## [1] 0.06438284
x<-interval  
ytop<-dnorm(93,81,7.9)
MyDF<-data.frame(x=x,y=densityCurve)
p<-ggplot(MyDF,aes(x,y))+geom_line()+scale_x_continuous(limits = c(50, 110))
+ggtitle("Probabilty of 93 Stops or More is 6.4%")
shade <- rbind(c(93,0), subset(MyDF, x > 93), c(MyDF[nrow(MyDF), "X"], 0))

p + geom_segment(aes(x=93,y=0,xend=93,yend=ytop)) +
        geom_polygon(data = shade, aes(x, y))

b42a7c19-1992-4df1-95cc-40ea097058de

Conclusion

A lot of work was done but all in a practical manner. Looking at realistic problem. We were able to calculate several different probabilities and graph them accordingly.

Using Maps in ggplot2

It seems as though there are no limits to what can be done with ggplot2. Another example of this is the use of maps in presenting data. If you are trying to share information that depends on location then this is an important feature to understand.

This post will provide some basic explanation for understanding how to use maps with ggplot2.

The Maps Package

One of several packages available for using maps with ggplot2 is the “maps” package. This package contains a limited number of maps along with several databases that contain information that can be used to create data-filled maps.

The “maps” package cooperates with ggplot2 through the use of the “borders” function and plotting the plot using lattitude and longitude for the “aes” function. After you have installed the “maps” package you can run the example code below.

library(ggplot2);library(maps)
ggplot(us.cities,aes(long,lat))+geom_point()+borders("state")

4a57f730-d006-49c6-b960-60c03ba6ce7b.png

In the code above we told R to use the data from “us.cities” which comes with the “maps” package. We then told R to graph the latitude and longitude and to do this by placing a point for each city. Lastly, the “borders” function was use to place this information on the state map of the US.

There are several points way off of the map. These represents datapoints for cities in Alaska and Hawaii.

Below is an example that is limited to one state in America. To do this we first must subset the data to only include one state.

tx_cities<-subset(us.cities,country.etc=="TX")
ggplot(tx_cities,aes(long,lat))+geom_point()+borders(database = "state",regions = "texas")

7ad1cf89-6f81-479d-b61f-3935590ee2c6

The map shows all the cities in the state of Texas that are pulled form the “us.cities” dataset.

We can also play with the colors of the maps just like any other ggplot2 output. Below is an example.

data("world.cities")
Thai_cities<-subset(world.cities, country.etc=="Thailand")
ggplot(Thai_cities,aes(long,lat))+borders("world","Thailand", fill="light blue",col="dark blue")+geom_point(aes(size=pop),col="dark red")

820ad721-b4c1-427f-b044-5473a887ae2b.png

In the example above, we took all of the cities in Thailand and saved them into the variable “Thai_cities”. We then made a plot of Thailand but we played with the color and fill features. Lastly, we plotted the population be location and we indicated that the size of the data point should depend on the size. In this example, all the data points were the same size which means that all the cities in Thailand in the dataset are about the same size.

We can also add text to maps. In the example below, we will use a subset of the data from Thailand and add the names of cities to the map.

Big_Thai_cities<-subset(Thai_cities, pop>100000)
ggplot(Big_Thai_cities,aes(long,lat))+borders("world","Thailand", fill="light blue",col="dark blue")+geom_point(aes(size=pop),col="dark red")+geom_text(aes(long,lat,label=name),hjust=-.2,size=3)

f3fed191-5454-47dd-a754-cae713beebf2.png

In this plot there is a messy part in the middle where Bangkok is a long with several other large cities. However, you can see the flexiability in the plot by adding the “geom_text” function which has been discussed previously. In the “geom_text” function we added some aesthetics as well add the “name” of the city.

Conclusion

In this post, we look at some of the basic was of using maps with ggplot2. There are many more ways and features that can be explored in future post.

Axis and Title Modifications in ggplot2

This post will provide explanation on how to customize the axis and title of a plot that utilizes ggplot2. We will use the “Computer” dataset from the “Ecdat” package looking specifically at the difference in price of computers based on the inclusion of a cd-rom. Below is some code needed to be prepared for the examples along with a printout of our initial boxplot.

library(ggplot2);library(grid);library("Ecdat")
data("Computers")
theBoxplot<-ggplot(Computers,aes(cd, price, fill=cd))+geom_boxplot()
theBoxplot

82c9eab1-07c5-4358-8299-804b4a8df22f.png

In the example below, we change the color of the tick marks to purple and we bold them. This all involves the use of the “axis.text” argument in the “theme” function.

theBoxplot + theme(axis.text=element_text(color="purple",face="bold"))

13bb4565-1ed9-4766-9e57-67bbe08387a0.png

In the example below, the y label “price” is rotated 90 degrees to be in line with text. This is accomplished using the “axis.title.y” argument along with additional code.

theBoxplot+theme(axis.title.y=element_text(size=rel(1.5),angle = 0))

bd5cc730-b487-4586-9599-b42a20462fb8.png

Below is an example that includes a title with a change to the default size and color

theBoxplot+labs(title="The Example")+theme(plot.title=element_text(size=rel(1.5),color="orange"))

44ab52c9-905b-414e-bbdb-cfda3166e1c1

You can also remove the axis label. IN the example below, we remove the x axis along with its tick marks.

theBoxplot+theme(axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.title.x=element_blank())

5043dac3-64c8-4a70-859d-839e43a235ad.png

It is also possible to modify the plot background axis as well. In the example below, we change the background color to blue, the color of the lines to green, and yellow.

This is not an attractive plot but it does provide an example of the various options available in ggplot2

theBoxplot+theme(panel.background=element_rect(fill="blue"), panel.grid.major=element_line(color="green", size = 3),panel.grid.minor=element_line(color="yellow",linetype="solid",size=2))

2b5051bd-5480-4460-8a07-6727d566ac1b.png

All of the tricks we have discussed so far can also apply when faceting data. Below we make a scatterplot using the same background as before but comparing trend and price.

theScatter<-ggplot(Computers,aes(trend, price, color=cd))+geom_point()
theScatter1<-theScatter+facet_grid(.~cd)+theme(panel.background=element_rect(fill="blue"), panel.grid.major=element_line(color="green", size = 3),panel.grid.minor=element_line(color="yellow",linetype="solid",size=2))
theScatter1

14422d13-9bb2-4988-814a-86cc3b9bc226.png

Right now the plots are too close to each other. We can account for this by modifying the panel margins.

theScatter1 +theme(panel.margin=unit(2,"cm"))

05b82983-1f38-49ce-9912-a2cdef40f35e.png

Conclusion

These examples provide further evidence of the endless variety that is available when using ggplot2. Whatever are your purposes, it is highly probably that ggplot2 has some sort of a data visualization answer.

Axis and Labels in ggplot2

In this post, we will look at how to manipulate the labels and positioning of the data when using ggplot2. We will use the “Wage” data from the “ISLR” package. Below is initial code needed to begin.

library(ggplot2);library(ISLR)
data("Wage")

Manipulating Labels

Our first example involves adding labels for the x, y-axis as well as a title. To do this we will create a histogram of the wage variable and save it as a variable in R. By saving the histogram as a variable it saves time as we do not have to recreate all of the code but only add the additional information. After creating the histogram and saving it to a variable we will add the code for creating the labels. Below is the code

myHistogram<-ggplot(Wage, aes(wage, fill=..count..))+geom_histogram()
myHistogram+labs(title="This is My Histogram", x="Salary as a Wage", y="Number")
download (17).png

By using the “labs” function you can add a title and information for the x and y-axis. If your title is really long you can use the code “” to break the information into separate lines as shown below.

myHistogram+labs(title="This is the Longest Title for a Histogram \n that I have ever Seen in My Entire Life", x="Salary as a Wage", y="Number")
download (18).png

Discrete Axis Scale

We will now turn our attention to working with discrete scales. Discrete scales deal with categorical data such as box plots and bar charts. First, we will store a boxplot of the wages subsetted by the level of education in a variable and we will display it.

myBoxplot<-ggplot(Wage, aes(education, wage,fill=education))+geom_boxplot()
myBoxplot

download-19

Now, by using the “scale_x_discrete” function along with the “limits” argument we are able to change the order of the groups as shown below

myBoxplot+scale_x_discrete(limits=c("5. Advanced Degree","2. HS Grad","1. < HS Grad","4. College Grad","3. Some College"))

download (20).png

Continuous Scale

The most common modification to a continuous scale is to modify the range. In the code below, we change the default range of “myBoxplot” to something that is larger.

myBoxplot+scale_y_continuous(limits=c(0,400))

download (21).png

Conclusion

This post provided some basic insights into modifying plots using ggplot2.

Pie Charts and More Using ggplot2

This post will explain several types of visuals that can be developed in using ggplot2. In particular, we are going to make three specific types of charts and they are…

  • Pie chart
  • Bullseye chart
  • Coxcomb diagram

To complete this task, we will use the “Wage” dataset from the “ISLR” package. We will use the “education” variable which has five factors in it. Below is the initial code to get started.

library(ggplot2);library(ISLR)
data("Wage")

Pie Chart

In order to make a pie chart, we first need to make a bar chart and add several pieces of code to change it into a pie chart. Below is the code for making a regular bar plot.

ggplot(Wage, aes(education, fill=education))+geom_bar()

download (11).png

We will now modify two parts of the code. First, we do not want separate bars. Instead, we want one bar. The reason being is that we only want one pie chart so before that we need one bar. Therefore, for the x value in the “aes” function, we will use the argument “factor(1)” which tells R to force the data as one factor on the chart thus making one bar. We also need to add the “width=1” inside the “geom_bar” function. This helps with spacing. Below is the code for this

ggplot(Wage, aes(factor(1), fill=education))+geom_bar(width=1)

download-12

To make the pie chart, we need to add the “coord_polar” function to the code which adjusts the mapping. We will include the argument “theta=y” which tells R that the size of the pie a factor gets depends on the number of people in that factor. Below is the code for the pie chart.

ggplot(Wage, aes(factor(1), fill=education))+
geom_bar(width=1)+coord_polar(theta="y")

download (13).png

By changing the “width” argument you can place a circle in the middle of the chart as shown below.

ggplot(Wage, aes(factor(1), fill=education))+
geom_bar(width=.5)+coord_polar(theta="y")

download (14).png

Bullseye Chart

A bullseye chart is a pie chart that shares the information in a concentric way. The coding is mostly the same except that you remove the “theta” argument from the “coord_polar” function. The thicker the circle the more respondents within it. Below is the code

ggplot(Wage, aes(factor(1), fill=education))+
geom_bar(width=1)+coord_polar()

download (15).png

Coxcomb Diagram

The Coxcomb Diagram is similar to the pie chart but the data is not normalized to fit the entire area of the circle. To make this plot we have to modify the code to make the by removing the “factor(1)” argument and replacing it with the name of the variable and be reading the “coord_polor” function. Below is the code

ggplot(Wage, aes(education, fill=education))+
geom_bar(width=1)+coord_polar()

download (16).png

Conclusion

These are just some of the many forms of visualizations available using ggplot2. Which to use depends on many factors from personal preference to the needs of the audience.

Adding text and Lines to Plots in R

There are times when a researcher may want to add annotated information to a plot. Example of annotation includes text and or different lines to clarify information. In this post we will learn how to add lines and text to a plot. For the lines, we are speaking of lines that are added mainly and not through some sort of statistical transformation such as through regression or smoothing.

In order to do this we will use the “Caschool” data set from the “Ecdata” package and will make several histograms that will display test scores. Below is initial coding information that is needed.

library(ggplot2);library(Ecdat)
data("Caschool")

There are three lines that can be added manually using ggplot2. They are…

  • geom_vline = vertical line
  • geom_hline = horizontal line
  • geom_abline = slope/intercept line

In the code below, we are going to make a histogram of the test scores in the “Caschool” dataset. We are also going to add a vertical yellow line that is set at where the median is. Below is the code

ggplot(Caschool,aes(testscr))+geom_histogram()+
geom_vline(aes(xintercept=median(testscr)),color="yellow")

download (4).png

By adding aesthetic information to the “geom_vline” function we add the line depicting the median. We will now use the same code but add a horizontal line. Below is the code.

ggplot(Caschool,aes(testscr))+geom_histogram()+
geom_vline(aes(xintercept=median(testscr)),color="yellow")+
geom_hline(aes(yintercept=15), color="blue")

download (5).png

The horizontal line we added was at the arbitrary point of 15 on the y axis. We could have set it anywhere we wanted by specifying a value for the y-intercept.

In the next histogram we are going to add text to the graph. Text provides further explanation about what is happening in the plot. We are going to use the same code as before but we are going to provide additional information about the yellow median line. We are going to explain that the yellow is the median and we will provide the value of the median.

ggplot(Caschool,aes(testscr))+geom_histogram()+
        geom_vline(aes(xintercept=median(testscr)),color="yellow")+
        geom_hline(aes(yintercept=15), color="blue")+
        geom_text(aes(x=median(Caschool$testscr),
           y=30),label="Median",hjust=1, size=9)+
        geom_text(aes(x=median(Caschool$testscr),
           y=30,label=round(median(testscr),digits=0)),hjust=-0.5, size=9)
download (6).png

Must of the code above is review but we did add the “geom_text” function. Here is what’s happening. Inside the function we need to add aesthetic information. We indicate that the label =“median” should be placed at the median for the test scores for the x value and at the arbitrary point of 30 for the y-intercept. We also offset the the placement by using the hjust argument.

For the second label we calculate the actual median and have it rounded and have the digits removed. This result is also offset slightly. Lastly, for both text we set the text size to 9 to make it easier to read.

Are next example involves annotating. Using ggplot2 we can actually highlight a specific area of the histogram. In the example below we highlight the middle quartile.

ggplot(Caschool,aes(testscr))+geom_histogram()+geom_vline(aes(xintercept=median(testscr)),color="yellow")+
        geom_hline(aes(yintercept=15), color="blue")+
        geom_text(aes(x=median(Caschool$testscr),y=30),
           label="Median",hjust=1, size=9)+
        geom_text(aes(x=median(Caschool$testscr),y=30,
           label=round(median(testscr),digits=0)),hjust=-0.5, size=9)+
        annotate("rect",xmin=quantile(Caschool$testscr, probs = 0.25),
                 xmax = quantile(Caschool$testscr, probs=0.75),ymin=0, 
                 ymax=45, alpha=.2, fill="red")
download (7).png

The information inside the “annotate” function includes the “rect” argument which indicates that the added information is numerical. Next, we indicate that we want the xmin value to be the 25% quartile and the xmax to be the 75% quartile. We also indicate the values for the y axis as well as some transparency with the “alpha” argument as well as the color of the annotated area, which is red.

Are final example involves the use of facets. We are going to split the data by school district type and show how you can add lines to another while not adding lines to a different plot. The second plot will include a line based on median while the first plot will not.

ggplot(Caschool,aes(testscr, fill=grspan))+geom_histogram()+
        geom_vline(data=subset(Caschool, grspan=="KK-08"), 
                   aes(xintercept=median(testscr)), color="yellow")+
        geom_text(data=subset(Caschool, grspan=="KK-08"),
                  aes(x=median(Caschool$testscr),y=35), label=round(median
                                                           (Caschool$testscr), 
                                                           digits=0),
                                                              hjust=-0.2,
                                                                  size=9)+
        geom_text(data=subset(Caschool,grspan=="KK-08"),
                  aes(x=median(Caschool$testscr), y=35),label="Median",
                       hjust=1,size=9)+facet_grid(.~grspan)
download (8).png

Conclusion

Adding lines to text and understanding how to annotate provides additional tools for those who need to communicate data in a visual way.

Histograms and Colors with ggplot2

In this post, we will look at how ggplot2 is able to create variables for the purpose of providing aesthetic information for a histogram. Specifically, we will look at how ggplot2 calculates the bin sizes and then assigns colors to each bin depending on the count or density of that particular bin.

To do this we will use dataset called “Star” from the “Edat” package. From the dataset, we will look at total math score and make several different histograms. Below is the initial code you need to begin.

library(ggplot2);library(Ecdat)
data(Star)

We will now create our initial histogram. What is new in the code below is the “..count..” for the “fill” argument. This information tells are to fill the bins based on their count or the number of data points that fall in this bin. By doing this, we get a gradation of colors with darker colors indicating more data points and lighter colors indicating fewer data points. The code is as follows.

ggplot(Star, aes(tmathssk, fill=..count..))+geom_histogram()
download.png

As you can see, we have a nice histogram that uses color to indicate how common data in a specific bin is. We can also make a histogram that has a line that indicates the density of the data using the kernel function. This is similar to adding a LOESS line on a plot. The code is below.

ggplot(Star, aes(tmathssk)) + geom_histogram(aes(y=..density.., fill=..density..))+geom_density()
download (1).png

The code is mostly the same but we moved the “fill” argument inside “geom_histogram” function and added a second “aes” function. We also included a y argument inside the second “aes” function. Instead of using the “..count..” information we used “..density..” as this is needed to create the line. Lastly, we added the “geom_density” function.

The chart below uses the “alpha” argument to add transparency to the histogram. This allows us to communicate additional information. In the histogram below we can see visual information about gender and the how common a particular gender and bin are in the data.

ggplot(Star, aes(tmathssk, col=sex, fill=sex, alpha=..count..))+geom_histogram()
download-2

Conclusion

What we have learned in this post is some of the basic features of ggplot2 for creating various histograms. Through the use of colors, a researcher is able to display useful information in an interesting way.

Linear Regression Lines and Facets in ggplot2

In this post, we will look at how to add a regression line to a plot using the “ggplot2” package. This is mostly a review of what we learned in the post on adding a LOESS line to a plot. The main difference is that a regression line is a straight line that represents the relationship between the x and y variable while a LOESS line is used mostly to identify trends in the data.

One new wrinkle we will add to this discussion is the use of faceting when developing plots. Faceting is the development of multiple plots simultaneously with each sharing different information about the data.

The data we will use is the “Housing” dataset from the “Ecdat” package. We will examine how lotsize affects housing price when also considering whether the house has central air conditioning or not. Below is the initial code in order to be prepared for analysis

library(ggplot2);library(Ecdat)
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
data("Housing")

The first plot we will make is the basic plot of lotsize and price with the data being distinguished by having central air or not, without a regression line. The code is as follows

ggplot(data=Housing, aes(x=lotsize, y=price, col=airco))+geom_point()

download.png

We will now add the regression line to the plot. We will make a new plot with an additional piece of code.

ggplot(data=Housing, aes(x=lotsize, y=price, col=airco))+geom_point()+stat_smooth(method='lm')

download (23).png

As you can see we get two lines for the two conditions of the data. If we want to see the overall regression line we use the code that is shown below.

ggplot()+geom_point(data=Housing, aes(x=lotsize, y=price, col=airco))+stat_smooth(data=Housing, aes(x=lotsize, y=price ),method='lm')

download (24).png

We will now experiment with a technique called faceting. Faceting allows you to split the data by various subgroups and display the result via plot simultaneously. For example, below is the code for splitting the data by central air for examining the relationship between lot size and price.

ggplot(data=Housing, aes(lotsize, price, col=airco))+geom_point()+stat_smooth(method='lm')+facet_grid(.~airco)

download (25).png

By adding the “facet_grid” function we can subset the data by the categorical variable “airco”.

In the code below we have three plots. The first two show the relationship between lotsize and price based on central air and the last plot shows the overall relationship.

ggplot(data=Housing, aes(lotsize, price, col=airco))+geom_point()+stat_smooth(method="lm")+facet_grid(.~airco, margins = TRUE)

download (26).png

By adding the argument “margins” and setting it to true we are able to add the third plot that shows the overall results.

So far all of are facetted plots have had the same statistical transformation of the use of a regression. However, we can actually mix the type of transformations that happen when facetting the results. This is shown below.

ggplot(data=Housing, aes(lotsize, price, col=airco))+geom_point()+stat_smooth(data=subset(Housing, airco=="yes"))+stat_smooth(data=subset(Housing, airco=="no"), method="lm")+facet_grid(.~airco)

download (27).png