Category Archives: python

Fraud Detection with Python: Sampling-VIDEO

Fraud detection is a critical tool used in a variety of industries. The video shares basic tips for examining the data and how to deal with data imbalances.

SMOTE & Logistic Regression with Python

Leave a reply

In this post, we are using logistic regression and the sampling technique of SMOTE to improve our model’s ability to detect fraud. SMOTE creates synthetic cases of actual fraud in order to balance out the number of true and false cases in the dataset. We will begin by loading our libraries

Libraries

The libraries we are using are below. As we use these libraries, they will be explained.

from imblearn.pipeline import Pipeline 
from imblearn.over_sampling import SMOTE
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Data Preparation

We load our data using the .read_csv() method from pandas. The object data_loc was created to store the location of the data on the computer. The data used in this example is not available. After loading the data, we use .shape to see how many columns and rows of data we have. The code and output are below

df = pd.read_csv(data_loc)
df.shape
(5050, 31)

You can see that we have 5050 rows of data and 31 columns of data. Next, we need to separate the X values from the y value. To do this, we will take columns 2 to 29 as X values and column 30 as the y value. The code below completes all of this for us.

X = df.iloc[:, 1:30]    
X = np.array(X).astype('float')    
y = df.iloc[:, 30]    
y=np.array(y).astype('float')

In the code below, we are creating our train and test sets. We are going to split our X and y objects so that 70% of the data is for training and 30% of the data is for testing purposes. The function train_test_split() is used for this, with the argument test_size being set to 0.3 for 30% test data and the random_state being set to 0, which is the seed number.

# Split your data X and y, into a training and a test set and fit the pipeline onto the training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)

Pipeline Development

A pipeline is used to chain several actions together sequentially and is similar to piping in R. To do this, we are using the Pipeline() function from the imblearn library. The imblearn library is used to address imbalances in datasets, as our data has. We will complete the pipeline by first creating an instance of SMOTE and logistic regression. We do this because these are the two objects we will pipe one after the other.

Next, we will actually create our pipe. We created an object called “pipeline” and used the Pipeline function. Inside this function are two tuples. The first is for SMOTE and uses the first object we create at the beginning of this cell, and the second contains the information of the object we created. Also, notice how both tuples are wrapped inside square brackets. The code for all of this is below

# Define which resampling method and which ML model to use in the pipeline
resampling = SMOTE()
model = LogisticRegression(max_iter=1000)

# Define the pipeline, tell it to combine SMOTE with the Logistic Regression model
pipeline = Pipeline([('SMOTE', resampling), ('Logistic Regression',model)])

What we did in this code was tell Python to use SMOTE to create synthetic cases of instances of fraud. Once the resampling is completed, the resampled data will be used to train the model.

Model Development and Performance Metrics

We will now train our model with the SMOTE data using logistic regression and make the predictions. We use the .fit() method with the pipeline object and then use the .predict() method with the test data. The code is below

# Fit your pipeline onto your training set and obtain predictions by fitting the model onto the test data 
pipeline.fit(X_train, y_train) 
predicted = pipeline.predict(X_test)

Now we run our performance metrics to see how our model did. We will use the classification_report() and confusion_matrix() functions. The classification_report function tells us the precision, recall, and f1-score. The confusion_metrix() function is a printout of a crosstab of our data. Notice in both of these metrics, we are using the y test values compared to the predicted values.

# Obtain the results from the classification report and confusion matrix 
print('Classifcation report:\n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:\n', conf_mat)

Classifcation report:
               precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      1505
         1.0       0.82      0.90      0.86        10

    accuracy                           1.00      1515
   macro avg       0.91      0.95      0.93      1515
weighted avg       1.00      1.00      1.00      1515

Confusion matrix:
 [[1503    2]
 [   1    9]]

Conclusion

With the help of SMOTE, it is possible to improve the performance of your algorithm when detecting fraud. As such, SMOTE is a powerful tool that can be useful in the appropriate context.

Annotating Visualizations with Python VIDEO

Leave a reply

Annotations add text and other objects to a visualization to provide information. The video below explains how to add annotations to a visualization when using Python.

Highlighting Data Points with Python VIDEO

Leave a reply

The video below provides two methods that can be used to highlight data points using Python. Which method to use depends on the context, but an analyst should be familiar with both.

Highlighting Data Points with Python

Leave a reply

In this post, we are going to look at how to highlight data points in a scatterplot. We will specifically look at two different methods for doing this. These two methods are hard coding and programmatically.

Libraries and Data Preparations

First, we will load our libraries and prepare our data. Below are the libraries we will use

import seaborn as sns
import matplotlib.pyplot as plt
from pydataset import data

The first two lines are libraries for our data visualization. The last line of code will be used for pulling the data that we will use. Below we prepare our data by loading it into an object called “df” and we take a quick peek at it as well.

df=data('Prestige')
df.head()

The data set we are using is called “Prestige” and we load it using the data() function. This data contains various jobs, education, income, women, prestige, census, and type information. Next, we will look at how to highlight a specific data point in a scatterplot.

Hard Coding

Hard coding is when you manually pick a specific data point to highlight. Below is the code and output for this

df_prof = df[df.type  ==  'prof']

# Make array orangred for highest income
prof_colors = ['orangered' if (education  ==  12.26) & (income  ==  25879) else 'lightgray' 
                  for education,income in zip(df_prof.education, df_prof.income)]

sns.regplot(x = 'education',
            y = 'income',
            data = df_prof,
            fit_reg = False, 
            # Send scatterplot argument to color points 
            scatter_kws = {'facecolors': prof_colors, 'alpha': 0.7})
plt.show()

We did the following to make the plot above.

We subset the data so that it only contains job types of “prof” and save this as an object called df_prof. The reason we did this was to reduce the number of data points and make it easier to see what we were doing.
Next, we make an object called prof_colors which will color one dot orange if it meets the criteria for the values for education and income. Everything I just said is captured in an if else statement. The for statement is used to tell Python where to apply the if else statement. Since this is hard to understand, below is a visual of the prof_colors object.

Notice the second row and how it is labeled “orangered” this is because this row matches the criteria for the values of education and income. We will use this object to make the colors of our dots

3. The next block of code is for making the visualization. Most of this is self-explanatory. You set your x and y values for education and income,. The fit_reg argument was set to false because we do not want a regression line. The scatter_kws argument is used to set the color of the dots and the alpha sets the level of transparency of the dots.

Setting the highlighted point manually is good if your data is static. However, if your data is dynamic, you want to highlight the points programmatically so that the highlight point changes as the data does.

Progammatically

The code is mostly similar as above with a few minor changes. Below is the code followed by the output and lastly the explanation.

df_prof = df[df.type  ==  'prof']

# Find the highest income
max_income = df_prof.income.max()

# Make a column that denotes which occuaption has highest income
df_prof['point_type'] = ['Highest Income' if income  ==  max_income else 'Others' for income in df_prof.income]

# Encode the hue of the points with the O3 generated column
sns.scatterplot(x = 'education',
                y = 'income',
                hue = 'point_type',
                data = df_prof)
plt.show()

Here is what we did

We start by subsetting the data as before for type of “prof”.
We then create an object called max_income and find the highest income in the df_prof object using the max() method.
This time we create a new column in our data called “point_type” which is created using an if else statement and a for loop. If income matches our highest income, it will be labeled highest income, and the rest will be labeled others for all data in the income column.
Lastly, we create our scatterplot. We set the x and y values, and we set the hue to match the “point_type” which is the new column we just created.

With this second method, our highlighted data point will change as necessary if it changes in the data.

Conclusion

Highlighting data points is something that is needed at times when creating data visualizations. The examples above provide two different ways to deal with this. Which method is best depends on the context.

Privacy of Continous Data with Python VIDEO

Leave a reply

There are several different ways to modify continuous data to protect individuals’ privacy. The video below provides several practical ways to do this using Python.

Generating Fake Data for Privacy with Python VIDEO

Leave a reply

Generating fake data is one way to protect an individual’s privacy. The video below provides examples of how to do this using Python.

Python for Data Privacy VIDEO

Leave a reply

Data privacy and the protection of people’s identities is important. The video below provides some basic ways to ensure the privacy of individuals when working with data.

Privacy of Continous Data with Python

Leave a reply

There are several ways that an individual’s privacy can be protected when dealing with continuous data. In this post, we will look at how protecting privacy can be accomplished using Python.

Libraries

We will begin by loading the necessary libraries. Below is the code.

from pydataset import data
import pandas as pd

The library setup is simple. We are importing the data() function from pydataset. This will allow us to load the data we will use in this post. Below we will address the data preparation. We are also importing pandas to make a frequency table later on.

Data Preparation

The data preparation is also simple. We will load the dataset called “SLID” using the data() function into an object called df. We will then view the df object using the .head() method. Below is the code followed by the output.

df=data('SLID')
df.head()

The data set has five variables. The focus of this post will be on the manipulation of the “age” variable. We will now make a histogram of the data before we manipulate it.

View of Original Histogram

Below is the code output of the histogram of the “age” variable. The reason for making this visual is to provide a “before” picture of the data before changes are made.

df['age'].hist(bins=15)

We will now move to our first transformation which will involve changing the data to a categorical variable.

Change to Categorical

Changing continuous data to categorical is one way of protecting privacy as it removes individual values and replaces them with group values. Below is an example of how to do this with the code and the first few rows of the modified data.

df['age'] = df['age'].apply(lambda x:">=40"if x>=40 else"<40" )
df.head()

We are overwriting the “age” variable in the code using an anonymous function. On the “age” variable we use the .apply() method and replace values above 40 with “>=40” and values below 40 with “<40”. The data is now broken down into two groups, those above 40 and those below 40. Below is a frequency table of the transformed “age” variable.

df['age'].value_counts()

age
>=40    3984
<40     3441
Name: count, dtype: int64

The .value_counts() method comes from the pandas library. There are two groups now. The table above is a major transformation from the original histogram. Below is the code and output of a bar graph of this transformation

import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x="age", data=df)
plt.show()

This was a simple example. You do not have to limit yourself to only two groups to divide your data. How many groups depends on the context and the purpose of the use of this technique.

Top Coding

Top coding is a trick used to bring extremely high values down to a specific value. Again, the purpose of modifying these values in our context is to protect people’s privacy. Below is the code and output for this approach.

df=data('SLID')
df.loc[df['age'] > 75, 'age'] = 75
df['age'].hist(bins=15)

The code does the following.

We load the “SLID” dataset again so that we can modify it again from its original state.
We then use the .loc method to change all values in “age” above 75 to 75.
Lastly, we create our histogram for comparison purposes to the original data

If you look to the far right you can see that spike in the number of data points at age 75 compared to our original histogram. This is a result of our manipulation of the data. Through doing this, we can keep all of our data for other forms of analysis while also protecting the privacy of the handful of people who are over the age of 75.

Bottom Coding

Bottom coding is the same as top coding except now you raise values below a threshold to a minimum value. Below is the code and output for this.

df=data('SLID')
df.loc[df['age'] < 20, 'age'] = 20
df['age'].hist(bins=15)

The code is the same as before with the only difference being the less than “<” symbol and the threshold being set to 20. As you compare this histogram to the original you can see a huge spike in the number of values at 20.

Conclusion

Data protection is an important aspect of the analysis role. The examples provided here are just some of the many ways in which the privacy of individuals can be respected with the help of Python

Python for Data Privacy

Leave a reply

Data privacy is a major topic among analysts who want to protect people’s information. There are often ethical expectations that personal identifying information is protected. Whenever data is shared, you want to be sure that individual people cannot be identified within a dataset, which can lead to unforeseen consequences. This post will examine simple ways a data analyst can protect personal information.

Libraries & Data Preparation

There are few libraries and minimal data preparation for this example. The code and output are below.

from pydataset import data
df=data('SLID')
df.head()

The only library we need is “pydataset” which contains the dataset we will use. In the second line, we create an object called “df” which contains our data. The data we are using is called “SLID” and contains data on individuals relating to their wages, education level, age, sex, and language.

We will now move to the first way to protect privacy when working with data.

Drop Columns

Sometimes protecting people’s identity can be as easy as dropping a column. Often, the column(s) that contain the names, addresses, or phone numbers can be dropped. In our example below, we are going to pretend that the “language” column can be used to identify people. Therefore we will drop this column. Below is the code and the output for this.

# Attribute suppression on "language"
suppressed_language = df.drop('language', axis="columns")

# Explore obtained dataset
suppressed_language.head()

To remove the “language” column we use the drop() method. Inside this method, we indicate the name of the column and the axis as well.

Drop Rows

It is also possible to drop rows. Dropping rows may be appropriate for outliers. If only a handful of individuals have a certain value in a column it may be possible to identify them. In the code and output below, we drop all values where education is above or equal to 14.

# Drop rows with education higher than 14
education = df.drop(df[df.education >= 14].index)

# See  DataFrame
education.head()

In the code, we used the drop() method again but subsetted the data to remove rows with education values greater than or equal to 14. We also include the index option to indicate the removal of rows. If you look you can see that several rows are now missing such as 1,3,4,6,8,9 as all of these rows had education scores above 14

Data Masking

Data masking involves removing all or part of the information within a column. In the example below, we remove the values for education and replace them with asterisks.

# Uniformly mask the education column 
df['education'] = '****'

# See resulting DataFrame
df.head()

The code involves subsetting the education variable and setting it equal to the asterisks. This approach is similar to dropping the column. However, there may be a reason to keep the column even if there is no useful information in it.

Replace Part of String

Data masking can also include replacing part of the data within a column. In the code below, we will remove some of the information within the “sex” column.

#Modify Sex Column
df['sex'] = df['sex'].apply(lambda text: text[0] + '****' + text[text.find('le'):] )

#See Results
df.head()

The code involves rewriting the data in the “sex” column.

We do this by using the apply() method in this column. Inside the apply() method we use an anonymous function. Using an anonymous function includes using the word “lambda”.
After lambda, we set the argument to the word “text” for practical reasons since we are modifying text.
After the colon, we tell Python to start at the beginning of the string and keep it “text[0]”. Next, insert four asterisks **** after the first letter in the string.
Lastly, we subset from “text and find the string “le” in “text” using the find() method.

The apply() method allows us to loop through the column like a for loop and repeat this process for every row.

Conclusion

Protecting data is critical when using data. The ideas presented here are just some of the many ways that a data analyst can protect people’s personal information.

Bokeh-Manipulating Glyph Color in Python VIDEO

Leave a reply

The video below explains how to modify the color of glyphs when using Bokeh. Manipulating the color is another way to convey information about your data to the end user.

Bokeh: Modifying Glyphs VIDEO

Leave a reply

Modifying the glyphs in a Bokeh data visualization is another useful tool. The video below explains how to do this

Generating Fake Data for Privacy with Python

Leave a reply

The privacy of individuals in a dataset can be protected through the development of fake data. Using false numbers makes it much more difficult to identify individual people within a dataset. In this post, we will look at how to generate fake numbers and names using Python.

Libraries & Data Preparation

The initial library needed is only “pydataset” which will allow us to load the data. We will use the data() function to load the “SLID” dataset into an object called “df”. Next, we will look at the data using the .head() method. Below is the code and the output.

from pydataset import data
df=data('SLID')
df.head()

We have five columns of data that address wages, education level, age, sex, and language. However, for this example, we need to take several additional steps.

We are going to create four new columns that will be manipulated in the example below. These columns will be “name”, “credit_card”, “credit_code”, and “credit_company”. Each of these columns will have a default value that we will manipulate. Below is the code and output.

df['name']="Dan"
df['credit_card']=1234567890
df['credit_code']=123
df['credit_company']='comp'
df.head()

All of this new data will serve as data that needs protection. The original data isn’t needed it just serves as a dataset that we are grafting the privacy data onto. Making a dataframe from scratch is a little complicated in Python and beyond the scope of this video so we took a shortcut by adding to preexisting data. We will now see how to generate fake numbers and names.

Fake Numbers

The “faker” library has a function called “Faker” that can generate fake data for almost any circumstance. We will demonstrate this by generating phony credit card numbers. Below is the code and output.

# Import Faker class
from faker import Faker
# Create fake data generator
fake_data = Faker()
# Generate a credit card number
fake_data.credit_card_number()

'6561857744400343'

To generate the false credit card number we loaded the faker library and imported the Faker() function. Then we created an instance of the Faker() function called “fake_data”. Lastly, we used the .credit_card_number() method on the “fake_data” object.

We will now generate fake numbers for “credit_card”, “credits_code”, and “credit_company”.

# Mask card number with new generated data using a lambda function
Faker.seed(0)
df['credit_code'] = df['credit_code'].apply(lambda x: fake_data.credit_card_security_code())
df['credit_company'] = df['credit_company'].apply(lambda x: fake_data.credit_card_provider())
df['credit_card'] = df['credit_card'].apply(lambda x: fake_data.credit_card_number())

# See the resulting pseudonymized data
df.head()

If you compare this output to the original you can see that the values have changed. We set the seed using Faker.seed(0) so we always get the same results. The next three lines of code use an anonymous function which allows us to loop through our dataset. First, we subset the name of the column we want to overwrite. Second, we use the .apply() method on the same column. Inside the .apply() method we lambda followed by the argument x. After the x we indicate what we want done to the column using the appropriate method from the faker library. Lastly, we display the results using the .head() method. We will address the names of people.

Fake Names

There are at least three different methods for generating fake names, there is a method that generates male or female names, a method that generates only male names, and a method that only generates female names. Below is a brief example of each.

Faker.seed(0)
print(fake_data.name())
print(fake_data.name_male())
print(fake_data.name_female())

Norma Fisher
Jorge Sullivan
Elizabeth Woods

The code above is self-explanatory. We used the print function in order to print several lines of code with different outputs. We will use the .name() method in the code below to generate fake names for our “name” column.

Faker.seed(0)
df['name'] = df['name'].apply(lambda x: fake_data.name())
df.head()

The steps for changing the names are the same as what we did with the credit card information. As such, we will not reexplain it here.

Conclusion

The ability to generate fake data as shown in this post allows an incredible amount of flexibility in protecting people’s identity. However, nothing must be lost that is used for developing insights. For example, generating random credit card numbers could be catastrophic if this information provides insights in a given context. Therefore, any tool that is going to be used must be used with wisdom and caution.

Bokeh-Manipulating Glyph Color

Leave a reply

In this post, we will examine how to manipulate the color of the glyphs in a Bokeh data visualization. We are doing this not necessarily for aesthetic reasons but to convey additional information. Below are the initial libraries that we need. We will load additional libraries as required.

from pydataset import data
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
from bokeh.io import output_file, show

pydatasets will be used to load the data we need. The next two lines create the axes we will need and the objects for storing our data. The last line includes functions for saving and displaying our visualization.

Data Preparation

For this example, data preparation is simple. We will load the dataset “Duncan” using the data() function in an object called “df”. This dataset includes data about various occupations as measured on several variables. We will then display the data using the .head() method. Below is the code and output.

df=data('Duncan')
df.head()

Color Glyphs

In the example below, we will color the glyphs based on one of the variables. We will graph education vs income and color code the glyphs based on income. Below is the code followed by the output and finally the explanation.

from bokeh.transform import linear_cmap
from bokeh.palettes import RdBu8

source = ColumnDataSource(data=df)

# Create mapper
mapper = linear_cmap(field_name="income", palette=RdBu8, low=min(df["income"]), high=max(df["income"]))

# Create the figure
fig = figure(x_axis_label="Education", y_axis_label="Income", title="Education vs. Income")

# Add circle glyphs
fig.circle(x="education", y="income", source=source, color=mapper,size=16)

output_file(filename="education_vs_income.html")
show(fig)

Here is what we did

We had to load additional libraries. liner_cmap() will be used to create the actual coloring of the glyphs. RdBu8 is the color choice for the glyphs.
We then create the source of our data using the ColumSourcData() function
We create our mapper function using the linear_map() function. The arguments inside the function are the variable we are using (income) and the low and high values for the variable.
Next, we create our figure. We label our x and y axis based on the variables we are using and set the title.
We use the .circle() method to create our glyphs. Notice how we set the color argument to our “mapper” object.
The last two lines of code are for creating our output and showing it.

You could set the glyph color to a third variable, which would allow you to express a third variable in a two-dimensional space. For example, we could have used the “prestige” variable for the coloring of the glyphs rather than income, as income was already represented on the y-axis.

Adding a Color Bar

Adding a color bar will help to explain to a reader of our visualization what the color of the glyphs means. The code below is mostly the same and is followed by the output and lastly the explanation.

from bokeh.models import ColorBar
source = ColumnDataSource(data=df)

# Create mapper
mapper = linear_cmap(field_name="income", palette=RdBu8, low=min(df["income"]), high=max(df["income"]))

# Create the figure
fig = figure(x_axis_label="Education", y_axis_label="Income", title="Education vs. Income")
fig.circle(x="education", y="income", source=source, color=mapper,size=16)

# Create the color_bar
color_bar = ColorBar(color_mapper=mapper['transform'], width=8)

# Update layout with color_bar on the right
fig.add_layout(color_bar, "right")
output_file(filename="Education_vs_prestige_color_mapped.html")
show(fig)

Here is what happened.

We loaded a new function called ColorBar().
We create our source data (same as the previous example)
We create our mapper (same as the previous example)
We create our figure and glyphs (same as the previous example)
Next, we create our color bar using the ColorBar() function. Inside this function, we set the color_mapper argument to a transformed version of the mapper object we already created. We also can set the width of the color bar using the width argument. Everything we have done in this step is saved in an object called “color_bar”
We then use the .fig_layout() method on our “fig” object and place the object “color_bar” inside it along with the phrase “right” which tells Python to place the color bar on the right-hand side of the scatterplot.

There is one more example of glyph manipulation below.

Color by Category

In this last example, we will map an additional categorical variable onto our plot using color. Below is the code, output, and explanation.

# Import modules
from bokeh.transform import factor_cmap
from bokeh.palettes import Category10_5
source = ColumnDataSource(data=df)

# Create positions
TOOLTIPS=[('Education', '@education'), ('Prestige', '@prestige')]
positions = ["wc","bc","prof"]
fig = figure(x_axis_label="Education", y_axis_label="Prestige", title="Education vs Prestige", tooltips=TOOLTIPS)

# Add circle glyphs
fig.circle(x="education", y="prestige", source=source, legend_field="type", 
           size=16,fill_color=factor_cmap("type", palette=Category10_5, factors=positions))

output_file(filename="Education_vs_prestige_by_type.html")
show(fig)

We need to load some additional libraries. factpor_cmap() will be used to color the glyphs based on the categorical variable “types. Category10_5 is the color palette.
We create an object called “source” for our data using the ColumnSourceData() function.
We create an object called “TOOLTIPS”. This is an object that will be used to display the individual data points of a glyph when the mouse hovers over it in the visualization
We create an object called “positions” which is a list of all of the job types we want to match to different colors in our plot.
We create an object called “fig” which uses the figure() function to create the x and y axes and the title of the plot. Inside this function, we also set the tooltips argument equal to the “TOOLTIPS” object we created previously
Next, we use the .circle() method on our “fig” object. Most of the arguments are self-explanatory but notice how the argument “fill_color” is set to the function factor_cmap(). Inside this function we indicate the variable “type” as the variable to use, set the palette, and set the factors to the “position” object that we made earlier.
The last two lines save the output and display it.

Conclusion

Bokeh allows you to do many cool things when creating a visualization for data purposes. This post was focused on how to manipulate the glyphs in a scatterplot. However, there is so much more that can be done beyond what was shared here.

Bokeh Display Multiple Plots VIDEO

Leave a reply

The video below shows you how to make multiple plots with Bokeh. This is a valuable tool when you are trying to develop multiple visualizations for comparison purposes.

Bokeh Tools and Tooltips VIDEO

Leave a reply

In the video below, we will take a look at Bokeh tools and tooltips in Python. Tools and tooltips are great options for modifying your data visualization’s interactivity.

Bokeh Display Customization in Python

Leave a reply

In this post, we will examine how to modify the default display of a plot in Bokeh, a library for interactive data visualizations in Python. Below are the initial libraries that we need.

from pydataset import data
from bokeh.plotting import figure
from bokeh.io import output_file, show

The first line of code is where our data comes from. We are using the data() function from pydataset for loading our data. The next two lines are for making the plot’s figure (x and y axes) and for the output file.

Data Preparation

There is no data preparation beyond loading the dataset using the data() function. We pick the dataset “Duncan” and load it into an object called “df.” The code is below, followed by a brief view of the actual data using the .head() method.

df=data('Duncan')
df.head()

This dataset includes various occupations measured in four ways: job type, income, education, and prestige.

Default Graph’s Appearance

Before we modify the appearance of the plot, it is important to know what the default appearance of the plot is for comparison purposes. Below is the code for a simple plot followed by the actual output and then lastly an explanation.

# Create a new figure
fig = figure(x_axis_label="Education", y_axis_label="Income")

# Add circle glyphs
fig.circle(x=df["education"], y=df["income"])

# Call function to produce html file and display plot
output_file(filename="my_first_plot.html")
show(fig)

The first line of code sets up the fig or figure. We use the figure() function to label the axes which are education and income. The second line of code creates the actual data points in the figure using the .circle() method. The last two lines create the output and display it.

So the figure above is the default appearance of a graph. Below we will look at several modifications.

Modification 1

In the code below, we are making the following changes to the plot.

Identifying data points by job type using color
Change the background color to black

Below is the code followed by the output and the explanation

# Import curdoc
from bokeh.io import curdoc

prof = df.loc[df["type"] == "prof"]
bc = df.loc[df["type"] == "bc"]

# Change theme to contrast
curdoc().theme = "contrast"
fig = figure(x_axis_label="Education", y_axis_label="Income")

# Add prof circle glyphs
fig.circle(x=prof["education"], y=prof["income"], color="yellow", legend_label="prof",size=10)

# Add bc circle glyphs
fig.circle(x=bc["education"], y=bc["income"], color="red", legend_label="bc",size=10)

output_file(filename="prof_vs_bc.html")
show(fig)

Here is what happened,

We load a library that allows us to modify the appearance called curdoc
Next, we do some data preparation. Separating the data for types that are “prof” and those that are “bc” into separate objects.
We change the theme of the plot to contrast using curdoc().theme
We also created the figure as done previously
We use the .circle() method twice. Once to set the “prof” data points on the plot and a second time to place the “bc” data points on the plot. We also make the data points larger by setting the size and using different colors for each job type.
The last two lines of code are for creating the output and displaying it.

You can see the difference between this second plot and the first one. This also shows the flexibility that is inherent in the use of Bokeh. Below we add one more variation to the display.

Modified Graph’s Appearance

The plot below is mostly the same except for the following

We add a third job type “wc”
We modify the shapes of the data points

Below is the code followed by the graph and the explanation

# Create figure
wc = df.loc[df["type"] == "wc"]
prof = df.loc[df["type"] == "prof"]
bc = df.loc[df["type"] == "bc"]

fig = figure(x_axis_label="Education", y_axis_label="Income")

# Add circle glyphs for houses
fig.circle(x=wc["education"], y=wc["income"], legend_label="wc", color="purple",size=10)

# Add square glyphs for units
fig.square(x=prof["education"], y=prof["income"], legend_label="prof", color="red",size=10)

# Add triangle glyphs for townhouses
fig.triangle(x=bc["education"], y=bc["income"], legend_label="bc", color="green",size=10)

output_file(filename="education_vs_income_by_type.html")
show(fig)

The code is almost all the same. The main difference is there are now three job types and each type has a different shape for their data points. The shapes are determined by using either .circle(), .triangle(), or .square() methods.

Conclusion

There are many more ways to modify the appearance of visualization in bokeh. The goal here was to provide some basic examples that may lead to additional exploration.

Bokeh Tools and Tooltips

Leave a reply

In this post, we will look at how to manipulate the different tools and tooltips that you can use to interact with data visualizations that are made using Bokeh in Python. Tool are the icons that are displayed by default to the right of a visual when looking at a Bokeh output. To the right is what default tools look like. Tooltips provide interactive data based on the position of the mouse.

We will now go through the process of changing these tools and tooltips for various reasons and purposes.

Load Libraries

First, we need to load the libraries we need to make our tools. Below is the code followed by an explanation.

from pydataset import data
from bokeh.plotting import figure
from bokeh.io import output_file, show

We start by loading “data” from “pydataset”. This library contains the actual data we are going to use. The other libraries are all related to Bokeh’s “figure” which will create details for our visualization. In addition, we will need the “output_file” to make our HTML document and the “show” function to display our visualization.

Data Preparation

Data preparation is straightforward. All we have to do is load our data into an object. We will use the “Duncan” dataset, which contains data on various jobs’ income, education, and prestige. Below is the code followed by a snippet of the actual data.

df=data('Duncan')
df.head()

Default Settings for Tools

We will now look at a basic plot with the basic tools. Below is the code.

# Create a new figure
fig = figure(x_axis_label="income", y_axis_label="prestige")

# Add circle glyphs
fig.circle(x=df["income"], y=df["prestige"])

# Call function to produce html file and display plot
output_file(filename="my_first_plot.html")
show(fig)

There is nothing new here. We create the figure for our axes first. Then we add the points in the next line of code. Lastly, we write some code to create an output. The default tools has 7 options. Below they are explained from top to bottom.

At the top, is the logo that takes you to the Bokeh website
Pan tool
Box zoom
Wheel zoom
Save figure
Reset figure
Takes you to Bokeh documentation

We will now show how to customize the available tools.

Custom Settings for Tooltips

In order to make a set of custom tools, we need to make some small modifications to the previous code as shown below.

# Create a list of tools
tools = ["lasso_select", "wheel_zoom", "reset","save"]

# Create figure and set tools
# Create a new figure
fig = figure(x_axis_label="income", y_axis_label="prestige",tools=tools)

# Add circle glyphs
fig.circle(x=df["income"], y=df["prestige"])

# Call function to produce html file and display plot
output_file(filename="my_first_plot.html")
show(fig)

What is new in the code is the object called “tools”. This object contains a list of the tooltips we want to be available in our plot. The names of the tools is available in the Bokeh documentation. We then add this object “tools” to the argument called “tools” in the line of code where we create the “fig” object. If you compare the second plot to the first plot you can see we have fewer tools in the second one as determine by our code.

Hover Tooltip

The hover tooltip allows you to place your mouse over the plot and have information displayed about what your mouse is resting upon. Being able to do this can be useful for gaining insights about your data. Below is the code and the output followed by an explanation.

# Import ColumnDataSource
from bokeh.models import ColumnDataSource

# Create source
source = ColumnDataSource(data=df)

# Create TOOLTIPS and add to figure
TOOLTIPS = [("Education", "@education"), ("Position", "@type"), ("Income", "@income")]
fig = figure(x_axis_label="education", y_axis_label="income", tooltips=TOOLTIPS)

# Add circle glyphs
fig.circle(x="education", y="income", source=source)
output_file(filename="first_tooltips.html")
show(fig)

Here is what happened.

We loaded a new library called ColumnDataSource. This function allows us to create a data structure that is unique to Bokeh. This is not required but will appear in the future.
We then save are dataset using the new function and called it “source”
Next, we create a list called “TOOLTIPS” this list contains tuples which are in parentheses. The first string in the parentheses will be the name that appears in the hover. The second string in the parentheses accesses the value in the dataset. For example, if you look at the hover in the plot above the first line says “Education” and the number 72. The string “Education” is just the first string in the tuple and the value 72 is the value of education from the dataset for that particular data point
The rest of the code is a review of what has been done previously. The only difference is that we use the argument “tooltip” instead of “tool”

Conclusion

With tooltips and tools you can make some rather professional looking visualization with a minimum amount of code. That is what makes the Bokeh library so powerful.

Bar Graphs Using Bokeh and Python VIDEO

Leave a reply

The video below provides an introduction to making bar graphs using the Bokeh library and Python.

Make a Bar Graph with Bokeh in Python

Leave a reply

Bokeh is a data visualization library available in Python with the unique ability of interaction. In this video, we will look at how to make a basic bar graph using bokeh.

To begin we need to load certain libraries as shown below.

from pydataset import data
import pandas as pd
from bokeh.plotting import figure
from bokeh.io import output_file, show

In the code above, we load the “pydataset” library to gain access to the data we will use. Next, we load “pandas” which will help us with some data preparation. The last two libraries are related to “bokeh.” The “figure” function will be used to set the actual plot, the “output_file” function will allow us to save our plot as an HTML file and the “show” function will allow us to display our plot.

Data Preparation

We need to do two things to be ready to create our bar graph. First, we need to load the data. Second, we need to calculate group means for the bar graph. Below is the code for the first step followed by the output.

df=data('Duncan')
df.head()

In the code above we use the “data” function to load the “Duncan” dataset into an object called “df”. Next, we display the output of this. The “Duncan” dataset contains data on different jobs, the type of job, income, education, and prestige. We want to graph prestige and job type as a bar graph which will require us to calculate the mean of prestige by type. The code for this is below.

# Calculate group means of prestige
positions = df.groupby('type', as_index=False)['prestige'].mean()
positions

In the code above we use the “groupby” function on the “df” object. Inside the function, we indicate we want to group by “type”. The “as_index” argument is set to false so that the “type” column is not set at the index or you can say as the row numbers. Next, we subset the data using square brackets to only include the “prestige” column. Lastly, we indicate that we want to calculate the “mean”. The result is that there are three job types and we have the mean for each job’s prestige. The job types and means from this table above are what we will use for making our visualization.

Bar Graph

We are now ready to make our bar graph. Below is the code followed by the output.

# Instantiate figure
fig = figure(x_axis_label="positions", y_axis_label="Prestige", x_range=positions["type"]) 

# Add bars
fig.vbar(x=positions["type"], top=positions["prestige"],width=0.9)

# Produce the html file and display the plot
output_file(filename="Prestige.html")
show(fig)

Here are the steps.

We began by creating the “fig” object. We labeled are x and y axes and also indicated the range of the x values which means determining the categories of our data. For our purposes, this was the unique job type in the “types” column.
Next, we use the “vbar” function to make our bar graph. The x values were set to the “type” column from the “positions” object. The y or “top” values were set to the means of “prestige” from the “positions” object. The “width” argument was set to 0.9 to ensure there was a little whitespace between the bars.
The “output_file” creates a saved plot and the “show” function displays the bar graph.

Conclusion

Bokeh has lots of cool tools available for the data analyst. This post was focused on bar graphs but this is only the most basic information that has been shared here. There is much more possible with this library.

Bokeh-Scatter Plot Basics in Python

Leave a reply

Bokeh is another data visualization library available in Python. One of Bokeh’s unique features is that it allows for interaction. In this post, we will learn how to make a basic scatterplot in Bokeh while also exploring some of the basic interactions that are provided by default.

Data Preparation

We are going to make a scatterplot using the “Duncan” data set that is available in the “pydataset” library. Below is the initial code.

from pydataset import data
from bokeh.plotting import figure
from bokeh.io import output_file, show

The code above is just the needed libraries. We loaded “pydataset” because this is where our data will come from. All of the other libraries are related to “bokeh.” “Figure” allows us to set up our axes for the scatterplot. “Output_file” allows us to create the file of our plot. Lastly, “show” allows us to show the plot of our visualization. In the code below we will load our dataset, give it a name, and print the first few rows.

df=data('Duncan')
df.head()

In the code above we store the “Duncan” dataset in an object called “df” using the data() function. We then display a snippet of the data using the .head() function. The “Duncan” data shares information on jobs as defined by several variables. We will now proceed

Making the Scatterplot

We will now make our scatterplot. We have to do this in three steps.

Make the axis
Add the data to the plot
Create the output file and show the results

Below is the code with the output

# Create a new figure
fig = figure(x_axis_label="education", y_axis_label="income") #labels axises

# Add circle glyphs
fig.circle(x=df["education"], y=df["income"]) #adds the dots

# Call function to produce html file and display plot
output_file(filename="my_first_plot.html")
show(fig)

At the top of the code, we create our axis information using the “figure” function. Here we are plotting education vs income and storing all of this in an object called “fig”. Next, we insert the data into our plot using the “circle” function. To insert the data we also have to subset the “df” dataframe for the variables that we want. Note that the data added to a plot are called “glyphs” in Bokeh. Lastly, we create an output file using a function with the same name and show the results.

To the right of your plot, there are also some interaction buttons as shown below

Here is what they do from top to bottom.

Takes you to bokeh.org
Pan the image
Box zoom
Wheel zoom
Download image
Resets image
It takes you to information about the bokeh function

There are other interactions possible but these are the default ones when you make a plot.

Conclusion

Bokeh is one of many tools used in Python for data visualization. It is a powerful tool that can be used in certain contexts. The interactive tools can also enhance the user experience.

Importing Files with Python VIDEO

Leave a reply

The video below provides several different examples and ways that data can be imported into Python.

T-test & ANOVA with Python VIDEO

Leave a reply

The video below shows how to conduct a t-test and an ANOVA analysis using the pingouin library from Python.

T-Test with Pingouin

Leave a reply

In this post, we will look at how to use the Pingouin package to calculate both t-test and ANOVA results. This post is not a post on statistics. Rather, we are focused on how to do t-test and ANOVA using Python. Therefore, the explanation of the statistics is not a part of this post.

We will be using the Duncan dataset from the pydataset package. In the code below, we are loading the needed libraries and we are also printing a portion of the Duncan dataset.

import pandas as pd

import pingouin

from pydataset import data

df=data("Duncan")

df.head()

The Duncan dataset is simple. It has stats on various jobs which include the type of job, income, education, and prestige. We want to compare job type with income. What we want to do is compare professional jobs (prof) with white-collar jobs (wc) and see if there is a difference. After doing this, we will compare all three job types (bc, wc, prof) using ANOVA.

T-Test

In the code below, we need to subset our data so that the professionals and white-collar workers are separate.

df_prof=df[df['type']=='prof']

df_wc=df[df['type']=='wc']

Now that this is complete, the code below is what is used for conducting the t-test. We are comparing professional income with white-collar income. The t-test is two-sided which means we are looking for any difference at all. Below are the results

pingouin.ttest(x=df_prof['income'],y=df_wc['income'],alternative="two-sided")

According to the p-value, there is no difference between the salaries of professionals when compared to white-collar workers. We will now move to ANOVA.

ANOVA

T-test only allows the user to compare two groups. ANOVA allows the user to compare multiple groups. We have three types of workers and not just two. Using ANOVA, we can compare all three at once. In addition, unlike the t-test, there is no data preparation needed in this example.

The code below is relatively simple, we are using the ANOVA function from Pingouin. The first argument is for the data, the second indicates the dependent variable, and the between argument indicates the independent variable. Below is the code and output.

pingouin.anova(data=df,dv="income",between="type")

The value we are focused on is the p-unc or p-value. The results are significant. In other words, there is a difference between one of the comparisons. We don’t know which one will require us to do a pairwise comparison. Below are two different pairwise comparisons, one without an adjustment and one with an adjustment.

Pairwise Comparision No Adjustment

The first pairwise comparison is without an adjustment. The code below is mostly the same as for ANOVA. The main difference is we are using the pairwise_test function and there is an additional argument called padjust which is set to none. Below is the code and output.

pingouin.pairwise_tests(data=df,dv="income",between="type",padjust='none')

Focusing on the p-values (p-unc) again we can see that there is a difference between blue-collar workers and professionals and another difference between blue-collar workers and white-collar workers. However, there is no difference between professional and white-collar workers. Keep in mind that we already knew that there was no difference between professionals and white-collar workers from the t-test results.

Pairwise Comparision with Adjustment

In the code below, we have the same code but with a Bonferroni p-value adjustment. Adjustments become important when you have a large number of groups. The details of this are beyond the scope of this post. However, it is important to make this adjustment because otherwise, you could get false positives which could skew your results and interpretation. Below is the code and output.

pingouin.pairwise_tests(data=df,dv="income",between="type",padjust='bonf')

You may have noticed that the numbers are the same. That is because in our example we have a small number of groups. Therefore, this correction is not necessary for the data we are using.

Conclusion

The main purpose here was to show what the penguin package can do when it comes to t-tests and ANOVA. We could have calculated means for each group and other statistics. However, that was not the focus. Now, you know some of the tools that are available in the pingouin library.

Import Simple Files into Python

Leave a reply

In this post we will be using Python to import files. Importing a text file is rather easy into Python. We will look at several different examples and file types in this post.

Importing a Text File

Importing a text file is often done in Python. To do this see the code below.

file=open('Corr.txt',mode='r')
text=file.read()
file.close()
print(text)

$r
              ACmean    CLMean   SFIMean EnrichMean
ACmean     1.0000000 0.4386146 0.2463862  0.5758464
CLMean     0.4386146 1.0000000 0.2874991  0.5730721
SFIMean    0.2463862 0.2874991 1.0000000  0.2076200
EnrichMean 0.5758464 0.5730721 0.2076200  1.0000000

$n
           ACmean CLMean SFIMean EnrichMean
ACmean        172    172     172        172
CLMean        172    172     172        172
SFIMean       172    172     172        172
EnrichMean    172    172     172        172

$P
                 ACmean       CLMean      SFIMean   EnrichMean
ACmean               NA 1.763762e-09 0.0011214521 0.000000e+00
CLMean     1.763762e-09           NA 0.0001312634 2.220446e-16
SFIMean    1.121452e-03 1.312634e-04           NA 6.277935e-03
EnrichMean 0.000000e+00 2.220446e-16 0.0062779348           NA

attr(,"class")
[1] "rcorr"

In order to load the text file we used the open() function to open the file in the working directory. Next, we indicated the mode as ‘r’ which means ‘read’. Everything that was just mentioned was saved into an object called ‘file’. Then we use the read() function on the object called ‘file’ and save all this in a new object called ‘text’. The next step involves using the close() function in order to complete the process. The last step involves printing the content of the text object using print().

Below is a way to complete this process faster.

with open('Corr.txt','r') as file:
    print(file.read())

$r
              ACmean    CLMean   SFIMean EnrichMean
ACmean     1.0000000 0.4386146 0.2463862  0.5758464
CLMean     0.4386146 1.0000000 0.2874991  0.5730721
SFIMean    0.2463862 0.2874991 1.0000000  0.2076200
EnrichMean 0.5758464 0.5730721 0.2076200  1.0000000

$n
           ACmean CLMean SFIMean EnrichMean
ACmean        172    172     172        172
CLMean        172    172     172        172
SFIMean       172    172     172        172
EnrichMean    172    172     172        172

$P
                 ACmean       CLMean      SFIMean   EnrichMean
ACmean               NA 1.763762e-09 0.0011214521 0.000000e+00
CLMean     1.763762e-09           NA 0.0001312634 2.220446e-16
SFIMean    1.121452e-03 1.312634e-04           NA 6.277935e-03
EnrichMean 0.000000e+00 2.220446e-16 0.0062779348           NA

attr(,"class")
[1] "rcorr"

Using the ‘with’ approach is much faster and simpler. THe content of the open() function is the same while we save its as “file” by writing this after the open() function rather than before. Lastly, we print the file and use the read() function together.

Import with Numpy

Naturally, there is more than one way to import data. The example below involve the use of the NumPy library. This approach is used in particular for dealing with numerical data that might be saved as a texr file. Below is an example of how to do this.

import numpy as np
text=np.loadtxt('sample.txt', delimiter=',')
print(text)

[[1. 2. 3. 4.]
 [5. 6. 7. 8.]
 [9. 0. 1. 2.]]

We begin by importing the numpy librar as np. Next, we create an object called ‘text’ and use the loadtxt() function to load a text file called ‘sample’. The argument ‘delimiter’ is used to tell numpy how the numbers are separated in the file. Lastly, we print the ‘text’ object as an array.

Import with Pandas

Pandas is another way to import data. For our example, we will look at how to import csv files. Below is the code for how to complete this.

import pandas as pd
data=pd.read_csv('sample.csv')
data.head()

In line 1 of the code we load the pandas library as ‘pd’. In line 2, we use the read_csv() function to load our data into the ‘data’ object. Lastly, we use head() to take a peek at the first few lines of data.

Conclusion

With this information, you now possess some basic knowledge on how to get data into Python for the purpose of being able to manipulate it. In a future post, we will look at other ways and means of importing data into Python.

RANSAC Regression with Python VIDEO

Leave a reply

RANSAC regression is a unique style of regression. This algorithm identifies outliers and inliers using the unique tools of this approach. The video below provides an overview of how it can be used in Python

Gradient Boosting CLassification with Python VIDEO

Leave a reply

In this video, we will look at gradient boosting classification with python. Gradient boosting is similar to Adaboost in that it is an ensemble technique and is often associated with decision trees. The main difference is the focus on the gradient or slope in the calculations.

AdaBoost Regression with Python VIDEO

Leave a reply

AdaBoost regression uses ensemble learning to improve the performance of numeric prediction models. The video below explains how to use adaBoost with Python.

AdaBoost Classification with Python VIDEO

Leave a reply

AdaBoost classification is a type of ensemble learning. What this means is that the algorithm makes multiple models that work together to make predictions. Such techniques are powerful in improving the strength of models. The video below explains how to use this algorithm within Python.

Elastic Net Regression with Python VIDEO

Leave a reply

Elastic net regression has all the strengths of both ridge and lasso regression without the apparent weaknesses. As such this is a great algorithm for regularized regression. The video below explains how to use this algorithm with Python

Lasso Regression with Python VIDEO

Leave a reply

Lasso regression is another algorithm that uses regularization to handle variables. Essentially, this algorithm will reduce coefficients to zero based on whether they contribute meaningfully to the results. The video below will explain how to use Lasso regression in Python.

Ridge Regression with Python VIDEO

Leave a reply

Ridge regression belongs to a family of regression called regularization regression. This family of regression uses various mathematical techniques to reduce or remove coefficients from a regression model. In the case of ridge, this algorithm will reduce coefficients close to zero but never actually remove variables from a model. In this video, we will focus on using this algorithm in python rather than on the mathematical details.

Hyper-Parameter Tuning with Python VIDEO

Leave a reply

Hyper-parameter tuning is one way of taking your model development to the next level. This tool provides several ways to make small adjustments that can reap huge benefits. In the video below, we will look at tuning the hyper-parameters of a KNN model. Naturally, this tuning process can be used for any algorithm.

Cross-Validation with Python VIDEO

Leave a reply

Cross-validation is a valuable tool for assessing a model’s ability to generalize. In the video below, we will look at how to use cross-validation with Python.

Intro to Matplotlib with Python VIDEO

Leave a reply

Matplotlib is a data visualization module used often in Python. In this video, we will go over some introductory basic commands. Doing so will allow anybody who wants to be able to make simple manipulations to their visualizations.

Random Forest Regression with Python VIDEO

Leave a reply

In the video below we will take a look at how to perform a random forest regression analysis with Python. Random forest is one of many tools that can be used in the field of data science to gain insights to help people.

K Nearest Neighbor Classification with Python VIDEO

Leave a reply

K nearest neighbor classification is another tool used in machine learning to predict what class an observation belongs to. In this video, we will learn how to implement this algorithm using Python.

Naive Bayes with Python VIDEO

Leave a reply

Naive Bayes is an algorithm that is commonly used with text classification. However, it can also be used for separating observations into multiple categories. In this video, we will look at a simple example of the use of Naive Bayes in Python.

K-Nearest Neighbor Regression with Python VIDEO

Leave a reply

K-Nearest neighbor is a great technique for dealing with data. In the video below, we will look at how to use this tool with Python.

Support Vector Machines Regression with Python VIDEO

Leave a reply

VIDEOIn this video, we will look at a simple example of SVM regression. In this context, regression involves predicting a continuous dependent variable. This is similar to the basic form of regression that is taught in an introduction to stats class

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

T-Test

ANOVA

Pairwise Comparision No Adjustment

Pairwise Comparision with Adjustment

Conclusion

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this: