Generating fake data is one way to protect an individual’s privacy. The video below provides examples of how to do this using Python.
Data Privacy with Python: Unique Combos & Generalizations
Program Implementation
Program implementation examines how a program is put into practice. The focus of any program is to bring change to whoever the stakeholders of the program are. Therefore, how the program is put into practice or implemented plays a critical role in whether the program is successful.
Components of Implementation
Joseph Durlak describes eight components of program implementation as shown below.
- Fidelity
- Dosage
- Quality
- Participant engagement
- Program differentiation
- Monitoring of controlled conditions
- Program reach
Most of these components are self-explanatory. Fidelity is the level of faithfulness implementors of the program have to the procedures and or protocols of the program. Many programs have an experimental nature in which the participants of the program are compared either to themselves as a “before” group or to a control group that does not experience the program. To confirm that the program is the reason for any difference it must be confirmable that the procedures of the program are adhered to.
The same idea applies to dosage which is the amount of the program that is experienced. This value must be consistent to establish any differences between groups. Dosage can be measured in terms of the amount, length of time, number of occurrences, etc. the program requires.
Adaptions are the modifications that are made to the program for various reasons. Sometimes the original procedures of the program are not practical during implementation. For example, a program may expect participants to receive counseling twice a week for 30 minutes each time for a total of an hour. During implementation, it may be found that the participants were not able to come twice a week. Therefore, instead of meeting twice a week, the program is adapted to meet once a week for one hour. It is critical to keep track of adaptions as they can cause a program to lose its focus and original purpose.
Participant engagement is how involved and cooperative the participants in the program are. Low engagement is often a sign that a program is failing. If this does happen it may be necessary to make adaptations to the program.
Program differentiation is the awareness of how the current program is different from other programs. Knowing what makes a program different is critical in showing how it is superior to other interventions that have been tried. Understanding differences also is an indication for determining what works and does not work in terms of helping participants.
Monitoring of controlled conditions is focused on the controlled variables that need to be monitored when using an experimental and controlled group with programs. Lastly, program reach is a measured of how much of the target population is involved with the program.
It is critical to be aware of these components of implementation as they help evaluators determine the level of success a program has had. It is also important to make sure that the individuals who are actually implementing the program are trained and supported throughout the entire implementation process. If the implementors do not know what to do are feel abandoned then implementation will also suffer.
Factors of Implementation
Components of implementation are aspects of the program that are within the program. Factors of implementation are variables outside of the program that influence it. According to Joseph Durlak, there are also several factors to be aware of when it comes to implementation. Some of the factors include the following
- Community level
- Traits of implementors
- Program traits
- organizational factor
- Processes
- Staffing
- Professional development
The community level factor relates to traits of the community surrounding the program and can include the policies, politics, and the level of funding for a program. A negative political environment can seriously hamper cooperation for example.
The implementers’ traits can include their skill level, confidence, sense of relevancy, and more. We have already discussed implementors earlier but if the implementors lack the skill even the best programs will fail.
Program traits include how well the program fits with the school and or the adaptability of the program. Sometimes a great program is a poor culture fit and or is too rigid for the local context. An example would be the example used earlier for dosage. Twice-a-week counseling may not be appropriate for the context.
Organizational factors include the climate, openness, integration, etc., of the local organization that is supporting the program. A closed-off organization will probably not support any program no matter the benefits.
Processes include decision-making, communication, planning, etc. Programs require local stakeholders to make decisions about cooperation and other factors related to planning and implementation. If there is a bottleneck or resistance to developing processes the program may never get off the ground.
Staffing is about leadership and how they support the program. Enthusiastic leaders may provide adequate support for a program while indifferent leaders may cause a program to fail. One reason for this is the control over resources and morale that leaders possess.
Professional development has already been alluded to and it is the amount of support and training that implementers of a program need. It is of critical importance that the individuals who bring a program to life through implementation receive the support and training they need in order to ensure success. If the implementors are confused over what to do the program has little hope for success.
Conclusion
Program implementation is often overlooked. People are so excited to begin a new program to help people that they often forget to assess the implementation of it. Doing this can lead to good programs being labeled as failures, leading to finger-pointing. Focusing on the implementation can help to alleviate this common occurrence.
Python for Data Privacy VIDEO
Privacy of Continous Data with Python
There are several ways that an individual’s privacy can be protected when dealing with continuous data. In this post, we will look at how protecting privacy can be accomplished using Python.
Libraries
We will begin by loading the necessary libraries. Below is the code.
from pydataset import data
import pandas as pd
The library setup is simple. We are importing the data() function from pydataset. This will allow us to load the data we will use in this post. Below we will address the data preparation. We are also importing pandas to make a frequency table later on.
Data Preparation
The data preparation is also simple. We will load the dataset called “SLID” using the data() function into an object called df. We will then view the df object using the .head() method. Below is the code followed by the output.
df=data('SLID')
df.head()

The data set has five variables. The focus of this post will be on the manipulation of the “age” variable. We will now make a histogram of the data before we manipulate it.
View of Original Histogram
Below is the code output of the histogram of the “age” variable. The reason for making this visual is to provide a “before” picture of the data before changes are made.
df['age'].hist(bins=15)

We will now move to our first transformation which will involve changing the data to a categorical variable.
Change to Categorical
Changing continuous data to categorical is one way of protecting privacy as it removes individual values and replaces them with group values. Below is an example of how to do this with the code and the first few rows of the modified data.
df['age'] = df['age'].apply(lambda x:">=40"if x>=40 else"<40" )
df.head()

We are overwriting the “age” variable in the code using an anonymous function. On the “age” variable we use the .apply() method and replace values above 40 with “>=40” and values below 40 with “<40”. The data is now broken down into two groups, those above 40 and those below 40. Below is a frequency table of the transformed “age” variable.
df['age'].value_counts()
age
>=40 3984
<40 3441
Name: count, dtype: int64
The .value_counts() method comes from the pandas library. There are two groups now. The table above is a major transformation from the original histogram. Below is the code and output of a bar graph of this transformation
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x="age", data=df)
plt.show()

This was a simple example. You do not have to limit yourself to only two groups to divide your data. How many groups depends on the context and the purpose of the use of this technique.
Top Coding
Top coding is a trick used to bring extremely high values down to a specific value. Again, the purpose of modifying these values in our context is to protect people’s privacy. Below is the code and output for this approach.
df=data('SLID')
df.loc[df['age'] > 75, 'age'] = 75
df['age'].hist(bins=15)

The code does the following.
- We load the “SLID” dataset again so that we can modify it again from its original state.
- We then use the .loc method to change all values in “age” above 75 to 75.
- Lastly, we create our histogram for comparison purposes to the original data
If you look to the far right you can see that spike in the number of data points at age 75 compared to our original histogram. This is a result of our manipulation of the data. Through doing this, we can keep all of our data for other forms of analysis while also protecting the privacy of the handful of people who are over the age of 75.
Bottom Coding
Bottom coding is the same as top coding except now you raise values below a threshold to a minimum value. Below is the code and output for this.
df=data('SLID')
df.loc[df['age'] < 20, 'age'] = 20
df['age'].hist(bins=15)

The code is the same as before with the only difference being the less than “<” symbol and the threshold being set to 20. As you compare this histogram to the original you can see a huge spike in the number of values at 20.
Conclusion
Data protection is an important aspect of the analysis role. The examples provided here are just some of the many ways in which the privacy of individuals can be respected with the help of Python
thoughts on The State and Revolution by Lenin
The State and Revolution was written by Lenin in 1917. This text provides Lenin’s thoughts on the role of communism in the context of leading the proletarian revolution and the shape of the government afterward. The text is rather repetitive and rambling. Therefore, instead of providing a summary, which would be rather difficult, it was decided to briefly describe some of the text’s main points. These main points are…
- The purpose of the state
- The purpose of the revolution
- The stages after the revolution
None of the ideas above are in one specific place within the text. Instead, they are scattered throughout and shared repeatedly, making the text difficult to understand.
Purpose of the State
Stalin states that the state exists solely because of class antagonism. The government referees the battle between the bourgeoisie and the proletariat in other words. This makes sense as you cannot have property or capital unless there is someone to protect said property. A society without government would not have anything whether communist or capitalist. The capitalists need the government to protect their capital while the proletariat seeks justice from the same government.
Stalin also shares that the ruling class uses the state to oppress the poor. Again, it is hard to refute this as corporate America has teamed up with the government before. However, Lenin has left out how the government has responded to the cries of the poor in the past. For example, during Lenin’s life, the Russian Czar attempted reforms before his downfall. Even before the French Revolution the King of France tried to compromise. As such, even in monarchies tone deafness is difficult to maintain fully.
Purpose of Revolution
Lenin then shared that the purpose of revolution was to overthrow the Bourgeoisie class so the proletarians could take power. Lenin believes that overthrowing the ruling class will solve most if not all of society’s problems.
The problem with this belief is that revolution leads to a new set of oppressors in most cases. The leadership changes but the wicked hearts of man remain the same. Lenin seems to think that the system is the problem (a sentiment shared today). In reality, it is the people who are the problem. All governments have issues and problems, but they also have one thing in common: people who form, lead, and destroy them.
Stages After the Revolution
Lenin also divides the stages after the revolution into three main parts. The first stage is the proletarian dictatorship. This dictatorship involves the proletarians using the apparatus of the conquered state to crush all of the remaining bourgeoise. In other words, the tools of the enemy are used to destroy the enemy. This stage of the revolution has happened in many countries such as Cambodia, Cuba, North Korea, Russia, and Vietnam. The landholders and capitalists are rounded up and killed and the people seize their property. There is often a huge loss of life as the revolutionaries tend to kill indiscriminately in their zeal for change.
The second stage is socialism which involves the government having control over the means of production. Notice how the government is still being used but instead of for slaughtering, the focus has shifted to control of the people. In addition, contrary to popular belief, traditional communism doesn’t want to control all property just property for producing wealth. At this point, everyone only gets what they need instead of what they want, destroying all motivation and ambition to work hard. This is also the stage at which all communist governments stop. The government takes control and they never give up that control. This proves the point that communism swaps one corrupt leadership for another. The main difference between communism and capitalism is who has control, the individual or a monolith government.
The final stage of the revolution is the withering of the state. Once everyone is thoroughly communist and social classes are destroyed there is no need for the state. No communist government has achieved this as the revolution’s leaders enjoy being in charge. The common counter to this observation is that nobody has successfully completed a communist revolution. Therefore, people must try harder to achieve this. It also must be mentioned that there is no view of utopia as Lenin shares that neither he nor Marx knows what that is like. As such, the revolution must continue forever.
Conclusion
This was not a summary of Lenin’s views in his book The State and Revolution. The goal was only to share some of the main points. This is probably one of Lenin’s best-known books and required reading for hardcore leftists. Even though no one has achieved true communism many are highly motivated to make this theory a reality.
Python for Data Privacy
Data privacy is a major topic among analysts who want to protect people’s information. There are often ethical expectations that personal identifying information is protected. Whenever data is shared, you want to be sure that individual people cannot be identified within a dataset, which can lead to unforeseen consequences. This post will examine simple ways a data analyst can protect personal information.
Libraries & Data Preparation
There are few libraries and minimal data preparation for this example. The code and output are below.
from pydataset import data
df=data('SLID')
df.head()

The only library we need is “pydataset” which contains the dataset we will use. In the second line, we create an object called “df” which contains our data. The data we are using is called “SLID” and contains data on individuals relating to their wages, education level, age, sex, and language.
We will now move to the first way to protect privacy when working with data.
Drop Columns
Sometimes protecting people’s identity can be as easy as dropping a column. Often, the column(s) that contain the names, addresses, or phone numbers can be dropped. In our example below, we are going to pretend that the “language” column can be used to identify people. Therefore we will drop this column. Below is the code and the output for this.
# Attribute suppression on "language"
suppressed_language = df.drop('language', axis="columns")
# Explore obtained dataset
suppressed_language.head()

To remove the “language” column we use the drop() method. Inside this method, we indicate the name of the column and the axis as well.
Drop Rows
It is also possible to drop rows. Dropping rows may be appropriate for outliers. If only a handful of individuals have a certain value in a column it may be possible to identify them. In the code and output below, we drop all values where education is above or equal to 14.
# Drop rows with education higher than 14
education = df.drop(df[df.education >= 14].index)
# See DataFrame
education.head()

In the code, we used the drop() method again but subsetted the data to remove rows with education values greater than or equal to 14. We also include the index option to indicate the removal of rows. If you look you can see that several rows are now missing such as 1,3,4,6,8,9 as all of these rows had education scores above 14
Data Masking
Data masking involves removing all or part of the information within a column. In the example below, we remove the values for education and replace them with asterisks.
# Uniformly mask the education column
df['education'] = '****'
# See resulting DataFrame
df.head()

The code involves subsetting the education variable and setting it equal to the asterisks. This approach is similar to dropping the column. However, there may be a reason to keep the column even if there is no useful information in it.
Replace Part of String
Data masking can also include replacing part of the data within a column. In the code below, we will remove some of the information within the “sex” column.
#Modify Sex Column
df['sex'] = df['sex'].apply(lambda text: text[0] + '****' + text[text.find('le'):] )
#See Results
df.head()

The code involves rewriting the data in the “sex” column.
- We do this by using the apply() method in this column. Inside the apply() method we use an anonymous function. Using an anonymous function includes using the word “lambda”.
- After lambda, we set the argument to the word “text” for practical reasons since we are modifying text.
- After the colon, we tell Python to start at the beginning of the string and keep it “text[0]”. Next, insert four asterisks **** after the first letter in the string.
- Lastly, we subset from “text and find the string “le” in “text” using the find() method.
The apply() method allows us to loop through the column like a for loop and repeat this process for every row.
Conclusion
Protecting data is critical when using data. The ideas presented here are just some of the many ways that a data analyst can protect people’s personal information.
Bokeh-Manipulating Glyph Color in Python VIDEO
Bokeh: Modifying Glyphs VIDEO
Generating Fake Data for Privacy with Python
The privacy of individuals in a dataset can be protected through the development of fake data. Using false numbers makes it much more difficult to identify individual people within a dataset. In this post, we will look at how to generate fake numbers and names using Python.
Libraries & Data Preparation
The initial library needed is only “pydataset” which will allow us to load the data. We will use the data() function to load the “SLID” dataset into an object called “df”. Next, we will look at the data using the .head() method. Below is the code and the output.
from pydataset import data
df=data('SLID')
df.head()

We have five columns of data that address wages, education level, age, sex, and language. However, for this example, we need to take several additional steps.
We are going to create four new columns that will be manipulated in the example below. These columns will be “name”, “credit_card”, “credit_code”, and “credit_company”. Each of these columns will have a default value that we will manipulate. Below is the code and output.
df['name']="Dan"
df['credit_card']=1234567890
df['credit_code']=123
df['credit_company']='comp'
df.head()

All of this new data will serve as data that needs protection. The original data isn’t needed it just serves as a dataset that we are grafting the privacy data onto. Making a dataframe from scratch is a little complicated in Python and beyond the scope of this video so we took a shortcut by adding to preexisting data. We will now see how to generate fake numbers and names.
Fake Numbers
The “faker” library has a function called “Faker” that can generate fake data for almost any circumstance. We will demonstrate this by generating phony credit card numbers. Below is the code and output.
# Import Faker class
from faker import Faker
# Create fake data generator
fake_data = Faker()
# Generate a credit card number
fake_data.credit_card_number()
'6561857744400343'
To generate the false credit card number we loaded the faker library and imported the Faker() function. Then we created an instance of the Faker() function called “fake_data”. Lastly, we used the .credit_card_number() method on the “fake_data” object.
We will now generate fake numbers for “credit_card”, “credits_code”, and “credit_company”.
# Mask card number with new generated data using a lambda function
Faker.seed(0)
df['credit_code'] = df['credit_code'].apply(lambda x: fake_data.credit_card_security_code())
df['credit_company'] = df['credit_company'].apply(lambda x: fake_data.credit_card_provider())
df['credit_card'] = df['credit_card'].apply(lambda x: fake_data.credit_card_number())
# See the resulting pseudonymized data
df.head()

If you compare this output to the original you can see that the values have changed. We set the seed using Faker.seed(0) so we always get the same results. The next three lines of code use an anonymous function which allows us to loop through our dataset. First, we subset the name of the column we want to overwrite. Second, we use the .apply() method on the same column. Inside the .apply() method we lambda followed by the argument x. After the x we indicate what we want done to the column using the appropriate method from the faker library. Lastly, we display the results using the .head() method. We will address the names of people.
Fake Names
There are at least three different methods for generating fake names, there is a method that generates male or female names, a method that generates only male names, and a method that only generates female names. Below is a brief example of each.
Faker.seed(0)
print(fake_data.name())
print(fake_data.name_male())
print(fake_data.name_female())
Norma Fisher
Jorge Sullivan
Elizabeth Woods
The code above is self-explanatory. We used the print function in order to print several lines of code with different outputs. We will use the .name() method in the code below to generate fake names for our “name” column.
Faker.seed(0)
df['name'] = df['name'].apply(lambda x: fake_data.name())
df.head()

The steps for changing the names are the same as what we did with the credit card information. As such, we will not reexplain it here.
Conclusion
The ability to generate fake data as shown in this post allows an incredible amount of flexibility in protecting people’s identity. However, nothing must be lost that is used for developing insights. For example, generating random credit card numbers could be catastrophic if this information provides insights in a given context. Therefore, any tool that is going to be used must be used with wisdom and caution.
Postpositivist Paradigm and Program Evaluation
Within the context of program evaluation, different schools of thought or paradigms affect how evaluators do evaluation. In this post, we will look specifically at the postpositivist paradigm.
Postpositivism
The postpositivist paradigm grew out of the positivist paradigm. Both paradigms believe in using the scientific method to uncover laws of human behavior. There is also a focus on experiments whether they are true or quasi with the use of surveys and or observation. However, postpositivism will also take a mixed method (combining quantitative with qualitative) approach when it makes sense.
The main differences between positivism and postpositivism are the level of certainty and their contrasting positions on metaphysics. Positivists focus on absolute certainty of results while postpositivists are more focused on the probability of certainty. In addition, Positivists believe in one objective reality that is independent of the distant observer while postpositivists tend to have a more nuanced view of reality.
The typical academic research article follows the positivist/postpositivist paradigm. Such an article will contain a problem, purpose, hypotheses, methods, results, and conclusion. This structure is not unique to postpositivism, but it is important to note how ubiquitous this format is. The example above is primarily for quantitative research, but qualitative and mixed methods follow this format more loosely.
Within evaluation, postpositivism has influenced theory-based evaluation and program theory. Theory-based evaluation is focused on theories or ideas about what makes a great program, which are realized in the traits and tools used in the program.
Program theory is a closely related idea focused on the elements needed for achieving results and showing how these elements relate to each other. The natural outgrowth of this is the logic model which identifies what is needed (inputs) for the program, what will be done with these resources (output), and what is the impact of the use of these resources among stakeholders (outcomes). The logic model is the bedrock of program evaluation in many contexts such as within the government.
The reason for the success of the logic model is how incredibly structured and clear it is. Anybody can understand the results even if they may not be useful. In addition, the logic model was developed earlier than other approaches to program evaluation and it may be popular because it’s one of the first approaches most students learn in graduate school.
The emphasis on theory with postpositivism can often be at the expense of what is taking place in the actual world. While the use of theory is critical for grounding a study scientifically this can be alienating to the stakeholders who are tasked with using the results of a postpositivist program evaluation. As such, other schools of thought have looked to address this.
Conclusion
Postpositivism is one of many ways to view program evaluation. The steps are highly clear and sequential, and generally, everybody knows what to do. However, the appearance of clarity does not imply that it exists. Other paradigms have challenged the usefulness of the results of a program evaluation inspired by postpositivism.
Program Evaluation Paradigms
Program evaluation plays a critical role in assessing program performance. However, as with most disciplines of knowledge, there are different views or paradigms for how to assess a program.
The word paradigm, in this context, means a collection of assumptions or beliefs that shape an individual’s worldview. For example, creationists have assumptions about how life came to be that are different from those of people who believe in evolution. Just as paradigms influence science, they also play a role in how evaluators view the structure and purpose of program evaluation.
In this post, we will briefly go over four schools of thought or paradigms of program evaluation, along with a description of each and how they approach program evaluation. These four paradigms are
- Postpositive
- Pragmatic
- Constructivist
- Transformative
Postpositive
The postpositive paradigm grew out of the positive paradigm. Both paradigms are focused on the use of the scientific method to investigate a phenomenon. They also both support the idea of a single reality that is observable. However, postpositivists believe in a level of probability that accounts for human behavior. This assumption may have given rise to statistics which focuses heavily on probability.
Postpositivism is heavily focused on methods that involve quantitative data. Therefore, any program evaluator who is eager to gather numerical data is probably highly supportive of postpositivism.
Pragmatic
A pragmatic paradigm is one in which there is a strong emphasis on the actual use of the results. A pragmatist wants to collect data that they are sure will be used to make a difference in the program. In terms of data and methods, anything goes as long as it leads to implementation.
Since pragmatism is so flexible it is supportive of mixed methods which can include quantitative or qualitative data. While a postpositivist might be happy once the report is completed, a pragmatist is only happy if their research is used by stakeholders.
Constructivist
The constructivist paradigm is focused on how people create knowledge. Therefore, constructivists are focused on the values of people because values shape ideas and the construction of knowledge. As such, constructivists want to use methods that focus on the interaction of people.
With the focus on people, constructivists want to create a story using narrative approaches that are often associated with qualitative methods. It is possible but unusual to use quantitative methods with constructivists because such an approach does help to identify what makes a person tick in the same way as an interview would.
Transformative
The transformative paradigm is focused on social justice. Therefore, adherents to this paradigm want to bring about social change. This approach constantly investigates injustice and oppression. The world and the system need to be radically changed for the benefit of those who are oppressed.
People who support the transformative paradigm are focused on the viewpoints of others and the development of more rights for minority groups. When the transformative paradigm is the view of a program evaluation the evaluators will look for inequity, inequality, and injustice. Generally, with this approach, the outcome is already determined in that there is some sort of oppression and injustice that is happening, and the purpose of the evaluation is to determine where it is so that it can be stamped out.
Conclusion
The paradigm that someone adheres to has a powerful influence on how they would approach program evaluation. The point is not to say that one approach is better than the other. Instead, the point is that being aware of the various positions can help people to better understand those around them.
Bokeh-Manipulating Glyph Color
In this post, we will examine how to manipulate the color of the glyphs in a Bokeh data visualization. We are doing this not necessarily for aesthetic reasons but to convey additional information. Below are the initial libraries that we need. We will load additional libraries as required.
from pydataset import data
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
from bokeh.io import output_file, show
pydatasets will be used to load the data we need. The next two lines create the axes we will need and the objects for storing our data. The last line includes functions for saving and displaying our visualization.
Data Preparation
For this example, data preparation is simple. We will load the dataset “Duncan” using the data() function in an object called “df”. This dataset includes data about various occupations as measured on several variables. We will then display the data using the .head() method. Below is the code and output.
df=data('Duncan')
df.head()

Color Glyphs
In the example below, we will color the glyphs based on one of the variables. We will graph education vs income and color code the glyphs based on income. Below is the code followed by the output and finally the explanation.
from bokeh.transform import linear_cmap
from bokeh.palettes import RdBu8
source = ColumnDataSource(data=df)
# Create mapper
mapper = linear_cmap(field_name="income", palette=RdBu8, low=min(df["income"]), high=max(df["income"]))
# Create the figure
fig = figure(x_axis_label="Education", y_axis_label="Income", title="Education vs. Income")
# Add circle glyphs
fig.circle(x="education", y="income", source=source, color=mapper,size=16)
output_file(filename="education_vs_income.html")
show(fig)

Here is what we did
- We had to load additional libraries. liner_cmap() will be used to create the actual coloring of the glyphs. RdBu8 is the color choice for the glyphs.
- We then create the source of our data using the ColumSourcData() function
- We create our mapper function using the linear_map() function. The arguments inside the function are the variable we are using (income) and the low and high values for the variable.
- Next, we create our figure. We label our x and y axis based on the variables we are using and set the title.
- We use the .circle() method to create our glyphs. Notice how we set the color argument to our “mapper” object.
- The last two lines of code are for creating our output and showing it.
You could set the glyph color to a third variable, which would allow you to express a third variable in a two-dimensional space. For example, we could have used the “prestige” variable for the coloring of the glyphs rather than income, as income was already represented on the y-axis.
Adding a Color Bar
Adding a color bar will help to explain to a reader of our visualization what the color of the glyphs means. The code below is mostly the same and is followed by the output and lastly the explanation.
from bokeh.models import ColorBar
source = ColumnDataSource(data=df)
# Create mapper
mapper = linear_cmap(field_name="income", palette=RdBu8, low=min(df["income"]), high=max(df["income"]))
# Create the figure
fig = figure(x_axis_label="Education", y_axis_label="Income", title="Education vs. Income")
fig.circle(x="education", y="income", source=source, color=mapper,size=16)
# Create the color_bar
color_bar = ColorBar(color_mapper=mapper['transform'], width=8)
# Update layout with color_bar on the right
fig.add_layout(color_bar, "right")
output_file(filename="Education_vs_prestige_color_mapped.html")
show(fig)

Here is what happened.
- We loaded a new function called ColorBar().
- We create our source data (same as the previous example)
- We create our mapper (same as the previous example)
- We create our figure and glyphs (same as the previous example)
- Next, we create our color bar using the ColorBar() function. Inside this function, we set the color_mapper argument to a transformed version of the mapper object we already created. We also can set the width of the color bar using the width argument. Everything we have done in this step is saved in an object called “color_bar”
- We then use the .fig_layout() method on our “fig” object and place the object “color_bar” inside it along with the phrase “right” which tells Python to place the color bar on the right-hand side of the scatterplot.
There is one more example of glyph manipulation below.
Color by Category
In this last example, we will map an additional categorical variable onto our plot using color. Below is the code, output, and explanation.
# Import modules
from bokeh.transform import factor_cmap
from bokeh.palettes import Category10_5
source = ColumnDataSource(data=df)
# Create positions
TOOLTIPS=[('Education', '@education'), ('Prestige', '@prestige')]
positions = ["wc","bc","prof"]
fig = figure(x_axis_label="Education", y_axis_label="Prestige", title="Education vs Prestige", tooltips=TOOLTIPS)
# Add circle glyphs
fig.circle(x="education", y="prestige", source=source, legend_field="type",
size=16,fill_color=factor_cmap("type", palette=Category10_5, factors=positions))
output_file(filename="Education_vs_prestige_by_type.html")
show(fig)

- We need to load some additional libraries. factpor_cmap() will be used to color the glyphs based on the categorical variable “types. Category10_5 is the color palette.
- We create an object called “source” for our data using the ColumnSourceData() function.
- We create an object called “TOOLTIPS”. This is an object that will be used to display the individual data points of a glyph when the mouse hovers over it in the visualization
- We create an object called “positions” which is a list of all of the job types we want to match to different colors in our plot.
- We create an object called “fig” which uses the figure() function to create the x and y axes and the title of the plot. Inside this function, we also set the tooltips argument equal to the “TOOLTIPS” object we created previously
- Next, we use the .circle() method on our “fig” object. Most of the arguments are self-explanatory but notice how the argument “fill_color” is set to the function factor_cmap(). Inside this function we indicate the variable “type” as the variable to use, set the palette, and set the factors to the “position” object that we made earlier.
- The last two lines save the output and display it.
Conclusion
Bokeh allows you to do many cool things when creating a visualization for data purposes. This post was focused on how to manipulate the glyphs in a scatterplot. However, there is so much more that can be done beyond what was shared here.
Bokeh-Modifying Glyphs
In this post, we will examine additional ways to modify a Bokeh data visualization. The data points in a Bokeh visualization are referred to as glyphs. Glyphs can take various shapes, colors, etc, and we will learn how to modify glyphs specifically for scatter plots in this post.
Libraries
We will need several libraries to work with glyphs in a scatterplot. They are listed below.
from pydataset import data
from bokeh.plotting import figure
from bokeh.io import output_file, show
The first library is for pulling the data we need. The next two libraries are basic libraries from Bokeh for generating our scatterplot.
Data Preparation
Data preparation is simple in this example. All we have to do is use the data() function to load the “Duncan” dataset into an object we call “df.” After this, we print out the first few rows of this dataset using the head() method.
df=data('Duncan')
df.head()

The Duncan dataset contains various occupations that are scored on type, income, education, and prestige. We will be focused primarily on education and prestige in this post.
Default Scatterplot
The code below provides a default scatterplot without any modifications. The purpose of providing this is to serve as a comparison to what we will do when we modify the glyphs in the subsequent examples. Below is the code, output, and explanation.
#Fig Setup
fig = figure(x_axis_label="Prestige", y_axis_label="Education", title="Prestige vs Education")
# Add glyphs
fig.circle(x="prestige", y="education", source=df)
output_file(filename="pres_vs_inc.html")
show(fig)

The first line of code sets up the x and y axis using the figure() function in an object we created called “fig”. The second line allows us to add the actual data points using the .circle() method on the “fig” object. The last two lines allow us to display our plot. output_file() function allows us to save our work and the show() function displays the plot.
Modified Glyphs
Our first modification will involve changes to the appearance of the glyphs in the scatterplot. The first line is the same but the second line includes changes to the size, color, and transparency of the glyphs. Below is the code, output, and explanation.
fig = figure(x_axis_label="Prestige", y_axis_label="Education", title="Prestige vs Education")
# Add glyphs
fig.circle(x="prestige", y="education", source=df, size=16, fill_color="yellow", fill_alpha=0.2)
output_file(filename="pres_vs_inc.html")
show(fig)

In the second line of code, we changed the size to 16 using the size argument and the fill of the dots to yellow using the fill_color argument. We also adjusted the transparency using the fill_alpha argument.
Multiple Glyphs
The example below includes the modification of different glyphs based on a categorical variable. A tooltip is also included to show how various tools can be combined when employing Bokeh. The code, output, and explanation are below.
#Subset Data
source_wc = df[df["type"]=="wc"]
source_prof = df[df["type"]=="prof"]
TOOLTIPS=[('Education', '@education'), ('Prestige', '@prestige')]
fig = figure(x_axis_label="Education", y_axis_label="Prestige", tooltips = TOOLTIPS)
wc_glyphs = fig.circle(x="education", y="prestige", source=source_wc, legend_label="WC", fill_color="blue",fill_alpha=0.2)
prof_glyphs = fig.circle(x="education", y="prestige", source=source_prof, legend_label="Prof", fill_color="green", fill_alpha=0.6)
output_file(filename="update.html")
show(fig)

Here is what we did
- We subsetted the data into two different objects (source_ws & source_prof) based on the type variable.
- We created our tooltip which allows us to see the education and prestige of individual data points on the graph.
- We make the figure or x and y axis for our plot.
- We use the .circle() method twice. Once for job type “WC” and once for job type “prof”. We set the colors, fills, and transparency of each subset of data. Notice that we created a legend as well. Remember that we named each object in this step wc_glyph and prof_glyph
- The last two lines of code create a file for the visualization and display it.
Updating Glyphs
We will now learn how to update our code rather than recreating it from scratch. We will update the size of the glyphs and change their color from the prior example.
# Update glyph size
wc_glyphs.glyph.size = 20
prof_glyphs.glyph.size = 10
# Update glyph fill_color
wc_glyphs.glyph.fill_color = "red"
prof_glyphs.glyph.fill_color = "yellow"
output_file(filename="update.html")
show(fig)

We begin by using the objects we created in the last example wc_glyph and prof_glyph, These two objects contain all the information for creating the figure and data points. To update the glyph size we will use the .glyph.size argument and set it to two different values for each of the two objects that were created. We repeat this process for the fill_color.
You can compare the last two visualizations and see the differences for yourself.
Conclusion
Being able to customize glyphs is another powerful feature of Bokeh. With these tools, you can modify your visualizations with ease to communicate with your audience.
Notes on Nationalism
George Orwell wrote an essay entitled “Notes on Nationalism” around the time of WW II. In this brief essay, Orwell defines nationalism along with a description of the traits of nationalists. In this post, a summary of his essay will be provided along with modern examples of some of his key points.
Defining Nationalism
For Orwell, nationalism is an individual’s identification with a single nation or unit. Nation is a country but unit is much harder to define. A unit could be a religion, such as Islam, or an ideology like communism. Simply, a unit can be anything that is not a nation.
Orwell then goes on to describe two types of nationalism which are positive and negative nationalism. A positive nationalist wants to boost the prestige of his country or unit. An example of a positive nationalist would be a patriotic American who believes in “God bless America.”
The examples Orwell includes in his essay of positive nationalism include Zionism, which supports the idea of a Jewish state and is not ashamed to do so. Orwell also shared the example of Celtic Nationalism which believed in the support of the Celtic ethnicities in the United Kingdom. What both of these examples have in common is a focus on supporting a unit of people to achieve goals and objectives.
A negative nationalist is a person who wants to denigrate or lower the prestige of a country or unit. An example of this would be Americans who are ashamed or embarrassed by the past atrocities of the US and want the US to offer reparations, apologies, and to show penitence. These people are also nationalist but have a sense of shame over their country’s behavior that is baffling to a positive nationalist.
The examples of negative nationalism that Orwell shares in his essay include Anti-semitism, Anglophobia, and Trotskyism. Anti-Semitism is racism against people who are Jewish and does not require much additional explanation. Anglophobia is a negative attitude towards the UK. What makes Anglophobia pertinent is that a similar attitude has permeated the US in recent years. Trotskyism was a branch of mainly Russian communists who did not support Stalin’s leadership of the Soviet Union.
What all of these negative nationalists have in common is hatred and or resistance to another country or unit. This leads to the conclusion that whether someone is a positive or negative nationalist depends on who is asking the question. For example, someone who supports Black Lives Matter might see themselves as a patriot continuing the fight for equality which is a tradition in America. However, another person might see BLM in a negative light due to the instability that BLM brings into certain areas. In the end, whether someone is a positive or negative nationalist is based more on marketing than on the actual behavior and beliefs of the individuals involved.
There is one more group of nationalists that do not neatly fall into the two categories already mentioned and this group is called transferred nationalists. A transferred nationalist is a person who holds a contrasting position to the context in which they live. An example that Orwell uses is a communist who lives in a capitalist country, which is a minority position. Another example he shared was political Catholicism which was the promotion of Catholic social teachings through government support. Political Catholicism is a form of transfer nationalism because the use of the state to support religion in this matter is supposedly an unusual position in Orwell’s view.
As mentioned before, whether someone is a positive, negative, or transferred nationalist is a matter of perspective. The main point here is to understand how nationalism can manifest in different ways and different contexts.
Main Characteristics of Nationalism
In addition to categorizing the types of nationalism, Orwell also provides three main characteristics of nationalists which are obsession, instability, and indifference to reality. Obsession is being highly focused on the group/unit that the nationalist is supporting. For example, Zionists are highly focused on Israel and matters related to this country. Black Lives Matter support is highly focused on systemic racism and matters related to the Black community.
Instability relates primarily to transferred nationalists and it is loyalty outside of the system one is in. The previous example was a communist in a capitalist country. A more recent example is natural-born Americans supporting immigration regardless of the context. Perhaps the reason that Orwell labels this instability is that a minority position can often push for change that destabilizes the status quo.
The final trait of nationalists is indifference to reality. Reality is not defined in a traditional manner here but is more focused on morality. Nationalists see the world from their viewpoint to the exclusion of all contradictory evidence. What is good or bad is not based on behavior but rather on who did it. If the US goes and attacks another country it is a fight for freedom. However, if anybody attacks the US it is considered terrorism. For a pro-US nationalist, no information can be given to criticize US aggression or condone attacks on the US because it is not evidence or morals that matters but the group/unit that the nationalist is supporting.
We can extend this to every other example if we want. Immigration is okay for transferred nationalists no matter how much crime, unemployment or drains of social services happen. The opposite is true for US positive nationalists, immigration is a problem no matter how many hard-working, tax-paying immigrants come. The same applies to Black Lives Matter and racism. No matter what the government does systemic racism is still a threat to Blacks. On the other hand, US nationalists are convinced that nothing can be done to appease people who think they are victims of racism.
Conclusion
Orwell’s views on nationalism provide an interesting take from the WW II era. The point was not to criticize his view but rather to explain his position with a few recent examples. Nationalism is a part of the worldview of most individuals in one way or the other. What is truly important is just to be aware of one’s position concerning one’s thoughts relating to nationalism.
Extracting & matching in R VIDEO
Understanding the Preface of a Textbook VIDEO
Bokeh Display Customization VIDEO
Essay on Liberation-Subverting Forces & Solidarity
This post will examine chapters three and four of Herbert Marcuse’s “Essay on Liberation.” This highly influential essay, written in the 1960s, lays out many of the left’s goals and desires regarding the reshaping of society.
Subverting Forces
Chapter 3 is mostly a rehash of complaints and solutions that Marcuse has already addressed in his essay. It begins with a litany of complaints, including the terrible jobs people have to work, the exploitation of minorities, increased violence, and the waste of resources. All of these complaints are blamed on capitalism. It needs to be noted that every system has some sort of flaws and even oppression within them which includes the communist system that Marcuse supports.
Marcuse also mentions how technology can be used to end capitalism rather than support it. The challenge is that the technocrats are using technology to continue the existing system of oppression. Not only is this terrible but the current system must be abolished as reformation is not even an option for Marcuse. This is a sentiment that is shared by many leftists today regarding the destruction of the current system in order to set up a completely new one.
Marcuse also calls on universities to radicalize students by developing and/or awakening their true consciousness. A true consciousness is a mind that has awakened to its true socialist nature. It appears the universities have heeded Marcuse’s call as many of them are considered bastions of liberal left-wing thinking. Again, the problem isn’t that Marcuse believes these things but that he wants everyone else to believe them and thinks it’s okay to use the educational system for this. If we are really free we should be able to accept or reject this worldview that Marcuse so vehemently supports.
Marcuse repeats his desire to radicalize the ghetto (black) population as well. Again, the reason for radicalizing students and minorities is to replace the proletariat workers who are enjoying their middle-class lifestyle. Marcuse never mentions how the ghetto populations were to be radicalized but it would probably involve the use of former university students who have achieved their true consciousness and are educating and working among the ghetto populations and pointing out the oppression these people are facing. Paulo Friere may be one example of this as he worked exclusively among the poor and minority populations as a language teacher in Brazil pointing out oppression.
One shocking comment Marcuse makes about the black population of his time is that they are expendable. Now, expendable does not mean that blacks should be eliminated or that they have no value. Rather, Marcuse used the term “expendable” to mean that the majority of blacks are not contributing significantly to the current economic system. For Marcuse, this is an advantage because these oppressed individuals are potential recruits for the revolution.
Correlation is not causation but there was a surprising number of radical black groups that arose in the 1960s and 1970s. Examples include the Black Panthers and the Black Liberation Army. There are also a host of other left-leaning groups such as the Symbionese Liberation Army, Weather Underground, and Students for a Democratic Society. The example provided explains why Marcuse is often called the “father of the new left.”
Solidarity
The final chapter of Marcuse’s essay shares how the revolution was successful in both Cuba and Vietnam. With such recent success as this (Marcuse was writing in the 1960’s) Marcuse is implying that such success can be experienced in the US. At the time it was unclear what to expect from the communist revolutions in Cuba and Vietnam. However, history shows us that these revolutions were not blessings to the citizens of either of these countries.
Marcuse then goes on to ponder what life after the revolution will look like. He essentially implies that it is unclear what life will truly be like after the communist revolution. This is a common criticism of communism in that the proponents want a different world but have no idea what to do if they take power. Given the track record of communist governments, it is better that communists pursue power rather than obtain it.
Conclusion
Marcuse had a strong vision for what he wanted to see happen in America. His desire was for the fall of capitalism and the rise of a socialist/communist utopia. In his essay, he lays out this dream of his. Unfortunately, the general success of communist revolutions is often negative and leads to huge loss of life as people’s freedoms are curtailed for the sake of the collective.
Bokeh Display Multiple Plots VIDEO
Essay on liberation-The New Sensibility
This post will look at the second chapter of Herbert Marcuse’s essay “Essay on Liberation.” The general gist of this influential essay is to bemoan capitalism and champion the benefits and superiority of socialism. The focus of this chapter in particular is mostly on the benefits and implementation of socialism.
The New Sensibility
A key word in this chapter is the word “sensibility.” From what I can determine it seems that the word “sensibility” in the title relates to worldview or perhaps world order. Therefore, in this chapter, Marcuse is attempting to explain the new worldview or values of individuals who have been liberated from capitalism.
Within this chapter, Marcuse talks about a world in which injustice and misery have been abolition and there is a controlled economy in place. By controlling the economy, people are free from the evils of capitalism. The evils of capitalism appear to be hard work and consumerism as these are concepts Marcuse seems to criticize and complain about.
Marcuse also tries to explain what a liberated consciousness is. A liberated consciousness is someone who has been awakened to the evils of capitalism and understands the natural state of man, which is a socialist being. The way Marcuse describes this is similar to Plato’s Cave Analogy of someone who realizes the way they see the world is a shadow of the actual reality with the chains representing capitalism. I cannot confirm this but Marcuse’s concept of the liberated consciousness may have inspired Freire’s critical consciousness which sounds similar and is focused on realizing the oppression that is found in the pedagogical process.
Marcuse goes on to share how praxis is key. By praxis, an appropriate definition would be social action which generally involves protesting and other forms of destabilizing the existing society. In other words, it is not enough to be awakened as one must push for the manifestation of this awakening in the real world. Friere also speaks of social action and unrest in his work. Socialism is not content to exist along with other worldviews it wants to overtake the world and bring about the utopia that has never existed in recorded human history.
Another aspect of this chapter was Marcuse’s exploration of how art shapes reality. Art can be used to influence and shape reality through the ability to express what is ideal. Through warping reality through the use of art society can be changed for the better as well. Marcuse briefly touches on this idea in this essay but he does explore it in greater detail in his other works.
Conclusion
Marcuse lays out his claims for the need for socialism and how people would act if they were awakened to their true nature. The main failure of the MArcuse’s argument is its theoretical nature. The reality of socialism and communism is a system that lacks the benefits and resources it claims to provide.
Essay on Liberation-Biological Foundation
Herbert Marcuse wrote a famous essay in the 1960’s entitled “Essay on Liberation.” The writing is somewhat difficult and convoluted which means interpretation can be challenging. However, the main thesis of Marcuse’s essay appears to be that the productivity of capitalism is inhibiting the rise of the socialist revolution. He addresses this thesis by addressing how a man can take care of himself without being dependent on the capitalist system and by asserting there can be no freedom from labor in the current capitalist system.
In this post, we will attempt to provide a summary of this essay succinctly. In particular, will focus on only chapter one of this essay entitled “Biological Foundation of Socialism”
Biological Foundation for Socialism
The first part of Marcuse’s essay addresses the biological foundation for socialism. From what I can assess the term “biological” means the innate need or basis for socialism. In other words, Marcuse builds a case for socialism as a natural state of man in the first part of his essay.
Marcuse lays out two problems with capitalism, which are the increase in production and the exploitation of products. For Marcuse, capitalist societies overproduce but at the same time do not provide enough for the people trapped in this oppressive system. For people to be free they must break their dependence on this market system with its focus on consumption. However, Marcuse later goes on to prescribe a controlled market as the alternative which has its problems of efficiency as demonstrated by other communist states such as the Soviet Union.
Marcuse also shares that capitalism is transformative. By transformative Marcuse is probably referring to how capitalism changes the nature, character, and or values of the individual. The accusation of the transformative nature of capitalism may also be why Marxists in general speak of transformation. However, when Marxists speak of transformation they believe it relates to awakening man to his true socialist nature rather than the capitalist lie. For Marcuse, the change of an individual brought about by capitalism causes exploitation as the individual buys into an oppressive system. Anyone familiar with the term “rat race” may have sympathy with Marcuse”s views.
Marcuse desires to free man from this exploitative system. This gives the impression that people should not have to do anything they don’t want to do. The problem is that many communist and socialist countries still have exploitive systems that force people to do things after the revolution. In other words, there is no system in which man is truly free. Everyone has to spend time doing things they do not want. The only difference is who is your master and what are the benefits of serving him.
Marcuse then goes on to explain why the Marxist revolution has not taken place. He claims that poverty doesn’t bring revolution, as Marx argued. With the success of capitalism, the proletariat was beginning to move into the middle class. The problem with the economic success of the middle class is that they hate the idea of revolution. This disdain for revolution is because of the middle class’s investment in the current system. In other words, capitalism blunts the desire for true freedom because it bribes individuals with economic gain.
Marcuse’s solution to the middle class’s stabilization was to focus on the radicalization of the super poor and blacks. In later parts of his essay, he adds students to this potential pool of revolutionaries. By shifting the focus away from the traditional proletariat, who are essentially sell-outs, to other oppressed groups, the revolution can continue.
The impact of this statement is felt today. Now, we have a plethora of groups who are crying out about the oppression of capitalism and other norms of society such as sexuality, health, race, etc. The idea of radicalizing various ethnic, sexual, and other minorities for the sake of revolution may have started with the ideas of Marcuse in the 1960s.
Conclusion
Marcuse lays out several key terms of his essay in this first chapter. Establishing this foundation is key as we will see how the rest of the essay is a variation of the ideas presented here.
Bokeh Tools and Tooltips VIDEO
Creating Multiple Plots Using Bokeh in Python
In this post, we will look at how to make multiple plots at once using Boke in Python. This technique can be a powerful tool when you need to create visualizations rapidly for whatever purpose you may have.
Needed Initial Libraries & Data Preparation
Below are the initial libraries we need to begin this example and the data preparation.
from pydataset import data
from bokeh.plotting import figure
from bokeh.io import output_file, show
import pandas as pd
df=data("Duncan")
The first line is the code for the data we will use. It loads the data() function from pydataset. Next, we load the figure function from bokeh which will allow us to create our plots. After this we load the output_file() and show() functions which will allow us to display our plots. Lastly, we created our object df which holds our data from the Duncan dataset which has job types, prestige, income, and education as variables.
Multiple Scatterplots
Below is an example of displaying multiple scatterplots. The code and visualization are below followed by an explanation.
# SCATTER PLOT
from bokeh.layouts import column
wc = df.loc[df["type"] == "wc"]
prof = df.loc[df["type"] == "prof"]
fig_one = figure(x_axis_label="Education",y_axis_label="Prestige")
fig_two = figure(x_axis_label="Education",y_axis_label="Prestige")
fig_one.circle(x="education", y="prestige",source=wc,color="blue", legend_label="wc")
fig_two.circle(x="education", y="prestige",source=prof,color="red", legend_label="prof")
output_file(filename="column_plots.html")
show(column(fig_one, fig_two))

You can see the plots are stacked into a single column. The actual setup for this is simple.
- We loaded the column() function which allows us to display visualizations in columns
- We subsetted the data so that wc workers are in one object and prof are in the other object
- We created two figures (fig_one, fig_two) for each of the other datasets. The figures are identical and both will contain education and prestige as the variables
- We then added the data to both figures distinguishing the plots by having different colored dots
- We created a name for the output
- Inside the show() function we used the column() function to display the visualization in columns
All of this code was mostly reviewed, the only new thing was the use of the column() function within the show() function.
Multiple Bar Plots
In this example, we use bar plots and rows instead of columns. The code is followed by the visualization and the explanation.
# bar PLOT
from bokeh.layouts import row
income=pd.DataFrame(df.groupby('type')['income'].mean())
prestige=pd.DataFrame(df.groupby('type')['prestige'].mean())
types = ["prof", "wc", "bc"]
income_type = figure(x_axis_label="type", y_axis_label="income",
x_range=types)
prestige_type = figure(x_axis_label="type", y_axis_label="prestige)",
x_range=types)
# Add bar glyphs
income_type.vbar(x="type", top="income", source=income)
prestige_type.vbar(x="type", top="prestige", source=prestige)
# Generate HTML file and display the subplots
output_file(filename="my_first_column.html")
show(row(income_type, prestige_type))

Here is what we did
- We loaded the row() functions which allows us to make rows as you can see.
- We calculated the group means for each job type
- We created a list called types which included the three job types in our dataset
- Next, we made our two figures. One for income and the other for prestige
- After this, we added the data to the plots
- Lastly, we created the output and showed the visualizations this time using rows
Gridplots
The code below takes a different approach using grid plots. This allows you to set columns and rows for your multiple plots. Below is the code, output, and explanation of this.
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource
from bokeh.models import NumeralTickFormatter
plots = []
#df['type'] = df.type.astype('category')
# Complete for loop to create plots
for type in ["bc","wc"]:
source = ColumnDataSource(data=df)
df = df.loc[df["type"] == type]
fig = figure(x_axis_label="education", y_axis_label="income")
fig.circle(x="education", y="income", source=source, legend_label=type)
fig.yaxis[0].formatter = NumeralTickFormatter(format="$0a")
plots.append(fig)
# Display plot
output_file(filename="gridplot.html")
show(gridplot(plots, ncols=2))

- We began by loading gridplot(), ColumnSourceData, and NumeralTickFormatter() functions. Gridplot made the grid, columsource created a native data type for bokeh and numeraltickformatter allows us to format the numbers on the axes.
- We created an empty list called plots that we will use in our for-loop
- We used a for loop to generate the plots. The plots graph education vs income. The NumeralTickFormatter allowed us to display dollar signs on the y-axis
- We then displayed the plot
Conclusion
This post provided an example of how to make multiple plots with bokeh. With these tools, there are many different ways they can be utilized for your data purposes.
Bokeh Display Customization in Python
In this post, we will examine how to modify the default display of a plot in Bokeh, a library for interactive data visualizations in Python. Below are the initial libraries that we need.
from pydataset import data
from bokeh.plotting import figure
from bokeh.io import output_file, show
The first line of code is where our data comes from. We are using the data() function from pydataset for loading our data. The next two lines are for making the plot’s figure (x and y axes) and for the output file.
Data Preparation
There is no data preparation beyond loading the dataset using the data() function. We pick the dataset “Duncan” and load it into an object called “df.” The code is below, followed by a brief view of the actual data using the .head() method.
df=data('Duncan')
df.head()

This dataset includes various occupations measured in four ways: job type, income, education, and prestige.
Default Graph’s Appearance
Before we modify the appearance of the plot, it is important to know what the default appearance of the plot is for comparison purposes. Below is the code for a simple plot followed by the actual output and then lastly an explanation.
# Create a new figure
fig = figure(x_axis_label="Education", y_axis_label="Income")
# Add circle glyphs
fig.circle(x=df["education"], y=df["income"])
# Call function to produce html file and display plot
output_file(filename="my_first_plot.html")
show(fig)

The first line of code sets up the fig or figure. We use the figure() function to label the axes which are education and income. The second line of code creates the actual data points in the figure using the .circle() method. The last two lines create the output and display it.
So the figure above is the default appearance of a graph. Below we will look at several modifications.
Modification 1
In the code below, we are making the following changes to the plot.
- Identifying data points by job type using color
- Change the background color to black
Below is the code followed by the output and the explanation
# Import curdoc
from bokeh.io import curdoc
prof = df.loc[df["type"] == "prof"]
bc = df.loc[df["type"] == "bc"]
# Change theme to contrast
curdoc().theme = "contrast"
fig = figure(x_axis_label="Education", y_axis_label="Income")
# Add prof circle glyphs
fig.circle(x=prof["education"], y=prof["income"], color="yellow", legend_label="prof",size=10)
# Add bc circle glyphs
fig.circle(x=bc["education"], y=bc["income"], color="red", legend_label="bc",size=10)
output_file(filename="prof_vs_bc.html")
show(fig)

Here is what happened,
- We load a library that allows us to modify the appearance called curdoc
- Next, we do some data preparation. Separating the data for types that are “prof” and those that are “bc” into separate objects.
- We change the theme of the plot to contrast using curdoc().theme
- We also created the figure as done previously
- We use the .circle() method twice. Once to set the “prof” data points on the plot and a second time to place the “bc” data points on the plot. We also make the data points larger by setting the size and using different colors for each job type.
- The last two lines of code are for creating the output and displaying it.
You can see the difference between this second plot and the first one. This also shows the flexibility that is inherent in the use of Bokeh. Below we add one more variation to the display.
Modified Graph’s Appearance
The plot below is mostly the same except for the following
- We add a third job type “wc”
- We modify the shapes of the data points
Below is the code followed by the graph and the explanation
# Create figure
wc = df.loc[df["type"] == "wc"]
prof = df.loc[df["type"] == "prof"]
bc = df.loc[df["type"] == "bc"]
fig = figure(x_axis_label="Education", y_axis_label="Income")
# Add circle glyphs for houses
fig.circle(x=wc["education"], y=wc["income"], legend_label="wc", color="purple",size=10)
# Add square glyphs for units
fig.square(x=prof["education"], y=prof["income"], legend_label="prof", color="red",size=10)
# Add triangle glyphs for townhouses
fig.triangle(x=bc["education"], y=bc["income"], legend_label="bc", color="green",size=10)
output_file(filename="education_vs_income_by_type.html")
show(fig)

The code is almost all the same. The main difference is there are now three job types and each type has a different shape for their data points. The shapes are determined by using either .circle(), .triangle(), or .square() methods.
Conclusion
There are many more ways to modify the appearance of visualization in bokeh. The goal here was to provide some basic examples that may lead to additional exploration.
Bokeh-Scatter Plot basics in Python VIDEO
Bokeh Tools and Tooltips
In this post, we will look at how to manipulate the different tools and tooltips that you can use to interact with data visualizations that are made using Bokeh in Python. Tool are the icons that are displayed by default to the right of a visual when looking at a Bokeh output. To the right is what default tools look like. Tooltips provide interactive data based on the position of the mouse.
We will now go through the process of changing these tools and tooltips for various reasons and purposes.
Load Libraries
First, we need to load the libraries we need to make our tools. Below is the code followed by an explanation.
from pydataset import data
from bokeh.plotting import figure
from bokeh.io import output_file, show
We start by loading “data” from “pydataset”. This library contains the actual data we are going to use. The other libraries are all related to Bokeh’s “figure” which will create details for our visualization. In addition, we will need the “output_file” to make our HTML document and the “show” function to display our visualization.
Data Preparation
Data preparation is straightforward. All we have to do is load our data into an object. We will use the “Duncan” dataset, which contains data on various jobs’ income, education, and prestige. Below is the code followed by a snippet of the actual data.
df=data('Duncan')
df.head()

Default Settings for Tools
We will now look at a basic plot with the basic tools. Below is the code.
# Create a new figure
fig = figure(x_axis_label="income", y_axis_label="prestige")
# Add circle glyphs
fig.circle(x=df["income"], y=df["prestige"])
# Call function to produce html file and display plot
output_file(filename="my_first_plot.html")
show(fig)

There is nothing new here. We create the figure for our axes first. Then we add the points in the next line of code. Lastly, we write some code to create an output. The default tools has 7 options. Below they are explained from top to bottom.
- At the top, is the logo that takes you to the Bokeh website
- Pan tool
- Box zoom
- Wheel zoom
- Save figure
- Reset figure
- Takes you to Bokeh documentation
We will now show how to customize the available tools.
Custom Settings for Tooltips
In order to make a set of custom tools, we need to make some small modifications to the previous code as shown below.
# Create a list of tools
tools = ["lasso_select", "wheel_zoom", "reset","save"]
# Create figure and set tools
# Create a new figure
fig = figure(x_axis_label="income", y_axis_label="prestige",tools=tools)
# Add circle glyphs
fig.circle(x=df["income"], y=df["prestige"])
# Call function to produce html file and display plot
output_file(filename="my_first_plot.html")
show(fig)

What is new in the code is the object called “tools”. This object contains a list of the tooltips we want to be available in our plot. The names of the tools is available in the Bokeh documentation. We then add this object “tools” to the argument called “tools” in the line of code where we create the “fig” object. If you compare the second plot to the first plot you can see we have fewer tools in the second one as determine by our code.
Hover Tooltip
The hover tooltip allows you to place your mouse over the plot and have information displayed about what your mouse is resting upon. Being able to do this can be useful for gaining insights about your data. Below is the code and the output followed by an explanation.
# Import ColumnDataSource
from bokeh.models import ColumnDataSource
# Create source
source = ColumnDataSource(data=df)
# Create TOOLTIPS and add to figure
TOOLTIPS = [("Education", "@education"), ("Position", "@type"), ("Income", "@income")]
fig = figure(x_axis_label="education", y_axis_label="income", tooltips=TOOLTIPS)
# Add circle glyphs
fig.circle(x="education", y="income", source=source)
output_file(filename="first_tooltips.html")
show(fig)

Here is what happened.
- We loaded a new library called ColumnDataSource. This function allows us to create a data structure that is unique to Bokeh. This is not required but will appear in the future.
- We then save are dataset using the new function and called it “source”
- Next, we create a list called “TOOLTIPS” this list contains tuples which are in parentheses. The first string in the parentheses will be the name that appears in the hover. The second string in the parentheses accesses the value in the dataset. For example, if you look at the hover in the plot above the first line says “Education” and the number 72. The string “Education” is just the first string in the tuple and the value 72 is the value of education from the dataset for that particular data point
- The rest of the code is a review of what has been done previously. The only difference is that we use the argument “tooltip” instead of “tool”
Conclusion
With tooltips and tools you can make some rather professional looking visualization with a minimum amount of code. That is what makes the Bokeh library so powerful.
Bar Graphs Using Bokeh and Python VIDEO
Make a Bar Graph with Bokeh in Python
Bokeh is a data visualization library available in Python with the unique ability of interaction. In this video, we will look at how to make a basic bar graph using bokeh.
To begin we need to load certain libraries as shown below.
from pydataset import data
import pandas as pd
from bokeh.plotting import figure
from bokeh.io import output_file, show
In the code above, we load the “pydataset” library to gain access to the data we will use. Next, we load “pandas” which will help us with some data preparation. The last two libraries are related to “bokeh.” The “figure” function will be used to set the actual plot, the “output_file” function will allow us to save our plot as an HTML file and the “show” function will allow us to display our plot.
Data Preparation
We need to do two things to be ready to create our bar graph. First, we need to load the data. Second, we need to calculate group means for the bar graph. Below is the code for the first step followed by the output.
df=data('Duncan')
df.head()

In the code above we use the “data” function to load the “Duncan” dataset into an object called “df”. Next, we display the output of this. The “Duncan” dataset contains data on different jobs, the type of job, income, education, and prestige. We want to graph prestige and job type as a bar graph which will require us to calculate the mean of prestige by type. The code for this is below.
# Calculate group means of prestige
positions = df.groupby('type', as_index=False)['prestige'].mean()
positions

In the code above we use the “groupby” function on the “df” object. Inside the function, we indicate we want to group by “type”. The “as_index” argument is set to false so that the “type” column is not set at the index or you can say as the row numbers. Next, we subset the data using square brackets to only include the “prestige” column. Lastly, we indicate that we want to calculate the “mean”. The result is that there are three job types and we have the mean for each job’s prestige. The job types and means from this table above are what we will use for making our visualization.
Bar Graph
We are now ready to make our bar graph. Below is the code followed by the output.
# Instantiate figure
fig = figure(x_axis_label="positions", y_axis_label="Prestige", x_range=positions["type"])
# Add bars
fig.vbar(x=positions["type"], top=positions["prestige"],width=0.9)
# Produce the html file and display the plot
output_file(filename="Prestige.html")
show(fig)

Here are the steps.
- We began by creating the “fig” object. We labeled are x and y axes and also indicated the range of the x values which means determining the categories of our data. For our purposes, this was the unique job type in the “types” column.
- Next, we use the “vbar” function to make our bar graph. The x values were set to the “type” column from the “positions” object. The y or “top” values were set to the means of “prestige” from the “positions” object. The “width” argument was set to 0.9 to ensure there was a little whitespace between the bars.
- The “output_file” creates a saved plot and the “show” function displays the bar graph.
Conclusion
Bokeh has lots of cool tools available for the data analyst. This post was focused on bar graphs but this is only the most basic information that has been shared here. There is much more possible with this library.
Bokeh-Scatter Plot Basics in Python
Bokeh is another data visualization library available in Python. One of Bokeh’s unique features is that it allows for interaction. In this post, we will learn how to make a basic scatterplot in Bokeh while also exploring some of the basic interactions that are provided by default.
Data Preparation
We are going to make a scatterplot using the “Duncan” data set that is available in the “pydataset” library. Below is the initial code.
from pydataset import data
from bokeh.plotting import figure
from bokeh.io import output_file, show
The code above is just the needed libraries. We loaded “pydataset” because this is where our data will come from. All of the other libraries are related to “bokeh.” “Figure” allows us to set up our axes for the scatterplot. “Output_file” allows us to create the file of our plot. Lastly, “show” allows us to show the plot of our visualization. In the code below we will load our dataset, give it a name, and print the first few rows.
df=data('Duncan')
df.head()

In the code above we store the “Duncan” dataset in an object called “df” using the data() function. We then display a snippet of the data using the .head() function. The “Duncan” data shares information on jobs as defined by several variables. We will now proceed
Making the Scatterplot
We will now make our scatterplot. We have to do this in three steps.
- Make the axis
- Add the data to the plot
- Create the output file and show the results
Below is the code with the output
# Create a new figure
fig = figure(x_axis_label="education", y_axis_label="income") #labels axises
# Add circle glyphs
fig.circle(x=df["education"], y=df["income"]) #adds the dots
# Call function to produce html file and display plot
output_file(filename="my_first_plot.html")
show(fig)

At the top of the code, we create our axis information using the “figure” function. Here we are plotting education vs income and storing all of this in an object called “fig”. Next, we insert the data into our plot using the “circle” function. To insert the data we also have to subset the “df” dataframe for the variables that we want. Note that the data added to a plot are called “glyphs” in Bokeh. Lastly, we create an output file using a function with the same name and show the results.
To the right of your plot, there are also some interaction buttons as shown below
Here is what they do from top to bottom.
- Takes you to bokeh.org
- Pan the image
- Box zoom
- Wheel zoom
- Download image
- Resets image
- It takes you to information about the bokeh function
There are other interactions possible but these are the default ones when you make a plot.
Conclusion
Bokeh is one of many tools used in Python for data visualization. It is a powerful tool that can be used in certain contexts. The interactive tools can also enhance the user experience.
Mistakes in Evaluation Writing
Writing program evaluation reports is always a tricky task to accomplish. As a writer, you have to be concerned about the style of writing, and the audience of the report, among other challenges. In addition, there are several common mistakes made when writing as shown below.
- Small sample
- No comparison group
- Instrument use
- Sharing too little or too much
- Hasty generalization
Small Sample Sizes
The sample size is highly important, particularly in quantitative reports. If a sample is small it will be difficult to make strong conclusions and the findings will be considered questionable. Naturally, there is disagreement over what is thought of as an adequate sample size. However, this can be calculated mathematically. The general rule of thumb for statistical tests is a sample size of at least 30 observations.
Even if the sample size starts adequate there is still the challenge of attrition. As time progresses, people will drop out of programs and this can make the data collected on them useless.
If the sample size drops below an acceptable level all is not lost. It is important to communicate the limitations of the report and not oversell the results due to the small sample size. If you know in advance that the sample size will be small, it may be more appropriate to focus more on a qualitative study rather than a quantitative one.
Lack of Comparison Group
A problem that is often associated with sample size is the lack of a comparison group. Quantitative research is about comparing different values to see if they are the same or different. If a program is implemented, there is no way to assess the quality of it unless it is compared to individuals who did not participate in the program. Without a comparison group, there is no way to interpret the program quality.
You can’t say a program is “good” or “bad” in a vacuum. Such a statement as this must be made in comparison to a situation that is similar or the same as the context of the program with the effect of the program. In other words, quality is generally a relative concept rather than an absolute one.
There is an argument that it is unethical to deny some individuals participation in a program for the sake of a comparison group. However, it can also be said that it is unethical to state that a program is good or bad without having a comparison group.
Instrument Use
There are two common mistakes with instruments.
- Lack of information on the instruments
- Mixing and matching survey items from different instruments
Sometimes people will use instruments to explain anything about the instrument. In general, the writer of a report should provide enough information about an instrument that a reader knows that the instrument is psychometrically appropriate. This can include sharing how many items are in the instrument, the reliability score, the purpose of the instrument (what it measures), and how the instrument was used in the current study. Providing this information on the instrument helps to provide context to the study and allows for the reproducibility of the study.
A common problem, especially among people without a strong background in research is mixing and matching items from various instruments. Sometimes people think that they can take two items from one instrument along with three items from another instrument and make a new instrument.
The problem with this mix-and-match approach is that instruments are tested and developed as a block of items. To add or subtract from this block would mean that the instrument is no longer measuring what it used to measure. This new instrument would have to be retested to make sure that it is reliable and valid. Therefore, whenever employing an instrument it must be unaltered to ensure that it is capturing the data that it was set out to collect.
Sharing too Little or too Much
When writing, the evaluator must find a balance between sharing too little and too much information. This is more of an art than a science but it is something that a writer needs to know.
Too little information would be to make statements and provide no supporting data for the statement. For example, “The scores were low here”. Such a statement needs actual numbers to support it.
Another mistake would be to share too much information. Using the same example of “the scores were low here” and then sharing all the individual scores of each participant. Quantitative research is focused on the aggregation of data and not individual scores.
How much information to share is also influenced by the nature of the report. Quantitative reports will have fewer words and more numbers that share broad conclusions. A qualitative report will be much more focused on individual stories and will not have the same broad conclusions.
Generalizing
The results of a study are limited to the context. To make broad sweeping statements from a limited context is to overgeneralize. For example, if a study is conducted using reading software among 35 fifth graders in rural Texas the results of this study only apply to a similar context. You cannot say that since the program was successful here it will be successful in a different context.
However, to be fair it is possible it just has not been proven yet. This is one reason why further study is always encouraged in academic writing. As the program is proven in different contexts, then there is evidence to make a strong general conclusion about the strength of the program.
Conclusion
There are other ways mistakes can be made in the writing process. The focus here was on common errors and mental miscalculations that obscure the hard work of evaluators. When writing it is important to make sure to maintain that the conclusions that are drawn are accurate in supported by a rigorous methodology.
Treatment Fidelity
Whenever a program is implemented there are always ways for things to go wrong. Treatment fidelity is a term used to describe how programs are not implemented as intended in the grant proposal. Below is a list of common ways that treatment fidelity can become a problem
- Adherence to implementation
- Implementation incompetence
- Variations in treatment
- Program drift
We will look at each of these below
Adherence to Implementation
Implementation adherence is whether the provider of the program follows the intended procedures. For example, if we have a reading lab program to boost students’ reading comprehension. The procedures may be as follows.
- Fifth-grade students are to use the reading lab on Monday, Wednesday, and Friday for 30 minutes each. (Dosage)
- The students must be engaged actively in using the reading software
If the provider wanders from these procedures it can quickly become an implementation issue. This is common. A teacher may take their kids on a field trip, there could be holidays, the teacher might do 1 hour one day and skip another day, etc. In other words, providers agree to a program but essentially do what they want when necessary. Every time these modifications happen it impacts the quality of the results as other factors are introduced into the study that were not originally planned for.
Implementation Competence
Implementation competence is defined as the provider’s ability to follow directions. If the procedures are too complicated the provider may not be able to follow them for the benefit of the students in the program.
An example would be if a provider is not comfortable with using computers and the reading software they may not be able to help students who are having technical issues. If too many students are unable to use the computers because the provider or teacher cannot help them this could lead to implementation competence concerns.
Difference in Treatment
The difference in treatment means that the treatment that the participants in the program receive should not be the same as participants who are not in the program. The treatments must be different so that comparisons can be made.
Sometimes when a new program is implemented providers will want all students to experience it. In our reading lab example, the procedures might call for allowing only half of the fifth-graders below grade level in reading comprehension. However, a teacher might decide to have all students participate in the reading lab because of the obvious benefits. If this happens, there is no way to compare the results of those who participate and those who do not.
Such well-meaning actions may benefit the students but damage the scientific process. It is always critical that there are differences in treatment so that it can be determined if the treatment makes a difference.
Program Drift
Program drift is the gradual weakening of the implementation of a program. People naturally lose discipline over time and this can apply to obeying the procedures of a program. For example, a provider might vigilantly follow the procedures of the reading lab in the beginning but may slowly allow more or less time for the students.
Program drift is hard to notice. One way to prevent it is to constantly re-train providers so that they are reminded about how to implement the program. Retraining is beneficial when providers want to implement the program correctly.
Conclusion
Treatment fidelity is critical to determine the quality and influence of a program. Evaluators need to be familiar with these common threats to fidelity so that they can provide the needed support to help providers.
UNION and INTERSECT in SQL VIDEO
John Holt and Youthism
In this post, we will look at John Holt and his views on education in the US.
Bio
John Holt was an early proponent of homeschooling in the US. What makes him unique is that Holt was a left-wing or progressive voice for homeschooling. Homeschooling has often been associated with conservatives and Christianity but this was not the case with Holt. By most accounts, Holt was a devout Atheist.
Holt viewed the traditional education experience of children as oppressive. The reason for this oppression was the students did not have control of their learning experience. For Holt, children should be able to choose what they study. The factory-style education in the US was a major criticism of Holt as he believed it stripped young people of their individuality.
Holt’s views were not limited to education. He also supported other left-leaning views involving feminism, environmentalism, and a guaranteed income for all. His motivation behind a guaranteed income was to liberate women and children from being dependent on men or the husband and father of the family. Holt is also considered the father of the Children’s rights movement. In many ways, Holt had issues with traditional views of family.
Youthism
The Children’s Rights Movement has many names such as Youth Rights or Youthism. The main premise of proponents of this belief system is that adults discriminate and oppress young people and children. This belief system is similar to other Communist/Critical Theory-inspired belief systems such as Critical Race Theory, Feminism, etc.
What all of these –isms or theories of oppression have in common is a power struggle between two groups. In Communism or Marxism, the bourgeoise control the means of production and oppress the proletariat. The proletariat needs to rise up, rebel, overthrow the bourgeoise, and seize the means of production.
In Critical Theory, the oppressors maintain the country’s current cultural structure (often portrayed as White, male, and Christian) and the various social institutions (school, church, etc.). People who are not producers of the current culture are oppressed and should rise up and overthrow those who control the production of culture.
Critical Race Theory states that the oppressors are White Americans and the oppressed are people of color. Whites control access to various things through their production of privilege or culture. People of color need to rise up, abolish the privilege of Whites, and destroy the ability of Whites to reproduce the current societal structure or have any form of privilege.
Feminism states that the oppressors are men and the oppressed are women. Men oppressed women through the use of cultural and traditional beliefs and reproduced these beliefs through various social institutions. Women need to rise up and rebel and stop the reproduction of traditional beliefs in society so that women can have emancipation from male leadership.
Queer studies state that the oppressors are people who are straight and the oppressed are people with alternative sexual identities and preferences. Heterosexuals control the means of reproducing heterosexuality through culture, families, and schools. Queer individuals need to rise up, overthrow heteronormativity, and liberate society from those false beliefs.
In Youthism the struggle is between children and adults. Adults oppress children and want to maintain their power and authority over them. Children, in turn, should rebel and seize their autonomy and rights from the adults. By leaving schools, children can seize some of the power and take control of their education. Below is a table that briefly summarizes what has been shared.
| Philosophy | Oppressor | Oppressed | Means of Production | Goal |
| Communism | Feminism | Proletariate | Financial/factories | Revolution |
| Critical Theory | Majority race | Minority race | Culture, schools, family, religion | Revolution |
| Feminism | Men | Women | Culture, schools, family, religion | Culture, schools, family, religion |
| Queer Studies | Heterosexuals | Alternative sexualities | Culture, schools, family, religion | Revolution |
| Youthism | Adults | Children | Feminism | Revolution |
The end game is the same. To overthrow the existing society from one angle or the other. The reason for these various theories and belief systems is the same as why there are different flavors of ice cream, which is to attract the highest number of people possible. All of these various oppressed groups can agree on the need for change and can work together for this. In addition, these various movements create a mult-front assault on the existing society which is much more difficult to defend against than one enemy. Multiple groups of oppressed people also create a picture that something is seriously wrong with society when so many people are dissatisfied with it.
Holts Beliefs
Returning to the focus on Holt, he also had some unusual beliefs about children’s freedom. For example, he believed that a child should be able to drive whenever they possess the ability rather than at 16. He criticized how adults speak to children by calling them “cute” and patting them on the head. Holts also had issues with how adults are sometimes dismissive of the feelings and problems of children, which to him was a form of oppression.
Perhaps one of Holt’s most shocking beliefs was in the sexual freedom of children. Essentially, he believed that children should make their own decisions about sexuality. It may be possible that Holts’ views on this were inspired by Kinsey whose research focused on providing evidence that this was a viable position for children.
Conclusion
John Holt was a trailblazing liberal in the world of homeschooling. He radically supported a conservative idea in his unique way. His influence on homeschooling is significant, whether or not people agree with him on a personal level.
Subqueries in SQL VIDEO
Windows Functions in SQL VIDEO
Cost-Effectiveness Analysis
The purpose of a cost-effectiveness analysis is to determine the relationship between the benefits and expenses of a program. Naturally, there are many different ways to do this but there are some common steps for approaching this as shown below
- Define the program and outcome indicators
- Determine what you want to know
- Compute cost
- Determine the scope of program outcome data
- Compute outcome data
- Compute cost-effectiveness ratio
Define the Program and Outcome Indicators
Defining the programs means to know all the components and features of the program. For example, a reading lab program might have the following components.
- Online reading in a computer lab that develops reading comprehension and pronunciation skills
- Participation 30 minutes a day twice a week
- The program lasts one semester
- Participants are 30 fifth graders who are reading 2 levels below grade level
The example above is highly simplified but serves our purpose. Once the program is defined it is necessary to determine the outcome indicators. Outcomes are measurable changes in behavior. For our reading lab example below is the the outcome we want to measure.
- Number of fifth-grade students who are reading at or above grade level at the end of the semester of reading lab participation.
With the information above we can move to step 2.
Developing Questions
Once you know what the program is about and the outcome you want to measure it is now time to shape questions for the study. This might seem like a wasted step because obviously we want to gather data about the outcome indicator. However, there might be more than one thing we want to know about the outcome. For example, we might want to know if there are differences by gender, race, or socioeconomic factors. Since we can nuance and complicate the study it is important to state explicitly what we want to know. Below are the questions for our reading lab example.
- How many fifth-grade students reach grade level for reading comprehension through the use of the reading lab twice a week for 30 minutes?
- How many fifth-grade students are unable to reach grade level for reading comprehension through the use of the reading lab twice a week for 30 minutes?
- What is the cost per fifth-grade student for the use of the reading lab over the semester?
The next step is to determine the cost of the program
Determine Cost
It is now time to find out how much money was spent. This is a straightforward process that includes calculating the expenses for personnel, facilities, equipment, and other expenses. For our reading lab example, the costs are simple to compute and our shown below.
- Personal: The total cost is 0 zero dollars because the teachers are already paid by the school and no additional staff was necessary
- Facilities: Again, the total cost was zero because an existing computer lab was used.
- Equipment: The expense for the license for the reading software is $30,000 dollars for the length of the program
- Other expenses: Zero dollars for other expenses
For our example, only $30,000 is used for this program.
Determine Scope of Data Collection
The amount of data to collect depends on the questions to answer and the maturity of the program. If the program has been around for several years you have to decide if you want to collect data from all years or a subset. In our example, this is a new program so we will take all data from the the fifth graders who participated in the reading lab program.
Compute Outcome
Once the program has run its course it is time to determine outcomes. For our example, after the reading lab program was completed, each student took a reading comprehension test to assess what grade level they were at. For our purposes, students at or above grade level are considered successful. Below are the results
| Success | Unsuccess |
| 20 | 10 |
The information in the table above has already answered our first two questions for this study. We can now use this information to determine the cost-effectiveness ratio to answer the last question.
Cost-Effectiveness Ratio
The cost-effectiveness ratio can be calculated by dividing the cost of the program by the outcome. For our example, this would mean dividing $30,000 by 20 (number of success). In the table below we have several important calculations
| Reading Lab Program | |
| Program cost | $30,000 |
| Success rate | 20 / 30 * 100 = 66% |
| Number of students at grade level | 20 |
| Cost per successful student | $30,000 / 20 = $1,500 |
The table above provides all the information we need to assess this program within the scope that we defined. Right now it is hard to tell if this program is good or not because there is no standard or another program to compare it to. However, having external standards or another program for comparison is often expected with real examples.
Sometimes an additional step that is taken is a sensitivity analysis. A sensitivity analysis is especially important when there are a lot of estimations in the model. When it is necessary to estimate it is important to adjust these values high and low to see how they affect outcomes. For our example, this is not applicable.
Conclusion
Cost-effectiveness analysis is an important tool in determining the value of a program. The goals of a program are normally to help people while keeping in mind cost and effectiveness. The analysis presented here allows an evaluation to assess programs so that services can be rendered efficiently.
Window Function in SQL
SQL window functions are calculation functions based on individual rows that create a new column that contains the new calculation. They are similar to aggregate functions but are different in that normal aggregate functions like “group by,” will provide a single number as the output. As mentioned earlier, with a window function, the results are placed in a new column for each row.
Window functions allow users to perform many analytical tasks, from ranking and partitioning data to calculating moving averages, cumulative sums, and differences between consecutive rows. Again, what is important here is that widow functions can be used to calculate values between rows.
Basic Example
In this first example, we will figure out how many customers we have from Texas and California and put this in a new column. Below is the code followed by the output.
SELECT first_name, last_name,state ,
COUNT(*) OVER () as total_customers
FROM customers
WHERE state in ('CA','TX')
ORDER BY customer_id;

In the select statement, we pull first_name, last_name, and state, and we then have our window function. In this window function, we are counting the number of customers in the customer table who are from CA and TX. The OVER() function is used to define the window. Since it is blank it is telling SQL that the entire table is the window. This will not make sense right now but is critical information for future examples. Lastly, we are ordering the data by customer_id.
The output indicates that 9,903 customers are from CA or TX. We can confirm this by running a separate line of code.
SELECT COUNT(*)
FROM customers
where state in('CA','TX')

The output from the window function is repeated in every row called total_customers. Repeating this information doesn’t make sense but it shows us what the window function does. For example, in row 1, the function sees that this person is from TX and then outputs that the total number of customers is 9,903 for somebody from TX or CA. This is a basic example of what window functions can do. Of course, there are things much more insightful than this that can be calculated.
Intermediate Example
We are now going to run mostly the same code with one slight difference. We want to know not just how many total customers there are but how many are from TX and how many are from CA. To do this we will have to use PARTITION BY which is the group by clause for window functions.
SELECT customer_id, first_name, last_name, state,
COUNT(*) OVER (PARTITION BY state) as total_customers
FROM customers
where state in ('CA','TX')
ORDER BY customer_id;

In the output, we have all of the same rows from the previous SELECT clause and the total_customers columns are different. When a person is from TX this column shows how many people are from TX but when a person is from CA it shows how many people are from CA. If you add up the number of people from TX and CA you get the following
4,865 + 5,038 = 9,903
This is the same amount as the total number of customers in our previous example. The PARTITION BY clause breaks the number of customers into two groups, those from TX and those from CA, and assigns the appropriate value based on where the customer in that row is from.
Advanced Examples
The next example will involve using the SUM aggregation function. We are going to add customer_id in a new column. This will be a running total. In other words, SQL will keep adding the customer_id until they get through all of the data. Below is the code followed by the output and explanation.
SELECT customer_id, title, first_name, last_name, state,
Sum(customer_id) OVER (ORDER BY customer_id) as total_customers
FROM customers
where state in('CA','TX')
ORDER BY customer_id;

Here is what is happening. In the total_customer column, a running total of customer_id is being created. For example, row 1 has the value 10 because that is the first customer id of someone from TX. Row 2 has a customer ID of 13. When you add 10 + 13 = 23 which is the value in row 2 of total_customers. This continues for the rest of the table.
Here is another example this time with the RANK() function. The RANK() function allows you to create a new column that ranks the data based on a criteria. For our example, we will rank the data based on their customer_id with the lower the id number the higher the ranking. To make this even more interesting, we will partition the data by state so that the lowest value customer_id will be number 1 for TX and the lowest ranking customer_id will be number 1 for CA. Below is the code
SELECT customer_id, title, first_name, last_name, state,
rank () OVER (PARTITION BY state ORDER BY customer_id) as total_customers
FROM customers
where state in ('CA','TX')
ORDER BY customer_id;

Row number 1 is rank 1 because it is the lowest value customer_id of all TX. Row 2 is also ranked 1 because it is the lowest value customer_id of all CA.
Conclusion
The possibilities are almost endless with window functions. These tools allow you to get into the data and find insightful answers to complex questions. The examples here are only scratching the surface of what can be achieved.
UNION and INTERSECT in SQL
UNION and INTERSECT are two useful statements used in SQL for specific purposes. In this post, we will define, provide examples of each statement, and compare and contrast UNION, INTERSECT, with the JOIN command.
UNION
The UNION statement is used to append rows together from different select statements. Below is the code followed by the output and explanation.
(
SELECT street_address, city, state, postal_code
FROM customers
WHERE street_address IS NOT null
)
union
(
SELECT street_address, city, state, postal_code
FROM dealerships
WHERE street_address IS NOT null
)

Notice how the select statements are both in their own set of parentheses with the union statement between them. In addition, you can see that both select statements used the same columns. Remember we are trying to combine data from different places that have the same columns. You can ignore this but the output will be hard to comprehend. For example, if you have different columns in each select statement you will get an output but it will be hard to interpret.
Essentially, in the example above, we took the same columns from two different tables and created one table. UNION removes duplicates. If you want duplicates you must use UNION ALL
In contrast, joins are used to append columns together based on criteria. If we were to join two or more tables the joined table would increase in its number of columns and possibly its rows. A UNION will have a specified number of columns while growing in terms of the number of rows that are present in the output.
INTERSECT
INTERSECT finds common rows between select statements. It is highly similar to JOIN with the main difference being INTERSECT removes duplicates while JOIN will not. Below is an example.
SELECT state,postal_code FROM customers
INTERSECT
SELECT state,postal_code FROM dealerships;

The code is simple. You make your select statements and place INTERSECT in between them. The results above show us what data these two select statements share. When state and postal_code are the criteria for these two tables only these three rows are in common. If we did a JOIN we would get every instance of these three state and postal codes rather than just the unique ones.
Conclusion
UNION and INTERSECT have their place in the data analyst toolbelt. UNION is for appending rows together. INTERSECT is for finding commonalities between different select statements. Both UNION and INTERESECT remove duplicates which is not done when using a JOIN.
Subqueries in SQL
Subqueries allow you to use the tables produced by SELECT queries. One advantage of this is instead of referencing an existing table in your database you can pull from the table you are making when making your SQL query. This is best explained with an example.
WHERE Clause
I have two tables I want to pull data from called salespeople and dealerships. The titles of these two tables explain what they contain. One column these two tables have in common is dealership_id which identifies the dealership in the dealership table and where the salesperson works at in the salespeople table.
Now the question we want to know is “Which of my salespeople work in Texas.” This can only be determined by using my two tables together the salespeople table does not have the state the person is in while the dealerships table has the state but does not have the salesperson’s information.
This problem can be solved with a subquery. Below is the code followed by the output and an explanation.
SELECT *
FROM salespeople
WHERE dealership_id IN (
SELECT dealership_id FROM dealerships
WHERE dealerships.state = 'TX'
)

The first two lines are standard SQL coding. In line three, we have our subquery in the WHERE clause. What is happening is that we are filtering the data to include only dealership_id that matches the state of Texas inside the parentheses. This leads to a note that subqueries are always inside parentheses.
SELECT Clause
Using subqueries in the WHERE clause is most common but we can also do them in the SELECT clause as well. In our first example we learned where the employees work but let’s say we want to know the city and not just the state. Below is a way to pull the city data for one dealership with the salespeople data.
SELECT *,
(SELECT city
FROM dealerships d
WHERE d.dealership_id = 19) as City
FROM salespeople
WHERE salespeople.dealership_id =19;

In this example, we pull all data from salespeople while pulling only data related to the city from the dealership while also filtering for only dealership_id 19.
FROM Clause
We can also place subqueries in a FROM clause. In the example below, I want the first and last name of the salesperson followed by the city and state they work in. The name info is on the salespeople table and the city and state are on the dealership table. Below is the code and results.
SELECT *
FROM
(SELECT s.first_name, s.last_name,d.city,d.state
FROM salespeople s , dealerships d
WHERE s.dealership_id= d.dealership_id);

The code should be self-explanatory. Inside the parentheses, we are creating a table we want to pull data from. The subquery is essentially a join of the two tables based on the criteria. This brings up an important point. Subqueries and joins (inner joins in particular) serve the same purpose. Joins are better for large amounts of data while suqueries are better for smaller amounts of data.
Conclusion
Subqueries are another tool that can be used to deal with data questions. They can be used in the WHERE, SELECT, or FROM clauses. When to use them is a matter of preference and efficiency.







































