Category Archives: Uncategorized

Python fraud Detection: Traditional Approach

Fraud detection today leverages complex algorithms and machine learning approaches. However, this was not always the case. In the past, fraud detection used simple yet highly efficient methods. In this post, we will look at a traditional method of fraud detection that involves setting threshold values for variables to flag a case a fraud or not.

Load Libraries

We will begin by loading our libraries and data. The data for this demonstration is not available on the web. Below, we load pandas, numpy, and matplotlib.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv(data_loc)

We will now look at the means of the individual variables.

Examine Means

To determine the cutoff values for setting our thresholds, we need to examine the means of each variable when a case is marked as fraud or not. Next, we will look at boxplots of the variables we will use. Below is the code and output for the means of the variable based on class.

df.groupby('Class').mean() #provides a general threshold for fraud
Out[2]: 
        Unnamed: 0        V1        V2        V3        V4        V5  \
Class                                                                  
0      143084.8702  0.035030  0.011553  0.037444 -0.045760 -0.013825   
1      121384.7000 -4.985211  3.321539 -7.293909  4.827952 -3.326587   

             V6        V7        V8        V9       V10       V11       V12  \
Class                                                                         
0     -0.030885  0.014315 -0.022432 -0.002227  0.001667 -0.004511  0.017434   
1     -1.591882 -5.776541  1.395058 -2.537728 -5.917934  4.020563 -7.032865   

            V13       V14       V15       V16       V17       V18       V19  \
Class                                                                         
0      0.004204  0.006542 -0.026640  0.001190  0.004481 -0.010892 -0.016554   
1     -0.104179 -7.100399 -0.120265 -4.658854 -7.589219 -2.650436  0.894255   

            V20       V21       V22       V23       V24       V25       V26  \
Class                                                                         
0     -0.002896 -0.010583 -0.010206 -0.003305 -0.000918 -0.002613 -0.004651   
1      0.194580  0.703182  0.069065 -0.088374 -0.029425 -0.073336 -0.023377   

            V27       V28      Amount  
Class                                  
0     -0.009584  0.002414   85.843714  
1      0.380072  0.009304  113.469000

Now, there are many different ways to explore the data to determine which variables to select and what to set the threshold values to. We can look at histograms, descriptive statistics, rely on domain knowledge, etc. For the sake of simplicity, we are selecting variables V1 and V3 for additional analysis. You can have more than two variables if you desire. Below are boxplots of V1 and V3.

#data to plot
V1=df[df['Class'] == 1]['V1']
V3=df[df['Class'] == 1]['V3']
plot_data=[V1,V3]
# Create a basic box plot
plt.boxplot(plot_data,tick_labels=["V1","V3"] )
plt.show()

Here is an explanation of the code.

We create two objects called V1 and V3. Both of these objects subset the data for Class when it equals 1 (which indicates fraud). The V1 object pulls the values of V1 when Class equals 1. The V3 does the same for the V3 variable. In other words, we now have all values of V1 and V3 when fraud is indicated. Next, we store our values in another object called plot_data. We then create our boxplot and label the x-axis.

The box plot for V1 indicates a median value of around -3, while the box plot for V3 indicates a median value of around -5. We will use these values as our thresholds.

Confusion Matrix with Thresholds

We will now set our thresholds and create the confusion matrix. Below is the code and output.

df['flag_as_fraud'] = np.where(np.logical_and(df['V1']<-3, df['V3']<-5), 1, 0)

print(pd.crosstab(df.Class, df.flag_as_fraud, rownames=['Actual Fraud'], colnames=['Flagged Fraud']))
Flagged Fraud     0   1
Actual Fraud           
0              4984  16
1                28  22

Here is an explanation of the code.

1. We create a new column called “flag_as_fraud”. This column uses a 1 when V1 < -3 and V3 < -5. All other instances are flagged as 0.
2. Next, we create our crosstabs comparing Class with flag_as_fraud. Here are the results.

4984 True negatives = It was flagged as not being fraud, and was not actual fraud
22 True positives = It was flagged as fraud, and it was actual fraud
28 False negatives = It was not flagged as fraud, but it was actual fraud
22 False positives = It was flagged as fraud, but it was not actual fraud

Now, whether these results are good or bad depends on the situation. There are problems with false negatives and false positives. Correcting for one means losing accuracy for another. If the context were credit card fraud, false negatives may be worse, as the criminal may get away with fraud. Another way to assess the values is through a classification report.

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

print('Classification report:\n', classification_report(df['Class'],df['flag_as_fraud']))
conf_mat = confusion_matrix(y_true=df['Class'], y_pred=df['flag_as_fraud'])
print('Confusion matrix:\n', conf_mat)

Classification report:
               precision    recall  f1-score   support

           0       0.99      1.00      1.00      5000
           1       0.58      0.44      0.50        50

    accuracy                           0.99      5050
   macro avg       0.79      0.72      0.75      5050
weighted avg       0.99      0.99      0.99      5050

Confusion matrix:
 [[4984   16]
 [  28   22]]

The output above provides numbers for us to assess. The precision indicates how well our model is at predicting true positives compared to all positives. Recall indicates how well our model predicts true positives compared to all true positives. The F1-score is an aggregate of the precision and recall values. Our model struggles with both precision and recall, so we need to modify this.

Multiple Rules

It is possible to have more than one rule. Below is an example.

df['flag_as_fraud'] = np.where(np.logical_and(df['V1']<-3, df['V3']<-5),1, 0)
df['flag_as_fraud'] = np.where(np.logical_and(df['flag_as_fraud']== 1, df['V7']<-6),1, 0)
df['flag_as_fraud'] = np.where(np.logical_and(df['flag_as_fraud']== 1, df['V9']<-0),1, 0)
df['flag_as_fraud'] = np.where(np.logical_and(df['flag_as_fraud']== 1, df['V10']<-4.5),1, 0)

In this code, we set the initial rule as done previously. Then, for the second rule, we use the previous rule and place the new variable as the second comparison. We then repeat this as many times as necessary. Below is a verbal explanation of the code above

Create “flag_as_fraud” where V1 < -3 and V3 < -5. Then create “flag_as_fraud” where “flag_as_fraud” = 1 and V7 < -6. Then create “flag_as_fraud” where “flag_as_fraud” = 1 and V9 < 0. Then create “flag_as_fraud” where “flag_as_fraud” = 1 and V10 < -4.5.

Below is the classification report and confusion matrix.

print('Classification report:\n', classification_report(df['Class'],df['flag_as_fraud']))
conf_mat = confusion_matrix(y_true=df['Class'], y_pred=df['flag_as_fraud'])
print('Confusion matrix:\n', conf_mat)

Classification report:
               precision    recall  f1-score   support

           0       0.99      1.00      1.00      5000
           1       0.94      0.34      0.50        50

    accuracy                           0.99      5050
   macro avg       0.97      0.67      0.75      5050
weighted avg       0.99      0.99      0.99      5050

Confusion matrix:
 [[4999    1]
 [  33   17]]

Our precision is improved, which means we did excellent work reducing the number of false positives. However, our false negatives have increased, and our recall has decreased.

Conclusion

The traditional approach is excellent in many circumstances. This approach is easy to understand, which can relieve the anxiety ofleaders who need to know what is going on in case there is a problem. Complex algorithms may yield better results, but it is not always clear how they work and what they are doing. With the traditional approach, this is not a problem. However, if high accuracy is needed, sometimes the traditional approach falls short. Which approach to use depends on the context and the needs of the stakeholders.

Scatterplot in Power BI

Leave a reply

The video below explains how to create a simple scatterplot using Power BI.

Fraud Detection with Python: Sampling

Leave a reply

In this post, we will explore how to approach resampling when implementing fraud detection with Python. When examining fraud detection, a significant imbalance often exists between negative and positive fraud cases. The problem with this is that by guessing randomly, your model can be highly accurate in predicting whether a case is fraudulent or not. Therefore, we must consider how to address the low number of positive instances when conducting a fraud analysis.

We are going to first look at the characteristics of the data as is, then we will use Python to balance our data and compare the original data with the modified data.

Data Preparation of Original Data

Below are the libraries we will use.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE

Pandas and numpy are for creating our dataset. Matplotlib is for data visualization, and the SMOTE function will be used to rebalance our data later on.

The data we will use is not available on the web. Therefore, the code for this dataset is unclear, as I will hide the string where the data comes from on my computer. The code is below.

df = pd.read_csv(data_loc)

We will now look at the data using the .info() method

print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5050 entries, 0 to 5049
Data columns (total 31 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  5050 non-null   int64  
 1   V1          5050 non-null   float64
 2   V2          5050 non-null   float64
 3   V3          5050 non-null   float64
 4   V4          5050 non-null   float64
 5   V5          5050 non-null   float64
 6   V6          5050 non-null   float64
 7   V7          5050 non-null   float64
 8   V8          5050 non-null   float64
 9   V9          5050 non-null   float64
 10  V10         5050 non-null   float64
 11  V11         5050 non-null   float64
 12  V12         5050 non-null   float64
 13  V13         5050 non-null   float64
 14  V14         5050 non-null   float64
 15  V15         5050 non-null   float64
 16  V16         5050 non-null   float64
 17  V17         5050 non-null   float64
 18  V18         5050 non-null   float64
 19  V19         5050 non-null   float64
 20  V20         5050 non-null   float64
 21  V21         5050 non-null   float64
 22  V22         5050 non-null   float64
 23  V23         5050 non-null   float64
 24  V24         5050 non-null   float64
 25  V25         5050 non-null   float64
 26  V26         5050 non-null   float64
 27  V27         5050 non-null   float64
 28  V28         5050 non-null   float64
 29  Amount      5050 non-null   float64
 30  Class       5050 non-null   int64  
dtypes: float64(29), int64(2)
memory usage: 1.2 MB
None

For our purposes, there are 30 variables available from V1 to Class. We will now look at a breakdown of the Class variable, which tells us if there is fraud or not.

fraud_breakdown = df['Class'].value_counts()
print(fraud_breakdown)
Class
0    5000
1      50
Name: count, dtype: int64

By subsetting for the Class variable and using the .valuecounts() method, we can see that the results indicate there are 5050 rows of data, with 5000 not being fraud and 50 being fraud. This indicates that less than one percent of the cases are instances of fraud, and this is confirmed with the code below.

print(fraud_breakdown / len(df))
Class
0    0.990099
1    0.009901
Name: count, dtype: float64

Next, we will create a visualization of our data.

Data Visualization

In order to create the data visualization, we need to separate the X values, which are all the variables we are using that do not tell us if the case is fraud or not, from the y value, which is the Class variable. We also need to convert them to a numpy array. Below is the code to do this.

X = df.iloc[:, 1:30]    
X = np.array(X).astype('float')    
y = df.iloc[:, 30]    
y=np.array(y).astype('float')

Below is the code for the data scatterplot. The plot will be based on the first two variables of the dataset and colored by the Class variable.

plt.scatter(X[y == 0, 0], X[y == 0, 1], 
            label="Class #0", alpha=0.5, linewidth=0.15)
plt.scatter(X[y == 1, 0], X[y == 1, 1], 
            label="Class #1", alpha=0.5, linewidth=0.15, c='r')
plt.legend()
plt.show()

Here is a breakdown of the code.

1. We use plt.scatter. Inside, we indicate that for the values in X when y = 0 take the values of the first column. In the second subset, we indicate that in X, when y = 0 take the second column of values.

2. Next, we set the label, alpha, and linewidth.

3. We repeat this process, but this time we take the values when y = 1 instead of y= 0. We also set the color to red instead of the default blue.

4. Finally, we plot both scatter plots on the same plot with a legend indicating what the color means.

You can see the huge imbalance with just this visual. We will now look at how to correct this imbalance.

SMOTE

Resampling can be performed in several ways. Undersampling involves reducing the amount of data you are using to match the number of fraud cases. In other words, for our 5050 dataset with 50 fraud cases, we would reduce this to perhaps 100 rows of data with 50 fraud cases. One problem with this is that you throw out a lot of data.

Another approach is oversampling, which involves duplicating your fraud cases until they match half of your data. For example, since our dataset contains 5050 cases with 50 cases of fraud, we would duplicate our fraud cases until we had 5000 fraud cases for a total dataset size of 10,000. Here you can see the problem of duplicating so much data, which can cause problems.

SMOTE, or synthetic minority oversampling technique, is a variation on oversampling. It involves creating additional fraud cases by generating new cases through the traits of the nearest neighbors. This works if your fraud cases are similar to each other.

Below, we will generate a dataset using SMOTE

# Define the resampling method
method = SMOTE()

# Create the resampled feature set
X_resampled, y_resampled = method.fit_resample(X, y)

print(pd.Series(y).value_counts())
print(pd.Series(y_resampled).value_counts())
print(X.shape[0])
print(X_resampled.shape[0])

0.0    5000
1.0      50
Name: count, dtype: int64
0.0    5000
1.0    5000
Name: count, dtype: int64
5050
10000

The code involves creating an instance of SMOTE(). We then create our resampled X and y values using .fit_resample. Next, we print our results. The first output shows the original shape of the data y values from 5000 to 50 cases of fraud. The next output shows the resampled y values with 5000 to 5000 cases using SMOTE(). Now the data is balanced. The last two outputs show the original shape of the X values and compare it to the new shape, thanks to resampling.

Below is the code for the visualization. It is the same as the previous visual, just with the resampled data.

# Plot the resampled data
plt.scatter(X_resampled[y_resampled == 0, 0], 
            X_resampled[y_resampled == 0, 1], 
            label="Class #0", alpha=0.5, linewidth=0.15)
plt.scatter(X_resampled[y_resampled == 1, 0], 
            X_resampled[y_resampled == 1, 1], 
            label="Class #1", alpha=0.5, linewidth=0.15, c='r')
plt.legend()
plt.show()

You can see the difference compared to the first plot. This data is much more balanced, which will help in the detection of fraud cases. How you address imbalances depends on the situation, so let’s not assume SMOTE is the best approach every single time.

Conclusion

Fraud detection is critical in different industries to prevent crime and abuse. Python can be used to support this process. Naturally, fraud is unusual when compared to legitimate transactions. This necessitates the use of various techniques to balance the data and ensure the accuracy of the model.

Modifiying Data Tables in R VIDEO

Leave a reply

In the video below, we look at how to modify data tables in R.

Theocratic Anabaptists and Utopia

Leave a reply

In this post, we will look at another religious group that had Communist leanings before Communism was fully articulated. This group is familiar to man and is called the Anabaptist.

There are various sects within the Anabaptist movement. The voluntary sect is the one with which many are familiar today, and this sect includes the Amish and the Mennonites. The voluntary sect, as its name implies, supports people in making their own choices about religion.

The sect of this post is the Theocratic sect. The Theocratic sect believed in seeing power from the state and forcing people to become Anabaptists. Since they had a utopian focus, the Theocratics also were focused on compelling the world to uphold specific views of what heaven on earth would look like. The view they had in mind was of a communal society style, where everything was shared.

Theocratics

The Theocratic sect was originally led by a man named Thomas Muntzer. Muntzer was a disciple of Martin Luther but would later become a convert to Taborism. The Taborites believed in destroying the non-elect and taking their property for the religio-state. Muntzer was convinced he was a prophet and called on the German princes to kill the godless. When he is ignored, Muntzer tries to lead this slaughter himself, and he is executed.

Eventually, the leadership of the Theocratic sect of the Anabaptists falls to Jan Mathys. As the leader, Mathys takes the Theocratics to Munster, Germany. While trapped inside this city, Mathys is convinced that the rest of the world is doomed. Therefore, all property of non-Anabaptists inside the town of Munster was seized, and the non-believers were killed or expelled from the city.

Once the means of production and wealth were taken, Mathys began to implement the policies of his religious utopia. Money is outlawed, which forces everyone to be dependent on the government. Food was taken from homes and rationed (shared) with people being forced to eat in communal halls. It was also illegal to lock or even close your doors since everyone was family.

Siege of Munster

While all this was going on, Munster was under siege by the German princes, who did not appreciate Protestant sectarians seizing a town. During this siege, Mathys, is killed, and a man named Bockelson takes over. Bockelson’s adds to the oppression by implementing polygamy. The women initially rebelled against this, but after many were executed, they quickly got used to the idea.

However, the ladies did not give up without a fight. Since the women could not stop themselves from being married off to someone, they could give their new husbands hell. When the men began to suffer from the relentless behavior of their wives, divorce was allowed. Marriage was essentially abolished, but people being people still found ways to enjoy the rapturous experience of marriage outside the boundaries of marriage. Therefore, in a matter of a few months, the Theocrats have gone from puritans to fornicators with a destruction of the family.

Bockelson would eventually proclaim himself king. However, his reign was short-lived, and the German princes broke into the city, and Bockelsno was killed.

Conclusion

The Theocratics were inspired to bring about the Millennium on earth through the use of force and the implementation of various ideas linked to Communism in the future. They seized all property, money, and controlled the means of production. The Theocritics were also against the family by removing marriage and allowing promiscuity. In addition, as things began to fall, a tyrant arose to control the people. This suggests that paradise is never truly reached. Still, people will often use the promise of a utopia to seize power for themselves, even if it was not their original intention.

Bar Graphs in Power BI

Leave a reply

Data visualization plays an important part in explaining an analysis. In this post, we will examine how to make bar graphs using Power BI.

Modifiying Data Tables in R

Leave a reply

In this post, we will look at how you can modify data tables in R. Specifically, we will look at how to add columns, fix errors, and calculate values. Below is the initial code to prepare the data we will use.

library(data.table)
mtcars<-data.table(mtcars)

In the code above, we load the data.table package. We then convert the “mtcars” dataset to be a data.table in the second line. Below is a look at all the columns and the first few rows of data

head(mtcars)
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1:  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4
2:  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4
3:  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1
4:  21.4     6   258   110  3.08 3.215 19.44     1     0     3     1
5:  18.7     8   360   175  3.15 3.440 17.02     0     0     3     2
6:  18.1     6   225   105  2.76 3.460 20.22     1     0     3     1

Adding a New Column

Adding a new column involves a simple process. In the code below, we call “mtcars” and inside the brackets, we place a comma first. This tells R that we want all rows of data.

After the comma, we create a name for our new column called “distance_travel.” After this, we place the := notation to indicate that we are calculating a value. After the :=, we write mpg * 4. This means multiply the mpg column by 4.

# Add a new column, travel distance
mtcars[, distance_travel := mpg*4]

Below is the output for the code above. To save space, we will not print every column. Instead, we will subset what we want as shown below.

mtcars[1:3,c(1,10:12)]
     mpg  gear  carb distance_travel
   <num> <num> <num>           <num>
1:  21.0     4     4            84.0
2:  21.0     4     4            84.0
3:  22.8     4     1            91.2

You can see the calculated value in the fourth column. This value is mpg multiplied by 4.

Fixing Errors

We can also fix errors for this example. Let’s assume that any value less than 21.5 in the mpg column is a mistake. We can replace those values with NA with the code below.

#fix errors change mpg less than 21.5 to NA
mtcars[mpg <21.5, mpg := NA]

In this code, you can see that we use brackets again and indicate that we are looking for all mpg values that are less than 21.5. These values will be rewritten with NA. Below is the output for select columns and rows only.

mtcars[1:3,1:3]
     mpg   cyl  disp
   <num> <num> <num>
1:    NA     6   160
2:    NA     6   160
3:  22.8     4   108

The NAs are where there used to be mpg values less than 21.5

Adding Columns for Groups

Another value we may want to calculate is to count values by groups. In the code below, we count the number of cars that have an automatic or manual transmission. To do this, we use brackets again. Inside the brackets, we place a comma. After the comma, we name our new variable “total_am” followed by the := notation. After the := notation, we place .N. The .N notation tells are to count all rows. In this case, we are counting all rows by the variable “am,” which stands for automatic transmission, and yes or no. Below is the code and output

# Add a new column equal to total cars with automatic transmission
 mtcars[, total_am := .N, by = am]

mtcars[1:5,c(1,9,13)]
     mpg    am total_am
   <num> <num>    <int>
1:    NA     1       13
2:    NA     1       13
3:  22.8     1       13
4:    NA     0       19
5:    NA     0       19

You can see that there are 13 cars with automatic transmissions and 19 cars without automatic transmissions. Each row has a 13 if it has an automatic transmission and a 19 if it does not. This calculation is something similar to what a windowing function does in SQL.

Calculate Values of Groups

It is also possible to calculate other values. In the code below, we calculate the average mpg by the number of cylinders a car has. The syntax for this code should be looking familiar by now. Notice how after the := notation, it is possible to use a function. In our case, we are using the mean() function.

# Calculate the mean mpg by cyl 
mtcars[, mean_mpg:=mean(mpg,na.rm=TRUE), 
            by = cyl]

mtcars[1:3,c(1,2,13)]
     mpg   cyl mean_mpg
   <num> <num>    <num>
1:  21.0     6 19.74286
2:  21.0     6 19.74286
3:  22.8     4 26.66364

The results are similar to the previous example, except this time we calculate the mean of mpg by cyl.

Using LHS := RHS Form

LHS := RHS Form is another way to indicate to R what you want to do. On the left-hand side, you create the names of your columns. After the := sign you place the functions you are using. Notice how the functions are wrapped inside parentheses with a period on the outside. Also note the comma at the beginning of the square brackets and in front of the “by” argument.

# Add columns using the LHS := RHS form
mtcars[, c("mean_mpg", "median_mpg") := .(mean(mpg), median(mpg)), 
        by = cyl]

In the code above, we calculate the mean and the median mpg by the number of cylinders. Below is the output

mtcars[1:3,c(1,2,13,14)]
     mpg   cyl mean_mpg median_mpg
   <num> <num>    <num>      <num>
1:  21.0     6 19.74286       19.7
2:  21.0     6 19.74286       19.7
3:  22.8     4 26.66364       26.0

You can clearly see the two new columns that show the mean and median for mpg.

Functional Form

Functional form is another way to get the same results. In the code below, we are still using the square brackets. Inside the square brackets, we first have a comma. Next, you have our := symbol, but this time the := symbol is inside grave accents (“). The grave accent is next to the number 1 on a standard keyboard and is also home to the tilde sign (~). After the := symbol you create the column name,s followed by the function you are using, separated by commas. After all of this, you indicate the grouping using the “by” argument.

# Add columns using the functional form
mtcars[, `:=`(mean_mpg_func = mean(mpg), 
               median_mpg_func = median(mpg)), 
        by = cyl]

mtcars[1:3,c(1,2,15,16)]
     mpg   cyl mean_mpg_func median_mpg_func
   <num> <num>         <num>           <num>
1:  21.0     6      19.74286            19.7
2:  21.0     6      19.74286            19.7
3:  22.8     4      26.66364            26.0

The results speak for themselves.

Functional Form with Complex Grouping

So far, we have been grouping with only one column in the “by” argument. However, it is possible to have more than one column in the “by” argument while also having another filter in place. In the code below, we filter for mpg greater than 21 while also grouping by the number of cylinders, and whether the car has an automatic transmission or not.

# Add the mean_duration column
mtcars[mpg>21,ave_mpg :=mean(mpg),
        by = .(cyl,am)]

mtcars[1:3,c(1,2,9,17)]
     mpg   cyl    am ave_mpg
   <num> <num> <num>   <num>
1:  21.0     6     1      NA
2:  21.0     6     1      NA
3:  22.8     4     1  28.075

The reason for the NA is that there are no cars that meet the criteria. In other words, there was not more than one car that had an mpg greater than 21 that was 6-cylinder and also had an automatic transmission.

COnclusion

Data.table is just another way to manipulate data inside R. Generally, it is considered faster when dealing with large datasets. The purpose here was only to explore the potential of this package if it is needed.

Proto-Communist Religio Groups

Leave a reply

The ideas of apocalyptic thinking and Communism are often associated with each other since COmmunism is viewed as the utopian future of the world. Joachim Fiore was a medieval monk who made significant contributions to apocalyptic thinking associated with Christianity and the Book of Revelation. Fiore was not directly linked with Communist thought, but his ideas were mixed with proto-Communism by religious leaders who came after him.

Fiore proposed the idea of the Three Ages. For Fiore, history was divided into three parts: The Age of the Father, the Age of the Son, and the Age of the Holy Spirit. The Age of the Father is associated with the Old Testament and was a time of focusing on obedience, running from creation until the birth of Christ. The Age of the Son is linked with the New Testament, which was the time from the birth of Christ until the 13th century. Lastly, the Age of the Holy Spirit began in the 13th century and is a time of universal love that would transcend the letter of the law. During this final age, man’s material body would disappear.

Splinter Groups

Fiore would inspire many apocalyptic Christian groups. The age of the Holy Spirit provided ideas for several groups that would expand on Fiore’s ideas in particular. For example, the Almaricians, an early 13th-century group, believed each of the three stages of Fiore was an incarnation. Incarnation is the belief that Christ came in the flesh. Almaricians believe they were the incarnation of the Holy Spirit. This idea implies the Almaricians were claiming to be gods and showed signs of pantheism.

Brethren of the Free Spirit

The Brethren of the Free Spirit, a group that began in the late 13th century, believed that the “Elect” would not die and would be gods on earth. Since there is no death, it implies there is no law, which makes the Brethren supporters of antinomianism (against the law). The Brethren were also supporters of taking the property of the non-elect. Seizing property is a key component of Communism.

Taborites

The Taborites emerged during the 15th century, originating from the Hussites. Their beliefs were based on the Brethren of the Free Spirit, but they believed not just in taking the property of the non-elect but in violently destroying them. This is similar to various communist purges that have taken place throughout history. The taborites did not believe in private property, believing that all things should be held in common. Strangely enough, the idea of no personal property included sexual relationships with women, which meant people were free to sleep with whoever they pleased.. Marx was a married man, but he was also critical of marriage and the family, viewing these institutions as tools that supported bourgeois society.

Adamites

The Brethren of the Free Spirit also inspired the Admites. They not only believed they were living gods but were superior to Christ. Their thought process was that since Christ died while they lived, this made them superior.

As with the Taborites, the Adamites shared all goods in common while having conflicting views on chastity. There was no marriage, and people in theory could sleep with whoever they wanted. In practice, sex was restricted because everyone had to get permission from the leader to sleep with each other.

Another unusual belief of this group is the practice of walking around naked. Adamites believe that walking around naked, as Adam and Eve did, is important to reflect the perfect love of the original couple. However, walking around naked did not discourage the belief in destroying the non-Elect. The Adamits were eventually destroyed due in part to their heretical beliefs.

Conclusion

The motivation behind each of these groups was that, by stripping people of autonomy and sacrificing individual desires for the group, it would lead to a heaven on earth. Of course, autonomy and personal desire are what fuel progress. Therefore, by removing this, you bring a form of peace without the necessary motivation to maintain the utopia. Individualism is a two-edged sword that brings the benefits of ambition with the downside of selfishness and oppression.

Furthermore, one thing Christians and Communists have in common is a desire for a better world here. The difference is in whether or not freedom will be a part of this new world.

Reabsorption Theology

Leave a reply

Reabsorption Theology is not a religious term. Rather, this term was developed by Leszek Kołakowski to explain the problem that Communism and other ideologies attempt to address regarding human behavior and the apparent separation between humans and God.

Definition

Reabsorption theology posits that the end of humanity will culminate in its reabsorption into the essence or nature of divinity. In other words, man must return to God. Another key term is alienation. Alienation, as defined in Communism, is separation not only from one’s work and fellow man but also from god. Therefore, the ultimate purpose of Communism is to end alienation, unite man with man, and humanity with god. As such, Communism is much bigger than just the emancipation of the proletariat.

Kolakowski states that God created the world and separated it from himself because he was lonely. He continues by stating that there are three stages to existence. The first is pre-creation, when God is alone. The second is post-creation, when there is a separation between God and the rest of reality. Finally, there is a reunion when man is reabsorbed into God.

The Problem

The problem with creation is that once it was separated from God, it became evil. The logic behind this thought is that if God is good, being separate from him is bad or evil. As such, as soon as man was separated from God, man was evil. The idea of an inherently evil creation is in stark contrast to Christianity, which states explicitly that creation was good. With this assumption of an evil of creation, Communism is seen as the solution to this corruption of man, and it will be the process used to reunite man with God.

The idea that man must reunite with God is not unique to Communism and is found in such religions as Hinduism and faintly in Buddhism. The first man emerged from Brahma in Hinduism. The motivation of Brahma to create humans was due to loneliness, but there are various interpretations of this.

Buddhism skips explaining creation and focuses on the endless cycle of life, birth, and death. The purpose of this process is almost a form of purification. As an individual lives over and over, hopefully they eventually awaken (reach enlightenment) and achieve Nirvana, which is challenging to define but involves the extinction of desire and perhaps removal from this plane of existence.

Christianity does not suggest that man will be reabsorbed, but it does state that the relationship between man and God will be reestablished like a husband and wife reconciling after a severe conflict. In Christianity, the problem is not that man is separate from God but that the relationship between man and God has been strained by sin. By reestablishing this broken relationship, man is united with God in a way that a family is united. One family, but individual members have personal autonomy.

The Solution

The solution of unification, as determined by Communism, is for all of mankind and not for the individual. What this means is that people on the individual level do not have a choice in this process. True “freedom” can only happen through the submergence of self into the state and the removal of diversity. Everyone must conform, or nobody gets the reward. Therefore, the elimination of non-conformers is a necessary sacrifice for the greater good. This has led to the murder of millions in various iterations of Communism.

Individuality is the origin and source of greed and strife because people are thinking of themselves over the group. Communism will always have issues with individualism, as individualistic people are materialistic in their eyes. In other words, individuality leads to greed and strife as these behaviors contribute to alienation and separation from God. By destroying individuality, the fruit of this behavior is also destroyed, and reabsorption can transpire.

The idea that man and God need to reunite suggests that man and God are essentially equal and that neither is perfect without the other. This idea is not generally associated with mainline Christian thought, which views God as self-sufficient with a desire to save fallen man if they are willing to accept his help.

Conclusion

Reabsorption theology is an interesting idea that attempts to explain the motivations of Communism. However, it is an outsider’s perspective on the motivations of people who hold a particular ideology. Kołakowski was anti-communist, and it would be hard for his opinion to be unbiased. Despite this, his ideas concerning reabsorption provide an interesting insight into understanding Communism.

Column Computation with Data Table in R VIDEO

Leave a reply

In the video below, we will look at how to perform various column-wise computations with data tables in R.

Data Table Basics in R VIDEO

Leave a reply

Data tables provide an efficient way to work with and manipulate data. In the video below, examples are provided of the strengths and benefits of using data tables.

Traits of Communism

Leave a reply

In this post, we will look at some common characteristics of Communism. Naturally, this is not an exhaustive list; however, it does provide a basic introduction to these commonly held traits.

Restrictions on Property

One of the most common tenets of Communism is restrictions on property. Commonly, this has been interpreted as no private property. Several attempts at Communism have removed all private property rights, such as in the Soviet Union. Marx did dislike private property, but he truly hated individual ownership of the means of production. Anything that could produce wealth should be owned by the people, in Marx’s opinion.

Therefore, and much to many people’s surprise, Marx may not have had issues with people owning homes, computers, phones, or cars, but he would challenge a person’s right to own farmland, factories, or businesses. Consumption was fine as long as production was controlled centrally.

Loss of Individualism

Many interpretations of Communism involve the sacrifice of the individual for the collective. Individualism is often seen as a threat because, to have a communist society, everyone must go along with it. In other words, for true Communism to arise, everyone must support it so that the state withers away. Particularly for Communists who ascribe and yearn for a utopia, this heaven on earth cannot transpire until dissent is removed.

This desire for a man-made, secular heaven explains in part the tremendous amount of persecution and death that is associated with Communism. Unlike capitalism, which may abuse power to make more money, Communism will abuse power to bring about a new earth in which there is no more strife. In other words, the sacrifice of the few to save the many.

Examples of the destruction of countless lives in the pursuit of Communism can be found in the millions who died in the Soviet Union, China, and Cambodia. Dissenters and even apparent dissenters were systematically destroyed or “reeducated.” All this was done in the name of the people to bring about a better world.

Upheaval of Social Order

Communism brings about a total upheaval of the social order. Marx makes it clear that the working class, or proletariat, needs to rise and overthrow the bourgeoisie. Later, Communist thinkers such as Marcuse included minorities (whether sexual, racial, gender, etc.) as part of the revolution. Eventually, everyone is included as oppressed, thanks to the splintering of people into oppressed groups that encompass anyone who is not part of the normalized society.

The destruction of the current oppressors creates a vacuum that the rising Communist leaders fill. Essentially, Communism throws out one corrupt government to bring in another. The new leaders claim to be for the people, but eventually become accustomed to doing whatever it takes to maintain their power. An example would be what has happened in Cuba, China, and North Korea over the past 80 years. Each of these governments used Communism to take power and has used it to maintain power.

Religious Undertones

Even though Marx despised religion, Communism is often treated as a religion. Some adherents of Communism truly believe that implementing this belief system will lead to peace and prosperity on Earth in a way that believers in Christianity believe in heaven.

Even with all the evidence to the contrary that Communism does not work, believers fight to preserve the idea of Communism. A common counterargument is that Communism has never been implemented properly or that the famous leaders of Communism misunderstood it.

For example, the focus of Communism was primarily economic, with an emphasis on the means of production. However, as the middle class rose and became content, many communist thought leaders moved from attacking the means of production to critiquing the cultural reproduction of society. This is why there is so much criticism of Judeo-Christian-Heterosexual-White norms in the West today. Pulling down these norms today is the equivalent of seizing the factories of the bourgeoisie in the 19th century.

Communism seeks to displace other religious systems to generate a religion in which man is God rather than the gods of various religions. Marx viewed religion as a tool that kept people asleep and ignorant of their condition. This has been interpreted as the need to destroy religion by many so that the masses are awakened or “woke.” Evangelism is performed with protesting in the streets and or the barrel of a gun rather than with the persuasion of the Bible.

Conclusion

Communism is a complex ideology that has had a major influence on the world. For better or for worse, people believe that the ideas of Communism will make the world a better place. As such, there have been attempts to realize the ideas of the philosophy with mixed results. Despite the implementation, the traits described here are generally present when Communists take power.

Data Manipulation with Data.Table in R

Leave a reply

In this post, we will go over more examples of how to manipulate data with data.table in R. We will begin by loading the needed packages and preparing our data.

Packages and Data Preparation

In the code below, we load our library data.table. Next, we prepare our data set mtcars and convert it into a data.table of the same name. Note that mtcars is preloaded within R.

library(data.table)
mtcars<-data.table(mtcars)

Below is a preview of the mtcars dataset.

> head(mtcars)
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1:  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4
2:  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4
3:  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1
4:  21.4     6   258   110  3.08 3.215 19.44     1     0     3     1
5:  18.7     8   360   175  3.15 3.440 17.02     0     0     3     2
6:  18.1     6   225   105  2.76 3.460 20.22     1     0     3     1

Selecting Columns

Below is an example of how to select columns. You can do this by using brackets and placing the columns you want inside the c function. Remember to place a comma in front of the c function, as this indicates to take all rows of data, while the information after the comma indicates which columns to take.

# Select mpg and cyl using a character vector
> mtcars_select <- mtcars[,c("mpg","cyl")]
> head(mtcars_select)
   mpg cyl
1 21.0   6
2 21.0   6
3 22.8   4
4 21.4   6
5 18.7   8
6 18.1   6

By indicating which columns we wanted, we were able to pull only what we wanted. If you want to leave out columns, you just need to place a minus sign in front of the c function as shown below.

> # Deselect mph and cyl columns
> mtcars_drop <- mtcars[,-c("mpg","cyl")]
> head(mtcars_drop)
    disp    hp  drat    wt  qsec    vs    am  gear  carb
   <num> <num> <num> <num> <num> <num> <num> <num> <num>
1:   160   110  3.90 2.620 16.46     0     1     4     4
2:   160   110  3.90 2.875 17.02     0     1     4     4
3:   108    93  3.85 2.320 18.61     1     1     4     1
4:   258   110  3.08 3.215 19.44     1     0     3     1
5:   360   175  3.15 3.440 17.02     0     0     3     2
6:   225   105  2.76 3.460 20.22     1     0     3     1

In the example above, the columns left out are mpg and cyl, as we indicated. Next, we will look at performing calculations.

Performing Calculations

It is also possible to perform specific calculations. In the example below, we calculate the median mpg of all cars in the dataset.

> # Calculate median mpg using the j argument
> median_mpg <- mtcars[,median(mpg)]
> median_mpg
[1] 19.2

As you can see, to perform a calculation, you must place the function inside the brackets and after the comma. The column you want to perform the calculation on must be inside the formula, as usual.

It is also possible to give names to your output. In the example below, we provide the output of our calculation, the name “mean_mpg”. Notice also the use of the period right in front of the parentheses, which is needed when performing this type of calculation

> # Calculate the average mpg as mean_mpg 
> mean_mpg <- mtcars[,.(mean_mpg=mean(mpg))]
> mean_mpg
   mean_mpg
       <num>
1:  20.09062

In our example above, we can see that the average mpg of all the cars in our dataset is 20.09.

Multiple Calculations

By employing the same dot notation, it is possible to perform multiple calculations at once. In the example below, we find the minimum and maximum values of mpg for all cars.

> # Get the min and max mpg values
> min_max_mpg <- mtcars[, .(min(mpg),max(mpg))]
> min_max_mpg
      V1    V2
   <num> <num>
1:  10.4  33.9

There is nothing unique here except for the inclusion of a second function. Notice how each function is separated by a comma.

Just as before, you can also name each output from your results. Below is the mean weight and the max hp from the dataset.

> # Calculate the average wt and the max hp
> other_stats <- mtcars[, .(mean_wt=mean(wt),max_hp=max(hp))]
> other_stats
   mean_wt max_hp
     <num>  <num>
1: 3.21725    335

Filtering and Calculations

So far, we have not made any adjustments to the input before the comma when performing calculations. In the example below, we are filtering for cars with 6 cylinders and hp that is less than 120. Once this is filtered, we then want to calculate the minimum and maximum mpg.

> #filter for two or more variables then statistics
> mpg_stats <- mtcars[cyl==6 & hp<120, .(min_dur=min(mpg), 
+                             max_dur=max(mpg))]
> mpg_stats
   min_dur max_dur
     <num>   <num>
1:    18.1    21.4

The output speaks for itself. Normally, when subsetting data, the information before the comma indicates the rows. However, when performing calculations, the information before the comma can be used to filter the data as appropriate.

In the example below, we make a histogram based on the same filtering criteria.

mtcars[cyl==6 & hp<120, 
                    hist(mpg)]

As you can see, the use of data.table is almost endless

Conclusion

The data.table library provides you with several beneficial tools for conveniently slicing data. Data analysis can use these tools as needed to provide insights for their audience.

Column Computation with Data Table in R

Leave a reply

The data table data structure is a great way to manipulate your data to address various questions you may have. In this post, we will learn about filtering, dealing with text, and more complex numerical calculations.

Packages and Data Preparation

We will begin by loading our package data.table and converting our datasets mtcars and iris, into data tables. Both mtcars and iris are preinstalled on R. Below is the code.

library(data.table)
mtcars<-data.table(mtcars)
iris<-data.table(iris)

Next, we will quickly examine both datasets using the head() function to understand what each one is about.

We now move to filtering.

Filtering for Not

Our first exercise is the use of NOT logic in filtering. With NOT logic, you are filtering for what is not included in your code. For example, in the code below, we are telling R to display all cars that do not have a transmission. The code for NOT is != which means “does not equal”. Below is the code and example.

> # Filter all rows where am is not 0
> not_0_am <- mtcars[am !=0]
> not_0_am
      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
 1:  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
 2:  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
 3:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
 4:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
 5:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
 6:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
 7:  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1
 8:  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2
 9:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
10:  15.8     8 351.0   264  4.22 3.170 14.50     0     1     5     4
11:  19.7     6 145.0   175  3.62 2.770 15.50     0     1     5     6
12:  15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8
13:  21.4     4 121.0   109  4.11 2.780 18.60     1     1     4     2
>

Of course, you can have more than one argument within your code, as we will see in the next example.

Multiple Commands for Not

It is also possible to include multiple commands. In the example below, we are filtering for cars with an automatic transmission (am==1) but do not have 6 cylinders (cyl != 6). The output matches the criteria that were set

> # Filter all rows where am is 0 AND cyl is not 6
> am_cyl <- mtcars[am==1 & cyl != 6]
> am_cyl
      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
 1:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
 2:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
 3:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
 4:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
 5:  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1
 6:  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2
 7:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
 8:  15.8     8 351.0   264  4.22 3.170 14.50     0     1     5     4
 9:  15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8
10:  21.4     4 121.0   109  4.11 2.780 18.60     1     1     4     2

Searching Text

It is also possible to search for text and even numbers. In the code below, we are searching the iris dataset for the species “setosa” and for petal lengths that are less than 1.3

> #with text
> iris[Species=="setosa" & Petal.Length<1.3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <num>       <num>        <num>       <num>  <fctr>
1:          4.3         3.0          1.1         0.1  setosa
2:          5.8         4.0          1.2         0.2  setosa
3:          4.6         3.6          1.0         0.2  setosa
4:          5.0         3.2          1.2         0.2  setosa

We can also search for text when unsure what we are looking for. In the example below, we use the %like% argument to search the Specias column for text containing the letter v. Since the results are rather long, we use the head() function to see the first few rows.

> # Filter all rows where Species contains "V"
> any_v <- iris[Species %like% "v"]
> head(any_v)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
          <num>       <num>        <num>       <num>     <fctr>
1:          7.0         3.2          4.7         1.4 versicolor
2:          6.4         3.2          4.5         1.5 versicolor
3:          6.9         3.1          4.9         1.5 versicolor
4:          5.5         2.3          4.0         1.3 versicolor
5:          6.5         2.8          4.6         1.5 versicolor
6:          5.7         2.8          4.5         1.3 versicolor

Another way to search text is by looking for words that end with something. In the example below, we are looking for words in the Species column that end with the word “color.” We indicate this to are by using the %like% argument again and the word “color” with a dollar sign at the end of it. The dollar sign tells R to look for this word at the end of a word in the Species column.

> # Filter all rows where Species ends with "color"
> end_flowers <- iris[Species %like% "color$"]
> head(end_flowers)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
          <num>       <num>        <num>       <num>     <fctr>
1:          7.0         3.2          4.7         1.4 versicolor
2:          6.4         3.2          4.5         1.5 versicolor
3:          6.9         3.1          4.9         1.5 versicolor
4:          5.5         2.3          4.0         1.3 versicolor
5:          6.5         2.8          4.6         1.5 versicolor
6:          5.7         2.8          4.5         1.3 versicolor

Multiple Numerical Arguments

Multiple numerical arguments are also possible. In the example shown below, we are looking for all cars in the mtcars dataset that are 4 or 6 cylinders. We achieve this by listing the variable we are searching “cyl” followed by the %in% argument, and lastly we use the c() function and include our values inside it. Below is the code and output.

> # Filter all rows where cyl is 4 or 6
> filter_cyl <- mtcars[cyl %in% c(4, 6)]
> filter_cyl
      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
 1:  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
 2:  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
 3:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
 4:  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
 5:  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
 6:  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
 7:  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
 8:  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
 9:  17.8     6 167.6   123  3.92 3.440 18.90     1     0     4     4
10:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
11:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
12:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
13:  21.5     4 120.1    97  3.70 2.465 20.01     1     0     3     1
14:  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1
15:  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2
16:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
17:  19.7     6 145.0   175  3.62 2.770 15.50     0     1     5     6
18:  21.4     4 121.0   109  4.11 2.780 18.60     1     1     4     2

In this last example, we learn to find data that meets a range rather than just specific values. In the code below, we are looking for cars that have an mpg between 20 and 22. The new argument in this example is the %between% argument, which is used to tell R to search for a range of values. Below is the code, followed by the output

> # Filter all rows where mpg is between [20, 22]
> mpg_20_22 <- mtcars[mpg %between% c(20,22)]
> mpg_20_22
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1:  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
2:  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
3:  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
4:  21.5     4 120.1    97  3.70 2.465 20.01     1     0     3     1
5:  21.4     4 121.0   109  4.11 2.780 18.60     1     1     4     2

Conclusion

Data tables provide a different way of pulling insights from data. The value of this approach becomes clearer when dealing with large datasets in which speed becomes important.

Comparing Data with Python VIDEO

Leave a reply

The comparison of data can be useful to determine if it is necessary to use additional statistical tests to confirm a significant difference. In the video below, we look at several simple ways to compare data using Python.

Annotating Visualizations in Python

Leave a reply

Annotating data allows you to communicate vital information in a visualization for an audience. In the example below, we will look at how to annotate a visualization while using Python.

Libraries and Data Preparation

We will begin by loading the needed libraries and preparing the data. In the code below, lines 1 and 3 load our visualization libraries. Line 2 loads the function we will need to load our data.

import seaborn as sns
from pydataset import data
import matplotlib.pyplot as plt

In the code below, we use the data() function to load the Prestige data from pydataset into an object called df. Then, we display the head of this data using the .head() method.

df=data('Prestige')
df.head()

Our dataset contains various jobs measured on five dimensions. In our code below, we will focus on using the education, income, and prestige variables.

Making a Comment

Now we will add a comment to our visualization. Specifically, we will point out the highest income value. Below is the code, followed by the visualization

# Draw basic scatter plot of education data and income 
sns.scatterplot(x = 'education', y = 'income', data = df)

# Label highest income value with text annotation
plt.text(6, 25000,
         'The max income is over 25000', 
         # Set the font to large
         fontdict = {'ha': 'left', 'size': 'large'})
plt.show()

The first step was to make a basic scatterplot. We use the .scatterplot() method from seaborn and plot education and income from our df dataset. Next, we set up our text using the .text() method from matplotlib. For the text, we set an x and y value and then indicate what the text should say. Below that, we adjust the font to come from the left and top to be large in size.

Arrow Annotation

Using an arrow is another way to bring attention to data in a visualization. In the code below, we will use an arrow that will point to the same data point that we used in the previous example. Below is the code, followed by the visualization.

# Query and filter to General Managers
women_census = df.query("(women  ==  4.02) & (census  ==  1130)")
prestige_type = df.query("(prestige  ==  69.1) & (type  ==  'prof')")

sns.scatterplot(x = 'education', y = 'income',
                data = df)

# Point arrow to General Managers 
plt.annotate('General Managers',
             xy = (women_census.education, prestige_type.income),
             xytext = (6.5, 15000), 
             # Shrink the arrow to avoid occlusion
             arrowprops = {'facecolor':'gray', 'width': 3, 'shrink': 0.03},
             backgroundcolor = 'white')
plt.show()

Here is what we did,

We create two queries to locate the data point we want the arrow to point to. All the values in the .query() method for both the woman_census and the prestige_type are values from the general manager row. As shown below,

These two objects are used to locate general managers in the dataset.

2. We make the same scatterplot as shown before

3. The .annotate() method is used. We start by writing in quotes what we want to appear in the scatterplot. Next, we set the x and y coordinates of the data point we want to highlight using the women_census and prestige_type queries we did previously. From there, we have to set the text location. After this, we set the arrow properties in terms of the color, width, and size. Lastly, the background color is set.

Annotation with Color & Text

Color annotation provides a contrast based on color. Below is the code and the visualization when this approach is used.

# Make a vector where prof is orangered; else lightgray
prof = ['orangered' if type  ==  'prof' else 'lightgray' for type in df['type']]

# Map facecolors to the list prof and set alpha to 0.3
sns.regplot(x = 'education',
            y = 'income',
            data = df,
            fit_reg = False,
            scatter_kws = {'facecolors':prof, 'alpha': 0.3})

# Add annotation to plot
plt.text(11, 23000, 'General Managers')
plt.show()

This approach is simpler compared to the last one. We begin by separating the data based on type. Professionals are colored orange, and the rest are colored light gray. Next, we create our scatterplot using the .regplot() method this time. Education and income are the x and y axes, the regression line is removed, and the color of the dots is set using the scatter_kws argument. The “prof” argument provides the coloring rules, and the alpha is set to make the points transparent. The next step uses the .text() method to set the x and y coordinates for the text.

Conclusion

Annotation is one of many ways to bring attention to crucial insights in a visualization. The examples above provide some of the many ways this tool can be used to provide crucial information when using Python

Data Tables Basics in R

Leave a reply

Data frames are the default way that data is often stored in R. However, another option for storing data in R is using data tables. As we will see, data tables allow you to accomplish much more than data frames. For now, we will focus on some basic features of data tables and data frames before moving to actions that are easier to perform with data tables.

Loading Packages and Data Preparation

We will start by loading the package data.table and preparing our data. The data.table package is loaded using the library() function. We will use the mtcars and iris datasets for the various examples. Both of these datasets are available by default within R. Since our focus is on data tables, we will convert both the mtcars and iris datasets into data tables and store them in objects with the same name. Below is the code.

library(data.table)
mtcars<-data.table(mtcars)
iris<-data.table(iris)

Next, we will use the head () function to examine the mtcars and iris datasets quickly.

The mtcars dataset has data about cars while the iris dataset has data bout various features of flowers.

Subsetting Basics

The first five examples can be performed data frames or data tables. We will begin by subsetting a single row from a data table as shown below.

#filtering with positive integers
row_2 <- mtcars[2,]
row_2
  mpg cyl disp  hp drat    wt  qsec vs am gear carb
2  21   6  160 110  3.9 2.875 17.02  0  1    4    4

In the code above, we filter for the second row in the mtcars data table. This is done using brackets followed by a number for the row we want. The common after the number 2 in the brackets would allow us to select a column. Since there is no number after the comma, this indicates that R should select all rows. This is why we have all the data from row number 2.

In the example below, we will select multiple rows at once using the c() function and a colon.

#multiple rows
> rows_1_5 <- mtcars[c(1:5),]
> rows_1_5
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

The main difference in the code above is the use of the c() function or the concatenate function. Inside this function, we tell R we want the first 5 rows and columns. However, it is not necessary to pull consecutive rows, as we can also pull whatever rows we want specifically.

#filtering non consecutive rows
rows_1_3_5 <- mtcars[c(1,3,5),]
rows_1_3_5
   mpg cyl disp  hp drat   wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.62 16.46  0  1    4    4
3 22.8   4  108  93 3.85 2.32 18.61  1  1    4    1
5 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2

In the next example above, inside the c() function, we indicate that we want the 1st, 3rd, and 5th rows along with all of the columns. In the next two examples, we will learn how to leave rows rather than include them.

> only_last_two <- mtcars[-c(1:30),]
> only_last_two
    mpg cyl disp  hp drat   wt qsec vs am gear carb
31 15.0   8  301 335 3.54 3.57 14.6  0  1    5    8
32 21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

You can use a minus sign in front of your subset to remove everything that is inside the brackets. For example, in the code above, we place a minus sign in front of rows 1 to 30 to indicate to R to remove rows 1 to 30. This is why in the output, only rows 31 and 32 are available. Just as in the other examples, the numbers do not have to be consecutive, as shown below

> exclude_some <- mtcars[-c(1:10,12:32),]
> exclude_some
    mpg cyl  disp  hp drat   wt qsec vs am gear carb
11 17.8   6 167.6 123 3.92 3.44 18.9  1  0    4    4

In the above example, we exclude rows 1 to 10 and rows 12 to 32, leaving only row 11.

Using data.table

We will now do three examples that require the use of data.table. The first example below removes the first 30 rows and the last row of 32, which means only row 31 is displayed

> not_first_last <- mtcars[-c(1:30,.N)]
> not_first_last
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1:    15     8   301   335  3.54  3.57  14.6     0     1     5     8

If you look closely, the output is different. There is a 1 next to all of the data, which gives the impression that this is row 1 from the dataset. However, this is not row 1 of the dataset but rather the first row of the subsetted data. In addition, you can see the <num> above all columns, which means this data is numerical. Lastly, in the code, you see a .N, which tells R to remove the last row of the data.

In the next example, we are going to subset the data so that only cars with an automatic transmission appear (am==1).

> am_1 <- mtcars[am == 1]
> am_1
      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
 1:  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
 2:  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
 3:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
 4:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
 5:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
 6:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
 7:  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1
 8:  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2
 9:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
10:  15.8     8 351.0   264  4.22 3.170 14.50     0     1     5     4
11:  19.7     6 145.0   175  3.62 2.770 15.50     0     1     5     6
12:  15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8
13:  21.4     4 121.0   109  4.11 2.780 18.60     1     1     4     2

Within the brackets, you simply indicate what values you want for the column that is being used for the filtering. Naturally, you can create more complex queries as shown below.

> am_1_mpg_25 <- mtcars[am==1 & mpg > 25]
> am_1_mpg_25
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
2:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
3:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
4:  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1
5:  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2
6:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2

IN the last example, we filtered the data for cars with automatic transmissions and with mpg above 25.

Conclusion

Data tables are highly flexible and allow a user to do things in a way that is much more efficient, depending on the situation. This is yet another excellent tool that can be deployed by an R enthusiast.

Comparing Data Using Python

Leave a reply

Comparing groups within a dataset is another aspect of analysis. Here, we will use some tools from Python.

Libraries & Data Prep

First, we need to load our libraries and prepare our data. Below is the code for the libraries we need.

import seaborn as sns
from pydataset import data
import matplotlib.pyplot as plt

The first and last lines load the libraries we need for data visualization. The second line loads the data() function from pydataset, where our data will come from. Below is the code for loading our data.

df=data('Prestige')
df.head()

Our data is the Prestige dataset from the data() function, loaded as the object df. The .head() method displays the first few lines of our dataset. This dataset contains various jobs measured in terms of education, income, women, prestige, census, and type. We are now ready to create our first comparison.

Histogram Comparision

The histogram comparison allows us to compare the shape of different distributions of data when the histograms overlap each other. We will compare the income distribution by job type in the code below.

# Filter dataset for prof
sns.kdeplot(df[df.type == 'prof'].income, 
            # Shade under kde and add a helpful label
            fill = True,
            label = 'prof')

# Filter dataset for non prof
sns.kdeplot(df[df.type != 'prof'].income, 
            # Again, shade under kde and add a helpful label
            fill = True,
            label = 'non-prof')
plt.show()

We create the first plot (blue color) by submitting the data for “prof” and income. We repeat this process and subset the data for individuals who are not “prof.”The plot shows that there is a broader distribution of income for people whose job type is professional. We could confirm this with a t-test or ANOVA, but the visual can often help to determine if statistical testing is appropriate

Rug Plot

A rug plot serves a similar purpose to a histogram. The main difference is that a rug plot includes ticks along the x-axis that help to show where data points are located. This knowledge can be used to remove outliers when necessary. Below is the code, output, and explanation of how to create a rug plot.

sns.kdeplot(df[df.type == 'prof'].income, 
             label = 'prof',
             # Turn the color blue to stand out
             color = 'green')

sns.kdeplot(df[df.type != 'prof'].income,
             label = 'Other types',
             # Turn the color gray
             color = 'red')
# Turn on rugplot
sns.rugplot(df[df.type == 'prof'].income, 
             label = 'prof',
             color = 'green')

sns.rugplot(df[df.type != 'prof'].income, 
             label = 'Other types',
             color = 'red')
plt.show()

The first two blocks of code are the same as the last example and make the histograms. The main difference is that no color fills the inside of the histograms. The new code is the two blocks of code that use the rugplot() method. The code for the arguments within the ruplot() method is almost the same as for the histogram, so there is little to discuss.

Swarm Plot

The swarm plot provides a different way to visualize a distribution. Like the histogram, it can be useful for comparison. In the code below, we use a swarm plot to look at the distribution of education by job type.

# Plot beeswarm 
sns.swarmplot(y = "type",
              x = 'education', 
              data = df, 
              # Decrease the size of the points to avoid crowding 
              size = 3)

# Give a descriptive title
plt.title('Education and Type')
plt.show()

The code is straightforward. You use the swarmplot() method and put in your values for the x and y axes. The plot shows you that prof job types have a higher level of education compared to the other job types.

Conclusion

People have their preferences for how they want to view data. The point here was not to state that one method was better than another. Rather, the goal was to raise awareness of the available options and to show how they can be created using Python.

Highlighting Data Points with Python

Leave a reply

In this post, we are going to look at how to highlight data points in a scatterplot. We will specifically look at two different methods for doing this. These two methods are hard coding and programmatically.

Libraries and Data Preparations

First, we will load our libraries and prepare our data. Below are the libraries we will use

import seaborn as sns
import matplotlib.pyplot as plt
from pydataset import data

The first two lines are libraries for our data visualization. The last line of code will be used for pulling the data that we will use. Below we prepare our data by loading it into an object called “df” and we take a quick peek at it as well.

df=data('Prestige')
df.head()

The data set we are using is called “Prestige” and we load it using the data() function. This data contains various jobs, education, income, women, prestige, census, and type information. Next, we will look at how to highlight a specific data point in a scatterplot.

Hard Coding

Hard coding is when you manually pick a specific data point to highlight. Below is the code and output for this

df_prof = df[df.type  ==  'prof']

# Make array orangred for highest income
prof_colors = ['orangered' if (education  ==  12.26) & (income  ==  25879) else 'lightgray' 
                  for education,income in zip(df_prof.education, df_prof.income)]

sns.regplot(x = 'education',
            y = 'income',
            data = df_prof,
            fit_reg = False, 
            # Send scatterplot argument to color points 
            scatter_kws = {'facecolors': prof_colors, 'alpha': 0.7})
plt.show()

We did the following to make the plot above.

We subset the data so that it only contains job types of “prof” and save this as an object called df_prof. The reason we did this was to reduce the number of data points and make it easier to see what we were doing.
Next, we make an object called prof_colors which will color one dot orange if it meets the criteria for the values for education and income. Everything I just said is captured in an if else statement. The for statement is used to tell Python where to apply the if else statement. Since this is hard to understand, below is a visual of the prof_colors object.

Notice the second row and how it is labeled “orangered” this is because this row matches the criteria for the values of education and income. We will use this object to make the colors of our dots

3. The next block of code is for making the visualization. Most of this is self-explanatory. You set your x and y values for education and income,. The fit_reg argument was set to false because we do not want a regression line. The scatter_kws argument is used to set the color of the dots and the alpha sets the level of transparency of the dots.

Setting the highlighted point manually is good if your data is static. However, if your data is dynamic, you want to highlight the points programmatically so that the highlight point changes as the data does.

Progammatically

The code is mostly similar as above with a few minor changes. Below is the code followed by the output and lastly the explanation.

df_prof = df[df.type  ==  'prof']

# Find the highest income
max_income = df_prof.income.max()

# Make a column that denotes which occuaption has highest income
df_prof['point_type'] = ['Highest Income' if income  ==  max_income else 'Others' for income in df_prof.income]

# Encode the hue of the points with the O3 generated column
sns.scatterplot(x = 'education',
                y = 'income',
                hue = 'point_type',
                data = df_prof)
plt.show()

Here is what we did

We start by subsetting the data as before for type of “prof”.
We then create an object called max_income and find the highest income in the df_prof object using the max() method.
This time we create a new column in our data called “point_type” which is created using an if else statement and a for loop. If income matches our highest income, it will be labeled highest income, and the rest will be labeled others for all data in the income column.
Lastly, we create our scatterplot. We set the x and y values, and we set the hue to match the “point_type” which is the new column we just created.

With this second method, our highlighted data point will change as necessary if it changes in the data.

Conclusion

Highlighting data points is something that is needed at times when creating data visualizations. The examples above provide two different ways to deal with this. Which method is best depends on the context.

Needs Assessment within Program Evaluation

Leave a reply

Program design often begins with a needs assessment. The needs assessment helps to shape what the program requires to address the problems it will address. In this post, we will look at how needs assessments are often developed within the context of program evaluation and the various levels of needs that one may encounter.

Process of Needs Assessment

In program evaluation, needs assessment has three phases: preassessment, assessment, and postassessment. Preassessment determines the problem’s current status and the assets available to address it. Some common questions that a preassessment may address are resolving the issue, identifying who is affected by the problem and/or the lack of resources, and determining what has been done in the past to address this situation. The sources of data for this include historical data and interviews.

The assessment phase involves collecting new information on the organization’s needs and assets. Whereas the preassessment looks at the past, the assessment looks at the present situation. The evaluation also addresses the same questions as the preassessment. Since there is so much overlap between the preassessment and the evaluation, it is common to skip this step and move directly to the postassessment.

The postassessment phase involves using the information that was gathered from the first two phases to develop appropriate interventions. For example, if a needs assessment finds a lack of resources for improving reading, an appropriate intervention may be the development of a reading lab. Naturally, the creation of a reading lab would necessitate the need for funding, such as from a grant

Levels of Need

Another aspect of a needs assessment is determining the level of need. In this context, need refers to who is receiving and giving services. A primary-level need is used to identify service recipients. For example, the students who use a reading lab would be at the primary level. Primary-level individuals need the program’s services.

Secondary needs level involves the individual who provides the services of a program. An example of individuals at the secondary need level would be teachers who are supporting the reading lab. Secondary level individuals may need training, support, and or the actual materials to make the program come to life.

Tertiary needs level is the actual support secondary needs level individuals use to make the program happen. As already mentioned, this can include training, materials, and/or support. An example would be training teachers to use the reading lab and making the software readily available for teachers.

Conclusion

A needs assessment is often necessary when developing programs, especially large ones. This crucial step provides clarity about what needs to be developed. With these tools, program administrators can be sure that they are taking a scientific approach to supporting program participants.

Generating Fake Data for Privacy with Python VIDEO

Leave a reply

Generating fake data is one way to protect an individual’s privacy. The video below provides examples of how to do this using Python.

Data Privacy with Python: Unique Combos & Generalizations

Leave a reply

In the video below, we will look at how to maintain privacy in data through removing unique combinations and generalizing data.

Program Implementation

Leave a reply

Program implementation examines how a program is put into practice. The focus of any program is to bring change to whoever the stakeholders of the program are. Therefore, how the program is put into practice or implemented plays a critical role in whether the program is successful.

Components of Implementation

Joseph Durlak describes eight components of program implementation as shown below.

Fidelity
Dosage
Quality
Participant engagement
Program differentiation
Monitoring of controlled conditions
Program reach

Most of these components are self-explanatory. Fidelity is the level of faithfulness implementors of the program have to the procedures and or protocols of the program. Many programs have an experimental nature in which the participants of the program are compared either to themselves as a “before” group or to a control group that does not experience the program. To confirm that the program is the reason for any difference it must be confirmable that the procedures of the program are adhered to.

The same idea applies to dosage which is the amount of the program that is experienced. This value must be consistent to establish any differences between groups. Dosage can be measured in terms of the amount, length of time, number of occurrences, etc. the program requires.

Adaptions are the modifications that are made to the program for various reasons. Sometimes the original procedures of the program are not practical during implementation. For example, a program may expect participants to receive counseling twice a week for 30 minutes each time for a total of an hour. During implementation, it may be found that the participants were not able to come twice a week. Therefore, instead of meeting twice a week, the program is adapted to meet once a week for one hour. It is critical to keep track of adaptions as they can cause a program to lose its focus and original purpose.

Participant engagement is how involved and cooperative the participants in the program are. Low engagement is often a sign that a program is failing. If this does happen it may be necessary to make adaptations to the program.

Program differentiation is the awareness of how the current program is different from other programs. Knowing what makes a program different is critical in showing how it is superior to other interventions that have been tried. Understanding differences also is an indication for determining what works and does not work in terms of helping participants.

Monitoring of controlled conditions is focused on the controlled variables that need to be monitored when using an experimental and controlled group with programs. Lastly, program reach is a measured of how much of the target population is involved with the program.

It is critical to be aware of these components of implementation as they help evaluators determine the level of success a program has had. It is also important to make sure that the individuals who are actually implementing the program are trained and supported throughout the entire implementation process. If the implementors do not know what to do are feel abandoned then implementation will also suffer.

Factors of Implementation

Components of implementation are aspects of the program that are within the program. Factors of implementation are variables outside of the program that influence it. According to Joseph Durlak, there are also several factors to be aware of when it comes to implementation. Some of the factors include the following

Community level
Traits of implementors
Program traits
organizational factor
Processes
Staffing
Professional development

The community level factor relates to traits of the community surrounding the program and can include the policies, politics, and the level of funding for a program. A negative political environment can seriously hamper cooperation for example.

The implementers’ traits can include their skill level, confidence, sense of relevancy, and more. We have already discussed implementors earlier but if the implementors lack the skill even the best programs will fail.

Program traits include how well the program fits with the school and or the adaptability of the program. Sometimes a great program is a poor culture fit and or is too rigid for the local context. An example would be the example used earlier for dosage. Twice-a-week counseling may not be appropriate for the context.

Organizational factors include the climate, openness, integration, etc., of the local organization that is supporting the program. A closed-off organization will probably not support any program no matter the benefits.

Processes include decision-making, communication, planning, etc. Programs require local stakeholders to make decisions about cooperation and other factors related to planning and implementation. If there is a bottleneck or resistance to developing processes the program may never get off the ground.

Staffing is about leadership and how they support the program. Enthusiastic leaders may provide adequate support for a program while indifferent leaders may cause a program to fail. One reason for this is the control over resources and morale that leaders possess.

Professional development has already been alluded to and it is the amount of support and training that implementers of a program need. It is of critical importance that the individuals who bring a program to life through implementation receive the support and training they need in order to ensure success. If the implementors are confused over what to do the program has little hope for success.

Conclusion

Program implementation is often overlooked. People are so excited to begin a new program to help people that they often forget to assess the implementation of it. Doing this can lead to good programs being labeled as failures, leading to finger-pointing. Focusing on the implementation can help to alleviate this common occurrence.

Python for Data Privacy VIDEO

Leave a reply

Data privacy and the protection of people’s identities is important. The video below provides some basic ways to ensure the privacy of individuals when working with data.

Privacy of Continous Data with Python

Leave a reply

There are several ways that an individual’s privacy can be protected when dealing with continuous data. In this post, we will look at how protecting privacy can be accomplished using Python.

Libraries

We will begin by loading the necessary libraries. Below is the code.

from pydataset import data
import pandas as pd

The library setup is simple. We are importing the data() function from pydataset. This will allow us to load the data we will use in this post. Below we will address the data preparation. We are also importing pandas to make a frequency table later on.

Data Preparation

The data preparation is also simple. We will load the dataset called “SLID” using the data() function into an object called df. We will then view the df object using the .head() method. Below is the code followed by the output.

df=data('SLID')
df.head()

The data set has five variables. The focus of this post will be on the manipulation of the “age” variable. We will now make a histogram of the data before we manipulate it.

View of Original Histogram

Below is the code output of the histogram of the “age” variable. The reason for making this visual is to provide a “before” picture of the data before changes are made.

df['age'].hist(bins=15)

We will now move to our first transformation which will involve changing the data to a categorical variable.

Change to Categorical

Changing continuous data to categorical is one way of protecting privacy as it removes individual values and replaces them with group values. Below is an example of how to do this with the code and the first few rows of the modified data.

df['age'] = df['age'].apply(lambda x:">=40"if x>=40 else"<40" )
df.head()

We are overwriting the “age” variable in the code using an anonymous function. On the “age” variable we use the .apply() method and replace values above 40 with “>=40” and values below 40 with “<40”. The data is now broken down into two groups, those above 40 and those below 40. Below is a frequency table of the transformed “age” variable.

df['age'].value_counts()

age
>=40    3984
<40     3441
Name: count, dtype: int64

The .value_counts() method comes from the pandas library. There are two groups now. The table above is a major transformation from the original histogram. Below is the code and output of a bar graph of this transformation

import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x="age", data=df)
plt.show()

This was a simple example. You do not have to limit yourself to only two groups to divide your data. How many groups depends on the context and the purpose of the use of this technique.

Top Coding

Top coding is a trick used to bring extremely high values down to a specific value. Again, the purpose of modifying these values in our context is to protect people’s privacy. Below is the code and output for this approach.

df=data('SLID')
df.loc[df['age'] > 75, 'age'] = 75
df['age'].hist(bins=15)

The code does the following.

We load the “SLID” dataset again so that we can modify it again from its original state.
We then use the .loc method to change all values in “age” above 75 to 75.
Lastly, we create our histogram for comparison purposes to the original data

If you look to the far right you can see that spike in the number of data points at age 75 compared to our original histogram. This is a result of our manipulation of the data. Through doing this, we can keep all of our data for other forms of analysis while also protecting the privacy of the handful of people who are over the age of 75.

Bottom Coding

Bottom coding is the same as top coding except now you raise values below a threshold to a minimum value. Below is the code and output for this.

df=data('SLID')
df.loc[df['age'] < 20, 'age'] = 20
df['age'].hist(bins=15)

The code is the same as before with the only difference being the less than “<” symbol and the threshold being set to 20. As you compare this histogram to the original you can see a huge spike in the number of values at 20.

Conclusion

Data protection is an important aspect of the analysis role. The examples provided here are just some of the many ways in which the privacy of individuals can be respected with the help of Python

thoughts on The State and Revolution by Lenin

Leave a reply

The State and Revolution was written by Lenin in 1917. This text provides Lenin’s thoughts on the role of communism in the context of leading the proletarian revolution and the shape of the government afterward. The text is rather repetitive and rambling. Therefore, instead of providing a summary, which would be rather difficult, it was decided to briefly describe some of the text’s main points. These main points are…

The purpose of the state
The purpose of the revolution
The stages after the revolution

None of the ideas above are in one specific place within the text. Instead, they are scattered throughout and shared repeatedly, making the text difficult to understand.

Purpose of the State

Stalin states that the state exists solely because of class antagonism. The government referees the battle between the bourgeoisie and the proletariat in other words. This makes sense as you cannot have property or capital unless there is someone to protect said property. A society without government would not have anything whether communist or capitalist. The capitalists need the government to protect their capital while the proletariat seeks justice from the same government.

Stalin also shares that the ruling class uses the state to oppress the poor. Again, it is hard to refute this as corporate America has teamed up with the government before. However, Lenin has left out how the government has responded to the cries of the poor in the past. For example, during Lenin’s life, the Russian Czar attempted reforms before his downfall. Even before the French Revolution the King of France tried to compromise. As such, even in monarchies tone deafness is difficult to maintain fully.

Purpose of Revolution

Lenin then shared that the purpose of revolution was to overthrow the Bourgeoisie class so the proletarians could take power. Lenin believes that overthrowing the ruling class will solve most if not all of society’s problems.

The problem with this belief is that revolution leads to a new set of oppressors in most cases. The leadership changes but the wicked hearts of man remain the same. Lenin seems to think that the system is the problem (a sentiment shared today). In reality, it is the people who are the problem. All governments have issues and problems, but they also have one thing in common: people who form, lead, and destroy them.

Stages After the Revolution

Lenin also divides the stages after the revolution into three main parts. The first stage is the proletarian dictatorship. This dictatorship involves the proletarians using the apparatus of the conquered state to crush all of the remaining bourgeoise. In other words, the tools of the enemy are used to destroy the enemy. This stage of the revolution has happened in many countries such as Cambodia, Cuba, North Korea, Russia, and Vietnam. The landholders and capitalists are rounded up and killed and the people seize their property. There is often a huge loss of life as the revolutionaries tend to kill indiscriminately in their zeal for change.

The second stage is socialism which involves the government having control over the means of production. Notice how the government is still being used but instead of for slaughtering, the focus has shifted to control of the people. In addition, contrary to popular belief, traditional communism doesn’t want to control all property just property for producing wealth. At this point, everyone only gets what they need instead of what they want, destroying all motivation and ambition to work hard. This is also the stage at which all communist governments stop. The government takes control and they never give up that control. This proves the point that communism swaps one corrupt leadership for another. The main difference between communism and capitalism is who has control, the individual or a monolith government.

The final stage of the revolution is the withering of the state. Once everyone is thoroughly communist and social classes are destroyed there is no need for the state. No communist government has achieved this as the revolution’s leaders enjoy being in charge. The common counter to this observation is that nobody has successfully completed a communist revolution. Therefore, people must try harder to achieve this. It also must be mentioned that there is no view of utopia as Lenin shares that neither he nor Marx knows what that is like. As such, the revolution must continue forever.

Conclusion

This was not a summary of Lenin’s views in his book The State and Revolution. The goal was only to share some of the main points. This is probably one of Lenin’s best-known books and required reading for hardcore leftists. Even though no one has achieved true communism many are highly motivated to make this theory a reality.

Python for Data Privacy

Leave a reply

Data privacy is a major topic among analysts who want to protect people’s information. There are often ethical expectations that personal identifying information is protected. Whenever data is shared, you want to be sure that individual people cannot be identified within a dataset, which can lead to unforeseen consequences. This post will examine simple ways a data analyst can protect personal information.

Libraries & Data Preparation

There are few libraries and minimal data preparation for this example. The code and output are below.

from pydataset import data
df=data('SLID')
df.head()

The only library we need is “pydataset” which contains the dataset we will use. In the second line, we create an object called “df” which contains our data. The data we are using is called “SLID” and contains data on individuals relating to their wages, education level, age, sex, and language.

We will now move to the first way to protect privacy when working with data.

Drop Columns

Sometimes protecting people’s identity can be as easy as dropping a column. Often, the column(s) that contain the names, addresses, or phone numbers can be dropped. In our example below, we are going to pretend that the “language” column can be used to identify people. Therefore we will drop this column. Below is the code and the output for this.

# Attribute suppression on "language"
suppressed_language = df.drop('language', axis="columns")

# Explore obtained dataset
suppressed_language.head()

To remove the “language” column we use the drop() method. Inside this method, we indicate the name of the column and the axis as well.

Drop Rows

It is also possible to drop rows. Dropping rows may be appropriate for outliers. If only a handful of individuals have a certain value in a column it may be possible to identify them. In the code and output below, we drop all values where education is above or equal to 14.

# Drop rows with education higher than 14
education = df.drop(df[df.education >= 14].index)

# See  DataFrame
education.head()

In the code, we used the drop() method again but subsetted the data to remove rows with education values greater than or equal to 14. We also include the index option to indicate the removal of rows. If you look you can see that several rows are now missing such as 1,3,4,6,8,9 as all of these rows had education scores above 14

Data Masking

Data masking involves removing all or part of the information within a column. In the example below, we remove the values for education and replace them with asterisks.

# Uniformly mask the education column 
df['education'] = '****'

# See resulting DataFrame
df.head()

The code involves subsetting the education variable and setting it equal to the asterisks. This approach is similar to dropping the column. However, there may be a reason to keep the column even if there is no useful information in it.

Replace Part of String

Data masking can also include replacing part of the data within a column. In the code below, we will remove some of the information within the “sex” column.

#Modify Sex Column
df['sex'] = df['sex'].apply(lambda text: text[0] + '****' + text[text.find('le'):] )

#See Results
df.head()

The code involves rewriting the data in the “sex” column.

We do this by using the apply() method in this column. Inside the apply() method we use an anonymous function. Using an anonymous function includes using the word “lambda”.
After lambda, we set the argument to the word “text” for practical reasons since we are modifying text.
After the colon, we tell Python to start at the beginning of the string and keep it “text[0]”. Next, insert four asterisks **** after the first letter in the string.
Lastly, we subset from “text and find the string “le” in “text” using the find() method.

The apply() method allows us to loop through the column like a for loop and repeat this process for every row.

Conclusion

Protecting data is critical when using data. The ideas presented here are just some of the many ways that a data analyst can protect people’s personal information.

Bokeh-Manipulating Glyph Color in Python VIDEO

Leave a reply

The video below explains how to modify the color of glyphs when using Bokeh. Manipulating the color is another way to convey information about your data to the end user.

Generating Fake Data for Privacy with Python

Leave a reply

The privacy of individuals in a dataset can be protected through the development of fake data. Using false numbers makes it much more difficult to identify individual people within a dataset. In this post, we will look at how to generate fake numbers and names using Python.

Libraries & Data Preparation

The initial library needed is only “pydataset” which will allow us to load the data. We will use the data() function to load the “SLID” dataset into an object called “df”. Next, we will look at the data using the .head() method. Below is the code and the output.

from pydataset import data
df=data('SLID')
df.head()

We have five columns of data that address wages, education level, age, sex, and language. However, for this example, we need to take several additional steps.

We are going to create four new columns that will be manipulated in the example below. These columns will be “name”, “credit_card”, “credit_code”, and “credit_company”. Each of these columns will have a default value that we will manipulate. Below is the code and output.

df['name']="Dan"
df['credit_card']=1234567890
df['credit_code']=123
df['credit_company']='comp'
df.head()

All of this new data will serve as data that needs protection. The original data isn’t needed it just serves as a dataset that we are grafting the privacy data onto. Making a dataframe from scratch is a little complicated in Python and beyond the scope of this video so we took a shortcut by adding to preexisting data. We will now see how to generate fake numbers and names.

Fake Numbers

The “faker” library has a function called “Faker” that can generate fake data for almost any circumstance. We will demonstrate this by generating phony credit card numbers. Below is the code and output.

# Import Faker class
from faker import Faker
# Create fake data generator
fake_data = Faker()
# Generate a credit card number
fake_data.credit_card_number()

'6561857744400343'

To generate the false credit card number we loaded the faker library and imported the Faker() function. Then we created an instance of the Faker() function called “fake_data”. Lastly, we used the .credit_card_number() method on the “fake_data” object.

We will now generate fake numbers for “credit_card”, “credits_code”, and “credit_company”.

# Mask card number with new generated data using a lambda function
Faker.seed(0)
df['credit_code'] = df['credit_code'].apply(lambda x: fake_data.credit_card_security_code())
df['credit_company'] = df['credit_company'].apply(lambda x: fake_data.credit_card_provider())
df['credit_card'] = df['credit_card'].apply(lambda x: fake_data.credit_card_number())

# See the resulting pseudonymized data
df.head()

If you compare this output to the original you can see that the values have changed. We set the seed using Faker.seed(0) so we always get the same results. The next three lines of code use an anonymous function which allows us to loop through our dataset. First, we subset the name of the column we want to overwrite. Second, we use the .apply() method on the same column. Inside the .apply() method we lambda followed by the argument x. After the x we indicate what we want done to the column using the appropriate method from the faker library. Lastly, we display the results using the .head() method. We will address the names of people.

Fake Names

There are at least three different methods for generating fake names, there is a method that generates male or female names, a method that generates only male names, and a method that only generates female names. Below is a brief example of each.

Faker.seed(0)
print(fake_data.name())
print(fake_data.name_male())
print(fake_data.name_female())

Norma Fisher
Jorge Sullivan
Elizabeth Woods

The code above is self-explanatory. We used the print function in order to print several lines of code with different outputs. We will use the .name() method in the code below to generate fake names for our “name” column.

Faker.seed(0)
df['name'] = df['name'].apply(lambda x: fake_data.name())
df.head()

The steps for changing the names are the same as what we did with the credit card information. As such, we will not reexplain it here.

Conclusion

The ability to generate fake data as shown in this post allows an incredible amount of flexibility in protecting people’s identity. However, nothing must be lost that is used for developing insights. For example, generating random credit card numbers could be catastrophic if this information provides insights in a given context. Therefore, any tool that is going to be used must be used with wisdom and caution.

Postpositivist Paradigm and Program Evaluation

Leave a reply

Within the context of program evaluation, different schools of thought or paradigms affect how evaluators do evaluation. In this post, we will look specifically at the postpositivist paradigm.

Postpositivism

The postpositivist paradigm grew out of the positivist paradigm. Both paradigms believe in using the scientific method to uncover laws of human behavior. There is also a focus on experiments whether they are true or quasi with the use of surveys and or observation. However, postpositivism will also take a mixed method (combining quantitative with qualitative) approach when it makes sense.

The main differences between positivism and postpositivism are the level of certainty and their contrasting positions on metaphysics. Positivists focus on absolute certainty of results while postpositivists are more focused on the probability of certainty. In addition, Positivists believe in one objective reality that is independent of the distant observer while postpositivists tend to have a more nuanced view of reality.

The typical academic research article follows the positivist/postpositivist paradigm. Such an article will contain a problem, purpose, hypotheses, methods, results, and conclusion. This structure is not unique to postpositivism, but it is important to note how ubiquitous this format is. The example above is primarily for quantitative research, but qualitative and mixed methods follow this format more loosely.

Within evaluation, postpositivism has influenced theory-based evaluation and program theory. Theory-based evaluation is focused on theories or ideas about what makes a great program, which are realized in the traits and tools used in the program.

Program theory is a closely related idea focused on the elements needed for achieving results and showing how these elements relate to each other. The natural outgrowth of this is the logic model which identifies what is needed (inputs) for the program, what will be done with these resources (output), and what is the impact of the use of these resources among stakeholders (outcomes). The logic model is the bedrock of program evaluation in many contexts such as within the government.

The reason for the success of the logic model is how incredibly structured and clear it is. Anybody can understand the results even if they may not be useful. In addition, the logic model was developed earlier than other approaches to program evaluation and it may be popular because it’s one of the first approaches most students learn in graduate school.

The emphasis on theory with postpositivism can often be at the expense of what is taking place in the actual world. While the use of theory is critical for grounding a study scientifically this can be alienating to the stakeholders who are tasked with using the results of a postpositivist program evaluation. As such, other schools of thought have looked to address this.

Conclusion

Postpositivism is one of many ways to view program evaluation. The steps are highly clear and sequential, and generally, everybody knows what to do. However, the appearance of clarity does not imply that it exists. Other paradigms have challenged the usefulness of the results of a program evaluation inspired by postpositivism.

Program Evaluation Paradigms

Leave a reply

Program evaluation plays a critical role in assessing program performance. However, as with most disciplines of knowledge, there are different views or paradigms for how to assess a program.

The word paradigm, in this context, means a collection of assumptions or beliefs that shape an individual’s worldview. For example, creationists have assumptions about how life came to be that are different from those of people who believe in evolution. Just as paradigms influence science, they also play a role in how evaluators view the structure and purpose of program evaluation.

In this post, we will briefly go over four schools of thought or paradigms of program evaluation, along with a description of each and how they approach program evaluation. These four paradigms are

Postpositive
Pragmatic
Constructivist
Transformative

Postpositive

The postpositive paradigm grew out of the positive paradigm. Both paradigms are focused on the use of the scientific method to investigate a phenomenon. They also both support the idea of a single reality that is observable. However, postpositivists believe in a level of probability that accounts for human behavior. This assumption may have given rise to statistics which focuses heavily on probability.

Postpositivism is heavily focused on methods that involve quantitative data. Therefore, any program evaluator who is eager to gather numerical data is probably highly supportive of postpositivism.

Pragmatic

A pragmatic paradigm is one in which there is a strong emphasis on the actual use of the results. A pragmatist wants to collect data that they are sure will be used to make a difference in the program. In terms of data and methods, anything goes as long as it leads to implementation.

Since pragmatism is so flexible it is supportive of mixed methods which can include quantitative or qualitative data. While a postpositivist might be happy once the report is completed, a pragmatist is only happy if their research is used by stakeholders.

Constructivist

The constructivist paradigm is focused on how people create knowledge. Therefore, constructivists are focused on the values of people because values shape ideas and the construction of knowledge. As such, constructivists want to use methods that focus on the interaction of people.

With the focus on people, constructivists want to create a story using narrative approaches that are often associated with qualitative methods. It is possible but unusual to use quantitative methods with constructivists because such an approach does help to identify what makes a person tick in the same way as an interview would.

Transformative

The transformative paradigm is focused on social justice. Therefore, adherents to this paradigm want to bring about social change. This approach constantly investigates injustice and oppression. The world and the system need to be radically changed for the benefit of those who are oppressed.

People who support the transformative paradigm are focused on the viewpoints of others and the development of more rights for minority groups. When the transformative paradigm is the view of a program evaluation the evaluators will look for inequity, inequality, and injustice. Generally, with this approach, the outcome is already determined in that there is some sort of oppression and injustice that is happening, and the purpose of the evaluation is to determine where it is so that it can be stamped out.

Conclusion

The paradigm that someone adheres to has a powerful influence on how they would approach program evaluation. The point is not to say that one approach is better than the other. Instead, the point is that being aware of the various positions can help people to better understand those around them.

Bokeh-Modifying Glyphs

Leave a reply

In this post, we will examine additional ways to modify a Bokeh data visualization. The data points in a Bokeh visualization are referred to as glyphs. Glyphs can take various shapes, colors, etc, and we will learn how to modify glyphs specifically for scatter plots in this post.

Libraries

We will need several libraries to work with glyphs in a scatterplot. They are listed below.

from pydataset import data
from bokeh.plotting import figure
from bokeh.io import output_file, show

The first library is for pulling the data we need. The next two libraries are basic libraries from Bokeh for generating our scatterplot.

Data Preparation

Data preparation is simple in this example. All we have to do is use the data() function to load the “Duncan” dataset into an object we call “df.” After this, we print out the first few rows of this dataset using the head() method.

df=data('Duncan')
df.head()

The Duncan dataset contains various occupations that are scored on type, income, education, and prestige. We will be focused primarily on education and prestige in this post.

Default Scatterplot

The code below provides a default scatterplot without any modifications. The purpose of providing this is to serve as a comparison to what we will do when we modify the glyphs in the subsequent examples. Below is the code, output, and explanation.

#Fig Setup
fig = figure(x_axis_label="Prestige", y_axis_label="Education", title="Prestige vs Education")

# Add glyphs
fig.circle(x="prestige", y="education", source=df)

output_file(filename="pres_vs_inc.html")
show(fig)

The first line of code sets up the x and y axis using the figure() function in an object we created called “fig”. The second line allows us to add the actual data points using the .circle() method on the “fig” object. The last two lines allow us to display our plot. output_file() function allows us to save our work and the show() function displays the plot.

Modified Glyphs

Our first modification will involve changes to the appearance of the glyphs in the scatterplot. The first line is the same but the second line includes changes to the size, color, and transparency of the glyphs. Below is the code, output, and explanation.

fig = figure(x_axis_label="Prestige", y_axis_label="Education", title="Prestige vs Education")

# Add glyphs
fig.circle(x="prestige", y="education", source=df, size=16, fill_color="yellow", fill_alpha=0.2)

output_file(filename="pres_vs_inc.html")
show(fig)

In the second line of code, we changed the size to 16 using the size argument and the fill of the dots to yellow using the fill_color argument. We also adjusted the transparency using the fill_alpha argument.

Multiple Glyphs

The example below includes the modification of different glyphs based on a categorical variable. A tooltip is also included to show how various tools can be combined when employing Bokeh. The code, output, and explanation are below.

#Subset Data
source_wc = df[df["type"]=="wc"]
source_prof = df[df["type"]=="prof"]

TOOLTIPS=[('Education', '@education'), ('Prestige', '@prestige')]
fig = figure(x_axis_label="Education", y_axis_label="Prestige", tooltips = TOOLTIPS)
wc_glyphs = fig.circle(x="education", y="prestige", source=source_wc, legend_label="WC", fill_color="blue",fill_alpha=0.2)
prof_glyphs = fig.circle(x="education", y="prestige", source=source_prof, legend_label="Prof", fill_color="green", fill_alpha=0.6)

output_file(filename="update.html")
show(fig)

Here is what we did

We subsetted the data into two different objects (source_ws & source_prof) based on the type variable.
We created our tooltip which allows us to see the education and prestige of individual data points on the graph.
We make the figure or x and y axis for our plot.
We use the .circle() method twice. Once for job type “WC” and once for job type “prof”. We set the colors, fills, and transparency of each subset of data. Notice that we created a legend as well. Remember that we named each object in this step wc_glyph and prof_glyph
The last two lines of code create a file for the visualization and display it.

Updating Glyphs

We will now learn how to update our code rather than recreating it from scratch. We will update the size of the glyphs and change their color from the prior example.

# Update glyph size
wc_glyphs.glyph.size = 20
prof_glyphs.glyph.size = 10

# Update glyph fill_color
wc_glyphs.glyph.fill_color = "red"
prof_glyphs.glyph.fill_color = "yellow"

output_file(filename="update.html")
show(fig)

We begin by using the objects we created in the last example wc_glyph and prof_glyph, These two objects contain all the information for creating the figure and data points. To update the glyph size we will use the .glyph.size argument and set it to two different values for each of the two objects that were created. We repeat this process for the fill_color.

You can compare the last two visualizations and see the differences for yourself.

Conclusion

Being able to customize glyphs is another powerful feature of Bokeh. With these tools, you can modify your visualizations with ease to communicate with your audience.

Notes on Nationalism

2 Replies

George Orwell wrote an essay entitled “Notes on Nationalism” around the time of WW II. In this brief essay, Orwell defines nationalism along with a description of the traits of nationalists. In this post, a summary of his essay will be provided along with modern examples of some of his key points.

Defining Nationalism

For Orwell, nationalism is an individual’s identification with a single nation or unit. Nation is a country but unit is much harder to define. A unit could be a religion, such as Islam, or an ideology like communism. Simply, a unit can be anything that is not a nation.

Orwell then goes on to describe two types of nationalism which are positive and negative nationalism. A positive nationalist wants to boost the prestige of his country or unit. An example of a positive nationalist would be a patriotic American who believes in “God bless America.”

The examples Orwell includes in his essay of positive nationalism include Zionism, which supports the idea of a Jewish state and is not ashamed to do so. Orwell also shared the example of Celtic Nationalism which believed in the support of the Celtic ethnicities in the United Kingdom. What both of these examples have in common is a focus on supporting a unit of people to achieve goals and objectives.

A negative nationalist is a person who wants to denigrate or lower the prestige of a country or unit. An example of this would be Americans who are ashamed or embarrassed by the past atrocities of the US and want the US to offer reparations, apologies, and to show penitence. These people are also nationalist but have a sense of shame over their country’s behavior that is baffling to a positive nationalist.

The examples of negative nationalism that Orwell shares in his essay include Anti-semitism, Anglophobia, and Trotskyism. Anti-Semitism is racism against people who are Jewish and does not require much additional explanation. Anglophobia is a negative attitude towards the UK. What makes Anglophobia pertinent is that a similar attitude has permeated the US in recent years. Trotskyism was a branch of mainly Russian communists who did not support Stalin’s leadership of the Soviet Union.

What all of these negative nationalists have in common is hatred and or resistance to another country or unit. This leads to the conclusion that whether someone is a positive or negative nationalist depends on who is asking the question. For example, someone who supports Black Lives Matter might see themselves as a patriot continuing the fight for equality which is a tradition in America. However, another person might see BLM in a negative light due to the instability that BLM brings into certain areas. In the end, whether someone is a positive or negative nationalist is based more on marketing than on the actual behavior and beliefs of the individuals involved.

There is one more group of nationalists that do not neatly fall into the two categories already mentioned and this group is called transferred nationalists. A transferred nationalist is a person who holds a contrasting position to the context in which they live. An example that Orwell uses is a communist who lives in a capitalist country, which is a minority position. Another example he shared was political Catholicism which was the promotion of Catholic social teachings through government support. Political Catholicism is a form of transfer nationalism because the use of the state to support religion in this matter is supposedly an unusual position in Orwell’s view.

As mentioned before, whether someone is a positive, negative, or transferred nationalist is a matter of perspective. The main point here is to understand how nationalism can manifest in different ways and different contexts.

Main Characteristics of Nationalism

In addition to categorizing the types of nationalism, Orwell also provides three main characteristics of nationalists which are obsession, instability, and indifference to reality. Obsession is being highly focused on the group/unit that the nationalist is supporting. For example, Zionists are highly focused on Israel and matters related to this country. Black Lives Matter support is highly focused on systemic racism and matters related to the Black community.

Instability relates primarily to transferred nationalists and it is loyalty outside of the system one is in. The previous example was a communist in a capitalist country. A more recent example is natural-born Americans supporting immigration regardless of the context. Perhaps the reason that Orwell labels this instability is that a minority position can often push for change that destabilizes the status quo.

The final trait of nationalists is indifference to reality. Reality is not defined in a traditional manner here but is more focused on morality. Nationalists see the world from their viewpoint to the exclusion of all contradictory evidence. What is good or bad is not based on behavior but rather on who did it. If the US goes and attacks another country it is a fight for freedom. However, if anybody attacks the US it is considered terrorism. For a pro-US nationalist, no information can be given to criticize US aggression or condone attacks on the US because it is not evidence or morals that matters but the group/unit that the nationalist is supporting.

We can extend this to every other example if we want. Immigration is okay for transferred nationalists no matter how much crime, unemployment or drains of social services happen. The opposite is true for US positive nationalists, immigration is a problem no matter how many hard-working, tax-paying immigrants come. The same applies to Black Lives Matter and racism. No matter what the government does systemic racism is still a threat to Blacks. On the other hand, US nationalists are convinced that nothing can be done to appease people who think they are victims of racism.

Conclusion

Orwell’s views on nationalism provide an interesting take from the WW II era. The point was not to criticize his view but rather to explain his position with a few recent examples. Nationalism is a part of the worldview of most individuals in one way or the other. What is truly important is just to be aware of one’s position concerning one’s thoughts relating to nationalism.

Bokeh Display Customization VIDEO

Leave a reply

The video below shows you how to make modifications to the display of interactive graphs using Bokeh

Essay on Liberation-Subverting Forces & Solidarity

Leave a reply

This post will examine chapters three and four of Herbert Marcuse’s “Essay on Liberation.” This highly influential essay, written in the 1960s, lays out many of the left’s goals and desires regarding the reshaping of society.

Subverting Forces

Chapter 3 is mostly a rehash of complaints and solutions that Marcuse has already addressed in his essay. It begins with a litany of complaints, including the terrible jobs people have to work, the exploitation of minorities, increased violence, and the waste of resources. All of these complaints are blamed on capitalism. It needs to be noted that every system has some sort of flaws and even oppression within them which includes the communist system that Marcuse supports.

Marcuse also mentions how technology can be used to end capitalism rather than support it. The challenge is that the technocrats are using technology to continue the existing system of oppression. Not only is this terrible but the current system must be abolished as reformation is not even an option for Marcuse. This is a sentiment that is shared by many leftists today regarding the destruction of the current system in order to set up a completely new one.

Marcuse also calls on universities to radicalize students by developing and/or awakening their true consciousness. A true consciousness is a mind that has awakened to its true socialist nature. It appears the universities have heeded Marcuse’s call as many of them are considered bastions of liberal left-wing thinking. Again, the problem isn’t that Marcuse believes these things but that he wants everyone else to believe them and thinks it’s okay to use the educational system for this. If we are really free we should be able to accept or reject this worldview that Marcuse so vehemently supports.

Marcuse repeats his desire to radicalize the ghetto (black) population as well. Again, the reason for radicalizing students and minorities is to replace the proletariat workers who are enjoying their middle-class lifestyle. Marcuse never mentions how the ghetto populations were to be radicalized but it would probably involve the use of former university students who have achieved their true consciousness and are educating and working among the ghetto populations and pointing out the oppression these people are facing. Paulo Friere may be one example of this as he worked exclusively among the poor and minority populations as a language teacher in Brazil pointing out oppression.

One shocking comment Marcuse makes about the black population of his time is that they are expendable. Now, expendable does not mean that blacks should be eliminated or that they have no value. Rather, Marcuse used the term “expendable” to mean that the majority of blacks are not contributing significantly to the current economic system. For Marcuse, this is an advantage because these oppressed individuals are potential recruits for the revolution.

Correlation is not causation but there was a surprising number of radical black groups that arose in the 1960s and 1970s. Examples include the Black Panthers and the Black Liberation Army. There are also a host of other left-leaning groups such as the Symbionese Liberation Army, Weather Underground, and Students for a Democratic Society. The example provided explains why Marcuse is often called the “father of the new left.”

Solidarity

The final chapter of Marcuse’s essay shares how the revolution was successful in both Cuba and Vietnam. With such recent success as this (Marcuse was writing in the 1960’s) Marcuse is implying that such success can be experienced in the US. At the time it was unclear what to expect from the communist revolutions in Cuba and Vietnam. However, history shows us that these revolutions were not blessings to the citizens of either of these countries.

Marcuse then goes on to ponder what life after the revolution will look like. He essentially implies that it is unclear what life will truly be like after the communist revolution. This is a common criticism of communism in that the proponents want a different world but have no idea what to do if they take power. Given the track record of communist governments, it is better that communists pursue power rather than obtain it.

Conclusion

Marcuse had a strong vision for what he wanted to see happen in America. His desire was for the fall of capitalism and the rise of a socialist/communist utopia. In his essay, he lays out this dream of his. Unfortunately, the general success of communist revolutions is often negative and leads to huge loss of life as people’s freedoms are curtailed for the sake of the collective.

Essay on liberation-The New Sensibility

Leave a reply

This post will look at the second chapter of Herbert Marcuse’s essay “Essay on Liberation.” The general gist of this influential essay is to bemoan capitalism and champion the benefits and superiority of socialism. The focus of this chapter in particular is mostly on the benefits and implementation of socialism.

The New Sensibility

A key word in this chapter is the word “sensibility.” From what I can determine it seems that the word “sensibility” in the title relates to worldview or perhaps world order. Therefore, in this chapter, Marcuse is attempting to explain the new worldview or values of individuals who have been liberated from capitalism.

Within this chapter, Marcuse talks about a world in which injustice and misery have been abolition and there is a controlled economy in place. By controlling the economy, people are free from the evils of capitalism. The evils of capitalism appear to be hard work and consumerism as these are concepts Marcuse seems to criticize and complain about.

Marcuse also tries to explain what a liberated consciousness is. A liberated consciousness is someone who has been awakened to the evils of capitalism and understands the natural state of man, which is a socialist being. The way Marcuse describes this is similar to Plato’s Cave Analogy of someone who realizes the way they see the world is a shadow of the actual reality with the chains representing capitalism. I cannot confirm this but Marcuse’s concept of the liberated consciousness may have inspired Freire’s critical consciousness which sounds similar and is focused on realizing the oppression that is found in the pedagogical process.

Marcuse goes on to share how praxis is key. By praxis, an appropriate definition would be social action which generally involves protesting and other forms of destabilizing the existing society. In other words, it is not enough to be awakened as one must push for the manifestation of this awakening in the real world. Friere also speaks of social action and unrest in his work. Socialism is not content to exist along with other worldviews it wants to overtake the world and bring about the utopia that has never existed in recorded human history.

Another aspect of this chapter was Marcuse’s exploration of how art shapes reality. Art can be used to influence and shape reality through the ability to express what is ideal. Through warping reality through the use of art society can be changed for the better as well. Marcuse briefly touches on this idea in this essay but he does explore it in greater detail in his other works.

Conclusion

Marcuse lays out his claims for the need for socialism and how people would act if they were awakened to their true nature. The main failure of the MArcuse’s argument is its theoretical nature. The reality of socialism and communism is a system that lacks the benefits and resources it claims to provide.

Essay on Liberation-Biological Foundation

Leave a reply

Herbert Marcuse wrote a famous essay in the 1960’s entitled “Essay on Liberation.” The writing is somewhat difficult and convoluted which means interpretation can be challenging. However, the main thesis of Marcuse’s essay appears to be that the productivity of capitalism is inhibiting the rise of the socialist revolution. He addresses this thesis by addressing how a man can take care of himself without being dependent on the capitalist system and by asserting there can be no freedom from labor in the current capitalist system.

In this post, we will attempt to provide a summary of this essay succinctly. In particular, will focus on only chapter one of this essay entitled “Biological Foundation of Socialism”

Biological Foundation for Socialism

The first part of Marcuse’s essay addresses the biological foundation for socialism. From what I can assess the term “biological” means the innate need or basis for socialism. In other words, Marcuse builds a case for socialism as a natural state of man in the first part of his essay.

Marcuse lays out two problems with capitalism, which are the increase in production and the exploitation of products. For Marcuse, capitalist societies overproduce but at the same time do not provide enough for the people trapped in this oppressive system. For people to be free they must break their dependence on this market system with its focus on consumption. However, Marcuse later goes on to prescribe a controlled market as the alternative which has its problems of efficiency as demonstrated by other communist states such as the Soviet Union.

Marcuse also shares that capitalism is transformative. By transformative Marcuse is probably referring to how capitalism changes the nature, character, and or values of the individual. The accusation of the transformative nature of capitalism may also be why Marxists in general speak of transformation. However, when Marxists speak of transformation they believe it relates to awakening man to his true socialist nature rather than the capitalist lie. For Marcuse, the change of an individual brought about by capitalism causes exploitation as the individual buys into an oppressive system. Anyone familiar with the term “rat race” may have sympathy with Marcuse”s views.

Marcuse desires to free man from this exploitative system. This gives the impression that people should not have to do anything they don’t want to do. The problem is that many communist and socialist countries still have exploitive systems that force people to do things after the revolution. In other words, there is no system in which man is truly free. Everyone has to spend time doing things they do not want. The only difference is who is your master and what are the benefits of serving him.

Marcuse then goes on to explain why the Marxist revolution has not taken place. He claims that poverty doesn’t bring revolution, as Marx argued. With the success of capitalism, the proletariat was beginning to move into the middle class. The problem with the economic success of the middle class is that they hate the idea of revolution. This disdain for revolution is because of the middle class’s investment in the current system. In other words, capitalism blunts the desire for true freedom because it bribes individuals with economic gain.

Marcuse’s solution to the middle class’s stabilization was to focus on the radicalization of the super poor and blacks. In later parts of his essay, he adds students to this potential pool of revolutionaries. By shifting the focus away from the traditional proletariat, who are essentially sell-outs, to other oppressed groups, the revolution can continue.

The impact of this statement is felt today. Now, we have a plethora of groups who are crying out about the oppression of capitalism and other norms of society such as sexuality, health, race, etc. The idea of radicalizing various ethnic, sexual, and other minorities for the sake of revolution may have started with the ideas of Marcuse in the 1960s.

Conclusion

Marcuse lays out several key terms of his essay in this first chapter. Establishing this foundation is key as we will see how the rest of the essay is a variation of the ideas presented here.

Creating Multiple Plots Using Bokeh in Python

Leave a reply

In this post, we will look at how to make multiple plots at once using Boke in Python. This technique can be a powerful tool when you need to create visualizations rapidly for whatever purpose you may have.

Needed Initial Libraries & Data Preparation

Below are the initial libraries we need to begin this example and the data preparation.

from pydataset import data
from bokeh.plotting import figure
from bokeh.io import output_file, show
import pandas as pd

df=data("Duncan")

The first line is the code for the data we will use. It loads the data() function from pydataset. Next, we load the figure function from bokeh which will allow us to create our plots. After this we load the output_file() and show() functions which will allow us to display our plots. Lastly, we created our object df which holds our data from the Duncan dataset which has job types, prestige, income, and education as variables.

Multiple Scatterplots

Below is an example of displaying multiple scatterplots. The code and visualization are below followed by an explanation.

# SCATTER PLOT
from bokeh.layouts import column

wc = df.loc[df["type"] == "wc"]
prof = df.loc[df["type"] == "prof"]

fig_one = figure(x_axis_label="Education",y_axis_label="Prestige")
fig_two = figure(x_axis_label="Education",y_axis_label="Prestige")
fig_one.circle(x="education", y="prestige",source=wc,color="blue", legend_label="wc")
fig_two.circle(x="education", y="prestige",source=prof,color="red", legend_label="prof")

output_file(filename="column_plots.html")
show(column(fig_one, fig_two))

You can see the plots are stacked into a single column. The actual setup for this is simple.

We loaded the column() function which allows us to display visualizations in columns
We subsetted the data so that wc workers are in one object and prof are in the other object
We created two figures (fig_one, fig_two) for each of the other datasets. The figures are identical and both will contain education and prestige as the variables
We then added the data to both figures distinguishing the plots by having different colored dots
We created a name for the output
Inside the show() function we used the column() function to display the visualization in columns

All of this code was mostly reviewed, the only new thing was the use of the column() function within the show() function.

Multiple Bar Plots

In this example, we use bar plots and rows instead of columns. The code is followed by the visualization and the explanation.

# bar PLOT
from bokeh.layouts import row

income=pd.DataFrame(df.groupby('type')['income'].mean())
prestige=pd.DataFrame(df.groupby('type')['prestige'].mean())

types = ["prof", "wc", "bc"]
income_type = figure(x_axis_label="type", y_axis_label="income", 
                       x_range=types)
prestige_type = figure(x_axis_label="type", y_axis_label="prestige)", 
                   x_range=types)

# Add bar glyphs
income_type.vbar(x="type", top="income", source=income)
prestige_type.vbar(x="type", top="prestige", source=prestige)

# Generate HTML file and display the subplots
output_file(filename="my_first_column.html")
show(row(income_type, prestige_type))

Here is what we did

We loaded the row() functions which allows us to make rows as you can see.
We calculated the group means for each job type
We created a list called types which included the three job types in our dataset
Next, we made our two figures. One for income and the other for prestige
After this, we added the data to the plots
Lastly, we created the output and showed the visualizations this time using rows

Gridplots

The code below takes a different approach using grid plots. This allows you to set columns and rows for your multiple plots. Below is the code, output, and explanation of this.

from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource
from bokeh.models import NumeralTickFormatter

plots = []
#df['type'] = df.type.astype('category')
# Complete for loop to create plots
for type in ["bc","wc"]:
    source = ColumnDataSource(data=df)
    df = df.loc[df["type"] == type]
    fig = figure(x_axis_label="education", y_axis_label="income")
    fig.circle(x="education", y="income", source=source, legend_label=type)
    fig.yaxis[0].formatter = NumeralTickFormatter(format="$0a")
    plots.append(fig)

# Display plot
output_file(filename="gridplot.html")
show(gridplot(plots, ncols=2))

We began by loading gridplot(), ColumnSourceData, and NumeralTickFormatter() functions. Gridplot made the grid, columsource created a native data type for bokeh and numeraltickformatter allows us to format the numbers on the axes.
We created an empty list called plots that we will use in our for-loop
We used a for loop to generate the plots. The plots graph education vs income. The NumeralTickFormatter allowed us to display dollar signs on the y-axis
We then displayed the plot

Conclusion

This post provided an example of how to make multiple plots with bokeh. With these tools, there are many different ways they can be utilized for your data purposes.

Bokeh Display Customization in Python

Leave a reply

In this post, we will examine how to modify the default display of a plot in Bokeh, a library for interactive data visualizations in Python. Below are the initial libraries that we need.

from pydataset import data
from bokeh.plotting import figure
from bokeh.io import output_file, show

The first line of code is where our data comes from. We are using the data() function from pydataset for loading our data. The next two lines are for making the plot’s figure (x and y axes) and for the output file.

Data Preparation

There is no data preparation beyond loading the dataset using the data() function. We pick the dataset “Duncan” and load it into an object called “df.” The code is below, followed by a brief view of the actual data using the .head() method.

df=data('Duncan')
df.head()

This dataset includes various occupations measured in four ways: job type, income, education, and prestige.

Default Graph’s Appearance

Before we modify the appearance of the plot, it is important to know what the default appearance of the plot is for comparison purposes. Below is the code for a simple plot followed by the actual output and then lastly an explanation.

# Create a new figure
fig = figure(x_axis_label="Education", y_axis_label="Income")

# Add circle glyphs
fig.circle(x=df["education"], y=df["income"])

# Call function to produce html file and display plot
output_file(filename="my_first_plot.html")
show(fig)

The first line of code sets up the fig or figure. We use the figure() function to label the axes which are education and income. The second line of code creates the actual data points in the figure using the .circle() method. The last two lines create the output and display it.

So the figure above is the default appearance of a graph. Below we will look at several modifications.

Modification 1

In the code below, we are making the following changes to the plot.

Identifying data points by job type using color
Change the background color to black

Below is the code followed by the output and the explanation

# Import curdoc
from bokeh.io import curdoc

prof = df.loc[df["type"] == "prof"]
bc = df.loc[df["type"] == "bc"]

# Change theme to contrast
curdoc().theme = "contrast"
fig = figure(x_axis_label="Education", y_axis_label="Income")

# Add prof circle glyphs
fig.circle(x=prof["education"], y=prof["income"], color="yellow", legend_label="prof",size=10)

# Add bc circle glyphs
fig.circle(x=bc["education"], y=bc["income"], color="red", legend_label="bc",size=10)

output_file(filename="prof_vs_bc.html")
show(fig)

Here is what happened,

We load a library that allows us to modify the appearance called curdoc
Next, we do some data preparation. Separating the data for types that are “prof” and those that are “bc” into separate objects.
We change the theme of the plot to contrast using curdoc().theme
We also created the figure as done previously
We use the .circle() method twice. Once to set the “prof” data points on the plot and a second time to place the “bc” data points on the plot. We also make the data points larger by setting the size and using different colors for each job type.
The last two lines of code are for creating the output and displaying it.

You can see the difference between this second plot and the first one. This also shows the flexibility that is inherent in the use of Bokeh. Below we add one more variation to the display.

Modified Graph’s Appearance

The plot below is mostly the same except for the following

We add a third job type “wc”
We modify the shapes of the data points

Below is the code followed by the graph and the explanation

# Create figure
wc = df.loc[df["type"] == "wc"]
prof = df.loc[df["type"] == "prof"]
bc = df.loc[df["type"] == "bc"]

fig = figure(x_axis_label="Education", y_axis_label="Income")

# Add circle glyphs for houses
fig.circle(x=wc["education"], y=wc["income"], legend_label="wc", color="purple",size=10)

# Add square glyphs for units
fig.square(x=prof["education"], y=prof["income"], legend_label="prof", color="red",size=10)

# Add triangle glyphs for townhouses
fig.triangle(x=bc["education"], y=bc["income"], legend_label="bc", color="green",size=10)

output_file(filename="education_vs_income_by_type.html")
show(fig)

The code is almost all the same. The main difference is there are now three job types and each type has a different shape for their data points. The shapes are determined by using either .circle(), .triangle(), or .square() methods.

Conclusion

There are many more ways to modify the appearance of visualization in bokeh. The goal here was to provide some basic examples that may lead to additional exploration.

Bokeh-Scatter Plot basics in Python VIDEO

Leave a reply

In the video below we will look at making scatterplots using Bokeh. Bokeh is a Python library that makes interactive visualizations.

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: