Monthly Archives: November 2025

Python for Data Privacy VIDEO

Data privacy and the protection of people’s identities is important. The video below provides some basic ways to ensure the privacy of individuals when working with data.

ad

close up photography of gray adding machine

Privacy of Continous Data with Python

There are several ways that an individual’s privacy can be protected when dealing with continuous data. In this post, we will look at how protecting privacy can be accomplished using Python.

Libraries

We will begin by loading the necessary libraries. Below is the code.

from pydataset import data
import pandas as pd
ad

The library setup is simple. We are importing the data() function from pydataset. This will allow us to load the data we will use in this post. Below we will address the data preparation. We are also importing pandas to make a frequency table later on.

Data Preparation

The data preparation is also simple. We will load the dataset called “SLID” using the data() function into an object called df. We will then view the df object using the .head() method. Below is the code followed by the output.

df=data('SLID')
df.head()

The data set has five variables. The focus of this post will be on the manipulation of the “age” variable. We will now make a histogram of the data before we manipulate it.

View of Original Histogram

Below is the code output of the histogram of the “age” variable. The reason for making this visual is to provide a “before” picture of the data before changes are made.

df['age'].hist(bins=15)

We will now move to our first transformation which will involve changing the data to a categorical variable.

Change to Categorical

Changing continuous data to categorical is one way of protecting privacy as it removes individual values and replaces them with group values. Below is an example of how to do this with the code and the first few rows of the modified data.

df['age'] = df['age'].apply(lambda x:">=40"if x>=40 else"<40" )
df.head()

We are overwriting the “age” variable in the code using an anonymous function. On the “age” variable we use the .apply() method and replace values above 40 with “>=40” and values below 40 with “<40”. The data is now broken down into two groups, those above 40 and those below 40. Below is a frequency table of the transformed “age” variable.

df['age'].value_counts()
age
>=40    3984
<40     3441
Name: count, dtype: int64

The .value_counts() method comes from the pandas library. There are two groups now. The table above is a major transformation from the original histogram. Below is the code and output of a bar graph of this transformation

import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x="age", data=df)
plt.show()

This was a simple example. You do not have to limit yourself to only two groups to divide your data. How many groups depends on the context and the purpose of the use of this technique.

Top Coding

Top coding is a trick used to bring extremely high values down to a specific value. Again, the purpose of modifying these values in our context is to protect people’s privacy. Below is the code and output for this approach.

df=data('SLID')
df.loc[df['age'] > 75, 'age'] = 75
df['age'].hist(bins=15)

The code does the following.

  1. We load the “SLID” dataset again so that we can modify it again from its original state.
  2. We then use the .loc method to change all values in “age” above 75 to 75.
  3. Lastly, we create our histogram for comparison purposes to the original data

If you look to the far right you can see that spike in the number of data points at age 75 compared to our original histogram. This is a result of our manipulation of the data. Through doing this, we can keep all of our data for other forms of analysis while also protecting the privacy of the handful of people who are over the age of 75.

Bottom Coding

Bottom coding is the same as top coding except now you raise values below a threshold to a minimum value. Below is the code and output for this.

df=data('SLID')
df.loc[df['age'] < 20, 'age'] = 20
df['age'].hist(bins=15)

The code is the same as before with the only difference being the less than “<” symbol and the threshold being set to 20. As you compare this histogram to the original you can see a huge spike in the number of values at 20.

Conclusion

Data protection is an important aspect of the analysis role. The examples provided here are just some of the many ways in which the privacy of individuals can be respected with the help of Python

person s fist

thoughts on The State and Revolution by Lenin

The State and Revolution was written by Lenin in 1917. This text provides Lenin’s thoughts on the role of communism in the context of leading the proletarian revolution and the shape of the government afterward. The text is rather repetitive and rambling. Therefore, instead of providing a summary, which would be rather difficult, it was decided to briefly describe some of the text’s main points. These main points are…

  • The purpose of the state
  • The purpose of the revolution
  • The stages after the revolution

None of the ideas above are in one specific place within the text. Instead, they are scattered throughout and shared repeatedly, making the text difficult to understand.

Purpose of the State

Stalin states that the state exists solely because of class antagonism. The government referees the battle between the bourgeoisie and the proletariat in other words. This makes sense as you cannot have property or capital unless there is someone to protect said property. A society without government would not have anything whether communist or capitalist. The capitalists need the government to protect their capital while the proletariat seeks justice from the same government.

ad

Stalin also shares that the ruling class uses the state to oppress the poor. Again, it is hard to refute this as corporate America has teamed up with the government before. However, Lenin has left out how the government has responded to the cries of the poor in the past. For example, during Lenin’s life, the Russian Czar attempted reforms before his downfall. Even before the French Revolution the King of France tried to compromise. As such, even in monarchies tone deafness is difficult to maintain fully.

Purpose of Revolution

Lenin then shared that the purpose of revolution was to overthrow the Bourgeoisie class so the proletarians could take power. Lenin believes that overthrowing the ruling class will solve most if not all of society’s problems.

The problem with this belief is that revolution leads to a new set of oppressors in most cases. The leadership changes but the wicked hearts of man remain the same. Lenin seems to think that the system is the problem (a sentiment shared today). In reality, it is the people who are the problem. All governments have issues and problems, but they also have one thing in common: people who form, lead, and destroy them.

Stages After the Revolution

Lenin also divides the stages after the revolution into three main parts. The first stage is the proletarian dictatorship. This dictatorship involves the proletarians using the apparatus of the conquered state to crush all of the remaining bourgeoise. In other words, the tools of the enemy are used to destroy the enemy. This stage of the revolution has happened in many countries such as Cambodia, Cuba, North Korea, Russia, and Vietnam. The landholders and capitalists are rounded up and killed and the people seize their property. There is often a huge loss of life as the revolutionaries tend to kill indiscriminately in their zeal for change.

The second stage is socialism which involves the government having control over the means of production. Notice how the government is still being used but instead of for slaughtering, the focus has shifted to control of the people. In addition, contrary to popular belief, traditional communism doesn’t want to control all property just property for producing wealth. At this point, everyone only gets what they need instead of what they want, destroying all motivation and ambition to work hard. This is also the stage at which all communist governments stop. The government takes control and they never give up that control. This proves the point that communism swaps one corrupt leadership for another. The main difference between communism and capitalism is who has control, the individual or a monolith government.

The final stage of the revolution is the withering of the state. Once everyone is thoroughly communist and social classes are destroyed there is no need for the state. No communist government has achieved this as the revolution’s leaders enjoy being in charge. The common counter to this observation is that nobody has successfully completed a communist revolution. Therefore, people must try harder to achieve this. It also must be mentioned that there is no view of utopia as Lenin shares that neither he nor Marx knows what that is like. As such, the revolution must continue forever.

Conclusion

This was not a summary of Lenin’s views in his book The State and Revolution. The goal was only to share some of the main points. This is probably one of Lenin’s best-known books and required reading for hardcore leftists. Even though no one has achieved true communism many are highly motivated to make this theory a reality.


black android smartphone on top of white book

Python for Data Privacy

Data privacy is a major topic among analysts who want to protect people’s information. There are often ethical expectations that personal identifying information is protected. Whenever data is shared, you want to be sure that individual people cannot be identified within a dataset, which can lead to unforeseen consequences. This post will examine simple ways a data analyst can protect personal information.

Libraries & Data Preparation

There are few libraries and minimal data preparation for this example. The code and output are below.

from pydataset import data
df=data('SLID')
df.head()

The only library we need is “pydataset” which contains the dataset we will use. In the second line, we create an object called “df” which contains our data. The data we are using is called “SLID” and contains data on individuals relating to their wages, education level, age, sex, and language.

ad

We will now move to the first way to protect privacy when working with data.

Drop Columns

Sometimes protecting people’s identity can be as easy as dropping a column. Often, the column(s) that contain the names, addresses, or phone numbers can be dropped. In our example below, we are going to pretend that the “language” column can be used to identify people. Therefore we will drop this column. Below is the code and the output for this.

# Attribute suppression on "language"
suppressed_language = df.drop('language', axis="columns")

# Explore obtained dataset
suppressed_language.head()

To remove the “language” column we use the drop() method. Inside this method, we indicate the name of the column and the axis as well.

Drop Rows

It is also possible to drop rows. Dropping rows may be appropriate for outliers. If only a handful of individuals have a certain value in a column it may be possible to identify them. In the code and output below, we drop all values where education is above or equal to 14.

# Drop rows with education higher than 14
education = df.drop(df[df.education >= 14].index)

# See  DataFrame
education.head()

In the code, we used the drop() method again but subsetted the data to remove rows with education values greater than or equal to 14. We also include the index option to indicate the removal of rows. If you look you can see that several rows are now missing such as 1,3,4,6,8,9 as all of these rows had education scores above 14

Data Masking

Data masking involves removing all or part of the information within a column. In the example below, we remove the values for education and replace them with asterisks.

# Uniformly mask the education column 
df['education'] = '****'

# See resulting DataFrame
df.head()

The code involves subsetting the education variable and setting it equal to the asterisks. This approach is similar to dropping the column. However, there may be a reason to keep the column even if there is no useful information in it.

Replace Part of String

Data masking can also include replacing part of the data within a column. In the code below, we will remove some of the information within the “sex” column.

#Modify Sex Column
df['sex'] = df['sex'].apply(lambda text: text[0] + '****' + text[text.find('le'):] )

#See Results
df.head()

The code involves rewriting the data in the “sex” column.

  1. We do this by using the apply() method in this column. Inside the apply() method we use an anonymous function. Using an anonymous function includes using the word “lambda”.
  2. After lambda, we set the argument to the word “text” for practical reasons since we are modifying text.
  3. After the colon, we tell Python to start at the beginning of the string and keep it “text[0]”. Next, insert four asterisks **** after the first letter in the string.
  4. Lastly, we subset from “text and find the string “le” in “text” using the find() method.

The apply() method allows us to loop through the column like a for loop and repeat this process for every row.

Conclusion

Protecting data is critical when using data. The ideas presented here are just some of the many ways that a data analyst can protect people’s personal information.