Category Archives: Data Governance

person wearing adidas hoodie during daytime

De-Identification of Data

Advertisements

Removing identifying information in data is a self-explanatory term. The purpose of removing identification from data is to protect the people who the data came from. These people can be customers, employees, or other groups for which data has been collected. De-identification can also be performed for compliance reasons and or as a security measure.

People who are responsible for privacy and or data governance in their organization need to be familiar with ways to de-identify data. Therefore, in this post, we will look at two commonly used techniques for removing identification from data. These two methods are.

  • Pseudonymization
  •  Anonymization

Pseudonymization

A pseudonym is a false name. Therefore, in the context of data, pseudonymization is the process of giving false names to data that can help identify somebody. It is similar to having a secret identity in the superhero world. For example, Peter Parker and Spider-Man are the same person but most people do not know this because of the use of a false name.

ad

Practical ways to achieve pseudonymization with data can include changing text to numbers such as names. Removing information such as date of birth and or removing parts of data in a column such as keeping only the last four digits of a person’s social security number.

One advantage, or perhaps disadvantage, of pseudonymization is that the data can be returned to its original state. This is because whoever altered the data used the same rules for every change they made. The downside to this is if someone else can determine how the data was altered it would allow them to see the original data which could be used to identify someone.

Anonymization

Anonymous means no name. Therefore, anonymization is the process of removing all personal identifying information in a dataset. When this is done the process is not reversible and thus there is no way to determine the identity of the people in the dataset.

An example of anonymization would be to completely remove the names of people in a dataset along with other information such as date of birth and the total removal of phone numbers. Anonymization provides heightened protection but at the loss that even the people who anonymized the data have no idea who the original people are. Whether this is good or bad depends on the context in which the data will be used.

There are industry-specific ways of achieving either pseudonymization or anonymization. Examples include the fields of health care and education. However, at the macro level, all industries are using some combination of pseudonymization and anonymization.

Conclusion

Data privacy is a major concern in the world today. The concern with privacy needs to also be balanced with the need to analyze data for insights. For this reason, many have turned to various ways to de-identify data to support the conflicting concerns of privacy with analysis.