person in white long sleeve shirt holding credit card

Data Profile

One aspect of the data governance experience is data profiling. In this post we will look at what a data profile is, an example of a simple data profile, and the development of rules that are related to the data profile.

Definition

Data profiling is the process of running descriptive statistics on a dataset to develop insights about the data and field dependencies. Some questions there are commonly asked when performing a data profile includes.

  • How many observations are in the data set?
  •  What are the min and max values of a column(s)?
  •  How many observations have a particular column populated with a value (missing vs non-missing data)?
  •  When one column is populated what other columns are populated?

Data profiling helps you to confirm what you know and do not know about your data. This knowledge will help you to determine issues with your data quality and to develop rules to assess data quality.

Student Records Table

StudentIDStudentFirstNameStudentLastNameStudentBirthDateStudentClassLevel
1001MariaSmith04/04/2000Senior
1002Chang09/12/2004Junior
1003FranciscoBrownJunior
1004MatthewPeter01/01/2005Freshman
1005Martin02/05/2002Sophmore

The first column from the left is the student id. Looking at this column we can see that there are five records with data. That this column is numeric with 4 characters. The minimum value is 1001 and the max value is 1005.

The next two columns are first name and last name. Both of these columns are string text with a min character length of 5 and a max length of 7 for first name and 5 for last name. For both columns, 80% of the records are populated with a value. In addition, 60% of the records have a first name and a last name.

ad

The fourth column is the birthdate. This column has populated records 80% of the time and all rows follow a MM/DD/YYYY format. The minimum value is 04/04/2000 and the max value is 01/01/2005. 40% of the rows have a first name, last name, and birthdate.

Lastly, 100% of the class-level column is populated with values. 20% of the values are senior, 40% are junior, 20% are sophomore, and 20% are freshman.

Developing Data Quality Rules

From the insights derived from the data profile, we can now develop some rules to ensure quality. With any analysis or insight the actual rules will vary from place to place based on needs and context but below are some examples for demonstration purposes.

  • All StudentID values must be 4 numeric characters
  •  The Student ID values must be populated
  •  All StudentFirstName values must be 1-10 characters in length
  •  All StudentLastName values must be 1-10 characters in length
  •  All StudentBirhdate values must be in MM/DD/YYYY format
  •  All StudentClassLevel values must be Freshman, Sophomore,, Junior, or Senior

Conclusion

A data profile can be much more in-depth than the example presented here. However, if you have hundreds of tables and dozens of databases this can be quite a labor-intensive experience. There is software available to help with this but a discussion of that will have to wait for the future.

Leave a Reply