One aspect of the data governance experience is data profiling. In this post we will look at what a data profile is, an example of a simple data profile, and the development of rules that are related to the data profile.
Definition
Data profiling is the process of running descriptive statistics on a dataset to develop insights about the data and field dependencies. Some questions there are commonly asked when performing a data profile includes.
- How many observations are in the data set?
- What are the min and max values of a column(s)?
- How many observations have a particular column populated with a value (missing vs non-missing data)?
- When one column is populated what other columns are populated?
Data profiling helps you to confirm what you know and do not know about your data. This knowledge will help you to determine issues with your data quality and to develop rules to assess data quality.
Student Records Table
StudentID | StudentFirstName | StudentLastName | StudentBirthDate | StudentClassLevel |
---|---|---|---|---|
1001 | Maria | Smith | 04/04/2000 | Senior |
1002 | Chang | 09/12/2004 | Junior | |
1003 | Francisco | Brown | Junior | |
1004 | Matthew | Peter | 01/01/2005 | Freshman |
1005 | Martin | 02/05/2002 | Sophmore |
The first column from the left is the student id. Looking at this column we can see that there are five records with data. That this column is numeric with 4 characters. The minimum value is 1001 and the max value is 1005.
The next two columns are first name and last name. Both of these columns are string text with a min character length of 5 and a max length of 7 for first name and 5 for last name. For both columns, 80% of the records are populated with a value. In addition, 60% of the records have a first name and a last name.
The fourth column is the birthdate. This column has populated records 80% of the time and all rows follow a MM/DD/YYYY format. The minimum value is 04/04/2000 and the max value is 01/01/2005. 40% of the rows have a first name, last name, and birthdate.
Lastly, 100% of the class-level column is populated with values. 20% of the values are senior, 40% are junior, 20% are sophomore, and 20% are freshman.
Developing Data Quality Rules
From the insights derived from the data profile, we can now develop some rules to ensure quality. With any analysis or insight the actual rules will vary from place to place based on needs and context but below are some examples for demonstration purposes.
- All StudentID values must be 4 numeric characters
- The Student ID values must be populated
- All StudentFirstName values must be 1-10 characters in length
- All StudentLastName values must be 1-10 characters in length
- All StudentBirhdate values must be in MM/DD/YYYY format
- All StudentClassLevel values must be Freshman, Sophomore,, Junior, or Senior
Conclusion
A data profile can be much more in-depth than the example presented here. However, if you have hundreds of tables and dozens of databases this can be quite a labor-intensive experience. There is software available to help with this but a discussion of that will have to wait for the future.