Data quality rules are for protecting data from errors. In this post, we will learn about different data quality rules. In addition, we will look at tools used in connection with data quality rules.
Detective
Detective rules monitor data after it has already moved through a pipeline and is being used by the organization. Detective rules are generally used when the issues that are being detected are not causing a major problem when the issue cannot be solved quickly, and when a limited number of records are affected.
Of course, all of the criteria listed above are relative. In other words, it is up to the organization to determine what thresholds are needed for a data quality rule to be considered a detective rule.
An example of a detective data quality rule may be a student information table that is missing a student’s uniform size. Such information is useful but probably not worthy enough to stop the data from moving to others for use.
Preventative
Preventive data quality rules stop data in the pipeline when issues are found. Preventive rules are used when the data is too important to allow errors, when the problem is easy to fix, and or when the issue is affecting a large number of records. Again, all of these criteria are relative to the organization.
An example of a violation of a data quality prevention rule would be a student records table missing student ID numbers. Generally, such information is needed to identify students and make joins between tables. Therefore, such a problem would need to be fixed immediately.
Thresholds & Anomaly detection
There are several tools for implementing detection and prevention data quality rules. Among the choices are the setting of thresholds and the use of anomaly detection.
Thresholds are actions that are triggered after a certain number of errors occurred. It is totally up to the organization to determine how to set up their thresholds. Common levels include no action, warning, alert, and prevention. Each level must have a minimum number of errors that must occur for this information to be passed on to the user or IT.
To make things more complicated you can tie threshold levels to detective and preventive rules. For example, if a dataset has 5% missing data it might only flag it as a warning threshold. However, if the missing data jumps to 10% it might now be a violation of a preventative rule as the violation has reached the prevention level.
Anomaly detection can be used to find outliers. Unusual records can be flagged for review. For example, a university has an active student who was born in 1920. Such a birthdate is highly unusual and the system should flag it as an outlier by the rule. After reviewing, IT can decide if it is necessary to edit the record. Again, anomaly detection can be used to detect or prevent data errors and can have thresholds set to them as well.
Conclusion
Data quality rules can be developed to monitor the state of data within a system. Once the rules are developed it is important to determine if they are detective or preventative. The main reason for this is that the type of rule affects the urgency with which the problem needs to be addressed.