# Data Quality

Bad data leads to bad decisions. However, the question is how can you know if your data is bad. One answer to this question is the use of data quality metrics. In this post, we will look at a definition of data quality as well as metrics of data quality

Definition

Data quality is a measure of the degree that data is appropriate for its intended purpose. In other words, it is the context in which the data is used that determines if it is of high quality. For example, knowing email addresses may be appropriate in one instance but inappropriate in another instance.

When data is determined to be of high quality it helps to encourage trust in the data. Developing this trust is critical for decision-makers to have confidence in the actions they choose to take based on the data that they have. Therefore data quality is of critical importance for an organization and below are several measures of data quality.

Measuring Data Quality

Completeness is a measure of the degree to which expected columns (variables) and rows (observations) are present. There are times when data can be incomplete due to missing data and or missing variables. There can also be data that is partially completed which means that data is present in some columns but not others. There are various tools for finding this type of missing data in whatever language you are using.

Validity is a measure of how appropriate the data is in comparison to what the data is supposed to represent. For example, if there is a column in a dataset that measures the class level of high school students using Freshman, Sophmore, Junior, and Senior. Data would e invalid if it use the numerical values for the grade levels such as 9, 10, 11, and 12. This is only invalid because of the context and the assumptions that are brought to the data quality test.

Uniqueness is a measure of duplicate values. Normally, duplicate values happen along rows in structured data which indicates that the same observation appears twice or more. However, it is possible to have duplicate columns or variables in a dataset. Having duplicate variables can cause confusion and erroneous conclusions in statistical models such as regression.

Consistency is a measure of whether data is the same across all instances. For example, there are times when a dataset is refreshed overnight or whenever. The expectation is that the data should be mostly the same except for the new values. A consistency check would assess this. There are also times when thresholds are put in place such that the data can be a little different based on the parameters that are set.

Timeliness is the availability of the data. For example, if data is supposed to be ready by midnight any data that comes after this time fails the timeliness criteria. Data has to be ready when it is supposed to be. This is critical for real-time applications in which people or applications are waiting for data.

Accuracy is the correctness of the data. The main challenge of this is that there is an assumption that the ground truth is known to make the comparison. If a ground truth is available the data is compared to the truth to determine the accuracy.

Conclusion

The metrics shared here are for helping the analyst to determine the quality of their data. For each of these metrics, there are practical ways to assess them using a variety of tools. With this knowledge, you can be sure of the quality of your data.