Tag Archives: data governance

De-Identification of Data

Removing identifying information in data is a self-explanatory term. The purpose of removing identification from data is to protect the people who the data came from. These people can be customers, employees, or other groups for which data has been collected. De-identification can also be performed for compliance reasons and or as a security measure.

People who are responsible for privacy and or data governance in their organization need to be familiar with ways to de-identify data. Therefore, in this post, we will look at two commonly used techniques for removing identification from data. These two methods are.

Pseudonymization
Anonymization

Pseudonymization

A pseudonym is a false name. Therefore, in the context of data, pseudonymization is the process of giving false names to data that can help identify somebody. It is similar to having a secret identity in the superhero world. For example, Peter Parker and Spider-Man are the same person but most people do not know this because of the use of a false name.

Practical ways to achieve pseudonymization with data can include changing text to numbers such as names. Removing information such as date of birth and or removing parts of data in a column such as keeping only the last four digits of a person’s social security number.

One advantage, or perhaps disadvantage, of pseudonymization is that the data can be returned to its original state. This is because whoever altered the data used the same rules for every change they made. The downside to this is if someone else can determine how the data was altered it would allow them to see the original data which could be used to identify someone.

Anonymization

Anonymous means no name. Therefore, anonymization is the process of removing all personal identifying information in a dataset. When this is done the process is not reversible and thus there is no way to determine the identity of the people in the dataset.

An example of anonymization would be to completely remove the names of people in a dataset along with other information such as date of birth and the total removal of phone numbers. Anonymization provides heightened protection but at the loss that even the people who anonymized the data have no idea who the original people are. Whether this is good or bad depends on the context in which the data will be used.

There are industry-specific ways of achieving either pseudonymization or anonymization. Examples include the fields of health care and education. However, at the macro level, all industries are using some combination of pseudonymization and anonymization.

Conclusion

Data privacy is a major concern in the world today. The concern with privacy needs to also be balanced with the need to analyze data for insights. For this reason, many have turned to various ways to de-identify data to support the conflicting concerns of privacy with analysis.

Security Models

Leave a reply

Protecting data is a major concern of organizations today. With so many people sharing so much about themselves online organizations must be careful and aware of ways to secure the data that they have. In this post, we will look at two different security models that are commonly deployed today. These two models are the CIA Triad and the DIE model. Either of these models is commonly used when developing a data governance plan for an organization.

CIA Triad

There are several different models used by organizations to examine data privacy. One example is the CIA triad. The CIA triad provides 3 concepts that must be kept in mind when attempting to protect the privacy of users.

“C” stands for confidentiality, in other words, organizations must be sure that the data they have cannot be accessed by others. The “I” stands for integrity. Integrity involves ensuring that data is not altered or changed without authorization. If the data is manipulated without user knowledge any insights derived from the data would be considered questionable.

The last letter in the CIA triad is “A.” The letter “A” stands for availability. Availability means that the data system is operational and can access the data. In other words, the security system cannot be so complex that nobody can get the data that is being protected.

DIE Model

Another security model that is commonly is the DIE model. DIE stands for distributed, immutable, and ephemeral. Distributed means that data should not be limited to one source in case of failures. For example, having multiple copies of data in multiple sources.

The “I” in DIE stands for immutable. Immutable in this context means that the infrastructure being used is replaceable without data loss whenever there is a problem. Again this relates to the idea of having multiple sources of the same data. Lastly, the “E” in DIE stands for ephemeral. Ephemeral means that if there is a data problem that it does not take a long time to get back up and running in the event of a data failure or breach.

Compare and Contrast

There are some similarities and differences between the CIA triad and DIE. Both are focused on data being available. For the CIA this is the “A” and for DIE this is the “I.” In addition, both models are focused on protecting data in terms of preventing changes and this is covered in the letter “I” in both models.

However, there are also some differences. The DIE model is considered much more scalable than the CIA triad. As such, smaller organizations may lean towards the CIA triad while larger organizations may lean towards the DIE model. Furthermore, DIE is focused on hardware and infrastructure while CIA is more data-focused.

Conclusion

Every model has its strengths and weaknesses. The best model depends on the needs of the organization. In either case, the CIA Triad or the DIE model can guide an organization that is looking for a roadmap for securing its data.

Data Classification

Leave a reply

Data classification is a critical part of many company’s strategy for protecting data. In this post, we will look at data classification in terms of its purpose, types, and steps for the implementation of this process.

Common Purposes

The main reason for data classification is to ensure confidentiality. Many data systems have personally identifiable information such as credit cards, social security numbers, and more. Such information needs to be protected and the only way to know it needs to be protected is through classifying it as something that must be shielded.

Availability is another reason for data classification. Through classifying data, it helps a data governance team to know who should have access to what kinds of data. For example, the manager may have full access to all data while the assistant may only have access to data that is not considered confidential. Classification helps in determining access to data.

Data integrity is yet another reason. By ensuring that the data represent what it claims to be assessing what the data stands for. If data is classified as sensitive but does not contain any sensitive information it indicates a problem.

Data Types

There are also several different ways data can be classified. Data can be public which is generally not protected as it is accessible to all for the most part. Data can also be personal which is data that can be used to identify individuals and is usually strictly protected. Data can also be classified as sensitive which means data that requires access authorization.

Lastly, there is confidential information which is data that may have legal restrictions associated with it. The examples above are common forms of data classification. Individual organizations may use all or some of these classifications. In addition, there is nothing to stop an organization from creating its own distinct categories.

Steps

The process of classifying data is rather simple. First, you need to gather all the information that is needed to classify data. Part of this process is supported by having a data catalog that provides information on the location, owners, and content of the data asset.

Once it is clear what data is going to be classified, step two involves the development of a framework. This framework provides the structure for determining how to classify the data. The team involved in this process must develop the criteria for determining which category to place data in. When the categories are developed the data will be tagged. Once this is down the process can be automated using software.

Step three involves making sure the rules developed in step 2 are consistent with the standards that have been developed in the data governance policy. In other words, the classification must not violate the data governance policy because of compliance issues. There must be administrative consistency between data classification rules and the data governance policy

Step 4 involves the application of the rules developed in Step 2. Once this is completed the data classification is over at that moment.

Conclusion

Data classification is another tool that can be used to support an organization. This tool in particular is useful in protecting data based on its characteristics. Therefore, when it’s time to protect data a data classification can help you to determine what data to protect.

Data Governance Policy

Leave a reply

A data governance policy is a set of guidelines that allows an organization to manage its data consistently and properly. What is contained within this policy will vary from one organization to another but some of the topics addressed include data quality, access, usage, integration, and security.

The topics listed above are included in a data governance policy because they relate to the topic of managing data. If a data governance team ignores data quality, access, security, etc. It could have negative ramifications for the organization.

Components

The topics of a data governance policy are described above. However, we will not look at the structure of a data governance policy. Generally, the following components are used in a data governance policy.

Statement of purpose-The goal of the document
Scope and goals-What are and what are not covered in the document along with a breakdown of core beliefs about what data governance should do.
Roles & responsibilities-Who is in charge of what
Principles and rules-THese are a further breakdown of goals into observable behaviors that are called rules. Goals and principles are highly similar and it might be too confusing to have both. Therefore, consider choosing one or the other.
Definitions of terms-It is important to define keywords that will be frequently used on the document. The level of detail depends on the audience.

These are some of the main components of a data governance policy. However, what to include in such a document depends on the local context and challenges of an organization. Some or all of these pieces may be needed or other components not mentioned may be appropriate.

Process

The steps for making a data governance policy are as follows

Make an inventory of the data that will be covered under the data governance policy. Most of the time, not all data in an organization is under the policy.
Build a team that includes a leader along with other stakeholders of the data.
Define scope and goals
Assign roles and responsibilities
Develop standards
Define metrics. Metrics help you to determine if you are achieving your goals.
Make a draft of your document and revised it as needed.

As you can see, the process is similar to the components. The order of developing the components matters as it is better to build broader policies before focusing on behavioral objectives.

Conclusion

A major step in the development of a data governance program is the development of a data governance policy. The policy allows a data governance team lay out what they are trying to do and how they will do it. Such a document is critical to helping the team to stay on the same page and to consistently seek the same. objectives.

Master Data Management

Leave a reply

Data continues to become more and more important. With this growth, there has been a corresponding need for standardizing and managing. In this post, we will look at master data and how one can go about managing it.

Definition

Master data is a uniform set of data that is used throughout an organization. By uniform, it means that this data is exactly the same wherever it appears in any data set. This is highly important because it is natural for data to change a little as people use it or if it is merged and edited in various stages of the workflow. Master data is so important and fundamental that it must remain unchanged for the sake of consistency when different departments within an organization need to integrate data.

Therefore, master data management is the process of protecting master data from the changes that can happen from people and systems interacting with data. Unfortunately, preserving data is not an easy task and at times this can be complex and difficult.

Master Data Management Forms

There are different forms or ways to develop master data. The analytical approach feeds whatever master data an organization has into a data warehouse where it can be referred to as needed. The operational approach involves master data in the core business or organizational systems. Essentially, the difference between these two approaches is at what level of granularity they are implemented. Analytical is across an organization while operational is within a sub-unit of the organization.

Whichever method is used there are several ways that the approach is implemented. A registry process involves creating a unified master data source with making any changes to local systems. This means there are two different systems which mean that people need to be aware of when to refer to the registry.

Consolidation is another way and involves updating the registry master data whenever the local system is updated. Lastly, the transaction method is the opposite and involves the local system being updated whenever the registry is.

Steps

The steps to selecting and standardizing master data are explained below Step one involves selecting what is considered master data. This will vary from organization to organization and will involve some disagreeing and negotiation. The same applies to step two which is agreed on data standards and the master data approach. Examples of things that involve data standards can include capitalization of text, number of decimals, number of digits, maximum text length, abbreviations, etc. All these must be worked out together. Should states be abbreviated or spelled out fully? Should phone numbers have dashes in them? These are just some of the challenges to address.

Step three involves deploying the software to find and standardize the master data. This can be done manually and this happens in smaller organizations but for larger organizations, this is the only practical way to do this. Step 5 is the cleansing of the data which can include dealing with duplicates. Once all of this is completed it is now appropriate to use the master data.

The Team

Most projects require a team effort and master data management is no exception. Often you will want a manager who oversees the project. Another person who may be involved is a master data specialist who maintains the system. Data stewards are generally involved as they are the ones most familiar with whatever data they are responsible for. In addition, you may need leadership sponsors and stakeholders involved as well particularly when picking master data and assigning data standards.

Conclusion

Master data is a critical component of many organizations which means it must be managed and controlled as well. Some practical ways to address this have been shared here. However, the best way to approach this will vary from one organization to another.

Data Protection Impact Assessment

Leave a reply

The data protection impact assessment (DPIA) is a tool associated with GPDR that is used to determine the level of protection a data needs within an organization. Protection is determined by finding potential risks that might negatively affect data within the organization. In this post we will look at the benefits of conducting a DPIA, assessing when to conduct an assessment, and a brief look at the process for completing a DPIA.

Benefits

As mentioned earlier, conducting a DPIA allows an organization to document risk. Documenting risk allows for strategies to be developed to reduce the said risk. Other benefits include allowing an organization to assess the cost or level of a particular risk. Lastly, a DPIA can provide unique insights into specific data protection needs and risks.

In general, the DPIA provides the initial data needed to develop a roadmap for supporting data protection within an institution. As such, this is a critical first step in a complex process.

When to do DPIA

Considering the importance of conducting a DPIA a natural question to consider is when should such an assessment be performed. There are several situations that warrant a DPIA. One example is whenever an organization is moving to some form of auto processing such as a program that identifies at-risk students. Since this system is automated it is important to make sure the data is protected.

Another situation that may warrant a DPIA is a situation in which individuals are judged and or evaluated. For example, collecting what users watch on Youtube to make recommendations. Lastly, instances of data integration may require a DPIA to make sure there is no loss of protection from combining data.

Process

There are several steps to actually completing a DPIA. Step one often involves describing the data flow. By data flow, it is meant how data movies throughout the organization in terms of its collection, storage, as well as sources. Step two involves determining the scope of the data. Scope is referring to what types of data will be assessed, the amount of data to be assessed, and or how long will the data be stored.

Step three involves defining the benefits of data processing. Data processing is the cleansing of data so that it can be used for analysis. How this is done varies wildly and depends on the situation. Step four looks at how processing affects the consumer. Explaining this is difficult but for example, complex data processing could slow down the user experience.

Steps 5 and 6 involve talking to stakeholders about this new project and checking for compliance. Stakeholders will explain any concerns that they may have while compliance involves legal matters such as regulations and laws.

Steps 7 and 8 are where various risks are identified and solutions are proposed. For example, if it is discovered that some of the data is revealing people’s identities it might be appropriate to make the data anonymous. Once all of the problems and solutions are developed, step 9 is the official approval of the DPIA.

Conclusion

Completing a data protection impact assessment is a practical way to take the first steps in data privacy in an organization. With the insights developed an organization can inspire confidence in their stakeholders that the data within the organization is not only accurate but safe as well.

Privacy by Design

Leave a reply

Privacy by Design is an idea found within the General Data Protection Regulation, which affects the data privacy practices of organizations. In this post, we will define this term and explain several principles of privacy by design.

Definition

Privacy by design is a concept in which data protection happens through the appropriate development of technology. Essentially, data protection should not be limited to one place or one feature instead data protection should be layered throughout the system of an organization.

There are several ways to begin this initiative. A common method is to have a privacy policy that is up-to-date and readable. Another way to begin this process is to establish someone as the data protection officer. Lastly, it is also common to conduct some sort of assessment of data protection to determine areas of improvement before using an individual’s personal data.

Principles

There are seven principles of privacy by design. Below is a list with explanations.

Proactive rather than reactive-There should be an effort to prevent privacy loss rather than trying to fix a situation in which people’s personal information is inappropriately accessed.
Privacy by default-Maintaining the privacy of data should be the first thing an organization thinks about and can include restricting use/access, and or deleting data that is no longer needed.
Embedding of privacy-EMbedding involves such tools as encryption, authentication, and the testing of vulnerabilities. In other words, privacy is used as a foundational aspect of developing a website or application.
Full functionality-This idea is a reminder that data privacy should not make it difficult to use a website or application. Protect data but avoid sacrificing the user experience.
End-to-end security-This is similar to principle number two and is essentially a reminder that privacy protection must be comprehensive from the time the data is received until the data is destroyed.
Visibility and transparency-People should know what is being done with the data an organization has of them.
Respect for user privacy-People should still have authority over their data after it is collected. What this means is that they can grant or rescind consent to their data at any time.

Implementation Perspective

There are several perspectives from which the implementation of privacy by design that must be considered and these are systems, processes, and risk management perspectives.

The system perspective involves documenting the organization’s commitment to data protection, appointing a data protection officer or leader, providing training for employees, checking security measures, developing a record-keeping system, and conducting a self-assessment. All of these steps are used to develop an initial system for data privacy.

For processes, it is necessary to determine roles within privacy such as people in IT, legal, etc. who support privacy with their technical expertise. It is also important to document the data processing process and privacy risks. Privacy controls for users and the implementation of security measures from the systems perspective are critical as well.

Risk management is another key perspective that needs to be addressed for data privacy. Risk management involves the legal purpose of processing data. It also includes tracking who has access to data, controls for accessing data, what to do in the event of a breach, and minimization, anonymization, and pseudonymization of data. Lastly, measures for data accuracy are developed here.

Data Privacy

Leave a reply

A field closely related to data governance is data privacy. In this post, we will look at what data privacy is as well as principles that need to be kept in mind when trying to keep people’s data private.

Data Privacy

Privacy is a term that is difficult to define. For our purposes, data privacy is the amount of control a person has over personal information in terms of how this information is collected, managed, and stored. This definition gives the impression that people have little data privacy because we are so often compelled to share our information online.

Websites often require some surrendering of personally identifiable information (PII) such as name, address, phone number, etc while in the medical field, there is demand for personal health information (PHI). Sharing information about yourself can be frustrating for many but is the cost of doing business online. Naturally, once these various online companies have your data they must be sure to protect it.

Data security is not about collecting or managing data. Rather, data security is focused on the protection of data from unauthorized access. Securing data is critical to protect individuals and organizations from harm because of security breaches. For example, there can be serious financial repercussions if someone’s credit card number is stolen online.

Fair Information Practice Principles

With all the concerns regarding data privacy, it was natural that frameworks would be developed to help organizations with data privacy. One such framework is the Fair Information Practice Principles (FIPPs) developed by the Organization of Economic Development back in the early 1980s. Below are the eight principles in this framework.

Limits on data collections-Every organization need to determine the smallest amount of data they can connect while still maintaining success
Data quality-Data that is collected needs to be accurate and pertinent to the purposes of the organization.
Purpose determination-There must be a clear compelling reason to collect data.
Limits of use-Personal data must only be used for its intended purpose.
Security-Data must be protected
Transparency-People should know that their data is being collected
Individual participation-People whose data has been collected have the right to access their data, have it corrected, and or erased
Accountability-Whoever collects this data is responsible for adhering to the principles listed above

The principles shared above have been adopted by many organizations to provide a foundation on which they can develop their own data privacy policies and philosophy.

Conclusion

Data privacy is a major concern in the world today. Organizations whether online or offline continue to demand more information about their customers. As such, this implies that there must be safeguards in place to ensure the protection of this information.

Defense & Offense with Data

Leave a reply

Within the field of data governance, there are different ways of approaching data and the definition of truth. In this post, we will look at different approaches to data and also how truth can be defined with a data governance framework.

Defense

A defense approach to data is focused on controlling data. This can involve security and stringent governance of data through a highly centralized setting. In addition, the defensive data approach is concerned with minimizing risk and ensuring compliance with standards and expectations. Preventing theft and tracking the flow of data through an organization is also important.

When analytics are used they are used to detect fraud and unusual activity. How defensive an organization is depends on the field or industry. For example, banking and health care are highly defensive due to the type of data they gather.

Offense

An offensive approach to data is focused on developing insights with data. The goal is not to protect but to develop insights for decision-making. An offensive approach to data is characterized by flexibility and being focused on the customer. This style of approaching data is generally emphasizing a decentralized style of data governance.

Organizations that find themselves in highly competitive environments often are forced to become more offensive as they search for insights to maximize profits. How much offensive and defensive an organization needs does vary. However, in general, most if not all organizations start defensive and slowly become more offensive in nature.

Truth

Whether the approach to data is offensive or defensive it is important to determine what is the truth when it comes to data in an organization. Every organization needs a single source of truth (SSOT) for critical data. The SSOT is language used within data that is the same across an organization. For example, sometimes the same name can be entered in multiple different ways in an organization’s data. Take the company AT&T as an example it could be entered in some of the following ways

ATT

att

Att

AT and T

AT&T

Each of the examples above can be considered different and can lead to chaos when it is time to analyze data for insights. This is because redundant names can lead to redundant costs. For example, if AT&T was a vendor for our fictitious company there might be several different contracts with AT&T with several different divisions who all spell AT&T differently. To prevent this the SSOT will define the one way to code AT&T into the system and determine what it represents.

However, keeping the offensive approach to data in mind. There are times for the purpose of analysis that the SSOT can be modified. Doing this leads to what is called multiple versions of truth (MVOT). An example of MVOT is a department that classifies our example of AT&T different way from the SSOT. Accounting might see AT&T as a vendor while marketing might see AT&T as their internet provider, etc. Since everyone knows what the SSOT is they are aware when they make a MVOT for their distinct purpose.

Conclusion

Each organization needs to decide for themselves what approach to data they want to take. There is no right or wrong way to approach data it really depends on the situation. In addition, every organization needs to determine for itself how they will define truth and there is no single way to do this either. What organizations need to do is address these two topics in a way that is satisfying for them.

Data Governance Methodology

Leave a reply

Data governance is becoming more and more common in today’s world. In this post, we will look at one commonly used process of implementing data governance. The steps are explained below.

Scope & Initiation

The first step in setting up a data governance system is to determine the scope of data governance. By scope, it means how deep and wide the program will be. In other words, you have to determine what will be governed and how thoroughly it will be governed.

It may surprise some that not all data is governed by data governance. For each organization, it will be different but generally, all organizations have data that is excluded from data governance. For example, some organizations will include emails under data governance while others will not. It depends on the situation and there is no single rule.

In addition, it is important to determine how thorough the governance will be. An example of this would be the tolerance for data quality issues. There are times were some data errors are permissible as long as they do not exceed a certain threshold but this also depends on the context

Assess

At the assessment stage, the purpose is to determine an organization’s ability to govern data and be governed by policies. Generally, there are three ways of assessing this and they are measuring the capacity to change, the culture of data use, and the ability to collaborate.

The capacity to change is self-explanatory and is a measure of an organization’s ability to accept new policies such as data governance policies. The data use culture is looking at how an organization uses data at that moment. Lastly, collaboration looks at how well people within the organization can work together. Collaboration is critical because data governance generally affects the entire organization and people from multiple departments must work together.

Vision

The vision is where terms are defined and steps going forward are set. For example, the organization needs to define what data governance is for them. In addition, requirements for doing data governance are also developed.

Vision setting is a theoretical experience and this is often boring for the more practical action-oriented individuals. However, setting the vision sets the tone for the rest of the project. Therefore, this must be planned and developed.

Align & Business Value

Aligning and business value is for determining the financial value of incorporating data governance into an organization and also refining how things will be measured. For profit-seeking organizations business value is critical. Most projects need to make or at least save money in this setting. For non-profit organizations, the motivation might be to increase efficiency or the ability to better serve stakeholders.

It’s not enough to talk about savings. Evidence must be provided for determining actual savings. This is where metrics come into play. There must be ways to measure the value of a data governance project. Again, how to do this will vary from place to place but it needs to be addressed.

Functional Design

Functional design is focused on the actual process of doing data governance. What will be done must be determined as well as established roles that support this process as well. Principles are often developed at this step and principles are similar to goals in terms of what is expected from implementing data governance. Following principles, the next thing that is developed are standards which are similar objectives in education in which you have some sort of measurable action.

Best practices often encourage data governance to be embedded within existing roles and responsibilities. In other words, setting up another department within an organization and calling it data governance is generally not considered the best way to make this happen.

Governing Framework Design

Once the plan has been developed it is time to find the people who will implement it. governing framework involves assigning processes to people and setting up the various roles associated with data governance. Generally. a lot of the aspects of data governance are being done at an organization but in a disjointed unaware way. Therefore, the main benefit here is not so much to give out more work but rather to make it clear who is already doing what and make sure they are aware of it.

Road Map

The road map step involves data governance going live. This is the point where data governance is integrated into the existing organization. Other things that are done at this step are designing metrics and reporting requirements. In other words, how good or bad does performance have to be on a standard and how will this be reported?

Change management is also addressed here and involves dealing with resistance and making sure that the scope and or goals of the project do not change. There are times when a project will wander from its original purpose which can be frustrating for people.

Rollout and Sustain

Roll out and sustain involves executing the plan and checking its effectiveness. Essentially, this step involves monitoring the data governance implementation and making corrections as necessary.

Conclusion

Data governance is a critical part of most organizations today. However, it can be tricky to figure out how to make this a part of an organization. The information above provides an example of how this could be done.

Types of Data Quality Rules

Leave a reply

Data quality rules are for protecting data from errors. In this post, we will learn about different data quality rules. In addition, we will look at tools used in connection with data quality rules.

Detective

Detective rules monitor data after it has already moved through a pipeline and is being used by the organization. Detective rules are generally used when the issues that are being detected are not causing a major problem when the issue cannot be solved quickly, and when a limited number of records are affected.

Of course, all of the criteria listed above are relative. In other words, it is up to the organization to determine what thresholds are needed for a data quality rule to be considered a detective rule.

An example of a detective data quality rule may be a student information table that is missing a student’s uniform size. Such information is useful but probably not worthy enough to stop the data from moving to others for use.

Preventative

Preventive data quality rules stop data in the pipeline when issues are found. Preventive rules are used when the data is too important to allow errors, when the problem is easy to fix, and or when the issue is affecting a large number of records. Again, all of these criteria are relative to the organization.

An example of a violation of a data quality prevention rule would be a student records table missing student ID numbers. Generally, such information is needed to identify students and make joins between tables. Therefore, such a problem would need to be fixed immediately.

Thresholds & Anomaly detection

There are several tools for implementing detection and prevention data quality rules. Among the choices are the setting of thresholds and the use of anomaly detection.

Thresholds are actions that are triggered after a certain number of errors occurred. It is totally up to the organization to determine how to set up their thresholds. Common levels include no action, warning, alert, and prevention. Each level must have a minimum number of errors that must occur for this information to be passed on to the user or IT.

To make things more complicated you can tie threshold levels to detective and preventive rules. For example, if a dataset has 5% missing data it might only flag it as a warning threshold. However, if the missing data jumps to 10% it might now be a violation of a preventative rule as the violation has reached the prevention level.

Anomaly detection can be used to find outliers. Unusual records can be flagged for review. For example, a university has an active student who was born in 1920. Such a birthdate is highly unusual and the system should flag it as an outlier by the rule. After reviewing, IT can decide if it is necessary to edit the record. Again, anomaly detection can be used to detect or prevent data errors and can have thresholds set to them as well.

Conclusion

Data quality rules can be developed to monitor the state of data within a system. Once the rules are developed it is important to determine if they are detective or preventative. The main reason for this is that the type of rule affects the urgency with which the problem needs to be addressed.

Data Profile

Leave a reply

One aspect of the data governance experience is data profiling. In this post we will look at what a data profile is, an example of a simple data profile, and the development of rules that are related to the data profile.

Definition

Data profiling is the process of running descriptive statistics on a dataset to develop insights about the data and field dependencies. Some questions there are commonly asked when performing a data profile includes.

How many observations are in the data set?
What are the min and max values of a column(s)?
How many observations have a particular column populated with a value (missing vs non-missing data)?
When one column is populated what other columns are populated?

Data profiling helps you to confirm what you know and do not know about your data. This knowledge will help you to determine issues with your data quality and to develop rules to assess data quality.

Student Records Table

StudentID	StudentFirstName	StudentLastName	StudentBirthDate	StudentClassLevel
1001	Maria	Smith	04/04/2000	Senior
1002		Chang	09/12/2004	Junior
1003	Francisco	Brown		Junior
1004	Matthew	Peter	01/01/2005	Freshman
1005	Martin		02/05/2002	Sophmore

The first column from the left is the student id. Looking at this column we can see that there are five records with data. That this column is numeric with 4 characters. The minimum value is 1001 and the max value is 1005.

The next two columns are first name and last name. Both of these columns are string text with a min character length of 5 and a max length of 7 for first name and 5 for last name. For both columns, 80% of the records are populated with a value. In addition, 60% of the records have a first name and a last name.

The fourth column is the birthdate. This column has populated records 80% of the time and all rows follow a MM/DD/YYYY format. The minimum value is 04/04/2000 and the max value is 01/01/2005. 40% of the rows have a first name, last name, and birthdate.

Lastly, 100% of the class-level column is populated with values. 20% of the values are senior, 40% are junior, 20% are sophomore, and 20% are freshman.

Developing Data Quality Rules

From the insights derived from the data profile, we can now develop some rules to ensure quality. With any analysis or insight the actual rules will vary from place to place based on needs and context but below are some examples for demonstration purposes.

All StudentID values must be 4 numeric characters
The Student ID values must be populated
All StudentFirstName values must be 1-10 characters in length
All StudentLastName values must be 1-10 characters in length
All StudentBirhdate values must be in MM/DD/YYYY format
All StudentClassLevel values must be Freshman, Sophomore,, Junior, or Senior

Conclusion

A data profile can be much more in-depth than the example presented here. However, if you have hundreds of tables and dozens of databases this can be quite a labor-intensive experience. There is software available to help with this but a discussion of that will have to wait for the future.

Data Quality

Leave a reply

Bad data leads to bad decisions. However, the question is how can you know if your data is bad. One answer to this question is the use of data quality metrics. In this post, we will look at a definition of data quality as well as metrics of data quality

Definition

Data quality is a measure of the degree that data is appropriate for its intended purpose. In other words, it is the context in which the data is used that determines if it is of high quality. For example, knowing email addresses may be appropriate in one instance but inappropriate in another instance.

When data is determined to be of high quality it helps to encourage trust in the data. Developing this trust is critical for decision-makers to have confidence in the actions they choose to take based on the data that they have. Therefore data quality is of critical importance for an organization and below are several measures of data quality.

Measuring Data Quality

Completeness is a measure of the degree to which expected columns (variables) and rows (observations) are present. There are times when data can be incomplete due to missing data and or missing variables. There can also be data that is partially completed which means that data is present in some columns but not others. There are various tools for finding this type of missing data in whatever language you are using.

Validity is a measure of how appropriate the data is in comparison to what the data is supposed to represent. For example, if there is a column in a dataset that measures the class level of high school students using Freshman, Sophmore, Junior, and Senior. Data would e invalid if it use the numerical values for the grade levels such as 9, 10, 11, and 12. This is only invalid because of the context and the assumptions that are brought to the data quality test.

Uniqueness is a measure of duplicate values. Normally, duplicate values happen along rows in structured data which indicates that the same observation appears twice or more. However, it is possible to have duplicate columns or variables in a dataset. Having duplicate variables can cause confusion and erroneous conclusions in statistical models such as regression.

Consistency is a measure of whether data is the same across all instances. For example, there are times when a dataset is refreshed overnight or whenever. The expectation is that the data should be mostly the same except for the new values. A consistency check would assess this. There are also times when thresholds are put in place such that the data can be a little different based on the parameters that are set.

Timeliness is the availability of the data. For example, if data is supposed to be ready by midnight any data that comes after this time fails the timeliness criteria. Data has to be ready when it is supposed to be. This is critical for real-time applications in which people or applications are waiting for data.

Accuracy is the correctness of the data. The main challenge of this is that there is an assumption that the ground truth is known to make the comparison. If a ground truth is available the data is compared to the truth to determine the accuracy.

Conclusion

The metrics shared here are for helping the analyst to determine the quality of their data. For each of these metrics, there are practical ways to assess them using a variety of tools. With this knowledge, you can be sure of the quality of your data.

Data Governance Solutions

Leave a reply

Data governance is good at indicating various problems an organization may have with data. However, finding problems doesn’t help as much as finding solutions does. This post will look at several different data governance solutions that deal with different problems.

Business Glossary

The business glossary contains standard descriptions and definitions. It also can contain business terms or discipline-specific terminology. One of the main benefits of developing a business glossary is creating a common vocabulary within the organization.

Many if not all businesses and fields of study have several different terms that mean the same thing. In addition, people can be careless with terminology, to the confusion of outsiders. Lastly, sometimes a local organization will have its own unique terminology. No matter the case the business dictionary helps everyone within an organization to communicate with one another.

An example of a term in a business dictionary might be how a school defines a student ID number. The dictionary explains what the student ID number is and provides uses of the ID number within the school.

Data Dictionary

The data dictionary provides technical information. Some of the information in the data dictionary can include the location of data, relationships between tables, values, and usage of data. One benefit of the data dictionary is that it promotes consistency and transparency concerning data.

Returning to our student ID number example, a data dictionary would share where the student ID number is stored and the characteristics of this column such as the ID number being 7 digits. For a categorical variable, the data dictionary may explain what values are contained within the variable such as male and female for gender.

Data Catalog

A data catalog is a tool for metadata management. It provides an organized inventory of data within the organization. Benefits of a data catalog include improving efficiency and transparency, quick locating of data, collaboration, and data sharing.

An example of a data catalog would be a document that contains the metadata about several different data warehouses or sources within an organization. If a data analyst is trying to figure out where data on student ID numbers are stored they may start with the data catalog to determine where this data is. The data dictionary will explain the characteristics of the student ID column. Sometimes the data dictionary and catalog can be one document if tracking the data in an organization is not too complicated. The point is that the distinction between these solutions is not obvious and is really up to the organization.

Automated Data Lineage

Data lineage describes how data moves within an organization from production to transformation and finally to loading. Tracking this process is really complicated and time-consuming and many organizations have turned to software to complete this.

The primary benefit of tracking data lineage is increasing the trust and accuracy of the data. If there are any problems in the pipeline, data lineage can help to determine where the errors are creeping into the pipeline.

Data Protection, Privacy, QUailty

Data protection is about securing the data so that it is not tampered with in an unauthorized manner. An example of data protection would be implementing access capabilities such as user roles and passwords.

Data privacy is related to protection and involves making sure that information is restricted to authorized personnel. Thus, this also requires the use of logins and passwords. In addition, classifying the privacy level of data can also help in protecting it. For example, salaries are generally highly confidential while employee work phone numbers are probably not.

Data quality involves checking the health of the accuracy and consistency of the data. Tools for completing this task can include creating KPIs and metrics to measure data quality, developing policies and standards that defined what is good data quality as determined by the organization, and developing reports that share the current quality of data.

Conclusion

The purpose of data governance is to support an organization in maintaining data that is an asset to the organization. In order for data to be an asset it must be maintained so that the insights and decisions that are made from the data are as accurate and clear as possible. The tools described in this post provide some of the ways in which data can be protected within an organization.

Data Governance Strategy

Leave a reply

A strategy is a plan of action. Within data governance, it makes sense to ultimately develop a strategy or plan to ensure data governance takes place. In this post, we will look at the components of a data governance strategy. Below are the common components of a data governance strategy.

Approach
Vision statement
Mission statement
Value proposition
Guiding principles
Roles & Responsibilities

There is probably no particular order in which these components are completed. However, they tend to follow an inverted pyramid in terms of the scope of what they deal with. In other words, the approach is perhaps the broadest component and affects everything below it followed by the vision statement, etc. Where to begin probably depends on how your mind works. A detail-oriented person may start at the bottom while a big-picture thinker would start at the top.

Defined Approach

The approach defines how the organization will go about data governance. There are two extremes for this and they are defensive and offensive. A defensive approach is focused on risk mitigation while an offensive approach is focused more on achieving organizational goals.

Neither approach is superior to the other and the situation an organization is in will shape which is appropriate. For example, an organization that is struggling with data breaches may choose a more defensive approach while an organization that is thriving with allegations may take a more offensive approach.

Vision Statement

A vision statement is a brief snapshot of where the organization wants to be. Another way to see this is that a vision statement is the purpose of the organization. The vision statement needs to be inspiring and easily understood. It also helps to align the policies and standards that are developed.

An example of a vision statement for data governance is found below.

Transforming how data is leveraged to make informed decisions to support youth served by this organization

The vision is to transform data for decision-making. This is an ongoing process that will continue indefinitely.

Mission Statement

The mission statement explains how an organization will strive toward its vision. Like a vision statement, the mission statement provides guidance in developing policies and standards. The mission statement should be a call to action and include some of the goals the organization has about data. Below is an example

Enabling stakeholders to make data-driven decisions by providing accurate, timely data and insights

In the example above, it is clear that accuracy, timeliness, and insights are the goals for achieving the vision statement. In addition, the audience is identified which is the stakeholders within the organization.

Value Proposition

The value proposition provides a justification or the significance of adopting a data governance strategy. Another way to look at this is an emphasis on persuasion. Some of the ideas included in the value proposition are the benefits of implementation. Often the value proposition is written in the form of cause and effect statement(s). Below is an example

By implementing this data governance program we will see the following benefits:

Improved data quality for actionable insights, increased trust in data for making decisions, and clarity of roles and responsibilities of analysts

In the example above three clear benefits are shared. Succinctly this provides people with the potential outcomes of adopting this strategy. Naturally, it would be beneficial to develop ways to measure these ideas which means that only benefits that can be measured should be a part of the value proposition.

Guiding Principles

Guiding principles define how data should be used and managed. Common principles include transparency, accountability, integrity, and collaboration. These principles are just more concrete information for shaping policies and standards. Below is an example of a guiding principle.

All data will have people assigned to play critical roles in it

The guiding principle above is focused on accountability. Making sure all data has people who are assigned to perform various responsibilities concerning it is important to define and explain.

Roles & Responsibilities

Roles and responsibilities are about explaining the function of the data governance team and the role each person will play. For example, a small organization might have people who adopt more than one role such as being data stewards and custodians while larger organizations might separate these roles.

In addition, it is also important to determine the operating model and whether it will be centralized or decentralized. Determining the operating model again depends on the context and preferences of the organization.

It is also critical to determine how compliance with the policies and standards will be measured. It is not enough to say it, eventually, there needs to be evidence in terms of progress and potential changes that need to be made to the strategy. For example, perhaps a data audit is done monthly or quarterly to assess data quality.

Conclusion

Having a data governance strategy is a crucial step in improving data governance within an organization. Once a plan is in place it is simply a matter of implementation to see if it works.

Data Governance Assessment

Leave a reply

Before data governance can begin at an organization it is critical to assess where the organization is currently in terms of data governance. This necessitates the need for a data governance assessment. The assessment helps an organization to figure out where to begin by identifying challenges and prioritizing what needs to be addressed. In particular, it is common for there to be five steps in this process as shown below.

Identify data sources and stakeholders
Interview stakeholders
Determine current capabilities
Document the current state and target state
Analyze gaps and prioritize

We will look at each of these steps below.

Identify Data Sources and Stakeholders

Step one involves determining what data is used within the organization and the users or stakeholders of this data. Essentially, you are trying to determine…

What data is out there?
Who uses it?
Who produces it?
Who protects it?
Who is responsible for it?

Answering these questions also provides insights into what roles in relation to data governance are already being fulfilled at least implicitly and which roles need to be added to the organization. At most organizations at least some of these questions have answers and there are people responsible for many roles. The purpose here is not only to get this information but also to make people aware of the roles they are fulfilling from a data governance perspective.

Interview Stakeholders

Step two involves interviewing stakeholders. Once it is clear who is associated with data in the organization it is time to reach out to these people. You want to develop questions to ask stakeholders in order to inform you about what issues to address in relation to data governance.

An easy way to do this is to develop questions that address the pillars of data governance. The pillars are…

Ownership & accountability
Data quality
Data protection and privacy
Data management
Data use

Below are some sample questions based on the pillars above.

How do you know your data is of high quality
What needs to be done to improve data quality
How is data protected from misuse and loss
How is metadata handle
What concerns do you have related to data
What policies are there now related to data
What roles are there in relation to data
How is data used here

It may be necessary to address all or some of these pillars when conducting the assessment. The benefit of these pillars is they provide a starting point in which you can shape your own interview questions. In terms of the interview, it is up to each organization to determine what is best for data collection. Maybe a survey works or perhaps semi-structured interviews or focus groups. The actual research part of this process is beyond the scope of this interview.

Determine Current Capabilities

Step three involves determining the current capabilities of the organization in terms of data governance. Often this can be done by looking at the stakeholder interviews and comparing what they said to a rating scale. For example, the DCAM rating scale has six levels of data governance competence as shown below.

Non-initiated-No governance happening
Conceptual-Aware of data governance and planning
Developmental-Engaged in developing a plan
Defined-PLan approved
Achieved-Plann implemented and enforced
Enhanced-Plan a part of the culture and updated regularly

Determining the current capabilities is a subjective process. However, it needs to be done in order to determine the next steps in bringing data governance along in an organization.

Document Current State and Target State

Step four involves determining the current state and determining what the target state is. Again, this will be based on what was learned in the stakeholder interviews. What you will do is report what the stakeholders said in the interviews based on the pillars of data governance. It is not necessary to use the pillars but it does provide a convenient way to organize the data without having to develop your own way of classifying the results.

Once the current state is defined it is now time to determine what the organization should be striving for in the future and this is called the target state. The target state is the direction the organization is heading within a given timeframe. It is up to the data governance team to determine this and how it is done will vary. The main point is to make sure not to try and address too many issues at once and save some for the next cycle.

Analyze and Prioritize

The final step is to analyze and prioritize. This step involves performing a gap analysis to determine solutions that will solve the issues found in the previous step. In addition, it is also important to prioritize which gaps to address first.

Another part of this step is sharing recommendations and soliciting feedback. Provide insights into which direction the organization can go to improve its data governance and allow stakeholders to provide feedback in terms of their agreement with the report. Once all this is done the report is completed and documented until the next time this process needs to take place.

Conclusion

The steps presented here are not prescriptive. They are shared as a starting point for an organization’s journey in improving data governance. With experience, each organization will find its own way to support its stakeholders in the management of data.

Data Governance Office

Leave a reply

The data governance office or team are the leaders in dealing with data within an organization. This team is comprised of several members such as

Chief Data Officer
Data Governance Lead
Data Governance Consultant
Data Quality Analyst

We will look at each of these below. It also needs to be mentioned that a person might be assigned several of these roles which are particularly true in a smaller organization. In addition, it is possible that several people might fulfill one of these roles in a much larger organization as well.

Chief Data Officer

The chief data officer is responsible for shaping the overall data strategy at an organization. The chief data officer also promotes a data-driven culture and pushes for change within the organization. A person in this position also needs to understand the data needs of the organization in order to further the vision of the institution or company.

The role of the chief data officer encompasses all of the other roles that will be discussed. The chief data officer is essentially the leader of the data team and provides help with governance consulting, quality, and analytics. However, the primary role of this position is to see the big picture for big data and to guide the organization in this regard, which implies that technical skills are beneficial but leadership and change promotion is more critical. In sum, this is a challenging position that requires a large amount of experience

Data Governance Lead

The data governance leads primary responsibilities to involve defining policies and data governance frameworks. While the chief data officer is more of an evangelist or promoter of data governance the data governance lead is focused on the actual implementation of change and guiding the organization in this process.

Essentially, the data governance lead is in charge of the day-to-day operation of the data governance team. While the chief data officer may be the dreamer the data governance lead is a steady hand behind the push for change.

Data Governance Consultant

The data governance consultant is the subject matter expert in data governance. Their role is to know all the details of data governance in the general field and even better if they know how to make data governance happen in a particular discipline. For example, a data governance consultant who knows how to make data governance happen within the context of a university in particular.

The data governance consultant supports the data governance lead with implementation. In addition, the consultant is a go-between for the larger organization and IT. Serving as a go-between implies that the consultant is able to effectively communicate with both parties on a technical level with IT and in a layman’s matter with the larger organization. The synergy between IT and the larger organization can be challenging because of communication issues due to vastly different backgrounds and it is the consultant’s responsibility to bridge this gap.

Data Quality Analyst

The data quality analyst’s job is as the name implies to ensure quality data. One way of determining data quality is to develop rules for data entry. For example, a rule for data quality is that marital status can only be single, married, divorced, or widowed. This rule restricts any other option that people may want. When this rule is supported it is an example of high quality within this context.

A data quality analyst also performs troubleshooting or root cause investigations. If something funny is going on in the data such as duplicates, it is the data quality analyst’s job to determine what is causing the problems and to find a solution. Lastly, a data quality analyst is also responsible for statistical work. This can include statistical work that is associated with the work of a data analyst and or statistical work that monitors the use of data and the quality of data within the organization.

Conclusion

The data governance team plays a critical role in supporting the organization with reliable and clean data that can be trusted to make actionable insights. Even though this is a tremendous challenge it is an important function in an organization.

Data Governance Framework Types and Principles

Leave a reply

When it is time to develop data governance policies the first thing to consider is how the team views data governance. In this post, we will look at various data governance frameworks and principles to keep in mind when employing a data governance framework.

Top-Down

The top-down framework involves a small group of data providers. These data providers serve as gatekeepers for data that is used in the institution. Whatever data is used is controlled centrally in this framework.

One obvious benefit of this approach is that with a small group of people in charge, decision-making should be fast and relatively efficient. In addition, if something does go wrong it should be easy to trace the source of the problem. However, a top-down approach only works in situations that have small amounts of data or end users. When the amount of data becomes too large the small team will struggle to support users which indicates that this approach is hard to scale. Lastly, people may resent having to abide by rules that are handed down from above.

Bottom-Up

The bottom-up approach to data governance is the mirror opposite of the top-down approach. Where top-down involves a handful of decision-makers bottom-up focus is on a democratic style of data leadership. Bottom-up is scaleable due to everyone being involved in the process while top-down does not scale well. Generally, controls and restrictions on data are put in place after the raw data is shared rather than before when the bottom-up approach is used.

Like all approaches to data governance, there are concerns with the bottom-up approach. For example, it becomes harder to control the data when people are allowed to use raw data that has not been prepared for use. In addition, because of the democratic nature of the bottom-up approach, there is also an increased risk of security concerns because of the increased freedom people have.

Collaborative

The collaborative approach is a mix of top-down and bottom-up ideas on data governance. This approach is flexible and balanced while placing an emphasis on collaboration. The collaboration can be among stakeholders or between the gatekeepers and the users of data.

One main concern with this approach is that it can become messy and difficult to execute if principles and goals are not clearly defined. There it is important to spend a large amount of time in planning when choosing this approach.

Principles

Regardless of which framework you pick when beginning data governance. There are also several terms you need to be familiar with to help you be successful. For example, integrity involves maintaining open lines of communication and the sharing of problems so that an atmosphere of trust is maintained or developed.

It is also important to determine ownership for the purpose of governance and decision-making. Determining ownership also helps to find gaps in accountability and responsibility for data.

Leaders in data governance must also be aware of change and risk management. Change management is tools and process for communicating new strategies and policies related to data governance. Change management helps with ensuring a smooth transition from one state of equilibrium to another. Risk management is tools related to auditing and developing interventions for non-compliance.

A final concept to be aware of is strategic alignment. The goals and purpose of data governance must align with the goals of the organization that data governance is supporting. For example, a school will have a strict stance on protecting student privacy. Therefore, data governance needs to reflect this and support strict privacy policies

Conclusion

Frameworks provide a foundation on which your team can shape their policies for data governance. Each framework has its strengths and weaknesses but the point is to be aware of the basic ways that you can at least begin the process of forming policies and strategies for governing data at an organization.

Data Governance Framework

Leave a reply

In this post we will look at a defining data governance framework. We will also look a the key components that are a part of a data governance framework.

Defined

A data governance framework is the how or the plan for governing the data within an organization. The term data governance determines what needs to be governed or controlled while the data governance framework is the actual plan for controlling the data.

Common Components

There are several common components of a data governance plan and they include the following.

Strategy
Policies
Processes
Coordination
Monitoring/communication
Data literacy/culture

Strategy involves determining how data can be used to solve problems. This may seem pointless but certain data can be used to solve certain problems. For example, customers’ addresses in California might not be appropriate for determining revenue generated in Texas. When data is looked at strategically it helps to ensure that it is viewed as an asset in many cases by those who use it.

Policies help to guide such things as decision-making and expectations concerning data. In addition, policies also help with determining responsibilities and tasks related to data management. One example of policy in action is the development of standards which are rules for best practices in order to meet a policy. A policy may be something like protecting privacy. A standard to meet this policy would be to ensure that data is encrypted and password protected.

Process and technology involve steps for monitoring the quality of data. Other topics related to process can include dealing with metadata and data management. The proper process mainly helps with efficiency in the organization.

Coordination involves the processes of working together. Coordination can involve defining the roles and responsibilities for a complex process that requires collaboration with data. In other words, coordination is developed when multiple parties are involved with a complex task.

Progress monitoring involves the development of KPIs to make sure that the performance expectations are measured and adhered to. Progress monitoring can also involve issues related to privacy, quality, and compliance. An example of progress monitoring may be requiring everyone to change their password every 90 days. At the end of the 90 days, the system will automatically make the user create a new password.

Lastly, data literacy and culture involve training and developing the skill of analyzing and or communicating data to people and others within the organization of use or consumption data. Naturally, this is an ongoing process and how it works depends on who is involved.

Conclusion

A framework is a plan for achieving a particular goal or vision. As organizations work with data, they must be diligent in making sure that the data that is used is trustworthy and protected. A data governance framework is one way in which these goals can be attained.

Influences and Approaches of Data Governance

Leave a reply

Data governance has been around for a while. As a result of this, there have been various trends and challenges that have influenced this field. in this post, we will look at several laws that have had an impact on data governance along with various concepts that have been developed to address common concerns.

Laws

Several laws have played a critical role in influencing data governance both in the USA and internationally. For example, the Sarbanes-Oxley (SOX) Act was enacted in 2002. The SOX act was created in reaction to various accounting scandals at the time and large corporations. Among some of the requirements of this law are setting standards for financial and corporate reporting and the need for executives to verify or attest that the financial information is correct. Naturally, this requires data governance to make sure that the data is appropriate so that these requirements can be met.

There are also several laws related to privacy in particular. Focusing again on the USA there is the Health Insurance Portability and Accountability (HIPAA) which requires institutions in the medical field to protect patient data. For leaders in data, they must develop data governance policies that protect medical information.

In the state of California, there is the California Consumers Protection Act (CCPA) which allows California residents more control over how their personal data is handled by companies. The CCPA is focused much more on the collection and selling of personal data as this has become a lucrative industry in the data world.

At the international level, there is the General Data Protection Regulation (GDPR). The GDPR is a privacy law that applies to anybody who lives in the EU. What this implies is that a company in another part of the world that has customers in the EU must abide by this law as well. As such, this is one example of a local law related to data governance that can have a global impact.

Various Concepts that Support Data Governance

Data governance was around much earlier than the laws described above. However, several different concepts and strategies were developed to address transparency and privacy as explained below.

Data classification and retention deals with the level of confidentiality of the data and policies for data destruction. For example, social security numbers is a form of data that is highly confidential while the types of shoes a store sells would probably not be considered private. In addition, some data is not meant to be kept forever. For example, consumers may request their information be removed from a website such as credit card numbers. In such a situation there must be a way for this data to be removed permanently from the system.

Data management is focused on consistency and transparency. There must be a master copy of data to serve as a backup and for checking the accuracy of other copies. In addition, there must be some form of data reference management to identify and map datasets through some general identification such as zip code or state.

Lastly, metadata management deals with data that describes the data. By providing this information it is possible to search and catalog data

Conclusion

Data governance will continue to be influenced by the laws and context of the world. With new challenges will be new ways to satisfy the concerns of both lawmakers and the general public.

Data Governance

Leave a reply

Data governance involves several concepts that describe the characteristics and setting in which the data is found. For people in leadership positions involving data, it is critical to have some understanding of the following concepts related to data governance. These concepts are

Ownership
Quality
Protection
Use/Availability
Management

Each of these concepts plays a role in shaping the role of data within an organization.

Ownership

Data ownership is not always as obvious as it seems. One company may be using the data of a different company. It is important to identify who the data belongs to so that any rules and restrictions that the owner has about the use of the data are something that the user of the data is aware of.

Addressing details related to ownership helps to determine accountability as well. Identifying ownership can also identify who is responsible for the data because the owners will hopefully have an idea of who should be using the data. If not this is something that needs to be clarified as well.

Quality

Data quality is another self-explanatory term. Data quality is a way of determining how good the data is based on some criteria. One commonly used criterion for data quality is to determine the data’s completeness, consistency, timeliness, accuracy, and integrity.

Completeness is determining if everything that the data is supposed to capture is represented in the data set. For example, if income is one variable that needs to be in a dataset it is important to check that it is there.

Consistency is that the data that you are looking at is similar to other data in the same context. For example, student record data is probably similar regardless of the institutions. Therefore, someone with experience with student record data can tell you if the data you are looking at is consistent with other data in a similar context.

Timeliness has to do with the recency of the data. Some data is real-time while other data is historical. Therefore, the timeliness of the data will depend on the context of the project. A chatbot needs recent data while a study of incomes from ten years ago does not need data from yesterday.

Accuracy and integrity are two more measures of qualityu. Accuracy is how well the data represents the population. For example, a population of male college students should have data about male college students. Integrity has to do with the truthfulness of the data. For example, if the data was manipulated this needs to be explained.

Protection

Data protection has to do with all of the basic security concerns IT departments have to deal with today. Some examples include encryption and password protection. In addition, there may be a need to be aware of privacy concerns such as financial records or data collected from children.

There should also be awareness of disaster recovery. For example, there might be a real disaster that wipes out data or it can be an accidental deletion by someone. In either case, there should be backup copies of the data. Lastly, protection also involves controlling who has access to the data.

Use/Availability

Despite the concerns of protection, data still needs to be available to the appropriate parties and this relates to data availability. Whoever is supposed to have the data should be able to access it as needed.

The data must also be usable. The level of usability will depend on the user. For example, a data analyst should be able to handle messy data but a consumer of dashboards needs the data to be clean and ready prior to use.

Management

Data management is the implementation of the policies that are developed in the previous ideas mentioned. The data leadership team needs to develop processes and policies for ownership, quality, protection, and availability of data.

Once the policies are developed they have to actually be employed within the institution which can always be difficult as people generally want to avoid accountability and or responsibility, especially when things go wrong. In addition, change is always disliked as people gravitate towards the current norms.

Conclusion

Data governance is a critical part of institutions today given the importance of data now. IT departments need to develop policies and plans on the data in order to maintain trust in whatever conclusions are made from data.

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Student Records Table

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: