Author Archives: Dr. Darrin

close up photo of a paper on a vintage typewriter

Data Privacy Implementation Strategies

Data privacy is a topic that many organizations are addressing. In this post, we will go through several steps that must be taken to implement a data privacy program.

Leadership Sponsor

As with any major initiative, data privacy is going to need the support of leadership. In particular, there will be a need for an advocate on the leadership team who will support the vision of improving data privacy. Who this person is will naturally vary from organization to organization.


The sponsor is not only an advocate but also serves as a medium of communication between the data privacy team and leadership. The sponsor serves as the eyes and ears for the privacy team to help them to avoid pitfalls is deal with concerns that are not shared directly from the leadership team to the privacy team.

Put Someone in Charge

Implementing any program or strategy requires that someone take the lead. Therefore, when it is time to develop a privacy approach someone needs to be in charge. The selection of the leader will naturally vary from one place to the other. The point is that the leadership sponsor needs someone they can talk to directly about the challenges and concerns that may be made at the leadership level.

Depending on the size of the project there might be more than one person identified as a leader. However, it is generally wiser to start small and scale as appropriate.

Examine the Data

Before any action can take place it is important to take an inventory of available data. Another name for this is the compiling of a data catalog. A privacy leader must know what data needs to be held private. Without this information, it is hard to ensure the quality.

Knowing the data works in combination with the policies and procedures that need to be made. For example, if the data includes personal information this will influence how privacy is maintained versus data that does not contain such information.

Compliance Expectations

Knowledge of the data is used concerning compliance expectations. For a corporation, the compliance standard might be GPDR. For other organizations, compliance might be determined by local laws or organizational standards.

Generally, a privacy team must provide evidence that they are implementing and or obeying compliance standards. Therefore, a team might have to document and archive how they comply with regulations in the event of a data breach and or audit.

Assess Risk

Assessing risk helps to inform the privacy team in terms of what sort of policy and or procedures to implement. Fortunately, it is not necessary to develop this risk assessment in a vacuum. There are risk assessment frameworks such as ISO 31000 or ISO 27005. Either of these frameworks or others can help you to determine the level of danger your data is potentially facing.

Create Policies and Procedures

Policies are broad guidelines based on the context in which it is being developed for. Most websites have some sort of privacy policy that explains how and what data is collected along with its purpose. Privacy policies can include an idea of the roles and responsibilities of the data privacy team as well.

Procedures are the steps that need to be taken to fulfill the policies that were created. In other words, data procedures provide step-by-step guidance of policies. For example, if the policy speaks about the importance of only certain people having access to data a procedure for this might be how to set up a password or to seek permission to access a particular database. Essentially, policies inspire procedures.


Controls are inspired by risk assessment. In this step, you are implementing ways to mitigate risk to data. For example, it might have been uncovered that sensitive data is too easy to access. The control for this example may be to move the data to more secure data or to ensure that the data is password protected.

The main point here is that all of these measures must be integrated and working together. The data catalog and knowledge of compliance inspire the policies and procedures which in turn helps with the development of controls

Training & Monitoring

Now that almost everything is in place it is time to train people on the new privacy rules. The training will be context specific but is critical for getting buy-in to the new system. Without the cooperation of the masses, there is no hope for the success of the program.

After training, the training is assessed through monitoring. Monitoring assesses how well the program is running. It deals with such challenges as whether people are obeying the new procedures that have been implemented. Monitoring also helps in providing feedback in terms of where there might be growth opportunities. No system is perfect and monitoring provides critical information to strengthen the program.


Data privacy can be improved in any organization. The ideas presented here provide information on how to start a data privacy program. Naturally, all of these steps may not work for each organization but many valuable ideas have been shared to support the protection of privacy.

black android smartphone on top of white book

Privacy by Design

Privacy by Design is an idea found within the General Data Protection Regulation, which affects the data privacy practices of organizations. In this post, we will define this term and explain several principles of privacy by design.


Privacy by design is a concept in which data protection happens through the appropriate development of technology. Essentially, data protection should not be limited to one place or one feature instead data protection should be layered throughout the system of an organization.


There are several ways to begin this initiative. A common method is to have a privacy policy that is up-to-date and readable. Another way to begin this process is to establish someone as the data protection officer. Lastly, it is also common to conduct some sort of assessment of data protection to determine areas of improvement before using an individual’s personal data.


There are seven principles of privacy by design. Below is a list with explanations.

  1. Proactive rather than reactive-There should be an effort to prevent privacy loss rather than trying to fix a situation in which people’s personal information is inappropriately accessed.
  2.  Privacy by default-Maintaining the privacy of data should be the first thing an organization thinks about and can include restricting use/access, and or deleting data that is no longer needed.
  3.  Embedding of privacy-EMbedding involves such tools as encryption, authentication, and the testing of vulnerabilities. In other words, privacy is used as a foundational aspect of developing a website or application.
  4.  Full functionality-This idea is a reminder that data privacy should not make it difficult to use a website or application. Protect data but avoid sacrificing the user experience.
  5.  End-to-end security-This is similar to principle number two and is essentially a reminder that privacy protection must be comprehensive from the time the data is received until the data is destroyed.
  6.  Visibility and transparency-People should know what is being done with the data an organization has of them.
  7.  Respect for user privacy-People should still have authority over their data after it is collected. What this means is that they can grant or rescind consent to their data at any time.

Implementation Perspective

There are several perspectives from which the implementation of privacy by design that must be considered and these are systems, processes, and risk management perspectives.

The system perspective involves documenting the organization’s commitment to data protection, appointing a data protection officer or leader, providing training for employees, checking security measures, developing a record-keeping system, and conducting a self-assessment. All of these steps are used to develop an initial system for data privacy.

For processes, it is necessary to determine roles within privacy such as people in IT, legal, etc. who support privacy with their technical expertise. It is also important to document the data processing process and privacy risks. Privacy controls for users and the implementation of security measures from the systems perspective are critical as well.

Risk management is another key perspective that needs to be addressed for data privacy. Risk management involves the legal purpose of processing data. It also includes tracking who has access to data, controls for accessing data, what to do in the event of a breach, and minimization, anonymization, and pseudonymization of data. Lastly, measures for data accuracy are developed here.

metal building grayscale photo

Data Catalogs and Data Silos

Data catalogs and data silos are two ideas that are commonly associated with data governance. In this post, we will look at these two terms by defining them and share either how to implement them or prevent them.

Data Catalogs

Data catalogs are a rather recent phenomenon. They were first developed in the 2010’s with the exact origins not defined. A data catalog is a reference application that contains metadata on the various datasets within an organization. Usually, this document is in a searchable format so that people can find datasets they may need within an organization.


The data catalog essentially tracks available data within an organization. The main reason for tracking data is to prevent loss and or secret data. Within a data governance framework, data is considered an asset. Therefore, just as an organization prevents the loss of inventory because of its monetary potential the data catalog prevents the monetary decision-making loss of data within an organization.


There are also several tips for developing and using data catalogs. For example, a data catalog should track the roles of various people concerning individual datasets. Roles can include who is the owner of the data, the steward, the custodian, etc. Tracking roles helps in assigning responsibility for data.

Another tip is to develop data dictionaries concerning the data catalog. Data dictionaries contain metadata not from all data but just from one dataset. An analogy would be maps. Some maps cover the whole world like a data catalog while other maps only cover a city or county like a data dictionary. The data dictionary is useful one an analyst needs more information when preparing to use data.

It is also important to make the data catalog user-friendly. Making a data catalog user-friendly for stakeholders involves the support of IT with a strong concern for the user experience. Nobody will use a data catalog if its user interface is useless. However, the solution to this would be lots and lots of training

Data Silos

Data catalogs help to prevent what are called data silos. Data silos are sources of data that are controlled in an isolated place within an organization. When silos are developed it can lead to analyses that are incomplete because of incomplete data. In multiplication, silos can lead to a breakdown in collaboration which can cause duplication of efforts and reduced productivity. Lastly, people may also struggle within an organization to find data that is needed for analysis.

Data silos are often developed in organizations that have a decentralized IT strategy. A decentralized approach frequently leads to every department doing what they want in terms of data storage and technology utilization which is chaotic. Other motivations for data silos can include a lack of common goals when it comes to data management. No goals means everyone does what they want.

Breaking Silos

Two main ways of breaking data silos are the development of data governance and data integration. One step in data governance is developing a data catalog as mentioned early. Once a data catalog is developed the team can start to create policies and standards in data governance to establish expectations regarding data use and storage.

A second strategy that is related to the first is data integration. Data integration is the processing of combining data from different tables into one. Upon completing this more analysis can take place. Combining data makes it hard to isolate because data must be available for use.


Data catalogs and silos are a part of the daily life of the information professional. Therefore, in the context of data governance, it is important to be familiar with these two terms so that support can be provided.

grey and black macbook pro showing vpn

Data Privacy

A field closely related to data governance is data privacy. In this post, we will look at what data privacy is as well as principles that need to be kept in mind when trying to keep people’s data private.

Data Privacy

Privacy is a term that is difficult to define. For our purposes, data privacy is the amount of control a person has over personal information in terms of how this information is collected, managed, and stored. This definition gives the impression that people have little data privacy because we are so often compelled to share our information online.


Websites often require some surrendering of personally identifiable information (PII) such as name, address, phone number, etc while in the medical field, there is demand for personal health information (PHI). Sharing information about yourself can be frustrating for many but is the cost of doing business online. Naturally, once these various online companies have your data they must be sure to protect it.

Data security is not about collecting or managing data. Rather, data security is focused on the protection of data from unauthorized access. Securing data is critical to protect individuals and organizations from harm because of security breaches. For example, there can be serious financial repercussions if someone’s credit card number is stolen online.

Fair Information Practice Principles

With all the concerns regarding data privacy, it was natural that frameworks would be developed to help organizations with data privacy. One such framework is the Fair Information Practice Principles (FIPPs) developed by the Organization of Economic Development back in the early 1980s. Below are the eight principles in this framework.

  1. Limits on data collections-Every organization need to determine the smallest amount of data they can connect while still maintaining success
  2.  Data quality-Data that is collected needs to be accurate and pertinent to the purposes of the organization.
  3.  Purpose determination-There must be a clear compelling reason to collect data.
  4.  Limits of use-Personal data must only be used for its intended purpose.
  5.  Security-Data must be protected
  6.  Transparency-People should know that their data is being collected
  7.  Individual participation-People whose data has been collected have the right to access their data, have it corrected, and or erased
  8.  Accountability-Whoever collects this data is responsible for adhering to the principles listed above

The principles shared above have been adopted by many organizations to provide a foundation on which they can develop their own data privacy policies and philosophy.


Data privacy is a major concern in the world today. Organizations whether online or offline continue to demand more information about their customers. As such, this implies that there must be safeguards in place to ensure the protection of this information.

france landmark water clouds

Defense & Offense with Data

Within the field of data governance, there are different ways of approaching data and the definition of truth. In this post, we will look at different approaches to data and also how truth can be defined with a data governance framework.


A defense approach to data is focused on controlling data. This can involve security and stringent governance of data through a highly centralized setting. In addition, the defensive data approach is concerned with minimizing risk and ensuring compliance with standards and expectations. Preventing theft and tracking the flow of data through an organization is also important.


When analytics are used they are used to detect fraud and unusual activity. How defensive an organization is depends on the field or industry. For example, banking and health care are highly defensive due to the type of data they gather.


An offensive approach to data is focused on developing insights with data. The goal is not to protect but to develop insights for decision-making. An offensive approach to data is characterized by flexibility and being focused on the customer. This style of approaching data is generally emphasizing a decentralized style of data governance.

Organizations that find themselves in highly competitive environments often are forced to become more offensive as they search for insights to maximize profits. How much offensive and defensive an organization needs does vary. However, in general, most if not all organizations start defensive and slowly become more offensive in nature.


Whether the approach to data is offensive or defensive it is important to determine what is the truth when it comes to data in an organization. Every organization needs a single source of truth (SSOT) for critical data. The SSOT is language used within data that is the same across an organization. For example, sometimes the same name can be entered in multiple different ways in an organization’s data. Take the company AT&T as an example it could be entered in some of the following ways




AT and T


Each of the examples above can be considered different and can lead to chaos when it is time to analyze data for insights. This is because redundant names can lead to redundant costs. For example, if AT&T was a vendor for our fictitious company there might be several different contracts with AT&T with several different divisions who all spell AT&T differently. To prevent this the SSOT will define the one way to code AT&T into the system and determine what it represents.

However, keeping the offensive approach to data in mind. There are times for the purpose of analysis that the SSOT can be modified. Doing this leads to what is called multiple versions of truth (MVOT). An example of MVOT is a department that classifies our example of AT&T different way from the SSOT. Accounting might see AT&T as a vendor while marketing might see AT&T as their internet provider, etc. Since everyone knows what the SSOT is they are aware when they make a MVOT for their distinct purpose.


Each organization needs to decide for themselves what approach to data they want to take. There is no right or wrong way to approach data it really depends on the situation. In addition, every organization needs to determine for itself how they will define truth and there is no single way to do this either. What organizations need to do is address these two topics in a way that is satisfying for them.

children having an activity together

Juvenile Justice Programs, Practices, & Policies

In this post, we will look at juvenile justice programs, practices, and policies. Each of these terms plays a different role within the juvenile justice system.


A program in juvenile justice is a designed package that has clear procedures for delivery, has manuals and provides technical assistance. In addition, the outcome is commonly related to some sort of change such as recidivism. Two commonly implemented programs are multisystemic therapy and functional family therapy.


Another key aspect of a program within juvenile justice is recidivism. Programs are generally designed for specific use through the use of a logic model. A logic model is a visual depiction that shows the relationship among the resources, activities, outcomes, and outputs of a program. In other forms of social science research, the logic model is called a conceptual framework however these two concepts are not exactly the same. Logic models depict relationships while conceptual frameworks are a proposed theory that is attempting explain why certain outcomes take place.


PRactices in juvenile justice are not as clearly defined as programs are. In general, practices in juvenile justice are essentially programs that are more flexible in their application and use. For example, the design may not be as rigorous and the instructions made not be as detailed.

Due to their more flexible nature practices are often more general in nature and can thus be applied in different situations. To make things more confusing some programs are considered practices if they are more flexible than highly controlled programs. One common practice is the Treatment in Secure Corrections for Serious Juvenile Offenders.


Policies are regulations that apply to the general population. Policies generally lack empirical evidence for their usefulness in supporting youth. However, policies do provide guidance and structure which shows that they serve a different role than what is found with programs and practices.

Evaluating Programs and Practices

There are times when programs and practices are evaluated for their usefulness. Below are some commonly used ways to evaluate programs and practices.

One way to evaluate a program is the quality of the evidence or data. For example, randomized controlled experiments are considered the gold standard. Therefore, other methods of collecting data such as quasi-experiments and surveys will affect the perceived quality of a program.

A second criterion is looking at the quality and extensiveness of the research of the program. What this means is the quality and quantity of research that has assessed the value of a program. If a program has multiple studies that are a witness to its worthiness and these studies are of high quality it raises the value of this program.

A third criterion is the expected impact of a program. By expected impact, it is meant the effect size. The effect size is something that is extracted in the data analysis aspect of a study and helps to provide a number of the impact of a study. Programs with stronger effect sizes are seen as better.

Finally, another criterion for program/practice quality is the adoption rate. In other words, how many other people are using the program/practice? Tracking adoption is higher but there are program registries that have vetted programs and recommend them. Examples of program registries include crimesolutions and blueprints. Both of these registries have graded programs and provide links to studies about the programs.


Programs, practices, and policies all play a critical role in helping youth in the juvenile justice system. People who work within this system need to be aware of the meaning of these terms as well as how to judge good from bad programs. 

man sitting in front of three computers

Data Governance Methodology

Data governance is becoming more and more common in today’s world. In this post, we will look at one commonly used process of implementing data governance. The steps are explained below.

Scope & Initiation

The first step in setting up a data governance system is to determine the scope of data governance. By scope, it is meant how deep and wide the program will be. In other words, you have to determine what will be governed and how thoroughly it will be governed.

It may surprise some that not all data is governed by data governance. For each organization, it will be different but generally, all organizations have data that is excluded from data governance. For example, some organizations will include emails under data governance while others will not. It depends on the situation and there is no single rule.


In addition, it is important to determine how thorough the governance will be. An example of this would be the tolerance for data quality issues. There are times were some data errors are permissible as long as they do not exceed a certain threshold but this also depends on the context


At the assessment stage, the purpose is to determine an organization’s ability to govern data and be governed by policies. Generally, there are three ways of assessing this and they are measuring the capacity to change, the culture of data use, and the ability to collaborate.

The capacity to change is self-explanatory and is a measure of an organization’s ability to accept new policies such as data governance policies. The data use culture is looking at how an organization uses data at that moment. Lastly, collaboration looks at how well people within the organization can work together. Collaboration is critical because data governance generally affects the entire organization and people from multiple departments must work together.


The vision is where terms are defined and steps going forward are set. For example, the organization needs to define what data governance is for them. In addition, requirements for doing data governance are also developed.

Vision setting is a theoretical experience and this is often boring for the more practical action-oriented individuals. However, setting the vision sets the tone for the rest of the project. Therefore, this must be planned and developed.

Align & Business Value

Aligning and business value is for determining the financial value of incorporating data governance into an organization and also refining how things will be measured. For profit-seeking organizations business value is critical. Most projects need to make or at least save money in this setting. For non-profit organizations, the motivation might be to increase efficiency or the ability to better serve stakeholders.

It’s not enough to talk about savings. Evidence must be provided for determining actual savings. This is where metrics come into play. There must be ways to measure the value of a data governance project. Again, how to do this will vary from place to place but it needs to be addressed.

Functional Design

Functional design is focused on the actual process of doing data governance. What will be done must be determined as well as established roles that support this process as well. Principles are often developed at this step and principles are similar to goals in terms of what is expected from implementing data governance. Following principles, the next thing that is developed are standards which are similar objectives in education in which you have some sort of measurable action.

Best practices often encourage data governance to be embedded within existing roles and responsibilities. In other words, setting up another department within an organization and calling it data governance is generally not considered the best way to make this happen.

Governing Framework Design

Once the plan has been developed it is time to find the people who will implement it. governing framework involves assigning processes to people and setting up the various roles associated with data governance. Generally. a lot of the aspects of data governance are being done at an organization but in a disjointed unaware way. Therefore, the main benefit here is not so much to give out more work but rather to make it clear who is already doing what and make sure they are aware of it.

Road Map

The road map step involves data governance going live. This is the point where data governance is integrated into the existing organization. Other things that are done at this step are designing metrics and reporting requirements. In other words, how good or bad does performance have to be on a standard and how will this be reported?

Change management is also addressed here and involves dealing with resistance and making sure that the scope and or goals of the project do not change. There are times when a project will wander from its original purpose which can be frustrating for people.

Rollout and Sustain

Roll out and sustain involves executing the plan and checking its effectiveness. Essentially, this step involves monitoring the data governance implementation and making corrections as necessary.


Data governance is a critical part of most organizations today. However, it can be tricky to figure out how to make this a part of an organization. The information above provides an example of how this could be done.

a person holding a the law book

Types of Justice in the Classroom

Justice can look many different ways. In this post, we will look at three different forms of justice procedural, substantive, and negotiated. In particular, we will look at how these different forms of justice work within the classroom.

Procedural Justice

Procedural justice means that the disciplinary power of the teacher is only used within the constraints of the policies and rules of the school. For example, most schools do not allow corporal punishment. What this means is that a teacher who makes the decision to spank a student has violated what is considered to be an acceptable process for discipline within that school.


Procedural justice also has to do with maintaining fairness. In other words, rules cannot be randomly enforced based on a teacher’s mood. When teachers are not consistent in the application and enforcement of rules it gives the appearance of unfairness and injustice to the students. When this happens it can trigger even more undesirable behavior from students.

However, everyone has their moments of inconsistencies, including teachers. Therefore, when a teacher makes a mistake in procedural justice it is wise to acknowledge the mistake and make efforts to correct the misstep. Doing this will help students to maintain faith in a system that when it makes mistakes it tries to correct them.

Substantive Justice

Substantive justice is the unequal impact enforcing rules has on different groups. A common example of substantive justice in the classroom is the disproportional amount of trouble males and minorities get into within the classroom.

Dealing with race and gender are both highly controversial topics. Therefore, teachers must be careful to be aware of these two demographic traits of their students. The perception of differences in justice due to substantive differences in demographic traits could lead to serious accusations and headaches.

Negotiated Justice

Negotiated justice is the process of how justice is discovered and carried out. A practical example would be a court trial. During the trial, the truth is sought so that justice can be delivered. In the classroom, there are many different ways in which teachers uncover what to do when it is time to administer justice.

For example, in some classes, a teacher will have both parties sit down and discuss what happened. In other classes, the students may be sent to the office to work out their disagreement. If the teacher witnessed what happened, there may be no questioning at all.

The ultimate point here is that a teacher needs to be aware of how they go about determining guilt and innocence in their classroom. At times, the emotions of teachers will overwhelm them and they may make just or unjust decisions without knowing how they made their decision. Naturally, we want to avoid unjust decisions but no matter what decision was made it is important to be aware of how the decision was developed.


Teachers must be careful with how they deal with justice in their classrooms. There is always a danger of being accused of oppression when you have power and authority over others. Awareness is at least one way that this problem can be avoided.

close up photography of person in handcuffs

Views of Punishment

Punishment is a part of juvenile justice. However, as with most ideas and concepts, there is disagreement over the role and function of punishment. In this post, we will look at common positions in relation to punishment.


The reductivist position on punishment views punishment as a means to prevent future crimes. This approach is based on a utilitarian position of causing the most happiness for the most people. By focusing on future crimes it is believed that preventing these crimes will bring the most harmony and happiness to people rather than looking at what has already happened.


There are several strategies that are used to support a reductivist approach. For example, the use of deterrence. Deterrence is the use of punishment to prevent crime by instilling fear. An example of deterrence would be capital punishment. Through hanging or public execution, the thought is that this will motivate others to be good. Other forms of deterrence that are used today would be boot camps which are meant to whip delinquent youths into shape and in some countries, corporal punishment such as caning is employed to maintain order.

Another manifestation of reductivism is reform-rehabilitation. Reform is meant to mean hard labor, such as working in a chain gang along with religious instruction. Rehabilitation involves treatment for some sort of vice that may have led to incarceration such as substance abuse, sex treatment, etc. The assumption is that there is something wrong with the prisoner that can be fixed through treatment. Again, the motivation behind reform and rehabilitation is to change the person for the benefit of society.

A final form of reductivism is incapacitation. Incapacitation is simply a strategy of keeping offenders locked up to protect the public. One way this was done was through the three strikes law used in parts of the United States. Once a person committed a third felony the sentencing could be 25 years to life.


The retributivist position looks to punish people for crimes already committed with no regard for the future. In other words, retributivists focus on the past while reductivists focus on the future. Punishment should restore equilibrium and focus on what is right to do rather than what is good to do (utilitarian position). The reason for this distinction is that right and wrong are more immovable than what makes people happy.

The main strategy for retribution is just deserts. Just deserts are a way of punishing people for the crimes they have committed and doing no more. As such, there is no support for three-strike laws, deterrence, or other methods among people who have a retributivist perspective.


The point is not to state that one of these positions is superior to the other. Rather, the goal was to explain these two different positions to inform the reader about them. There are times and circumstances in which one of the positions would be a better position than the other.

black internal hdd on black surface

Terms Related to Data Storage

There are several different terms used when referring to data within an organization that can become confusing for people who are not experts in this field. In this post, we will look at various terms that are often misused in the field of data management.


Databases are for structured data which is data that has rows and columns. Among the many benefits of using a database over an Excel spreadsheet is that databases can hold almost limitless amounts of data. In addition, databases can have multiple users querying and inputting data at the same time which is not possible with a spreadsheet.

Data Warehouse

A data warehouse is a computer system designed to store and analyze large amounts of data for an organization. The data for a data warehouse can come from various areas within the organization. Since the data comes from many different places it also helps to integrate data for the purpose of analysis which is valuable for decision-making and insights.


Data warehouses take pressure off databases by providing another location for data. However, because of their size, often over 100 GB, data warehouses are hard to change once they are up and running. Therefore, great care is needed when developing and using this tool.

Data Marts

Data marts are similar to data warehouses with the main difference being the scope. Like data warehouses, data marts are also databases. However, data marts are focused on one subject or department whereas data warehouses gather data from all over an organization. For example, a school might have a data warehouse for all student data while it has a data mart that only holds student classes and grades.

Since they have a focus on a given subject, data marts are generally smaller than data warehouses at less than 100 GB. The rationale of a data mart is that analytic teams can focus when trying to develop insights rather than searching through a larger data warehouse.

Data Lake

Data lakes are also similar to data warehouses. Just like a data warehouse data lakes contain data from all over the organization from many sources. Data lakes are also generally larger than 100 GB. One of the main differences is that data lakes contain structured and unstructured data. Unstructured data is data that does not fit into rows and columns. Examples can include video data, social media, and images.

Another purpose for a data lake is to have a place for keeping data that may not have a specific purpose yet. Another to think of this is to consider a data lake as a historical repository of data. Due to their multipurpose nature, data lakes are often less complex in comparison to data warehouses.


All of the various data products discussed here work together to give an organization access to its data. It is important to understand these different terms because it is common for people to use them interchangeably to the confusion of everyone involved. With consistent terminology, everyone can be on the same page when it comes to delivering value through using data.

similar cubes with rules inscription on windowsill in building

Types of Data Quality Rules

Data quality rules are for protecting data from errors. In this post, we will learn about different data quality rules. In addition, we will look at tools used in connection with data quality rules.


Detective rules monitor data after it has already moved through a pipeline and is being used by the organization. Detective rules are generally used when the issues that are being detected are not causing a major problem when the issue cannot be solved quickly, and when a limited number of records are affected.

Of course, all of the criteria listed above are relative. In other words, it is up to the organization to determine what thresholds are needed for a data quality rule to be considered a detective rule.


An example of a detective data quality rule may be a student information table that is missing a student’s uniform size. Such information is useful but probably not worthy enough to stop the data from moving to others for use.


Preventive data quality rules stop data in the pipeline when issues are found. Preventive rules are used when the data is too important to allow errors, when the problem is easy to fix, and or when the issue is affecting a large number of records. Again, all of these criteria are relative to the organization.

An example of a violation of a data quality prevention rule would be a student records table missing student ID numbers. Generally, such information is needed to identify students and make joins between tables. Therefore, such a problem would need to be fixed immediately.

Thresholds & Anomaly detection

There are several tools for implementing detection and prevention data quality rules. Among the choices are the setting of thresholds and the use of anomaly detection.

Thresholds are actions that are triggered after a certain number of errors occurred. It is totally up to the organization to determine how to set up their thresholds. Common levels include no action, warning, alert, and prevention. Each level must have a minimum number of errors that must occur for this information to be passed on to the user or IT.

To make things more complicated you can tie threshold levels to detective and preventive rules. For example, if a dataset has 5% missing data it might only flag it as a warning threshold. However, if the missing data jumps to 10% it might now be a violation of a preventative rule as the violation has reached the prevention level.

Anomaly detection can be used to find outliers. Unusual records can be flagged for review. For example, a university has an active student who was born in 1920. Such a birthdate is highly unusual and the system should flag it as an outlier by the rule. After reviewing, IT can decide if it is necessary to edit the record. Again, anomaly detection can be used to detect or prevent data errors and can have thresholds set to them as well.


Data quality rules can be developed to monitor the state of data within a system. Once the rules are developed it is important to determine if they are detective or preventative. The main reason for this is that the type of rule affects the urgency with which the problem needs to be addressed.

person in white long sleeve shirt holding credit card

Data Profile

One aspect of the data governance experience is data profiling. In this post we will look at what a data profile is, an example of a simple data profile, and the development of rules that are related to the data profile.


Data profiling is the process of running descriptive statistics on a dataset to develop insights about the data and field dependencies. Some questions there are commonly asked when performing a data profile includes.

  • How many observations are in the data set?
  •  What are the min and max values of a column(s)?
  •  How many observations have a particular column populated with a value (missing vs non-missing data)?
  •  When one column is populated what other columns are populated?

Data profiling helps you to confirm what you know and do not know about your data. This knowledge will help you to determine issues with your data quality and to develop rules to assess data quality.

Student Records Table


The first column from the left is the student id. Looking at this column we can see that there are five records with data. That this column is numeric with 4 characters. The minimum value is 1001 and the max value is 1005.

The next two columns are first name and last name. Both of these columns are string text with a min character length of 5 and a max length of 7 for first name and 5 for last name. For both columns, 80% of the records are populated with a value. In addition, 60% of the records have a first name and a last name.


The fourth column is the birthdate. This column has populated records 80% of the time and all rows follow a MM/DD/YYYY format. The minimum value is 04/04/2000 and the max value is 01/01/2005. 40% of the rows have a first name, last name, and birthdate.

Lastly, 100% of the class-level column is populated with values. 20% of the values are senior, 40% are junior, 20% are sophomore, and 20% are freshman.

Developing Data Quality Rules

From the insights derived from the data profile, we can now develop some rules to ensure quality. With any analysis or insight the actual rules will vary from place to place based on needs and context but below are some examples for demonstration purposes.

  • All StudentID values must be 4 numeric characters
  •  The Student ID values must be populated
  •  All StudentFirstName values must be 1-10 characters in length
  •  All StudentLastName values must be 1-10 characters in length
  •  All StudentBirhdate values must be in MM/DD/YYYY format
  •  All StudentClassLevel values must be Freshman, Sophomore,, Junior, or Senior


A data profile can be much more in-depth than the example presented here. However, if you have hundreds of tables and dozens of databases this can be quite a labor-intensive experience. There is software available to help with this but a discussion of that will have to wait for the future.

rows of different lenses for checking eyesight

Data Quality

Bad data leads to bad decisions. However, the question is how can you know if your data is bad. One answer to this question is the use of data quality metrics. In this post, we will look at a definition of data quality as well as metrics of data quality


Data quality is a measure of the degree that data is appropriate for its intended purpose. In other words, it is the context in which the data is used that determines if it is of high quality. For example, knowing email addresses may be appropriate in one instance but inappropriate in another instance.


When data is determined to be of high quality it helps to encourage trust in the data. Developing this trust is critical for decision-makers to have confidence in the actions they choose to take based on the data that they have. Therefore data quality is of critical importance for an organization and below are several measures of data quality.

Measuring Data Quality

Completeness is a measure of the degree to which expected columns (variables) and rows (observations) are present. There are times when data can be incomplete due to missing data and or missing variables. There can also be data that is partially completed which means that data is present in some columns but not others. There are various tools for finding this type of missing data in whatever language you are using.

Validity is a measure of how appropriate the data is in comparison to what the data is supposed to represent. For example, if there is a column in a dataset that measures the class level of high school students using Freshman, Sophmore, Junior, and Senior. Data would e invalid if it use the numerical values for the grade levels such as 9, 10, 11, and 12. This is only invalid because of the context and the assumptions that are brought to the data quality test.

Uniqueness is a measure of duplicate values. Normally, duplicate values happen along rows in structured data which indicates that the same observation appears twice or more. However, it is possible to have duplicate columns or variables in a dataset. Having duplicate variables can cause confusion and erroneous conclusions in statistical models such as regression.

Consistency is a measure of whether data is the same across all instances. For example, there are times when a dataset is refreshed overnight or whenever. The expectation is that the data should be mostly the same except for the new values. A consistency check would assess this. There are also times when thresholds are put in place such that the data can be a little different based on the parameters that are set.

Timeliness is the availability of the data. For example, if data is supposed to be ready by midnight any data that comes after this time fails the timeliness criteria. Data has to be ready when it is supposed to be. This is critical for real-time applications in which people or applications are waiting for data.

Accuracy is the correctness of the data. The main challenge of this is that there is an assumption that the ground truth is known to make the comparison. If a ground truth is available the data is compared to the truth to determine the accuracy.


The metrics shared here are for helping the analyst to determine the quality of their data. For each of these metrics, there are practical ways to assess them using a variety of tools. With this knowledge, you can be sure of the quality of your data.

Reciprocal Teaching VIDEO

One goal of many teachers is to help their students to become independent and self-directed learners. One tools for achieving autonomous learners is the use of reciprocal teaching. The video below explains the steps involved in utilizing reciprocal teaching in the classroom.

man showing distress

Data Governance Solutions

Data governance is good at indicating various problems an organization may have with data. However, finding problems doesn’t help as much as finding solutions does. This post will look at several different data governance solutions that deal with different problems.

Business Glossary

The business glossary contains standard descriptions and definitions. It also can contain business terms or discipline-specific terminology. One of the main benefits of developing a business glossary is creating a common vocabulary within the organization.

Many if not all businesses and fields of study have several different terms that mean the same thing. In addition, people can be careless with terminology, to the confusion of outsiders. Lastly, sometimes a local organization will have its own unique terminology. No matter the case the business dictionary helps everyone within an organization to communicate with one another.


An example of a term in a business dictionary might be how a school defines a student ID number. The dictionary explains what the student ID number is and provides uses of the ID number within the school.

Data Dictionary

The data dictionary provides technical information. Some of the information in the data dictionary can include the location of data, relationships between tables, values, and usage of data. One benefit of the data dictionary is that it promotes consistency and transparency concerning data.

Returning to our student ID number example, a data dictionary would share where the student ID number is stored and the characteristics of this column such as the ID number being 7 digits. For a categorical variable, the data dictionary may explain what values are contained within the variable such as male and female for gender.

Data Catalog

A data catalog is a tool for metadata management. It provides an organized inventory of data within the organization. Benefits of a data catalog include improving efficiency and transparency, quick locating of data, collaboration, and data sharing.

An example of a data catalog would be a document that contains the metadata about several different data warehouses or sources within an organization. If a data analyst is trying to figure out where data on student ID numbers are stored they may start with the data catalog to determine where this data is. The data dictionary will explain the characteristics of the student ID column. Sometimes the data dictionary and catalog can be one document if tracking the data in an organization is not too complicated. The point is that the distinction between these solutions is not obvious and is really up to the organization.

Automated Data Lineage

Data lineage describes how data moves within an organization from production to transformation and finally to loading. Tracking this process is really complicated and time-consuming and many organizations have turned to software to complete this.

The primary benefit of tracking data lineage is increasing the trust and accuracy of the data. If there are any problems in the pipeline, data lineage can help to determine where the errors are creeping into the pipeline.

Data Protection, Privacy, QUailty 

Data protection is about securing the data so that it is not tampered with in an unauthorized manner. An example of data protection would be implementing access capabilities such as user roles and passwords.

Data privacy is related to protection and involves making sure that information is restricted to authorized personnel. Thus, this also requires the use of logins and passwords. In addition, classifying the privacy level of data can also help in protecting it. For example, salaries are generally highly confidential while employee work phone numbers are probably not.

Data quality involves checking the health of the accuracy and consistency of the data. Tools for completing this task can include creating KPIs and metrics to measure data quality, developing policies and standards that defined what is good data quality as determined by the organization, and developing reports that share the current quality of data.


The purpose of data governance is to support an organization in maintaining data that is an asset to the organization. In order for data to be an asset it must be maintained so that the insights and decisions that are made from the data are as accurate and clear as possible. The tools described in this post provide some of the ways in which data can be protected within an organization.

my secret plan to rule the world book

Data Governance Strategy

A strategy is a plan of action. Within data governance, it makes sense to ultimately develop a strategy or plan to ensure data governance takes place. In this post, we will look at the components of a data governance strategy. Below are the common components of a data governance strategy.

  • Approach
  •  Vision statement
  •  Mission statement
  •  Value proposition
  •  Guiding principles
  •  Roles & Responsibilities

There is probably no particular order in which these components are completed. However, they tend to follow an inverted pyramid in terms of the scope of what they deal with. In other words, the approach is perhaps the broadest component and affects everything below it followed by the vision statement, etc. Where to begin probably depends on how your mind works. A detail-oriented person may start at the bottom while a big-picture thinker would start at the top.

Defined Approach

The approach defines how the organization will go about data governance. There are two extremes for this and they are defensive and offensive. A defensive approach is focused on risk mitigation while an offensive approach is focused more on achieving organizational goals.


Neither approach is superior to the other and the situation an organization is in will shape which is appropriate. For example, an organization that is struggling with data breaches may choose a more defensive approach while an organization that is thriving with allegations may take a more offensive approach.

Vision Statement

A vision statement is a brief snapshot of where the organization wants to be. Another way to see this is that a vision statement is the purpose of the organization. The vision statement needs to be inspiring and easily understood. It also helps to align the policies and standards that are developed.

An example of a vision statement for data governance is found below.

Transforming how data is leveraged to make informed decisions to support youth served by this organization

The vision is to transform data for decision-making. This is an ongoing process that will continue indefinitely.

Mission Statement

The mission statement explains how an organization will strive toward its vision. Like a vision statement, the mission statement provides guidance in developing policies and standards. The mission statement should be a call to action and include some of the goals the organization has about data. Below is an example

Enabling stakeholders to make data-driven decisions by providing accurate, timely data and insights

In the example above, it is clear that accuracy, timeliness, and insights are the goals for achieving the vision statement. In addition, the audience is identified which is the stakeholders within the organization.

Value Proposition

The value proposition provides a justification or the significance of adopting a data governance strategy. Another way to look at this is an emphasis on persuasion. Some of the ideas included in the value proposition are the benefits of implementation. Often the value proposition is written in the form of cause and effect statement(s). Below is an example

By implementing this data governance program we will see the following benefits: 

Improved data quality for actionable insights, increased trust in data for making decisions, and clarity of roles and responsibilities of analysts

In the example above three clear benefits are shared. Succinctly this provides people with the potential outcomes of adopting this strategy. Naturally, it would be beneficial to develop ways to measure these ideas which means that only benefits that can be measured should be a part of the value proposition.

Guiding Principles

Guiding principles define how data should be used and managed. Common principles include transparency, accountability, integrity, and collaboration. These principles are just more concrete information for shaping policies and standards. Below is an example of a guiding principle.

All data will have people assigned to play critical roles in it

The guiding principle above is focused on accountability. Making sure all data has people who are assigned to perform various responsibilities concerning it is important to define and explain.

Roles & Responsibilities

Roles and responsibilities are about explaining the function of the data governance team and the role each person will play. For example, a small organization might have people who adopt more than one role such as being data stewards and custodians while larger organizations might separate these roles.

In addition, it is also important to determine the operating model and whether it will be centralized or decentralized. Determining the operating model again depends on the context and preferences of the organization.

It is also critical to determine how compliance with the policies and standards will be measured. It is not enough to say it, eventually, there needs to be evidence in terms of progress and potential changes that need to be made to the strategy. For example, perhaps a data audit is done monthly or quarterly to assess data quality.


Having a data governance strategy is a crucial step in improving data governance within an organization. Once a plan is in place it is simply a matter of implementation to see if it works.

white dry erase board with red diagram

Data Governance Assessment

Before data governance can begin at an organization it is critical to assess where the organization is currently in terms of data governance. This necessitates the need for a data governance assessment. The assessment helps an organization to figure out where to begin by identifying challenges and prioritizing what needs to be addressed. In particular, it is common for there to be five steps in this process as shown below.

  1. Identify data sources and stakeholders
  2.  Interview stakeholders
  3.  Determine current capabilities
  4.  Document the current state and target state
  5.  Analyze gaps and prioritize

We will look at each of these steps below.

Identify Data Sources and Stakeholders

Step one involves determining what data is used within the organization and the users or stakeholders of this data. Essentially, you are trying to determine…

  • What data is out there?
  •  Who uses it?
  •  Who produces it?
  •  Who protects it?
  •  Who is responsible for it?

Answering these questions also provides insights into what roles in relation to data governance are already being fulfilled at least implicitly and which roles need to be added to the organization. At most organizations at least some of these questions have answers and there are people responsible for many roles. The purpose here is not only to get this information but also to make people aware of the roles they are fulfilling from a data governance perspective.


Interview Stakeholders

Step two involves interviewing stakeholders. Once it is clear who is associated with data in the organization it is time to reach out to these people. You want to develop questions to ask stakeholders in order to inform you about what issues to address in relation to data governance.

An easy way to do this is to develop questions that address the pillars of data governance. The pillars are…

  • Ownership & accountability
  •  Data quality
  •  Data protection and privacy
  •  Data management
  •  Data use

Below are some sample questions based on the pillars above.

  • How do you know your data is of high quality
  •  What needs to be done to improve data quality
  •  How is data protected from misuse and loss
  •  How is metadata handle
  •  What concerns do you have related to data
  •  What policies are there now related to data
  •  What roles are there in relation to data
  •  How is data used here

It may be necessary to address all or some of these pillars when conducting the assessment. The benefit of these pillars is they provide a starting point in which you can shape your own interview questions. In terms of the interview, it is up to each organization to determine what is best for data collection. Maybe a survey works or perhaps semi-structured interviews or focus groups. The actual research part of this process is beyond the scope of this interview.

Determine Current Capabilities

Step three involves determining the current capabilities of the organization in terms of data governance. Often this can be done by looking at the stakeholder interviews and comparing what they said to a rating scale. For example, the DCAM rating scale has six levels of data governance competence as shown below.

  1. Non-initiated-No governance happening
  2.  Conceptual-Aware of data governance and planning
  3.  Developmental-Engaged in developing a plan
  4.  Defined-PLan approved
  5.  Achieved-Plann implemented and enforced
  6.  Enhanced-Plan a part of the culture and updated regularly

Determining the current capabilities is a subjective process. However, it needs to be done in order to determine the next steps in bringing data governance along in an organization.

Document Current State and Target State

Step four involves determining the current state and determining what the target state is. Again, this will be based on what was learned in the stakeholder interviews. What you will do is report what the stakeholders said in the interviews based on the pillars of data governance. It is not necessary to use the pillars but it does provide a convenient way to organize the data without having to develop your own way of classifying the results.

Once the current state is defined it is now time to determine what the organization should be striving for in the future and this is called the target state. The target state is the direction the organization is heading within a given timeframe. It is up to the data governance team to determine this and how it is done will vary. The main point is to make sure not to try and address too many issues at once and save some for the next cycle.

Analyze and Prioritize

The final step is to analyze and prioritize. This step involves performing a gap analysis to determine solutions that will solve the issues found in the previous step. In addition, it is also important to prioritize which gaps to address first.

Another part of this step is sharing recommendations and soliciting feedback. Provide insights into which direction the organization can go to improve its data governance and allow stakeholders to provide feedback in terms of their agreement with the report. Once all this is done the report is completed and documented until the next time this process needs to take place.


The steps presented here are not prescriptive. They are shared as a starting point for an organization’s journey in improving data governance. With experience, each organization will find its own way to support its stakeholders in the management of data.


Total Data Quality

Total data quality as its name implies is a framework for improving the state of data that is used for research and reporting purposes. The dimensions that are used to assess the quality of data are measurement and representation


Measurement is focused on the values gathered on the variable(s) of interest. When assessing measurement researchers are concerned with.

  • Construct-The construct is the definition of the variable of interest. For example, income is can be defined as a person’s gross yearly salary in dollars. However, salary can also be defined as per month or as the net after taxes to show how this construct can be defined differently. The construct validity must also be determined to ensure that it is measuring what it claims to measure.
  • ¬†Field-This is the place where data is collected and how it is collected. For example, our income variable can be collected from students or working adults. Where the data comes from affects the quality of the data concerning the research problem and questions. If the research questions are focused on student income then collecting income data from students ensures quality. In addition, how the data is encoded matters. All student incomes need to be in the same currency in order to make sense for comparision
  •  Data Values-This refers to the tools and procedures for preparing the data for analysis to ensure high-quality values within the data. Such challenges addressed are dealing with missing data, data entry errors, duplications, assumptions for various analytical approaches, and or issues between variables such as high correlations.


Representation looks at determining if the data collected comes from the population of interest. Several concerns need to be addressed when dealing with representation.

  • Target population- The target population is potential participants in the study. The limitation here is determining the access of the target population. For example, studies involving children can be difficult because of ethical concerns over data collection with children. These ethical concerns limit access at times.
  •  Data sources- Data sources are avenues for obtaining data. It can relate to a location such as a school or to a group of people such as students among other definitions. Once access is established it is necessary to specifically determine where the data will come from.
  •  Missing data-Missing data isn’t just looking at what data is not complete in a dataset. Missing data is also about looking at who was left out of the data collection process. For example, if the target population is women then women should be represented in the data. In addition, missing data can also look at who is represented in the data but should not be. For example, if women are the target population then there should not be any men in the dataset.

Where measurement and representation meet is at the data analysis part of a research project. If the measurement and representation are bad it is already apparent that the data analysis will not yield useful insights. However, if the measurement and representation are perfect but the analysis is poor then you are still left without useful insights.


Measurement and representation are key components of data quality. Researchers need to be aware of these ideas to ensure that they are providing useful results to whatever stakeholders are involved in a study.

photo of assorted acoustic guitars

Data Types

There are many different ways that data can be organized and classified. In this post, we will look at data as it is classified by purpose. Essentially, data can be gathered for non-research or research purposes. Data collected for non-research purposes is called gathered data and data collected for research purposes is called designed data.

Gathered Data

Gathered data is data that is obtained from sources that were not developed with the intention of conducting research specifically. Examples of gathered data would be data found in social media such as Twitter or YouTube and data that is scraped from a website. In each of those examples, data was collected but not necessarily for an empirical theory testing purpose.

Gathered data is also collected in many ways beyond websites. Other modes of data collection could be sensors such as traffic light cameras, transactions such as those at a store, and wearables such as those used during exercise.


Just because the data was not collected for research purposes does not mean that it cannot be used for this purpose. Gathered data is frequently used to support research as it can be analyzed and insights developed from it. The challenge is that the gathered data may not directly address whatever research questions a researcher may have which necessitates using this data as a proxy for a construct or rephrasing research questions to align with what the gathered data can answer. Gathered data is also referred to as big data or organic data.

Designed data

Designed data is data that was developed and collected for a specific research purpose. Often this data is collected from people or establishments for answering scientifically designed research questions. A common way of collecting this form of data is the use of a survey and these surveys can be conducted in-person, online, and or over the phone. These forms of data collection are in contrast to gathered data which collects data passively and without human interaction. This leads to an important distinction in that gathered data is probably strictly quantitative because of its impersonal nature while designed data can be quantitative and or qualitative in nature because it is possible to have a human element in the collection process.

When a researcher wants designed data they will go through the process of conducting research which often includes developing a problem, purpose, research questions, and methodology. All of these steps are commonly involved in conducting research in general. The data that is collected for design purposes is then used to address the research questions of the study.

The purpose of this process is to ensure that the data collected will answer the specific questions the researcher has in mind. In other words, designed data is designed to answer specific research questions while gathered can hopefully answer some questions.


Understanding what data was collected for is beneficial for researchers because it helps them to be aware of the strengths and weaknesses the data may have based on its purpose. Neither gathered nor designed data is superior to the other. Rather, the difference is in what was the inspiration for collecting the data.

two gray bullet security cameras

Data Governance Office

The data governance office or team are the leaders in dealing with data within an organization. This team is comprised of several members such as

  • Chief Data Officer
  •  Data Governance Lead
  •  Data Governance Consultant
  •  Data Quality Analyst

We will look at each of these below. It also needs to be mentioned that a person might be assigned several of these roles which are particularly true in a smaller organization. In addition, it is possible that several people might fulfill one of these roles in a much larger organization as well.

Chief Data Officer

The chief data officer is responsible for shaping the overall data strategy at an organization. The chief data officer also promotes a data-driven culture and pushes for change within the organization. A person in this position also needs to understand the data needs of the organization in order to further the vision of the institution or company.


The role of the chief data officer encompasses all of the other roles that will be discussed. The chief data officer is essentially the leader of the data team and provides help with governance consulting, quality, and analytics. However, the primary role of this position is to see the big picture for big data and to guide the organization in this regard, which implies that technical skills are beneficial but leadership and change promotion is more critical. In sum, this is a challenging position that requires a large amount of experience

Data Governance Lead

The data governance leads primary responsibilities to involve defining policies and data governance frameworks. While the chief data officer is more of an evangelist or promoter of data governance the data governance lead is focused on the actual implementation of change and guiding the organization in this process.

Essentially, the data governance lead is in charge of the day-to-day operation of the data governance team. While the chief data officer may be the dreamer the data governance lead is a steady hand behind the push for change.

Data Governance Consultant

The data governance consultant is the subject matter expert in data governance. Their role is to know all the details of data governance in the general field and even better if they know how to make data governance happen in a particular discipline. For example, a data governance consultant who knows how to make data governance happen within the context of a university in particular.

The data governance consultant supports the data governance lead with implementation. In addition, the consultant is a go-between for the larger organization and IT. Serving as a go-between implies that the consultant is able to effectively communicate with both parties on a technical level with IT and in a layman’s matter with the larger organization. The synergy between IT and the larger organization can be challenging because of communications issues due to vastly different backgrounds and it is the consultant’s responsibility to bridge this gap.

Data Quality Analyst

The data quality analyst’s job is as the name implies to ensure quality data. One way of determining data quality is to develop rules for data entry. For example, a rule for data quality is that marital status can only be single, married, divorced, or widowed. This rule restricts any other option that people may want. When this rule is supported it is an example of high quilty within this context.

A data quality analyst also performs troubleshooting or root cause investigations. If something funny is going on in the data such as duplicates, it is the data quality analyst’s job to determine what is causing the problems and to find a solution. Lastly, a data quality analyst is also responsible for statistical work. This can include statistical work that is associated with the work of a data analyst and or statistical work that monitors the use of data and the quality of data within the organization.


The data governance team plays a critical role in supporting the organization with reliable and clean data that can be trusted to make actionable insights. Even though this is a tremendous challenge it is an important function in an organization.

interior of empty parking lot

Roles in Data Governance

Working with data is a team event. Different people are involved in different stages of the data process. The roles described below are roles commonly involved in data governance. The general order below is the common order in which these individuals will work with data. However, life is not always linear and different people may jump in at different times. In addition, one person might have more than one role when working with data in the governance process.

Data Owners

Data owners are responsible for the infrastructure such as the database in which data is stored for consumption and use. Data owners are also in charge of the allocation of resources related to the data. Data owners also play a critical role in developing standard operating procedures and compliance with these standards.

Data Producers

Once the database or whatever tool is used for the data the next role involved is the data producer. Data producers are responsible for creating data. The creation of data can happen through such processes as data entry or data collection. Data producers may also support quality control and general problem-solving of issues related to data. To make it simple the producer uses the system that the owner developed for the data.


Data Engineers

Data engineers are responsible for pipeline development which is moving data from one place to the other for various purposes. Data engineers deal with storage optimization and distribution. Data engineers also support the automation of various tasks. Essentially, engineers move around the data that producers create.

Data Custodians

Data custodians are the keepers and protectors of data. They focus on using the storage created by the data owner and the delivery of data like the data engineer. The difference is that the data custodian sends data to the people after them in this process such as stewards and analysts.

Data custodians also make sure to secure and back up the data. Lastly, data custodians are often responsible for network management.

Data Stewards

Data stewards work on defining and organizing data. These tasks might involve working with metadata in particular. Data students also serve as gatekeepers to the data which involves keeping track of who is using and accessing the data. Lastly, data stewards help consumers (analysts and scientists) find the data that they may need to complete a project.

Data Analysts

Data analysts as the name implies analyze the data. Their job can involve statistical modeling of data to make a historical analysis of what happened in the past. Data analysts are also responsible for cleaning data for analysis. In addition, data analysts are primarily responsible for data visualization and storytelling development of data. Dashboards and reports are also frequently developed by the data analyst.

Data Scientists

The role of a data scientist is highly similar to data analyst. The main difference is that data scientists use data to predict the future while data analysts use data to explain the past. In addition, data scientists serve as research designers to acquire additional data for the goals of a project. Lastly, data scientists do advance statistical work involving at times machine learning, artificial intelligence, and data mining.


The roles mentioned above all play a critical role in supporting data within an organization. When everybody plays their part well organizations can have much more confidence in the decisions they make based on the data that they have.

person holding white and black frame

Data Governance Framework Types and Principles

When it is time to develop data governance policies the first thing to consider is how the team views data governance. In this post, we will look at various data governance frameworks and principles to keep in mind when employing a data governance framework.


The top-down framework involves a small group of data providers. These data providers serve as gatekeepers for data that is used in the institution. Whatever data is used is controlled centrally in this framework.


One obvious benefit of this approach is that with a small group of people in charge, decision-making should be fast and relatively efficient. In addition, if something does go wrong it should be easy to trace the source of the problem. However, a top-down approach only works in situations that have small amounts of data or end users. When the amount of data becomes too large the small team will struggle to support users which indicates that this approach is hard to scale. Lastly, people may resent having to abide by rules that are handed down from above.


The bottom-up approach to data governance is the mirror opposite of the top-down approach. Where top-down involves a handful of decision-makers bottom-up focus is on a democratic style of data leadership. Bottom-up is scaleable due to everyone being involved in the process while top-down does not scale well. Generally, controls and restrictions on data are put in place after the raw data is shared rather than before when the bottom-up approach is used.

Like all approaches to data governance, there are concerns with the bottom-up approach. For example, it becomes harder to control the data when people are allowed to use raw data that has not been prepared for use. In addition, because of the democratic nature of the bottom-up approach, there is also an increased risk of security concerns because of the increased freedom people have.


The collaborative approach is a mix of top-down and bottom-up ideas on data governance. This approach is flexible and balanced while placing an emphasis on collaboration. The collaboration can be among stakeholders or between the gatekeepers and the users of data.

One main concern with this approach is that it can become messy and difficult to execute if principles and goals are not clearly defined. There it is important to spend a large amount of time in planning when choosing this approach.


Regardless of which framework you pick when beginning data governance. There are also several terms you need to be familiar with to help you be successful. For example, integrity involves maintaining open lines of communication and the sharing of problems so that an atmosphere of trust is maintained or developed.

It is also important to determine ownership for the purpose of governance and decision-making. Determining ownership also helps to find gaps in accountability and responsibility for data.

Leaders in data governance must also be aware of change and risk management. Change management is tools and process for communicating new strategies and policies related to data governance. Change management helps with ensuring a smooth transition from one state of equilibrium to another. Risk management is tools related to auditing and developing interventions for non-compliance.

A final concept to be aware of is strategic alignment. The goals and purpose of data governance must align with the goals of the organization that data governance is supporting. For example, a school will have a strict stance on protecting student privacy. Therefore, data governance needs to reflect this and support strict privacy policies


Frameworks provide a foundation on which your team can shape their policies for data governance. Each framework has its strengths and weaknesses but the point is to be aware of the basic ways that you can at least begin the process of forming policies and strategies for governing data at an organization.

white paper with note

Data Governance Framework

In this post we will look at a defining data governance framework. We will also look a the key components that are a part of a data governance framework.


A data governance framework is the how or the plan for governing the data within an organization. The term data governance determines what needs to be governed or controlled while the data governance framework is the actual plan for controlling the data.

Common Components

There are several common components of a data governance plan and they include the following.

  • Strategy
  •  Policies
  •  Processes
  •  Coordination
  •  Monitoring/communication
  •  Data literacy/culture

Strategy involves determining how data can be used to solve problems. This may seem pointless but certain data can be used to solve certain problems. For example, customers’ addresses in California might not be appropriate for determining revenue generated in Texas. When data is looked at strategically it helps to ensure that it is viewed as an asset in many cases by those who use it.


Policies help to guide such things as decision-making and expectations concerning data. In addition, policies also help with determining responsibilities and tasks related to data management. One example of policy in action is the development of standards which are rules for best practices in order to meet a policy. A policy may be something like protecting privacy. A standard to meet this policy would be to ensure that data is encrypted and password protected.

Process and technology involve steps for monitoring the quality of data. Other topics related to process can include dealing with metadata and data management. The proper process mainly helps with efficiency in the organization.

Coordination involves the processes of working together. Coordination can involve defining the roles and responsibilities for a complex process that requires collaboration with data. In other words, coordination is developed when multiple parties are involved with a complex task.

Progress monitoring involves the development of KPIs to make sure that the performance expectations are measured and adhered to. Progress monitoring can also involve issues related to privacy, quality, and compliance. An example of progress monitoring may be requiring everyone to change their password every 90 days. At the end of the 90 days, the system will automatically make the user create a new password.

Lastly, data literacy and culture involve training and developing the skill of analyzing and or communicating data to people and others within the organization of use or consumption data. Naturally, this is an ongoing process and how it works depends on who is involved.


A framework is a plan for achieving a particular goal or vision. As organizations work with data, they must be diligent in making sure that the data that is used is trustworthy and protected. A data governance framework is one way in which these goals can be attained.

settings android tab

Data Governance Benefits

Data governance is a critical part of many organizations today. In this post, we will look at some of the commonly found benefits of incorporating data governance into an organization.

Improved Data Quality

In theory, when data governance is implemented within an organization there should be a corresponding improvement in data quality. What is meant by improved data quality is better accuracy, consistency, and integrity. In addition, data quality can also include the completeness of the data and ensuring that the data is timely.


When data quality is high it allows end users to have greater trust in the analysis and conclusions that can be made from the data. Improved trust can also lead to an increase in confidence we sharing and or defending the decision-making process.

Risk Reduction

Data governance can also reduce risk. There are often laws that organizations have to follow concerning data governance. Common laws often include laws about privacy. When data governance is implemented and carefully enforced it can help in complying with laws and thus lower the risk of breaking laws and or facing legal consequences.

The typical organization probably does not want to deal with legal matters. As such, it is in most if not all organizations’ benefit to comply with laws through data governance. The process of abiding by laws also provides a good example to stakeholders and creates a culture of transparency.

Improved Decision-Making

Decisions are only as good as the information that they are based upon. If data is bad then it puts at risk the making of bad decisions. There is an idiom common in the data world which states “garbage in garbage out.” Therefore, it is critical that the data accurately represents what it is supposed to represent.

As mentioned earlier, good data leads to good decisions and increase confidence. It also helps with improving understanding of the context in which the data came from. 

Improved Processes

Data governance can also improve various processes. For example, roles relating to data have to be clearly defined. In addition, various tasks that need to be completed must also be stipulated and clarified. Whenever steps like these are taken it can improve the speed at which things are done.

In addition, improving processes can also reduce errors. Since people know what their role is and what they need to do it is easier to spot and prevent mistakes as the data moves to the various parties that are using it.

Customer service

Data governance is also beneficial for customer service or dealing with stakeholders. When requests are made by customers or stakeholders, accurate data is critical for addressing their questions. In addition, there are situations in which customers or stakeholders can access the data themselves. For example, most customers can at least access their own personal information on a shopping website such as Amazon.

If data is not properly cared for users cannot access it or have their questions answered. This is frustrating no matter what field or industry one is working for. Therefore, data governance is important in enhancing the experience of customers and people who work in the institution

Profit Up

A natural outcome of the various points mentioned above is increased profit or decreased expenses depending on the revenue model. When efficiency goes up and or customer satisfaction goes up there is often an increase in revenue.

What can be inferred from this is that data governance is not just a set of ideas to avoid headaches but a tool that can be employed to enhance profitability in many situations.


Data governance is beneficial in many more ways than mentioned here. For our purposes, data governance can allow an organization to focus on making cost-efficient, sound decisions by ensuring the quality and accuracy of the data involved in the process of making conclusions.

a man in maroon suit sitting at the table

Influences and Approaches of Data Governance

Data governance has been around for a while. As a result of this, there have been various trends and challenges that have influenced this field. in this post, we will look at several laws that have had an impact on data governance along with various concepts that have been developed to address common concerns.


Several laws have played a critical role in influencing data governance both in the USA and internationally. For example, the Sarbanes-Oxley (SOX) Act was enacted in 2002. The SOX act was created in reaction to various accounting scandals at the time and large corporations. Among some of the requirements of this law are setting standards for financial and corporate reporting and the need for executives to verify or attest that the financial information is correct. Naturally, this requires data governance to make sure that the data is appropriate so that these requirements can be met.


There are also several laws related to privacy in particular. Focusing again on the USA there is the Health Insurance Portability and Accountability (HIPAA) which requires institutions in the medical field to protect patient data. For leaders in data, they must develop data governance policies that protect medical information.

In the state of California, there is the California Consumers Protection Act (CCPA) which allows California residents more control over how their personal data is handled by companies. The CCPA is focused much more on the collection and selling of personal data as this has become a lucrative industry in the data world.

At the international level, there is the General Data Protection Regulation (GDPR). The GDPR is a privacy law that applies to anybody who lives in the EU. What this implies is that a company in another part of the world that has customers in the EU must abide by this law as well. As such, this is one example of a local law related to data governance that can have a global impact.

Various Concepts that Support Data Governance

Data governance was around much earlier than the laws described above. However, several different concepts and strategies were developed to address transparency and privacy as explained below.

Data classification and retention deals with the level of confidentiality of the data and policies for data destruction. For example, social security numbers is a form of data that is highly confidential while the types of shoes a store sells would probably not be considered private. In addition, some data is not meant to be kept forever. For example, consumers may request their information be removed from a website such as credit card numbers. In such a situation there must be a way for this data to be removed permanently from the system.

Data management is focused on consistency and transparency. There must be a master copy of data to serve as a backup and for checking the accuracy of other copies. In addition, there must be some form of data reference management to identify and map datasets through some general identification such as zip code or state.

Lastly, metadata management deals with data that describes the data. By providing this information it is possible to search and catalog data


Data governance will continue to be influenced by the laws and context of the world. With new challenges will be new ways to satisfy the concerns of both lawmakers and the general public.

white caution cone on keyboard

Data Governance

Data governance involves several concepts that describe the characteristics and setting in which the data is found. For people in leadership positions involving data, it is critical to have some understanding of the following concepts related to data governance. These concepts are

  • Ownership
  •  Quality
  •  Protection
  •  Use/Availability
  •  Management

Each of these concepts plays a role in shaping the role of data within an organization.


Data ownership is not always as obvious as it seems. One company may be using the data of a different company. It is important to identify who the data belongs to so that any rules and restrictions that the owner has about the use of the data are something that the user of the data is aware of.


Addressing details related to ownership helps to determine accountability as well. Identifying ownership can also identify who is responsible for the data because the owners will hopefully have an idea of who should be using the data. If not this is something that needs to be clarified as well.


Data quality is another self-explanatory term. Data quality is a way of determining how good the data is based on some criteria. One commonly used criterion for data quality is to determine the data’s completeness, consistency, timeliness, accuracy, and integrity.

Completeness is determining if everything that the data is supposed to capture is represented in the data set. For example, if income is one variable that needs to be in a dataset it is important to check that it is there.

Consistency is that the data that you are looking at is similar to other data in the same context. For example, student record data is probably similar regardless of the institutions. Therefore, someone with experience with student record data can tell you if the data you are looking at is consistent with other data in a similar context.

Timeliness has to do with the recency of the data. Some data is real-time while other data is historical. Therefore, the timeliness of the data will depend on the context of the project. A chatbot needs recent data while a study of incomes from ten years ago does not need data from yesterday.

Accuracy and integrity are two more measures of qualityu. Accuracy is how well the data represents the population. For example, a population of male college students should have data about male college students. Integrity has to do with the truthfulness of the data. For example, if the data was manipulated this needs to be explained.


Data protection has to do with all of the basic security concerns IT departments have to deal with today. Some examples include encryption and password protection. In addition, there may be a need to be aware of privacy concerns such as financial records or data collected from children.

There should also be awareness of disaster recovery. For example, there might be a real disaster that wipes out data or it can be an accidental deletion by someone. In either case, there should be backup copies of the data. Lastly, protection also involves controlling who has access to the data.


Despite the concerns of protection, data still needs to be available to the appropriate parties and this relates to data availability. Whoever is supposed to have the data should be able to access it as needed.

The data must also be usable. The level of usability will depend on the user. For example, a data analyst should be able to handle messy data but a consumer of dashboards needs the data to be clean and ready prior to use.


Data management is the implementation of the policies that are developed in the previous ideas mentioned. The data leadership team needs to develop processes and policies for ownership, quality, protection, and availability of data.

Once the policies are developed they have to actually be employed within the institution which can always be difficult as people generally want to avoid accountability and or responsibility, especially when things go wrong. In addition, change is always disliked as people gravitate towards the current norms.


Data governance is a critical part of institutions today given the importance of data now. IT departments need to develop policies and plans on the data in order to maintain trust in whatever conclusions are made from data.

Linking Plots in Plotly with R ViDEO

Linking plots involves allowing the action you take in one plot to affect another. Doing this can allow the user to uncover various patterns that may not be apparent otherwise. Using plotly, it is possible to link plots and this is shown in the video below.