Data catalogs and data silos are two ideas that are commonly associated with data governance. In this post, we will look at these two terms by defining them and share either how to implement them or prevent them.
Data Catalogs
Data catalogs are a rather recent phenomenon. They were first developed in the 2010’s with the exact origins not defined. A data catalog is a reference application that contains metadata on the various datasets within an organization. Usually, this document is in a searchable format so that people can find datasets they may need within an organization.
The data catalog essentially tracks available data within an organization. The main reason for tracking data is to prevent loss and or secret data. Within a data governance framework, data is considered an asset. Therefore, just as an organization prevents the loss of inventory because of its monetary potential the data catalog prevents the monetary decision-making loss of data within an organization.
Tips
There are also several tips for developing and using data catalogs. For example, a data catalog should track the roles of various people concerning individual datasets. Roles can include who is the owner of the data, the steward, the custodian, etc. Tracking roles helps in assigning responsibility for data.
Another tip is to develop data dictionaries concerning the data catalog. Data dictionaries contain metadata not from all data but just from one dataset. An analogy would be maps. Some maps cover the whole world like a data catalog while other maps only cover a city or county like a data dictionary. The data dictionary is useful one an analyst needs more information when preparing to use data.
It is also important to make the data catalog user-friendly. Making a data catalog user-friendly for stakeholders involves the support of IT with a strong concern for the user experience. Nobody will use a data catalog if its user interface is useless. However, the solution to this would be lots and lots of training
Data Silos
Data catalogs help to prevent what are called data silos. Data silos are sources of data that are controlled in an isolated place within an organization. When silos are developed it can lead to analyses that are incomplete because of incomplete data. In multiplication, silos can lead to a breakdown in collaboration which can cause duplication of efforts and reduced productivity. Lastly, people may also struggle within an organization to find data that is needed for analysis.
Data silos are often developed in organizations that have a decentralized IT strategy. A decentralized approach frequently leads to every department doing what they want in terms of data storage and technology utilization which is chaotic. Other motivations for data silos can include a lack of common goals when it comes to data management. No goals means everyone does what they want.
Breaking Silos
Two main ways of breaking data silos are the development of data governance and data integration. One step in data governance is developing a data catalog as mentioned early. Once a data catalog is developed the team can start to create policies and standards in data governance to establish expectations regarding data use and storage.
A second strategy that is related to the first is data integration. Data integration is the processing of combining data from different tables into one. Upon completing this more analysis can take place. Combining data makes it hard to isolate because data must be available for use.
Conclusion
Data catalogs and silos are a part of the daily life of the information professional. Therefore, in the context of data governance, it is important to be familiar with these two terms so that support can be provided.