There are several different terms used when referring to data within an organization that can become confusing for people who are not experts in this field. In this post, we will look at various terms that are often misused in the field of data management.
Database
Databases are for structured data which is data that has rows and columns. Among the many benefits of using a database over an Excel spreadsheet is that databases can hold almost limitless amounts of data. In addition, databases can have multiple users querying and inputting data at the same time which is not possible with a spreadsheet.
Data Warehouse
A data warehouse is a computer system designed to store and analyze large amounts of data for an organization. The data for a data warehouse can come from various areas within the organization. Since the data comes from many different places it also helps to integrate data for the purpose of analysis which is valuable for decision-making and insights.
Data warehouses take pressure off databases by providing another location for data. However, because of their size, often over 100 GB, data warehouses are hard to change once they are up and running. Therefore, great care is needed when developing and using this tool.
Data Marts
Data marts are similar to data warehouses with the main difference being the scope. Like data warehouses, data marts are also databases. However, data marts are focused on one subject or department whereas data warehouses gather data from all over an organization. For example, a school might have a data warehouse for all student data while it has a data mart that only holds student classes and grades.
Since they have a focus on a given subject, data marts are generally smaller than data warehouses at less than 100 GB. The rationale of a data mart is that analytic teams can focus when trying to develop insights rather than searching through a larger data warehouse.
Data Lake
Data lakes are also similar to data warehouses. Just like a data warehouse data lakes contain data from all over the organization from many sources. Data lakes are also generally larger than 100 GB. One of the main differences is that data lakes contain structured and unstructured data. Unstructured data is data that does not fit into rows and columns. Examples can include video data, social media, and images.
Another purpose for a data lake is to have a place for keeping data that may not have a specific purpose yet. Another to think of this is to consider a data lake as a historical repository of data. Due to their multipurpose nature, data lakes are often less complex in comparison to data warehouses.
Conclusion
All of the various data products discussed here work together to give an organization access to its data. It is important to understand these different terms because it is common for people to use them interchangeably to the confusion of everyone involved. With consistent terminology, everyone can be on the same page when it comes to delivering value through using data.