The rise of the data lake has empowered more users to gain insights by generating analytics from raw data. One of the drivers for this paradigm shift was to speed the process of innovation by providing the business key information so they can react quicker. This increased agility was genuine. However, it came with a price. The same criticism that beleaguers the agile methodology also besets the data lake: Documentation!
The data lake encourages users to build understanding (Insights) from primitive data sets, but it does a bad job encouraging re-use. As a result, previously created Insights may be poorly cataloged, inaccessible, or untrustworthy, which exacerbates the problem by allowing competing ‘Alternative Facts’ from shadow systems to pop up.
“… with great power comes great responsibility” – Uncle Ben to Peter Parker 2002
The data lake paradox does not have to be a foregone conclusion if we apply some controls and context. Adding the right mix of data governance and borrowing a few ideas from data science gives the enterprise the comprehension of the data warehouse with the agility of the data lake and the promises of the data mesh.
The remainder of this article will give four steps that your company can start today to begin your journey toward data governance.
1. Define Your Data Strategy
Defining a data strategy that meets the immediate needs as well as those ten years out is a significant task that should not be taken lightly. Let’s take this time to review a few key concepts and perhaps introduce some new ones. This understanding will help set the context for the remainder of the article and maybe help simplify some topics if you are new to this subject.
OLTP (Online Transaction Processing): This is data that runs a business. It is the most detailed view of a company’s raw data. This is operational data, high volumes, – think invoices.
OLAP (Online Analytical Processing): Aggregated OLTP data used to define metrics, i.e., ‘monthly revenue’.
Data Warehouse: A centralized store for data from multiple systems. It can be OLTP but is typically composed of OLAP data. Most warehouses have an hierarchical star schema that ties all structured data together. Because of this systematic structure, it is difficult to add new sources.
Data Lake: A centralized repository for raw (OLTP) unstructured data. This allows analysts to create the OLAP (Analytics), visualization, reporting, or AI/ML insights from the raw data.
Data Lakehouse: Combination of data lake and data warehouse where user-curated unstructured data and structured data can be exposed and controlled together.
Monolithic: A large system that tries to do or store everything. The opposite is distributed or federated.
Data Mesh: A new decentralized paradigm shift to data lakes that addresses many of the common downfalls and promotes the data product principle. All the recommendations of this article comply with the data mesh principles. [Dehgani]
Insights: Combining data sets and performing analysis to gain a higher-order understanding of the business. This may be in the form of aggregation like “monthly revenue” or might be an AI model like a ‘churn propensity model’
In simplest terms, your data strategy should:
- Identify where your OLTP data is sourced from
- Answer the question of where the data will go so you can run analytics – OLAP
- Decide how the data will be aggregated
- Determine the frequency of data extraction
- Guide you on how to expose analytics to your users
Monolithic architectures for data strategies get a bad wrap and are often considered antiquated. But there are a few circumstances that might be just right. Small companies, startups, and platform companies might benefit from a single system that does everything as long as they realize they may eventually outgrow such systems.
2. Segment Your Data Into Domains
Data domain ownership is the first principle of the data mesh for a good reason. It is difficult to execute because it challenges our traditional hierarchical org structure. This does not mean that we have to reorganize our reporting structure around data sets, but it does mean that we need to create virtual federated teams empowered to make decisions that are closest to the information they manage.
These teams should be direct producers and consumers of data, and they serve as delegates to represent their hierarchical organization. The members of these teams should speak on behalf of all processes that touch data and the systems that run the processes.
The typical approach to implementing technology in the enterprise requires us to organize similar to this, but the focus is on the process, not the data. However, there is a subtle, yet important difference that we must not overlook—our data domains envelope the entire lifecycle of a subset of data from the cradle to the grave, which includes many processes which may not be consecutive. One of the primary goals of breaking data into domains is to provide better control and more agility.
By reducing the scope of responsibility, we increase the understanding.
Members should know the schema “by heart”. Having a more intimate knowledge of the data and the relationship inside the domain reduces the impact of change. Regardless of whether the domain’s data is raw or derived insights, the use of federated teams promotes the “data as a product” and “data as a service” type of thinking.
One technique to define your data domains is by mimicking your systems’ read and write profiles. The business users who “write” the data are the owners (producers), and those who “read” are the consumers.
3. Define your “Next Gen” Data Catalog
The next step of your data governance journey is defining your data catalog. As mentioned in the data lake paradox, the chief criticism of the data lake is that it does not encourage analysts to document their efforts and catalog for reuse. This ambiguity leaves their co-workers starting from scratch, duplicating efforts, often with different and conflicting results.
We need a way to keep track of what we “have” and what we “make”. “Have” being the source objects/ fields and “make” being the “insights” created. Objects/fields are pretty straightforward since you can export the DDL from the database. But the insights are a little more complicated because we need to derive them from SQL and multiple tables or use code pipelines to compute them.
Since we are using SQL/code, we also need to track the version used to generate a given result. Insights are often layered, so we must also record the lineage (data provenance or pedigree) of the elements that comprise the results.
Tracking all this new dynamic meta-data might seem like an overwhelming addition to a data catalog, and that is because it is. However, we can learn some lessons from the data science world since they have been addressing these concepts for years.
Data Scientists expect complicated pipelines which evolve code for extraction, aggregation, and modeling. Since they are defining new “variables” to the enterprise, they must be able to discreetly articulate the steps and state of their work product. In addition, they are expecting change, so they build in version control as a first-class citizen. These new derived variables are called features and are maintained in a system called a feature store. Feature stores also provide direct access to the insights.
This makes the feature store a next-generation data catalog with the additional benefit of exposing the data or model as a service. The next-generation data catalog should behave like microservices for data. You may not be using an AI or machine learning model in your business, but I bet you at least have some analytics that contains some derived metrics.
For starters, you should define these metrics in human terms. What do they mean? Where did they come from, and what are they to be used for? Most likely, this can all be captured in the description of the analytic platform. However, that may not be good enough.
You might consider pulling out the metadata and publishing it somewhere more accessible to more users or if you have multiple analytics tools. Ideally, you will also include the lineage, which implements auto-discovery of your source systems. By this time, you are well on your way to developing your next-generation data catalog. You may choose to purchase a commercial product or roll your own. In either case, these first steps will be the same.
4. Define Your Data Epochs
We define a data epoch as a point in time where the behavior of data changes. For example, if your company acquires your competition on Jan 1, 2022, the insights before that date and after that date will most likely look different. Normalizing the data before and after that date is a topic for another day; what is essential at this point is you identify those points in time and understand how they affected your insights.
It is crucial to look at the data itself as a time series to find all the epochs since some might not be obvious. For example, you may add a new status field value which will impact any insights that use those status fields. Plotting counts of field missingness or categorical values over time is an excellent technique to identify these epochs. Once these anomalies are identified, they should also be stored with an explanation in the data catalog.
Using a profiler like Pandas will also give statistical information about your fields. Finally, you may consider bisecting datasets by time and comparing the statistics using describe() method. If the results are grossly different, you may have an epoch in that subset.
Conclusion
Data governance is a continuous process of improvement and understanding. It starts with a basic inventory of what you have and what you have defined and evolves to a methodology of getting the right information to the right people with the confidence that the data is secure and accurate.
Business leaders rely on accurate data to make strategic decisions. As companies across all industries continue to adopt a data-driven approach, data has become the most significant key differentiator. Learn more about our services and how Atrium can help with your data strategy!