Data Mesh and the Rise of Data Lakehouse

Data Mesh and the Rise of Data Lakehouse

Contributor: Rick Arnett

Data Mesh is a new paradigm for managing enterprise data. Created in 2019 by Zhamak Dehghani, it directly addresses some of the biggest challenges facing data architects and data engineers head-on. It also borrows ideas from big IT Agile, microservices, and domain driven development. Before we talk about data mesh, let’s take a few minutes to recap these current challenges and baseline a few concepts.

Challenges of Current Data Architecture 

Challenges of the Data Warehouse  

The data warehouse brought great order to the enterprise by defining the information we need to run the business and creating a home for it, a proper home that had a place for everything all neatly tied together. The data warehouse was a significant breakthrough in technology. It was the “wheel” of data architecture.

However, it also introduced a few potholes:

(1) It takes a long time to introduce new information to the data warehouse because it must be modeled properly and related to existing data.

(2) It is very difficult to change since the schema is monolithic.

(3) The resources needed to maintain a data warehouse are centralized and highly specific.

(4) The original data is most often transformed to fit into the data warehouse schema and the original format is lost or summarized.

Challenges of the Data Lake

The data lake sprung directly out of data warehouse shortfalls. Since data lakes are composed of raw dumps of unstructured data, information analysts now had access to aggregate insights themselves. The mentality was “just dump it in there, and someone will figure it out if they need it.”

Speed and access to raw data addressed many of the issues of the data warehouse. Having no master schema, it was not necessary to model, which saves time and highly specific skills. Data lakes also tended to democratize the raw data in the vein of self service.

The speed to integrate new information and the self-service nature are great benefits, however, the data lake has a few leaks of its own:

(1) Users are often overwhelmed by the volume and number of sources in the data lake that analysts have several choices of sources and aggregation techniques to answer the business question that will often give varying answers.

(2) It is difficult to dole out the proper access to the many and growing sources in the data lake.

The Rise of the Data Warehouse 2.0, aka Lakehouse

In order to address the challenges of the data lake and keep the benefits of the data warehouse, the data warehouse 2.0, aka lakehouse, emerged. In simple terms,a lakehouse aims to take the best elements of the data lake (fast access to structured or unstructured data) and combine them with the best elements of a data warehouse (consistent structure, schema, data access controls, and transformation capabilities). This is a data warehouse that includes a landing zone for raw data. Instead of doing complicated ETL (Extract/Transform/Load) jobs to format the data into the monolithic schema, an as-is copy of the data is either processed via pipelines into the structure data warehouse (movement) or uses a logical (virtualization) layer to provide access and structure.

The evolution of modern data lakes and data warehouses are taking a different path but they are both leading to the same place and look almost identical with the following features: a landing zone for raw unstructured extracts, a toolset to transform unstructured data into structured data using either movement or virtualization, a data catalog defines raw and derived data sets and provide governance, a service layer to expose ‘blessed’ data products, and a self-service interface to both consume higher order data sets and create them.

Now that we have a converged lake / warehouse with all the features to support self-service analytics, is this enough to unlock the data agility that the enterprise has sought for so long? The answer is simply no.

We might have all the building blocks in place, but we haven’t addressed the people, processes, or culture. Culture, the latest upgrade to the People Process Technology (PPT) Framework, is key to transforming into a data driven company.

The enterprise must foster a culture that treats its internal data products with the same importance and controls that it treats the products and services it sells to its customers. Business outcomes are a function of the data products that define and measure their success. Data products are a function of data quality which is a function of the culture that emphasizes good controls.

“Plans are worthless, but planning is everything” – Dwight D Eisenhower

Data Mesh, Just in Time!

Data mesh is a paradigm shift for data management that Zhamak Dehghani developed in 2019.

In Dehghani’s seminal article  “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh” and its 2020 follow up “Data Mesh Principles and Logical Architecture,“ – she introduces a distributed architecture with four primary principles:

  1. Domain-oriented data decomposition and ownership.
  2. Data as a product.
  3. Self-service data platform.
  4. Federated computation governance.

Despite the obvious suggestion to move from a centralized architecture to a distributed one, I think the principles Dehghani lays out are even more significant as a cultural paradigm shift than an architectural one.

Domain-Oriented Data Decomposition and Ownership 

This principle describes the fact that those who are closest to the data should own it. Data owners should not be specialized in data engineering but rather the producers and consumers who work with that data every day. I prefer to think of data domains as data neighborhoods where everyone is close enough that meta-data definitions are ubiquitous to the members in that domain.

The divisions into data domains are a good idea if you are implementing data mesh or not.  Breaking governance, process, people, and responsibility into smaller, bite-size chunks makes a daunting task easier to swallow.

The border of these domains is entirely up to the company to choose, but it should be pretty logical and intuitive. Spotify is often used to introduce this concept, with example data domains of artist, podcasts, user, and media. To me, that example is a little abstract.

For a Software as a Service client, we recently used:  Company, Person, Deal, Offering and Usage as the five data sub domains in the “Quote to Cash” super domain. The way we treat data inside the domain can be a little looser. A half-inch socket one day and a crescent wrench the next will still do the same job but as we export data we must be very strict and treat that data with the same vim and vigor we treat the SKUs we sell to our customers. It must be controlled and versioned. Regardless if it is raw or aggregate data or derived insights, we must have strict control of this “data product”.

Data As a Product

Data-as-a-product thinking only works in a culture that values the importance of internal data’s relationship to business outcomes. Once we achieve this level of appreciation and understanding we can assign the appropriate level of resources to steward the data, and set and monitor its controls.

As the domains create data products, they expose them to other parts of the business (other domains) tagged with a version. Internally they also associate that data product version with the version of code (pipeline) that was used to create it.

There are three associates here and here is an example:

  1. Version 2.1 of the data product “Top 10 customers”  is defined as: “The 10 customers with the highest sum of paid invoices from purchases made from January 1 to December 31 not including distributors or any government contracts.
  2. The code that generates this data product is http://github.com/acme/commit/833c21 (note: commit 833c21 represents the version of that code that matches version 2.1 of the data product).
  3. And the infrastructure to consume this data product is here http://acmeco.internal/data-products/top-10-customers/v2.1

Once the relationship of the meta description, version of code that creates the data product, and infrastructure link are established, they are registered in the data catalog (topic for another day) and they are almost ready for consumption by other domains.

Before we discuss how to expose this data product to other domains, we have two more concepts we need to introduce which are tightly coupled to the data product: Data Producers and Data Consumers. Note producers and consumers may be systems or humans or both. They also are enumerated and attached to the data product in the data catalog. This is a useful reference to identify dependencies when a change needs to be made or there is a problem.

A data product is often combined with additional data products in other domains so understanding the dependencies and lineage is critical. Knowing who the human consumers are is also very important when we need to update a data product version.

For example:

Acme Co determines that Version 2.1 of the data product ‘Top 10 customers” is slightly flawed. Since they have a net 30 agreement on their invoices. So Acme Co decided to create Version 3.0 of the data product “Top 10 customers” which now includes the language purchased from January 1 and invoices paid prior to January 31 of the following year.  This will accommodate the net 30 arrangement and is slightly more accurate. The producing domain now has a list of who all the consumers of this data product are, and they can decide with those consumers if they will still serve the old version 2.1 in addition to 3.0 or discontinue 2.1 altogether.

Self-Service Data Platform

In the previous example, we illustrated the situation where a data product may change over time and what that potential impact might be on data product consumers. This impact is reduced by exposing the data product to the consuming domain on a self-service platform. This is a core tenant of the data mesh and the concept was borrowed from the data fabric architecture introduced by Noel Yuanna in the early 2000s.

A self-service data platform can expose versioned data products as services like an API gateway or as part of a cloud data hub platform like Snowflake or even through built-in data catalogs like Tableau.

The important aspect is that it allows for an approach to see the available data products, version, and metadata definitions and can control the access.

Federate Computation Governance

Decentralization is at the core of data mesh and Federated computational governance allows the data domains to do what they want inside the data domains as long as they can expose their data products for self-service. This resurrects the age-old centralized versus distributed architecture and the bottom line is there is no right answer. There are no best practices, only trade-offs. Most likely the data domains mimic the hierarchy of the organization and this principle says that work inside the domain can be done better and faster if we don’t impose a monolithic platform and management layer on everything. I think of this as “standards” at an enterprise level and “governance” at a domain level. At Atrium we refer to this as data agility.

Data Mesh Architecture

The Data Mesh favors decentralized over centralized. However, I would argue that Dehghani was more focused on the management and administration side of a platform than the platform itself. The spirit is that each domain is responsible for its own schemas, not that each domain is responsible for acquiring its own hardware. The spirit is that each team is empowered to choose their own tools, not that everyone must use ETL tool X or that Analytics platform Y. Cloud-based platforms like Snowflake are ideally suited for that level of granular governance.

Data Mesh Process Flow

If we look at a data mesh process flow for a single data product it might look like the following:

  1. Input: source system or other domain data product
  2. Processing: computational, SQL aggregation via movement or virtualization
  3. Catalog: producers, lineage, code, infrastructure, security other meta-data
  4. Service: expose the new data product as a service
  5. Measure: collect the KPI values (quality)  of the data product and feedback into the data catalog
  6. Monitor: identify the consumers of the DP and hydrate the data catalog with values

Data Mesh Challenges

Data mesh has been gaining a lot of traction in recent years. Like microservices, it just “feels right” to most architects. I think that this stems from the fact that data domains and product thinking is just a good idea even if you are not deploying a data mesh. Secondly, I think that allowing teams to make their own decisions and moving that governance from global to local solves age-old battles between IT and the rest of the business. But there are a few of the challenges that come to mind:

  1. Giving more autonomy to the data domains means that they may need to hire more data engineers, security, data scientists, and other technical folks outside of global IT. This is happening already, but it is worth mentioning.
  2. The specific “data mesh platforms and tooling” are still point solutions, however, we see a lot of progress on this front.
  3. Enterprises need to embrace a data-first culture and adopt data product thinking
  4. Data catalog products tend to be expensive and use interrogation of SQL logs to reverse engineer data products. Registering data products still seems ad-hoc.

Data Mesh vs Data Fabric

The differences between data mesh and data fabric are subtle and both approaches attempt to solve the same pain points. The primary difference is that data mesh really acts like microservices for data and each data product is exposed as an API. It does not really prescribe that the APIs cohabitate on the same platform. On the other hand, a data fabric includes a virtualization layer through a single platform. This difference is so minimal that I feel it is not with further discussion. What is important is the similarities, namely that both approaches support heterogeneous architectures and tools.

Conclusion

Although purpose built data mesh platforms may not be prime time yet they appear to be on the horizon.  Point solutions and tools are available and are the spirit of a decentralized architecture. Out of the four main principles of data mesh, I suspect that Federated governance will be the most challenging for many organizations to adopt, while other organizations find it a liberating path to data agility.

Federated governance is not a replacement of global standards, it is the execution of them. That being said, I think the other three principles: data domains, data as a product, and self-service data infrastructure are a no-brainer for any company trying to move the needle with data. They each provide a paradigm to break a complex and large initiative into smaller, simpler ones. There are a few things in technology that have staying power: Agile, domain driven design, and microservices to name a few. At Atrium, we believe that data mesh is the next item on that list.

If you need help with data mesh, or data architecture and strategy in general, we can help.

You may also like...