Carlos Grande

My Data Mesh Thesis

I wanted to write a post with my thesis about the Data Mesh paradigm coined by Zhamak Dehghani . To be honest, I’m still wondering what a Data Mesh is.

Looking at different articles, videos and talking to other people wondering the same thing, you realize some ideas differ considerably. It seems that the Data Mesh paradigm has some abstract ideas, and hence it has different interpretations. Although if I would like to say what a Data Mesh is in a short definition, I would pick the following:

Data meshes are a decentralization technique of the ownership, transformation & serving of data. It is proposed as a solution for centralized architectures, where growth is limited by its dependencies and complexity.

Data Mesh: Centralized VS decentralized architecture

Data Mesh: Centralized VS decentralized architecture

The goal of this article is to try and bring some clarity to all concepts associated with a Data Mesh , including an end to end storytelling with a thesis structure. This is just my public personal attempt to make sense of all the information I have found about this matter.

CONTENTS: 1. Context   1.1 First Generation: Data Warehouse Architecture   1.2 Second Generation: Data Lake Architecture   1.3 Third Generation: Multimodal Cloud Architecture 2. Principles   2.1 Domain-oriented ownership   2.2 Data as a Product   2.3 Self-serve Data Platform   2.4 Federated Computational Governance 3. Data Mesh Architecture   3.1 Data Product   3.2 Roles   3.3 Self-Serve Platform   3.4 Data Governance   3.5 BluePrint 4. Data Mesh Implementation   4.1 Investment Factors   4.2 Building a Data Mesh 5. Case Studies   5.1 Zalando   5.2 Intuit   5.3 Saxo Bank   5.4 JP Morgan Chase   5.5 Kolibri Games   5.6 Netflix   5.7 Adevinta   5.8 HelloFresh 6. Conclussion 7. Data Mesh Vocabulary 8. Discovery Roadmap   8.1 Zhamak Dehghani   8.2 Introductory Content   8.3 Deep Dive Acknowledgements References and links

1.1 First Generation: Data Warehouse Architecture

Data warehousing architecture today is influenced by early concepts such as facts and dimensions formulated in the 1960s. The architecture intends to flow data from operational systems to business intelligence systems that traditionally have served the management with operations and planning of an organization. While data warehousing solutions have greatly evolved, many of the original characteristics and assumptions of their architectural model remain the same. (Dehghani, 2022)

Data Warehouse Architecture

Data Warehouse Architecture

1.2 Second Generation: Data Lake Architecture

Data lake architecture was introduced in 2010 in response to challenges of data warehousing architecture in satisfying the new uses of data; access to data based on data science and machine learning model training workflows, and supporting massively parallelized access to data.

Unlike data warehousing, data lake assumes no or very little transformation and modeling of the data upfront; it attempts to retain the data close to its original form. Once the data becomes available in the lake, the architecture gets extended with elaborate transformation pipelines to model the higher value data and store it in lakeshore marts.

This evolution to data architecture aims to improve ineffectiveness and friction introduced by extensive upfront modeling that data warehousing demands. The upfront transformation is a blocker and leads to slower iterations of model training. Additionally, it alters the nature of the operational system’s data and mutates the data in a way that models trained with transformed data fail to perform against the real production queries. (Dehghani, 2022)

Data Lake Architecture

Data Lake Architecture

1.3 Third Generation: Multimodal Cloud Architecture

The third and current generation data architectures are more or less similar to the previous generations, with a few modern twists:

  • Streaming for real-time data availability with architectures such as Kappa
  • Attempting to unify the batch and stream processing for data transformation with frameworks such as Apache Beam
  • Fully embracing cloud based managed services with modern cloud-native implementations with isolated compute and storage
  • Convergence of warehouse and lake , either extending data warehouse to include embedded ML training, e.g. Google BigQuery ML, or alternatively build data warehouse integrity, transactionality and querying systems into data lake solutions, e.g., Databricks Lakehouse

The third generation data platform is addressing some of the gaps of the previous generations such as real-time data analytics, as well as reducing the cost of managing big data infrastructure. However it suffers from many of the underlying characteristics that have led to the limitations of the previous generations. (Dehghani, 2022)

Multimodal Cloud Architecture

Multimodal Cloud Architecture

2. Principles

Data Mesh Principles

Data Mesh Principles

2.1 Domain-oriented ownership

This principle aim to decentralize the ownership of sharing analytical data to business domains who are closest to the data — either are the source of the data or its main consumers. Decompose the data artefacts (data, code, metadata, policies) - logically - based on the business domain they represent and manage their life cycle independently. (Dehghani, 2022)

2.2 Data as a Product

This simply means applying widely used product thinking to data and, in doing so, making data a first-class citizen: supporting operations with its owner and development team behind it.

Existing or new business domains become accountable to share their data as a product served to data users – data analysts and data scientists. Data as a product introduces a new unit of logical architecture called, data product quantum, controlling and encapsulating all the structural components — data, code, policy and infrastructure dependencies — needed to share data as a product autonomously. (Dehghani, 2022)

2.3 Self-serve Data Platform

The principle of creating a self-serve infrastructure is to provide tools and user-friendly interfaces so that generalist developers can develop analytical data products where, previously, the sheer range of operational platforms made this incredibly difficult.

A new generation of self-serve data platform to empower domain-oriented teams to manage the end-to-end life cycle of their data products, to manage a reliable mesh of interconnected data products and share the mesh’s emergent knowledge graph and lineage, and to streamline the experience of data consumers to discover, access, and use the data products. (Dehghani, 2022)

2.4 Federated Computational Governance:

This is an inevitable consequence of the first principle. Wherever you deploy decentralised services—microservices, for example—it’s essential to introduce overarching rules and regulations to govern their operation. As Dehghani puts it, it’s crucial to "maintain an equilibrium between centralisation and decentralisation".

A data governance operational model that is based on a federated decision making and accountability structure, with a team made up of domains, data platform, and subject matter experts — legal, compliance, security, etc. It creates an incentive and accountability structure that balances the autonomy and agility of domains, while respecting the global conformance, interoperability and security of the mesh. The governance model heavily relies on codifying and automated execution of policies at a fine-grained level, for each and every data product. (Dehghani, 2022)

3. Data Mesh Architecture

3.1 data product.

A data product consists of the code including pipelines, the data itself including metadata and the infrastructure required to run the pipelines. The goal is to have application code and data pipelines code under the same domain owned by the same team. As you can see, we are shifting responsibility to the people who actually understand the domain and create the data, instead of “data” owners in the data plane that usually struggle to understand the data and create friction between teams. This means that the people who change the application and within the data, are in charge of owning that change using schema versioning and documentation to broadcast the data evolution to the different stakeholders. This ensures that data schema changes can be implemented easily by the data creators instead of data analysts trying to adapt to changes after the fact.

Data Mesh Quantum: Data Products

Data Mesh Quantum: Data Products

The idea behind the Domain Ownership principle is to use Domain Driven Design in the data plane alongside the operational plane to close the gap between the two planes. The goal is to split teams around business domains being each team fully cross functional, but not only on the operational level ( DevOps ) but also on the analytical level. Each team should have a data owner, data engineers and QA teams that not only validate mircroservices but also data quality.

Following the Zhamak Dehghani book we find two specific key roles related to a data product domain:

Data product developer roles : A group of roles responsible for developing, serving and maintaining the domain’s data products as long as the data products remain to exist and serve its consumers. Data product developers will be working alongside regular developers in the domain. Each domain team may serve one or multiple data products.

Key domain specific objects:

  • Transformation code, the domain-specific logic that generates and maintains the data
  • Tests to verify and maintain the domain’s data integrity.
  • Tests to continuously monitor that the data product meets its quality guarantees.
  • Generating data products meta-data such as its schema, documentation, etc.

Data product owner : A role accountable for the success of domain’s data products in delivering value, satisfying and growing the data consumers, and defining the lifecycle of the data products.

Domain data product owners must have a deep understanding of who the data users are , how they use the data,and what are the native methods that they are comfortable with consuming the data. The conversation between users of the data and product owners is a necessary piece for establishing the interfaces of data products.

Data Mesh Architect : A role responsible for the infrastructure team, with the big picture of the Data Mesh, ensuring the data is self-served and acting as the link between the infrastructure layer and the Federated Governance team.

Following the Data Mesh layer Architecture we would have three team types:

  • Data infrastructure teams (located on the infrastructure plane): Providing the underliying infrastructure required to build, run and monitor data products.
  • Data Product teams (located on the Data Product developer plane): Supporting the common data product developer journey. They are conformed by the Data Product Developer roles and a Data Product Owner.
  • Federated Computatinal Governance Team (located on the mesh supervision plane): maintaining an equilibrium between centralization and decentralization ; what decisions need to be localized to each domain and what decisions should be made globally for all domains.

In a data mesh, each data product team is in charge of dealing with the data related to the team domain. They must gather the data and move it to the right storage so it can be easily consume by data users. Data engineers will be part of the team and they may use stream engines to move the data and perform ETL or run data pipelines in a batch or micro batch fashion. The key is that a data pipeline is simply an internal complexity and implementation of the data domain and is handled internally within the domain instead of having separate data engineering teams.

On this next visualization, I tried to reflect the teams and roles involved in a Data Mesh. These roles are not exclusive, they may be conditioned by the company, the organizational structure, and the platform differing considerably, but it's important to highlight the different layers and the teams to which they belong .

Data Mesh Teams & Roles

Data Mesh Teams & Roles

3.3 Self-Serve Platform

Data Mesh’s fourth principle, self-serve data infrastructure as a platform, exists. It is not that we have any shortage of data and analytics platforms and technologies, but we need to make changes to them so that they can scale out sharing, accessing and using analytical data, in a decentralized manner , for a new population of generalist technologists. This is the key differentiation of data platforms that enable a Data Mesh implementation.

Mesh Service Infrastructure

Mesh Service Infrastructure

3.4 Data Governance

A data mesh implementation requires a governance model that embraces decentralization, interoperability through global standardization, and an automated execution of decisions in the platform. The idea is to create a team to maintain an equilibrium between centralization and decentralization ; that is what decisions need to be localized to each domain and what decisions should be made globally for all domains.

Data mesh’s federated governance, embraces change and multiple bounded contexts. A supportive organizational structure, incentive model and architecture is necessary for the federated governance model to function in order to arrive at global decisions and standards for interoperability , while respecting autonomy of local domains, and implement global policies effectively.

The idea is to localize decisions as close to the source as possible while keeping interoperability and integration standards at a global level, so the mesh components can be easily integrated. In a data mesh, tools can be used to enforce global policies such as GDPR enforcement or access management and also local policies where each domain sets its own policies for their data products such as access control or data retention.

Pre Data Mesh Governance Data Mesh Governance
Centralized team Federated team
Responsible for data quality Responsible for defining how to model what constitutes quality
Responsible for data security Responsible for defining aspects of data security i.e. data sensivity levels for the platform to build in and monitor automatically
Responsible for complying with regulation Responsible for defining the regulation requirements for the platform to build in and monitor automatically
Centralized custodianship of data Federated custodianship of data by domains
Responsible for global canonical data modeling Responsible for modeling polysemes, data elements that cross the boundaries of multiple domains
Team is independent from domains Team is made of domains representatives
Aiming for a well defined static structure of data Aiming for enabling effective mesh operation embracing a continuously changing and dynamic topology of the mesh
Centralized technology used by monolithic lake/warehouse Self-serve platform technologies used by each domain
Measure success based on number or volume of governed data (tables) Measure success based on the network effect, the connections representing the consumption of data on the mesh
Manual process with human intervention Automated processes implemented by the platform
Prevent error Detect error and recover through platform's automated processing

3.5 Data Mesh Architecture

There are many applied Data Mesh architectures, some are more decentralized than others, and we have different tools and services. Even though there isn't a Data Mesh architecture itself, we can define and relate their main components.

Deep Store: a Repository store to make data addressable using URLs, access control, versioning, encryption, metadata, and observability. Where you can easily monitor and govern the data stored in a data lake.

New modern engines have been created to be able to unify real time and batch data and perform OLAP queries with very low latency. As an example, Apache Druid can ingest and store massive amounts of data in a cost efficient way, minimizing the needs for data lakes.

Data Warehouse / Data Virtualization: A fast data layer in a relational database used for data analysis, particularly of historical data.

The main advantage of Data Virtualization is speed-to-market, where we can build a solution in a fraction of the time it takes to build a data warehouse. This is because you don’t need to design and build the data warehouse and the ETL to copy the data into it, and also don’t need to spend as much time testing.

Stream engine platform

A stream engine platform such as Kafka or Pulsar to migrate to microservices and unify batch and streaming. This is the first step in order to close the gap between OLTP and OLAP workloads, both can use the streaming platform either to develop event driven microservices or to move data.

These platforms will allow you to duplicate the data in different formats or databases in a reliable way. This way you can start serving the same data in different shapes to match the needs of the downstream consumers.

Metadata and Data Catalogs

A collection of metadata, combined with data management and search tools, that helps analysts and other data users to find the data that they need, serves as an inventory of available data, and provides information to evaluate fitness data for intended uses.

Query Engines

This type of tools focus on querying different data sources and formats in an unified way. The idea is to query your data lake using SQL queries like if it was a relational database, although it has some limitations. Some of these tools can also query NoSQL databases and much more. Query engines are the slowest option but provide the maximum flexibility.

Some query engines can integrate with data catalogs and join data from different data sources.

Data Mesh Architecture

Data Mesh Architecture

4. Data Mesh Implementation

Currently, we have many kinds of Data Meshes, some highly centralized and others more de-centralized. Migrating to a decentralized architecture is difficult and time consuming, it takes a long time to reach the right level of maturity to be able to run it at a scale. Although there are technical challenges, the main difficulty is trying to change the organization mindset.

4.1 Data Mesh Investment Factors

Data Mesh is difficult to implement because of the de-centralized nature, but at the end, it is required in order to solve the scalability issues that companies are currently facing. Centralized architectures work better for small companies or companies which are not data driven.

Only implement a data mesh if you have difficulties scaling, team friction, data quality issues, bottlenecks and governance/securities problems. You must also have a Big Data problem with huge amounts of structure and unstructured data.

In my opinion, to be able to decide if your company needs the Data Mesh paradigm, there are so many factors you should analyze before taking the step. That said, after reading different articles, there are three main factors to keep in mind:

  • The number of data sources your company has to feed the analytical Data Platform.
  • The size of your data team , how many data analysts, data engineers, and product managers.
  • The number of data domains your company has. How many functional teams (marketing, sales, operations, etc.) rely on your data sources to drive decision-making, and how many products does your company have?

Data Mesh Investment Factors

Data Mesh Investment Factors

4.2 Building a Data Mesh

In their blogs, Javier Ramos and Sven Balnojan have done an excellent job explaining the different steps required to build a data mesh. I really recommend checking their articles to get more details.

  • Data Mesh Applied by Sven Balnojan
  • Building a Data Mesh: A beginners guide by Javier Ramos

I have tried to summarize the different steps to decentralize your architecture and, start building a Data Mesh.

Step 1: Addressable data

Adress your data by standardizing path names and using the REST approach to name the data products using resource names, and add SLAs to the end points and monitor them to make sure the data is always available.

Re route your query engines and BI tools to use the new data products which are independent and addressable.

The data infrastructure team will be in charge of this step, still using a centralized approach.

Step 2: Discoverability (Metadata and Data Catalog)

Create a space to find the new data source with the following capabilities:

  • Search, discover and “add to the cart” for data within your enterprise.
  • Request access and grant access to data products in a way that is usable to data owners and consumers without the involvement of a central team.

In this step, work on the data product features adding tests for data quality, lineage, monitoring, etc.

Step 3 : Decentralize and implement DDD

Now we can start adding nodes to our data mesh.

  • Migrate the ownership into the domain team creating the data moving towards a de centralized architecture. Each team must own their data assets, ETL pipelines, quality, testing, etc.
  • Introduce the federated governance for data standardization,security and interoperability, by introducing DataOps practices and improving observability and self services capabilities. This way you can unify your OLTP and OLAP processes and tooling.

Once you have created your first “data microservice”, repeat the above process breaking the legacy data monolith into more decentralized services.

5. Case Studies

5.1 zalando.

An excellent presentation by Max Schultze and Arif Wider, about the Zalando analytics cloud journey. The presentation begins with their legacy analytics and how they manage to evolve this legacy from the ingestion, storage, and serving layer.

Data Mesh in Practice: How Europe's Leading Online Platform for Fashion Goes Beyond the Data Lake by Max Schultze (Zalando)

A great post from Tristan Baker about the followed strategy at Intuit. They have migrated from an on-premise architecture of centrally-managed analytics data sets and data infrastructure tools to a fully cloud-native set of data and tools. Tristan will take you through a full articulation of their vision, inherent challenges, and strategy for building better data-driven systems at Intuit.

Intuit’s Data Mesh Strategy

5.3 Saxo Bank

An outstanding post by Sheetal Pratik about the Saxo Journey. They show how they implement the Data Mesh paradigm and focus on the Governance Framework and Data Mesh architecture. The post has very clear diagrams to explain their points.

Enabling Data Discovery in a Data Mesh: The Saxo Journey

5.4 JP Morgan Chase

An AWS blog co-authored with Anu Jain, Graham Person, and Paul Conroy from JP Morgan Chase.

They provide a blueprint for instantiating data lakes that implements the mesh architecture in a standardized way using a defined set of cloud services, we enable data sharing across the enterprise while giving data owners the control and visibility they need to manage their data effectively.

How JPMorgan Chase built a data mesh architecture to drive significant value to enhance their enterprise data platform

5.5 Kolibri Games

A presentation by António Fitas, Barr Moses and, Scorr O'Leary. They talks about the evolution of teams on the data/engineering side as well as the pain points of their setup at that point, and discuss how to evaluate whether data mesh is right for a company and how to measure the return on investment of data mesh, especially the increase in agility and the increase in the number of decisions you deem "data driven".

Kolibri Games' Data Mesh Journey and Measuring Data Mesh ROI

5.6 Netflix

A presentation by Justin Cunningham about the Keystone Platform’s unique approach to declarative configuration and schema evolution, as well as our approach to unifying batch and streaming data and processing covered in depth.

Malla de datos de Netflix: procesamiento de datos componibles - Justin Cunningham

5.7 Adevinta

An excellent post by Xavier Gumara Rigol, about how Adevinta evolved from a centralised approach to data for analytics to a data mesh by setting some working agreements.

Building a data mesh to support an ecosystem of data products at Adevinta

5.8 HelloFresh

In this blog, Clemence W. Chee describes their journey of implementing the Data Mesh principles by showing the different phases they have faced.

HelloFresh Journey to the Data Mesh

6. Conclusion

Data Mesh isn't a static platform nor an architecture. Data Mesh is a product continuously evolving , and it may have different interpretations, but the core principles always remain. Domain-driven design, decentralization, data ownership, automation, observability, and federated governance.

The main important aspect, as Zhamak Dehghani mentions, is to stop thinking about data as an asset, like something we want to keep and collect. The new way of imaging the data should shift from an asset to a product. The moment we think about data as a product, we start delighting the experience of the consumers, shifting the perspective from the producer collecting data to the producer serving the data.

Start building a Data Mesh can be overwhelming. First, we need to understand this is an evolutionary path starting from your own company vision and introduce the Mesh principles slowly. We may start by select two or three source-aligned use cases , locating the domain working backwards from the use case to the sources and empower those teams to start serving those data products . Then, think about the platform capabilities we need and put the platform team in place to start building this first generation Data Mesh. Finally, we iterate with new use cases moving towards the Data Mesh vision.

7. Data Mesh Vocabulary

Data Mesh is a complex paradigm with many abstract terms. At the next table, I tried to extract the main vocabulary around this topic.

Word Definition
A data mesh is a set of read-only products, designed for sharing the data on the outside for non-real-time consumption/analytics. They are shared in an interoperable way so you can combine data from multiple domains owned by each domain team.
A business domain, where experts analyze data and build reports themselves, with minimal IT support. A data domain should create and publish their data as a product for the rest of the business to consume as well.
A Data Product is a collection of datasets concerning a certain topic which has risen to fulfill a certain purpose, yet which can support multiple purposes or be used as building block for multiple other data products.
A role accountable for the success of domain’s data products in delivering value, satisfying and growing the data consumers, and defining the lifecycle of the data products.
A role responsible for the infrastructure team, with the big picture of the Data Mesh, ensuring the data is self-served and acting as the link between the infrastructure layer and the Federated Governance team.
Analytical data reflecting the business facts generated
by the operational systems. This is also called native data product.
Analytical data that is an aggregate of multiple upstream domains.
Analytical data transformed to fit the needs of one or multiple specific use cases and consuming applications. This is also called fit-for-purpose domain data.
Refers to the encapsulated private data contained within the service itself. As a sweeping statement, this is the data that has always been considered "normal"—at least in your database class in college. The classic data contained in a SQL database and manipulated by a typical application is inside data.
Data on the outside refers to the information that flows between these independent services. This includes messages, files, and events. It's not your classic SQL data.
Smallest unit of architecture that can be independently deployed with high functional cohesion, and includes all the structural elements required for its function.
Input data ports to the data product.
Output data ports from the data product.

8. Discovery Roadmap

To elaborate this Thesis, I have been recollecting links and resources about the Data Mesh paradigm. I wanted to share the discovery path I've followed to understand the Data Mesh implications.

8.1 Zhamak Dehghani

How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh

Data Mesh Principles and Logical Architecture

Introduction to Data Mesh: A Paradigm Shift in Analytical Data Management by Zhamak Dehghani (Part I)

How to Build a Foundation for Data Mesh: A Principled Approach by Zhamak Dehghani (Part II)

8.2 Introductory Content

Data Mesh Score calculator

Decentralizing Data: From Data Monolith to Data Mesh with Zhamak Dehghani, Creator of Data Mesh

Data Mesh: The Four Principles of the Distributed Architecture by Eugene Berko

How to achieve Data Mesh teams and culture?

8.3 Deep Dive

Anatomy of Data Products in Data Mesh

  • Building a Data Mesh: A beginners guide

Data Mesh Applied: Moving step-by-step from mono data lake to decentralized 21st-century data mesh.

There’s More Than One Kind of Data Mesh: Three Types of Data Meshes

Building a successful Data Mesh – More than just a technology initiative

How the **ck (heck) do you build a Data Mesh?

Data Mesh architecture patterns

Data Mesh: Topologies and domain granularity

Acknowledgements

I am so grateful to Data Mesh Learning Community founded and run by Scott Hirleman. I appreciate the help from the community, the meetups, and the resources published and organized. It has been a great point to start researching for this thesis.

I also wanted to acknowledge all the authors from my sources for sharing their knowledge.

References and links

  • Dehghani, Z. Data Mesh: Delivering Data-Driven Value at Scale (1st ed.). O’Reilly.
  • Data Mesh Learning Community
  • What is a Data Mesh and How Not to Mesh it Up
  • More Resources like this here

Privacy Overview

Open Universiteit research portal Logo

Data Mesh and its Contribution to Data Governance : A case study within the financial asset management industry

  • Laurens Christiaanse
  • Department of Information Science

Student thesis : Master's Thesis

Date of Award1 Sept 2023
Original languageEnglish
SupervisorKarel Lemmen (Supervisor) & Pien Walraven (Examiner)
  • Data Governance
  • Contributions
  • Data Domains
  • Self-Serve Data Platform
  • Data as a Product
  • Federated Computational Governance
  • Usability Characteristics

Master's Degree

  • Master Business Process management & IT (BPMIT)

File : application/pdf, 1.54 MB

Type : Thesis

Voicebox is here!

What is a Data Mesh: Principles and Architecture

Get the latest in your inbox

Defining Data Mesh

Rather than dwell on the definitions (Gartner counts at least three) of data mesh, I’ll go with my lay version:

data mesh [dey-tuh- mesh]

A decentralized architecture capability to solve the data swamp problem, reduce data analytics cost, and speed actionable insights to better enable data-informed business decisions. 

There, I said it without the buzzwords like “data democratization” or “paradigm shift.” Mea culpa for throwing in “actionable insight.” Let’s decompose what data mesh means for the real working class, our data engineers, architects, and scientists.

Why Use Data Mesh

Data swamps are losing their role as a centralized data platform

Data Swamp

In the mid-1990s, data warehousing was bursting onto the data management scene. Fueled by the hype of the fabled “beer and diapers” story, businesses were pouring tens of millions of dollars to build huge data monoliths to handle the consumption, storage, transformation, and output of data in one central system to answer business questions that required complex data analytics such as “who are the high-value customers most likely to buy X?” 

The thesis of data warehousing worked like a charm at that time . However, as the appetite for data analytics increased, so did the need for more data to be ingested. The complexity and pace of data pipelines soared (as did the nickname “data wranglers”). I began to see the cracks forming in the data warehouse theory and delved into its growing failure to get value from analytical data in my master’s research. 

As social media and the iPhone became the norm, many turned to a second generation of data analytics architecture called data lakes. While traditional data warehouses used an Extract-Transform-Load (ETL) process to ingest data, data lakes instead rely on an Extract-Load-Transform (ELT) process that puts data into cheap BLOB storage. This eliminated the big shortcomings of data warehouses but spurred the “Let’s just collect everything” theology of data swamps. 

Further learning: Data Lake Acceleration

Is Data Mesh Right For Your Organization?

The data swamp problem goes beyond unmanaged/inaccessible data lakes

Fast forward to today. Only 32% of companies are realizing tangible and measurable value from data (“trapped value”), according to a study by Accenture. The roaring demand for “discovery or iterative style analytics” (where consumers don’t really know the questions or data they need) is raising access to data to a whole new level with new/expanding data sources (or “wide data”) across multi- and hybrid cloud environments, thrusting massive friction onto traditional data lakes and warehouses.

Nowhere is this pain more visible than among data and analytics teams where:

  • Data scientists consider themselves 40% a vacuum, 40% a janitor, and 20% a fortune-teller Toward Data Science
  • 78% of data engineers wished that their job came with a therapist to help with work-related stress Survey commissioned by data.world and DataKitchen
  • 65% of large, data-intensive firms have a CDO or CAO, but the average tenure is just 2.5 years Harvard Business Review

Data Mesh Principles

A data mesh aims to create an architectural foundation for getting value from analytical data and historical facts at scale – scale being applied to the constant change of data landscape, proliferation of data and analytics demand, diversity of transformation and processing that use cases require, and speed of response to change. 

To achieve this objective, most experts agree that the thesis of data mesh is based on four precepts:

  • Decentralize the ownership of analytical data to business domains closest to the source of the data or its main consumers. This removes the need for authoritarian bottlenecks of data teams, warehouses, and lake architecture, scaling out data access, consumers, and use cases. 
  • Make access and use of data products easy and self-service . This removes the friction of data sharing, from source to consumption, streamlining the experience of data users to discover, access, and use data products for their use cases. 
  • Federate the governance of data based on an appropriate operating model that balances decision-making and accountability. This precept builds on domain-oriented ownership and data as a product.
  • Manage data as a product and a development methodology. Consider how data teams can create value in their organizations. Think features like discoverability trustworthiness, reusability, and value. This facilitates, regardless of silos, the sharing of data with users. 

Data mesh: Common data product modeling, integration, and cataloging

Data mesh inverts the traditional data warehouse/lake ideology by transforming data gatekeepers into data liberators to give every “data citizen” (not just the data scientist, engineer, or analyst) easy access and the ability to work with data comfortably, regardless of their technical know-how, to reduce the cost of data analytics and speed time to insight . 

However, the rise of data mesh does not mean the fall of data lakes; rather, they are complementary architectures. A data lake is a good solution for storing vast amounts of data in a centralized location. And data mesh is the best solution for fast data retrieval, integration, and analytics. In a nutshell, think of a data mesh as connective tissue to data lakes and/or other sources. 

Introduced by Thoughtworks, data mesh is “a shift in modern distributed architecture that applies platform thinking to create self-serve data infrastructure, treating data as the product.” It is a data and analytics platform where data can remain within different databases, rather than being consolidated into a single data lake.

Like a data fabric architecture, a data mesh architecture comprises four layers . To provide useful information to data and analytics professionals, I’ll break the data mesh rule by also talking about specific technologies and representative vendors. 

  • Storage: This is where much of your data lives once it’s ingested and organized from OLTP databases, data lakes, data warehouses, graph databases, and various files. The composite of analytic systems equates to your central data platform. Representative vendors are Snowflake, Databricks , and Google. 
  • Analytics: This layer is responsible for delivering the process data to the end-users, including business intelligence analysts, data scientists, and business stakeholders who consume the data with the help of reports and analytic models. Representative vendors are Tableau, PowerBI, and SAS.
  • Governance: This is where processes, roles, policies, standards, and metrics ensure the effective and efficient use of your data. Data catalogs define who can take what action, upon what data, in what situations, and using what methods. Representative vendors are Alation , Collibra, and Promethium.
  • Semantic: This is the virtual “connective tissue” of the data mesh. Knowledge graphs (not to be confused with graph databases) connect any data into a canonical data model, harmonize it into real-world business meaning and relationships, and enable self-service data exploration and discovery. 

VentureBeat says a data mesh architecture “connects various data sources (including data lakes) into a coherent infrastructure, where all data is accessible if you have the right authority to access it.” This doesn’t mean there is one big, hairy data warehouse or lake (see data swamp problem) — the laws of physics and the demand for analytics mean that large, disparate data sets can’t just be joined together over huge distances with decent performance. Not to mention the costs of moving, transforming, and maintaining the data (and ETL | ELT).

The Enterprise Knowledge Graph and how it enables a data mesh

Enter the semantic layer. A semantic layer represents a network of real-world entities— i.e., objects, events, situations, or concepts — and illustrates the relationships to answer complex cross-domain questions that can be shared and re-used based on fine-grained data access policies and business rules. It is comprised of three layers (and a knowledge catalog that interfaces with governance tools):

  • Business meaning: This is a business representation of data. It enables users to quickly discover and access data using standard search terms — like customer, recent purchase, and prospect. Data can be shared and reused through a common, standards-based vocabulary.
  • Data storytelling (inferencing) : Creates new relationships by interpreting your source data against your data model. By expressing all explicit and inferred relationships and connections between your data sources, you create a richer, more accurate view of your data and cut down on data preparation. 
  • Virtualization: Provides an alternative to costly, slow ETL integration and permanent transformation of source data. Virtual graphs allow you to leave data where it is and bring it together at query time to reflect the latest changes, which scales analytics use cases and users at minimal cost and reduces data latency.

The big payoff of a semantic layer is in providing a better way to enable self-service, federated queries, enabling you to build and deploy analytics data products quickly and efficiently. 

Further learning: How Knowledge Graphs Work

Data Mesh Benefits

Generally, organizations handling and analyzing a large amount of data sources should seriously consider evolving to a data mesh architecture. Other signals include a high number of data domains and functions teams that demand data products (especially advanced analytics such as predictive and simulation modeling), frequent data pipeline bottlenecks, and prioritizing data governance. 

Stardog commissioned Forrester Consulting to interview four decision-makers with experience implementing Stardog. For this commissioned study, Forrester aggregated the interviewees’ experiences and combined the results into a single composite organization. The key findings were that the composite organization using the Stardog Enterprise Knowledge Graph platform realized the following over three years:

  • 320 percent return on investments
  • $4.7 million total data scientist productivity improvement
  • $2.6 million in infrastructure savings from avoided copying and moving data
  • $2.4 million in incremental profit from enhanced quantity, quality, and speed of insights 

Knowledge Graphs 101

How to Overcome a Major Enterprise Liability and Unleash Massive Potential

ebook

Let’s stay in touch

Subscribe to get our latest content and stay up to date on news and events.

This website stores cookies on your computer which are used to improve your website experience and provide more customized services to you. To find out more about the cookies we use, see our privacy policy.

data mesh thesis

Data Mesh Principles and Logical Architecture

Our aspiration to augment and improve every aspect of business and life with data, demands a paradigm shift in how we manage data at scale. While the technology advances of the past decade have addressed the scale of volume of data and data processing compute, they have failed to address scale in other dimensions: changes in the data landscape, proliferation of sources of data, diversity of data use cases and users, and speed of response to change. Data mesh addresses these dimensions, founded in four principles: domain-oriented decentralized data ownership and architecture, data as a product, self-serve data infrastructure as a platform, and federated computational governance. Each principle drives a new logical view of the technical architecture and organizational structure.

03 December 2020

Zhamak is the director of emerging technologies at Thoughtworks North America with focus on distributed systems architecture and a deep passion for decentralized solutions. She is a member of Thoughtworks Technology Advisory Board and contributes to the creation of Thoughtworks Technology Radar.

data analytics

Core principles and logical architecture of data mesh

Logical architecture: domain-oriented data and compute, logical architecture:data product the architectural quantum, logical architecture: a multi-plane data platform, logical architecture: computational policies embedded in the mesh, principles summary and the high level logical architecture.

For more on Data Mesh, Zhamak went on to write a full book that covers more details on strategy, implementation, and organizational design.

The original writeup, How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh - which I encourage you to read before joining me back here - empathized with today’s pain points of architectural and organizational challenges in order to become data-driven, use data to compete, or use data at scale to drive value. It offered an alternative perspective which since has captured many organizations’ attention, and given hope for a different future. While the original writeup describes the approach, it leaves many details of the design and implementation to one’s imagination. I have no intention of being too prescriptive in this article, and kill the imagination and creativity around data mesh implementation. However I think it’s only responsible to clarify the architectural aspects of data mesh as a stepping stone to move the paradigm forward.

This article is written with the intention of a follow up. It summarizes the data mesh approach by enumerating its underpinning principles, and the high level logical architecture that the principles drive. Establishing the high level logical model is a necessary foundation before I dive into detailed architecture of data mesh core components in future articles. Hence, if you are in search of a prescription around exact tools and recipes for data mesh, this article may disappoint you. If you are seeking a simple and technology-agnostic model that establishes a common language, come along.

The great divide of data

What do we really mean by data? The answer depends on whom you ask. Today’s landscape is divided into operational data and analytical data . Operational data sits in databases behind business capabilities served with microservices, has a transactional nature, keeps the current state and serves the needs of the applications running the business. Analytical data is a temporal and aggregated view of the facts of the business over time, often modeled to provide retrospective or future-perspective insights; it trains the ML models or feeds the analytical reports.

The current state of technology, architecture and organization design is reflective of the divergence of these two data planes - two levels of existence, integrated yet separate. This divergence has led to a fragile architecture. Continuously failing ETL (Extract, Transform, Load) jobs and ever growing complexity of a labyrinth of data pipelines, is a familiar sight to many who attempt to connect these two planes, flowing data from operational data plane to the analytical plane, and back to the operational plane.

Figure 1: The great divide of data

Analytical data plane itself has diverged into two main architectures and technology stacks: data lake and data warehouse ; with data lake supporting data science access patterns, and data warehouse supporting analytical and business intelligence reporting access patterns. For this conversation, I put aside the dance between the two technology stacks: data warehouse attempting to onboard data science workflows and data lake attempting to serve data analysts and business intelligence. The original writeup on data mesh explores the challenges of the existing analytical data plane architecture.

Figure 2: Further divide of analytical data - warehouse

Figure 3: Further divide of analytical data - lake

Data mesh recognizes and respects the differences between these two planes: the nature and topology of the data, the differing use cases, individual personas of data consumers, and ultimately their diverse access patterns. However it attempts to connect these two planes under a different structure - an inverted model and topology based on domains and not technology stack - with a focus on the analytical data plane. Differences in today's available technology to manage the two archetypes of data, should not lead to separation of organization, teams and people who work on them. In my opinion, the operational and transactional data technology and topology is relatively mature, and driven largely by the microservices architecture; data is hidden on the inside of each microservice, controlled and accessed through the microserivce’s APIs. Yes there is room for innovation to truly achieve multi-cloud-native operational database solutions, but from the architectural perspective it meets the needs of the business. However it’s the management and access to the analytical data that remains a point of friction at scale. This is where data mesh focuses.

I do believe that at some point in the future our technologies will evolve to bring these two planes even closer together, but for now, I suggest we keep their concerns separate.

Data mesh objective is to create a foundation for getting value from analytical data and historical facts at scale - scale being applied to constant change of data landscape , proliferation of both sources of data and consumers , diversity of transformation and processing that use cases require , speed of response to change . To achieve this objective, I suggest that there are four underpinning principles that any data mesh implementation embodies to achieve the promise of scale, while delivering quality and integrity guarantees needed to make data usable : 1) domain-oriented decentralized data ownership and architecture, 2) data as a product, 3) self-serve data infrastructure as a platform, and 4) federated computational governance.

While I expect the practices, technologies and implementations of these principles vary and mature over time, these principles remain unchanged.

I have intended for the four principles to be collectively necessary and sufficient ; to enable scale with resiliency while addressing concerns around siloing of incompatible data or increased cost of operation. Let's dive into each principle and then design the conceptual architecture that supports it.

Domain Ownership

Data mesh, at core, is founded in decentralization and distribution of responsibility to people who are closest to the data in order to support continuous change and scalability. The question is, how do we decompose and decentralize the components of the data ecosystem and their ownership. The components here are made of analytical data , its metadata , and the computation necessary to serve it.

Data mesh follows the seams of organizational units as the axis of decomposition. Our organizations today are decomposed based on their business domains. Such decomposition localizes the impact of continuous change and evolution - for the most part - to the domain’s bounded context . Hence, making the business domain’s bounded context a good candidate for distribution of data ownership.

In this article, I will continue to use the same use case as the original writeup, ‘a digital media company’. One can imagine that the media company divides its operation, hence the systems and teams that support the operation, based on domains such as ‘podcasts’, teams and systems that manage podcast publication and their hosts; ‘artists’, teams and systems that manage onboarding and paying artists, and so on. Data mesh argues that the ownership and serving of the analytical data should respect these domains. For example, the teams who manage ‘podcasts’, while providing APIs for releasing podcasts, should also be responsible for providing historical data that represents ‘released podcasts’ over time with other facts such as ‘listenership’ over time. For a deeper dive into this principle see Domain-oriented data decomposition and ownership .

To promote such decomposition, we need to model an architecture that arranges the analytical data by domains. In this architecture, the domain’s interface to the rest of the organization not only includes the operational capabilities but also access to the analytical data that the domain serves. For example, ‘podcasts’ domain provides operational APIs to ‘create a new podcast episode’ but also an analytical data endpoint for retrieving ‘all podcast episodes data over the last <n> months’. This implies that the architecture must remove any friction or coupling to let domains serve their analytical data and release the code that computes the data, independently of other domains. To scale, the architecture must support autonomy of the domain teams with regard to the release and deployment of their operational or analytical data systems.

The following example demonstrates the principle of domain oriented data ownership. The diagrams are only logical representations and exemplary. They aren't intended to be complete.

Each domain can expose one or many operational APIs, as well as one or many analytical data endpoints

Figure 4: Notation: domain, its analytical data and operational capabilities

Naturally, each domain can have dependencies to other domains' operational and analytical data endpoints. In the following example, 'podcasts' domain consumes analytical data of 'users updates' from the 'users' domain, so that it can provide a picture of the demographic of podcast listeners through its 'Podcast listeners demographic' dataset.

Figure 5: Example: domain oriented ownership of analytical data in addition to operational capabilities

Note: In the example, I have used an imperative language for accessing the operational data or capabilities, such as 'Pay artists'. This is simply to emphasize the difference between the intention of accessing operational data vs. analytical data. I do recognize that in practice operational APIs are implemented through a more declarative interface such as accessing a RESTful resource or a GraphQL query.

Data as a product

One of the challenges of existing analytical data architectures is the high friction and cost of discovering, understanding, trusting, and ultimately using quality data . If not addressed, this problem only exacerbates with data mesh, as the number of places and teams who provide data - domains - increases. This would be the consequence of our first principle of decentralization. Data as a product principle is designed to address the data quality and age-old data silos problem; or as Gartner calls it dark data - “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes”. Analytical data provided by the domains must be treated as a product, and the consumers of that data should be treated as customers - happy and delighted customers.

The original article enumerates a list of capabilities , including discoverability, security, explorability, understandability, trustworthiness , etc., that a data mesh implementation should support for a domain data to be considered a product. It also details the roles such as domain data product owner that organizations must introduce, responsible for the objective measures that ensure data is delivered as a product. These measures include data quality, decreased lead time of data consumption, and in general data user satisfaction through net promoter score . Domain data product owner must have a deep understanding of who the data users are, how do they use the data, and what are the native methods that they are comfortable with consuming the data. Such intimate knowledge of data users results in design of data product interfaces that meet their needs. In reality, for the majority of data products on the mesh, there are a few conventional personas with their unique tooling and expectations, data analysts and data scientists. All data products can develop standardized interfaces to support them. The conversation between users of the data and product owners is a necessary piece for establishing the interfaces of data products.

Each domain will include data product developer roles , responsible for building, maintaining and serving the domain's data products. Data product developers will be working alongside other developers in the domain. Each domain team may serve one or multiple data products. It’s also possible to form new teams to serve data products that don’t naturally fit into an existing operational domain.

Note: this is an inverted model of responsibility compared to past paradigms. The accountability of data quality shifts upstream as close to the source of the data as possible.

Architecturally, to support data as a product that domains can autonomously serve or consume, data mesh introduces the concept of data product as its architectural quantum . Architectural quantum, as defined by Evolutionary Architecture , is the smallest unit of architecture that can be independently deployed with high functional cohesion, and includes all the structural elements required for its function.

Data product is the node on the mesh that encapsulates three structural components required for its function, providing access to the domain's analytical data as a product.

  • Code : it includes (a) code for data pipelines responsible for consuming, transforming and serving upstream data - data received from domain’s operational system or an upstream data product; (b) code for APIs that provide access to data, semantic and syntax schema, observability metrics and other metadata; (c) code for enforcing traits such as access control policies, compliance, provenance, etc.
  • Data and Metadata : well that’s what we are all here for, the underlying analytical and historical data in a polyglot form. Depending on the nature of the domain data and its consumption models, data can be served as events, batch files, relational tables, graphs, etc., while maintaining the same semantic. For data to be usable there is an associated set of metadata including data computational documentation, semantic and syntax declaration, quality metrics, etc; metadata that is intrinsic to the data e.g. its semantic definition, and metadata that communicates the traits used by computational governance to implement the expected behavior e.g. access control policies.
  • Infrastructure : The infrastructure component enables building, deploying and running the data product's code, as well as storage and access to big data and metadata.

Figure 6: Data product components as one architectural quantum

The following example builds on the previous section, demonstrating the data product as the architectural quantum. The diagram only includes sample content and is not intended to be complete or include all design and implementation details. While this is still a logical representation it is getting closer to the physical implementation.

Figure 7: Notation: domain, its (analytical) data product and operational system

Figure 8: Data products serving the domain-oriented analytical data

Note: Data mesh model differs from the past paradigms where pipelines (code) are managed as independent components from the data they produce; and often infrastructure, like an instance of a warehouse or a lake storage account, is shared among many datasets. Data product is a composition of all components - code, data and infrastructure - at the granularity of a domain's bounded context.

Self-serve data platform

As you can imagine, to build, deploy, execute, monitor, and access a humble hexagon - a data product - there is a fair bit of infrastructure that needs to be provisioned and run; the skills needed to provision this infrastructure are specialized and would be difficult to replicate in each domain. Most importantly, the only way that teams can autonomously own their data products is to have access to a high-level abstraction of infrastructure that removes complexity and friction of provisioning and managing the lifecycle of data products. This calls for a new principle, Self-serve data infrastructure as a platform to enable domain autonomy .

The data platform can be considered an extension of the delivery platform that already exists to run and monitor the services. However the underlying technology stack to operate data products, today, looks very different from the delivery platform for services. This is simply due to divergence of big data technology stacks from operational platforms. For example, domain teams might be deploying their services as Docker containers and the delivery platform uses Kubernetes for their orchestration; However the neighboring data product might be running its pipeline code as Spark jobs on a Databricks cluster. That requires provisioning and connecting two very different sets of infrastructure, that prior to data mesh did not require this level of interoperability and interconnectivity. My personal hope is that we start seeing a convergence of operational and data infrastructure where it makes sense. For example, perhaps running Spark on the same orchestration system, e.g. Kubernetes.

In reality, to make analytical data product development accessible to generalist developers, to the existing profile of developers that domains have, the self-serve platform needs to provide a new category of tools and interfaces in addition to simplifying provisioning. A self-serve data platform must create tooling that supports a domain data product developer’s workflow of creating, maintaining and running data products with less specialized knowledge that existing technologies assume; self-serve infrastructure must include capabilities to lower the current cost and specialization needed to build data products. The original writeup includes a list of capabilities that a self-serve data platform provides, including access to scalable polyglot data storage, data products schema, data pipeline declaration and orchestration, data products lineage, compute and data locality , etc.

The self-serve platform capabilities fall into multiple categories or planes as called in the model. Note: A plane is representative of a level of existence - integrated yet separate. Similar to physical and consciousness planes, or control and data planes in networking. A plane is neither a layer and nor implies a strong hierarchical access model.

Figure 9: Notation: A platform plane that provides a number of related capabilities through self-serve interfaces

A self-serve platform can have multiple planes that each serve a different profile of users. In the following example, lists three different data platform planes:

  • Data infrastructure provisioning plane : supports the provisioning of the underlying infrastructure, required to run the components of a data product and the mesh of products. This includes provisioning of a distributed file storage, storage accounts, access control management system, the orchestration to run data products internal code, provisioning of a distributed query engine on a graph of data products, etc. I would expect that either other data platform planes or only advanced data product developers use this interface directly. This is a fairly low level data infrastructure lifecycle management plane.
  • Data product developer experience plane : this is the main interface that a typical data product developer uses. This interface abstracts many of the complexities of what entails to support the workflow of a data product developer. It provides a higher level of abstraction than the 'provisioning plane'. It uses simple declarative interfaces to manage the lifecycle of a data product. It automatically implements the cross-cutting concerns that are defined as a set of standards and global conventions, applied to all data products and their interfaces.
  • Data mesh supervision plane : there are a set of capabilities that are best provided at the mesh level - a graph of connected data products - globally. While the implementation of each of these interfaces might rely on individual data products capabilities, it’s more convenient to provide these capabilities at the level of the mesh. For example, ability to discover data products for a particular use case, is best provided by search or browsing the mesh of data products; or correlating multiple data products to create a higher order insight, is best provided through execution of a data semantic query that can operate across multiple data products on the mesh.

The following model is only exemplary and is not intending to be complete. While a hierarchy of planes is desirable, there is no strict layering implied below.

Figure 10: Multiple planes of self-serve data platform *DP stands for a data product

Federated computational governance

As you can see, data mesh follows a distributed system architecture; a collection of independent data products, with independent lifecycle, built and deployed by likely independent teams. However for the majority of use cases, to get value in forms of higher order datasets, insights or machine intelligence there is a need for these independent data products to interoperate; to be able to correlate them, create unions, find intersections, or perform other graphs or set operations on them at scale. For any of these operations to be possible, a data mesh implementation requires a governance model that embraces decentralization and domain self-sovereignty, interoperability through global standardization, a dynamic topology and most importantly automated execution of decisions by the platform . I call this a federated computational governance. A decision making model led by the federation of domain data product owners and data platform product owners, with autonomy and domain-local decision making power, while creating and adhering to a set of global rules - rules applied to all data products and their interfaces - to ensure a healthy and interoperable ecosystem. The group has a difficult job: maintaining an equilibrium between centralization and decentralization ; what decisions need to be localized to each domain and what decisions should be made globally for all domains. Ultimately global decisions have one purpose, creating interoperability and a compounding network effect through discovery and composition of data products.

The priorities of the governance in data mesh are different from traditional governance of analytical data management systems. While they both ultimately set out to get value from data, traditional data governance attempts to achieve that through centralization of decision making, and establishing global canonical representation of data with minimal support for change. Data mesh's federated computational governance, in contrast, embraces change and multiple interpretive contexts.

Placing a system in a straitjacket of constancy can cause fragility to evolve. -- C.S. Holling, ecologist

A supportive organizational structure, incentive model and architecture is necessary for the federated governance model to function: to arrive at global decisions and standards for interoperability, while respecting autonomy of local domains, and implement global policies effectively.

Figure 11: Notation: federated computational governance model

As mentioned earlier, striking a balance between what shall be standardized globally, implemented and enforced by the platform for all domains and their data products, and what shall be left to the domains to decide, is an art. For instance the domain data model is a concern that should be localized to a domain who is most intimately familiar with it. For example, how the semantic and syntax of 'podcast audienceship' data model is defined must be left to the 'podcast domain' team. However in contrast, the decision around how to identify a 'podcast listener' is a global concern. A podcast listener is a member of the population of 'users' - its upstream bounded context - who can cross the boundary of domains and be found in other domains such as 'users play streams'. The unified identification allows correlating information about 'users' who are both 'podcast listeners' and 'stream listeners'.

The following is an example of elements involved in the data mesh governance model. It’s not a comprehensive example and only demonstrative of concerns relevant at the global level.

Figure 12: : Example of elements of a federated computational governance: teams, incentives, automated implementation, and globally standardized aspects of data mesh

Many practices of pre-data-mesh governance, as a centralized function, are no longer applicable to the data mesh paradigm. For example, the past emphasis on certification of golden datasets - the datasets that have gone through a centralized process of quality control and certification and marked as trustworthy - as a central function of governance is no longer relevant. This had stemmed from the fact that in the previous data management paradigms, data - in whatever quality and format - gets extracted from operational domain’s databases and gets centrally stored in a warehouse or a lake that now requires a centralized team to apply cleansing, harmonization and encryption processes to it; often under the custodianship of a centralized governance group. Data mesh completely decentralizes this concern. A domain dataset only becomes a data product after it locally, within the domain, goes through the process of quality assurance according to the expected data product quality metrics and the global standardization rules. The domain data product owners are best placed to decide how to measure their domain’s data quality knowing the details of domain operations producing the data in the first place. Despite such localized decision making and autonomy, they need to comply with the modeling of quality and specification of SLOs based on a global standard, defined by the global federated governance team, and automated by the platform.

The following table shows the contrast between centralized (data lake, data warehouse) model of data governance, and data mesh.

Pre data mesh governance aspectData mesh governance aspect
Centralized teamFederated team
Responsible for data qualityResponsible for defining how to model what constitutes quality
Responsible for data securityResponsible for defining aspects of data security i.e. data sensitivity levels for the platform to build in and monitor automatically
Responsible for complying with regulationResponsible for defining the regulation requirements for the platform to build in and monitor automatically
Centralized custodianship of dataFederated custodianship of data by domains
Responsible for global canonical data modelingResponsible for modeling - data elements that cross the boundaries of multiple domains
Team is independent from domainsTeam is made of domains representatives
Aiming for a well defined static structure of dataAiming for enabling effective mesh operation embracing a continuously changing and a dynamic topology of the mesh
Centralized technology used by monolithic lake/warehouseSelf-serve platform technologies used by each domain
Measure success based on number or volume of governed data (tables)Measure success based on the network effect - the connections representing the consumption of data on the mesh
Manual process with human interventionAutomated processes implemented by the platform
Prevent errorDetect error and recover through platform’s automated processing

Let’s bring it all together, we discussed four principles underpinning data mesh:

Domain-oriented decentralized data ownership and architecture the ecosystem creating and consuming data can scale out as the number of sources of data, number of use cases, and diversity of access models to the data increases; simply increase the autonomous nodes on the mesh.
Data as a product data users can easily discover, understand and securely use high quality data with a delightful experience; data that is distributed across many domains.
Self-serve data infrastructure as a platform the domain teams can create and consume data products autonomously using the platform abstractions, hiding the complexity of building, executing and maintaining secure and interoperable data products.
Federated computational governance data users can get value from aggregation and correlation of independent data products - the mesh is behaving as an ecosystem following global interoperability standards; standards that are baked computationally into the platform.

These principles drive a logical architectural model that while bringing analytical data and operational data closer together under the same domain, it respects their underpinning technical differences. Such differences include where the analytical data might be hosted, different compute technologies for processing operational vs. analytical services, different ways of querying and accessing the data, etc.

Figure 13: Logical architecture of data mesh approach

I hope by this point, we have now established a common language and a logical mental model that we can collectively take forward to detail the blueprint of the components of the mesh, such as the data product, the platform, and the required standardizations.

Acknowledgments

I am grateful to Martin Fowler for helping me refine the narrative and structure of this article, and for hosting it.

Special thanks to many ThoughtWorkers who have been helping create and distill the ideas in this article through client implementations and workshops.

Also thanks to the following early reviewers who provided invaluable feedback: Chris Ford, David Colls and Pramod Sadalage.

03 December 2020: Published

' title=

Data mesh: Real examples and lessons learned

Data is changing. are you keeping up.

Data and how we use it are constantly evolving in today's fast-paced world. And as we continue to rely on data accessibility to drive growth, it will only become more complicated to manage. Centralized data platforms have long served as the foundation of modern business intelligence and analytics and, in most cases, continue to deliver meaningful business value. But, like all foundations, over time, cracks begin to show. Data solutions are now bursting at the seams as the number and diversity of data sources and use cases becomes too complicated to manage with a traditional, centralized approach. Moreover, this rapidly increasing demand for business intelligence and analytics is inadvertently creating insight bottlenecks, preventing the delivery of deep, valuable insights. And truth be told, it's a big ask to address the above — it takes a cultural shift in ways of working that truly sets organizations on a path of readiness to innovate and leverage their data at a much faster pace.

So, what are some problems of not addressing data issues? The accidental creation of data latency generates a delay and a lack of access to the correct information, leading to the use of rogue data repositories and shadow BI solutions. Regulatory requirements surrounding data are becoming increasingly complex, and all who work with data must comply. Dependance on tribal knowledge generates stagnation in innovation and ideas. The list can go on and on. So, what does it take to not only avoid the problems previously mentioned but to thrive and grow in an ever-changing data landscape? How can organizations move forward when the path ahead can appear unclear and confusing? We feel the paramount solution to the change and potential problems in today's data landscape is through Data Mesh. Here are some brief examples of the Data mesh work we've conducted with our clients Gilead and Saxo Bank.

Success based on real-world use cases

Thoughtworks has been working with Gilead, an American biopharmaceutical company, for over a year in developing the case and planning the implementation for Data mesh. Gilead has a robust experimentation culture, established people practices and innovative technology thought leadership, but like many enterprises, it faced numerous challenges adopting the Data mesh approach to deliver data-driven value at scale. Thoughtworks is actively assisting Gilead in their approach to Data mesh in building an Enterprise Data and AI Platform leveraging their prior experiences to establish new guiding principles. When reviewing opportunities, Gilead saw value in a new organizational and operational model backed by data, but by using a Data mesh approach, it also allows them the opportunity to engage in a cloud transformation initiative. While their previous experience and realized opportunity for Data mesh allow Gilead to create guiding principles moving forward, such as managing their data as a product and adopting cloud-first architecture, it's only the tip of the iceberg in their journey.

Thoughtworks has also been working with Saxo Bank, a European online investment bank, to democratize data while empowering clients with information and agility to act with confidence. Because of the bank’s complex ecosystem, the data found within Saxo Bank's platform must be transparent, trustworthy and co-sharable, with the Saxo app being white-listed in every environment. So, Saxo Bank and Thoughtworks partnered to bring Data mesh to their organization. Thoughtworks created a data workbench for Saxo Bank to make their data assets searchable and discoverable. Much like how one would search for a product on Amazon, one types in the name of the data asset they're seeking in a search bar, and results backed by a business catalog of consistent business definitions appear. The data workbench also has product descriptions for each data asset along with the data asset's number of uses and user feedback, so one knows that the data is trustworthy. At a high level, since deploying their Data mesh initiative, Saxo Bank has seen a reduced cost of customer acquisition, more efficient costs of operation and increased defense due to the reduced chances of compliance and regulatory quagmires. 

More in-depth information about Gilead and Saxo Bank can be found in the video below.

Some lessons learned along the way

Our Data Mesh work with Gilead, Saxo Bank and other organizations have taught us much about what it takes to succeed. While not exhaustive, here's a brief overview of some of the lessons we've learned along the way in empowering our clients with Data mesh:

Mindset, organizational and operational models are the most significant barriers to adopting Data Mesh.

Educating stakeholders and domain teams about Data Mesh is critical to success.

Developing a product mindset needs to start with discovery from the consumer perspective.

Creating foundational data products that can be reused and repurposed for multiple use cases helps solidify Data mesh.

Data products need to be compliant with global and local policies.

Choosing the right implementation partner to adopt Data Mesh within your organization is crucial.

Getting started with Data mesh

Data mesh is a powerfully transformative analytical data architecture and operating model. Businesses in all industries stand to gain with correct Data mesh implementation. But adopting Data Mesh requires more than just technology change — it takes some time, organizational commitment and the right partner to guide you through the process. As a Data Mesh innovator, Thoughtworks is committed to delivering the business outcomes your strategy requires. We also aim to positively impact your organization, working with you to transform your digital capabilities, delivery practices and the mindset of your talent. Finally, as an organization committed to learning, we continually invest in research, harvesting learnings and develop thought leadership to share with our clients. As a result, we're constantly helping our clients achieve their goals through Data mesh strategies and other transformative technologies. We're more than willing to get started with your organization today.

data mesh thesis

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Let's talk about Data Mesh and your next project

Politecnico di Torino (logo)

  • Corsi di laurea
  • Classi di laurea
  • Tesi meritorie

Data Mesh: the newest paradigm shift for a distributed architecture in the data world and its application

data mesh thesis

Simona Genovese

Data Mesh: the newest paradigm shift for a distributed architecture in the data world and its application.

Rel. Silvia Anna Chiusano . Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2021

Abstract:
Relatori:
Anno accademico: 2021/22
Tipo di pubblicazione: Elettronica
Numero di pagine: 76
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea:
Aziende collaboratrici: Agile Lab S.r.l.
URI:

Actions (login required)

Modifica (riservato agli operatori)

twitter

 Illustration with collage of pictograms of clouds, pie chart, graph pictograms on the following

A data mesh is a decentralized data architecture that organizes data by a specific business domain—for example, marketing, sales, customer service and more—to provide more ownership to the producers of a given data set.

The producers’ understanding of the domain data positions them to set data governance policies focused on documentation, quality, and access. This, in turn, enables self-service use across an organization. While this federated approach eliminates many operational bottlenecks associated with centralized, monolithic systems, it doesn't necessarily mean that you can't use traditional storage systems, like data lakes or data warehouses. It just means that their use has shifted from a single, centralized data platform to multiple decentralized data repositories.

It's worth noting that data mesh promotes the adoption of cloud native and cloud platform technologies to scale and achieve the goals of data management. This concept is commonly compared to microservices to help audiences understand its use within this landscape. As this distributed architecture is particularly helpful in scaling data needs across an organization, it can be inferred that a data mesh may not be for all types of businesses; that is, smaller businesses may not reap the benefits of a data mesh as their enterprise data may not be as complex as a larger organization.  

Zhamak Dehghani, a director of technology for IT consultancy firm ThoughtWorks, is credited for promoting the concept of data mesh as a solution to the inherent challenges of centralized, monolithic data structures, such as data accessibility and organization. Its adoption was further spurred by the COVID-19 pandemic in an effort to drive cultural change and reduce organizational complexity around data.

Learn about barriers to AI adoptions, particularly lack of AI governance and risk management solutions.

Read the guide for data leaders

A data mesh is a decentralized data architecture that organizes data by a specific business domain—for example, marketing, sales, customer service, and more—providing more ownership to the producers of a given dataset. The producers’ understanding of the domain data positions them to set data governance policies focused on documentation, quality, and access. This, in turn, enables self-service use across an organization. While this federated approach eliminates many operational bottlenecks associated with centralized, monolithic systems, it doesn't necessarily mean that you can't use traditional storage systems, like data lakes or data warehouses. It just means that their use has shifted from a single, centralized data platform to multiple decentralized data repositories.

Scale AI workloads, for all your data, anywhere

A data mesh involves a cultural shift in the way that companies think about their data. Instead of data acting as a by-product of a process, it becomes the product, where data producers act as data product owners. Historically, a centralized infrastructure team would maintain data ownership across domains, but the product thinking focus under a data mesh model shifts this ownership to the producers as they are the subject matter experts. Their understanding of the primary data consumers and how they leverage the domain’s operational and analytical data allows them to design APIs with their best interests in mind. While this domain-driven design also makes data producers responsible for documenting semantic definitions, cataloguing metadata and setting policies for permissions and usage, there is still a centralized data governance team to enforce these standards and procedures around the data. Additionally, while domain teams become responsible for their ETL data pipelines under a data mesh architecture, it doesn't eliminate the need for a centralized data engineering team. However, their responsibility becomes more focused on determining the best data infrastructure solutions for the data products being stored.

Similar to how a microservices architecture couples lightweight services together to provide functionality to a business- or consumer-facing application, a data mesh uses functional domains as a way to set parameters around the data, enabling it to be treated as a product which can be accessed to users across the organization. In this way, a data mesh allows for more flexible data integration and interoperable functionality, where data from multiple domains can be immediately consumed by users for business analytics, data science experimentation and more.

As previously stated, a data mesh is a distributed data architecture, where data is organized by its domain to make it more accessible to users across an organization. A data lake is a low-cost storage environment, which typically houses petabytes of structured, semi-structured and unstructured data for business analytics, machine learning and other broad applications. A data mesh is an architectural approach to data, which a data lake can be a part of. However, a central data lake is more typically used as dumping ground for data as it frequently is used to ingest data that does not yet have a defined purpose. As a result, it can fall victim to becoming a data swamp—i.e. a data lake that lacks the appropriate data quality and data governance practices to provide insightful learnings.

A data fabric is an architecture concept, and it focuses on the automation of data integration, data engineering, and governance in a data value chain between data providers and data consumers. A data fabric is based on the notion of “active metadata” which uses knowledge graph, semantics, and AI / ML technology to discover patterns in various types of metadata (for example system logs, social, etc.) and apply this insight to automate and orchestrate the data value chain (for example enable a data consumer to find a data product and then have that data product provisioned to them automatically). A data fabric is complimentary to a data mesh as opposed to mutually exclusive. In fact the data fabric makes the data mesh better because it can automate key parts of the data mesh such as creating data products faster, enforcing global governance, and making it easier to orchestrate the combination of multiple data products.

Data democratization: Data mesh architectures facilitates self-service applications from multiple data sources, broadening the access of data beyond more technical resources, such as data scientists, data engineers, and developers. By making data more discoverable and accessible via this domain-driven design, it reduces data silos and operational bottlenecks, enabling faster decision-making and freeing up technical users to prioritize tasks that better utilize their skillsets.

Cost efficiencies: This distributed architecture moves away from batch data processing and instead, it promotes the adoption of cloud data platforms and streaming pipelines to collect data in real-time. Cloud storage provides an additional cost advantage by allowing data teams to spin up large clusters as needed, paying only for the storage specified. This means that if you need additional compute power to run a job in a few hours vs. a few days, you can easily do this on a cloud data platform by purchasing additional compute nodes. This also means that it improves visibility into storage costs, enabling better budget and resource allocation for engineering teams.

Less technical debt: A centralized data infrastructure causes more technical debt due to the complexity and required collaboration to maintain the system. As data accumulates within a repository, it also begins to slow down the overall system. By distributing the data pipeline by domain ownership, data teams can better meet the demands of their data consumers and reduce technical strains on the storage system. They can also provide more accessibility to data by providing APIs for them to interface with, reducing the overall volume of individual requests.

Interoperability: Under a data mesh model, data owners agree on how to standardize domain-agnostic data fields upfront, which facilitates interoperability. This way, when a domain team is structuring their respective datasets, they are applying the relevant rules to enable data linkage across domains quickly and easily.  Some fields commonly standardized are field type, metadata, schema flags, and more. Consistency across domains enables data consumers to interface with APIs more easily and develop applications to serve their business needs more appropriately.

Security and compliance: Data mesh architectures promote stronger governance practices as they help enforce data standards for domain-agnostic data and access controls for sensitive data. This ensures that organizations follow government regulations, like HIPPA restrictions, and the structure of this data ecosystem supports this compliance through the enablement of data audits. Log and trace data in a data mesh architecture embeds observability into the system, allowing auditors to understand which users are accessing specific data and the frequency of that access.

While distributed data mesh architectures are still gaining adoption, they're helping teams attain their goals of scalability for common big data use cases. These include:

  • Business intelligence dashboards: As new initiatives arise, teams commonly require customized data views to understand the performance of these projects. Data mesh architectures can support this need for flexibility and customization by making data more available to data consumers. 
  • Automated virtual assistants: Businesses commonly use chatbots to support call centers and customer service teams. As frequently asked questions can touch on various datasets, a distributed data architecture can make more data assets available to these virtual agent systems.
  • Customer experience: Customer data allows businesses to better understand their users, allowing them to provide more personalized experiences. This has been observed in a variety of industries from marketing to healthcare.
  • Machine learning projects: By standardizing domain agnostic data, data scientists can more easily stitch together data from various data sources, reducing the time spent on data processing. This time can help to accelerate the number of models which move into a production environment, enabling the achievement of automation goals.

Learn questions to consider when looking for the right data and AI platform for your organization.

Don't get bogged down by misinformation. Learn more about the 5 myths of a data lakehouse.

IBM supports the implementation of a data mesh with the IBM Data Fabric on Cloud Pak for Data. The IBM Data Fabric is a unified solution that contains all the capabilities needed to create data products and enable the governed and orchestrated access and use of these data products. The IBM Data Fabric enables the implementation of a data mesh on any platform (e.g., on premises data lakes, cloud data warehouses, etc.), allowing true enterprise-level self-service and re-use of data products regardless of where the data is.

Scale AI workloads for all your data, anywhere, with IBM watsonx.data, a fit-for-purpose data store built on an open data lakehouse architecture.

Demystifying data mesh

A data mesh has emerged as a possible solution to the challenges of data access plaguing many large organizations. This approach takes data out of stovepipes and puts it directly in the hands of business users, but in a controlled manner that maintains strong governance.

About the authors

This article is a collaborative effort by Joe Caserta , Jean-Baptiste Dubois, Matthias Roggendorf , Marcus Roth , and Nikhil Srinidhi, representing views from McKinsey Digital.

Done well, a data mesh can speed time to market for data-driven applications and give rise to more powerful and scalable data products. These benefits have strategic implications. But it’s essential to approach the buildout in the right way. Otherwise, well-intentioned programs can collapse under their own weight. A leading life sciences company, for example, was prepared, from a technological standpoint, for the hard work a data mesh would require. But what it was unprepared for—and found far more challenging—was harmonizing data-management practices and building agreement among different business groups on which data products and use cases to centralize. Failing to anticipate these issues forced the project to pause midstream, creating confusion and prompting business users to revert to older and less efficient ways of managing data.

By understanding what domain-based data management is and hewing to a few core precepts, companies can avoid the learning pitfalls others have faced and begin reaping the rewards of a data mesh more quickly.

What exactly is a data mesh?

The term “data mesh” was coined by Zhamak Dehghani in 2019, when she was a principal at Thoughtworks. It caught on as a way of capturing the idea of distributed data access. But interpretations of what that means in practice abound. Is it a new technology, does it make existing data repositories obsolete, or is it a theoretical construct?

McKinsey defines a data mesh as a data-management paradigm that organizes data in domains, treats it as a product, enables self-service access, and supports these activities with federated governance (Exhibit 1). Here is why each of these elements is important.

Domain-based data management allows data to sit anywhere. Business teams own the data and are responsible for its quality, accessibility, and security. Domains are collections of data organized around a particular business purpose, such as marketing, procurement, or a particular customer segment or region. They contain raw data as well as self-contained elements known as data products . These data products bundle data to support different business applications, and they are designed with the internal wiring needed to plug directly into relevant apps or systems. A self-serve data infrastructure underlies the data mesh and acts as a central platform, providing a common place for business users to find and access data, regardless of where it is hosted.

Governance is managed in a federated “hub-and-spoke” way. Under this approach, a small central team sets controls, and a supporting data infrastructure enforces them. Standards defined in code enable data product teams within the business to comply with metadata documentation, data classification , and data quality monitoring.

Together, these elements create a self-organizing mesh in which different groups around the business can come together, define their data requirements, agree on how new data is to be shared, and align on the best ways to employ that data.

Would you like to learn more about QuantumBlack, AI by McKinsey ?

Executed well, a data mesh can deliver powerful advantages.

Most product and solution breakthroughs occur within the business—and few such breakthroughs can occur today without data. Data meshes allow business users to get their hands on critical information more quickly, delivering the following benefits:

  • Speeding time to market for data-analytics applications: Data products can react more responsively to data demand and provide business users with scalable access to high-quality data through the direct exchange between data producers and data consumers.
  • Unlocking self-service data access for business users: Domain-based structures reduce dependency on centrally located teams, putting insights within more immediate reach of business users and enabling them to get “skin in the game.” In addition, a high degree of self-service boosts adoption, allowing nontechnical users to feel comfortable engaging with data and using data products to answer business questions and prepare fact-based decisions.
  • Enhancing data IQ: Greater engagement with data builds learning, enabling business users to design increasingly sophisticated applications over time. By shaping the data and assets they use, business users ensure that what’s created is fit for purpose, driving greater return on investment. For example, a large industrial company established self-service dashboards that enabled staff to discover existing data products and build individual reports. Together with a communications campaign, the effort activated 300 new data users.

Prior to implementing a data mesh, a large mining organization had hundreds of siloed operational databases scattered around the world, and developing analytics use cases took months. After shifting to a data mesh, the company cut time spent on data-engineering activities dramatically and developed use cases seven times faster than before while also increasing data stability and reusability.

A data mesh involves the entire business

Obtaining the full benefits of a data mesh requires careful choreography. While domain-based architectures have attracted growing interest, the technological discussion often predominates, overshadowing other critical elements.

Business users, for instance, may recognize that their current data-management systems are problematic but feel it’s better to stick with what is known than undergo the disruption of assuming direct ownership for data domains and products.

Even those eager to get started may not realize how organizational structures need to adapt to enable a steady flow of data products and use cases. For example, it’s not uncommon for organizations setting up a data mesh to discover that needed documentation is missing, taxonomies are incomplete, or new processes need to be created before data can be used. These issues can delay completion unless businesses make provision for them in their resourcing. For nontechnical professionals particularly, the learning curve can be steep and momentum for domain-based data ownership can sputter unless properly supported.

The following practices can help companies mitigate these learning-curve issues and increase the odds of a successful data mesh implementation.

illustration corner of digital cube

How to unlock the full value of data? Manage it like a product

Put the business in the lead.

Stewardship of the data mesh implementation must come from the business, supported by executive sponsors and backed by a formal change-management team. Data mesh evangelists within the change team can help business departments analyze their data landscape and define the most valuable data products to share with the organization. Some organizations have found it helpful to position the data mesh as part of a strategic initiative such as a digital transformation. That can help set the context and the case for change. There also needs to be a committed data product owner within the business who is willing to take on the challenge of “selling” data internally to other business users and application teams. In addition, there should be a central data-infrastructure team that can implement “data governance as code” in tools that are not yet fully mature.

Let ROI guide data provisioning

Organizations sometimes get stuck trying to determine whether a centralized or decentralized approach to data management is best, but the answer is that both methods can be effective (Exhibit 2). Companies with a modern IT landscape and well-established local data repositories might get more value from exposing data through virtualized links (while still registering it in a central data marketplace or catalog). By contrast, those that are in the middle of an enterprise resource planning (ERP) transformation or other large IT change might find it better to first move toward a central data platform and create a single logic on core data products.

There can occasionally be an argument where fully centralized approaches deliver superior ROI—for example, if the majority of data use cases and data products are used globally. Fully decentralized approaches are rare at present, since they require a level of data-management orchestration that large enterprises may feel is currently out of their reach.

In practice, most organizations begin with a mix of centralized and localized data products that reflect their particular business, technology, capabilities, and go-to-market requirements. How hard to lean on centralized versus decentralized structures is often a matter of degree.

Finance, operations, and marketing, for instance, often require niche sets of data and analytics, so a company might choose to localize these functions’ data management. Cross-spanning data assets required by multiple functions can be managed by a centralized group and shared with the relevant functions accordingly.

Start with a few high-value data domains and applications

The data mesh does not need to be constructed in one fell swoop. Many companies attain positive results by taking serial steps. A biotech company began by providing data from an operational data warehouse through a data mesh to feed into operational reporting of its production performance (monitoring production variables). The data product team worked closely with business users to understand their needs, improve data quality and velocity, and standardize data into a harmonized format. Business users were able to explore and develop new applications more quickly at the proof-of-concept stage and then scale them to full production.

Centralized standards for data quality, data architecture , and data sovereignty must also be established and adopted by all data product owners. Some companies that already have centralized standards in place can adjust them to reflect the needs of a decentralized data organization. Others start by defining standards for a data domain, test them for practical applicability, and improve them as needed. Then, they roll the standards out in waves to the rest of the organization along with training and capability building to ensure the governance is consistently applied across the organization.

Identify capability gaps and fill them

Executive and nontechnical business users will all need a basic level of data literacy for data mesh success. Coaching, hackathons, online programs, and analytics academies can all work well. Business teams responsible for managing domains will need more extensive training, which should be ongoing so that users can continually grow their skill sets. Otherwise, companies can end up with a narrow set of data capabilities, enough to get started but not sufficient to create the momentum needed to sustain growth or scale.

Keep the conversation going

In most cases, building a data mesh is a continuum. Leaders make a point of regularly communicating with the organization, in large-scale town halls and intimate team meetings, on what the company is trying to achieve and what the road map looks like in terms of timing and capability building. They use internal communications to share success stories, acknowledge the individuals involved in the effort, and remain open about the inevitable challenges. Regular dialog helps to sustain long-term change efforts, keeping the transition alive in people’s minds and reinforcing its steadily accruing benefits.

A data mesh can help close the insights gap and grease the wheels of innovation, allowing companies to better predict the direction of change and proactively respond to it. But bringing a data mesh from concept to reality requires managing it as a business transformation, not a technological one.

Joe Caserta is a partner in McKinsey’s New York office; Jean-Baptiste Dubois is a senior expert in the Paris office; Matthias Roggendorf is a partner in the Berlin office, where Nikhil Srinidhi is an associate partner; and Marcus Roth is a partner in the Chicago office.

The authors wish to thank Marie Grünbein for her contributions to this article.

Explore a career with us

Related articles.

Designing data governance that delivers value

Designing data governance that delivers value

Balance concept with balls - stock photo

How data can help tech companies thrive amid economic uncertainty

""

The data-driven enterprise of 2025

  • Bibliography
  • More Referencing guides Blog Automated transliteration Relevant bibliographies by topics
  • Automated transliteration
  • Relevant bibliographies by topics
  • Referencing guides

Computer Science > Software Engineering

Title: data mesh: a systematic gray literature review.

Abstract: Data mesh is an emerging domain-driven decentralized data architecture that aims to minimize or avoid operational bottlenecks associated with centralized, monolithic data architectures in enterprises. The topic has picked the practitioners' interest, and there is considerable gray literature on it. At the same time, we observe a lack of academic attempts at defining and building upon the concept. Hence, in this article, we aim to start from the foundations and characterize the data mesh architecture regarding its design principles, architectural components, capabilities, and organizational roles. We systematically collected, analyzed, and synthesized 114 industrial gray literature articles. The review provides insights into practitioners' perspectives on the four key principles of data mesh: data as a product, domain ownership of data, self-serve data platform, and federated computational governance. Moreover, due to the comparability of data mesh and SOA (service-oriented architecture), we mapped the findings from the gray literature into the reference architectures from the SOA academic literature to create the reference architectures for describing three key dimensions of data mesh: organization of capabilities and roles, development, and runtime. Finally, we discuss open research issues in data mesh, partially based on the findings from the gray literature.
Subjects: Software Engineering (cs.SE); Databases (cs.DB)
Cite as: [cs.SE]
  (or [cs.SE] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Data Mesh: Systematic Gray Litearture Study, Reference Architecture, and Cloud-based Instantiation

abel177/Data-Mesh-Master-Thesis

Folders and files.

NameName
1 Commit

Sls logo

No. 116: Bridging Privacy Rights and National Security: The Legal Framework for EU-US Data Transfers Before and After the 2023 Commission Adequacy Decision

  • Benedetta Locatelli
  • Benedetta Locatelli, Bridging Privacy Rights and National Security: The Legal Framework for EU-US Data Transfers Before and After the 2023 Commission Adequacy Decision , TTLF Working Papers No. 116, Stanford-Vienna Transatlantic Technology Law Forum (2024).
  • Transatlantic Technology Law Forum

This thesis explores the complex balance between national security and privacy rights within the context of EU-US data transfers by assessing the previous and current legal frameworks, shaped by the Schrems I and II rulings and the European Commission’s Adequacy Decision of 10th July 2023. These judicial decisions, and the consequent bilateral efforts to build a safe data transfer system following the invalidation of previous frameworks, highlight the tension between the EU’s comprehensive General Data Protection Regulation and the US’s fragmented data protection approach. As technology expands in all facets of life, so do national security concerns, leading states to enhance surveillance, which may often result in the compromising of individual privacy. US surveillance practices, in particular, have raised significant concerns, due to the wide powers afforded to US authorities to access data of non-US citizens. In light of these concerns, the CJEU invalidated the Safe Harbor and Privacy Shield Frameworks, due to a non-equivalent and thus inadequate standard of data protection in the US for EU citizens’ data, citing in particular the extended scope of US surveillance practices. The new EU-US Data Protection Framework, which was adopted with the July 2023 Commission decision, aims to address these issues, but has already faced stark criticism for its limitations and potential inadequacies. These include the DPF’s reliance on self-certification, an insufficient oversight by US authorities, the limitations of Executive Order 14086 and a refusal to actuate a reform of Section 702 of FISA, among other concerns. Internal EU challenges, such as divergent interpretations by EU Member States of national security and legitimate surveillance practices, complicate efforts to harmonize standards and maintain credibility in dictating data protection norms outside the EU. Balancing national security and privacy rights is an ongoing challenge, influenced by political, legal, and geopolitical factors. This thesis calls for innovative legal and policy responses to balance these imperatives and protect individual rights in an interconnected world.

IMAGES

  1. My Data Mesh Thesis

    data mesh thesis

  2. My Data Mesh Thesis

    data mesh thesis

  3. Data mesh: a new paradigm for data management

    data mesh thesis

  4. The Ultimate Guide to Data Mesh

    data mesh thesis

  5. My Data Mesh Thesis

    data mesh thesis

  6. Data Mesh: Topologies and domain granularity

    data mesh thesis

VIDEO

  1. Data Mesh vs Data Fabric

  2. HOW TO DO DATA INTERPRETATION IN THESIS EASILY (UNDER 30 MINS)

  3. Learn how Data Mesh Can Revolutionize Your Business Data Strategy

  4. Hexagonal Geometry Clipmaps for Spherical Terrain Rendering, SIGGRAPH Asia 2008

  5. Data Mesh Architecture. A case study of implemented Data Mesh on Azure by Andriy Zabavskyy

  6. #Mesh ‖ #PubMed ‖ How to use Medical Subject Heading (MeSH) in Pubmed for literature review?

COMMENTS

  1. PDF Data Mesh: a Holistic Examination of Its Principles, Practices, and

    This thesis focuses to study concept called data mesh which was first introduced by Zhamak Dehghani in 2019. The whole data mesh is somewhat new but still very hot topic in data engineering field among data professionals and there is a reason for this. According to the survey from NewVantage Partners, data mesh was identified the fifth most

  2. PDF Master's Thesis: Towards a Data Mesh Reference Architecture

    becoming a bottleneck and thus the data mesh paradigm emerged. This thesis explores the Data Mesh concept, a new way of structuring the enterprise data architecture by decentralizing the data capabilities and positioning those at a domain level, with an overarching governance framework, supported by a self-serve data platform.

  3. PDF Data Mesh: Best Practices to Avoid the Data Mess

    Fig. 1. Conceptual overview of a data mesh based on the four key principles: 1) domain-oriented decentralized data ownership, 2) data as a product, 3) self-serve data platform, and 4) federated computational governance. The figure shows diferent levels of granularity (high on the left and low on the right).

  4. My Data Mesh Thesis

    Data meshes are a decentralization technique of the ownership, transformation & serving of data. It is proposed as a solution for centralized architectures, where growth is limited by its dependencies and complexity. Data Mesh: Centralized VS decentralized architecture.

  5. Towards Avoiding the Data Mess: Industry Insights from Data Mesh

    Domain D. Fig. 1. Conceptual overview of a data mesh based on the four key principles: 1) domain-oriented decentralized data ownership, 2) data as a product, 3) self-serve data platform, and 4) federated computational governance. The figure shows diferent levels of granularity (high on the left and low on the right).

  6. PDF Data Mesh: a Systematic Gray Literature Review

    A data mesh is emerging as a novel decentralized approach for managing data at scale by applying domain-oriented, self-serve design and product thinking [13]. Zhamak Dehghani first defined the term data mesh in 2019 [44]. Figure 1 shows the search index for "Data Mesh" on Google Trends for the past five years. A clear increasing trend line can be

  7. Data Mesh: Concepts and Principles of a Paradigm Shift in Data

    Data mesh is an emerging domain-driven decentralized data architecture that aims to minimize or avoid operational bottlenecks associated with centralized, monolithic data architectures in enterprises.

  8. Data Mesh: Concepts and Principles of a Paradigm Shift in Data

    The Data Mesh allows for the provision of complex management, access, and support components through the connectivity layer it implements - data from different locations will now be connected in the Mesh [3]. Recently, Zhamak Dehghani began taking the first steps in consolidating what might be the core principles and logical architecture of a ...

  9. Data Mesh and its Contribution to Data Governance

    The case study uncovers tangible benefits of how data mesh can contribute to data governance. The conclusion unveils that the four principles of data mesh; domain ownership, product thinking, self-serve data platforms, and federated computational governance can provide a robust foundation for data governance, if not a method to operationalize ...

  10. PDF Identifying Alternatives and Deciding Factors for a Data Mesh ...

    before implementing a data mesh. This paper is based on a bachelor thesis. Keywords: data mesh, data architectures, data integration, expert interviews 1 Introduction In a world where the importance of data and its analysis continuously grows, choosing a fitting data architecture is crucial to the success of a company's data strategy. While many

  11. What is a Data Mesh: Principles and Architecture

    A data lake is a good solution for storing vast amounts of data in a centralized location. And data mesh is the best solution for fast data retrieval, integration, and analytics. In a nutshell, think of a data mesh as connective tissue to data lakes and/or other sources. Introduced by Thoughtworks, data mesh is "a shift in modern distributed ...

  12. Data Mesh Principles and Logical Architecture

    The great divide of data. Core principles and logical architecture of data mesh. Domain Ownership. Logical architecture: domain-oriented data and compute. Data as a product. Logical architecture:data product the architectural quantum. Self-serve data platform. Logical architecture: a multi-plane data platform.

  13. Exploring Data Mesh: A Paradigm Shift in Data Architecture

    In a data mesh, data ownership and responsibilities are distributed among domain-specific teams or data product teams, granting them autonomy in managing their data within their respective domains. This decentralized approach aims to address the limitations associated with centralized data models, such as scalability challenges, data silos, and ...

  14. Data Mesh: Real examples and lessons learned

    Data mesh is a powerfully transformative analytical data architecture and operating model. Businesses in all industries stand to gain with correct Data mesh implementation. But adopting Data Mesh requires more than just technology change — it takes some time, organizational commitment and the right partner to guide you through the process.

  15. Data Mesh: the newest paradigm shift for a distributed architecture in

    The goal of this project is to expand on the notion of the Data Mesh and to propose a first real implementation of it. In particular, the case study focuses on the backend implementation of the various provisioning services of the data product and its resources, using the Scala programming language and the cloud environments Amazon Web Service (AWS) and Cloudera Data Platform (CDP).

  16. Decoding the Data Mesh

    Underlying the data mesh architecture is a layer of universal interoperability, reflecting domain-agnostic standards, as well as observability and governance. Image courtesy of Monte Carlo. Zhamak defines the data mesh as " a socio-technical shift — a new approach in how we collect, manage, and share analytical data.".

  17. Towards Avoiding the Data Mess: Industry Insights from Data Mesh

    With the increasing importance of data and artificial intelligence, organizations strive to become more data-driven. However, current data architectures are not necessarily designed to keep up with the scale and scope of data and analytics use cases. In fact, existing architectures often fail to deliver the promised value associated with them. Data mesh is a socio-technical, decentralized ...

  18. What Is a Data Mesh?

    A data mesh is a decentralized data architecture that organizes data by a specific business domain—for example, marketing, sales, customer service and more—to provide more ownership to the producers of a given data set. The producers' understanding of the domain data positions them to set data governance policies focused on documentation ...

  19. Demystifying data mesh

    McKinsey defines a data mesh as a data-management paradigm that organizes data in domains, treats it as a product, enables self-service access, and supports these activities with federated governance (Exhibit 1). Here is why each of these elements is important. 1. Domain-based data management allows data to sit anywhere.

  20. What is a Data Mesh?

    A data mesh is an architectural framework that solves advanced data security challenges through distributed, decentralized ownership. Organizations have multiple data sources from different lines of business that must be integrated for analytics. A data mesh architecture effectively unites the disparate data sources and links them together ...

  21. Dissertations / Theses: 'Data Mesh'

    The triangle mesh data structures proposed in this thesis support the standard set of mesh connectivity operators introduced by the previously proposed Corner Table at an amortized constant time complexity. They can be constructed in linear time and space from the Corner Table or any equivalent representation. If geometry is stored as 16-bit ...

  22. [2304.01062] Data Mesh: a Systematic Gray Literature Review

    Data Mesh: a Systematic Gray Literature Review. Data mesh is an emerging domain-driven decentralized data architecture that aims to minimize or avoid operational bottlenecks associated with centralized, monolithic data architectures in enterprises. The topic has picked the practitioners' interest, and there is considerable gray literature on it.

  23. abel177/Data-Mesh-Master-Thesis

    Data Mesh: Systematic Gray Litearture Study, Reference Architecture, and Cloud-based Instantiation - GitHub - abel177/Data-Mesh-Master-Thesis: Data Mesh: Systematic Gray Litearture Study, Reference Architecture, and Cloud-based Instantiation

  24. No. 116: Bridging Privacy Rights and National Security: The Legal

    This thesis explores the complex balance between national security and privacy rights within the context of EU-US data transfers by assessing the previous and current legal frameworks, shaped by the Schrems I and II rulings and the European Commission's Adequacy Decision of 10th July 2023.