by the operational systems. This is also called native data product.
To elaborate this Thesis, I have been recollecting links and resources about the Data Mesh paradigm. I wanted to share the discovery path I've followed to understand the Data Mesh implications.
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
Data Mesh Principles and Logical Architecture
Introduction to Data Mesh: A Paradigm Shift in Analytical Data Management by Zhamak Dehghani (Part I)
How to Build a Foundation for Data Mesh: A Principled Approach by Zhamak Dehghani (Part II)
Data Mesh Score calculator
Decentralizing Data: From Data Monolith to Data Mesh with Zhamak Dehghani, Creator of Data Mesh
Data Mesh: The Four Principles of the Distributed Architecture by Eugene Berko
How to achieve Data Mesh teams and culture?
Anatomy of Data Products in Data Mesh
Data Mesh Applied: Moving step-by-step from mono data lake to decentralized 21st-century data mesh.
There’s More Than One Kind of Data Mesh: Three Types of Data Meshes
Building a successful Data Mesh – More than just a technology initiative
How the **ck (heck) do you build a Data Mesh?
Data Mesh architecture patterns
Data Mesh: Topologies and domain granularity
I am so grateful to Data Mesh Learning Community founded and run by Scott Hirleman. I appreciate the help from the community, the meetups, and the resources published and organized. It has been a great point to start researching for this thesis.
I also wanted to acknowledge all the authors from my sources for sharing their knowledge.
Student thesis : Master's Thesis
Date of Award | 1 Sept 2023 |
---|---|
Original language | English |
Supervisor | Karel Lemmen (Supervisor) & Pien Walraven (Examiner) |
File : application/pdf, 1.54 MB
Type : Thesis
Voicebox is here!
Get the latest in your inbox
Rather than dwell on the definitions (Gartner counts at least three) of data mesh, I’ll go with my lay version:
data mesh [dey-tuh- mesh]
A decentralized architecture capability to solve the data swamp problem, reduce data analytics cost, and speed actionable insights to better enable data-informed business decisions.
There, I said it without the buzzwords like “data democratization” or “paradigm shift.” Mea culpa for throwing in “actionable insight.” Let’s decompose what data mesh means for the real working class, our data engineers, architects, and scientists.
Data swamps are losing their role as a centralized data platform
In the mid-1990s, data warehousing was bursting onto the data management scene. Fueled by the hype of the fabled “beer and diapers” story, businesses were pouring tens of millions of dollars to build huge data monoliths to handle the consumption, storage, transformation, and output of data in one central system to answer business questions that required complex data analytics such as “who are the high-value customers most likely to buy X?”
The thesis of data warehousing worked like a charm at that time . However, as the appetite for data analytics increased, so did the need for more data to be ingested. The complexity and pace of data pipelines soared (as did the nickname “data wranglers”). I began to see the cracks forming in the data warehouse theory and delved into its growing failure to get value from analytical data in my master’s research.
As social media and the iPhone became the norm, many turned to a second generation of data analytics architecture called data lakes. While traditional data warehouses used an Extract-Transform-Load (ETL) process to ingest data, data lakes instead rely on an Extract-Load-Transform (ELT) process that puts data into cheap BLOB storage. This eliminated the big shortcomings of data warehouses but spurred the “Let’s just collect everything” theology of data swamps.
Further learning: Data Lake Acceleration
Fast forward to today. Only 32% of companies are realizing tangible and measurable value from data (“trapped value”), according to a study by Accenture. The roaring demand for “discovery or iterative style analytics” (where consumers don’t really know the questions or data they need) is raising access to data to a whole new level with new/expanding data sources (or “wide data”) across multi- and hybrid cloud environments, thrusting massive friction onto traditional data lakes and warehouses.
Nowhere is this pain more visible than among data and analytics teams where:
A data mesh aims to create an architectural foundation for getting value from analytical data and historical facts at scale – scale being applied to the constant change of data landscape, proliferation of data and analytics demand, diversity of transformation and processing that use cases require, and speed of response to change.
To achieve this objective, most experts agree that the thesis of data mesh is based on four precepts:
Data mesh inverts the traditional data warehouse/lake ideology by transforming data gatekeepers into data liberators to give every “data citizen” (not just the data scientist, engineer, or analyst) easy access and the ability to work with data comfortably, regardless of their technical know-how, to reduce the cost of data analytics and speed time to insight .
However, the rise of data mesh does not mean the fall of data lakes; rather, they are complementary architectures. A data lake is a good solution for storing vast amounts of data in a centralized location. And data mesh is the best solution for fast data retrieval, integration, and analytics. In a nutshell, think of a data mesh as connective tissue to data lakes and/or other sources.
Introduced by Thoughtworks, data mesh is “a shift in modern distributed architecture that applies platform thinking to create self-serve data infrastructure, treating data as the product.” It is a data and analytics platform where data can remain within different databases, rather than being consolidated into a single data lake.
Like a data fabric architecture, a data mesh architecture comprises four layers . To provide useful information to data and analytics professionals, I’ll break the data mesh rule by also talking about specific technologies and representative vendors.
VentureBeat says a data mesh architecture “connects various data sources (including data lakes) into a coherent infrastructure, where all data is accessible if you have the right authority to access it.” This doesn’t mean there is one big, hairy data warehouse or lake (see data swamp problem) — the laws of physics and the demand for analytics mean that large, disparate data sets can’t just be joined together over huge distances with decent performance. Not to mention the costs of moving, transforming, and maintaining the data (and ETL | ELT).
Enter the semantic layer. A semantic layer represents a network of real-world entities— i.e., objects, events, situations, or concepts — and illustrates the relationships to answer complex cross-domain questions that can be shared and re-used based on fine-grained data access policies and business rules. It is comprised of three layers (and a knowledge catalog that interfaces with governance tools):
The big payoff of a semantic layer is in providing a better way to enable self-service, federated queries, enabling you to build and deploy analytics data products quickly and efficiently.
Further learning: How Knowledge Graphs Work
Generally, organizations handling and analyzing a large amount of data sources should seriously consider evolving to a data mesh architecture. Other signals include a high number of data domains and functions teams that demand data products (especially advanced analytics such as predictive and simulation modeling), frequent data pipeline bottlenecks, and prioritizing data governance.
Stardog commissioned Forrester Consulting to interview four decision-makers with experience implementing Stardog. For this commissioned study, Forrester aggregated the interviewees’ experiences and combined the results into a single composite organization. The key findings were that the composite organization using the Stardog Enterprise Knowledge Graph platform realized the following over three years:
How to Overcome a Major Enterprise Liability and Unleash Massive Potential
Subscribe to get our latest content and stay up to date on news and events.
This website stores cookies on your computer which are used to improve your website experience and provide more customized services to you. To find out more about the cookies we use, see our privacy policy.
Our aspiration to augment and improve every aspect of business and life with data, demands a paradigm shift in how we manage data at scale. While the technology advances of the past decade have addressed the scale of volume of data and data processing compute, they have failed to address scale in other dimensions: changes in the data landscape, proliferation of sources of data, diversity of data use cases and users, and speed of response to change. Data mesh addresses these dimensions, founded in four principles: domain-oriented decentralized data ownership and architecture, data as a product, self-serve data infrastructure as a platform, and federated computational governance. Each principle drives a new logical view of the technical architecture and organizational structure.
03 December 2020
Zhamak is the director of emerging technologies at Thoughtworks North America with focus on distributed systems architecture and a deep passion for decentralized solutions. She is a member of Thoughtworks Technology Advisory Board and contributes to the creation of Thoughtworks Technology Radar.
data analytics
Logical architecture: domain-oriented data and compute, logical architecture:data product the architectural quantum, logical architecture: a multi-plane data platform, logical architecture: computational policies embedded in the mesh, principles summary and the high level logical architecture.
For more on Data Mesh, Zhamak went on to write a full book that covers more details on strategy, implementation, and organizational design.
The original writeup, How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh - which I encourage you to read before joining me back here - empathized with today’s pain points of architectural and organizational challenges in order to become data-driven, use data to compete, or use data at scale to drive value. It offered an alternative perspective which since has captured many organizations’ attention, and given hope for a different future. While the original writeup describes the approach, it leaves many details of the design and implementation to one’s imagination. I have no intention of being too prescriptive in this article, and kill the imagination and creativity around data mesh implementation. However I think it’s only responsible to clarify the architectural aspects of data mesh as a stepping stone to move the paradigm forward.
This article is written with the intention of a follow up. It summarizes the data mesh approach by enumerating its underpinning principles, and the high level logical architecture that the principles drive. Establishing the high level logical model is a necessary foundation before I dive into detailed architecture of data mesh core components in future articles. Hence, if you are in search of a prescription around exact tools and recipes for data mesh, this article may disappoint you. If you are seeking a simple and technology-agnostic model that establishes a common language, come along.
What do we really mean by data? The answer depends on whom you ask. Today’s landscape is divided into operational data and analytical data . Operational data sits in databases behind business capabilities served with microservices, has a transactional nature, keeps the current state and serves the needs of the applications running the business. Analytical data is a temporal and aggregated view of the facts of the business over time, often modeled to provide retrospective or future-perspective insights; it trains the ML models or feeds the analytical reports.
The current state of technology, architecture and organization design is reflective of the divergence of these two data planes - two levels of existence, integrated yet separate. This divergence has led to a fragile architecture. Continuously failing ETL (Extract, Transform, Load) jobs and ever growing complexity of a labyrinth of data pipelines, is a familiar sight to many who attempt to connect these two planes, flowing data from operational data plane to the analytical plane, and back to the operational plane.
Figure 1: The great divide of data
Analytical data plane itself has diverged into two main architectures and technology stacks: data lake and data warehouse ; with data lake supporting data science access patterns, and data warehouse supporting analytical and business intelligence reporting access patterns. For this conversation, I put aside the dance between the two technology stacks: data warehouse attempting to onboard data science workflows and data lake attempting to serve data analysts and business intelligence. The original writeup on data mesh explores the challenges of the existing analytical data plane architecture.
Figure 2: Further divide of analytical data - warehouse
Figure 3: Further divide of analytical data - lake
Data mesh recognizes and respects the differences between these two planes: the nature and topology of the data, the differing use cases, individual personas of data consumers, and ultimately their diverse access patterns. However it attempts to connect these two planes under a different structure - an inverted model and topology based on domains and not technology stack - with a focus on the analytical data plane. Differences in today's available technology to manage the two archetypes of data, should not lead to separation of organization, teams and people who work on them. In my opinion, the operational and transactional data technology and topology is relatively mature, and driven largely by the microservices architecture; data is hidden on the inside of each microservice, controlled and accessed through the microserivce’s APIs. Yes there is room for innovation to truly achieve multi-cloud-native operational database solutions, but from the architectural perspective it meets the needs of the business. However it’s the management and access to the analytical data that remains a point of friction at scale. This is where data mesh focuses.
I do believe that at some point in the future our technologies will evolve to bring these two planes even closer together, but for now, I suggest we keep their concerns separate.
Data mesh objective is to create a foundation for getting value from analytical data and historical facts at scale - scale being applied to constant change of data landscape , proliferation of both sources of data and consumers , diversity of transformation and processing that use cases require , speed of response to change . To achieve this objective, I suggest that there are four underpinning principles that any data mesh implementation embodies to achieve the promise of scale, while delivering quality and integrity guarantees needed to make data usable : 1) domain-oriented decentralized data ownership and architecture, 2) data as a product, 3) self-serve data infrastructure as a platform, and 4) federated computational governance.
While I expect the practices, technologies and implementations of these principles vary and mature over time, these principles remain unchanged.
I have intended for the four principles to be collectively necessary and sufficient ; to enable scale with resiliency while addressing concerns around siloing of incompatible data or increased cost of operation. Let's dive into each principle and then design the conceptual architecture that supports it.
Data mesh, at core, is founded in decentralization and distribution of responsibility to people who are closest to the data in order to support continuous change and scalability. The question is, how do we decompose and decentralize the components of the data ecosystem and their ownership. The components here are made of analytical data , its metadata , and the computation necessary to serve it.
Data mesh follows the seams of organizational units as the axis of decomposition. Our organizations today are decomposed based on their business domains. Such decomposition localizes the impact of continuous change and evolution - for the most part - to the domain’s bounded context . Hence, making the business domain’s bounded context a good candidate for distribution of data ownership.
In this article, I will continue to use the same use case as the original writeup, ‘a digital media company’. One can imagine that the media company divides its operation, hence the systems and teams that support the operation, based on domains such as ‘podcasts’, teams and systems that manage podcast publication and their hosts; ‘artists’, teams and systems that manage onboarding and paying artists, and so on. Data mesh argues that the ownership and serving of the analytical data should respect these domains. For example, the teams who manage ‘podcasts’, while providing APIs for releasing podcasts, should also be responsible for providing historical data that represents ‘released podcasts’ over time with other facts such as ‘listenership’ over time. For a deeper dive into this principle see Domain-oriented data decomposition and ownership .
To promote such decomposition, we need to model an architecture that arranges the analytical data by domains. In this architecture, the domain’s interface to the rest of the organization not only includes the operational capabilities but also access to the analytical data that the domain serves. For example, ‘podcasts’ domain provides operational APIs to ‘create a new podcast episode’ but also an analytical data endpoint for retrieving ‘all podcast episodes data over the last <n> months’. This implies that the architecture must remove any friction or coupling to let domains serve their analytical data and release the code that computes the data, independently of other domains. To scale, the architecture must support autonomy of the domain teams with regard to the release and deployment of their operational or analytical data systems.
The following example demonstrates the principle of domain oriented data ownership. The diagrams are only logical representations and exemplary. They aren't intended to be complete.
Each domain can expose one or many operational APIs, as well as one or many analytical data endpoints
Figure 4: Notation: domain, its analytical data and operational capabilities
Naturally, each domain can have dependencies to other domains' operational and analytical data endpoints. In the following example, 'podcasts' domain consumes analytical data of 'users updates' from the 'users' domain, so that it can provide a picture of the demographic of podcast listeners through its 'Podcast listeners demographic' dataset.
Figure 5: Example: domain oriented ownership of analytical data in addition to operational capabilities
Note: In the example, I have used an imperative language for accessing the operational data or capabilities, such as 'Pay artists'. This is simply to emphasize the difference between the intention of accessing operational data vs. analytical data. I do recognize that in practice operational APIs are implemented through a more declarative interface such as accessing a RESTful resource or a GraphQL query.
One of the challenges of existing analytical data architectures is the high friction and cost of discovering, understanding, trusting, and ultimately using quality data . If not addressed, this problem only exacerbates with data mesh, as the number of places and teams who provide data - domains - increases. This would be the consequence of our first principle of decentralization. Data as a product principle is designed to address the data quality and age-old data silos problem; or as Gartner calls it dark data - “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes”. Analytical data provided by the domains must be treated as a product, and the consumers of that data should be treated as customers - happy and delighted customers.
The original article enumerates a list of capabilities , including discoverability, security, explorability, understandability, trustworthiness , etc., that a data mesh implementation should support for a domain data to be considered a product. It also details the roles such as domain data product owner that organizations must introduce, responsible for the objective measures that ensure data is delivered as a product. These measures include data quality, decreased lead time of data consumption, and in general data user satisfaction through net promoter score . Domain data product owner must have a deep understanding of who the data users are, how do they use the data, and what are the native methods that they are comfortable with consuming the data. Such intimate knowledge of data users results in design of data product interfaces that meet their needs. In reality, for the majority of data products on the mesh, there are a few conventional personas with their unique tooling and expectations, data analysts and data scientists. All data products can develop standardized interfaces to support them. The conversation between users of the data and product owners is a necessary piece for establishing the interfaces of data products.
Each domain will include data product developer roles , responsible for building, maintaining and serving the domain's data products. Data product developers will be working alongside other developers in the domain. Each domain team may serve one or multiple data products. It’s also possible to form new teams to serve data products that don’t naturally fit into an existing operational domain.
Note: this is an inverted model of responsibility compared to past paradigms. The accountability of data quality shifts upstream as close to the source of the data as possible.
Architecturally, to support data as a product that domains can autonomously serve or consume, data mesh introduces the concept of data product as its architectural quantum . Architectural quantum, as defined by Evolutionary Architecture , is the smallest unit of architecture that can be independently deployed with high functional cohesion, and includes all the structural elements required for its function.
Data product is the node on the mesh that encapsulates three structural components required for its function, providing access to the domain's analytical data as a product.
Figure 6: Data product components as one architectural quantum
The following example builds on the previous section, demonstrating the data product as the architectural quantum. The diagram only includes sample content and is not intended to be complete or include all design and implementation details. While this is still a logical representation it is getting closer to the physical implementation.
Figure 7: Notation: domain, its (analytical) data product and operational system
Figure 8: Data products serving the domain-oriented analytical data
Note: Data mesh model differs from the past paradigms where pipelines (code) are managed as independent components from the data they produce; and often infrastructure, like an instance of a warehouse or a lake storage account, is shared among many datasets. Data product is a composition of all components - code, data and infrastructure - at the granularity of a domain's bounded context.
As you can imagine, to build, deploy, execute, monitor, and access a humble hexagon - a data product - there is a fair bit of infrastructure that needs to be provisioned and run; the skills needed to provision this infrastructure are specialized and would be difficult to replicate in each domain. Most importantly, the only way that teams can autonomously own their data products is to have access to a high-level abstraction of infrastructure that removes complexity and friction of provisioning and managing the lifecycle of data products. This calls for a new principle, Self-serve data infrastructure as a platform to enable domain autonomy .
The data platform can be considered an extension of the delivery platform that already exists to run and monitor the services. However the underlying technology stack to operate data products, today, looks very different from the delivery platform for services. This is simply due to divergence of big data technology stacks from operational platforms. For example, domain teams might be deploying their services as Docker containers and the delivery platform uses Kubernetes for their orchestration; However the neighboring data product might be running its pipeline code as Spark jobs on a Databricks cluster. That requires provisioning and connecting two very different sets of infrastructure, that prior to data mesh did not require this level of interoperability and interconnectivity. My personal hope is that we start seeing a convergence of operational and data infrastructure where it makes sense. For example, perhaps running Spark on the same orchestration system, e.g. Kubernetes.
In reality, to make analytical data product development accessible to generalist developers, to the existing profile of developers that domains have, the self-serve platform needs to provide a new category of tools and interfaces in addition to simplifying provisioning. A self-serve data platform must create tooling that supports a domain data product developer’s workflow of creating, maintaining and running data products with less specialized knowledge that existing technologies assume; self-serve infrastructure must include capabilities to lower the current cost and specialization needed to build data products. The original writeup includes a list of capabilities that a self-serve data platform provides, including access to scalable polyglot data storage, data products schema, data pipeline declaration and orchestration, data products lineage, compute and data locality , etc.
The self-serve platform capabilities fall into multiple categories or planes as called in the model. Note: A plane is representative of a level of existence - integrated yet separate. Similar to physical and consciousness planes, or control and data planes in networking. A plane is neither a layer and nor implies a strong hierarchical access model.
Figure 9: Notation: A platform plane that provides a number of related capabilities through self-serve interfaces
A self-serve platform can have multiple planes that each serve a different profile of users. In the following example, lists three different data platform planes:
The following model is only exemplary and is not intending to be complete. While a hierarchy of planes is desirable, there is no strict layering implied below.
Figure 10: Multiple planes of self-serve data platform *DP stands for a data product
As you can see, data mesh follows a distributed system architecture; a collection of independent data products, with independent lifecycle, built and deployed by likely independent teams. However for the majority of use cases, to get value in forms of higher order datasets, insights or machine intelligence there is a need for these independent data products to interoperate; to be able to correlate them, create unions, find intersections, or perform other graphs or set operations on them at scale. For any of these operations to be possible, a data mesh implementation requires a governance model that embraces decentralization and domain self-sovereignty, interoperability through global standardization, a dynamic topology and most importantly automated execution of decisions by the platform . I call this a federated computational governance. A decision making model led by the federation of domain data product owners and data platform product owners, with autonomy and domain-local decision making power, while creating and adhering to a set of global rules - rules applied to all data products and their interfaces - to ensure a healthy and interoperable ecosystem. The group has a difficult job: maintaining an equilibrium between centralization and decentralization ; what decisions need to be localized to each domain and what decisions should be made globally for all domains. Ultimately global decisions have one purpose, creating interoperability and a compounding network effect through discovery and composition of data products.
The priorities of the governance in data mesh are different from traditional governance of analytical data management systems. While they both ultimately set out to get value from data, traditional data governance attempts to achieve that through centralization of decision making, and establishing global canonical representation of data with minimal support for change. Data mesh's federated computational governance, in contrast, embraces change and multiple interpretive contexts.
Placing a system in a straitjacket of constancy can cause fragility to evolve. -- C.S. Holling, ecologist
A supportive organizational structure, incentive model and architecture is necessary for the federated governance model to function: to arrive at global decisions and standards for interoperability, while respecting autonomy of local domains, and implement global policies effectively.
Figure 11: Notation: federated computational governance model
As mentioned earlier, striking a balance between what shall be standardized globally, implemented and enforced by the platform for all domains and their data products, and what shall be left to the domains to decide, is an art. For instance the domain data model is a concern that should be localized to a domain who is most intimately familiar with it. For example, how the semantic and syntax of 'podcast audienceship' data model is defined must be left to the 'podcast domain' team. However in contrast, the decision around how to identify a 'podcast listener' is a global concern. A podcast listener is a member of the population of 'users' - its upstream bounded context - who can cross the boundary of domains and be found in other domains such as 'users play streams'. The unified identification allows correlating information about 'users' who are both 'podcast listeners' and 'stream listeners'.
The following is an example of elements involved in the data mesh governance model. It’s not a comprehensive example and only demonstrative of concerns relevant at the global level.
Figure 12: : Example of elements of a federated computational governance: teams, incentives, automated implementation, and globally standardized aspects of data mesh
Many practices of pre-data-mesh governance, as a centralized function, are no longer applicable to the data mesh paradigm. For example, the past emphasis on certification of golden datasets - the datasets that have gone through a centralized process of quality control and certification and marked as trustworthy - as a central function of governance is no longer relevant. This had stemmed from the fact that in the previous data management paradigms, data - in whatever quality and format - gets extracted from operational domain’s databases and gets centrally stored in a warehouse or a lake that now requires a centralized team to apply cleansing, harmonization and encryption processes to it; often under the custodianship of a centralized governance group. Data mesh completely decentralizes this concern. A domain dataset only becomes a data product after it locally, within the domain, goes through the process of quality assurance according to the expected data product quality metrics and the global standardization rules. The domain data product owners are best placed to decide how to measure their domain’s data quality knowing the details of domain operations producing the data in the first place. Despite such localized decision making and autonomy, they need to comply with the modeling of quality and specification of SLOs based on a global standard, defined by the global federated governance team, and automated by the platform.
The following table shows the contrast between centralized (data lake, data warehouse) model of data governance, and data mesh.
Pre data mesh governance aspect | Data mesh governance aspect |
---|---|
Centralized team | Federated team |
Responsible for data quality | Responsible for defining how to model what constitutes quality |
Responsible for data security | Responsible for defining aspects of data security i.e. data sensitivity levels for the platform to build in and monitor automatically |
Responsible for complying with regulation | Responsible for defining the regulation requirements for the platform to build in and monitor automatically |
Centralized custodianship of data | Federated custodianship of data by domains |
Responsible for global canonical data modeling | Responsible for modeling - data elements that cross the boundaries of multiple domains |
Team is independent from domains | Team is made of domains representatives |
Aiming for a well defined static structure of data | Aiming for enabling effective mesh operation embracing a continuously changing and a dynamic topology of the mesh |
Centralized technology used by monolithic lake/warehouse | Self-serve platform technologies used by each domain |
Measure success based on number or volume of governed data (tables) | Measure success based on the network effect - the connections representing the consumption of data on the mesh |
Manual process with human intervention | Automated processes implemented by the platform |
Prevent error | Detect error and recover through platform’s automated processing |
Let’s bring it all together, we discussed four principles underpinning data mesh:
Domain-oriented decentralized data ownership and architecture | the ecosystem creating and consuming data can scale out as the number of sources of data, number of use cases, and diversity of access models to the data increases; simply increase the autonomous nodes on the mesh. |
Data as a product | data users can easily discover, understand and securely use high quality data with a delightful experience; data that is distributed across many domains. |
Self-serve data infrastructure as a platform | the domain teams can create and consume data products autonomously using the platform abstractions, hiding the complexity of building, executing and maintaining secure and interoperable data products. |
Federated computational governance | data users can get value from aggregation and correlation of independent data products - the mesh is behaving as an ecosystem following global interoperability standards; standards that are baked computationally into the platform. |
These principles drive a logical architectural model that while bringing analytical data and operational data closer together under the same domain, it respects their underpinning technical differences. Such differences include where the analytical data might be hosted, different compute technologies for processing operational vs. analytical services, different ways of querying and accessing the data, etc.
Figure 13: Logical architecture of data mesh approach
I hope by this point, we have now established a common language and a logical mental model that we can collectively take forward to detail the blueprint of the components of the mesh, such as the data product, the platform, and the required standardizations.
I am grateful to Martin Fowler for helping me refine the narrative and structure of this article, and for hosting it.
Special thanks to many ThoughtWorkers who have been helping create and distill the ideas in this article through client implementations and workshops.
Also thanks to the following early reviewers who provided invaluable feedback: Chris Ford, David Colls and Pramod Sadalage.
03 December 2020: Published
Data is changing. are you keeping up.
Data and how we use it are constantly evolving in today's fast-paced world. And as we continue to rely on data accessibility to drive growth, it will only become more complicated to manage. Centralized data platforms have long served as the foundation of modern business intelligence and analytics and, in most cases, continue to deliver meaningful business value. But, like all foundations, over time, cracks begin to show. Data solutions are now bursting at the seams as the number and diversity of data sources and use cases becomes too complicated to manage with a traditional, centralized approach. Moreover, this rapidly increasing demand for business intelligence and analytics is inadvertently creating insight bottlenecks, preventing the delivery of deep, valuable insights. And truth be told, it's a big ask to address the above — it takes a cultural shift in ways of working that truly sets organizations on a path of readiness to innovate and leverage their data at a much faster pace.
So, what are some problems of not addressing data issues? The accidental creation of data latency generates a delay and a lack of access to the correct information, leading to the use of rogue data repositories and shadow BI solutions. Regulatory requirements surrounding data are becoming increasingly complex, and all who work with data must comply. Dependance on tribal knowledge generates stagnation in innovation and ideas. The list can go on and on. So, what does it take to not only avoid the problems previously mentioned but to thrive and grow in an ever-changing data landscape? How can organizations move forward when the path ahead can appear unclear and confusing? We feel the paramount solution to the change and potential problems in today's data landscape is through Data Mesh. Here are some brief examples of the Data mesh work we've conducted with our clients Gilead and Saxo Bank.
Thoughtworks has been working with Gilead, an American biopharmaceutical company, for over a year in developing the case and planning the implementation for Data mesh. Gilead has a robust experimentation culture, established people practices and innovative technology thought leadership, but like many enterprises, it faced numerous challenges adopting the Data mesh approach to deliver data-driven value at scale. Thoughtworks is actively assisting Gilead in their approach to Data mesh in building an Enterprise Data and AI Platform leveraging their prior experiences to establish new guiding principles. When reviewing opportunities, Gilead saw value in a new organizational and operational model backed by data, but by using a Data mesh approach, it also allows them the opportunity to engage in a cloud transformation initiative. While their previous experience and realized opportunity for Data mesh allow Gilead to create guiding principles moving forward, such as managing their data as a product and adopting cloud-first architecture, it's only the tip of the iceberg in their journey.
Thoughtworks has also been working with Saxo Bank, a European online investment bank, to democratize data while empowering clients with information and agility to act with confidence. Because of the bank’s complex ecosystem, the data found within Saxo Bank's platform must be transparent, trustworthy and co-sharable, with the Saxo app being white-listed in every environment. So, Saxo Bank and Thoughtworks partnered to bring Data mesh to their organization. Thoughtworks created a data workbench for Saxo Bank to make their data assets searchable and discoverable. Much like how one would search for a product on Amazon, one types in the name of the data asset they're seeking in a search bar, and results backed by a business catalog of consistent business definitions appear. The data workbench also has product descriptions for each data asset along with the data asset's number of uses and user feedback, so one knows that the data is trustworthy. At a high level, since deploying their Data mesh initiative, Saxo Bank has seen a reduced cost of customer acquisition, more efficient costs of operation and increased defense due to the reduced chances of compliance and regulatory quagmires.
More in-depth information about Gilead and Saxo Bank can be found in the video below.
Our Data Mesh work with Gilead, Saxo Bank and other organizations have taught us much about what it takes to succeed. While not exhaustive, here's a brief overview of some of the lessons we've learned along the way in empowering our clients with Data mesh:
Mindset, organizational and operational models are the most significant barriers to adopting Data Mesh.
Educating stakeholders and domain teams about Data Mesh is critical to success.
Developing a product mindset needs to start with discovery from the consumer perspective.
Creating foundational data products that can be reused and repurposed for multiple use cases helps solidify Data mesh.
Data products need to be compliant with global and local policies.
Choosing the right implementation partner to adopt Data Mesh within your organization is crucial.
Data mesh is a powerfully transformative analytical data architecture and operating model. Businesses in all industries stand to gain with correct Data mesh implementation. But adopting Data Mesh requires more than just technology change — it takes some time, organizational commitment and the right partner to guide you through the process. As a Data Mesh innovator, Thoughtworks is committed to delivering the business outcomes your strategy requires. We also aim to positively impact your organization, working with you to transform your digital capabilities, delivery practices and the mindset of your talent. Finally, as an organization committed to learning, we continually invest in research, harvesting learnings and develop thought leadership to share with our clients. As a result, we're constantly helping our clients achieve their goals through Data mesh strategies and other transformative technologies. We're more than willing to get started with your organization today.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.
Simona Genovese
Data Mesh: the newest paradigm shift for a distributed architecture in the data world and its application.
Rel. Silvia Anna Chiusano . Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2021
Abstract: | |
---|---|
Relatori: | |
Anno accademico: | 2021/22 |
Tipo di pubblicazione: | Elettronica |
Numero di pagine: | 76 |
Soggetti: | |
Corso di laurea: | Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering) |
Classe di laurea: | |
Aziende collaboratrici: | Agile Lab S.r.l. |
URI: |
Modifica (riservato agli operatori) |
A data mesh is a decentralized data architecture that organizes data by a specific business domain—for example, marketing, sales, customer service and more—to provide more ownership to the producers of a given data set.
The producers’ understanding of the domain data positions them to set data governance policies focused on documentation, quality, and access. This, in turn, enables self-service use across an organization. While this federated approach eliminates many operational bottlenecks associated with centralized, monolithic systems, it doesn't necessarily mean that you can't use traditional storage systems, like data lakes or data warehouses. It just means that their use has shifted from a single, centralized data platform to multiple decentralized data repositories.
It's worth noting that data mesh promotes the adoption of cloud native and cloud platform technologies to scale and achieve the goals of data management. This concept is commonly compared to microservices to help audiences understand its use within this landscape. As this distributed architecture is particularly helpful in scaling data needs across an organization, it can be inferred that a data mesh may not be for all types of businesses; that is, smaller businesses may not reap the benefits of a data mesh as their enterprise data may not be as complex as a larger organization.
Zhamak Dehghani, a director of technology for IT consultancy firm ThoughtWorks, is credited for promoting the concept of data mesh as a solution to the inherent challenges of centralized, monolithic data structures, such as data accessibility and organization. Its adoption was further spurred by the COVID-19 pandemic in an effort to drive cultural change and reduce organizational complexity around data.
Learn about barriers to AI adoptions, particularly lack of AI governance and risk management solutions.
Read the guide for data leaders
A data mesh is a decentralized data architecture that organizes data by a specific business domain—for example, marketing, sales, customer service, and more—providing more ownership to the producers of a given dataset. The producers’ understanding of the domain data positions them to set data governance policies focused on documentation, quality, and access. This, in turn, enables self-service use across an organization. While this federated approach eliminates many operational bottlenecks associated with centralized, monolithic systems, it doesn't necessarily mean that you can't use traditional storage systems, like data lakes or data warehouses. It just means that their use has shifted from a single, centralized data platform to multiple decentralized data repositories.
Scale AI workloads, for all your data, anywhere
A data mesh involves a cultural shift in the way that companies think about their data. Instead of data acting as a by-product of a process, it becomes the product, where data producers act as data product owners. Historically, a centralized infrastructure team would maintain data ownership across domains, but the product thinking focus under a data mesh model shifts this ownership to the producers as they are the subject matter experts. Their understanding of the primary data consumers and how they leverage the domain’s operational and analytical data allows them to design APIs with their best interests in mind. While this domain-driven design also makes data producers responsible for documenting semantic definitions, cataloguing metadata and setting policies for permissions and usage, there is still a centralized data governance team to enforce these standards and procedures around the data. Additionally, while domain teams become responsible for their ETL data pipelines under a data mesh architecture, it doesn't eliminate the need for a centralized data engineering team. However, their responsibility becomes more focused on determining the best data infrastructure solutions for the data products being stored.
Similar to how a microservices architecture couples lightweight services together to provide functionality to a business- or consumer-facing application, a data mesh uses functional domains as a way to set parameters around the data, enabling it to be treated as a product which can be accessed to users across the organization. In this way, a data mesh allows for more flexible data integration and interoperable functionality, where data from multiple domains can be immediately consumed by users for business analytics, data science experimentation and more.
As previously stated, a data mesh is a distributed data architecture, where data is organized by its domain to make it more accessible to users across an organization. A data lake is a low-cost storage environment, which typically houses petabytes of structured, semi-structured and unstructured data for business analytics, machine learning and other broad applications. A data mesh is an architectural approach to data, which a data lake can be a part of. However, a central data lake is more typically used as dumping ground for data as it frequently is used to ingest data that does not yet have a defined purpose. As a result, it can fall victim to becoming a data swamp—i.e. a data lake that lacks the appropriate data quality and data governance practices to provide insightful learnings.
A data fabric is an architecture concept, and it focuses on the automation of data integration, data engineering, and governance in a data value chain between data providers and data consumers. A data fabric is based on the notion of “active metadata” which uses knowledge graph, semantics, and AI / ML technology to discover patterns in various types of metadata (for example system logs, social, etc.) and apply this insight to automate and orchestrate the data value chain (for example enable a data consumer to find a data product and then have that data product provisioned to them automatically). A data fabric is complimentary to a data mesh as opposed to mutually exclusive. In fact the data fabric makes the data mesh better because it can automate key parts of the data mesh such as creating data products faster, enforcing global governance, and making it easier to orchestrate the combination of multiple data products.
Data democratization: Data mesh architectures facilitates self-service applications from multiple data sources, broadening the access of data beyond more technical resources, such as data scientists, data engineers, and developers. By making data more discoverable and accessible via this domain-driven design, it reduces data silos and operational bottlenecks, enabling faster decision-making and freeing up technical users to prioritize tasks that better utilize their skillsets.
Cost efficiencies: This distributed architecture moves away from batch data processing and instead, it promotes the adoption of cloud data platforms and streaming pipelines to collect data in real-time. Cloud storage provides an additional cost advantage by allowing data teams to spin up large clusters as needed, paying only for the storage specified. This means that if you need additional compute power to run a job in a few hours vs. a few days, you can easily do this on a cloud data platform by purchasing additional compute nodes. This also means that it improves visibility into storage costs, enabling better budget and resource allocation for engineering teams.
Less technical debt: A centralized data infrastructure causes more technical debt due to the complexity and required collaboration to maintain the system. As data accumulates within a repository, it also begins to slow down the overall system. By distributing the data pipeline by domain ownership, data teams can better meet the demands of their data consumers and reduce technical strains on the storage system. They can also provide more accessibility to data by providing APIs for them to interface with, reducing the overall volume of individual requests.
Interoperability: Under a data mesh model, data owners agree on how to standardize domain-agnostic data fields upfront, which facilitates interoperability. This way, when a domain team is structuring their respective datasets, they are applying the relevant rules to enable data linkage across domains quickly and easily. Some fields commonly standardized are field type, metadata, schema flags, and more. Consistency across domains enables data consumers to interface with APIs more easily and develop applications to serve their business needs more appropriately.
Security and compliance: Data mesh architectures promote stronger governance practices as they help enforce data standards for domain-agnostic data and access controls for sensitive data. This ensures that organizations follow government regulations, like HIPPA restrictions, and the structure of this data ecosystem supports this compliance through the enablement of data audits. Log and trace data in a data mesh architecture embeds observability into the system, allowing auditors to understand which users are accessing specific data and the frequency of that access.
While distributed data mesh architectures are still gaining adoption, they're helping teams attain their goals of scalability for common big data use cases. These include:
Learn questions to consider when looking for the right data and AI platform for your organization.
Don't get bogged down by misinformation. Learn more about the 5 myths of a data lakehouse.
IBM supports the implementation of a data mesh with the IBM Data Fabric on Cloud Pak for Data. The IBM Data Fabric is a unified solution that contains all the capabilities needed to create data products and enable the governed and orchestrated access and use of these data products. The IBM Data Fabric enables the implementation of a data mesh on any platform (e.g., on premises data lakes, cloud data warehouses, etc.), allowing true enterprise-level self-service and re-use of data products regardless of where the data is.
Scale AI workloads for all your data, anywhere, with IBM watsonx.data, a fit-for-purpose data store built on an open data lakehouse architecture.
A data mesh has emerged as a possible solution to the challenges of data access plaguing many large organizations. This approach takes data out of stovepipes and puts it directly in the hands of business users, but in a controlled manner that maintains strong governance.
This article is a collaborative effort by Joe Caserta , Jean-Baptiste Dubois, Matthias Roggendorf , Marcus Roth , and Nikhil Srinidhi, representing views from McKinsey Digital.
Done well, a data mesh can speed time to market for data-driven applications and give rise to more powerful and scalable data products. These benefits have strategic implications. But it’s essential to approach the buildout in the right way. Otherwise, well-intentioned programs can collapse under their own weight. A leading life sciences company, for example, was prepared, from a technological standpoint, for the hard work a data mesh would require. But what it was unprepared for—and found far more challenging—was harmonizing data-management practices and building agreement among different business groups on which data products and use cases to centralize. Failing to anticipate these issues forced the project to pause midstream, creating confusion and prompting business users to revert to older and less efficient ways of managing data.
By understanding what domain-based data management is and hewing to a few core precepts, companies can avoid the learning pitfalls others have faced and begin reaping the rewards of a data mesh more quickly.
The term “data mesh” was coined by Zhamak Dehghani in 2019, when she was a principal at Thoughtworks. It caught on as a way of capturing the idea of distributed data access. But interpretations of what that means in practice abound. Is it a new technology, does it make existing data repositories obsolete, or is it a theoretical construct?
McKinsey defines a data mesh as a data-management paradigm that organizes data in domains, treats it as a product, enables self-service access, and supports these activities with federated governance (Exhibit 1). Here is why each of these elements is important.
Domain-based data management allows data to sit anywhere. Business teams own the data and are responsible for its quality, accessibility, and security. Domains are collections of data organized around a particular business purpose, such as marketing, procurement, or a particular customer segment or region. They contain raw data as well as self-contained elements known as data products . These data products bundle data to support different business applications, and they are designed with the internal wiring needed to plug directly into relevant apps or systems. A self-serve data infrastructure underlies the data mesh and acts as a central platform, providing a common place for business users to find and access data, regardless of where it is hosted.
Governance is managed in a federated “hub-and-spoke” way. Under this approach, a small central team sets controls, and a supporting data infrastructure enforces them. Standards defined in code enable data product teams within the business to comply with metadata documentation, data classification , and data quality monitoring.
Together, these elements create a self-organizing mesh in which different groups around the business can come together, define their data requirements, agree on how new data is to be shared, and align on the best ways to employ that data.
Executed well, a data mesh can deliver powerful advantages.
Most product and solution breakthroughs occur within the business—and few such breakthroughs can occur today without data. Data meshes allow business users to get their hands on critical information more quickly, delivering the following benefits:
Prior to implementing a data mesh, a large mining organization had hundreds of siloed operational databases scattered around the world, and developing analytics use cases took months. After shifting to a data mesh, the company cut time spent on data-engineering activities dramatically and developed use cases seven times faster than before while also increasing data stability and reusability.
Obtaining the full benefits of a data mesh requires careful choreography. While domain-based architectures have attracted growing interest, the technological discussion often predominates, overshadowing other critical elements.
Business users, for instance, may recognize that their current data-management systems are problematic but feel it’s better to stick with what is known than undergo the disruption of assuming direct ownership for data domains and products.
Even those eager to get started may not realize how organizational structures need to adapt to enable a steady flow of data products and use cases. For example, it’s not uncommon for organizations setting up a data mesh to discover that needed documentation is missing, taxonomies are incomplete, or new processes need to be created before data can be used. These issues can delay completion unless businesses make provision for them in their resourcing. For nontechnical professionals particularly, the learning curve can be steep and momentum for domain-based data ownership can sputter unless properly supported.
The following practices can help companies mitigate these learning-curve issues and increase the odds of a successful data mesh implementation.
Put the business in the lead.
Stewardship of the data mesh implementation must come from the business, supported by executive sponsors and backed by a formal change-management team. Data mesh evangelists within the change team can help business departments analyze their data landscape and define the most valuable data products to share with the organization. Some organizations have found it helpful to position the data mesh as part of a strategic initiative such as a digital transformation. That can help set the context and the case for change. There also needs to be a committed data product owner within the business who is willing to take on the challenge of “selling” data internally to other business users and application teams. In addition, there should be a central data-infrastructure team that can implement “data governance as code” in tools that are not yet fully mature.
Organizations sometimes get stuck trying to determine whether a centralized or decentralized approach to data management is best, but the answer is that both methods can be effective (Exhibit 2). Companies with a modern IT landscape and well-established local data repositories might get more value from exposing data through virtualized links (while still registering it in a central data marketplace or catalog). By contrast, those that are in the middle of an enterprise resource planning (ERP) transformation or other large IT change might find it better to first move toward a central data platform and create a single logic on core data products.
There can occasionally be an argument where fully centralized approaches deliver superior ROI—for example, if the majority of data use cases and data products are used globally. Fully decentralized approaches are rare at present, since they require a level of data-management orchestration that large enterprises may feel is currently out of their reach.
In practice, most organizations begin with a mix of centralized and localized data products that reflect their particular business, technology, capabilities, and go-to-market requirements. How hard to lean on centralized versus decentralized structures is often a matter of degree.
Finance, operations, and marketing, for instance, often require niche sets of data and analytics, so a company might choose to localize these functions’ data management. Cross-spanning data assets required by multiple functions can be managed by a centralized group and shared with the relevant functions accordingly.
The data mesh does not need to be constructed in one fell swoop. Many companies attain positive results by taking serial steps. A biotech company began by providing data from an operational data warehouse through a data mesh to feed into operational reporting of its production performance (monitoring production variables). The data product team worked closely with business users to understand their needs, improve data quality and velocity, and standardize data into a harmonized format. Business users were able to explore and develop new applications more quickly at the proof-of-concept stage and then scale them to full production.
Centralized standards for data quality, data architecture , and data sovereignty must also be established and adopted by all data product owners. Some companies that already have centralized standards in place can adjust them to reflect the needs of a decentralized data organization. Others start by defining standards for a data domain, test them for practical applicability, and improve them as needed. Then, they roll the standards out in waves to the rest of the organization along with training and capability building to ensure the governance is consistently applied across the organization.
Executive and nontechnical business users will all need a basic level of data literacy for data mesh success. Coaching, hackathons, online programs, and analytics academies can all work well. Business teams responsible for managing domains will need more extensive training, which should be ongoing so that users can continually grow their skill sets. Otherwise, companies can end up with a narrow set of data capabilities, enough to get started but not sufficient to create the momentum needed to sustain growth or scale.
In most cases, building a data mesh is a continuum. Leaders make a point of regularly communicating with the organization, in large-scale town halls and intimate team meetings, on what the company is trying to achieve and what the road map looks like in terms of timing and capability building. They use internal communications to share success stories, acknowledge the individuals involved in the effort, and remain open about the inevitable challenges. Regular dialog helps to sustain long-term change efforts, keeping the transition alive in people’s minds and reinforcing its steadily accruing benefits.
A data mesh can help close the insights gap and grease the wheels of innovation, allowing companies to better predict the direction of change and proactively respond to it. But bringing a data mesh from concept to reality requires managing it as a business transformation, not a technological one.
Joe Caserta is a partner in McKinsey’s New York office; Jean-Baptiste Dubois is a senior expert in the Paris office; Matthias Roggendorf is a partner in the Berlin office, where Nikhil Srinidhi is an associate partner; and Marcus Roth is a partner in the Chicago office.
The authors wish to thank Marie Grünbein for her contributions to this article.
Related articles.
Title: data mesh: a systematic gray literature review.
Abstract: Data mesh is an emerging domain-driven decentralized data architecture that aims to minimize or avoid operational bottlenecks associated with centralized, monolithic data architectures in enterprises. The topic has picked the practitioners' interest, and there is considerable gray literature on it. At the same time, we observe a lack of academic attempts at defining and building upon the concept. Hence, in this article, we aim to start from the foundations and characterize the data mesh architecture regarding its design principles, architectural components, capabilities, and organizational roles. We systematically collected, analyzed, and synthesized 114 industrial gray literature articles. The review provides insights into practitioners' perspectives on the four key principles of data mesh: data as a product, domain ownership of data, self-serve data platform, and federated computational governance. Moreover, due to the comparability of data mesh and SOA (service-oriented architecture), we mapped the findings from the gray literature into the reference architectures from the SOA academic literature to create the reference architectures for describing three key dimensions of data mesh: organization of capabilities and roles, development, and runtime. Finally, we discuss open research issues in data mesh, partially based on the findings from the gray literature.
Subjects: | Software Engineering (cs.SE); Databases (cs.DB) |
Cite as: | [cs.SE] |
(or [cs.SE] for this version) | |
Focus to learn more arXiv-issued DOI via DataCite |
Access paper:.
Code, data and media associated with this article, recommenders and search tools.
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .
Search code, repositories, users, issues, pull requests..., provide feedback.
We read every piece of feedback, and take your input very seriously.
Use saved searches to filter your results more quickly.
To see all available qualifiers, see our documentation .
Data Mesh: Systematic Gray Litearture Study, Reference Architecture, and Cloud-based Instantiation
Folders and files.
Name | Name | |||
---|---|---|---|---|
1 Commit | ||||
This thesis explores the complex balance between national security and privacy rights within the context of EU-US data transfers by assessing the previous and current legal frameworks, shaped by the Schrems I and II rulings and the European Commission’s Adequacy Decision of 10th July 2023. These judicial decisions, and the consequent bilateral efforts to build a safe data transfer system following the invalidation of previous frameworks, highlight the tension between the EU’s comprehensive General Data Protection Regulation and the US’s fragmented data protection approach. As technology expands in all facets of life, so do national security concerns, leading states to enhance surveillance, which may often result in the compromising of individual privacy. US surveillance practices, in particular, have raised significant concerns, due to the wide powers afforded to US authorities to access data of non-US citizens. In light of these concerns, the CJEU invalidated the Safe Harbor and Privacy Shield Frameworks, due to a non-equivalent and thus inadequate standard of data protection in the US for EU citizens’ data, citing in particular the extended scope of US surveillance practices. The new EU-US Data Protection Framework, which was adopted with the July 2023 Commission decision, aims to address these issues, but has already faced stark criticism for its limitations and potential inadequacies. These include the DPF’s reliance on self-certification, an insufficient oversight by US authorities, the limitations of Executive Order 14086 and a refusal to actuate a reform of Section 702 of FISA, among other concerns. Internal EU challenges, such as divergent interpretations by EU Member States of national security and legitimate surveillance practices, complicate efforts to harmonize standards and maintain credibility in dictating data protection norms outside the EU. Balancing national security and privacy rights is an ongoing challenge, influenced by political, legal, and geopolitical factors. This thesis calls for innovative legal and policy responses to balance these imperatives and protect individual rights in an interconnected world.
IMAGES
VIDEO
COMMENTS
This thesis focuses to study concept called data mesh which was first introduced by Zhamak Dehghani in 2019. The whole data mesh is somewhat new but still very hot topic in data engineering field among data professionals and there is a reason for this. According to the survey from NewVantage Partners, data mesh was identified the fifth most
becoming a bottleneck and thus the data mesh paradigm emerged. This thesis explores the Data Mesh concept, a new way of structuring the enterprise data architecture by decentralizing the data capabilities and positioning those at a domain level, with an overarching governance framework, supported by a self-serve data platform.
Fig. 1. Conceptual overview of a data mesh based on the four key principles: 1) domain-oriented decentralized data ownership, 2) data as a product, 3) self-serve data platform, and 4) federated computational governance. The figure shows diferent levels of granularity (high on the left and low on the right).
Data meshes are a decentralization technique of the ownership, transformation & serving of data. It is proposed as a solution for centralized architectures, where growth is limited by its dependencies and complexity. Data Mesh: Centralized VS decentralized architecture.
Domain D. Fig. 1. Conceptual overview of a data mesh based on the four key principles: 1) domain-oriented decentralized data ownership, 2) data as a product, 3) self-serve data platform, and 4) federated computational governance. The figure shows diferent levels of granularity (high on the left and low on the right).
A data mesh is emerging as a novel decentralized approach for managing data at scale by applying domain-oriented, self-serve design and product thinking [13]. Zhamak Dehghani first defined the term data mesh in 2019 [44]. Figure 1 shows the search index for "Data Mesh" on Google Trends for the past five years. A clear increasing trend line can be
Data mesh is an emerging domain-driven decentralized data architecture that aims to minimize or avoid operational bottlenecks associated with centralized, monolithic data architectures in enterprises.
The Data Mesh allows for the provision of complex management, access, and support components through the connectivity layer it implements - data from different locations will now be connected in the Mesh [3]. Recently, Zhamak Dehghani began taking the first steps in consolidating what might be the core principles and logical architecture of a ...
The case study uncovers tangible benefits of how data mesh can contribute to data governance. The conclusion unveils that the four principles of data mesh; domain ownership, product thinking, self-serve data platforms, and federated computational governance can provide a robust foundation for data governance, if not a method to operationalize ...
before implementing a data mesh. This paper is based on a bachelor thesis. Keywords: data mesh, data architectures, data integration, expert interviews 1 Introduction In a world where the importance of data and its analysis continuously grows, choosing a fitting data architecture is crucial to the success of a company's data strategy. While many
A data lake is a good solution for storing vast amounts of data in a centralized location. And data mesh is the best solution for fast data retrieval, integration, and analytics. In a nutshell, think of a data mesh as connective tissue to data lakes and/or other sources. Introduced by Thoughtworks, data mesh is "a shift in modern distributed ...
The great divide of data. Core principles and logical architecture of data mesh. Domain Ownership. Logical architecture: domain-oriented data and compute. Data as a product. Logical architecture:data product the architectural quantum. Self-serve data platform. Logical architecture: a multi-plane data platform.
In a data mesh, data ownership and responsibilities are distributed among domain-specific teams or data product teams, granting them autonomy in managing their data within their respective domains. This decentralized approach aims to address the limitations associated with centralized data models, such as scalability challenges, data silos, and ...
Data mesh is a powerfully transformative analytical data architecture and operating model. Businesses in all industries stand to gain with correct Data mesh implementation. But adopting Data Mesh requires more than just technology change — it takes some time, organizational commitment and the right partner to guide you through the process.
The goal of this project is to expand on the notion of the Data Mesh and to propose a first real implementation of it. In particular, the case study focuses on the backend implementation of the various provisioning services of the data product and its resources, using the Scala programming language and the cloud environments Amazon Web Service (AWS) and Cloudera Data Platform (CDP).
Underlying the data mesh architecture is a layer of universal interoperability, reflecting domain-agnostic standards, as well as observability and governance. Image courtesy of Monte Carlo. Zhamak defines the data mesh as " a socio-technical shift — a new approach in how we collect, manage, and share analytical data.".
With the increasing importance of data and artificial intelligence, organizations strive to become more data-driven. However, current data architectures are not necessarily designed to keep up with the scale and scope of data and analytics use cases. In fact, existing architectures often fail to deliver the promised value associated with them. Data mesh is a socio-technical, decentralized ...
A data mesh is a decentralized data architecture that organizes data by a specific business domain—for example, marketing, sales, customer service and more—to provide more ownership to the producers of a given data set. The producers' understanding of the domain data positions them to set data governance policies focused on documentation ...
McKinsey defines a data mesh as a data-management paradigm that organizes data in domains, treats it as a product, enables self-service access, and supports these activities with federated governance (Exhibit 1). Here is why each of these elements is important. 1. Domain-based data management allows data to sit anywhere.
A data mesh is an architectural framework that solves advanced data security challenges through distributed, decentralized ownership. Organizations have multiple data sources from different lines of business that must be integrated for analytics. A data mesh architecture effectively unites the disparate data sources and links them together ...
The triangle mesh data structures proposed in this thesis support the standard set of mesh connectivity operators introduced by the previously proposed Corner Table at an amortized constant time complexity. They can be constructed in linear time and space from the Corner Table or any equivalent representation. If geometry is stored as 16-bit ...
Data Mesh: a Systematic Gray Literature Review. Data mesh is an emerging domain-driven decentralized data architecture that aims to minimize or avoid operational bottlenecks associated with centralized, monolithic data architectures in enterprises. The topic has picked the practitioners' interest, and there is considerable gray literature on it.
Data Mesh: Systematic Gray Litearture Study, Reference Architecture, and Cloud-based Instantiation - GitHub - abel177/Data-Mesh-Master-Thesis: Data Mesh: Systematic Gray Litearture Study, Reference Architecture, and Cloud-based Instantiation
This thesis explores the complex balance between national security and privacy rights within the context of EU-US data transfers by assessing the previous and current legal frameworks, shaped by the Schrems I and II rulings and the European Commission's Adequacy Decision of 10th July 2023.