case study data quality

Data Science Central

  • Author Portal
  • 3D Printing
  • AI Data Stores
  • AI Hardware
  • AI Linguistics
  • AI User Interfaces and Experience
  • AI Visualization
  • Cloud and Edge
  • Cognitive Computing
  • Containers and Virtualization
  • Data Science
  • Data Security
  • Digital Factoring
  • Drones and Robot AI
  • Internet of Things
  • Knowledge Engineering
  • Machine Learning
  • Quantum Computing
  • Robotic Process Automation
  • The Mathematics of AI
  • Tools and Techniques
  • Virtual Reality and Gaming
  • Blockchain & Identity
  • Business Agility
  • Business Analytics
  • Data Lifecycle Management
  • Data Privacy
  • Data Strategist
  • Data Trends
  • Digital Communications
  • Digital Disruption
  • Digital Professional
  • Digital Twins
  • Digital Workplace
  • Marketing Tech
  • Sustainability
  • Agriculture and Food AI
  • AI and Science
  • AI in Government
  • Autonomous Vehicles
  • Education AI
  • Energy Tech
  • Financial Services AI
  • Healthcare AI
  • Logistics and Supply Chain AI
  • Manufacturing AI
  • Mobile and Telecom AI
  • News and Entertainment AI
  • Smart Cities
  • Social Media and AI
  • Functional Languages
  • Other Languages
  • Query Languages
  • Web Languages
  • Education Spotlight
  • Newsletters
  • O’Reilly Media

Data Quality Case Studies: How We Saved Clients Real Money Thanks to Data Validation

MichalFracek

  • April 7, 2019 at 12:21 am

Machine learning models grow more powerful every week, but the earliest models and the most recent state-of-the-art models share the exact same dependency:  data quality . The maxim “garbage in – garbage out” coined decades ago, continues to apply today. Recent examples of data verification shortcomings abound, including JP Morgan/Chase’s  2013 fiasco  and this lovely list of  Excel snafus .  Brilliant people make data collection and entry errors all of the time, and that isn’t just our opinion (although we have plenty of personal experience with it); Kaggle  did a survey  of data scientists and found that “dirty data” is the number one barrier for data scientists.  

Before we create a  machine  learning model, before we create a Shiny R dashboard, we evaluate the dataset for a project.  Data validation is a complicated multi-step process, and maybe it’s not as sexy as talking about the latest   ML  models, but as the data science consultants of Appsilon we live and breathe data governance and offer solutions.  And it is not only about data format. Data can be corrupted on different levels of abstraction. We can distinguish three levels:  

  • Data structure and format
  • Qualitative & business logic rules
  • Expert logic rules

Level One: structure and format

For every project, we must verify:

  • Is the data structure consistent? A given dataset should have the same structure all of the time, because the ML model or app expects the same format.  Names of columns/fields, number of columns/fields, field data type (integers? Strings?) must remain consistent.
  • Are we working with multiple datasets, or merged?
  • Do we have duplicate entries? Do they make sense in this context or should they be removed?
  • Do we have correct, consistent data types (e.g. integers, floating point numbers, strings) in all entries?
  • Do we have a consistent format for floating point numbers? Are we using a comma or a period?
  • What is the format of other data types, such as e-mail addresses, dates, zip codes, country codes and is it consistent?

It sounds obvious, but there are always problems and it must be checked every time.  The right questions must be asked.

Level Two: qualitative and business logic rules

We must check the following every time:

  • Is the price parameter (if applicable) always non-negative?  (We stopped several of our retail customers from recommending the wrong discounts thanks to this rule. They saved significant sums and prevented serious problems thanks to this step… More on that later).
  • Do we have any unrealistic values?  For data related to humans, is age always a realistic number?
  • Parameters.  For data related to machines, does the status parameter always have a correct value from a defined set? E.g. only “FINISHED” or “RUNNING” for a machine status?
  • Can we have “Not Applicable” (NA), null, or empty values? What do they mean?
  • Do we have several values that mean the same thing? For example, users might enter their residence in different ways — “NEW YORK”, “Nowy Jork”, “NY, NY” or just “NY” for a city parameter. Should we standardize them?

Level three: expert rules

Expert rules govern something different than format and values. They check if the story behind the data makes sense. This requires business knowledge about the data and it is the data scientist’s responsibility to be curious, to explore and challenge the client with the right questions, to avoid logical problems with the data.  The right questions must be asked.

Expert Rules Case Studies 

I’ll illustrate with a couple of true stories.

Story #1: Is this machine teleporting itself?

We were tasked to analyze the history of a company’s machines.  The question was, how much time did each machine work at a given location.  We have the following entries in our database:

We see that format and values are correct. But why did machine #1234 change its location  every day? Is it possible? We should ask such a question of our client.   In this case, we found that it was not physically possible for the machine to switch sites so often.  After some investigation, we found that the problem was that the software installed on the machine had a duplicated ID number and in fact there were two machines on different sites with the same ID number.  When we learned what was possible, we set data validation rules for that, and then we ensured that this issue won’t happen again.

Expert rules can be developed only by the close cooperation between data scientists and business. This is not an easy part that can be automated by “data cleaning tools,” which are great for hobbyists, but are not suitable for anything remotely serious.

Story #2: Negative sign could have changed all prices in the store

One of our retail clients was pretty far along in their project journey when we began to work with them.  They already had a data scientist on staff and had already developed their own price optimization models. Our role was to utilize the output from those models and display recommendations in an R Shiny dashboard that was to be used by their salespeople. We had some assumptions about the format of the data that the application would use from their models.  So we wrote our validation rules on what we thought the application should expect when it reads the data.

We reasoned that the price should be

  • non-negative
  • an integer number
  • shouldn’t be an empty value or a string.  
  • within a reasonable range for the given product

As this model was being developed over the course of several weeks, suddenly we observed that prices were being returned as too high.  It was actually validated automatically. It wasn’t like we spotted this in production, we spotted this problem before the data even landed in the application.  After we saw this result in the report, we asked their team why it happened. It turns out that they had a new developer who assumed that discounts could be displayed as a negative number, because why not?  He didn’t realize that some applications actually depended on that output, and assumed that it would be subtracting the value instead of adding It. Thanks to the automatic data validation, we could prevent loading errors into production.  We worked with their data scientists to improve the model. It was a very quick fix of course, a no-brainer. But the end result was that they saved real money.

Data Validation Report for all stakeholders

Here is a sample data validation report that our workflow produces for all stakeholders in the project:

Data Verification Report

The intent is that the data verification report is readable by all stakeholders, not just data scientists and software engineers.  After years of experience working on data science projects, we observed that multiple people within an organization know of realistic parameters for data values, such as price points.  There is usually more than one expert in a community, and people are knowledgeable about different things. New data is often added at a constant rate, and parameters can change. So why not allow multiple people add and edit rules when verifying data? So with our Data Verification workflow, anyone from the team of stakeholders can add or edit a data verification rule.

Our Data Verification workflow works with the  assertr package  (for the R enthusiasts out there).  Our workflow runs validation rules automatically – after every update in the data.   This is exactly the same process as writing unit tests for software. Like unit testing, our data verification workflow allows you to more easily identify problems and catch them early; and of course fixing problems at an earlier stage is much more cost effective.    

Finally, what do validation rules look like on the code level?  We can’t show you code created for clients, so here is an example using data from the City of Warsaw public transportation system (requested from a public API).  Let’s say that we want a real-time check on the location and status of all the vehicles in the transit system fleet.

In this example, we want to ensure that the Warsaw buses and trams are operating within the borders of the city, so we check the latitude and longitude.  If a vehicle is outside the city limits, then we certainly want to know about it! We want real-time updates, so we write a rule that “Data is not older than 5 minutes.”  In a real project, we would probably write hundreds of such rules in partnership with the client. Again, we typically run this workflow BEFORE we build a model or a software solution for the client, but as you can see from the examples above, there is even tremendous value in running the Data Validation Workflow late in the production process!  And one of our clients did remark that they saved more money with the Data Validation Workflow than with some of the machine learning models that were previously built for them.

Sharing our data validation workflow with the community

Data quality must be verified in every project to produce the best results.  There are a number of potential errors that seem obvious and simplistic but in our experience to tend to occur often.  

After working on numerous projects with Fortune 500 companies, we came up with a solution to the above 3-Level problem cluster.  Since multiple people within an organization know of realistic parameters for datasets, such as price points, why not allow multiple people add and edit rules when verifying data?  We recently shared our workflow at a  hackathon  sponsored by the Ministry of Digitization here in Poland.  We took third place in the competition, but more importantly, it reflects one of the core values of our company — to share our best practices with the data science community.     

photo of Hackathon winner

Pawel   and  Krystian   accept an award at the Ministry of Digital Affairs  Hackathon

I hope that you can put these take-aways in your toolbox:

  • Validate your data early and often, covering all assumptions.
  • Engage a data science professional early in the process  
  • Leverage the expertise of your workforce in data governance strategy
  • Data quality issues are extremely common

In the midst of designing new products, manufacturing, marketing, sales planning and execution, and the thousands of other activities that go into operating a successful business, companies sometimes forget about data dependencies and how small errors can have a significant impact on profit margins.  

We unleash your expertise about your organization or business by asking the right questions, then we teach the workflow to check for it constantly.  We take your expertise and we leverage it repeatedly.

data validation infographic

You can find me on Twitter at  @pawel_appsilon .

Originally posted on Data Science Blog .

Related Content

'  data-srcset=

We are in the process of writing and adding new material (compact eBooks) exclusively available to our members, and written in simple English, by world leading experts in AI, data science, and machine learning.

Welcome to the newly launched Education Spotlight page! View Listings

Maintaining Data Quality from Multiple Sources Case Study

Provider relying on data quality

There is a wealth of data within the healthcare industry that can be used to drive innovation, direct care, change the way systems function, and create solutions to improve patient outcomes. But with all this information coming in from multiple unique sources that all have their own ways of doing things, ensuring data quality is more important than ever.

The COVID-19 pandemic highlighted breakthroughs in data sharing and interoperability advances in the past few years. However, that does not mean that there aren’t challenges when it comes to data quality.

“As we have seen, many organizations have created so many amazing solutions around data,” Mujeeb Basit, MD, associate chief medical informatics officer and associate director, Clinical Informatics Center, University of Texas Southwestern Medical Center said. “COVID really highlighted the innovations and what you can do with sophisticated data architectures and how that flow of data really helps us understand what's happening in our communities. Data has become even more important.”

Dr. Basit shared some of his organization’s experiences in creating strategies to improve data quality while making the process as seamless as possible for all stakeholders.

The medical center had four groups working together on solution co-development, including quality, clinical operations, information resources and analytics.

“It is the synergy of working together and aligning our goals that really helps us develop singular data pipelines as well as workflows and outcomes that we're all vested in,” Dr. Basit said.

Finding Errors

One of the problems the organization previously faced was that errors would slowly accumulate in their systems because of complicated processes or frequent updates. When an error was found, Dr. Basit noted it was usually fixed as a single entity, and sometimes a backlog is fixed.

“But what happens is, over time, this error rate redevelops. How do we take this knowledge gained in this reported error event and then make that a sustainable solution long term? And this becomes exceedingly hard because that relationship may be across multiple systems,” Dr. Basit said.

He shared an example of how this had happened while adding procedures into their system that become charges, which then get translated into claim files.

“But if that charge isn't appropriately flagged, we actually don't get that,” Dr. Basit said. “This is missing a rate and missing a charge, and therefore, we will not get revenue associated with it. So, we need to make sure that this flag is appropriately set and this code is appropriately captured.”

His team created a workaround for this data quality issue where they will use a user story in their development environment and fix the error, but this is just a band-aid solution to the problem.

“As additional analysts are hired, they may not know this requirement, and errors can reoccur. So how do you solve this globally and sustain that solution over time? And for us, the outcome is significantly lost work, lost reimbursement, as well as denials, and this is just unnecessary work that is creating a downstream problem for us,” Dr. Basit said.

Their solution? Apply analysis at regular intervals to keep error rates low. 

“This is not sustainable by applying people to it, but it is by applying technology to it. We approach it as an early detection problem. No repeat failures, automate it so we don't have to apply additional resources for it, and therefore, it scales very, very well, as well as reduced time to resolution, and it is a trackable solution for us,” Dr. Basit said.

To accomplish this, they utilized a framework for integrated tests (FIT) and built a SQL server solution that intermittently runs to look for new errors. When one is found, a message is sent to an analyst to determine a solution.

“We have two types of automated testing. You have reactive where someone identifies the problem and puts in the error for a solution, and we have preventative,” Dr. Basit said.

The outcome of this solution means they are saving time and money—something the leadership within the University of Texas Southwestern Medical Center has taken notice of. They are now requesting FIT tests to ensure errors do not reoccur.

“This has now become a part of their vocabulary as we have a culture of data-driven approaches and quality,” Dr. Basit said.

Applying the Data Efficiently

Another challenge they faced was streamlining different types of information coming in through places like the patient portal and EHR while maintaining data quality.

“You can't guarantee 100% consistency in a real-time capture system. They would require a lot of guardrails in order to do that, and the clinicians will probably get enormously frustrated,” Dr. Basit said. “So we go for reasonable accuracy of the data. And then we leverage our existing technologies to drive this.”

He used an example from his organization about a rheumatology assessment to determine the day-to-day life of someone with the condition. They use a patient questionnaire to create a system scoring system, and providers also conduct an assessment.

“Those two data elements get linked together during the visit so that we can then get greater insight on it. From that, we're able to use alerting mechanisms to drive greater responsiveness to the patient,” Dr. Basit said.

Conducting this data quality technology at scale was a challenge, but Dr. Basit and his colleagues utilized the Agile methodology to help.

“We didn't have sufficient staff to complete our backlog. What would happen is somebody would propose a problem, and by the time we finally got to solve it, they'd not be interested anymore, or that faculty member has left, or that problem is no longer an issue, and we have failed our population,” Dr. Basit said. “So for us, success is really how quickly can we get that solution implemented, and how many people will actually use it, and how many patients will it actually benefit. And this is a pretty large goal.”

 The Agile methodology focused on:

  • Consistency
  • Minimizing documentation
  • Incremental work products that can be used as a single entity

They began backlog sprint planning, doing two-week sprints at a time.

“We want to be able to demonstrate that we're able to drive value and correct those problems that we talked about earlier in a very rapid framework. The key to that is really this user story, the lightweight requirement gathering to improve our workflow,” Dr. Basit said.  “So you really want to focus as a somebody, and put yourself in the role of the user who's having this problem.”

An example of this would be a rheumatologist wanting to know if their patient is not on a disease-modifying anti-rheumatic drug (DMARD) so that their patient can receive optimal therapy for their rheumatoid arthritis.

“This is really great for us, and what we do is we take this user story and we digest it. And especially the key part here is everything that comes out for the ‘so that,’ and that really tells us what our success measures are for this project. This should only take an hour or two, but it tells so much information about what we want to do,” Dr. Basit said.

Acceptance criteria they look for include:

  • Independent
  • Estimatable

“And we try to really stick to this, and that has driven us to success in terms of leveraging our data quality and improving our overall workflow as much as possible,” Dr. Basit said.

With the rheumatology project, they were able to reveal that increased compliance to DMARD showed an increase in low acuity disease and a decrease in high acuity.

“That's what we really want to go for. These are small changes but could be quite significant to those people's lives who it impacted,” Dr. Basit said.

In the end, the systems he and his team have created high-value solutions that clinicians and executives at their medical center use often.

“And over time we have built a culture where data comes first. People always ask, ‘What does the data say?’ Instead of sitting and wasting time on speculating on that solution,” Dr. Basit said.

The views and opinions expressed in this content or by commenters are those of the author and do not necessarily reflect the official policy or position of HIMSS or its affiliates.

Machine Learning & AI for Healthcare Forum

December 14–15, 2021 | Digital

Machine learning and AI are full of possibilities to address some of healthcare’s biggest challenges. Learn how leading healthcare organizations have leveraged the power of machine learning and AI to improve patient care and where they see real ROI—better care, cost containment, and operational improvements and efficiencies.

Register for the forum and get inspired

case study data quality

Data Analytics Case Study Guide 2024

by Sam McKay, CFA | Data Analytics

case study data quality

Data analytics case studies reveal how businesses harness data for informed decisions and growth.

For aspiring data professionals, mastering the case study process will enhance your skills and increase your career prospects.

Sales Now On Advertisement

So, how do you approach a case study?

Use these steps to process a data analytics case study:

Understand the Problem: Grasp the core problem or question addressed in the case study.

Collect Relevant Data: Gather data from diverse sources, ensuring accuracy and completeness.

Apply Analytical Techniques: Use appropriate methods aligned with the problem statement.

Visualize Insights: Utilize visual aids to showcase patterns and key findings.

Derive Actionable Insights: Focus on deriving meaningful actions from the analysis.

This article will give you detailed steps to navigate a case study effectively and understand how it works in real-world situations.

By the end of the article, you will be better equipped to approach a data analytics case study, strengthening your analytical prowess and practical application skills.

Let’s dive in!

Data Analytics Case Study Guide

Table of Contents

What is a Data Analytics Case Study?

A data analytics case study is a real or hypothetical scenario where analytics techniques are applied to solve a specific problem or explore a particular question.

It’s a practical approach that uses data analytics methods, assisting in deciphering data for meaningful insights. This structured method helps individuals or organizations make sense of data effectively.

Additionally, it’s a way to learn by doing, where there’s no single right or wrong answer in how you analyze the data.

So, what are the components of a case study?

Key Components of a Data Analytics Case Study

Key Components of a Data Analytics Case Study

A data analytics case study comprises essential elements that structure the analytical journey:

Problem Context: A case study begins with a defined problem or question. It provides the context for the data analysis , setting the stage for exploration and investigation.

Data Collection and Sources: It involves gathering relevant data from various sources , ensuring data accuracy, completeness, and relevance to the problem at hand.

Analysis Techniques: Case studies employ different analytical methods, such as statistical analysis, machine learning algorithms, or visualization tools, to derive meaningful conclusions from the collected data.

Insights and Recommendations: The ultimate goal is to extract actionable insights from the analyzed data, offering recommendations or solutions that address the initial problem or question.

Now that you have a better understanding of what a data analytics case study is, let’s talk about why we need and use them.

Why Case Studies are Integral to Data Analytics

Why Case Studies are Integral to Data Analytics

Case studies serve as invaluable tools in the realm of data analytics, offering multifaceted benefits that bolster an analyst’s proficiency and impact:

Real-Life Insights and Skill Enhancement: Examining case studies provides practical, real-life examples that expand knowledge and refine skills. These examples offer insights into diverse scenarios, aiding in a data analyst’s growth and expertise development.

Validation and Refinement of Analyses: Case studies demonstrate the effectiveness of data-driven decisions across industries, providing validation for analytical approaches. They showcase how organizations benefit from data analytics. Also, this helps in refining one’s own methodologies

Showcasing Data Impact on Business Outcomes: These studies show how data analytics directly affects business results, like increasing revenue, reducing costs, or delivering other measurable advantages. Understanding these impacts helps articulate the value of data analytics to stakeholders and decision-makers.

Learning from Successes and Failures: By exploring a case study, analysts glean insights from others’ successes and failures, acquiring new strategies and best practices. This learning experience facilitates professional growth and the adoption of innovative approaches within their own data analytics work.

Including case studies in a data analyst’s toolkit helps gain more knowledge, improve skills, and understand how data analytics affects different industries.

Using these real-life examples boosts confidence and success, guiding analysts to make better and more impactful decisions in their organizations.

But not all case studies are the same.

Let’s talk about the different types.

Types of Data Analytics Case Studies

 Types of Data Analytics Case Studies

Data analytics encompasses various approaches tailored to different analytical goals:

Exploratory Case Study: These involve delving into new datasets to uncover hidden patterns and relationships, often without a predefined hypothesis. They aim to gain insights and generate hypotheses for further investigation.

Predictive Case Study: These utilize historical data to forecast future trends, behaviors, or outcomes. By applying predictive models, they help anticipate potential scenarios or developments.

Diagnostic Case Study: This type focuses on understanding the root causes or reasons behind specific events or trends observed in the data. It digs deep into the data to provide explanations for occurrences.

Prescriptive Case Study: This case study goes beyond analytics; it provides actionable recommendations or strategies derived from the analyzed data. They guide decision-making processes by suggesting optimal courses of action based on insights gained.

Each type has a specific role in using data to find important insights, helping in decision-making, and solving problems in various situations.

Regardless of the type of case study you encounter, here are some steps to help you process them.

Roadmap to Handling a Data Analysis Case Study

Roadmap to Handling a Data Analysis Case Study

Embarking on a data analytics case study requires a systematic approach, step-by-step, to derive valuable insights effectively.

Here are the steps to help you through the process:

Step 1: Understanding the Case Study Context: Immerse yourself in the intricacies of the case study. Delve into the industry context, understanding its nuances, challenges, and opportunities.

Data Mentor Advertisement

Identify the central problem or question the study aims to address. Clarify the objectives and expected outcomes, ensuring a clear understanding before diving into data analytics.

Step 2: Data Collection and Validation: Gather data from diverse sources relevant to the case study. Prioritize accuracy, completeness, and reliability during data collection. Conduct thorough validation processes to rectify inconsistencies, ensuring high-quality and trustworthy data for subsequent analysis.

Data Collection and Validation in case study

Step 3: Problem Definition and Scope: Define the problem statement precisely. Articulate the objectives and limitations that shape the scope of your analysis. Identify influential variables and constraints, providing a focused framework to guide your exploration.

Step 4: Exploratory Data Analysis (EDA): Leverage exploratory techniques to gain initial insights. Visualize data distributions, patterns, and correlations, fostering a deeper understanding of the dataset. These explorations serve as a foundation for more nuanced analysis.

Step 5: Data Preprocessing and Transformation: Cleanse and preprocess the data to eliminate noise, handle missing values, and ensure consistency. Transform data formats or scales as required, preparing the dataset for further analysis.

Data Preprocessing and Transformation in case study

Step 6: Data Modeling and Method Selection: Select analytical models aligning with the case study’s problem, employing statistical techniques, machine learning algorithms, or tailored predictive models.

In this phase, it’s important to develop data modeling skills. This helps create visuals of complex systems using organized data, which helps solve business problems more effectively.

Understand key data modeling concepts, utilize essential tools like SQL for database interaction, and practice building models from real-world scenarios.

Furthermore, strengthen data cleaning skills for accurate datasets, and stay updated with industry trends to ensure relevance.

Data Modeling and Method Selection in case study

Step 7: Model Evaluation and Refinement: Evaluate the performance of applied models rigorously. Iterate and refine models to enhance accuracy and reliability, ensuring alignment with the objectives and expected outcomes.

Step 8: Deriving Insights and Recommendations: Extract actionable insights from the analyzed data. Develop well-structured recommendations or solutions based on the insights uncovered, addressing the core problem or question effectively.

Step 9: Communicating Results Effectively: Present findings, insights, and recommendations clearly and concisely. Utilize visualizations and storytelling techniques to convey complex information compellingly, ensuring comprehension by stakeholders.

Communicating Results Effectively

Step 10: Reflection and Iteration: Reflect on the entire analysis process and outcomes. Identify potential improvements and lessons learned. Embrace an iterative approach, refining methodologies for continuous enhancement and future analyses.

This step-by-step roadmap provides a structured framework for thorough and effective handling of a data analytics case study.

Now, after handling data analytics comes a crucial step; presenting the case study.

Presenting Your Data Analytics Case Study

Presenting Your Data Analytics Case Study

Presenting a data analytics case study is a vital part of the process. When presenting your case study, clarity and organization are paramount.

To achieve this, follow these key steps:

Structuring Your Case Study: Start by outlining relevant and accurate main points. Ensure these points align with the problem addressed and the methodologies used in your analysis.

Crafting a Narrative with Data: Start with a brief overview of the issue, then explain your method and steps, covering data collection, cleaning, stats, and advanced modeling.

Visual Representation for Clarity: Utilize various visual aids—tables, graphs, and charts—to illustrate patterns, trends, and insights. Ensure these visuals are easy to comprehend and seamlessly support your narrative.

Visual Representation for Clarity

Highlighting Key Information: Use bullet points to emphasize essential information, maintaining clarity and allowing the audience to grasp key takeaways effortlessly. Bold key terms or phrases to draw attention and reinforce important points.

Addressing Audience Queries: Anticipate and be ready to answer audience questions regarding methods, assumptions, and results. Demonstrating a profound understanding of your analysis instills confidence in your work.

Integrity and Confidence in Delivery: Maintain a neutral tone and avoid exaggerated claims about findings. Present your case study with integrity, clarity, and confidence to ensure the audience appreciates and comprehends the significance of your work.

Integrity and Confidence in Delivery

By organizing your presentation well, telling a clear story through your analysis, and using visuals wisely, you can effectively share your data analytics case study.

This method helps people understand better, stay engaged, and draw valuable conclusions from your work.

We hope by now, you are feeling very confident processing a case study. But with any process, there are challenges you may encounter.

EDNA AI Advertisement

Key Challenges in Data Analytics Case Studies

Key Challenges in Data Analytics Case Studies

A data analytics case study can present various hurdles that necessitate strategic approaches for successful navigation:

Challenge 1: Data Quality and Consistency

Challenge: Inconsistent or poor-quality data can impede analysis, leading to erroneous insights and flawed conclusions.

Solution: Implement rigorous data validation processes, ensuring accuracy, completeness, and reliability. Employ data cleansing techniques to rectify inconsistencies and enhance overall data quality.

Challenge 2: Complexity and Scale of Data

Challenge: Managing vast volumes of data with diverse formats and complexities poses analytical challenges.

Solution: Utilize scalable data processing frameworks and tools capable of handling diverse data types. Implement efficient data storage and retrieval systems to manage large-scale datasets effectively.

Challenge 3: Interpretation and Contextual Understanding

Challenge: Interpreting data without contextual understanding or domain expertise can lead to misinterpretations.

Solution: Collaborate with domain experts to contextualize data and derive relevant insights. Invest in understanding the nuances of the industry or domain under analysis to ensure accurate interpretations.

Interpretation and Contextual Understanding

Challenge 4: Privacy and Ethical Concerns

Challenge: Balancing data access for analysis while respecting privacy and ethical boundaries poses a challenge.

Solution: Implement robust data governance frameworks that prioritize data privacy and ethical considerations. Ensure compliance with regulatory standards and ethical guidelines throughout the analysis process.

Challenge 5: Resource Limitations and Time Constraints

Challenge: Limited resources and time constraints hinder comprehensive analysis and exhaustive data exploration.

Solution: Prioritize key objectives and allocate resources efficiently. Employ agile methodologies to iteratively analyze and derive insights, focusing on the most impactful aspects within the given timeframe.

Recognizing these challenges is key; it helps data analysts adopt proactive strategies to mitigate obstacles. This enhances the effectiveness and reliability of insights derived from a data analytics case study.

Now, let’s talk about the best software tools you should use when working with case studies.

Top 5 Software Tools for Case Studies

Top Software Tools for Case Studies

In the realm of case studies within data analytics, leveraging the right software tools is essential.

Here are some top-notch options:

Tableau : Renowned for its data visualization prowess, Tableau transforms raw data into interactive, visually compelling representations, ideal for presenting insights within a case study.

Python and R Libraries: These flexible programming languages provide many tools for handling data, doing statistics, and working with machine learning, meeting various needs in case studies.

Microsoft Excel : A staple tool for data analytics, Excel provides a user-friendly interface for basic analytics, making it useful for initial data exploration in a case study.

SQL Databases : Structured Query Language (SQL) databases assist in managing and querying large datasets, essential for organizing case study data effectively.

Statistical Software (e.g., SPSS , SAS ): Specialized statistical software enables in-depth statistical analysis, aiding in deriving precise insights from case study data.

Choosing the best mix of these tools, tailored to each case study’s needs, greatly boosts analytical abilities and results in data analytics.

Final Thoughts

Case studies in data analytics are helpful guides. They give real-world insights, improve skills, and show how data-driven decisions work.

Using case studies helps analysts learn, be creative, and make essential decisions confidently in their data work.

Check out our latest clip below to further your learning!

Frequently Asked Questions

What are the key steps to analyzing a data analytics case study.

When analyzing a case study, you should follow these steps:

Clarify the problem : Ensure you thoroughly understand the problem statement and the scope of the analysis.

Make assumptions : Define your assumptions to establish a feasible framework for analyzing the case.

Gather context : Acquire relevant information and context to support your analysis.

Analyze the data : Perform calculations, create visualizations, and conduct statistical analysis on the data.

Provide insights : Draw conclusions and develop actionable insights based on your analysis.

How can you effectively interpret results during a data scientist case study job interview?

During your next data science interview, interpret case study results succinctly and clearly. Utilize visual aids and numerical data to bolster your explanations, ensuring comprehension.

Frame the results in an audience-friendly manner, emphasizing relevance. Concentrate on deriving insights and actionable steps from the outcomes.

How do you showcase your data analyst skills in a project?

To demonstrate your skills effectively, consider these essential steps. Begin by selecting a problem that allows you to exhibit your capacity to handle real-world challenges through analysis.

Methodically document each phase, encompassing data cleaning, visualization, statistical analysis, and the interpretation of findings.

Utilize descriptive analysis techniques and effectively communicate your insights using clear visual aids and straightforward language. Ensure your project code is well-structured, with detailed comments and documentation, showcasing your proficiency in handling data in an organized manner.

Lastly, emphasize your expertise in SQL queries, programming languages, and various analytics tools throughout the project. These steps collectively highlight your competence and proficiency as a skilled data analyst, demonstrating your capabilities within the project.

Can you provide an example of a successful data analytics project using key metrics?

A prime illustration is utilizing analytics in healthcare to forecast hospital readmissions. Analysts leverage electronic health records, patient demographics, and clinical data to identify high-risk individuals.

Implementing preventive measures based on these key metrics helps curtail readmission rates, enhancing patient outcomes and cutting healthcare expenses.

This demonstrates how data analytics, driven by metrics, effectively tackles real-world challenges, yielding impactful solutions.

Why would a company invest in data analytics?

Companies invest in data analytics to gain valuable insights, enabling informed decision-making and strategic planning. This investment helps optimize operations, understand customer behavior, and stay competitive in their industry.

Ultimately, leveraging data analytics empowers companies to make smarter, data-driven choices, leading to enhanced efficiency, innovation, and growth.

author avatar

Related Posts

4 Types of Data Analytics: Explained

4 Types of Data Analytics: Explained

Data Analytics

In a world full of data, data analytics is the heart and soul of an operation. It's what transforms raw...

Data Analytics Outsourcing: Pros and Cons Explained

Data Analytics Outsourcing: Pros and Cons Explained

In today's data-driven world, businesses are constantly swimming in a sea of information, seeking the...

What Does a Data Analyst Do on a Daily Basis?

What Does a Data Analyst Do on a Daily Basis?

In the digital age, data plays a significant role in helping organizations make informed decisions and...

case study data quality

book

Estuary Flow

Build fully managed real-time data pipelines in minutes.

Estuary vs. Fivetran

Estuary vs. Confluent

Estuary vs. Airbyte

Estuary vs. Debezium

Product Tour [2 min]

Real-time 101 [30 min]

CASE STUDIES

True Platform

Soli & Company

Connect&GO

What Is Data Quality? Dimensions, Standards, & Examples

In this guide, we will discuss what data quality is, its dimensions, standards, and real-life examples and see how you can benefit from it..

Author's avatar

A strong data quality foundation is not a luxury; it is a necessity – one that empowers businesses, governments, and individuals to make informed choices that propel them forward. Can you imagine a world where decisions are made on a shaky foundation? Where information is riddled with inaccuracies, incomplete fragments, and discrepancies? 

This is where data quality emerges as the champion of order, clarity, and excellence. However, without a clear understanding of what constitutes high-quality data, you will struggle to identify and address the underlying issues that compromise data integrity.  

To help you navigate the complexities of  data management and drive better outcomes, we will look into different dimensions of data quality, commonly used data standards in the industry, and real-life case studies. By the end of this 8-minute guide, you'll know everything that is there to know about data quality and how it can help your business.

  • What Is Data Quality?

Blog Post Image

Image Source

Data quality refers to the overall accuracy, completeness, consistency, reliability, and relevance of data in a given context. It is a  measure of how well data meets the requirements and expectations for its intended use.

So why do we care about data quality? 

Here are a few reasons:

  • Optimizing Business Processes: Reliable data helps pinpoint operational inefficiencies.
  • Better Decision-Making: Quality data underpins effective decision-making and gives a significant competitive advantage.
  • Boosting Customer Satisfaction: It provides a deeper understanding of target customers and helps tailor your products and services to better meet their needs and preferences.

Enhancing Data Quality: Exploring 8 Key Dimensions For Reliable & Valuable Data

Blog Post Image

The success of any organization hinges on its ability to harness reliable, accurate, and valuable data. To unlock the full potential of data, it is important to understand the key dimensions that define its quality. Let's discuss this in more detail:

Accuracy is the  degree to which your data mirrors real-world situations and aligns with trustworthy references. The  fine line between facts and fallacy is crucial in every data analysis. If your data is accurate, the real-world entities it represents can function as anticipated.

For instance, correct employee phone numbers ensure seamless communication. On the flip side, inaccurate data like incorrect date of joining could deprive them of certain privileges. 

The crux of ensuring data accuracy lies in  verification, using credible sources, and direct testing. Industries with stringent regulations, like healthcare and finance, particularly depend on high data accuracy for reliable operations and outcomes.

Different modes can be adopted to verify and compare the data, which are:

  • Reference Source Comparison:  This method checks the accuracy by comparing actual data with a reference or standard values from a reliable source.
  • Physical Verification:  This involves matching data with physical objects or observations, like comparing the items listed on your grocery bill to what's actually in your cart.

Completeness

Data completeness is a measure of how much essential data you have. It's all about  assessing whether all necessary values in your data are present and accounted for.  

Think of it this way: in customer data, completeness is whether you have enough details for effective engagement. An example is a customer address – even if an optional landmark attribute is missing, the data can still be considered complete. 

Similarly, for products or services, completeness indicates important features that help customers make informed decisions. If a product description lacks delivery estimates, it's not complete. Completeness gauges if the data provides enough insight to make valuable conclusions and decisions.

Consistency

Data consistency checks if the same data stored at different places or used in different instances align perfectly.  It's basically the synchronization between multiple data records. While it might be a bit tricky to gauge, it's a vital sign of high-quality data. 

Think about 2 systems using patients’ phone numbers; even though the formatting differs,  if the core information remains the same, you have consistent data. But if the fundamental data itself varies, say a patient's date of birth differs across records, you'll need another source to verify the inconsistent data.

The uniqueness of data is the absence of duplication or overlaps within or across data sets.  It's the assurance that each instance recorded in a data set is unique. A high uniqueness score builds trusted data and reliable analysis. 

It's a highly important component of data quality that helps in both offensive and defensive strategies for customer engagement. While it may seem like a tedious task, maintaining uniqueness becomes achievable by  actively identifying overlaps and promptly cleaning up duplicated records.

Blog Post Image

The timeliness of data ensures it is readily available and up-to-date when needed . It is a user expectation; if your data isn't prepared exactly when required, it falls short of meeting the timeliness dimension. 

Timeliness is about reducing latency and ensuring that the correct data reaches the right people at the right time. A general rule of thumb here is that the fresher the data, the more likely it is to be accurate.

Validity is the  extent to which data aligns with the predefined business rules and falls within acceptable formats and ranges. It's about whether the data is correct and acceptable for its intended use. For instance, a ZIP code is valid if it contains the right number of characters for a specific region. 

The implementation of business rules   helps evaluate data validity. While invalid data can impede data completeness,  setting rules to manage or eliminate this data enhances completeness.

In the context of data quality, the currency is  how up-to-date your data is. Data should accurately reflect the real-world scenario it represents. For instance, if you once had the right information about an IT asset but it was subsequently modified or relocated, the data is no longer current and needs an update. 

These updates can be manual or automatic and occur as needed or at scheduled intervals , all based on your organization's requirements.

Integrity refers to the preservation of attribute relationships as data navigate through various systems. To ensure accuracy, these connections must stay consistent, forming an unbroken trail of traceable information. If these relationships suffer damage during the data journey, it could result in incomplete or invalid data.

Building upon the foundation of the data quality dimensions, let's discuss data quality standards and see how you can use them for data quality assessments and improve your organization’s data quality.

  • Understanding 6 Data Quality Standards

Here are 6 prominent data quality standards and frameworks. 

When it comes to data quality,  ISO 8000 is considered the international benchmark. This series of standards is developed by the International Organization for Standardization (ISO) and provides a  comprehensive framework for enhancing data quality across various dimensions. 

Here's a quick rundown:

  • Applicability : The best thing about ISO 8000 is its universal applicability. It is designed for all organizations, regardless of size or type, and is relevant at every point in the data supply chain.
  • Benefits of Implementation : Adopting ISO 8000 brings significant benefits. It provides a solid foundation for digital transformation, promotes trust through evidence-based  data processing , and enhances data portability and interoperability.
  • Data Quality Dimensions : ISO 8000 emphasizes that data quality isn't an abstract concept. It's about how well data characteristics conform to specific requirements. This means data can be high quality for one purpose but not for another, depending on the requirements.
  • Data Governance & Management : The ISO 8000 series addresses critical aspects of  data governance , data quality management, and maturity assessment. It guides organizations in creating and applying data requirements, monitoring and measuring data quality, and making improvements.

Total Data Quality Management (TDQM)

Total Data Quality Management (TDQM) is a strategic approach to ensuring high-quality data. It's about managing data quality end-to-end, from creation to consumption. Here's what you need to know:

  • Continuous Improvement : TDQM adopts a  continuous improvement mindset. It's not a one-time effort but an ongoing commitment to enhancing data quality.
  • Root Cause Analysis: A key feature of TDQM is its focus on understanding the root causes of data quality issues. It's not just about fixing data quality problems; it's about preventing them from recurring.
  • Organizational Impact : Implementing TDQM can have far-reaching benefits. It boosts operational efficiency, enhances decision-making, improves customer satisfaction, and ultimately drives business success.
  • Comprehensive Coverage : TDQM covers all aspects of data processing, including creation, collection, storage, maintenance, transfer, utilization, and presentation. It's about ensuring quality at every step of the data lifecycle.

Blog Post Image

Six Sigma is a well-established methodology for process improvement. It is a powerful tool for enhancing data quality and  uses the DMAIC (Define, Measure, Analyze, Improve, Control) model to systematically address data quality issues. Details of DMAIC are:

  • Define : Start by identifying the focus area for the data quality improvement process. Understand your data consumers and their needs, and set clear data quality expectations.
  • Measure : Assess the current state of data quality and existing data management practices. This step helps you understand where you stand and what needs to change.
  • Analyze : Dive deeper into the high-level solutions identified in the Measure phase. Design or redesign processes and applications to address data quality challenges.
  • Improve : Develop and implement data quality improvement initiatives and solutions. This involves building new systems, implementing new processes, or even a one-time data cleanup.
  • Control : Finally, ensure the improvements are sustainable. Monitor data quality continuously using relevant metrics to keep it at the desired level.

The Data Management Body of Knowledge (DAMA DMBOK) framework is an extensive guide that provides a blueprint for effective  master data management (MDM) . Here’s how it can help your organization elevate its data quality.

  • Emphasis on Data Governance: The framework highlights data governance in ensuring data quality and promotes consistent data handling across the organization.
  • Structured Approach to Data Management: The framework offers a systematic way to manage data, aligning data management initiatives with business strategies to enhance data quality.
  • The Role of Data Quality in Decision-Making: DAMA DMBOK underscores the importance of high-quality data in making informed business decisions and achieving organizational goals.
  • Adaptability of the DAMA DMBOK Framework: DAMA DMBOK is designed to be flexible so you can tailor your principles to your specific needs and implement the most relevant data quality measures.

Federal Information Processing Standards (FIPS)

Federal Information Processing Standards (FIPS) are a set of standards developed by the U.S. federal government for use in computer systems. Here's how they contribute to data quality:

  • Compliance: For U.S. federal agencies and contractors, compliance with FIPS is mandatory. This ensures a minimum level of data quality and security.
  • Global Impact: While FIPS are U.S. standards, their influence extends globally. Many organizations worldwide adopt these standards to enhance their data security and quality.
  • Data Security: FIPS primarily focuses on computer security and interoperability. They define specific requirements for encryption and hashing algorithms, ensuring the integrity and security of data.
  • Standardization: FIPS promotes consistency across different systems and processes. They provide a common language and set of procedures that can significantly enhance data quality.

IMF Data Quality Assessment Framework (DQAF)

The International Monetary Fund (IMF)has developed a robust tool known as the Data Quality Assessment Framework (DQAF). This framework  enhances the quality of statistical systems, processes, and products.

It's built on the United Nations Fundamental Principles of Official Statistics and is widely used for assessing best practices, including internationally accepted methodologies. The DQAF is organized around prerequisites and 5 dimensions of data quality:

  • Assurances of Integrity: This ensures objectivity in the collection, processing, and dissemination of statistics.
  • Methodological Soundness:  The statistics follow internationally accepted standards, guidelines, or good practices.
  • Accuracy and Reliability: The source data and statistical techniques are sound and the statistical outputs accurately portray reality.
  • Serviceability: The statistics are consistent, have adequate periodicity and timeliness, and follow a predictable revisions policy.
  • Accessibility: Data and metadata are easily available and assistance to users is adequate.

Now that we are familiar with the data quality standards, let’s look at some real-world examples that will help you implement the best data quality practices within your organization.

  • 3 Real-life Examples Of Data Quality Practices

Let's explore some examples and case studies that highlight the significance of good data quality and its impact in different sectors. 

IKEA Australia's loyalty program, IKEA Family, aimed to personalize communication with its members to build loyalty and engagement.

To understand and target its customers better, the company  recognized the need for data enrichment , particularly regarding postal addresses. Manually entered addresses caused poor data quality with errors, incomplete data, and formatting issues,  leading to a match rate of only 83%.

To address the challenges with data quality, IKEA ensured accurate data entry and improved data enrichment for their loyalty program. The solution  streamlined the sign-up process and reduced errors and keystrokes.

The following are the outcomes:

  • The enriched datasets led to a 7% increase in annual spending by members.
  • The improvement in data quality enabled more targeted communications with customers.
  • IKEA Australia's implementation of the solution resulted in a significant  12% increase in their data quality match rate,  rising from 83% to 95%.
  • The system of validated address data  minimized the risk of incorrect information entering IKEA Australia's Customer Relationship Management (CRM) system.

Hope Media Group (HMG)

Hope Media Group was dealing with a significant data quality issue because of the continuous influx of contacts from various sources, leading to numerous duplicate records. The transition to a centralized platform from multiple disparate solutions further highlighted the need for advanced data quality tools.

HMG implemented a data management strategy that involved scanning for thousands of duplicate records and creating a 'best record' for a single donor view in their CRM. They automated the process using rules to select the 'best record' and populate it with the best data from other duplicates. This saved review time and created a process for automatically reviewing duplicates. Ambiguous records were sent for further analysis and processing.

HMG has successfully identified, cleansed, and merged  over 10,000 duplicate records so far, with the process still ongoing. As they expand their CRM to include more data capture sources, the need for their data management strategy is increasing. This approach allowed them to  clean their legacy datasets.

Northern Michigan University

Northern Michigan University faced a data quality issue when they introduced a self-service technology for students to manage administrative tasks including address changes. This caused  incorrect address data to be entered into the school's database.

The university implemented a real-time address verification system within the self-service technology. This system verified address information entered over the web and  prompted users to provide missing address elements when an incomplete address was detected.

The real-time address verification system has given the university confidence in the usability of the address data students entered. The system validates addresses against official postal files in  real time before a student submits them.

If an address is incomplete,  the student is prompted to augment it, reducing the resources spent on manually researching undeliverable addresses and ensuring accurate and complete address data for all students.

  • Estuary Flow: Your Go-To Solution For Superior Data Quality

Blog Post Image

Estuary Flow is our  no-code DataOps platform that offers real-time data ingestion, transformation, and replication functionality. It can greatly enhance the data quality in your organization. Let's take a look at how Flow plays a vital role in improving data quality.

Capturing Clean Data

Estuary Flow starts by capturing data from various sources like databases and applications. By using Change Data Capture (CDC), Estuary makes sure that it  picks up only the fresh and latest data. This means you don’t have to worry about your data getting outdated or duplicated. Build-in schema validation ensures that only clean data makes it through the pipeline.

Smart Transformations

Once the data is in the system, Estuary Flow gives you the tools to mold it just like you want. You can  use streaming SQL and Javascript to make changes to your data  as it streams in.

Thorough Testing

A big part of maintaining data quality is checking for errors. Estuary  built-in testing that’s like a security guard for your data. It ensures that your  data pipelines are error-free and reliable. If something doesn’t look right, Estuary will help you spot it before it becomes a problem — corrupt data will never land in your destination system.

Keeping Everything Up-to-Date

Estuary Flow makes sure that your data is not just clean but stays clean. It does this by creating  low-latency views of your data which are like live snapshots that are always up to date. This means you can trust your data to be consistent across different systems.

Bringing It All Together

Finally, Estuary helps you combine data from different sources to get a full picture. This ensures that  you have all the information you need , in high quality, to make better decisions and provide better services.

Achieving high-quality data is an ongoing journey that requires continuous effort. However, when you acknowledge the importance of data quality and invest in improving it, you set yourself on a transformative path toward remarkable success. 

By prioritizing data quality, you can make better decisions, drive innovation, and thrive in today's dynamic landscape. It's a journey worth taking as the rewards of trustworthy data are substantial and pave the way for significant positive changes within organizations.

Estuary Flow greatly boosts data quality. It offers real-time data transformations and automated schema management. Its flexible controls, versatile compatibility, and secure data sharing make it an ideal choice for top-notch data management. With Estuary, you're not just handling data, you're enhancing its quality and preparing it for real-time analytics.

If you are looking to abide by the data quality standards in your organization, Estuary Flow is an excellent place to start.  Sign up for Flow to start for free and explore its many benefits, or  contact our team to discuss your specific needs.

Start streaming your data for free

In this article

  • Enhancing Data Quality: Exploring 8 Key Dimensions For Reliable & Valuable Data

Popular Articles

debezium alternatives

ChatGPT for Sales Conversations: Building a Smart Dashboard

Author's Avatar

Why You Should Reconsider Debezium: Challenges and Alternatives

debezium alternatives

Don't Use Kafka as a Data Lake. Do This Instead.

Streaming pipelines., simple to deploy., simply priced..

case study data quality

Data import

case study data quality

Data profiling

case study data quality

Data cleansing

case study data quality

Data matching

case study data quality

Data deduplication

case study data quality

Data merge purge

case study data quality

Address verification

By use case.

case study data quality

Address standardization

case study data quality

Data standardization

case study data quality

Data scrubbing

case study data quality

Entity resolution

case study data quality

Fuzzy matching

case study data quality

Record linkage

case study data quality

List matching

case study data quality

Product matching

By industry.

case study data quality

Finance and Insurance

case study data quality

Sales and Marketing

case study data quality

OUR PRODUCTS

case study data quality

DataMatch Enterprise

case study data quality

ProductMatch

case study data quality

Data Ladder Team

case study data quality

Partner With Us

case study data quality

Why Data Ladder?

case study data quality

  • Customer Stories

case study data quality

360 Science

All resources.

case study data quality

Whitepapers

case study data quality

  • DataMatch Enterprise API
  • Partner with us
  • Free Download

case study data quality

Building a case for data quality: What is it and why is it important

Ehsan Elahi

  • Written by Ehsan Elahi
  • May 9, 2022

According to an IDC study , 30-50% of organizations encounter a gap between their data expectations and reality. A deeper look at this statistic shows that:

  • 45% of organizations see a gap in data lineage and content ,
  • 43% of organizations see a gap in data completeness and consistency ,
  • 41% of organizations see a gap in data timeliness ,
  • 31% of organizations see a gap in data discovery , and
  • 30% of organizations see a gap in data accountability and trust .

These data dimensions are commonly termed as data quality metrics – something that helps us to measure the fitness of data for its intended purpose – also known as data quality.

What is data quality?

The degree to which data satisfies the requirements of its intended purpose.

If an organization is unable to use the data for the reason it is stored and managed for, then it’s said to be of poor quality. This definition implies that data quality is subjective and it means something different for every organization, depending on how they intend to use it. For example, in some cases, data accuracy is more important than data completeness , while in other cases, the opposite may be true.

Another interesting way of describing data quality is:

The absence of intolerable defects in a dataset.

Meaning, data cannot be completely free of defects and that is fine. It just has to be free of defects that are intolerable for the purpose it is used across the organization. Usually, data quality is monitored to see that the datasets contain the needed information (in terms of attributes and entities), and that information is as accurate (or defect-free) as possible.

How to build a case for data quality?

Having delivered data solutions to Fortune 500 clients for over a decade, we usually find data professionals spending more than 50 hours a week on their job responsibilities. The added hours are a result of duplicate work, unsuccessful results, and lack of data knowledge. On further analysis, we often find data quality to be the main culprit behind most of these data issues. The absence of a centralized data quality engine that consistently validates and fixes data quality problems is costing experienced data professionals more time and effort than necessary.

When something silently eats away at your team productivity and produces unreliable results, it becomes crucial to bring it to the attention of necessary stakeholders so that corrective measures can be taken in time. These measures should also be integrated as part of the business process so that they can be exercised as a habit and not just a one-time act.

In this blog, we will cover three important points:

  • The quickest and easiest way to prove the importance of data quality.
  • A bunch of helpful resources that discuss different aspects of data quality.
  • How data quality benefits the six main pillars of an organization.

Let’s get started.

1. Design data flaw – business risk matrix

To prove the importance of data quality, you need to highlight how data quality problems increase business risks and impact business efficiency. This requires some research and discussion amongst data leaders and professionals, and then they can share the results and outcomes with necessary stakeholders.

We oftentimes encounter minor and major issues in our datasets, but we rarely evaluate them deep enough to see the kind of business impact they can have. In a recent blog, I talked about designing the data flaw – business risk matrix : a template that helps you to relate data flaws to business impact and resulting costs. In a nutshell, this template helps you to relate different types of misinformation present in your dataset to business risks.

For example, a misspelled customer name or incorrect contact information can lead to duplicate records in a dataset for the same customer. This, in turn, increases the number of inbound calls, decreases customer satisfaction, as well as impacts audit demand. These mishaps take a toll on a business in terms of increased staff time, reduced orders due to customer dissatisfaction, and increased cash flow volatility, etc.

But if you can get this information on paper where something as small as a misspelled customer name is attributed to something as big as losing customers, it can prove to be the first step in building a case about the importance of data quality.

2. Utilize helpful data quality resources

We have a bunch of content on our data quality hub that discusses data quality from different angles and perspectives. You will probably find something there that fulfils your requirements – something that helps you to convince your team or managers about the importance and role of data quality for any data-driven initiative.

A list of such resources is given below:

  • The impact of poor data quality: Risks, challenges, and solutions
  • Data quality measurement: When should you worry?
  • Building a data quality team: Roles and responsibilities to consider
  • 8 best practices to ensure data quality at enterprise-level
  • Data quality dimensions – 10 metrics you should be measuring
  • 5 data quality processes to know before designing a DQM framework
  • Designing a framework for data quality management
  • The definitive buyer’s guide to data quality tools
  • The definitive guide to data matching

3. Present the benefits of data quality across main pillars

In this section, we will see how end-to-end data quality testing and fixing can benefit you across the six main pillars of an organization (business, finance, customer, competition, team, and technology).

a. Business

A business uses its data as a fuel across all departments and functions. Not being able to trust the authenticity and accuracy of your data can be one of the biggest disasters in any data initiative. Although all business areas benefit from good data quality, the core ones include:

i. Decision making

Instead of relying on intuitions and guesses, organizations use business intelligence and analytics results to make concrete decisions. Whether these decisions are being made at an individual or a corporate level, data is utilized throughout the company to find patterns in past information so that accurate inferences can be made for the future. Lack of quality data can definitely skew the results of your analysis, leading this approach to do more harm than good.

Read more at Improving analytics and business intelligence with clean data .

ii. Operations

Various departments such as sales, marketing, and product depend on data for effective operation of business processes. Whether you are putting product information on your website, using prospect lists in marketing campaigns, or using sales data to calculate yearly revenue, data is part of every small and big operation. Hence, good quality data can boost operational efficiency of your business, while ensuring results accuracy and reducing gaps for potential errors.

Read more at Key components that should be part of operational efficiency goals .

iii. Compliance

Data compliance standards (such as GDPR, HIPAA, and CCPA, etc.) are compelling businesses to revisit and revise their data management strategies. Under these data compliance standards, companies are obliged to protect the personal data of their customers and ensure that data owners (the customers themselves) have the right to access, change, or erase their data.

Apart from these rights granted to data owners, the standards also hold companies responsible for following the principles of transparency, purpose limitation, data minimization, accuracy, storage limitation, security, and accountability. Timely implementation of such principles becomes way easier with clean and reliable data quality. Hence, quality data can help you conform to integral compliance standards.

Read more at The importance of data cleansing and matching for data compliance .

b. Finances

A company’s finances include an abundance of customer, employee, and vendor information, as well as the history of all transactions that happened with these entities. Bank records, invoices, credit cards, bank sheets, customer information are confidential data that do not have room for errors. For this reason, consistent, accurate, and available data help ensure that:

  • Timely payments are made whenever due,
  • Cases of underpay and overpay are avoided,
  • Transactions to incorrect recipients are avoided,
  • The chances of fraud are reduced due to duplicate entity records, and so on.

Read more at The impact of data matching on the world of finance .

c. Customer

In this era, customers seek personalization. The only way to convince them to buy from you and not a competitor is to offer them an experience that is special to them. Make them feel they are seen, heard, and understood. To achieve this, businesses use a ton of customer-generated data to understand their behavior and preferences. If this data has serious defects, you will obviously end up inferring wrong details about your customers or potential buyers. This can lead to reduced customer satisfaction and brand loyalty.

On the other hand, having quality data increases the probability of discovering relevant buyers or leads – someone who is interested in doing business with you. While allowing poor quality data in your datasets can add noise and make you lose sight of potential leads out there in the market.

Read more at Your complete guide to a obtaining a 360 customer view .

d. Competition

Good data quality can help you to identify potential opportunities in the market for cross-selling and upselling. Similarly, accurate market data and understanding can help you effectively strategize your brand and product according to market needs.

If your competition leverages quality data to infer trends about market growth and consumer behavior, they will definitely leave you behind and convert potential customers more quickly and timely. On the other hand, if wrong or incorrect data is used for such analysis, your business can be misled into making inaccurate decisions – costing you a lot of time, money, and resources.

Read more at How you can leverage your data as a competitive advantage?

Managing data and its quality is the core responsibility of the data team, but almost everyone reaps the benefits of clean and accurate data. With good quality data, your team doesn’t have to spend time correcting data quality issues every time before they can use it in their routine tasks. Since people do not waste time on rework due to errors and gaps present in datasets, this has a positive impact on the team’s productivity and efficiency; and they can focus their efforts on the task at hand.

Read more at Building a data quality team: Roles and responsibilities to consider .

f. Technology

Data quality can be a deal-breaker while digitizing any aspect of your organization through technology. It is quite easy to digitize a process when the data involved is structured, organized, and meaningful. On the other hand, bad or poor data quality can be the biggest roadblock in process automation and technology adoption in most companies.

Whether you are employing a new CRM, business intelligence, or automating marketing campaigns, you won’t get the expected results if the data contains errors and is not standardized. To get the most out of your web applications or designed databases, the content of the data must conform to acceptable data quality standards.

Read more at The definitive buyer’s guide to data quality tools .

And there you have it – we went through a whole lot of information that can help you build a case for data quality in front of stakeholders or line managers. This piece is definitely a bit different in how the benefits of data quality were presented. The reason for this is, instead of highlighting six or ten areas that can be improved with quality data – I wanted to bring our attention to a more crucial point: Data quality impacts the main pillars of your business in too many different dimensions .

Business leaders need to realize that having and using data is not even half the game. The ability to trust and rely on that data to produce consistent and accurate results is the main concern now. For this reason, companies often adopt stand-alone data quality tools for cleaning and standardizing their datasets so that it can be trusted and used whenever and wherever needed.

In this blog, you will find:

Try data matching today.

No credit card required

" * " indicates required fields

Want to know more?

Check out dme resources.

case study data quality

Merging Data from Multiple Sources – Challenges and Solutions

Oops! We could not locate your form.

case study data quality

Address standardization guide: What, why, and how?

Inaccurate and incomplete address data can cause your mail deliveries to be returned. In fact, the US postal service handled 6.5 billion pieces of UAA

case study data quality

What is data integrity and how can you maintain it?

While surveying 2,190 global senior executives, only 35% claimed that they trust their organization’s data and analytics. As data usage surges across various business functions,

case study data quality

Guide to data survivorship: How to build the golden record?

92% of organizations claim that their data sources are full of duplicate records. To make things worse, valuable information is present in every duplicate that

Data Ladder offers an end-to-end data quality and matching engine to enhance the reliability and accuracy of enterprise data ecosystem without friction.

Quick Links

  • (516) 468 6879
  • [email protected]

© DataLadder 2024

Privacy policy.

Cart

  • SUGGESTED TOPICS
  • The Magazine
  • Newsletters
  • Managing Yourself
  • Managing Teams
  • Work-life Balance
  • The Big Idea
  • Data & Visuals
  • Reading Lists
  • Case Selections
  • HBR Learning
  • Topic Feeds
  • Account Settings
  • Email Preferences

To Improve Data Quality, Start at the Source

  • Thomas C. Redman

case study data quality

You’ll thank yourself later.

You can’t do anything important in your company without high-quality data. But most organizations focus their data-quality efforts on cleaning up errors, rather than finding and fixing the root cause of the errors in the first place. To become a more data-driven organization, managers and teams must adopt a new mentality — one that focuses on creating data correctly the first time to ensure quality throughout the process.

Part of this process requires identifying two new roles in data quality: the data customer and the data creator. The customer is the person using the data, and the creator is the person who creates, or first inputs, the needed data. People must recognize themselves as customers, clarify their needs, and communicate those needs to creators. People must also recognize themselves as creators, and make improvements to their processes, so they provide data in accordance with their customers’ needs. Once customers and creators have an open dialog, they can work together to make improvements, stopping bad data at its source.

You can’t do anything important in your company without high-quality data, and most people suspect, deep down, that their data is not up-to-snuff . They do their best to clean up their data, install software to find errors automatically, and seek confirmation from external sources — efforts I call “the hidden data factory.” It is time-consuming, expensive work, and most of the time, it doesn’t go well.

case study data quality

  • Thomas C. Redman , “the Data Doc,” is President of Data Quality Solutions . He helps companies and people  chart their courses to data-driven futures with special emphasis on quality, analytics, and organizational capabilities. His latest book, People and Data: Uniting to Transform Your Organization (Kogan Page) was published Summer 2023.

Partner Center

  • Reference Manager
  • Simple TEXT file

People also looked at

Review article, improving data quality in clinical research informatics tools.

case study data quality

  • Information Science Department, University of Arkansas at Little Rock, Little Rock, AR, United States

Maintaining data quality is a fundamental requirement for any successful and long-term data management. Providing high-quality, reliable, and statistically sound data is a primary goal for clinical research informatics. In addition, effective data governance and management are essential to ensuring accurate data counts, reports, and validation. As a crucial step of the clinical research process, it is important to establish and maintain organization-wide standards for data quality management to ensure consistency across all systems designed primarily for cohort identification, allowing users to perform an enterprise-wide search on a clinical research data repository to determine the existence of a set of patients meeting certain inclusion or exclusion criteria. Some of the clinical research tools are referred to as de-identified data tools. Assessing and improving the quality of data used by clinical research informatics tools are both important and difficult tasks. For an increasing number of users who rely on information as one of their most important assets, enforcing high data quality levels represents a strategic investment to preserve the value of the data. In clinical research informatics, better data quality translates into better research results and better patient care. However, achieving high-quality data standards is a major task because of the variety of ways that errors might be introduced in a system and the difficulty of correcting them systematically. Problems with data quality tend to fall into two categories. The first category is related to inconsistency among data resources such as format, syntax, and semantic inconsistencies. The second category is related to poor ETL and data mapping processes. In this paper, we describe a real-life case study on assessing and improving the data quality at one of healthcare organizations. This paper compares between the results obtained from two de-identified data systems i2b2, and Epic Slicedicer, and discuss the data quality dimensions' specific to the clinical research informatics context, and the possible data quality issues between the de-identified systems. This work in paper aims to propose steps/rules for maintaining the data quality among different systems to help data managers, information systems teams, and informaticists at any health care organization to monitor and sustain data quality as part of their business intelligence, data governance, and data democratization processes.

Introduction

Data is the building block in all research, as results are only as good as the data upon which the conclusions were formed. However, researchers may receive minimal training on how to use the de-identified data systems and methods for achieving, assessing, or controlling the quality of research data ( Nahm, 2012 ; Zozus et al., 2019 ).

De-identified data systems are defined as systems/tools that allow users to drag and drop search terms from a hierarchical ontology into a Venn diagram-like interface. Investigators can perform an initial analysis on the de-identified cohort. Furthermore, de-identified data systems have no features to indicate the data quality or assist in identifying the data quality; these systems only provide counts.

Informatics is the science of how to use data, information, and knowledge to improve human health and the delivery of healthcare services ( American Medical Informatics Association, 2022 ).

Clinical Informatics is the application of informatics and information technology to deliver healthcare services. For example, patient portals, electronic medical records (EMRs), telehealth, healthcare apps, and a variety of data reporting tools ( American Medical Informatics Association, 2022 ).

The case presented in this paper focuses on the quality of data obtained from two de-identified systems (Epic Slicerdicer and i2b2).The purpose of this paper is to discuss the quality of the data (counts) generated from the two systems, understand the potential causes of the data quality issues, and propose steps to improve the quality and increase the trust of the generated counts by comparing the accuracy, consistency, validity, and understandability of the outcomes from the two systems.

The proposed steps for maintaining the data quality among different systems aim to help data managers, information systems teams, and informaticists at a healthcare organization monitor and sustain data quality as part of their business intelligence, data governance, and data democratization processes. The quality improvement steps proposed are generic and contributes in adding generic and essential steps to automate data curation and data governance to tackle various data quality problem.

The remainder of this paper is organized as follows. In the following section, we introduce the importance of data quality to clinical research informatics, the study case and study method and materials presented in the Importance of Data Quality to Clinical Research Informatics, Case Study Goals, and Methodology section. The findings and the discussion part, and the proposed steps to ensure data quality are discussed in Discussion section. Conclusions are drawn and work contribution is discussed in Conclusion section.

Importance of Data Quality to Clinical Research Informatics

Data quality refers to the degree data meets the expectations of data consumers and their intended use of the data ( Pipino et al., 2002 ; Halimeh, 2011 ; AbuHalimeh and Tudoreanu, 2014 ). In clinical informatics, this depends on the study conducted ( Nahm, 2012 ; Zozus et al., 2019 ).

The meaning of data quality lies in how the data is perceived and used by its consumer. Identifying data quality involves two stages: first, highlighting which characteristics (Dimensions) are important ( Figure 1 ) and second, determining how these dimensions affect the population in question ( Halimeh, 2011 ; AbuHalimeh and Tudoreanu, 2014 ).

www.frontiersin.org

Figure 1 . De-identified data quality dimensions (DDQD).

This paper focuses on a subset of data quality dimensions, which we term de-identified data quality dimensions (DDQD). We think these dimensions should be mainly considered to maintain the data quality in de-identified systems because the absence of any of these dimensions will affect the overall quality of the data in the de-identified data systems. These dimensions are described in Table 1 below.

www.frontiersin.org

Table 1 . De-identified data quality dimensions definitions (DDQD).

The impact of quality data and management is in performance and efficiency gains and the ability to extract new understandings. Poor clinical informatics data quality can cause glitches throughout an organization. This impact includes the quality of research outcomes, healthcare services, and decision-making.

Quality is not a simple scalar measure but can be defined on multiple dimensions, with each dimension yielding different meanings to different information consumers and processes ( Halimeh, 2011 ; AbuHalimeh and Tudoreanu, 2014 ). Each dimension can be measured and assessed differently. Data quality assessment implies providing a value for each dimension about how much of the dimension or quality feature is achieved to enable adequate understanding and management. Data quality and the discipline of informatics are undistinguishable interconnected. Data quality depends on how data are collected, processed, and presented; this is what makes data quality very important and sometimes complicated because data collection and processing varies from one study to another. Clinical informatics data can include different data formats and types and could come from different resources.

Case Study Goals

The primary goal is to compare, identify and understand discrepancies in a patient count in i2b2 compared to Epic Slicerdicer ( Galaxy, 2021 ). The secondary goal was to create a data dictionary that clinical researchers would easily understand. For example, if they wanted a count of patients with asthma, they would know (1) what diagnoses were used to identify patients, (2) where these diagnoses were captured, and (3) that this count matched existing clinical knowledge.

The case described below is from one of the healthcare organizations wanted to have the ability to ingest other sources of research-specific data, such as genomic information, and the existing products did not have a way to do that. After deliberation i2b2 ( The i2b2 tranSMART Foundation, 2021 ) was chosen as the data model for their clinical data warehouse. Prior to going live with users, however, it is very important and essential to validate that the data in their Clinical Data Warehouse (CDW) was accurate.

Methodology

Participants.

The clinical validation process involved a clinical informatician, data analyst, and ETL developer.

Many healthcare organizations use at least one of the three Epic databases (Chronicles, Clarity, and Caboodle). The data source used to feed i2b2 and Slicerdicer tools was Caboodle database.

The tools used to perform the study are i2b2 tool and Epic Slicerdicer.

I2b2: Informatics for Integrating Biology and the Bedside (i2b2) is an open-source clinical data warehousing and analytics research platform; i2b2 enables sharing, integration, standardization, and analysis of heterogeneous data from healthcare and research ( The i2b2 tranSMART Foundation, 2021 ).

Epic Slicerdicer: is a self-service reporting tool that allows physicians ready access to clinical data that is customizable by patient populations for data exploration. Slicerdicer allows the user to choose and search a specific patient population to answer questions about diagnoses, demographics, and procedures performed ( Galaxy, 2021 ).

Method Description

The study was designed in a way to compare, identify and understand discrepancies in a patient count in i2b2 compared to Epic Slicerdicer ( Galaxy, 2021 ). We achieved this goal by choosing a task based on the nature of the tools.

The first step was by running the same query to look at patient demographics (race, ethnicity, gender) and identified different aggregations with race and ethnicity in i2b2 compared with Slicerdicer, which was more granular as shown in Table 2 . For example, Cuban and Puerto Rican values in Slicerdicer were included in the Other Hispanic or Latino category in i2b2. The discrepancies are shown in Table 2 .

www.frontiersin.org

Table 2 . Patients demographic counts.

The second steps was running same query to explore diagnoses using J45 * as the ICD-10 code for asthma and Type 1 diabetes diagnosis code (E10 * ) as shown in Table 3 .

www.frontiersin.org

Table 3 . Patients count based on diagnosis codes.

The Percentage Difference Calculator (% difference calculator) was implemented to find the percent difference between i2b2 counts and Epic Slicerdicer counts >0. The percentage difference as described in the formula below is usually calculated when you want to know the difference in percentage between two numbers is used to estimate the quality of the counts coming from the two tools, the threshold for accepted quality in this study was below 2% difference.

V 1 = i2b2 counts and V 2 = Slicerdicer counts and counts are plugged into the below formula

A paired t -test is used to investigate the difference between two counts from i2b2 and Epic Slicerdicer for the same query.

All the results obtained from comparing the counts between Slicerdicer and i2b2 are listed in the Tables 2 , 3 below.

However, when diagnoses were explored, larger discrepancies were noted. There are 2 diagnosis fields in i2b2, one for billing diagnosis, and one for diagnosis. Using J45 * as the ICD-10 code for asthma resulted in 22,265 patients when using the billing diagnosis code in Slicerdicer but only 20,429 in i2b2. The discrepancy using diagnosis was even larger. Patient count results for Type 1 diabetes diagnosis code (E10 * ) using both diagnosis and billing are also shown in Table 3 .

The best approach to understand the reasons of this discrepancy was by looking at the diagnosis options in Slicerdicer to build a hypothesis on where this discrepancy might come from. Next, was examining the SQL code for the Caboodle to i2b2 ETL process.

The following hypotheses were considered:

H0: There is no discrepancy in the data elements used to pull the data.

H1: There is a discrepancy in the data elements used to pull the data.

Paired sample t -test was implemented on the counts obtained from the ib2b and Slicerdicer using different data points. The p -value was equals to 0, [ P (x ≤ –Infinity) = 0] in all cases that means that the chance of type I error (rejecting a correct H0) is small: 0 (0%). The smaller the p -value the more it supports H1. For example results of the paired t -test indicated that there is a significant medium difference between i2b2 ( M = 14,500, SD = 0) and Epic Slicerdicer ( M = 23,958, SD = 0), t(0) = Infinity, p < 0.001 and results of the paired t -test indicated that there is a significant medium difference between i2b2 ( M = 1,55,434, SD = 0) and Epic Slicerdicer ( M = 1,579, SD = 0), t(0) = Infinity, p < 0.001.

Since the p -value < α, H0 is rejected the i2b2 population's average is considered to be not equal to the Epic Slicerdicer population's average. In other words, the difference between the averages of i2b2 and Epic Slicerdicer is big enough to be statistically significant.

The paired t -test results supported the alternative hypothesis and revealed that there is a discrepancy in the data elements used to pull the data.

Also the Percentage Difference Calculator (% difference calculator) results which used to estimate the quality of the counts coming from the two tools, the majority of the results exceeded the threshold for accepted quality in this study (below 2%) difference as shown in Tables 2 , 3 . The percentage difference results showed and provided a strong evidence for a crucial quality issue in the counts obtained.

In that process of examining the SQL code for the Caboodle to i2b2 ETL process, the SQL code results showed the code only looked at billing and encounter diagnosis and everything that was not a billing diagnosis was labeled diagnosis. Slicerdicer and even Caboodle include other diagnosis sources such as medical history, hospital problem, and problem list. This was included in the data dictionary so that researchers would understand what sources i2b2 was using and that if they wanted data beyond that, they would have to request data from Caboodle.

The discrepancies led to major information quality issues such as data inconsistency and data accuracy both affects the believability and the validity of the data which also are major data quality measures. The discrepancies noted above are likely due to several factors. First, Slicerdicer counts patients for every race selected instead of i2b2, which only takes the first race field, this because two data models were used to pattern race and ethnicity variables in i2b2 to the 1997 OMB race categories and the 2003 OMB variables, which contains a more granular set of race and ethnicity categories. The mapping then was done to ‘bundle” the other races to a more general set of categories. This could be the reason why there is a reduction of concepts because maybe the map is incomplete.

Secondly, the purpose of the Extract-Load-Transform (ETL) process is to load the warehouse with integrated and cleansed data. Data quality focuses on the contents of the individual records to ensure the data loaded into the target destination is accurate, reliable, and consistent, so the ETL code should be evaluated to ensure the data extracted generally match what researchers want. In our case, understanding what diagnosis most researchers are interested in—they may want encounter diagnosis instead of including problem list and medical history. Thirdly, the causes for data quality issues are format differences or conversion errors ( Azeroual et al., 2019 ; Souibgui et al., 2019 ).

Lastly, data loss could be present in the ETL process, which is one of the challenges in ETL processes because of the nature of the source systems. Data losses arise from the disparities among the source operational systems. Source systems are very diverse and disparate because of the increased amount of data, modification of data formats, and modification of and deriving new data elements.

In general, data integration with heterogeneous systems is not an easy task. This is mainly due to the fact that many data exchange channels must be developed in order to allow an exchange of data between the systems ( Berkhoff et al., 2012 ) and to solve problems related to the provision of interoperability between systems on the level of data ( Macura, 2014 ).

Steps to Ensure Informatics Quality

To improve the data quality generated from the de-identified systems which is mainly counts, and to solve any data quality issues related to the provision of interoperability between the used tools on the level of data, we propose the following steps:

1. Make data “fit for use.”

To make data fit for use, data governance bodies must clearly define major data concepts/variables included in the de-identified systems and standardize their collection and monitoring processes; this can increase clinical data reliability and reduce the inconsistency of data quality among systems involved ( Halimeh, 2011 ; AbuHalimeh and Tudoreanu, 2014 ).

2. Define data elements (data dictionary).

This is a fundamental part—the lack of clear definitions of source data and controlled data collection procedures often raises concerns about the quality of data provided in such environments and, consequently, about the evidence level of related findings ( Spengler et al., 2020 ). Developing a data dictionary is essential to ensuring data quality, especially in de-identified systems where all data elements are aggregated in a specific way, and there are not enough details about each concept. A data dictionary will serve as a guidebook to define the major data concepts. To do this, organizations must determine what data about data (metadata) is helpful to the researchers when they use the de-identified data systems. In addition, identifying more targeted data concepts and process workflows can help reduce some of the time and effort for researchers when working with large amounts of data and ultimately improve overall data quality.

3. Applying good ETL practices such as data cleansing mechanisms to get the data to a place that acts well with data from other sources.

4. Choose smart ETL architecture that allows you to update components of your ETL process when data and systems need change or update to prevent any data loss and to ensure data integrity and consistency.

5. Apply Data Lineage techniques. This will help in understanding where data originated from, when it was loaded, how it was transformed and is essential for the integrity of the downstream data and the process that moves it to any of the de-identified system.

6. Establish a process for cleansing and tracing suspicious data and unusual rows of data when are revealed.

7. Users need to revise their queries and refine results as they combine data variables.

8. Having a clinical informaticist on board can also be beneficial to the process. They can ensure that your data reflects what is seen in clinical practice or help explain questionable data with their knowledge of clinical workflows and how that data is collected, especially if your analyst has no clinical background.

The success of any de-identified data tool depends largely on the quality of the data used, and the mapping process which is intertwined with the extraction and transformation components. The ETL process is a crucial component in determining the quality of the data generated by an information system.

This study proved that the discrepancies in the data used in the data pull process led to major information quality issues such as data inconsistency and data accuracy which both affects the believability and the validity of the data which also are major data quality measures.

Our contribution in this paper is to propose a set of steps that together form guidelines for a method or automated procedures and tools to manage data quality and data governance in a multifaceted, diverse information environment such as healthcare organizations and to enhance the data quality among the de-identified data tools.

Future plan is to study more clinical informatics tools such TriNetX and other sets of medical data to assess the quality of the counts obtained from these tools.

Author Contributions

The author confirms being the sole contributor of this work and has approved it for publication.

Conflict of Interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

AbuHalimeh, A., and Tudoreanu, M. E. (2014). “Subjective information quality in data integration: evaluation and principles,” in Information Quality and Governance for Business Intelligence (Pennsylvania: IGI Global), 44–65.

Google Scholar

American Medical Informatics Association (2022). Available online at: https://amia.org/about-amia/why-informatics/informatics-research-and-practice (accessed April, 2021).

Azeroual, O., Saake, G., and Abuosba, M. (2019). “ETL best practices for data quality checks in RIS databases,” in Informatics, Vol. 6 (Basel: Multidisciplinary Digital Publishing Institute), 10.

Berkhoff, K., Ebeling, B., and Lübbe, S. (2012). “Integrating research information into a software for higher education administration—benefits for data quality and accessibility,” in 11th International Conference on Current Research Information Systems , Prague.

Galaxy (2021). Epic User Web . Available online at: https://galaxy.epic.com/#Search/searchWord=slicerdicer (accessed April, 2021).

Halimeh, A. A. (2011). Integrating Information Quality in Visual Analytics . University of Arkansas at Little Rock, Little rock.

Macura, M. (2014). Integration of data from heterogeneous sources using ETL technology. Comput. Sci. 15:109–132. doi: 10.7494/csci.2014.15.2.109

CrossRef Full Text | Google Scholar

Nahm, M. (2012). “Data quality in clinical research,” in Clinical Research Informatics (London: Springer), 175–201.

Pipino, L. L., Lee, Y. W., and Wang, R. Y. (2002). Data quality assessment. Commun. ACM 45, 211–218. doi: 10.1145/505248.506010

Souibgui, M., Atigui, F., Zammali, S., Cherfi, S., and Yahia, S. B. (2019). Data quality in ETL process: a preliminary study. Proc. Comput. Sci. 159, 676–687. doi: 10.1016/j.procs.2019.09.223

Spengler, H., Gatz, I., Kohlmayer, F., Kuhn, K. A., and Prasser, F. (2020). “Improving data quality in medical research: a monitoring architecture for clinical and translational data warehouses,” in 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS) (Rochester, MN: IEEE), 415–420.

The i2b2 tranSMART Foundation (2021). Available online at: https://www.i2b2.org/about/ (accessed April, 2021).

Zozus, M. N., Kahn, M. G., and Weiskopf, N. G. (2019). “Data quality in clinical research,” in Clinical Research Informatics (Cham: Springer), 213–248.

Keywords: clinical research data, data quality, research informatics, informatics, management of clinical data

Citation: AbuHalimeh A (2022) Improving Data Quality in Clinical Research Informatics Tools. Front. Big Data 5:871897. doi: 10.3389/fdata.2022.871897

Received: 08 February 2022; Accepted: 29 March 2022; Published: 29 April 2022.

Reviewed by:

Copyright © 2022 AbuHalimeh. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Ahmed AbuHalimeh, aaabuhalime@ualr.edu

This article is part of the Research Topic

Automated Data Curation and Data Governance Automation

  • What is data observability? Reliable data & AI delivered.
  • Detect anomalies ML-enabled data anomaly detection.
  • Triage incidents Get the right alert to the right team.
  • Resolve the root cause Fix data issues in minutes.
  • Measure data quality Measure what matters.
  • Optimize cost and performance Rightsize runtime.
  • Integrations The interoperability you need.

case study data quality

  • Data quality monitoring and testing Deploy and manage monitors and tests on one platform.
  • Report and dashboard integrity Produce reliable data your company can trusts.
  • Data mesh and self serve data Empower data producers and consumers to self-serve.
  • Customer-facing data products Launch and maintain performant and reliable products.
  • Cloud migrations Deploy your warehouse/lake, transformation, and BI tools with confidence.
  • Infrastructure and cost management Optimize your cloud storage and compute spend.
  • Financial services
  • Advertising, media, and entertainment
  • Healthcare and life sciences

The five pillars of data observability

  • Case studies
  • DQ calculator
  • GigaOm Data Observability Radar
  • IMPACT Data Summit

Data mesh vs data lake

Updated Feb 20 2024

Modern Data Quality Management: A Proven 6 Step Guide

Data quality management

CEO and Co-founder, Monte Carlo. Proponent of data reliability and action movies.

case study data quality

Will Robins

Will Robins is a member of the founding team at Monte Carlo.

Share this article

Data quality management is the process of setting quality benchmarks, actively improving data based on those benchmarks, and continually maintaining those data quality levels.

Historically, it was a simple affair. Data was small, slow, and confined within the walls of on-premise databases. Managing it was akin to keeping a small garden.

Now, as we’ve catapulted into the cloud era, the game has changed. The quaint garden has become a sprawling jungle, wild and untamed. Data sources proliferate like rabbits and data models and tables are as numerous as stars in the sky.

Data leaders need data quality management programs designed for the modern data stack. While the six dimensions of data quality (accuracy, completeness, integrity, validity , timeliness, uniqueness) still apply, there needs to be a much larger focus on the overall reliability of data systems, sources, and pipelines.

Most of all, there needs to be a repeatable process beyond static data cleansing, testing and profiling. These legacy approaches just can’t scale across dozens of data sources, hundreds of data models, and thousands of tables.

We’ve worked with hundreds of data teams that have successfully leveraging this six step process to achieve the data quality levels required by their business.

Data Quality Management Steps

Step 1: baseline current data quality levels, step 2: rally and align the organization, step 3: implement broad data quality monitoring, step 4: optimize incident resolution, step 5: create custom data quality monitors, step 6: incident prevention.

One of the best places to start with your data quality management strategy is an inventory of your current (and ideally near future) data use cases. Categorize them by:

  • Analytical : Data is used primarily for decision making or evaluating the effectiveness of different business tactics via a BI dashboard.
  • Operational : Data used directly in support of business operations in near-real time. This is typically steaming or microbatched data. Some use cases here could be accommodating customers as part of a support/service motion or an ecommerce machine learning algorithm that recommends, “other products you might like.
  • Customer facing : Data that is surfaced within and adds value to the product offering or data that IS the product. This could be a reporting suite within a digital advertising platform for example.

Why is this important? As previously mentioned, data quality is contextual. There will be some scenarios, such as financial reporting, where accuracy is paramount. Other use cases, such as some machine learning applications, freshness will be key and “directionally accurate” will suffice.

The next step is to assess the overall performance of your systems and team. At this stage you have just begun your journey so it’s unlikely you have detailed insights into your overall data health or operations. There are some quantitative and qualitative proxies you can use however.

  • Quantitative : You can measure the number of data consumer complaints, overall data adoption, and levels of data trust (NPS survey). You can also ask the team to estimate the amount of time they spend on data quality management related tasks like maintaining data tests and resolving incidents.
  • Qualitative : Is there a desire or an opportunity for more advanced data use cases? Do leaders feel like they’ve unlocked the full value of the organization’s data? Is the culture data driven? Was there a recent data quality disaster that led to very senior escalation?

Categorizing your data use cases and baselining current performance will also help you assess the gap between your current and desired future state across your infrastructure, team, processes, and performance. It’s an answer to broader tactical questions that impact data quality across:

  • Do I need to migrate any on-premises databases to the cloud? 
  • Where do we need to have data streaming vs batch? ETL vs ELT ?
  • How do I prioritize and build the layers of my modern data platform across ingestion, storage, transformation/orchestration, visualization, data quality, data governance/access management?
  • What level of data pipeline monitoring coverage do we need?
  • Should there be a central data team, decentralized data mesh, or a hybrid with a data center of excellence ?
  • Do I need specialized roles and/or teams to manage data governance such as data stewards or data quality such as data reliability engineers ?
  • Are we efficient at identifying the root cause of data incidents?
  • Do we understand the relative importance of each asset and how they are related?
  • What data SLAs should we have in place?
  • How do we onboard data sets? 
  • What level of documentation is appropriate?
  • How do we enable discovery and prioritize self-service access to data?

“Given that we are in the financial sector, we see quite disparate use-cases for both analytical and operational reporting which require high-levels of accuracy” says Checkout.com Senior Data Engineer Martynas Matimaitis. “That forced our hands to [scale data quality management] quite early on in our journey, and that became a crucial part of our day-to-day business.” 

Once you have a baseline and an informed opinion, you are ready to start building support for your data quality initiative. You will want to start by understanding what pain is felt by different stakeholders. 

If there is no pain, you need to take a moment to understand why. It could be the scale of your data operations or the overall importance of your data isn’t mature enough to warrant an investment in improving data quality. However, assuming you have more than 50 tables and a few members on your data team that is unlikely to be the case.

What’s more likely is your organization has quite a bit of unrealized risk. The data quality is low and a costly data incident is just around the corner…but it hasn’t struck yet . Your data consumers will generally trust the data until you give them a reason not to. At that point, trust is much harder to regain than it was to lose.

case study data quality

The overall risk of poor data quality can be difficult to assess. The consequences of bad data can range from slightly under optimized decision making to reporting incorrect data to Wall Street.

One approach is to pool this risk by estimating your data downtime and attributing an inefficiency cost to it. Or you could take established industry baselines– our study shows bad data can impact on average 26% of a company’s revenue .

That risk assessment and cost of business stakeholders dealing with bad data will be informative if a bit fuzzy. It should also be paired with the cost to the data team of dealing with bad data. This can be done by totaling up the amount of time spent on data quality related tasks, wincing, and then multiplying that time by the average data engineering salary. 

Pro-Tip: Data testing is often one of the data team’s biggest inefficiencies. It is time consuming to define, maintain, and scale every expectation and assumption across every dataset. Worse, because data can break in near infinite number of ways ( unknown unknown ) this level of coverage is often woefully inadequate.

Congratulations! You now have a business case for your data quality management initiative and the changes you need to make across your people, technology, and processes.

At this point, the following stages will assume you have obtained a mandate and made a decision to either build or acquire a data quality or data observability solution to assist in your efforts. Now, it’s time to implement and scale.

The third data quality management stage is to make sure you have basic machine learning monitors (freshness, volume, schema) in place across your data environment. For many organizations (excluding the largest enterprises), you will want to roll this out across every data product, domain, and department rather than pilot and scale.  

This will accelerate your time to value and help you establish critical touch points with different teams if you haven’t done so already. 

Another reason for a wide roll out is that, even with the most decentralized organizations, data is interdependent. If you install fire depressant systems in the living room while you have a fire in the kitchen, it doesn’t do you much good. 

Also, wide-scale data monitoring and/or data observability will give you a complete picture of your data environment and the overall health. Having the 30,000 foot view is helpful as you enter the next stage of data quality management.

“With…broad coverage and automated lineage…our team can identify, understand downstream impacts, prioritize, and resolve data issues at a much faster rate,” said Ashley VanName, general manager of data engineering, JetBlue.

At this stage, we want to start optimizing our incident triage and resolution response. This involves setting up clear lines of ownership. There should be team owners for data quality as well as overall data asset owners at the data product and even data pipeline level.

Breaking your environment into domains, if you haven’t already, can help create additional accountability and transparency for the overall data health levels maintained by different groups.

Having clear ownership also enables fine tuning your alert settings, making sure they are sent to the right communication channels of the responsible team at the right level of escalation.  

Alerting considerations for a data quality management initiative.

“We started building these relationships where I know who’s the team driving the data set,” said Lior Solomon, VP of Data at Drata. “I can set up these Slack channels where the alerts go and make sure the stakeholders are also on that channel and the publishers are on that channel and we have a whole kumbaya to understand if a problem should be investigated.”

Next you will want to focus on layering more sophisticated, custom monitors. These can be either manually defined–for example if data needs to be fresh at 8:00 am every weekday for a meticulous executive–or machine learning based. In the latter case, you indicate which tables or segments of the data are important to examine and the ML alerts trigger when the data starts to look awry.

We recommend layering on custom monitors on your organization’s most critical data assets. These can typically be identified as those that have many downstream consumers or important dependencies.

Custom monitors and SLAs can also be built around different data reliability tiers to help set expectations. You can certify the most reliable datasets “gold” or label an ad-hoc data pull for a limited use case as “bronze” to indicate it is not supported as robustly.

How Monte Carlo helps with data certification

The most sophisticated organizations manage a large portion of their custom data quality monitors through code (monitors as code) as part of the CI/CD process. 

The Checkout.com data team reduced its reliance on manual monitors and tests by adding monitors as code functionality into every deployment pipeline. This enabled them to deploy monitors within their dbt repository, which helped harmonize and scale the data platform. 

“Monitoring logic is now part of the same depository and is stacked in the same place as a data pipeline, and it becomes an integral part of every single deployment,” says Martynas. In addition, that centralized monitoring logic enables the clear and easy display of all monitors and issues, which expedites time to resolution.

At this point, we have driven significant value to the business and noticeably improved data quality management at our organization. The previous data quality management stages have helped dramatically reduce our time-to-detection and time-to-resolution, but there is a third variable in the data downtime formula: number of data incidents.

n other words we want to try preventing data incidents before they happen. 

That can be done by focusing on data health insights like unused tables or deteriorating queries. Analyzing and reporting the data reliability levels or SLA adherence across domains can also help data leaders determine where to allocate their data quality management program resources. 

“Monte Carlo’s lineage highlights upstream and downstream dependencies in our data ecosystem, including Salesforce, to give us a better understanding of our data health,” said Yoav Kamin, business analysis group leader at Moon Active. “Instead of being reactive and fixing the dashboard after it breaks, Monte Carlo provides the visibility that we need to be proactive.”

Final thoughts

We covered a lot of ground in this article – some might call it a data reliability marathon. Some of our key data quality management takeaways include:

  • Make sure you are monitoring both the data pipeline and the data flowing through it.
  • You can build a business case for data monitoring by understanding the amount of time your team spends fixing pipelines and the impact it has on the business.
  • You can build or buy data monitoring–the choice is yours–but if you decide to buy a solution be sure to evaluate its end-to-end visibility, monitoring scope, and incident resolution capabilities. 
  • Operationalize data monitoring by starting with broad coverage and mature your alerting, ownership, preventive maintenance, and programmatic operations over time.

Perhaps the most important point is that data pipelines will break and data will “go bad” – unless you’re keeping them healthy.  

Whatever your next data quality management step entails, it’s important to take it sooner rather than later. You’ll thank us later.  

Considering modernizing or scaling your data quality management program? Schedule a time to talk to us about how data observability can help.

Our promise: we will show you the product.

Read more posts.

Is the modern data warehouse broken?

Is Modern Data Warehouse Architecture Broken? 

Monte Carlo Named to First-Ever Intelligent Apps Top 40 List

Monte Carlo Named to First-Ever Intelligent Apps Top 40 List

How JetBlue Used Data Observability To Help Improve Internal “Data NPS” By 16 Points Year Over Year

How JetBlue Used Data Observability To Help Improve Internal “Data NPS” By 16 Points Year Over Year

What is data consistency

How to Find and Fix Data Consistency Issues

How SeatGeek Reduced Data Incidents to Zero with Data Observability

How SeatGeek Reduced Data Incidents to Zero with Data Observability

End-to-End Data Observability and Our Rapidly Expanding Database Support

End-to-End Data Observability and Our Rapidly Expanding Database Support

Data Quality Case Studies: How We Saved Clients Real Money Thanks to Data Validation

case study data quality

Machine learning models grow more powerful every week, but the earliest models and the most recent state-of-the-art models share the exact same dependency: data quality. The maxim "garbage in - garbage out" coined decades ago, continues to apply today. Recent examples of data verification shortcomings abound, including JP Morgan/Chase’s <a href="https://en.wikipedia.org/wiki/2012_JPMorgan_Chase_trading_loss" target="_blank" rel="noopener noreferrer">2013 fiasco</a> and this lovely list of <a href="https://blog.floatapp.com/5-greatest-spreadsheet-errors-of-all-time/" target="_blank" rel="noopener noreferrer">Excel snafus</a>.  Brilliant people make data collection and entry errors all of the time, and that isn’t just our opinion (although we have plenty of personal experience with it); Kaggle <a href="https://www.kaggle.com/mrisdal/dealing-with-dirty-data-on-the-job/data" target="_blank" rel="noopener noreferrer">did a survey</a> of data scientists and found that “dirty data” is the number one barrier for data scientists.   Before we create a machine learning model, before we create a Shiny R dashboard, we evaluate the dataset for a project.  Data validation is a complicated multi-step process, and maybe it's not as sexy as talking about the latest  ML models, but as the data science consultants of Appsilon we live and breathe data governance and offer solutions.  And it is not only about data format. Data can be corrupted on different levels of abstraction. We can distinguish three levels:   <ol><li style="font-weight: 400;">Data structure and format </li><li style="font-weight: 400;">Qualitative & business logic rules </li><li style="font-weight: 400;">Expert logic rules </li></ol> <h2><b>Level One: structure and format</b></h2> For every project, we must verify: <ul><li>Is the data structure consistent? A given dataset should have the same structure all of the time because the ML model or app expects the same format.  Names of columns/fields, number of columns/fields, field data type (integers? Strings?) must remain consistent. </li><li>Are we working with multiple datasets, or merged?</li><li>Do we have duplicate entries? Do they make sense in this context or should they be removed?</li><li>Do we have correct, consistent data types (e.g. integers, floating point numbers, strings) in all entries?</li><li>Do we have a consistent format for floating point numbers? Are we using a comma or a period?</li><li>What is the format of other data types, such as e-mail addresses, dates, zip codes, country codes, and is it consistent?</li></ul> It sounds obvious, but there are always problems and they must be checked every time.  The right questions must be asked. <h2><b>Level Two: qualitative and business logic rules</b></h2> We must check the following every time: <ul><li>Is the price parameter (if applicable) always non-negative?  (We stopped several of our retail customers from recommending the wrong discounts thanks to this rule. They saved significant sums and prevented serious problems thanks to this step… More on that later).</li><li>Do we have any unrealistic values?  For data related to humans, is age always a realistic number?</li><li>Parameters.  For data related to machines, does the status parameter always have a correct value from a defined set? E.g. only “FINISHED” or “RUNNING” for a machine status?</li><li>Can we have "Not Applicable" (NA), null, or empty values? What do they mean?</li><li>Do we have several values that mean the same thing? For example, users might enter their residence in different ways -- “NEW YORK”, “Nowy Jork”, "NY, NY" or just “NY” for a city parameter. Should we standardize them?</li></ul> <h2><b>Level three: expert rules</b></h2> Expert rules govern something different than format and values. They check if the story behind the data makes sense. This requires business knowledge about the data and it is the data scientist's responsibility to be curious, to explore and challenge the client with the right questions, to avoid logical problems with the data.  The right questions must be asked. <h2><b>Expert Rules Case Studies </b></h2> I'll illustrate with a couple of true stories. <h3><strong>Story #1: Is this machine teleporting itself?</strong></h3> We were tasked to analyze the history of a company's machines.  The question was, how much time did each machine work at a given location.  We have the following entries in our database: &nbsp; <table class="blog-table"> <thead> <tr> <th style="text-align: center;">Date</th> <th style="text-align: center;">Machine ID</th> <th style="text-align: center;">Hours of work</th> <th style="text-align: center;">Site</th> </tr> </thead> <tbody> <tr> <td style="text-align: center;">2018-10-26 </td> <td style="text-align: center;">1234</td> <td style="text-align: center;">1</td> <td style="text-align: center;">Warsaw</td> </tr> <tr> <td style="text-align: center;">2018-10-27  </td> <td style="text-align: center;">1234</td> <td style="text-align: center;">2</td> <td style="text-align: center;">Cracow</td> </tr> <tr> <td style="text-align: center;">2018-10-28  </td> <td style="text-align: center;">1234</td> <td style="text-align: center;">2</td> <td style="text-align: center;">Warsaw</td> </tr> <tr> <td>2018-10-29</td> <td style="text-align: center;">1234</td> <td style="text-align: center;">3</td> <td style="text-align: left;">Cracow</td> </tr> <tr> <td style="text-align: center;">2018-10-30</td> <td style="text-align: center;">1234</td> <td style="text-align: center;">1</td> <td style="text-align: center;">Warsaw</td> </tr> </tbody> </table> &nbsp; We see that format and values are correct. But why did machine #1234 change its location every day? Is it possible? We should ask such a question of our client.  In this case, we found that it was not physically possible for the machine to switch sites so often.  After some investigation, we found that the problem was that the software installed on the machine had a duplicated ID number and in fact, there were two machines on different sites with the same ID number.  When we learned what was possible, we set data validation rules for that, and then we ensured that this issue won't happen again. Expert rules can be developed only by the close cooperation between data scientists and businesses. This is not an easy part that can be automated by “data cleaning tools,” which are great for hobbyists but are not suitable for anything remotely serious. <h3><strong>Story #2: Negative sign could have changed all prices in the store</strong></h3> One of our retail clients was pretty far along in their project journey when we began to work with them.  They already had a data scientist on staff and had already developed their own price optimization models. Our role was to utilize the output from those models and display recommendations in an R Shiny dashboard that was to be used by their salespeople. We had some assumptions about the format of the data that the application would use from their models.  So we wrote our validation rules on what we thought the application should expect when it reads the data. We reasoned that the price should be <ul><li style="font-weight: 400;">non-negative </li><li style="font-weight: 400;">an integer number </li><li style="font-weight: 400;">shouldn’t be an empty value or a string.  </li><li style="font-weight: 400;">within a reasonable range for the given product </li></ul> As this model was being developed over the course of several weeks, suddenly we observed that prices were being returned as too high.  It was actually validated automatically. It wasn't like we spotted this in production, we spotted this problem before the data even landed in the application.  After we saw this result in the report, we asked their team why it happened. It turns out that they had a new developer who assumed that discounts could be displayed as a negative number, because why not?  He didn't realize that some applications actually depended on that output, and assumed that it would be subtracting the value instead of adding It. Thanks to the automatic data validation, we could prevent loading errors into production.  We worked with their data scientists to improve the model. It was a very quick fix of course, a no-brainer. But the end result was that they saved real money. <h2><b>Data Validation Report for all stakeholders</b></h2> Here is a sample data validation report that our workflow produces for all stakeholders in the project:

case study data quality

‍ The intent is that the data verification report be readable by all stakeholders, not just data scientists and software engineers.  After years of experience working on data science projects, we observed that multiple people within an organization know of realistic parameters for data values, such as price points.  There is usually more than one expert in a community, and people are knowledgeable about different things. New data is often added at a constant rate, and parameters can change. So why not allow multiple people to add and edit rules when verifying data? So with our Data Verification workflow, anyone from the team of stakeholders can add or edit a data verification rule. <blockquote><strong>Ensure clean, well-formatted data for your data pipelines using <a href="https://appsilon.com/data-validation-with-data-validator-an-open-source-package-from-appsilon/" target="_blank" rel="noopener noreferrer">Appsilon's data.validator package</a>.</strong></blockquote> Our Data Verification workflow works with the <a href="https://github.com/ropensci/assertr" target="_blank" rel="noopener noreferrer">assertr package</a> (for the R enthusiasts out there).  Our workflow runs validation rules automatically - after every update in the data.   This is exactly the same process as writing unit tests for software. Like unit testing, our data verification workflow allows you to more easily identify problems and catch them early; and of course, fixing problems at an earlier stage is much more cost-effective.     Finally, what do validation rules look like on the code level?  We can’t show you code created for clients, so here is an example using data from the City of Warsaw public transportation system (requested from a public API).  Let’s say that we want a real-time check on the location and status of all the vehicles in the transit system fleet. <figure class="highlight"> <pre class="language-r"><code class="language-r" data-lang="r"> library(assertr) <br>source("utils.R") # Utils define functions like check_data_last_5min() or check_lat_in_warsaw() <br>api_url &lt;- "https://api.um.warszawa.pl/api/action/wsstore_get/?id=c7238cfe-8b1f-4c38-bb4a-de386db7e776" <br>vehicle_positions &lt;- jsonlite::fromJSON(api_url)[[1]] vehicle_positions %&gt;%  verify(title = "Each entry has 8 parameters", ncol(.) == 8, mark_data_corrupted_on_failure = TRUE) %&gt;%  assert(title = "Time is in format 'yyyy-mm-dd HH:MM:SS'", check_datetime_format, Time) %&gt;%  assert(title = "Data is not older than 5 minutes", check_data_last_5min, Time) %&gt;%  assert(title = "Latitude coordinate is correct value in Warsaw", check_lat_in_warsaw, Lat) %&gt;%  assert(title = "Longitude coordinate is correct value in Warsaw", check_lon_in_warsaw, Lon) %&gt;%  validator$add_validations("vehicles") <br></code></pre> </figure> In this example, we want to ensure that the Warsaw buses and trams are operating within the borders of the city, so we check the latitude and longitude.  If a vehicle is outside the city limits, then we certainly want to know about it! We want real-time updates, so we write a rule that “Data is not older than 5 minutes.”  In a real project, we would probably write hundreds of such rules in partnership with the client. Again, we typically run this workflow BEFORE we build a model or a software solution for the client, but as you can see from the examples above, there is even tremendous value in running the Data Validation Workflow late in the production process!  And one of our clients did remark that they saved more money with the Data Validation Workflow than with some of the machine learning models that were previously built for them. <h2><b>Sharing our data validation workflow with the community</b></h2> Data quality must be verified in every project to produce the best results.  There are a number of potential errors that seem obvious and simplistic but in our experience to tend to occur often.   After working on numerous projects with Fortune 500 companies, we came up with a solution to the above 3-Level problem cluster.  Since multiple people within an organization know of realistic parameters for datasets, such as price points, why not allow multiple people to add and edit rules when verifying data?  We recently shared our workflow at a <a href="https://hackathon.gov.pl" target="_blank" rel="noopener noreferrer">hackathon</a> sponsored by the Ministry of Digitization here in Poland.  We took third place in the competition, but more importantly, it reflects one of the core values of our company -- to share our best practices with the data science community.      ‍

case study data quality

<a href="https://twitter.com/pawel_appsilon" target="_blank" rel="noopener noreferrer">Pawel</a> and <a href="https://twitter.com/krystian8207" target="_blank" rel="noopener noreferrer">Krystian</a> accept an award at the Ministry of Digital Affairs <a href="https://hackathon.gov.pl/" target="_blank" rel="noopener noreferrer">Hackathon</a> I hope that you can put these takeaways in your toolbox: <ul><li style="font-weight: 400;">Validate your data early and often, covering all assumptions. </li><li style="font-weight: 400;">Engage a data science professional early in the process  </li><li style="font-weight: 400;">Leverage the expertise of your workforce in data governance strategy</li><li style="font-weight: 400;">Data quality issues are extremely common </li></ul> In the midst of designing new products, manufacturing, marketing, sales planning and execution, and the thousands of other activities that go into operating a successful business, companies sometimes forget about data dependencies and how small errors can have a significant impact on profit margins.   We unleash your expertise about your organization or business by asking the right questions, then we teach the workflow to check for it constantly.  We take your expertise and we leverage it repeatedly.

case study data quality

Contact us!

Damian's Avatar

Read about simillar topics

R Highcharts: How to Make Animated and Interactive Data Visualizations in R

R Highcharts: How to Make Animated and Interactive Data Visualizations in R

case study data quality

Navigating ShinyConf 2024: A First-Timer’s Guide to Virtual Conferences

case study data quality

R dtplyr: How to Efficiently Process Huge Datasets with a data.table Backend

Experience the data advantage, life science.

case study data quality

Demo Gallery

case study data quality

Case Studies

case study data quality

Shiny Templates

Shiny weekly.

case study data quality

Subscribe to our newsletter

case study data quality

  • Big Data Quality Case Study Preliminary Findings

A set of four case studies related to data quality in the context of the management and use of Big Data are being performed and reported separately; these will also be compiled into a summary overview report.

Download Resources

Pdf accessibility.

One or more of the PDF files on this page fall under E202.2 Legacy Exceptions and may not be completely accessible. You may request an accessible version of a PDF using the form on the Contact Us page.

A set of four case studies related to data quality in the context of the management and use of Big Data are being performed and reported separately; these will also be compiled into a summary overview report. The report herein documents one of those four cases studies.

The purpose of this document is to present information about the various data quality issues related to the design, implementation and operation of a specific data initiative, the U.S. Army's Medical Command (MEDCOM) Medical Operational Data System (MODS) project. While MODS is not currently a Big Data initiative, potential future Big Data requirements under consideration (in the areas of geospatial data, document and records data, and textual data) could easily move MODS into the realm of Big Data. Each of these areas has its own data quality issues that must be considered. By better understanding the data quality issues in these Big Data areas of growth, we hope to explore specific differences in the nature and type of Big Data quality problems from what is typically experienced in traditionally sized data sets. This understanding should facilitate the acquisition of the MODS data warehouse though improvements in the requirements and downstream design efforts. It should also enable the crafting of better strategies and tools for profiling, measurement, assessment, and action processing of Big Data Quality problems.

  • Ironstream for Splunk®
  • Ironstream for ServiceNow®
  • Automate Evolve
  • Automate Studio
  • Assure Security
  • Assure MIMIX
  • Assure MIMIX for AIX®
  • Assure QuickEDD
  • Assure iTERA
  • Syncsort MFX
  • Syncsort Optimize IMS
  • Syncsort Optimize DB2
  • Syncsort Optimize IDMS
  • Syncsort Network Management
  • Syncsort Capacity Management
  • Spectrum Context Graph
  • Spectrum Global Addressing
  • Spectrum Quality
  • Trillium Discovery
  • Trillium Geolocation
  • Trillium Quality
  • Data360 Analyze
  • Data360 DQ+
  • Data360 Govern
  • Spectrum Spatial
  • Spectrum Spatial Routing
  • Spectrum Spatial Insights
  • Spectrum Global Geocoding
  • Spectrum Enterprise Tax
  • MapInfo Pro
  • Precisely Addresses
  • Precisely Boundaries
  • Precisely Demographics
  • Precisely Points of Interest
  • Precisely Streets
  • PlaceIQ Audiences
  • PlaceIQ Movement
  • EngageOne Communicate
  • EngageOne RapidCX
  • EngageOne Digital Self-Service
  • EngageOne Vault
  • EngageOne Compose
  • EngageOne Enrichment
  • Precisely Data Integrity Suite
  • Precisely APIs
  • Precisely Data Experience
  • Customer engagement
  • Digital self-service
  • Digital archiving
  • Email and SMS
  • Print to digital
  • Data enrichment
  • Data integrity
  • Environmental, social and governance (ESG)
  • Data integration
  • Security Information and Event Management
  • Real-time CDC and ETL
  • IT Operations Analytics
  • IT Operations Management
  • Cloud data warehousing
  • Data governance
  • Data catalog
  • Data products
  • Data quality
  • Address validation/standardization
  • CRM & ERP data validation
  • Customer 360
  • Data matching & entity resolution
  • Data observability
  • Data reconciliation
  • Data validation and enrichment
  • IBM systems optimization
  • Geo addressing and spatial analytics
  • Spatial analytics
  • Geocoding and data enrichment
  • Master data management
  • Process automation
  • Amazon Pinpoint
  • Compliance with security regulations
  • Security monitoring and reporting
  • High availability and disaster recovery
  • Data privacy
  • Access control
  • IBM mainframe
  • Sort optimization
  • Microsoft Azure
  • SAP process automation
  • Excel to SAP automation
  • SAP master data management
  • SAP finance automation
  • Financial services
  • Telecommunications
  • Precisely Strategic Services
  • Professional services
  • Analyst reports
  • Customer stories
  • Infographics
  • Product demos
  • Product documentation
  • Solution sheets
  • White papers
  • IBM i security
  • Location intelligence
  • Master Data Management
  • SAP Automation
  • Financial service and banking
  • Supply Chain
  • Global offices
  • Careers and Culture
  • Diversity, Equity, Inclusion, and Belonging
  • Environmental, Social, and Governance (ESG)
  • Global Code of Conduct
  • Precisely Trust Center
  • Press releases
  • In the news
  • Trust ’23
  • Get in touch

Blog > Data Quality > The Benefits of Data Quality – A Review of Use Cases & Trends

Data Quality Benefits – A Full Review of Use Cases & Trends

The Benefits of Data Quality – A Review of Use Cases & Trends

Authors Photo

Looking for an overview of use cases and current trends for data quality? You’ve come to the right place! In this post, we review the benefits of data quality and how it can help your business.

Data quality saves you money

A big reason to pay attention to data quality is that it can save you money. First and foremost, it can help you maximize the return  on your big data investments. And there are additional cost-related benefits (areas that we will discuss below) to help you save even more.

It builds trust

Business leaders rely on big data analytics to make informed decisions. Ensuring quality data can help organizations trust the data.

And further, customers can trust businesses who are confident in their data. If your data is inaccurate, inconsistent or otherwise of low quality, you risk misunderstanding your customers and doing things that undermine their trust in you.

It appears there is an abundance of data, but a scarcity of trust, and the need for data literacy. It’s important to understand what your data  MEANS to your organization. Defining data’s value wedge may be key to developing confidence in your enterprise data.

Data quality’s link to data governance

Data quality is  essential for data governance because ensuring data quality is the only way to be certain that your data governance policies are consistently followed and enforced.

During her Enterprise Data World presentation, Laura Sebastian-Coleman, the Data Quality Center of Excellence Lead for Cigna, noted specifically that data quality depends on fitness for purpose, representational effectiveness and data knowledge. And, without this knowledge, which depends on the data context, our data lakes or even our data warehouses are doomed to become “data graveyards.”

Data governance and data quality are intrinsically linked, and as the strategic importance of data grows in an organization, the intersection of these practices grows in importance, too.

4 Ways to Measure Data Quality

Assessing data quality on an ongoing basis is necessary to know how well the organization is doing at maximizing data quality. There are a variety of data and metrics that organizations can use to measure data quality. We review of few of them in this ebook.

Data quality and your customers

Engaging your customers is vital to driving your business. Data quality can help you improve your customer records by verifying and enriching the information you already have. And beyond contact info, you can manage customer interaction by storing additional customer preferences such as time of day they visit your site and which content topics and type they are most interested in.

The more customer information you have, the better you can understand your customers and achieve “ Customer 360 ,” or full-view of your customer. But you need to be aware that more data means more complexity – creating a data integration paradox.

Its role in cyber security

You may be aware of all the ways you can leverage big data to detect fraud , but maybe you’re wondering how data quality can fight security breaches?

Think about it. If the machine data that your intrusion-detection tools collect about your software environments is filled with incomplete or inaccurate information, then you cannot expect your security tools to effectively detect dangerous threats.

Keep in mind, too, that when it comes to fraud detection, real-time results are key. By extension, your data quality tools covering fraud analysis data will also need to be work in real time.

Additional data quality trends

Of course, we’re always thinking about what’s next for data quality. One additional area of interest that’s gaining momentum is machine learning . While machine learning may seem like a “silver bullet,” because of the technologies it enables for us today, it’s important to understand that without high-quality data on which to operate, it is less magical.

Download our eBook to learn how you can measure data quality and track the effectiveness of your data quality improvement efforts.

ebook

There are a variety of data and metrics that organizations can use to measure data quality. We review of few of them in this ebook. 

Related posts

Insurance Organizations Depend on the Quality of Their Data

Insurance Organizations Depend on the Quality of Their Data

Insurance is an inherently data-driven industry. Even before the age of advanced analytics, experts in the industry were routinely using data to assess risk and price policies. Today, data analytics...

Authors Photo

Data Quality Dimensions: How Do You Measure Up? (+ Downloadable Scorecard)

Virtually every business leader understands just how valuable data can be for driving innovation, increasing revenue, improving customer satisfaction, optimizing processes, and achieving...

Authors Photo

Validation vs. Verification: What’s the Difference?

To a layperson, data verification and data validation may sound like the same thing. When you delve into the intricacies of data quality, however, these two important pieces of the puzzle are...

Data Topics

  • Data Architecture
  • Data Literacy
  • Data Science
  • Data Strategy
  • Data Modeling
  • Governance & Quality
  • Data Education
  • Enterprise Information Management
  • Information Management Articles

Case Study: Using Data Quality and Data Management to Improve Patient Care

Mismatched patient data is the third leading cause of preventable death in the United States, according to healthIT.gov, and a 2016 survey by the Poneman Institute revealed that 86 percent of all healthcare practitioners know of an error caused by incorrect patient data. Patient misidentification is also responsible for 35 percent of denied insurance claims, […]

Data Quality

Melanie Mecca , Director of Data Management Products & Services for CMMI Institute calls this situation “A classic Master Data and Data Quality problem.” A multitude of different vendors is one of the causes, she said, but “there’s really no standard at all for this data.”

The Health and Human Services Office of the National Coordinator (HHS-ONC ) wants to make it safer for patients needing health care by improving those numbers.

“They’re trying to lower the number of duplicates and overlays in the patient identification data – the demographic data – so that they can have fewer instances of record confusion and ensure that records can be matched with patients as close as possible to a hundred percent,” she said.

In the article Improving Patient Data Quality, Part 1: Introduction to the PDDQ Framework Mecca remarked that, “duplicate patient records are a symptom of a deeper and more pervasive issue – the lack of industry-wide adoption of fundamental Data Management practices.” Sources for this case study also include a presentation by Mecca and Jim Halcomb , Strategy Consultant at CMMI, as well as the Patient Demographic Data Quality (PDDQ) Framework, v.7.

The Challenge

Government sources (and CMMI) estimate that the average hospital has 8-12 percent of duplicate records, and as many as 10 percent of incoming patients are misidentified. Sharing of patient data from disparate providers increases the likelihood of duplicates during health information exchanges due to defects in Master Patient Indexes.

Preventable medical errors may include:

  • Misdiagnosis and incorrect treatment procedures
  • Incorrect or wrong dose of medication
  • Incorrect blood type
  • Allergies not known
  • Repeated diagnostic tests

In addition, inaccurate and duplicate records can increase the risk of lawsuits, and can cause claims to be rejected. The cost to correct a duplicate patient record is estimated at $1000.

Previous attempts have been made to address these issues using algorithms that search for data fragments, but due to a lack of standardized practices:

“Algorithms alone have failed to provide a sustainable solution. Patient record-matching algorithms are necessary, but they are reactive, and do not address the root cause, which is the lack of industry-wide standards for capturing, storing, maintaining, and transferring patient data,” she said.

case study data quality

Finding a Solution

According to the HHS-ONC website , the Office of the National Coordinator for Health Information Technology (ONC) is located within the U.S. Department of Health and Human Services (HHS). The ONC:

“Serves as a resource to the entire health system, promoting nationwide health information exchange to improve health care. ONC is the principal federal entity charged with coordination of nationwide efforts to implement and use the most advanced health information technology and the electronic exchange of health information.”

In line with this mission, HHS-ONC decided to craft a solution to the patient data problem with built-in participation and support across all areas of health care. To this end, ONC assembled a community of practice that included 25 organizations ranging from health Data Management associations such as AHIMA (American Health Information Management Association), to government offices like OSHA (Occupational Safety and Health Administration), and large health care organizations such as Kaiser Permanente, as well as other health care providers.

This community was charged with finding a set of standards and practices that could be used to evaluate existing patient Data Management processes, and a comprehensive tool for bringing organizations into compliance with those standards. “What they were looking for was a Data Management Framework that was complimentary to what they were trying to accomplish,” said Mecca.

They chose CMMI’s Data Management Maturity (DMM) SM Model as the best approximation of what they were looking to accomplish. “The DMM’s fact-based approach, enterprise focus, and built-in path for capability growth aligned exactly with the healthcare industry’s need for a comprehensive standard,” she said.

Developing a Tool

CMMI, as a sub-contractor with health information technology company Audacious Inquiry , then used the Data Management Maturity model to determine which practices were essential “specifically for patient demographic Data Quality,” Mecca said. Out of that process came the Patient Demographic Data Quality Framework (PDDQ). The PDDQ offered the HHS-ONC a health care-focused, “sustainable solution for building proactive, defined processes that lead to improved and sustained Data Quality.”

The Patient Demographic Data Quality (PDDQ) Framework: The Solution

The PDDQ Framework :

“Allows organizations to evaluate themselves against key questions designed to foster collaborative discussion and consensus among all involved stakeholders. Its content reflects the typical path that most organizations follow when building proactive, defined processes to influence positive behavioral changes in the management of patient demographic data.”

The PDDQ is composed of 76 questions, organized into five categories with three to five process areas in each, representing the broad topics that need interrogation by the health care organization to understand current practices and determine what activities need to be established, enhanced, and followed.

The questions are supported by contextual information specific to health care providers.

“Data Governance is highly accented, as is the Business Glossary – the business terms used in registration, and terms that providers, and claims, and billing have to agree on, like patient status,” Mecca said.

Examples include illustrative scenarios, such as how a patient name should be entered, and what to enter if the patient has three middle names, for example. The questions and supporting context are intended to serve as an “encouraging and helpful mechanism for discovery.” While the framework encourages good Data Management,

“It does not prescribe how an organization should achieve these capabilities. It should be used by organizations both to assess their current state of capabilities, and as input to a customized roadmap for data management implementation.”

One of the features of the PDDQ is its flexibility to address Data Quality in a variety of environments. It is designed for any organization creating, managing or aggregating patient data, such as hospitals, health systems, Health Information Exchange (HIE) vendors, Master Data Management (MDM) solution vendors, and Electronic Health Record (EHR) vendors. “An organization can implement any combination of categories or process areas, and obtain baseline conclusions about their capabilities,” she said.

The organization can focus on a single process area, a set of process areas, a category, a set of categories, or any combination up to and including the entire PDDQ Framework. This allows flexible application to meet specific organizational needs and address for resource and time constraints.

Using the PDDQ , organizations can quickly assess their current state of Data Management practices, discover gaps, and formulate actionable plans and initiatives to improve management of the organization’s data assets across functional, departmental, and geographic boundaries.

“The PDDQ Framework is designed to serve as both a proven yardstick against which progress can be measured as well as an accelerator for an organization-wide approach to improving Data Quality. Its key questions stimulate knowledge sharing, surface issues, and provide an outline of what the organization should be doing next to more effectively manage this critical data.”

The PDDQ assessment can deliver actionable results within three weeks, leading directly to the implementation phase. For HHS-ONC, Kaiser did pilots (in Oregon) where they went on site and did data profiling and cleansing of the patient records. During this effort, they used several of the process areas that we wrote for the PDDQ Framework, and they applied them to the organizations.”

A month later when Kaiser checked with the pilot sites, all had made improvements in the way they were managing that data, she said. “And it showed because the matching algorithms had lower incidence of duplicates.”

According to the presentation by Mecca and Halcomb, use of the PDDQ leads to decreased operational risk through improvements to the quality of patient demographic data. Specifically, patient safety is protected and quality in the delivery of patient care improves due to:

  • Increased operational efficiency, requiring less manual effort to fix data issues, fewer duplicate test orders for patients, and adoption of standard data representations.
  • Improved interoperability and data integration through adopting data standards and data management practices that are followed by staff across the patient lifecycle.
  • Improved staff productivity by expending fewer hours on detecting and remediating data defects to perform their tasks.
  • Increased staff awareness for contributing to and following processes that improve patient identity integrity.

Patient data is a common thread throughout the health care system, she said. Capturing or modifying patient data differently magnifies the potential for duplication. The PDDQ helps uncover unexamined processes around patient data and means that health care organizations don’t have to guess about how they’re managing patient data. It clearly identifies gaps, creates awareness about individual responsibility for quality of patient data, engenders cooperation and participation, and sets a baseline for monitoring progress.

“Once gaps and strengths have been identified, organizations can quickly establish timelines for new capabilities and objectives,” she said.

Steps Moving Forward

According to Mecca, managing data is “first and foremost a people problem, not a system problem.” No one individual knows everything about the patient data. Adoption of consistent data standards industry-wide would increase interoperability and minimize duplicates. The PDDQ provides organizational guidance and “an embedded path for successive improvements along with a concentrated education for everyone dealing with patient demographic data,” she said.

“ Health care Data Management consultants can employ the PDDQ for their client organizations as a powerful tool to quickly identify gaps, leverage accomplishments, focus priorities, and develop an improvement roadmap with the confidence that all factors have been examined and that consensus has been reached.”

Access the PDDQ

  • The PDDQ and evaluation scoring tool are available at the following location: https://www.healthit.gov/playbook/pddq-framework/
  • A condensed version of the PDDQ, the Ambulatory Guide, contains a core set of questions aimed at very small health care practices, to help them get started in improving Data Quality.  It is available at: https://www.healthit.gov/playbook/ambulatory-guide/

Photo Credit: Micolas/Shutterstock.com

Leave a Reply Cancel reply

You must be logged in to post a comment.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Overview of Data Quality: Examining the Dimensions, Antecedents, and Impacts of Data Quality

Jingran wang.

2 Sogang Business School, Sogang University, 35 Baekbeom-Ro, Mapo-Gu, Seoul, South Korea

3 H-E-B School of Business & Administration, University of the Incarnate Word, 4301 Broadway, San Antonio, TX 78209 USA

1 School of Accounting, Shanghai Lixin University of Accounting and Finance, 2800 Wenxiang Rd, Songjiang District, Shanghai, 201620 China

Zhenxing Lin

Stavros sindakis.

4 School of Social Sciences, Hellenic Open University, 18 Aristotelous Street, Patras, 26335 Greece

Sakshi Aggarwal

5 Institute of Strategy, Entrepreneurship and Education for Growth, Athens, Greece

Associated Data

The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Competition in the business world is fierce, and poor decisions can bring disaster to firms, especially in the big data era. Decision quality is determined by data quality, which refers to the degree of data usability. Data is the most valuable resource in the twenty-first century. The open data (OD) movement offers publicly accessible data for the growth of a knowledge-based society. As a result, the idea of OD is a valuable information technology (IT) instrument for promoting personal, societal, and economic growth. Users must control the level of OD in their practices in order to advance these processes globally. Without considering data conformity with norms, standards, and other criteria, what use is it to use data in science or practice only for the sake of using it? This article provides an overview of the dimensions, subdimensions, and metrics utilized in research publications on OD evaluation. To better understand data quality, we review the literature on data quality studies in information systems. We identify the data quality dimensions, antecedents, and their impacts. In this study, the notion of “Data Analytics Competency” is developed and validated as a five-dimensional formative measure (i.e., data quality, the bigness of data, analytical skills, domain knowledge, and tool sophistication) and its effect on corporate decision-making performance is experimentally examined (i.e., decision quality and decision efficiency). By doing so, we provide several research suggestions, which information system (IS) researchers can leverage when investigating future research in data quality.

Introduction

Competition in the business world is fierce, and poor decisions can bring disaster to firms. For example, Nokia’s leadership decline in the telecommunications industry resulted from its overestimated brand strength and continued instances that its superior hardware design would win over users long after the iPhone’s release (Surowiecki, 2013 ). Making the right decisions leads to better performance (Goll & Rasheed, 2005 ; Zouari & Abdelhedi, 2021 ).

Knowledge is a foundational value in our society. Data must be free and open since they are a fundamental requirement for knowledge discovery. In terms of science and application, the open data idea is still in its infancy. The development of open government lies at the heart of this political and economic movement. The President’s Memorandum on Transparency and Open Government, which launched the US open data project in 2009, was followed by the UK government’s open data program, which was established in 2011. While public sectors host the bulk of open data activities, open data extends beyond “open government” to include areas such as science, economics, and culture. Open data is also becoming more significant in research and has the ability to enhance public institutional governance (Kanwar & Sanjeeva, 2022 ; Khan et al., 2021 ; Šlibar et al., 2021 ). Thus, open data may be viewed from various angles, providing a range of direct and indirect advantages. For example, the economic perspective makes the case that open data-based innovation promotes economic expansion. The political and strategic viewpoints heavily emphasize political concerns like security and privacy. The social angle focuses on the advantages of data usage for society. The social perspective also looks at how all citizens might see the advantages of open data (Danish et al., 2019 ; Šlibar et al., 2021 ).

As was previously said, numerous research discovered that open data activities strive to promote societal ideals and advantages. The following highlights a few instances of social, political, and economic benefits. More openness, increased citizen engagement and empowerment, public trust in government, new government services for citizens, creative social services, improved policy-making procedures, and modeling knowledge advancements are all results of political and social gains. Additionally, there are a number of economic advantages, such as increased economic growth, increased competitiveness, increased innovation, development of new goods and services, and the emergence of new industries that add to the economy (Cho et al., 2021 ; Ebabu Engidaw, 2021 ; Šlibar et al., 2021 ).

In the big data era, data-driven forecasting, real-time analytics, and performance management tools are aspects of next-generation decision support systems (Hosack et al., 2012 ). A high-quality decision based on data analytics can help companies gain a sustained competitive advantage (Davenport & Harris, 2007 ; Russo et al., 2015 ). Data-driven decision-making is a newer form of decision-making. Data-driven decision-making refers to the practice of basing decisions on the analysis of data rather than purely on intuition or experience (Abdouli & Omri, 2021 ; Provost & Fawcett, 2013 ). In data-driven decision-making, data is at the core of decision-making and influences the quality of the decision. The success of data-driven decision-making depends on data quality, which refers to the degree of usable data (Pipino et al., 2002 ; Price et al., 2013 ). This research seeks to investigate (1) what kinds of data can be viewed as high-quality, (2) what factors influence data quality, and (3) how data quality influences decision-making.

The scope of the paper revolves around three methodologies used to examine the dimensions of data quality and synthesize those data quality dimensions. The findings in the below section show that data quality has many characteristics, with accuracy, completeness, consistency, timeliness, and relevance considered the most significant ones. Additionally, two important aspects were discovered in the paper that affect data quality, i.e., time constraints and data user experience, which is frequently discussed in the literature review. By doing this, we were able to clearly illustrate the problems with data quality, point out the gaps in the literature, and suggest three key concerns about big data quality.

Moreover, the study’s main contributions are beneficial for upcoming academicians and researchers as the literature review emphasizes the benefits of utilizing data analytics tools on firm decision-making performance. There needs to be more research that quantitatively demonstrates the influence of the successful use of data analytics (data analytics competency) on firm decision-making which is fulfilled in our study. This area of research was essential as improving firms’ decision-making performance is the overarching goal of data analysis in the field of data analytics. Understanding the factors affecting it is a novel contribution to its field.

Remarkably, the literature review is built by reviewing articles, of which 29 articles related to data quality were considered. By examining the fundamental aspect of data quality and its impact on decision-making and end users, we begin to take the first step towards gaining a more in-depth understanding of the factors that influence data quality. Previous researchers should have focused more on the above areas, and we aim to highlight and enhance the same.

In addition to this, a thorough review of previous works in the same field has helped us track the research gap, and this organized review of previous works was divided into several steps: identifying keywords, analysis of citations, the calibration of the search strings, and the classification of articles by abstracts. Based on all the database searches, we found that these 29 articles best discuss data quality, its dimensions, constructs, and its impact on decision-making and are much more relevant than other articles included in our study. These articles determine the factors that influence data quality, and a framework provided helps illustrate a complete description of the factors affecting data quality.

The paper is organized as follows. The literature review is divided into three sections. In the first section, we review the literature and briefly identify the dimensions of data quality. In the second section, we summarize the antecedents of data quality. In the third section, we summarize the impacts of data quality. We then discuss future opportunities for dimensions of big data quality that have been neglected in the data quality literature. Finally, we propose several research directions for future studies in data quality.

Literature Review

Data is so essential to modern life that some have referred to it as the “new oil.” A current illustration of the significance of the data is the management of the COVID-19 epidemic. The early detection of the virus, the prediction of its spread, and the evaluation of the effects of lockdowns were all made possible by data gathered from location-based social network posts and mobility records from telecommunications networks, which allowed for data-driven public health decision-making (Dakkak et al., 2021 ; Shumetie & Watabaji, 2019 ).

As a result, words like “data-driven organization” or “data-driven services” are beginning to appear with the prefix “data-driven” more frequently. Additionally, according to Google Books Ngram, the word “data-driven” has become increasingly popular during the previous 10 years. Data-driven creation, which has been defined as the organization’s capacity to acquire, analyze, and use data to produce new goods and services, generate efficiency, and achieve a competitive advantage, is a trend that also applies to the development of software (Dakkak et al., 2021 ; Maradana et al., 2017 ; Prifti & Alimehmeti, 2017 ). More and more software organizations are implementing data-driven development techniques to take advantage of the potential that data offers. Facebook, Google, and Microsoft, among other cloud and web-based software companies, have been tracking user behavior, identifying their preferences, and running experiments utilizing data from the production environment. The adoption of data-driven techniques is happening more slowly in software-intensive embedded systems, where businesses are still modernizing to include capabilities for continuous data collection from in-service systems. The majority of embedded systems organizations use ad hoc techniques to meet the demands of the individual, the team, or the customer rather than having a systematic and ongoing method for collecting data from in-service products (Carayannis et al., 2012 ; Cho et al., 2021 ; Dakkak et al., 2021 ; Šlibar et al., 2021 ). Therefore, Dakkak et al. ( 2021 ) discussed the two areas we use to identify the data gathering challenges, which are:

Other clients permit automatic data collection but set restrictions on the types of data that may be gathered, when they can be gathered, how they will be used, and how they will be moved. This is frequently the case with clients who have contracts for services like customer support, optimization, or operations, where data could be used for these reasons and must only be available to those carrying out these tasks. The data itself is now used to evolve these services to become data-driven, even while consumers with service-specific data collection agreements prohibit the data from being used for continuous software enhancements (Cho et al., 2021 ; Dakkak et al., 2021 ).

  • Impact on the performance of the product : While some data can be gathered from in-service products without having any adverse effects on their operations, such as network performance evaluation, other data must be instrumented before collecting due to the adverse effects their collection creates on the product’s performance as they require internal resources during the generation and collection times, such as processor and memory power.
  • Data dependability : Given the variety of data kinds, it may be misleading to consider one data type in isolation from the quality standpoint. While a single piece of data can be evaluated based on certain quality indicators like integrity, developing a comprehensive picture of data quality necessitates a connection between several data sources (Dakkak et al., 2021 ; Khan et al., 2021 ; Šlibar et al., 2021 ).

Data Quality Dimensions

Data quality is the core of big data analytics-based decision support. It is not a unidimensional concept but a multidimensional concept (Ballou & Pazer, 1985 ; Pipino et al., 2002 ). The identified dimensions include accessibility, amount of data, believability, completeness, concise representation, consistent representation, ease of manipulation, free of error, interpretability, objectivity, relevancy, reputation, security, timeliness, understandability, and value-added (Abdouli & Omri, 2021 ; Pipino et al., 2002 ). Furthermore, Cho et al. ( 2021 ) highlighted that data quality dimensions are constructs used when evaluating data and are criteria or features of data quality that are thought to be crucial for a particular user’s task. For instance, completeness (e.g., are measured values present?), conformance (e.g., do data values comply with prescribed requirements and layouts?), and plausibility (e.g., are data values credible?) could all be used to evaluate the quality of data. Since data quality has multiple dimensions, how studies are conducted on the dimensions of data quality and which dimensions are the most popular are two questions we want to review in this section. In the data quality literature, three approaches are commonly used to study data quality dimensions (Wang & Strong, 1996 ).

The first approach is an intuitive approach based on the researchers’ past experience or intuitive understanding of what dimensions are essential (Wang & Strong, 1996 ). The intuitive approach has been used in early studies of data quality (Bailey & Pearson, 1983 ; Ballou & Pazer, 1985 ; DeLone & McLean, 1992 ; Ives et al., 1983 ; Laudon, 1986 ; Morey, 1982 ). For example, Bailey and Pearson ( 1983 ) viewed accuracy, timeliness, precision, reliability, currency, and completeness as important dimensions of the data quality of the output information. Ives et al. ( 1983 ) viewed relevancy, volume, accuracy, precision, currency, timeliness, and completeness as important dimensions of data quality for the output information. Ballou and Pazer ( 1985 ) also viewed accuracy, completeness, consistency, and timeliness as data quality dimensions. Laudon ( 1986 ) used accuracy, completeness, and unambiguousness as essential attributes of data quality included in the information. DeLone and McLean ( 1992 ) used accuracy, timeliness, consistency, completeness, relevance, and reliability as data quality dimensions. Studies also argue that inconsistency is important to data quality (Ballou & Tayi, 1999 ; Bouchoucha & Benammou, 2020 ). Many studies use an intuitive approach to define data quality dimensions because each study can choose the dimensions relevant to the specific purpose of the study. In other words, the intuitive approach allows scholars to choose specific dimensions based on their research context or purpose.

A second approach is a theoretical approach that studies data quality from the perspective of the data manufacturing process. Wang et al. ( 1995 ) viewed information systems as data manufacturing systems that work on raw material inputs to produce output material or tangible products. The same can be said of an information system, which acts on raw data input (such as a file, record, single number, report, or spreadsheet) to generate output data or data products (e.g., a corrected mailing list or a sorted file). In some other data manufacturing systems, this data result can be used as raw data. The phrase “data manufacturing” urges academics and industry professionals to look for extra-disciplinary comparisons that can help with knowledge transfer from the context of product assurance to the field of data quality. The phrase “data product” is used to underline that the data output has a value that is passed on to consumers, whether inside or outside the business (Feki & Mnif, 2016 ; Wang et al., 1995 ).

From the data manufacturing standpoint, the quality of data products is decided by consumers. In other words, the actual use of data determines the notion of data quality (Wand & Wang, 1996 ). Thus, Wand and Wang ( 1996 ) posit that the analysis of data quality dimensions should be based on four assumptions: (1) information systems can represent real-world systems; (2) information systems design is based on the interpretation of real-world systems; (3) users can infer a view of the real-world systems from the representation created by information systems; (4) only issues related to the internal view are part of the model (Wand & Wang, 1996 ). Based on the representation, interpretation, inference, and internal view assumptions, they proposed intrinsic data quality dimensions, including complete, unambiguous, meaningful, and correct data (Wand & Wang, 1996 ). The theoretical approach provided a more detailed and complete set of data quality dimensions, which are natural and inherent to the data product.

A third approach is empirical , which focuses on analyzing data quality from the user’s viewpoint. A tenant of the empirical approach is the belief that the quality of the data product is decided by its consumers (Wang & Strong, 1996 ). One of the representative studies was done by Wang and Strong ( 1996 ), who defined the dimensions and evaluation of data quality by collecting information from data consumers (Wang & Strong, 1996 ). Data has quality in and of itself, according to intrinsic DQ. One of the four dimensions that make up this category is accuracy. Contextual DQ draws attention to the necessity of considering data quality as a component of the job at hand; that is, data must be pertinent, timely, complete, and of an acceptable volume to provide value. The relevance of systems is highlighted by representational DQ and accessibility DQ. To be effective, a system must display data in a form that is comprehensible, easy to grasp, and consistently expressed (Ghasemaghaei et al., 2018 ; Ouechtati, 2022 ; Wang & Strong, 1996 ). This study argues that a preliminary conceptual framework for data quality should include four aspects: accessible, interpretable, relevant, and accurate. They further refined their model into four dimensions: (1) intrinsic data quality, which means that data should be not only accurate and objective but also believable and reputable, (2) contextual data quality, which means that data quality must be considered within the context of the task, (3) representational data quality, which means that data quality should include both format of data and meaning of data, and (4) accessible data quality, which is also a significant dimension of data quality from the consumer’s viewpoint.

Fisher and Kingma ( 2001 ) used these dimensions of data quality to analyze the reasons that caused two disasters in US history. In order to explain the role of data quality in the explosion of the Challenger spacecraft and the miss-fire caused by the USS Vincennes. Accuracy, timeliness, consistency, completeness, relevancy, and fitness for use were used as data quality dimensions (Barkhordari et al., 2019 ; Fisher & Kingma, 2001 ). In their study, accuracy means a lack of error between recorded and real-world values. Timeliness means the recorded value is up to date. Completeness is focused on whether all relevant data is recorded. Consistency means data values did not change in different records. Data relevance means data should relate to special issues, and fitness for use means data should serve the user’s purpose (Fisher & Kingma, 2001 ; Strong, 1997 ; Tayi & Ballou, 1998 ). Data quality should depend on purpose (Shankaranarayanan & Cai, 2006 ). This category of data quality is also used in credit risk management. Parssian et al. ( 2004 ) viewed information as a product and presented a method to assess data quality for information products. Researchers mainly focused on accuracy and completeness because they thought these two factors were the most important to decision-making (Parssian et al., 2004 ; Reforgiato Recupero et al., 2016 ). They viewed information as a product but still evaluated the factors from the user’s viewpoint. Studies also evaluated a model for cost-effective data quality management in customer relationship management (CRM) (Even et al., 2010 ). Moges et al. ( 2013 ) argued that completeness, interpretability, reputability, traceability, easily understandable, appropriate-amount, alignment, and concise representation are important dimensions of data quality in credit risk management (Danish et al., 2019 ; Moges et al., 2013 ). These studies view data quality dimensions as involving the voice of data consumers. Examining data quality dimensions from the user’s point of view is one of the most critical characteristics of empirical approaches (Even et al., 2010 ; Fisher & Kingma, 2001 ; Moges et al., 2013 ; Parssian et al., 2004 ; Shankaranarayanan & Cai, 2006 ; Strong, 1997 ; Tayi & Ballou, 1998 ).

Intuitive approaches are the easiest to examine data quality dimensions, and theoretical approaches are supported by theory. However, both approaches overlook the user, the most important judge of data quality. Data consumers are the judges that decide whether data is of high quality or poor quality. However, it is difficult to prove that the results gained from empirical approaches are complete and precise through fundamental principles (Wang et al . , 1995 ; Prifti & Alimehmeti, 2017 ). Based on the studies reviewed, we summarize data quality dimensions and comparative studies (Table ​ (Table1). 1 ). The results indicate that completeness, accuracy, timeliness, consistency, and relevance are the top six dimensions of data quality mentioned in studies.

Summary of data quality dimensions

Factors that Influence Data Quality (Antecedents)

Several studies try to determine the factors that influence data quality. Wang et al. ( 1995 ) proposed a framework that included seven elements that influenced data quality: management responsibility, operation and assurance costs, research and development, production, distribution, personal management, and legal (Wang et al., 1995 ) . This framework provides a complete description of the factors influencing data quality, but it is challenging to implement because of its complexity. Ballou and Pazer ( 1995 ) studied the accuracy-timeliness tradeoff and argued that accuracy improves with time and will increase data quality (Ballou & Pazer, 1995 ; Šlibar et al., 2021 ). Experience also influenced data quality by affecting the usage of incomplete data (Ahituv et al., 1998 ). If a decision-maker is familiar with the data, the decision-maker may be able to use intuition to compensate for problems (Chengalur-Smith et al., 1999 ). Also, studies indicated that information overload would affect data quality by reducing data accuracy (Berghel, 1997 ; Cho et al., 2021 ). Later, scholars pointed out that information overload, experience level, and time constraints impact data quality by influencing the way decision-makers use the information (e.g., Fisher & Kingma, 2001 ). The top ten antecedents of data quality have been identified through a literature review related to the antecedents of data quality. Table ​ Table2 2 presents the summary of antecedents of data quality.

Summary of antecedents of data quality

Impact of Data Quality

The impact of data quality on decision-making and the impact of data quality on end users are two main themes. Studies of the impact of data quality on decision-making frequently use the definition of data quality information (DQI), which is a general evaluation of data quality and data sets (Chengalur-Smith et al., 1999 ; Fisher et al., 2003 ). After considering the decision environment, Chengalur-Smith et al. ( 1999 ) argued that DQI generates different influences on decision-making in different tasks, decision strategies, and the formation of the DQI context. Later, Fisher et al. ( 2003 ) presented the influence of experience and time on the use of DQI in the decision-making process. They developed a detailed model to explain the influence factors between DQI and the decision outcome. Through research, Fisher et al. ( 2003 ) argued that (1) experts use DQI more frequently than do novices; (2) managerial experience positively influences the usage of DQI, but domain experience did not have an influence on the usage of DQI; (3) DQI would be useful for managers with little domain-specific experience, and training in the use of DQI by experts would be worthwhile; (4) the availability of DQI will have more influence on decision-makers who feel time pressure than decision-makers who do not feel time pressure (Cho et al., 2021 ; Fisher et al., 2003 ). According to Price and Shanks ( 2011 ), metadata depicting data quality (DQ) can be viewed as DQ tags. They found that DQ tags can not only increase decision time but can also change decision choices. DQ tags are associated with increased cognitive processing in the early decision-making phases, which delays the generation of decision alternatives (Price & Shanks, 2011 ). Another study on the impact of data quality on decision-making focused on the implementation of data quality management to support decision-making. The data quality management framework was mainly built on the information product view (Ballou et al., 1998 ; Wang et al., 1998 ). Total data quality management (TDQM) and information product map (IPMAP) were developed based on the information product view. Studies of data quality management have focused more on context. For example, Shankaranarayanan and Cai ( 2006 ) constructed a data quality standard framework for B2B e-commerce. They proposed three-layer solutions based on IPMAP and IP View, including the DQ 9000 quality management standard, the standardized data quality specification metadata, and the third-party DQ certification issuers (Sarkheyli & Sarkheyli, 2019 ; Shankaranarayanan & Cai, 2006 ). The representative research on data quality impact on end-user was proposed by Foshay et al. ( 2007 ). They argued that end-user metadata impacts user attitudes toward data in databases, and end-user metadata elements strongly influence user attitudes toward data in the warehouse. They have a similar impact as the “Other factors”: data quality, business intelligence tool utility, and user views of training quality. Together with these other characteristics, metadata factors appear to have a considerable impact on attitudes. This discovery is incredibly important. It implies that metadata plays a significant role in determining whether a user will have a favorable opinion of a data warehouse (Dranev et al., 2020 ; Foshay et al., 2007 ) (Table ​ (Table3 3 ).

Summary of data quality impacts

The study’s “Other factors” do not seem to have much of a direct impact on the utilization of data warehouses. Other elements function similarly to the metadata factors, having an indirect impact on use. Perceived data quality, out of all the other criteria, had the most significant impact on users’ views toward data. Therefore, user perceptions about the data available from the warehouse have a significant impact on how valuable and simple the data warehouse is thought to be to use. Thus, the amount of use of the data warehouse is influenced by perceived usefulness and perceived ease of use to a moderately substantial extent. As a result, it would seem that variables other than perceived usefulness and usability have a role in deciding how widely data warehouses are used. Also, the degree to which end-user metadata quality and use influence user attitudes depend critically on the user experience accessing a data warehouse (Foshay et al., 2007 ; Zhuo et al., 2021 ).

Through the literature review of these 29 articles related to data quality in the IS field, we found that data quality has multiple dimensions, and that completeness (Bailey & Pearson, 1983 ; Ballou & Pazer, 1985 ; Côrte-Real et al., 2020 ; DeLone & McLean, 1992 ; Even et al., 2010 ; Fisher & Kingma, 2001 ; Ives et al., 1983 ; Laudon, 1986 ; Moges et al., 2013 ; Parssian et al., 2004 ; Shankaranarayanan & Cai, 2006 ; Šlibar et al., 2021 ; Wand & Wang, 1996 ; Wang & Strong, 1996 ; Zouari & Abdelhedi, 2021 ), accuracy (Bailey & Pearson, 1983 ; Ballou & Pazer, 1985 ; Dakkak et al., 2021 ; DeLone & McLean, 1992 ; Fisher & Kingma, 2001 ; Ghasemaghaei et al., 2018 ; Ives et al., 1983 ; Juddoo & George, 2018 ; Laudon, 1986 ; Morey, 1982 ; Parssian et al., 2004 ; Safarov, 2019 ; Wand & Wang, 1996 ; Wang & Strong, 1996 ), timeliness (Bailey & Pearson, 1983 ; Ballou & Pazer, 1985 ; Cho et al., 2021 ; Côrte-Real et al., 2020 ; DeLone & McLean, 1992 ; Fisher & Kingma, 2001 ; Ives et al., 1983 ; Šlibar et al., 2021 ; Wang & Strong, 1996 ), consistency (Ballou & Pazer, 1985 ; Ballou & Tayi, 1999 ; Cho et al., 2021 ; Dakkak et al., 2021 ; DeLone & McLean, 1992 ; Fisher & Kingma, 2001 ; Ghasemaghaei et al., 2018 ; Wang & Strong, 1996 ), and relevance (Bailey & Pearson, 1983 ; Côrte-Real et al., 2020 ; Dakkak et al., 2021 ; DeLone & McLean, 1992 ; Fisher & Kingma, 2001 ; Ives et al., 1983 ; Klein et al., 2018 ; Šlibar et al., 2021 ; Wang & Strong, 1996 ) are the top five data quality dimensions mentioned in studies.

However, existing studies focused on multiple dimensions of traditional data quality and did not address the new dimensions of big data quality. In traditional data, timeliness is important; however, one new attribute of big data is its real-time delivery. So, we do not know whether timeliness will still play an important role in the dimensions of big data. Volume is also a new attribute of big data. Three papers address the volume of data (Ives et al., 1983 ; Moges et al., 2013 ; Šlibar et al., 2021 ; Wang & Strong, 1996 ). One reason that traditional data quality highlights the role of volume is that it is hard to get enough data. However, in the era of big data, there are enormous amounts of data, and volume is no longer a big issue. Therefore, we do not know whether volume will still be an important attribute of big data quality. Value-added is also one of the traditional data quality dimensions (Dakkak et al., 2021 ; Sarkheyli & Sarkheyli, 2019 ; Wang & Strong, 1996 ). Big data’s value is still, however, uncertain, whether value will be a new important attribute of big data quality needs to be studied further. Recent studies indicate that volume, variety, velocity, value, and veracity (5 V) are five common characteristics of big data (Cho et al., 2021 ; Firmani et al., 2019 ; Gordon, 2013 ; Hook et al., 2018 ). Nevertheless, few studies investigated the impacts of the 5 V dimensions on big data quality.

Also, existing studies identified several factors that influenced data quality, such as time pressure (Ballou & Pazer, 1995 ; Cho et al., 2021 ; Côrte-Real et al., 2020 ; Fisher & Kingma, 2001 ; Mock, 1971 ), data user experience (Ahituv et al., 1998 ; Chengalur-Smith et al., 1999 ; Cho et al., 2021 ; Fisher & Kingma, 2001 ), and information overload (Berghel, 1997 ; Fisher & Kingma, 2001 ; Hook et al., 2018 ). There are, however, few studies explaining what new factors influence big data quality. For example, existing studies talk about time pressure (Ballou & Pazer, 1995 ; Fisher & Kingma, 2001 ; Mock, 1971 ; Zhuo et al., 2021 ) more than information overload (Berghel, 1997 ; Fisher & Kingma, 2001 ; Klein et al., 2018 ; Sarkheyli & Sarkheyli, 2019 ). But big data volume is expansive, and data is obtained in real-time. This will cause information overload problems more serious than time pressure. Variety dimensions of big data mean that the structures of big data are various, which could cause problems when unstructured data is converted into structured data. Human error or system error may also constitute new factors influencing big data quality. Most research related to data quality considers how data quality impacts decision-making. No studies discussed the unknown impact of big data quality. Recent studies indicate that future decisions will be based on data analytics, and our world is data-driven (Davenport & Harris, 2007 ; Juddoo & George, 2018 ; Loveman, 2003 ).

Based on the literature review and the research gaps identified, we propose several future research directions related to data quality within the big data context. First, future studies on data quality dimensions should focus more on the 5 V dimensions of big data quality to identify new attributes of big data quality. Furthermore, future research should examine possible changes in the other quality dimensions, such as accuracy and timeliness. Secondly, future research should focus on identifying the new impacts of big data quality on decision-making by answering how big data quality influences decision-making and finding other issues related to big data quality (Davenport & Patil, 2012 ; Safarov, 2019 ). Third, future research should investigate various factors influencing big data quality. Finally, any future research should also actively investigate how to leverage a firm’s capabilities to improve big data quality.

There is some proof that adopting data analytics tools can help businesses become better at making decisions. Studies showed that many businesses that invested in data analytics were unable to fully utilize these capabilities. A study that quantitatively demonstrates the influence of successful use of data analytics (data analytics competency) on firm decision-making is lacking, despite the academic and practitioner literature emphasizing the benefit of employing data analytics tools on firm decision-making effectiveness. We, therefore, set out to investigate how this impact operates. Understanding the elements affecting it is a novel addition to the data analytics literature because increasing firms’ decision-making performance is the ultimate purpose of data analysis in the realm of data analytics. In this research, we filled this knowledge gap by using Huber’s ( 1990 ) theory of effects of advanced IT on decision-making and Bharadwaj’s ( 2000 ) framework of key IT-based resources to describe and justify data analytics competency for enterprises as a multidimensional formative index, as well as to create and validate a framework to predict the role of data analytics competency on firm decision-making performance (i.e., decision quality and efficiency). The two aforementioned initiatives represent fresh characteristics that have not yet been discussed in IS literature.

Furthermore, in this work, various techniques were used to identify the data quality aspects of user-generated wearable device data. Literature analysis and survey were done to comprehend the issues associated with data quality for investigators and their perspectives on data quality dimensions. Domain specialists chose the right dimensions based on this information (Cho et al., 2021 ; Ghasemaghaei et al., 2018 ).

Completeness

In this analysis, the contextual and data quality characteristics of breadth and density completeness were thought to be crucial for conducting research. It is critical to evaluate the breadth and completeness of data sets, especially those gathered in a bring-your-own-device research environment. The number of valid days required within a specific data collection period or the frequency with which the data must be present for the individual data to be included in the analysis is another way that researchers can define completeness. Further research is required to establish how completeness is defined in research studies because recently launched gadgets have the capacity to evaluate numerous data types and gather data continuously for years (Côrte-Real et al., 2020 ).

Conformance

While value, relational, and computational conformity are all seen as crucial aspects of wearable device data, data administration and quality evaluation present difficulties. Only the data dictionary and relational model particular to the model, brand, and version of the device can be used to evaluate the value and relational conformity, and only in cases where this data is publicly available.

Plausibility

Plausibility fits with researchers’ demands for precise data values. For example, the data might be judged implausible when step counts are higher than expected but associated heart rate values are lower than usual. Before beginning an investigation, researchers frequently and arbitrarily create their own standards to determine the facts’ plausibility. However, creating a collection of prospective data quality criteria requires extensive topic expertise and expert time. Therefore, developing a knowledge base of data quality guidelines for user-generated wearable device data would not only help future researchers save time but also eliminate the need for ad hoc data quality guidelines (Cho et al., 2021 ; Dakkak et al., 2021 ).

Theoretical Implications

In this paper, we reviewed literature related to data quality. Based on the literature review, we presented previous studies on data quality. We identified the three approaches used to study the dimensions of data quality and summarized those data quality dimensions by referencing to work of several works of research in this area. The results indicated that data quality has multiple dimensions, with accuracy, completeness, consistency, timeliness, and relevance viewed as the important dimensions of data quality. Also, through the literature review, we identified two important factors that influence data quality: time pressure and data user experience, which are frequently mentioned throughout the literature. We also identified the impact of data quality through present studies related to data quality impacts. We found that many studies examined the impact of data quality on decision-making. By doing so, we depicted a clear picture of issues related to data quality, identified the gaps in existing research, and proposed three important questions related to big data quality.

A study that quantitatively demonstrates the influence of the successful use of data analytics (data analytics competency) on firm decision-making is lacking, despite the academic and practitioner literature emphasizing the benefit of employing data analytics tools on firm decision-making performance (Bridge, 2022 ). We, therefore, set out to investigate how this impact operates. Understanding the elements affecting it is a novel addition to the data analytics literature because increasing firms’ decision-making performance is the overarching goal of data analysis in the realm of data analytics.

Surprisingly, although the size of the data dramatically improves the quality of firm decision-making, it has no discernible effect on firm decision efficiency, according to later investigations. This indicates that while having large amounts of data is a great resource for businesses to use to increase the quality of company decisions, it does not increase the speed at which they can make decisions. The difficulties in gathering, managing, and evaluating massive amounts of data may be to blame. Decision quality and decision efficiency were highly impacted by all other first-order constructs, including data quality, analytical ability, subject expertise, and tool sophistication.

Limitations in our review of literature do exist. Although we summarized the data quality dimensions, antecedents, and impacts through our literature review, we may have overlooked other data quality dimensions, antecedents, and impacts due to the limited number of papers we reviewed. In order to have a comprehensive understanding of data quality, we also suggest that further research needs to be conducted through the review of more papers related to data quality to reveal more dimensions, antecedents, and impacts.

Managerial Implications

The findings of this study have significant ramifications for managers who use data analytics to their benefit. Organizations that make sizable investments in these technologies do so primarily to enhance decision-making performance through the use of data analytics. As an outcome, managers must pay close attention to strengthening data analytics competency dimensions to improve firm decision-making performance as a result of using these tools. This is because they have now adequately explained a large portion of the variance in decision-making performance (Ghasemaghaei et al., 2018 ). The use of analytical tools may fail to enhance organizational decision-making performance without such competency.

Companies could, for instance, invest in training to enhance employees’ analytical skills in order to improve firm decision-making. When employees are equipped with the skills needed to carry out the demands of their jobs, the quality of their work is increased. Furthermore, managers must ensure that staff members who utilize data analytics to make important choices have the necessary domain expertise to correctly use the tools and interpret the findings. When purchasing data analytics tools, managers can use careful selection procedures to ensure the chosen means are powerful enough to support all the data required for carrying out current and upcoming analytical jobs. Therefore, managers must invest in data quality to speed up information processing and increase the efficiency of business decisions if they want to increase their data analytics proficiency.

Ideas for Future Research

It is essential to recognize the limits of this study, as with all studies. First, factors other than data analytics proficiency can have an impact on how well a company makes decisions. Future research is also necessary to better understand how other factors (such as organizational structure and business procedures) affect the effectiveness of company decision-making. Second, open data research is a new area of study, and the current assessment of open data within existing research has space for improvement, according to the initial literature review. This improvement can target a variety of open data paradigm elements, including commonly acknowledged dataset attributes, publishing specifications for open datasets, adherence to specific policies, necessary open data infrastructure functionalities, assessment processes of datasets, openness, accountability, involvement or collaboration, and evaluation of economic, social, political, and human value in open data initiatives. Because open data is, by definition, free, accessible to the general public, nonexclusive (unrestricted by copyrights, patents, etc.), open-licensed, usability-structured, and so forth, its use may be advantageous to a variety of stakeholders. These advantages can include the creation of new jobs, economic expansion, the introduction of new goods and services, the improvement of already existing ones, a rise in citizen participation, and assistance in decision-making. Consequently, the open data paradigm illustrates how IT may support social, economic, and personal growth.

Finally, the decisions that were made were not explicitly covered by this study. Future study is necessary to examine the impact of data analytics competency and each of its dimensions on decision-making consequences in particular contexts, as the relative importance of data analytics competency and its many dimensions may change depending on the type of decision being made (e.g., recruitment processes, marketing promotions).

The paper was supported by the H-E-B School of Business & Administration, the University of the Incarnate Word, and the Social Science Foundation of the Ministry of Education of China (16YJA630025).

Data Availability

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Jingran Wang, Email: moc.anis@1gnawrj .

Yi Liu, Email: ude.xtwiu@5uiliy .

Peigong Li, Email: nc.ude.umx@ilgp .

Zhenxing Lin, Email: nc.ude.nixil@xznil .

Stavros Sindakis, Email: [email protected] .

Sakshi Aggarwal, Email: [email protected] .

  • Abdouli M, Omri A. Exploring the nexus among FDI inflows, environmental quality, human capital, and economic growth in the Mediterranean region. Journal of the Knowledge Economy. 2021; 12 (2):788–810. doi: 10.1007/s13132-020-00641-5. [ CrossRef ] [ Google Scholar ]
  • Ahituv N, Igbaria M, Sella AV. The effects of time pressure and completeness of information on decision making. Journal of Management Information Systems. 1998; 15 (2):153–172. doi: 10.1080/07421222.1998.11518212. [ CrossRef ] [ Google Scholar ]
  • Bailey JE, Pearson SW. Development of a tool for measuring and analyzing computer user satisfaction. Management Science. 1983; 29 (5):530–545. doi: 10.1287/mnsc.29.5.530. [ CrossRef ] [ Google Scholar ]
  • Ballou DP, Pazer HL. Modeling data and process quality in multi-input, multi-output information systems. Management Science. 1985; 31 (2):150–162. doi: 10.1287/mnsc.31.2.150. [ CrossRef ] [ Google Scholar ]
  • Ballou DP, Pazer HL. Designing information systems to optimize the accuracy-timeliness tradeoff. Information Systems Research. 1995; 6 (1):51–72. doi: 10.1287/isre.6.1.51. [ CrossRef ] [ Google Scholar ]
  • Ballou DP, Tayi GK. Enhancing data quality in data warehouse environments. Communications of the ACM. 1999; 42 (1):73–78. doi: 10.1145/291469.291471. [ CrossRef ] [ Google Scholar ]
  • Ballou D, Wang R, Pazer H, Tayi GK. Modeling information manufacturing systems to determine information product quality. Management Science. 1998; 44 (4):462–484. doi: 10.1287/mnsc.44.4.462. [ CrossRef ] [ Google Scholar ]
  • Barkhordari S, Fattahi M, Azimi NA. The impact of knowledge-based economy on growth performance: Evidence from MENA countries. Journal of the Knowledge Economy. 2019; 10 (3):1168–1182. doi: 10.1007/s13132-018-0522-4. [ CrossRef ] [ Google Scholar ]
  • Berghel H. Cyberspace 2000: Dealing with information overload. Communications of the ACM. 1997; 40 (2):19–24. doi: 10.1145/253671.253680. [ CrossRef ] [ Google Scholar ]
  • Bharadwaj, A. S. (2000). A resource-based perspective on information technology capability and firm performance: An empirical investigation. MIS Quarterly , 169–196.
  • Bouchoucha N, Benammou S. Does institutional quality matter foreign direct investment? Evidence from African countries. Journal of the Knowledge Economy. 2020; 11 (1):390–404. doi: 10.1007/s13132-018-0552-y. [ CrossRef ] [ Google Scholar ]
  • Bridge, J. (2022). A quantitative study of the relationship of data quality dimensions and user satisfaction with cyber threat intelligence (Doctoral dissertation, Capella University).
  • Carayannis EG, Barth TD, Campbell DF. The Quintuple Helix innovation model: Global warming as a challenge and driver for innovation. Journal of Innovation and Entrepreneurship. 2012; 1 (1):1–12. doi: 10.1186/2192-5372-1-1. [ CrossRef ] [ Google Scholar ]
  • Chengalur-Smith IN, Ballou DP, Pazer HL. The impact of data quality information on decision making: An exploratory analysis. IEEE Transactions on Knowledge and Data Engineering. 1999; 11 (6):853–864. doi: 10.1109/69.824597. [ CrossRef ] [ Google Scholar ]
  • Cho S, Weng C, Kahn MG, Natarajan K. Identifying data quality dimensions for person-generated wearable device data: Multi-method study. JMIR mHealth and uHealth. 2021; 9 (12):e31618. doi: 10.2196/31618. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Côrte-Real N, Ruivo P, Oliveira T. Leveraging internet of things and big data analytics initiatives in European and American firms: Is data quality a way to extract business value? Information & Management. 2020; 57 (1):103141. doi: 10.1016/j.im.2019.01.003. [ CrossRef ] [ Google Scholar ]
  • Dakkak, A., Zhang, H., Mattos, D. I., Bosch, J., & Olsson, H. H. (2021, December). Towards continuous data collection from in-service products: Exploring the relation between data dimensions and collection challenges. In  2021 28th Asia-Pacific Software Engineering Conference (APSEC)  (pp. 243–252). IEEE.
  • Danish RQ, Asghar J, Ahmad Z, Ali HF. Factors affecting “entrepreneurial culture”: The mediating role of creativity. Journal of Innovation and Entrepreneurship. 2019; 8 (1):1–12. doi: 10.1186/s13731-019-0108-9. [ CrossRef ] [ Google Scholar ]
  • Davenport TH, Harris JG. Competing on analytics: The new science of winning. Harvard Business Press; 2007. [ Google Scholar ]
  • Davenport TH, Patil DJ. Data scientist. Harvard Business Review. 2012; 90 (5):70–76. [ PubMed ] [ Google Scholar ]
  • DeLone WH, McLean ER. Information systems success: The quest for the dependent variable. Information Systems Research. 1992; 3 (1):60–95. doi: 10.1287/isre.3.1.60. [ CrossRef ] [ Google Scholar ]
  • Dranev Y, Izosimova A, Meissner D. Organizational ambidexterity and performance: Assessment approaches and empirical evidence. Journal of the Knowledge Economy. 2020; 11 (2):676–691. doi: 10.1007/s13132-018-0560-y. [ CrossRef ] [ Google Scholar ]
  • EbabuEngidaw A. The effect of external factors on industry performance: The case of Lalibela City micro and small enterprises, Ethiopia. Journal of Innovation and Entrepreneurship. 2021; 10 (1):1–14. [ Google Scholar ]
  • Even A, Shankaranarayanan G, Berger PD. Evaluating a model for cost-effective data quality management in a real-world CRM setting. Decision Support Systems. 2010; 50 (1):152–163. doi: 10.1016/j.dss.2010.07.011. [ CrossRef ] [ Google Scholar ]
  • Feki C, Mnif S. Entrepreneurship, technological innovation, and economic growth: Empirical analysis of panel data. Journal of the Knowledge Economy. 2016; 7 (4):984–999. doi: 10.1007/s13132-016-0413-5. [ CrossRef ] [ Google Scholar ]
  • Firmani D, Tanca L, Torlone R. Ethical dimensions for data quality. Journal of Data and Information Quality (JDIQ) 2019; 12 (1):1–5. [ Google Scholar ]
  • Fisher CW, Kingma BR. Criticality of data quality as exemplified in two disasters. Information & Management. 2001; 39 (2):109–116. doi: 10.1016/S0378-7206(01)00083-0. [ CrossRef ] [ Google Scholar ]
  • Fisher CW, Chengalur-Smith I, Ballou DP. The impact of experience and time on the use of data quality information in decision making. Information Systems Research. 2003; 14 (2):170–188. doi: 10.1287/isre.14.2.170.16017. [ CrossRef ] [ Google Scholar ]
  • Foshay N, Mukherjee A, Taylor A. Does data warehouse end-user metadata add value? Communications of the ACM. 2007; 50 (11):70–77. doi: 10.1145/1297797.1297800. [ CrossRef ] [ Google Scholar ]
  • Ghasemaghaei M, Ebrahimi S, Hassanein K. Data analytics competency for improving firm decision making performance. The Journal of Strategic Information Systems. 2018; 27 (1):101–113. doi: 10.1016/j.jsis.2017.10.001. [ CrossRef ] [ Google Scholar ]
  • Goll I, Rasheed AA. The relationships between top management demographic characteristics, rational decision making, environmental munificence, and firm performance. Organization Studies. 2005; 26 (7):999–1023. doi: 10.1177/0170840605053538. [ CrossRef ] [ Google Scholar ]
  • Gordon K. What is big data? Itnow. 2013; 55 (3):12–13. doi: 10.1093/itnow/bwt037. [ CrossRef ] [ Google Scholar ]
  • Hook DW, Porter SJ, Herzog C. Dimensions: Building context for search and evaluation. Frontiers in Research Metrics and Analytics. 2018; 3 :23. doi: 10.3389/frma.2018.00023. [ CrossRef ] [ Google Scholar ]
  • Hosack B, Hall D, Paradice D, Courtney JF. A look toward the future: Decision support systems research is alive and well. Journal of the Association for Information Systems. 2012; 13 (5):3. doi: 10.17705/1jais.00297. [ CrossRef ] [ Google Scholar ]
  • Huber, G. P. (1990). A theory of the effects of advanced information technologies on organizational design, intelligence, and decision making. Academy of Management Review, 15 (1), 47–71. https://www.jstor.org/stable/258105
  • Ives B, Olson MH, Baroudi JJ. The measurement of user information satisfaction. Communications of the ACM. 1983; 26 (10):785–793. doi: 10.1145/358413.358430. [ CrossRef ] [ Google Scholar ]
  • Juddoo, S., & George, C. (2018). Discovering most important data quality dimensions using latent semantic analysis. In  2018 International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD)  (pp. 1–6). IEEE.
  • Kanwar A, Sanjeeva M. Student satisfaction survey: A key for quality improvement in the higher education institution. Journal of Innovation and Entrepreneurship. 2022; 11 (1):1–10. doi: 10.1186/s13731-022-00196-6. [ CrossRef ] [ Google Scholar ]
  • Khan RU, Salamzadeh Y, Shah SZA, Hussain M. Factors affecting women entrepreneurs’ success: A study of small-and medium-sized enterprises in emerging market of Pakistan. Journal of Innovation and Entrepreneurship. 2021; 10 (1):1–21. doi: 10.1186/s13731-021-00145-9. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Klein RH, Klein DB, Luciano EM. Open Government Data: Concepts, approaches and dimensions over time. Revista Economia & Gestão. 2018; 18 (49):4–24. doi: 10.5752/P.1984-6606.2018v18n49p4-24. [ CrossRef ] [ Google Scholar ]
  • Laudon KC. Data quality and due process in large interorganizational record systems. Communications of the ACM. 1986; 29 (1):4–11. doi: 10.1145/5465.5466. [ CrossRef ] [ Google Scholar ]
  • Loveman G. Diamonds in the data mine. Harvard Business Review. 2003; 81 (5):109–113. [ Google Scholar ]
  • Maradana RP, Pradhan RP, Dash S, Gaurav K, Jayakumar M, Chatterjee D. Does innovation promote economic growth? Evidence from European countries. Journal of Innovation and Entrepreneurship. 2017; 6 (1):1–23. doi: 10.1186/s13731-016-0061-9. [ CrossRef ] [ Google Scholar ]
  • Mock TJ. Concepts of information value and accounting. The Accounting Review. 1971; 46 (4):765–778. [ Google Scholar ]
  • Moges HT, Dejaeger K, Lemahieu W, Baesens B. A multidimensional analysis of data quality for credit risk management: New insights and challenges. Information & Management. 2013; 50 (1):43–58. doi: 10.1016/j.im.2012.10.001. [ CrossRef ] [ Google Scholar ]
  • Morey RC. Estimating and improving the quality of information in a Mis. Communications of the ACM. 1982; 25 (5):337–342. doi: 10.1145/358506.358520. [ CrossRef ] [ Google Scholar ]
  • Ouechtati, I. (2022). Financial inclusion, institutional quality, and inequality: An empirical analysis. Journal of the Knowledge Economy , 1–25. 10.1007/s13132-022-00909-y
  • Parssian A, Sarkar S, Jacob VS. Assessing data quality for information products: Impact of selection, projection, and Cartesian product. Management Science. 2004; 50 (7):967–982. doi: 10.1287/mnsc.1040.0237. [ CrossRef ] [ Google Scholar ]
  • Pipino LL, Lee YW, Wang RY. Data quality assessment. Communications of the ACM. 2002; 45 (4):211–218. doi: 10.1145/505248.506010. [ CrossRef ] [ Google Scholar ]
  • Price R, Shanks G. The impact of data quality tags on decision-making outcomes and process. Journal of the Association for Information Systems. 2011; 12 (4):1. doi: 10.17705/1jais.00264. [ CrossRef ] [ Google Scholar ]
  • Price DP, Stoica M, Boncella RJ. The relationship between innovation, knowledge, and performance in family and non-family firms: An analysis of SMEs. Journal of Innovation and Entrepreneurship. 2013; 2 (1):1–20. doi: 10.1186/2192-5372-2-14. [ CrossRef ] [ Google Scholar ]
  • Prifti R, Alimehmeti G. Market orientation, innovation, and firm performance—An analysis of Albanian firms. Journal of Innovation and Entrepreneurship. 2017; 6 (1):1–19. doi: 10.1186/s13731-017-0069-9. [ CrossRef ] [ Google Scholar ]
  • Provost F, Fawcett T. Data science and its relationship to big data and data-driven decision making. Big Data. 2013; 1 (1):51–59. doi: 10.1089/big.2013.1508. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Reforgiato Recupero, D., Castronovo, M., Consoli, S., Costanzo, T., Gangemi, A., Grasso, L., ... & Spampinato, E. (2016). An innovative, open, interoperable citizen engagement cloud platform for smart government and users’ interaction.  Journal of the Knowledge Economy ,  7 (2), 388-412.
  • Russo G, Marsigalia B, Evangelista F, Palmaccio M, Maggioni M. Exploring regulations and scope of the Internet of Things in contemporary companies: A first literature analysis. Journal of Innovation and Entrepreneurship. 2015; 4 (1):1–13. doi: 10.1186/s13731-015-0025-5. [ CrossRef ] [ Google Scholar ]
  • Safarov I. Institutional dimensions of open government data implementation: Evidence from the Netherlands, Sweden, and the UK. Public Performance & Management Review. 2019; 42 (2):305–328. doi: 10.1080/15309576.2018.1438296. [ CrossRef ] [ Google Scholar ]
  • Sarkheyli, A., & Sarkheyli, E. (2019). Smart megaprojects in smart cities, dimensions, and challenges. In  Smart Cities Cybersecurity and Privacy  (pp. 269–277). Elsevier.
  • Shankaranarayanan G, Cai Y. Supporting data quality management in decision-making. Decision Support Systems. 2006; 42 (1):302–317. doi: 10.1016/j.dss.2004.12.006. [ CrossRef ] [ Google Scholar ]
  • Shumetie A, Watabaji MD. Effect of corruption and political instability on enterprises’ innovativeness in Ethiopia: Pooled data based. Journal of Innovation and Entrepreneurship. 2019; 8 (1):1–19. doi: 10.1186/s13731-019-0107-x. [ CrossRef ] [ Google Scholar ]
  • Šlibar B, Oreški D, BegičevićReđep N. Importance of the open data assessment: An insight into the (meta) data quality dimensions. SAGE Open. 2021; 11 (2):21582440211023178. doi: 10.1177/21582440211023178. [ CrossRef ] [ Google Scholar ]
  • Strong DM. IT process designs for improving information quality and reducing exception handling: A simulation experiment. Information & Management. 1997; 31 (5):251–263. doi: 10.1016/S0378-7206(96)01089-0. [ CrossRef ] [ Google Scholar ]
  • Surowiecki, J. (2013). “Where Nokia Went Wrong,” Retrieved February 20, 2020, from  http://www.newyorker.com/business/currency/where-Nokia-went-wrong
  • Tayi GK, Ballou DP. Examining data quality. Communications of the ACM. 1998; 41 (2):54–57. doi: 10.1145/269012.269021. [ CrossRef ] [ Google Scholar ]
  • Wand Y, Wang RY. Anchoring data quality dimensions in ontological foundations. Communications of the ACM. 1996; 39 (11):86–95. doi: 10.1145/240455.240479. [ CrossRef ] [ Google Scholar ]
  • Wang RY, Strong DM. Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems. 1996; 12 (4):5–33. doi: 10.1080/07421222.1996.11518099. [ CrossRef ] [ Google Scholar ]
  • Wang RY, Lee YW, Pipino LL, Strong DM. Manage your information as a product. MIT Sloan Management Review. 1998; 39 (4):95. [ Google Scholar ]
  • Wang RY, Storey VC, Firth CP. A framework for analysis of data quality research. IEEE Transactions on Knowledge and Data Engineering. 1995; 7 (4):623–640. doi: 10.1109/69.404034. [ CrossRef ] [ Google Scholar ]
  • Zhuo Z, Muhammad B, Khan S. Underlying the relationship between governance and economic growth in developed countries. Journal of the Knowledge Economy. 2021; 12 (3):1314–1330. doi: 10.1007/s13132-020-00658-w. [ CrossRef ] [ Google Scholar ]
  • Zouari G, Abdelhedi M. Customer satisfaction in the digital era: Evidence from Islamic banking. Journal of Innovation and Entrepreneurship. 2021; 10 (1):1–18. doi: 10.1186/s13731-021-00151-x. [ CrossRef ] [ Google Scholar ]
  • Privacy Policy

Research Method

Home » Case Study – Methods, Examples and Guide

Case Study – Methods, Examples and Guide

Table of Contents

Case Study Research

A case study is a research method that involves an in-depth examination and analysis of a particular phenomenon or case, such as an individual, organization, community, event, or situation.

It is a qualitative research approach that aims to provide a detailed and comprehensive understanding of the case being studied. Case studies typically involve multiple sources of data, including interviews, observations, documents, and artifacts, which are analyzed using various techniques, such as content analysis, thematic analysis, and grounded theory. The findings of a case study are often used to develop theories, inform policy or practice, or generate new research questions.

Types of Case Study

Types and Methods of Case Study are as follows:

Single-Case Study

A single-case study is an in-depth analysis of a single case. This type of case study is useful when the researcher wants to understand a specific phenomenon in detail.

For Example , A researcher might conduct a single-case study on a particular individual to understand their experiences with a particular health condition or a specific organization to explore their management practices. The researcher collects data from multiple sources, such as interviews, observations, and documents, and uses various techniques to analyze the data, such as content analysis or thematic analysis. The findings of a single-case study are often used to generate new research questions, develop theories, or inform policy or practice.

Multiple-Case Study

A multiple-case study involves the analysis of several cases that are similar in nature. This type of case study is useful when the researcher wants to identify similarities and differences between the cases.

For Example, a researcher might conduct a multiple-case study on several companies to explore the factors that contribute to their success or failure. The researcher collects data from each case, compares and contrasts the findings, and uses various techniques to analyze the data, such as comparative analysis or pattern-matching. The findings of a multiple-case study can be used to develop theories, inform policy or practice, or generate new research questions.

Exploratory Case Study

An exploratory case study is used to explore a new or understudied phenomenon. This type of case study is useful when the researcher wants to generate hypotheses or theories about the phenomenon.

For Example, a researcher might conduct an exploratory case study on a new technology to understand its potential impact on society. The researcher collects data from multiple sources, such as interviews, observations, and documents, and uses various techniques to analyze the data, such as grounded theory or content analysis. The findings of an exploratory case study can be used to generate new research questions, develop theories, or inform policy or practice.

Descriptive Case Study

A descriptive case study is used to describe a particular phenomenon in detail. This type of case study is useful when the researcher wants to provide a comprehensive account of the phenomenon.

For Example, a researcher might conduct a descriptive case study on a particular community to understand its social and economic characteristics. The researcher collects data from multiple sources, such as interviews, observations, and documents, and uses various techniques to analyze the data, such as content analysis or thematic analysis. The findings of a descriptive case study can be used to inform policy or practice or generate new research questions.

Instrumental Case Study

An instrumental case study is used to understand a particular phenomenon that is instrumental in achieving a particular goal. This type of case study is useful when the researcher wants to understand the role of the phenomenon in achieving the goal.

For Example, a researcher might conduct an instrumental case study on a particular policy to understand its impact on achieving a particular goal, such as reducing poverty. The researcher collects data from multiple sources, such as interviews, observations, and documents, and uses various techniques to analyze the data, such as content analysis or thematic analysis. The findings of an instrumental case study can be used to inform policy or practice or generate new research questions.

Case Study Data Collection Methods

Here are some common data collection methods for case studies:

Interviews involve asking questions to individuals who have knowledge or experience relevant to the case study. Interviews can be structured (where the same questions are asked to all participants) or unstructured (where the interviewer follows up on the responses with further questions). Interviews can be conducted in person, over the phone, or through video conferencing.

Observations

Observations involve watching and recording the behavior and activities of individuals or groups relevant to the case study. Observations can be participant (where the researcher actively participates in the activities) or non-participant (where the researcher observes from a distance). Observations can be recorded using notes, audio or video recordings, or photographs.

Documents can be used as a source of information for case studies. Documents can include reports, memos, emails, letters, and other written materials related to the case study. Documents can be collected from the case study participants or from public sources.

Surveys involve asking a set of questions to a sample of individuals relevant to the case study. Surveys can be administered in person, over the phone, through mail or email, or online. Surveys can be used to gather information on attitudes, opinions, or behaviors related to the case study.

Artifacts are physical objects relevant to the case study. Artifacts can include tools, equipment, products, or other objects that provide insights into the case study phenomenon.

How to conduct Case Study Research

Conducting a case study research involves several steps that need to be followed to ensure the quality and rigor of the study. Here are the steps to conduct case study research:

  • Define the research questions: The first step in conducting a case study research is to define the research questions. The research questions should be specific, measurable, and relevant to the case study phenomenon under investigation.
  • Select the case: The next step is to select the case or cases to be studied. The case should be relevant to the research questions and should provide rich and diverse data that can be used to answer the research questions.
  • Collect data: Data can be collected using various methods, such as interviews, observations, documents, surveys, and artifacts. The data collection method should be selected based on the research questions and the nature of the case study phenomenon.
  • Analyze the data: The data collected from the case study should be analyzed using various techniques, such as content analysis, thematic analysis, or grounded theory. The analysis should be guided by the research questions and should aim to provide insights and conclusions relevant to the research questions.
  • Draw conclusions: The conclusions drawn from the case study should be based on the data analysis and should be relevant to the research questions. The conclusions should be supported by evidence and should be clearly stated.
  • Validate the findings: The findings of the case study should be validated by reviewing the data and the analysis with participants or other experts in the field. This helps to ensure the validity and reliability of the findings.
  • Write the report: The final step is to write the report of the case study research. The report should provide a clear description of the case study phenomenon, the research questions, the data collection methods, the data analysis, the findings, and the conclusions. The report should be written in a clear and concise manner and should follow the guidelines for academic writing.

Examples of Case Study

Here are some examples of case study research:

  • The Hawthorne Studies : Conducted between 1924 and 1932, the Hawthorne Studies were a series of case studies conducted by Elton Mayo and his colleagues to examine the impact of work environment on employee productivity. The studies were conducted at the Hawthorne Works plant of the Western Electric Company in Chicago and included interviews, observations, and experiments.
  • The Stanford Prison Experiment: Conducted in 1971, the Stanford Prison Experiment was a case study conducted by Philip Zimbardo to examine the psychological effects of power and authority. The study involved simulating a prison environment and assigning participants to the role of guards or prisoners. The study was controversial due to the ethical issues it raised.
  • The Challenger Disaster: The Challenger Disaster was a case study conducted to examine the causes of the Space Shuttle Challenger explosion in 1986. The study included interviews, observations, and analysis of data to identify the technical, organizational, and cultural factors that contributed to the disaster.
  • The Enron Scandal: The Enron Scandal was a case study conducted to examine the causes of the Enron Corporation’s bankruptcy in 2001. The study included interviews, analysis of financial data, and review of documents to identify the accounting practices, corporate culture, and ethical issues that led to the company’s downfall.
  • The Fukushima Nuclear Disaster : The Fukushima Nuclear Disaster was a case study conducted to examine the causes of the nuclear accident that occurred at the Fukushima Daiichi Nuclear Power Plant in Japan in 2011. The study included interviews, analysis of data, and review of documents to identify the technical, organizational, and cultural factors that contributed to the disaster.

Application of Case Study

Case studies have a wide range of applications across various fields and industries. Here are some examples:

Business and Management

Case studies are widely used in business and management to examine real-life situations and develop problem-solving skills. Case studies can help students and professionals to develop a deep understanding of business concepts, theories, and best practices.

Case studies are used in healthcare to examine patient care, treatment options, and outcomes. Case studies can help healthcare professionals to develop critical thinking skills, diagnose complex medical conditions, and develop effective treatment plans.

Case studies are used in education to examine teaching and learning practices. Case studies can help educators to develop effective teaching strategies, evaluate student progress, and identify areas for improvement.

Social Sciences

Case studies are widely used in social sciences to examine human behavior, social phenomena, and cultural practices. Case studies can help researchers to develop theories, test hypotheses, and gain insights into complex social issues.

Law and Ethics

Case studies are used in law and ethics to examine legal and ethical dilemmas. Case studies can help lawyers, policymakers, and ethical professionals to develop critical thinking skills, analyze complex cases, and make informed decisions.

Purpose of Case Study

The purpose of a case study is to provide a detailed analysis of a specific phenomenon, issue, or problem in its real-life context. A case study is a qualitative research method that involves the in-depth exploration and analysis of a particular case, which can be an individual, group, organization, event, or community.

The primary purpose of a case study is to generate a comprehensive and nuanced understanding of the case, including its history, context, and dynamics. Case studies can help researchers to identify and examine the underlying factors, processes, and mechanisms that contribute to the case and its outcomes. This can help to develop a more accurate and detailed understanding of the case, which can inform future research, practice, or policy.

Case studies can also serve other purposes, including:

  • Illustrating a theory or concept: Case studies can be used to illustrate and explain theoretical concepts and frameworks, providing concrete examples of how they can be applied in real-life situations.
  • Developing hypotheses: Case studies can help to generate hypotheses about the causal relationships between different factors and outcomes, which can be tested through further research.
  • Providing insight into complex issues: Case studies can provide insights into complex and multifaceted issues, which may be difficult to understand through other research methods.
  • Informing practice or policy: Case studies can be used to inform practice or policy by identifying best practices, lessons learned, or areas for improvement.

Advantages of Case Study Research

There are several advantages of case study research, including:

  • In-depth exploration: Case study research allows for a detailed exploration and analysis of a specific phenomenon, issue, or problem in its real-life context. This can provide a comprehensive understanding of the case and its dynamics, which may not be possible through other research methods.
  • Rich data: Case study research can generate rich and detailed data, including qualitative data such as interviews, observations, and documents. This can provide a nuanced understanding of the case and its complexity.
  • Holistic perspective: Case study research allows for a holistic perspective of the case, taking into account the various factors, processes, and mechanisms that contribute to the case and its outcomes. This can help to develop a more accurate and comprehensive understanding of the case.
  • Theory development: Case study research can help to develop and refine theories and concepts by providing empirical evidence and concrete examples of how they can be applied in real-life situations.
  • Practical application: Case study research can inform practice or policy by identifying best practices, lessons learned, or areas for improvement.
  • Contextualization: Case study research takes into account the specific context in which the case is situated, which can help to understand how the case is influenced by the social, cultural, and historical factors of its environment.

Limitations of Case Study Research

There are several limitations of case study research, including:

  • Limited generalizability : Case studies are typically focused on a single case or a small number of cases, which limits the generalizability of the findings. The unique characteristics of the case may not be applicable to other contexts or populations, which may limit the external validity of the research.
  • Biased sampling: Case studies may rely on purposive or convenience sampling, which can introduce bias into the sample selection process. This may limit the representativeness of the sample and the generalizability of the findings.
  • Subjectivity: Case studies rely on the interpretation of the researcher, which can introduce subjectivity into the analysis. The researcher’s own biases, assumptions, and perspectives may influence the findings, which may limit the objectivity of the research.
  • Limited control: Case studies are typically conducted in naturalistic settings, which limits the control that the researcher has over the environment and the variables being studied. This may limit the ability to establish causal relationships between variables.
  • Time-consuming: Case studies can be time-consuming to conduct, as they typically involve a detailed exploration and analysis of a specific case. This may limit the feasibility of conducting multiple case studies or conducting case studies in a timely manner.
  • Resource-intensive: Case studies may require significant resources, including time, funding, and expertise. This may limit the ability of researchers to conduct case studies in resource-constrained settings.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Questionnaire

Questionnaire – Definition, Types, and Examples

Observational Research

Observational Research – Methods and Guide

Quantitative Research

Quantitative Research – Methods, Types and...

Qualitative Research Methods

Qualitative Research Methods

Explanatory Research

Explanatory Research – Types, Methods, Guide

Survey Research

Survey Research – Types, Methods, Examples

  • Data-driven evolution of water quality models: An in-depth investigation of innovative outlier detection approaches-A case study of Irish Water Quality Index (IEWQI) model

Water Research, Volume 255, 15 May 2024, 121499.

Md Galal Uddin, Azizur Rahman, Firouzeh Rosa Taghikhah, Agnieszka I. Olbert

https://www.sciencedirect.com/science/article/pii/S0043135424004019

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Post Comment

Related Articles

Metagenomics combined with comprehensive validation as a public health risk assessment tool for urban and agricultural run-off, cryptosporidium spp. in groundwater supplies intended for human consumption – a descriptive review of global prevalence, risk factors and knowledge gaps, insight into co-hosts of nitrate reduction genes and antibiotic resistance genes in an urban river of the qinghai-tibet plateau, a review on recent progress in the detection methods and prevalence of human enteric viruses in water, water-related disease news.

February 2024

January 2024

December 2023

Recent Posts

  • Landscape of plasmids encoding β-lactamases in disinfection residual Enterobacteriaceae from wastewater treatment plants
  • A fecal score approximation model for analysis of real-time quantitative PCR fecal source identification measurements
  • Human intestinal enteroids platform to assess the infectivity of gastroenteritis viruses in wastewater
  • Assessment of antibiotic-resistant infection risks associated with reclaimed wastewater irrigation in intensive tomato cultivation
  • April 2024  (19)
  • March 2024  (11)
  • February 2024  (12)
  • January 2024  (17)
  • December 2023  (11)
  • November 2023  (14)
  • October 2023  (7)
  • September 2023  (6)
  • August 2023  (17)
  • July 2023  (3)
  • June 2023  (6)
  • May 2023  (10)

Annual archive

  • 2024  (59)
  • 2023  (103)
  • 2022  (137)
  • 2021  (175)
  • 2020  (156)
  • 2019  (131)
  • 2018  (149)
  • 2017  (197)
  • 2016  (14)
  • Events (59)
  • HRWM News (108)
  • Projects (106)
  • IWA News (10)
  • Water-related Disease News (11)
  • Featured Publications (775)

Recent Comments

  • Tamara Connaughton on Evaluation of wastewater-based epidemiology of COVID-19 approaches in Singapore’s ‘closed-system’ scenario: A long-term country-wide assessment
  • Peter Vikesland on IWA HRWM SG Open Meeting at WaterMicro23 Darwin (June 8, 5PM)
  • Jieun S. Park on IWA HRWM SG Open Meeting at WaterMicro23 Darwin (June 8, 5PM)
  • Elabiyi Michael on Identifying water quality and environmental factors that influence indicator and pathogen decay in natural surface waters
  • Terri on 3rd Regional IWA Diffuse Pollution Conference at Chiang Mai, Thailand (19-22 November 2018)

Surface water quality index forecasting using multivariate complementing approach reinforced with locally weighted linear regression model

  • Research Article
  • Published: 23 April 2024

Cite this article

case study data quality

  • Tao Hai 1 , 2 ,
  • Iman Ahmadianfar 3 ,
  • Bijay Halder 4 , 5 ,
  • Salim Heddam 6 ,
  • Ahmed M. Al-Areeq 12 , 7 ,
  • Vahdettin Demir 8 ,
  • Huseyin Cagan Kilinc 9 ,
  • Sani I. Abba 7 ,
  • Mou Leong Tan 10 ,
  • Raad Z. Homod 11 &
  • Zaher Mundher Yaseen   ORCID: orcid.org/0000-0003-3647-7137 12  

River water quality management and monitoring are essential responsibilities for communities near rivers. Government decision-makers should monitor important quality factors like temperature, dissolved oxygen (DO), pH, and biochemical oxygen demand (BOD). Among water quality parameters, the BOD throughout 5 days is an important index that must be detected by devoting a significant amount of time and effort, which is a source of significant concern in both academic and commercial settings. The traditional experimental and statistical methods cannot give enough accuracy or solve the problem for a long time to detect something. This study used a unique hybrid model called MVMD-LWLR, which introduced an innovative method for forecasting BOD in the Klang River, Malaysia. The hybrid model combines a locally weighted linear regression (LWLR) model with a wavelet-based kernel function, along with multivariate variational mode decomposition (MVMD) for the decomposition of input variables. In addition, categorical boosting (Catboost) feature selection was used to discover and extract significant input variables. This combination of MVMD-LWLR and Catboost is the first use of such a complete model for predicting BOD levels in the given river environment. In addition, an optimization process was used to improve the performance of the model. This process utilized the gradient-based optimization (GBO) approach to fine-tune the parameters and better the overall accuracy of predicting BOD levels. To assess the robustness of the proposed method, we compared it to other popular models such as kernel ridge (KRidge) regression, LASSO, elastic net, and gaussian process regression (GPR). Several metrics, comprising root-mean-square error (RMSE), R (correlation coefficient), U 95% (uncertainty coefficient at 95% level), and NSE (Nash–Sutcliffe efficiency), as well as visual interpretation, were used to evaluate the predictive efficacy of hybrid models. Extensive testing revealed that, in forecasting the BOD parameter, the MVMD-LWLR model outperformed its competitors. Consequently, for BOD forecasting, the suggested MVMD-LWLR optimized with the GBO algorithm yields encouraging and reliable results, with increased forecasting accuracy and minimal error.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

case study data quality

Data Availability

Data will be supplied upon request from the corresponding author.

Abbreviations

adaptive neuro fuzzy inference system

artificial neural network

alternate direction method of multipliers

biochemical oxygen demand

categorical boosting

chemical oxygen demand

cost function

dissolved oxygen

deep learning–based echo state network

deep neural network

deep matrix factorization

differential evolution

decision tree

deep random vector functional line

deep autoregressive

direction movement vector

extreme learning machine

fitness function

gradient-based optimization

Gaussian process regression

genetic programming

gene expression programming

gradient boosting regression tree

gradient search rule

gradient function

intrinsic time-scale decomposition

kernel ridge regression

ensemble Kalman filter-ANN

locally weighted linear regression

linear regression

Lagrangian function

local escaping operator

multivariate variational mode decomposition

machine learning

multivariate mode decomposition machine learning

multivariable modulation oscillations

mean absolute percentage error

multivariate regression

numerical weather prediction

New York City

Nash–Sutcliffe efficiency

ammonium nitrogen

nephelometric turbidity unit

particle swarm optimization

total precipitation

root-mean-square error

random forest

residual error

correlation coefficient

standard deviation

sodium adsorption ratio

singular value decomposition

standard deviation error

suspended solid

total organic carbon

time-varying filter-based empirical mode decomposition

total dissolved solids

variational mode decomposition

wind velocity

water quality

water temperature

wavelet transform

water quality index

wavelet-based LSSVM linked with improved simulated annealing

wastewater treatment plants

extreme gradient boosting

Ahmadianfar I, Bozorg-Haddad O, Chu X (2020a) Gradient-based optimizer: a new metaheuristic optimization algorithm. Inf Sci (Ny) 540:131–159

Article   Google Scholar  

Ahmadianfar I, Heidari AA, Gandomi AH et al (2021a) RUN beyond the metaphor: an efficient optimization algorithm based on Runge Kutta method. Expert Syst Appl 181:115079

Ahmadianfar I, Heidari AA, Noshadian S et al (2022a) INFO: an efficient optimization algorithm based on weighted mean of vectors. Expert Syst Appl 195:116516. https://doi.org/10.1016/j.eswa.2022.116516

Ahmadianfar I, Jamei M, Chu X (2020b) A novel hybrid wavelet-locally weighted linear regression (W-LWLR) model for electrical conductivity (EC) prediction in surface water. J Contam Hydrol. https://doi.org/10.1016/j.jconhyd.2020.103641

Ahmadianfar I, Khajeh Z, Asghari-Pari S-A, Chu X (2019) Developing optimal policies for reservoir systems using a multi-strategy optimization algorithm. Appl Soft Comput 80:888–903. https://doi.org/10.1016/j.asoc.2019.04.004

Ahmadianfar I, Noshadian S, Elagib NA, Salarijazi M (2021b) Robust diversity-based sine-cosine algorithm for optimizing hydropower multi-reservoir systems. Water Resour Manag 35:3513–3538. https://doi.org/10.1007/s11269-021-02903-6

Ahmadianfar I, Shirvani-Hosseini S, He J et al (2022b) An improved adaptive neuro fuzzy inference system model using conjoined metaheuristic algorithms for electrical conductivity prediction. Sci Rep 12:1–34

Ahmadianfar I, Shirvani-Hosseini S, Samadi-Koucheksaraee A, Yaseen ZM (2022c) Surface water sodium (Na+) concentration prediction using hybrid weighted exponential regression model with gradient-based optimization. Environ Sci Pollut Res 1–26

Asadollah SBHS, Sharafati A, Motta D, Yaseen ZM (2021) River water quality index prediction and uncertainty analysis: a comparative study of machine learning models. J Environ Chem Eng 9:104599. https://doi.org/10.1016/j.jece.2020.104599

Article   CAS   Google Scholar  

Atkeson CG, Moore AW, Schaal S (1997) Locally weighted learning for control. Lazy learning, 75–113.

Ay M, Kisi O (2011) Modeling of dissolved oxygen concentration using different neural network techniques in Foundation Creek, El Paso County, Colorado. J Environ Eng 138:654–662

Barzegar R, Asghari Moghaddam A, Adamowski J, Ozga-Zielinski B (2018) Multi-step water quality forecasting using a boosting ensemble multi-wavelet extreme learning machine model. Stoch Environ Res Risk Assess 32:799–813. https://doi.org/10.1007/s00477-017-1394-z

Botchkarev A (2018) Performance metrics (error measures) in machine learning regression, forecasting and prognostics: Properties and typology. arXiv preprint arXiv:1809.03006.

Bozorg-Haddad O, Soleimani S, Loáiciga HA (2017) Modeling water-quality parameters using genetic algorithm–least squares support vector regression and genetic programming. J Environ Eng 143:04017021. https://doi.org/10.1061/(asce)ee.1943-7870.0001217

Dehghani R, Torabi Poudeh H, Izadi Z (2021) Dissolved oxygen concentration predictions for running waters with using hybrid machine learning techniques. Model Earth Syst Environ 1–15.

Dogan E, Sengorur B, Koklu R (2009) Modeling biological oxygen demand of the Melen River in Turkey using an artificial neural network technique. J Environ Manage 90:1229–1235. https://doi.org/10.1016/j.jenvman.2008.06.004

Hadi SJ, Tombul M (2018) Forecasting daily streamflow for basins with different physical characteristics through data-driven methods. Water Resour Manag 32:3405–3422. https://doi.org/10.1007/s11269-018-1998-1

He W, Zhang K, Kong Y et al (2023) Reduction pathways identification of agricultural water pollution in Hubei Province, China. Ecol Indic 153:110464

Hestenes MR (1969) Multiplier and gradient methods. J Optim Theory Appl 4:303–320. https://doi.org/10.1007/BF00927673

Ho JY, Afan HA, El-Shafie AH et al (2019) Towards a time and cost effective approach to water quality index class prediction. J Hydrol 575:148–165. https://doi.org/10.1016/j.jhydrol.2019.05.016

Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67

Jamei M, Ahmadianfar I, Karbasi M et al (2021) The assessment of emerging data-intelligence technologies for modeling Mg+ 2 and SO4− 2 surface water quality. J Environ Manage 300:113774

Jamei M, Ali M, Karbasi M et al (2022) Designing a multi-stage expert system for daily ocean wave energy forecasting: a multivariate data decomposition-based approach. Appl Energy 326:119925

Jamei M, Ali M, Karbasi M et al (2024) Monthly sodium adsorption ratio forecasting in rivers using a dual interpretable glass-box complementary intelligent system: hybridization of ensemble TVF-EMD-VMD, Boruta-SHAP, and eXplainable GPR. Expert Syst Appl 237:121512

Khaleefa O, Kamel AH (2021) On the evaluation of water quality index: case study of Euphrates River, Iraq. Knowledge-Based Eng Sci 2:35–43

Khozani ZS, Khosravi K, Pham BT et al (2019) Determination of compound channel apparent shear stress: application of novel data mining models. J Hydroinformatics 21(5):798–811. https://doi.org/10.2166/hydro.2019.037

Kim S, Alizamir M, Zounemat-Kermani M et al (2020) Assessing the biochemical oxygen demand using neural networks and ensemble tree approaches in South Korea. J Environ Manage 270:110834

Ma J, Ding Y, Cheng JCP et al (2020) Soft detection of 5-day BOD with sparse matrix in city harbor water using deep learning techniques. Water Res 170:115350. https://doi.org/10.1016/j.watres.2019.115350

Mohamed I, Othman F, Ibrahim AIN et al (2015) Assessment of water quality parameters using multivariate analysis for Klang River basin, Malaysia. Environ Monit Assess 187:1–12. https://doi.org/10.1007/s10661-014-4182-y

Mostafa F, Bozorg HO, Samaneh S-A, Loáiciga HA (2015) Assimilative capacity and flow dilution for water quality protection in rivers. J Hazardous, Toxic, Radioact Waste 19:4014027. https://doi.org/10.1061/(ASCE)HZ.2153-5515.0000234

Nagamuthu P (2023) Climate change impacts on surface water resources of the northern region of Sri Lanka. Knowledge-Based Eng Sci 4:25–50

Google Scholar  

Najah Ahmed A, Binti Othman F, Abdulmohsin Afan H et al (2019) Machine learning methods for better water quality prediction. J Hydrol 578:124084. https://doi.org/10.1016/j.jhydrol.2019.124084

Orouji H, Bozorg Haddad O, Fallah-Mehdipour E, Mariño MA (2013) Modeling of water quality parameters using data-driven models. J Environ Eng. https://doi.org/10.1061/(ASCE)EE.1943-7870.0000706

Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A (2018) CatBoost: unbiased boosting with categorical features. Advances in neural information processing systems, 31.

Qambar AS, Al KMM (2022) Prediction of municipal wastewater biochemical oxygen demand using machine learning techniques: a sustainable approach. Process Saf Environ Prot 168:833–845. https://doi.org/10.1016/j.psep.2022.10.033

Rana B (2023) Real-time flood inundation monitoring in Capital of India using Google Earth Engine and Sentinel database. Knowledge-Based Eng Sci 4:1–16

Ravansalar M, Rajaee T, Zounemat-Kermani M (2016) A wavelet-linear genetic programming model for sodium (Na+) concentration forecasting in rivers. J Hydrol 537:398–407. https://doi.org/10.1016/j.jhydrol.2016.03.062

Rezaie-Balf M, Attar NF, Mohammadzadeh A et al (2020) Physicochemical parameters data assimilation for efficient improvement of water quality index prediction: comparative assessment of a noise suppression hybridization approach. J Clean Prod 271:122576

Saunders C, Gammerman A (1998) Ridge regression learning algorithm in dual variables. In: 15th International Conference on Machine Learning (ICML ’98) (01/01/98)

Singh RB, Patra KC, Pradhan B, Samantra A (2024) HDTO-DeepAR: a novel hybrid approach to forecast surface water quality indicators. J Environ Manage 352:120091

Song C, Yao L, Hua C, Ni Q (2021) A water quality prediction model based on variational mode decomposition and the least squares support vector machine optimized by the sparrow search algorithm (VMD-SSA-LSSVM) of the Yangtze River, China. Environ Monit Assess 193:1–17

Taylor KE (2001) Summarizing multiple aspects of model performance in a single diagram. J Geophys Res Atmos 106:7183–7192. https://doi.org/10.1029/2000JD900719

Tiyasha T, Bhagat SK, Fituma F et al (2021a) Dual water choices: the assessment of the influential factors on water sources choices using unsupervised machine learning market basket analysis. IEEE Access 9:150532–150544

Tiyasha T, Tung TM, Bhagat SK et al (2021b) Functionalization of remote sensing and on-site data for simulating surface water dissolved oxygen: development of hybrid tree-based artificial intelligence models. Mar Pollut Bull 170:112639

Tiyasha TTM, Yaseen ZM (2020) A survey on river water quality modelling using artificial intelligence models: 2000–2020. J Hydrol 585:124670. https://doi.org/10.1016/j.jhydrol.2020.124670

Uddin MG, Nash S, Olbert AI (2021) A review of water quality index models and their use for assessing surface water quality. Ecol Indic 122:107218

ur Rehman N, Aftab H (2019) Multivariate variational mode decomposition. IEEE Trans Signal Process 67:6039–6052

Vapnik VN (2000) The nature of statistical learning theory, second edn. Springer, New York, New York, NY

Book   Google Scholar  

Wang H, Shangguan L, Wu J, Guan R (2013) Multiple linear regression modeling for compositional data. Neurocomputing 122:490–500. https://doi.org/10.1016/j.neucom.2013.05.025

Yaseen ZM (2023) A new benchmark on machine learning methodologies for hydrological processes modelling: a comprehensive review for limitations and future research directions. Knowledge-Based Eng Sci 4:65–103

Ypma TJ (1995) Historical development of the Newton–Raphson method. SIAM Rev 37:531–551. https://doi.org/10.1137/1037125

Yuan L, Li R, He W et al (2022) Coordination of the industrial-ecological economy in the Yangtze River Economic Belt, China. Front Environ Sci 10:451

Zaman Zad Ghavidel S, Montaseri M (2014) Application of different data-driven methods for the prediction of total dissolved solids in the Zarinehroud basin. Stoch Environ Res Risk Assess 28:2101–2118. https://doi.org/10.1007/s00477-014-0899-y

Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol 67:301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x

Zou R, Lung W-S, Wu J (2007) An adaptive neural network embedded genetic algorithm approach for inverse water quality modeling. Water Resour Res 43:1–13. https://doi.org/10.1029/2006WR005158

Download references

Acknowledgements

The authors acknowledge the data source provider “Department of Environment (DoE) (Malaysia).”, In addition, Zaher Mundher Yaseen thanks the Civil and Environmental Engineering Department, King Fahd University of Petroleum & Minerals, Saudi Arabia for its support.

The research received no funds.

Author information

Authors and affiliations.

School of Information and Artificial Intelligence, Nanchang Institute of Science and Technology, Nanchang, China

Artificial Intelligence Research Center (AIRC), Ajman University, P.O. Box: 346, Ajman, United Arab Emirates

Department of Civil Engineering, Behbahan Khatam Alanbia University of Technology, Behbahan, Iran

Iman Ahmadianfar

Department of Earth Sciences and Environment, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Malaysia

Bijay Halder

New Era and Development in Civil Engineering Research Group, Scientific Research Center, Al-Ayen University, Thi-Qar, 64001, Iraq

Faculty of Science, Agronomy Department, University 20 Août 1955 Skikda, Route El Hadaik, 26, Skikda, BP, Algeria

Salim Heddam

Interdisciplinary Research Center for Membranes and Water Security, King Fahd University of Petroleum & Minerals (KFUPM), Dhahran, Saudi Arabia

Ahmed M. Al-Areeq & Sani I. Abba

Department of Civil Engineering, KTO Karatay University, 42020, Konya, Turkey

Vahdettin Demir

Department of Civil Engineering, Istanbul Aydın University, Istanbul, Turkey

Huseyin Cagan Kilinc

GeoInformatic Unit, Geography Section, School of Humanities, Universiti Sains Malaysia, 11800 Minden, Penang, Malaysia

Mou Leong Tan

Department of Oil and Gas Engineering, Basrah University for Oil and Gas, Basra, Iraq

Raad Z. Homod

Civil and Environmental Engineering Department, King Fahd University of Petroleum & Minerals, Dhahran, 31261, Saudi Arabia

Ahmed M. Al-Areeq & Zaher Mundher Yaseen

You can also search for this author in PubMed   Google Scholar

Contributions

Tao Hai: Conceptualization; Investigation; Visualization; Writing—original draft, Writing—review and editing draft preparation; Project leader. Iman Ahmadianfar: Data curation; Formal analysis; Methodology; Investigation; Visualization; Writing—original draft, Writing—review and editing draft preparation; Software. Bijay Halder: Conceptualization; Investigation; Visualization; Writing—original draft, Writing—review and editing draft preparation. Salim Heddam: Conceptualization; Investigation; Visualization; Writing—original draft, Writing—review and editing draft preparation. Ahmed M. Al-Areeq: Conceptualization; Investigation; Visualization; Writing—original draft, Writing—review and editing draft preparation. Vahdettin Demir: Conceptualization; Investigation; Visualization; Writing—original draft, Writing—review and editing draft preparation. Huseyin Cagan Kilinc: Conceptualization; Investigation; Visualization; Writing—original draft, Writing—review and editing draft preparation. Sani I. Abba: Conceptualization; Investigation; Visualization; Writing—original draft, Writing—review and editing draft preparation. Mou Leong Tan: Conceptualization; Investigation; Visualization; Writing—original draft, Writing—review and editing draft preparation. Raad Z. Homod: Conceptualization; Investigation; Visualization; Writing—original draft, Writing—review and editing draft preparation. Zaher Mundher Yaseen: Conceptualization; Supervision; Investigation; Visualization; Writing—original draft, Writing—review & editing draft preparation; Project leader.

Corresponding author

Correspondence to Zaher Mundher Yaseen .

Ethics declarations

Ethics approval.

The manuscript is conducted within the ethical manner advised by the targeted journal.

Consent to publish

The research is scientifically consent to be published.

Conflict of interest

The authors declare no competing interests.

Additional information

Responsible Editor: Xianliang Yi

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Explanation of kernel ridge and Elastic net methods

1. Kernel ridge method

Ridge Regression (RR), a well-known regression approach, was first suggested by Hoerl and Kennard ( 1970 ). The problem of regression can be defined: as accorded independent variables ( x ki ) and the dependent variable z k , the goal is to minimize the square error as much as possible, which is defined as:

where θ expresses the regression coefficients. k denotes the number of target dataset. \({\overset{\hat{\mkern6mu}}{z}}_k\) indicates the predicted amount.

The least squares method is modified by ridge regression, which adds a regularization l 2 to the characteristics in order to minimize the variance:

where η expresses a positive number (shrinkage coefficient). This parameter is applied to control the penalization. Saunders et al. came up with the idea of employing kernel functions ( Kr ) to create a new version of the Ridge Regression (Saunders and Gammerman 1998 ). After then, the kernel technique can be implemented in the solution space by an algorithm, and it will not be necessary to do any computations inside the solution space. In accordance with Ridge method (Saunders and Gammerman 1998 ), which derives its meaning in part from Vapnik ( 2000 ), the Kernel Ridge Regression (KRR) method produces the following equation:

The kernel function, which can be thought of as a similarity metric between features, is expressed by the formula Kr ( x ,  x k ), and the weights are denoted by α k . These weights are discovered by attempting to reduce the cost function as much as possible:

As a result, Eq. ( 27 ) is very similar to RR Eq. ( 10 ), with the exception that the kernel technique is used to replace all of the dot products with the Kr . If we want to solve for the coefficients α  exactly, they could be defined as

where I indicates the identity matrix and Z  = ( Z 1 , …,  Z k ) T . For an accurate solution of the method, we can utilize the generic equation for KRR in Eq. ( 26 ).

2. Elastic net method

With roots in the RR and Lasso regression (LR), Zou and Hastie ( 2005 ) presented Elastic net (Zou and Hastie 2005 ), which is fundamentally a linear regression method. Given that there are N predictive variables, the number of samples is denoted by the letter K .

Here, θ indicates the regression factor, θ 0 expresses a constant coefficient, and δ 2 expresses a variation of the target amount around the actual value. The RR model is defined as

Because the RR does not have variable selection issues, the LR enhances on it:

Given that RR is inclined to distorted regression findings and the LR is too simplistic, the Elastic net appears as a solution to overcome the drawbacks of the two approaches. The equation for calculating the regression coefficient is

The Elastic net’s penalty function (PF), \(\eta \sum_{j=1}^N\kern0.20em \left(\alpha \left|{\theta}_j\right|+\left(1-\alpha \right){\theta}_j^2\right)\) , is a convex linear expression of the RR penalty function, \({\eta}_1\sum_{j=1}^N\kern0.20em {\theta}_j^2\) , and the LR penalty function, \({\eta}_2\sum_{j=1}^N\kern0.20em \left|{\theta}_j\right|\) . The Elastic net is the RR when α =0. In the case when α =1, the EN corresponds to the LR. As a result, the Elastic net net is a powerful hybrid of the RR and LR, bringing together their respective strengths.

Performance metrics

Specifically, this study takes into account six criteria to measure the accuracy of the ML models in forecasting the BOD parameter: mean absolute percentage error (MAPE), uncertainty coefficient with 95% confidence level (U 95% ), correlation coefficient (R), Willmott’s agreement index (I A ) (Khozani et al. 2019 ), root-mean-square error (RMSE) (Khozani et al. 2019 ), and Nash–Sutcliffe efficiency (NSE) (Botchkarev 2018 ), and the mathematical expressions are

where BOD FO , i and BOD MO , i indicate the forecasted and measured values of the BOD, and \(\overline{BOD_{FO}}\) and \(\overline{BOD_{MO}}\) indicate the average values of forecasted and measured the BOD. SD expresses the standard deviation. An ideal ML model has the following values for metrics: NSE = 1, R = 1, RMSE = 0, U 95 = 0, and MAPE = 0.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Hai, T., Ahmadianfar, I., Halder, B. et al. Surface water quality index forecasting using multivariate complementing approach reinforced with locally weighted linear regression model. Environ Sci Pollut Res (2024). https://doi.org/10.1007/s11356-024-33027-0

Download citation

Received : 08 November 2023

Accepted : 17 March 2024

Published : 23 April 2024

DOI : https://doi.org/10.1007/s11356-024-33027-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Surface water quality
  • Multivariate variational mode decomposition
  • Tropical region
  • Industrial cities
  • Reinforced learning
  • Find a journal
  • Publish with us
  • Track your research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 23 April 2024

Prediction and optimization method for welding quality of components in ship construction

  • Jinfeng Liu 1 ,
  • Yifa Cheng 1 ,
  • Xuwen Jing 1 ,
  • Xiaojun Liu 2 &
  • Yu Chen 1  

Scientific Reports volume  14 , Article number:  9353 ( 2024 ) Cite this article

20 Accesses

Metrics details

  • Mechanical engineering

Welding process, as one of the crucial industrial technologies in ship construction, accounts for approximately 70% of the workload and costs account for approximately 40% of the total cost. The existing welding quality prediction methods have hypothetical premises and subjective factors, which cannot meet the dynamic control requirements of intelligent welding for processing quality. Aiming at the low efficiency of quality prediction problems poor timeliness and unpredictability of quality control in ship assembly-welding process, a data and model driven welding quality prediction method is proposed. Firstly, the influence factors of welding quality are analyzed and the correlation mechanism between process parameters and quality is determined. According to the analysis results, a stable and reliable data collection architecture is established. The elements of welding process monitoring are also determined based on the feature dimensionality reduction method. To improve the accuracy of welding quality prediction, the prediction model is constructed by fusing the adaptive simulated annealing, the particle swarm optimization, and the back propagation neural network algorithms. Finally, the effectiveness of the prediction method is verified through 74 sets of plate welding experiments, the prediction accuracy reaches over 90%.

Similar content being viewed by others

case study data quality

Prediction of line heating deformation on sheet metal based on an ISSA–ELM model

case study data quality

Quality prediction and classification of resistance spot weld using artificial neural network with open-sourced, self-executable and GUI-based application tool Q-Check

case study data quality

Optimization of the ultrasonic roll extrusion process parameters based on the SPEA2SDE algorithm

Introduction.

The shipbuilding industry is a comprehensive national high-end equipment manufacturing industry that supports the shipping industry, marine development, and national defense construction. It plays a critical role in guaranteeing national defense strength and economic development 1 , 2 . With the continuous development of a new generation of information technology based on big data, internet of things (IoT), 5G, cloud computing, artificial intelligence, and digital twin, the intelligent construction is becoming the dominant advanced mode in the shipbuilding industry. At the same time, welding quality control is regarded as a significant part in shipbuilding, and the related innovation research under intelligent welding also urgently needs to be carry out. The current welding processing gradually becoming more flexible and complicated, the welding quality of each workstation ultimately determines the majority of the product quality by means of propagation, accumulation and interaction.

Welding process is one of the vital industrial technologies in ship segment construction 3 , 4 . However, in the welding process of ship components, the local uneven heating and local uncoordinated plastic strain of metal materials are most probably leading to large residual stresses 5 , 6 . This will cause a reduction in the static load capacity and fatigue strength of the ship components. which in turn affects the load capacity, dimensional accuracy, and assembly accuracy of the structure. However, in most shipbuilding enterprises, quality management usually involves issuing quality plans, post-sequence production inspections, and quality statistical reports, which are static quality control. The existing welding quality prediction methods have hypothetical premises and subjective factors, which cannot meet the dynamic control requirements of intelligent welding for processing quality. These methods often encounter problems such as inefficient quality inspection, untimely quality feedback, and untimely quality control 7 . Moreover, the post-welding correction process delays the ship construction cycle and increases the production cost.

The inadequacy of traditional welding quality control technology determines the functional and technical limitations in practical applications 8 , 9 . Firstly, the current welding process design relies on production experience and empirical calculation formulas 10 , which makes it difficult to ensure the design requirements for minimizing residual stress in the forming of structural parts. Secondly, the absence of effective data pre-processing methods to address complex production conditions and massive amounts of welding measurement data. Currently, the welding quality prediction methods for ship components is inadequate. For example, it is difficult to balance the prediction accuracy and computational efficiency, or combine actual measured welding data to drive data analysis services.

This work aims to provide a solution to the inefficiency of welding quality control during ship construction, delaying the production cycle and increasing production costs. The proposed method has the following advantages.

The data-acquisition framework for welding process parameters of ship unit-assembly welding is constructed, a stable and reliable data acquisition method is proposed.

Based on the feature selection method, the influence feature of the welding quality is quantitatively analyzed. This leads to the construction of an optimal subset of process influencing features for welding quality prediction.

Fusing an adaptive simulated annealing (SA), the particle swarm optimization (PSO) and the back propagation neural network (BPNN), a welding quality prediction model is established for welding quality control and decision making.

The remainder of this paper is organized as follows. “ Related works ” section presents the related research on the influence factor and prediction methods of welding quality. The data acquisition and processing framework is explained in “ Acquisition and pre-processing of welding process data ” section. In “ Construction the welding quality prediction model ” section, fusing an Adaptive SA, the PSO and the BPNN (APB), a welding quality prediction model is established. To verify the proposed method, the case study of ship unit-assembly welding is illustrated in “ Case study ” section. The conclusion and future work are shown in “ Conclusion and future works ” section.

Related works

Method for selecting welding quality features.

For huge amount of data in the production site, the knowledge information can be mined through suitable processing methods to assist production 11 , 12 . Feature selection is an important and widely used technique in this field. The purpose of feature selection is to select a small subset of features from the original dataset based on certain evaluation criteria, which usually yields better performance, such as higher classification accuracy, lower computational cost and better model interpretability. As a practical method, feature selection has been widely used in many fields 13 , 14 , 15 .

Depending on how the evaluation is performed, feature selection methods can be distinguished as filter models, wrapper models, or hybrid models. Filter models evaluate and select a subset of features based on the general characteristics of the data without involving any learning model. On the other hand, the wrapper model uses a learning algorithm set in advance and uses its performance as an evaluation criterion. Compared to filter models, wrapper models are more accurate but computationally more expensive. Hybrid models use different evaluation criteria at different stages and combine the advantages of the first two methods.

Two versions of an ant colony optimization-based feature selection algorithm was proposed by Warren et al. 16 , which can effectively improve weld defect detection accuracy and weld defect type classification accuracy. An enhanced feature selection method combining the Relief-F algorithm with a convolutional neural network (CNN) was proposed by Jiang et al. 17 to improve the recognition accuracy of welding defect identification in the manufacturing process of large equipment. A hybrid fisher-based filter and wrapper-based feature selection algorithm was proposed by Zhang et al. 18 , which reduces the 41 feature parameters for weld defect monitoring during tungsten arc welding of aluminum alloy to 19. The computational effort is reduced and the modeling accuracy is improved. Abdel et al. 19 proposed a combination of two-phase mutation and gray wolf optimization algorithm in the literature to solve the wrapper-based feature selection problem. It was able to balance accuracy and efficiency in handling the classification task. To the effect of maintaining and improving the classification accuracy. Le et al. 20 introduced a stochastic privacy preserving machine learning algorithm. The Relief-F algorithm was used for feature selection and Random Forest is utilized for privacy preserving classification. The algorithm is prevented from overfitting and higher classification accuracy is obtained.

In general, a huge amount of measured welding data is generated during the welding of actual ship components because of the complex production conditions. Problems such as high computational cost, easy to fall into local optimum and premature convergence of the algorithm may exist in feature selection. To determine the essential welding process influencing factors. It is necessary to use a suitable feature selection method that facilitates reasonable parsimony in obtaining the best set of input data features. This maximizes the accuracy and computational efficiency while reducing the computational complexity of the prediction model.

Welding quality prediction method

As the new-generation of information technology becomes popular in the ship construction, process data from the manufacturing site can be collected. This contains a non-linear mapping relationship between welding process parameters and quality. Welding data monitoring, welding quality prediction and optimization decisions can be effectively implemented. Therefore, based on machine learning algorithms to achieve welding quality prediction has received wide attention from academia and industry.

Pal et al. predicted the welding quality by processing the current and voltage signals in the welding process 21 , 22 . Taking the process parameters and statistical parameters of the arc signal as input variables, the BPNN, and radial basis function network models are adopted to realize the prediction of welding quality. A Fatigue strength prediction method of ultra-high strength steel butt-welded joints was proposed by Nykanen 23 . A reinforcement-penetration collaborative prediction network model based on deep residual was designed by Lu et al. 24 to predict reinforcement and penetration depth quantitatively. A nugget quality prediction method of resistance spot welding on aluminum alloy based on structure-borne acoustic emission signals was proposed by Luo et al. 25 .

Along with the maturity of related theories such as machine learning and neural networks, the task of welding quality prediction is increasingly implemented by scholars using related techniques.

Artificial neural networks (ANN): In the automatic gas metal arc welding processes, the response surface methodology and ANN models are adopted by Shim et al. 26 predict the best welding parameters for a given weld bead geometry. Lei et al. 27 used a genetic algorithm to optimize the initialization weights and biases of the neural network. They proposed a multi-information fusion neural network to predict the geometric characteristics of the weld by combining the welding parameters and the morphological characteristics of the molten pool. Chaki et al. 28 proposed an integrated prediction model of ANN and non-dominated sorting genetic algorithm which was used to predict and optimize the quality characteristics during pulsed Nd:YAG laser cutting of aluminum alloys. The improved regression network was adopted by Wang et al. 29 and predicted the future molten pool image. CNN were used by Hartl et al. 30 to analyze process data in friction stir welding and predict the resulting quality of the weld surface. To predict the penetration of fillet welds, a penetration quality prediction of asymmetrical fillet root welding based on an optimized BPNN was proposed by Chang et al. 31 . A CNN-based back bead prediction model was proposed by Jin et al. 32 . The image data of the current welding change is acquired, and the CNN model is used to realize the welding shape prediction of the gas metal arc welding. Hu et al. 33 established an ANN optimized by a pigeon inspired algorithm to optimize the welding process parameters of ultrasonic-static shoulder-assisted stir friction welding (U-SSFSW), which led to a significant improvement in the tensile strength of the joints. Cruz et al. 34 presented a procedure for yielding a near-optimal ensemble of CNNs through an efficient search strategy based on an evolutionary algorithm. Able to weigh the predictive accuracy of forecasting models against calculated costs under actual production conditions on the shop floor.

Support vector machines (SVM): SVM using the radial kernel, boosting, and random forest techniques were adopted by Pereda et al. 35 The direct quality prediction in the resistance spot welding process is achieved. To improve the prediction ability to welding quality during high-power disk laser welding, the SVM model was adopted by Wang et al. 36 to predict the welding quality of the metal vapor plume. By collecting the real-time torque signal of the friction stirs welding process, Das et al. 37 used an SVM regression model to predict the ultimate tensile strength of the welded joint. A model of laser welding quality prediction based on different input parameters was established by Petkovic 38 . Yu et al. 39 proposed a real-time prediction method of welding penetration mode and depth based on two-dimensional visual characteristics of the weld pool.

Other prediction models: A neuro-fuzzy model for the prediction and classification of the defects in the fused zone was built by Casalino et al. 40 . Using laser beam welding process parameters as input variables, neural networks, and C-Means fuzzy clustering algorithms are used to classify and predict the welding defects of Ti6Al4V alloy. Rout et al. 41 proposed a hybrid method based on fuzzy regression of the particle swarm optimization to achieve and optimize the prediction of weld quality from both mechanical properties and weld geometry. Kim et al. 42 proposed a semantic resistance spot welding weldability prediction framework. The framework constructs a shareable weldability knowledge database based on the regression rules. A decision tree algorithm and regression tree are used to extract decision rules, and the nugget width of the case was successfully predicted. AbuShanab et al. 43 proposed a stochastic vector functional link prediction model optimized by the Hunger Games search algorithm to link the joint properties with the welding variables, introducing a new prediction model for stir friction welding of dissimilar polymer materials.

Scholars have given various feasible forecasting schemes for welding quality. However, there are still defects, such as the lack of generalization performance of the weld quality prediction algorithms, having a large number of assumptions and subjective factors, these methods cannot meet the dynamic control requirements of intelligent welding for processing quality. Secondly, most prediction models can only predict before or after work, and cannot meet the dynamic changes in the welding environment on site. Therefore, the crucial to improving the welding quality is to give accurately and timely prediction results.

Acquisition and pre-processing of welding process data

The welding quality prediction framework is proposed and shown in Fig.  1 (The clearer version is shown in Supplementary Figure S1 ). Firstly, the critical controllable quality indicators in the ship unit-assembly welding are determined, and the influencing factors are analyzed. Secondly, based on the IoT system, a data acquisition system for real-time monitoring and prediction of the welding quality of the ship unit-assembly welding is established. Collection and transmission of welding data are achieved. Then, a feature selection method is created to optimally select the key features of the welding quality data.Secondly, fusing an adaptive simulated annealing, the particle swarm optimization and the back propagation neural network, a welding quality prediction model is established for welding quality control and decision making. Finally, the welding experiments of ship unit-assembly welding as an example to verify the critical technologies in the paper.

figure 1

The framework of welding quality prediction method.

Analyze the factors affecting welding quality

The reasons that lead to the welding quality problems of the ship component of ships involve six significant factors: human factors, welding equipment, materials, welding process, measurement system, and production environment. Residual stresses caused by instability during the welding of ship components are inextricably linked to the welding method and process parameters used. However, the essential factors are mainly determined by the thermal welding process and the constrained conditions of the weldment during the welding process. The influencing factors of welding quality in the thermal welding process are mainly reflected in the welding heat source type and its power density \(W\) , the effective thermal efficiency \(P\) and linear energy \(Q\) of the welding process, the heat energy’s transfer method (such as heat conduction, convection, etc.) and the welding temperature field. The determinants of the welding temperature field include the nature of the heat source and welding parameters (such as welding current, arc voltage, gas flow, inductance, welding speed, heat input, etc.). When the arc voltage and welding current increase, the heat energy input to the weld and the melting amount of the welding wire will increase directly, which will affect the width and penetration of the weld. When the welding speed is too low, it will cause the heat concentration and the width of the molten pool to increase, resulting in defects such as burn-through and dimensional deformation. The restraint conditions refer to the restraint type and restraint degree of the welded structure. Its value is mainly determined by factors such as the structure of the welded sheet, the position of the weld, the welding direction and sequence, the shrinkage of other parts during the cooling process, and the tightness of the clamped part.

Take the CO2 gas shielded welding process of the ship component as an example. Welding parameters determine the energy input to the weld and to a large extent affect the formation of the weld. Important process parameters that determine the welding quality of thin plate structures of ships include arc voltage, welding current, welding speed, inductance and gas flow. For example, when the welding current is too large, the weld width, penetration, and reinforcement that determine the dimension of the weld will increase, and welding defects are likely to occur during the welding process. At the same time, the angular deformation and bending deflection deformation of the welded sheet also increase. The instability and disturbance of the melt pool and arc can be caused when the gas flow rate is too high, resulting in turbulence and spatter in the melt pool.

Obtain the welding process data

The collection and transmission of process parameters during the welding process is an important basis for supporting the real-time prediction of welding quality. Therefore, a welding data acquisition and processing framework for ship component based on the IoT system is proposed, which is mainly divided into three levels: data perception, data transmission and preprocessing, and application services, as shown in Fig.  2 .

figure 2

A welding process data acquisition and processing framework.

During the execution of the welding process, the data sensing layer is mainly responsible for collecting various multi-source heterogeneous data in real-time and accurately, and providing a stable original data source for the data integration and analysis phase. The sensing data types mainly include welding process parameters, operating status information of welding equipment, and welding quality indicators. The collection method can be used through interface and protocol conversion or by connecting to an external intelligent sensing device. For example, for some non-digital smart devices, data collection can be operated by analog signals of electrical circuits. Then, data such as current, voltage, and welding speed are collected from the welding equipment by installing sensors such as current, voltage, and speed. Finally, a data acquisition board, such as PCL-818L, is used for analog-to-digital conversion, summary fusion, and data transmission. For most digital intelligent devices, data collection can use various communication interfaces or serial ports, PLC networks or communication interfaces, and other methods to collect and summarize the execution parameters and operating status of the equipment. Then, through the corresponding communication protocol, such as OPC-UA, MQTT, etc., the data read and write operations among the application, the server, and the PLC are realized.

The data transmission layer is mainly responsible for transmitting multi-source heterogeneous welding data collected on-site, achieving interconnectivity between underlying devices, application services, and multiple databases. As the new generation of communication technology matures, there are many ways to choose from for industrial-level information communication, such as 5G, Zigbee, industrial WiFi networks, and industrial Ethernet. According to actual needs and complementary advantages, a combined communication scheme can also be formed to meet the requirements of transmission and anti-interference ability, communication speed, and stability. The application scene and system functional requirements of this study are taken into account. Choose the combination of wired communication technology and wireless communication technology applications. To achieve efficient deployment of communication networks, with real-time welding data fast and stable transmission and portable networking.

The diversity of equipment in shipbuilding workshops and the heterogeneity of application systems have caused data to have multi-source and heterogeneous characteristics. Therefore, data integration is to shield the differences in data types and structures to realize unified storage, management, and data analysis. The key technologies of data integration include data storage and management, as well as data preprocessing. Among them, data storage management is the basis for maximizing data value and data preprocessing. Standard database technologies include SQL databases, such as MySQL, Oracle, etc., Redis, HBase, and other types of NoSQL databases. The specific deployment can be mixed and used according to actual needs and application scenarios to achieve complementary advantages and maximize benefits.

Data feature selection

Data feature selection is the premise to ensure the quality of data analysis and mining. It can not only ensure the quality and uniform format of the perceived data set, but also effectively avoid the feature jumble and curse of dimensionality in the process of data analysis. The welding data collected on-site will inevitably have characteristics such as missing, non-standard, and large capacity, requiring data filtering, data recovery, and data conversion to improve data quality and unify data formats.

The Relief-F algorithm is obtained by extending the function of the Relief algorithm by I. Kononenko 44 . It is a feature weight algorithm. Its function is to assign different weights to feature quantities based on the correlation between each feature quantity and category. Remove feature quantities with weights less than a certain threshold based on the calculation results. To achieve optimization of feature quantities. For multi-classification problems, suppose that the single-label training data \(D=\{\left({x}_{1},{y}_{1}\right),\left({x}_{2},{y}_{2}\right),\dots ,({x}_{n},{y}_{n})\}\) set can be divided into \(|c|\) categories. Relief-F can find the nearest neighbor examples in the sample set of class \({K}_{j}({K}_{j}\in ,\{\mathrm{1,2},\dots ,|c|\})\) and each other class for the example \({X}_{i}\) belonging to class \({K}_{j}\) . Suppose that the near-hit examples of \({X}_{i}\) is \({X}_{i,l,nh}\) ( \(l=\mathrm{1,2},\dots ,\left|c\right|;l\ne {K}_{j}\) ) and the near-miss examples of \({X}_{i}\) is \({X}_{i,l,nm}\) . Then, the iterative calculation formula is used to update the feature weight \(W(A)\) of the attribute feature A. According to the input data set \(D\) , set the sampling times of the sample to \(m\) , the threshold of the feature weight to \(\delta\) , and the number of nearest neighbor samples to \(k\) , and the corresponding calculation description is as follows:

The feature weight \(W(A)\) of each attribute is initialized to 0, and the feature weight set \(T\) of the sample data set \(D\) is an empty set.

Starting iterative calculation, and randomly selecting example \({X}_{i}\) from the sample data set \(D\) .

From the sample set \(D\) of the same type as \({X}_{i}\) , finding \(k\) the near-hit examples \({X}_{i,l,nh}\) , denoted as \({H}_{i}(c)(i=\mathrm{1,2},\dots ,k,c=class({X}_{i}))\) . From the sample set \(D\) of the same different type as \({X}_{i}\) , finding \(k\) the near-miss examples \({X}_{i,l,nm}\) , denoted as \({M}_{i}(\widehat{c})(\widehat{c}\ne class({X}_{i}))\) .

Updating the feature weights \(W(A)\) and \(T\) , the calculation formulas are as follows:

where \(diff\left(A,{X}_{1},{X}_{2}\right)\) represents the distance between the sample \({X}_{1}\) and \({X}_{2}\) on the feature \(A\) . \(class\left({X}_{i}\right)\) represents the class label contained in sample \({X}_{i}\) . \(P\left(c\right)\) represents the prior probability of the result label c.

According to the weight calculation results of each attribute, the feature set of the initial input data is filtered reasonably. Specifically, a threshold \(\tau\) needs to be specified, and the setting principle of its value should conform to Chebyshev's inequality \(0<\tau \ll 1/\sqrt{\alpha m}\) , \(a\) is the probability of accepting irrelevant features and \(m\) is the number of welding data samples.

Construction the welding quality prediction model

The welding quality prediction model based on apb.

The BPNN is the most successful learning algorithm for training multi-layer feedforward neural networks. Mathematically express the principle of iterative computation of BP neural network 45 . Assume that the sample dataset \(D=\left\{\left({x}_{1}{,y}_{1}\right),\left({x}_{2}{,y}_{2}\right),\dots ,\left({x}_{n}{,y}_{n}\right)\right\},{x}_{i}\in {R}^{m},{y}_{i}\in {R}^{z})\) , where the input sample vector includes m feature attributes and outputs a z -dimensional real-valued vector. m input neural nodes, q hidden layer neural nodes and z output neural nodes form a classical error BPNN structure. Taking the three-layer multilayer feedforward network structure as an example. The threshold value of the h -th neural node in the hidden layer is \({\gamma }_{h}\) . Threshold value of the j- th neural node in the output layer be \({\theta }_{j}\) . Connection weights between the i -th neural node in the input layer and the h -th neural node in the hidden layer are denoted as \({v}_{ih}\) . Connection weights between the h- th neural node in the hidden layer and the j -th neural node in the output layer are denoted as \({\omega }_{hj}\) . Notate that k is the number of training iterations of the network model.

The input vectors of each neural node in the hidden layer can be computed through the threshold \({\gamma }_{h}\) and the connection weights \({v}_{ih}\) between the input layer and the hidden layer \({O}_{h}\) . d is then used to generate the output vectors f of each neural node in the hidden layer by calculating through the activation function e.

The input vector \({O}_{h}\) of each neural node in the hidden layer can be calculated by the threshold \({\gamma }_{h}\) and the connection weight \({v}_{ih}\) between the input layer and the hidden layer. The output vectors \({S}_{h}\) of each neural node in the hidden layer are then computed by using \({O}_{h}\) through the activation function \(L(x)\) to generate the output vectors f of each neural node in the hidden layer:

Then, by utilizing the output vectors of the implicit layer, the connection weights \({\omega }_{hj}\) and the threshold \({\theta }_{j}\) , the input vector \({\beta }_{j}\) of each neural node in the output layer can be calculated. In employing the input vectors \({\beta }_{j}\) by means of the activation function \(p(x)\) the output response vectors \({T}_{j}\) of each neural node in the output layer can be computed:

For training sample \(\left({x}_{i}{,y}_{i}\right)\) , the output vector of the error back-propagation neural network is assumed to be \({T}_{j}\) . That is, the mean square error \({E}_{i}\) between the actual output value \({T}_{i}\) and the expected output value \({y}_{i}\) of the input training sample \(\left({x}_{i}{,y}_{i}\right)\) can be calculated as:

The BP neural network is an iterative learning algorithm. Based on the gradient descent strategy in each round of iteration for any parameter \(\delta\) the update formula is:

The learning rate is given as \(\eta\) , and the formula is derived in terms of the incremental weight \(\Delta {V}_{ih}\) of the connection between the input and hidden layers. Consider that \({V}_{ih}\) successively affects the input and output vectors of the h -th neural node of the hidden layer. Then it affects the input and output vectors of the j -th neural node of the output layer. Finally, it affects \({E}_{i}\) . That is:

It is assumed that a typical Sigmoid function is used for both hidden and output layer neural element nodes. That is, there is a characteristic function formula relationship. That is:

Substituting into Eq. (9), the update equation for a can be solved. Similarly, updated formulas for \(\Delta {\omega }_{hj}\) , \(\Delta {\theta }_{j}\) , and \(\Delta {\gamma }_{h}\) can be obtained. That is:

The BPNN model can realize any complex mapping of multidimensional and nonlinear functions, but it is easy to fall into the optimal local solution. The particle swarm optimization is a global random search algorithm based on swarm intelligence. It has good global search performance and universality for solving the global optimal solution of multiple objective functions and constraints. It can improve the convergence accuracy of BPNN and improve prediction performance. Therefore, fusing an adaptive simulated annealing, the particle swarm optimization and the back propagation neural network, a welding quality prediction algorithm is created. The algorithm flow is shown in Fig.  3 .

figure 3

The APB algorithm flow.

During the iteration optimization of the algorithm, the particle updates its position by tracking the individual extremes of the particle itself and the global extremes of the population. The movement of particles is composed of three parts, which reflect the trend of maintaining the previous speed, approaching the best position in history, group cooperation, and information sharing. The updated formulas of particle velocity and function are as follows:

where the critical parameters of each part are: \(w\) is the inertia weight coefficient. \({c}_{1}\) and \({c}_{2}\) are self-cognitive factors and social cognitive factors, respectively. \({v}_{i}(k)\) and \({x}_{i}\left(k\right)\) , respectively represent the velocity and position of the particle \(i\) at the k-th iteration. \({r}_{1}\) , \({r}_{2}\) are uniform random numbers in the range of \([\mathrm{0,1}]\) . \({P}_{best.i}\left(k\right)\) and \({G}_{best}(k)\) represent the individual optimal solution and the optimal global solution of the particle \(i\) at the k-th iteration.

\(w\) , \({c}_{1}\) and \({c}_{2}\) are essential parameters for controlling the iteration of the particle swarm optimization algorithm (PSO). \(w\) contains the inertia of the particle flight and the strength of the algorithm's searchability. \({c}_{1}\) , \({c}_{2}\) directly affect the particle's motion bias toward individual or group optimal. Therefore, to realize the adaptability of PSO, this study dynamically adjusts \(w\) and \({c}_{1}\) , \({c}_{2}\) to control the local and global optimization search strategy and collaborative sharing ability of the algorithm during iterative calculation. The nonlinear control strategy of a negative double tangent curve is adopted to control the change of \(w\) . The values of \({c}_{1}\) and \({c}_{2}\) vary with the iterative times \(k\) of PSO. The updated formulas of related parameters are as follows:

where \({w}_{max}\) and \({w}_{min}\) are the maximum and minimum values of the inertia weight coefficient. \(k\) is the current number of iterations. \({k}_{max}\) is the maximum number of iterations. \({c}_{1max}\) , \({c}_{1min}\) are the maximum and minimum values of the self-cognitive factor. \({c}_{2max}\) , \({c}_{2min}\) are the maximum and minimum values of the social cognitive factor.

In addition, to improve the search dispersion of the PSO algorithm and avoid convergence to local minima, SA is applied to the cyclic solution process of the PSO algorithm. The SA algorithm is an adaptive iterative heuristic probabilistic search algorithm. It has strong robustness, global convergence, computational parallelism, and adaptability, and can be suitable for solving nonlinear problems, as well as solving different types of design variable optimization problems. The specific process of the algorithm is as follows:

Select welding quality influencing factors with strong correlation as input characteristic set and corresponding welding quality data as output attribute set to establish training data set and verification data set of algorithm;

Preliminary construction of the BPNN prediction model for welding quality prediction;

The suitability function is set as the mean square error calculation function to evaluate the predictive performance. The flying particles are the weights and threshold parameter matrices of each neural network node. Particle population size N and maximum evolution number M are initialized. Set the search space dimension and speed range of the particle swarm. Random updating of the positions and velocities of all particles in the population;

Calculate the fitness values of all initial particles in the population, compare the optimal position of particle individual \({{\text{P}}}_{best}\) with the optimal position of population \({{\text{G}}}_{best}\) , and set the initial temperature of simulated annealing algorithm \(T\left(0\right)\) according to formula ( 22 );

Update the position and velocity of the particles by adjusting w, \({c}_{1}\) , and \({c}_{2}\) adaptively according to formulas ( 18 ), ( 19 ), and ( 20 ). Perform an iterative optimization. Update the global optimum of the population;

Set \(T=T(0)\) and initial solution \({S}_{1}\) as the global optimal solution, and determine the number of iterations at each temperature T, denoted as the chain length L of the Metropolis algorithm;

A stochastic perturbation is added to solution \({S}_{1}\) of the wheeled iteration and a new solution \({S}_{2}\) is generated.

Calculate the increment \(df=f{(S}_{2}\) ) \(-f({S}_{1})\) of the new solution \({S}_{2}\) , where \(f(x)\) is the fitness function;

If \(df<0\) , \({S}_{2}\) is accepted as the current solution for the iteration of the current wheel, so \({{S}_{1}=S}_{2}\) . If \(df>0\) , then the acceptance probability \({\text{exp}}(df/T)\) of \({S}_{2}\) is calculated, i.e. the random number Rand with uniform distribution is randomly generated in the interval (0,1). When the acceptance probability \({\text{exp}}(df/T)>rand\) , \({S}_{2}\) is also accepted as the new solution for the iteration, otherwise the current solution \({S}_{1}\) is retained;

The predictive error of the current solution \({S}_{1}\) has reached the accuracy requirement, or the number of iterations of the algorithm reaches the maximum number of iterations M, and the algorithm terminates. Otherwise, the algorithm decays the current temperature T according to the set attenuation function and returns to step 5 for cycle iteration until the condition is met and the current global optimal solution is output;

Output the current optimal particle, i.e. the optimal threshold and weight vector, fit the validation sample set and calculate the forecast error, and return to step 5 if conditions are not met.

After each iteration, the algorithm simulates the linear decay process of the initial temperature \(T\left(0\right).\) Then, the algorithm can not only accept the optimal solution, but also accept a certain probability \({P}_{T}\) , which improves the ability of PSO to jump out of the optimal local solution in the iterative optimization process. The updated formulas of related parameters are as follows:

where \({X}_{i+1}^{T(k)}\) represents the individual solution at the current temperature \(T\left(k\right)\) . \({P}_{T\left(k\right)}(i)\) is an acceptable probability that the new solution \({X}_{i+1}^{T(k)}\) can replace the historical solution \({X}_{i}^{T(k)}\) . \(T\left(k\right)\) represents the current temperature of the k-th annealing. \(\mu\) represents the cooling coefficient.

To evaluate the prediction accuracy of the improved algorithm model, the coefficient of determination (R 2 ), the mean absolute percentage error (MAPE), and the root mean square error (RMSE) is selected as a predictor of error in this thesis. The specific reference formula is:

where n is the sample size in the dataset; \({y}_{i}\) is the actual observation corresponding to the ith sample instance; \({\widehat{y}}_{i}\) is the fitted prediction corresponding to the ith sample instance; and \(\overline{y }\) is the average observation of the n sample instances. \({y}^{(i)}\) is the actual value corresponding to the i-th instances. \(h({X}^{(i)})\) is the predicted value corresponding to the i-th instances.

The R 2 indicates the superiority of fitting the covariance between the sample independent variables and the dependent variable in the regression model. The MAPE and RMSE reflect the degree of deviation between the predicted and actual values.

  • Process parameter optimization

The genetic algorithm is first proposed by John Holland 46 according to the evolution law of biological populations in nature. It is an algorithm to obtain the optimal solution by simulating the natural evolution of the biological population. It can handle complex nonlinear combinatorial optimization problems and has a good global optimization-seeking ability, so genetic algorithm is widely used in optimization problems in many engineering fields. Based on the welding quality prediction model built in the previous chapter, the genetic algorithm is introduced to optimize the welding process parameters to obtain the optimal combination of process parameters.

The specific idea is the welding current, arc voltage, welding speed, wire elongation, welding gas flow, and inductance of each process parameter by the actual number encoding as a gene composition chromosome. A chromosome represents a set of welding process parameters combined, to obtain the initial population, and then through selection, crossover, mutation to generate new populations, and according to the above-established prediction model for the degree of adaptation function for evaluation, and then finally iterate through the genetic algorithm to get the optimal combination of process parameters. The specific algorithm flow is shown in Fig.  4 .

figure 4

The optimization process of welding process parameters.

To demonstrate the feasibility of the method proposed in this paper, welding experiments on ship unit components are conducted in cooperation with a large shipyard. The proposed method is verified to accurately predict the welding quality for ship unit-assembly welding.

Based on the industrial IoT framework of data acquisition and processing of ship component welding, The welding data collection method is validated, as shown in Fig.  5 (The clearer version is shown in Supplementary Figure S2 ). The ship plate used in the investigation is general strength hull structural steel-Q235B. Its specific size is 300 mm × 150 mm × 5 mm, and its welding process chooses the center surfacing welding of the ship component. In the welding experiment of the ship component structure, the digital welding machine selected is Panasonic's all-digital Metal Inert Gas welding machine, model YD-350GS5 of the GP5 series, which has a built-in IoT module and analog communication interface. The automatic welding robot uses a Panasonic TAWERS welding robot, which can realize very low spatter welding of ship components. To collect key welding process parameters, some intelligent sensors and precision measuring instruments are equipped in the experiment. The threading sensor of CO2 welding can monitor the elongation of welding wire during welding. TH2810 inductance measuring instrument is used to measure the inductance during welding. In addition, the mass flow controller of shielding gas is used to measure the welding gas flow in the welding process (More complete description is shown in Supplementary Table S1 ).

figure 5

The welding data acquisition system of the ship component structure.

The equipment used for welding data transmission includes communication interface equipment and a serial port network module. For digital welding machines and welding robots, PLC provides analog input modules that can receive standard voltage or current signals converted by transmitters. Then, after the calculation and analysis of PLC, the analyzed data can be displayed on the human–machine interface (HMI) of the welding site through the communication interface device and communication protocol. The intelligent sensor configured in the experiment has its communication interface, such as RS232 and RS485. Therefore, the serial port network module can establish a connection and data protocol conversion with each sensor. Wireless Fidelity transmission and Ethernet can be set through a radiofrequency (RF) antenna and WAN/LAN conversion component to support the welding data reading and writing operation between welding site and the upper computer. In this case, MySQL database is used to store, manage and share welding data.

Residual stresses are measured by the blind hole method on the finished welded steel plate. The value of the residual stress reflects the quality of the weld. The blind hole method is based on applying a strain gauge to the surface of the workpiece to be measured. Then a hole is punched into the workpiece to cause stress relaxation around the hole and generate a new stress field distribution. Strain release is collected by the strain gauge, and the original residual stress and strain of the workpiece can be deduced based on the principle of elasticity.

The measurement equipment consisted of a stepper-adjustable drilling machine, a three-phase strain gauge, a TST3822E static strain test analyzer, and software. The diameter of the blind hole is 2 mm, and the depth of the hole is 3 mm. The measured stress is the average value of the pressure distribution in the depth of the blind spot. According to the direction of action, the residual welding stresses are divided into longitudinal residual stresses parallel to the weld axis and transverse residual stresses perpendicular to the weld. In this experiment, the strain gauge type is chosen as a three-phase right-angle strain gauge. That is, the layout angles of the strain gauges are 0°, 45°, and 90°. Longitudinal strain, principal strain, and transverse strain are measured, respectively. Since the distribution of longitudinal residual stresses is more regular than that of transverse residual stresses, only the themes in the 0° direction are considered in this experiment. The amount of strain changes along the weld direction, and then the analysis software yields the longitudinal residual stress. As shown in Fig.  6 , the operation site and the sticking position of the strain gauge for the experiment using the blind hole method are used. The participating stresses of each plate are collected through the TST3822E static strain test analyzer and computer software.

figure 6

Blind hole method to collect residual stress.

Preprocessing the welding process data

According to the correlation between the collected welding data and weld formation quality, MATLAB software and the Relief-F algorithm assign different influence weights to each data feature. Data features whose weight is less than the threshold value, such as data types irrelevant or weakly related to the weld formation quality, these data features will be excluded. The collected data includes welding current, arc voltage, welding speed, welding gun angle, steel plate thickness, welding gas flow, welding wire diameter, inductance value, and welding wire elongation. The Relief-F algorithm needs to set the number of neighbors and sampling times. Combined with the experimental sample data collected in the experiment, this case selects the number of neighbors \(k=\mathrm{3,5},\mathrm{7,8},9,\mathrm{10,15,18,25}\) , and the number of sampling \(m\) is 80. The calculation results are shown in Fig.  7 . The average of the calculation results of each group is used as the final weight of each data feature, and the calculation result is shown in Table 1 .

figure 7

The final weight of each data feature.

Among the features of the collected welding data, the influence weights of arc voltage and welding current on the quality of weld formation are the largest, which are 0.254 and 0.232, respectively. The main reason is that when the arc voltage and welding current increase, it will directly cause the heat energy input to the weld seam and the melting amount of the welding wire to increase, thereby increasing the width, penetration, and reinforcement of the weld seam. Secondly, the data feature with a relatively small degree of influence is the welding speed, and its corresponding influence weight is 0.173. When the welding speed is too high, the cooling rate of the welding seam will be too fast. Then, it will lead to deposition and the reduction of the number of metal coatings, which will affect the quality of weld formation. On the contrary, it will cause the heat concentration and width of the molten pool to increase, resulting in burn-through and other welding defects. In addition, in CO2 gas-shielded welding, the welding gas flow rate is a key parameter that affects the quality of weld formation, and its calculated influence weight is 0.171. When the gas flow is too large, it will cause instability and disturbance of the molten pool and arc of the weld, resulting in turbulence and splashing in the molten pool. On the contrary, it will directly reduce the protective effect of gas and affect the quality of weld formation. The inductance value will affect the penetration of the weld, and its weight is calculated to be 0.16. The welding wire elongation will directly affect the protective effect of the gas, and its weight is estimated to be 0.144. The welding gun angle, steel plate thickness, and welding wire diameter also have a particular influence on the forming quality of the weld. The influence weights are calculated to be 0.13, 0.08, and 0.05, respectively.

In this verification case, the data sample size for CO2 gas-shielded welding of the ship component structure is 350, and \(\alpha\) is 0.145. It is calculated that the weight threshold range of the influence weight of the weld forming quality in the CO2 gas-shielded welding of the ship component structure is \(0<\tau \le 0.14\) . Combined with the calculation results of the data feature weight, the influencing factors whose influence weight is greater than the threshold value are considered the main process parameters in this experiment. The main process parameters are arc voltage, welding current, welding speed, welding gas flow, inductance value, and welding wire extension length. These will be used as key input variables for constructing a welding quality prediction model.

Predict the welding quality

Using MATLAB as the verification platform, this case uses the APB algorithm model to predict the welding quality of the ship component. 300 sets of welding data are selected to train the algorithm model (Complete data in Supplementary Table S2 ), and 74 sets are selected for verification. The verification data set is shown in Table 2 (Complete data in Supplementary Table S3 ). This case considers the weld forming coefficient as the target variable and selects six variables as the key welding quality influencing factors according to the feature selection results in “ Preprocessing the welding process data ” section. The six key welding quality influencing factors include welding current, arc voltage, welding speed, welding wire elongation, inductance value, and welding gas flow.

After conducting many experiments using the above welding data, the upper limit \({w}_{max}\) is set to 0.9, and the lower limit \({w}_{min}\) is set to 0.4. The maximum number of \({k}_{max}\) iterations of PSO is set to 1000. The parameter combination of the self-cognitive factor and social cognitive factor is that \({c}_{1max},{c}_{1min},{c}_{2max}\) , and \({c}_{2min}\) are set to 2.5, 1.25, 2.5, and 1.25, respectively. The APB algorithm's global search capability and convergence speed can be balanced and achieve better results. The Metropolis criterion of the SA algorithm is introduced into the iterative calculation of the PSO algorithm. In the case verification, the initial temperature ( \(T\left(0\right)={10}^{4}\) ) is attenuated by the cooling coefficient ( \(\mu =0.9\) ). The 24 sets of welding data in Table 2 are substituted into the trained APB algorithm model to predict and verify the forming weld coefficient. The actual output value of each verification sample is compared with the expected value, and the relative error is calculated, as shown in Table 3 . (Complete data in Supplementary Table S4 ). In this case, the maximum and minimum relative prediction errors of the SAPSO_BPNN algorithm model on the validation sample data set are 8.764% and 5.364%, respectively. In general, the error of the proposed prediction algorithm is relatively small and can satisfy the accuracy requirements for predicting the welding quality of ship components of ships.

In addition, the improvements and advantages of the proposed APB algorithm model are further explained. Using the same welding data set above, the BPNN, BPNN optimized method based on the particle swarm optimization algorithm (PSO-BP), and the APB algorithm are selected to predict the residual welding stress. Some data comparison results are shown in Fig.  8 .

figure 8

The predictive outputs and comparison result of different algorithms.

The calculation results of the algorithm evaluation indicators R2, MAPE and RSME are shown in Table 4 . In comparison, the prediction accuracy of the welding data samples using the PSO-BP algorithm is higher than that of the BPNN. In addition, the prediction accuracy based on the APB algorithm is also significantly improved compared to PSO-BP.

Optimize the welding process parameters

Several workpieces with high welding residual stress are found in the experiment. The quality of these workpieces is not up to requirements, resulting in scrap. It will bring unnecessary economic loss to the enterprise. To reduce the loss and improve efficiency, the unqualified variety of welding process parameters is optimized by using the global optimization ability of the genetic algorithm.

Relevant parameters of the genetic algorithm are selected: maximum evolutionary algebra, population size, crossover probability, and variation probability are 100, 50, 0.7, and 0.01, respectively. The proposed forecast model is used as the objective function. The smaller the residual stress value, the higher the suitability. The experiment is carried out again to optimize the defective combination of process parameters in real time. The residual stress of the optimized product is re-measured. The results are shown in Table 5 . The experimental results show that the optimized combination of process parameters can yield products with lower residual stress. Improve quality and reduce economic losses. It can provide a reference for the real-time improvement of the welding process in enterprises.

Conclusion and future works

To meet the requirements of real-time monitoring and accurately prediction the ship unit-assembly welding quality, an IOT-based welding data acquisition framework is firstly established. And the stable and reliable data is obtained. The welding process monitoring elements are determined based on the feature dimensionality reduction methods. According to the Relief-F algorithm, the crucial features data is selected among the historical dataset. Secondly, the correlation rule between process parameters and welding quality is established. The prediction model of the ship unit-assembly welding is constructed by fusing the adaptive simulated annealing, the particle swarm optimization and the back propagation neural network. In order to optimize the welding parameters, the genetic algorithm is selected. Finally, the experimental welding of ship component is used as an example to verify the effectiveness of the proposed critical techniques.

The experimental results show that the proposed APB prediction model can predict the welding characteristics more effectively than the traditional methods, with a prediction accuracy of more than 91.236%, the coefficient of determination (R 2 ) is increased from 0.659 to 0.952, the mean absolute percentage error (MAPE) is reduced from 39.83 to 1.77%, and the root mean square error (RMSE) is reduced from 0.4933 to 0.0709. Showing higher prediction accuracy. It is proved that the technique can be used for online monitoring and accurate prediction of the welding quality for ship components. It can realize real-time collection and efficient transmission of big welding data, including welding process parameters, information on the operating status of welding equipment, and welding quality indicators. In addition, with the support of new-generation information technology such as the IoT, Big data, etc. the dynamic quality data in the welding process can be tracked in real-time and fully explored to realize online monitoring and accurate prediction of welding quality. With the application and development of automated welding equipment, more welding quality data and its impact factors are obtained. With the continuous updating and mining of welding data, a more accurate prediction model of welding quality needs to be established.

To dynamic control the processing quality of the ship unit-assembly welding, the proposed method can be well carried out. However, the implementation of technology is limited by the diversity and complexity of the ship sections assembly-welding process, so more effort and innovation should be paid to solve these defects. It is necessary to improve the perception and management of real-time data in the IoT system, so as to promote the deep fusion of physical and virtual workshops, and establish a more reliable virtual model and multi-dimensional welding simulation. Meanwhile, with the support of a more complete real-time database and welding quality mapping mechanism, the ship welding quality analysis ability can be continuously enhanced, and the processing quality prediction method can be further improved and innovated.

Data availability

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.

Stanic, V., Hadjina, M., Fafandjel, N. & Matulja, T. Toward shipbuilding 4.0—An Industry 4.0 changing the face of the shipbuilding industry. Brodogradnja 69 , 111–128. https://doi.org/10.21278/brod69307 (2018).

Article   Google Scholar  

Ang, J., Goh, C., Saldivar, A. & Li, Y. Energy-efficient through-life smart design, manufacturing and operation of ships in an Industry 4.0 environment. Energies 10 , 610. https://doi.org/10.3390/en10050610 (2017).

Remes, H. & Fricke, W. Influencing factors on fatigue strength of welded thin plates based on structural stress assessment. Weld. World 6 , 915–923. https://doi.org/10.1007/s40194-014-0170-7 (2014).

Article   CAS   Google Scholar  

Remes, H. et al. Factors affecting the fatigue strength of thin-plates in large structures. Int. J. Fatigue 101 , 397–407. https://doi.org/10.1016/j.ijfatigue.2016.11.019 (2017).

Li, L., Liu, D., Ren, S., Zhou, H. & Zhou, J. Prediction of welding deformation and residual stress of a thin plate by improved support vector regression. Scanning 2021 , 1–10. https://doi.org/10.1155/2021/8892128 (2021).

Fricke, W. et al. Fatigue strength of laser-welded thin-plate ship structures based on nominal and structural hot-spot stress approach. Ships Offshore Struct. 10 , 39–44. https://doi.org/10.1080/17445302.2013.850208 (2015).

Li, L., Liu, D., Liu, J., Zhou, H. & Zhou, J. Quality prediction and control of assembly and welding process for ship group product based on digital twin. Scanning 2020 , 1–13. https://doi.org/10.1155/2020/3758730 (2020).

Franciosa, P., Sokolov, M., Sinha, S., Sun, T. & Ceglarek, D. Deep learning enhanced digital twin for closed-loop in-process quality improvement. CIRP Ann. 69 , 369–372. https://doi.org/10.1016/j.cirp.2020.04.110 (2020).

Febriani, R. A., Park, H.-S. & Lee, C.-M. An approach for designing a platform of smart welding station system. Int. J. Adv. Manuf. Technol. 106 , 3437–3450. https://doi.org/10.1007/s00170-019-04808-6 (2020).

Liu, J. et al. Digital twin-enabled machining process modeling. Adv. Eng. Inf. 54 , 101737. https://doi.org/10.1016/j.aei.2022.101737 (2022).

Liu, J. et al. A digital twin-driven approach towards traceability and dynamic control for processing quality. Adv. Eng. Inf. 50 , 101395. https://doi.org/10.1016/j.aei.2021.101395 (2021).

Chen, J., Wang, T., Gao, X. & Wei, L. Real-time monitoring of high-power disk laser welding based on support vector machine. Comput. Ind. 94 , 75–81. https://doi.org/10.1016/j.compind.2017.10.003 (2018).

Rauber, T. W., De Assis Boldt, F. & Varejao, F. M. Heterogeneous feature models and feature selection applied to bearing fault diagnosis. IEEE Trans. Ind. Electron. 62 , 637–646. https://doi.org/10.1109/TIE.2014.2327589 (2015).

Bahmanyar, A. R. & Karami, A. Power system voltage stability monitoring using artificial neural networks with a reduced set of inputs. Int. J. Electr. Power Energy Syst. 58 , 246–256. https://doi.org/10.1016/j.ijepes.2014.01.019 (2014).

Rostami, M., Berahmand, K., Nasiri, E. & Forouzandeh, S. Review of swarm intelligence-based feature selection methods. Eng. Appl. Artif. Intell. 100 , 104210. https://doi.org/10.1016/j.engappai.2021.104210 (2021).

Liao, T. W. Improving the accuracy of computer-aided radiographic weld inspection by feature selection. NDT E Int. 42 , 229–239. https://doi.org/10.1016/j.ndteint.2008.11.002 (2009).

Jiang, H. et al. Convolution neural network model with improved pooling strategy and feature selection for weld defect recognition. Weld. World 65 , 731–744. https://doi.org/10.1007/s40194-020-01027-6 (2021).

Zhang, Z. et al. Multisensor-based real-time quality monitoring by means of feature extraction, selection and modeling for Al alloy in arc welding. Mech. Syst. Signal Process. 60–61 , 151–165. https://doi.org/10.1016/j.ymssp.2014.12.021 (2015).

Abdel-Basset, M., El-Shahat, D., El-henawy, I., de Albuquerque, V. H. C. & Mirjalili, S. A new fusion of grey wolf optimizer algorithm with a two-phase mutation for feature selection. Expert Syst. Appl. 139 , 112824. https://doi.org/10.1016/j.eswa.2019.112824 (2020).

Le, T. T. et al. Differential privacy-based evaporative cooling feature selection and classification with relief-F and random forests. Bioinformatics 33 , 2906–2913. https://doi.org/10.1093/bioinformatics/btx298 (2017).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Pal, S., Pal, S. K. & Samantaray, A. K. Neurowavelet packet analysis based on current signature for weld joint strength prediction in pulsed metal inert gas welding process. Sci. Technol. Weld. Join. 13 , 638–645. https://doi.org/10.1179/174329308X299986 (2008).

Pal, S., Pal, S. K. & Samantaray, A. K. Prediction of the quality of pulsed metal inert gas welding using statistical parameters of arc signals in artificial neural network. Int. J. Comput. Integr. Manuf. 23 , 453–465. https://doi.org/10.1080/09511921003667698 (2010).

Nykänen, T., Björk, T. & Laitinen, R. Fatigue strength prediction of ultra high strength steel butt-welded joints. Fatigue Fract. Eng. Mat. Struct. 36 , 469–482. https://doi.org/10.1111/ffe.12015 (2013).

Lu, J., Shi, Y., Bai, L., Zhao, Z. & Han, J. Collaborative and quantitative prediction for reinforcement and penetration depth of weld bead based on molten pool image and deep residual network. IEEE Access 8 , 126138–126148. https://doi.org/10.1109/ACCESS.2020.3007815 (2020).

Luo, Y., Li, J. L. & Wu, W. Nugget quality prediction of resistance spot welding on aluminium alloy based on structureborne acoustic emission signals. Sci. Technol. Weld. Join. 18 , 301–306. https://doi.org/10.1179/1362171812Y.0000000102 (2013).

Shim, J.-Y., Zhang, J.-W., Yoon, H.-Y., Kang, B.-Y. & Kim, I.-S. Prediction model for bead reinforcement area in automatic gas metal arc welding. Adv. Mech. Eng. 10 , 168781401878149. https://doi.org/10.1177/1687814018781492 (2018).

Lei, Z., Shen, J., Wang, Q. & Chen, Y. Real-time weld geometry prediction based on multi-information using neural network optimized by PCA and GA during thin-plate laser welding. J. Manuf. Process. 43 , 207–217. https://doi.org/10.1016/j.jmapro.2019.05.013 (2019).

Chaki, S., Bathe, R. N., Ghosal, S. & Padmanabham, G. Multi-objective optimisation of pulsed Nd:YAG laser cutting process using integrated ANN–NSGAII model. J. Intell. Manuf. 29 , 175–190. https://doi.org/10.1007/s10845-015-1100-2 (2018).

Wang, Y. et al. Weld reinforcement analysis based on long-term prediction of molten pool image in additive manufacturing. IEEE Access 8 , 69908–69918. https://doi.org/10.1109/ACCESS.2020.2986130 (2020).

Hartl, R., Praehofer, B. & Zaeh, M. Prediction of the surface quality of friction stir welds by the analysis of process data using artificial neural networks. Proc. Inst. Mech. Eng. Part L J. Mater. Des. Appl. 234 , 732–751. https://doi.org/10.1177/1464420719899685 (2020).

Chang, Y., Yue, J., Guo, R., Liu, W. & Li, L. Penetration quality prediction of asymmetrical fillet root welding based on optimized BP neural network. J. Manuf. Process. 50 , 247–254. https://doi.org/10.1016/j.jmapro.2019.12.022 (2020).

Jin, C., Shin, S., Yu, J. & Rhee, S. Prediction model for back-bead monitoring during gas metal arc welding using supervised deep learning. IEEE Access 8 , 224044–224058. https://doi.org/10.1109/ACCESS.2020.3041274 (2020).

Hu, W. et al. Improving the mechanical property of dissimilar Al/Mg hybrid friction stir welding joint by PIO-ANN. J. Mater. Sci. Technol. 53 , 41–52. https://doi.org/10.1016/j.jmst.2020.01.069 (2020).

Cruz, Y. J. et al. Ensemble of convolutional neural networks based on an evolutionary algorithm applied to an industrial welding process. Comput. Ind. 133 , 103530. https://doi.org/10.1016/j.compind.2021.103530 (2021).

Pereda, M., Santos, J. I., Martín, Ó. & Galán, J. M. Direct quality prediction in resistance spot welding process: Sensitivity, specificity and predictive accuracy comparative analysis. Sci. Technol. Weld. Join. 20 , 679–685. https://doi.org/10.1179/1362171815Y.0000000052 (2015).

Wang, T., Chen, J., Gao, X. & Li, W. Quality monitoring for laser welding based on high-speed photography and support vector machine. Appl. Sci. 7 , 299. https://doi.org/10.3390/app7030299 (2017).

Das, B., Pal, S. & Bag, S. Torque based defect detection and weld quality modelling in friction stir welding process. J. Manuf. Process. 27 , 8–17. https://doi.org/10.1016/j.jmapro.2017.03.012 (2017).

Petković, D. Prediction of laser welding quality by computational intelligence approaches. Optik 140 , 597–600. https://doi.org/10.1016/j.ijleo.2017.04.088 (2017).

Yu, R., Han, J., Zhao, Z. & Bai, L. Real-time prediction of welding penetration mode and depth based on visual characteristics of weld pool in GMAW process. IEEE Access 8 , 81564–81573. https://doi.org/10.1109/ACCESS.2020.2990902 (2020).

Casalino, G., Campanelli, S. L. & Memola Capece Minutolo, F. Neuro-fuzzy model for the prediction and classification of the fused zone levels of imperfections in Ti6Al4V alloy butt weld. Adv. Mater. Sci. Eng. 2013 , 1–7. https://doi.org/10.1155/2013/952690 (2013).

Rout, A., Bbvl, D., Biswal, B. B. & Mahanta, G. B. A fuzzy-regression-PSO based hybrid method for selecting welding conditions in robotic gas metal arc welding. Assem. Autom. 40 , 601–612. https://doi.org/10.1108/AA-12-2019-0223 (2020).

Kim, K.-Y. & Ahmed, F. Semantic weldability prediction with RSW quality dataset and knowledge construction. Adv. Eng. Inf. 38 , 41–53. https://doi.org/10.1016/j.aei.2018.05.006 (2018).

AbuShanab, W. S., AbdElaziz, M., Ghandourah, E. I., Moustafa, E. B. & Elsheikh, A. H. A new fine-tuned random vector functional link model using Hunger games search optimizer for modeling friction stir welding process of polymeric materials. J. Mater. Res. Technol. 14 , 1482–1493. https://doi.org/10.1016/j.jmrt.2021.07.031 (2021).

Kennedy, J. Particle Swarm Optimization. In Encyclopedia of Machine Learning (eds. Sammut, C. & Webb, G. I.) 760–766. https://doi.org/10.1007/978-0-387-30164-8_630 (2011).

Sun, C. et al. Prediction method of concentricity and perpendicularity of aero engine multistage rotors based on PSO-BP neural network. IEEE Access 7 , 132271–132278. https://doi.org/10.1109/ACCESS.2019.2941118 (2019).

Holland, J. H. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence (The MIT Press, New York, 1992). https://doi.org/10.7551/mitpress/1090.001.0001 .

Book   Google Scholar  

Download references

The work is supported by the National Natural Science Foundation of China under Grant (number 52075229\ 52371324), in part by the Provincial Natural Science Foundation of China under Grant (number KYCX20_3121), the Postgraduate Research & Practice Innovation Program of Jiangsu Province under Grant (number SJCX22_1923). Sponsored by Jiangsu Qinglan Project.

Author information

Authors and affiliations.

Jiangsu University of Science and Technology, Zhenjiang, 212100, Jiangsu, China

Jinfeng Liu, Yifa Cheng, Xuwen Jing & Yu Chen

Southeast University, Nanjing, 211189, China

Xiaojun Liu

You can also search for this author in PubMed   Google Scholar

Contributions

All authors contributed to the study conception and design. The first draft of the manuscript was written by J.L., manuscript review and editing were performed by Y.C., X.J., X.L. and Y.C. All authors commented on previous versions of the manuscript. All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Jinfeng Liu or Xuwen Jing .

Ethics declarations

Competing interests.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Liu, J., Cheng, Y., Jing, X. et al. Prediction and optimization method for welding quality of components in ship construction. Sci Rep 14 , 9353 (2024). https://doi.org/10.1038/s41598-024-59490-w

Download citation

Received : 10 January 2024

Accepted : 11 April 2024

Published : 23 April 2024

DOI : https://doi.org/10.1038/s41598-024-59490-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Quality prediction
  • Components welding
  • Welding quality

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

case study data quality

IMAGES

  1. Digiratina Technology Solutions

    case study data quality

  2. Case Study Research Method in Psychology

    case study data quality

  3. Decision Inc. optimises the customer service of a global beverage conglomerate through its data

    case study data quality

  4. Maintaining Data Quality from Multiple Sources Case Study

    case study data quality

  5. Case Study Data Quality.docx

    case study data quality

  6. Case Studies • Gavroshe USA, Inc

    case study data quality

VIDEO

  1. (Mastering JMP) Visualizing and Exploring Data

  2. Quality Enhancement during Data Collection

  3. Qualitative Research Designs

  4. Difference between Data Analytics and Data Science . #shorts #short

  5. Big Data Introduction 11 Case Study Data Warehouse

  6. iA: Taking 5 logins to 1 with Master Data Management, removing complexity and increasing trust

COMMENTS

  1. Data Quality Case Studies: How We Saved Clients Real Money Thanks to

    Machine learning models grow more powerful every week, but the earliest models and the most recent state-of-the-art models share the exact same dependency: data quality. The maxim "garbage in - garbage out" coined decades ago, continues to apply today. Recent examples of data verification shortcomings abound, including JP Morgan/Chase's 2013 fiasco and this lovely list of Excel snafus ...

  2. Data Quality Roadmap. Part II: Case Studies

    Airbnb's case study. This is a case study for Airbnb, compiled based on public information and made by authors of the roadmap. The roadmap is based on their description of data quality (part 1 ...

  3. Maintaining Data Quality from Multiple Sources Case Study

    Maintaining Data Quality from Multiple Sources Case Study. There is a wealth of data within the healthcare industry that can be used to drive innovation, direct care, change the way systems function, and create solutions to improve patient outcomes. But with all this information coming in from multiple unique sources that all have their own ...

  4. PDF A Framework for Data Quality: Case Studies October 2023

    The second case study describes the data quality assessments of a new method for collecting Consumer Price Index gasoline price data from retailers or data aggregators instead of a sample survey at BLS. While crowdsourcing data directly from retailers leads to efficiencies in both collection efforts and costs,

  5. Data Governance and Data Quality Use Cases

    While Data Quality Management at an enterprise happens both at the front (incoming data pipelines) and back ends (databases, servers), the whole process is defined, structured, and implemented through a well-designed framework. ... One of the SAS users group conducted a Case Study on National Bank of Canada, where the SAS system was used to ...

  6. Data Analytics Case Study Guide 2024

    A data analytics case study comprises essential elements that structure the analytical journey: Problem Context: A case study begins with a defined problem or question. It provides the context for the data analysis, setting the stage for exploration and investigation.. Data Collection and Sources: It involves gathering relevant data from various sources, ensuring data accuracy, completeness ...

  7. What Is Data Quality? Dimensions, Standards, & Examples

    The DQAF is organized around prerequisites and 5 dimensions of data quality: Assurances of Integrity: This ensures objectivity in the collection, processing, and dissemination of statistics. Methodological Soundness: The statistics follow internationally accepted standards, guidelines, or good practices.

  8. How to Create a Business Case for Data Quality Improvement

    Step No. 3. Profile the current state of data quality and its business implications. Once the scope of the business case has been agreed on, initial data profiling can begin. Carry out data profiling early and often. Establish a benchmark at the initial level of data quality, prior to its improvement, to help you objectively demonstrate the ...

  9. Building a case for data quality: What is it and why is it important

    According to an IDC study, 30-50% of organizations encounter a gap between their data expectations and reality.A deeper look at this statistic shows that: 45% of organizations see a gap in data lineage and content,; 43% of organizations see a gap in data completeness and consistency,; 41% of organizations see a gap in data timeliness,; 31% of organizations see a gap in data discovery, and

  10. To Improve Data Quality, Start at the Source

    To become a more data-driven organization, managers and teams must adopt a new mentality — one that focuses on creating data correctly the first time to ensure quality throughout the process.

  11. Improving Data Quality in Clinical Research Informatics Tools

    Maintaining data quality is a fundamental requirement for any successful and long-term data management. Providing high-quality, reliable, and statistically sound data is a primary goal for clinical research informatics. ... In this paper, we describe a real-life case study on assessing and improving the data quality at one of healthcare ...

  12. Case Study Method: A Step-by-Step Guide for Business Researchers

    The quality of a case study does not only depend on the empirical material collection and analysis but also on its reporting ... The authors interpreted the raw data for case studies with the help of a four-step interpretation process (PESI). Raw empirical material, in the form of texts from interviews, field notes of meetings, and observation ...

  13. Modern Data Quality Management: A Proven 6 Step Guide

    Data Quality Management Steps. Step 1: Baseline Current Data Quality Levels. Step 2: Rally And Align The Organization. Step 3: Implement Broad Data Quality Monitoring. Step 4: Optimize Incident Resolution. Step 5: Create Custom Data Quality Monitors. Step 6: Incident Prevention.

  14. Data Quality Case Studies: How We Saved Clients Real Money ...

    Data Quality Case Studies: How We Saved Clients Real Money Thanks to Data Validation. Machine learning models grow more powerful every week, but the earliest models and the most recent state-of-the-art models share the exact same dependency: data quality. The maxim "garbage in - garbage out" coined decades ago, continues to apply today.

  15. Big Data Quality Case Study Preliminary Findings

    Big Data Quality Case Study Preliminary Findings. Oct 1, 2013. By David Becker , Patricia King , William McMullen , Lisa Lalis , David Bloom , Dr. Ali Obaidi , Donna Fickett. A set of four case studies related to data quality in the context of the management and use of Big Data are being performed and reported separately; these will also be ...

  16. Data Quality in Healthcare: 3 Real-Life Stories

    Data quality is crucial, though there are few industries in which it's a life-or-death issue. The healthcare field is a notable one - a missing value, an additional value, or the wrong value could all lead to serious injury or even a fatality. Healthcare organizations must take steps to improve data quality to better protect patients.

  17. Data Quality Benefits

    Data quality and your customers. Engaging your customers is vital to driving your business. Data quality can help you improve your customer records by verifying and enriching the information you already have. And beyond contact info, you can manage customer interaction by storing additional customer preferences such as time of day they visit your site and which content topics and type they are ...

  18. Case Studies

    Using Exploratory Data Analysis to Improve the Fresh Foods Ordering Process in Retail Stores. This case study presents a real-world example of how the thought processes of data scientists can contribute to quality practice. See how explorative data analysis and basic statistics helped a grocery chain reduce inefficiencies in its retail ...

  19. Automated detection of poor-quality data: case studies in healthcare

    In healthcare, clinical data can be inherently poor quality due to subjectivity and clinical uncertainty. An example of this is pneumonia detection from chest X-ray images. The labeling of a ...

  20. Case Study: Using Data Quality and Data Management to ...

    Patient misidentification is also responsible for 35 percent of denied insurance claims, costing hospitals up to $1.2 million annually. Melanie Mecca, Director of Data Management Products & Services for CMMI Institute calls this situation "A classic Master Data and Data Quality problem.". A multitude of different vendors is one of the ...

  21. Data Quality Analysis and Improvement: A Case Study of a Bus ...

    Due to the rapid development of the mobile Internet and the Internet of Things, the volume of generated data keeps growing. The topic of data quality has gained increasing attention recently. Numerous studies have explored various data quality (DQ) problems across several fields, with corresponding effective data-cleaning strategies being researched. This paper begins with a comprehensive and ...

  22. Overview of Data Quality: Examining the Dimensions, Antecedents, and

    The data quality management framework was mainly built on the information product view (Ballou et al., 1998; Wang et al., 1998). Total data quality management (TDQM) and information product map (IPMAP) were developed based on the information product view. Studies of data quality management have focused more on context.

  23. Case Study

    Conducting a case study research involves several steps that need to be followed to ensure the quality and rigor of the study. Here are the steps to conduct case study research: ... Rich data: Case study research can generate rich and detailed data, including qualitative data such as interviews, observations, and documents. This can provide a ...

  24. Case Study Methodology of Qualitative Research: Key Attributes and

    A case study is one of the most commonly used methodologies of social research. This article attempts to look into the various dimensions of a case study research strategy, the different epistemological strands which determine the particular case study type and approach adopted in the field, discusses the factors which can enhance the effectiveness of a case study research, and the debate ...

  25. Data-driven evolution of water quality models: An in-depth

    Data-driven evolution of water quality models: An in-depth investigation of innovative outlier detection approaches-A case study of Irish Water Quality Index (IEWQI) model. scheduleApril 25, 2024 refreshApril 25, 2024. Featured Publications. Water Research, Volume 255, 15 May 2024, 121499.

  26. Atmosphere

    A case study is for demonstrating air quality forecasting in sub-tropical urban cities. Since MTS decomposition reduces complexity and makes the features to be explored easier, the speed of deep learning models as well as their accuracy are improved. ... In order to make the novelty not limited to a certain kind of data, in this study, the ...

  27. Surface water quality index forecasting using multivariate ...

    In addition, "Case study and data preparation" section provides descriptions of all developed models. The implementation of created models is described in "Applied methodology" section. The outcomes of the developed models to anticipate BOD are examined in "Implementing multivariate mode decomposition machine learning models" section.

  28. Prediction and optimization method for welding quality of ...

    Using MATLAB as the verification platform, this case uses the APB algorithm model to predict the welding quality of the ship component. 300 sets of welding data are selected to train the algorithm ...

  29. CE-CERT Research Seminar : Dr. Naomi Zimmerman

    This will draw upon case studies from across Pittsburgh, PA, Vancouver, BC, and Uttar Pradesh, India. Lastly, I will briefly touch on strategies for knowledge dissemination of maps or models built with air quality data, since these networks are used by disparate groups who may benefit from knowledge sharing that goes beyond traditional academic ...