U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • AMIA Annu Symp Proc
  • v.2018; 2018

Defining and Developing a Generic Framework for Monitoring Data Quality in Clinical Research

Miss lauren houston.

1 School of Medicine, Faculty of Science, Medicine and Health, University of Wollongong, Wollongong NSW 2522, Australia

2 Illawarra Health and Medical Research Institute, University of Wollongong, Wollongong NSW 2522, Australia

A/Prof Ping Yu

3 Centre for IT-enabled Transformation, School of Computing and Information Technology, Faculty of Engineering and Information Sciences, University of Wollongong, Wollongong NSW 2522, Australia

Dr Allison Martin

Dr yasmine probst.

Evidence for the need for high data quality in clinical research is well established. The rigor of clinical research conclusions rely heavily on good quality data, which relies on good documentation practices. Little attention has been given to clear guidelines and definitions to monitor data quality. To address this, a “fit-for-use” data quality monitoring framework (DQMF) for clinical research was developed based on a holistic design-oriented approach. An integrated literature review and feasibility study underpinned the framework development. Ontology of key terms, concepts, methods, and standards were recorded using a consensus approach and mind mapping technique. The DQMF is presented as a nested concentric network illustrating concept relationships and hierarchy. Face validation was conducted, and common terminology and definitions are listed. The consolidated DQMF can be adapted according to study context and data availability aiding in the development of a long-term strategy with increased efficacy for clinical data quality monitoring.


Regardless of study design or clinical area, high quality data collection and standardized data processing and representation are paramount to ensure reliable research findings 1 - 3 . Evidence has linked poor data quality to incorrect conclusions and recommendations 4 - 7 . Preventing data error is key during the development, design, and collection of clinical data 8 . The National Institute of Health (NIH) 9 broadly defines a clinical trial as “a research study in which one or more human subjects are prospectively assigned to one or more interventions (which may include placebo or other control) to evaluate the effects of those interventions on health-related biomedical or behavioral outcomes” [para. 3]. With this broad definition, several challenges arise to ensure high data quality due to different clinical objectives and requirements for data. Many strategies and interventions have been developed and are aimed at reducing error in clinical trials, including standard operating procedures (SOPs), personnel training, data monitoring or auditing and careful design of case report forms (CRF). However, current international and national guidelines lack consistency, creating uncertainty for clinical researchers. Therefore, in order to optimize data quality, standard procedures must be implemented.

The need for high data quality in clinical research has been well established. The term ‘data quality’, sometimes referred to as information quality 10 , is a multidimensional and hierarchical concept 11 . The World Health Organization (WHO) 12 defines data quality as “ … the ability to achieve desirable objectives using legitimate means. Quality data represents what was intended or defined by their official source, are objective, unbiased and comply with known standards” [pg.10]. An alternate definition emerged from Kerr et al. 13 who suggested that good data quality is data “fit for use” [pg.5] for the objectives of data collection. High data quality is crucial to a research organization’s success, while poor data quality (often referred to as ‘dirty data’) can significantly impact on the productivity of a business or institution 14 . It is essential to minimize errors, as a poorly designed study with inferior data points and results cannot be redeemed. In the context of clinical research, data that is not “fit for use” may lead to biased results, conclusions, and recommendations and may compromise participant health. To date, there is a lack of a precise definition for data quality. This, in turn, creates misunderstandings that may weaken the validity and reliability of data quality assessment and monitoring methods 4 .

Data management needs to be consistent, effective and efficient within each study 8 , 15 . Regardless of the method used to collect, handle and store the data collected within clinical trials, a vigorous management system is essential. For academic clinical trials, developing and maintaining a data management system is a challenge 16 . This is largely due to special requirements for individual trials, for example, the need to implement specific frameworks, and the expense to develop and run the software which also requires a sophisticated information technology infrastructure 17 . Academic clinical trials are less likely to implement common clinical data management systems such as those utilized within the pharmaceutical industry (e.g. Oracle Clinical); instead they often implement specialized, in-house smaller systems 18 . A recent survey found that considerable heterogeneity in data management exists and limited open access or freely available standard documents are available 18 . Furthermore, over 50% of the surveyed clinical research centers stated that although they had a data management system in place, the system did not comply with guidelines and legal requirements (GCP, ECRIN, FDA, GAMP, and ISO) for both internal system and independent validation by an external auditor. Similarly, survey results from the Association of Academic Health Centers (AAHC) highlighted that the greatest barrier for clinical trial operations was a lack of resources, systems, and procedures within the organisation 19 . It is clear that clinical trials, especially within the academic research community, need a standardized, open-source data monitoring framework to improve the quality level and best practice.

With rapid developments in technology, clinical research now relies heavily on the evaluation of automatic and electronically communicated data for critical decision-making through which data quality has become increasingly important. Emerging literature on technological improvements, demonstrate new opportunities and concerns about the reuse of clinical research data. The American Medical Informatics Association have compiled recommendations and also stressed the urgency and complexity of issues that surround the secondary use of clinical data 20 . Additionally, the primary obstacle to integrated data repositories was data quality 21 . In an effort to overcome this issue and to continuously improve data quality; standard monitoring methods are required before, during and after primary data collection, and at a larger-scale for the reuse of clinical data in research. To optimize quality, clinical studies should implement and publish their approach to monitor data quality to increase efficacy, reduce costs and follow procedures designed to minimize inaccurate and incomplete data.

The aim of this research was to develop a “fit-for-use” data quality monitoring framework (DQMF) for clinical research. This framework will aid clinical trials in obtaining and maintain high data quality by providing guidance on critical areas that relate to trial operations throughout the clinical research process.

Framework development

When determining data quality criteria there are different approaches that can be applied, which include empirical, practitioner-based, theoretical, literature-based, pragmatic and design-oriented 22 , 23 . For the purpose of this research, a holistic design-oriented approach was applied to design and develop the DQMF. A design-oriented approach provided guidance to the researchers to create the framework (artefact) and further understand the apparent reality of different stakeholders (clinical research trials) of the framework 24 . This provided further guidance to the researchers by helping to recognize data failures by developing the framework against a real-world state 25 . Design science is considered a problem-solving paradigm and seeks to extend the boundaries of the human and organizational capabilities by creating new and innovative artefacts 26 . The purpose of such artefacts is to improve the efficiency and effectiveness of the organization’s characteristics, the work systems and its employee’s capabilities. Design science argues that human knowledge and understanding of the problem and the solution are acquired in the ‘building’ and the ‘application’ of the artefact. Therefore, this research follows the conceptual framework and seven guidelines proposed by Hevner et al. 26 for understanding, executing and evaluating design-oriented research. This creative design process utilised a build-and-evaluate loop iteration to the evolving design of the generated artefact. This preliminary methodological research focuses largely on the ‘build’ process of the resulting artefact, keeping in mind that further evaluation and testing needs to be conducted. Overall, the proposed framework aims to help those designing, implementing and working in clinical research to understand the complex inter- and intra-relationships between the concepts that need to be planned both methodically and structurally, in order to improve the data quality of clinical research.

The initial design of the DQMF was guided by an integrated literature review and feasibility study 27 of data quality concepts, from an information sciences and clinical grounding. Outcomes from the feasibility study determined that clinical trials are implementing ad hoc methods pragmatically to ensure data quality. Thus, there is a necessity for further research into ‘standard practice’. The ontology of key terms, concepts, methods, and standards were extracted and recorded from the literature review and feasibility study survey questions. Consensus approach and mind mapping techniques were used to present associations in a non-linear diagram/network 28 . The dependent variable, ‘data quality monitoring’ was placed at the centre of the network to compose the mind map where associations were added and ‘branched’. This process was undertaken by the researchers (L.H., P.Y., A.M. and Y.P.) to foster a natural thinking process, allowing for the addition of new concepts, relationships and annotations 29 . Furthermore, branches and nodes were grouped together under comparable topic areas via researcher agreement to construct a hierarchical tree-like figure.

Once the researchers (L.H., P.Y., A.M. and Y.P.) came to a consensus, the DQMF was presented to a convenient sample (n=8) of working health professionals (dieticians, nutritionists, educators, public health practitioners) for face validity testing. Participants had clinical research experience (1 – 15+ years) in university academic, private institute and hospital settings. The primary researcher (L.H) moderated the interactive one-hour workshop, which aimed to gain feedback on the design and useability of the proposed DQMF within different clinical research settings. Each participant was provided an individual copy of the DQMF and encouraged to make note of any questions or issues. The workshop communicated the process by which the artefact was created and defined as the mechanism to finding an effective and efficient solution. Once the background information was presented, participants were asked to refine and make relevant changes to the DQMF based on their own knowledge and expertise. The primary researcher (L.H) then opened up the discussion to the group to explore reasons why amendments were suggested to fit each of the participant’s clinical focus. The workshop identified that standardized terminologies, definitions and dialogue are crucial to the success of the DQMF. According to participants’ responses, amendments to the DQMF were discussed and agreed upon and a supporting list of key terms and definitions were devised. Each stage of the systematic framework development was aligned with the international guidelines (GCP, ECRIN, FDA, GAMP, and ISO) to ensure the taxonomy and terminology used complied with global standard procedures and policies.

Data Quality Monitoring Framework (DQMF)

Refinement and evaluation of the key concepts has led to the development of the DQMF. This framework contains the key components of data quality, data quality monitoring, and data quality management presented in a nested concentric network to illustrate the relationships and hierarchy ( Figure 1 ). Each layer of the framework contains specific and highly important procedures and concepts undertaken within each layer. The importance of training and education is highlighted by its expansion across all layers of the DQMF. It was determined in the stakeholder workshop that dialogue, definitions, and terminology should be implemented consistently across clinical research. Due to the need to clarify terminology related to the DQMF and ensure effective communication we have included Table 1 , which highlights key terms and their definitions related specifically to the DQMF.

An external file that holds a picture, illustration, etc.
Object name is 2973256f1.jpg

The data quality monitoring framework.

DQMF terms, abbreviations, and definitions

Data Governance

Four key independent variables were identified (inner layer) and adapted from Nahm 30 who illustrates the way data definition, collection, processing and representation are each handled, impacts data use which impacts on data and information quality. On the contrary, data and information quality in turn impact use. The data evolution life cycle (DELC) was deliberated for inclusion within the inner tier throughout the development of the DQMF as it reflects a sequence of stages known as data collection, organization, presentation, and application 31 . The researchers chose not to integrate this cycle as they believe the stages of collection, organization and presentation relate closely to the terminology and stages of Nahm’s 30 framework of collection, processing and representation, respectively. Additionally, the DQMF main focus is ‘data quality monitoring’ in which Nahm’s framework was designed to highlight the factors that affect data quality while the DELC represents data evolution. The addition of ‘application’ was discussed, however, the researchers believe that the framework illustrates how data is utilized and applied in clinical research by linking ‘data’ to ‘information’.

Data quality monitoring was separated into two main concepts; quality assurance and quality control (middle tiers). The terms quality assurance and quality control are often used inaccurately or interchangeably 8 . Quality assurance is the process to “prevent” data errors, which includes the methods such as audits and other methods/techniques to ensure data integrity. Auditing is a recognized method that has been used to assess and develop the quality of information 32 , 33 . Quality assurance audits within clinical and healthcare settings are extensively employed and are the major strategy to ensure high-quality data 30 , 34 - 36 . Quality assurance activities include examining the design of case report forms, analyzing the data collection techniques and regular training of data entry personnel and data management 37 , 38 . On the other hand, quality control is the process to “alleviate” or “remove” the impact of errors that have occurred during data collection and/or analysis 15 . This refers to the operational techniques used to fulfil requirements for quality. Recognised methods of quality control include the conduct of periodic monitoring (daily, weekly, and monthly) through pragmatic data range and consistency checks, query management, and double data entry to minimize errors 8 . Therefore, quality control is the continuous quality assurance activity undertaken to verify clinical trial-related processes to fulfil the agreed standards.

Data quality management and data governance (the outer tiers) include developing and implementing national and internal standards and regulations in the full data life cycle, including planning before and execution of protocols and policies for data capture and analysis. Currently,there is no open access Good Clinical Practice (GCP) standard on data monitoring that is broadly recognized, detailed and applicable to all clinical research. Therefore, the design of the DQMF has drawn upon an illustration by Ohmann et al. 16 who highlight the central importance of the International Conference on Harmonisation (ICH)-GCP guidelines within clinical research at an international level and within the United States of America regulations and the European Union directives. The illustration links regulations and guidelines that are relevant to GCP-compliant computer systems and data management practices connecting important references from one document to another. The researchers took guidance from Ohmann’s work and extracted key concepts within each of the regulatory requirements and documents. In light, the researchers agreed a simpler and broader approach was required to guide clinical researchers by providing overarching concepts of infrastructure, protocol and regulations/standards. This approach will allow clinical researchers to make an informed decision regarding the most suitable management strategy to their individual trial while incorporating and highlighting legal requirements. Further, it will acknowledge the broad range of clinical research trials and the fact that healthcare requires person-to-person interactions for collaboration and integration between strategy, process (automated/non-automated) and the supporting information systems. As the feasibility study that guided this research found heterogeneity in data management practices with only 50% of respondents reporting to have a data management plan in place 27 . By providing a consolidated framework to optimize the efficacy of data management, the researchers aim to provide clear guidance to clinical research data quality monitoring that is both time and cost effective.

A systematic approach to data quality monitoring is essential to ensure high data quality for confidence in data reuse and technological improvements for clinical research. The DQMF developed provides an easy-to-use guide for monitoring data quality of individual clinical research trials with reference to key international documents. Within the pharmaceutical/private industries 40 and information sciences literature, data quality audit tools and procedures appear to be well developed with many frameworks acknowledging the multiple dimensions of data quality 50 - 55 . However, only a small body of clinical research has described the use of data quality frameworks 22 , 40 , 56 - 58 , and even less have identified the appropriate methods to quantify the quality of data 30 . Although many data quality dimensions and attributes have been determined within the clinical and health literature, the majority provide no usable definitions. Public sharing of such knowledge is crucial in developing a standardized approach that can be implemented across the clinical and broader research community to improve the rigor of clinical research.

Many organizations collect and analyze data for their own benefit to meet SOPs and ensure quality assurance and control. In terms of quality assurance and quality control the SOP is one of the most generic, reusable and important documents within clinical research 15 . However, within academic research settings, published on-site audits that include quality assurance are less often reported. This may be due to unclear audit methods, lack of time and funding, audits perceived as unnecessary for unregulated studies and publishing SOPs is not seen as a ‘value added’ activity 59 - 61 . There is a general agreement among leading clinical trial management groups that establishing reliable guidelines as a monitoring strategy would need to be determined on a risk-adapted basis for each trial 36 . It is recognized that different strategies need to be tailored for different types of clinical trials to determine adequate and appropriate monitoring 41 , 62 . However, published methodology papers are warranted to promote routine auditing and monitoring within both academic and commercial research settings. Research grants seldom include funding for such programs 63 . The DQMF has gathered current published information on the conduct of clinical trial data management, albeit limited. The application of the framework is a vital implementation strategy to the overall improvement of the quality of clinical trial data and the follow-on-effect of results, conclusions and recommendations 60 . Identifying all possible data discrepancies before they occur with all best intentions may not happen; therefore, a standardized framework, such as the one in this study, will provide useful guidance for the pragmatic implementation of continuous quality improvement.

The interest in standardization within the clinical research community has grown in recent years and therefore, the DQMF considers data quality monitoring from a broad perspective. This generic framework brings together key concepts from the scientific literature, government documents, and policies to illustrate links between concepts and their effect on each other within and amongst layers. This differs from previously published frameworks, which have focused on specific concepts in isolation, not considering the inter- and intra-relationships. This singular approach has caused confusion within the clinical research space. A survey conducted by the Clinical Trials Transformative Initiative (CTTI) found heterogeneity in data quality monitoring intensity, focus and methodology within and between academic/government, industry and clinical research organisations 64 . The utility of the DQMF is that it provides a single consolidated framework, which allows adaptations according to study context and data availability. This research will benefit the development of a long-term strategy and focus to fill the knowledge gap and reduce confusion around data quality monitoring in clinical research trials. Currently, as no standardized definitions apply across all clinical context. In correspondence with an increasing movement from paper-based forms to a digital and adaptive learning environment, it is necessary to improve the methods and approach to collection, storage and sharing of clinical research data. Electronic solutions are relatively new in clinical research and require major changes to existing procedures and professional training. Additionally, challenges arise in incorporating electronic data standards (CDISC and HL7) and the role they play in ensuring efficient and economic data sharing within clinical research 65 . The proposed DQMF provides guidance to clinical researchers on areas related to trial operations and ensuring high data quality throughout the entire research process. By utilizing this generic framework, it is anticipated to minimize the obstacles related to primary data quality and for the reuse of clinical research data. As the DQMF continues to evolve throughout the design-orientated approach, our knowledge and understanding of the challenges that arise from adapting to an electronic world will be addressed. This is vital to ensure the generic framework has future applicability.

The holistic designed-orientated approach provided guidance to developing the DQMF by aiding the researchers to understand the clinical stakeholders. The framework aims to improve clinical research trial practices, which currently consists of complex, isolated and independent tools, procedures and frameworks. By providing an easily integrated knowledge development tool for clinical research practice the DQMF will support clinical research as a value-added function by providing oversight and guidance on the complex area of data quality monitoring. Additionally, providing clear definitions to concepts are key to its success. It should be highlighted that the proposed DQMF is developed from the published literature and draws on the personal experiences of the research team and the participants included in the face validation workshop. This may be considered a bias. A major limitation of the proposed DQMF is that it is yet to be applied in practice and implemented in a clinical research trial. The researchers stress the importance that application of the DQMF is required to test the framework within a broad spectrum of clinical research studies to identify facilitators and barriers, thereby ensuring best practice for data quality. According to the design-science research guidelines, further evaluation, contribution, rigor, and communication are needed to develop a convincing argument for the utility of this framework and its purpose for real world use. It is suggested that empirical research be conducted through the use of a reactive Delphi study 66 to validate and allow experts to reach a consensus of opinion on the illustration, terms and what constitutes data quality monitoring in clinical research. The importance of industry wide definitions and methods are essential to enable strategic management and evaluate quality information. Without standardization, principle investigators of clinical research are left with inefficient data quality management.

A data quality monitoring framework (DQMF) has been developed for clinical research trials. The utility of this single consolidated framework is to reduce confusion around data quality monitoring whilst allowing for adaptations according to study context and data availability. The framework will guide new trials or identify procedures in existing trials to improve data quality monitoring. The framework demonstrates how data quality monitoring develops over the life cycle of a clinical study and how knowledge management may guide new approaches to research. The DQMF must now be validated by applying the framework and terminology to various clinical research trials for real world use. This will be crucial to refine and evaluate the generic detailed framework. Overall, the DQMF will aid in the development of a long-term strategy to increase efficacy for clinical research data quality monitoring.

List of Abbreviations

AAHC: Association of Academic Health Centers

CDISC: Clinical Data Interchange Standards Consortium

CRF: Case report form

CTTI: Clinical trials Transformative Initiative

DELC: Data evolution life cycle

DQMF: Data quality monitoring framework

ECRIN: European Clinical Research Infrastructure Network

EU: European Union

FDA: Food and Drug Administration

GAMP: Good Automated Manufacturing Practice

GCP: Good Clinical Practice

HL7: Health Level Seven

ICH: International Conference on Harmonisation

ISO: International Organisation for Standardisation;

MOP: Manual of operations

NIH: National Institute of Health

SOP: Standard Operating Procedure

US: United States

WHO: World Health Organisation

  • Open access
  • Published: 29 May 2021

Big data quality framework: a holistic approach to continuous quality management

  • Ikbal Taleb 1 ,
  • Mohamed Adel Serhani   ORCID: orcid.org/0000-0001-7001-3710 2 ,
  • Chafik Bouhaddioui 3 &
  • Rachida Dssouli 4  

Journal of Big Data volume  8 , Article number:  76 ( 2021 ) Cite this article

31k Accesses

38 Citations

4 Altmetric

Metrics details

Big Data is an essential research area for governments, institutions, and private agencies to support their analytics decisions. Big Data refers to all about data, how it is collected, processed, and analyzed to generate value-added data-driven insights and decisions. Degradation in Data Quality may result in unpredictable consequences. In this case, confidence and worthiness in the data and its source are lost. In the Big Data context, data characteristics, such as volume, multi-heterogeneous data sources, and fast data generation, increase the risk of quality degradation and require efficient mechanisms to check data worthiness. However, ensuring Big Data Quality (BDQ) is a very costly and time-consuming process, since excessive computing resources are required. Maintaining Quality through the Big Data lifecycle requires quality profiling and verification before its processing decision. A BDQ Management Framework for enhancing the pre-processing activities while strengthening data control is proposed. The proposed framework uses a new concept called Big Data Quality Profile. This concept captures quality outline, requirements, attributes, dimensions, scores, and rules. Using Big Data profiling and sampling components of the framework, a faster and efficient data quality estimation is initiated before and after an intermediate pre-processing phase. The exploratory profiling component of the framework plays an initial role in quality profiling; it uses a set of predefined quality metrics to evaluate important data quality dimensions. It generates quality rules by applying various pre-processing activities and their related functions. These rules mainly aim at the Data Quality Profile and result in quality scores for the selected quality attributes. The framework implementation and dataflow management across various quality management processes have been discussed, further some ongoing work on framework evaluation and deployment to support quality evaluation decisions conclude the paper.


Big Data is universal [ 1 ], it consists of large volumes of data, with unconventional types. These types may be structured, unstructured, or in a continuous motion. Either it is used by the industry and governments or by research institutions, a new way to handle Big Data from a technology perspective to research approaches in its management is highly required to support data-driven decisions. The expectation from Big Data analytics varies from trends finding to pattern discovery in different application domains such as healthcare, businesses, and scientific exploration. The aim is to extract significant insights and decisions. Extracting this precious information from large datasets is not an easy task. A devoted planning and appropriate selection of tools and techniques are available to optimize the exploration of Big Data.

Owning a huge amount of data does not often lead to valuable insights and decisions since Big Data does not necessarily mean Big insights. In fact, it can complicate the processes involved in fulfilling such expectations. Also, a lot of resources may be required, in addition to adapting the existing analytics algorithms to cope with Big Data requirements. Generally, data is not ready to be processed as it is. It should go through many stages, including cleansing and pre-processing, before undergoing any refining, evaluation, and preparation treatment for the next stages along its lifecycle.

Data Quality (DQ) is a very important aspect of Big Data for assessing the aforementioned pre-processing data transformations. This is because Big Data is mostly obtained from the web, social networks, and the IoT, where they may be found in a structured or unstructured form with no schema and eventually with no quality properties. Exploring data profiling, and more specifically, DQ profiling is essential before data preparation and pre-processing for both structured and unstructured data. Also, a DQ assessment should be conducted for all data-related content, including attributes and features. Then, an analysis of the assessment results can provide the necessary elements to enhance, control, monitor, and enforce the DQ along the Big Data lifecycle; for example, maintaining high Data Quality (conforming to its requirements) in the processing phase.

Data Quality has been an active and attractive research area for several years [ 2 , 3 ]. In the context of Big Data, quality assessment processes are hard to implement, since they are time- and cost-consuming, especially for the pre-processing activities. These issues have got intensified since the available quality assessment techniques were developed initially for well-structured data and are not fully appropriate for Big Data. Consequently, new Data Quality processes must be carefully developed to assess the data origin, domain, format, and type. An appropriate DQ management scheme is critical when dealing with Big Data. Furthermore, Big Data architectures do not incorporate quality assessment practices throughout the Big Data lifecycle apart from pre-processing. Some new initiatives are still limited to specific applications [ 4 , 5 , 6 ]. However, the evaluation and estimation of Big Data Quality should be handled in all phases of the Big Data lifecycle from data inception to its analytics, thus support data-driven decisions.

The work presented in this paper is related to Big Data Quality management through the Big Data lifecycle. The objective of such a management perspective is to provide users or data scientists with a framework capable of managing DQ from its inception to its analytics and visualization, therefore support decisions. The definition of acceptable Big Data quality depends largely on the type of applications and Big Data requirements. The need for a quality Big Data evaluation before engaging in any Big Data related project is imminent. This is because the high costs involved in processing useless data at an early stage of its lifecycle can be prevented. More challenges to the data quality evaluation process may occur when dealing with unstructured, schema-less data collected from multiples sources. Moreover, a Big Data Quality Management Framework can provide quality management mechanisms to handle and ensure data quality throughout the Big Data lifecycle by:

Improving the processes of the Big Data lifecycle to be quality-driven, in a way that it integrates quality assessment (built-in) at every stage of the Big Data architecture.

Providing quality assessment and enhancement mechanisms to support cross-process data quality enforcement.

Introducing the concept of Big Data Quality Profile (DQP) to manage and trace the whole data pre-processing procedures from data source selection to final pre-processed data and beyond (processing and analytics).

Supporting profiling of data quality and quality rules discovery based on quantitative quality assessments.

Supporting deep quality assessment using qualitative quality evaluations on data samples obtained using data reduction techniques.

Supporting data-driven decision making based on the latest data assessments and analytics results.

The remainder of this paper is organized as follows. In Sect. " Overview and background ", we provide ample detail and background on Big Data and data quality, besides, the introduction of the problem statement, and the research objectives. The research literature related to Big Data quality assessment approaches is presented in Sect. " Related research studies ". The components of the proposed framework and an explanation of their main functionalities are described in Sect. " Big data quality management framework ". Finally, implementation discussion and dataflow management are detailed in Sect. " Implementations: Dataflow and quality processes development ", whereas Sect. " Conclusion " concludes the paper and points to our ongoing research developments.

Overview and background

An exponential increase in global inter-network activities and data storage has triggered the Big Data Era. Moreover, application domains, including Facebook, Amazon, Twitter, YouTube, Internet of Things Sensors, and mobile smartphones, are the main players and data generators. The amount of data generated daily is around 2.5 quintillion bytes (2.5 Exabyte, 1 EB = 1018 Bytes).

According to IBM, Big Data is a high-volume, high-velocity, and high-variety information asset that demands cost-effective, innovative forms of information processing for enhanced insights and decision-making. It is used to describe a massive volume of both structured and unstructured data; therefore, Big Data processing using traditional database and software tools is a difficult task. Big Data also refers to the technologies and storage facilities required by an organization to handle and manage large amounts of data.

Originally, in [ 7 ], the McKinsey Global Institute identifies three Big Data characteristics commonly known as ''3Vs'' for Volume, Variety, and Velocity [ 1 , 7 , 8 , 9 , 10 , 11 ]. These characteristics have been extended to more dimensions, moving to 10 Vs (Volume, Velocity, Variety, Veracity, Value, Vitality, Viscosity, Visualization, Vulnerability) [ 12 , 13 , 14 ].

In [ 10 , 15 , 16 ], the authors define important Big Data systems architectures. The data in Big Data comes from (1) heterogeneous data sources (e-Gov: Census data, Social networking: Facebook, and Web: Google page rank data), (2) data in different formats (video, text), and (3) data of various forms (unstructured: raw text data with no schema, and semi-structured: metadata, graph structure as text). Moreover, data travels through different stages, composing the Big Data lifecycle. Many aspects of Big Data architectures were compiled from the literature. Our enhanced design contributions are illustrated in Fig.  1 and described as follows:

Data generation: this is the phase of data creation. Many data sources can generate this data such as electrophysiology signals, sensors used to gather climate information, surveillance devices, posts to social media sites, videos and still images, transaction records, stock market indices, GPS location, etc.

Data acquisition: it consists of data collection, data transmission, and data pre-processing [ 1 , 10 ]. Due to the exponential growth and availability of heterogeneous data production sources, an unprecedented amount of structured, semi-structured, and unstructured data is available. Therefore, the Big Data Pre-Processing consists of typical data pre-processing activities: integration, enhancements and enrichment, transformation, reduction, discretization, and cleansing .

Data storage: it consists of the data center infrastructure, where the data is stored and distributed among several clusters and data centers, spread geographically around the world. The software storage is supported by the Hadoop ecosystem to ensure a certain degree of fault tolerance storage reliability and efficiency through replication. The data storage stage is responsible for all input and output data that circulates within the lifecycle.

Data analysis: (Processing, Analytics, and Visualization); it involves the application of data mining and machine learning algorithms to process the data and extract useful insights for better decision making. Data scientists are the most valuable users of this phase since they have the expertise to apply what is needed, on what must be analyzed.

figure 1

Big data lifecycle value chain

Data quality, quality dimensions, and metrics

The majority of studies in the area of DQ originate from the database context [ 2 , 3 ] and management research communities. According to [ 17 ], DQ is not an easy concept to define. Its definition is data domain awareness. There is a consensus that data quality always depends on the quality of the data source [ 18 ]. However, it highlights that enormous quality issues are hidden inside data and their values.

In the following, the definitions of data quality, data quality dimensions, and quality metrics and their measurements are given:

Data quality: It has many meanings that are related to data context, domain, area, and the fields from which it is used [ 19 , 20 ]. Academia interprets DQ differently than industry. In [ 21 ], data quality is reduced to “The capability of data to satisfy stated and implied needs when used under specified conditions”. Also, DQ is defined as “fitness for use”. Yet, [ 20 ] define data quality as the property corresponding to quality management, which is appropriate for use or meeting user needs.

Data quality dimensions: DQD’s are used to measure, quantify, and manage DQ [ 20 , 22 , 23 ]. Each quality dimension has a specific metric, which measures its performance. There are several DQDs, which can be organized into 4 categories according to [ 24 , 25 ], intrinsic, contextual, accessibility, and representational [ 14 , 15 , 22 , 24 , 26 , 27 ]. Two important categories (intrinsic and contextual) are illustrated in Fig.  2 . Examples of intrinsic quality dimensions are illustrated in Table 1 .

Metrics and measurements: Once the data is generated, its quality should be measured. This means that a data-driven strategy is considered to act on the data. Hence, it is mandatory to measure and quantify the DQD. Structured or semi-structured data is available as a set of attributes represented in columns or rows, and their values are respectively recorded. In [ 28 ], a quality metric, as a quantitative or categorical representation of one or more attributes, is defined. Any data quality metric should define whether the values of an attribute respect a targeted quality dimension. The author [ 29 ], quoted that data quality measurement metrics tend to evaluate binary results: correct or incorrect, or a value between 0 and 100 (with 100% representing the highest). This applies to some quality dimensions such as accuracy, completeness, consistency, and currency. Examples of DQD metrics are illustrated in Table 2 .

figure 2

Data quality dimensions

DQD’s must be relevant to data quality problems that have been identified. Thus, a metric tends to measure if attributes comply with defined DQD’s. These measurements are performed for each attribute, given their type and data ranges of values collected from the data profiling process. The measurements produce DQD’s scores for the designed metrics of all attributes [ 30 ]. Specific metrics need to be defined, to estimate specific quality dimensions of other data types such as images, videos, and audio [ 5 ].

Big data characteristics and data quality

The main Big Data characteristics, commonly named as V’s, are initially, Volume, Velocity, Variety, and Veracity. Since the Big Data inception, 10 V’s have been defined, and probably new Vs will be adopted [ 12 ]. For example, veracity tends to express and describe the trustworthiness of data, mostly known as data quality. The accuracy is often related to precision, reliability, and veracity [ 31 ]. Our tentative mapping among these characteristics, data, and data quality, is shown in Table 3 . It is based on the intuitive studies accomplished by [ 5 , 32 , 33 ]. In these studies, the authors attempted to link the V’s to the data quality dimensions. In another study, the authors [ 34 ] addressed the mapping of DQD Accuracy with the Big Data characteristic Volume and showed that the data size has an impact on DQ.

Big data lifecycle: where quality matters?

According to [ 21 , 35 ], data quality issues may appear in each phase of the Big Data value chain. Addressing data quality may follow different strategies, as each phase has its features either improving the quality of existing data or/and refining, reassessing, redesigning the whole processes, which generate and collect data, aiming at improving their quality.

Big Data quality issues were addressed by many studies in the literature [ 36 , 37 , 38 ]. These studies generally elaborated on the issues and proposed generic frameworks with no comprehensive approaches and techniques to manage quality across the Big Data lifecycle. Among these, generic frameworks are presented in [ 5 , 39 , 40 ].

In Fig.  3 , it is illustrated where data quality can and must be addressed in the Big Data value chain phases/stages from (1) to (7).

In the data generation phase, there is a need to define how and what data is generated.

In the data transmission phase, the data distribution scheme relies on the underlying networks. Unreliable networks may affect data transfer. Its quality is expressed by data loss and transmission errors.

Data collection refers to where, when, and how the data is collected and handled. Well-defined structured constraint verification on data must be established.

The pre-processing phase is one of the main focus points of the proposed work. It follows a data-driven strategy, which is largely focused on data. An evaluation process provides the necessary means to ensure the quality of data for the next phases. An evaluation of the DQ before (Pre) and after (Post) pre-processing on data samples is necessary to strengthen the DQP.

In the Big Data storage phase, some aspects of data quality, such as storage failure, are handled by replicating data on multiple storages. The latter is also valid for data transmission when a network fails to transmit data.

In the Data Processing and Analytics phases, the quality is influenced by both the applied process and data quality itself. Among the various data mining and machine learning algorithms and techniques suitable for Big Data, those that converge rapidly and consume fewer cloud resources will be highly adopted. The relation between DQ and the processing methods is substantial. A certain DQ requirement on these methods or algorithms might be imposed to ensure efficient performance.

Finally, for an ongoing iterative value chain, the visualization phase seems to be only a representation of the data in a fashionable way such as a dashboard. This helps the decision-makers to have a clear picture of the data and its valuable insights. Finally, in this work, Big Data is transformed into useful Small Data, which is easy to visualize and interpret.

figure 3

Where quality matters in big data lifecycle?

Data quality issues

Data quality issues generally appear when the quality requirements are not met on the data values [ 41 ]. These issues are due to several factors or processes having occurred at different levels:

Data source level: unreliability, trust, data copying, inconsistency, multi-sources, and data domain.

Generation level: human data entry, sensors’ readings, social media, unstructured data, and missing values.

Process level (acquisition: collection, transmission).

In [ 21 , 35 , 42 ], many causes of poor data quality were enumerated, and a list of elements, which affect the quality and DQD’s was produced. This list is illustrated in Table 4 .

Related research studies

Research directions on Big Data differ between industry and academia. Industry scientists mainly focus on the technical implementations, infrastructures, and solutions for Big Data management, whereas researchers from academia tackle theoretical issues of Big Data. Academia’s efforts mainly include the development of new algorithms for data analytics, data replication, data distribution, and optimization of data handling. In this section, the literature review is classified into 3 categories, which are described in the following sub-sections.

Data quality assessment approaches

Existing studies on data quality have been approached from different perspectives. In the majority of the papers, the authors agree that data quality is related to the phases or processes of its lifecycle [ 8 ]. Specifically, data quality is highly related to the data generation phases and/or with its origin. The methodologies adopted to assess data quality are based on traditional data strategies and should be adapted to Big Data. Moreover, the application domain and type of information (Content-based, Context-based, or Rating-based) affects the way the quality evaluation metrics are designed and applied. In content-based quality metrics, the information itself is used as a quality indicator, whereas in context-based metrics meta-data is used as quality indicators.

There are two main strategies to improve data quality according to [ 20 , 23 ]: data-driven and process-driven. The first strategy handles the data quality in the pre-processing phase by applying some pre-processing activities (PPA) such as cleansing, filtering, and normalization. These PPAs are important and occur before the data processing stage, preferably as early as possible. However, the process-driven quality strategy is applied to each stage of the Big Data value chain.

Data quality assessment was discussed early in the literature [ 10 ]. It is divided into two main categories: subjective and objective. Moreover, an approach that combines these two categories to provide organizations with usable data quality metrics to evaluate their data was proposed. However, the proposed approach was not developed to deal with Big Data.

In summary, Big Data quality should be addressed early in the pre-processing stage during the data lifecycle. The aforementioned Big Data quality challenges have not been investigated in the literature from all perspectives. There are still many open issues, which must be addressed especially at the pre-processing stage.

Rule-based quality methodologies

Since the data quality concept is context-driven, it may differ from an application domain to another. The definition of quality rules involves establishing a set of constraints on data generation, entry, and creation. Poor data can always exist, and rules are created or discovered to correct or eliminate this data. Rules themselves are only one part of the data quality assessment approach. The necessity to establish a consistent process for creating, discovering, and applying the quality rules should consider the following:

Characterize the quality of data being good or bad from its profile and quality requirements.

Select the data quality dimensions that apply to the data quality assessment context.

Generate quality rules based on data quality requirements, quantitative, and qualitative assessments.

Check, filter, optimize, validate, run, and test rules on data samples for efficient rules’ management.

Generate a statistical quality profile with quality rules. These rules represent an overview of successful valid rules with the expected quality levels.

Hereafter, the data quality rules are discovered from data quality evaluation. These rules will be used in Big Data pre-processing activities to improve the quality of data. The discovery process reveals many challenges, which should consider different factors, including data attributes, data quality dimensions, data quality rules discovery, and their relationship with pre-processing activities.

In (Lee et al., 2003), the authors concluded that the data quality problems depend on data, time, and context. Quality rules are applied to the data to solve and/or avoid quality problems. Accordingly, quality rules must be continuously assessed, updated, and optimized.

Most studies on the discovery of data quality rules come from the database community. These studies are often based on conditional functional dependencies (CFDs) to detect inconsistencies in data. CFDs are used to formulate data quality rules, which are generally expressed manually and discovered automatically using several CFD approaches [ 3 , 43 ].

Data quality assessment in Big Data has been addressed in several studies. In [ 32 ], a Data Quality-in-Use model was proposed to assess the quality of Big Data. Business rules for data quality are used to decide on which data these rules must meet the pre-defined constraints or requirements. In [ 44 ], a new quality assessment approach was introduced and involved both the data provider and the data consumer. The assessment was mainly based on data consistency rules provided as metadata.

The majority of research studies on data quality and discovery of data quality rules are based on CFD’s and database. In Big Data quality, the size, variety, and veracity of data are key characteristics that must be considered. These characteristics should be processed to reduce the quality assessment time and resources since they are handled before the pre-processing phase. Regarding quality rules, it is fundamental to consider these rules to eliminate poor data and enforce quality on existing data, while following a data-driven quality context.

Big data pre-processing frameworks

The pre-processing of data before performing any analytics is primeval. However, several challenges have emerged at this crucial phase of the Big Data value chain [ 10 ]. Data quality is one of these challenges, which must be highly considered in the Big Data context.

As pointed out in [ 45 ], data quality problems arise when dealing with multiple data sources. This increases the requirements for data cleansing significantly. Additionally, the large size of datasets, which arrive at an uncontrolled speed, generates an overhead on the cleansing processes. In [ 46 , 47 , 48 ], NADEEF, an extensible data cleaning system, was proposed. The extension for Big Data cleaning based on NADEEF was presented in [ 49 ] for streaming data. The system deals with data quality from the data cleaning activity using data quality rules and functional dependencies rules [ 14 ].

Numerous other studies on Big Data management frameworks exist. In these studies, the authors surveyed and proposed Big Data management models dealing with storage, pre-processing, and processing [ 50 , 51 , 52 ]. An up-to-date review of the techniques and methods for each process involved in the management processes is also included.

The importance of quality evaluation in Big Data Management has not been, generally, addressed. In some studies, Big Data characteristics are the only recommendations for quality. However, no mechanisms have been proposed to map or handle quality issues that might be a consequence of these Big Data Vs. A Big Data Management Framework, which includes data quality management, must be developed to cope with end-to-end quality management across the Big Data lifecycle.

Finally, it is worth mentioning that research initiatives and solutions on Big Data quality are still in their preliminary phase; there is much to do on the development and standardization of Big Data quality. Big Data quality is a multidisciplinary, complex, and multi-variant domain, where new evaluation techniques, processing and analytics algorithms, storage and processing technologies, and platforms will play a key role in the development and maturity of this active research area. We anticipate that researchers from academia will contribute to the development of new Big Data quality approaches, algorithms, and optimization techniques, which will advance beyond the traditional approaches used in databases and data warehouses. Additionally, industries will lead development initiatives of new platforms, solutions, and technologies optimized to support end-to-end quality management within the Big Data lifecycle.

Big data quality management framework

The purpose of proposing a Big Data Quality Management Framework (BDQMF) is to address the quality at all stages of the Big Data lifecycle. This can be achieved by managing data quality before and after the pre-processing stage while providing feedback at each stage and loop back to the previous phase, whenever possible. We also believe that data quality must be handled at data inception. However, this is not considered in this work.

To overcome the limitations of the existing Big Data architectures for managing data quality, a Big Data Quality pre-processing approach is proposed: a Quality Framework [ 53 ]. In our framework, the quality evaluation process tends to extract the actual quality status of Big Data and proposes efficient actions to avoid, eliminate, or enhance poor data, thus improving its quality. The framework features the creation and management of a DQP and its repository. The proposed scheme deals with data quality evaluation before and after the pre-processing phase. These practices are essential to ensure a certain quality level for the next phases while maintaining the optimal cost of the evaluation.

In this work, a quantitative approach is used. This approach consists of an end-to-end data quality management system that deals with DQ through the execution of pre-pre-processing tasks to evaluate BDQ on data. It starts with data sampling, data and DQ profiling, and gathering user DQ requirements. It then proceeds to DQD evaluation and discovery of Quality rules from quality scores and requirements. Each data quality rule is represented by one-to-many Pre-Processing Functions (PPF’s) under a specific Pre-Processing Activity (PPA). A PPA, such as cleansing, aims at increasing data quality. Pre-processing is applied to Big Data samples and re-evaluated once again to update and certify that the quality profile is complete. It is applied to the whole Big Dataset, not only to data samples. Before pre-processing, the DQP is tuned and revisited by quality experts for endorsement based on an equivalent data quality report. This report states the quality scores of the data, not the rules.

Framework description

The BDQM framework is illustrated in Fig.  4 , where all the components cooperate, relying on the Data Quality Profile. It is initially created as a Data Profile and is progressively extended from the data collection phase to the analytics phase to capture important quality-related information. For example, it contains quality requirements, targeted data quality dimensions, quality scores, and quality rules.

figure 4

Big data sources

Data lifecycle stages are part of the BDQMF. Generated feedbacks in all the stages are analyzed and used to correct, improve the data quality, and detect any DQ management related failures. The key components of the proposed BDQMF include:

Big Data Quality Project (Data Sources, Data Model, User/App Quality Requirements, Data domain),

Data Quality Profile and its Repository,

Data Preparation (Sampling and Profiling),

Exploratory Quality Profiling,

Quality Parameters and Mapping,

Quantitative Quality Evaluation,

Quality Control,

Quality Rules Discovery,

Quality Rules Validation,

Quality Rules Optimization,

Big Data Pre-Processing,

Data Processing,

Data Visualization, and

Quality Monitoring.

A detailed description of each of these components is provided hereafter.

Framework key components

In the following sub-sections, each component is described. Its input(s) and output(s), its main functions, and its roles and interactions with the other framework’s components, are also described. Consequently, at each Big Data stage, the Data Quality Profile is created, updated, and adapted until it achieves the quality requirements already set by the users or applications at the beginning of the Big Data Quality Project.

Big data quality project module

The Big Data Quality Project Module contains all the elements that define the data sources, and the quality requirements set by either the Big Data users or Big Data applications to represent the quality foundations of the Big Data project. As illustrated in Error! Reference source not found., any Big Data Quality Project should specify a set of quality requirements as targeted quality goals (Fig. 5 ).

It represents the first module of the framework. The Big Data quality project represents the starting point of the BDQMF, where specifications of the data model, data sources, and targeted quality goals for DQD and data attributes are defined. These requirements are represented as data quality scores/ratios, which express the acceptance level of the evaluated data quality dimensions. For example, 80% of data accuracy, 60% data completeness, and 85% data consistency are judged by quality experts as accepted levels (or tolerance ratios). These levels can be relaxed using a range of values, depending on the context, the application domain, and the targeted processing algorithm’s requirements.

Let us denote by BDQP(DS , DS’ , Req) a Big Data Quality Project Request that initiates many automatic processes:

A data sampling and profiling process.

An exploratory quality profiling process, which is included in many quality assessment procedures.

A pre-processing phase is eventually considered if the resulted quality scores are not met.

The BDQP contains the input dataset DS , output dataset DS’ , and Req . The Quality requirements are presented as a tuple of sets Req  = ( D , L , A ), where:

D represents a set of data quality dimensions DQD’s (e.g., accuracy, consistency): \({D}=\left\{{{\varvec{d}}}_{0},\dots ,{{\varvec{d}}}_{{\varvec{i}}},\dots ,{{\varvec{d}}}_{{\varvec{m}}}\right\},\)

L is a set of DQD acceptance (tolerance) level ratios (%) set by the user or the application related to the quality project and associated with each DQD, respectively: \({L}=\left\{{{\varvec{l}}}_{0},\dots ,{{\varvec{l}}}_{{\varvec{i}}},\dots ,{{\varvec{l}}}_{{\varvec{m}}}\right\},\)

A is the set of targeted data attributes. If it is not specified, the DQD’s are assessed for the dataset, which includes all possible attributes, since some dimensions need more detailed requirements to be assessed. Therefore, it depends on the DQD and the attribute type: \({A}=\left\{{{\varvec{a}}}_{0},\dots ,{{\varvec{a}}}_{{\varvec{i}}},\dots ,{{\varvec{a}}}_{{\varvec{m}}}\right\}\)

The Data quality requirements might be updated with some more aspects, whereas the profiling component provides well-detailed information about the data ( DQP Level 0 ). This update is performed within the quality mapping component and interfaces with user experts to refine, reconfirm, and restructure their data quality parameters over the data attributes.

Data sources: There are multiple Big Data sources. Most of them are generated from the new media (e.g., social media) based on the internet. Other data sources are based on the context of new technologies such as the cloud, sensors, and IoT. A list of Big Data sources is illustrated in Error! Reference source not found.

Data users, data applications, and quality requirements: This module identifies and specifies the input sources of the quality requirements parameters for the data sources. These sources include user’s quality requirements (e.g., Domain Experts, Researchers, Analysts, and Data scientists) or application quality requirements. (Applications may vary from simple data processing to machine learning applications or AI-based applications). For the users, a dashboard-like interface is used to capture user’s data requirements and other quality information. This interface can be enriched with information from the data sources as attributes and their types, if available. This can efficiently guide users to the inputs and ensure the right data is used. This phase can be initiated after sample profiling or exploratory quality profiling. Otherwise, a general quality request is entered in the form of targeted Data Quality dimensions and their expected quality scores after the pre-processing phase. All the quality requirements parameters and settings are recorded in the Data Quality Profile ( DQP 0 ). DQP Level 0 is created when the quality project is set.

The quality requirements are specifically set as quality score ratios, goals, or targets to be achieved by the BDQMF. They are expressed as targeted DQDs in the Big Data Quality Project.

Let us denote by Req , a set of quality requirements presented as Req = \(\left\{{{\varvec{r}}}_{0},\dots ,{{\varvec{r}}}_{{\varvec{i}}},\dots ,{{\varvec{r}}}_{{\varvec{m}}}\right\}\) and constructed with a tuple ( D , L, A ). The Req quality requirements list is identified by elements, where each of these elements is a quality requirement characterized by \({{\varvec{r}}}_{{\varvec{i}}}=\left({{\varvec{d}}}_{{\varvec{i}}},{{\varvec{l}}}_{{\varvec{i}}},{{\varvec{a}}}_{{\varvec{i}}}\right)\) ; \({{\varvec{r}}}_{{\varvec{i}}}\) represents a \({{\varvec{d}}}_{{\varvec{i}}}\) in the DQD with a minimum accepted ratio level \({{\varvec{l}}}_{{\varvec{i}}}\) for all or a sub-list of selected attributes \({{\varvec{a}}}_{{\varvec{i}}}.\)

The initial DQP originating from this module is a DQP Level 0, containing the following tuple, as illustrated in Fig.  6 : BDQP (DS, DS’, Req) with Req  =  ( D , L, A )

Data models and data domains

Data models: If the Data is structured, then a schema is provided to add more detailed quality settings for all attributes. In other cases, if there are no such attributes or types, the data is considered as unstructured data, and its quality evaluation will consist of a set of general Quality Indicators (QI). In our Framework, these QI are provided especially for the cases, where a direct identification of DQD’s is not available for an easy quality assessment.

Data domains: Each data domain has a unique set of default quality requirements. Some are very sensitive to accuracy and completeness; others, prioritize data currency and higher timeliness. This module adds value to users or applications when it comes to quality requirements elicitation.

figure 6

BDQP and quality requirements settings

figure 7

Exploratory quality profiling modules

Data quality profile creation: Once the Big Data Quality Project (BDQP) is initiated, the DQP level 0 (DQP0) is created and consists of the following elements, as illustrated in Fig. 7 :

Data sources information, which may include datasets, location, URL, origin, type, and size.

Information about data that can be created or extracted from metadata if available, such as database schema, data attributes names and types, data profile, or basic data profile.

Data domains such as business, health, commerce, or transportation.

Data users, which may include the names and positions of each member of the project, security credentials, and data access levels.

Data application platforms, software, programming languages, or applications that are used to process the data. These may include R, Python, Java, Julia, Orange, Rapid Miner, SPSS, Spark, and Hadoop.

Data quality requirements: for each dataset, its expected quality ratios, and tolerance levels are accepted; otherwise, the data is discarded or repaired. It can also be set as a range of quality tolerance levels. For example, the DQD completeness is defined as equal to or higher than 67%, which means the acceptance ratio of missing values, is equal to or less than 33% (100% –67%).

Data quality profile (DQP) and repository (DQPREPO)

We describe hereafter the content of DQP and the DQP repository and the DQP levels captured through the lifecycle of framework processes.

  • Data quality profile

The data quality profile is generated once a Big Data Quality Project is created. It contains, for example, information about the data sources, domain, attributes, or features. This information may be retrieved from metadata, data provenance, schema, or from the dataset itself. If not available, data preparation (sampling and profiling) is needed to collect and extract important information, which will support the upcoming processes, as the Data Profile (DP) is created.

An Exploratory Quality Profiling will generate a quality rules proposal list. The DP is updated with these rules and converted into a DQP. This will help the user to obtain an overview of some DQDs and make better attributes selection based on this first quality approximation with a ready-to-use list of rules for pre-processing.

The User/App quality requirements (Quality tolerance levels, DQDs, and targeted attributes) are set and added to the DQP. Updated and tuned-up previously proposed quality rules are more likely, or a complete redefinition of the quality requirement parameters is performed.

The mapping and selection phase will update the DQP with a DQES, which contains the set of attributes to be evaluated for a set of DQDs, using a set of metrics from the DQP repository.

The Quantitative Quality Evaluation component assesses the DQ and updates the DQES with DQD Scores.

The DQES scores pass through quality control if validated. The DQP is executed in the pre-processing stage and confirmed in the repository.

If the scores (based on the quality requirements) are not valid, a quality rules discovery, validation, and optimization will be added/updated to the DQP configuration to obtain a valid DQD score that satisfies the quality requirements.

A continuous quality monitoring is performed for an eventual DQ failure that triggers a DQP update.

The DQP Repository: The DQPREPO contains detailed data quality profiles per data source and dataset. In the following, an information list managed by the repository is presented:

Data Quality User/App requirements.

Data Profiles, Metadata, and Data Provenance.

Data Quality Profiles (e.g. Data Quality Evaluation Schemes, and Data Quality Rules).

Data Quality Dimensions and related Metrics (metrics formulas and aggregate functions).

Data Domains (DQD’s, BD Characteristics).

DQD’s vs BD Characteristics.

Pre-processing Activities (e.g. Cleansing, and Normalizing) and functions (to replace missing values).

DQD’s vs DQ Issues vs PPF: Pre-processing Functions.

DQD’s priority processing in Quality Rules.

At every stage, module, task, or process, the DQP repository is incrementally updated with quality-related information. This includes, for example, quality requirements, DQES, DQD scores, data quality rules, Pre-Processing activities, activity functions, DQD metrics, and Data Profiles. Moreover, the DQP’s are organized per Data Domain and datatype to allow reuse. Adaptation is performed in the case of additional Big Datasets.

In Table 5 , an example of DQP Repository managed information along with its preprocessing activities (PPA) and their related functions (PPAF), is presented.

DQP lifecycle (Levels) : The DQP goes through the complete process flow of the proposed BDQMF. It starts with the specification of the Big Data Quality Project and ends with quality monitoring as an ongoing process that closes the quality enforcement loop and triggers other processes, which handle DQP adaptation, upgrade, or reuse. In Table 6 , the various DQP levels and their interaction within the BDQM Framework components are described. Each component involves process operations applied to the DQP.

Data preparation: sampling and profiling

Data preparation generates representative Big Data samples that serve as an entry for profiling, quality evaluation, and quality rules validation.

Sampling: Several sampling strategies can be applied to Big Data as surveyed in [ 54 , 55 ]. In this work, the authors evaluated the effect of sampling methods on Big Data and concluded that the sampling of large datasets reduces the run-time and computational footprint of link prediction algorithms, maintaining an adequate prediction performance. In statistics, the Bootstrap sampling technique evaluates the sampling distribution of an estimator using sampling, which replaces the original samples. In the Big Data context, Bootstrap sampling has been studied in several works [ 56 , 57 ]. In the proposed data quality evaluation scheme, it was decided to use the Bag of Little Bootstrap (BLB) [ 58 ]. This combines the results of bootstrapping multiple small subsets of a Big Data dataset. The BLB algorithm employs an original Big Dataset, which is used to generate small samples without replacements. For each generated sample, another set of samples is created by re-sampling with replacements.

Profiling: The data profiling module performs the data quality screening based on statistics and information summary [ 59 , 60 , 61 ]. Since profiling is meant to discover data characteristics from data sources, it is considered as a data assessment process that provides a first summary of the data quality reported in its data profile. Such information includes, for example, data format description, different attributes their types, values, and basic quality dimensions’ evaluations, data constraints (if any), and data ranges (max and min, a set of specific values or subsets).

More precisely, the information about the data is presented in two types: technical and functional data. This information can be extracted from the data itself without any additional representation using metadata or any descriptive header file or by parsing the data using analysis tools. This task may become very costly in Big Data. Therefore, to avoid costs generated by the data size, the same sampling process (based on BLB) is used. Thus, the data is reduced to a representative population sample, in addition to the combination of profiling results. More precisely, a data profile in the proposed framework is represented as a data quality profile of the first level ( DQP1 ), which is generated after the profiling phase. Moreover, data profiling provides some useful information that leads to significant data quality rules, usually named as data constraints. These rules are mostly equivalent to a structured-data schema, which is represented as technical and functional rules.

According to [ 61 ], there are many activities and techniques used to profile the data. These may range from online, incremental, and structural, to continuous profiling. Profiling tasks aim at discovering information about the data schema. Some data sources are already provided with their data profiles, sometimes with minimal information. In the following, some other techniques are introduced. These techniques can enrich and bring value-added information to a data profile:

Data provenance inquiry : it tracks the data origin and provides information about data transformations, data copying, and its related data quality through the data lifecycle [ 62 , 63 , 64 ].

Metadata : it provides descriptive and structural information about the data. Many data types, such as images, videos, and documents, use metadata to provide deep information about their contents. Metadata can be represented in many formats, including XML, or it can be extracted directly from the data itself without any additional representation.

Data parsing (supervised/manual/automatic) : data parsing is required since not all the data has a provenance or metadata that describes the data. The hardest way to gather extra information about the data is to parse it. Automatic parsing can be initially applied. Then, it is tuned and supervised manually by a data expert. This task may become very costly when Big Data is concerned, especially in the case of unstructured data. Consequently, a data profile is generated to represent only certain parts of the data that make sense. Therefore, multiple data profiles for multiple data partitions must be taken into consideration.

Data profile : it is generated early in the Big Data Project as DQP Level 0 (Data profile in its early form) and upgraded as a data quality profile within the data preparation component as DQP Level 1. Then, it is updated and extended through all the components of the Big Data Quality Management Framework until it reaches a DQP Level 2 . The DQP Level 8 is the profile applied to the data in the pre-processing phase with its quality rules and related activities to output a pre-processed data conformed to the quality requirements.

Exploratory quality profiling

Since a data-driven approach that uses a quantitative approach to quality dimensions’ evaluation from the data itself is followed, two evaluation steps are adopted: Quantitative Quality Evaluation based on user requirements and Exploratory Quality Profiling.

The exploratory quality profiling component is responsible for automatic data quality dimensions’ exploration without user interventions. The Quality Rules Proposals module, which produces a list of actions to elevate data quality, is based on some elementary DQDs that fit all varieties and data types.

A list of quality rules proposition, which is based on the quality evaluation of the most likely considered DQDs (e.g., completeness, accuracy, and uniqueness), is produced. This preliminary assessment is performed based on the data itself and using predefined scenarios. These scenarios are meant to increase data quality for some basic DQDs. In Fig. 7 , the steps involved in the exploratory quality profiling for quality rules proposals generation are depicted. DQP1 is extended to DQP2, after adding the Data Quality Rules Proposal ( DQRP ), which is generated by the “quality rules proposals” process.

This module is part of the DQ profiling process, which varies the DQD tolerance levels from min to max scores and applies a systematic list of predefined quality rules. These predefined rules are a set of actions applied to the data when the measured DQD scores are not in the tolerance level defined by the min, max value scores. The actions vary from deleting only attributes, discarding only observations, or a combination of both. After these actions, a re-evaluation of the new DQD scores will lead to a quality rules proposal (DQRP) with known DQD target scores after performing an analysis. In Table 7 , some examples of these predefined rules scenarios for the DQD completeness ( dqd  =  Comp ) with an execution priority for each set of grouped actions, are described. The DQD levels are set to vary from a 5% to 95% tolerance score with a granularity step of 5. They can be set differently according to the DQD choice and its sensitivity to the data model and domain. The selection of the best-proposed data quality rules is based on the KNN algorithm using Euclidean distance (Deng et al. 2016.; [ 65 ]. It gives the closest quality rules parameters that achieve (by default) high completeness with less data reduction. The process might be refined by specifying other quality parameters.

A list of quality rules proposal based on quality evaluation of the most likely considered DQD’s (e.g., completeness, accuracy, and uniqueness), is produced. This preliminary assessment is based on the data itself using predefined scenarios. The quality rules are meant to increase data quality for some basic DQD’s. In Fig.  8 , the modules involved in the exploratory quality profiling for quality rules proposals generation, are illustrated.

figure 8

Quality rules proposals with exploratory quality profiling

Quality mapping and selection

The quality mapping and selection module of the BDQM framework is responsible for mapping data features or attributes to DQD’s to target pre-required quality evaluation scores. It generates a Data Quality Evaluation Scheme ( DQES ) and then adds it (updates) to the DQP. The DQES contains the DQD’s of the appropriate attributes to be evaluated using adequate metric formulas. The DQES, as a part of DQP, contains (for each of the selected data attributes) the following list, which is considered essential for the quantitative quality evaluation:

The attributes: all or a selected list,

The data quality dimensions (DQD’s) to be evaluated for each selected attribute,

Each DQD has a metric that returns the quality score, and

The quality requirement scores for each DQD needed in the score’s validation.

These requirements are general and target many global quality levels. The mapping component acts as a refinement of the global settings with precise qualities’ goals. Therefore, a mapping must be performed between the data quality dimensions and targeted data features/attributes before proceeding with the quality assessment. Each DQD is measured for each attribute and sample. The mapping generates a DQES , which contains Quality Evaluation Requests ( QER ) Q x . Each QER Q x targets a data quality dimension (DQD) for an attribute, all attributes, or a set of selected attributes, where x is the number of requests.

Quality mapping: Many approaches are available to accomplish an efficient mapping process. These include automatic, interactive, manual, and based on quality rules proposals techniques:

Automatic : it completes the alignment and comparison of the data attributes (from DQP) with the data quality requirements (either per attribute type, or name). A set of DQDs is associated with each attribute for quality evaluation. It results in a set of associations to be executed and evaluated in the quality assessment component.

Interactive : it relies on experts’ involvement to refine, amend, or confirm the previous automated associations.

Manual : it uses a similar but advanced dashboard to that illustrated in Error! Reference source not found. and a more detailed one in the attribute level.

Quality rules proposals : the proposal list collected from the DQP2 is used to obtain an understanding of the impact of a DQD level and the data reduction ratio. These quality insights help decide which DQD is best when compared to the quality requirements.

Quality selection (of DQD, Metrics and Attributes): It consists of a selection of an appropriate quality metric to evaluate data quality dimensions for an attribute of a Big Data sample set and returns a count of correct values, which comply with the metric formula. Each metric will be computed if the attribute values reflect the DQD constraints. For example, accuracy can be defined as a count of correct attributes in a certain range of values [v 1 , v 2 ]. Similarly, it can be defined to satisfy a certain number of constraints related to the type of data such as zip code, email, social security number, dates, or addresses.

Let us define the tuple DQES (S, D, A, M) . Most of the information is provided by the BDQP(DS , DS’ , Req) with Req  =  ( D , L, A ) parameters. The profiling information is used to select the appropriate quality metrics \({{\varvec{m}}}_{{\varvec{l}}}\) to evaluate the data quality dimensions \({{\varvec{q}}}_{{\varvec{l}}}\) for an attribute \({{\varvec{a}}}_{{\varvec{k}}}\) with a weight \({{\varvec{w}}}_{{\varvec{j}}}\) . In addition to the previous settings, let us consider the following: S : S ( DS , N , n, R ) \(\to\) \({{\varvec{S}}}_{{\varvec{i}}}\) a sampling strategy

Let us denote by M , a set of quality metrics \({\varvec{M}}=\left\{{{\varvec{m}}}_{1},..,{{\varvec{m}}}_{{\varvec{l}}},..,{{\varvec{m}}}_{{\varvec{d}}}\right\}\) where \({{\varvec{m}}}_{{\varvec{l}}}\) is a quality metric that measures and evaluates a DQD \({{\varvec{q}}}_{{\varvec{l}}}\) for each value of an attribute \({{\varvec{a}}}_{{\varvec{k}}}\) in the sample \({{\varvec{s}}}_{{\varvec{i}}}\) and returns 1, if correct, and 0, if not. Each \({{\varvec{m}}}_{{\varvec{l}}}\) metric will be computed if the value of the attribute reflects the \({{\varvec{q}}}_{{\varvec{l}}}\) constraint. For example, the accuracy of an attribute is defined as a range of values between 0 and 100. Otherwise, it is incorrect. If the same DQD \({{\varvec{q}}}_{{\varvec{l}}}\) is evaluated for a set of attributes, and if the weights are all equal, a simple mean is computed. The metric \({{\varvec{m}}}_{{\varvec{l}}}\) will be evaluated to measure if each attribute has its \({{\varvec{m}}}_{{\varvec{l}}}\) correct. This is performed for each instance (cell or row) of the sample \({{\varvec{s}}}_{{\varvec{i}}}\) .

Let us denote by \({{{\varvec{M}}}_{{\varvec{l}}}}^{\left(i\right)}, i=1,\dots ,{\varvec{N}}\) , a metric total \({{\varvec{m}}}_{{\varvec{l}}}\) , which evaluates and counts the number of observations that satisfy this metric, for a DQD \({{\varvec{q}}}_{{\varvec{l}}}\) of an attribute \({{\varvec{a}}}_{{\varvec{k}}}\) of N samples from the dataset DS . The proportion of observations under the adequacy rule is calculated by:

The proportion of observations under the adequacy rule in a sample \({{\varvec{s}}}_{{\varvec{i}}}\) is given by:

The total proportion of observations under the adequacy rule for all samples is given by:

where \({{\varvec{M}}}_{{\varvec{l}}}\) characterizes the \({{\varvec{q}}}_{{\varvec{l}}}\) mean score for the whole dataset.

Let \({{\varvec{Q}}}_{{\varvec{x}}}\left({{\varvec{a}}}_{{\varvec{k}}},{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\) represents a request for a quality evaluation, which results in the mean quality score for a DQD \({{\varvec{q}}}_{{\varvec{l}}}\) for a measurable attribute \({{\varvec{a}}}_{{\varvec{k}}}\) calculated by M l . The process by which Big Data samples are evaluated for a DQD \({{\varvec{q}}}_{{\varvec{j}}}\) in a sample \({{\varvec{s}}}_{{\varvec{i}}}\) for an attribute \({{\varvec{a}}}_{{\varvec{k}}}\) with a metric \({{\varvec{m}}}_{{\varvec{l}}}\) , providing a \({{\varvec{q}}}_{{\varvec{l}}}{{\varvec{s}}}_{{\varvec{i}}}\) score for each sample (described below in Quantitative Quality Evaluation ). Then, a sample mean \({{\varvec{q}}}_{{\varvec{l}}}\) is the final score for \({{\varvec{a}}}_{{\varvec{k}}}\) .

Let us denote a process, which sorts and combines the requests of a quality evaluation (QER) by DQD or by an attribute, resulting in a re-arrangement of the \({{\varvec{Q}}}_{{\varvec{x}}}\left({{\varvec{a}}}_{{\varvec{k}}},{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\) tuple into two types, depending on the evaluation selection group parameter:

Per DQD identified as \({{\varvec{Q}}}_{{\varvec{x}}}\left({\varvec{A}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}\left({{\varvec{a}}}_{{\varvec{z}}}\right),{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\) where AList(a z ) represents the attributes \({{\varvec{a}}}_{{\varvec{z}}}\) ( z:1…R ) to be evaluated for the DQD \({{\varvec{q}}}_{{\varvec{l}}}\) .

Per attributes identified as Q x (a k , DList( \({{\varvec{q}}}_{{\varvec{l}}}\) , m l )) , where DList( \({{\varvec{q}}}_{{\varvec{l}}}\) , m l ) represents the data quality dimensions \({{\varvec{d}}}_{{\varvec{l}}}\) ( l:1… d ) to be evaluated for the attribute \({{\varvec{a}}}_{{\varvec{k}}}\) .

In some cases, the type of combination is automatically selected for a certain DQD, such as consistency, when all the attributes are constrained towards specific conditions. The combination is either based on attributes or DQD’s, and the DQES will be constructed as follows:

DQES ( \({{\varvec{Q}}}_{{\varvec{x}}}\left({\varvec{A}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}\left({{\varvec{a}}}_{{\varvec{z}}}\right),{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\) ,…,…) or.

DQES ( \({{\varvec{Q}}}_{{\varvec{x}}}\left({{\varvec{a}}}_{{\varvec{k}}},{\varvec{D}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}({{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}})\right)\) ,…,…)

The completion of the quality mapping process updates the DQP Level 2 with a DQES set as follows (Also illustrated in Error! Reference source not found.):

DQES ( \({{\varvec{Q}}}_{{\varvec{x}}}\left({{\varvec{a}}}_{{\varvec{k}}},{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\) ,…,…) , where x ranges from 1 to a defined number of evaluation requests. Each Q x element is a quality evaluation request of an attribute \({{\varvec{a}}}_{{\varvec{k}}}\) for a quality dimension \({{\varvec{q}}}_{{\varvec{l}}}\) , with a DQD metric m l .

The output of this phase generates a DQES score, which contains the mean score for each DQ dimension for one or many attributes. The mapping and selection data flow initiated using Big Data quality project (BDQP) settings is illustrated in Fig.  9 . This is accomplished either using the same BDQP Req or defining more detailed and refined quality parameters and a sampling strategy. Two types of DQES can be yielded:

Data Quality Dimension-wise evaluation of a list of attributes or

Attribute-wise evaluation of many DQD’s. As described before, the quality mapping and selection component generates a DQES evaluation scheme for the dataset, identifying which DQD and attributes tuples to evaluate using a specific quality metric. Therefore, a more detailed and refined set of parameters can also be set, as described in previous sections. In the following, the steps that construct the DQES in the mapping component are depicted:

The QMS function extracts the Req parameters from BDQP as (D, L, A) .

A quality evaluation request \(\left({a}_{k},{q}_{l},{m}_{l}\right)\) , is generated from the (D, A) tuple.

A list is constructed with these quality evaluation requests.

A list sorting is performed either by DQD or by Attributes producing two types of lists:

A combination of requests per DQD generates quality requests for a set of attributes \(\left(AList\left({a}_{z}\right),{q}_{l},{m}_{l}\right)\) .

A combination of requests per attribute generates quality requests for a set of DQD’s \(\left({a}_{k},DList({q}_{l},{m}_{l})\right)\) .

A DQES is returned based on the evaluation selection group parameter (per DQD, per attribute).

figure 9

DQES parameters settings

Quantitative quality evaluation

The Authors in [ 66 ], addressed how to evaluate a set of DQDs over a set of attributes. According to this study, the evaluation of Big Data quality is applied and iterated to many samples. The aggregation and combination of DQD’s scores are performed after each iteration. The evaluation scores are added to the DQES, which results in updating the DQP. We proposed an algorithm, which computes the quality scores for a dataset based on a certain quality mapping and quality metrics.

This algorithm is based on quality metrics evaluation using scores after collecting and validating the scores with quality requirements and generating quality rules from these scores [ 66 , 67 ]. There are rules related to each pre-processing activity, such as data cleaning rules, which eliminate data, and data enrichment, which replaces or adds data. Other activities, such as data reduction, reduce the data size by decreasing the number of features or attributes that have certain characteristics such as low variance, and highly correlated features.

In this phase, all the information collected from previous components (profiling, mapping, DQES) is included in the data quality profile level 3. The important elements are the set of samples and the data quality evaluation scheme, which are executed on each sample to evaluate its quality attributes for a specific DQD.

DQP Level 3 provides all the information needed about the settings represented by the DQES to proceed with the quality evaluation. The DQES contains the following:

The selected DQDs and their related metrics.

The selected attributes with the DQD to be evaluated.

The DQD selection, which is based on the Big Data quality requirements expressed early when initiating a Big Data Quality Project.

Attributes selection is set in the quality selection mapping component (3).

The quantitative quality evaluation methodology is described as follows:

The selected DQD quality metrics will measure and evaluate the DQD for each attribute observation in each sample from the sample set. For each attribute observation, it returns a value 1, if correct, or 0, if incorrect.

Each metric will be computed if all the sample observations attribute values reflect the constraints. For example, the metric accuracy of an attribute defines that a range of values between 20 and 70 is valid. Otherwise, it is invalid. The count of correct values out of the total sample observations is the DQD ratio represented by a percentage (%). This is performed for all selected attributes and their selected DQDs.

The sample mean from all samples for each evaluated DQD represents a Data Quality Score (DQS) estimation \(\left(\overline{DQS }\right)\) of a data quality dimension of the data source.

DQP Level 4 : an update to the DQP level 3 includes a data quality evaluation scheme (DQES) with the quality scores per DQD and per attribute ( DQES  +  Scores ).

In summary, the quantitative quality evaluation starts with sampling, DQD’s and DQDs metrics selection, mapping with data attributes, quality measurements, and the sample mean DQD’s ratios.

Let us denote by \({{\varvec{Q}}}_{{\varvec{x}}}\) Score (quality score), the evaluation results of each quality evaluation request \({{\varvec{Q}}}_{{\varvec{x}}}\) in the DQES . Two types of DQES, depending on the evaluation type, which means two kind of results scores organized per DQD of all attributes or per attribute for all DQD’s, can be identified:

\({{\varvec{Q}}}_{{\varvec{x}}}\left({\varvec{A}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}\left({{\varvec{a}}}_{{\varvec{z}}}\right),{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\to\) \({{\varvec{Q}}}_{{\varvec{x}}}\) ScoreList \(\left({\varvec{A}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}\left({{\varvec{a}}}_{{\varvec{z}}},{\varvec{S}}{\varvec{c}}{\varvec{o}}{\varvec{r}}{\varvec{e}}\right),{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\) or.

\({{\varvec{Q}}}_{{\varvec{x}}}\left({{\varvec{a}}}_{{\varvec{z}}},{\varvec{D}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}({{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}})\right)\) \(\to\) Q x ScoreList \(\left({{\varvec{a}}}_{{\varvec{z}}},{\varvec{D}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}\left({{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}},{\varvec{S}}{\varvec{c}}{\varvec{o}}{\varvec{r}}{\varvec{e}}\right)\right)\)

where \({\varvec{z}}=1,\dots ,{\varvec{r}},\boldsymbol{ }{\varvec{r}}\) is the number of selected attributes, and \({\varvec{l}}=1,\dots ,{\varvec{d}},\) \({\varvec{d}}\) is the number of selected DQD’s.

The quality evaluation generates quality scores \({{\varvec{Q}}}_{{\varvec{x}}}\) Score . A quality scoring model is used to assess these results. It is provided in the form of quality requirements to comprehend the resulted scores, which are expressed as quality acceptance level percentages. These quality requirements might be a set of values, or an interval in which values are accepted or rejected, or a single score ratio percentage. The analysis of these scores against quality requirements leads to the discovery and generation of quality rules for attributes violating the quality requirements.

The quantitative quality evaluation process follows the steps described below for the case of the evaluation of a DQD’s list among several attributes ( \({{\varvec{Q}}}_{{\varvec{x}}}\left({{\varvec{a}}}_{{\varvec{z}}},{\varvec{D}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}({{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}})\right)\) ):

N samples (of size n ) are generated from the dataset DS using a BLB-based bootstrap sampling approach.

For each sample \({{\varvec{s}}}_{{\varvec{i}}}\) generated in step 1, and

For each \({{\varvec{a}}}_{{\varvec{z}}}\) ( \({\varvec{z}}=1,\dots ,{\varvec{r}}\) ) selected attribute in DQES in step 1, evaluate all the DQD’s in the DList using their related metrics to obtain Q x ScoreList \(\left({{\varvec{a}}}_{{\varvec{z}}},{\varvec{D}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}\left({{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}},{\varvec{S}}{\varvec{c}}{\varvec{o}}{\varvec{r}}{\varvec{e}}\right),{{\varvec{s}}}_{{\varvec{i}}}\right)\) for each sample \({{\varvec{s}}}_{{\varvec{i}}}\) .

For all the samples scores, evaluate the sample mean of all N samples for each attribute \({{\varvec{a}}}_{{\varvec{z}}}\) related to the \({{\varvec{q}}}_{{\varvec{l}}}\) evaluation scores, as \(\stackrel{-}{{\overline{{\varvec{q}}} }_{{\varvec{z}}{\varvec{l}}}}.\)

For the dataset DS , evaluate the quality score mean \({\overline{{\varvec{q}}} }_{{\varvec{l}}}\) for each DQD for all attributes \({{\varvec{a}}}_{{\varvec{z}}}\) , as follows:

The illustration in Fig.  10 shows that the \({{\varvec{q}}}_{{\varvec{z}}{\varvec{l}}}{{\varvec{s}}}_{{\varvec{i}}}{\varvec{S}}{\varvec{c}}{\varvec{o}}{\varvec{r}}{\varvec{e}}\) is the evaluation of DQD \({{\varvec{q}}}_{{\varvec{l}}}\) for the sample \({{\varvec{s}}}_{{\varvec{i}}}\) for an attribute \({{\varvec{a}}}_{{\varvec{z}}}\) with a metric m l \(\boldsymbol{ }{\overline{{\varvec{q}}} }_{{\varvec{z}}{\varvec{l}}}\) represents the quality score sample mean for the attributes \({{\varvec{a}}}_{{\varvec{z}}}\) .

figure 10

Big data sampling and quantitative quality evaluation

Quality control

The quality control is initiated when the quality evaluation results are available and reported in the DQES in DQP Level 4 . During quality control, all the quality scores with the quality requirements of the Big Data project are checked. If any detected anomalies or a non-conformance are found, the quality control component forwards a DQP Level 5 to the data quality rules discovery component.

At this point, various cases are highlighted. An iteration process is performed until the required quality levels are satisfied, or the experts decide to stop the quality evaluation process and re-evaluate their requirements. At each phase, there is a kind of quality control, even if it is not explicitly specified, within each quality process.

The quality control acts in the following cases:

Case 1: This case applies when the quality is estimated, and no rules are yet included in the DQP Level 4 (the DQP is considered as a report, since the data quality is still inspected, and only reports are generated with no actions yet to be performed).

In the case of accepted quality scores, no quality actions need to be applied to data. The DQP Level 4 remains unchanged and acts as a full data quality report, which is updated with positive validation of the data per quality requirement. However, it might include some simple pre-processing such as attribute selection and filtering. According to the data analytics requirements and expected results planned in the Big Data project, more specific data pre-processing actions are performed but not related to quality in this case.

In the case when quality scores are not accepted, the DQP Level 4 DQES scores are analyzed, and the DQP is updated with a quality error report about the related DQD scores and their data attributes. DQP Level 5 is created, and it will be analyzed by the quality rules discovery component for the pre-processing activities to be executed on the data.

Case 2: In the presence of a DQP Level 6 that contains a quality evaluation request of the pre-processed samples with discovered quality rules, the following situations may occur:

When the quality control checks that the DQP Level 6 rules are valid and satisfy the quality requirements, the DQP Level 6 is updated to DQP Level 7 and confirmed as the final data quality profile, which will be applied to the data in the pre-processing phase. DQP Level 7 is considered as important if it contains validated quality rules.

When the quality control is not totally or partially satisfied, the DQP Level 6 is sent back for an adaptation of the quality selection and mapping component with valid and invalid quality rules, quality scores, and error reports. These reports highlight with an unacceptable score interval the quality rules that have not satisfied the quality requirements. The quality selection and mapping component provide automatic or manual analysis and assessment of the unsatisfied quality rules concerning their targeted DQD’s, attributes, and quality requirements. An adaptation of quality requirements is needed to re-validate these rules. Finally, the user experts have the final word to continue or break the process and proceed to the pre-processing phase with the valid rules. As part of the framework reuse specification, the invalid rules are kept within the DQP for future re-evaluation.

Case 3: The control component will always proceed based on the quality scores and quality requirements for both input and pre-processed data. Continuous control and monitoring are responsible for initiating DQP updates and adaptation if the quality requirements are relaxed.

Quality rules, discovery, validation, optimization, and execution

In [ 67 ] work, it was reported that if the DQD scores do not conform to the quality requirements, then failed scores are used to discover data quality rules. When executed on data, these rules enhance its quality. They are based on known pre-processing activities such as data cleansing. Each activity has a set of functions targeting different types of data in order to increase its DQD ratio and the whole Data Quality (of the Data source or the Dataset(s)).

When Quality Rules ( QR) are applied to a sample set S , a pre-processed sample set S’ is generated. A quality evaluation process is invoked on S’ , generating DQD scores for S’ . Thus, a score comparison between S and S’ is conducted to filter only qualified and valid rules with a higher percentage of success among data. Then, an optimization scheme is applied to the list of valid quality rules before their application on production data. The predefined optimization schemes vary from (1) rules priority to (2) rules redundancy, (3) rules removal, (4) rules grouping per attribute, or (5) per DQD’s, or (6) per duplicate rules.

Quality rules discovery: The discovery is based on the DQP Level 5 from the quality control component. An analysis of the quality scores is initiated, and an error report is extracted. If the DQD scores do not conform to the quality requirements, then failed scores are used to discover data quality rules. When executed on data, these rules enhance its quality. They are based on known pre-processing activities such as data cleansing. Error! Reference source not found. illustrates the several modules of the discovery component from DQES DQDs scores analysis versus requirements, attributes pre-processing activities combination for each targeted DQD, and the rules generation.

For example, an attribute having a 50% score of missing data is not accepted for a required score of 20% or less. This initiates the generation of a quality rule, which consists of a data cleansing activity for observations that do not satisfy the quality requirements. The data cleansing or data enrichment activity is selected from the Big Data quality profile repository. The quality rule will target all the related attributes marked for pre-processing to reduce the 50% to 20% for the DQD completeness. Moreover, in the case of completeness, not only cleansing can be applied to missing values, but many alternatives are available for pre-processing activities. These activities are related to completeness such as missing values replacement activity with many functions for several replacements’ methods like the mean, mode, and the median.

The pre-processing activities are provided by the repository to achieve the required data quality. Many possibilities for pre-processing activities selection are available:

Automatic , by discovering and suggesting a set of activities or DQ rules.

Predefined , by selecting ready-to-use quality rules proposals from the exploratory quality profiling component, predefined pre-processing activity functions from the repository, indexed by DQDs.

Manual, giving the expert the ability to query the exploratory quality profiling results for the best rules, achieving the required quality using KNN-based filtering.

Quality rules validation: The generated quality rules from the discovery components are set in the DQP Level 6. As illustrated in Error! Reference source not found., the rules validation component process starts when the DQR list is applied to the sample set S , resulting in a pre-processed sample set S’ , which is generated by the related pre-processing activities. Then, a quality evaluation process is invoked on S’ , generating DQD scores for S’ . Thus, a score comparison between S and S’ is conducted to filter only qualified and valid rules with a higher percentage of success among data. After analyzing this score, two sets of rules are identified: successful and failed rules.

Quality rules optimization: After the set of discovered valid quality rules is selected, an optimization process is activated to reorganize and filter the rules. This is due to the nature of the evaluation parameters set in the mapping component and the refinement of the quality requirement. These choices with the rule’s validation process will produce a list of individual quality rules that, if applied as generated, might have the following consequences:

Redundant rules.

Ineffective rules due to the order of execution.

Multiple rules, which target the same DQD with the same requirements.

Multiple rules, which target the same attributes for the same DQD and requirements.

Rules, which drop attributes or rows, must be applied first or have a higher priority to avoid applying rules on data items that are meant to be dropped (Table 8 ).

The quality rules optimization component applies an optimization scheme to the list of valid quality rules before their application to production data in the pre-processing phase. The predefined optimization schemes vary according to the following, as illustrated in Error! Reference source not found.:

Rules execution priority per attribute or DQD, per pre-processing activity, or pre-processing function.

Rules redundancy removal per attributes or DQDs.

Rules grouping, combination, per activity, per attribute, per DQD’s, or duplicates.

For invalid rules, the component consists of several actions, including rules removal or rules adaptation from previously generated proposals in the exploratory quality profiling component for the same targeted tuple (attributes, DQDs).

Quality rules optimization: The Quality Rules execution consists of pre-processing data using the DQP, which embeds the data quality rules that enhance the quality to reach the agreed requirements. As part of the monitoring module, a sampling set from the pre-processed data is used to re-assess the quality and detect eventual failures.

Quality monitoring

Quality Monitoring is a continuous quality control process, which relies on the DQP. The purpose of monitoring is to validate the DQP across all the Big Data lifecycle processes. The QP repository is updated during and after the complete lifecycle as well as after the user’s feedback data, quality requirements, and mapping.

As illustrated in Fig.  11 , the monitoring process takes a scheduled snapshot of the pre-processed Big Data all along the BDQMF for the BDQ project. This data snapshot is a set of samples that have their quality evaluated in the BDQMF component (4). Then, quality control is conducted on the quality scores, and an update is performed to the DQP. The quality report may highlight the quality failure and its ratio evolution through multiple sampling snapshots of data.

figure 11

Quality monitoring component

The monitoring process strengthens and enforces the quality across the Big Data value chain using the BDQM framework while reusing the data quality profile information. For each quality monitoring iteration on the datasets from the data source, quality reports are added to the data quality profile, updating it to a DQP Level 10 .

Data processing, analytics, and visualization

This process involves the application of algorithms or methodologies, which extract insights from the ready-to-use data, with enhanced quality. Then, the value of processed data is projected visually as a dashboard and graphically enhanced charts for the decision-makers to act economically. Big Data visualization approaches are of high importance for the final exploitation of the data.

Implementations: Dataflow and quality processes development

In this section, we overview the dataflow across the various processes of the framework, we also highlight the implemented quality management processes along with the supporting application interfaces developed to support main processes. Finally, we describe the ongoing processes’ implementations and evaluations.

Framework dataflow

In Fig.  12 , we illustrate the whole process flow of the framework, from the inception of the quality project in its specification and requirements to the quality monitoring phase. As an ongoing process, monitoring is a part of the quality enforcement loop and may trigger other processes that handle several quality profile operations like DQP adaptation, upgrade, or reuse.

figure 12

Big data quality management framework data flow

In Table 9 , we enumerate and detail the multiple processes and their interactions within the BDQM Framework components including their inputs and outputs after executing related activities with the quality profile (DQP), as detailed in the previous section.

Quality management processes’ implementation

In this section, we describe the implementation of our framework's important components, processes, and their contributions towards the quality management of Big Data across its lifecycle.

Core processes implementation

As depicted above, core framework processes have been implemented and evaluated, in the following, we describe how these components have been implemented and evaluated.

Quality profiling : one of the central components of our framework is the data quality profile (DQP). Initially, the DQP implements a simple data profile of a Big Data set as an XML file (DQP Sample illustrated in Fig.  13 ).

figure 13

Example of data quality profile

After traversing several framework component’s processes, it is updated to a data quality profile. The data quality evaluation process is one of the activities that updates the DQP with quality scores that are later used to discover data quality rules. These rules, when applied to the original data, will ensure an output data set with higher quality. The DQP is finally executed by the pre-processing component. Through the end of the lifecycle, the DQP contains all pieces of information such as data quality rules that target a set of data sources with multiple datasets, data attributes and data quality dimensions such as accuracy, and pre-processing activities like data cleansing, data integration, and data normalization. The Data Quality Profile (DQP) contains all the information about the Data, its Quality, the User Quality Requirements, DQD’s, Quality Levels, Attributes, the Data Quality Evaluation Scheme (DQES), Quality Scores, and the Data Quality Rules. The DQP is stored in the DQP repository, which contains the following modules, and performs many tasks related to DQP. In the following, the DQP lifecycle and its repository are described.

Quality requirement dashboard : developed as a web-based application as shown in Fig.  14 below to capture user’s requirements and other quality information. Such requirements include for instance data quality dimension requirements specification. This application can be extended with extra information about data sources such as attributes and their types. The user is guided through the interface to specify the right attributes’ values and also given the option to upload an XML file containing the relationship between attributes. The recorded requirements are finally saved to a data quality profile level 0 which will be used in the next stage of the quality management process.

figure 14

Quality requirements dashboard

Data preparation and sampling : The framework operations start when the quality project's minimal specifications are set. It initiates and provides a data quality summary named data quality profile (DQP) by running an exploratory quality profiling assessment on data samples (using BLB sampling algorithm). This DQP is projected to be the core component of the framework and every update and every result regarding the quality is noted/recorded. The DQP is stored in a quality repository and registered in the Big Data’s provenance to keep track of data changes due to quality enhancements.

Data quality mapping and rule discovery components : data quality mapping alleviates and adds more data quality control to the whole data quality assessment process. The implemented mapping links and categorizes all the quality project required elements, from Big Data quality characteristics, pre-processing activities, and their related techniques functions, to data quality rules, dimensions, and their metrics. The Data Quality Rules’ discovery from evaluation results implementation reveals the required actions and transformations that when applied on the data set will accomplish the targeted quality level. These rules are the main ingredients of pre-processing activities. The role of a DQ rule is to undertake the sources of bad quality by defining a list of actions related to each quality score. The DQ rules are the results of systematic and planned data quality assessment analysis.

Quality profile repository (QPREPO) : Finally, our framework implements the QPREPO to manage the data quality profiles for different data types and domains and to adapt or optimize existing profiles. This repository manages the data quality dimensions with their related metrics, and the pre-processing activities, and their activity functions. A QPREPO entry is implemented for each Big Data quality project with the related DQP containing information’s about each dataset, data source, data domain, and data user. This information is essential for DQP reuse, adaptation, and enhancement for the same or different data sources.

Implemented approaches for quality assessment.

The framework uses various approaches for quality assessment: (1) Exploratory Quality Profiling; (2) a Quantitative Quality Assessment approach using DQD metrics; and it's anticipated to add a new component for (3) a Qualitative quality assessment.

Exploratory Quality Profiling implements an automatic quality evaluation that is done systematically on all data attributes for basic DQDs. The resulted in calculated scores are used to generate quality rules for each quality tolerance ratio variation. These rules are then applied to other data samples and the quality is reassessed. An analysis of the results provides an interactive quality-based rules search using several ranking algorithms (maximization, minimization, applying weight).

The Quantitative Quality Assessment implements a quick data quality evaluation strategy supported through sampling and profiling processes for Big Data. The evaluation is conducted by measuring the data quality dimensions (DQDs) for attributes using specific metrics to calculate a quality score.

The Qualitative Quality Assessment approach implements a deep quality assessment to discover hidden quality aspects and their impact on the Big Data Lifecycle outputs. These quality aspects must be quantified into scores and mapped with related attributes and DQD’s. This quantification is achieved by applying several feature selection strategies and algorithms to data samples. These qualitative insights are combined with those obtained before the quantitative quality evaluation early in the Quality management process.

Framework development, deployment, and evaluation

Development, deployment, and evaluation of our BDQMF framework follow a systematic modular approach where various components of the framework are developed and tested independently then integrated with the other components to compose the integrated solution. Most of the components are implemented in R and |Python using SparkR and PySpark libraries respectively. The supporting files like the DQP, DQES, and configuration files are written in XML and JSON formats. Big Data quality project requests and constraints including the data sources and the quality expectation are implemented within the solution where more than one module might be involved. The BDQMF components are deployed following Apache Hadoop and Spark ecosystem architecture.

The BDQMF deployed modules implementation description and developed APIs are listed in the following:

Quality setting mapper (QSP): it implements an interface for automatic selection and mapping of DQD’s and dataset attributes from the initial DQP.

Quality settings parser (QSP): responsible for parsing and loading parameters to the execution environment from DQP settings to data files. It is also used to extract quality rules and scores from the DQES in the DQP.

Data loader (DL): implements filtering, selecting, and loading all types of data files required by the BDQMF including datasets from data sources into the Spark environment (e.g. DataFrames, tables), it will be used by various processes or it will persist in the database for further reuse. For data selection the uses SQL to retrieve only attributes being set in the DQP settings.

Data samples generator (DSG): it generates data samples from multiple data sources.

Quality inspector and profiler (QIP): it is responsible for all qualitative and quantitative quality evaluations among data samples for all the BDQMF lifecycle phases. The inspector assesses all the default and required DQD’s, and all quality evaluations are set into the DQES within the DQP file.

Preprocessing activities and functions execution engine (PPAF-E ): all the repository preprocessing activities along with their related functions are implemented as APIs in python and R. When requested this library will load the necessary methods and execute them within the preprocessing activities for rules validation and rules execution in phase 9.

Quality rules manager (QRM): it is one of the important modules of the framework. It implements and deliver the following features:

Analyzes Quality results

Discovers and generates Quality rules proposals.

Quality rules validation among requirements settings.

Quality rules refinement and optimizations

Quality rules ACID operations in the DQP files and the repository.

Quality monitor (QM) : it is responsible for monitoring, triggering, and reporting any quality change all over the Big Data lifecycle to assure the efficiency of quality improvement of the discovered data quality rules.

BDQMF-Repo: is the repository where all the quality-related files, settings, requirements, results are stored. The repo is using HBase or Mongo DB to fulfill requirements of the Big Data ecosystem environments and scalability for intensive data updates.

Big data quality has attracted the attention of researchers regarding Big Data as it is considered the key differentiator, which leads to high-quality insights and data-driven decisions. In this paper, a Big Data Quality Management Framework for addressing end-to-end Quality in the Big Data lifecycle was proposed. The framework is based on a Data Quality Profile, which is augmented with valuable information while traveling across different stages of the framework, starting from Big Data project parameters, quality requirements, quality profiling, and quality rules proposals. The exploratory quality profiling feature, which extracts quality information from the data, helped in building a robust DQP with a quality rules proposal and a step over for the configuration of the data quality evaluation scheme. Moreover, the extracted quality rules proposals are of high benefit for the quality dimensions mapping and attribute selection component. This fact supports the users with quality data indicators characterized by their profile.

The framework dataflow shows that any Big Data set quality is evaluated through the exploratory quality profiling component and the quality rules extraction and validation towards an improvement in its quality. It is of great importance to ensure the right selection of a combination of targeted DQD levels, observations (rows), and attributes (columns) for efficient quality results, while not sacrificing vital data because of considering only one DQD. The resulted quality profile based on the quality assessment results confirms that the contained quality information significantly improves the quality of Big Data.

In future work, we plan to extend the quantitative quality profiling with qualitative evaluation. We also plan to extend the framework to cope with unstructured Big Data quality assessment.

Availability of data and materials

Data used in this work is available with the first author and can be provided up on request. The data includes sampling data, pre-processed data, etc.

Chen M, Mao S, Liu Y. Big data: A survey. Mobile Netw Appl. 2014;19:171–209. https://doi.org/10.1007/s11036-013-0489-0 .

Article   Google Scholar  

Chiang F, Miller RJ. Discovering data quality rules. Proceed VLDB Endowment. 2008;1:1166–77.

Yeh, P.Z., Puri, C.A., 2010. An Efficient and Robust Approach for Discovering Data Quality Rules, in: 2010 22nd IEEE International Conference on Tools with Artificial Intelligence (ICTAI). Presented at the 2010 22nd IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 248–255. https://doi.org/10.1109/ICTAI.2010.43

Ciancarini, P., Poggi, F., Russo, D., 2016. Big Data Quality: A Roadmap for Open Data, in: 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService). Presented at the 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService), pp. 210–215. https://doi.org/10.1109/BigDataService.2016.37

Firmani D, Mecella M, Scannapieco M, Batini C. On the meaningfulness of “big data quality” (Invited Paper). Data Sci Eng. 2016;1:6–20. https://doi.org/10.1007/s41019-015-0004-7 .

Rivas, B., Merino, J., Serrano, M., Caballero, I., Piattini, M., 2015. I8K|DQ-BigData: I8K Architecture Extension for Data Quality in Big Data, in: Advances in Conceptual Modeling, Lecture Notes in Computer Science. Presented at the International Conference on Conceptual Modeling, Springer, Cham, pp. 164–172. https://doi.org/10.1007/978-3-319-25747-1_17

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H., 2011. Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute 1–137.

Chen CP, Zhang C-Y. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Inf Sci. 2014;275:314–47.

Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Ullah Khan S. The rise of “big data” on cloud computing: Review and open research issues. Inf Syst. 2015;47:98–115. https://doi.org/10.1016/j.is.2014.07.006 .

Hu H, Wen Y, Chua T-S, Li X. Toward scalable systems for big data analytics: a technology tutorial. IEEE Access. 2014;2:652–87. https://doi.org/10.1109/ACCESS.2014.2332453 .

Wielki J. The Opportunities and Challenges Connected with Implementation of the Big Data Concept. In: Mach-Król M, Olszak CM, Pełech-Pilichowski T, editors. Advances in ICT for Business. Springer International Publishing: Industry and Public Sector, Studies in Computational Intelligence; 2015. p. 171–89.

Google Scholar  

Ali-ud-din Khan, M., Uddin, M.F., Gupta, N., 2014. Seven V’s of Big Data understanding Big Data to extract value, in: American Society for Engineering Education (ASEE Zone 1), 2014 Zone 1 Conference of The. Presented at the American Society for Engineering Education (ASEE Zone 1), 2014 Zone 1 Conference of the, pp. 1–5. https://doi.org/10.1109/ASEEZone1.2014.6820689

Kepner, J., Gadepally, V., Michaleas, P., Schear, N., Varia, M., Yerukhimovich, A., Cunningham, R.K., 2014. Computing on masked data: a high performance method for improving big data veracity, in: 2014 IEEE High Performance Extreme Computing Conference (HPEC). Presented at the 2014 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. https://doi.org/10.1109/HPEC.2014.7040946

Saha, B., Srivastava, D., 2014. Data quality: The other face of Big Data, in: 2014 IEEE 30th International Conference on Data Engineering (ICDE). Presented at the 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 1294–1297. https://doi.org/10.1109/ICDE.2014.6816764

Gandomi A, Haider M. Beyond the hype: Big data concepts, methods, and analytics. Int J Inf Manage. 2015;35:137–44.

Pääkkönen P, Pakkala D. Reference architecture and classification of technologies, products and services for big data systems. Big Data Research. 2015;2:166–86. https://doi.org/10.1016/j.bdr.2015.01.001 .

Oliveira, P., Rodrigues, F., Henriques, P.R., 2005. A Formal Definition of Data Quality Problems., in: IQ.

Maier, M., Serebrenik, A., Vanderfeesten, I.T.P., 2013. Towards a Big Data Reference Architecture. University of Eindhoven.

Caballero, I., Piattini, M., 2003. CALDEA: a data quality model based on maturity levels, in: Third International Conference on Quality Software, 2003. Proceedings. Presented at the Third International Conference on Quality Software, 2003. Proceedings, pp. 380–387. https://doi.org/10.1109/QSIC.2003.1319125

Sidi, F., Shariat Panahy, P.H., Affendey, L.S., Jabar, M.A., Ibrahim, H., Mustapha, A., 2012. Data quality: A survey of data quality dimensions, in: 2012 International Conference on Information Retrieval Knowledge Management (CAMP). Presented at the 2012 International Conference on Information Retrieval Knowledge Management (CAMP), pp. 300–304. https://doi.org/10.1109/InfRKM.2012.6204995

Chen, M., Song, M., Han, J., Haihong, E., 2012. Survey on data quality, in: 2012 World Congress on Information and Communication Technologies (WICT). Presented at the 2012 World Congress on Information and Communication Technologies (WICT), pp. 1009–1013. https://doi.org/10.1109/WICT.2012.6409222

Batini C, Cappiello C, Francalanci C, Maurino A. Methodologies for data quality assessment and improvement. ACM Comput Surv. 2009;41:1–52. https://doi.org/10.1145/1541880.1541883 .

Glowalla, P., Balazy, P., Basten, D., Sunyaev, A., 2014. Process-Driven Data Quality Management–An Application of the Combined Conceptual Life Cycle Model, in: 2014 47th Hawaii International Conference on System Sciences (HICSS). Presented at the 2014 47th Hawaii International Conference on System Sciences (HICSS), pp. 4700–4709. https://doi.org/10.1109/HICSS.2014.575

Wand Y, Wang RY. Anchoring data quality dimensions in ontological foundations. Commun ACM. 1996;39:86–95. https://doi.org/10.1145/240455.240479 .

Wang, R.Y., Strong, D.M., 1996. Beyond accuracy: What data quality means to data consumers. Journal of management information systems 5–33.

Cappiello, C., Caro, A., Rodriguez, A., Caballero, I., 2013. An Approach To Design Business Processes Addressing Data Quality Issues.

Hazen BT, Boone CA, Ezell JD, Jones-Farmer LA. Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications. Int J Prod Econ. 2014;154:72–80. https://doi.org/10.1016/j.ijpe.2014.04.018 .

Caballero, I., Verbo, E., Calero, C., Piattini, M., 2007. A Data Quality Measurement Information Model Based On ISO/IEC 15939., in: ICIQ. pp. 393–408.

Juddoo, S., 2015. Overview of data quality challenges in the context of Big Data, in: 2015 International Conference on Computing, Communication and Security (ICCCS). Presented at the 2015 International Conference on Computing, Communication and Security (ICCCS), pp. 1–9. https://doi.org/10.1109/CCCS.2015.7374131

Woodall P, Borek A, Parlikad AK. Data quality assessment: The hybrid approach. Inf Manage. 2013;50:369–82. https://doi.org/10.1016/j.im.2013.05.009 .

Goasdoué, V., Nugier, S., Duquennoy, D., Laboisse, B., 2007. An Evaluation Framework For Data Quality Tools., in: ICIQ. pp. 280–294.

Caballero, I., Serrano, M., Piattini, M., 2014. A Data Quality in Use Model for Big Data, in: Indulska, M., Purao, S. (Eds.), Advances in Conceptual Modeling, Lecture Notes in Computer Science. Springer International Publishing, pp. 65–74. https://doi.org/10.1007/978-3-319-12256-4_7

Cai L, Zhu Y. The challenges of data quality and data quality assessment in the big data era. Data Sci J. 2015. https://doi.org/10.5334/dsj-2015-002 .

Philip Woodall, A.B., 2014. An Investigation of How Data Quality is Affected by Dataset Size in the Context of Big Data Analytics.

Laranjeiro, N., Soydemir, S.N., Bernardino, J., 2015. A Survey on Data Quality: Classifying Poor Data, in: 2015 IEEE 21st Pacific Rim International Symposium on Dependable Computing (PRDC). Presented at the 2015 IEEE 21st Pacific Rim International Symposium on Dependable Computing (PRDC), pp. 179–188. https://doi.org/10.1109/PRDC.2015.41

Liu, J., Li, J., Li, W., Wu, J., 2016. Rethinking big data: A review on the data quality and usage issues. ISPRS Journal of Photogrammetry and Remote Sensing, Theme issue “State-of-the-art in photogrammetry, remote sensing and spatial information science” 115, 134–142. https://doi.org/10.1016/j.isprsjprs.2015.11.006

Rao, D., Gudivada, V.N., Raghavan, V.V., 2015. Data quality issues in big data, in: 2015 IEEE International Conference on Big Data (Big Data). Presented at the 2015 IEEE International Conference on Big Data (Big Data), pp. 2654–2660. https://doi.org/10.1109/BigData.2015.7364065

Zhou, H., Lou, J.G., Zhang, H., Lin, H., Lin, H., Qin, T., 2015. An Empirical Study on Quality Issues of Production Big Data Platform, in: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE). Presented at the 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE), pp. 17–26. https://doi.org/10.1109/ICSE.2015.130

Becker, D., King, T.D., McMullen, B., 2015. Big data, big data quality problem, in: 2015 IEEE International Conference on Big Data (Big Data). Presented at the 2015 IEEE International Conference on Big Data (Big Data), IEEE, Santa Clara, CA, USA, pp. 2644–2653. https://doi.org/10.1109/BigData.2015.7364064

Maślankowski, J., 2014. Data Quality Issues Concerning Statistical Data Gathering Supported by Big Data Technology, in: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (Eds.), Beyond Databases, Architectures, and Structures, Communications in Computer and Information Science. Springer International Publishing, pp. 92–101. https://doi.org/10.1007/978-3-319-06932-6_10

Fürber, C., Hepp, M., 2011. Towards a Vocabulary for Data Quality Management in Semantic Web Architectures, in: Proceedings of the 1st International Workshop on Linked Web Data Management, LWDM ’11. ACM, New York, NY, USA, pp. 1–8. https://doi.org/10.1145/1966901.1966903

Corrales DC, Corrales JC, Ledezma A. How to address the data quality issues in regression models: a guided process for data cleaning. Symmetry. 2018;10:99.

Fan, W., 2008. Dependencies revisited for improving data quality, in: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, pp. 159–170.

Kläs, M., Putz, W., Lutz, T., 2016. Quality Evaluation for Big Data: A Scalable Assessment Approach and First Evaluation Results, in: 2016 Joint Conference of the International Workshop on Software Measurement and the International Conference on Software Process and Product Measurement (IWSM-MENSURA). Presented at the 2016 Joint Conference of the International Workshop on Software Measurement and the International Conference on Software Process and Product Measurement (IWSM-MENSURA), pp. 115–124. https://doi.org/10.1109/IWSM-Mensura.2016.026

Rahm E, Do HH. Data cleaning: Problems and current approaches. IEEE Data Eng Bull. 2000;23:3–13.

Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Tang, N., 2013. NADEEF: A Commodity Data Cleaning System, in: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13. ACM, New York, NY, USA, pp. 541–552. https://doi.org/10.1145/2463676.2465327

Ebaid A, Elmagarmid A, Ilyas IF, Ouzzani M, Quiane-Ruiz J-A, Tang N, Yin S. NADEEF: A generalized data cleaning system. Proceed VLDB Endowment. 2013;6:1218–21.

Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J.-A., Tang, N., Yin, S., 2014. NADEEF/ER: generic and interactive entity resolution. ACM Press, pp. 1071–1074. https://doi.org/10.1145/2588555.2594511

Tang N. Big Data Cleaning. In: Chen L, Jia Y, Sellis T, Liu G, editors. Web Technologies and Applications. Lecture Notes in Computer Science: Springer International Publishing; 2014. p. 13–24.

Chapter   Google Scholar  

Ge M, Dohnal V. Quality management in big data informatics. 2018;5:19. https://doi.org/10.3390/informatics5020019 .

Jimenez-Marquez JL, Gonzalez-Carrasco I, Lopez-Cuadrado JL, Ruiz-Mezcua B. Towards a big data framework for analyzing social media content. Int J Inf Manage. 2019;44:1–12. https://doi.org/10.1016/j.ijinfomgt.2018.09.003 .

Siddiqa A, Hashem IAT, Yaqoob I, Marjani M, Shamshirband S, Gani A, Nasaruddin F. A survey of big data management: Taxonomy and state-of-the-art. J Netw Comput Appl. 2016;71:151–66. https://doi.org/10.1016/j.jnca.2016.04.008 .

Taleb, I., Dssouli, R., Serhani, M.A., 2015. Big Data Pre-processing: A Quality Framework, in: 2015 IEEE International Congress on Big Data (BigData Congress). Presented at the 2015 IEEE International Congress on Big Data (BigData Congress), pp. 191–198. https://doi.org/10.1109/BigDataCongress.2015.35

Cormode, G., Duffield, N., 2014. Sampling for Big Data: A Tutorial, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14. ACM, New York, NY, USA, pp. 1975–1975. https://doi.org/10.1145/2623330.2630811

Gadepally, V., Herr, T., Johnson, L., Milechin, L., Milosavljevic, M., Miller, B.A., 2015. Sampling operations on big data, in: 2015 49th Asilomar Conference on Signals, Systems and Computers. Presented at the 2015 49th Asilomar Conference on Signals, Systems and Computers, pp. 1515–1519. https://doi.org/10.1109/ACSSC.2015.7421398

Liang F, Kim J, Song Q. A bootstrap metropolis-hastings algorithm for bayesian analysis of big data. Technometrics. 2016. https://doi.org/10.1080/00401706.2016.1142905 .

Article   MathSciNet   Google Scholar  

Satyanarayana, A., 2014. Intelligent sampling for big data using bootstrap sampling and chebyshev inequality, in: 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE). Presented at the 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE), IEEE, Toronto, ON, Canada, pp. 1–6. https://doi.org/10.1109/CCECE.2014.6901029

Kleiner, A., Talwalkar, A., Sarkar, P., Jordan, M., 2012. The big data bootstrap. arXiv preprint

Dai, W., Wardlaw, I., Cui, Y., Mehdi, K., Li, Y., Long, J., 2016. Data Profiling Technology of Data Governance Regarding Big Data: Review and Rethinking, in: Latifi, S. (Ed.), Information Technolog: New Generations. Springer International Publishing, Cham, pp. 439–450. https://doi.org/10.1007/978-3-319-32467-8_39

Loshin, D., 2010. Rapid Data Quality Assessment Using Data Profiling 15.

Naumann F. Data profiling revisited. ACM. SIGMOD Record. 2014;42:40–9.

Buneman, P., Davidson, S.B., 2010. Data provenance–the foundation of data quality.

Glavic, B., 2014. Big Data Provenance: Challenges and Implications for Benchmarking, in: Specifying Big Data Benchmarks. Springer, pp. 72–80.

Wang, J., Crawl, D., Purawat, S., Nguyen, M., Altintas, I., 2015. Big data provenance: Challenges, state of the art and opportunities, in: 2015 IEEE International Conference on Big Data (Big Data). Presented at the 2015 IEEE International Conference on Big Data (Big Data), pp. 2509–2516. https://doi.org/10.1109/BigData.2015.7364047

Hwang W-J, Wen K-W. Fast kNN classification algorithm based on partial distance search. Electron Lett. 1998;34:2062–3.

Taleb, I., Kassabi, H.T.E., Serhani, M.A., Dssouli, R., Bouhaddioui, C., 2016. Big Data Quality: A Quality Dimensions Evaluation, in: 2016 Intl IEEE Conferences on Ubiquitous Intelligence Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld). Presented at the 2016 Intl IEEE Conferences on Ubiquitous Intelligence Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), pp. 759–765. https://doi.org/10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0122

Taleb, I., Serhani, M.A., 2017. Big Data Pre-Processing: Closing the Data Quality Enforcement Loop, in: 2017 IEEE International Congress on Big Data (BigData Congress). Presented at the 2017 IEEE International Congress on Big Data (BigData Congress), pp. 498–501. https://doi.org/10.1109/BigDataCongress.2017.73

Deng, Z., Zhu, X., Cheng, D., Zong, M., Zhang, S., n.d. Efficient kNN classification algorithm for big data. Neurocomputing. https://doi.org/10.1016/j.neucom.2015.08.112

Firmani, D., Mecella, M., Scannapieco, M., Batini, C., 2015. On the Meaningfulness of “Big Data Quality” (Invited Paper), in: Data Science and Engineering. Springer Berlin Heidelberg, pp. 1–15. https://doi.org/10.1007/s41019-015-0004-7

Lee YW. Crafting rules: context-reflective data quality problem solving. J Manag Inf Syst. 2003;20:93–119.

Download references


Not applicable.

This work is supported by fund #12R005 from ZCHS at UAE University.

Author information

Authors and affiliations.

College of Technological Innovation, Zayed University, P.O. Box 144534, Abu Dhabi, United Arab Emirates

Ikbal Taleb

College of Information Technology, UAE University, P.O. Box 15551, Al Ain, United Arab Emirates

Mohamed Adel Serhani

Department of Statistics, College of Business and Economics, UAE University, P.O. Box 15551, Al Ain, United Arab Emirates

Chafik Bouhaddioui

Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC, H4B 1R6, Canada

Rachida Dssouli

You can also search for this author in PubMed   Google Scholar


IT conceived the main conceptual ideas related to Big data quality framework and proof outline. He designed the framework and their main modules, he also worked on the implementation and validation of some of the framework’s components. MAS supervised the study and was in charge of direction and planning, he also contributed to couple of sections including the introduction, abstract, the framework and the implementation and conclusion section. CB contributed to data preparation sampling and profiling, he also reviewed and validated all formulations and statistical modeling included in this work. RD contributed in the review and discussion of the core contributions and their validation. All authors read and approved the final manuscript.

Authors’ information

Dr. Ikbal Taleb is currently an Assistant Professor, College of Technological Information, Zayed University, Abu Dhabi, U.A.E. He got his Ph.D. in information and systems engineering from Concordia University in 2019, and MSc. in Software Engineering from the University of Montreal, Canada in 2006. His research interests include data and Big data quality, quality profiling, quality assessment, cloud computing, web services, and mobile web services.

Prof. M. Adel Serhani is currently a Professor, and Assistant Dean for Research and Graduate Studies College of Information Technology, U.A.E University, Al Ain, U.A.E. He is also an Adjunct faculty in CIISE, Concordia University, Canada. He holds a Ph.D. in Computer Engineering from Concordia University in 2006, and MSc. in Software Engineering from University of Montreal, Canada in 2002. His research interests include: Cloud for data intensive e-health applications, and services; SLA enforcement in Cloud Data centers, and Big data value chain, Cloud federation and monitoring, Non-invasive Smart health monitoring; management of communities of Web services; and Web services applications and security. He has a large experience earned throughout his involvement and management of different R&D projects. He served on several organizing and Technical Program Committees and he was the program Co-Chair of International Conference in Web Services (ICWS’2020), Co-chair of the IEEE conference on Innovations in Information Technology (IIT´13), Chair of IEEE Workshop on Web service (IWCMC´13), Chair of IEEE workshop on Web, Mobile, and Cloud Services (IWCMC´12), and Co-chair of International Workshop on Wireless Sensor Networks and their Applications (NDT´12). He has published around 130 refereed publications including conferences, journals, a book, and book chapters.

Dr. Chafik Bouhaddioui is an Associate Professor of Statistics in the College of Business and Economics at UAE University. He got his Ph.D. from University of Montreal in Canada. He worked as lecturer at Concordia University for 4 years. He has a rich experience in applied statistics in finance in private and public sectors. He worked as assistant researcher in Finance Ministry in Canada. He worked as Senior Analyst in National Bank of Canada and developed statistical methods used in stock market forecasting. He joined in 2004 a team of researchers in finance group at CIRANO in Canada to develop statistical tools and modules in finance and risk analysis. He published several papers in well-known journals in multivariate time series analysis and their applications in economics and finance. His area of research is diversified and includes modeling and prediction in multivariate time series, causality and independence tests, biostatistics, and Big Data.

Prof. Rachida Dssouli is a full professor and Director of Concordia Institute for Information Systems Engineering, Faculty of Engineering and Computer Science, Concordia University. Dr. Dssouli received a Master (1978), Diplome d'études Approfondies (1979), Doctorat de 3eme Cycle in Networking (1981) from Université Paul Sabatier, Toulouse, France. She earned her PhD degree in Computer Science (1987) from Université de Montréal, Canada. Her research interests are in Communication Software Engineering a sub discipline of Software Engineering. Her contributions are in Testing based on Formal Methods, Requirements Engineering, Systems Engineering, Telecommunication Service Engineering and Quality of Service. She published more than 200 papers in journals and referred conferences in her area of research. She supervised/ co-supervised more than 50 graduate students among them 20 PhD students. Dr. Dssouli is the founding Director of Concordia Institute for Information and Systems Engineering (CIISE) June 2002. The Institute hosts now more than 550 graduate students and 20 faculty members, 4 master programs, and a PhD program.

Corresponding author

Correspondence to Mohamed Adel Serhani .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Taleb, I., Serhani, M.A., Bouhaddioui, C. et al. Big data quality framework: a holistic approach to continuous quality management. J Big Data 8 , 76 (2021). https://doi.org/10.1186/s40537-021-00468-0

Download citation

Received : 06 February 2021

Accepted : 15 May 2021

Published : 29 May 2021

DOI : https://doi.org/10.1186/s40537-021-00468-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Big data quality
  • Quality assessment
  • Quality metrics and scores
  • Pre-processing

data quality control in research pdf

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Data Descriptor
  • Open access
  • Published: 03 May 2024

A dataset for measuring the impact of research data and their curation

  • Libby Hemphill   ORCID: orcid.org/0000-0002-3793-7281 1 , 2 ,
  • Andrea Thomer 3 ,
  • Sara Lafia 1 ,
  • Lizhou Fan 2 ,
  • David Bleckley   ORCID: orcid.org/0000-0001-7715-4348 1 &
  • Elizabeth Moss 1  

Scientific Data volume  11 , Article number:  442 ( 2024 ) Cite this article

686 Accesses

8 Altmetric

Metrics details

  • Research data
  • Social sciences

Science funders, publishers, and data archives make decisions about how to responsibly allocate resources to maximize the reuse potential of research data. This paper introduces a dataset developed to measure the impact of archival and data curation decisions on data reuse. The dataset describes 10,605 social science research datasets, their curation histories, and reuse contexts in 94,755 publications that cover 59 years from 1963 to 2022. The dataset was constructed from study-level metadata, citing publications, and curation records available through the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan. The dataset includes information about study-level attributes (e.g., PIs, funders, subject terms); usage statistics (e.g., downloads, citations); archiving decisions (e.g., curation activities, data transformations); and bibliometric attributes (e.g., journals, authors) for citing publications. This dataset provides information on factors that contribute to long-term data reuse, which can inform the design of effective evidence-based recommendations to support high-impact research data curation decisions.

Similar content being viewed by others

data quality control in research pdf

SciSciNet: A large-scale open data lake for the science of science research

data quality control in research pdf

Data, measurement and empirical methods in the science of science

data quality control in research pdf

Interdisciplinarity revisited: evidence for research impact and dynamism

Background & summary.

Recent policy changes in funding agencies and academic journals have increased data sharing among researchers and between researchers and the public. Data sharing advances science and provides the transparency necessary for evaluating, replicating, and verifying results. However, many data-sharing policies do not explain what constitutes an appropriate dataset for archiving or how to determine the value of datasets to secondary users 1 , 2 , 3 . Questions about how to allocate data-sharing resources efficiently and responsibly have gone unanswered 4 , 5 , 6 . For instance, data-sharing policies recognize that not all data should be curated and preserved, but they do not articulate metrics or guidelines for determining what data are most worthy of investment.

Despite the potential for innovation and advancement that data sharing holds, the best strategies to prioritize datasets for preparation and archiving are often unclear. Some datasets are likely to have more downstream potential than others, and data curation policies and workflows should prioritize high-value data instead of being one-size-fits-all. Though prior research in library and information science has shown that the “analytic potential” of a dataset is key to its reuse value 7 , work is needed to implement conceptual data reuse frameworks 8 , 9 , 10 , 11 , 12 , 13 , 14 . In addition, publishers and data archives need guidance to develop metrics and evaluation strategies to assess the impact of datasets.

Several existing resources have been compiled to study the relationship between the reuse of scholarly products, such as datasets (Table  1 ); however, none of these resources include explicit information on how curation processes are applied to data to increase their value, maximize their accessibility, and ensure their long-term preservation. The CCex (Curation Costs Exchange) provides models of curation services along with cost-related datasets shared by contributors but does not make explicit connections between them or include reuse information 15 . Analyses on platforms such as DataCite 16 have focused on metadata completeness and record usage, but have not included related curation-level information. Analyses of GenBank 17 and FigShare 18 , 19 citation networks do not include curation information. Related studies of Github repository reuse 20 and Softcite software citation 21 reveal significant factors that impact the reuse of secondary research products but do not focus on research data. RD-Switchboard 22 and DSKG 23 are scholarly knowledge graphs linking research data to articles, patents, and grants, but largely omit social science research data and do not include curation-level factors. To our knowledge, other studies of curation work in organizations similar to ICPSR – such as GESIS 24 , Dataverse 25 , and DANS 26 – have not made their underlying data available for analysis.

This paper describes a dataset 27 compiled for the MICA project (Measuring the Impact of Curation Actions) led by investigators at ICPSR, a large social science data archive at the University of Michigan. The dataset was originally developed to study the impacts of data curation and archiving on data reuse. The MICA dataset has supported several previous publications investigating the intensity of data curation actions 28 , the relationship between data curation actions and data reuse 29 , and the structures of research communities in a data citation network 30 . Collectively, these studies help explain the return on various types of curatorial investments. The dataset that we introduce in this paper, which we refer to as the MICA dataset, has the potential to address research questions in the areas of science (e.g., knowledge production), library and information science (e.g., scholarly communication), and data archiving (e.g., reproducible workflows).

We constructed the MICA dataset 27 using records available at ICPSR, a large social science data archive at the University of Michigan. Data set creation involved: collecting and enriching metadata for articles indexed in the ICPSR Bibliography of Data-related Literature against the Dimensions AI bibliometric database; gathering usage statistics for studies from ICPSR’s administrative database; processing data curation work logs from ICPSR’s project tracking platform, Jira; and linking data in social science studies and series to citing analysis papers (Fig.  1 ).

figure 1

Steps to prepare MICA dataset for analysis - external sources are red, primary internal sources are blue, and internal linked sources are green.

Enrich paper metadata

The ICPSR Bibliography of Data-related Literature is a growing database of literature in which data from ICPSR studies have been used. Its creation was funded by the National Science Foundation (Award 9977984), and for the past 20 years it has been supported by ICPSR membership and multiple US federally-funded and foundation-funded topical archives at ICPSR. The Bibliography was originally launched in the year 2000 to aid in data discovery by providing a searchable database linking publications to the study data used in them. The Bibliography collects the universe of output based on the data shared in each study through, which is made available through each ICPSR study’s webpage. The Bibliography contains both peer-reviewed and grey literature, which provides evidence for measuring the impact of research data. For an item to be included in the ICPSR Bibliography, it must contain an analysis of data archived by ICPSR or contain a discussion or critique of the data collection process, study design, or methodology 31 . The Bibliography is manually curated by a team of librarians and information specialists at ICPSR who enter and validate entries. Some publications are supplied to the Bibliography by data depositors, and some citations are submitted to the Bibliography by authors who abide by ICPSR’s terms of use requiring them to submit citations to works in which they analyzed data retrieved from ICPSR. Most of the Bibliography is populated by Bibliography team members, who create custom queries for ICPSR studies performed across numerous sources, including Google Scholar, ProQuest, SSRN, and others. Each record in the Bibliography is one publication that has used one or more ICPSR studies. The version we used was captured on 2021-11-16 and included 94,755 publications.

To expand the coverage of the ICPSR Bibliography, we searched exhaustively for all ICPSR study names, unique numbers assigned to ICPSR studies, and DOIs 32 using a full-text index available through the Dimensions AI database 33 . We accessed Dimensions through a license agreement with the University of Michigan. ICPSR Bibliography librarians and information specialists manually reviewed and validated new entries that matched one or more search criteria. We then used Dimensions to gather enriched metadata and full-text links for items in the Bibliography with DOIs. We matched 43% of the items in the Bibliography to enriched Dimensions metadata including abstracts, field of research codes, concepts, and authors’ institutional information; we also obtained links to full text for 16% of Bibliography items. Based on licensing agreements, we included Dimensions identifiers and links to full text so that users with valid publisher and database access can construct an enriched publication dataset.

Gather study usage data

ICPSR maintains a relational administrative database, DBInfo, that organizes study-level metadata and information on data reuse across separate tables. Studies at ICPSR consist of one or more files collected at a single time or for a single purpose; studies in which the same variables are observed over time are grouped into series. Each study at ICPSR is assigned a DOI, and its metadata are stored in DBInfo. Study metadata follows the Data Documentation Initiative (DDI) Codebook 2.5 standard. DDI elements included in our dataset are title, ICPSR study identification number, DOI, authoring entities, description (abstract), funding agencies, subject terms assigned to the study during curation, and geographic coverage. We also created variables based on DDI elements: total variable count, the presence of survey question text in the metadata, the number of author entities, and whether an author entity was an institution. We gathered metadata for ICPSR’s 10,605 unrestricted public-use studies available as of 2021-11-16 ( https://www.icpsr.umich.edu/web/pages/membership/or/metadata/oai.html ).

To link study usage data with study-level metadata records, we joined study metadata from DBinfo on study usage information, which included total study downloads (data and documentation), individual data file downloads, and cumulative citations from the ICPSR Bibliography. We also gathered descriptive metadata for each study and its variables, which allowed us to summarize and append recoded fields onto the study-level metadata such as curation level, number and type of principle investigators, total variable count, and binary variables indicating whether the study data were made available for online analysis, whether survey question text was made searchable online, and whether the study variables were indexed for search. These characteristics describe aspects of the discoverability of the data to compare with other characteristics of the study. We used the study and series numbers included in the ICPSR Bibliography as unique identifiers to link papers to metadata and analyze the community structure of dataset co-citations in the ICPSR Bibliography 32 .

Process curation work logs

Researchers deposit data at ICPSR for curation and long-term preservation. Between 2016 and 2020, more than 3,000 research studies were deposited with ICPSR. Since 2017, ICPSR has organized curation work into a central unit that provides varied levels of curation that vary in the intensity and complexity of data enhancement that they provide. While the levels of curation are standardized as to effort (level one = less effort, level three = most effort), the specific curatorial actions undertaken for each dataset vary. The specific curation actions are captured in Jira, a work tracking program, which data curators at ICPSR use to collaborate and communicate their progress through tickets. We obtained access to a corpus of 669 completed Jira tickets corresponding to the curation of 566 unique studies between February 2017 and December 2019 28 .

To process the tickets, we focused only on their work log portions, which contained free text descriptions of work that data curators had performed on a deposited study, along with the curators’ identifiers, and timestamps. To protect the confidentiality of the data curators and the processing steps they performed, we collaborated with ICPSR’s curation unit to propose a classification scheme, which we used to train a Naive Bayes classifier and label curation actions in each work log sentence. The eight curation action labels we proposed 28 were: (1) initial review and planning, (2) data transformation, (3) metadata, (4) documentation, (5) quality checks, (6) communication, (7) other, and (8) non-curation work. We note that these categories of curation work are very specific to the curatorial processes and types of data stored at ICPSR, and may not match the curation activities at other repositories. After applying the classifier to the work log sentences, we obtained summary-level curation actions for a subset of all ICPSR studies (5%), along with the total number of hours spent on data curation for each study, and the proportion of time associated with each action during curation.

Data Records

The MICA dataset 27 connects records for each of ICPSR’s archived research studies to the research publications that use them and related curation activities available for a subset of studies (Fig.  2 ). Each of the three tables published in the dataset is available as a study archived at ICPSR. The data tables are distributed as statistical files available for use in SAS, SPSS, Stata, and R as well as delimited and ASCII text files. The dataset is organized around studies and papers as primary entities. The studies table lists ICPSR studies, their metadata attributes, and usage information; the papers table was constructed using the ICPSR Bibliography and Dimensions database; and the curation logs table summarizes the data curation steps performed on a subset of ICPSR studies.

Studies (“ICPSR_STUDIES”): 10,605 social science research datasets available through ICPSR up to 2021-11-16 with variables for ICPSR study number, digital object identifier, study name, series number, series title, authoring entities, full-text description, release date, funding agency, geographic coverage, subject terms, topical archive, curation level, single principal investigator (PI), institutional PI, the total number of PIs, total variables in data files, question text availability, study variable indexing, level of restriction, total unique users downloading study data files and codebooks, total unique users downloading data only, and total unique papers citing data through November 2021. Studies map to the papers and curation logs table through ICPSR study numbers as “STUDY”. However, not every study in this table will have records in the papers and curation logs tables.

Papers (“ICPSR_PAPERS”): 94,755 publications collected from 2000-08-11 to 2021-11-16 in the ICPSR Bibliography and enriched with metadata from the Dimensions database with variables for paper number, identifier, title, authors, publication venue, item type, publication date, input date, ICPSR series numbers used in the paper, ICPSR study numbers used in the paper, the Dimension identifier, and the Dimensions link to the publication’s full text. Papers map to the studies table through ICPSR study numbers in the “STUDY_NUMS” field. Each record represents a single publication, and because a researcher can use multiple datasets when creating a publication, each record may list multiple studies or series.

Curation logs (“ICPSR_CURATION_LOGS”): 649 curation logs for 563 ICPSR studies (although most studies in the subset had one curation log, some studies were associated with multiple logs, with a maximum of 10) curated between February 2017 and December 2019 with variables for study number, action labels assigned to work description sentences using a classifier trained on ICPSR curation logs, hours of work associated with a single log entry, and total hours of work logged for the curation ticket. Curation logs map to the study and paper tables through ICPSR study numbers as “STUDY”. Each record represents a single logged action, and future users may wish to aggregate actions to the study level before joining tables.

figure 2

Entity-relation diagram.

Technical Validation

We report on the reliability of the dataset’s metadata in the following subsections. To support future reuse of the dataset, curation services provided through ICPSR improved data quality by checking for missing values, adding variable labels, and creating a codebook.

All 10,605 studies available through ICPSR have a DOI and a full-text description summarizing what the study is about, the purpose of the study, the main topics covered, and the questions the PIs attempted to answer when they conducted the study. Personal names (i.e., principal investigators) and organizational names (i.e., funding agencies) are standardized against an authority list maintained by ICPSR; geographic names and subject terms are also standardized and hierarchically indexed in the ICPSR Thesaurus 34 . Many of ICPSR’s studies (63%) are in a series and are distributed through the ICPSR General Archive (56%), a non-topical archive that accepts any social or behavioral science data. While study data have been available through ICPSR since 1962, the earliest digital release date recorded for a study was 1984-03-18, when ICPSR’s database was first employed, and the most recent date is 2021-10-28 when the dataset was collected.

Curation level information was recorded starting in 2017 and is available for 1,125 studies (11%); approximately 80% of studies with assigned curation levels received curation services, equally distributed between Levels 1 (least intensive), 2 (moderately intensive), and 3 (most intensive) (Fig.  3 ). Detailed descriptions of ICPSR’s curation levels are available online 35 . Additional metadata are available for a subset of 421 studies (4%), including information about whether the study has a single PI, an institutional PI, the total number of PIs involved, total variables recorded is available for online analysis, has searchable question text, has variables that are indexed for search, contains one or more restricted files, and whether the study is completely restricted. We provided additional metadata for this subset of ICPSR studies because they were released within the past five years and detailed curation and usage information were available for them. Usage statistics including total downloads and data file downloads are available for this subset of studies as well; citation statistics are available for 8,030 studies (76%). Most ICPSR studies have fewer than 500 users, as indicated by total downloads, or citations (Fig.  4 ).

figure 3

ICPSR study curation levels.

figure 4

ICPSR study usage.

A subset of 43,102 publications (45%) available in the ICPSR Bibliography had a DOI. Author metadata were entered as free text, meaning that variations may exist and require additional normalization and pre-processing prior to analysis. While author information is standardized for each publication, individual names may appear in different sort orders (e.g., “Earls, Felton J.” and “Stephen W. Raudenbush”). Most of the items in the ICPSR Bibliography as of 2021-11-16 were journal articles (59%), reports (14%), conference presentations (9%), or theses (8%) (Fig.  5 ). The number of publications collected in the Bibliography has increased each decade since the inception of ICPSR in 1962 (Fig.  6 ). Most ICPSR studies (76%) have one or more citations in a publication.

figure 5

ICPSR Bibliography citation types.

figure 6

ICPSR citations by decade.

Usage Notes

The dataset consists of three tables that can be joined using the “STUDY” key as shown in Fig.  2 . The “ICPSR_PAPERS” table contains one row per paper with one or more cited studies in the “STUDY_NUMS” column. We manipulated and analyzed the tables as CSV files with the Pandas library 36 in Python and the Tidyverse packages 37 in R.

The present MICA dataset can be used independently to study the relationship between curation decisions and data reuse. Evidence of reuse for specific studies is available in several forms: usage information, including downloads and citation counts; and citation contexts within papers that cite data. Analysis may also be performed on the citation network formed between datasets and papers that use them. Finally, curation actions can be associated with properties of studies and usage histories.

This dataset has several limitations of which users should be aware. First, Jira tickets can only be used to represent the intensiveness of curation for activities undertaken since 2017, when ICPSR started using both Curation Levels and Jira. Studies published before 2017 were all curated, but documentation of the extent of that curation was not standardized and therefore could not be included in these analyses. Second, the measure of publications relies upon the authors’ clarity of data citation and the ICPSR Bibliography staff’s ability to discover citations with varying formality and clarity. Thus, there is always a chance that some secondary-data-citing publications have been left out of the bibliography. Finally, there may be some cases in which a paper in the ICSPSR bibliography did not actually obtain data from ICPSR. For example, PIs have often written about or even distributed their data prior to their archival in ICSPR. Therefore, those publications would not have cited ICPSR but they are still collected in the Bibliography as being directly related to the data that were eventually deposited at ICPSR.

In summary, the MICA dataset contains relationships between two main types of entities – papers and studies – which can be mined. The tables in the MICA dataset have supported network analysis (community structure and clique detection) 30 ; natural language processing (NER for dataset reference detection) 32 ; visualizing citation networks (to search for datasets) 38 ; and regression analysis (on curation decisions and data downloads) 29 . The data are currently being used to develop research metrics and recommendation systems for research data. Given that DOIs are provided for ICPSR studies and articles in the ICPSR Bibliography, the MICA dataset can also be used with other bibliometric databases, including DataCite, Crossref, OpenAlex, and related indexes. Subscription-based services, such as Dimensions AI, are also compatible with the MICA dataset. In some cases, these services provide abstracts or full text for papers from which data citation contexts can be extracted for semantic content analysis.

Code availability

The code 27 used to produce the MICA project dataset is available on GitHub at https://github.com/ICPSR/mica-data-descriptor and through Zenodo with the identifier https://doi.org/10.5281/zenodo.8432666 . Data manipulation and pre-processing were performed in Python. Data curation for distribution was performed in SPSS.

He, L. & Han, Z. Do usage counts of scientific data make sense? An investigation of the Dryad repository. Library Hi Tech 35 , 332–342 (2017).

Article   Google Scholar  

Brickley, D., Burgess, M. & Noy, N. Google dataset search: Building a search engine for datasets in an open web ecosystem. In The World Wide Web Conference - WWW ‘19 , 1365–1375 (ACM Press, San Francisco, CA, USA, 2019).

Buneman, P., Dosso, D., Lissandrini, M. & Silvello, G. Data citation and the citation graph. Quantitative Science Studies 2 , 1399–1422 (2022).

Chao, T. C. Disciplinary reach: Investigating the impact of dataset reuse in the earth sciences. Proceedings of the American Society for Information Science and Technology 48 , 1–8 (2011).

Article   ADS   Google Scholar  

Parr, C. et al . A discussion of value metrics for data repositories in earth and environmental sciences. Data Science Journal 18 , 58 (2019).

Eschenfelder, K. R., Shankar, K. & Downey, G. The financial maintenance of social science data archives: Four case studies of long–term infrastructure work. J. Assoc. Inf. Sci. Technol. 73 , 1723–1740 (2022).

Palmer, C. L., Weber, N. M. & Cragin, M. H. The analytic potential of scientific data: Understanding re-use value. Proceedings of the American Society for Information Science and Technology 48 , 1–10 (2011).

Zimmerman, A. S. New knowledge from old data: The role of standards in the sharing and reuse of ecological data. Sci. Technol. Human Values 33 , 631–652 (2008).

Cragin, M. H., Palmer, C. L., Carlson, J. R. & Witt, M. Data sharing, small science and institutional repositories. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 368 , 4023–4038 (2010).

Article   ADS   CAS   Google Scholar  

Fear, K. M. Measuring and Anticipating the Impact of Data Reuse . Ph.D. thesis, University of Michigan (2013).

Borgman, C. L., Van de Sompel, H., Scharnhorst, A., van den Berg, H. & Treloar, A. Who uses the digital data archive? An exploratory study of DANS. Proceedings of the Association for Information Science and Technology 52 , 1–4 (2015).

Pasquetto, I. V., Borgman, C. L. & Wofford, M. F. Uses and reuses of scientific data: The data creators’ advantage. Harvard Data Science Review 1 (2019).

Gregory, K., Groth, P., Scharnhorst, A. & Wyatt, S. Lost or found? Discovering data needed for research. Harvard Data Science Review (2020).

York, J. Seeking equilibrium in data reuse: A study of knowledge satisficing . Ph.D. thesis, University of Michigan (2022).

Kilbride, W. & Norris, S. Collaborating to clarify the cost of curation. New Review of Information Networking 19 , 44–48 (2014).

Robinson-Garcia, N., Mongeon, P., Jeng, W. & Costas, R. DataCite as a novel bibliometric source: Coverage, strengths and limitations. Journal of Informetrics 11 , 841–854 (2017).

Qin, J., Hemsley, J. & Bratt, S. E. The structural shift and collaboration capacity in GenBank networks: A longitudinal study. Quantitative Science Studies 3 , 174–193 (2022).

Article   PubMed   PubMed Central   Google Scholar  

Acuna, D. E., Yi, Z., Liang, L. & Zhuang, H. Predicting the usage of scientific datasets based on article, author, institution, and journal bibliometrics. In Smits, M. (ed.) Information for a Better World: Shaping the Global Future. iConference 2022 ., 42–52 (Springer International Publishing, Cham, 2022).

Zeng, T., Wu, L., Bratt, S. & Acuna, D. E. Assigning credit to scientific datasets using article citation networks. Journal of Informetrics 14 , 101013 (2020).

Koesten, L., Vougiouklis, P., Simperl, E. & Groth, P. Dataset reuse: Toward translating principles to practice. Patterns 1 , 100136 (2020).

Du, C., Cohoon, J., Lopez, P. & Howison, J. Softcite dataset: A dataset of software mentions in biomedical and economic research publications. J. Assoc. Inf. Sci. Technol. 72 , 870–884 (2021).

Aryani, A. et al . A research graph dataset for connecting research data repositories using RD-Switchboard. Sci Data 5 , 180099 (2018).

Färber, M. & Lamprecht, D. The data set knowledge graph: Creating a linked open data source for data sets. Quantitative Science Studies 2 , 1324–1355 (2021).

Perry, A. & Netscher, S. Measuring the time spent on data curation. Journal of Documentation 78 , 282–304 (2022).

Trisovic, A. et al . Advancing computational reproducibility in the Dataverse data repository platform. In Proceedings of the 3rd International Workshop on Practical Reproducible Evaluation of Computer Systems , P-RECS ‘20, 15–20, https://doi.org/10.1145/3391800.3398173 (Association for Computing Machinery, New York, NY, USA, 2020).

Borgman, C. L., Scharnhorst, A. & Golshan, M. S. Digital data archives as knowledge infrastructures: Mediating data sharing and reuse. Journal of the Association for Information Science and Technology 70 , 888–904, https://doi.org/10.1002/asi.24172 (2019).

Lafia, S. et al . MICA Data Descriptor. Zenodo https://doi.org/10.5281/zenodo.8432666 (2023).

Lafia, S., Thomer, A., Bleckley, D., Akmon, D. & Hemphill, L. Leveraging machine learning to detect data curation activities. In 2021 IEEE 17th International Conference on eScience (eScience) , 149–158, https://doi.org/10.1109/eScience51609.2021.00025 (2021).

Hemphill, L., Pienta, A., Lafia, S., Akmon, D. & Bleckley, D. How do properties of data, their curation, and their funding relate to reuse? J. Assoc. Inf. Sci. Technol. 73 , 1432–44, https://doi.org/10.1002/asi.24646 (2021).

Lafia, S., Fan, L., Thomer, A. & Hemphill, L. Subdivisions and crossroads: Identifying hidden community structures in a data archive’s citation network. Quantitative Science Studies 3 , 694–714, https://doi.org/10.1162/qss_a_00209 (2022).

ICPSR. ICPSR Bibliography of Data-related Literature: Collection Criteria. https://www.icpsr.umich.edu/web/pages/ICPSR/citations/collection-criteria.html (2023).

Lafia, S., Fan, L. & Hemphill, L. A natural language processing pipeline for detecting informal data references in academic literature. Proc. Assoc. Inf. Sci. Technol. 59 , 169–178, https://doi.org/10.1002/pra2.614 (2022).

Hook, D. W., Porter, S. J. & Herzog, C. Dimensions: Building context for search and evaluation. Frontiers in Research Metrics and Analytics 3 , 23, https://doi.org/10.3389/frma.2018.00023 (2018).

https://www.icpsr.umich.edu/web/ICPSR/thesaurus (2002). ICPSR. ICPSR Thesaurus.

https://www.icpsr.umich.edu/files/datamanagement/icpsr-curation-levels.pdf (2020). ICPSR. ICPSR Curation Levels.

McKinney, W. Data Structures for Statistical Computing in Python. In van der Walt, S. & Millman, J. (eds.) Proceedings of the 9th Python in Science Conference , 56–61 (2010).

Wickham, H. et al . Welcome to the Tidyverse. Journal of Open Source Software 4 , 1686 (2019).

Fan, L., Lafia, S., Li, L., Yang, F. & Hemphill, L. DataChat: Prototyping a conversational agent for dataset search and visualization. Proc. Assoc. Inf. Sci. Technol. 60 , 586–591 (2023).

Download references


We thank the ICPSR Bibliography staff, the ICPSR Data Curation Unit, and the ICPSR Data Stewardship Committee for their support of this research. This material is based upon work supported by the National Science Foundation under grant 1930645. This project was made possible in part by the Institute of Museum and Library Services LG-37-19-0134-19.

Author information

Authors and affiliations.

Inter-university Consortium for Political and Social Research, University of Michigan, Ann Arbor, MI, 48104, USA

Libby Hemphill, Sara Lafia, David Bleckley & Elizabeth Moss

School of Information, University of Michigan, Ann Arbor, MI, 48104, USA

Libby Hemphill & Lizhou Fan

School of Information, University of Arizona, Tucson, AZ, 85721, USA

Andrea Thomer

You can also search for this author in PubMed   Google Scholar


L.H. and A.T. conceptualized the study design, D.B., E.M., and S.L. prepared the data, S.L., L.F., and L.H. analyzed the data, and D.B. validated the data. All authors reviewed and edited the manuscript.

Corresponding author

Correspondence to Libby Hemphill .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Hemphill, L., Thomer, A., Lafia, S. et al. A dataset for measuring the impact of research data and their curation. Sci Data 11 , 442 (2024). https://doi.org/10.1038/s41597-024-03303-2

Download citation

Received : 16 November 2023

Accepted : 24 April 2024

Published : 03 May 2024

DOI : https://doi.org/10.1038/s41597-024-03303-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

data quality control in research pdf

Differences in quality of anticoagulation care delivery according to ethnoracial group in the United States: A scoping review

  • Open access
  • Published: 11 May 2024

Cite this article

You have full access to this open access article

data quality control in research pdf

  • Sara R. Vazquez   ORCID: orcid.org/0000-0002-9267-8980 1 ,
  • Naomi Y. Yates 2 ,
  • Craig J. Beavers 3 , 4 ,
  • Darren M. Triller 3 &
  • Mary M. McFarland 5  

381 Accesses

6 Altmetric

Explore all metrics

Anticoagulation therapy is standard for conditions like atrial fibrillation, venous thromboembolism, and valvular heart disease, yet it is unclear if there are ethnoracial disparities in its quality and delivery in the United States. For this scoping review, electronic databases were searched for publications between January 1, 2011 – March 30, 2022. Eligible studies included all study designs, any setting within the United States, patients prescribed anticoagulation for any indication, outcomes reported for ≥ 2 distinct ethnoracial groups. The following four research questions were explored: Do ethnoracial differences exist in 1) access to guideline-based anticoagulation therapy, 2) quality of anticoagulation therapy management, 3) clinical outcomes related to anticoagulation care, 4) humanistic/educational outcomes related to anticoagulation therapy. A total of 5374 studies were screened, 570 studies received full-text review, and 96 studies were analyzed. The largest mapped focus was patients’ access to guideline-based anticoagulation therapy (88/96 articles, 91.7%). Seventy-eight articles made statistical outcomes comparisons among ethnoracial groups. Across all four research questions, 79 articles demonstrated favorable outcomes for White patients compared to non-White patients, 38 articles showed no difference between White and non-White groups, and 8 favored non-White groups (the total exceeds the 78 articles with statistical outcomes as many articles reported multiple outcomes). Disparities disadvantaging non-White patients were most pronounced in access to guideline-based anticoagulation therapy (43/66 articles analyzed) and quality of anticoagulation management (19/21 articles analyzed). Although treatment guidelines do not differentiate anticoagulant therapy by ethnoracial group, this scoping review found consistently favorable outcomes for White patients over non-White patients in the domains of access to anticoagulation therapy for guideline-based indications and quality of anticoagulation therapy management. No differences among groups were noted in clinical outcomes, and very few studies assessed humanistic or educational outcomes.

Graphical Abstract

Scoping Review: Differences in quality of United States anticoagulation care delivery by ethnoracial group. AF = atrial fibrillation; AMS = anticoagulation management service; DOACs = direct oral anticoagulants; INR = international normalized ratio; PSM = patient self-management; PST = patient self-testing

data quality control in research pdf

Similar content being viewed by others

Patients’ and physicians’ perceptions and attitudes about oral anticoagulation and atrial fibrillation: a qualitative systematic review.

data quality control in research pdf

Factors influencing primary care physicians’ prescribing behavior of anticoagulant therapy for the management of patients with non-valvular atrial fibrillation in Singapore: a qualitative research study

data quality control in research pdf

Enrolling people of color to evaluate a practice intervention: lessons from the shared decision-making for atrial fibrillation (SDM4AFib) trial

Avoid common mistakes on your manuscript.


It is well-established that in the United States (US) ethnoracial disparities exist in various aspects of health care. Specifically, persons identifying with an ethnoracial minority group may have more challenging access to health care, worse clinical outcomes, and higher dissatisfaction with care compared to White persons [ 1 , 2 , 3 , 4 , 5 ]. There are differences by ethnoracial group in the prevalence of the three most common indications for which anticoagulants are prescribed, stroke prevention in atrial fibrillation (AF), treatment of venous thromboembolism (VTE), and valvular heart disease [ 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 ]. Specifically, VTE is most prevalent in Black patients compared to White and Asian patients, whereas AF is most prevalent in White patients compared to Black, Asian, and Hispanic patients [ 9 , 10 , 15 ]. Calcific heart valve disease has the most relevance to the US population, and epidemiologic data has shown that aortic stenosis is more prevalent in White patients compared to Black, Asian, and Hispanic patients [ 17 ]. Despite these epidemiologic differences, there is no evidence to suggest there should be any difference in treatment strategies across ethnoracial patient groups.

While studies have demonstrated genotypic differences that may result in different warfarin dose requirements[ 18 ], and early studies may indicate genotypic differences in direct oral anticoagulant (DOAC) response [ 19 ], no US-based labeling or guidelines recommend a difference in prescription or delivery of anticoagulation care based on race or ethnicity. However, it is unclear if there are in fact differences in the type and quality of anticoagulation therapy, which is standard of care for each of these conditions [ 20 , 21 , 22 , 23 , 24 ]. Anticoagulants remain in the top three classes of drugs causing adverse drug events (primarily bleeding) in the United States, according to the 2014 National Action Plan for Adverse Drug Event Prevention. One of the goals of the National Action Plan was to identify patient populations at higher risk for these adverse drug events to inform the development of targeted harm reduction strategies [ 25 ]. If ethnoracial minority patients are receiving sub-optimal anticoagulation therapy in certain measurable areas of anticoagulation quality, it is vital to highlight the areas of disparity so that these can be explored and care optimized. Anticoagulation providers often have high frequency contact with their patients and can be a reliable connection between disproportionately affected patients and a system in need of change. Systematic reviews of ethnoracial disparities in AF and VTE have been conducted. The AF review assessed AF prevalence among racial groups as well as differences in symptoms and management, including stroke prevention with warfarin or DOACs [ 9 ]. The VTE review specifically assessed VTE prevalence and racial differences in COVID-19 and did report the use of any prophylactic anticoagulation, but this was not part of the analysis [ 26 ]. No review of racial disparities in quality of anticoagulation therapy was found in search results conducted prior to protocol.

In this study we aimed to identify any potential ethnoracial disparities in anticoagulation care quality in the US. The decision to limit the study to a US population was based on our observation that the US has a unique history of interactions between racial and ethnic groups that may not necessarily be reflected by studies conducted in other countries. Additionally, health care delivery systems vary widely across the world, and we wanted to include the data most relevant to the potential racial disparities existing in the US health care system. The term “race” was used to identify a group of people with shared physical characteristics believed to be of common ancestry whereas the term “ethnicity” refers to a group of people with shared cultural traditions [ 27 ]. We recognize these terms may be far more complex. In order to encompass both the physical and cultural aspects of a patient’s identity we have chosen to use the term “ethnoracial” for this study [ 27 ]. Highlighting existing differences will serve as a stimulus for institutions and clinicians to assess current services, implement quality improvement measures, and inform future research efforts to deliver optimal anticoagulation care for all patients. The scoping review protocol was registered December 22, 2021 to Open Science Framework, https://doi.org/10.17605/OSF.IO/9SE7H [ 28 ].

We conducted this scoping review with guidance from the 2020 version of the JBI Manual for Evidence Synthesis and organized to Arksey's five stages: 1) identifying the research question, 2) identifying relevant studies, 3) study selection, 4) charting the data and 5) collating, summarizing and reporting the results [ 29 , 30 ]. For transparency and reproducibility, we followed the PRISMA-ScR and PRISMA-S reporting guidelines in reporting our results [ 31 ]. We used Covidence (Veritas Health Innovation,) an online systematic reviewing platform to screen and select studies. Citation management and duplicate detection and removal was accomplished with EndNote, version 19 (Clarivate Analytics.) Data was charted from our selected studies using REDCap, an electronic data capture tool hosted at the University of Utah [ 32 ].

Literature searching

An information specialist developed and translated search strategies for the online databases using a combination of keywords and controlled subject headings unique to each database along with team feedback. Peer review of the strategies was conducted by library colleagues using the PRESS guidelines. [ 33 ] Electronic databases searched included Medline (Ovid) 2011–2022, Embase (embase.com) 2011–2022, CINAHL Complete (Ebscohost) 2011–2022, Sociological Abstracts (ProQuest) 2011–2022, International Pharmaceutical Abstracts (Ovid) 2011–2022, Scopus (scopus.org) 2011–2022 and Web of Science Core Collection (Clarivate Analytics) 2011–2022. Limits included a date range from January 1, 2011 to March 30—April 19, 2022, as not all database results were exported on the same day. See Supplemental File 1 for detailed search strategies. A search of grey literature was not conducted due to time and resource constraints.

Study Selection

For inclusion, each study required two votes by independent reviewers for screening of titles and abstracts followed by full-text review. A third reviewer provided the deciding vote. Data extraction was performed by two independent reviewers, and consensus on any discrepancies was reached via discussion between the reviewers. The data form was piloted by two team members using sentinel articles prior to data extraction.

Eligible studies included all types of study designs in any setting with a population of patients of any age or gender located within the US who were prescribed anticoagulant therapy for any indication, published between January 1, 2011 – March 30, 2022 in order to capture contemporary and clinically relevant practices.

We defined the following research questions for this scoping review as described in Table  1 .

Studies must have reported any of these anticoagulation care delivery outcomes for at least 2 distinct racial or ethnic groups. We excluded genotyping studies and non-English language articles at full text review, as we had no funding for translation services. In checking references of included studies, no additional studies met inclusion criteria. In accordance with scoping review methodology, no quality assessment of included studies was conducted as our goal was to rapidly map the literature. As this is a scoping review of the literature, no aggregate or pooled analysis was performed; however, for ease of interpretation, when assessing for the directionality of the outcomes in the various studies, we categorized studies into Favoring White Group, Favoring Non-White Group, and No Differences Among Ethnoracial Groups. If studies had mixed outcomes of favoring one group for one outcome and no difference for another, then the study was categorized with the favoring group.

A PRISMA flow diagram in Fig.  1 depicts search results, exclusions, and inclusions. The search strategies retrieved 6900 results with 1526 duplicates removed. Following title and abstract screening of 5374 references, 570 articles received full-text review. The most common reason for the exclusion of 474 studies was that outcomes were not reported for two distinct ethnoracial groups (171 studies). Ninety-six studies underwent data extraction.

figure 1

PRISMA Flow Diagram

Study characteristics-overall

Fifty of the 96 studies were published between 2011 and 2018 (an average of 6.25 articles per year that compared outcomes between two ethnoracial groups) and 43 of 96 studies were published in the years 2019–2021 (average 14.3 articles per year; 2022 excluded here because only 4 months of data was captured) (Fig.  2 ). Most studies analyzed an outpatient population (65.6%) for an indication of stroke prevention in AF (67.7%) in patients taking warfarin (71.9%) or DOACs (49.0%). Study population size was heterogenous, ranging from a study size of 24 patients to over 1.3 million patients (median 5,238 patients) in the 69 studies that reported population size by racial group. When stratified by size, 60.9% of the articles in the scoping review (42 articles) represented < 10,000 patients (Table  2 ).

figure 2

Number of Articles by Publication Year. *2022 excluded from this figure since the search period did not capture the entire year

Study characteristics-by ethnoracial group

There were 50 studies (52.1%) where race or ethnicity was either mentioned in the title or objective of the article, with 24 of these published over the 7-year period 2011–2018 and 26 published over the 3-year period 2019 to first quarter 2022. The method for reporting race or ethnicity was unclear or unspecified in most studies (77.1%) and 16 articles (16.7%) utilized self-reporting of race or ethnicity. Most studies analyzed White or Caucasian racial groups (94.8%), followed by Black or African-American (80.2%), and many studies grouped all other racial groups into an “Other” category (41.7%) (Fig.  3 ).

figure 3

Number of Articles by Ethnoracial Groups. *For study inclusion, a study had to compare outcomes for least two distinct ethnoracial groups 

White patients accounted for a median 77% of study populations, Black patients 9.5%, Hispanic/Latino patients 6.2%, “Other” racial groups 5.3%, and Asian patients 2.5%.

Study outcomes-overall

Of the 4 research questions, most studies included in this review analyzed patients’ access to guideline-based anticoagulation therapy (88/96 articles, 91.7%), clinical outcomes (42/96 articles, 43.8%), or quality of anticoagulation management (24/96 articles, 25.0%), while very few addressed humanistic or educational outcomes (5/96 articles, 5.2%) (Fig.  4 ). Many studies addressed multiple outcomes within the single study.

figure 4

Number of Articles Mapped by Research Question

Seventy-eight of the 96 included studies provided statistical comparisons between ethnoracial groups, and these data are presented below.

Outcomes for research question 1: Do ethnoracial differences exist in access to guideline-based anticoagulation therapy?

Anticoagulation for a guideline-based indication.

This question focused on patients who had an indication for anticoagulation actually receiving an anticoagulant, specifically AF and VTE prophylaxis (based on risk stratification) and acute VTE. The majority of the AF studies (25/34 studies) demonstrated White patients receiving anticoagulation at significantly higher rates compared to non-White patients [ 36 , 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 , 48 , 49 , 50 , 51 , 52 , 53 , 54 , 55 , 56 , 57 , 58 , 59 , 60 ], while the six VTE studies largely demonstrated no difference among ethnoracial groups [ 61 , 62 , 63 , 64 , 65 , 66 ].

DOACs as first-line therapy for AF or VTE

Eighteen individual studies statistically assessed the outcome of DOAC as first-line therapy (compared to warfarin) for AF (15 studies), VTE treatment (2 studies), or both indications (1 study). Twelve of the 15 AF studies showed a significantly higher proportion of White patients received DOACs as first-line therapy compared to non-White patients [ 36 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 54 , 55 , 67 , 68 ]. Of those 12, 9 specifically compared White patients to Black patients. Both VTE treatment studies and the study that assessed both AF and VTE indications showed significantly higher DOAC prescribing rates for White patients compared to Black patients [ 69 , 70 , 71 ].

Anticoagulant therapy adherence/persistence

The eight studies that addressed anticoagulation therapy adherence/persistence showed variability in outcome directionality by ethnoracial group: 5 no difference [ 41 , 72 , 73 , 74 , 75 ], 2 showed better treatment adherence/persistence for White patients compared to Black patients[ 76 ] or non-White patients [ 77 ], and one showed better treatment adherence/persistence for White patients compared to Hispanic patients, but no difference in White versus Black patients [ 78 ].

Figure  5 summarizes the outcome directionality for Research Question 1 regarding access to guideline-based anticoagulation therapy. Overall, the areas of disparity identified included anticoagulation for atrial fibrillation and preferential use of DOAC therapy for AF and VTE treatment.

figure 5

Outcome Directionality for the 4 Research Questions and their Subcategories. AC = anticoagulant; AMS = anticoagulation management service; INR = international normalized ratio; PST = patient self-testing; PSM = patient self-management

Research question 2: Do ethnoracial differences exist in the quality of anticoagulation therapy management?

A total of 21 studies assessed quality of anticoagulation therapy management: Warfarin time in therapeutic range (TTR)/INR (International Normalized Ratio) control 12 studies, appropriate anticoagulant dosing 3 studies, enrollment in an anticoagulation management service 5 studies, and PST/PSM one study.

In statistical comparisons of INR control in warfarin patients, all 12 studies (7 assessed mean or median TTR, 5 assessed other measures of INR control such as days spent above/below range, gaps in INR monitoring) showed White patients had favorable INR control compared to non-White patients (most comparisons included Black patients) [ 41 , 75 , 79 , 80 , 81 , 82 , 83 , 84 , 85 , 86 , 87 ]. Enrollment in an anticoagulation management service was statistically compared among ethnoracial groups in 5 studies, and this opportunity favored White patients compared to other racial groups in four of the five [ 41 , 82 , 86 , 88 ]. Two of the three studies that statistically analyzed appropriate anticoagulant dosing showed a higher rate of appropriate DOAC dosing in White patients compared to non-White patients [ 41 , 89 ], and the third showed no difference among ethnoracial groups for enoxaparin dosing in the emergency department [ 90 ]. The one study assessing access to PST/PSM showed that more White patients used PST compared to Black or Hispanic patients[ 91 ] (Fig.  5 ).

Research question 3: Do ethnoracial differences exist in the clinical outcomes related to anticoagulation care?

Articles assessing clinical outcomes among ethnoracial groups primarily assessed bleeding (15 articles) or thrombosis (9 articles) outcomes, and 8 articles assessing anticoagulation related hospitalization or mortality. One article addressed a net clinical outcome including major bleeding, stroke or systemic embolism, and death from any cause. This was included in the bleeding outcomes category so that it was not double-counted in the other two outcome categories. Additional details about the 24 unique studies that statistically assessed clinical outcomes including the study design, population size, ethnoracial groups studied, anticoagulants used, and statistical outcomes measured can be found in Supplementary Tables 1 and 2 .

Sixteen studies statistically assessed bleeding outcomes of varying definitions (major bleeding 13 studies, clinically relevant non-major bleeding 3 studies, any bleeding 3 studies, bleeding otherwise defined 3 studies). Six studies demonstrated no difference in bleeding outcomes by ethnoracial group [ 55 , 92 , 93 , 94 , 95 , 96 ]9 reported that White patients had lower rates of bleeding compared to Black or Asian patients,[ 53 , 80 , 83 , 85 , 97 , 98 , 99 , 100 , 101 ]. In the remaining study, Asian patients had a more favorable net clinical outcome compared to non-Asian patients [ 102 ].

Nine studies statistically assessed thrombosis outcomes among ethnoracial groups, including stroke/systemic embolism (5 studies), recurrent VTE (3 studies), or any thrombosis (1 study). The stroke outcomes by racial group were heterogeneous, with 3 studies showing better outcomes for White patients compared to Black patients[ 103 , 104 , 105 ] and two studies showing no difference in outcomes when White patients were compared to Non-White patients [ 55 , 95 ]. In three of the four VTE studies there were no differences in outcomes by ethnoracial group [ 61 , 93 , 96 ], and in one study White patients had more favorable outcomes compared to Black patients [ 106 ].

Nine studies assessed anticoagulation-related hospitalizations or mortality by ethnoracial group. Outcomes were mixed, as four studies showed no difference in hospitalizations or mortality among ethnoracial groups,[ 89 , 95 , 96 , 107 ], three studies showed White patients had a lower rate of hospitalizations[ 85 , 105 ] or mortality[ 104 , 105 ] Another study showed lower rate of mortality or hospice after intracranial hemorrhage in Black and Other race patients [ 108 ].(Fig.  5 ).

Research question 4: Do ethnoracial differences exist in the humanistic/educational outcomes related to anticoagulation therapy?

The five studies reporting this category of outcomes were heterogeneous. Of the two studies assessing anticoagulation knowledge, one showed no difference by ethnoracial group [ 109 ], and the other favored the non-White group in appropriately estimating bleeding risk [ 110 ]. One study assessed an atrial fibrillation quality of life score at 2-year follow-up after AF diagnosis and found the outcomes favored White patients [ 79 ]. Another study assessed satisfaction with VTE care and found no difference among ethnoracial groups [ 111 ]. A third study found no difference in the percentage of racial groups having a cost conversation when initiating DOAC therapy (78% Whites, 72.2% non-Whites)[ 112 ] (Fig.  5 ).

Overall outcome directionality for all four research questions is shown in Fig.  6 . A total of 79 articles demonstrated favorable outcomes for White patients compared to non-White patients, 38 articles showed no difference between White and non-White groups, and 8 articles had outcomes favoring non-White groups (the total exceeds the 78 articles with statistical outcomes as many articles reported multiple outcomes). The biggest areas of disparity between White and non-White groups are access to guideline-based anticoagulation therapy and quality of anticoagulation therapy management. Clinical outcomes relating to anticoagulation care had the least difference among ethnoracial groups. Relatively few studies assessed potential ethnoracial disparities in humanistic and educational outcomes.

figure 6

Outcome Directionality for All 4 Research Questions

This scoping review assessing ethnoracial differences in the quality of anticoagulation care and its delivery to patients in the United States encompassed eleven full years of literature and resulted in the inclusion of 96 studies, 78 of which contained statistical outcomes comparisons among ethnoracial groups. The most common reason for study exclusion was that outcomes were not reported for at least two distinct ethnoracial groups. We observed that beginning in 2019 and following the racial unrest of 2020, the density of articles addressing ethnoracial disparities in anticoagulation care more than doubled. During the entire study period, half of studies had race or ethnicity as the focus or objective of the paper, but this was largely driven by articles published after 2019.

Only 16% of included articles documented self-reporting of racial identity, with most of the remainder using an unspecified method for documenting racial identity. It is likely that many studies utilize demographic information extracted from an electronic medical record (EMR), but it is often unclear if that is truly self-reported race. A second element this scoping review identified was that many studies analyzed two or three ethnoracial groups and then categorized all others into a heterogenous “Other” category. For example, frequently studies would categorize patients as White, Black, and “Other.” It is unclear whether those in a racial category labeled as “Other” had an unknown or missing racial identity in the EMR, or intentionally chose not to disclose. It is also likely that study investigators decided to classify ethnoracial groups with lower population sizes into a miscellaneous category. There were few studies (15%) that specifically assessed patients identifying as Native American/Alaska Native, Native Hawaiian/Pacific Islander, and multiracial. While Hispanic/Latino is an ethnicity, most studies categorized it as a separate “race” category. Of the 37 studies that analyzed “Asian” patient populations, none specifically defined “Asian” beyond that. The US Census Bureau defines “Asian” race as a person having origins of the Far East, Southeast Asia, or the Indian subcontinent [ 113 ]. This broad definition encompasses many different ethnicities which could represent variability in health outcomes if better defined and more frequently analyzed. These may be opportunities for EMR systems to improve transparency for how race, ethnicity, and language preference are captured and for those designing research studies to be thoughtful and intentional about analyzing the ethnoracial identities of the study population, perhaps in alignment with the minimum 5 racial categories utilized by the US Census Bureau, the National Institutes of Health, and the Office of Management and Budget (White, Black, American Indian/Alaska Native, Asian, Native Hawaiian/Pacific Islander, with permission for a “some other race” category and the option to select multiple races) [ 113 ]. Since 2017 Clinicaltrials.gov has required the reporting of race/ethnicity if collected, and there is good compliance with this requirement, but less so in publication of the work [ 114 ].

We examined the proportion of ethnoracial groups represented for each of the disease states in the studies included in this scoping review, relative to disease state prevalence and found a discrepancy. For AF, prevalence in White patients was 11.3%, in Black patients 6.6%, and in Hispanic patients 7.8% [ 15 ]. However, the representation in AF studies in this review were 74% White, 13% Black, and 8% Hispanic. Assessing VTE incidence by race is more difficult, as studies have shown regional and time variation, with Black patients typically having a higher incidence compared to other ethnoracial groups [ 16 ]. In this review, however, of the studies assessing VTE treatment or prophylaxis, only 16% of the patient population identified as Black, whereas 70% identified as White. There were only 3 studies that assessed a valvular heart disease population, making ethnoracial group representation difficult to assess.

The majority of studies captured in this review analyzed patients in the outpatient setting, for the anticoagulation indication of stroke prevention in AF, taking either warfarin or DOAC. Few studies involved the acute care setting or injectable anticoagulants, representing an area for future study of potential ethnoracial disparities.

Overall, the majority of studies in this scoping review addressed ethnoracial disparities in patients’ access to guideline-based anticoagulation therapy, clinical outcomes related to anticoagulation care, and quality of anticoagulation management. A research gap identified was more study is needed to assess gaps in educational outcomes such as anticoagulation and disease state knowledge, shared decision-making willingness and capability, and humanistic outcomes such as quality of life or satisfaction with anticoagulation therapy.

In analyzing the first research question regarding ethnoracial differences in access to guideline-based anticoagulation therapy, the majority of studies addressed use of any anticoagulation for stroke prevention in AF in patients above a threshold risk score and the preferential use of DOACs as first-line therapy instead of warfarin for AF. In both categories, patients in a non-White ethnoracial group (particularly Black patients) received recommended therapy less often than patients identified as White. It is unclear why this is the case. It could be on the patient, provider, and/or system level. It is possible that some studies more successfully adjusted for covariates than others. Sites or settings with systematic processes like order sets or clinical decision support systems in place for standard prescribing may be more successful in equitably prescribing indicated therapies. In one large study in the Veterans Affairs population of AF patients, even after adjusting for numerous variables that included clinical, demographic, socioeconomic, prescriber, and geographic site factors, DOAC prescribing remained lower in Asian and Black patients when compared with White patients. The authors in that study postulate that non-White populations may be less receptive to novel therapies due to historical mistrust of the health care system or have reduced access to education about the latest treatments, and they give the example of direct-to-consumer advertising [ 42 ]. It has also previously been demonstrated that prescribing of oral anticoagulation and particularly DOACs is lower in non-White patients [ 41 ]. These are difficult to capture as standard covariates, which is why further study is needed. We examined the publication dates for both access categories to see if perhaps there was a lack of contemporary data skewing the outcomes. However, for both anticoagulation for a guideline-based indication and DOACs as first-line therapy, the majority of articles came from the time period 2019–2021 (24 of 40 articles, and 15 of 18 articles, respectively), well after guideline updates preferentially recommended DOACs [ 34 , 35 ]. Also, there were relatively few studies addressing guideline-based therapy for VTE treatment and prophylaxis, making assessment of disparities difficult. Regarding access, it is well established that race and ethnicity often determine a patient’s socioeconomic status and that low socioeconomic status and its correlates (e.g., reduced education, income, and healthcare access) are associated with poorer health outcomes [ 115 ]. However, at each level of income or education, Black patients experience worse health outcomes than Whites [ 116 ]. So, low socioeconomic status does not fully explain poorer health outcomes for non-White individuals.

After examining access to appropriate and preferred anticoagulation therapy, the second research question of this scoping review examined potential ethnoracial disparities in the quality of anticoagulation therapy management. INR control measures such as time in therapeutic INR range are a surrogate measure of both thrombotic and bleeding outcomes and frequently used as a way to assess quality of warfarin therapy. The studies identified in this review showed clear disparity between White and non-White patient groups (especially Black patients), however all twelve studies comparing TTR among ethnoracial groups were published prior to 2019. This could be due to the decline in warfarin prescribing relative to increases in DOAC prescribing [ 117 , 118 , 119 ], but there remain patient populations that require or choose warfarin, so this marker of anticoagulation control remains relevant and requires continued reassessment. There were relatively few studies assessing other markers of anticoagulation management quality such as anticoagulation management service enrollment, appropriate DOAC dosing, and access to quality improvement strategies like PST or PSM. Few studies assessed educational outcomes, yet this may have relevance to the above anticoagulation care quality question. For those patients who remain on warfarin, dietary Vitamin K consistency is an example of a key educational point that links directly to INR control. It is unclear if there are disparities in this type of education among ethnoracial groups that may have more far-reaching effects.

Of note, clinical outcomes related to anticoagulant therapy seemed to have the fewest areas of disparity, although the number of articles was small. This suggests that if patients have access to high quality anticoagulation therapy, there is a promising sign that optimal clinical outcomes can be achieved for all ethnoracial groups.

There are some limitations of this scoping review that warrant consideration. First, we chose fairly broad inclusion criteria (all anticoagulants, all study types) because a review of this type had never been performed before. This resulted in a relatively large number of included articles for a scoping review. Second, there is likely a high degree of heterogeneity among patient populations and outcomes definitions. However, as this is a scoping review with the goal to present an overview of the literature and not report on composite outcomes, a risk of bias assessment was not performed. Third is our decision to group patients into White and non-White groups for assessment of outcome directionality. In doing so, we may have missed subtle differences in outcomes between various non-White ethnoracial groups. Fourth, in our main search we included all studies that reported outcomes, but due to scope, we only reported outcome directionality for studies that statistically compared outcomes between ethnoracial groups. Finally, due to the large number of studies that required review and analysis, this was a lengthy undertaking and we are certain that additional studies have been published since the closure of our search period.

In line with the 2014 National Action Plan for Adverse Drug Event Prevention’s goal of identifying patient populations at higher risk of adverse drug events, this scoping review highlights several areas where quality of anticoagulation care can be optimized for all patients. Future research opportunities in ethnoracial differences in the quality of anticoagulation care are summarized in Table  3 . While the scoping review focused exclusively on the evaluation of peer-reviewed manuscripts, the heterogeneity of terminology and methodologies identified in the published papers may have implications for national health policy relating to the quality and safety of care (e.g.the Medicare Quality Payment Program) [ 120 ]. To accurately and reliably quantify important disparities in AC-related care and support effective improvement initiatives, attention and effort will need to be invested across the full continuum of quality measure development [ 121 ], measure endorsement [ 122 ], measure selection, and status assignment within value-based payment programs (e.g., required/optional, measure weighting) [ 123 ]. The findings of the scoping review may be of utility to such efforts, and the development and implementation of suitable quality measures will likely be of value to future research efforts in this important therapeutic area.


Treatment guidelines do not recommend differentiating anticoagulant therapy by ethnoracial group, yet this scoping review of the literature demonstrates consistent directionality in favor of White patients over non-White patients in the domains of access to anticoagulation therapy for guideline-based indications, prescription of preferred anticoagulation therapies, and quality of anticoagulation therapy management. These data should serve as a stimulus for an assessment of current services, implementation of quality improvement measures, and inform future research to make anticoagulation care quality more equitable.

Data Availability

Data are available on request from the corresponding author.

Wheeler SM, Bryant AS (2017) Racial and Ethnic Disparities in Health and Health Care. Obstet Gynecol Clin North Am 44(1):1–11

Article   PubMed   Google Scholar  

Manuel JI (2018) Racial/Ethnic and Gender Disparities in Health Care Use and Access. Health Serv Res 53(3):1407–1429

Nadeem MF, Kaiser LR (2022) Disparities in Health Care Delivery Systems. Thorac Surg Clin 32(1):13–21

Wallace J et al (2021) Changes in Racial and Ethnic Disparities in Access to Care and Health Among US Adults at Age 65 Years. JAMA Intern Med 181(9):1207–1215

Article   PubMed   PubMed Central   Google Scholar  

Churchwell K et al (2020) Call to Action: Structural Racism as a Fundamental Driver of Health Disparities: A Presidential Advisory From the American Heart Association. Circulation 142(24):e454–e468

Gomez SE et al (2023) Racial, ethnic, and sex disparities in atrial fibrillation management: rate and rhythm control. J Interv Card Electrophysiol 66(5):1279–1290

Tamirisa KP et al (2021) Racial and Ethnic Differences in the Management of Atrial Fibrillation. CJC Open 3(12 Suppl):S137–S148

Eberly LA et al (2021) Racial/Ethnic and Socioeconomic Disparities in Management of Incident Paroxysmal Atrial Fibrillation. JAMA Netw Open 4(2):e210247

Ugowe FE, Jackson LR, Thomas KL (2018) Racial and ethnic differences in the prevalence, management, and outcomes in patients with atrial fibrillation: A systematic review. Heart Rhythm 15(9):1337–1345

Xu Y, Siegal DM, Anand SS (2021) Ethnoracial variations in venous thrombosis: Implications for management, and a call to action. J Thromb Haemost 19(1):30–40

Erinne I et al (2021) Racial disparities in the treatment of aortic stenosis: Has transcatheter aortic valve replacement bridged the gap? Catheter Cardiovasc Interv 98(1):148–156

Ali A et al (2022) Racial and Ethnic Disparities in the Use of Transcatheter Aortic Valve Replacement in the State of Connecticut. Cardiovasc Revasc Med 37:7–12

Pienta MJ et al (2023) Racial disparities in mitral valve surgery: A statewide analysis. J Thorac Cardiovasc Surg 165(5):1815-1823.e8

Alkhouli M et al (2019) Racial Disparities in the Utilization and Outcomes of Structural Heart Disease Interventions in the United States. J Am Heart Assoc 8(15):e012125

Heckbert SR et al (2020) Differences by Race/Ethnicity in the Prevalence of Clinically Detected and Monitor-Detected Atrial Fibrillation: MESA. Circ Arrhythm Electrophysiol 13(1):e007698

Zakai NA et al (2014) Racial and regional differences in venous thromboembolism in the United States in 3 cohorts. Circulation 129(14):1502–1509

Lamprea-Montealegre JA et al (2021) Valvular Heart Disease in Relation to Race and Ethnicity: JACC Focus Seminar 4/9. J Am Coll Cardiol 78(24):2493–2504

Tamargo J et al (2022) Racial and ethnic differences in pharmacotherapy to prevent coronary artery disease and thrombotic events. Eur Heart J Cardiovasc Pharmacother 8(7):738–751

Kanuri SH, Kreutz RP (2019) Pharmacogenomics of Novel Direct Oral Anticoagulants: Newly Identified Genes and Genetic Variants. J Pers Med 9(1):7

Craig TJ, Wann LS, Calkins H, Chen LY, Cigarroa JE, Cleveland CJ Jr, Ellinor PT, Ezekowitz MD, Field ME, Furie LK, Heidenreich PA, Murray KT, Shea JB, Tracy CM, Yancy CW (2019) Circulation 140 (2):e125–e151. https://doi.org/10.1161/CIR.000000000000066

Ortel TL et al (2020) American Society of Hematology 2020 guidelines for management of venous thromboembolism: treatment of deep vein thrombosis and pulmonary embolism. Blood Adv 4(19):4693–4738

Article   CAS   PubMed   PubMed Central   Google Scholar  

Stevens SM et al (2021) Antithrombotic Therapy for VTE Disease: Second Update of the CHEST Guideline and Expert Panel Report. Chest 160(6):e545–e608

Article   CAS   PubMed   Google Scholar  

Schünemann HJ et al (2018) American Society of Hematology 2018 guidelines for management of venous thromboembolism: prophylaxis for hospitalized and nonhospitalized medical patients. Blood Adv 2(22):3198–3225

Otto CM et al (2021) 2020 ACC/AHA Guideline for the Management of Patients With Valvular Heart Disease: A Report of the American College of Cardiology/American Heart Association Joint Committee on Clinical Practice Guidelines. Circulation 143(5):e72–e227

PubMed   Google Scholar  

National Action Plan for Adverse Drug Event Prevention (2014) Office of Disease Prevention and Health Promotion. US Department of Health and Human Services, Washington, DC

Google Scholar  

Bhakta S et al (2022) A systematic review and meta-analysis of racial disparities in deep vein thrombosis and pulmonary embolism events in patients hospitalized with coronavirus disease 2019. J Vasc Surg Venous Lymphat Disord 10(4):939-944.e3

Corbie-Smith G et al (2008) Conceptualizing race in research. J Natl Med Assoc 100(10):1235–1243

Vazquez SR, Yates NYY, McFarland MM (2022). Differences in anticoagulant care delivery according to ethnoracial group in the United States: a scoping review protocol. Open Sci Fram osf.io/eydsa. https://doi.org/10.17605/OSF.IO/9SE7H

Peters MDJ, Godfrey C, McInerney P, Munn Z, Tricco AC, Khalil H (2020) Joanna briggs institute reviewer's manual. Aromataris E, Munn Z (eds). Publisher: Joanna briggs institute. Chapter: chapter 11: scoping reviews

Arksey H, O’Malley L (2005) Scoping studies: towards a methodological framework. Int J Soc Res Methodol 8(1):19–32

Article   Google Scholar  

Rethlefsen ML et al (2021) PRISMA-S: an extension to the PRISMA Statement for Reporting Literature Searches in Systematic Reviews. Syst Rev 10(1):39

Harris PA et al (2009) Research electronic data capture (REDCap)–a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform 42(2):377–381

McGowan J et al (2016) PRESS Peer Review of Electronic Search Strategies: 2015 Guideline Statement. J Clin Epidemiol 75:40–46

Kearon C et al (2016) Antithrombotic Therapy for VTE Disease: CHEST Guideline and Expert Panel Report. Chest 149(2):315–352

Lip GYH et al (2018) Antithrombotic Therapy for Atrial Fibrillation: CHEST Guideline and Expert Panel Report. Chest 154(5):1121–1201

al Taii, H. and S. Al-Kindi, Racial disparities in prescriptions of oral anticoagulants among patients with atrial fibrillation., in 24th International Atrial Fibrillation Symposium (2019) Boston. MA, USA

Bhave PD et al (2015) Race- and sex-related differences in care for patients newly diagnosed with atrial fibrillation. Heart Rhythm 12(7):1406–1412

Chapman SA et al (2017) Adherence to treatment guidelines: the association between stroke risk stratified comparing CHADS. BMC Health Serv Res 17(1):127

Chen Q et al (2021) Prevalence and the factors associated with oral anticoagulant use among nursing home residents. J Clin Pharm Ther 46(6):1714–1728

Deitelzweig S et al (2020) Utilization of anticoagulants and predictors of treatment among hospitalized patients with atrial fibrillation in the USA. J Med Econ 23(12):1389–1400

Essien UR et al (2018) Association of Race/Ethnicity With Oral Anticoagulant Use in Patients With Atrial Fibrillation: Findings From the Outcomes Registry for Better Informed Treatment of Atrial Fibrillation II. JAMA Cardiol 3(12):1174–1182

Essien UR et al (2021) Disparities in Anticoagulant Therapy Initiation for Incident Atrial Fibrillation by Race/Ethnicity Among Patients in the Veterans Health Administration System. JAMA Netw Open 4(7):e2114234

Essien UR, Kim N, Hausmann LR, Mor MK, Good C, Gellad WF, Fine MJ (2020) Abstract: racial and ethnic disparities in anticoagulant choice for atrial fibrillation in the veterans health administration: results from the reach-af study. J Gen Int Med J 35(1):S250–251

Essien UR et al (2022) Association of Race and Ethnicity and Anticoagulation in Patients With Atrial Fibrillation Dually Enrolled in Veterans Health Administration and Medicare: Effects of Medicare Part D on Prescribing Disparities. Circ Cardiovasc Qual Outcomes 15(2):e008389

Essien UR et al (2020) Race/Ethnicity and Sex-Related Differences in Direct Oral Anticoagulant Initiation in Newly Diagnosed Atrial Fibrillation: A Retrospective Study of Medicare Data. J Natl Med Assoc 112(1):103–108

Fohtung RB, Novak E, Rich MW (2017) Effect of New Oral Anticoagulants on Prescribing Practices for Atrial Fibrillation in Older Adults. J Am Geriatr Soc 65(11):2405–2412

Lewis WR et al (2011) Improvement in use of anticoagulation therapy in patients with ischemic stroke: results from Get With The Guidelines-Stroke. Am Heart J 162(4):692-699.e2

Martin Diaz C et al (2021) Anticoagulation After Ischemic Stroke or Transient Ischemic Attack (TIA) in the Time of Direct Oral Anticoagulation (DOAC) and Thrombectomy. Cureus 13(8):e17392

PubMed   PubMed Central   Google Scholar  

Mentias A et al (2021) Racial and Sex Disparities in Anticoagulation After Electrical Cardioversion for Atrial Fibrillation and Flutter. J Am Heart Assoc 10(17):e021674

Obi CA et al (2022) Examination of anticoagulation prescription among elderly patients with atrial fibrillation after in-hospital fall. J Thromb Thrombolysis 53(3):683–689

Piccini JP et al (2019) Adherence to Guideline-Directed Stroke Prevention Therapy for Atrial Fibrillation Is Achievable. Circulation 139(12):1497–1506

Raji MA et al (2013) National utilization patterns of warfarin use in older patients with atrial fibrillation: a population-based study of Medicare Part D beneficiaries. Ann Pharmacother 47(1):35–42

Schwartz SM et al (2019) Discriminative Ability of CHA. Am J Cardiol 123(12):1949–1954

Shahid I et al (2021) Meta-Analysis of Racial Disparity in Utilization of Oral Anticoagulation for Stroke Prevention in Atrial Fibrillation. Am J Cardiol 153:147–149

Tedla YG et al (2020) Racial Disparity in the Prescription of Anticoagulants and Risk of Stroke and Bleeding in Atrial Fibrillation Patients. J Stroke Cerebrovasc Dis 29(5):104718

Thomas KL et al (2013) Racial differences in the prevalence and outcomes of atrial fibrillation among patients hospitalized with heart failure. J Am Heart Assoc 2(5):e000200

Waddy SP et al (2020) Racial/Ethnic Disparities in Atrial Fibrillation Treatment and Outcomes among Dialysis Patients in the United States. J Am Soc Nephrol 31(3):637–649

Wetmore JB et al (2019) Relation of Race, Apparent Disability, and Stroke Risk With Warfarin Prescribing for Atrial Fibrillation in Patients Receiving Maintenance Hemodialysis. Am J Cardiol 123(4):598–604

Winkelmayer WC et al (2012) Prevalence of atrial fibrillation and warfarin use in older patients receiving hemodialysis. J Nephrol 25(3):341–353

Zahuranec DB et al (2017) Stroke Quality Measures in Mexican Americans and Non-Hispanic Whites. J Health Dispar Res Pract 10(1):111–123

Addo-Tabiri NO et al (2020) Black Patients Experience Highest Rates of Cancer-associated Venous Thromboembolism. Am J Clin Oncol 43(2):94–100

Freeman AH et al (2016) Venous thromboembolism following minimally invasive surgery among women with endometrial cancer. Gynecol Oncol 142(2):267–272

Friedman AM et al (2013) Underuse of postcesarean thromboembolism prophylaxis. Obstet Gynecol 122(6):1197–1204

Lau BD et al (2015) Eliminating Health Care Disparities With Mandatory Clinical Decision Support: The Venous Thromboembolism (VTE) Example. Med Care 53(1):18–24

Owodunni OP et al (2020) Using electronic health record system triggers to target delivery of a patient-centered intervention to improve venous thromboembolism prevention for hospitalized patients: Is there a differential effect by race? PLoS ONE 15(1):e0227339

Shah S et al (2021) Implementation of an Anticoagulation Practice Guideline for COVID-19 via a Clinical Decision Support System in a Large Academic Health System and Its Evaluation: Observational Study. JMIR Med Inform 9(11):e30743

Steinberg BA et al (2013) Early adoption of dabigatran and its dosing in US patients with atrial fibrillation: results from the outcomes registry for better informed treatment of atrial fibrillation. J Am Heart Assoc 2(6):e000535

Vaughan Sarrazin MS et al (2014) Bleeding rates in Veterans Affairs patients with atrial fibrillation who switch from warfarin to dabigatran. Am J Med 127(12):1179–1185

Nathan AS et al (2019) Racial, Ethnic, and Socioeconomic Inequities in the Prescription of Direct Oral Anticoagulants in Patients With Venous Thromboembolism in the United States. Circ Cardiovasc Qual Outcomes 12(4):e005600

Singh BP, Sitlinger A, Saber I, Thames E, Beckman M, Reyes N, Schulteis RD, Ortel T (2016) Direct oral anticoagulant usage in the treatment of venous thromboembolism across racial groups in Durham County, NC. J Gen Int Med 31:S191–S192

Schaefer JK et al (2017) Sociodemographic factors in patients continuing warfarin vs those transitioning to direct oral anticoagulants. Blood Adv 1(26):2536–2540

Lank RJ et al (2019) Ethnic Differences in 90-Day Poststroke Medication Adherence. Stroke 50(6):1519–1524

O'Brien EC, Holmes ND, Thomas L, Fonarow GC, Kowey PR, Ansell JE, Mahaffey KW, Gersh BJ, Peterson ED, Piccini JP, Hylek EM (2018) J Am Heart Assoc 7(12):e006391

Singh RR et al (2016) Adherence to Anticoagulant Therapy in Pediatric Patients Hospitalized With Pulmonary Embolism or Deep Vein Thrombosis: A Retrospective Cohort Study. Clin Appl Thromb Hemost 22(3):260–264

Yong C, Xu XY, Than C, Ullal A, Schmitt S, Azarbal F, Heidenreich P, Turakhia M (2013) Abstract 14134: racial disparities in warfarin time in INR therapeutic range in patients with atrial fibrillation: Findings from the TREAT-AF Study. Circul 128(22).  https://www.ahajournals.org/doi/10.1161/circ.128.suppl_22.A14134

Chen N, Brooks MM, Hernandez I (2020) Latent Classes of Adherence to Oral Anticoagulation Therapy Among Patients With a New Diagnosis of Atrial Fibrillation. JAMA Netw Open 3(2):e1921357

Pham Nguyen TP et al (2003) Does hospitalization for thromboembolism improve oral anticoagulant adherence in patients with atrial fibrillation? J Am Pharm Assoc 60(6):986-992.e2

Patel AA et al (2013) Persistence of warfarin therapy for residents in long-term care who have atrial fibrillation. Clin Ther 35(11):1794–1804

Golwala H et al (2016) Racial/ethnic differences in atrial fibrillation symptoms, treatment patterns, and outcomes: Insights from Outcomes Registry for Better Informed Treatment for Atrial Fibrillation Registry. Am Heart J 174:29–36

Limdi NA et al (2017) Quality of anticoagulation control and hemorrhage risk among African American and European American warfarin users. Pharmacogenet Genomics 27(10):347–355

Lip GY et al (2016) Determinants of Time in Therapeutic Range in Patients Receiving Oral Anticoagulants (A Substudy of IMPACT). Am J Cardiol 118(11):1680–1684

Rao SR et al (2015) Explaining racial disparities in anticoagulation control: results from a study of patients at the Veterans Administration. Am J Med Qual 30(3):214–222

Thigpen JL, Yah Q, Beasley M, Limdi NA (2012) Abstract: racial differences in anticoagulation control and risk of hemorrhage among warfarin users. Pharmacoth 32(10):E189

Yong C et al (2016) Racial Differences in Quality of Anticoagulation Therapy for Atrial Fibrillation (from the TREAT-AF Study). Am J Cardiol 117(1):61–68

Moffett BS, Kim S, Bomgaars LR (2013) Readmissions for warfarin-related bleeding in pediatric patients after hospital discharge. Pediatr Blood Cancer 60(9):1503–1506

Rose AJ et al (2013) Gaps in monitoring during oral anticoagulation: insights into care transitions, monitoring barriers, and medication nonadherence. Chest 143(3):751–757

Mahle WT et al (2011) Management of warfarin in children with heart disease. Pediatr Cardiol 32(8):1115–1119

Meade M et al (2011) Impact of health disparities on staff workload in pharmacist-managed anticoagulation clinics. Am J Health Syst Pharm 68(15):1430–1435

Aguilar F et al (2021) Off-label direct oral anticoagulants dosing in atrial fibrillation and venous thromboembolism is associated with higher mortality. Expert Rev Cardiovasc Ther 19(12):1119–1126

Jellinek-Cohen SP, Li M, Husk G (2018) Enoxaparin dosing errors in the emergency department. World J Emerg Med 9(3):195–202

Triller DM et al (2015) Trends in Warfarin Monitoring Practices Among New York Medicare Beneficiaries, 2006–2011. J Community Health 40(5):845–854

Akhtar T et al (2020) Factors associated with bleeding events in patients on rivaroxaban for non-valvular atrial fibrillation: A real-world experience. Int J Cardiol 320:78–82

Cires-Drouet RS et al (2022) Prevalence and clinical outcomes of hospitalized patients with upper extremity deep vein thrombosis. J Vasc Surg Venous Lymphat Disord 10(1):102–110

Doucette K et al (2020) Efficacy and Safety of Direct-Acting Oral Anticoagulants (DOACs) in the Overweight and Obese. Adv Hematol 2020:3890706

Gu K et al (2021) Racial disparities among Asian Americans with atrial fibrillation: An analysis from the NCDR® PINNACLE Registry. Int J Cardiol 329:209–216

Yamashita Y et al (2018) Asian patients versus non-Asian patients in the efficacy and safety of direct oral anticoagulants relative to vitamin K antagonist for venous thromboembolism: A systemic review and meta-analysis. Thromb Res 166:37–42

Di Nisio M et al (2016) Risk of major bleeding in patients with venous thromboembolism treated with rivaroxaban or with heparin and vitamin K antagonists. Thromb Haemost 115(2):424–432

Hernandez I et al (2015) Risk of bleeding with dabigatran in atrial fibrillation. JAMA Intern Med 175(1):18–24

Majeed A et al (2016) Bleeding events with dabigatran or warfarin in patients with venous thromboembolism. Thromb Haemost 115(2):291–298

Hankey GJ et al (2014) Intracranial hemorrhage among patients with atrial fibrillation anticoagulated with warfarin or rivaroxaban: the rivaroxaban once daily, oral, direct factor Xa inhibition compared with vitamin K antagonism for prevention of stroke and embolism trial in atrial fibrillation. Stroke 45(5):1304–1312

Kobayashi L et al (2017) Novel oral anticoagulants and trauma: The results of a prospective American Association for the Surgery of Trauma Multi-Institutional Trial. J Trauma Acute Care Surg 82(5):827–835

Gencer B et al (2022) Edoxaban versus Warfarin in high-risk patients with atrial fibrillation: A comprehensive analysis of high-risk subgroups. Am Heart J 247:24–32

Chen N et al (2021) Joint Latent Class Analysis of Oral Anticoagulation Use and Risk of Stroke or Systemic Thromboembolism in Patients with Atrial Fibrillation. Am J Cardiovasc Drugs 21(5):573–580

Kabra R et al (2015) Effect of race on outcomes (stroke and death) in patients >65 years with atrial fibrillation. Am J Cardiol 116(2):230–235

Kim MH, Xu L, Puckrein G (2018) Patient Diversity and Population Health-Related Cardiovascular Outcomes Associated with Warfarin Use in Atrial Fibrillation: An Analysis Using Administrative Claims Data. Adv Ther 35(11):2069–2080

Abu-Zeinah G, Oromendia C, DeSancho MT (2019) Thrombotic risk factors in patients with antiphospholipid syndrome: a single center experience. J Thromb Thrombolysis 48(2):233–239

Banala SR et al (2017) Discharge or admit? Emergency department management of incidental pulmonary embolism in patients with cancer: a retrospective study. Int J Emerg Med 10(1):19

LaDuke ZJ et al (2019) Association of mortality among trauma patients taking preinjury direct oral anticoagulants versus vitamin K antagonists. Surgery 166(4):564–571

Moreland CJ et al (2013) Anticoagulation education: do patients understand potential medication-related emergencies? Jt Comm J Qual Patient Saf 39(1):22–31

Bamgbade BA et al (2021) Differences in Perceived and Predicted Bleeding Risk in Older Adults With Atrial Fibrillation: The SAGE-AF Study. J Am Heart Assoc 10(17):e019979

Webb D et al (2019) Patient Satisfaction With Venous Thromboembolism Treatment. Clin Appl Thromb Hemost 25:1076029619864663

Kamath CC et al (2021) Cost Conversations About Anticoagulation Between Patients With Atrial Fibrillation and Their Clinicians: A Secondary Analysis of a Randomized Clinical Trial. JAMA Netw Open 4(7):e2116009

U.S. Census Bureau . [cited 2023 10/20/23]; Available from: https://www.census.gov/quickfacts/fact/note/US/RHI625222#:~:text=Asian.,Japan%2C%20Korea%2C%20or%20Vietnam .

Fain KM et al (2021) Race and ethnicity reporting for clinical trials in ClinicalTrials.gov and publications. Contemp Clin Trials 101:106237

American Psychological Association Ethnic and racial minorities & socioeconomic status. 3/22/24]; Available from: https://www.apa.org/pi/ses/resources/publications/minorities#:~:text=The%20relationship%20between%20SES%2C%20race,SES%2C%20race%2C%20and%20ethnicity . Accessed 22 Mar 2024

Braveman PA et al (2010) Socioeconomic disparities in health in the United States: what the patterns tell us. Am J Public Health 100 Suppl 1(Suppl 1):186–96

Wheelock KM et al (2021) Clinician Trends in Prescribing Direct Oral Anticoagulants for US Medicare Beneficiaries. JAMA Netw Open 4(12):e2137288

Barnes GD et al (2015) National Trends in Ambulatory Oral Anticoagulant Use. Am J Med 128(12):1300–5.e2

Iyer GS et al (2023) Trends in the Use of Oral Anticoagulants for Adults With Venous Thromboembolism in the US, 2010–2020. JAMA Netw Open 6(3):e234059

MACRA Medicare Access and CHIP Reauthorization Act of 2015 (MACRA). 03/19/24]; Available from: https://www.cms.gov/medicare/quality/value-based-programs/chip-reauthorization-act . Accessed 19 Mar 2024

Blueprint Measure Lifecycle Overview. 03/19/2024]; Available from: https://mmshub.cms.gov/blueprint-measure-lifecycle-overview . Accessed 19 Mar 2024

Endorsement and Maintenance (E&M) Guidebook. 3/19/24]; Available from: https://p4qm.org/sites/default/files/2023-12/Del-3-6-Endorsement-and-Maintenance-Guidebook-Final_0_0.pdf . Accessed 19 Mar 2024

Measure Implementation. 3/19/24]; Available from: https://mmshub.cms.gov/measure-lifecycle/measure-implementation/selection . Accessed 19 Mar 2024

Download references


The authors wish to acknowledge the following individuals for their work in screening articles for this scoping review: April Allen, PharmD, CACP; Allison Burnett, PharmD, PhC, CACP; Stacy Ellsworth, RN, MSN, CCRC; Danielle Jenkins, MBA, RN, BSN, CRNI; Amanda Katz, MBA; Lea Kistenmacher, Julia Mulheman, PharmD; Surhabi Palkimas, PharmD, MBA; Terri Schnurr, RN, CCRC; Deborah Siegal, MD, MSc, FRCPC; Kimberly Terry, PharmD, BCPS, BCCCP; and Terri Wiggins, MS.

The authors wish to acknowledge the support of the Anticoagulation Forum in the development of this manuscript. The Anticoagulation Forum is a non-profit organization dedicated to improving the quality of care for patients taking antithrombotic medications.

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and affiliations.

University of Utah Health Thrombosis Service, 6056 Fashion Square Drive, Suite 1200, Murray, UT, 84107, USA

Sara R. Vazquez

Kaiser Permanente Clinical Pharmacy Services, 200 Crescent Center Pkwy, Tucker, GA, 30084, USA

Naomi Y. Yates

Anticoagulation Forum, Inc, 17 Lincoln Street, Suite 2B, Newton, MA, 02461, USA

Craig J. Beavers & Darren M. Triller

University of Kentucky College of Pharmacy, 789 S Limestone, Lexington, KY, 40508, USA

Craig J. Beavers

University of Utah Spencer S. Eccles Health Sciences Library, 10 N 1900 E, Salt Lake City, UT, 84112, USA

Mary M. McFarland

You can also search for this author in PubMed   Google Scholar


All authors contributed to the study conception and design. Material preparation was performed by Sara Vazquez, Naomi Yates, and Mary McFarland. Data collection and analysis were performed by Sara Vazquez, Naomi Yates, Craig Beavers, and Darren Triller. The first draft of the manuscript was written by Sara Vazquez and all authors edited subsequent drafts. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Sara R. Vazquez .

Ethics declarations

Competing interests.

Dr. Vazquez discloses that she is a member of the Anticoagulation Forum Advisory Council and an editorial consultant for UptoDate, Inc.

Dr. Yates has no conflicts of interest to disclose.

Dr. Beavers has no conflicts of interest to disclose.

Dr. Triller has no conflicts of interest to disclose.

Ms. McFarland has no conflicts of interest to disclose.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 22 KB)

Supplementary file2 (docx 14 kb), rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Vazquez, S.R., Yates, N.Y., Beavers, C.J. et al. Differences in quality of anticoagulation care delivery according to ethnoracial group in the United States: A scoping review. J Thromb Thrombolysis (2024). https://doi.org/10.1007/s11239-024-02991-2

Download citation

Accepted : 27 April 2024

Published : 11 May 2024

DOI : https://doi.org/10.1007/s11239-024-02991-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Ethnoracial
  • Disparities
  • Anticoagulation
  • Find a journal
  • Publish with us
  • Track your research


  1. (PDF) Improving data quality control in quality improvement projects

    data quality control in research pdf

  2. (PDF) Post-Field Data Quality Control

    data quality control in research pdf

  3. FREE 10+ Quality Control Samples in PDF

    data quality control in research pdf

  4. What is data quality and why is it important?

    data quality control in research pdf

  5. (PDF) Corrections and Data Quality Control

    data quality control in research pdf

  6. (PDF) Automated quality control methods for sensor data: a novel

    data quality control in research pdf


  1. VERDER SCIENTIFIC at the Analytica 2018

  2. Quality Enhancement during Data Collection

  3. Industries Laboratories। Quality assurance। Quality Control। Research and Development। QC,QA,R&D Lab

  4. Online Data Quality Control (QC) Day: Preparation


  6. How to highlight duplicated values in excel


  1. (PDF) Data Quality Management

    PDF | On Jun 24, 2019, Sadia Vancauwenbergh published Data Quality Management | Find, read and cite all the research you need on ResearchGate

  2. PDF Data Quality Management: An Overview of Methods and Challenges

    Goal: assess and improve the processes involved with data management. Methodology: develop an adequate measurement procedure and measure the quality of data for a period of time sufficient to understand the targeted processes. If possible, modify the targeted processes to improve the data management process.

  3. PDF Data Quality Management In Clinical Research

    These data may and records including the medical record. Data quality management (DQM) is a formal process for managing the quality, validity and integrity of the research data captured throughout the study from the time it is collected, stored and transformed (processed) through analysis and publication.

  4. PDF Dimensions of Data Quality (DDQ)

    1.2 Dimensions, data and quality 13 1.3 Scope 13 1.4 Research question 13 1.5 Target group 13 1.6 Background, ownership, and management 14 1.7 Release policy 14 1.8 Reading guide 14 ... practical data management at Philips Semiconductors and NXP where he had experience of product data management, data migration, data cleansing and a data ...

  5. PDF Data Quality—Concepts and Problems

    Next, in conceptualizing a project, lack of planning and expertise, including a poorly designed data (quality) management plan, may lead to preventable data problems during the data collection and quality assurance phase, leading to erroneous data collection or the inability to detect flawed data.

  6. Quality Control During Data Collection: Refining for Rigor

    Quality control is an essential component of international large-scale assessment. Rigorous quality control processes help ensure high quality data and facilitate cross-country comparisons of study results. Although quality control can and does exist at all stages, including before, during, and after data collection, the term is often used ...

  7. PDF Data and Information Quality Research: Its Evolution and Future

    Research in this area investigates interactions between data quality and IT management, e.g., IT investment, CIO stewardship, and IT governance. The "fitness for use" view of data quality positions data quality initiatives as critical to an organization's use of IT in support of its operations and competitiveness.

  8. Data Quality Management: An Overview of Methods and Challenges

    1 Introduction. Over the past twenty-five years, the importance of data quality management has silently grown to become an essential part of any modern data-driven process. We are well beyond the point where organizations are ignorant about the potential of using data to optimize processes and activities.

  9. Practices in Data-Quality Evaluation: A Large-Scale Review of Online

    The observed usage of methods for evaluation of data quality in online self-administered surveys is low; approximately half of studies did not inspect the quality of research data. The most frequently used indicators for data exclusion are designed control items and nonresponse rates, although researchers often provide poor justification for ...


    Module 1: Framework and metrics (current document) Module 2: Desk review of data quality Module 3: Data verification and system assessment. DHIS 2 is a web-based, open source software that is used by countries chiefly as their health information system for data management and monitoring of health programmes.

  11. PDF A Framework for Analysis of Data Quality Research

    R.Y. Wang is with the Total Data Quality Management Research Program, Room E53-320, Sloan School of Management, Massachusetts Institute of Tech- nology, 50 Memorial Dr., Cambtidge, MA 02142; e-mail: [email protected]. ... l Data quality control is the set of operational techniques and activities that are used to attain the quality required for a ...

  12. Improving Data Quality in Clinical Research Informatics Tools

    Maintaining data quality is a fundamental requirement for any successful and long-term data management. Providing high-quality, reliable, and statistically sound data is a primary goal for clinical research informatics. In addition, effective data governance and management are essential to ensuring accurate data counts, reports, and validation.

  13. Defining and Developing a Generic Framework for Monitoring Data Quality

    Data Quality Monitoring Framework (DQMF) Refinement and evaluation of the key concepts has led to the development of the DQMF. This framework contains the key components of data quality, data quality monitoring, and data quality management presented in a nested concentric network to illustrate the relationships and hierarchy (Figure 1).Each layer of the framework contains specific and highly ...

  14. Big data quality framework: a holistic approach to continuous quality

    Big Data is an essential research area for governments, institutions, and private agencies to support their analytics decisions. Big Data refers to all about data, how it is collected, processed, and analyzed to generate value-added data-driven insights and decisions. Degradation in Data Quality may result in unpredictable consequences. In this case, confidence and worthiness in the data and ...

  15. PDF Data Quality Control

    Data Verification is a process in which different types of data are checked for accuracy and consistency after data entry is completed: check totals for micro-data. reconciliation of data sources. previous year comparison. consistency with different data sets. data auditing processes Data verification should enable comparisons of aggregate data ...

  16. Handbook of Data Quality: Research and Practice

    Accordingly, Sadiq structured this handbook in four parts: Part I is on organizational solutions, i.e. the development of data quality objectives for the organization, and the development of strategies to establish roles, processes, policies, and standards required to manage and ensure data quality. Part II, on architectural solutions, covers ...

  17. PDF Data Quality Control and Assurance

    Define data quality control and data quality assurance Perform quality control and assurance on their data at all stages of the research cycle CC image by 0xFCAF on Flickr. The Data Life Cycle DataONE Life Cycle. Definitions Data Contamination: Process or phenomenon, other than the one of interest, that affects the

  18. PDF Lars Lyberg Quality Assurance and Control

    Quality assurance (QA) and quality control (QC) QA is everything we have in place so that the system and its processes are capable of delivering a product that meets customer expectations. QC makes sure that the product actually is good. QC can be seen as part of QA and also part of Evaluation.


    Data quality review. Module 3: Site assessment of data quality: data verification and system assessment - Implementation guide Figure 16a. The Garmin eTrex GPS unit Figure 16b. GPS instruction screens Figure 17. Format of the DV/SA paper questionnaire Figure 18. CSPro tools Figure 19. Keeping track of facilities sampled Figure 20.

  20. A dataset for measuring the impact of research data and their ...

    Science funders, publishers, and data archives make decisions about how to responsibly allocate resources to maximize the reuse potential of research data. This paper introduces a dataset ...

  21. PDF Determinants of COVID-19 data quality in the District Health

    Finally, the data quality was determined when a monthly report scores three (thus, one score each for completeness, timeliness, and accuracy) (6). Quality and non-quality data were coded as "one" and "zero" respectively. Study variables The dependent variable was COVID-19 data quality coded as "1 = data quality" and "0 = no data

  22. Visualization and Analysis of Urban Air Quality Management Using

    This study aims to provide an overview of urban air quality management research published between 1975 and 2022, with a focus in the following areas: (1) It provides a qualitative (topic of study area or co-occurrence of terms) and quantitative (number of publications covered, number of citations, and data sources) assessment of current ...

  23. Zebra and Quagga mussels in the United States—Dreissenid mussel

    The U.S. Geological Survey (USGS) delivers high-quality data, technologies, and decision-support tools to help managers both reduce existing populations and control the spread of dreissenid mussels. The USGS researches ecology, biology, risk assessment, and early detection and rapid response methods; provides decision support; and develops and tests control measures.

  24. Differences in quality of anticoagulation care delivery ...

    We conducted this scoping review with guidance from the 2020 version of the JBI Manual for Evidence Synthesis and organized to Arksey's five stages: 1) identifying the research question, 2) identifying relevant studies, 3) study selection, 4) charting the data and 5) collating, summarizing and reporting the results [29, 30].For transparency and reproducibility, we followed the PRISMA-ScR and ...