U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Indian J Pharmacol
  • v.44(2); Mar-Apr 2012

Data management in clinical research: An overview

Binny krishnankutty.

Global Medical Affairs, Dr. Reddy's Laboratories Ltd., Ameerpet, Hyderabad, India

Shantala Bellary

Naveen b.r. kumar, latha s. moodahadu.

Clinical Data Management (CDM) is a critical phase in clinical research, which leads to generation of high-quality, reliable, and statistically sound data from clinical trials. This helps to produce a drastic reduction in time from drug development to marketing. Team members of CDM are actively involved in all stages of clinical trial right from inception to completion. They should have adequate process knowledge that helps maintain the quality standards of CDM processes. Various procedures in CDM including Case Report Form (CRF) designing, CRF annotation, database designing, data-entry, data validation, discrepancy management, medical coding, data extraction, and database locking are assessed for quality at regular intervals during a trial. In the present scenario, there is an increased demand to improve the CDM standards to meet the regulatory requirements and stay ahead of the competition by means of faster commercialization of product. With the implementation of regulatory compliant data management tools, CDM team can meet these demands. Additionally, it is becoming mandatory for companies to submit the data electronically. CDM professionals should meet appropriate expectations and set standards for data quality and also have a drive to adapt to the rapidly changing technology. This article highlights the processes involved and provides the reader an overview of the tools and standards adopted as well as the roles and responsibilities in CDM.

Introduction

Clinical trial is intended to find answers to the research question by means of generating data for proving or disproving a hypothesis. The quality of data generated plays an important role in the outcome of the study. Often research students ask the question, “what is Clinical Data Management (CDM) and what is its significance?” Clinical data management is a relevant and important part of a clinical trial. All researchers try their hands on CDM activities during their research work, knowingly or unknowingly. Without identifying the technical phases, we undertake some of the processes involved in CDM during our research work. This article highlights the processes involved in CDM and gives the reader an overview of how data is managed in clinical trials.

CDM is the process of collection, cleaning, and management of subject data in compliance with regulatory standards. The primary objective of CDM processes is to provide high-quality data by keeping the number of errors and missing data as low as possible and gather maximum data for analysis.[ 1 ] To meet this objective, best practices are adopted to ensure that data are complete, reliable, and processed correctly. This has been facilitated by the use of software applications that maintain an audit trail and provide easy identification and resolution of data discrepancies. Sophisticated innovations[ 2 ] have enabled CDM to handle large trials and ensure the data quality even in complex trials.

How do we define ‘high-quality’ data? High-quality data should be absolutely accurate and suitable for statistical analysis. These should meet the protocol-specified parameters and comply with the protocol requirements. This implies that in case of a deviation, not meeting the protocol-specifications, we may think of excluding the patient from the final database. It should be borne in mind that in some situations, regulatory authorities may be interested in looking at such data. Similarly, missing data is also a matter of concern for clinical researchers. High-quality data should have minimal or no misses. But most importantly, high-quality data should possess only an arbitrarily ‘acceptable level of variation’ that would not affect the conclusion of the study on statistical analysis. The data should also meet the applicable regulatory requirements specified for data quality.

Tools for CDM

Many software tools are available for data management, and these are called Clinical Data Management Systems (CDMS). In multicentric trials, a CDMS has become essential to handle the huge amount of data. Most of the CDMS used in pharmaceutical companies are commercial, but a few open source tools are available as well. Commonly used CDM tools are ORACLE CLINICAL, CLINTRIAL, MACRO, RAVE, and eClinical Suite. In terms of functionality, these software tools are more or less similar and there is no significant advantage of one system over the other. These software tools are expensive and need sophisticated Information Technology infrastructure to function. Additionally, some multinational pharmaceutical giants use custom-made CDMS tools to suit their operational needs and procedures. Among the open source tools, the most prominent ones are OpenClinica, openCDMS, TrialDB, and PhOSCo. These CDM software are available free of cost and are as good as their commercial counterparts in terms of functionality. These open source software can be downloaded from their respective websites.

In regulatory submission studies, maintaining an audit trail of data management activities is of paramount importance. These CDM tools ensure the audit trail and help in the management of discrepancies. According to the roles and responsibilities (explained later), multiple user IDs can be created with access limitation to data entry, medical coding, database designing, or quality check. This ensures that each user can access only the respective functionalities allotted to that user ID and cannot make any other change in the database. For responsibilities where changes are permitted to be made in the data, the software will record the change made, the user ID that made the change and the time and date of change, for audit purposes (audit trail). During a regulatory audit, the auditors can verify the discrepancy management process; the changes made and can confirm that no unauthorized or false changes were made.

Regulations, Guidelines, and Standards in CDM

Akin to other areas in clinical research, CDM has guidelines and standards that must be followed. Since the pharmaceutical industry relies on the electronically captured data for the evaluation of medicines, there is a need to follow good practices in CDM and maintain standards in electronic data capture. These electronic records have to comply with a Code of Federal Regulations (CFR), 21 CFR Part 11. This regulation is applicable to records in electronic format that are created, modified, maintained, archived, retrieved, or transmitted. This demands the use of validated systems to ensure accuracy, reliability, and consistency of data with the use of secure, computer-generated, time-stamped audit trails to independently record the date and time of operator entries and actions that create, modify, or delete electronic records.[ 3 ] Adequate procedures and controls should be put in place to ensure the integrity, authenticity, and confidentiality of data. If data have to be submitted to regulatory authorities, it should be entered and processed in 21 CFR part 11-compliant systems. Most of the CDM systems available are like this and pharmaceutical companies as well as contract research organizations ensure this compliance.

Society for Clinical Data Management (SCDM) publishes the Good Clinical Data Management Practices (GCDMP) guidelines, a document providing the standards of good practice within CDM. GCDMP was initially published in September 2000 and has undergone several revisions thereafter. The July 2009 version is the currently followed GCDMP document. GCDMP provides guidance on the accepted practices in CDM that are consistent with regulatory practices. Addressed in 20 chapters, it covers the CDM process by highlighting the minimum standards and best practices.

Clinical Data Interchange Standards Consortium (CDISC), a multidisciplinary non-profit organization, has developed standards to support acquisition, exchange, submission, and archival of clinical research data and metadata. Metadata is the data of the data entered. This includes data about the individual who made the entry or a change in the clinical data, the date and time of entry/change and details of the changes that have been made. Among the standards, two important ones are the Study Data Tabulation Model Implementation Guide for Human Clinical Trials (SDTMIG) and the Clinical Data Acquisition Standards Harmonization (CDASH) standards, available free of cost from the CDISC website ( www.cdisc.org ). The SDTMIG standard[ 4 ] describes the details of model and standard terminologies for the data and serves as a guide to the organization. CDASH v 1.1[ 5 ] defines the basic standards for the collection of data in a clinical trial and enlists the basic data information needed from a clinical, regulatory, and scientific perspective.

The CDM Process

The CDM process, like a clinical trial, begins with the end in mind. This means that the whole process is designed keeping the deliverable in view. As a clinical trial is designed to answer the research question, the CDM process is designed to deliver an error-free, valid, and statistically sound database. To meet this objective, the CDM process starts early, even before the finalization of the study protocol.

Review and finalization of study documents

The protocol is reviewed from a database designing perspective, for clarity and consistency. During this review, the CDM personnel will identify the data items to be collected and the frequency of collection with respect to the visit schedule. A Case Report Form (CRF) is designed by the CDM team, as this is the first step in translating the protocol-specific activities into data being generated. The data fields should be clearly defined and be consistent throughout. The type of data to be entered should be evident from the CRF. For example, if weight has to be captured in two decimal places, the data entry field should have two data boxes placed after the decimal as shown in Figure 1 . Similarly, the units in which measurements have to be made should also be mentioned next to the data field. The CRF should be concise, self-explanatory, and user-friendly (unless you are the one entering data into the CRF). Along with the CRF, the filling instructions (called CRF Completion Guidelines) should also be provided to study investigators for error-free data acquisition. CRF annotation is done wherein the variable is named according to the SDTMIG or the conventions followed internally. Annotations are coded terms used in CDM tools to indicate the variables in the study. An example of an annotated CRF is provided in Figure 1 . In questions with discrete value options (like the variable gender having values male and female as responses), all possible options will be coded appropriately.

An external file that holds a picture, illustration, etc.
Object name is IJPharm-44-168-g001.jpg

Annotated sample of a Case Report Form (CRF). Annotations are entered in coloured text in this figure to differentiate from the CRF questions. DCM = Data collection module, DVG = Discrete value group, YNNA [S1] = Yes, No = Not applicable [subset 1], C = Character, N = Numerical, DT = Date format. For xample, BRTHDTC [DT] indicates date of birth in the date format

Based on these, a Data Management Plan (DMP) is developed. DMP document is a road map to handle the data under foreseeable circumstances and describes the CDM activities to be followed in the trial. A list of CDM activities is provided in Table 1 . The DMP describes the database design, data entry and data tracking guidelines, quality control measures, SAE reconciliation guidelines, discrepancy management, data transfer/extraction, and database locking guidelines. Along with the DMP, a Data Validation Plan (DVP) containing all edit-checks to be performed and the calculations for derived variables are also prepared. The edit check programs in the DVP help in cleaning up the data by identifying the discrepancies.

List of clinical data management activities

An external file that holds a picture, illustration, etc.
Object name is IJPharm-44-168-g002.jpg

Database designing

Databases are the clinical software applications, which are built to facilitate the CDM tasks to carry out multiple studies.[ 6 ] Generally, these tools have built-in compliance with regulatory requirements and are easy to use. “System validation” is conducted to ensure data security, during which system specifications,[ 7 ] user requirements, and regulatory compliance are evaluated before implementation. Study details like objectives, intervals, visits, investigators, sites, and patients are defined in the database and CRF layouts are designed for data entry. These entry screens are tested with dummy data before moving them to the real data capture.

Data collection

Data collection is done using the CRF that may exist in the form of a paper or an electronic version. The traditional method is to employ paper CRFs to collect the data responses, which are translated to the database by means of data entry done in-house. These paper CRFs are filled up by the investigator according to the completion guidelines. In the e-CRF-based CDM, the investigator or a designee will be logging into the CDM system and entering the data directly at the site. In e-CRF method, chances of errors are less, and the resolution of discrepancies happens faster. Since pharmaceutical companies try to reduce the time taken for drug development processes by enhancing the speed of processes involved, many pharmaceutical companies are opting for e-CRF options (also called remote data entry).

CRF tracking

The entries made in the CRF will be monitored by the Clinical Research Associate (CRA) for completeness and filled up CRFs are retrieved and handed over to the CDM team. The CDM team will track the retrieved CRFs and maintain their record. CRFs are tracked for missing pages and illegible data manually to assure that the data are not lost. In case of missing or illegible data, a clarification is obtained from the investigator and the issue is resolved.

Data entry takes place according to the guidelines prepared along with the DMP. This is applicable only in the case of paper CRF retrieved from the sites. Usually, double data entry is performed wherein the data is entered by two operators separately.[ 8 ] The second pass entry (entry made by the second person) helps in verification and reconciliation by identifying the transcription errors and discrepancies caused by illegible data. Moreover, double data entry helps in getting a cleaner database compared to a single data entry. Earlier studies have shown that double data entry ensures better consistency with paper CRF as denoted by a lesser error rate.[ 9 ]

Data validation

Data validation is the process of testing the validity of data in accordance with the protocol specifications. Edit check programs are written to identify the discrepancies in the entered data, which are embedded in the database, to ensure data validity. These programs are written according to the logic condition mentioned in the DVP. These edit check programs are initially tested with dummy data containing discrepancies. Discrepancy is defined as a data point that fails to pass a validation check. Discrepancy may be due to inconsistent data, missing data, range checks, and deviations from the protocol. In e-CRF based studies, data validation process will be run frequently for identifying discrepancies. These discrepancies will be resolved by investigators after logging into the system. Ongoing quality control of data processing is undertaken at regular intervals during the course of CDM. For example, if the inclusion criteria specify that the age of the patient should be between 18 and 65 years (both inclusive), an edit program will be written for two conditions viz . age <18 and >65. If for any patient, the condition becomes TRUE, a discrepancy will be generated. These discrepancies will be highlighted in the system and Data Clarification Forms (DCFs) can be generated. DCFs are documents containing queries pertaining to the discrepancies identified.

Discrepancy management

This is also called query resolution. Discrepancy management includes reviewing discrepancies, investigating the reason, and resolving them with documentary proof or declaring them as irresolvable. Discrepancy management helps in cleaning the data and gathers enough evidence for the deviations observed in data. Almost all CDMS have a discrepancy database where all discrepancies will be recorded and stored with audit trail.

Based on the types identified, discrepancies are either flagged to the investigator for clarification or closed in-house by Self-Evident Corrections (SEC) without sending DCF to the site. The most common SECs are obvious spelling errors. For discrepancies that require clarifications from the investigator, DCFs will be sent to the site. The CDM tools help in the creation and printing of DCFs. Investigators will write the resolution or explain the circumstances that led to the discrepancy in data. When a resolution is provided by the investigator, the same will be updated in the database. In case of e-CRFs, the investigator can access the discrepancies flagged to him and will be able to provide the resolutions online. Figure 2 illustrates the flow of discrepancy management.

An external file that holds a picture, illustration, etc.
Object name is IJPharm-44-168-g003.jpg

Discrepancy management (DCF = Data clarification form, CRA = Clinical Research Associate, SDV = Source document verification, SEC = Self-evident correction)

The CDM team reviews all discrepancies at regular intervals to ensure that they have been resolved. The resolved data discrepancies are recorded as ‘closed’. This means that those validation failures are no longer considered to be active, and future data validation attempts on the same data will not create a discrepancy for same data point. But closure of discrepancies is not always possible. In some cases, the investigator will not be able to provide a resolution for the discrepancy. Such discrepancies will be considered as ‘irresolvable’ and will be updated in the discrepancy database.

Discrepancy management is the most critical activity in the CDM process. Being the vital activity in cleaning up the data, utmost attention must be observed while handling the discrepancies.

Medical coding

Medical coding helps in identifying and properly classifying the medical terminologies associated with the clinical trial. For classification of events, medical dictionaries available online are used. Technically, this activity needs the knowledge of medical terminology, understanding of disease entities, drugs used, and a basic knowledge of the pathological processes involved. Functionally, it also requires knowledge about the structure of electronic medical dictionaries and the hierarchy of classifications available in them. Adverse events occurring during the study, prior to and concomitantly administered medications and pre-or co-existing illnesses are coded using the available medical dictionaries. Commonly, Medical Dictionary for Regulatory Activities (MedDRA) is used for the coding of adverse events as well as other illnesses and World Health Organization–Drug Dictionary Enhanced (WHO-DDE) is used for coding the medications. These dictionaries contain the respective classifications of adverse events and drugs in proper classes. Other dictionaries are also available for use in data management (eg, WHO-ART is a dictionary that deals with adverse reactions terminology). Some pharmaceutical companies utilize customized dictionaries to suit their needs and meet their standard operating procedures.

Medical coding helps in classifying reported medical terms on the CRF to standard dictionary terms in order to achieve data consistency and avoid unnecessary duplication. For example, the investigators may use different terms for the same adverse event, but it is important to code all of them to a single standard code and maintain uniformity in the process. The right coding and classification of adverse events and medication is crucial as an incorrect coding may lead to masking of safety issues or highlight the wrong safety concerns related to the drug.

Database locking

After a proper quality check and assurance, the final data validation is run. If there are no discrepancies, the SAS datasets are finalized in consultation with the statistician. All data management activities should have been completed prior to database lock. To ensure this, a pre-lock checklist is used and completion of all activities is confirmed. This is done as the database cannot be changed in any manner after locking. Once the approval for locking is obtained from all stakeholders, the database is locked and clean data is extracted for statistical analysis. Generally, no modification in the database is possible. But in case of a critical issue or for other important operational reasons, privileged users can modify the data even after the database is locked. This, however, requires proper documentation and an audit trail has to be maintained with sufficient justification for updating the locked database. Data extraction is done from the final database after locking. This is followed by its archival.

Roles and Responsibilities in CDM

In a CDM team, different roles and responsibilities are attributed to the team members. The minimum educational requirement for a team member in CDM should be graduation in life science and knowledge of computer applications. Ideally, medical coders should be medical graduates. However, in the industry, paramedical graduates are also recruited as medical coders. Some key roles are essential to all CDM teams. The list of roles given below can be considered as minimum requirements for a CDM team:

  • Data Manager
  • Database Programmer/Designer
  • Medical Coder
  • Clinical Data Coordinator
  • Quality Control Associate
  • Data Entry Associate

The data manager is responsible for supervising the entire CDM process. The data manager prepares the DMP, approves the CDM procedures and all internal documents related to CDM activities. Controlling and allocating the database access to team members is also the responsibility of the data manager. The database programmer/designer performs the CRF annotation, creates the study database, and programs the edit checks for data validation. He/she is also responsible for designing of data entry screens in the database and validating the edit checks with dummy data. The medical coder will do the coding for adverse events, medical history, co-illnesses, and concomitant medication administered during the study. The clinical data coordinator designs the CRF, prepares the CRF filling instructions, and is responsible for developing the DVP and discrepancy management. All other CDM-related documents, checklists, and guideline documents are prepared by the clinical data coordinator. The quality control associate checks the accuracy of data entry and conducts data audits.[ 10 ] Sometimes, there is a separate quality assurance person to conduct the audit on the data entered. Additionally, the quality control associate verifies the documentation pertaining to the procedures being followed. The data entry personnel will be tracking the receipt of CRF pages and performs the data entry into the database.

CDM has evolved in response to the ever-increasing demand from pharmaceutical companies to fast-track the drug development process and from the regulatory authorities to put the quality systems in place to ensure generation of high-quality data for accurate drug evaluation. To meet the expectations, there is a gradual shift from the paper-based to the electronic systems of data management. Developments on the technological front have positively impacted the CDM process and systems, thereby leading to encouraging results on speed and quality of data being generated. At the same time, CDM professionals should ensure the standards for improving data quality.[ 11 ] CDM, being a speciality in itself, should be evaluated by means of the systems and processes being implemented and the standards being followed. The biggest challenge from the regulatory perspective would be the standardization of data management process across organizations, and development of regulations to define the procedures to be followed and the data standards. From the industry perspective, the biggest hurdle would be the planning and implementation of data management systems in a changing operational environment where the rapid pace of technology development outdates the existing infrastructure. In spite of these, CDM is evolving to become a standard-based clinical research entity, by striking a balance between the expectations from and constraints in the existing systems, driven by technological developments and business demands.

Source of Support: Nil.

Conflict of Interest: None declared.

Data Collection and Management in Clinical Research

  • First Online: 01 January 2012

Cite this chapter

clinical data management research paper

  • Mario Guralnik PhD 3  

5705 Accesses

Well-designed trials and data management methods are essential to the integrity of the findings from clinical trials, and the completeness, accuracy, and timeliness of data collection are key indicators of the quality of conduct of the study. The research data provide the information to be analyzed in addressing the study objectives, and addressing the primary objectives is the critical driver of the study. Since the data management plan closely follows the structure and sequence of the protocol, the data management group and protocol development team must work closely together. Accurate, thorough, detailed, and complete collection of data is critical, especially at baseline as this is the last time observations can be recorded before the effects of the trial interventions come into play. The shift from paper-based to electronic systems promotes efficient and uniform collection of data and can build quality control into the data collection process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Liu EW. Clinical research the six sigma way. JALA. 2006;11:42–9.

Google Scholar  

Brandt CA, Argraves S, Money R, Ananth S, Trocky NM, Nadkarni PM. Informatics tools to improve clinical research study implementation. Contemp Clin Trials. 2006;27:112–22.

Article   PubMed   Google Scholar  

Piantadosi S. Clinical trials. A methodologic perspective. 2nd ed. Hoboken: Wiley; 2005.

Book   Google Scholar  

McFadden E. Data definition, forms, and database design. In: Management of data in clinical trials. 2nd ed. New York: Wiley; 2007.

Chapter   Google Scholar  

Roberts P. Reliability and validity in research. Nurs Stand. 2006;20:41–5.

Williams GW. The other side of clinical trial monitoring; assuring data quality and procedural adherence. Clin Trials. 2006;3:530–7.

Crerand WJ, Lamb J, Rulon V, Karal B, Mardekian J. Building data quality into clinical trials. J AHIMA. 2002;73:44–56.

PubMed   Google Scholar  

Guidance for industry: electronic source documentation in clinical investigations. Rockville: Food and Drug Administration (US). 2010. Office of Commu­nication, Outreach and Development, HFM-40. Accessed 28 Oct 2011.

Kush R, Alschuler L, Ruggeri R, Cassells S, Gupta N, Bain L, Claise K, Shah M, Nahm M. Implementing Single Source: the STARBRITE proof-of-concept study. J Am Med Inform Assoc. 2007;14:662–73.

Takayanagi R, Watanabe K, Nakahara A, Nakamura H, Yamada Y, Suzuki H, Arakawa Y, Omata M, Iga T. Items of concern associated with source document verification of clinical trials for new drugs. Yakugaku Zasshi. 2004;124:89–92.

Article   PubMed   CAS   Google Scholar  

Bernd CL. Clinical case report forms design—a key to clinical trial success. Drug Inf J. 1984;18:3–8.

PubMed   CAS   Google Scholar  

US Food and Drug Administration. Code of Federal Regulations:21CFR11.10. Title 21—food and drugs. Chapter I, subchapter A—general. Part 11—electronic records; electronic signatures. Subpart B—electronic records. Sec. 11.10—controls for closed systems. http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfCFR/CFRSearch.cfm?fr=11.10 . Accessed 25 Feb 2008.

Meadows BJ. Eliciting remote data entry system requirements for the collection of cancer clinical trial data. Comput Inform Nurs. 2003;21:234–40.

Welker JA. Implementation of electronic data capture systems: barriers and solutions. Contemp Clin Trials. 2007;28:329–36.

Kashner TM, Hinson R, Holland GJ, Mickey DD, Hoffman K, Lind L, Johnson LD, Chang BK, Golden RM, Henley SS. A data accounting system for clinical investigators. J Am Med Inform Assoc. 2007;14:394–6.

Argraves S, Brandt CA, Money R, Nadkarni P. Informatics tools to improve clinical research. In: Proceedings of the American Medical Informatics Association Symposium, 22–26 Oct 2005; Washington, DC.

Kawado M, Hinotsu S, Matsuyama Y, Yamaguchi T, Hashimoto S, Ohashi Y. A comparison of error detection rates between the reading aloud method and the double data entry method. Control Clin Trials. 2003;24:560–9.

King DW, Lashley R. A quantifiable alternative to double data entry. Control Clin Trials. 2000;21:94–102.

Day S, Fayers P, Harvey D. Double data entry: what value, what price? Control Clin Trials. 1998;19:15–24.

McFadden E. Software tools for trials management. In: Management of data in clinical trials. 2nd ed. New York: Wiley; 2007.

Trocky NM, Fontinha M. Quality management tools: facilitating clinical research data integrity by utilizing specialized reports with electronic case report forms. In: Proceedings of the American Medical Informatics Association Symposium, 22–26 Oct 2005; Washington, DC.

Saw SM, Lim SG. Clinical drug trials: practical problems of phase III. Ann Acad Med Singapore. 2000;29:598–605.

http://www.datagovernance.com/adl_FDA_21_CFR_USA.html .

Department of Health and Human Services. IRB Guidebook Chapter IV: Consideration of research design, 2007. http://www.hhs.gov/ohrp/irb/irb_chapter4.htm . Accessed 25 Apr 2008.

US Food and Drug Administration. Code of Federal Regulations: 21CFR56.115. Title 21—Food and drugs. Chapter I, subchapter A—general. Part 56—institutional review boards. Subpart D—records and reports. Sec. 56.115—IRB records. http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfcfr/CFRSearch.cfm?fr=56.115 . Accessed 24 Feb 2008.

US Food and Drug Administration. Code of Federal Regulations: 21CFR312.59. Title 21—Food and drugs. Chapter I, subchapter D—drugs for human use. Part 312—investigational new drug application. Subpart D—responsibilities of sponsors and investigators. Sec. 312.59 —disposition of unused supply of investigational drug. http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfcfr/CFRSearch.cfm?fr=312.59 . Accessed 24 Feb 2008.

US Food and Drug Administration. Code of Federal Regulations: 21CFR312.61 Investigational New Drug Application. Title 21—Food and drugs. Chapter I, subchapter D—drugs for human use. Part 312—investigational new drug application. Subpart D—responsibilities of sponsors and investigators. Sec. 312.61—control of the investigational drug. http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfcfr/CFRSearch.cfm?fr=312.61 . Accessed 24 Feb 2008.

US Food and Drug Administration. Code of Federal Regulations: 21CFR312.62. Title 21—Food and drugs. Chapter I, subchapter D—drugs for human use. Part 312—investigational new drug application. Subpart D—responsibilities of sponsors and investigators. Sec. 312.62—investigator recordkeeping and record retention. http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfcfr/CFRSearch.cfm?fr=312.62 . Accessed 24 Feb 2008.

Siden R, Tankanow RM, Tamer HR. Understanding and preparing for clinical drug trial audits. Am J Health Syst Pharm 2002;59:2301,2306,2308.

DDOTS, Inc. IDEA Web-based software for investigational drug inventory management. http://www.ddots.com/idea_product_overview.cfm . Accessed 24 Feb 2008.

Department of Health and Human Services. IRB Guidebook Chapter III: Basic IRB Review. 2007. http://www.hhs.gov/ohrp/irb/irb_chapter3.htm . Accessed 25 Apr 2008.

Download references

Author information

Authors and affiliations.

Synergy Research Inc, 3943 Irvine Blvd #627, Irvine, CA, 92602, USA

Mario Guralnik PhD

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Mario Guralnik PhD .

Editor information

Editors and affiliations.

, Cardiovascular Medicine, SUNY Downstate Medical Center, Clarkson Avenue, box 1199 450, Brooklyn, 11203, USA

Phyllis G. Supino

, Cardiovascualr Medicine, SUNY Downstate Medical Center, Clarkson Avenue 450, Brooklyn, 11203, USA

Jeffrey S. Borer

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media, LLC

About this chapter

Guralnik, M. (2012). Data Collection and Management in Clinical Research. In: Supino, P., Borer, J. (eds) Principles of Research Methodology. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-3360-6_7

Download citation

DOI : https://doi.org/10.1007/978-1-4614-3360-6_7

Published : 18 April 2012

Publisher Name : Springer, New York, NY

Print ISBN : 978-1-4614-3359-0

Online ISBN : 978-1-4614-3360-6

eBook Packages : Medicine Medicine (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Technical advance
  • Open access
  • Published: 24 June 2019

The Generalized Data Model for clinical research

  • Mark D. Danese   ORCID: orcid.org/0000-0002-7068-9603 1 ,
  • Marc Halperin 1 ,
  • Jennifer Duryea 1 &
  • Ryan Duryea 1  

BMC Medical Informatics and Decision Making volume  19 , Article number:  117 ( 2019 ) Cite this article

19k Accesses

10 Citations

12 Altmetric

Metrics details

Most healthcare data sources store information within their own unique schemas, making reliable and reproducible research challenging. Consequently, researchers have adopted various data models to improve the efficiency of research. Transforming and loading data into these models is a labor-intensive process that can alter the semantics of the original data. Therefore, we created a data model with a hierarchical structure that simplifies the transformation process and minimizes data alteration.

There were two design goals in constructing the tables and table relationships for the Generalized Data Model (GDM). The first was to focus on clinical codes in their original vocabularies to retain the original semantic representation of the data. The second was to retain hierarchical information present in the original data while retaining provenance. The model was tested by transforming synthetic Medicare data; Surveillance, Epidemiology, and End Results data linked to Medicare claims; and electronic health records from the Clinical Practice Research Datalink. We also tested a subsequent transformation from the GDM into the Sentinel data model.

The resulting data model contains 19 tables, with the Clinical Codes, Contexts, and Collections tables serving as the core of the model, and containing most of the clinical, provenance, and hierarchical information. In addition, a Mapping table allows users to apply an arbitrarily complex set of relationships among vocabulary elements to facilitate automated analyses.

Conclusions

The GDM offers researchers a simpler process for transforming data, clear data provenance, and a path for users to transform their data into other data models. The GDM is designed to retain hierarchical relationships among data elements as well as the original semantic representation of the data, ensuring consistency in protocol implementation as part of a complete data pipeline for researchers.

Peer Review reports

Healthcare data contains useful information for clinical researchers across a wide range of disciplines, including pharmacovigilance, epidemiology, and health services research. However, most data sources throughout the world store information within their own unique schemas, making it difficult to develop software tools that ensure reliable and reproducible research. One solution to this problem is to create data models that standardize the storage of both the data and the relationships among data elements [ 1 ].

In healthcare, several commonly used data models include those supported by the following organizations: Informatics for Integrating Biology and the Bedside (i2b2) [ 2 , 3 , 4 ], Observational Health Data Sciences and Informatics (OHDSI, managing the OMOP [Observational Outcomes Medical Partnership] data model) [ 5 , 6 , 7 ], Sentinel [ 8 , 9 , 10 ], and PCORnet (Patient Centered Outcomes Research Network) [ 11 , 12 ], among others. The first, and biggest, challenge with any data model is the process of migrating the raw (source) data into the data model, referred to as the “extract, transform, and load” (ETL) process. The ETL process is particularly burdensome when one has to support multiple, large data sources, and to update them regularly [ 13 ].

Some aspects of transforming raw data into a particular data model are straight-forward, including reorganizing variables and standardizing their names. However, the most challenging aspect is standardizing the relationships among data elements without changing their meaning. Since different healthcare data sources encode relationships in different ways, the ETL process can lose information, or create inaccurate information. The best example is the process of creating a visit, a construct which, in most data models, is used to link information (e.g., diagnoses and procedures) on a per patient basis.

Visits are challenging because administrative claims allow facilities and practitioners to invoice separately for their portions of the same medical encounter, and allow practitioners to bill for multiple interactions on a single invoice [ 14 ]. Within the practitioner bills, individual procedures are linked to diagnosis codes, procedure modifiers, and costs. Consequently, a visit should link both the facility and the practitioner information without changing the existing practitioner-specified relationships between procedures, modifiers, diagnoses, and costs. Even electronic medical records can be challenging when each interaction with a different provider (e.g., nurse, physician, pharmacist, etc.) is recorded separately, requiring decisions to be made about defining a visit.

To minimize the need to encode specific relationships that may not exist in the source data, we created a data model with a hierarchical structure that minimizes changes to the meaning of the original data. This data model can serve both as a stand-alone data model for clinical researchers using observational data, as well as a storage model for later conversion into other data models.

In designing the Generalized Data Model (GDM) the primary use case was to allow clinical researchers using commonly available observational datasets to conduct research efficiently using a common framework. In particular, the GDM was designed to allow researchers to reuse an extensive, published body of existing algorithms for identifying clinical research constructs, including visits, that are expressed in the native vocabularies of the raw data. These algorithms require code sets, and may also require temporal logic (e.g., before, after, during, etc.), sequencing information (e.g., first, last, etc.), and provenance information (e.g., inpatient, outpatient, etc.). The GDM specifically considered both oncology research, which has its own specific vocabularies, and health services research. However, the model was designed so that these specific focus areas would not limit the design or use of the model.

Design goals

We initiated development of the GDM to make ETL specification and implementation easier for users who work with data models. There were two primary goals in defining the standard tables and table relationships for the GDM, described below.

Focus on clinical codes in their original vocabularies

For clinical research, transparency and reproducibility are critically important. Therefore, the model is focused on the original (source) vocabularies to prevent the loss of the original semantic expression of the underlying clinical information. We also wanted all clinical codes (e.g., International Classification of Diseases [ICD], Current Procedural Terminology, National Drug Codes, etc.) to be easy to load into the data model and easy to query, because they represent the majority of electronic clinical information. Hence, the key organizing structure of the GDM is the placement of all clinical codes in a central “fact” table. This is not unlike the i2b2 data model that uses a fact table to store all “observations” from a source data set; however, the GDM was not designed as a star schema despite the similar idea of locating the most important data at the center of the data model.

We also considered interoperability as part of the design, but it was of secondary importance. Interoperability, like the construction of visits, requires establishing new connections (“mappings”) between the source vocabularies and a standard vocabulary such that a single query can operate across all data sources regardless of the source vocabulary. For international studies using different vocabularies, this might be a useful tool. However, given that every code isn’t yet mapped to a standard (e.g., OMOP has little in the way of procedure code mappings), and the maintenance required to support and update mappings, we designed the GDM to incorporate reliable cross-vocabulary mappings where they exist.

Retain hierarchical information with provenance

The second goal was to capture important hierarchical relationships among data elements within a relational data structure. Based on the review of numerous data sources including Medicare, Surveillance Epidemiology and End Results (SEER) Medicare, Optum, Truven, JMDC (Japanese claims), and Clinical Practice Research Datalink (CPRD), we decided on a two-level hierarchy for grouping clinical codes, with the lower level table called Contexts and the higher-level table called Collections. This was based on common data structures where many related codes are recorded on a single record in the source data (Contexts table), and where these records are often grouped together (Collections table) based on clinical reporting or billing considerations. See Results for table definitions, and Fig.  1 for a visual depiction of the hierarchical structure of the Contexts and Collections tables.

figure 1

Relationships Among the Collections, Contexts, and Clinical Codes Tables. Note: EHR = electronic health record. HCPCS = Healthcare Common Procedure Coding System. NDC = National Drug Code. ICD = International Classification of Diseases. Figure does not contain specific data, but is intended to show the conceptual relationships among data elements across tables

Our review of data sources suggested that the data model needed to support relatively few relationship types. The primary relationship represents data that is reported together or collected at the same time. One example of this includes a “line”, which occurs in claims data when one or more diagnosis codes, a procedure code, and a cost are all reported together. Another example includes laboratory values assessed at the same time (e.g., systolic and diastolic blood pressure) which could be considered to be co-reported. Also, a set of prescription refills could represent a linked set of records. Even records that contain pre-coordinated expressions (i.e., a linked set of codes used to provide clinical information akin to an English sentence) could also be stored in order by associating the codes with a single Context record.

We also included the provenance for each clinical code as part of Contexts, recording not only the type of relationship among elements within a Context as discussed above, but also the source file from which the data was abstracted. To minimize the loss of information when converting from the GDM to a data model that uses visits for organizing and consolidating most data relationships, the GDM does not require explicit visits (see Results ). This is important because visits are not consistently defined among other data models, particularly for administrative claims data (see Discussion ).

Other considerations

There are several other considerations made in building this data model, some of which were borrowed or adapted from other data models. For example, in addition to the cost table, we borrowed the OMOP idea to store all codes as “concept ids” (unique numeric identifiers for each code in each vocabulary to avoid conflicts between different vocabularies that use the same code). We also expanded upon the idea of OMOP “type_concept_ids” to track provenance within our data model. Finally, we allow flexibility in storing enrollment information in the Information Periods table using a “type_concept_id” so that the data can be used for different purposes (e.g., if a protocol does not require drug data, then enrollment in a drug plan should not be required). We also wanted to facilitate a straightforward, subsequent ETL process to other data models, including OMOP, Sentinel, and PCORnet.

We adapted the Payer Reimbursements table from the OMOP version 5.2 Cost table because it was the only data model with a cost table, and because we contributed substantially to its design. However, unlike the single OMOP cost table, we created two tables to accommodate both reimbursement-specific information, which has a well-defined structure, and all other kinds of economic information, which requires a very flexible structure. (The OMOP version 5.31 Cost table was redesigned to be more flexible, coincidentally resembling the GDM Costs table.)

We tested the data model on three very different types of commonly available data used by clinical researchers: administrative claims data, EHR data, and cancer registry data. Claims in the United States are generally submitted electronically by the provider to the insurer using the American National Standards Institute (ANSI) 837P and 837I file specifications, which correspond to the CMS-1500 and UB04 paper forms [ 15 ]. Remittance information is sent from the insurer to the provider using the 835P and 835I specifications. However, actual claims data used for research is provided in a much simpler format. Based on experience developing and supporting software for submitting claims to insurers as well as creating ETL specifications for multiple commercial claims and EHR datasets using the OMOP data model, we determined that Medicare data is the most stringent test for transforming claims data because it contains the most information from the 837 and 835 files. For EHR data, we used the Clinical Practice Research Datalink (CPRD) data, because it is widely used for clinical research [ 16 ]. Finally, as part of our focus on oncology research, we included Surveillance, Epidemiology, and End Results (SEER) data [ 17 ] because SEER provides some of the most detailed cancer registry data available globally to clinical researchers which is challenging to incorporate into data models.

More specifically, we implemented a complete ETL process for the Medicare Synthetic Public Use Files (SynPUF). The SynPUF data are created from a 2.1-million-patient sample of Medicare beneficiaries from 2008 who were followed for three years, created to facilitate software development using Medicare data [ 18 , 19 ]. We also implemented an ETL for SEER data linked to Medicare claims data [ 20 ] for 20,000 patients with small cell lung cancer, as part of an ongoing research project to describe patterns of care in that population. Finally, we developed a complete ETL for 140,000 CPRD patients for an ongoing research project evaluating outcomes associated with adherence to lipid-lowering medications. We also tested the feasibility of an ETL process to move SynPUF data from the GDM to the Sentinel data model (version 6.0) to ensure that the model did not contain any structural irregularities that would make it difficult to move data into other data model structures.

Finally, we conducted a test of information loss in the context of applying quality control to a study of mesothelioma patients. We conducted two analyses by separate people based on a written specification document using SEER Medicare data. The first was conducted using the source data and a combination of SAS and R code, and the second was conducted using the GDM version of the data and proprietary software. The analysis required the use of several SEER-specific fields, including the tumor sequence (first primary), histology, reporting type (microscopic confirmation), reporting source (not at death or autopsy), and tumor location data.

ETL software

Our ETL process focused on the extraction of the source data and the transformation to the GDM data model, and saved tables as .csv files (i.e., it focused primarily on the E and T parts of the ETL). The ETL processes were built using R (version 3.4.4) and the data.table package (version 1.11.6) [ 21 ]. R was selected because it is an open-source, cross-platform software package; because of its flexibility for composing ETL functions; and because of the availability of the data.table package as an in-memory database written in C for speed. The package itself is modular, and allows users to compose arbitrary ETL functions. Although the approach is different, the process is conceptually related to the dynamic ETL described by Ong, et al. [ 22 ]

The resulting data model contains 19 tables (see hierarchical view in Fig.  2 ). Details of the tables are provided in Additional file  1 , and the most up-to-date version is available on a GitHub repository [ 23 ]. This repository will also contain links to any publicly available ETL specifications that we develop.

figure 2

Hierarchical View of the Generalized Data Model. Note: Table names and key relationships among tables are depicted above. See Additional file 1 for more detail on tables. Tables in green serve as lookup tables across the database. There is a single Addresses table for unique addresses with relationships to Patients, Practitioners, and Facilities, and a single Practitioners table with relationships to Patients and Contexts Practitioners. The Contexts Practitioners table allows multiple practitioners to be associated with a Context record

Clinical data

The Clinical Codes, Contexts, and Collections tables make up the core of the GDM (as shown in Fig. 1 ). All clinical codes are stored in the Clinical Codes table. Each row of the Clinical Codes table contains a single code from the source data. In addition, each row also contains a patient id, the associated start and end dates for the record, a provenance concept id, and a sequence number. The sequence number allows codes to retain their order from the source data, as necessary. The most obvious example from billing data is diagnosis codes that are stored in numbered fields (e.g., diagnosis 1, diagnosis 2, etc.). But any set of ordered records could be stored this way, including groups of codes in a pre-coordinated expression. Grouping together ordered records in the Clinical Codes table is accomplished by associating them with the same id from the Contexts table. The provenance id allows for the specification of the type of record (e.g., admitting diagnosis, problem list diagnosis, etc.).

The Contexts table allows for grouping clinical codes and storing information about their origin. The record type concept id identifies the type of group that is stored. Examples might include lines from claims data where diagnoses, procedures, and other information are grouped, prescription and refill records that might be in electronic medical record or pharmacy data, or measurements of some kind from electronic health record or laboratory data (e.g., systolic and diastolic blood pressure, or a laboratory panel). In addition, the table stores the file name from the source data, the Center for Medicare and Medicaid Services place of service values [ 27 ] (used for physician records since facility records to not have a place of service in claims data), and foreign keys to the care site and facility tables. The Contexts table also contains a patient id and both start and stop dates which could be different from the start and stop dates of the individual records from other tables to which the Contexts record is linked (e.g., a hospitalization may have different start and stop dates than the individual records within the hospitalization, as might occur with an in-hospital procedure performed on a single day of a multi-day hospitalization).

The Collections table represents a higher level of hierarchy for records in the Contexts table. That is, records in the Collections table represent groups of records from the Contexts table. This kind of grouping occurs when multiple billable units (“lines” or “details”) are combined into invoices (“claims”). It also occurs when prescriptions, laboratory measures, diagnoses and/or procedures are all recorded at a single office visit. In short, a Collection is typically a “claim” or a “visit” depending on whether the source data is administrative billing or electronic health record data. By using a hierarchical structure, the model avoids the requirement to construct “visits” from claims data which often leads to inaccuracy, loss of information, and complicated ETL processing. In the simplest possible case, it is possible to have a single record in the Clinical Codes table which is associated with a single Context record, which is associated with a single Collection record, as shown in Fig. 1 for a drug record. The critical part of the ETL process, moving data into the Clinical Codes, Contexts, and Collections tables, is described in Fig.  3 for the SynPUF data.

figure 3

Visualization of the ETL Process for SynPUF Data. Note: Clinical codes are derived from a single row in the source data set (SynPUF record). Colored arrows indicate how each group of codes is used to create records. Each code from the original record gets its own row in the Clinical Codes table. Codes that are grouped together (e.g., line diagnosis 1 and procedure 1 in yellow) share the same context. In the Contexts table, type concept id ending in “64” indicates a claim level context, and the id ending in “65” indicates a line level context. The three contexts (groups of codes) share the same collection id

The Details tables capture domain-specific information related to hospitalizations, drugs, and measurements. The Admissions Details table stores admissions and emergency department information that doesn’t fit in the Clinical Codes, Contexts, or Collections tables. It is designed to hold one admission per row. Each record in the Collections table for an inpatient admission links to this table. The Drug Exposure Details and Measurement Details contain information about medications and measurements (e.g., laboratory values). The Clinical Codes table contains foreign keys to these tables. We should also note that these two tables could be combined with the Clinical Codes table to make one larger table and improve query times on some database platforms. While this might require some minor modifications to the query, it wouldn’t change the underlying logic of the data model.

Patient data

The Patients table includes information about birth date, sex, race, ethnicity, address (via the Addresses table) and primary care provider (via Practitioners table). The Patient Details table allows a more flexible structure for timeless information like family history or simple genetic information. The Information Periods table captures periods of time during which the information in each table is relevant. This can include multiple records for each patient, including records for different enrollment types (e.g., Medicare Part A, Medicare Part B, or Medicare Managed Care) or this can be something as simple as a single date range of “up-to-standard” data as provided by the Clinical Practice Research Datalink. This table includes one row per patient for each unique combination of information type and date range.

The Deaths table captures mortality information at the patient level, including date and cause(s) of death. This is typically populated from beneficiary or similar administrative data associated with the medical record. However, it is useful to check discharge status in the Admissions Details table as part of ETL process to ensure completeness. There are also diagnosis codes that indicate death. Deaths that are indicated by diagnosis codes should be in the Clinical Codes table and not be moved to the Deaths table. If needed, these codes can be identified using an appropriate algorithm (e.g., a set of ICD-9 codes, possibly with associated provenance specifications) to identify death as part of the identification of outcomes in an analysis.

There are two tables that store cost, charge, or payment data of some kind. The Payer Reimbursements table stores information from administrative claims data, with separate columns for each commonly used reimbursement element. All other financial information is stored in the Costs table, which is designed to support arbitrary cost types, and uses a “value_type_concept_id” to indicate the specific type. Costs may be present at a Context (line-item) or Collection (invoice) level. Therefore, this led us to align costs with the Contexts table. By evaluating the type of the context record, users can determine whether a cost is an aggregated construct or not. In administrative claims data, this means that each “line” (diagnosis and procedure) can have a cost record. For records that have costs only at the claim/header level (e.g., inpatient hospitalizations), only Contexts that refer to “claims” (i.e., a record_type_concept_id for “claim”) will have costs. For data with costs at both the line and claim/header level, costs can be distinguished by the Context type. In our experience, the sum of the line costs does not always equal the total cost, so depending on the research question, the researcher will need to determine whether claim, line, or both should be used. It is possible that each Clinical Code record sharing a single Contexts record could have a different cost; therefore, the two cost-related tables include a column to indicate the specific Clinical Code record to which the cost belongs. This might occur, for example, if multiple laboratory tests have different costs, but are share a common provenance (i.e., Contexts record).

Facility and practitioner data

The Facilities table contains unique records for each facility where a patient is seen. The facility_type_concept_id should be used to describe the whole facility (e.g., Academic Medical Center or Community Medical Center). Specific departments in the facility should be entered in the Contexts table using the care_site_type_concept_id field. The Addresses table captures address information for practitioners and facilities, as well as patients.

The Contexts Practitioners table links one or more practitioners with a record in the Contexts table. Each record represents an encounter between a patient and a practitioner in a specific context. This role_type_concept_id in the table captures the role, if any, the practitioner played on the context (e.g., attending physician).

Vocabulary data

The Concepts table provides a unique numeric identifier (“concept_id”) for each source code in each vocabulary used in the data (see Table  1 ). Since queries against the GDM are intended to use the source codes, the Vocabulary table functions as a lookup table; therefore, the Concepts table does not have to be consistent across databases. However, there may be efficiencies in using a consistent set of identifiers for all entries from commonly used vocabularies. The specific vocabularies used in the data are provided in the Vocabularies table. The idea of having both Concepts and Vocabularies tables was adapted from the OMOP data models. As mentioned in Methods, the Mappings table allows for the expression of consistent concepts across databases.

The Mappings table is designed to express relationships among data elements. It can also be used to facilitate translation into other data models (see Table  2 ). In a few very simple cases like sex and race/ethnicity, we recommend concept mappings to a core set of values to make it easier for users of a protocol implementation software to filter patients by age, gender, and race/ethnicity using a simpler representation of the underlying information. The Mappings table also permits an arbitrarily complex set of relationships, along the lines of the approach taken with the OMOP model and the use of standard concepts for all data elements. By using a Mappings table, we reduce the need to re-map and re-load the entire dataset when new mappings become available. Regardless of how the Mappings table is used, the GDM still retains the original codes from the raw dataset.

ETL results

We loaded SynPUF data and SEER Medicare data into the GDM. After downloading the data to a local server, the process of migrating the SynPUF data with 2.1 million patients of data to the GDM took approximately 8 h on a Windows server with 4 cores and 128 Gb of RAM and conventional hard drives (running two files at a time in parallel). Most of the time was spent loading files into RAM and writing files to disk since the process of ETL with the GDM is primarily about relocating data.

SEER Medicare data for SCLC included approximately 20,000 patients and took less than 1 h. Selected SEER data was included in the ETL process ignoring recoded versions of existing variables or variables used for consistency of interpretation over time. The ETL process focused on 31 key variables including histology, location, behavior, grade, stage, surgery, radiation, urban/rural status, and poverty indicators. Each SEER variable was included as a new vocabulary in the Concepts table (see Table 1 ).

CPRD data included approximately 140,000 patients and took approximately 2 h. For the Test file which contains laboratory values and related measurements, we used Read codes in the Clinical Codes table; however, one could add the “entity types” (numeric values for laboratory values and other clinical measurements and assessments) to the Clinical Codes table as well, with both the Read code and the entity type associated with the same Context record and the same Measurement Details record. We used the entity types for all records in the CPRD Additional Clinical Details table. In all cases, the Mapping table allows for alternative relationships to be added to the data.

Information loss

After reconciling differences in interpretation and resolving coding errors, we identified the identical cohort of patients when using the source data compared to using the same data in the GDM.

ETL from the GDM to sentinel

We conducted an exploratory transformation from the GDM to Sentinel to ensure that it was feasible. The process of moving the data was conducted as follows. The transformations from the GDM Patients, Deaths, and Information Periods tables to Sentinel’s Demographic, Death, and Enrollment tables required renaming variables and mapping a source data vocabulary to a Sentinel vocabulary (e.g., SynPUF sex coding to Sentinel sex coding). The Sentinel Diagnosis, Procedure, and Dispensing tables were populated by splitting the GDM Clinical Codes table by clinical_code_source_vocabulary (e.g., ICD-9 codes were moved to the Sentinel Diagnosis table).

Populating the Sentinel Encounter table required records to be rolled up into a visit. To do this, the Contexts table was transformed into a “pre-Encounter” table with an encounter identifier set to the Contexts table identifier, with a similar process used for the Sentinel Procedure and Diagnosis tables. The “pre-Encounter” table was created with all of the specified columns and correctly mapped data, but had not yet grouped the records into visits. We applied logic based primarily on provenance information in the Contexts table to roll-up records into visits, and we created a new identifier in the Encounter table. Finally, the Diagnosis and Procedure tables were updated with new Encounter table identifier.

The remaining processing from the GDM to Sentinel involved vocabulary transformation since Sentinel has specific ways of representing concepts like sex which, in the GDM, are based on the source (e.g., male = 1 and female = 2) using a unique concept id in the Vocabulary table. We created records in the Mappings table from the SynPUF concepts to the Sentinel concepts (Table 2 ) to accomplish all needed mappings. Our ETL process then used those mappings to insert the correctly transformed variables from the GDM into the Sentinel tables during the ETL.

The GDM is designed to allow clinical researchers to identify the clinical, resource utilization, and cost constructs needed for a wide range of epidemiological and health services research areas without altering the data’s original semantics by creating visits or domains, or performing substantial vocabulary mapping. This provides flexibility for researchers to study not only clinical encounters like outpatient visits, hospitalizations, emergency room visits, and episodes of care, but also more basic constructs like conditions or medication use. Its main goal is to simplify the location of the most important information for creating analysis data sets, which has the benefit of making ETL easier. It does this by using a hierarchical structure instead of visits. It tracks the provenance of the original data elements to enhance the reproducibility of studies. It includes a table to store relationships among data elements for standardized analyses. And it allows for a subsequent ETL process to other data models to provide researchers access to the analytical tools and frameworks associated with those models.

Because other data models (e.g., OMOP, Sentinel, PCORnet, and i2b2) use visits to connect patient-related information within the data model, our emphasis on avoiding visits deserves comment. Visits are seldom required for clinical research, unless the enumeration of explicit visits is the research topic itself. However, for most research projects, protocols require retrieval of the dates of specific, clinically relevant codes, perhaps with provenance or temporal constraints. Satisfying these criteria does not require knowledge of a visit, per se. It is a research project in and of itself to define visits, and their definitions are specific to the health services research question being investigated [ 14 ]. For example, a study of “emergency department” visits would need to consider at least four options to define a visit [ 24 ]. Data models that pre-define visits do not allow such flexibility.

The challenges with visits can best be seen by inspecting the guidelines for creating visits from each data model. In the Sentinel version 6 data model [ 10 ], a visit is defined as a unique combination of patient, start date, provider and visit type. Visit types are defined as Ambulatory, Emergency Department, Inpatient Hospital, Non-acute Institutional, and Other. Furthermore, “Multiple visits to the same provider on the same day should be considered one visit and should include all diagnoses and procedures that were recorded during those visits. Visits to different providers on the same day, such as a physician appointment that leads to a hospitalization, should be considered multiple encounters.”

PCORnet version 4.1 is similar to Sentinel [ 12 ]. However, PCORnet allows more visit types compared to PCORnet version 3, OMOP, and Sentinel. It includes Emergency Department Admit to Inpatient Stay, Observation Stay, and Institutional Professional Consult.

In the OMOP version 5.31 data model, a visit is defined for each “visit to a healthcare facility.” According to the specifications [ 6 ], in any single day, there can be more than one visit. One visit may involve multiple providers, in which case the ETL must either specify how a single provider is selected or leave it null. One visit may involve multiple care sites, in which case the ETL must either specify how a single site is selected or leave it null. Visits must be given one of the following visit types: Inpatient Visit, Outpatient Visit, Emergency Room Visit, Long Term Care Visit and Combined ER and Inpatient Visit. OMOP added an optional Visit Detail table in version 5.3, recognizing the two-level hierarchy common in US claims data [ 6 ].

For i2b2, the specifications state a visit “.. . can involve a patient directly, such as a visit to a doctor’s office, or it can involve the patient indirectly, as in when several tests are run on a tube of the patient’s blood. More than one observation can be made during a visit. All visits must have a start date / time associated with them, but they may or may not have an end date. The visit record also contains specifics about the location of the session, such as the hospital or clinic the session occurred and whether the patient was an inpatient or an outpatient at the time of the visit.” There are no specified visit types, and the data model allows for an “unlimited number of optional columns but their data types and coding systems are specific to the local implementation” [ 4 ].

Clearly, each data model has different perspectives on the definition of a visit. Such ambiguity can lead to differences in how tables are created in the ETL process. As a result, inconsistencies within or across data models can lead to differences in results, as has already been demonstrated [ 25 , 26 ]. Laboratory records could be visits as with i2b2, or could be associated with visits as with other data models. Similarly, prescription, refill, and pharmacy dispensing records could be considered visits, or associated with visits. And other information, like family history, might not require a visit at all. In short, the most important structural component of other data models cannot be accurately and consistently defined, which affects the consistency of analyses across the data models, and makes translation among data models problematic. This also undermines provenance since each data model might answer the question of “where did this record come from” using different visit types. However, we note that these are semantic considerations and not technical limitations for record retrieval. For example, the i2b2 query platform recently has been extended to permit querying of OMOP and PCORnet data [ 28 ].

One important consideration in using data models is their stability. It can be labor-intensive to keep data updated, and if both the data and the data model are changing, maintenance may be prohibitively time-consuming [ 13 ]. One of our intentions is that the GDM should remain stable over time; therefore, we incorporated separate Vocabulary and the Mappings tables which can be updated without running the ETL from the beginning. Hence, the GDM may be a useful, harmonized approach for data providers, compared to their various proprietary solutions. This contrasts with the OMOP data model which requires re-running the ETL when the vocabulary and domain mappings are updated.

The value of domains is that they allow data users to identify the necessary clinical information to extract for analysis and they facilitate interoperability. However, moving raw healthcare data into domains requires either mapping the entire vocabulary into a single domain, or mapping each individual code into a single domain. Placing codes in domain-specific tables can be particularly challenging when vocabularies cross domains (e.g., Read) or when individual codes are ambiguous (e.g., family history information). The GDM does not require domains or vocabulary mappings to be fully functional. The GDM only requires that users assign a unique number (concept id) to all unique source codes in a given dataset to ensure consistency in the data type for the codes. The vocabulary table is simply a look-up table for the codes and concept ids. Because of this, all codes in all vocabularies (e.g., ICD-9, HCPCS [ 29 ], etc.) in the source data will be retained unless there is an explicit decision to exclude a code. However, if needed, the GDM could support domains as an additional field in the Vocabulary table.

It is important to clarify the role of analyses in the ecosystem of data models. Neither the GDM nor any other data model is designed to support direct analyses of any sophistication on the entire database (excluding summary analyses to characterize the entire dataset). The role of the data model is to ease the extraction and organization of analysis data sets to address specific clinical research questions. The required analysis dataset structure depends on the specific analyses (e.g., prevalence, incidence, time to event, repeated measures, etc.) and is typically performed using R (OHDSI) or SAS (Sentinel). By starting with the GDM, researchers can develop tools to extract data directly, or implement the necessary transformations to migrate their data to other data models and make use of the tools for extraction and analysis offered by those models. While this requires another ETL process, or a database view to be created on the GDM, it facilitates access to existing analytical tools. Hence, the GDM can be used as a standardized waypoint in a data pipeline because the necessary information for other data models can be contained within the GDM as we found in our test of a GDM to Sentinel conversion.

We should also note that our approach to incorporating relationships into the data (i.e., our Mappings table) is not unique. Others have designed approaches that rely on semantic mappings to organize and extract data [ 30 ]. There are even methods to eliminate the need for both database reorganization and semantic mapping [ 31 ]. While these approaches may be more flexible and avoid cumbersome ETL and/or mapping processes, it is unclear how they fare with respect to the sensitivity and specificity of their exposure and outcome definitions making it challenging to understand or assess bias in their results [ 32 , 33 ].

Information loss and data quality assessment are challenging subjects. We designed the GDM to minimize information loss in the sense that any codes in the source data can be incorporated by creating entries in the Concepts, Vocabularies, and Clinical Codes tables. We also retained database specific provenance information by indicating the source file from which each data element is derived as well as the type of information that was derived. While we tested information loss in the context of a cohort study and found no problems, this is not a guarantee that all necessary information is, or can be, retained. A more robust assessment of data quality will be the subject of future research. However, our use of the SEER data is illustrative because detailed oncology data does not fit naturally into any of the other data models mentioned. Cancer registry data relies heavily on very specific vocabularies for location, histology, grade, staging, behavior, reporting source, microscopic confirmation and many other factors. Many of these don’t fit easily into the existing domain-based tables. The OMOP data model has a further complication in that the International Classification of Diseases for Oncology version 3 (ICD-O-3) which covers location, histology, grade, and behavior is not a standard vocabulary. Therefore, while the OMOP data model stores the concatenated source codes, work remains to be done to map all combinations to the proper standard vocabulary based on SNOMED. (This work is ongoing at the time of this writing).

There are other limitations to the GDM. While we have tested it against data that is typically used by health services researchers and epidemiologists, there are likely to be specific data sets that will require modifications or improvements. The GDM does not yet include tables for patient reported outcomes, genomic data, or free text notes which are becoming more widely available for researchers. If other data models add support for these or other fields, this might require changes to the GDM to retain compatibility. For example, more detailed location information may need to be added for those with access to additional data (which is often limited due to privacy issues). While we have considered data from Japan and the United Kingdom, there are many data sources to which we did not have access that might require changes in the data model. Finally, while we have developed tools to extract analysis data sets from the GDM based on a protocol, they are not yet available publicly. (However, the ConceptQL language on which the tools are based is open-source [ 34 ]).

The GDM is designed to retain the relationships among data elements to the extent possible, facilitating ETL and protocol implementation as part of a complete data pipeline for clinical researchers using commonly available observational data. Furthermore, by avoiding the requirements to create visits and to use domains, it offers researchers a simpler process of standardizing the location of data in a defined structure and may make it easier for users to transform their data into other data models.

Availability of data and materials

The data model is publicly available. The raw data is not available due to privacy reasons, except for the Medicare Synthetic Public Use data. See Ethics approval and consent to participate for details on SEER Medicare and CPRD data acquisition, and References for a specific hyperlink to the Synthetic Public Use data.

Abbreviations

American National Standards Institute

Centers for Medicare and Medicaid Services

Clinical Practice Research Data link

Electronic Health Records

Extract Transform and Load

Generalized Data Model

Informatics for Integrating Biology and the Bedside

International Classification of Diseases

North American Association of Central Cancer Registries

Observational Health Data Science and Informatics

Observational Outcomes Medical Partnership

Patient Centered Outcomes Research Network

Surveillance Epidemiology and End Results

Synthetic Public Use Files

Kahn MG, Batson D, Schilling LM. Data model considerations for clinical effectiveness researchers. Med Care. 2012;50:S60–7.

Article   Google Scholar  

Klann JG, Abend A, Raghavan VA, Mandl KD, Murphy SN. Data interchange using i2b2. J Am Med Informatics Assoc. 2016;23:909–15.

Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC, Churchill S, et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am Med Informatics Assoc. 2010;17:124–30.

i2b2 Common Data Model. https://i2b2.org/software/files/PDF/current/CRC_Design.pdf . Accessed 20 Apr 2017.

Overhage JM, Ryan PB, Reich CG, Hartzema AG, Stang PE. Validation of a common data model for active safety surveillance research. J Am Med Inform Assoc. 2012;19:54–60.

OHDSI. OMOP Common Data Model. http://www.ohdsi.org/web/wiki/doku.php?id=documentation:overview . Accessed 20 Apr 2017.

Voss EA, Makadia R, Matcho A, Ma Q, Knoll C, Schuemie M, et al. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. J Am Med Informatics Assoc. 2015;22:553–64.

Psaty BM, Breckenridge AM. Mini-sentinel and regulatory science--big data rendered fit and functional. N Engl J Med. 2014;370:2165.

Curtis LH, Weiner MG, Boudreau DM, Cooper WO, Daniel GW, Nair VP, et al. Design considerations, architecture, and use of the mini-sentinel distributed data system. Pharmacoepidemiol Drug Saf. 2012;21(SUPPL. 1):23–31.

Sentinel Common Data Model. https://www.sentinelinitiative.org/sentinel/data/distributed-database-common-data-model . Accessed 20 Apr 2017.

Fleurence RL, Curtis LH, Califf RM, Platt R, Selby JV, Brown JS. Launching PCORnet, a national patient-centered clinical research network. J Am Med Inform Assoc. 2014;21:578–82.

PCORnet Common Data Model v 4.1. https://pcornet.org/data-driven-common-model/ . Accessed 28 Sept 2018.

Bourke A, Bate A, Sauer BC, Brown JS, Hall GC. Evidence generation from healthcare databases: recommendations for managing change. Pharmacoepidemiol Drug Saf. 2016;25:749–54.

Tyree PT, Lind BK, Lafferty WE. Challenges of using medical insurance claims data for utilization analysis. Am J Med Qual. 2006;21:269–75.

Centers for Medicare and Medicaid Services. Medicare fee-for-service companion guides. https://www.cms.gov/Medicare/Billing/ElectronicBillingEDITrans/CompanionGuides.html . Accessed 24 Oct 2017.

Herrett E, Gallagher AM, Bhaskaran K, Forbes H, Mathur R, van Staa T, et al. Data resource profile: clinical practice research datalink (CPRD). Int J Epidemiol. 2015;44:827–36.

Park HS, Lloyd S, Decker RH, Wilson LD, Yu JB. Overview of the surveillance, epidemiology, and end results database: evolution, data variables, and quality assurance. Curr Probl Cancer. 36:183–90.

Danese MD, Voss EA, Duryea J, Gleeson M, Duryea R, Matcho A, et al. Feasibility of converting the Medicare synthetic public use data into a standardized data model for clinical research informatics. In: AMIA 2015 annual symposium. San Francisco; 2015.

Centers for Medicare and Medicaid Services. Synthetic public use file. https://www.cms.gov/research-statistics-data-and-systems/downloadable-public-use-files/synpufs/ . Accessed 20 Apr 2017.

Warren JL, Klabunde CN, Schrag D, Bach PB, Riley GF. Overview of the SEER-Medicare data: content, research applications, and generalizability to the United States elderly population. Med Care. 2002;40(8 Suppl):IV–3-18.

PubMed   Google Scholar  

Comprehensive R. Archive network. R. .

Ong TC, Kahn MG, Kwan BM, Yamashita T, Brandt E, Hosokawa P, et al. Dynamic-ETL: a hybrid approach for health data extraction, transformation and loading. BMC Med Inform Decis Mak. 2017;17:134.

Outcomes Insights Inc. Generalized Data Model. https://github.com/outcomesinsights/generalized_data_model . Accessed 20 Apr 2017.

Venkatesh AK, Mei H, Kocher KE, Granovsky M, Obermeyer Z, Spatz ES, et al. Identification of emergency department visits in Medicare administrative claims: approaches and implications. Acad Emerg Med. 2017;24:422–31.

Xu Y, Zhou X, Suehs BT, Hartzema AG, Kahn MG, Moride Y, et al. A comparative assessment of observational medical outcomes partnership and mini-sentinel common data models and analytics: implications for active drug safety surveillance. Drug Saf. 2015;38:749–65.

Zhou X, Murugesan S, Bhullar H, Liu Q, Cai B, Wentworth C, et al. An evaluation of the THIN database in the OMOP common data model for active drug safety surveillance. Drug Saf. 2013;36:119–34.

Article   CAS   Google Scholar  

Centers for Medicare and Medicaid Services. Place of service code set. https://www.cms.gov/Medicare/Coding/place-of-service-codes/Place_of_Service_Code_Set.html . Accessed 20 Sep 2018.

Klann JG, Phillips LC, Herrick C, Joss MAH, Wagholikar KB, Murphy SN. Web services for data warehouses: OMOP and PCORnet on i2b2. J Am Med Inform Assoc. 2018;25(10):1331–8.

Centers for Medicare and Medicaid Services. HCPCS.

Bradshaw RL, Matney S, Livne OE, Bray BE, Mitchell JA, Narus SP. Architecture of a federated query engine for heterogeneous resources. AMIA . Annu Symp proceedings AMIA Symp. 2009;2009:70–4.

Google Scholar  

Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Liu PJ, et al. Scalable and accurate deep learning for electronic health records. npj Digit Med. 2018; January:1–10.

Lash TL, Fox MP, Cooney D, Lu Y, Forshee RA. Quantitative Bias analysis in regulatory settings. Am J Public Health. 2016;106:1227–30.

Duan R, Cao M, Wu Y, Huang J, Denny JC, Xu H, et al. An empirical study for impacts of measurement errors on EHR based association studies. AMIA Annu Symp proceedings AMIA Symp. 2016;2016:1764–73.

Outcomes Insights Inc. ConceptQL. https://github.com/outcomesinsights/conceptql . Accessed 30 Sep 2018.

Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32(suppl 1):D267–70.

Download references

Acknowledgements

We gratefully acknowledge the influence of the open-source OMOP model specifications on our thinking in creating our data model. In addition, we acknowledge the influence of Sentinel, PCORnet, and i2b2 on our approach, although most of our data model was designed prior to reviewing these models in detail. We also thank Chris Adamson for helpful discussions about organizing the data model in different ways. At the time of writing, all references to the concepts table refer to the OMOP version 5.20 vocabulary table maintained by OHDSI. However, there is no reason that a user could not create their own system of codes with unique identifiers across vocabularies, or use the codes from the National Library of Medicine Metathesaurus [ 35 ].

This research was self-funded.

Author information

Authors and affiliations.

Outcomes Insights, Inc., 2801 Townsgate Road, Suite 330, Westlake Village, CA, 91361, USA

Mark D. Danese, Marc Halperin, Jennifer Duryea & Ryan Duryea

You can also search for this author in PubMed   Google Scholar

Contributions

MD, MH, RD, and JD contributed to the design of the data model. MH wrote the software code for data transformations. All authors have read and approved the manuscript.

Authors’ information

MD is an epidemiologist who has worked with a wide variety of clinical data sources across therapeutic areas. JD and RD have extensive experience designing software for providers to submit medical bills to insurers. JD has constructed and/or substantially revised OMOP ETL specifications for many commercially available data sources. MD, JD, and RD are collaborators in the Observational Data Health Sciences and Informatics organization.

Corresponding author

Correspondence to Mark D. Danese .

Ethics declarations

Ethics approval and consent to participate.

A study protocol, an institutional review board exemption determination (Quorum IRB exemption determination #31309), and a data use agreement were required to access SEER-Medicare data. CPRD receives ethics approval to supply patient data for all protocols. Because of this, and the fact that all data are de-identified, no IRB approval or exemption are required. Our study protocol to access the CPRD data was reviewed by the CPRD Independent Scientific Advisory Committee. Provision of the CPRD data required a data use agreement. All transformations of the raw data were completed as part of the process of creating analysis data sets for approved study protocols. No clinical data was analyzed or reported for this study.

Consent for publication

Not applicable.

Competing interests

Outcomes Insights provides consulting services and license software for implementing research protocols using observational data.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional file

Additional file 1:.

The Generalized Data Model Table Specifications. (DOCX 75 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article.

Danese, M.D., Halperin, M., Duryea, J. et al. The Generalized Data Model for clinical research. BMC Med Inform Decis Mak 19 , 117 (2019). https://doi.org/10.1186/s12911-019-0837-5

Download citation

Received : 12 November 2017

Accepted : 10 June 2019

Published : 24 June 2019

DOI : https://doi.org/10.1186/s12911-019-0837-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Claims data
  • Electronic health records

BMC Medical Informatics and Decision Making

ISSN: 1472-6947

clinical data management research paper

Clinical Trial Data Management and Professionals

By Andy Marker | January 16, 2020 (updated September 16, 2021)

  • Share on Facebook
  • Share on LinkedIn

Link copied

This guide provides professionals with everything they need to understand clinical data management, offering expert advice, templates, graphics, and a sample clinical data management plan. 

Included on this page, you'll find information on how to become a clinical trial data manager , a clinical data management plan template , a clinical data validation plan template , and much more.

What Is Clinical Trial Management?

Clinical trial management refers to the structured, organized regulatory approach that managers take in clinical trial projects to produce timely and efficient project outcomes. It includes developing and maintaining specified or general software systems, processes, procedures, training, and protocols.  

What Is a Clinical Trial Management System (CTMS)?

A clinical trial management system (CTMS) is a type of project management software specific to clinical research and clinical data management. It allows for centralized planning, reporting, and tracking of all aspects of clinical trials, with the end goal of ensuring that the trials are efficient, compliant, and successful, whether across one or several institutions. 

Companies use CTMS for their clinical data management to ensure they build trust with regulatory agencies. Trust is earned as the companies collect, integrate, and validate their clinical trial data with integrity over time. A comprehensive system helps them do so. 

In a 2017 paper, “ Artificial intelligence based clinical data management systems: A review ,” Gazali discusses CTMS and what makes it worthwhile for investigators — namely, that it helps to authenticate data. Accurate study results and a trail of data collection, as collected through a quality CTMS, lend credence to research study data. Clinical trial data management systems enable researchers to adhere to quality standards and provide assurance that they are appropriately collecting, cleaning, and managing the data.

A clinical data management system also offers remote data monitoring. The sponsor, or principal investigator, may want to monitor the trial from a distance, especially if the organization has many sites. Since the FDA mandates monitoring in clinical trials, and many studies generally consider it a large cost, remote monitoring offers a lower-priced option in which sponsors can identify issues and outliers and mitigate them quickly. 

Many data management systems are also incorporating artificial intelligence (AI). AI-based clinical data management systems support process automation, data insights analysis, and critical decision making. All of this can happen as your staff inputs the research data. According to a review of clinical data management systems , researchers note that automating all dimensions of clinical data management in trials can take them from mere electronic data capture to something that helps with findings in clinical trials.

The most helpful strategies for implementing clinical data management systems balance risk reduction and lead time. All trial managers want to have their software deployed rapidly. However, it is best to set up the databases thoroughly before the trial. When staff must make software changes during the trial, it can be costly and have implications on the trial data’s validity. 

Other strategies that help organizations implement a new system include making sure that, prior to deployment, the intended users give input. These users include entities such as the contract research organization (CRO), the sponsor, staff at the investigator site, and any onsite technical support. Staff should respond well to the graphical user interface (GUI). Additionally, depending on software support, the staff can gradually expand the modules to include more functionality, perform module-based programming, and duplicate the hardware. These actions give the staff the most functionality and the software the best chance at success.

How to Compare Clinical Data Management Systems

When deciding which clinical data management system to use, compare the program’s available features and those that your clinical sites need. Additionally, you can compare clinical data management systems by reviewing the installation platforms, pricing, technical support, and number of allowed users. 

For programs that collect data on paper and send it to data entry staff, the data entry portal should be simple and allow for double entry and regular oversight. 

In general, here are the main features to compare in a clinical data management system: 

  • 21 CFR Part 11 Compliance: Electronic systems must provide assurance of authentic records. 
  • Document Management: All documents should be in a centralized location, linked to their respective records. 
  • Electronic Data Capture (EDC): Direct clinical trial data collection, as opposed to paper forms. 
  • Enrollment Management: Research studies can use this data (from interactive web or voice response systems) to enroll, randomize, and manage patients. 
  • HIPAA compliance: Ensure compliance with the Health Insurance Portability and Accountability Act to protect patients’ information.
  • Installation: Identify whether you want a cloud-based or on-premises solution and if you need mobile deployment (iOS or Android).
  • Investigator and Site Profiling: Use this function to rapidly identify the feasibility of possible investigators and sites.
  • Monitoring: The system should offer a calendar, scheduling capabilities, adverse and severe adverse event tracking, trip reporting, site communication, and triggers. 
  • Number of Users: How many users can the software can handle? Is there a minimum number of required users? Does the software provide levels of accessibility and price based on the number of users?
  • Patient Database: Separate from recruitment and enrollment, the patient database is a record of previous contacts that you can potentially draw from for future trials. 
  • Payment Portal: Pay out stipends, contracts, and other finances related to the research project.
  • Pricing: Check whether the software company offers free trials, free or premium options, monthly subscriptions, annual subscriptions, or a one-time license. 
  • Recruiting Management: This function helps streamline recruitment by targeting potential trial patients with web recruitment and prescreening. 
  • Scheduling: Use this feature to keep track of visits and events.
  • Study Planning and Workflows: This function enables you to track all required study pieces from the beginning and optimize each piece with workflows. 
  • Support: Check if the software company offers 24-hour issue support and training on the software.

What Is Clinical Data Management?

Clinical data management (CDM) is the part of clinical trial management that deals specifically with information that comes out of the trials. In order to yield ethical, repeatable results, researchers must document their patients’ medical status — including everything relative to that status — and the trial’s interventions.

Clinical data management evolved from drug companies’ need for an honest path from their research to their findings; in short, their data had to be reproducible. CDM helps evolve a standards-based approach, and many regulators are continually imposing their requirements on it. For instance, paper is no longer favored as a collection method; most clinical trials prefer software systems that improve the timeliness and quality of data. 

In one model for data management, the cycle begins when the clinical trial is in the planning stages and goes through the final analysis and lockdown of the data. The stages for data management are as follows:

  • Plan: The data manager prepares the database, forms, and overall plan.
  • Collect: Staff gathers data in the course of the trial.
  • Assure: The data manager determines if the data plan and tools meet the requirements. 
  • Identify: Staff and the data manager identify any issues or risks.
  • Preserve: The data manager preserves the data already collected and mitigates risks.
  • Integrate: The data manager oversees different datasets and information mapped together for consistency.
  • Analyze: The statisticians analyze the mapped data trends and outcomes. 
  • Lock: The data manager locks the database for integrity.

Model for Data Management in Clinical Trials

CDM Model

When it comes to data, clinical research has several areas of responsibility. Sponsors can split these functions among several staff or, in smaller studies, assign them to the main data manager. These functions include the following:

Clinical systems: Any software or technology used. 

Data management: Data acquisition, coding, and standardization.

Data review and analytics: Quality management, auditing, and statistical analysis of the collected data. 

Data standards: Checking against regulatory requirements.

Innovation: Using tools and theory that coordinate with the developing field. For more innovative templates to use in clinical trials, see “ Clinical Trial Templates to Start Your Clinical Research .”

Clinical Research Data Areas of Responsibility

Study Data Areas of Responsibility

Clinical data management is one of the most critical functions in overall clinical trial management. Staff collects data from many different sources in a clinical trial — some will necessarily be from paper forms filled out by the patient, their representative, or a staff member on their behalf. However, instead of paper, some clinics may use devices such as tablets or iPads to fill out this direct-entry data electronically. 

Clinical data management also includes top-line data , such as the demographic data summary, the primary endpoint data, and the safety data. Together, this constitutes the executive summary for clinical trials. Companies often issue this data as a part of press releases. Additional clinical trial data management activities include the following:

  • Case report form (CRF) design, annotation, and tracking
  • Central lab data
  • Data abstraction and extraction
  • Data archiving
  • Data collection
  • Data entry and validation
  • Data extraction
  • Data queries and analysis
  • Data storage and privacy
  • Data transmission
  • Database design, build, and testing
  • Database locking
  • Discrepancy management
  • Medical data coding and processing
  • Patient recorded data
  • Severe adverse event (SAE) reconciliation
  • Study metrics and tracking
  • Quality control and assurance
  • User acceptance testing
  • Validation checklist

Since there are many different types of data coming from many different sources, some data managers have become experts in hybrid data management — the synchronization required to not only make disparate data relate to each other, but also to adequately manage each type of data. For example, one study could generate data on paper from both the trial site and from a contract research organization, electronic data from the site, and clinical data measurements from a laboratory.

The Roles and Responsibilities in Clinical Data Management

Clinical data management software assigns database access limitations based on the assigned roles and responsibilities of the users. This coding ensures there is an audit trail and the users can only access their respective required functionalities, without the ability to make other changes.

All staff members, whether a manager, programmer, administrator, medical coder, data coordinator, quality control staff, or data entry person, have differing levels of access to the software system, as delineated in the protocol. The principle investigator can use the CDMS to restrict these access levels.

What Is Clinical Trial Data Management (CDM)?

Clinical trial data management (CDM) is the process of a program or study collecting, cleaning, and managing subject and study data in a way that complies with internal protocols and regulatory requirements. It is simultaneously the initial phase in a clinical trial, a field of study, and an aspirational model. 

With properly collected data in clinical trials, the study can progress and result in reliable, high-quality, statistically appropriate conclusions. Proper data collection also decreases the time from drug development to marketing. Further, proper data collection involves a multidisciplinary team, such as the research nurses, clinical data managers, investigators, support personnel, biostatisticians, and database programmers. Finally, CDM enables high-quality, understandable research, which can be capitalized on in its field and across many disciplines, according to the National Institutes of Health (NIH).

In clinical trials, data managers perform setup during the trial development phase. Data comes from the primary sources, such as site medical records, laboratory results, and patient diaries. If the project uses paper-based CRFs, staff members must transcribe them, then enter this source data into a clinical trial database. They enter paper-based forms twice, known as double data entry, and compare them, per best practice. This process significantly decreases the error rate from data entry mistakes. Electronic CRFs (eCRFs) enable staff to enter source data directly into the database. 

As with any project, the financial and human resources in clinical trials are finite. Coming up with and sticking to a solid data management plan is crucial — it should include structure for the research personnel, resources, and storage. A clinical trial is a huge investment of time, people, and money. It warrants expert-level management from its inception.

Clinical Data Management Plans

Clinical data management plans (DMPs) outline all the data management work needed in a clinical research project. This includes the timeline, any milestones, and all deliverables, as well as strategies for how the data manager will deal with disparate data sets. 

Regulators do not require a DMP, but they expect and audit them in clinical research. Thus, the DMPs should be comprehensive and all stakeholders should agree on them. They should also be living documents that staff regularly updates as the study evolves and the various study pieces develop. 

For example, during one study, the study manager might change the company used for laboratory work. This affects the DMP in two ways: First, staff needs to develop the data sharing agreement with the new company, and second, they need to integrate the data from both laboratories into one dataset at the end of the trial. The DMP should describe both.

When creating DMPs, you should also bear in mind any industry data standards, so the research can also be valuable outside of the discrete study. The Clinical Data Acquisitions Standards Harmonization (CDASH) recommends 16 standards for data collection fields for consistency in data across different studies.

The final piece of standardization in DMPs is the use of a template, which provides staff with a solid place to start developing a DMP specific to their study. Sponsors may have a standard template they use across their projects to help reduce the complexity inherent in clinical trials.

Data Management Plan Template for Clinical Trials

Sample Data Management Plan Template

This data management plan template provides the required contents of a standard clinical trial data management plan, with space and instructions to input elements such as the data validation process, the verification of database setup and implementation processes, and the data archival process. 

Download Data Management Plan Template - Word

Sample Data Management Plan for Clinical Trials

This sample data management plan shows a fictitious prospective, multicenter, single-arm study and its data management process needs. In two years of study, the data manager should regularly update this plan to demonstrate the study’s evolving needs, and document each change and update. Examples of sections include the databases used, how data will be entered and cleaned, and how staff will integrate different data sets collected in the study.

Download Sample Data Management Plan - Word

Clinical Trial Data Validation Plan

Data validation involves resolving database queries and inconsistencies by checking the data for accuracy, quality, and completeness. A data validation plan in clinical trials has all the variable calculations and checks that data managers use to identify any discrepancies in the dataset.

When the data is final, the database administrator locks it to ensure no further changes are made, as they could interrupt the integrity of the data. During reporting and analysis, experts may copy the data and reformat it into tables, lists, and graphs. Once the analysts complete their work, they report the results. When they have significant findings, they may create additional tables, lists, and graphs to present as part of the results. They then integrate these results into higher-level findings documentation. Examples of this type of documentation include investigator’s brochures or clinical case study reports (CSRs). Finally, the data manager archives the database. 

The above steps are important because they preserve the integrity of the data in the database. However, managers do not need to perform them in a strict order. Some studies may need more frequent data validation, due to the high volume of data they produce, while other studies may produce intermediate analysis and reporting as part of their predetermined requirements. Finally, due to the complexity of some studies, the data manager or analyst may need to query , which means running a data request in a database and determining cursory results so that they may adjust the protocol. 

Use this template to develop your own data validation plan. This Word template includes space and instructions for you to develop a data validation plan that you can include in your data management plan or use as a stand-alone document. Examples of sections include selecting and classifying the computer systems, validation protocol, and validation reporting.

Clinical trial Data Validation Plan Template

Download Data Validation Plan - Word

Data Management Workflow

A data management workflow is the process clinical research uses to deal with their data, from the data collection design to the electronic archival and findings presentation. This includes getting through the entry process, any batch validation, discrepancy management, coding, reconciliations, and quality control plans.

This workflow starts when researchers generate a CRF, whether manually or electronically, and continues through the final lock on the database. The data manager should perform quality checks and data cleaning throughout the workflow. The workflow steps for a data manager are as follows:

  • CRF Design: This initial design step forms the basis of initial data collection.
  • Database Design: The database should include space for all data collected in the study.
  • Data Mapping: This step integrates data from different forms or formats so researchers can consistently report it. 
  • SAE Reconciliation: Data managers should regularly review and correct severe adverse events and potential events. 
  • Database Locking: Once a study is complete, the database manager should lock the database so that no one can change the data.

Data Management Workflow

Clinical Trial Data Audits

A clinical trial data audit is a review of the information collected in order to ensure the quality, accuracy, and appropriateness for the stated research requirements, per the study protocol. Regulatory authorities, sponsors, and internal study staff can conduct two varieties of audit: overall and database-specific. 

Regulators use database audits to ensure that no one has tampered with the data. In general, there must be an audit trail to know which user made changes to what and when in the database. For example, the auditors will look at record creation, modification, and deletion, noting the usernames, dates, and times. FDA 21 CFR Part 11 includes this as a part of fraud detection, and requires that there is a complete history of the recordkeeping system and clinical trial data transparency.

The data manager develops templates for auditing the study during the study development phase and performs their own internal audits as a part of its quality management. 

This free clinical trial data management audit checklist template will help you develop your own checklist. This Excel template lets you show the status of your audit in an easy color-coded display, the category and tasks to review, and what criteria you require. It brings all your audit requirements and results together. 

CDM Clinical Data Management Audit Checklist

Download Clinical Data Management Audit Checklist - Excel

Quality Management in Clinical Trials

Data quality management (DQM) refers to the practices that ensure clinical information is of high value. In a clinical trial, DQM starts when staff first acquires the information and continues until the findings are distributed. DQM is critical in providing accurate outcomes. 

The factors that influence the quality of clinical data include how well the study investigators develop and implement each of the following data pieces: 

  • Case Report Forms (CRF): Design the CRF in parallel with the protocol so that the data collected by staff is complete, accurate, and consistent. 
  • Data Conventions: Data conventions include dates, times, and acronyms. Data managers should set these conventions during study development, especially if there are multiple study locations and investigators. 
  • Guidelines for Monitoring: The overall data quality is contingent on the quality of the monitoring guidelines established. 
  • Missing Data: Missing data are those values not available that could change the analysis findings. During study development, investigators and analysts should determine how they will handle missing data.
  • Verification of Source Data: Staff must verify that the source data is complete and accurate during data validation.

Regulations, Guidelines, and Standards in Clinical Data Management

Different regulations, guidelines, and standards govern clinical data management industry. The Clinical Data Interchange Standards Consortium (CDISC) is a global organization that holds clinical studies accountable to clinical trial data standards, international regulations, institutional and sponsor standard operating procedures (SOPs), and state laws.

There are standard operating procedures and best practices in clinical trial data management that are widespread. CDISC has two standards, the Study Data Tabulation Model Implementation Guide for Human Clinical Trials (SDTMIG), mandated by the U.S. Food and Drug Administration (FDA), and the Clinical Data Acquisition Standards Harmonization (CDASH). Also, in the industry, the Society for Clinical Data Management (SCDM) releases the Good Clinical Data Management Practices (GCDMP) guidelines and administers the International Association for Continuing Education and Training (IACET) credential for certified clinical data managers. The National Accreditations Board of Hospitals Health (NABH) provides additional guidance, such as pharmaceutical study auditing checklists. Finally, Good Clinical Practices (GCP) guidelines discuss ethical and quality standards in clinical research. 

A trial conducted under the appropriate standards ensures that staff has followed the protocol and treated the patients according to that protocol. Ultimately, this shows the integrity and reproducibility of the study and acceptance in the industry.

Case Report Forms in Data Management

In data management, CRFs are the main tool researchers use to collect information from their participants. Researchers design CRFs based on the study protocol; in them, they document all patient information per the protocol for the duration of the study’s requirements. 

CRFs should comply with all regulatory requirements and enable efficient analysis to decrease the need for data mapping during any data exchange. When longer than one page, the CRF is known as a CRF book, and each visit adds to the book. The main parts of a CRF are the header, the efficacy-related modules, and the safety-related modules:

  • CRF Header: This portion includes the patient identification information and study information, such as the study number, site number, and subject identification number.
  • Efficacy-Related Modules: This portion includes the baseline measurements, any diagnostics, the main efficacy endpoints of the trial, and any additional efficacy testing. 
  • Safety-Related Modules: The portion contains the patient’s demographic information, any adverse events, medical history, physical history, medications, confirmation of eligibility, and any reasons for release from the study.

What Is the Role of a Clinical Data Manager?

In a clinical trial, the data manager is the person who ensures that the research staff collects, manages, and prepares the resulting information accurately, comprehensively, and securely. This is a key role in clinical research, as the person is involved in the study setup, conduct, closeout, and some analysis and reporting.

Melissa Peda

Melissa Peda , Clinical Data Manager at Fred Hutch Cancer Research Center , says, “Being a clinical data manager, you have to be very detail-oriented. We write up very specific instructions for staff. For example, the specifications to a program’s database include one document that could easily have 1,000 rows in Excel, and it needs to be perfect for queries to fire in real time. Code mistakes can put your project behind, so they must do their review with a close eye. You must also be logical and think through the project setup. A good clinical data manager must be detailed, so the programmers and other staff can do their thing.” 

Krishnankutty, et al. , developed an overview of best practices for data management in clinical research. In their article, published in the Indian Journal of Pharmacology, they say that the need for strong clinical data management has sprung up from the pharmaceuticals industry wanting to fast-track drug development by having high-quality data, regardless of the type of data. Clinical data managers can expect to work with many different types of clinical data; the most common types include the following:

  • Administrative data
  • Case report forms (CRFs)
  • Electronic health records
  • Laboratory data
  • Patient diaries
  • Patient or disease registries
  • Safety data

The clinical data managers often must oversee the analysis of the data as well. Data analysis conducted in clinical trial data management is very delicate: It requires a solid dataset and an analyst who can explain the findings. Regulatory agencies, along with other companies and professionals, check the findings and analysis, so they need to be accurate and understandable.

Education and Credentials of a Clinical Data Manager

Professionals in clinical data management receive data management in clinical trials training, and often have the Certified Clinical Data Manager (CCDM) credential. Their studies can have optimized outcomes since they are executed by a competent CDM team with validated skill sets and continued professional development. 

To become certified, the applicant must have the appropriate education and experience, including the following:

  • A bachelor’s degree and two or more years of full-time data management experience.
  • An associate’s degree and three or more years of full-time data management experience.
  • Four years of full-time data management experience.
  • Part-time data management experience that adds up to the requirements above. 

Raleigh Edelstein

Raleigh Edelstein , a clinical data manager and EDC programmer, discusses the credentialing in this field. “Anyone can excel in this profession,” she says. “A CRA — a clinical research associate, also called a clinical monitor or a trial monitor — may need this credential more, as their profession is more competitive, and their experience is more necessary in trials. But if the credential makes you more confident, then I say go for it. Your experience and confidence matter.”

There are several degrees with an emphasis on clinical research that can also teach the necessary technical skills. In addition to many online options, these include the following, or a combination of the following:

  • Associate of Science in biology, mathematics, or pharmacy.
  • Bachelor of Science in one of the sciences.
  • Post-Master's certificate in clinical data management, or a certificate related to medical device and drug development.
  • Master of Science in clinical research, biotechnology, bioinformatics.
  • Doctor of Nursing Practice.
  • Doctor of Philosophy in any clinical research area.

These degree programs include concepts that help data managers understand what clinical studies need. They especially focus on survey design and data collection, but also include the following:

  • Biostatistics
  • Clinical research management and safety surveillance
  • Compliance, ethics, and law
  • New product business and strategic planning
  • New product research and development

These degree programs offer coursework that improves the relevant clinical research skills. Many of the courses are introductory to clinical research, trials, and pharmacology, and others include the following:

  • Business processes
  • Clinical outsourcing
  • Clinical research biostatistics
  • Clinical trial design
  • Compliance and monitoring
  • Data collection strategies
  • Data management
  • Electronic data capture
  • Epidemiology
  • Ethics in research
  • Federal regulatory issues 
  • Health policy and economics
  • Human research protection
  • Medical devices and product regulation
  • Patient recruitment and informed consent
  • Pharmaceutical law
  • Review boards
  • Worldwide regulations for submission

Clinical data managers can get involved with several professional organizations worldwide, including the following:

  • The Association for Clinical Data Management (ACDM): This global professional organization supports the industry by providing additional resources and promoting best practices. 
  • The Association Française de Data Management Biomédicale (DMB): This French data management organization shares information and practices among professionals. 
  • International Network of Clinical Data Management Associations (INCDMA):  Based in the United Kingdom, this professional network exchanges information and discusses relevant professional issues. 
  • The Society for Clinical Data Management (SCDM): This global organization awards CCDM credential to professionals, provides additional education, and facilitates conferences in clinical data management.

FAQs about Clinical Trial Managers

The field of clinical management is quickly expanding in many forms to support the need for new research. Below are some frequently asked questions.

How do I become a clinical trial manager?  

To become a clinical trial manager, you must obtain the appropriate education, experience, and credentialing, as detailed above. 

What is better: a Master’s in Health Administration or a Master’s in Health Sciences?  

To work as a clinical data manager, either degree program is appropriate. Your choice depends on your interest.

What can you do with a degree in biotechnology or bioenterprise?

Biotechnology is involved in the technology that aids in biological research, and bioenterprise takes the products of biotechnology and markets and sells them. 

What is a clinical application analyst? 

A clinical application analyst is a professional who helps clinics evaluate software systems and vendors. 

What is a clinical data analyst?  

A clinical data analyst is a professional who analyzes data from clinical trials, and develops and maintains databases.

Contract Research Organizations for Data Management Services

Contract research organizations (CROs) are companies that provide outsourced research services to industries such as pharmaceutical, biotechnology, and research development. Designed to keep costs low, studies can hire them to perform everything from overall project management and data management to technical jobs. 

Studies can hire CROs that specialize as clinical trial data management companies so they don’t have to worry about having all the necessary skills in-house. According to Melissa Peda, “A consultant may have the expertise that someone already working in the organization may not have, so they make sense to bring in.” Further, a contractor outside of the business can bring a lack of bias to the project. 

According to Raleigh Edelstein, “A third-party person in charge of data management may be necessary because you don’t have to worry about the lack of company loyalty that the data may need.” 

CROs can offer skills such as the following:

  • Annotation and review
  • Coding and validation
  • Database export, transfer, and locking
  • Database integration
  • Database setup and validation
  • Double data entry and third-party review of discrepancies
  • Form design
  • Planning, such as project management and data management plans 
  • Quality assessments and auditing
  • Software implementation and training
  • SAE reconciliation

Related Topics in Clinical Data Management

The following are related topics to clinical data management:

  • Application Analyst: This position deals with the software side of clinical trials. Examples of their work include choosing software, designing databases, and writing queries. 
  • Clinical Data Analyst: A professional who examines and verifies that clinical study data is appropriate and means what it is supposed to mean. 
  • Clinical Research Academic Programs: Entry-level professional positions in clinical trials often require a minimum of a bachelor’s degree. 
  • Clinical Research Associate: This clinical trial staff member designs and performs clinical studies. 
  • Laboratory Informatics: The field of data and computational systems specialized for laboratory work. 
  • Laboratory Information Management System (LIMS): LIMS enables collection and analysis of data from laboratory work. LIMS is specialized to work in different environments, such as manufacturing and pharmaceuticals. 
  • Scientific Management: This management theory studies workflows, applying science to process engineering and management.

Improve Clinical Trial Data Management with Smartsheet for Healthcare

Empower your people to go above and beyond with a flexible platform designed to match the needs of your team — and adapt as those needs change. 

The Smartsheet platform makes it easy to plan, capture, manage, and report on work from anywhere, helping your team be more effective and get more done. Report on key metrics and get real-time visibility into work as it happens with roll-up reports, dashboards, and automated workflows built to keep your team connected and informed. 

When teams have clarity into the work getting done, there’s no telling how much more they can accomplish in the same amount of time.  Try Smartsheet for free, today.

Any articles, templates, or information provided by Smartsheet on the website are for reference only. While we strive to keep the information up to date and correct, we make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability, or availability with respect to the website or the information, articles, templates, or related graphics contained on the website. Any reliance you place on such information is therefore strictly at your own risk. 

These templates are provided as samples only. These templates are in no way meant as legal or compliance advice. Users of these templates must determine what information is necessary and needed to accomplish their objectives.

Discover why over 90% of Fortune 100 companies trust Smartsheet to get work done.

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

Enter the email address you signed up with and we'll email you a reset link.

  • We're Hiring!
  • Help Center

paper cover thumbnail

Data Management in Clinical Research

Profile image of Texila International Journal

Clinical data management helps to produce a drastic reduction in time from drug development to marketing. Team members of CDM are actively involved in all stages of clinical trial right from inception to completion. They should have adequate process knowledge that helps maintain the quality standards of CDM including case report form (CRF) designing, CRF annotation, data designing, data-entry, data validation, discrepancy management, medical coding, data extraction and data locking are assessed for quality at regular intervals during the trial. Presently there is an increased demand to improve the CDM standards to meet the regulatory requirements and stay ahead of the competition by means of faster commercialization of product. With the implementation of regulatory complaint data management tools. CDM team can meet these demands. Additionally, it is becoming mandatory for companies to submit data electronically. It is advocated that CDM professionals should meet appropriate expectations and set standards for data quality and also have a drive to adapt to the rapidly changing technology. Pls refer Binny Krishnankutty, Shantala Bellary, Naveen B. R. Kumar et al 2011 Data management in clinical research an overview Indian Journal of pharmacology 44(2):168-172. doi 10. 4103/0253-7613.93842

Related Papers

NC VIJAY R CHARI

clinical data management research paper

International Journal of Clinical Trials

ajit gangawane

Jagadeeswara Rao Gaddale

In the clinical trial process, precise and concise data collection at the source is imperative and requires statistical analysis to be performed to derive the primary and secondary endpoints. The quality of raw data collection has a direct impact on the statistical outputs generated as per the statistical analysis plan. Hence, the data collection tools used for data transcription must be clear, understandable, and precise, which helps the investigator to provide the accurate subject data. Clinical Data Acquisition Standards Harmonization (CDASH) provides guidance to develop the case report form (CRF) for domains that are commonly used for the majority of the clinical trials across the therapeutic areas. This white paper describes the importance of CDASH standards, its advantages and its impact on the efforts and the cost in designing the CRF.

https://ijshr.com/IJSHR_Vol.4_Issue.4_Oct2019/IJSHR_Abstract.0013.html

International Journal of Science and Healthcare Research (IJSHR)

Clinical Data Management (CDM) is a critical phase in clinical research which results in collection of reliable, high-quality and statistically sound data. It consists of three phases i.e. start up, conduct and close out. Startup phase consists of activities like CRF creation and Designing, Database designing and Testing, Edit checks preparation and User Acceptance Testing (UAT) along with document preparation such as Data Management Plan, CRF Completion Guidelines, Data Entry Guidelines and Data Validation plan. Conduct phase is the longest and most critical phase were Data capture, Data Cleaning, Data reconciliation, Medical Coding and Data Validation takes place with regular evaluation of data known as Interim Analysis along with documentations such as Query Management Form, Revision Request Form, Post Production Changes. Close out phase is the success phase for Data Managers were all the clean data are frozen and locked. After the confirmation of locking all the data, it will be in Read only mode. Finally, the Database will be locked, and all the documents are archived. Keywords: Clinical Data Management, Clinical Trials, User Acceptance Testing, CRF, Query Management, Clinical Research.

Perspectives in Clinical Research

Deven Babre

Drug Development Research

Luca Clivio

Ferran Torres , Christian Ohmann

Background The use of Clinical Data Management Systems (CDMS) has become essential in clinical trials to handle the increasing amount of data that must be collected and analyzed. With a CDMS trial data are captured at investigator sites with "electronic Case Report Forms". Although more and more of these electronic data management systems are used in academic research centres an overview of CDMS products and of available data management and quality management resources for academic clinical trials in Europe is missing. Methods The ECRIN (European Clinical Research Infrastructure Network) data management working group conducted a two-part standardized survey on data management, software tools, and quality management for clinical trials. The questionnaires were answered by nearly 80 centres/units (with an overall response rate of 47% and 43%) from 12 European countries and EORTC. Results Our survey shows that about 90% of centres have a CDMS in routine use. Of these CDMS nearly 50% are commercial systems; Open Source solutions don't play a major role. In general, solutions used for clinical data management are very heterogeneous: 20 different commercial CDMS products (7 Open Source solutions) in addition to 17/18 proprietary systems are in use. The most widely employed CDMS products are MACRO™ and Capture System™, followed by solutions that are used in at least 3 centres: eResearch Network™, CleanWeb™, GCP Base™ and SAS™. Although quality management systems for data management are in place in most centres/units, there exist some deficits in the area of system validation. Conclusions Because the considerable heterogeneity of data management software solutions may be a hindrance to cooperation based on trial data exchange, standards like CDISC (Clinical Data Interchange Standard Consortium) should be implemented more widely. In a heterogeneous environment the use of data standards can simplify data exchange, increase the quality of data and prepare centres for new developments (e.g. the use of EHR for clinical research). Because data management and the use of electronic data capture systems in clinical trials are characterized by the impact of regulations and guidelines, ethical concerns are discussed. In this context quality management becomes an important part of compliant data management. To address these issues ECRIN will establish certified data centres to support electronic data management and associated compliance needs of clinical trial centres in Europe.

brahmaiah bonthagarala

Applied clinical informatics

Yasmine Probst

Clinical trials are an important research method for improving medical knowledge and patient care. Multiple international and national guidelines stipulate the need for data quality and assurance. Many strategies and interventions are developed to reduce error in trials, including standard operating procedures, personnel training, data monitoring, and design of case report forms. However, guidelines are nonspecific in the nature and extent of necessary methods. This article gathers information about current data quality tools and procedures used within Australian clinical trial sites, with the aim to develop standard data quality monitoring procedures to ensure data integrity. Relevant information about data quality management methods and procedures, error levels, data monitoring, staff training, and development were collected. Staff members from 142 clinical trials listed on the National Health and Medical Research Council (NHMRC) clinical trials Web site were invited to complet...

RELATED PAPERS

pamri supamri

Armées françaises : trois milieux, trois cultures

Alexandre Menard

Produsen Paving Block Termurah Di Kebagusan Jakarta Barat

Produsen Paving Block Termurah

BHARATIYA JNANPITH 2021

Gopal Kamal

Francisco Contreras

The American Surgeon

seymen bora

The Oncologist

Damien Bolton

Enric Mendizàbal Riera

Günter Emberger

BEST Journal (Biology Education, Sains and Technology)

Ummi Nur Afinni Dwi Jayanti

Tommy Supratama

Ateliê Geográfico

Celso Locatel

Revista Geografica De America Central

guillermo lopez alvarado

International journal of engineering research and technology

Manasa Murali

Revista Colombiana de Antropología

Ingrid Reyes

Mostra De Iniciacao Cientifica Do Cesuca 2317 5915

Daiane Machado

Iv Symposium De Cunicultura 1979 Pags 11 24

Jaime Daniel Vazquez

Electronic Journal of Differential Equations

Eduard Ponarin

Proceedings of the International Conference on Automated Planning and Scheduling

Daniel Edelson

Business Management and Strategy

Adams Issahaku - PhD

Kurt W. Alt

Journal of Environmental Management

Peter Mackelworth

corrado fidelibus

RELATED TOPICS

  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024
  • Sponsored Article

Advancing Clinical Research Through Effective Data Delivery

Novel data collection and delivery strategies help usher the clinical research industry into its next era..

A photo of Rose Kidd, the president of Global Operations Delivery at ICON.

The clinical research landscape is rapidly transforming. Instead of viewing patients as subjects, sponsors now use the patients’ input to help reduce the burden they face during trials. This patient-centric approach is necessary to ensure that the clinical trial staff recruit and retain enough participants and it has led the industry to modify all stages of the clinical trial life cycle, from design to analysis. “What we are seeing is a lot more openness to innovations, digitization, remote visits for the patient, and telemedicine, for example,” said Rose Kidd, the president of Global Operations Delivery at ICON, who oversees a variety of areas including site and patient solutions, study start up, clinical data science, biostatistics, medical writing, and pharmacovigilance. “It is becoming a lot more decentralized in terms of how we collect clinical data, which is really constructive for the industry, and also hugely positive for patients.” 

The Increasing Complexity of Clinical Trials

Accurate data is central to the success of a clinical trial. “Research results are only as reliable as the data on which they are based,” Kidd remarked. “If your data is of high quality, the conclusions of that data are trustworthy.” Sponsors are now collecting more data than ever through their trials. 1 This allows them to observe trends and make well-informed decisions about a drug’s or device’s development. 

However, these changes in data volume complicate how clinicians design and run their clinical trials. They must capture enough data to fully assess the drug or device without severely disrupting a patient’s lifestyle. Additionally, the investigational sites must ensure that they have enough staff to collect the data in the clinic or through home visits and keep up with their country’s clinical trial regulations. They also must develop efficient data collection and delivery strategies to ensure a trial’s success. While poorly collected data can introduce noise, properly collected data allows clinical trial leads to quickly consolidate and analyze this information. 2 And they often require support with this process. 

Innovative Solutions to Improve Data Collection and Delivery 

Fortunately, sponsors can find that support with ICON, the healthcare intelligence and clinical research organization. “We essentially advance clinical research [by] providing outsourced services to the pharmaceutical industry, to the medical device industry, and also to government and public health organizations,” Kidd explained. With expertise in numerous therapeutic areas, such as oncology, cell and gene therapies, cardiovascular, biosimilars, vaccines, and rare diseases to mention just a few, ICON helps the pharmaceutical industry efficiently bring devices and drugs to the patients that need them, while ensuring patient safety and meeting local regulations. 

One of the areas that Kidd’s team is specifically focused on is providing solutions to advance the collection, delivery, and analysis of clinical data.

The platform that ICON provides to support sponsors in this regard not only stores data directly entered into the system by clinicians during their site or home visits, but also serves as an electronic diary for patients to remotely record their symptoms as they happen. This makes it easier for patients to participate in clinical trials while maintaining their jobs and familial responsibilities. Moreover, this solution provides clinical trial staff with insights into their data as they emerge, such as adverse event profiles and the geographical spread of these events. However, this requires that the data is input into the system in the same manner at every participating site. 

To address this problem, ICON’s solutions also include a site-facing web portal that helps to reduce the training burden by standardizing data capture and allowing site teams to learn key information about a drug or device. The portal also offers a visit-by-visit guide to ensure that clinicians are asking the necessary questions for a particular visit and helps them remember how to record the data correctly. “It is training at their fingertips when they need it most,” Kidd said. Solutions like these help sponsors obtain the high-quality clinical data that they need to progress from the trial to the market.

Clinical research is evolving and data strategies that support sites and patients alike must similarly evolve. With the right expertise, experience, and technology solutions, ICON is supporting better decision-making by sponsors.

  • Crowley E, et al. Using systematic data categorisation to quantify the types of data collected in clinical trials: The DataCat project . Trials . 2020;21(1):535.
  • McGuckin T, et al. Understanding challenges of using routinely collected health data to address clinical care gaps: A case study in Alberta, Canada . BMJ Open Qual . 2022;11(1):e001491.

Icon logo

Related drug development Research Resources

iStock

Explainable AI for Rational Antibiotic Discovery

labvantage

Building Advanced Cell Models for Toxicity Testing

Lonza

How Cloud Labs and Remote Research Shape Science 

ORIGINAL RESEARCH article

This article is part of the research topic.

Reproducible Analysis in Neuroscience

Harmonizing data on correlates of sleep in children within and across neurodevelopmental disorders: lessons learned from an Ontario Brain Institute crossprogram collaboration Provisionally Accepted

  • 1 Department of Psychiatry and Behavioural Neurosciences, Faculty of Health Sciences, McMaster University, Canada
  • 2 Offord Centre for Child Studies, Faculty of Health Sciences, McMaster University, Canada
  • 3 School of Rehabilitation Science, Faculty of Health Sciences, McMaster University, Canada
  • 4 CanChild, Faculty of Health Sciences, McMaster University, Canada
  • 5 Indoc Research, Canada
  • 6 Ontario Brain Institute, Canada
  • 7 The Centre for Addiction and Mental Health, Canada
  • 8 Department of Paediatrics, Schulich School of Medicine and Dentistry, Western University, Canada
  • 9 Holland Bloorview Kids Rehabilitation Hospital, Canada
  • 10 Department of Pediatrics, Faculty of Health Sciences, McMaster University, Canada
  • 11 Center of Excellence for Rehabilitation Medicine, University Medical Center Utrecht, Netherlands
  • 12 Brain Center, University Medical Center Utrecht, Netherlands

The final, formatted version of the article will be published soon.

There is an increasing desire to study neurodevelopmental disorders (NDDs) together to understand commonalities to develop generic health promotion strategies and improve clinical treatment. Common data elements (CDEs) collected across studies involving children with NDDs afford an opportunity to answer clinically meaningful questions. We undertook a retrospective, secondary analysis of data pertaining to sleep in children with different NDDs collected through various research studies. The objective of this paper is to share lessons learned for data management, collation, and harmonization from a sleep study in children within and across NDDs from large, collaborative research networks in the Ontario Brain Institute (OBI). Three collaborative research networks contributed demographic data and data pertaining to sleep, internalizing symptoms, health-related quality of life, and severity of disorder for children with six different NDDs: autism spectrum disorder; attention deficit/hyperactivity disorder; obsessive compulsive disorder; intellectual disability; cerebral palsy; and epilepsy. Procedures for data harmonization, derivations, and merging were shared and examples pertaining to severity of disorder and sleep disturbances were described in detail.Important lessons emerged from data harmonizing procedures: prioritizing the collection of CDEs to ensure data completeness; ensuring unprocessed data are uploaded for harmonization in order to facilitate timely analytic procedures; the value of maintaining variable naming that is consistent with data dictionaries at time of project validation; and the value of regular meetings with the research networks to discuss and overcome challenges with data harmonization.Buy-in from all research networks involved at study inception and oversight from a centralized infrastructure (OBI) identified the importance of collaboration to collect CDEs and facilitate data harmonization to improve outcomes for children with NDDs.

Keywords: Data, harmonization, curation, Neurodevelopmental Disorders, Child, Sleep

Received: 13 Feb 2024; Accepted: 19 Apr 2024.

Copyright: © 2024 McPhee, Vaccarino, Naska, Nylen, Santisteban, Chepesiuk, Andrade, Georgiades, Behan, Iaboni, Wan, Aimola, Cheema and Gorter. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Dr. Patrick G. McPhee, Department of Psychiatry and Behavioural Neurosciences, Faculty of Health Sciences, McMaster University, Hamilton, L8S4L8, Ontario, Canada

People also looked at

Europe PMC requires Javascript to function effectively.

Either your web browser doesn't support Javascript or it is currently turned off. In the latter case, please turn on Javascript support in your web browser and reload this page.

Search life-sciences literature (43,977,191 articles, preprints and more)

  • Free full text
  • Citations & impact
  • Similar Articles

Data management in clinical research: An overview.

Author information, affiliations.

  • Krishnankutty B 1

Indian Journal of Pharmacology , 01 Mar 2012 , 44(2): 168-172 https://doi.org/10.4103/0253-7613.93842   PMID: 22529469  PMCID: PMC3326906

Abstract 

Free full text , data management in clinical research: an overview, binny krishnankutty.

Global Medical Affairs, Dr. Reddy's Laboratories Ltd., Ameerpet, Hyderabad, India

Shantala Bellary

Naveen b.r. kumar, latha s. moodahadu.

Clinical Data Management (CDM) is a critical phase in clinical research, which leads to generation of high-quality, reliable, and statistically sound data from clinical trials. This helps to produce a drastic reduction in time from drug development to marketing. Team members of CDM are actively involved in all stages of clinical trial right from inception to completion. They should have adequate process knowledge that helps maintain the quality standards of CDM processes. Various procedures in CDM including Case Report Form (CRF) designing, CRF annotation, database designing, data-entry, data validation, discrepancy management, medical coding, data extraction, and database locking are assessed for quality at regular intervals during a trial. In the present scenario, there is an increased demand to improve the CDM standards to meet the regulatory requirements and stay ahead of the competition by means of faster commercialization of product. With the implementation of regulatory compliant data management tools, CDM team can meet these demands. Additionally, it is becoming mandatory for companies to submit the data electronically. CDM professionals should meet appropriate expectations and set standards for data quality and also have a drive to adapt to the rapidly changing technology. This article highlights the processes involved and provides the reader an overview of the tools and standards adopted as well as the roles and responsibilities in CDM.

  • Introduction

Clinical trial is intended to find answers to the research question by means of generating data for proving or disproving a hypothesis. The quality of data generated plays an important role in the outcome of the study. Often research students ask the question, “what is Clinical Data Management (CDM) and what is its significance?” Clinical data management is a relevant and important part of a clinical trial. All researchers try their hands on CDM activities during their research work, knowingly or unknowingly. Without identifying the technical phases, we undertake some of the processes involved in CDM during our research work. This article highlights the processes involved in CDM and gives the reader an overview of how data is managed in clinical trials.

CDM is the process of collection, cleaning, and management of subject data in compliance with regulatory standards. The primary objective of CDM processes is to provide high-quality data by keeping the number of errors and missing data as low as possible and gather maximum data for analysis.[ 1 ] To meet this objective, best practices are adopted to ensure that data are complete, reliable, and processed correctly. This has been facilitated by the use of software applications that maintain an audit trail and provide easy identification and resolution of data discrepancies. Sophisticated innovations[ 2 ] have enabled CDM to handle large trials and ensure the data quality even in complex trials.

How do we define ‘high-quality’ data? High-quality data should be absolutely accurate and suitable for statistical analysis. These should meet the protocol-specified parameters and comply with the protocol requirements. This implies that in case of a deviation, not meeting the protocol-specifications, we may think of excluding the patient from the final database. It should be borne in mind that in some situations, regulatory authorities may be interested in looking at such data. Similarly, missing data is also a matter of concern for clinical researchers. High-quality data should have minimal or no misses. But most importantly, high-quality data should possess only an arbitrarily ‘acceptable level of variation’ that would not affect the conclusion of the study on statistical analysis. The data should also meet the applicable regulatory requirements specified for data quality.

Tools for CDM

Many software tools are available for data management, and these are called Clinical Data Management Systems (CDMS). In multicentric trials, a CDMS has become essential to handle the huge amount of data. Most of the CDMS used in pharmaceutical companies are commercial, but a few open source tools are available as well. Commonly used CDM tools are ORACLE CLINICAL, CLINTRIAL, MACRO, RAVE, and eClinical Suite. In terms of functionality, these software tools are more or less similar and there is no significant advantage of one system over the other. These software tools are expensive and need sophisticated Information Technology infrastructure to function. Additionally, some multinational pharmaceutical giants use custom-made CDMS tools to suit their operational needs and procedures. Among the open source tools, the most prominent ones are OpenClinica, openCDMS, TrialDB, and PhOSCo. These CDM software are available free of cost and are as good as their commercial counterparts in terms of functionality. These open source software can be downloaded from their respective websites.

In regulatory submission studies, maintaining an audit trail of data management activities is of paramount importance. These CDM tools ensure the audit trail and help in the management of discrepancies. According to the roles and responsibilities (explained later), multiple user IDs can be created with access limitation to data entry, medical coding, database designing, or quality check. This ensures that each user can access only the respective functionalities allotted to that user ID and cannot make any other change in the database. For responsibilities where changes are permitted to be made in the data, the software will record the change made, the user ID that made the change and the time and date of change, for audit purposes (audit trail). During a regulatory audit, the auditors can verify the discrepancy management process; the changes made and can confirm that no unauthorized or false changes were made.

Regulations, Guidelines, and Standards in CDM

Akin to other areas in clinical research, CDM has guidelines and standards that must be followed. Since the pharmaceutical industry relies on the electronically captured data for the evaluation of medicines, there is a need to follow good practices in CDM and maintain standards in electronic data capture. These electronic records have to comply with a Code of Federal Regulations (CFR), 21 CFR Part 11. This regulation is applicable to records in electronic format that are created, modified, maintained, archived, retrieved, or transmitted. This demands the use of validated systems to ensure accuracy, reliability, and consistency of data with the use of secure, computer-generated, time-stamped audit trails to independently record the date and time of operator entries and actions that create, modify, or delete electronic records.[ 3 ] Adequate procedures and controls should be put in place to ensure the integrity, authenticity, and confidentiality of data. If data have to be submitted to regulatory authorities, it should be entered and processed in 21 CFR part 11-compliant systems. Most of the CDM systems available are like this and pharmaceutical companies as well as contract research organizations ensure this compliance.

Society for Clinical Data Management (SCDM) publishes the Good Clinical Data Management Practices (GCDMP) guidelines, a document providing the standards of good practice within CDM. GCDMP was initially published in September 2000 and has undergone several revisions thereafter. The July 2009 version is the currently followed GCDMP document. GCDMP provides guidance on the accepted practices in CDM that are consistent with regulatory practices. Addressed in 20 chapters, it covers the CDM process by highlighting the minimum standards and best practices.

Clinical Data Interchange Standards Consortium (CDISC), a multidisciplinary non-profit organization, has developed standards to support acquisition, exchange, submission, and archival of clinical research data and metadata. Metadata is the data of the data entered. This includes data about the individual who made the entry or a change in the clinical data, the date and time of entry/change and details of the changes that have been made. Among the standards, two important ones are the Study Data Tabulation Model Implementation Guide for Human Clinical Trials (SDTMIG) and the Clinical Data Acquisition Standards Harmonization (CDASH) standards, available free of cost from the CDISC website ( www.cdisc.org ). The SDTMIG standard[ 4 ] describes the details of model and standard terminologies for the data and serves as a guide to the organization. CDASH v 1.1[ 5 ] defines the basic standards for the collection of data in a clinical trial and enlists the basic data information needed from a clinical, regulatory, and scientific perspective.

The CDM Process

The CDM process, like a clinical trial, begins with the end in mind. This means that the whole process is designed keeping the deliverable in view. As a clinical trial is designed to answer the research question, the CDM process is designed to deliver an error-free, valid, and statistically sound database. To meet this objective, the CDM process starts early, even before the finalization of the study protocol.

Review and finalization of study documents

The protocol is reviewed from a database designing perspective, for clarity and consistency. During this review, the CDM personnel will identify the data items to be collected and the frequency of collection with respect to the visit schedule. A Case Report Form (CRF) is designed by the CDM team, as this is the first step in translating the protocol-specific activities into data being generated. The data fields should be clearly defined and be consistent throughout. The type of data to be entered should be evident from the CRF. For example, if weight has to be captured in two decimal places, the data entry field should have two data boxes placed after the decimal as shown in Figure 1 . Similarly, the units in which measurements have to be made should also be mentioned next to the data field. The CRF should be concise, self-explanatory, and user-friendly (unless you are the one entering data into the CRF). Along with the CRF, the filling instructions (called CRF Completion Guidelines) should also be provided to study investigators for error-free data acquisition. CRF annotation is done wherein the variable is named according to the SDTMIG or the conventions followed internally. Annotations are coded terms used in CDM tools to indicate the variables in the study. An example of an annotated CRF is provided in Figure 1 . In questions with discrete value options (like the variable gender having values male and female as responses), all possible options will be coded appropriately.

clinical data management research paper

Annotated sample of a Case Report Form (CRF). Annotations are entered in coloured text in this figure to differentiate from the CRF questions. DCM = Data collection module, DVG = Discrete value group, YNNA [S1] = Yes, No = Not applicable [subset 1], C = Character, N = Numerical, DT = Date format. For xample, BRTHDTC [DT] indicates date of birth in the date format

Based on these, a Data Management Plan (DMP) is developed. DMP document is a road map to handle the data under foreseeable circumstances and describes the CDM activities to be followed in the trial. A list of CDM activities is provided in Table 1 . The DMP describes the database design, data entry and data tracking guidelines, quality control measures, SAE reconciliation guidelines, discrepancy management, data transfer/extraction, and database locking guidelines. Along with the DMP, a Data Validation Plan (DVP) containing all edit-checks to be performed and the calculations for derived variables are also prepared. The edit check programs in the DVP help in cleaning up the data by identifying the discrepancies.

List of clinical data management activities

clinical data management research paper

Database designing

Databases are the clinical software applications, which are built to facilitate the CDM tasks to carry out multiple studies.[ 6 ] Generally, these tools have built-in compliance with regulatory requirements and are easy to use. “System validation” is conducted to ensure data security, during which system specifications,[ 7 ] user requirements, and regulatory compliance are evaluated before implementation. Study details like objectives, intervals, visits, investigators, sites, and patients are defined in the database and CRF layouts are designed for data entry. These entry screens are tested with dummy data before moving them to the real data capture.

Data collection

Data collection is done using the CRF that may exist in the form of a paper or an electronic version. The traditional method is to employ paper CRFs to collect the data responses, which are translated to the database by means of data entry done in-house. These paper CRFs are filled up by the investigator according to the completion guidelines. In the e-CRF-based CDM, the investigator or a designee will be logging into the CDM system and entering the data directly at the site. In e-CRF method, chances of errors are less, and the resolution of discrepancies happens faster. Since pharmaceutical companies try to reduce the time taken for drug development processes by enhancing the speed of processes involved, many pharmaceutical companies are opting for e-CRF options (also called remote data entry).

CRF tracking

The entries made in the CRF will be monitored by the Clinical Research Associate (CRA) for completeness and filled up CRFs are retrieved and handed over to the CDM team. The CDM team will track the retrieved CRFs and maintain their record. CRFs are tracked for missing pages and illegible data manually to assure that the data are not lost. In case of missing or illegible data, a clarification is obtained from the investigator and the issue is resolved.

Data entry takes place according to the guidelines prepared along with the DMP. This is applicable only in the case of paper CRF retrieved from the sites. Usually, double data entry is performed wherein the data is entered by two operators separately.[ 8 ] The second pass entry (entry made by the second person) helps in verification and reconciliation by identifying the transcription errors and discrepancies caused by illegible data. Moreover, double data entry helps in getting a cleaner database compared to a single data entry. Earlier studies have shown that double data entry ensures better consistency with paper CRF as denoted by a lesser error rate.[ 9 ]

Data validation

Data validation is the process of testing the validity of data in accordance with the protocol specifications. Edit check programs are written to identify the discrepancies in the entered data, which are embedded in the database, to ensure data validity. These programs are written according to the logic condition mentioned in the DVP. These edit check programs are initially tested with dummy data containing discrepancies. Discrepancy is defined as a data point that fails to pass a validation check. Discrepancy may be due to inconsistent data, missing data, range checks, and deviations from the protocol. In e-CRF based studies, data validation process will be run frequently for identifying discrepancies. These discrepancies will be resolved by investigators after logging into the system. Ongoing quality control of data processing is undertaken at regular intervals during the course of CDM. For example, if the inclusion criteria specify that the age of the patient should be between 18 and 65 years (both inclusive), an edit program will be written for two conditions viz . age <18 and >65. If for any patient, the condition becomes TRUE, a discrepancy will be generated. These discrepancies will be highlighted in the system and Data Clarification Forms (DCFs) can be generated. DCFs are documents containing queries pertaining to the discrepancies identified.

Discrepancy management

This is also called query resolution. Discrepancy management includes reviewing discrepancies, investigating the reason, and resolving them with documentary proof or declaring them as irresolvable. Discrepancy management helps in cleaning the data and gathers enough evidence for the deviations observed in data. Almost all CDMS have a discrepancy database where all discrepancies will be recorded and stored with audit trail.

Based on the types identified, discrepancies are either flagged to the investigator for clarification or closed in-house by Self-Evident Corrections (SEC) without sending DCF to the site. The most common SECs are obvious spelling errors. For discrepancies that require clarifications from the investigator, DCFs will be sent to the site. The CDM tools help in the creation and printing of DCFs. Investigators will write the resolution or explain the circumstances that led to the discrepancy in data. When a resolution is provided by the investigator, the same will be updated in the database. In case of e-CRFs, the investigator can access the discrepancies flagged to him and will be able to provide the resolutions online. Figure 2 illustrates the flow of discrepancy management.

clinical data management research paper

Discrepancy management (DCF = Data clarification form, CRA = Clinical Research Associate, SDV = Source document verification, SEC = Self-evident correction)

The CDM team reviews all discrepancies at regular intervals to ensure that they have been resolved. The resolved data discrepancies are recorded as ‘closed’. This means that those validation failures are no longer considered to be active, and future data validation attempts on the same data will not create a discrepancy for same data point. But closure of discrepancies is not always possible. In some cases, the investigator will not be able to provide a resolution for the discrepancy. Such discrepancies will be considered as ‘irresolvable’ and will be updated in the discrepancy database.

Discrepancy management is the most critical activity in the CDM process. Being the vital activity in cleaning up the data, utmost attention must be observed while handling the discrepancies.

Medical coding

Medical coding helps in identifying and properly classifying the medical terminologies associated with the clinical trial. For classification of events, medical dictionaries available online are used. Technically, this activity needs the knowledge of medical terminology, understanding of disease entities, drugs used, and a basic knowledge of the pathological processes involved. Functionally, it also requires knowledge about the structure of electronic medical dictionaries and the hierarchy of classifications available in them. Adverse events occurring during the study, prior to and concomitantly administered medications and pre-or co-existing illnesses are coded using the available medical dictionaries. Commonly, Medical Dictionary for Regulatory Activities (MedDRA) is used for the coding of adverse events as well as other illnesses and World Health Organization–Drug Dictionary Enhanced (WHO-DDE) is used for coding the medications. These dictionaries contain the respective classifications of adverse events and drugs in proper classes. Other dictionaries are also available for use in data management (eg, WHO-ART is a dictionary that deals with adverse reactions terminology). Some pharmaceutical companies utilize customized dictionaries to suit their needs and meet their standard operating procedures.

Medical coding helps in classifying reported medical terms on the CRF to standard dictionary terms in order to achieve data consistency and avoid unnecessary duplication. For example, the investigators may use different terms for the same adverse event, but it is important to code all of them to a single standard code and maintain uniformity in the process. The right coding and classification of adverse events and medication is crucial as an incorrect coding may lead to masking of safety issues or highlight the wrong safety concerns related to the drug.

Database locking

After a proper quality check and assurance, the final data validation is run. If there are no discrepancies, the SAS datasets are finalized in consultation with the statistician. All data management activities should have been completed prior to database lock. To ensure this, a pre-lock checklist is used and completion of all activities is confirmed. This is done as the database cannot be changed in any manner after locking. Once the approval for locking is obtained from all stakeholders, the database is locked and clean data is extracted for statistical analysis. Generally, no modification in the database is possible. But in case of a critical issue or for other important operational reasons, privileged users can modify the data even after the database is locked. This, however, requires proper documentation and an audit trail has to be maintained with sufficient justification for updating the locked database. Data extraction is done from the final database after locking. This is followed by its archival.

Roles and Responsibilities in CDM

In a CDM team, different roles and responsibilities are attributed to the team members. The minimum educational requirement for a team member in CDM should be graduation in life science and knowledge of computer applications. Ideally, medical coders should be medical graduates. However, in the industry, paramedical graduates are also recruited as medical coders. Some key roles are essential to all CDM teams. The list of roles given below can be considered as minimum requirements for a CDM team:

Data Manager

Database Programmer/Designer

Medical Coder

Clinical Data Coordinator

Quality Control Associate

Data Entry Associate

The data manager is responsible for supervising the entire CDM process. The data manager prepares the DMP, approves the CDM procedures and all internal documents related to CDM activities. Controlling and allocating the database access to team members is also the responsibility of the data manager. The database programmer/designer performs the CRF annotation, creates the study database, and programs the edit checks for data validation. He/she is also responsible for designing of data entry screens in the database and validating the edit checks with dummy data. The medical coder will do the coding for adverse events, medical history, co-illnesses, and concomitant medication administered during the study. The clinical data coordinator designs the CRF, prepares the CRF filling instructions, and is responsible for developing the DVP and discrepancy management. All other CDM-related documents, checklists, and guideline documents are prepared by the clinical data coordinator. The quality control associate checks the accuracy of data entry and conducts data audits.[ 10 ] Sometimes, there is a separate quality assurance person to conduct the audit on the data entered. Additionally, the quality control associate verifies the documentation pertaining to the procedures being followed. The data entry personnel will be tracking the receipt of CRF pages and performs the data entry into the database.

CDM has evolved in response to the ever-increasing demand from pharmaceutical companies to fast-track the drug development process and from the regulatory authorities to put the quality systems in place to ensure generation of high-quality data for accurate drug evaluation. To meet the expectations, there is a gradual shift from the paper-based to the electronic systems of data management. Developments on the technological front have positively impacted the CDM process and systems, thereby leading to encouraging results on speed and quality of data being generated. At the same time, CDM professionals should ensure the standards for improving data quality.[ 11 ] CDM, being a speciality in itself, should be evaluated by means of the systems and processes being implemented and the standards being followed. The biggest challenge from the regulatory perspective would be the standardization of data management process across organizations, and development of regulations to define the procedures to be followed and the data standards. From the industry perspective, the biggest hurdle would be the planning and implementation of data management systems in a changing operational environment where the rapid pace of technology development outdates the existing infrastructure. In spite of these, CDM is evolving to become a standard-based clinical research entity, by striking a balance between the expectations from and constraints in the existing systems, driven by technological developments and business demands.

Source of Support: Nil.

Conflict of Interest: None declared.

Full text links 

Read article at publisher's site: https://doi.org/10.4103/0253-7613.93842

Citations & impact 

Impact metrics, citations of article over time, alternative metrics.

Altmetric item for https://www.altmetric.com/details/1612389

Smart citations by scite.ai Smart citations by scite.ai include citation statements extracted from the full text of the citing article. The number of the statements may be higher than the number of citations provided by EuropePMC if one paper cites another multiple times or lower if scite has not yet processed some of the citing articles. Explore citation contexts and check if this article has been supported or disputed. https://scite.ai/reports/10.4103/0253-7613.93842

Article citations, redcapdm: an r package with a set of data management tools for a redcap project..

Carmezim J , Satorra P , Peñafiel J , García-Lerma E , Pallarès N , Santos N , Tebé C

BMC Med Res Methodol , 24(1):55, 01 Mar 2024

Cited by: 0 articles | PMID: 38429658 | PMCID: PMC10905808

Considerations for establishing and maintaining international research collaboration: the example of chemotherapy-induced peripheral neurotoxicity (CIPN)-a white paper.

Alberti P , Argyriou AA , Bruna J , Damaj MI , Faithfull S , Harding A , Hoke A , Knoerl R , Kolb N , Li T , Park SB , Staff NP , Tamburin S , Thomas S , Smith EL

Support Care Cancer , 32(2):117, 20 Jan 2024

Cited by: 0 articles | PMID: 38244122 | PMCID: PMC10799817

NBSP: an online centralized database management system for a newborn sickle cell program in India.

Pandey A , Borah S , Chaudhary B , Rana S , Singh H , Nadkarni A , Kaur H

Front Digit Health , 5:1204550, 13 Sep 2023

Cited by: 1 article | PMID: 37781453 | PMCID: PMC10534972

Data Management 101 for drug developers: A peek behind the curtain.

Oronsky B , Burbano E , Stirn M , Brechlin J , Abrouk N , Caroen S , Coyle A , Williams J , Cabrales P , Reid TR

Clin Transl Sci , 16(9):1497-1509, 06 Jul 2023

Cited by: 1 article | PMID: 37382299 | PMCID: PMC10499417

The Minimum Data Set for Rare Diseases: Systematic Review.

Bernardi FA , Mello de Oliveira B , Bettiol Yamada D , Artifon M , Schmidt AM , Machado Scheibe V , Alves D , Félix TM

J Med Internet Res , 25:e44641, 27 Jul 2023

Cited by: 0 articles | PMID: 37498666 | PMCID: PMC10415943

Other citations

Wikipedia (2).

  • https://en.wikipedia.org/wiki/Clinical_data_management
  • https://en.wikipedia.org/wiki/Data_clarification_form

Similar Articles 

To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.

Critical Care Network in the State of Qatar.

Hijjeh M , Al Shaikh L , Alinier G , Selwood D , Malmstrom F , Hassan IF

Qatar Med J , 2019(2):2, 07 Nov 2019

Cited by: 2 articles | PMID: 31763205 | PMCID: PMC6851898

A tiered quality assurance review process for clinical data management standard operating procedures in an academic health center.

Ittenbach RF , Baker CL , Corsmo JJ

Acad Med , 89(5):745-748, 01 May 2014

Cited by: 0 articles | PMID: 24667508

The development and analysis of a Japanese modern comprehensive clinical data management training program.

Yamaguchi T , Yaegashi H , Chiu SW , Uemura Y , Kawahara T , Miyaji T , Mashiko T , Takata M

Heliyon , 10(6):e27846, 15 Mar 2024

Cited by: 0 articles | PMID: 38545152 | PMCID: PMC10966602

Culture of Care: Organizational Responsibilities

Brown MJ , Symonowicz C , Medina LV , Bratcher NA , Buckmaster CA , Klein H , Anderson LC

CRC Press/Taylor & Francis, Boca Raton (FL) , 23 May 2018

Cited by: 0 articles | PMID: 29787190

Review Books & documents Free full text in Europe PMC

Europe PMC is part of the ELIXIR infrastructure

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts

Latest science news, discoveries and analysis

clinical data management research paper

Could a rare mutation that causes dwarfism also slow ageing?

clinical data management research paper

Bird flu in US cows: is the milk supply safe?

clinical data management research paper

Future of Humanity Institute shuts: what's next for ‘deep future’ research?

clinical data management research paper

Judge dismisses superconductivity physicist’s lawsuit against university

Nih pay raise for postdocs and phd students could have us ripple effect, hello puffins, goodbye belugas: changing arctic fjord hints at our climate future, china's moon atlas is the most detailed ever made, ‘shut up and calculate’: how einstein lost the battle to explain quantum reality, ecologists: don’t lose touch with the joy of fieldwork chris mantegna.

clinical data management research paper

Should the Maldives be creating new land?

clinical data management research paper

Lethal AI weapons are here: how can we control them?

clinical data management research paper

Algorithm ranks peer reviewers by reputation — but critics warn of bias

clinical data management research paper

How gliding marsupials got their ‘wings’

Audio long read: why loneliness is bad for your health, nato is boosting ai and climate research as scientific diplomacy remains on ice, rat neurons repair mouse brains — and restore sense of smell, plastic pollution: three numbers that support a crackdown.

clinical data management research paper

Retractions are part of science, but misconduct isn’t — lessons from a superconductivity lab

clinical data management research paper

Any plan to make smoking obsolete is the right step

clinical data management research paper

Citizenship privilege harms science

European ruling linking climate change to human rights could be a game changer — here’s how charlotte e. blattner, will ai accelerate or delay the race to net-zero emissions, current issue.

Issue Cover

The Maldives is racing to create new land. Why are so many people concerned?

Surprise hybrid origins of a butterfly species, stripped-envelope supernova light curves argue for central engine activity, optical clocks at sea, research analysis.

clinical data management research paper

Ancient DNA traces family lines and political shifts in the Avar empire

clinical data management research paper

A chemical method for selective labelling of the key amino acid tryptophan

clinical data management research paper

Robust optical clocks promise stable timing in a portable package

clinical data management research paper

Targeting RNA opens therapeutic avenues for Timothy syndrome

Bioengineered ‘mini-colons’ shed light on cancer progression, galaxy found napping in the primordial universe, tumours form without genetic mutations, marsupial genomes reveal how a skin membrane for gliding evolved.

clinical data management research paper

Scientists urged to collect royalties from the ‘magic money tree’

clinical data management research paper

Breaking ice, and helicopter drops: winning photos of working scientists

clinical data management research paper

Shrouded in secrecy: how science is harmed by the bullying and harassment rumour mill

Want to make a difference try working at an environmental non-profit organization, how ground glass might save crops from drought on a caribbean island, books & culture.

clinical data management research paper

How volcanoes shaped our planet — and why we need to be ready for the next big eruption

clinical data management research paper

Dogwhistles, drilling and the roots of Western civilization: Books in brief

clinical data management research paper

Cosmic rentals

Las borinqueñas remembers the forgotten puerto rican women who tested the first pill, dad always mows on summer saturday mornings, nature podcast.

Nature Podcast

Latest videos

Nature briefing.

An essential round-up of science news, opinion and analysis, delivered to your inbox every weekday.

clinical data management research paper

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

IMAGES

  1. what is Clinical data management by Smita Kathe

    clinical data management research paper

  2. Clinical Data Management: Roles, Steps, and Software Tools

    clinical data management research paper

  3. Clinical Data Management: Roles, Steps, and Software Tools

    clinical data management research paper

  4. (PDF) Data management in clinical research: An overview

    clinical data management research paper

  5. The Future of Clinical Trial Data Management

    clinical data management research paper

  6. 7+ Data Management Plan Templates -Free Sample, Example Format Download

    clinical data management research paper

VIDEO

  1. Data Management for Clinical Research

  2. Clinical Data Management! #clinicaldatamanagement #cro #clinicalresearch

  3. Clinical data management/Query management/Listings/Manuals

  4. Clinical Data Archiving _ Clinical Data management session

  5. Clinical Data Management Interview Questions Part-4 #clinicaldatamanagement #interviewquestion #cdm

  6. Clinical Data Management Interview Questions Part-5 #clinicaldatamanagement #interviewquestion #cdm

COMMENTS

  1. Data management in clinical research: An overview

    Clinical Data Management (CDM) is a critical phase in clinical research, which leads to generation of high-quality, reliable, and statistically sound data from clinical trials. This helps to produce a drastic reduction in time from drug development to marketing. Team members of CDM are actively involved in all stages of clinical trial right ...

  2. PDF Essentials of data management: an overview

    Outlining a data management strategy prior to initiation of a research study plays an essential role in ensuring that both scienti c integrity (i.e., data generated can accurately test the fi ...

  3. (PDF) Data management in clinical research: An overview

    Clinical Data Management (CDM) is a critical phase in clinical research, which leads to. generation of high-quality, reliable, and statistically sound data from clinical trials. This. helps to ...

  4. PDF The Evolution of Clinical Data Management into Clinical Data Science

    SCDM vision of ^leading innovative clinical data science to advance global health research and development _. To that end, the SCDM Innovation Committee strives to demystify CDS and support the ... Society for Clinical Data Management Position Paper 6 1. The complexification of clinical trials designs (e.g., Adaptive, Master Protocols, ...

  5. 17781 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on CLINICAL DATA MANAGEMENT. Find methods information, sources, references or conduct a literature ...

  6. Essentials of data management: an overview

    Outlining a data management strategy prior to initiation of a research study plays an essential role in ensuring that both scientific integrity (i.e., data generated can accurately test the ...

  7. Improving Data Quality in Clinical Research Informatics Tools

    Maintaining data quality is a fundamental requirement for any successful and long-term data management. Providing high-quality, reliable, and statistically sound data is a primary goal for clinical research informatics. In addition, effective data governance and management are essential to ensuring accurate data counts, reports, and validation. As a crucial step of the clinical research ...

  8. Data Collection and Management in Clinical Research

    Abstract. Well-designed trials and data management methods are essential to the integrity of the findings from clinical trials, and the completeness, accuracy, and timeliness of data collection are key indicators of the quality of conduct of the study. The research data provide the information to be analyzed in addressing the study objectives ...

  9. Data Management in Clinical Research: General ...

    Data management is at the heart of the clinical research process. Although good data management practices cannot make up for poor study design, poor data management can render a perfectly executed trial useless. This book is about clinical research; the failure of a study to produce generalizable knowledge because of bad data management ...

  10. Artificial intelligence based clinical data management systems: A

    Abstract. The clinical data management system is widely used to manage the data that is collected during the clinical trials. This system offers various comfortable methods via which the data can be collected, managed and stored easily for further use. The points regarding the clinical data management that are covered in this review article are ...

  11. Publications

    Break. 10:45 AM - 11:30 AM. Panel Discussion Reimagining Clinical Research - Enhancing the 'Patient Centricity' through Decentralized Clinical Trials. 11:30 AM - 12:00 PM. Platinum Sponsor Presentation Oracle Health Sciences. 12:00 PM - 01:00 PM. Lunch Break. 01:00 PM - 02:30 PM. Industry Workshop 1 Medidata Solutions Virtual ...

  12. Data Management in Clinical Trials

    A critical component of conducting clinical research is data management. Data determine the clinical trial's progress, toxicities, and results. The reported results of a clinical trial should reflect verifiable, accurate data that have been collected and analyzed in a rigorous fashion. In addition, conducting a clinical trial successfully ...

  13. PDF Data Management Considerations for Clinical Trials

    7. Understand the reasons for performing research that is reproducible from data collection through publication of results. 9. Distinguish between variable types (e.g. continuous, binary, categorical) and understand the implications for selection of appropriate statistical methods. Extensively covered by required coursework.

  14. PDF A Reflection Paper on the impact of the Clinical Research industry

    The Evolution of Clinical Data Management to Clinical Data Science Society for Clinical Data Management Reflection Paper 5 5. Drivers of change and their impact on CDM 5.1. Clinical Research approaches and their rising complexities The rising cost of healthcare has an unsustainable trajectory and the costs of developing one new drug

  15. Data management in clinical research: An overview

    There is an increased demand to improve the CDM standards to meet the regulatory requirements and stay ahead of the competition by means of faster commercialization of product. Clinical Data Management (CDM) is a critical phase in clinical research, which leads to generation of high-quality, reliable, and statistically sound data from clinical trials. This helps to produce a drastic reduction ...

  16. The Generalized Data Model for clinical research

    Background Most healthcare data sources store information within their own unique schemas, making reliable and reproducible research challenging. Consequently, researchers have adopted various data models to improve the efficiency of research. Transforming and loading data into these models is a labor-intensive process that can alter the semantics of the original data. Therefore, we created a ...

  17. All about Clinical Trial Data Management

    A clinical trial management system (CTMS) is a type of project management software specific to clinical research and clinical data management. It allows for centralized planning, reporting, and tracking of all aspects of clinical trials, with the end goal of ensuring that the trials are efficient, compliant, and successful, whether across one or several institutions.

  18. Data Management in Clinical Research

    Pls refer Binny Krishnankutty, Shantala Bellary, Naveen B. R. Kumar et al 2011 Data management in clinical research an overview Indian Journal of pharmacology 44(2):168-172. doi 10. 4103/0253-7613.93842 ... To meet this expectation there is the graduate shift from the paper based to the electronic system of data management Developments in the ...

  19. Advancing Clinical Research Through Effective Data Delivery

    Through her 30-year career with ICON, Rose Kidd has watched the company and the clinical research industry evolve. Now her team provides the clinical data science, biostatistics, medical writing, pharmacovigilance, study start up, and site and patient solutions that sponsors need to improve their clinical trials.

  20. Frontiers

    We undertook a retrospective, secondary analysis of data pertaining to sleep in children with different NDDs collected through various research studies. The objective of this paper is to share lessons learned for data management, collation, and harmonization from a sleep study in children within and across NDDs from large, collaborative ...

  21. Data management in clinical research: An overview.

    Data management in clinical research: An overview. ... Clinical Data Management (CDM) is a critical phase in clinical research, which leads to generation of high-quality, reliable, and statistically sound data from clinical trials. ... The traditional method is to employ paper CRFs to collect the data responses, which are translated to the ...

  22. PDF Unlocking Healthcare's Future THE INVALUABLE ROLE OF CLINICAL INFORMATICS

    Informaticists recognize the importance of data and technology in achieving quality improvement in healthcare, emphasizing the need for their practical application to avoid adding to clinician frustration and burden. The HIMSS Clinical Informatics Communities white paper, published in 2022, outlines specific clinical informatics roles

  23. Society for Clinical Data Management (SCDM)

    We have designed four distinct and complementary work streams to facilitate your journey through Clinical Data Management. Our comprehensive content ensures all you need is just a click away. The ultimate CDM knowledge bank, exclusively designed for you. GCDMP industry standards, JCSDM articles, SCDM White Papers, on-demand webinars, training ...

  24. Latest science news, discoveries and analysis

    Find breaking science news and analysis from the world's leading research journal.

  25. Nature Communications Publishes Zapata AI Research on Generative AI for

    GEO has been applied to real-world industrial problems since the research paper was initially submitted to ArXiv in 2021. ... Harnessing the power of AI/ML and risk-informed strategies to streamline clinical data management. In today's fast-paced world, driven by demands for speed and efficiency, the field of clinical development has ...