• UNC Libraries
  • HSL Academic Process
  • Systematic Reviews
  • Step 6: Assess Quality of Included Studies

Systematic Reviews: Step 6: Assess Quality of Included Studies

Created by health science librarians.

HSL Logo

  • Step 1: Complete Pre-Review Tasks
  • Step 2: Develop a Protocol
  • Step 3: Conduct Literature Searches
  • Step 4: Manage Citations
  • Step 5: Screen Citations

Assess studies for quality and bias

Critically appraise included studies, select a quality assessment tool, a closer look at popular tools, use covidence for quality assessment.

  • Quality Assessment FAQs
  • Step 7: Extract Data from Included Studies
  • Step 8: Write the Review

  Check our FAQ's

   Email us

   Call (919) 962-0800

   Make an appointment with a librarian

  Request a systematic or scoping review consultation

About Step 6: Assess Quality of Included Studies

In step 6 you will evaluate the articles you included in your review for quality and bias. To do so, you will:

  • Use quality assessment tools to grade each article.
  • Create a summary of the quality of literature included in your review.

This page has links to quality assessment tools you can use to evaluate different study types. Librarians can help you find widely used tools to evaluate the articles in your review.

Reporting your review with PRISMA

If you reach the quality assessment step and choose to exclude articles for any reason, update the number of included and excluded studies in your PRISMA flow diagram.

Managing your review with Covidence

Covidence includes the Cochrane Risk of Bias 2.0 quality assessment template, but you can also create your own custom quality assessment template.

How a librarian can help with Step 6

  • What the quality assessment or risk of bias stage of the review entails
  • How to choose an appropriate quality assessment tool
  • Best practices for reporting quality assessment results in your review

After the screening process is complete, the systematic review team must assess each article for quality and bias. There are various types of bias, some of which are outlined in the table below from the Cochrane Handbook.

The most important thing to remember when choosing a quality assessment tool is to pick one that was created and validated to assess the study design(s) of your included articles.

For example, if one item in the inclusion criteria of your systematic review is to only include randomized controlled trials (RCTs), then you need to pick a quality assessment tool specifically designed for RCTs (for example, the Cochrane Risk of Bias tool)

Once you have gathered your included studies, you will need to appraise the evidence for its relevance, reliability, validity, and applicability​.

Ask questions like:

Relevance:  ​.

  • Is the research method/study design appropriate for answering the research question?​
  • Are specific inclusion / exclusion criteria used? ​

Reliability:  ​

  • Is the effect size practically relevant? How precise is the estimate of the effect? Were confidence intervals given?  ​

Validity: ​

  • Were there enough subjects in the study to establish that the findings did not occur by chance?    ​
  • Were subjects randomly allocated? Were the groups comparable? If not, could this have introduced bias?  ​
  • Are the measurements/ tools validated by other studies?  ​
  • Could there be confounding factors?   ​

Applicability:  ​

  • Can the results be applied to my organization and my patient?   ​

What are Quality Assessment tools?

Quality Assessment tools are questionnaires created to help you assess the quality of a variety of study designs.  Depending on the types of studies you are analyzing, the questionnaire will be tailored to ask specific questions about the methodology of the study.  There are appraisal tools for most kinds of study designs.  You should choose a Quality Assessment tool that matches the types of studies you expect to see in your results.  If you have multiple types of study designs, you may wish to use several tools from one organization, such as the CASP or LEGEND tools, as they have a range of assessment tools for many study designs.

Click on a study design below to see some examples of quality assessment tools for that type of study.

Randomized Controlled Trials (RCTs)

  • Cochrane Risk of Bias (ROB) 2.0 Tool Templates are tailored to randomized parallel-group trials, cluster-randomized parallel-group trails (including stepped-wedge designs), and randomized cross-over trails and other matched designs.
  • CASP- Randomized Controlled Trial Appraisal Tool A checklist for RCTs created by the Critical Appraisal Skills Program (CASP)
  • The Jadad Scale A scale that assesses the quality of published clinical trials based methods relevant to random assignment, double blinding, and the flow of patients
  • CEBM-RCT A critical appraisal tool for RCTs from the Centre for Evidence Based Medicine (CEBM)
  • Checklist for Randomized Controlled Trials (JBI) A critical appraisal checklist from the Joanna Briggs Institute (JBI)
  • Scottish Intercollegiate Guidelines Network (SIGN) Checklists for quality assessment
  • LEGEND Evidence Evaluation Tools A series of critical appraisal tools from the Cincinnati Children's Hospital. Contains tools for a wide variety of study designs, including prospective, retrospective, qualitative, and quantitative designs.

Cohort Studies

  • CASP- Cohort Studies A checklist created by the Critical Appraisal Skills Programme (CASP) to assess key criteria relevant to cohort studies
  • Checklist for Cohort Studies (JBI) A checklist for cohort studies from the Joanna Briggs Institute
  • The Newcastle-Ottawa Scale (NOS) for assessing the quality of nonrandomised studies in meta-analyses A validated tool for assessing case-control and cohort studies
  • STROBE Checklist A checklist for quality assessment of case-control, cohort, and cross-sectional studies

Case-Control Studies

  • CASP- Case Control Study A checklist created by the Critical Appraisal Skills Programme (CASP) to assess key criteria relevant to case-control studies
  • Tool to Assess Risk of Bias in Case Control Studies by the CLARITY Group at McMaster University A quality assessment tool for case-control studies from the CLARITY Group at McMaster University
  • JBI Checklist for Case-Control Studies A checklist created by the Joanna Briggs Institute

Cross-Sectional Studies

Diagnostic studies.

  • CASP- Diagnostic Studies A checklist for diagnostic studies created by the Critical Appraisal Skills Program (CASP)
  • QUADAS-2 A quality assessment tool developed by a team at the Bristol Medical School: Population Health Sciences at the University of Bristol
  • Critical Appraisal Checklist for Diagnostic Test Accuracy Studies (JBI) A checklist for quality assessment of diagnostic studies developed by the Joanna Briggs Institute

Economic Studies

  • Consensus Health Economic Criteria (CHEC) List 19 yes-or-no questions, one for each category to assess economic evaluations
  • CASP- Economic Evaluation A checklist for quality assessment of economic studies by the Critical Appraisal Skills Programme

Mixed Methods

  • McGill Mixed Methods Appraisal Tool (MMAT) 2018 User Guide See full site for additional information, including FAQ's, references and resources, earlier versions, and more

Qualitative Studies

  • CASP- Qualitative Studies 10 questions to help assess qualitative research from the Critical Appraisal Skills Programme

Systematic Reviews and Meta-Analyses

  • JBI Critical Appraisal Checklist for Systematic Reviews and Research Syntheses An 11-item checklist for evaluating systematic reviews
  • AMSTAR Checklist A 16-question measurement tool to assess systematic reviews
  • AHRQ Methods Guide for Effectiveness and Comparative Effectiveness Reviews A guide to selecting eligibility criteria, searching the literature, extracting data, assessing quality, and completing other steps in the creation of a systematic review
  • CASP - Systematic Review A checklist for quality assessment of systematic review from the Critical Appraisal Skills Programme

Clinical Practice Guidelines

  • National Guideline Clearinghouse Extent of Adherence to Trustworthy Standards (NEATS) Instrument A 15-item instrument using a scale of 1-5 to evaluate a guideline's adherence to the Institute of Medicine's standard for trust worth guidelines
  • AGREE-II Appraisal of Guidelines for Research and Evaluation The Appraisal of Guidelines for Research and Evaluation (AGREE) Instrument evaluates the process of practice guideline development and the quality of reporting

Other Study Designs

  • NTACT Quality Checklists Quality indicator checklists for correlational studies, group experimental studies, single case research studies, and qualitative studies developed by the National Technical Assistance Center on Transition (NTACT). (Users must make an account.)

Below, you will find a sample of four popular quality assessment tools and some basic information about each. For more quality assessment tools, please view the blue tabs in the boxes above, organized by study design.

Covidence uses Cochrane Risk of Bias (which is designed for rating RCTs and cannot be used for other study types) as the default tool for quality assessment of included studies. You can opt to manually customize the quality assessment template and use a different tool better suited to your review. More information about quality assessment using Covidence, including how to customize the quality assessment template, can be found below. If you decide to customize the quality assessment template, you cannot switch back to using the Cochrane Risk of Bias template.

More Information

  • Quality Assessment on the Covidence Guide
  • Covidence FAQs on Quality Assessment Commonly asked questions about quality assessment using Covidence
  • Covidence YouTube Channel A collection of Covidence-created videos
  • << Previous: Step 5: Screen Citations
  • Next: Step 7: Extract Data from Included Studies >>
  • Last Updated: May 16, 2024 3:24 PM
  • URL: https://guides.lib.unc.edu/systematic-reviews
  • Search Menu
  • Advance articles
  • Author Guidelines
  • Submission Site
  • Open Access
  • Why Publish?
  • About Research Evaluation
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Self-Archiving Policy
  • Dispatch Dates
  • Journals on Oxford Academic
  • Books on Oxford Academic

Article Contents

1. introduction, 4. synthesis, 4.1 principles of tdr quality, 5. conclusions, supplementary data, acknowledgements, defining and assessing research quality in a transdisciplinary context.

  • Article contents
  • Figures & tables
  • Supplementary Data

Brian M. Belcher, Katherine E. Rasmussen, Matthew R. Kemshaw, Deborah A. Zornes, Defining and assessing research quality in a transdisciplinary context, Research Evaluation , Volume 25, Issue 1, January 2016, Pages 1–17, https://doi.org/10.1093/reseval/rvv025

  • Permissions Icon Permissions

Research increasingly seeks both to generate knowledge and to contribute to real-world solutions, with strong emphasis on context and social engagement. As boundaries between disciplines are crossed, and as research engages more with stakeholders in complex systems, traditional academic definitions and criteria of research quality are no longer sufficient—there is a need for a parallel evolution of principles and criteria to define and evaluate research quality in a transdisciplinary research (TDR) context. We conducted a systematic review to help answer the question: What are appropriate principles and criteria for defining and assessing TDR quality? Articles were selected and reviewed seeking: arguments for or against expanding definitions of research quality, purposes for research quality evaluation, proposed principles of research quality, proposed criteria for research quality assessment, proposed indicators and measures of research quality, and proposed processes for evaluating TDR. We used the information from the review and our own experience in two research organizations that employ TDR approaches to develop a prototype TDR quality assessment framework, organized as an evaluation rubric. We provide an overview of the relevant literature and summarize the main aspects of TDR quality identified there. Four main principles emerge: relevance, including social significance and applicability; credibility, including criteria of integration and reflexivity, added to traditional criteria of scientific rigor; legitimacy, including criteria of inclusion and fair representation of stakeholder interests, and; effectiveness, with criteria that assess actual or potential contributions to problem solving and social change.

Contemporary research in the social and environmental realms places strong emphasis on achieving ‘impact’. Research programs and projects aim to generate new knowledge but also to promote and facilitate the use of that knowledge to enable change, solve problems, and support innovation ( Clark and Dickson 2003 ). Reductionist and purely disciplinary approaches are being augmented or replaced with holistic approaches that recognize the complex nature of problems and that actively engage within complex systems to contribute to change ‘on the ground’ ( Gibbons et al. 1994 ; Nowotny, Scott and Gibbons 2001 , Nowotny, Scott and Gibbons 2003 ; Klein 2006 ; Hemlin and Rasmussen 2006 ; Chataway, Smith and Wield 2007 ; Erno-Kjolhede and Hansson 2011 ). Emerging fields such as sustainability science have developed out of a need to address complex and urgent real-world problems ( Komiyama and Takeuchi 2006 ). These approaches are inherently applied and transdisciplinary, with explicit goals to contribute to real-world solutions and strong emphasis on context and social engagement ( Kates 2000 ).

While there is an ongoing conceptual and theoretical debate about the nature of the relationship between science and society (e.g. Hessels 2008 ), we take a more practical starting point based on the authors’ experience in two research organizations. The first author has been involved with the Center for International Forestry Research (CIFOR) for almost 20 years. CIFOR, as part of the Consultative Group on International Agricultural Research (CGIAR), began a major transformation in 2010 that shifted the emphasis from a primary focus on delivering high-quality science to a focus on ‘…producing, assembling and delivering, in collaboration with research and development partners, research outputs that are international public goods which will contribute to the solution of significant development problems that have been identified and prioritized with the collaboration of developing countries.’ ( CGIAR 2011 ). It was always intended that CGIAR research would be relevant to priority development and conservation issues, with emphasis on high-quality scientific outputs. The new approach puts much stronger emphasis on welfare and environmental results; research centers, programs, and individual scientists now assume shared responsibility for achieving development outcomes. This requires new ways of working, with more and different kinds of partnerships and more deliberate and strategic engagement in social systems.

Royal Roads University (RRU), the home institute of all four authors, is a relatively new (created in 1995) public university in Canada. It is deliberately interdisciplinary by design, with just two faculties (Faculty of Social and Applied Science; Faculty of Management) and strong emphasis on problem-oriented research. Faculty and student research is typically ‘applied’ in the Organization for Economic Co-operation and Development (2012) sense of ‘original investigation undertaken in order to acquire new knowledge … directed primarily towards a specific practical aim or objective’.

An increasing amount of the research done within both of these organizations can be classified as transdisciplinary research (TDR). TDR crosses disciplinary and institutional boundaries, is context specific, and problem oriented ( Klein 2006 ; Carew and Wickson 2010 ). It combines and blends methodologies from different theoretical paradigms, includes a diversity of both academic and lay actors, and is conducted with a range of research goals, organizational forms, and outputs ( Klein 2006 ; Boix-Mansilla 2006a ; Erno-Kjolhede and Hansson 2011 ). The problem-oriented nature of TDR and the importance placed on societal relevance and engagement are broadly accepted as defining characteristics of TDR ( Carew and Wickson 2010 ).

The experience developing and using TDR approaches at CIFOR and RRU highlights the need for a parallel evolution of principles and criteria for evaluating research quality in a TDR context. Scientists appreciate and often welcome the need and the opportunity to expand the reach of their research, to contribute more effectively to change processes. At the same time, they feel the pressure of added expectations and are looking for guidance.

In any activity, we need principles, guidelines, criteria, or benchmarks that can be used to design the activity, assess its potential, and evaluate its progress and accomplishments. Effective research quality criteria are necessary to guide the funding, management, ongoing development, and advancement of research methods, projects, and programs. The lack of quality criteria to guide and assess research design and performance is seen as hindering the development of transdisciplinary approaches ( Bergmann et al. 2005 ; Feller 2006 ; Chataway, Smith and Wield 2007 ; Ozga 2008 ; Carew and Wickson 2010 ; Jahn and Keil 2015 ). Appropriate quality evaluation is essential to ensure that research receives support and funding, and to guide and train researchers and managers to realize high-quality research ( Boix-Mansilla 2006a ; Klein 2008 ; Aagaard-Hansen and Svedin 2009 ; Carew and Wickson 2010 ).

Traditional disciplinary research is built on well-established methodological and epistemological principles and practices. Within disciplinary research, quality has been defined narrowly, with the primary criteria being scientific excellence and scientific relevance ( Feller 2006 ; Chataway, Smith and Wield 2007 ; Erno-Kjolhede and Hansson 2011 ). Disciplines have well-established (often implicit) criteria and processes for the evaluation of quality in research design ( Erno-Kjolhede and Hansson 2011 ). TDR that is highly context specific, problem oriented, and includes nonacademic societal actors in the research process is challenging to evaluate ( Wickson, Carew and Russell 2006 ; Aagaard-Hansen and Svedin 2009 ; Andrén 2010 ; Carew and Wickson 2010 ; Huutoniemi 2010 ). There is no one definition or understanding of what constitutes quality, nor a set guide for how to do TDR ( Lincoln 1995 ; Morrow 2005 ; Oberg 2008 ; Andrén 2010 ; Huutoniemi 2010 ). When epistemologies and methods from more than one discipline are used, disciplinary criteria may be insufficient and criteria from more than one discipline may be contradictory; cultural conflicts can arise as a range of actors use different terminology for the same concepts or the same terminology for different concepts ( Chataway, Smith and Wield 2007 ; Oberg 2008 ).

Current research evaluation approaches as applied to individual researchers, programs, and research units are still based primarily on measures of academic outputs (publications and the prestige of the publishing journal), citations, and peer assessment ( Boix-Mansilla 2006a ; Feller 2006 ; Erno-Kjolhede and Hansson 2011 ). While these indicators of research quality remain relevant, additional criteria are needed to address the innovative approaches and the diversity of actors, outputs, outcomes, and long-term social impacts of TDR. It can be difficult to find appropriate outlets for TDR publications simply because the research does not meet the expectations of traditional discipline-oriented journals. Moreover, a wider range of inputs and of outputs means that TDR may result in fewer academic outputs. This has negative implications for transdisciplinary researchers, whose performance appraisals and long-term career progression are largely governed by traditional publication and citation-based metrics of evaluation. Research managers, peer reviewers, academic committees, and granting agencies all struggle with how to evaluate and how to compare TDR projects ( ex ante or ex post ) in the absence of appropriate criteria to address epistemological and methodological variability. The extent of engagement of stakeholders 1 in the research process will vary by project, from information sharing through to active collaboration ( Brandt et al. 2013) , but at any level, the involvement of stakeholders adds complexity to the conceptualization of quality. We need to know what ‘good research’ is in a transdisciplinary context.

As Tijssen ( 2003 : 93) put it: ‘Clearly, in view of its strategic and policy relevance, developing and producing generally acceptable measures of “research excellence” is one of the chief evaluation challenges of the years to come’. Clear criteria are needed for research quality evaluation to foster excellence while supporting innovation: ‘A principal barrier to a broader uptake of TD research is a lack of clarity on what good quality TD research looks like’ ( Carew and Wickson 2010 : 1154). In the absence of alternatives, many evaluators, including funding bodies, rely on conventional, discipline-specific measures of quality which do not address important aspects of TDR.

There is an emerging literature that reviews, synthesizes, or empirically evaluates knowledge and best practice in research evaluation in a TDR context and that proposes criteria and evaluation approaches ( Defila and Di Giulio 1999 ; Bergmann et al. 2005 ; Wickson, Carew and Russell 2006 ; Klein 2008 ; Carew and Wickson 2010 ; ERIC 2010; de Jong et al. 2011 ; Spaapen and Van Drooge 2011 ). Much of it comes from a few fields, including health care, education, and evaluation; little comes from the natural resource management and sustainability science realms, despite these areas needing guidance. National-scale reviews have begun to recognize the need for broader research evaluation criteria but have had difficulty dealing with it and have made little progress in addressing it ( Donovan 2008 ; KNAW 2009 ; REF 2011 ; ARC 2012 ; TEC 2012 ). A summary of the national reviews that we reviewed in the development of this research is provided in Supplementary Appendix 1 . While there are some published evaluation schemes for TDR and interdisciplinary research (IDR), there is ‘substantial variation in the balance different authors achieve between comprehensiveness and over-prescription’ ( Wickson and Carew 2014 : 256) and still a need to develop standardized quality criteria that are ‘uniquely flexible to provide valid, reliable means to evaluate and compare projects, while not stifling the evolution and responsiveness of the approach’ ( Wickson and Carew 2014 : 256).

There is a need and an opportunity to synthesize current ideas about how to define and assess quality in TDR. To address this, we conducted a systematic review of the literature that discusses the definitions of research quality as well as the suggested principles and criteria for assessing TDR quality. The aim is to identify appropriate principles and criteria for defining and measuring research quality in a transdisciplinary context and to organize those principles and criteria as an evaluation framework.

The review question was: What are appropriate principles, criteria, and indicators for defining and assessing research quality in TDR?

This article presents the method used for the systematic review and our synthesis, followed by key findings. Theoretical concepts about why new principles and criteria are needed for TDR, along with associated discussions about evaluation process are presented. A framework, derived from our synthesis of the literature, of principles and criteria for TDR quality evaluation is presented along with guidance on its application. Finally, recommendations for next steps in this research and needs for future research are discussed.

2.1 Systematic review

Systematic review is a rigorous, transparent, and replicable methodology that has become widely used to inform evidence-based policy, management, and decision making ( Pullin and Stewart 2006 ; CEE 2010). Systematic reviews follow a detailed protocol with explicit inclusion and exclusion criteria to ensure a repeatable and comprehensive review of the target literature. Review protocols are shared and often published as peer reviewed articles before undertaking the review to invite critique and suggestions. Systematic reviews are most commonly used to synthesize knowledge on an empirical question by collating data and analyses from a series of comparable studies, though methods used in systematic reviews are continually evolving and are increasingly being developed to explore a wider diversity of questions ( Chandler 2014 ). The current study question is theoretical and methodological, not empirical. Nevertheless, with a diverse and diffuse literature on the quality of TDR, a systematic review approach provides a method for a thorough and rigorous review. The protocol is published and available at http://www.cifor.org/online-library/browse/view-publication/publication/4382.html . A schematic diagram of the systematic review process is presented in Fig. 1 .

Search process.

Search process.

2.2 Search terms

Search terms were designed to identify publications that discuss the evaluation or assessment of quality or excellence 2 of research 3 that is done in a TDR context. Search terms are listed online in Supplementary Appendices 2 and 3 . The search strategy favored sensitivity over specificity to ensure that we captured the relevant information.

2.3 Databases searched

ISI Web of Knowledge (WoK) and Scopus were searched between 26 June 2013 and 6 August 2013. The combined searches yielded 15,613 unique citations. Additional searches to update the first searchers were carried out in June 2014 and March 2015, for a total of 19,402 titles scanned. Google Scholar (GS) was searched separately by two reviewers during each search period. The first reviewer’s search was done on 2 September 2013 (Search 1) and 3 September 2013 (Search 2), yielding 739 and 745 titles, respectively. The second reviewer’s search was done on 19 November 2013 (Search 1) and 25 November 2013 (Search 2), yielding 769 and 774 titles, respectively. A third search done on 17 March 2015 by one reviewer yielded 98 new titles. Reviewers found high redundancy between the WoK/Scopus searches and the GS searches.

2.4 Targeted journal searches

Highly relevant journals, including Research Evaluation, Evaluation and Program Planning, Scientometrics, Research Policy, Futures, American Journal of Evaluation, Evaluation Review, and Evaluation, were comprehensively searched using broader, more inclusive search strings that would have been unmanageable for the main database search.

2.5 Supplementary searches

References in included articles were reviewed to identify additional relevant literature. td-net’s ‘Tour d’Horizon of Literature’, lists important inter- and transdisciplinary publications collected through an invitation to experts in the field to submit publications ( td-net 2014 ). Six additional articles were identified via supplementary search.

2.6 Limitations of coverage

The review was limited to English-language published articles and material available through internet searches. There was no systematic way to search the gray (unpublished) literature, but relevant material identified through supplementary searches was included.

2.7 Inclusion of articles

This study sought articles that review, critique, discuss, and/or propose principles, criteria, indicators, and/or measures for the evaluation of quality relevant to TDR. As noted, this yielded a large number of titles. We then selected only those articles with an explicit focus on the meaning of IDR and/or TDR quality and how to achieve, measure or evaluate it. Inclusion and exclusion criteria were developed through an iterative process of trial article screening and discussion within the research team. Through this process, inter-reviewer agreement was tested and strengthened. Inclusion criteria are listed in Tables 1 and 2 .

Inclusion criteria for title and abstract screening

Inclusion criteria for abstract and full article screening

Article screening was done in parallel by two reviewers in three rounds: (1) title, (2) abstract, and (3) full article. In cases of uncertainty, papers were included to the next round. Final decisions on inclusion of contested papers were made by consensus among the four team members.

2.8 Critical appraisal

In typical systematic reviews, individual articles are appraised to ensure that they are adequate for answering the research question and to assess the methods of each study for susceptibility to bias that could influence the outcome of the review (Petticrew and Roberts 2006). Most papers included in this review are theoretical and methodological papers, not empirical studies. Most do not have explicit methods that can be appraised with existing quality assessment frameworks. Our critical appraisal considered four criteria adapted from Spencer et al. (2003): (1) relevance to the review question, (2) clarity and logic of how information in the paper was generated, (3) significance of the contribution (are new ideas offered?), and (4) generalizability (is the context specified; do the ideas apply in other contexts?). Disagreements were discussed to reach consensus.

2.9 Data extraction and management

The review sought information on: arguments for or against expanding definitions of research quality, purposes for research quality evaluation, principles of research quality, criteria for research quality assessment, indicators and measures of research quality, and processes for evaluating TDR. Four reviewers independently extracted data from selected articles using the parameters listed in Supplementary Appendix 4 .

2.10 Data synthesis and TDR framework design

Our aim was to synthesize ideas, definitions, and recommendations for TDR quality criteria into a comprehensive and generalizable framework for the evaluation of quality in TDR. Key ideas were extracted from each article and summarized in an Excel database. We classified these ideas into themes and ultimately into overarching principles and associated criteria of TDR quality organized as a rubric ( Wickson and Carew 2014 ). Definitions of each principle and criterion were developed and rubric statements formulated based on the literature and our experience. These criteria (adjusted appropriately to be applied ex ante or ex post ) are intended to be used to assess a TDR project. The reviewer should consider whether the project fully satisfies, partially satisfies, or fails to satisfy each criterion. More information on application is provided in Section 4.3 below.

We tested the framework on a set of completed RRU graduate theses that used transdisciplinary approaches, with an explicit problem orientation and intent to contribute to social or environmental change. Three rounds of testing were done, with revisions after each round to refine and improve the framework.

3.1 Overview of the selected articles

Thirty-eight papers satisfied the inclusion criteria. A wide range of terms are used in the selected papers, including: cross-disciplinary; interdisciplinary; transdisciplinary; methodological pluralism; mode 2; triple helix; and supradisciplinary. Eight included papers specifically focused on sustainability science or TDR in natural resource management, or identified sustainability research as a growing TDR field that needs new forms of evaluation ( Cash et al. 2002 ; Bergmann et al. 2005 ; Chataway, Smith and Wield 2007 ; Spaapen, Dijstelbloem and Wamelink 2007 ; Andrén 2010 ; Carew and Wickson 2010 ; Lang et al. 2012 ; Gaziulusoy and Boyle 2013 ). Carew and Wickson (2010) build on the experience in the TDR realm to propose criteria and indicators of quality for ‘responsible research and innovation’.

The selected articles are written from three main perspectives. One set is primarily interested in advancing TDR approaches. These papers recognize the need for new quality measures to encourage and promote high-quality research and to overcome perceived biases against TDR approaches in research funding and publishing. A second set of papers is written from an evaluation perspective, with a focus on improving evaluation of TDR. The third set is written from the perspective of qualitative research characterized by methodological pluralism, with many characteristics and issues relevant to TDR approaches.

The majority of the articles focus at the project scale, some at the organization level, and some do not specify. Some articles explicitly focus on ex ante evaluation (e.g. proposal evaluation), others on ex post evaluation, and many are not explicit about the project stage they are concerned with. The methods used in the reviewed articles include authors’ reflection and opinion, literature review, expert consultation, document analysis, and case study. Summaries of report characteristics are available online ( Supplementary Appendices 5–8 ). Eight articles provide comprehensive evaluation frameworks and quality criteria specifically for TDR and research-in-context. The rest of the articles discuss aspects of quality related to TDR and recommend quality definitions, criteria, and/or evaluation processes.

3.2 The need for quality criteria and evaluation methods for TDR

Many of the selected articles highlight the lack of widely agreed principles and criteria of TDR quality. They note that, in the absence of TDR quality frameworks, disciplinary criteria are used ( Morrow 2005 ; Boix-Mansilla 2006a , b ; Feller 2006 ; Klein 2006 , 2008 ; Wickson, Carew and Russell 2006 ; Scott 2007 ; Spaapen, Dijstelbloem and Wamelink 2007 ; Oberg 2008 ; Erno-Kjolhede and Hansson 2011 ), and evaluations are often carried out by reviewers who lack cross-disciplinary experience and do not have a shared understanding of quality ( Aagaard-Hansen and Svedin 2009 ). Quality is discussed by many as a relative concept, developed within disciplines, and therefore defined and understood differently in each field ( Morrow 2005 ; Klein 2006 ; Oberg 2008 ; Mitchell and Willets 2009 ; Huutoniemi 2010 ; Hellstrom 2011 ). Jahn and Keil (2015) point out the difficulty of creating a common set of quality criteria for TDR in the absence of a standard agreed-upon definition of TDR. Many of the selected papers argue the need to move beyond narrowly defined ideas of ‘scientific excellence’ to incorporate a broader assessment of quality which includes societal relevance ( Hemlin and Rasmussen 2006 ; Chataway, Smith and Wield 2007 ; Ozga 2007 ; Spaapen, Dijstelbloem and Wamelink 2007 ). This shift includes greater focus on research organization, research process, and continuous learning, rather than primarily on research outputs ( Hemlin and Rasmussen 2006 ; de Jong et al. 2011 ; Wickson and Carew 2014 ; Jahn and Keil 2015 ). This responds to and reflects societal expectations that research should be accountable and have demonstrated utility ( Cloete 1997 ; Defila and Di Giulio 1999 ; Wickson, Carew and Russell 2006 ; Spaapen, Dijstelbloem and Wamelink 2007 ; Stige 2009 ).

A central aim of TDR is to achieve socially relevant outcomes, and TDR quality criteria should demonstrate accountability to society ( Cloete 1997 ; Hemlin and Rasmussen 2006 ; Chataway, Smith and Wield 2007 ; Ozga 2007 ; Spaapen, Dijstelbloem and Wamelink 2007 ; de Jong et al. 2011 ). Integration and mutual learning are a core element of TDR; it is not enough to transcend boundaries and incorporate societal knowledge but, as Carew and Wickson ( 2010 : 1147) summarize: ‘…the TD researcher needs to put effort into integrating these potentially disparate knowledges with a view to creating useable knowledge. That is, knowledge that can be applied in a given problem context and has some prospect of producing desired change in that context’. The inclusion of societal actors in the research process, the unique and often dispersed organization of research teams, and the deliberate integration of different traditions of knowledge production all fall outside of conventional assessment criteria ( Feller 2006 ).

Not only do the range of criteria need to be updated, expanded, agreed upon, and assumptions made explicit ( Boix-Mansilla 2006a ; Klein 2006 ; Scott 2007 ) but, given the specific problem orientation of TDR, reviewers beyond disciplinary academic peers need to be included in the assessment of quality ( Cloete 1997 ; Scott 2007 ; Spappen et al. 2007 ; Klein 2008 ). Several authors discuss the lack of reviewers with strong cross-disciplinary experience ( Aagaard-Hansen and Svedin 2009 ) and the lack of common criteria, philosophical foundations, and language for use by peer reviewers ( Klein 2008 ; Aagaard-Hansen and Svedin 2009 ). Peer review of TDR could be improved with explicit TDR quality criteria, and appropriate processes in place to ensure clear dialog between reviewers.

Finally, there is the need for increased emphasis on evaluation as part of the research process ( Bergmann et al. 2005 ; Hemlin and Rasmussen 2006 ; Meyrick 2006 ; Chataway, Smith and Wield 2007 ; Stige, Malterud and Midtgarden 2009 ; Hellstrom 2011 ; Lang et al. 2012 ; Wickson and Carew 2014 ). This is particularly true in large, complex, problem-oriented research projects. Ongoing monitoring of the research organization and process contributes to learning and adaptive management while research is underway and so helps improve quality. As stated by Wickson and Carew ( 2014 : 262): ‘We believe that in any process of interpreting, rearranging and/or applying these criteria, open negotiation on their meaning and application would only positively foster transformative learning, which is a valued outcome of good TD processes’.

3.3 TDR quality criteria and assessment approaches

Many of the papers provide quality criteria and/or describe constituent parts of quality. Aagaard-Hansen and Svedin (2009) define three key aspects of quality: societal relevance, impact, and integration. Meyrick (2006) states that quality research is transparent and systematic. Boaz and Ashby (2003) describe quality in four dimensions: methodological quality, quality of reporting, appropriateness of methods, and relevance to policy and practice. Although each article deconstructs quality in different ways and with different foci and perspectives, there is significant overlap and recurring themes in the papers reviewed. There is a broadly shared perspective that TDR quality is a multidimensional concept shaped by the specific context within which research is done ( Spaapen, Dijstelbloem and Wamelink 2007 ; Klein 2008 ), making a universal definition of TDR quality difficult or impossible ( Huutoniemi 2010 ).

Huutoniemi (2010) identifies three main approaches to conceptualizing quality in IDR and TDR: (1) using existing disciplinary standards adapted as necessary for IDR; (2) building on the quality standards of disciplines while fundamentally incorporating ways to deal with epistemological integration, problem focus, context, stakeholders, and process; and (3) radical departure from any disciplinary orientation in favor of external, emergent, context-dependent quality criteria that are defined and enacted collaboratively by a community of users.

The first approach is prominent in current research funding and evaluation protocols. Conservative approaches of this kind are criticized for privileging disciplinary research and for failing to provide guidance and quality control for transdisciplinary projects. The third approach would ‘undermine the prevailing status of disciplinary standards in the pursuit of a non-disciplinary, integrated knowledge system’ ( Huutoniemi 2010 : 313). No predetermined quality criteria are offered, only contextually embedded criteria that need to be developed within a specific research project. To some extent, this is the approach taken by Spaapen, Dijstelbloem and Wamelink (2007) and de Jong et al. (2011) . Such a sui generis approach cannot be used to compare across projects. Most of the reviewed papers take the second approach, and recommend TDR quality criteria that build on a disciplinary base.

Eight articles present comprehensive frameworks for quality evaluation, each with a unique approach, perspective, and goal. Two of these build comprehensive lists of criteria with associated questions to be chosen based on the needs of the particular research project ( Defila and Di Giulio 1999 ; Bergmann et al. 2005 ). Wickson and Carew (2014) develop a reflective heuristic tool with questions to guide researchers through ongoing self-evaluation. They also list criteria for external evaluation and to compare between projects. Spaapen, Dijstelbloem and Wamelink (2007) design an approach to evaluate a research project against its own goals and is not meant to compare between projects. Wickson and Carew (2014) developed a comprehensive rubric for the evaluation of Research and Innovation that builds of their extensive previous work in TDR. Finally, Lang et al. (2012) , Mitchell and Willets (2009) , and Jahn and Keil (2015) develop criteria checklists that can be applied across transdisciplinary projects.

Bergmann et al. (2005) and Carew and Wickson (2010) organize their frameworks into managerial elements of the research project, concerning problem context, participation, management, and outcomes. Lang et al. (2012) and Defila and Di Giulio (1999) focus on the chronological stages in the research process and identify criteria at each stage. Mitchell and Willets (2009) , , with a focus on doctoral s tudies, adapt standard dissertation evaluation criteria to accommodate broader, pluralistic, and more complex studies. Spaapen, Dijstelbloem and Wamelink (2007) focus on evaluating ‘research-in-context’. Wickson and Carew (2014) created a rubric based on criteria that span the research process, stages, and all actors included. Jahn and Keil (2015) organized their quality criteria into three categories of quality including: quality of the research problems, quality of the research process, and quality of the research results.

The remaining papers highlight key themes that must be considered in TDR evaluation. Dominant themes include: engagement with problem context, collaboration and inclusion of stakeholders, heightened need for explicit communication and reflection, integration of epistemologies, recognition of diverse outputs, the focus on having an impact, and reflexivity and adaptation throughout the process. The focus on societal problems in context and the increased engagement of stakeholders in the research process introduces higher levels of complexity that cannot be accommodated by disciplinary standards ( Defila and Di Giulio 1999 ; Bergmann et al. 2005 ; Wickson, Carew and Russell 2006 ; Spaapen, Dijstelbloem and Wamelink 2007 ; Klein 2008 ).

Finally, authors discuss process ( Defila and Di Giulio 1999 ; Bergmann et al. 2005 ; Boix-Mansilla 2006b ; Spaapen, Dijstelbloem and Wamelink 2007 ) and utilitarian values ( Hemlin 2006 ; Ernø-Kjølhede and Hansson 2011 ; Bornmann 2013 ) as essential aspects of quality in TDR. Common themes include: (1) the importance of formative and process-oriented evaluation ( Bergmann et al. 2005 ; Hemlin 2006 ; Stige 2009 ); (2) emphasis on the evaluation process itself (not just criteria or outcomes) and reflexive dialog for learning ( Bergmann et al. 2005 ; Boix-Mansilla 2006b ; Klein 2008 ; Oberg 2008 ; Stige, Malterud and Midtgarden 2009 ; Aagaard-Hansen and Svedin 2009 ; Carew and Wickson 2010 ; Huutoniemi 2010 ); (3) the need for peers who are experienced and knowledgeable about TDR for fair peer review ( Boix-Mansilla 2006a , b ; Klein 2006 ; Hemlin 2006 ; Scott 2007 ; Aagaard-Hansen and Svedin 2009 ); (4) the inclusion of stakeholders in the evaluation process ( Bergmann et al. 2005 ; Scott 2007 ; Andréen 2010 ); and (5) the importance of evaluations that are built in-context ( Defila and Di Giulio 1999 ; Feller 2006 ; Spaapen, Dijstelbloem and Wamelink 2007 ; de Jong et al. 2011 ).

While each reviewed approach offers helpful insights, none adequately fulfills the need for a broad and adaptable framework for assessing TDR quality. Wickson and Carew ( 2014 : 257) highlight the need for quality criteria that achieve balance between ‘comprehensiveness and over-prescription’: ‘any emerging quality criteria need to be concrete enough to provide real guidance but flexible enough to adapt to the specificities of varying contexts’. Based on our experience, such a framework should be:

Comprehensive: It should accommodate the main aspects of TDR, as identified in the review.

Time/phase adaptable: It should be applicable across the project cycle.

Scalable: It should be useful for projects of different scales.

Versatile: It should be useful to researchers and collaborators as a guide to research design and management, and to internal and external reviews and assessors.

Comparable: It should allow comparison of quality between and across projects/programs.

Reflexive: It should encourage and facilitate self-reflection and adaptation based on ongoing learning.

In this section, we synthesize the key principles and criteria of quality in TDR that were identified in the reviewed literature. Principles are the essential elements of high-quality TDR. Criteria are the conditions that need to be met in order to achieve a principle. We conclude by providing a framework for the evaluation of quality in TDR ( Table 3 ) and guidance for its application.

Transdisciplinary research quality assessment framework

a Research problems are the particular topic, area of concern, question to be addressed, challenge, opportunity, or focus of the research activity. Research problems are related to the societal problem but take on a specific focus, or framing, within a societal problem.

b Problem context refers to the social and environmental setting(s) that gives rise to the research problem, including aspects of: location; culture; scale in time and space; social, political, economic, and ecological/environmental conditions; resources and societal capacity available; uncertainty, complexity, and novelty associated with the societal problem; and the extent of agency that is held by stakeholders ( Carew and Wickson 2010 ).

c Words such as ‘appropriate’, ‘suitable’, and ‘adequate’ are used deliberately to allow for quality criteria to be flexible and specific enough to the needs of individual research projects ( Oberg 2008 ).

d Research process refers to the series of decisions made and actions taken throughout the entire duration of the research project and encompassing all aspects of the research project.

e Reflexivity refers to an iterative process of formative, critical reflection on the important interactions and relationships between a research project’s process, context, and product(s).

f In an ex ante evaluation, ‘evidence of’ would be replaced with ‘potential for’.

There is a strong trend in the reviewed articles to recognize the need for appropriate measures of scientific quality (usually adapted from disciplinary antecedants), but also to consider broader sets of criteria regarding the societal significance and applicability of research, and the need for engagement and representation of stakeholder values and knowledge. Cash et al. (2002) nicely conceptualize three key aspects of effective sustainability research as: salience (or relevance), credibility, and legitimacy. These are presented as necessary attributes for research to successfully produce transferable, useful information that can cross boundaries between disciplines, across scales, and between science and society. Many of the papers also refer to the principle that high-quality TDR should be effective in terms of contributing to the solution of problems. These four principles are discussed in the following sections.

4.1.1 Relevance

Relevance is the importance, significance, and usefulness of the research project's objectives, process, and findings to the problem context and to society. This includes the appropriateness of the timing of the research, the questions being asked, the outputs, and the scale of the research in relation to the societal problem being addressed. Good-quality TDR addresses important social/environmental problems and produces knowledge that is useful for decision making and problem solving ( Cash et al. 2002 ; Klein 2006 ). As Erno-Kjolhede and Hansson ( 2011 : 140) explain, quality ‘is first and foremost about creating results that are applicable and relevant for the users of the research’. Researchers must demonstrate an in-depth knowledge of and ongoing engagement with the problem context in which their research takes place ( Wickson, Carew and Russell 2006 ; Stige, Malterud and Midtgarden 2009 ; Mitchell and Willets 2009 ). From the early steps of problem formulation and research design through to the appropriate and effective communication of research findings, the applicability and relevance of the research to the societal problem must be explicitly stated and incorporated.

4.1.2 Credibility

Credibility refers to whether or not the research findings are robust and the knowledge produced is scientifically trustworthy. This includes clear demonstration that the data are adequate, with well-presented methods and logical interpretations of findings. High-quality research is authoritative, transparent, defensible, believable, and rigorous. This is the traditional purview of science, and traditional disciplinary criteria can be applied in TDR evaluation to an extent. Additional and modified criteria are needed to address the integration of epistemologies and methodologies and the development of novel methods through collaboration, the broad preparation and competencies required to carry out the research, and the need for reflection and adaptation when operating in complex systems. Having researchers actively engaged in the problem context and including extra-scientific actors as part of the research process helps to achieve relevance and legitimacy of the research; it also adds complexity and heightened requirements of transparency, reflection, and reflexivity to ensure objective, credible research is carried out.

Active reflexivity is a criterion of credibility of TDR that may seem to contradict more rigid disciplinary methodological traditions ( Carew and Wickson 2010 ). Practitioners of TDR recognize that credible work in these problem-oriented fields requires active reflexivity, epitomized by ongoing learning, flexibility, and adaptation to ensure the research approach and objectives remain relevant and fit-to-purpose ( Lincoln 1995 ; Bergmann et al. 2005 ; Wickson, Carew and Russell 2006 ; Mitchell and Willets 2009 ; Andreén 2010 ; Carew and Wickson 2010 ; Wickson and Carew 2014 ). Changes made during the research process must be justified and reported transparently and explicitly to maintain credibility.

The need for critical reflection on potential bias and limitations becomes more important to maintain credibility of research-in-context ( Lincoln 1995 ; Bergmann et al. 2005 ; Mitchell and Willets 2009 ; Stige, Malterud and Midtgarden 2009 ). Transdisciplinary researchers must ensure they maintain a high level of objectivity and transparency while actively engaging in the problem context. This point demonstrates the fine balance between different aspects of quality, in this case relevance and credibility, and the need to be aware of tensions and to seek complementarities ( Cash et al. 2002 ).

4.1.3 Legitimacy

Legitimacy refers to whether the research process is perceived as fair and ethical by end-users. In other words, is it acceptable and trustworthy in the eyes of those who will use it? This requires the appropriate inclusion and consideration of diverse values, interests, and the ethical and fair representation of all involved. Legitimacy may be achieved in part through the genuine inclusion of stakeholders in the research process. Whereas credibility refers to technical aspects of sound research, legitimacy deals with sociopolitical aspects of the knowledge production process and products of research. Do stakeholders trust the researchers and the research process, including funding sources and other sources of potential bias? Do they feel represented? Legitimate TDR ‘considers appropriate values, concerns, and perspectives of different actors’ ( Cash et al. 2002 : 2) and incorporates these perspectives into the research process through collaboration and mutual learning ( Bergmann et al. 2005 ; Chataway, Smith and Wield 2007 ; Andrén 2010 ; Huutoneimi 2010 ). A fair and ethical process is important to uphold standards of quality in all research. However, there are additional considerations that are unique to TDR.

Because TDR happens in-context and often in collaboration with societal actors, the disclosure of researcher perspective and a transparent statement of all partnerships, financing, and collaboration is vital to ensure an unbiased research process ( Lincoln 1995 ; Defila and Di Giulio 1999 ; Boaz and Ashby 2003 ; Barker and Pistrang 2005 ; Bergmann et al. 2005 ). The disclosure of perspective has both internal and external aspects, on one hand ensuring the researchers themselves explicitly reflect on and account for their own position, potential sources of bias, and limitations throughout the process, and on the other hand making the process transparent to those external to the research group who can then judge the legitimacy based on their perspective of fairness ( Cash et al. 2002 ).

TDR includes the engagement of societal actors along a continuum of participation from consultation to co-creation of knowledge ( Brandt et al. 2013 ). Regardless of the depth of participation, all processes that engage societal actors must ensure that inclusion/engagement is genuine, roles are explicit, and processes for effective and fair collaboration are present ( Bergmann et al. 2005 ; Wickson, Carew and Russell 2006 ; Spaapen, Dijstelbloem and Wamelink 2007 ; Hellstrom 2012 ). Important considerations include: the accurate representation of those involved; explicit and agreed-upon roles and contributions of actors; and adequate planning and procedures to ensure all values, perspectives, and contexts are adequately and appropriately incorporated. Mitchell and Willets (2009) consider cultural competence as a key criterion that can support researchers in navigating diverse epistemological perspectives. This is similar to what Morrow terms ‘social validity’, a criterion that asks researchers to be responsive to and critically aware of the diversity of perspectives and cultures influenced by their research. Several authors highlight that in order to develop this critical awareness of the diversity of cultural paradigms that operate within a problem situation, researchers should practice responsive, critical, and/or communal reflection ( Bergmann et al. 2005 ; Wickson, Carew and Russell 2006 ; Mitchell and Willets 2009 ; Carew and Wickson 2010 ). Reflection and adaptation are important quality criteria that cut across multiple principles and facilitate learning throughout the process, which is a key foundation to TD inquiry.

4.1.4 Effectiveness

We define effective research as research that contributes to positive change in the social, economic, and/or environmental problem context. Transdisciplinary inquiry is rooted in the objective of solving real-word problems ( Klein 2008 ; Carew and Wickson 2010 ) and must have the potential to ( ex ante ) or actually ( ex post ) make a difference if it is to be considered of high quality ( Erno-Kjolhede and Hansson 2011 ). Potential research effectiveness can be indicated and assessed at the proposal stage and during the research process through: a clear and stated intention to address and contribute to a societal problem, the establishment of the research process and objectives in relation to the problem context, and the continuous reflection on the usefulness of the research findings and products to the problem ( Bergmann et al. 2005 ; Lahtinen et al. 2005 ; de Jong et al. 2011 ).

Assessing research effectiveness ex post remains a major challenge, especially in complex transdisciplinary approaches. Conventional and widely used measures of ‘scientific impact’ count outputs such as journal articles and other publications and citations of those outputs (e.g. H index; i10 index). While these are useful indicators of scholarly influence, they are insufficient and inappropriate measures of research effectiveness where research aims to contribute to social learning and change. We need to also (or alternatively) focus on other kinds of research and scholarship outputs and outcomes and the social, economic, and environmental impacts that may result.

For many authors, contributing to learning and building of societal capacity are central goals of TDR ( Defila and Di Giulio 1999 ; Spaapen, Dijstelbloem and Wamelink 2007 ; Carew and Wickson 2010 ; Erno-Kjolhede and Hansson 2011 ; Hellstrom 2011 ), and so are considered part of TDR effectiveness. Learning can be characterized as changes in knowledge, attitudes, or skills and can be assessed directly, or through observed behavioral changes and network and relationship development. Some evaluation methodologies (e.g. Outcome Mapping ( Earl, Carden and Smutylo 2001 )) specifically measure these kinds of changes. Other evaluation methodologies consider the role of research within complex systems and assess effectiveness in terms of contributions to changes in policy and practice and resulting social, economic, and environmental benefits ( ODI 2004 , 2012 ; White and Phillips 2012 ; Mayne et al. 2013 ).

4.2 TDR quality criteria

TDR quality criteria and their definitions (explicit or implicit) were extracted from each article and summarized in an Excel database. These criteria were classified into themes corresponding to the four principles identified above, sorted and refined to develop sets of criteria that are comprehensive, mutually exclusive, and representative of the ideas presented in the reviewed articles. Within each principle, the criteria are organized roughly in the sequence of a typical project cycle (e.g. with research design following problem identification and preceding implementation). Definitions of each criterion were developed to reflect the concepts found in the literature, tested and refined iteratively to improve clarity. Rubric statements were formulated based on the literature and our own experience.

The complete set of principles, criteria, and definitions is presented as the TDR Quality Assessment Framework ( Table 3 ).

4.3 Guidance on the application of the framework

4.3.1 timing.

Most criteria can be applied at each stage of the research process, ex ante , mid term, and ex post , using appropriate interpretations at each stage. Ex ante (i.e. proposal) assessment should focus on a project’s explicitly stated intentions and approaches to address the criteria. Mid-term indicators will focus on the research process and whether or not it is being implemented in a way that will satisfy the criteria. Ex post assessment should consider whether the research has been done appropriately for the purpose and that the desired results have been achieved.

4.3.2 New meanings for familiar terms

Many of the terms used in the framework are extensions of disciplinary criteria and share the same or similar names and perhaps similar but nuanced meaning. The principles and criteria used here extend beyond disciplinary antecedents and include new concepts and understandings that encapsulate the unique characteristics and needs of TDR and allow for evaluation and definition of quality in TDR. This is especially true in the criteria related to credibility. These criteria are analogous to traditional disciplinary criteria, but with much stronger emphasis on grounding in both the scientific and the social/environmental contexts. We urge readers to pay close attention to the definitions provided in Table 3 as well as the detailed descriptions of the principles in Section 4.1.

4.3.3 Using the framework

The TDR quality framework ( Table 3 ) is designed to be used to assess TDR research according to a project’s purpose; i.e. the criteria must be interpreted with respect to the context and goals of an individual research activity. The framework ( Table 3 ) lists the main criteria synthesized from the literature and our experience, organized within the principles of relevance, credibility, legitimacy, and effectiveness. The table presents the criteria within each principle, ordered to approximate a typical process of identifying a research problem and designing and implementing research. We recognize that the actual process in any given project will be iterative and will not necessarily follow this sequence, but this provides a logical flow. A concise definition is provided in the second column to explain each criterion. We then provide a rubric statement in the third column, phrased to be applied when the research has been completed. In most cases, the same statement can be used at the proposal stage with a simple tense change or other minor grammatical revision, except for the criteria relating to effectiveness. As discussed above, assessing effectiveness in terms of outcomes and/or impact requires evaluation research. At the proposal stage, it is only possible to assess potential effectiveness.

Many rubrics offer a set of statements for each criterion that represent progressively higher levels of achievement; the evaluator is asked to select the best match. In practice, this often results in vague and relative statements of merit that are difficult to apply. We have opted to present a single rubric statement in absolute terms for each criterion. The assessor can then rank how well a project satisfies each criterion using a simple three-point Likert scale. If a project fully satisfies a criterion—that is, if there is evidence that the criterion has been addressed in a way that is coherent, explicit, sufficient, and convincing—it should be ranked as a 2 for that criterion. A score of 2 means that the evaluator is persuaded that the project addressed that criterion in an intentional, appropriate, explicit, and thorough way. A score of 1 would be given when there is some evidence that the criterion was considered, but it is lacking completion, intention, and/or is not addressed satisfactorily. For example, a score of 1 would be given when a criterion is explicitly discussed but poorly addressed, or when there is some indication that the criterion has been considered and partially addressed but it has not been treated explicitly, thoroughly, or adequately. A score of 0 indicates that there is no evidence that the criterion was addressed or that it was addressed in a way that was misguided or inappropriate.

It is critical that the evaluation be done in context, keeping in mind the purpose, objectives, and resources of the project, as well as other contextual information, such as the intended purpose of grant funding or relevant partnerships. Each project will be unique in its complexities; what is sufficient or adequate in one criterion for one research project may be insufficient or inappropriate for another. Words such as ‘appropriate’, ‘suitable’, and ‘adequate’ are used deliberately to encourage application of criteria to suit the needs of individual research projects ( Oberg 2008 ). Evaluators must consider the objectives of the research project and the problem context within which it is carried out as the benchmark for evaluation. For example, we tested the framework with RRU masters theses. These are typically small projects with limited scope, carried out by a single researcher. Expectations for ‘effective communication’ or ‘competencies’ or ‘effective collaboration’ are much different in these kinds of projects than in a multi-year, multi-partner CIFOR project. All criteria should be evaluated through the lens of the stated research objectives, research goals, and context.

The systematic review identified relevant articles from a diverse literature that have a strong central focus. Collectively, they highlight the complexity of contemporary social and environmental problems and emphasize that addressing such issues requires combinations of new knowledge and innovation, action, and engagement. Traditional disciplinary research has often failed to provide solutions because it cannot adequately cope with complexity. New forms of research are proliferating, crossing disciplinary and academic boundaries, integrating methodologies, and engaging a broader range of research participants, as a way to make research more relevant and effective. Theoretically, such approaches appear to offer great potential to contribute to transformative change. However, because these approaches are new and because they are multidimensional, complex, and often unique, it has been difficult to know what works, how, and why. In the absence of the kinds of methodological and quality standards that guide disciplinary research, there are no generally agreed criteria for evaluating such research.

Criteria are needed to guide and to help ensure that TDR is of high quality, to inform the teaching and learning of new researchers, and to encourage and support the further development of transdisciplinary approaches. The lack of a standard and broadly applicable framework for the evaluation of quality in TDR is perceived to cause an implicit or explicit devaluation of high-quality TDR or may prevent quality TDR from being done. There is a demonstrated need for an operationalized understanding of quality that addresses the characteristics, contributions, and challenges of TDR. The reviewed articles approach the topic from different perspectives and fields of study, using different terminology for similar concepts, or the same terminology for different concepts, and with unique ways of organizing and categorizing the dimensions and quality criteria. We have synthesized and organized these concepts as key TDR principles and criteria in a TDR Quality Framework, presented as an evaluation rubric. We have tested the framework on a set of masters’ theses and found it to be broadly applicable, usable, and useful for analyzing individual projects and for comparing projects within the set. We anticipate that further testing with a wider range of projects will help further refine and improve the definitions and rubric statements. We found that the three-point Likert scale (0–2) offered sufficient variability for our purposes, and rating is less subjective than with relative rubric statements. It may be possible to increase the rating precision with more points on the scale to increase the sensitivity for comparison purposes, for example in a review of proposals for a particular grant application.

Many of the articles we reviewed emphasize the importance of the evaluation process itself. The formative, developmental role of evaluation in TDR is seen as essential to the goals of mutual learning as well as to ensure that research remains responsive and adaptive to the problem context. In order to adequately evaluate quality in TDR, the process, including who carries out the evaluations, when, and in what manner, must be revised to be suitable to the unique characteristics and objectives of TDR. We offer this review and synthesis, along with a proposed TDR quality evaluation framework, as a contribution to an important conversation. We hope that it will be useful to researchers and research managers to help guide research design, implementation and reporting, and to the community of research organizations, funders, and society at large. As underscored in the literature review, there is a need for an adapted research evaluation process that will help advance problem-oriented research in complex systems, ultimately to improve research effectiveness.

This work was supported by funding from the Canada Research Chairs program. Funding support from the Canadian Social Sciences and Humanities Research Council (SSHRC) and technical support from the Evidence Based Forestry Initiative of the Centre for International Forestry Research (CIFOR), funded by UK DfID are also gratefully acknowledged.

Supplementary data is available here

The authors thank Barbara Livoreil and Stephen Dovers for valuable comments and suggestions on the protocol and Gillian Petrokofsky for her review of the protocol and a draft version of the manuscript. Two anonymous reviewers and the editor provided insightful critique and suggestions in two rounds that have helped to substantially improve the article.

Conflict of interest statement . None declared.

1. ‘Stakeholders’ refers to individuals and groups of societal actors who have an interest in the issue or problem that the research seeks to address.

2. The terms ‘quality’ and ‘excellence’ are often used in the literature with similar meaning. Technically, ‘excellence’ is a relative concept, referring to the superiority of a thing compared to other things of its kind. Quality is an attribute or a set of attributes of a thing. We are interested in what these attributes are or should be in high-quality research. Therefore, the term ‘quality’ is used in this discussion.

3. The terms ‘science’ and ‘research’ are not always clearly distinguished in the literature. We take the position that ‘science’ is a more restrictive term that is properly applied to systematic investigations using the scientific method. ‘Research’ is a broader term for systematic investigations using a range of methods, including but not restricted to the scientific method. We use the term ‘research’ in this broad sense.

Aagaard-Hansen J. Svedin U. ( 2009 ) ‘Quality Issues in Cross-disciplinary Research: Towards a Two-pronged Approach to Evaluation’ , Social Epistemology , 23 / 2 : 165 – 76 . DOI: 10.1080/02691720902992323

Google Scholar

Andrén S. ( 2010 ) ‘A Transdisciplinary, Participatory and Action-Oriented Research Approach: Sounds Nice but What do you Mean?’ [unpublished working paper] Human Ecology Division: Lund University, 1–21. < https://lup.lub.lu.se/search/publication/1744256 >

Australian Research Council (ARC) ( 2012 ) ERA 2012 Evaluation Handbook: Excellence in Research for Australia . Australia : ARC . < http://www.arc.gov.au/pdf/era12/ERA%202012%20Evaluation%20Handbook_final%20for%20web_protected.pdf >

Google Preview

Balsiger P. W. ( 2004 ) ‘Supradisciplinary Research Practices: History, Objectives and Rationale’ , Futures , 36 / 4 : 407 – 21 .

Bantilan M. C. et al.  . ( 2004 ) ‘Dealing with Diversity in Scientific Outputs: Implications for International Research Evaluation’ , Research Evaluation , 13 / 2 : 87 – 93 .

Barker C. Pistrang N. ( 2005 ) ‘Quality Criteria under Methodological Pluralism: Implications for Conducting and Evaluating Research’ , American Journal of Community Psychology , 35 / 3-4 : 201 – 12 .

Bergmann M. et al.  . ( 2005 ) Quality Criteria of Transdisciplinary Research: A Guide for the Formative Evaluation of Research Projects . Central report of Evalunet – Evaluation Network for Transdisciplinary Research. Frankfurt am Main, Germany: Institute for Social-Ecological Research. < http://www.isoe.de/ftp/evalunet_guide.pdf >

Boaz A. Ashby D. ( 2003 ) Fit for Purpose? Assessing Research Quality for Evidence Based Policy and Practice .

Boix-Mansilla V. ( 2006a ) ‘Symptoms of Quality: Assessing Expert Interdisciplinary Work at the Frontier: An Empirical Exploration’ , Research Evaluation , 15 / 1 : 17 – 29 .

Boix-Mansilla V. . ( 2006b ) ‘Conference Report: Quality Assessment in Interdisciplinary Research and Education’ , Research Evaluation , 15 / 1 : 69 – 74 .

Bornmann L. ( 2013 ) ‘What is Societal Impact of Research and How can it be Assessed? A Literature Survey’ , Journal of the American Society for Information Science and Technology , 64 / 2 : 217 – 33 .

Brandt P. et al.  . ( 2013 ) ‘A Review of Transdisciplinary Research in Sustainability Science’ , Ecological Economics , 92 : 1 – 15 .

Cash D. Clark W.C. Alcock F. Dickson N. M. Eckley N. Jäger J . ( 2002 ) Salience, Credibility, Legitimacy and Boundaries: Linking Research, Assessment and Decision Making (November 2002). KSG Working Papers Series RWP02-046. Available at SSRN: http://ssrn.com/abstract=372280 .

Carew A. L. Wickson F. ( 2010 ) ‘The TD Wheel: A Heuristic to Shape, Support and Evaluate Transdisciplinary Research’ , Futures , 42 / 10 : 1146 – 55 .

Collaboration for Environmental Evidence (CEE) . ( 2013 ) Guidelines for Systematic Review and Evidence Synthesis in Environmental Management . Version 4.2. Environmental Evidence < www.environmentalevidence.org/Documents/Guidelines/Guidelines4.2.pdf >

Chandler J. ( 2014 ) Methods Research and Review Development Framework: Policy, Structure, and Process . < http://methods.cochrane.org/projects-developments/research >

Chataway J. Smith J. Wield D. ( 2007 ) ‘Shaping Scientific Excellence in Agricultural Research’ , International Journal of Biotechnology 9 / 2 : 172 – 87 .

Clark W. C. Dickson N. ( 2003 ) ‘Sustainability Science: The Emerging Research Program’ , PNAS 100 / 14 : 8059 – 61 .

Consultative Group on International Agricultural Research (CGIAR) ( 2011 ) A Strategy and Results Framework for the CGIAR . < http://library.cgiar.org/bitstream/handle/10947/2608/Strategy_and_Results_Framework.pdf?sequence=4 >

Cloete N. ( 1997 ) ‘Quality: Conceptions, Contestations and Comments’, African Regional Consultation Preparatory to the World Conference on Higher Education , Dakar, Senegal, 1-4 April 1997 .

Defila R. DiGiulio A. ( 1999 ) ‘Evaluating Transdisciplinary Research,’ Panorama: Swiss National Science Foundation Newsletter , 1 : 4 – 27 . < www.ikaoe.unibe.ch/forschung/ip/Specialissue.Pano.1.99.pdf >

Donovan C. ( 2008 ) ‘The Australian Research Quality Framework: A Live Experiment in Capturing the Social, Economic, Environmental, and Cultural Returns of Publicly Funded Research. Reforming the Evaluation of Research’ , New Directions for Evaluation , 118 : 47 – 60 .

Earl S. Carden F. Smutylo T. ( 2001 ) Outcome Mapping. Building Learning and Reflection into Development Programs . Ottawa, ON : International Development Research Center .

Ernø-Kjølhede E. Hansson F. ( 2011 ) ‘Measuring Research Performance during a Changing Relationship between Science and Society’ , Research Evaluation , 20 / 2 : 130 – 42 .

Feller I. ( 2006 ) ‘Assessing Quality: Multiple Actors, Multiple Settings, Multiple Criteria: Issues in Assessing Interdisciplinary Research’ , Research Evaluation 15 / 1 : 5 – 15 .

Gaziulusoy A. İ. Boyle C. ( 2013 ) ‘Proposing a Heuristic Reflective Tool for Reviewing Literature in Transdisciplinary Research for Sustainability’ , Journal of Cleaner Production , 48 : 139 – 47 .

Gibbons M. et al.  . ( 1994 ) The New Production of Knowledge: The Dynamics of Science and Research in Contemporary Societies . London : Sage Publications .

Hellstrom T. ( 2011 ) ‘Homing in on Excellence: Dimensions of Appraisal in Center of Excellence Program Evaluations’ , Evaluation , 17 / 2 : 117 – 31 .

Hellstrom T. . ( 2012 ) ‘Epistemic Capacity in Research Environments: A Framework for Process Evaluation’ , Prometheus , 30 / 4 : 395 – 409 .

Hemlin S. Rasmussen S. B . ( 2006 ) ‘The Shift in Academic Quality Control’ , Science, Technology & Human Values , 31 / 2 : 173 – 98 .

Hessels L. K. Van Lente H. ( 2008 ) ‘Re-thinking New Knowledge Production: A Literature Review and a Research Agenda’ , Research Policy , 37 / 4 , 740 – 60 .

Huutoniemi K. ( 2010 ) ‘Evaluating Interdisciplinary Research’ , in Frodeman R. Klein J. T. Mitcham C. (eds) The Oxford Handbook of Interdisciplinarity , pp. 309 – 20 . Oxford : Oxford University Press .

de Jong S. P. L. et al.  . ( 2011 ) ‘Evaluation of Research in Context: An Approach and Two Cases’ , Research Evaluation , 20 / 1 : 61 – 72 .

Jahn T. Keil F. ( 2015 ) ‘An Actor-Specific Guideline for Quality Assurance in Transdisciplinary Research’ , Futures , 65 : 195 – 208 .

Kates R. ( 2000 ) ‘Sustainability Science’ , World Academies Conference Transition to Sustainability in the 21st Century 5/18/00 , Tokyo, Japan .

Klein J. T . ( 2006 ) ‘Afterword: The Emergent Literature on Interdisciplinary and Transdisciplinary Research Evaluation’ , Research Evaluation , 15 / 1 : 75 – 80 .

Klein J. T . ( 2008 ) ‘Evaluation of Interdisciplinary and Transdisciplinary Research: A Literature Review’ , American Journal of Preventive Medicine , 35 / 2 Supplment S116–23. DOI: 10.1016/j.amepre.2008.05.010

Royal Netherlands Academy of Arts and Sciences, Association of Universities in the Netherlands, Netherlands Organization for Scientific Research (KNAW) . ( 2009 ) Standard Evaluation Protocol 2009-2015: Protocol for Research Assessment in the Netherlands . Netherlands : KNAW . < www.knaw.nl/sep >

Komiyama H. Takeuchi K. ( 2006 ) ‘Sustainability Science: Building a New Discipline’ , Sustainability Science , 1 : 1 – 6 .

Lahtinen E. et al.  . ( 2005 ) ‘The Development of Quality Criteria For Research: A Finnish approach’ , Health Promotion International , 20 / 3 : 306 – 15 .

Lang D. J. et al.  . ( 2012 ) ‘Transdisciplinary Research in Sustainability Science: Practice , Principles , and Challenges’, Sustainability Science , 7 / S1 : 25 – 43 .

Lincoln Y. S . ( 1995 ) ‘Emerging Criteria for Quality in Qualitative and Interpretive Research’ , Qualitative Inquiry , 1 / 3 : 275 – 89 .

Mayne J. Stern E. ( 2013 ) Impact Evaluation of Natural Resource Management Research Programs: A Broader View . Australian Centre for International Agricultural Research, Canberra .

Meyrick J . ( 2006 ) ‘What is Good Qualitative Research? A First Step Towards a Comprehensive Approach to Judging Rigour/Quality’ , Journal of Health Psychology , 11 / 5 : 799 – 808 .

Mitchell C. A. Willetts J. R. ( 2009 ) ‘Quality Criteria for Inter and Trans - Disciplinary Doctoral Research Outcomes’ , in Prepared for ALTC Fellowship: Zen and the Art of Transdisciplinary Postgraduate Studies ., Sydney : Institute for Sustainable Futures, University of Technology .

Morrow S. L . ( 2005 ) ‘Quality and Trustworthiness in Qualitative Research in Counseling Psychology’ , Journal of Counseling Psychology , 52 / 2 : 250 – 60 .

Nowotny H. Scott P. Gibbons M. ( 2001 ) Re-Thinking Science . Cambridge : Polity .

Nowotny H. Scott P. Gibbons M. . ( 2003 ) ‘‘Mode 2’ Revisited: The New Production of Knowledge’ , Minerva , 41 : 179 – 94 .

Öberg G . ( 2008 ) ‘Facilitating Interdisciplinary Work: Using Quality Assessment to Create Common Ground’ , Higher Education , 57 / 4 : 405 – 15 .

Ozga J . ( 2007 ) ‘Co - production of Quality in the Applied Education Research Scheme’ , Research Papers in Education , 22 / 2 : 169 – 81 .

Ozga J . ( 2008 ) ‘Governing Knowledge: research steering and research quality’ , European Educational Research Journal , 7 / 3 : 261 – 272 .

OECD ( 2012 ) Frascati Manual 6th ed. < http://www.oecd.org/innovation/inno/frascatimanualproposedstandardpracticeforsurveysonresearchandexperimentaldevelopment6thedition >

Overseas Development Institute (ODI) ( 2004 ) ‘Bridging Research and Policy in International Development: An Analytical and Practical Framework’, ODI Briefing Paper. < http://www.odi.org/sites/odi.org.uk/files/odi-assets/publications-opinion-files/198.pdf >

Overseas Development Institute (ODI) . ( 2012 ) RAPID Outcome Assessment Guide . < http://www.odi.org/sites/odi.org.uk/files/odi-assets/publications-opinion-files/7815.pdf >

Pullin A. S. Stewart G. B. ( 2006 ) ‘Guidelines for Systematic Review in Conservation and Environmental Management’ , Conservation Biology , 20 / 6 : 1647 – 56 .

Research Excellence Framework (REF) . ( 2011 ) Research Excellence Framework 2014: Assessment Framework and Guidance on Submissions. Reference REF 02.2011. UK: REF. < http://www.ref.ac.uk/pubs/2011-02/ >

Scott A . ( 2007 ) ‘Peer Review and the Relevance of Science’ , Futures , 39 / 7 : 827 – 45 .

Spaapen J. Dijstelbloem H. Wamelink F. ( 2007 ) Evaluating Research in Context: A Method for Comprehensive Assessment . Netherlands: Consultative Committee of Sector Councils for Research and Development. < http://www.qs.univie.ac.at/fileadmin/user_upload/qualitaetssicherung/PDF/Weitere_Aktivit%C3%A4ten/Eric.pdf >

Spaapen J. Van Drooge L. ( 2011 ) ‘Introducing “Productive Interactions” in Social Impact Assessment’ , Research Evaluation , 20 : 211 – 18 .

Stige B. Malterud K. Midtgarden T. ( 2009 ) ‘Toward an Agenda for Evaluation of Qualitative Research’ , Qualitative Health Research , 19 / 10 : 1504 – 16 .

td-net ( 2014 ) td-net. < www.transdisciplinarity.ch/e/Bibliography/new.php >

Tertiary Education Commission (TEC) . ( 2012 ) Performance-based Research Fund: Quality Evaluation Guidelines 2012. New Zealand: TEC. < http://www.tec.govt.nz/Documents/Publications/PBRF-Quality-Evaluation-Guidelines-2012.pdf >

Tijssen R. J. W. ( 2003 ) ‘Quality Assurance: Scoreboards of Research Excellence’ , Research Evaluation , 12 : 91 – 103 .

White H. Phillips D. ( 2012 ) ‘Addressing Attribution of Cause and Effect in Small n Impact Evaluations: Towards an Integrated Framework’. Working Paper 15. New Delhi: International Initiative for Impact Evaluation .

Wickson F. Carew A. ( 2014 ) ‘Quality Criteria and Indicators for Responsible Research and Innovation: Learning from Transdisciplinarity’ , Journal of Responsible Innovation , 1 / 3 : 254 – 73 .

Wickson F. Carew A. Russell A. W. ( 2006 ) ‘Transdisciplinary Research: Characteristics, Quandaries and Quality,’ Futures , 38 / 9 : 1046 – 59

Email alerts

Citing articles via.

  • Recommend to your Library

Affiliations

  • Online ISSN 1471-5449
  • Print ISSN 0958-2029
  • Copyright © 2024 Oxford University Press
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Neurol Res Pract

Logo of neurrp

How to use and assess qualitative research methods

Loraine busetto.

1 Department of Neurology, Heidelberg University Hospital, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany

Wolfgang Wick

2 Clinical Cooperation Unit Neuro-Oncology, German Cancer Research Center, Heidelberg, Germany

Christoph Gumbinger

Associated data.

Not applicable.

This paper aims to provide an overview of the use and assessment of qualitative research methods in the health sciences. Qualitative research can be defined as the study of the nature of phenomena and is especially appropriate for answering questions of why something is (not) observed, assessing complex multi-component interventions, and focussing on intervention improvement. The most common methods of data collection are document study, (non-) participant observations, semi-structured interviews and focus groups. For data analysis, field-notes and audio-recordings are transcribed into protocols and transcripts, and coded using qualitative data management software. Criteria such as checklists, reflexivity, sampling strategies, piloting, co-coding, member-checking and stakeholder involvement can be used to enhance and assess the quality of the research conducted. Using qualitative in addition to quantitative designs will equip us with better tools to address a greater range of research problems, and to fill in blind spots in current neurological research and practice.

The aim of this paper is to provide an overview of qualitative research methods, including hands-on information on how they can be used, reported and assessed. This article is intended for beginning qualitative researchers in the health sciences as well as experienced quantitative researchers who wish to broaden their understanding of qualitative research.

What is qualitative research?

Qualitative research is defined as “the study of the nature of phenomena”, including “their quality, different manifestations, the context in which they appear or the perspectives from which they can be perceived” , but excluding “their range, frequency and place in an objectively determined chain of cause and effect” [ 1 ]. This formal definition can be complemented with a more pragmatic rule of thumb: qualitative research generally includes data in form of words rather than numbers [ 2 ].

Why conduct qualitative research?

Because some research questions cannot be answered using (only) quantitative methods. For example, one Australian study addressed the issue of why patients from Aboriginal communities often present late or not at all to specialist services offered by tertiary care hospitals. Using qualitative interviews with patients and staff, it found one of the most significant access barriers to be transportation problems, including some towns and communities simply not having a bus service to the hospital [ 3 ]. A quantitative study could have measured the number of patients over time or even looked at possible explanatory factors – but only those previously known or suspected to be of relevance. To discover reasons for observed patterns, especially the invisible or surprising ones, qualitative designs are needed.

While qualitative research is common in other fields, it is still relatively underrepresented in health services research. The latter field is more traditionally rooted in the evidence-based-medicine paradigm, as seen in " research that involves testing the effectiveness of various strategies to achieve changes in clinical practice, preferably applying randomised controlled trial study designs (...) " [ 4 ]. This focus on quantitative research and specifically randomised controlled trials (RCT) is visible in the idea of a hierarchy of research evidence which assumes that some research designs are objectively better than others, and that choosing a "lesser" design is only acceptable when the better ones are not practically or ethically feasible [ 5 , 6 ]. Others, however, argue that an objective hierarchy does not exist, and that, instead, the research design and methods should be chosen to fit the specific research question at hand – "questions before methods" [ 2 , 7 – 9 ]. This means that even when an RCT is possible, some research problems require a different design that is better suited to addressing them. Arguing in JAMA, Berwick uses the example of rapid response teams in hospitals, which he describes as " a complex, multicomponent intervention – essentially a process of social change" susceptible to a range of different context factors including leadership or organisation history. According to him, "[in] such complex terrain, the RCT is an impoverished way to learn. Critics who use it as a truth standard in this context are incorrect" [ 8 ] . Instead of limiting oneself to RCTs, Berwick recommends embracing a wider range of methods , including qualitative ones, which for "these specific applications, (...) are not compromises in learning how to improve; they are superior" [ 8 ].

Research problems that can be approached particularly well using qualitative methods include assessing complex multi-component interventions or systems (of change), addressing questions beyond “what works”, towards “what works for whom when, how and why”, and focussing on intervention improvement rather than accreditation [ 7 , 9 – 12 ]. Using qualitative methods can also help shed light on the “softer” side of medical treatment. For example, while quantitative trials can measure the costs and benefits of neuro-oncological treatment in terms of survival rates or adverse effects, qualitative research can help provide a better understanding of patient or caregiver stress, visibility of illness or out-of-pocket expenses.

How to conduct qualitative research?

Given that qualitative research is characterised by flexibility, openness and responsivity to context, the steps of data collection and analysis are not as separate and consecutive as they tend to be in quantitative research [ 13 , 14 ]. As Fossey puts it : “sampling, data collection, analysis and interpretation are related to each other in a cyclical (iterative) manner, rather than following one after another in a stepwise approach” [ 15 ]. The researcher can make educated decisions with regard to the choice of method, how they are implemented, and to which and how many units they are applied [ 13 ]. As shown in Fig.  1 , this can involve several back-and-forth steps between data collection and analysis where new insights and experiences can lead to adaption and expansion of the original plan. Some insights may also necessitate a revision of the research question and/or the research design as a whole. The process ends when saturation is achieved, i.e. when no relevant new information can be found (see also below: sampling and saturation). For reasons of transparency, it is essential for all decisions as well as the underlying reasoning to be well-documented.

An external file that holds a picture, illustration, etc.
Object name is 42466_2020_59_Fig1_HTML.jpg

Iterative research process

While it is not always explicitly addressed, qualitative methods reflect a different underlying research paradigm than quantitative research (e.g. constructivism or interpretivism as opposed to positivism). The choice of methods can be based on the respective underlying substantive theory or theoretical framework used by the researcher [ 2 ].

Data collection

The methods of qualitative data collection most commonly used in health research are document study, observations, semi-structured interviews and focus groups [ 1 , 14 , 16 , 17 ].

Document study

Document study (also called document analysis) refers to the review by the researcher of written materials [ 14 ]. These can include personal and non-personal documents such as archives, annual reports, guidelines, policy documents, diaries or letters.

Observations

Observations are particularly useful to gain insights into a certain setting and actual behaviour – as opposed to reported behaviour or opinions [ 13 ]. Qualitative observations can be either participant or non-participant in nature. In participant observations, the observer is part of the observed setting, for example a nurse working in an intensive care unit [ 18 ]. In non-participant observations, the observer is “on the outside looking in”, i.e. present in but not part of the situation, trying not to influence the setting by their presence. Observations can be planned (e.g. for 3 h during the day or night shift) or ad hoc (e.g. as soon as a stroke patient arrives at the emergency room). During the observation, the observer takes notes on everything or certain pre-determined parts of what is happening around them, for example focusing on physician-patient interactions or communication between different professional groups. Written notes can be taken during or after the observations, depending on feasibility (which is usually lower during participant observations) and acceptability (e.g. when the observer is perceived to be judging the observed). Afterwards, these field notes are transcribed into observation protocols. If more than one observer was involved, field notes are taken independently, but notes can be consolidated into one protocol after discussions. Advantages of conducting observations include minimising the distance between the researcher and the researched, the potential discovery of topics that the researcher did not realise were relevant and gaining deeper insights into the real-world dimensions of the research problem at hand [ 18 ].

Semi-structured interviews

Hijmans & Kuyper describe qualitative interviews as “an exchange with an informal character, a conversation with a goal” [ 19 ]. Interviews are used to gain insights into a person’s subjective experiences, opinions and motivations – as opposed to facts or behaviours [ 13 ]. Interviews can be distinguished by the degree to which they are structured (i.e. a questionnaire), open (e.g. free conversation or autobiographical interviews) or semi-structured [ 2 , 13 ]. Semi-structured interviews are characterized by open-ended questions and the use of an interview guide (or topic guide/list) in which the broad areas of interest, sometimes including sub-questions, are defined [ 19 ]. The pre-defined topics in the interview guide can be derived from the literature, previous research or a preliminary method of data collection, e.g. document study or observations. The topic list is usually adapted and improved at the start of the data collection process as the interviewer learns more about the field [ 20 ]. Across interviews the focus on the different (blocks of) questions may differ and some questions may be skipped altogether (e.g. if the interviewee is not able or willing to answer the questions or for concerns about the total length of the interview) [ 20 ]. Qualitative interviews are usually not conducted in written format as it impedes on the interactive component of the method [ 20 ]. In comparison to written surveys, qualitative interviews have the advantage of being interactive and allowing for unexpected topics to emerge and to be taken up by the researcher. This can also help overcome a provider or researcher-centred bias often found in written surveys, which by nature, can only measure what is already known or expected to be of relevance to the researcher. Interviews can be audio- or video-taped; but sometimes it is only feasible or acceptable for the interviewer to take written notes [ 14 , 16 , 20 ].

Focus groups

Focus groups are group interviews to explore participants’ expertise and experiences, including explorations of how and why people behave in certain ways [ 1 ]. Focus groups usually consist of 6–8 people and are led by an experienced moderator following a topic guide or “script” [ 21 ]. They can involve an observer who takes note of the non-verbal aspects of the situation, possibly using an observation guide [ 21 ]. Depending on researchers’ and participants’ preferences, the discussions can be audio- or video-taped and transcribed afterwards [ 21 ]. Focus groups are useful for bringing together homogeneous (to a lesser extent heterogeneous) groups of participants with relevant expertise and experience on a given topic on which they can share detailed information [ 21 ]. Focus groups are a relatively easy, fast and inexpensive method to gain access to information on interactions in a given group, i.e. “the sharing and comparing” among participants [ 21 ]. Disadvantages include less control over the process and a lesser extent to which each individual may participate. Moreover, focus group moderators need experience, as do those tasked with the analysis of the resulting data. Focus groups can be less appropriate for discussing sensitive topics that participants might be reluctant to disclose in a group setting [ 13 ]. Moreover, attention must be paid to the emergence of “groupthink” as well as possible power dynamics within the group, e.g. when patients are awed or intimidated by health professionals.

Choosing the “right” method

As explained above, the school of thought underlying qualitative research assumes no objective hierarchy of evidence and methods. This means that each choice of single or combined methods has to be based on the research question that needs to be answered and a critical assessment with regard to whether or to what extent the chosen method can accomplish this – i.e. the “fit” between question and method [ 14 ]. It is necessary for these decisions to be documented when they are being made, and to be critically discussed when reporting methods and results.

Let us assume that our research aim is to examine the (clinical) processes around acute endovascular treatment (EVT), from the patient’s arrival at the emergency room to recanalization, with the aim to identify possible causes for delay and/or other causes for sub-optimal treatment outcome. As a first step, we could conduct a document study of the relevant standard operating procedures (SOPs) for this phase of care – are they up-to-date and in line with current guidelines? Do they contain any mistakes, irregularities or uncertainties that could cause delays or other problems? Regardless of the answers to these questions, the results have to be interpreted based on what they are: a written outline of what care processes in this hospital should look like. If we want to know what they actually look like in practice, we can conduct observations of the processes described in the SOPs. These results can (and should) be analysed in themselves, but also in comparison to the results of the document analysis, especially as regards relevant discrepancies. Do the SOPs outline specific tests for which no equipment can be observed or tasks to be performed by specialized nurses who are not present during the observation? It might also be possible that the written SOP is outdated, but the actual care provided is in line with current best practice. In order to find out why these discrepancies exist, it can be useful to conduct interviews. Are the physicians simply not aware of the SOPs (because their existence is limited to the hospital’s intranet) or do they actively disagree with them or does the infrastructure make it impossible to provide the care as described? Another rationale for adding interviews is that some situations (or all of their possible variations for different patient groups or the day, night or weekend shift) cannot practically or ethically be observed. In this case, it is possible to ask those involved to report on their actions – being aware that this is not the same as the actual observation. A senior physician’s or hospital manager’s description of certain situations might differ from a nurse’s or junior physician’s one, maybe because they intentionally misrepresent facts or maybe because different aspects of the process are visible or important to them. In some cases, it can also be relevant to consider to whom the interviewee is disclosing this information – someone they trust, someone they are otherwise not connected to, or someone they suspect or are aware of being in a potentially “dangerous” power relationship to them. Lastly, a focus group could be conducted with representatives of the relevant professional groups to explore how and why exactly they provide care around EVT. The discussion might reveal discrepancies (between SOPs and actual care or between different physicians) and motivations to the researchers as well as to the focus group members that they might not have been aware of themselves. For the focus group to deliver relevant information, attention has to be paid to its composition and conduct, for example, to make sure that all participants feel safe to disclose sensitive or potentially problematic information or that the discussion is not dominated by (senior) physicians only. The resulting combination of data collection methods is shown in Fig.  2 .

An external file that holds a picture, illustration, etc.
Object name is 42466_2020_59_Fig2_HTML.jpg

Possible combination of data collection methods

Attributions for icons: “Book” by Serhii Smirnov, “Interview” by Adrien Coquet, FR, “Magnifying Glass” by anggun, ID, “Business communication” by Vectors Market; all from the Noun Project

The combination of multiple data source as described for this example can be referred to as “triangulation”, in which multiple measurements are carried out from different angles to achieve a more comprehensive understanding of the phenomenon under study [ 22 , 23 ].

Data analysis

To analyse the data collected through observations, interviews and focus groups these need to be transcribed into protocols and transcripts (see Fig.  3 ). Interviews and focus groups can be transcribed verbatim , with or without annotations for behaviour (e.g. laughing, crying, pausing) and with or without phonetic transcription of dialects and filler words, depending on what is expected or known to be relevant for the analysis. In the next step, the protocols and transcripts are coded , that is, marked (or tagged, labelled) with one or more short descriptors of the content of a sentence or paragraph [ 2 , 15 , 23 ]. Jansen describes coding as “connecting the raw data with “theoretical” terms” [ 20 ]. In a more practical sense, coding makes raw data sortable. This makes it possible to extract and examine all segments describing, say, a tele-neurology consultation from multiple data sources (e.g. SOPs, emergency room observations, staff and patient interview). In a process of synthesis and abstraction, the codes are then grouped, summarised and/or categorised [ 15 , 20 ]. The end product of the coding or analysis process is a descriptive theory of the behavioural pattern under investigation [ 20 ]. The coding process is performed using qualitative data management software, the most common ones being InVivo, MaxQDA and Atlas.ti. It should be noted that these are data management tools which support the analysis performed by the researcher(s) [ 14 ].

An external file that holds a picture, illustration, etc.
Object name is 42466_2020_59_Fig3_HTML.jpg

From data collection to data analysis

Attributions for icons: see Fig. ​ Fig.2, 2 , also “Speech to text” by Trevor Dsouza, “Field Notes” by Mike O’Brien, US, “Voice Record” by ProSymbols, US, “Inspection” by Made, AU, and “Cloud” by Graphic Tigers; all from the Noun Project

How to report qualitative research?

Protocols of qualitative research can be published separately and in advance of the study results. However, the aim is not the same as in RCT protocols, i.e. to pre-define and set in stone the research questions and primary or secondary endpoints. Rather, it is a way to describe the research methods in detail, which might not be possible in the results paper given journals’ word limits. Qualitative research papers are usually longer than their quantitative counterparts to allow for deep understanding and so-called “thick description”. In the methods section, the focus is on transparency of the methods used, including why, how and by whom they were implemented in the specific study setting, so as to enable a discussion of whether and how this may have influenced data collection, analysis and interpretation. The results section usually starts with a paragraph outlining the main findings, followed by more detailed descriptions of, for example, the commonalities, discrepancies or exceptions per category [ 20 ]. Here it is important to support main findings by relevant quotations, which may add information, context, emphasis or real-life examples [ 20 , 23 ]. It is subject to debate in the field whether it is relevant to state the exact number or percentage of respondents supporting a certain statement (e.g. “Five interviewees expressed negative feelings towards XYZ”) [ 21 ].

How to combine qualitative with quantitative research?

Qualitative methods can be combined with other methods in multi- or mixed methods designs, which “[employ] two or more different methods [ …] within the same study or research program rather than confining the research to one single method” [ 24 ]. Reasons for combining methods can be diverse, including triangulation for corroboration of findings, complementarity for illustration and clarification of results, expansion to extend the breadth and range of the study, explanation of (unexpected) results generated with one method with the help of another, or offsetting the weakness of one method with the strength of another [ 1 , 17 , 24 – 26 ]. The resulting designs can be classified according to when, why and how the different quantitative and/or qualitative data strands are combined. The three most common types of mixed method designs are the convergent parallel design , the explanatory sequential design and the exploratory sequential design. The designs with examples are shown in Fig.  4 .

An external file that holds a picture, illustration, etc.
Object name is 42466_2020_59_Fig4_HTML.jpg

Three common mixed methods designs

In the convergent parallel design, a qualitative study is conducted in parallel to and independently of a quantitative study, and the results of both studies are compared and combined at the stage of interpretation of results. Using the above example of EVT provision, this could entail setting up a quantitative EVT registry to measure process times and patient outcomes in parallel to conducting the qualitative research outlined above, and then comparing results. Amongst other things, this would make it possible to assess whether interview respondents’ subjective impressions of patients receiving good care match modified Rankin Scores at follow-up, or whether observed delays in care provision are exceptions or the rule when compared to door-to-needle times as documented in the registry. In the explanatory sequential design, a quantitative study is carried out first, followed by a qualitative study to help explain the results from the quantitative study. This would be an appropriate design if the registry alone had revealed relevant delays in door-to-needle times and the qualitative study would be used to understand where and why these occurred, and how they could be improved. In the exploratory design, the qualitative study is carried out first and its results help informing and building the quantitative study in the next step [ 26 ]. If the qualitative study around EVT provision had shown a high level of dissatisfaction among the staff members involved, a quantitative questionnaire investigating staff satisfaction could be set up in the next step, informed by the qualitative study on which topics dissatisfaction had been expressed. Amongst other things, the questionnaire design would make it possible to widen the reach of the research to more respondents from different (types of) hospitals, regions, countries or settings, and to conduct sub-group analyses for different professional groups.

How to assess qualitative research?

A variety of assessment criteria and lists have been developed for qualitative research, ranging in their focus and comprehensiveness [ 14 , 17 , 27 ]. However, none of these has been elevated to the “gold standard” in the field. In the following, we therefore focus on a set of commonly used assessment criteria that, from a practical standpoint, a researcher can look for when assessing a qualitative research report or paper.

Assessors should check the authors’ use of and adherence to the relevant reporting checklists (e.g. Standards for Reporting Qualitative Research (SRQR)) to make sure all items that are relevant for this type of research are addressed [ 23 , 28 ]. Discussions of quantitative measures in addition to or instead of these qualitative measures can be a sign of lower quality of the research (paper). Providing and adhering to a checklist for qualitative research contributes to an important quality criterion for qualitative research, namely transparency [ 15 , 17 , 23 ].

Reflexivity

While methodological transparency and complete reporting is relevant for all types of research, some additional criteria must be taken into account for qualitative research. This includes what is called reflexivity, i.e. sensitivity to the relationship between the researcher and the researched, including how contact was established and maintained, or the background and experience of the researcher(s) involved in data collection and analysis. Depending on the research question and population to be researched this can be limited to professional experience, but it may also include gender, age or ethnicity [ 17 , 27 ]. These details are relevant because in qualitative research, as opposed to quantitative research, the researcher as a person cannot be isolated from the research process [ 23 ]. It may influence the conversation when an interviewed patient speaks to an interviewer who is a physician, or when an interviewee is asked to discuss a gynaecological procedure with a male interviewer, and therefore the reader must be made aware of these details [ 19 ].

Sampling and saturation

The aim of qualitative sampling is for all variants of the objects of observation that are deemed relevant for the study to be present in the sample “ to see the issue and its meanings from as many angles as possible” [ 1 , 16 , 19 , 20 , 27 ] , and to ensure “information-richness [ 15 ]. An iterative sampling approach is advised, in which data collection (e.g. five interviews) is followed by data analysis, followed by more data collection to find variants that are lacking in the current sample. This process continues until no new (relevant) information can be found and further sampling becomes redundant – which is called saturation [ 1 , 15 ] . In other words: qualitative data collection finds its end point not a priori , but when the research team determines that saturation has been reached [ 29 , 30 ].

This is also the reason why most qualitative studies use deliberate instead of random sampling strategies. This is generally referred to as “ purposive sampling” , in which researchers pre-define which types of participants or cases they need to include so as to cover all variations that are expected to be of relevance, based on the literature, previous experience or theory (i.e. theoretical sampling) [ 14 , 20 ]. Other types of purposive sampling include (but are not limited to) maximum variation sampling, critical case sampling or extreme or deviant case sampling [ 2 ]. In the above EVT example, a purposive sample could include all relevant professional groups and/or all relevant stakeholders (patients, relatives) and/or all relevant times of observation (day, night and weekend shift).

Assessors of qualitative research should check whether the considerations underlying the sampling strategy were sound and whether or how researchers tried to adapt and improve their strategies in stepwise or cyclical approaches between data collection and analysis to achieve saturation [ 14 ].

Good qualitative research is iterative in nature, i.e. it goes back and forth between data collection and analysis, revising and improving the approach where necessary. One example of this are pilot interviews, where different aspects of the interview (especially the interview guide, but also, for example, the site of the interview or whether the interview can be audio-recorded) are tested with a small number of respondents, evaluated and revised [ 19 ]. In doing so, the interviewer learns which wording or types of questions work best, or which is the best length of an interview with patients who have trouble concentrating for an extended time. Of course, the same reasoning applies to observations or focus groups which can also be piloted.

Ideally, coding should be performed by at least two researchers, especially at the beginning of the coding process when a common approach must be defined, including the establishment of a useful coding list (or tree), and when a common meaning of individual codes must be established [ 23 ]. An initial sub-set or all transcripts can be coded independently by the coders and then compared and consolidated after regular discussions in the research team. This is to make sure that codes are applied consistently to the research data.

Member checking

Member checking, also called respondent validation , refers to the practice of checking back with study respondents to see if the research is in line with their views [ 14 , 27 ]. This can happen after data collection or analysis or when first results are available [ 23 ]. For example, interviewees can be provided with (summaries of) their transcripts and asked whether they believe this to be a complete representation of their views or whether they would like to clarify or elaborate on their responses [ 17 ]. Respondents’ feedback on these issues then becomes part of the data collection and analysis [ 27 ].

Stakeholder involvement

In those niches where qualitative approaches have been able to evolve and grow, a new trend has seen the inclusion of patients and their representatives not only as study participants (i.e. “members”, see above) but as consultants to and active participants in the broader research process [ 31 – 33 ]. The underlying assumption is that patients and other stakeholders hold unique perspectives and experiences that add value beyond their own single story, making the research more relevant and beneficial to researchers, study participants and (future) patients alike [ 34 , 35 ]. Using the example of patients on or nearing dialysis, a recent scoping review found that 80% of clinical research did not address the top 10 research priorities identified by patients and caregivers [ 32 , 36 ]. In this sense, the involvement of the relevant stakeholders, especially patients and relatives, is increasingly being seen as a quality indicator in and of itself.

How not to assess qualitative research

The above overview does not include certain items that are routine in assessments of quantitative research. What follows is a non-exhaustive, non-representative, experience-based list of the quantitative criteria often applied to the assessment of qualitative research, as well as an explanation of the limited usefulness of these endeavours.

Protocol adherence

Given the openness and flexibility of qualitative research, it should not be assessed by how well it adheres to pre-determined and fixed strategies – in other words: its rigidity. Instead, the assessor should look for signs of adaptation and refinement based on lessons learned from earlier steps in the research process.

Sample size

For the reasons explained above, qualitative research does not require specific sample sizes, nor does it require that the sample size be determined a priori [ 1 , 14 , 27 , 37 – 39 ]. Sample size can only be a useful quality indicator when related to the research purpose, the chosen methodology and the composition of the sample, i.e. who was included and why.

Randomisation

While some authors argue that randomisation can be used in qualitative research, this is not commonly the case, as neither its feasibility nor its necessity or usefulness has been convincingly established for qualitative research [ 13 , 27 ]. Relevant disadvantages include the negative impact of a too large sample size as well as the possibility (or probability) of selecting “ quiet, uncooperative or inarticulate individuals ” [ 17 ]. Qualitative studies do not use control groups, either.

Interrater reliability, variability and other “objectivity checks”

The concept of “interrater reliability” is sometimes used in qualitative research to assess to which extent the coding approach overlaps between the two co-coders. However, it is not clear what this measure tells us about the quality of the analysis [ 23 ]. This means that these scores can be included in qualitative research reports, preferably with some additional information on what the score means for the analysis, but it is not a requirement. Relatedly, it is not relevant for the quality or “objectivity” of qualitative research to separate those who recruited the study participants and collected and analysed the data. Experiences even show that it might be better to have the same person or team perform all of these tasks [ 20 ]. First, when researchers introduce themselves during recruitment this can enhance trust when the interview takes place days or weeks later with the same researcher. Second, when the audio-recording is transcribed for analysis, the researcher conducting the interviews will usually remember the interviewee and the specific interview situation during data analysis. This might be helpful in providing additional context information for interpretation of data, e.g. on whether something might have been meant as a joke [ 18 ].

Not being quantitative research

Being qualitative research instead of quantitative research should not be used as an assessment criterion if it is used irrespectively of the research problem at hand. Similarly, qualitative research should not be required to be combined with quantitative research per se – unless mixed methods research is judged as inherently better than single-method research. In this case, the same criterion should be applied for quantitative studies without a qualitative component.

The main take-away points of this paper are summarised in Table ​ Table1. 1 . We aimed to show that, if conducted well, qualitative research can answer specific research questions that cannot to be adequately answered using (only) quantitative designs. Seeing qualitative and quantitative methods as equal will help us become more aware and critical of the “fit” between the research problem and our chosen methods: I can conduct an RCT to determine the reasons for transportation delays of acute stroke patients – but should I? It also provides us with a greater range of tools to tackle a greater range of research problems more appropriately and successfully, filling in the blind spots on one half of the methodological spectrum to better address the whole complexity of neurological research and practice.

Take-away-points

Acknowledgements

Abbreviations, authors’ contributions.

LB drafted the manuscript; WW and CG revised the manuscript; all authors approved the final versions.

no external funding.

Availability of data and materials

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Child Care and Early Education Research Connections

Assessing research quality.

This page presents information and tools to help evaluate the quality of a research study, as well as information on the ethics of research.

The quality of social science and policy research can vary considerably. It is important that consumers of research keep this in mind when reading the findings from a research study or when considering whether or not to use data from a research study for secondary analysis.

assessing the quality of a research paper

Announcements

Find announcements, including conferences and meetings, Research Connections newsletters, opportunities, and more.

assessing the quality of a research paper

Search Resources

Search all resources in the Research Connections Library.

assessing the quality of a research paper

Explore Our Topics

Research Connections' resources are organized into topical categories and subcategories.

Key Questions to Ask

This section outlines key questions to ask in assessing the quality of research.

Research Assessment Tools

This section provides resources related to quantitative and qualitative assessment tools.

Ethics of Research

This section provides an overview of three basic ethical principles.

Banner

  • JABSOM Library

Systematic Review Toolbox

Quality assessment.

  • Guidelines & Rubrics
  • Databases & Indexes
  • Reference Management
  • Data Extraction
  • Data Analysis
  • Manuscript Development
  • Software Comparison
  • Systematic Searching This link opens in a new window
  • Authorship Determination This link opens in a new window
  • Critical Appraisal Tools This link opens in a new window

Critical Appraisal Questions

  • Is the study question relevant?
  • Does the study add anything new?
  • What type of research question is being asked?
  • Was the study design appropriate for the research question?
  • Did the study methods address the most important potential sources of bias?
  • Was the study performed according to the original protocol?
  • Does the study test a stated hypothesis?
  • Were the statistical analyses performed correctly?
  • Do the data justify the conclusions?
  • Are there any conflicts of interest?

The University of Sydney Library, Systematic Reviews: Assessment Tools and Critical Appraisal

Taylor, P., Hussain, J. A., & Gadoud, A. (2013). How to appraise a systematic review. British Journal of Hospital Medicine, 74(6), 331-334. doi: 10.12968/hmed.2013.74.6.331

Young, J. M., & Solomon, M. J. (2009). How to critically appraise an article. Nature Clinical Practice Gastroenterology and Hepatology, 6(2), 82-91. doi: 10.1038/ncpgasthep1331

Assessing the quality of evidence contained within a systematic review is as important as analyzing the data within. Results from a poorly conducted study can be skewed by biases from the research methodology and should be interpreted with caution. Such studies should be acknowledged as such in the systematic review or outright excluded. Selecting an appropriate tool to help analyze strength of evidence and imbedded biases within each paper is also essential. If using a systematic review manuscript development tool (e.g., RevMan), a checklist may be built into the software. Other software (e.g., Rayyan) may help with screening search results and discarding irrelevant studies. The following tools/checklists may help with study assessment and critical appraisal.

  • Assessing the Methodological Quality of Systematic Reviews (AMSTAR 2) is widely used to critically appraise systematic reviews .
  • Centre for Evidence-Based Medicine (CEBM) contains a collection of critical appraisal tools for studies of all types and examples of usage.
  • Cochrane risk-of-bias (RoB 2) tool is the recommended tool for assessing quality and risk of bias in randomized clinical trials in Cochrane-submitted systematic reviews.
  • Critical Appraisal Skills Programme (CASP) has 25 years of experience and expertise in critical appraisal and offers appraisal checklists for a wide range of study types .
  • Joanna Briggs Institute (JBI) provides robust checklists for the appraisal and assessment of most types of studies .
  • National Academies of Sciences, Health and Medicine Division provides standards for assessing bias in primary studies comprising systematic reviews of therapeutic or medical interventions.
  • Newcastle-Ottawa Scale (NOS) is also used in non-observational studies of cohort and case-control varieties.
  • Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool surveys diagnostic accuracy studies on four domains: index test, reference standard, patient selection, and flow and timing.
  • Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) framework is often used to measure the quality of cohort, case-control and cross-sectional studies .

Requesting Research Consultation

The Health Sciences Library provides consultation services for University of Hawaiʻi-affiliated students, staff, and faculty. The John A. Burns School of Medicine Health Sciences Library does not have staffing to conduct or assist researchers unaffiliated with the University of Hawaiʻi. Please utilize the publicly available guides and support pages that address research databases and tools.

Before Requesting Assistance

Before requesting systematic review assistance from the librarians, please review the relevant guides and the various pages of the Systematic Review Toolbox . Most inquiries received have been answered there previously. Support for research software issues is limited to help with basic installation and setup. Please contact the software developer directly if further assistance is needed.

  • << Previous: Data Extraction
  • Next: Data Analysis >>
  • Last Updated: Sep 20, 2023 9:14 AM
  • URL: https://hslib.jabsom.hawaii.edu/systematicreview

Health Sciences Library, John A. Burns School of Medicine, University of Hawai‘i at Mānoa, 651 Ilalo Street, MEB 101, Honolulu, HI 96813 - Phone: 808-692-0810, Fax: 808-692-1244

Copyright © 2004-2024. All rights reserved. Library Staff Page - Other UH Libraries

icon

  • - Google Chrome

Intended for healthcare professionals

  • Access provided by Google Indexer
  • My email alerts
  • BMA member login
  • Username * Password * Forgot your log in details? Need to activate BMA Member Log In Log in via OpenAthens Log in via your institution

Home

Search form

  • Advanced search
  • Search responses
  • Search blogs
  • How to read a paper:...

How to read a paper: Assessing the methodological quality of published papers

  • Related content
  • Peer review
  • Trisha Greenhalgh ( p.greenhalgh{at}ucl.ac.uk ) , senior lecturer a
  • a Unit for Evidence-Based Practice and Policy, Department of Primary Care and Population Sciences, University College London Medical School/Royal Free Hospital School of Medicine, Whittington Hospital, London N19 5NF
  • Correspondence to

Introduction

Before changing your practice in the light of a published research paper, you should decide whether the methods used were valid. This article considers five essential questions that should form the basis of your decision.

Question 1: Was the study original?

Only a tiny proportion of medical research breaks entirely new ground, and an equally tiny proportion repeats exactly the steps of previous workers. The vast majority of research studies will tell us, at best, that a particular hypothesis is slightly more or less likely to be correct than it was before we added our piece to the wider jigsaw. Hence, it may be perfectly valid to do a study which is, on the face of it, “unoriginal.” Indeed, the whole science of meta-analysis depends on the literature containing more than one study that has addressed a question in much the same way.

The practical question to ask, then, about a new piece of research is not “Has anyone ever done a similar study?” but “Does this new research add to the literature in any way?” For example:

Is this study bigger, continued for longer, or otherwise more substantial than the previous one(s)?

Is the methodology of this study any more rigorous (in particular, does it address any specific methodological criticisms of previous studies)?

Will the numerical results of this study add significantly to a meta-analysis of previous studies?

Is the population that was studied different in any way (has the study looked at different ages, sex, or ethnic groups than previous studies)?

Is the clinical issue addressed of sufficient importance, and is there sufficient doubt in the minds of the public or key decision makers, to make new evidence “politically” desirable even when it is not strictly scientifically necessary?

Question 2: Whom is the study about?

Before assuming that the results of a paper are applicable to your own practice, ask yourself the following questions:

How were the subjects recruited? If you wanted to do a questionnaire survey of the views of users of the hospital casualty department, you could recruit respondents by advertising in the local newspaper. However, this method would be a good example of recruitment bias since the sample you obtain would be skewed in favour of users who were highly motivated and liked to read newspapers. You would, of course, be better to issue a questionnaire to every user (or to a 1 in 10 sample of users) who turned up on a particular day.

Who was included in the study? Many trials in Britain and North America routinely exclude patients with coexisting illness, those who do not speak English, those taking certain other medication, and those who are illiterate. This approach may be scientifically “clean,” but since clinical trial results will be used to guide practice in relation to wider patient groups it is not necessarily logical. 1 The results of pharmacokinetic studies of new drugs in 23 year old healthy male volunteers will clearly not be applicable to the average elderly woman.

Who was excluded from the study? For example, a randomised controlled trial may be restricted to patients with moderate or severe forms of a disease such as heart failure—a policy which could lead to false conclusions about the treatment of mild heart failure. This has important practical implications when clinical trials performed on hospital outpatients are used to dictate “best practice” in primary care, where the spectrum of disease is generally milder.

Were the subjects studied in “real life” circumstances? For example, were they admitted to hospital purely for observation? Did they receive lengthy and detailed explanations of the potential benefits of the intervention? Were they given the telephone number of a key research worker? Did the company that funded the research provide new equipment which would not be available to the ordinary clinician? These factors would not necessarily invalidate the study itself, but they may cast doubt on the applicability of its findings to your own practice.

Question 3: Was the design of the study sensible?

Although the terminology of research trial design can be forbidding, much of what is grandly termed “critical appraisal” is plain common sense. I usually start with two fundamental questions:

What specific intervention or other manoeuvre was being considered, and what was it being compared with? It is tempting to take published statements at face value, but remember that authors frequently misrepresent (usually subconsciously rather than deliberately) what they actually did, and they overestimate its originality and potential importance. The examples in the box use hypothetical statements, but they are all based on similar mistakes seen in print.

What outcome was measured, and how? If you had an incurable disease for which a pharmaceutical company claimed to have produced a new wonder drug, you would measure the efficacy of the drug in terms of whether it made you live longer (and, perhaps, whether life was worth living given your condition and any side effects of the medication). You would not be too interested in the levels of some obscure enzyme in your blood which the manufacturer assured you were a reliable indicator of your chances of survival. The use of such surrogate endpoints is discussed in a later article in this series. 2

Examples of problematic descriptions in the methods section of a paper

  • View inline

PETER BROWN

  • Download figure
  • Open in new tab
  • Download powerpoint

The measurement of symptomatic effects (such as pain), functional effects (mobility), psychological effects (anxiety), or social effects (inconvenience) of an intervention is fraught with even more problems. You should always look for evidence in the paper that the outcome measure has been objectively validated—that is, that someone has confirmed that the scale of anxiety, pain, and so on used in this study measures what it purports to measure, and that changes in this outcome measure adequately reflect changes in the status of the patient. Remember that what is important in the eyes of the doctor may not be valued so highly by the patient, and vice versa. 3

Question 4: Was systematic bias avoided or minimised?

Systematic bias is defined as anything that erroneously influences the conclusions about groups and distorts comparisons. 4 Whether the design of a study is a randomised controlled trial, a non-randomised comparative trial, a cohort study, or a case-control study, the aim should be for the groups being compared to be as similar as possible except for the particular difference being examined. They should, as far as possible, receive the same explanations, have the same contacts with health professionals, and be assessed the same number of times by using the same outcome measures. Different study designs call for different steps to reduce systematic bias:

Randomised controlled trials

In a randomised controlled trial, systematic bias is (in theory) avoided by selecting a sample of participants from a particular population and allocating them randomly to the different groups. Figure 2 summarises sources of bias to check for.

Sources of bias to check for in a randomised controlled trial

Non-randomised controlled clinical trials

I recently chaired a seminar in which a multidisciplinary group of students from the medical, nursing, pharmacy, and allied professions were presenting the results of several in house research studies. All but one of the studies presented were of comparative, but non-randomised, design—that is, one group of patients (say, hospital outpatients with asthma) had received one intervention (say, an educational leaflet) while another group (say, patients attending GP surgeries with asthma) had received another intervention (say, group educational sessions). I was surprised how many of the presenters believed that their study was, or was equivalent to, a randomised controlled trial. In other words, these commendably enthusiastic and committed young researchers were blind to the most obvious bias of all: they were comparing two groups which had inherent, self selected differences even before the intervention was applied (as well as having all the additional potential sources of bias of randomised controlled trials).

As a general rule, if the paper you are looking at is a non-randomised controlled clinical trial, you must use your common sense to decide if the baseline differences between the intervention and control groups are likely to have been so great as to invalidate any differences ascribed to the effects of the intervention. This is, in fact, almost always the case. 5 6

Cohort studies

The selection of a comparable control group is one of the most difficult decisions facing the authors of an observational (cohort or case-control) study. Few, if any, cohort studies, for example, succeed in identifying two groups of subjects who are equal in age, sex mix, socioeconomic status, presence of coexisting illness, and so on, with the single difference being their exposure to the agent being studied. In practice, much of the “controlling” in cohort studies occurs at the analysis stage, where complex statistical adjustment is made for baseline differences in key variables. Unless this is done adequately, statistical tests of probability and confidence intervals will be dangerously misleading. 7

This problem is illustrated by the various cohort studies on the risks and benefits of alcohol, which have consistently found a “J shaped” relation between alcohol intake and mortality. The best outcome (in terms of premature death) lies with the cohort who are moderate drinkers. 8 The question of whether “teetotallers” (a group that includes people who have been ordered to give up alcohol on health grounds, health faddists, religious fundamentalists, and liars, as well as those who are in all other respects comparable with the group of moderate drinkers) have a genuinely increased risk of heart disease, or whether the J shape can be explained by confounding factors, has occupied epidemiologists for years. 8

Case-control studies

In case-control studies (in which the experiences of individuals with and without a particular disease are analysed retrospectively to identify putative causative events), the process that is most open to bias is not the assessment of outcome, but the diagnosis of “caseness” and the decision as to when the individual became a case.

A good example of this occurred a few years ago when a legal action was brought against the manufacturers of the whooping cough (pertussis) vaccine, which was alleged to have caused neurological damage in a number of infants. 9 In the court hearing, the judge ruled that misclassification of three brain damaged infants as “cases” rather than controls led to the overestimation of the harm attributable to whooping cough vaccine by a factor of three. 9

Question 5: Was assessment “blind”?

Even the most rigorous attempt to achieve a comparable control group will be wasted effort if the people who assess outcome (for example, those who judge whether someone is still clinically in heart failure, or who say whether an x ray is “improved” from last time) know which group the patient they are assessing was allocated to. If, for example, I knew that a patient had been randomised to an active drug to lower blood pressure rather than to a placebo, I might be more likely to recheck a reading which was surprisingly high. This is an example of performance bias, which, along with other pitfalls for the unblinded assessor, is listed in figure 2 .

Question 6: Were preliminary statistical questions dealt with?

Three important numbers can often be found in the methods section of a paper: the size of the sample; the duration of follow up; and the completeness of follow up.

Sample size

In the words of statistician Douglas Altman, a trial should be big enough to have a high chance of detecting, as statistically significant, a worthwhile effect if it exists, and thus to be reasonably sure that no benefit exists if it is not found in the trial. 10 To calculate sample size, the clinician must decide two things.

The first is what level of difference between the two groups would constitute a clinically significant effect. Note that this may not be the same as a statistically significant effect. You could administer a new drug which lowered blood pressure by around 10 mm Hg, and the effect would be a significant lowering of the chances of developing stroke (odds of less than 1 in 20 that the reduced incidence occurred by chance). 11 However, in some patients, this may correspond to a clinical reduction in risk of only 1 in 850 patient years 12 —a difference which many patients would classify as not worth the effort of taking the tablets. Secondly, the clinician must decide the mean and the standard deviation of the principal outcome variable.

Using a statistical nomogram, 10 the authors can then, before the trial begins, work out how large a sample they will need in order to have a moderate, high, or very high chance of detecting a true difference between the groups—the power of the study. It is common for studies to stipulate a power of between 80% and 90%. Underpowered studies are ubiquitous, usually because the authors found it harder than they anticipated to recruit their subjects. Such studies typically lead to a type II or ß error—the erroneous conclusion that an intervention has no effect. (In contrast, the rarer type I or α error is the conclusion that a difference is significant when in fact it is due to sampling error.)

Duration of follow up

Even if the sample size was adequate, a study must continue long enough for the effect of the intervention to be reflected in the outcome variable. A study looking at the effect of a new painkiller on the degree of postoperative pain may only need a follow up period of 48 hours. On the other hand, in a study of the effect of nutritional supplementation in the preschool years on final adult height, follow up should be measured in decades.

Completeness of follow up

Subjects who withdraw from (“drop out of”) research studies are less likely to have taken their tablets as directed, more likely to have missed their interim checkups, and more likely to have experienced side effects when taking medication, than those who do not withdraw. 13 The reasons why patients withdraw from clinical trials include the following:

Incorrect entry of patient into trial (that is, researcher discovers during the trial that the patient should not have been randomised in the first place because he or she did not fulfil the entry criteria);

Suspected adverse reaction to the trial drug. Note that the “adverse reaction” rate in the intervention group should always be compared with that in patients given placebo. Inert tablets bring people out in a rash surprisingly frequently;

Loss of patient motivation;

Withdrawal by clinician for clinical reasons (such as concurrent illness or pregnancy);

Loss to follow up (patient moves away, etc);

Are these results credible?

BMJ/PREUSS/SOUTHAMPTON UNIVERSITY TRUST

Simply ignoring everyone who has withdrawn from a clinical trial will bias the results, usually in favour of the intervention. It is, therefore, standard practice to analyse the results of comparative studies on an intention to treat basis. 14 This means that all data on patients originally allocated to the intervention arm of the study—including those who withdrew before the trial finished, those who did not take their tablets, and even those who subsequently received the control intervention for whatever reason—should be analysed along with data on the patients who followed the protocol throughout. Conversely, withdrawals from the placebo arm of the study should be analysed with those who faithfully took their placebo.

In a few situations, intention to treat analysis is not used. The most common is the efficacy analysis, which is to explain the effects of the intervention itself, and is therefore of the treatment actually received. But even if the subjects in an efficacy analysis are part of a randomised controlled trial, for the purposes of the analysis they effectively constitute a cohort study.

Summary points

The first essential question to ask about the methods section of a published paper is: was the study original?

The second is: whom is the study about?

Thirdly, was the design of the study sensible?

Fourthly, was systematic bias avoided or minimised?

Finally, was the study large enough, and continued for long enough, to make the results credible?

The articles in this series are excerpts from How to read a paper: the basics of evidence based medicine . The book includes chapters on searching the literature and implementing evidence based findings. It can be ordered from the BMJ Bookshop: tel 0171 383 6185/6245; fax 0171 383 6662. Price £13.95 UK members, £14.95 non-members.

  • Greenhalgh T
  • Dunning M ,
  • Chalmers TC ,
  • Colditz GA ,
  • Miller JA ,
  • Mosteller JF
  • Brennan P ,
  • Medical Research Council Working Party.
  • MacMahon S ,
  • Sackett DL ,
  • Haynes RB ,
  • Guyatt GH ,
  • Stewart LA ,
  • Chalmers I ,
  • Knipschild P

assessing the quality of a research paper

U.S. flag

An official website of the United States government

Here’s how you know

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS A lock ( A locked padlock ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

  • Heart-Healthy Living
  • High Blood Pressure
  • Sickle Cell Disease
  • Sleep Apnea
  • Information & Resources on COVID-19
  • The Heart Truth®
  • Learn More Breathe Better®
  • Blood Diseases and Disorders Education Program
  • Publications and Resources
  • Blood Disorders and Blood Safety
  • Sleep Science and Sleep Disorders
  • Lung Diseases
  • Health Disparities and Inequities
  • Heart and Vascular Diseases
  • Precision Medicine Activities
  • Obesity, Nutrition, and Physical Activity
  • Population and Epidemiology Studies
  • Women’s Health
  • Research Topics
  • Clinical Trials
  • All Science A-Z
  • Grants and Training Home
  • Policies and Guidelines
  • Funding Opportunities and Contacts
  • Training and Career Development
  • Email Alerts
  • NHLBI in the Press
  • Research Features
  • Past Events
  • Upcoming Events
  • Mission and Strategic Vision
  • Divisions, Offices and Centers
  • Advisory Committees
  • Budget and Legislative Information
  • Jobs and Working at the NHLBI
  • Contact and FAQs
  • NIH Sleep Research Plan
  • < Back To Health Topics

Study Quality Assessment Tools

In 2013, NHLBI developed a set of tailored quality assessment tools to assist reviewers in focusing on concepts that are key to a study’s internal validity. The tools were specific to certain study designs and tested for potential flaws in study methods or implementation. Experts used the tools during the systematic evidence review process to update existing clinical guidelines, such as those on cholesterol, blood pressure, and obesity. Their findings are outlined in the following reports:

  • Assessing Cardiovascular Risk: Systematic Evidence Review from the Risk Assessment Work Group
  • Management of Blood Cholesterol in Adults: Systematic Evidence Review from the Cholesterol Expert Panel
  • Management of Blood Pressure in Adults: Systematic Evidence Review from the Blood Pressure Expert Panel
  • Managing Overweight and Obesity in Adults: Systematic Evidence Review from the Obesity Expert Panel

While these tools have not been independently published and would not be considered standardized, they may be useful to the research community. These reports describe how experts used the tools for the project. Researchers may want to use the tools for their own projects; however, they would need to determine their own parameters for making judgements. Details about the design and application of the tools are included in Appendix A of the reports.

Quality Assessment of Controlled Intervention Studies - Study Quality Assessment Tools

*CD, cannot determine; NA, not applicable; NR, not reported

Guidance for Assessing the Quality of Controlled Intervention Studies

The guidance document below is organized by question number from the tool for quality assessment of controlled intervention studies.

Question 1. Described as randomized

Was the study described as randomized? A study does not satisfy quality criteria as randomized simply because the authors call it randomized; however, it is a first step in determining if a study is randomized

Questions 2 and 3. Treatment allocation–two interrelated pieces

Adequate randomization: Randomization is adequate if it occurred according to the play of chance (e.g., computer generated sequence in more recent studies, or random number table in older studies). Inadequate randomization: Randomization is inadequate if there is a preset plan (e.g., alternation where every other subject is assigned to treatment arm or another method of allocation is used, such as time or day of hospital admission or clinic visit, ZIP Code, phone number, etc.). In fact, this is not randomization at all–it is another method of assignment to groups. If assignment is not by the play of chance, then the answer to this question is no. There may be some tricky scenarios that will need to be read carefully and considered for the role of chance in assignment. For example, randomization may occur at the site level, where all individuals at a particular site are assigned to receive treatment or no treatment. This scenario is used for group-randomized trials, which can be truly randomized, but often are "quasi-experimental" studies with comparison groups rather than true control groups. (Few, if any, group-randomized trials are anticipated for this evidence review.)

Allocation concealment: This means that one does not know in advance, or cannot guess accurately, to what group the next person eligible for randomization will be assigned. Methods include sequentially numbered opaque sealed envelopes, numbered or coded containers, central randomization by a coordinating center, computer-generated randomization that is not revealed ahead of time, etc. Questions 4 and 5. Blinding

Blinding means that one does not know to which group–intervention or control–the participant is assigned. It is also sometimes called "masking." The reviewer assessed whether each of the following was blinded to knowledge of treatment assignment: (1) the person assessing the primary outcome(s) for the study (e.g., taking the measurements such as blood pressure, examining health records for events such as myocardial infarction, reviewing and interpreting test results such as x ray or cardiac catheterization findings); (2) the person receiving the intervention (e.g., the patient or other study participant); and (3) the person providing the intervention (e.g., the physician, nurse, pharmacist, dietitian, or behavioral interventionist).

Generally placebo-controlled medication studies are blinded to patient, provider, and outcome assessors; behavioral, lifestyle, and surgical studies are examples of studies that are frequently blinded only to the outcome assessors because blinding of the persons providing and receiving the interventions is difficult in these situations. Sometimes the individual providing the intervention is the same person performing the outcome assessment. This was noted when it occurred.

Question 6. Similarity of groups at baseline

This question relates to whether the intervention and control groups have similar baseline characteristics on average especially those characteristics that may affect the intervention or outcomes. The point of randomized trials is to create groups that are as similar as possible except for the intervention(s) being studied in order to compare the effects of the interventions between groups. When reviewers abstracted baseline characteristics, they noted when there was a significant difference between groups. Baseline characteristics for intervention groups are usually presented in a table in the article (often Table 1).

Groups can differ at baseline without raising red flags if: (1) the differences would not be expected to have any bearing on the interventions and outcomes; or (2) the differences are not statistically significant. When concerned about baseline difference in groups, reviewers recorded them in the comments section and considered them in their overall determination of the study quality.

Questions 7 and 8. Dropout

"Dropouts" in a clinical trial are individuals for whom there are no end point measurements, often because they dropped out of the study and were lost to followup.

Generally, an acceptable overall dropout rate is considered 20 percent or less of participants who were randomized or allocated into each group. An acceptable differential dropout rate is an absolute difference between groups of 15 percentage points at most (calculated by subtracting the dropout rate of one group minus the dropout rate of the other group). However, these are general rates. Lower overall dropout rates are expected in shorter studies, whereas higher overall dropout rates may be acceptable for studies of longer duration. For example, a 6-month study of weight loss interventions should be expected to have nearly 100 percent followup (almost no dropouts–nearly everybody gets their weight measured regardless of whether or not they actually received the intervention), whereas a 10-year study testing the effects of intensive blood pressure lowering on heart attacks may be acceptable if there is a 20-25 percent dropout rate, especially if the dropout rate between groups was similar. The panels for the NHLBI systematic reviews may set different levels of dropout caps.

Conversely, differential dropout rates are not flexible; there should be a 15 percent cap. If there is a differential dropout rate of 15 percent or higher between arms, then there is a serious potential for bias. This constitutes a fatal flaw, resulting in a poor quality rating for the study.

Question 9. Adherence

Did participants in each treatment group adhere to the protocols for assigned interventions? For example, if Group 1 was assigned to 10 mg/day of Drug A, did most of them take 10 mg/day of Drug A? Another example is a study evaluating the difference between a 30-pound weight loss and a 10-pound weight loss on specific clinical outcomes (e.g., heart attacks), but the 30-pound weight loss group did not achieve its intended weight loss target (e.g., the group only lost 14 pounds on average). A third example is whether a large percentage of participants assigned to one group "crossed over" and got the intervention provided to the other group. A final example is when one group that was assigned to receive a particular drug at a particular dose had a large percentage of participants who did not end up taking the drug or the dose as designed in the protocol.

Question 10. Avoid other interventions

Changes that occur in the study outcomes being assessed should be attributable to the interventions being compared in the study. If study participants receive interventions that are not part of the study protocol and could affect the outcomes being assessed, and they receive these interventions differentially, then there is cause for concern because these interventions could bias results. The following scenario is another example of how bias can occur. In a study comparing two different dietary interventions on serum cholesterol, one group had a significantly higher percentage of participants taking statin drugs than the other group. In this situation, it would be impossible to know if a difference in outcome was due to the dietary intervention or the drugs.

Question 11. Outcome measures assessment

What tools or methods were used to measure the outcomes in the study? Were the tools and methods accurate and reliable–for example, have they been validated, or are they objective? This is important as it indicates the confidence you can have in the reported outcomes. Perhaps even more important is ascertaining that outcomes were assessed in the same manner within and between groups. One example of differing methods is self-report of dietary salt intake versus urine testing for sodium content (a more reliable and valid assessment method). Another example is using BP measurements taken by practitioners who use their usual methods versus using BP measurements done by individuals trained in a standard approach. Such an approach may include using the same instrument each time and taking an individual's BP multiple times. In each of these cases, the answer to this assessment question would be "no" for the former scenario and "yes" for the latter. In addition, a study in which an intervention group was seen more frequently than the control group, enabling more opportunities to report clinical events, would not be considered reliable and valid.

Question 12. Power calculation

Generally, a study's methods section will address the sample size needed to detect differences in primary outcomes. The current standard is at least 80 percent power to detect a clinically relevant difference in an outcome using a two-sided alpha of 0.05. Often, however, older studies will not report on power.

Question 13. Prespecified outcomes

Investigators should prespecify outcomes reported in a study for hypothesis testing–which is the reason for conducting an RCT. Without prespecified outcomes, the study may be reporting ad hoc analyses, simply looking for differences supporting desired findings. Investigators also should prespecify subgroups being examined. Most RCTs conduct numerous post hoc analyses as a way of exploring findings and generating additional hypotheses. The intent of this question is to give more weight to reports that are not simply exploratory in nature.

Question 14. Intention-to-treat analysis

Intention-to-treat (ITT) means everybody who was randomized is analyzed according to the original group to which they are assigned. This is an extremely important concept because conducting an ITT analysis preserves the whole reason for doing a randomized trial; that is, to compare groups that differ only in the intervention being tested. When the ITT philosophy is not followed, groups being compared may no longer be the same. In this situation, the study would likely be rated poor. However, if an investigator used another type of analysis that could be viewed as valid, this would be explained in the "other" box on the quality assessment form. Some researchers use a completers analysis (an analysis of only the participants who completed the intervention and the study), which introduces significant potential for bias. Characteristics of participants who do not complete the study are unlikely to be the same as those who do. The likely impact of participants withdrawing from a study treatment must be considered carefully. ITT analysis provides a more conservative (potentially less biased) estimate of effectiveness.

General Guidance for Determining the Overall Quality Rating of Controlled Intervention Studies

The questions on the assessment tool were designed to help reviewers focus on the key concepts for evaluating a study's internal validity. They are not intended to create a list that is simply tallied up to arrive at a summary judgment of quality.

Internal validity is the extent to which the results (effects) reported in a study can truly be attributed to the intervention being evaluated and not to flaws in the design or conduct of the study–in other words, the ability for the study to make causal conclusions about the effects of the intervention being tested. Such flaws can increase the risk of bias. Critical appraisal involves considering the risk of potential for allocation bias, measurement bias, or confounding (the mixture of exposures that one cannot tease out from each other). Examples of confounding include co-interventions, differences at baseline in patient characteristics, and other issues addressed in the questions above. High risk of bias translates to a rating of poor quality. Low risk of bias translates to a rating of good quality.

Fatal flaws: If a study has a "fatal flaw," then risk of bias is significant, and the study is of poor quality. Examples of fatal flaws in RCTs include high dropout rates, high differential dropout rates, no ITT analysis or other unsuitable statistical analysis (e.g., completers-only analysis).

Generally, when evaluating a study, one will not see a "fatal flaw;" however, one will find some risk of bias. During training, reviewers were instructed to look for the potential for bias in studies by focusing on the concepts underlying the questions in the tool. For any box checked "no," reviewers were told to ask: "What is the potential risk of bias that may be introduced by this flaw?" That is, does this factor cause one to doubt the results that were reported in the study?

NHLBI staff provided reviewers with background reading on critical appraisal, while emphasizing that the best approach to use is to think about the questions in the tool in determining the potential for bias in a study. The staff also emphasized that each study has specific nuances; therefore, reviewers should familiarize themselves with the key concepts.

Quality Assessment of Systematic Reviews and Meta-Analyses - Study Quality Assessment Tools

Guidance for Quality Assessment Tool for Systematic Reviews and Meta-Analyses

A systematic review is a study that attempts to answer a question by synthesizing the results of primary studies while using strategies to limit bias and random error.424 These strategies include a comprehensive search of all potentially relevant articles and the use of explicit, reproducible criteria in the selection of articles included in the review. Research designs and study characteristics are appraised, data are synthesized, and results are interpreted using a predefined systematic approach that adheres to evidence-based methodological principles.

Systematic reviews can be qualitative or quantitative. A qualitative systematic review summarizes the results of the primary studies but does not combine the results statistically. A quantitative systematic review, or meta-analysis, is a type of systematic review that employs statistical techniques to combine the results of the different studies into a single pooled estimate of effect, often given as an odds ratio. The guidance document below is organized by question number from the tool for quality assessment of systematic reviews and meta-analyses.

Question 1. Focused question

The review should be based on a question that is clearly stated and well-formulated. An example would be a question that uses the PICO (population, intervention, comparator, outcome) format, with all components clearly described.

Question 2. Eligibility criteria

The eligibility criteria used to determine whether studies were included or excluded should be clearly specified and predefined. It should be clear to the reader why studies were included or excluded.

Question 3. Literature search

The search strategy should employ a comprehensive, systematic approach in order to capture all of the evidence possible that pertains to the question of interest. At a minimum, a comprehensive review has the following attributes:

  • Electronic searches were conducted using multiple scientific literature databases, such as MEDLINE, EMBASE, Cochrane Central Register of Controlled Trials, PsychLit, and others as appropriate for the subject matter.
  • Manual searches of references found in articles and textbooks should supplement the electronic searches.

Additional search strategies that may be used to improve the yield include the following:

  • Studies published in other countries
  • Studies published in languages other than English
  • Identification by experts in the field of studies and articles that may have been missed
  • Search of grey literature, including technical reports and other papers from government agencies or scientific groups or committees; presentations and posters from scientific meetings, conference proceedings, unpublished manuscripts; and others. Searching the grey literature is important (whenever feasible) because sometimes only positive studies with significant findings are published in the peer-reviewed literature, which can bias the results of a review.

In their reviews, researchers described the literature search strategy clearly, and ascertained it could be reproducible by others with similar results.

Question 4. Dual review for determining which studies to include and exclude

Titles, abstracts, and full-text articles (when indicated) should be reviewed by two independent reviewers to determine which studies to include and exclude in the review. Reviewers resolved disagreements through discussion and consensus or with third parties. They clearly stated the review process, including methods for settling disagreements.

Question 5. Quality appraisal for internal validity

Each included study should be appraised for internal validity (study quality assessment) using a standardized approach for rating the quality of the individual studies. Ideally, this should be done by at least two independent reviewers appraised each study for internal validity. However, there is not one commonly accepted, standardized tool for rating the quality of studies. So, in the research papers, reviewers looked for an assessment of the quality of each study and a clear description of the process used.

Question 6. List and describe included studies

All included studies were listed in the review, along with descriptions of their key characteristics. This was presented either in narrative or table format.

Question 7. Publication bias

Publication bias is a term used when studies with positive results have a higher likelihood of being published, being published rapidly, being published in higher impact journals, being published in English, being published more than once, or being cited by others.425,426 Publication bias can be linked to favorable or unfavorable treatment of research findings due to investigators, editors, industry, commercial interests, or peer reviewers. To minimize the potential for publication bias, researchers can conduct a comprehensive literature search that includes the strategies discussed in Question 3.

A funnel plot–a scatter plot of component studies in a meta-analysis–is a commonly used graphical method for detecting publication bias. If there is no significant publication bias, the graph looks like a symmetrical inverted funnel.

Reviewers assessed and clearly described the likelihood of publication bias.

Question 8. Heterogeneity

Heterogeneity is used to describe important differences in studies included in a meta-analysis that may make it inappropriate to combine the studies.427 Heterogeneity can be clinical (e.g., important differences between study participants, baseline disease severity, and interventions); methodological (e.g., important differences in the design and conduct of the study); or statistical (e.g., important differences in the quantitative results or reported effects).

Researchers usually assess clinical or methodological heterogeneity qualitatively by determining whether it makes sense to combine studies. For example:

  • Should a study evaluating the effects of an intervention on CVD risk that involves elderly male smokers with hypertension be combined with a study that involves healthy adults ages 18 to 40? (Clinical Heterogeneity)
  • Should a study that uses a randomized controlled trial (RCT) design be combined with a study that uses a case-control study design? (Methodological Heterogeneity)

Statistical heterogeneity describes the degree of variation in the effect estimates from a set of studies; it is assessed quantitatively. The two most common methods used to assess statistical heterogeneity are the Q test (also known as the X2 or chi-square test) or I2 test.

Reviewers examined studies to determine if an assessment for heterogeneity was conducted and clearly described. If the studies are found to be heterogeneous, the investigators should explore and explain the causes of the heterogeneity, and determine what influence, if any, the study differences had on overall study results.

Quality Assessment Tool for Observational Cohort and Cross-Sectional Studies - Study Quality Assessment Tools

Guidance for Assessing the Quality of Observational Cohort and Cross-Sectional Studies

The guidance document below is organized by question number from the tool for quality assessment of observational cohort and cross-sectional studies.

Question 1. Research question

Did the authors describe their goal in conducting this research? Is it easy to understand what they were looking to find? This issue is important for any scientific paper of any type. Higher quality scientific research explicitly defines a research question.

Questions 2 and 3. Study population

Did the authors describe the group of people from which the study participants were selected or recruited, using demographics, location, and time period? If you were to conduct this study again, would you know who to recruit, from where, and from what time period? Is the cohort population free of the outcomes of interest at the time they were recruited?

An example would be men over 40 years old with type 2 diabetes who began seeking medical care at Phoenix Good Samaritan Hospital between January 1, 1990 and December 31, 1994. In this example, the population is clearly described as: (1) who (men over 40 years old with type 2 diabetes); (2) where (Phoenix Good Samaritan Hospital); and (3) when (between January 1, 1990 and December 31, 1994). Another example is women ages 34 to 59 years of age in 1980 who were in the nursing profession and had no known coronary disease, stroke, cancer, hypercholesterolemia, or diabetes, and were recruited from the 11 most populous States, with contact information obtained from State nursing boards.

In cohort studies, it is crucial that the population at baseline is free of the outcome of interest. For example, the nurses' population above would be an appropriate group in which to study incident coronary disease. This information is usually found either in descriptions of population recruitment, definitions of variables, or inclusion/exclusion criteria.

You may need to look at prior papers on methods in order to make the assessment for this question. Those papers are usually in the reference list.

If fewer than 50% of eligible persons participated in the study, then there is concern that the study population does not adequately represent the target population. This increases the risk of bias.

Question 4. Groups recruited from the same population and uniform eligibility criteria

Were the inclusion and exclusion criteria developed prior to recruitment or selection of the study population? Were the same underlying criteria used for all of the subjects involved? This issue is related to the description of the study population, above, and you may find the information for both of these questions in the same section of the paper.

Most cohort studies begin with the selection of the cohort; participants in this cohort are then measured or evaluated to determine their exposure status. However, some cohort studies may recruit or select exposed participants in a different time or place than unexposed participants, especially retrospective cohort studies–which is when data are obtained from the past (retrospectively), but the analysis examines exposures prior to outcomes. For example, one research question could be whether diabetic men with clinical depression are at higher risk for cardiovascular disease than those without clinical depression. So, diabetic men with depression might be selected from a mental health clinic, while diabetic men without depression might be selected from an internal medicine or endocrinology clinic. This study recruits groups from different clinic populations, so this example would get a "no."

However, the women nurses described in the question above were selected based on the same inclusion/exclusion criteria, so that example would get a "yes."

Question 5. Sample size justification

Did the authors present their reasons for selecting or recruiting the number of people included or analyzed? Do they note or discuss the statistical power of the study? This question is about whether or not the study had enough participants to detect an association if one truly existed.

A paragraph in the methods section of the article may explain the sample size needed to detect a hypothesized difference in outcomes. You may also find a discussion of power in the discussion section (such as the study had 85 percent power to detect a 20 percent increase in the rate of an outcome of interest, with a 2-sided alpha of 0.05). Sometimes estimates of variance and/or estimates of effect size are given, instead of sample size calculations. In any of these cases, the answer would be "yes."

However, observational cohort studies often do not report anything about power or sample sizes because the analyses are exploratory in nature. In this case, the answer would be "no." This is not a "fatal flaw." It just may indicate that attention was not paid to whether the study was sufficiently sized to answer a prespecified question–i.e., it may have been an exploratory, hypothesis-generating study.

Question 6. Exposure assessed prior to outcome measurement

This question is important because, in order to determine whether an exposure causes an outcome, the exposure must come before the outcome.

For some prospective cohort studies, the investigator enrolls the cohort and then determines the exposure status of various members of the cohort (large epidemiological studies like Framingham used this approach). However, for other cohort studies, the cohort is selected based on its exposure status, as in the example above of depressed diabetic men (the exposure being depression). Other examples include a cohort identified by its exposure to fluoridated drinking water and then compared to a cohort living in an area without fluoridated water, or a cohort of military personnel exposed to combat in the Gulf War compared to a cohort of military personnel not deployed in a combat zone.

With either of these types of cohort studies, the cohort is followed forward in time (i.e., prospectively) to assess the outcomes that occurred in the exposed members compared to nonexposed members of the cohort. Therefore, you begin the study in the present by looking at groups that were exposed (or not) to some biological or behavioral factor, intervention, etc., and then you follow them forward in time to examine outcomes. If a cohort study is conducted properly, the answer to this question should be "yes," since the exposure status of members of the cohort was determined at the beginning of the study before the outcomes occurred.

For retrospective cohort studies, the same principal applies. The difference is that, rather than identifying a cohort in the present and following them forward in time, the investigators go back in time (i.e., retrospectively) and select a cohort based on their exposure status in the past and then follow them forward to assess the outcomes that occurred in the exposed and nonexposed cohort members. Because in retrospective cohort studies the exposure and outcomes may have already occurred (it depends on how long they follow the cohort), it is important to make sure that the exposure preceded the outcome.

Sometimes cross-sectional studies are conducted (or cross-sectional analyses of cohort-study data), where the exposures and outcomes are measured during the same timeframe. As a result, cross-sectional analyses provide weaker evidence than regular cohort studies regarding a potential causal relationship between exposures and outcomes. For cross-sectional analyses, the answer to Question 6 should be "no."

Question 7. Sufficient timeframe to see an effect

Did the study allow enough time for a sufficient number of outcomes to occur or be observed, or enough time for an exposure to have a biological effect on an outcome? In the examples given above, if clinical depression has a biological effect on increasing risk for CVD, such an effect may take years. In the other example, if higher dietary sodium increases BP, a short timeframe may be sufficient to assess its association with BP, but a longer timeframe would be needed to examine its association with heart attacks.

The issue of timeframe is important to enable meaningful analysis of the relationships between exposures and outcomes to be conducted. This often requires at least several years, especially when looking at health outcomes, but it depends on the research question and outcomes being examined.

Cross-sectional analyses allow no time to see an effect, since the exposures and outcomes are assessed at the same time, so those would get a "no" response.

Question 8. Different levels of the exposure of interest

If the exposure can be defined as a range (examples: drug dosage, amount of physical activity, amount of sodium consumed), were multiple categories of that exposure assessed? (for example, for drugs: not on the medication, on a low dose, medium dose, high dose; for dietary sodium, higher than average U.S. consumption, lower than recommended consumption, between the two). Sometimes discrete categories of exposure are not used, but instead exposures are measured as continuous variables (for example, mg/day of dietary sodium or BP values).

In any case, studying different levels of exposure (where possible) enables investigators to assess trends or dose-response relationships between exposures and outcomes–e.g., the higher the exposure, the greater the rate of the health outcome. The presence of trends or dose-response relationships lends credibility to the hypothesis of causality between exposure and outcome.

For some exposures, however, this question may not be applicable (e.g., the exposure may be a dichotomous variable like living in a rural setting versus an urban setting, or vaccinated/not vaccinated with a one-time vaccine). If there are only two possible exposures (yes/no), then this question should be given an "NA," and it should not count negatively towards the quality rating.

Question 9. Exposure measures and assessment

Were the exposure measures defined in detail? Were the tools or methods used to measure exposure accurate and reliable–for example, have they been validated or are they objective? This issue is important as it influences confidence in the reported exposures. When exposures are measured with less accuracy or validity, it is harder to see an association between exposure and outcome even if one exists. Also as important is whether the exposures were assessed in the same manner within groups and between groups; if not, bias may result.

For example, retrospective self-report of dietary salt intake is not as valid and reliable as prospectively using a standardized dietary log plus testing participants' urine for sodium content. Another example is measurement of BP, where there may be quite a difference between usual care, where clinicians measure BP however it is done in their practice setting (which can vary considerably), and use of trained BP assessors using standardized equipment (e.g., the same BP device which has been tested and calibrated) and a standardized protocol (e.g., patient is seated for 5 minutes with feet flat on the floor, BP is taken twice in each arm, and all four measurements are averaged). In each of these cases, the former would get a "no" and the latter a "yes."

Here is a final example that illustrates the point about why it is important to assess exposures consistently across all groups: If people with higher BP (exposed cohort) are seen by their providers more frequently than those without elevated BP (nonexposed group), it also increases the chances of detecting and documenting changes in health outcomes, including CVD-related events. Therefore, it may lead to the conclusion that higher BP leads to more CVD events. This may be true, but it could also be due to the fact that the subjects with higher BP were seen more often; thus, more CVD-related events were detected and documented simply because they had more encounters with the health care system. Thus, it could bias the results and lead to an erroneous conclusion.

Question 10. Repeated exposure assessment

Was the exposure for each person measured more than once during the course of the study period? Multiple measurements with the same result increase our confidence that the exposure status was correctly classified. Also, multiple measurements enable investigators to look at changes in exposure over time, for example, people who ate high dietary sodium throughout the followup period, compared to those who started out high then reduced their intake, compared to those who ate low sodium throughout. Once again, this may not be applicable in all cases. In many older studies, exposure was measured only at baseline. However, multiple exposure measurements do result in a stronger study design.

Question 11. Outcome measures

Were the outcomes defined in detail? Were the tools or methods for measuring outcomes accurate and reliable–for example, have they been validated or are they objective? This issue is important because it influences confidence in the validity of study results. Also important is whether the outcomes were assessed in the same manner within groups and between groups.

An example of an outcome measure that is objective, accurate, and reliable is death–the outcome measured with more accuracy than any other. But even with a measure as objective as death, there can be differences in the accuracy and reliability of how death was assessed by the investigators. Did they base it on an autopsy report, death certificate, death registry, or report from a family member? Another example is a study of whether dietary fat intake is related to blood cholesterol level (cholesterol level being the outcome), and the cholesterol level is measured from fasting blood samples that are all sent to the same laboratory. These examples would get a "yes." An example of a "no" would be self-report by subjects that they had a heart attack, or self-report of how much they weigh (if body weight is the outcome of interest).

Similar to the example in Question 9, results may be biased if one group (e.g., people with high BP) is seen more frequently than another group (people with normal BP) because more frequent encounters with the health care system increases the chances of outcomes being detected and documented.

Question 12. Blinding of outcome assessors

Blinding means that outcome assessors did not know whether the participant was exposed or unexposed. It is also sometimes called "masking." The objective is to look for evidence in the article that the person(s) assessing the outcome(s) for the study (for example, examining medical records to determine the outcomes that occurred in the exposed and comparison groups) is masked to the exposure status of the participant. Sometimes the person measuring the exposure is the same person conducting the outcome assessment. In this case, the outcome assessor would most likely not be blinded to exposure status because they also took measurements of exposures. If so, make a note of that in the comments section.

As you assess this criterion, think about whether it is likely that the person(s) doing the outcome assessment would know (or be able to figure out) the exposure status of the study participants. If the answer is no, then blinding is adequate. An example of adequate blinding of the outcome assessors is to create a separate committee, whose members were not involved in the care of the patient and had no information about the study participants' exposure status. The committee would then be provided with copies of participants' medical records, which had been stripped of any potential exposure information or personally identifiable information. The committee would then review the records for prespecified outcomes according to the study protocol. If blinding was not possible, which is sometimes the case, mark "NA" and explain the potential for bias.

Question 13. Followup rate

Higher overall followup rates are always better than lower followup rates, even though higher rates are expected in shorter studies, whereas lower overall followup rates are often seen in studies of longer duration. Usually, an acceptable overall followup rate is considered 80 percent or more of participants whose exposures were measured at baseline. However, this is just a general guideline. For example, a 6-month cohort study examining the relationship between dietary sodium intake and BP level may have over 90 percent followup, but a 20-year cohort study examining effects of sodium intake on stroke may have only a 65 percent followup rate.

Question 14. Statistical analyses

Were key potential confounding variables measured and adjusted for, such as by statistical adjustment for baseline differences? Logistic regression or other regression methods are often used to account for the influence of variables not of interest.

This is a key issue in cohort studies, because statistical analyses need to control for potential confounders, in contrast to an RCT, where the randomization process controls for potential confounders. All key factors that may be associated both with the exposure of interest and the outcome–that are not of interest to the research question–should be controlled for in the analyses.

For example, in a study of the relationship between cardiorespiratory fitness and CVD events (heart attacks and strokes), the study should control for age, BP, blood cholesterol, and body weight, because all of these factors are associated both with low fitness and with CVD events. Well-done cohort studies control for multiple potential confounders.

Some general guidance for determining the overall quality rating of observational cohort and cross-sectional studies

The questions on the form are designed to help you focus on the key concepts for evaluating the internal validity of a study. They are not intended to create a list that you simply tally up to arrive at a summary judgment of quality.

Internal validity for cohort studies is the extent to which the results reported in the study can truly be attributed to the exposure being evaluated and not to flaws in the design or conduct of the study–in other words, the ability of the study to draw associative conclusions about the effects of the exposures being studied on outcomes. Any such flaws can increase the risk of bias.

Critical appraisal involves considering the risk of potential for selection bias, information bias, measurement bias, or confounding (the mixture of exposures that one cannot tease out from each other). Examples of confounding include co-interventions, differences at baseline in patient characteristics, and other issues throughout the questions above. High risk of bias translates to a rating of poor quality. Low risk of bias translates to a rating of good quality. (Thus, the greater the risk of bias, the lower the quality rating of the study.)

In addition, the more attention in the study design to issues that can help determine whether there is a causal relationship between the exposure and outcome, the higher quality the study. These include exposures occurring prior to outcomes, evaluation of a dose-response gradient, accuracy of measurement of both exposure and outcome, sufficient timeframe to see an effect, and appropriate control for confounding–all concepts reflected in the tool.

Generally, when you evaluate a study, you will not see a "fatal flaw," but you will find some risk of bias. By focusing on the concepts underlying the questions in the quality assessment tool, you should ask yourself about the potential for bias in the study you are critically appraising. For any box where you check "no" you should ask, "What is the potential risk of bias resulting from this flaw in study design or execution?" That is, does this factor cause you to doubt the results that are reported in the study or doubt the ability of the study to accurately assess an association between exposure and outcome?

The best approach is to think about the questions in the tool and how each one tells you something about the potential for bias in a study. The more you familiarize yourself with the key concepts, the more comfortable you will be with critical appraisal. Examples of studies rated good, fair, and poor are useful, but each study must be assessed on its own based on the details that are reported and consideration of the concepts for minimizing bias.

Quality Assessment of Case-Control Studies - Study Quality Assessment Tools

Guidance for Assessing the Quality of Case-Control Studies

The guidance document below is organized by question number from the tool for quality assessment of case-control studies.

Did the authors describe their goal in conducting this research? Is it easy to understand what they were looking to find? This issue is important for any scientific paper of any type. High quality scientific research explicitly defines a research question.

Question 2. Study population

Did the authors describe the group of individuals from which the cases and controls were selected or recruited, while using demographics, location, and time period? If the investigators conducted this study again, would they know exactly who to recruit, from where, and from what time period?

Investigators identify case-control study populations by location, time period, and inclusion criteria for cases (individuals with the disease, condition, or problem) and controls (individuals without the disease, condition, or problem). For example, the population for a study of lung cancer and chemical exposure would be all incident cases of lung cancer diagnosed in patients ages 35 to 79, from January 1, 2003 to December 31, 2008, living in Texas during that entire time period, as well as controls without lung cancer recruited from the same population during the same time period. The population is clearly described as: (1) who (men and women ages 35 to 79 with (cases) and without (controls) incident lung cancer); (2) where (living in Texas); and (3) when (between January 1, 2003 and December 31, 2008).

Other studies may use disease registries or data from cohort studies to identify cases. In these cases, the populations are individuals who live in the area covered by the disease registry or included in a cohort study (i.e., nested case-control or case-cohort). For example, a study of the relationship between vitamin D intake and myocardial infarction might use patients identified via the GRACE registry, a database of heart attack patients.

NHLBI staff encouraged reviewers to examine prior papers on methods (listed in the reference list) to make this assessment, if necessary.

Question 3. Target population and case representation

In order for a study to truly address the research question, the target population–the population from which the study population is drawn and to which study results are believed to apply–should be carefully defined. Some authors may compare characteristics of the study cases to characteristics of cases in the target population, either in text or in a table. When study cases are shown to be representative of cases in the appropriate target population, it increases the likelihood that the study was well-designed per the research question.

However, because these statistics are frequently difficult or impossible to measure, publications should not be penalized if case representation is not shown. For most papers, the response to question 3 will be "NR." Those subquestions are combined because the answer to the second subquestion–case representation–determines the response to this item. However, it cannot be determined without considering the response to the first subquestion. For example, if the answer to the first subquestion is "yes," and the second, "CD," then the response for item 3 is "CD."

Question 4. Sample size justification

Did the authors discuss their reasons for selecting or recruiting the number of individuals included? Did they discuss the statistical power of the study and provide a sample size calculation to ensure that the study is adequately powered to detect an association (if one exists)? This question does not refer to a description of the manner in which different groups were included or excluded using the inclusion/exclusion criteria (e.g., "Final study size was 1,378 participants after exclusion of 461 patients with missing data" is not considered a sample size justification for the purposes of this question).

An article's methods section usually contains information on sample size and the size needed to detect differences in exposures and on statistical power.

Question 5. Groups recruited from the same population

To determine whether cases and controls were recruited from the same population, one can ask hypothetically, "If a control was to develop the outcome of interest (the condition that was used to select cases), would that person have been eligible to become a case?" Case-control studies begin with the selection of the cases (those with the outcome of interest, e.g., lung cancer) and controls (those in whom the outcome is absent). Cases and controls are then evaluated and categorized by their exposure status. For the lung cancer example, cases and controls were recruited from hospitals in a given region. One may reasonably assume that controls in the catchment area for the hospitals, or those already in the hospitals for a different reason, would attend those hospitals if they became a case; therefore, the controls are drawn from the same population as the cases. If the controls were recruited or selected from a different region (e.g., a State other than Texas) or time period (e.g., 1991-2000), then the cases and controls were recruited from different populations, and the answer to this question would be "no."

The following example further explores selection of controls. In a study, eligible cases were men and women, ages 18 to 39, who were diagnosed with atherosclerosis at hospitals in Perth, Australia, between July 1, 2000 and December 31, 2007. Appropriate controls for these cases might be sampled using voter registration information for men and women ages 18 to 39, living in Perth (population-based controls); they also could be sampled from patients without atherosclerosis at the same hospitals (hospital-based controls). As long as the controls are individuals who would have been eligible to be included in the study as cases (if they had been diagnosed with atherosclerosis), then the controls were selected appropriately from the same source population as cases.

In a prospective case-control study, investigators may enroll individuals as cases at the time they are found to have the outcome of interest; the number of cases usually increases as time progresses. At this same time, they may recruit or select controls from the population without the outcome of interest. One way to identify or recruit cases is through a surveillance system. In turn, investigators can select controls from the population covered by that system. This is an example of population-based controls. Investigators also may identify and select cases from a cohort study population and identify controls from outcome-free individuals in the same cohort study. This is known as a nested case-control study.

Question 6. Inclusion and exclusion criteria prespecified and applied uniformly

Were the inclusion and exclusion criteria developed prior to recruitment or selection of the study population? Were the same underlying criteria used for all of the groups involved? To answer this question, reviewers determined if the investigators developed I/E criteria prior to recruitment or selection of the study population and if they used the same underlying criteria for all groups. The investigators should have used the same selection criteria, except for study participants who had the disease or condition, which would be different for cases and controls by definition. Therefore, the investigators use the same age (or age range), gender, race, and other characteristics to select cases and controls. Information on this topic is usually found in a paper's section on the description of the study population.

Question 7. Case and control definitions

For this question, reviewers looked for descriptions of the validity of case and control definitions and processes or tools used to identify study participants as such. Was a specific description of "case" and "control" provided? Is there a discussion of the validity of the case and control definitions and the processes or tools used to identify study participants as such? They determined if the tools or methods were accurate, reliable, and objective. For example, cases might be identified as "adult patients admitted to a VA hospital from January 1, 2000 to December 31, 2009, with an ICD-9 discharge diagnosis code of acute myocardial infarction and at least one of the two confirmatory findings in their medical records: at least 2mm of ST elevation changes in two or more ECG leads and an elevated troponin level. Investigators might also use ICD-9 or CPT codes to identify patients. All cases should be identified using the same methods. Unless the distinction between cases and controls is accurate and reliable, investigators cannot use study results to draw valid conclusions.

Question 8. Random selection of study participants

If a case-control study did not use 100 percent of eligible cases and/or controls (e.g., not all disease-free participants were included as controls), did the authors indicate that random sampling was used to select controls? When it is possible to identify the source population fairly explicitly (e.g., in a nested case-control study, or in a registry-based study), then random sampling of controls is preferred. When investigators used consecutive sampling, which is frequently done for cases in prospective studies, then study participants are not considered randomly selected. In this case, the reviewers would answer "no" to Question 8. However, this would not be considered a fatal flaw.

If investigators included all eligible cases and controls as study participants, then reviewers marked "NA" in the tool. If 100 percent of cases were included (e.g., NA for cases) but only 50 percent of eligible controls, then the response would be "yes" if the controls were randomly selected, and "no" if they were not. If this cannot be determined, the appropriate response is "CD."

Question 9. Concurrent controls

A concurrent control is a control selected at the time another person became a case, usually on the same day. This means that one or more controls are recruited or selected from the population without the outcome of interest at the time a case is diagnosed. Investigators can use this method in both prospective case-control studies and retrospective case-control studies. For example, in a retrospective study of adenocarcinoma of the colon using data from hospital records, if hospital records indicate that Person A was diagnosed with adenocarcinoma of the colon on June 22, 2002, then investigators would select one or more controls from the population of patients without adenocarcinoma of the colon on that same day. This assumes they conducted the study retrospectively, using data from hospital records. The investigators could have also conducted this study using patient records from a cohort study, in which case it would be a nested case-control study.

Investigators can use concurrent controls in the presence or absence of matching and vice versa. A study that uses matching does not necessarily mean that concurrent controls were used.

Question 10. Exposure assessed prior to outcome measurement

Investigators first determine case or control status (based on presence or absence of outcome of interest), and then assess exposure history of the case or control; therefore, reviewers ascertained that the exposure preceded the outcome. For example, if the investigators used tissue samples to determine exposure, did they collect them from patients prior to their diagnosis? If hospital records were used, did investigators verify that the date a patient was exposed (e.g., received medication for atherosclerosis) occurred prior to the date they became a case (e.g., was diagnosed with type 2 diabetes)? For an association between an exposure and an outcome to be considered causal, the exposure must have occurred prior to the outcome.

Question 11. Exposure measures and assessment

Were the exposure measures defined in detail? Were the tools or methods used to measure exposure accurate and reliable–for example, have they been validated or are they objective? This is important, as it influences confidence in the reported exposures. Equally important is whether the exposures were assessed in the same manner within groups and between groups. This question pertains to bias resulting from exposure misclassification (i.e., exposure ascertainment).

For example, a retrospective self-report of dietary salt intake is not as valid and reliable as prospectively using a standardized dietary log plus testing participants' urine for sodium content because participants' retrospective recall of dietary salt intake may be inaccurate and result in misclassification of exposure status. Similarly, BP results from practices that use an established protocol for measuring BP would be considered more valid and reliable than results from practices that did not use standard protocols. A protocol may include using trained BP assessors, standardized equipment (e.g., the same BP device which has been tested and calibrated), and a standardized procedure (e.g., patient is seated for 5 minutes with feet flat on the floor, BP is taken twice in each arm, and all four measurements are averaged).

Question 12. Blinding of exposure assessors

Blinding or masking means that outcome assessors did not know whether participants were exposed or unexposed. To answer this question, reviewers examined articles for evidence that the outcome assessor(s) was masked to the exposure status of the research participants. An outcome assessor, for example, may examine medical records to determine the outcomes that occurred in the exposed and comparison groups. Sometimes the person measuring the exposure is the same person conducting the outcome assessment. In this case, the outcome assessor would most likely not be blinded to exposure status. A reviewer would note such a finding in the comments section of the assessment tool.

One way to ensure good blinding of exposure assessment is to have a separate committee, whose members have no information about the study participants' status as cases or controls, review research participants' records. To help answer the question above, reviewers determined if it was likely that the outcome assessor knew whether the study participant was a case or control. If it was unlikely, then the reviewers marked "no" to Question 12. Outcome assessors who used medical records to assess exposure should not have been directly involved in the study participants' care, since they probably would have known about their patients' conditions. If the medical records contained information on the patient's condition that identified him/her as a case (which is likely), that information would have had to be removed before the exposure assessors reviewed the records.

If blinding was not possible, which sometimes happens, the reviewers marked "NA" in the assessment tool and explained the potential for bias.

Question 13. Statistical analysis

Were key potential confounding variables measured and adjusted for, such as by statistical adjustment for baseline differences? Investigators often use logistic regression or other regression methods to account for the influence of variables not of interest.

This is a key issue in case-controlled studies; statistical analyses need to control for potential confounders, in contrast to RCTs in which the randomization process controls for potential confounders. In the analysis, investigators need to control for all key factors that may be associated with both the exposure of interest and the outcome and are not of interest to the research question.

A study of the relationship between smoking and CVD events illustrates this point. Such a study needs to control for age, gender, and body weight; all are associated with smoking and CVD events. Well-done case-control studies control for multiple potential confounders.

Matching is a technique used to improve study efficiency and control for known confounders. For example, in the study of smoking and CVD events, an investigator might identify cases that have had a heart attack or stroke and then select controls of similar age, gender, and body weight to the cases. For case-control studies, it is important that if matching was performed during the selection or recruitment process, the variables used as matching criteria (e.g., age, gender, race) should be controlled for in the analysis.

General Guidance for Determining the Overall Quality Rating of Case-Controlled Studies

NHLBI designed the questions in the assessment tool to help reviewers focus on the key concepts for evaluating a study's internal validity, not to use as a list from which to add up items to judge a study's quality.

Internal validity for case-control studies is the extent to which the associations between disease and exposure reported in the study can truly be attributed to the exposure being evaluated rather than to flaws in the design or conduct of the study. In other words, what is ability of the study to draw associative conclusions about the effects of the exposures on outcomes? Any such flaws can increase the risk of bias.

In critical appraising a study, the following factors need to be considered: risk of potential for selection bias, information bias, measurement bias, or confounding (the mixture of exposures that one cannot tease out from each other). Examples of confounding include co-interventions, differences at baseline in patient characteristics, and other issues addressed in the questions above. High risk of bias translates to a poor quality rating; low risk of bias translates to a good quality rating. Again, the greater the risk of bias, the lower the quality rating of the study.

In addition, the more attention in the study design to issues that can help determine whether there is a causal relationship between the outcome and the exposure, the higher the quality of the study. These include exposures occurring prior to outcomes, evaluation of a dose-response gradient, accuracy of measurement of both exposure and outcome, sufficient timeframe to see an effect, and appropriate control for confounding–all concepts reflected in the tool.

If a study has a "fatal flaw," then risk of bias is significant; therefore, the study is deemed to be of poor quality. An example of a fatal flaw in case-control studies is a lack of a consistent standard process used to identify cases and controls.

Generally, when reviewers evaluated a study, they did not see a "fatal flaw," but instead found some risk of bias. By focusing on the concepts underlying the questions in the quality assessment tool, reviewers examined the potential for bias in the study. For any box checked "no," reviewers asked, "What is the potential risk of bias resulting from this flaw in study design or execution?" That is, did this factor lead to doubt about the results reported in the study or the ability of the study to accurately assess an association between exposure and outcome?

By examining questions in the assessment tool, reviewers were best able to assess the potential for bias in a study. Specific rules were not useful, as each study had specific nuances. In addition, being familiar with the key concepts helped reviewers assess the studies. Examples of studies rated good, fair, and poor were useful, yet each study had to be assessed on its own.

Quality Assessment Tool for Before-After (Pre-Post) Studies With No Control Group - Study Quality Assessment Tools

Guidance for Assessing the Quality of Before-After (Pre-Post) Studies With No Control Group

Question 1. Study question

Question 2. Eligibility criteria and study population

Did the authors describe the eligibility criteria applied to the individuals from whom the study participants were selected or recruited? In other words, if the investigators were to conduct this study again, would they know whom to recruit, from where, and from what time period?

Here is a sample description of a study population: men over age 40 with type 2 diabetes, who began seeking medical care at Phoenix Good Samaritan Hospital, between January 1, 2005 and December 31, 2007. The population is clearly described as: (1) who (men over age 40 with type 2 diabetes); (2) where (Phoenix Good Samaritan Hospital); and (3) when (between January 1, 2005 and December 31, 2007). Another sample description is women who were in the nursing profession, who were ages 34 to 59 in 1995, had no known CHD, stroke, cancer, hypercholesterolemia, or diabetes, and were recruited from the 11 most populous States, with contact information obtained from State nursing boards.

To assess this question, reviewers examined prior papers on study methods (listed in reference list) when necessary.

Question 3. Study participants representative of clinical populations of interest

The participants in the study should be generally representative of the population in which the intervention will be broadly applied. Studies on small demographic subgroups may raise concerns about how the intervention will affect broader populations of interest. For example, interventions that focus on very young or very old individuals may affect middle-aged adults differently. Similarly, researchers may not be able to extrapolate study results from patients with severe chronic diseases to healthy populations.

Question 4. All eligible participants enrolled

To further explore this question, reviewers may need to ask: Did the investigators develop the I/E criteria prior to recruiting or selecting study participants? Were the same underlying I/E criteria used for all research participants? Were all subjects who met the I/E criteria enrolled in the study?

Question 5. Sample size

Did the authors present their reasons for selecting or recruiting the number of individuals included or analyzed? Did they note or discuss the statistical power of the study? This question addresses whether there was a sufficient sample size to detect an association, if one did exist.

An article's methods section may provide information on the sample size needed to detect a hypothesized difference in outcomes and a discussion on statistical power (such as, the study had 85 percent power to detect a 20 percent increase in the rate of an outcome of interest, with a 2-sided alpha of 0.05). Sometimes estimates of variance and/or estimates of effect size are given, instead of sample size calculations. In any case, if the reviewers determined that the power was sufficient to detect the effects of interest, then they would answer "yes" to Question 5.

Question 6. Intervention clearly described

Another pertinent question regarding interventions is: Was the intervention clearly defined in detail in the study? Did the authors indicate that the intervention was consistently applied to the subjects? Did the research participants have a high level of adherence to the requirements of the intervention? For example, if the investigators assigned a group to 10 mg/day of Drug A, did most participants in this group take the specific dosage of Drug A? Or did a large percentage of participants end up not taking the specific dose of Drug A indicated in the study protocol?

Reviewers ascertained that changes in study outcomes could be attributed to study interventions. If participants received interventions that were not part of the study protocol and could affect the outcomes being assessed, the results could be biased.

Question 7. Outcome measures clearly described, valid, and reliable

Were the outcomes defined in detail? Were the tools or methods for measuring outcomes accurate and reliable–for example, have they been validated or are they objective? This question is important because the answer influences confidence in the validity of study results.

An example of an outcome measure that is objective, accurate, and reliable is death–the outcome measured with more accuracy than any other. But even with a measure as objective as death, differences can exist in the accuracy and reliability of how investigators assessed death. For example, did they base it on an autopsy report, death certificate, death registry, or report from a family member? Another example of a valid study is one whose objective is to determine if dietary fat intake affects blood cholesterol level (cholesterol level being the outcome) and in which the cholesterol level is measured from fasting blood samples that are all sent to the same laboratory. These examples would get a "yes."

An example of a "no" would be self-report by subjects that they had a heart attack, or self-report of how much they weight (if body weight is the outcome of interest).

Question 8. Blinding of outcome assessors

Blinding or masking means that the outcome assessors did not know whether the participants received the intervention or were exposed to the factor under study. To answer the question above, the reviewers examined articles for evidence that the person(s) assessing the outcome(s) was masked to the participants' intervention or exposure status. An outcome assessor, for example, may examine medical records to determine the outcomes that occurred in the exposed and comparison groups. Sometimes the person applying the intervention or measuring the exposure is the same person conducting the outcome assessment. In this case, the outcome assessor would not likely be blinded to the intervention or exposure status. A reviewer would note such a finding in the comments section of the assessment tool.

In assessing this criterion, the reviewers determined whether it was likely that the person(s) conducting the outcome assessment knew the exposure status of the study participants. If not, then blinding was adequate. An example of adequate blinding of the outcome assessors is to create a separate committee whose members were not involved in the care of the patient and had no information about the study participants' exposure status. Using a study protocol, committee members would review copies of participants' medical records, which would be stripped of any potential exposure information or personally identifiable information, for prespecified outcomes.

Question 9. Followup rate

Higher overall followup rates are always desirable to lower followup rates, although higher rates are expected in shorter studies, and lower overall followup rates are often seen in longer studies. Usually an acceptable overall followup rate is considered 80 percent or more of participants whose interventions or exposures were measured at baseline. However, this is a general guideline.

In accounting for those lost to followup, in the analysis, investigators may have imputed values of the outcome for those lost to followup or used other methods. For example, they may carry forward the baseline value or the last observed value of the outcome measure and use these as imputed values for the final outcome measure for research participants lost to followup.

Question 10. Statistical analysis

Were formal statistical tests used to assess the significance of the changes in the outcome measures between the before and after time periods? The reported study results should present values for statistical tests, such as p values, to document the statistical significance (or lack thereof) for the changes in the outcome measures found in the study.

Question 11. Multiple outcome measures

Were the outcome measures for each person measured more than once during the course of the before and after study periods? Multiple measurements with the same result increase confidence that the outcomes were accurately measured.

Question 12. Group-level interventions and individual-level outcome efforts

Group-level interventions are usually not relevant for clinical interventions such as bariatric surgery, in which the interventions are applied at the individual patient level. In those cases, the questions were coded as "NA" in the assessment tool.

General Guidance for Determining the Overall Quality Rating of Before-After Studies

The questions in the quality assessment tool were designed to help reviewers focus on the key concepts for evaluating the internal validity of a study. They are not intended to create a list from which to add up items to judge a study's quality.

Internal validity is the extent to which the outcome results reported in the study can truly be attributed to the intervention or exposure being evaluated, and not to biases, measurement errors, or other confounding factors that may result from flaws in the design or conduct of the study. In other words, what is the ability of the study to draw associative conclusions about the effects of the interventions or exposures on outcomes?

Critical appraisal of a study involves considering the risk of potential for selection bias, information bias, measurement bias, or confounding (the mixture of exposures that one cannot tease out from each other). Examples of confounding include co-interventions, differences at baseline in patient characteristics, and other issues throughout the questions above. High risk of bias translates to a rating of poor quality; low risk of bias translates to a rating of good quality. Again, the greater the risk of bias, the lower the quality rating of the study.

In addition, the more attention in the study design to issues that can help determine if there is a causal relationship between the exposure and outcome, the higher quality the study. These issues include exposures occurring prior to outcomes, evaluation of a dose-response gradient, accuracy of measurement of both exposure and outcome, and sufficient timeframe to see an effect.

Generally, when reviewers evaluate a study, they will not see a "fatal flaw," but instead will find some risk of bias. By focusing on the concepts underlying the questions in the quality assessment tool, reviewers should ask themselves about the potential for bias in the study they are critically appraising. For any box checked "no" reviewers should ask, "What is the potential risk of bias resulting from this flaw in study design or execution?" That is, does this factor lead to doubt about the results reported in the study or doubt about the ability of the study to accurately assess an association between the intervention or exposure and the outcome?

The best approach is to think about the questions in the assessment tool and how each one reveals something about the potential for bias in a study. Specific rules are not useful, as each study has specific nuances. In addition, being familiar with the key concepts will help reviewers be more comfortable with critical appraisal. Examples of studies rated good, fair, and poor are useful, but each study must be assessed on its own.

Quality Assessment Tool for Case Series Studies - Study Quality Assessment Tools

Background: development and use - study quality assessment tools.

Learn more about the development and use of Study Quality Assessment Tools.

Last updated: July, 2021

Evaluating the quality of scientific research papers in entrepreneurship

  • Published: 15 October 2021
  • Volume 56 , pages 3013–3027, ( 2022 )

Cite this article

assessing the quality of a research paper

  • Yoganandan G.   ORCID: orcid.org/0000-0002-3000-9183 1 &
  • Vasan M.   ORCID: orcid.org/0000-0003-4600-4683 2  

959 Accesses

1 Altmetric

Explore all metrics

The study aims to find the quality of research papers published in the domain of entrepreneurship in India. This study covers 100 research papers. A standardized measurement tool developed by the earlier researchers was used to evaluate the research quality. The data compiled using the measurement tool were analyzed with the support of the SPSS. The statistical tools such as descriptive statistics, Friedman’s test, factor analysis, two-sample ‘t’ test, and ANOVA are applied to analyze the data. The study findings reported that the quality of research papers published in the field of entrepreneurship is not up to the quality standards. The quality of multiple-author papers is better than single-author papers. Similarly, the quality of papers published by foreign authors is comparatively better than Indian authors. Further, the quality of papers published with the combination of foreign and Indian authors is substantially good. The quality of papers published in foreign journals is higher as compared with Indian journals. Further, the standard of papers published under the qualitative approach was comparatively better than the quantitative approach. The authors developed a Conceptual Model of Process and Product of Research (YOVA model). This model clearly shows that the whole research process yields six levels of research products. The study recommended that the researchers need to go for international collaborations to improve the quality of the publication. The funding agencies, higher learning institutions and research institutions should focus on enhancing research infrastructure. The study examined the validity of research articles searched by novice researchers in India in Google by using keywords related to entrepreneurship and, as such this non-focused approach is a big impediment to quality research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

assessing the quality of a research paper

Source: Developed by Authors

assessing the quality of a research paper

Similar content being viewed by others

assessing the quality of a research paper

Entrepreneurship Research in Iran: Current Trends and Future Agendas

assessing the quality of a research paper

The evolution of university entrepreneurship over the past 20 years: a bibliometric analysis

assessing the quality of a research paper

The art of crafting a systematic literature review in entrepreneurship research

Abdel Moneim Aly: Quality of medical journals with special reference to the Eastern Mediterranean health journal. Saudi Med. J. 25 (1), 18–20 (2004)

Google Scholar  

Adams, J.: The fourth age of research. Nature 497 , 557–560 (2013)

Article   Google Scholar  

Akkerman, S., Admiraal, W., Brekelmans, M., Oost, H.: Auditing quality of research in social sciences. Qual. Quant. 42 (2), 257–274 (2006)

Aphinyanaphongs, Y., Tsamardinos, I., Statnikov, A., Hardin, D., Aliferis, C.F.: Text categorization models for high-quality article retrieval in internal medicine. J. Am. Med. Inform. Assoc. 12 (2), 207–217 (2005)

Bonaccorsi, A., Cicero, T., Ferrara, A., Malgarini. M., (2015). Journal Ratings as Predictors of Articles Quality in Arts, Humanities and Social Sciences: An Analysis based on the Italian Research Evaluation Exercise. F1000Research , 4, 196.

Bornmann, L., Schier, H., Marx, W., Daniel, H.-D.: Does the h index for assessing single publications really work? a case study on papers published in chemistry”. Scientometrics 89 , 835–843 (2011)

Bornmann, L., Schier, H., Marx, W., Daniel, H.-D.: What factors determine citation counts of publications in chemistry besides their quality? J. Informet. 6 , 11–18 (2012)

Bornmanna, L., Daniel, H.-D.: The Citation speed index: a useful bibliometric indicator to add to the h index. J. Informet. 4 , 444–446 (2010)

Chawla, R., Gupta, M., Anand, N.: Ranking of Indian journals with popular international journals: a comparative study. Int. J. Sci. Technol. Res. 9 (3), 72–81 (2020)

Cho, M.K., Bero, L.A.: Instruments for assessing the quality of drug studies published in the medical literature. J. Am. Med. Assoc. 272 , 101–104 (1994)

Djulbegovic, B., Lacevic, M., Cantor, A.: The uncertainty principle and industry-sponsored research. Lancet 356 , 635–638 (2000)

Durieux, V., Gevenois, P.A.: Bibliometric indicators: quality measurements of scientific publication. Radiology 255 (2), 342–351 (2010)

Ghazavi, R., Taheri, B., Ashrafi-rizi, H.: Article quality indicator: proposing a new indicator for measuring article quality in scopus and web of science. J. Sci. Res. 8 (1), 9–17 (2019)

Gupta, P., Kaur, G., Sharma, B., Shah, D., Choudhury, P.: What is submitted and what gets accepted in Indian Pediatrics: analysis of submissions, review process, decision making, and criteria for rejection. Indian Pediatr. 43 (6), 479–489 (2006)

Gupta, R., Tiwari, R., Mueen Ammed, K.K.: Dengue research in India: a scientometric analysis of publications - 2003–12. Int. J. Med. Public Health 4 (1), 1–9 (2014)

Hair, J.F., Black, W.C., Babin, B.J., Anderson, R.E.: Multivariate Data Analysis, 7th edn. Pearson Education, U.K (2010)

Hair, H.H., Jr, Black, W.C., Babin, B.J., Anderson, R.E., & Tatham, R.L., (2009). Multivariate Data Analysis . London, U.K: Pearson Education.

Hudson, J.: Trends in multi-authored papers in economics. J. Econom. Perspect. 10 (3), 153–158 (1996)

Inayatullah, S., Fitzgerald, J.: Gene Discourses: politics, culture, law, and futures. Technol. Forecast. Soc. Chang. 52 (2–3), 161–183 (1996)

James, J.B., Vincent, C.B.: Perception of journal quality. Account. Rev. 49 (2), 360–362 (1974)

Lawani, S.M.: Some bibliometric correlates of quality in scientific research. Scientometrics 9 , 13–25 (1986)

Li, E.Y., Liao, C.H., Yen, H.R.: Co-authorship networks and research impact: a social capital perspective. Res. Policy 42 (9), 1515–1530 (2013)

Low, W.Y., Ng, K.H., Kabir, M.A., et al.: Trend and impact of international collaboration in clinical medicine papers published in Malaysia. Scientometrics 98 , 1521–1533 (2014)

Mårtensson, P., Fors, U., Wallin, S.-B., Zander, U., Nilsson, G.H.: Evaluating research: a multidisciplinary approach to assessing research practice and quality. Res. Policy 45 , 593–603 (2015)

Mattsson, P., Laget, P., Nilsson, A., & Sundberg, C.J. (2008). Intra-EU Vs. Extra-EU Scientific Co-publication Patterns in EU. Scientometrics , 75(3), 555–574.

Mays, N., & Pope, C. (2000). Quality in Qualitative Health Research. In: Mays, N., and Pope, C. Qualitative Research in Health Care . 2nd ed. London: BMJ Books.

Michael, J. Cuellar, Duane, P. Truex, & Hirotoshi Takeda. (2016). Can we trust journal ranking to assess article quality? Proceedings of the Twenty-second Americas Conference on Information Systems , San Diego. 1–11.

Nanjundaiah & Dinesh, K.S. : Ranking and comparison of journals published in india with special reference to humanities and social sciences. Int. J. Inf. Dissem. Technol. 6 (4), 251–257 (2016)

National Center for the Dissemination of Disability Research - NCDDR. (2002). A Technical Brief: What Are the Standards for Quality Research? Southwest Educational Development Laboratory.

Popay, R., Rogers, A., Williams, G.: Rationale and standards for the systematic review of qualitative literature in health services research. Qual Health Research 8 , 341–351 (1998)

Ram, S.: A quantitative assessment of “chikungunya” research publications, 2004–2013. Trop. J. Med. Res. 19 , 52–60 (2016)

RAND. (2010). Technical Report: Standards for High-Quality Research and Analysis . RAND Corporation.

Salimi, N.: Quality assessment of scientific outputs using the BWM. Scientometrics 112 , 195–213 (2017)

Seglen, P.O.: Why the impact factor of journals should not be used for evaluating research? BMJ 314 (7079), 497 (1997)

Sendhilkumar, S., Elakkiya, E., Mahalakshmi, G.S.: Citation semantic based approaches to identify article quality. Comp Sci Inform Technol 1 , 411–420 (2013)

Sheikh Tariq Mahmood: Factors affecting the quality of research in education: student’s perceptions. J. Educ. Pract. 2 (11 & 12), 34–39 (2011)

Shewfelt, R.L.: What is quality? Postharvest Biol. Technol. 15 , 197–200 (1999)

Stokols, D., Harvey, R., Gress, J., Fuqua, J., Phillips, K.: In vivo studies of transdisciplinary scientific collaboration: lessons learned and implications for active living research. Am. J. Prev. Med. 28 (2), 202–213 (2005)

Moyses Szklo. (2006). Quality of Scientific Articles. Rev Saúde Pública , 40 (N Esp), 30–35.

Timmer, A., Sutherland, L.R., Hilsden, R.J.: Development and evaluation of a quality score for abstracts. BMC Med Res Methodol 3 , 2 (2003)

Victor, B.G., Hodge, D.R., Perron, B.E., Vaughn, M.G., Salas-Wright, C.P.: The rise of co-authorship in social work scholarship: a longitudinal study of collaboration and article quality, 1989–2013. Br. J. Soc. Work. 47 (8), 1–16 (2016)

Walter, J., Lechner, C., Kellermanns, F.W.: Knowledge transfer between and within alliance partners: private versus collective benefits of social capital. J. Bus. Res. 60 (7), 698–710 (2007)

Wasko, M.M., Faraj, S.: Why should I share? examining social capital and knowledge contribution in electronic networks of practice. MIS Q. 29 (1), 35–57 (2005)

Welch, C., Piekkari, R.: How should we (not) judge the ‘quality’ of qualitative research? a re-assessment of current evaluative criteria in international business. J. World Bus. 52 (5), 714–725 (2017)

Yokuş, G., Akdaği, H.: Identifying quality criteria of a scientific research adopted by academic community: a case study. Int. J. Eurasia Soc. Sci. 10 (36), 516–527 (2019)

Zuber-Skerritt, O., Fletcher, M.: The quality of an action research thesis in the social sciences. Qual. Assur. Educ. 15 (4), 413–436 (2007)

Download references

No funding received.

Author information

Authors and affiliations.

Department of Management Studies, Periyar University, Salem, Tamilnadu, 636 011, India

Yoganandan G.

Department of Commerce, National College (Autonomous), Tiruchirappalli, Tamilnadu, 620 001, India

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Vasan M. .

Ethics declarations

Conflict of interest.

The authors declare that there is no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Yoganandan, G., Vasan, M. Evaluating the quality of scientific research papers in entrepreneurship. Qual Quant 56 , 3013–3027 (2022). https://doi.org/10.1007/s11135-021-01254-z

Download citation

Accepted : 30 September 2021

Published : 15 October 2021

Issue Date : October 2022

DOI : https://doi.org/10.1007/s11135-021-01254-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Entrepreneurship
  • Quality criteria
  • Research collaboration
  • Research quality
  • Find a journal
  • Publish with us
  • Track your research

Help | Advanced Search

Electrical Engineering and Systems Science > Image and Video Processing

Title: rmt-bvqa: recurrent memory transformer-based blind video quality assessment for enhanced video content.

Abstract: With recent advances in deep learning, numerous algorithms have been developed to enhance video quality, reduce visual artefacts and improve perceptual quality. However, little research has been reported on the quality assessment of enhanced content - the evaluation of enhancement methods is often based on quality metrics that were designed for compression applications. In this paper, we propose a novel blind deep video quality assessment (VQA) method specifically for enhanced video content. It employs a new Recurrent Memory Transformer (RMT) based network architecture to obtain video quality representations, which is optimised through a novel content-quality-aware contrastive learning strategy based on a new database containing 13K training patches with enhanced content. The extracted quality representations are then combined through linear regression to generate video-level quality indices. The proposed method, RMT-BVQA, has been evaluated on the VDPVE (VQA Dataset for Perceptual Video Enhancement) database through a five-fold cross validation. The results show its superior correlation performance when compared to ten existing no-reference quality metrics.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Decision support tools of sustainability assessment for urban stormwater management – A review of their roles in governance and management

  • Sun, Zhengdong
  • Deak Sjöman, Johanna
  • Blecken, Godecke-Tobias
  • Randrup, Thomas B.

Urban areas face growing sustainable challenges arising from stormwater issues, necessitating the evolution of stormwater management concept and practice. This transformation not only entails the adoption of a multifunctional, holistic, and sustainable approach but also involves the integration of water quality and quantity considerations with governance and management aspects. A means to do so is via decision support tools. However, whilst existing studies using the tools by employing sustainability assessment principles or as indicators to plan blue-green infrastructures and strategies, uncertainties remain regarding how decision support tools encompass governance and management dimensions. The aim of this review study is to provide much-needed clarity on this aspect, in doing so, a systematic review of decision support tools used in sustainability assessment within the stormwater management context is conducted, focusing on their abilities to include governance and management. Findings encompass governance aspects, such as actors, discourses, rules, and resources considered, and explore how these relate to long-term management. The results reveal the recognized potential of decision support tools in facilitating governance and management for sustainable stormwater management, however, future research and efforts need to be allocated in: (i) Exploring practical challenges in integrating all sustainability assessment pillars with consistent criteria into decision support tools, to determine the optimal use of all criteria in fostering open and informed stormwater governance and management. (ii) Understanding how to engage diverse stormwater actors with future decision support tools, to secure ownership and relevance. (iii) Using retrospective (ex-post) sustainability assessments to provide more tangible knowledge and to support long-term management.

  • Decision support tools;
  • Sustainability assessment;
  • Stormwater management;
  • Stormwater control measures;
  • Governance and management;
  • Policy arrangement model

IMAGES

  1. Assessing the Quality of Scientific Papers

    assessing the quality of a research paper

  2. 📗 Paper Example on Assessing Data Quality

    assessing the quality of a research paper

  3. Examples Of Qualitative Research Paper : Sample of data analysis in

    assessing the quality of a research paper

  4. How to Write a High Quality Research Paper 2023

    assessing the quality of a research paper

  5. 02 Qualities of a Good Research Paper

    assessing the quality of a research paper

  6. Qualitative Research Paper / Qualitative Research Methodology In Social

    assessing the quality of a research paper

VIDEO

  1. Research Paper Session 3 Assessing Credit Risk

  2. How to Write a Good Quality Research Paper

  3. Systematic Reviews In Research Universe

  4. Research Paper Session 3 Assessing Credit Risk

  5. Research In A Minute: TREAT Journal Club

  6. Qualities required to publish a good research paper

COMMENTS

  1. Criteria for Good Qualitative Research: A Comprehensive Review

    Fundamental Criteria: General Research Quality. Various researchers have put forward criteria for evaluating qualitative research, which have been summarized in Table 3.Also, the criteria outlined in Table 4 effectively deliver the various approaches to evaluate and assess the quality of qualitative work. The entries in Table 4 are based on Tracy's "Eight big‐tent criteria for excellent ...

  2. How do you determine the quality of a journal article?

    1. Where is the article published? The journal (academic publication) where the article is published says something about the quality of the article. Journals are ranked in the Journal Quality List (JQL). If the journal you used is ranked at the top of your professional field in the JQL, then you can assume that the quality of the article is high.

  3. How to … assess the quality of qualitative research

    Depending on the research question, observations (as well as other methods) might be an alternative or complement to interviews, or individual interviews may be more appropriate than group interviews or focus groups. 4 One key marker for assessing the quality of qualitative research is the selection criteria used to recruit study participants.

  4. Systematic Reviews: Step 6: Assess Quality of Included Studies

    Quality Assessment tools are questionnaires created to help you assess the quality of a variety of study designs. Depending on the types of studies you are analyzing, the questionnaire will be tailored to ask specific questions about the methodology of the study. ... 10 questions to help assess qualitative research from the Critical Appraisal ...

  5. Research quality: What it is, and how to achieve it

    What research quality now is. When assessing scientific contribution, it is crucial to go beyond focusing on short-term measures of which journal the work is published in. ... Publishers should also consider that type of article (e.g., meta-analyses versus normal research papers) and research subject area will affect impact numbers and should ...

  6. Assessing the quality of research

    Figure 1. Go to: Systematic reviews of research are always preferred. Go to: Level alone should not be used to grade evidence. Other design elements, such as the validity of measurements and blinding of outcome assessments. Quality of the conduct of the study, such as loss to follow up and success of blinding.

  7. PDF Learning to Appraise the Quality of Qualitative Research Articles: A

    Select three qualitative research papers which present results from qualitative research methodology relevant to nursing education. Analyze the papers using the CASP tool. Discuss the papers' within-case and across case quality based upon the results of your CASP tool analysis in a 15 page paper.

  8. PDF Criteria for Good Qualitative Research: A Comprehensive Review

    to assess the quality of the research findings. After that, some of the quality checklists (as tools to evaluate quality) are discussed in Sect. Quality Checklists: Tools for Assessing the Quality. At last, the review ends with the concluding remarks presented in Sect. Conclusions, Future Directions and Outlook. Some prospects in qualitative ...

  9. Assessing quality in qualitative research

    Assessing quality in qualitative research. In the past decade, qualitative methods have become more commonplace in areas such as health services research and health technology assessment, and there has been a corresponding rise in the reporting of qualitative research studies in medical and related journals. 1 Interest in these methods and ...

  10. Assessing the quality of research

    Inflexible use of evidence hierarchies confuses practitioners and irritates researchers. So how can we improve the way we assess research? The widespread use of hierarchies of evidence that grade research studies according to their quality has helped to raise awareness that some forms of evidence are more trustworthy than others. This is clearly desirable. However, the simplifications involved ...

  11. Defining and assessing research quality in a transdisciplinary context

    2.2 Search terms. Search terms were designed to identify publications that discuss the evaluation or assessment of quality or excellence 2 of research 3 that is done in a TDR context. Search terms are listed online in Supplementary Appendices 2 and 3.The search strategy favored sensitivity over specificity to ensure that we captured the relevant information.

  12. Evaluating research: A multidisciplinary approach to assessing research

    In Canada, standard quality assessment criteria for research papers have been developed, and these deal separately with quantitative and qualitative research studies (Kmet et al., 2004). However, it is not our goal to distinguish some types of scientific methods that are inherently 'good' from others that may be 'bad'.

  13. How to use and assess qualitative research methods

    Abstract. This paper aims to provide an overview of the use and assessment of qualitative research methods in the health sciences. Qualitative research can be defined as the study of the nature of phenomena and is especially appropriate for answering questions of why something is (not) observed, assessing complex multi-component interventions ...

  14. Assessing Research Quality

    Assessing Research Quality. This page presents information and tools to help evaluate the quality of a research study, as well as information on the ethics of research. The quality of social science and policy research can vary considerably. It is important that consumers of research keep this in mind when reading the findings from a research ...

  15. PDF How to GRADE the quality of the evidence

    d. Come to an agreement about the overall quality of the evidence for that outcome. Assessing the quality of the evidence using GRADE criteria The GRADE system considers 8 criteria for assessing the quality of evidence. All decisions to downgrade involve subjective judgements, so a consensus view of the quality of

  16. JABSOM Library: Systematic Review Toolbox: Quality Assessment

    Results from a poorly conducted study can be skewed by biases from the research methodology and should be interpreted with caution. Such studies should be acknowledged as such in the systematic review or outright excluded. Selecting an appropriate tool to help analyze strength of evidence and imbedded biases within each paper is also essential.

  17. How to read a paper: Assessing the methodological quality of published

    Before changing your practice in the light of a published research paper, you should decide whether the methods used were valid. This article considers five essential questions that should form the basis of your decision. Only a tiny proportion of medical research breaks entirely new ground, and an equally tiny proportion repeats exactly the steps of previous workers. The vast majority of ...

  18. PDF Assessing the quality of evidence

    Critical appraisal is the systematic evaluation of a research paper to identify methodological flaws and determine the quality of the evidence. It involves considering the validity and rigour of the research, credibility of the findings, generalisability or applicability of the findings and how useful and relevant the findings are to your ...

  19. Study Quality Assessment Tools

    Guidance for Quality Assessment Tool for Systematic Reviews and Meta-Analyses. ... However, there is not one commonly accepted, standardized tool for rating the quality of studies. So, in the research papers, reviewers looked for an assessment of the quality of each study and a clear description of the process used.

  20. Assessing the Quality of Education Research Through Its Relevance to

    Consensus on assessing the quality of education research has been elusive. There are various different criteria for assessing research related to a host of methodological and research approaches employed in education research, and the effective adoption and use of quality standards is unclear (Boaz & Ashby, 2003; Moss et al., 2009; Tijssen, 2020).

  21. Assessing Quality in Systematic Literature Reviews: A Study of Novice

    Given that assessing study quality is a core principle of systematic reviews (Petticrew, 2015), checklists and/or rating scales such as the MQQ are useful for the following: (a) diagnosing and assessing potential bias in the original study and (b) minimizing systematic reviewer bias during the coding and rating processes.The first item relates to the internal validity of the appraisal tool itself.

  22. Evaluating the quality of scientific research papers in

    For assessing the quality of research papers, a checklist has been prepared based on the tool developed by Cho and Bero and Timmer et al. and the guidelines recommended by Mays and Pope and Popay et al. . This modified assessment tool is suitable for qualitative as well as quantitative studies. The five-point scale starts from Excellent [5] to ...

  23. (PDF) Assessment of Research Quality

    This paper considers assessment of research quality by focusing on definition and. solution of research problems. W e develop and discuss, across different classes of. problems, a set of general ...

  24. Practices in Data-Quality Evaluation: A Large-Scale Review of Online

    The observed usage of methods for evaluation of data quality in online self-administered surveys is low; approximately half of studies did not inspect the quality of research data. The most frequently used indicators for data exclusion are designed control items and nonresponse rates, although researchers often provide poor justification for ...

  25. A critical analysis of parameter choices in water quality assessment

    A crucial balance is thus struck between comprehensive water quality assessment and the pragmatic aspects of cost-effective data collection, highlighting the integration of scientific rigor with economic efficiency (Fortes et al., 2023, Chidiac et al., 2023). For example, the time span between sample collection and laboratory analysis can ...

  26. Assessing Bank Climate Disclosures and Their Relationship to ...

    Using Latent Dirichlet Allocation, three report quality metrics are derived to evaluate these climate discussions. The analysis shows that banks have greatly increased the quantity and quality of their climate risk discussions over time, and there is significant heterogeneity in bank discussions across regions.

  27. RMT-BVQA: Recurrent Memory Transformer-based Blind Video Quality

    However, little research has been reported on the quality assessment of enhanced content - the evaluation of enhancement methods is often based on quality metrics that were designed for compression applications. In this paper, we propose a novel blind deep video quality assessment (VQA) method specifically for enhanced video content.

  28. Dallas College: Education That Works in Dallas County

    The seven campuses of Dallas College include Brookhaven, Cedar Valley, Eastfield, El Centro, Mountain View, North Lake and Richland. Apply today!

  29. Environments

    This research aims to address this gap by assessing the efficacy of storage lagoons in refining the effluent quality at the Cabezo Beaza WWTP, considering recent UWWTD requirements. We conduct a comprehensive assessment of the water quality parameters and micropollutants, before and after the storage lagoon stage, at the Cabezo Beaza WWTP.

  30. Decision support tools of sustainability assessment for urban

    Urban areas face growing sustainable challenges arising from stormwater issues, necessitating the evolution of stormwater management concept and practice. This transformation not only entails the adoption of a multifunctional, holistic, and sustainable approach but also involves the integration of water quality and quantity considerations with governance and management aspects. A means to do ...