• Cookies & Privacy
  • GETTING STARTED
  • Introduction
  • FUNDAMENTALS
  • Acknowledgements
  • Research questions & hypotheses
  • Concepts, constructs & variables
  • Research limitations
  • Getting started
  • Sampling Strategy
  • Research Quality
  • Research Ethics
  • Data Analysis

Criterion validity

(concurrent and predictive validity).

There are many occasions when you might choose to use a well-established measurement procedure (e.g., a 42-item survey on depression) as the basis to create a new measurement procedure (e.g., a 19-item survey on depression) to measure the construct you are interested in (e.g., depression, sleep quality, employee commitment, etc.). This well-established measurement procedure acts as the criterion against which the criterion validity of the new measurement procedure is assessed. Like other forms of validity, criterion validity is not something that your measurement procedure has (or doesn't have). You will have to build a case for the criterion validity of your measurement procedure; ultimately, it is something that will be developed over time as more studies validate your measurement procedure. To assess criterion validity in your dissertation, you can choose between establishing the concurrent validity or predictive validity of your measurement procedure. These are two different types of criterion validity, each of which has a specific purpose. In this article, we first explain what criterion validity is and when it should be used, before discussing concurrent validity and predictive validity, providing examples of both.

What is criterion validity?

  • What is concurrent validity?
  • What is predictive validity?

Criterion validity reflects the use of a criterion - a well-established measurement procedure - to create a new measurement procedure to measure the construct you are interested in. The criterion and the new measurement procedure must be theoretically related . The measurement procedures could include a range of research methods (e.g., surveys, structured observation, or structured interviews, etc.), provided that they yield quantitative data.

There are a number of reasons why we would be interested in using criterions to create a new measurement procedure: (a) to create a shorter version of a well-established measurement procedure; (b) to account for a new context, location, and/or culture where well-established measurement procedures need to be modified or completely altered; and (c) to help test the theoretical relatedness and construct validity of a well-established measurement procedure. Each of these is discussed in turn:

To create a shorter version of a well-established measurement procedure

You want to create a shorter version of an existing measurement procedure, which is unlikely to be achieved through simply removing one or two measures within the measurement procedure (e.g., one or two questions in a survey), possibly because this would affect the content validity of the measurement procedure [see the article: Content validity ]. Therefore, you have to create new measures for the new measurement procedure. However, to ensure that you have built a valid new measurement procedure, you need to compare it against one that is already well-established; that is, one that already has demonstrated construct validity and reliability [see the articles: Construct validity and Reliability in research ]. This well-established measurement procedure is the criterion against which you are comparing the new measurement procedure (i.e., why we call it criterion validity ).

Indeed, sometimes a well-established measurement procedure (e.g., a survey), which has strong construct validity and reliability , is either too long or longer than would be preferable . A measurement procedure can be too long because it consists of too many measures (e.g., a 100 question survey measuring depression). Whilst the measurement procedure may be content valid (i.e., consist of measures that are appropriate/relevant and representative of the construct being measured), it is of limited practical use if response rates are particularly low because participants are simply unwilling to take the time to complete such a long measurement procedure. We also stated that a measurement procedure may be longer than would be preferable , which mirrors that argument above; that is, that it's easier to get respondents to complete a measurement procedure when it's shorter. However, the one difference is that an existing measurement procedure may not be too long (e.g., having only 40 questions in a survey), but would encourage much greater response rates if shorter (e.g., having just 18 questions). This may be a time consideration, but it is also an issue when you are combining multiple measurement procedures, each of which has a large number of measures (e.g., combining two surveys, each with around 40 questions).

To account for a new context, location and/or culture where well-established measurement procedures may need to be modified or completely altered

You are conducting a study in a new context , location and/or culture , where well-established measurement procedures no longer reflect the new context, location, and/or culture. As a result, there is a need to take a well-established measurement procedure, which acts as your criterion , but you need to create a new measurement procedure that is more appropriate for the new context, location, and/or culture. The new measurement procedure may only need to be modified or it may need to be completely altered . However, irrespective of whether a new measurement procedure only needs to be modified, or completely altered, it must be based on a criterion (i.e., a well-established measurement procedure).

For example, you may want to translate a well-established measurement procedure, which is construct valid , from one language (e.g., English) into another (e.g., Chinese or French). Since the English and French languages have some base commonalities, the content of the measurement procedure (i.e., the measures within the measurement procedure) may only have to be modified . However, such content may have to be completely altered when a translation into Chinese is made because of the fundamental differences in the two languages (i.e., Chinese and English). Nonetheless, the new measurement procedure (i.e., the translated measurement procedure) should have criterion validity ; that is, it must reflect the well-established measurement procedure upon which is was based.

In research, it is common to want to take measurement procedures that have been well-established in one context, location, and/or culture, and apply them to another context, location, and/or culture. Criterion validity is a good test of whether such newly applied measurement procedures reflect the criterion upon which they are based. When they do not, this suggests that new measurement procedures need to be created that are more appropriate for the new context, location, and/or culture of interest.

To help test the theoretical relatedness and construct validity of a well-established measurement procedure

It could also be argued that testing for criterion validity is an additional way of testing the construct validity of an existing, well-established measurement procedure. After all, if the new measurement procedure, which uses different measures (i.e., has different content ), but measures the same construct , is strongly related to the well-established measurement procedure, this gives us more confidence in the construct validity of the existing measurement procedure.

Criterion validity is demonstrated when there is a strong relationship between the scores from the two measurement procedures, which is typically examined using a correlation. For example, participants that score high on the new measurement procedure would also score high on the well-established test; and the same would be said for medium and low scores.

However, rather than assessing criterion validity , per se, determining criterion validity is a choice between establishing concurrent validity or predictive validity . There are two things to think about when choosing between concurrent and predictive validity:

The purpose of the study and measurement procedure

You need to consider the purpose of the study and measurement procedure ; that is, whether you are trying (a) to use an existing, well-established measurement procedure in order to create a new measurement procedure (i.e., concurrent validity ), or (b) to examine whether a measurement procedure can be used to make predictions (i.e., predictive validity ).

Study constraints

Testing for concurrent validity is likely to be simpler, more cost-effective, and less time intensive than predictive validity. This sometimes encourages researchers to first test for the concurrent validity of a new measurement procedure, before later testing it for predictive validity when more resources and time are available.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology
  • The 4 Types of Validity | Types, Definitions & Examples

The 4 Types of Validity | Types, Definitions & Examples

Published on 3 May 2022 by Fiona Middleton . Revised on 10 October 2022.

In quantitative research , you have to consider the reliability and validity of your methods and measurements.

Validity tells you how accurately a method measures something. If a method measures what it claims to measure, and the results closely correspond to real-world values, then it can be considered valid. There are four main types of validity:

  • Construct validity : Does the test measure the concept that it’s intended to measure?
  • Content validity : Is the test fully representative of what it aims to measure?
  • Face validity : Does the content of the test appear to be suitable to its aims?
  • Criterion validity : Do the results accurately measure the concrete outcome they are designed to measure?

Note that this article deals with types of test validity, which determine the accuracy of the actual components of a measure. If you are doing experimental research, you also need to consider internal and external validity , which deal with the experimental design and the generalisability of results.

Table of contents

Construct validity, content validity, face validity, criterion validity.

Construct validity evaluates whether a measurement tool really represents the thing we are interested in measuring. It’s central to establishing the overall validity of a method.

What is a construct?

A construct refers to a concept or characteristic that can’t be directly observed but can be measured by observing other indicators that are associated with it.

Constructs can be characteristics of individuals, such as intelligence, obesity, job satisfaction, or depression; they can also be broader concepts applied to organisations or social groups, such as gender equality, corporate social responsibility, or freedom of speech.

What is construct validity?

Construct validity is about ensuring that the method of measurement matches the construct you want to measure. If you develop a questionnaire to diagnose depression, you need to know: does the questionnaire really measure the construct of depression? Or is it actually measuring the respondent’s mood, self-esteem, or some other construct?

To achieve construct validity, you have to ensure that your indicators and measurements are carefully developed based on relevant existing knowledge. The questionnaire must include only relevant questions that measure known indicators of depression.

The other types of validity described below can all be considered as forms of evidence for construct validity.

Prevent plagiarism, run a free check.

Content validity assesses whether a test is representative of all aspects of the construct.

To produce valid results, the content of a test, survey, or measurement method must cover all relevant parts of the subject it aims to measure. If some aspects are missing from the measurement (or if irrelevant aspects are included), the validity is threatened.

Face validity considers how suitable the content of a test seems to be on the surface. It’s similar to content validity, but face validity is a more informal and subjective assessment.

As face validity is a subjective measure, it’s often considered the weakest form of validity. However, it can be useful in the initial stages of developing a method.

Criterion validity evaluates how well a test can predict a concrete outcome, or how well the results of your test approximate the results of another test.

What is a criterion variable?

A criterion variable is an established and effective measurement that is widely considered valid, sometimes referred to as a ‘gold standard’ measurement. Criterion variables can be very difficult to find.

What is criterion validity?

To evaluate criterion validity, you calculate the correlation between the results of your measurement and the results of the criterion measurement. If there is a high correlation, this gives a good indication that your test is measuring what it intends to measure.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Middleton, F. (2022, October 10). The 4 Types of Validity | Types, Definitions & Examples. Scribbr. Retrieved 2 April 2024, from https://www.scribbr.co.uk/research-methods/validity-types/

Is this article helpful?

Fiona Middleton

Fiona Middleton

Other students also liked, qualitative vs quantitative research | examples & methods, a quick guide to experimental design | 5 steps & examples, what is qualitative research | methods & examples.

Logo for M Libraries Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

5.2 Reliability and Validity of Measurement

Learning objectives.

  • Define reliability, including the different types and how they are assessed.
  • Define validity, including the different types and how they are assessed.
  • Describe the kinds of evidence that would be relevant to assessing the reliability and validity of a particular measure.

Again, measurement involves assigning scores to individuals so that they represent some characteristic of the individuals. But how do researchers know that the scores actually represent the characteristic, especially when it is a construct like intelligence, self-esteem, depression, or working memory capacity? The answer is that they conduct research using the measure to confirm that the scores make sense based on their understanding of the construct being measured. This is an extremely important point. Psychologists do not simply assume that their measures work. Instead, they collect data to demonstrate that they work. If their research does not demonstrate that a measure works, they stop using it.

As an informal example, imagine that you have been dieting for a month. Your clothes seem to be fitting more loosely, and several friends have asked if you have lost weight. If at this point your bathroom scale indicated that you had lost 10 pounds, this would make sense and you would continue to use the scale. But if it indicated that you had gained 10 pounds, you would rightly conclude that it was broken and either fix it or get rid of it. In evaluating a measurement method, psychologists consider two general dimensions: reliability and validity.

Reliability

Reliability refers to the consistency of a measure. Psychologists consider three types of consistency: over time (test-retest reliability), across items (internal consistency), and across different researchers (interrater reliability).

Test-Retest Reliability

When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time. Test-retest reliability is the extent to which this is actually the case. For example, intelligence is generally thought to be consistent across time. A person who is highly intelligent today will be highly intelligent next week. This means that any good measure of intelligence should produce roughly the same scores for this individual next week as it does today. Clearly, a measure that produces highly inconsistent scores over time cannot be a very good measure of a construct that is supposed to be consistent.

Assessing test-retest reliability requires using the measure on a group of people at one time, using it again on the same group of people at a later time, and then looking at test-retest correlation between the two sets of scores. This is typically done by graphing the data in a scatterplot and computing Pearson’s r . Figure 5.3 “Test-Retest Correlation Between Two Sets of Scores of Several College Students on the Rosenberg Self-Esteem Scale, Given Two Times a Week Apart” shows the correlation between two sets of scores of several college students on the Rosenberg Self-Esteem Scale, given two times a week apart. Pearson’s r for these data is +.95. In general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.

Figure 5.3 Test-Retest Correlation Between Two Sets of Scores of Several College Students on the Rosenberg Self-Esteem Scale, Given Two Times a Week Apart

Test-Retest Correlation Between Two Sets of Scores of Several College Students on the Rosenberg Self-Esteem Scale, Given Two Times a Week Apart

Again, high test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five personality dimensions. But other constructs are not assumed to be stable over time. The very nature of mood, for example, is that it changes. So a measure of mood that produced a low test-retest correlation over a period of a month would not be a cause for concern.

Internal Consistency

A second kind of reliability is internal consistency , which is the consistency of people’s responses across the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the same underlying construct, so people’s scores on those items should be correlated with each other. On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that that they have a number of good qualities. If people’s responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct. This is as true for behavioral and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure of their level of risk seeking. This measure would be internally consistent to the extent that individual participants’ bets were consistently high or low across trials.

Like test-retest reliability, internal consistency can only be assessed by collecting and analyzing data. One approach is to look at a split-half correlation . This involves splitting the items into two sets, such as the first and second halves of the items or the even- and odd-numbered items. Then a score is computed for each set of items, and the relationship between the two sets of scores is examined. For example, Figure 5.4 “Split-Half Correlation Between Several College Students’ Scores on the Even-Numbered Items and Their Scores on the Odd-Numbered Items of the Rosenberg Self-Esteem Scale” shows the split-half correlation between several college students’ scores on the even-numbered items and their scores on the odd-numbered items of the Rosenberg Self-Esteem Scale. Pearson’s r for these data is +.88. A split-half correlation of +.80 or greater is generally considered good internal consistency.

Figure 5.4 Split-Half Correlation Between Several College Students’ Scores on the Even-Numbered Items and Their Scores on the Odd-Numbered Items of the Rosenberg Self-Esteem Scale

Split-Half Correlation Between Several College Students' Scores on the Even-Numbered Items and Their Scores on the Odd-Numbered Items of the Rosenberg Self-Esteem Scale

Perhaps the most common measure of internal consistency used by researchers in psychology is a statistic called Cronbach’s α (the Greek letter alpha). Conceptually, α is the mean of all possible split-half correlations for a set of items. For example, there are 252 ways to split a set of 10 items into two sets of five. Cronbach’s α would be the mean of the 252 split-half correlations. Note that this is not how α is actually computed, but it is a correct way of interpreting the meaning of this statistic. Again, a value of +.80 or greater is generally taken to indicate good internal consistency.

Interrater Reliability

Many behavioral measures involve significant judgment on the part of an observer or a rater. Interrater reliability is the extent to which different observers are consistent in their judgments. For example, if you were interested in measuring college students’ social skills, you could make video recordings of them as they interacted with another student whom they are meeting for the first time. Then you could have two or more observers watch the videos and rate each student’s level of social skills. To the extent that each participant does in fact have some level of social skills that can be detected by an attentive observer, different observers’ ratings should be highly correlated with each other. If they were not, then those ratings could not be an accurate representation of participants’ social skills. Interrater reliability is often assessed using Cronbach’s α when the judgments are quantitative or an analogous statistic called Cohen’s κ (the Greek letter kappa) when they are categorical.

Validity is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? We have already considered one factor that they take into account—reliability. When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. As an absurd example, imagine someone who believes that people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers. Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. The fact that one person’s index finger is a centimeter longer than another’s would indicate nothing about which one had higher self-esteem.

Textbook presentations of validity usually divide it into several distinct “types.” But a good way to interpret these types is that they are other kinds of evidence—in addition to reliability—that should be taken into account when judging the validity of a measure. Here we consider four basic kinds: face validity, content validity, criterion validity, and discriminant validity.

Face Validity

Face validity is the extent to which a measurement method appears “on its face” to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity. Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally.

Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to. One reason is that it is based on people’s intuitions about human behavior, which are frequently wrong. It is also the case that many established measures in psychology work quite well despite lacking face validity. The Minnesota Multiphasic Personality Inventory (MMPI) measures many personality characteristics and disorders by having people decide whether each of over 567 different statements applies to them—where many of the statements do not have any obvious relationship to the construct that they measure. Another example is the Implicit Association Test, which measures prejudice in a way that is nonintuitive to most people (see Note 5.31 “How Prejudiced Are You?” ).

How Prejudiced Are You?

The Implicit Association Test (IAT) is used to measure people’s attitudes toward various social groups. The IAT is a behavioral measure designed to reveal negative attitudes that people might not admit to on a self-report measure. It focuses on how quickly people are able to categorize words and images representing two contrasting groups (e.g., gay and straight) along with other positive and negative stimuli (e.g., the words “wonderful” or “nasty”). The IAT has been used in dozens of published research studies, and there is strong evidence for both its reliability and its validity (Nosek, Greenwald, & Banaji, 2006). You can learn more about the IAT—and take several of them for yourself—at the following website: https://implicit.harvard.edu/implicit .

Content Validity

Content validity is the extent to which a measure “covers” the construct of interest. For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, then his measure of test anxiety should include items about both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that he or she thinks positive thoughts about exercising, feels good about exercising, and actually exercises. So to have good content validity, a measure of people’s attitudes toward exercise would have to reflect all three of these aspects. Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.

Criterion Validity

Criterion validity is the extent to which people’s scores on a measure are correlated with other variables (known as criteria ) that one would expect them to be correlated with. For example, people’s scores on a new measure of test anxiety should be negatively correlated with their performance on an important school exam. If it were found that people’s scores were in fact negatively correlated with their exam performance, then this would be a piece of evidence that these scores really represent people’s test anxiety. But if it were found that people scored equally well on the exam regardless of their test anxiety scores, then this would cast doubt on the validity of the measure.

A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there will usually be many of them. For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. People’s scores on this measure should be correlated with their participation in “extreme” activities such as snowboarding and rock climbing, the number of speeding tickets they have received, and even the number of broken bones they have had over the years. Criteria can also include other measures of the same construct. For example, one would expect new measures of test anxiety or physical risk taking to be positively correlated with existing measures of the same constructs. So the use of converging operations is one way to examine criterion validity.

Assessing criterion validity requires collecting data using the measure. Researchers John Cacioppo and Richard Petty did this when they created their self-report Need for Cognition Scale to measure how much people value and engage in thinking (Cacioppo & Petty, 1982). In a series of studies, they showed that college faculty scored higher than assembly-line workers, that people’s scores were positively correlated with their scores on a standardized academic achievement test, and that their scores were negatively correlated with their scores on a measure of dogmatism (which represents a tendency toward obedience). In the years since it was created, the Need for Cognition Scale has been used in literally hundreds of studies and has been shown to be correlated with a wide variety of other variables, including the effectiveness of an advertisement, interest in politics, and juror decisions (Petty, Briñol, Loersch, & McCaslin, 2009).

Discriminant Validity

Discriminant validity is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct. For example, self-esteem is a general attitude toward the self that is fairly stable over time. It is not the same as mood, which is how good or bad one happens to be feeling right now. So people’s scores on a new measure of self-esteem should not be very highly correlated with their moods. If the new measure of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is not really measuring self-esteem; it is measuring mood instead.

When they created the Need for Cognition Scale, Cacioppo and Petty also provided evidence of discriminant validity by showing that people’s scores were not correlated with certain other variables. For example, they found only a weak correlation between people’s need for cognition and a measure of their cognitive style—the extent to which they tend to think analytically by breaking ideas into smaller parts or holistically in terms of “the big picture.” They also found no correlation between people’s need for cognition and measures of their test anxiety and their tendency to respond in socially desirable ways. All these low correlations provide evidence that the measure is reflecting a conceptually distinct construct.

Key Takeaways

  • Psychological researchers do not simply assume that their measures work. Instead, they conduct research to show that they work. If they cannot show that they work, they stop using them.
  • There are two distinct criteria by which researchers evaluate their measures: reliability and validity. Reliability is consistency across time (test-retest reliability), across items (internal consistency), and across researchers (interrater reliability). Validity is the extent to which the scores actually represent the variable they are intended to.
  • Validity is a judgment based on various types of evidence. The relevant evidence includes the measure’s reliability, whether it covers the construct of interest, and whether the scores it produces are correlated with other variables they are expected to be correlated with and not correlated with variables that are conceptually distinct.
  • The reliability and validity of a measure is not established by any single study but by the pattern of results across multiple studies. The assessment of reliability and validity is an ongoing process.
  • Practice: Ask several friends to complete the Rosenberg Self-Esteem Scale. Then assess its internal consistency by making a scatterplot to show the split-half correlation (even- vs. odd-numbered items). Compute Pearson’s r too if you know how.
  • Discussion: Think back to the last college exam you took and think of the exam as a psychological measure. What construct do you think it was intended to measure? Comment on its face and content validity. What data could you collect to assess its reliability, criterion validity, and discriminant validity?
  • Practice: Take an Implicit Association Test and then list as many ways to assess its criterion validity as you can think of.

Cacioppo, J. T., & Petty, R. E. (1982). The need for cognition. Journal of Personality and Social Psychology, 42 , 116–131.

Nosek, B. A., Greenwald, A. G., & Banaji, M. R. (2006). The Implicit Association Test at age 7: A methodological and conceptual review. In J. A. Bargh (Ed.), Social psychology and the unconscious: The automaticity of higher mental processes (pp. 265–292). London, England: Psychology Press.

Petty, R. E, Briñol, P., Loersch, C., & McCaslin, M. J. (2009). The need for cognition. In M. R. Leary & R. H. Hoyle (Eds.), Handbook of individual differences in social behavior (pp. 318–329). New York, NY: Guilford Press.

Research Methods in Psychology Copyright © 2016 by University of Minnesota is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

LITFL-Life-in-the-FastLane-760-180

Validity of Research and Measurements

Chris nickson.

  • Nov 3, 2020

In general terms, validity is “the quality of being true or correct”, it refers to the strength of results and how accurately they reflect the real world. Thus ‘validity’ can have quite different meanings depending on the context!

  • Reliability is distinct from validity, in that it refers to the consistency or repeatability of results
  • internal validity
  • external validity
  • Validity applies to an outcome or measurement, not the instrument used to obtain it and is based on ‘validity evidence’

INTERNAL VALIDITY

  • The extent to which the design and conduct of the trial eliminate the possibility of bias, such that observed effects can be attributed to the independent variable
  • refers to the accuracy of a trial
  • a study that lacks internal validity should not applied to any clinical setting
  • power calculation
  • details of study context and intervention
  • avoid loss of follow up
  • standardised treatment conditions
  • control groups
  • objectivity from blinding and data handling
  • Clinical research can be internally valid despite poor external validity

EXTERNAL VALIDITY

  • The extent to which the results of a trial provide a correct basis for generalizations to other circumstances
  • Also called “generalizability or “applicability”
  • Studies can only be applied to clinical settings the same, or similar, to those used in the study
  • population validity – how well the study sample can be extrapolated to the population as a whole (based on randomized sampling)
  • ecological validity – the extent to which the study environment influences results (can the study be replicated in other contexts?)
  • internal/ construct validity – verified relationships between dependent and independent variables
  • Research findings cannot have external validity without being internally valid

FACTORS THAT AFFECT EXTERNAL VALIDITY OF CLINICAL RESEARCH (Rothwell, 2006)

Setting of the trial

  • healthcare system
  • recruitment from primary, secondary or tertiary care
  • selection of participating centers
  • selection of participating clinicians

Selection of patients

  • methods of pre-randomisation diagnosis and investigation
  • eligibility criteria
  • exclusion criteria
  • placebo run-in period
  • treatment run-in period
  • “enrichment” strategies
  • ratio of randomised patients to eligible non-randomised patients in participating centers
  • proportion of patients who decline randomisation

Characteristics of randomised patients

  • baseline clinical characteristics
  • racial group
  • uniformity of underlying pathology
  • stage in the natural history of disease
  • severity of disease
  • comorbidity
  • absolute risk of a poor outcome in the control group

Differences between trial protocol and routine practice

  • trial intervention
  • timing of treatment
  • appropriateness/ relevance of control intervention
  • adequacy of nontrial treatment – both intended and actual
  • prohibition of certain non-trial treatments
  • Therapeutic or diagnostic advances since trial was performed

Outcome measures and follow up

  • clinical relevance of surrogate outcomes
  • clinical relevance, validity, and reproducibility of complex scales
  • effect of intervention on most relevant components of composite outcomes
  • identification of who measured outcome
  • use of patient outcomes
  • frequency of follow up
  • adequacy of length of follow-up

Adverse effects of treatment

  • completeness of reporting of relevant adverse effects
  • rate of discontinuation of treatment
  • selection of trial centers on the basis of skill or experience
  • exclusion of patients at risk of complications
  • exclusion of patients who experienced adverse events during a run in period
  • intensity of trial safety procedures

MEASUREMENT VALIDITY (Downing & Yudkowsky, 2009)

Validity refers to the evidence presented to support or to refute the meaning or interpretation assigned to assessment data or results. It relates to whether a test, tool, instrument or device actually measures what it intends to measure.

Traditionally validity was viewed as a trinatarian concept based on:

  • degree to which the the test measures what it is meant to be measuring
  • e.g. the ideal depression score would include different variants of depression and be able to distinguish depression from stress and anxiety
  • Concurrent validity – compares measurements with an outcome at the same time (e.g. a concurrent “gold standard” test result)
  • Predictive validity – compares measurements with an outcome at the same time (e.g. do high exam marks predict subsequent incomes?)
  • the degree to which the content of an instrument is an adequate reflection of all the components of the construct
  • e.g. a schizophrenia score would need to include both positive and negative symptoms

According to current validity theory in psychometrics, validity is a unitary concept and thus construct validity is the only form of validity. For instance in health professions education, validity evidence for assessments comes from (:

  • relationship between test content and the construct of interest
  • theory; hypothesis about content
  • independent assessment of match between content sampled and domain of interest
  • solid, scientific, quantitative evidence
  • analysis of individual responses to stimuli
  • debriefing of examinees
  • process studies aimed at understanding what is measured and the soundness of intended score interpretations
  • quality assurance and quality control of assessment data
  • data internal to assessments such as: reliability or reproducibility of scores; inter-item correlations; statistical characteristics of items; statistical analysis of item option function; factor studies of dimensionality; Differential Item Functioning (DIF) studies
  • a. Convergent and discriminant evidence: relationships between similar and different measures
  • b. Test-criterion evidence: relationships between test and criterion measure(s)
  • c. Validity generalization: can the validity evidence be generalized? Evidence that the validity studies may generalize to other settings.
  • intended and unintended consequences of test use
  • differential consequences of test use
  • impact of assessment on students, instructors, schools, society
  • impact of assessments on curriculum; cost/benefit analysis with respect to tradeoff between instructional time and assessment time.
  • Note that strictly speaking we cannot comment on the validity of a test, tool, instrument, or device, only on the measurement that is obtained. This is because the the same test used in a different context (different operator, different subjects, different circumstances, at a different time) may not be valid. In other words, validity evidence applies to the data generated by an instrument, not the instrument itself.
  • Validity can be equated with accuracy, and reliability with precision
  • Face validity is a term commonly used as an indicator of validity – it is essential worthless! It means at ‘face value’, in other words, the degree to which the measure subjectively looks like what it is intended to measure.
  • The higher the stakes of measurement (e.g. test result), the higher the need for validity evidence.
  • You can never have too much validity evidence, but the minimum required varies with purpose (e.g. high stakes fellowship exam versus one of many progress tests)

References and Links

Journal articles and Textbooks

  • Downing SM, Yudkowsky R. (2009) Assessment in health professions education, Routledge, New York.
  • Rothwell PM. Factors that can affect the external validity of randomised controlled trials. PLoS Clin Trials. 2006 May;1(1):e9. [ pubmed ] [ article ]
  • Shankar-Hari M, Bertolini G, Brunkhorst FM, et al. Judging quality of current septic shock definitions and criteria. Critical care. 19(1):445. 2015. [ pubmed ] [ article ]

CCC 700 6

Critical Care

' src=

Chris is an Intensivist and ECMO specialist at the  Alfred ICU in Melbourne. He is also a Clinical Adjunct Associate Professor at Monash University . He is a co-founder of the  Australia and New Zealand Clinician Educator Network  (ANZCEN) and is the Lead for the  ANZCEN Clinician Educator Incubator  programme. He is on the Board of Directors for the  Intensive Care Foundation  and is a First Part Examiner for the  College of Intensive Care Medicine . He is an internationally recognised Clinician Educator with a passion for helping clinicians learn and for improving the clinical performance of individuals and collectives.

After finishing his medical degree at the University of Auckland, he continued post-graduate training in New Zealand as well as Australia’s Northern Territory, Perth and Melbourne. He has completed fellowship training in both intensive care medicine and emergency medicine, as well as post-graduate training in biochemistry, clinical toxicology, clinical epidemiology, and health professional education.

He is actively involved in in using translational simulation to improve patient care and the design of processes and systems at Alfred Health. He coordinates the Alfred ICU’s education and simulation programmes and runs the unit’s education website,  INTENSIVE .  He created the ‘Critically Ill Airway’ course and teaches on numerous courses around the world. He is one of the founders of the  FOAM  movement (Free Open-Access Medical education) and is co-creator of  litfl.com , the  RAGE podcast , the  Resuscitology  course, and the  SMACC  conference.

His one great achievement is being the father of three amazing children.

On Twitter, he is  @precordialthump .

| INTENSIVE | RAGE | Resuscitology | SMACC

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed .

Privacy Overview

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • Write for Us
  • BMJ Journals More You are viewing from: Google Indexer

You are here

  • Volume 18, Issue 3
  • Validity and reliability in quantitative studies
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • Roberta Heale 1 ,
  • Alison Twycross 2
  • 1 School of Nursing, Laurentian University , Sudbury, Ontario , Canada
  • 2 Faculty of Health and Social Care , London South Bank University , London , UK
  • Correspondence to : Dr Roberta Heale, School of Nursing, Laurentian University, Ramsey Lake Road, Sudbury, Ontario, Canada P3E2C6; rheale{at}laurentian.ca

https://doi.org/10.1136/eb-2015-102129

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Evidence-based practice includes, in part, implementation of the findings of well-conducted quality research studies. So being able to critique quantitative research is an important skill for nurses. Consideration must be given not only to the results of the study but also the rigour of the research. Rigour refers to the extent to which the researchers worked to enhance the quality of the studies. In quantitative research, this is achieved through measurement of the validity and reliability. 1

  • View inline

Types of validity

The first category is content validity . This category looks at whether the instrument adequately covers all the content that it should with respect to the variable. In other words, does the instrument cover the entire domain related to the variable, or construct it was designed to measure? In an undergraduate nursing course with instruction about public health, an examination with content validity would cover all the content in the course with greater emphasis on the topics that had received greater coverage or more depth. A subset of content validity is face validity , where experts are asked their opinion about whether an instrument measures the concept intended.

Construct validity refers to whether you can draw inferences about test scores related to the concept being studied. For example, if a person has a high score on a survey that measures anxiety, does this person truly have a high degree of anxiety? In another example, a test of knowledge of medications that requires dosage calculations may instead be testing maths knowledge.

There are three types of evidence that can be used to demonstrate a research instrument has construct validity:

Homogeneity—meaning that the instrument measures one construct.

Convergence—this occurs when the instrument measures concepts similar to that of other instruments. Although if there are no similar instruments available this will not be possible to do.

Theory evidence—this is evident when behaviour is similar to theoretical propositions of the construct measured in the instrument. For example, when an instrument measures anxiety, one would expect to see that participants who score high on the instrument for anxiety also demonstrate symptoms of anxiety in their day-to-day lives. 2

The final measure of validity is criterion validity . A criterion is any other instrument that measures the same variable. Correlations can be conducted to determine the extent to which the different instruments measure the same variable. Criterion validity is measured in three ways:

Convergent validity—shows that an instrument is highly correlated with instruments measuring similar variables.

Divergent validity—shows that an instrument is poorly correlated to instruments that measure different variables. In this case, for example, there should be a low correlation between an instrument that measures motivation and one that measures self-efficacy.

Predictive validity—means that the instrument should have high correlations with future criterions. 2 For example, a score of high self-efficacy related to performing a task should predict the likelihood a participant completing the task.

Reliability

Reliability relates to the consistency of a measure. A participant completing an instrument meant to measure motivation should have approximately the same responses each time the test is completed. Although it is not possible to give an exact calculation of reliability, an estimate of reliability can be achieved through different measures. The three attributes of reliability are outlined in table 2 . How each attribute is tested for is described below.

Attributes of reliability

Homogeneity (internal consistency) is assessed using item-to-total correlation, split-half reliability, Kuder-Richardson coefficient and Cronbach's α. In split-half reliability, the results of a test, or instrument, are divided in half. Correlations are calculated comparing both halves. Strong correlations indicate high reliability, while weak correlations indicate the instrument may not be reliable. The Kuder-Richardson test is a more complicated version of the split-half test. In this process the average of all possible split half combinations is determined and a correlation between 0–1 is generated. This test is more accurate than the split-half test, but can only be completed on questions with two answers (eg, yes or no, 0 or 1). 3

Cronbach's α is the most commonly used test to determine the internal consistency of an instrument. In this test, the average of all correlations in every combination of split-halves is determined. Instruments with questions that have more than two responses can be used in this test. The Cronbach's α result is a number between 0 and 1. An acceptable reliability score is one that is 0.7 and higher. 1 , 3

Stability is tested using test–retest and parallel or alternate-form reliability testing. Test–retest reliability is assessed when an instrument is given to the same participants more than once under similar circumstances. A statistical comparison is made between participant's test scores for each of the times they have completed it. This provides an indication of the reliability of the instrument. Parallel-form reliability (or alternate-form reliability) is similar to test–retest reliability except that a different form of the original instrument is given to participants in subsequent tests. The domain, or concepts being tested are the same in both versions of the instrument but the wording of items is different. 2 For an instrument to demonstrate stability there should be a high correlation between the scores each time a participant completes the test. Generally speaking, a correlation coefficient of less than 0.3 signifies a weak correlation, 0.3–0.5 is moderate and greater than 0.5 is strong. 4

Equivalence is assessed through inter-rater reliability. This test includes a process for qualitatively determining the level of agreement between two or more observers. A good example of the process used in assessing inter-rater reliability is the scores of judges for a skating competition. The level of consistency across all judges in the scores given to skating participants is the measure of inter-rater reliability. An example in research is when researchers are asked to give a score for the relevancy of each item on an instrument. Consistency in their scores relates to the level of inter-rater reliability of the instrument.

Determining how rigorously the issues of reliability and validity have been addressed in a study is an essential component in the critique of research as well as influencing the decision about whether to implement of the study findings into nursing practice. In quantitative studies, rigour is determined through an evaluation of the validity and reliability of the tools or instruments utilised in the study. A good quality research study will provide evidence of how all these factors have been addressed. This will help you to assess the validity and reliability of the research and help you decide whether or not you should apply the findings in your area of clinical practice.

  • Lobiondo-Wood G ,
  • Shuttleworth M
  • ↵ Laerd Statistics . Determining the correlation coefficient . 2013 . https://statistics.laerd.com/premium/pc/pearson-correlation-in-spss-8.php

Twitter Follow Roberta Heale at @robertaheale and Alison Twycross at @alitwy

Competing interests None declared.

Read the full text or download the PDF:

Content Validity in Research: Definition & Examples

Charlotte Nickerson

Research Assistant at Harvard University

Undergraduate at Harvard University

Charlotte Nickerson is a student at Harvard University obsessed with the intersection of mental health, productivity, and design.

Learn about our Editorial Process

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

  • Content validity is a type of criterion validity that demonstrates how well a measure covers the construct it is meant to represent.
  • It is important for researchers to establish content validity in order to ensure that their study is measuring what it intends to measure.
  • There are several ways to establish content validity, including expert opinion, focus groups , and surveys.

content validity

What Is Content Validity?

Content Validity is the degree to which elements of an assessment instrument are relevant to a representative of the targeted construct for a particular assessment purpose.

This encompasses aspects such as the appropriateness of the items, tasks, or questions to the specific domain being measured and whether the assessment instrument covers a broad enough range of content to enable conclusions to be drawn about the targeted construct (Rossiter, 2008).

One example of an assessment with high content validity is the Iowa Test of Basic Skills (ITBS). The ITBS is a standardized test that has been used since 1935 to assess the academic achievement of students in grades 3-8.

The test covers a wide range of academic skills, including reading, math, language arts, and social studies. The items on the test are carefully developed and reviewed by a panel of experts to ensure that they are fair and representative of the skills being tested.

As a result, the ITBS has high content validity and is widely used by schools and districts to measure student achievement.

Meanwhile, most driving tests have low content validity.  The questions on the test are often not representative of the skills needed to drive safely. For example, many driving permit tests do not include questions about how to parallel park or how to change lanes.

Meanwhile, driving license tests often do not test drivers in non-ideal conditions, such as rain or snow. As a result, these tests do not provide an accurate measure of a person’s ability to drive safely.

The higher the content validity of an assessment, the more accurately it can measure what it is intended to measure — the target construct (Rossiter, 2008).

Why is content validity important in research?

Content validity is important in research as it provides confidence that an instrument is measuring what it is supposed to be measuring.

This is particularly relevant when developing new measures or adapting existing ones for use with different populations.

It also has implications for the interpretation of results, as findings can only be accurately applied to groups for which the content validity of the measure has been established.

Step-by-step guide: How to measure content validity?

Haynes et al. (1995) emphasized the importance of content validity and gave an overview of ways to assess it.

One of the first ways of measuring content validity was the Delphi method, which was invented by NASA in 1940 as a way of systematically creating technical predictions. 

The method involves a group of experts who make predictions about the future and then reach a consensus about those predictions. Today, the Delphi method is most commonly used in medicine.

In a content validity study using the Delphi method, a panel of experts is asked to rate the items on an assessment instrument on a scale. The expert panel also has the opportunity to add comments about the items.

After all ratings have been collected, the average item rating is calculated. In the second round, the experts receive summarized results of the first round and are able to make further comments and revise their first-round answers.

This back-and-forth continues until some homogeneity criterion — similarity between the results of researchers — is achieved (Koller et al., 2017).

Lawshie (1975) and Lynn (1986) created numerical methods to assess content validity. Both of these methods require the development of a content validity index (CVI). A content validity index is a statistical measure of the degree to which an assessment instrument covers the content domain of interest.

There are two steps in calculating a content validity index:

  • Determining the number of items that should be included in the assessment instrument;
  • Determining the percentage of items that actually are included in the assessment instrument.

The first step, determining the number of items that should be included in an assessment instrument, can be done using one of two approaches: item sampling or expert consensus.

Item sampling involves selecting a sample of items from a larger set of items that cover the content domain. The number of items in the sample is then used to estimate the total number of items needed to cover the content domain.

This approach has the advantage of being quick and easy, but it can be biased if the sample of items is not representative of the larger set (Koller et al., 2017).

The second approach, expert consensus, involves asking a group of experts how many items should be included in an assessment instrument to adequately cover the content domain. This approach has the advantage of being more objective, but it can be time-consuming and expensive.

Experts are able to assign these items to dimensions of the construct that they intend to measure and assign relevance values to decide whether an item is a strong measure of the construct.

Although various attempts to numerize the process of measuring content validity exist, there is no systematic procedure that could be used as a general guideline for the evaluation of content validity (Newman et al., 2013).

When is content validity used?

Education assessment.

In the context of educational assessment, validity is the extent to which an assessment instrument accurately measures what it is intended to measure. Validity concerns anyone who is making inferences and decisions about a learner based on data.

This can have deep implications for students’ education and future. For instance, a test that poorly measures students’ abilities can lead to placement in a future course that is unsuitable for the student and, ultimately, to the student’s failure (Obilor, 2022).

There are a number of factors that specifically affect the validity of assessments given to students, such as (Obilor, 2018):

  • Unclear Direction: If directions do not clearly indicate to the respondent how to respond to the tool’s items, the validity of the tool is reduced.
  • Vocabulary: If the vocabulary of the respondent is poor, and he does not understand the items, the validity of the instrument is affected.
  • Poorly Constructed Test Items: If items are constructed in such a way that they have different meanings for different respondents, validity is affected.
  • Difficulty Level of Items: In an achievement test, too easy or too difficult test items would not discriminate among students, thereby lowering the validity of the test.
  • Influence of Extraneous Factors: Extraneous factors like the style of expression, legibility, mechanics of grammar (spelling, punctuation), handwriting, and length of the tool, amongst others, influence the validity of a tool.
  • Inappropriate Time Limit: In a speed test, if enough time limit is given, the result will be invalidated as a measure of speed. In a power test, an inappropriate time limit will lower the validity of the test.

There are a few reasons why interviews may lack content validity . First, interviewers may ask different questions or place different emphases on certain topics across different candidates. This can make it difficult to compare candidates on a level playing field.

Second, interviewers may have their own personal biases that come into play when making judgments about candidates.

Finally, the interview format itself may be flawed. For example, many companies ask potential programmers to complete brain teasers — such as calculating the number of plumbers in Chicago or coding tasks that rely heavily on theoretical knowledge of data structures — even if this knowledge would be used rarely or never on the job.

Questionnaires

Questionnaires rely on the respondents’ ability to accurately recall information and report it honestly. Additionally, the way in which questions are worded can influence responses.

To increase content validity when designing a questionnaire, careful consideration must be given to the types of questions that will be asked.

Open-ended questions are typically less biased than closed-ended questions, but they can be more difficult to analyze.

It is also important to avoid leading or loaded questions that might influence respondents’ answers in a particular direction. The wording of questions should be clear and concise to avoid confusion (Koller et al., 2017).

Is content validity internal or external?

Most experts agree that content validity is primarily an internal issue. This means that the concepts and items included in a test should be based on a thorough analysis of the specific content area being measured.

The items should also be representative of the range of difficulty levels within that content area. External factors, such as the opinions of experts or the general public, can influence content validity, but they are not necessarily the primary determinant.

In some cases, such as when developing a test for licensure or certification, external stakeholders may have a strong say in what is included in the test (Koller et al., 2017).

How can content validity be improved?

There are a few ways to increase content validity. One is to create items that are more representative of the targeted construct. Another is to increase the number of items on the assessment so that it covers a greater range of content.

Finally, experts can review the items on the assessment to ensure that they are fair and representative of the skills being tested (Koller et al., 2017).

How do you test the content validity of a questionnaire?

There are a few ways to test the content validity of a questionnaire. One way is to ask experts in the field to review the questions and provide feedback on whether or not they believe the questions are relevant and cover all important topics.

Another way is to administer the questionnaire to a small group of people and then analyze the results to see if there are any patterns or themes emerging from the responses.

Finally, it is also possible to use statistical methods to test for content validity, although this approach is more complex and usually requires access to specialized software (Koller et al., 2017).

How can you tell if an instrument is content-valid?

There are a few ways to tell if an instrument is content-valid. The first of these involves looking at two subsets of content validity: face and construct validity.

Face validity is a measure of whether or not the items on the test appear to measure what they claim to measure. This is highly subjective but convenient to assess.

Another way is to look at the construct validity, which is whether or not the items on the test measure what they are supposed to measure. Finally, you can also look at the criterion-related validity, which is whether or not the items on the test predict future performance.

What is the difference between content and criterion validity?

Content validity is a measure of how well a test covers the content it is supposed to cover.

Criterion validity, meanwhile, is an index of how well a test correlates with an established standard of comparison or a criterion.

For example, if a measure of criminal behavior is criterion valid, then it should be possible to use it to predict whether an individual will be arrested in the future for a criminal violation, is currently breaking the law, and has a previous criminal record (American Psychological Association).

Are content validity and construct validity the same?

Content validity is not the same as construct validity.

Content validity is a method of assessing the degree to which a measure covers the range of content that it purports to measure.

In contrast, construct validity is a method of assessing the degree to which a measure reflects the underlying construct that it purports to measure.

It is important to note that content validity and construct validity are not mutually exclusive; a measure can be both valid and invalid with respect to content and construct.

However, content validity is a necessary but not sufficient condition for construct validity. That is, a measure cannot be construct valid if it does not first have content validity (Koller et al., 2017).

For example, an academic achievement test in math may have content validity if it contains questions from all areas of math a student is expected to have learned before the test, but it may not have construct validity if it does not somehow relate to tests of similar and different constructs.

How many experts are needed for content validity?

There is no definitive answer to this question as it depends on a number of factors, including the nature of the instrument being validated and the purpose of the validation exercise.

However, in general, a minimum of three experts should be used in order to ensure that the content validity of an instrument is adequately established (Koller et al., 2017).

American Psychological Association. (n.D.). Content Validity. American Psychological Association Dictionary.

Haynes, S. N., Richard, D., & Kubany, E. S. (1995). Content validity in psychological assessment: A functional approach to concepts and methods. Psychological assessment , 7 (3), 238.

Koller, I., Levenson, M. R., & Glück, J. (2017). What do you think you are measuring? A mixed-methods procedure for assessing the content validity of test items and theory-based scaling. Frontiers in psychology , 8 , 126.

Lawshe, C. H. (1975). A quantitative approach to content validity. Personnel psychology , 28 (4), 563-575.

Lynn, M. R. (1986). Determination and quantification of content validity. Nursing research .

Obilor, E. I. (2018). Fundamentals of research methods and Statistics in Education and Social Sciences. Port Harcourt: SABCOS Printers & Publishers.

OBILOR, E. I. P., & MIWARI, G. U. P. (2022). Content Validity in Educational Assessment.

Newman, Isadore, Janine Lim, and Fernanda Pineda. “Content validity using a mixed methods approach: Its application and development through the use of a table of specifications methodology.” Journal of Mixed Methods Research 7.3 (2013): 243-260.

Rossiter, J. R. (2008). Content validity of measures of abstract constructs in management and organizational research. British Journal of Management , 19 (4), 380-388.

Print Friendly, PDF & Email

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Bras Pneumol
  • v.44(3); May-Jun 2018

Internal and external validity: can you apply research study results to your patients?

Cecilia maria patino.

1 . Methods in Epidemiologic, Clinical, and Operations Research-MECOR-program, American Thoracic Society/Asociación Latinoamericana del Tórax, Montevideo, Uruguay.

2 . Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.

Juliana Carvalho Ferreira

3 . Divisão de Pneumologia, Instituto do Coração, Hospital das Clínicas, Faculdade de Medicina, Universidade de São Paulo, São Paulo (SP) Brasil.

CLINICAL SCENARIO

In a multicenter study in France, investigators conducted a randomized controlled trial to test the effect of prone vs. supine positioning ventilation on mortality among patients with early, severe ARDS. They showed that prolonged prone-positioning ventilation decreased 28-day mortality [hazard ratio (HR) = 0.39; 95% CI: 0.25-0.63]. 1

STUDY VALIDITY

The validity of a research study refers to how well the results among the study participants represent true findings among similar individuals outside the study. This concept of validity applies to all types of clinical studies, including those about prevalence, associations, interventions, and diagnosis. The validity of a research study includes two domains: internal and external validity.

Internal validity is defined as the extent to which the observed results represent the truth in the population we are studying and, thus, are not due to methodological errors. In our example, if the authors can support that the study has internal validity, they can conclude that prone positioning reduces mortality among patients with severe ARDS. The internal validity of a study can be threatened by many factors, including errors in measurement or in the selection of participants in the study, and researchers should think about and avoid these errors.

Once the internal validity of the study is established, the researcher can proceed to make a judgment regarding its external validity by asking whether the study results apply to similar patients in a different setting or not ( Figure 1 ). In the example, we would want to evaluate if the results of the clinical trial apply to ARDS patients in other ICUs. If the patients have early, severe ARDS, probably yes, but the study results may not apply to patients with mild ARDS . External validity refers to the extent to which the results of a study are generalizable to patients in our daily practice, especially for the population that the sample is thought to represent.

An external file that holds a picture, illustration, etc.
Object name is 1806-3713-jbpneu-44-03-00183-gf1.jpg

Lack of internal validity implies that the results of the study deviate from the truth, and, therefore, we cannot draw any conclusions; hence, if the results of a trial are not internally valid, external validity is irrelevant. 2 Lack of external validity implies that the results of the trial may not apply to patients who differ from the study population and, consequently, could lead to low adoption of the treatment tested in the trial by other clinicians.

INCREASING VALIDITY OF RESEARCH STUDIES

To increase internal validity, investigators should ensure careful study planning and adequate quality control and implementation strategies-including adequate recruitment strategies, data collection, data analysis, and sample size. External validity can be increased by using broad inclusion criteria that result in a study population that more closely resembles real-life patients, and, in the case of clinical trials, by choosing interventions that are feasible to apply. 2

  • Open access
  • Published: 05 December 2023

A scoping review to identify and organize literature trends of bias research within medical student and resident education

  • Brianne E. Lewis 1 &
  • Akshata R. Naik 2  

BMC Medical Education volume  23 , Article number:  919 ( 2023 ) Cite this article

816 Accesses

1 Citations

2 Altmetric

Metrics details

Physician bias refers to the unconscious negative perceptions that physicians have of patients or their conditions. Medical schools and residency programs often incorporate training to reduce biases among their trainees. In order to assess trends and organize available literature, we conducted a scoping review with a goal to categorize different biases that are studied within medical student (MS), resident (Res) and mixed populations (MS and Res). We also characterized these studies based on their research goal as either documenting evidence of bias (EOB), bias intervention (BI) or both. These findings will provide data which can be used to identify gaps and inform future work across these criteria.

Online databases (PubMed, PsycINFO, WebofScience) were searched for articles published between 1980 and 2021. All references were imported into Covidence for independent screening against inclusion criteria. Conflicts were resolved by deliberation. Studies were sorted by goal: ‘evidence of bias’ and/or ‘bias intervention’, and by population (MS or Res or mixed) andinto descriptive categories of bias.

Of the initial 806 unique papers identified, a total of 139 articles fit the inclusion criteria for data extraction. The included studies were sorted into 11 categories of bias and showed that bias against race/ethnicity, specific diseases/conditions, and weight were the most researched topics. Of the studies included, there was a higher ratio of EOB:BI studies at the MS level. While at the Res level, a lower ratio of EOB:BI was found.

Conclusions

This study will be of interest to institutions, program directors and medical educators who wish to specifically address a category of bias and identify where there is a dearth of research. This study also underscores the need to introduce bias interventions at the MS level.

Peer Review reports

Physician bias ultimately impacts patient care by eroding the physician–patient relationship [ 1 , 2 , 3 , 4 ]. To overcome this issue, certain states require physicians to report a varying number of hours of implicit bias training as part of their recurring licensing requirement [ 5 , 6 ]. Research efforts on the influence of implicit bias on clinical decision-making gained traction after the “Unequal Treatment: Confronting Racial and Ethnic Disparities in Health Care” report published in 2003 [ 7 ]. This report sparked a conversation about the impact of bias against women, people of color, and other marginalized groups within healthcare. Bias from a healthcare provider has been shown to affect provider-patient communication and may also influence treatment decisions [ 8 , 9 ]. Nevertheless, opportunities within medical education curriculum are created to evaluate biases at an earlier stage of physician-training and provide instruction to intervene them [ 10 , 11 , 12 ]. We aimed to identify trends and organize literature on bias training provided during medical school and residency programs since the meaning of ‘bias’ is broad and encompasses several types of attitudes and predispositions [ 13 ].

Several reviews, narrative or systematic in nature, have been published in the field of bias research in medicine and healthcare [ 14 , 15 , 16 ]. Many of these reviews have a broad focus on implicit bias and they often fail to define the patient’s specific attributes- such as age, weight, disease, or condition against which physicians hold their biases. However, two recently published reviews categorized implicit biases into various descriptive characteristics albeit with research goals different than this study [ 17 , 18 ]. The study by Fitzgerald et al. reviewed literature focused on bias among physicians and nurses to highlight its role in healthcare disparities [ 17 ]. While the study by Gonzalez et al. focused on bias curricular interventions across professions related to social determinants of health such as education, law, medicine and social work [ 18 ]. Our research goal was to identify the various bias characteristics that are studied within medical student and/or resident populations and categorize them. Further, we were interested in whether biases were merely identified or if they were intervened. To address these deficits in the field and provide clarity, we utilized a scoping review approach to categorize the literature based on a) the bias addressed and b) the study goal within medical students (MS), residents (Res) and a mixed population (MS and Res).

To date no literature review has organized bias research by specific categories held solely by medical trainees (medical students and/or residents) and quantified intervention studies. We did not perform a quality assessment or outcome evaluation of the bias intervention strategies, as it was not the goal of this work and is standard with a scoping review methodology [ 19 , 20 ]. By generating a comprehensive list of bias categories researched among medical trainee population, we highlight areas of opportunity for future implicit bias research specifically within the undergraduate and graduate medical education curriculum. We anticipate that the results from this scoping review will be useful for educators, administrators, and stakeholders seeking to implement active programs or workshops that intervene specific biases in pre-clinical medical education and prepare physicians-in-training for patient encounters. Additionally, behavioral scientists who seek to support clinicians, and develop debiasing theories [ 21 ] and models may also find our results informative.

We conducted an exhaustive and focused scoping review and followed the methodological framework for scoping reviews as previously described in the literature [ 20 , 22 ]. This study aligned with the four goals of a scoping review [ 20 ]. We followed the first five out of the six steps outlined by Arksey and O’Malley’s to ensure our review’s validity 1) identifying the research question 2) identifying relevant studies 3) selecting the studies 4) charting the data and 5) collating, summarizing and reporting the results [ 22 ]. We did not follow the optional sixth step of undertaking consultation with key stakeholders as it was not needed to address our research question it [ 23 ]. Furthermore, we used Covidence systematic review software (Veritas Health Innovation, Melbourne, Australia) that aided in managing steps 2–5 presented above.

Research question, search strategy and inclusion criteria

The purpose of this study was to identify trends in bias research at the medical school and residency level. Prior to conducting our literature search we developed our research question and detailed the inclusion criteria, and generated the search syntax with the assistance from a medical librarian. Search syntax was adjusted to the requirements of the database. We searched PubMed, Web of Science, and PsycINFO using MeSH terms shown below.

Bias* [ti] OR prejudice*[ti] OR racism[ti] OR homophobia[ti] OR mistreatment[ti] OR sexism[ti] OR ageism[ti]) AND (prejudice [mh] OR "Bias"[Mesh:NoExp]) AND (Education, Medical [mh] OR Schools, Medical [mh] OR students, medical [mh] OR Internship and Residency [mh] OR “undergraduate medical education” OR “graduate medical education” OR “medical resident” OR “medical residents” OR “medical residency” OR “medical residencies” OR “medical schools” OR “medical school” OR “medical students” OR “medical student”) AND (curriculum [mh] OR program evaluation [mh] OR program development [mh] OR language* OR teaching OR material* OR instruction* OR train* OR program* OR curricul* OR workshop*

Our inclusion criteria incorporated studies which were either original research articles, or review articles that synthesized new data. We excluded publications that were not peer-reviewed or supported with data such as narrative reviews, opinion pieces, editorials, perspectives and commentaries. We included studies outside of the U.S. since the purpose of this work was to generate a comprehensive list of biases. Physicians, regardless of their country of origin, can hold biases against specific patient attributes [ 17 ]. Furthermore, physicians may practice in a different country than where they trained [ 24 ]. Manuscripts were included if they were published in the English language for which full-texts were available. Since the goal of this scoping review was to assess trends, we accepted studies published from 1980–2021.

Our inclusion criteria also considered the goal and the population of the study. We defined the study goal as either that documented evidence of bias or a program directed bias intervention. Evidence of bias (EOB) had to originate from the medical trainee regarding a patient attribute. Bias intervention (BI) studies involved strategies to counter biases such as activities, workshops, seminars or curricular innovations. The population studied had to include medical students (MS) or residents (Res) or mixed. We defined the study population as ‘mixed’ when it consisted of both MS and Res. Studies conducted on other healthcare professionals were included if MS or Res were also studied. Our search criteria excluded studies that documented bias against medical professionals (students, residents and clinicians) either by patients, medical schools, healthcare administrators or others, and was focused on studies where the biases were solely held by medical trainees (MS and Res).

Data extraction and analysis

Following the initial database search, references were downloaded and bulk uploaded into Covidence and duplicates were removed. After the initial screening of title and abstracts, full-texts were reviewed. Authors independently completed title and abstract screening, and full text reviews. Any conflicts at the stage of abstract screening were moved to full-text screening. Conflicts during full-text screening were resolved by deliberation and referring to the inclusion and exclusion criteria detailed in the research protocol. The level of agreement between the two authors for full text reviews as measured by inter-rater reliability was 0.72 (Cohen’s Kappa).

A data extraction template was created in Covidence to extract data from included full texts. Data extraction template included the following variables; country in which the study was conducted, year of publication, goal of the study (EOB, BI or both), population of the study (MS, Res or mixed) and the type of bias studied. Final data was exported to Microsoft Excel for quantification. For charting our data and categorizing the included studies, we followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews(PRISMA-ScR) guidelines [ 25 ]. Results from this scoping review study are meant to provide a visual synthesis of existing bias research and identify gaps in knowledge.

Study selection

Our search strategy yielded a total of 892 unique abstracts which were imported into ‘Covidence’ for screening. A total of 86 duplicate references were removed. Then, 806 titles and abstracts were screened for relevance independently by the authors and 519 studies were excluded at this stage. Any conflicts among the reviewers at this stage were resolved by discussion and referring to the inclusion and exclusion criteria. Then a full text review of the remaining 287 papers was completed by the authors against the inclusion criteria for eligibility. Full text review was also conducted independently by the authors and any conflicts were resolved upon discussion. Finally, we included 139 studies which were used for data extraction (Fig.  1 ).

figure 1

PRISMA diagram of the study selection process used in our scoping review to identify the bias categories that have been reported within medical education literature. Study took place from 2021–2022. Abbreviation: PRISMA, Preferred Reporting Items for Systematic Reviews and Meta-Analyses

Publication trends in bias research

First, we charted the studies to demonstrate the timeline of research focused on bias within the study population of our interest (MS or Res or mixed). Our analysis revealed an increase in publications with respect to time (Fig.  2 ). Of the 139 included studies, fewer studies were published prior to 2001, with a total of only eight papers being published from the years 1985–2000. A substantial increase in publications occurred after 2004, with 2019 being the peak year where most of the studies pertaining to bias were published (Fig.  2 ).

figure 2

Studies matching inclusion criteria mapped by year of publication. Search criteria included studies addressing bias from 1980–2021 within medical students (MS) or residents (Res) or mixed (MS + Res) populations. * Publication in 2022 was published online ahead of print

Overview of included studies

We present a descriptive analysis of the 139 included studies in Table 1 based on the following parameters: study location, goal of the study, population of the study and the category of bias studied. All of the above parameters except the category of bias included a denominator of 139 studies. Several studies addressed more than one bias characteristic; therefore, we documented 163 biases sorted in 11 categories over the 139 papers. The bias categories that we generated and their respective occurrences are listed in Table 1 . Of the 139 studies that were included, most studies originated in the United States ( n  = 89/139, 64%) and Europe ( n  = 20/139, 20%).

Sorting of included research by bias category

We grouped the 139 included studies depending on the patient attribute or the descriptive characteristic against which the bias was studied (Table 1 ). By sorting the studies into different bias categories, we aimed to not only quantitate the amount of research addressing a particular topic of bias, but also reveal the biases that are understudied.

Through our analysis, we generated 11 descriptive categories against which bias was studied: Age, physical disability, education level, biological sex, disease or condition, LGBTQ + , non-specified, race/ethnicity, rural/urban, socio-economic status, and weight (Table 1 ). “Age” and “weight” categories included papers that studied bias against older population and higher weight individuals, respectively. The categories “education level” and “socio-economic status” included papers that studied bias against individuals with low education level and individuals belonging to low socioeconomic status, respectively. Within the bias category named ‘biological sex’, we included papers that studied bias against individuals perceived as women/females. Papers that studied bias against gender-identity or sexual orientation were included in its own category named, ‘LGBTQ + ’. The bias category, ‘disease or condition’ was broad and included research on bias against any patient with a specific disease, condition or lifestyle. Studies included in this category researched bias against any physical illnesses, mental illnesses, or sexually transmitted infections. It also included studies that addressed bias against a treatment such as transplant or pain management. It was not significant to report these as individual categories but rather as a whole with a common underlying theme. Rural/urban bias referred to bias that was held against a person based on their place of residence. Studies grouped together in the ‘non-specified bias’ category explored bias without specifying any descriptive characteristic in their methods. These studies did not address any specific bias characteristic in particular but consisted of a study population of our interest (MS or Res or mixed). Based on our analysis, the top five most studied bias categories in our included population within medical education literature were: racial or ethnic bias ( n  = 39/163, 24%), disease or condition bias ( n  = 29/163, 18%), weight bias ( n  = 22/163, 13%), LGBTQ + bias ( n  = 21/163, 13%), and age bias ( n  = 16/163, 10%) which are presented in Table 1 .

Sorting of included research by population

In order to understand the distribution of bias research based on their populations examined, we sorted the included studies in one of the following: medical students (MS), residents (Res) or mixed (Table 1 ). The following distributions were observed: medical students only ( n  = 105/139, 76%), residents only ( n  = 19/139, 14%) or mixed which consisted of both medical students and residents ( n  = 15/139, 11%). In combination, these results demonstrate that medical educators have focused bias research efforts primarily on medical student populations.

Sorting of included research by goal

A critical component of this scoping review was to quantify the research goal of the included studies within each of the bias categories. We defined the research goal as either to document evidence of bias (EOB) or to evaluate a bias intervention (BI) (see Fig.  1 for inclusion criteria). Some of the included studies focused on both, documenting evidence in addition to intervening biases and those studies were grouped separately. The analysis revealed that 69/139 (50%) of the included studies focused exclusively on documenting evidence of bias (EOB). There were fewer studies ( n  = 51/139, 37%) which solely focused on bias interventions such as programs, seminars or curricular innovations. A small minority of the included studies were more comprehensive in that they documented EOB followed by an intervention strategy ( n  = 19/139, 11%). These results demonstrate that most bias research is dedicated to documenting evidence of bias among these groups rather than evaluating a bias intervention strategy.

Research goal distribution

Our next objective was to calculate the distribution of studies with respect to the study goal (EOB, BI or both), within the 163 biases studied across the 139 papers as calculated in Table 1 . In general, the goal of the studies favors documenting evidence of bias with the exception of race/ethnic bias which is more focused on bias intervention (Fig.  3 ). Fewer studies were aimed at both, documenting evidence then providing an intervention, across all bias categories.

figure 3

Sorting of total biases ( n  = 163) within medical students or residents or a mixed population based on the bias category . Dark grey indicates studies with a dual goal, to document evidence of bias and to intervene bias. Medium grey bars indicate studies which focused on documenting evidence of bias. Light grey bars indicate studies focused on bias intervention within these populations. Numbers inside the bars indicate the total number of biases for the respective study goal. * Non-specified bias includes studies which focused on implicit bias but did not mention the type of bias investigated

Furthermore, we also calculated the ratio of EOB, BI and both (EOB + BI) within each of our population of interest (MS; n  = 122, Res; n  = 26 and mixed; n  = 15) for the 163 biases observed in our included studies. Over half ( n  = 64/122, 52%) of the total bias occurrences in MS were focused on documenting EOB (Fig.  4 ). Contrastingly, a shift was observed within resident populations where most biases addressed were aimed at intervention ( n  = 12/26, 41%) rather than EOB ( n  = 4/26, 14%) (Fig.  4 ). Studies which included both MS and Res (mixed) were primarily focused on documenting EOB ( n  = 9/15, 60%), with 33% ( n  = 5/15) aimed at bias intervention and 7% ( n  = 1/15) which did both (Fig.  4 ). Although far fewer studies were documented in the Res population it is important to highlight that most of these studies were focused on bias intervention when compared to MS population where we documented a majority of studies focused on evidence of bias.

figure 4

A ratio of the study goal for the total biases ( n  = 163) mapped within each of the study population (MS, Res and Mixed). A study goal with a) documenting evidence of bias (EOB) is depicted in dotted grey, b) bias intervention (BI) in medium grey, and c) a dual focus (EOB + BI) is depicted in dark grey. * N  = 122 for medical student studies. b N  = 26 for residents. c N  = 15 for mixed

Addressing biases at an earlier stage of medical career is critical for future physicians engaging with diverse patients, since it is established that bias negatively influences provider-patient interactions [ 171 ], clinical decision-making [ 172 ] and reduces favorable treatment outcomes [ 2 ]. We set out with an intention to explore how bias is addressed within the medical curriculum. Our research question was: how has the trend in bias research changed over time, more specifically a) what is the timeline of papers published? b) what bias characteristics have been studied in the physician-trainee population and c) how are these biases addressed? With the introduction of ‘standards of diversity’ by the Liaison Committee on Medical Education, along with the Association of American Medical Colleges (AAMC) and the American Medical Association (AMA) [ 173 , 174 ], we certainly expected and observed a sustained uptick in research pertaining to bias. As shown here, research addressing bias in the target population (MS and Res) is on the rise, however only 139 papers fit our inclusion criteria. Of these studies, nearly 90% have been published since 2005 after the “Unequal Treatment: Confronting Racial and Ethnic Disparities in Health Care” report was published in 2003 [ 7 ]. However, given the well documented effects of physician held bias, we anticipated significantly more number of studies focused on bias at the medical student or resident level.

A key component from this study was that we generated descriptive categories of biases. Sorting the biases into descriptive categories helps to identify a more targeted approach for a specific bias intervention, rather than to broadly intervene bias as a whole. In fact, our analysis found a number of publications (labeled “non-specified bias” in Table 1 ) which studied implicit bias without specifying the patient attribute or the characteristic that the bias was against. In total, we generated 11 descriptive categories of bias from our scoping review which are shown in Table 1 and Fig.  3 . Furthermore, our bias descriptors grouped similar kinds of biases within a single category. For example, the category, “disease or condition” included papers that studied bias against any type of disease (Mental illness, HIV stigma, diabetes), condition (Pain management), or lifestyle. We neither performed a qualitative assessment of the studies nor did we test the efficacy of the bias intervention studies and consider it a future direction of this work.

Evidence suggests that medical educators and healthcare professionals are struggling to find the appropriate approach to intervene biases [ 175 , 176 , 177 ] So far, bias reduction, bias reflection and bias management approaches have been proposed [ 26 , 27 , 178 ]. Previous implicit bias intervention strategies have been shown to be ineffective when biased attitudes of participants were assessed after a lag [ 179 ]. Understanding the descriptive categories of bias and previous existing research efforts, as we present here is only a fraction of the challenge. The theory of “cognitive bias” [ 180 ] and related branches of research [ 13 , 181 , 182 , 183 , 184 ] have been studied in the field of psychology for over three decades. It is only recently that cognitive bias theory has been applied to the field of medical education medicine, to explain its negative influence on clinical decision-making pertaining only to racial minorities [ 1 , 2 , 15 , 16 , 17 , 185 ]. In order to elicit meaningful changes with respect to targeted bias intervention, it is necessary to understand the psychological underpinnings (attitudes) leading to a certain descriptive category of bias (behaviors). The questions which medical educators need to ask are: a) Can these descriptive biases be identified under certain type/s of cognitive errors that elicits the bias and vice versa b) Are we working towards an attitude change which can elicit a sustained positive behavior change among healthcare professionals? And most importantly, c) are we creating a culture where participants voluntarily enroll themselves in bias interventions as opposed to being mandated to participate? Cognitive psychologists and behavioral scientists are well-positioned to help us find answers to these questions as they understand human behavior. Therefore, an interdisciplinary approach, a marriage between cognitive psychologists and medical educators, is key in targeting biases held by medical students, residents, and ultimately future physicians. This review may also be of interest to behavioral psychologists, keen on providing targeted intervening strategies to clinicians depending on the characteristics (age, weight, sex or race) the portrayed bias is against. Further, instead of an individualized approach, we need to strive for systemic changes and evidence-based strategies to intervene biases.

The next element in change is directing intervention strategies at the right stage in clinical education. Our study demonstrated that most of the research collected at the medical student level was focused on documenting evidence of bias. Although the overall number of studies at the resident level were fewer than at the medical student level, the ratio of research in favor of bias intervention was higher at the resident level (see Fig.  3 ). However, it could be helpful to focus on bias intervention earlier in learning, rather than at a later stage [ 186 ]. Additionally, educational resources such as textbooks, preparatory materials, and educators themselves are potential sources of propagating biases and therefore need constant evaluation against best practices [ 187 , 188 ].

This study has limitations. First, the list of the descriptive bias categories that we generated was not grounded in any particular theory so assigning a category was subjective. Additionally, there were studies that were categorized as “nonspecified” bias as the studies themselves did not mention the specific type of bias that they were addressing. Moreover, we had to exclude numerous publications solely because they were not evidence-based and were either perspectives, commentaries or opinion pieces. Finally, there were overall fewer studies focused on the resident population, so the calculated ratio of MS:Res studies did not compare similar sample sizes.

Future directions of our study include working with behavioral scientists to categorize these bias characteristics (Table 1 ) into cognitive error types [ 189 ]. Additionally, we aim to assess the effectiveness of the intervention strategies and categorize the approach of the intervention strategies.

The primary goal of our review was to organize, compare and quantify literature pertaining to bias within medical school curricula and residency programs. We neither performed a qualitative assessment of the studies nor did we test the efficacy of studies that were sorted into “bias intervention” as is typical of scoping reviews [ 22 ]. In summary, our research identified 11 descriptive categories of biases studied within medical students and resident populations with “race and ethnicity”, “disease or condition”, “weight”, “LGBTQ + ” and “age” being the top five most studied biases. Additionally, we found a greater number of studies conducted in medical students (105/139) when compared to residents (19/139). However, most of the studies in the resident population focused on bias intervention. The results from our review highlight the following gaps: a) bias categories where more research is needed, b) biases that are studied within medical school versus in residency programs and c) study focus in terms of demonstrating the presence of bias or working towards bias intervention.

This review provides a visual analysis of the known categories of bias addressed within the medical school curriculum and in residency programs in addition to providing a comparison of studies with respect to the study goal within medical education literature. The results from our review should be of interest to community organizations, institutions, program directors and medical educators interested in knowing and understanding the types of bias existing within healthcare populations. It might be of special interest to researchers who wish to explore other types of biases that have been understudied within medical school and resident populations, thus filling the gaps existing in bias research.

Despite the number of studies designed to provide bias intervention for MS and Res populations, and an overall cultural shift to be aware of one’s own biases, biases held by both medical students and residents still persist. Further, psychologists have recently demonstrated the ineffectiveness of some bias intervention efforts [ 179 , 190 ]. Therefore, it is perhaps unrealistic to expect these biases to be eliminated altogether. However, effective intervention strategies grounded in cognitive psychology should be implemented earlier on in medical training. Our focus should be on providing evidence-based approaches and safe spaces for an attitude and culture change, so as to induce actionable behavioral changes.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Abbreviations

  • Medical student

Evidence of bias

  • Bias intervention

Hagiwara N, Mezuk B, Elston Lafata J, Vrana SR, Fetters MD. Study protocol for investigating physician communication behaviours that link physician implicit racial bias and patient outcomes in Black patients with type 2 diabetes using an exploratory sequential mixed methods design. BMJ Open. 2018;8(10):e022623.

Article   Google Scholar  

Haider AH, Schneider EB, Sriram N, Dossick DS, Scott VK, Swoboda SM, Losonczy L, Haut ER, Efron DT, Pronovost PJ, et al. Unconscious race and social class bias among acute care surgical clinicians and clinical treatment decisions. JAMA Surg. 2015;150(5):457–64.

Penner LA, Dovidio JF, Gonzalez R, Albrecht TL, Chapman R, Foster T, Harper FW, Hagiwara N, Hamel LM, Shields AF, et al. The effects of oncologist implicit racial bias in racially discordant oncology interactions. J Clin Oncol. 2016;34(24):2874–80.

Phelan SM, Burgess DJ, Yeazel MW, Hellerstedt WL, Griffin JM, van Ryn M. Impact of weight bias and stigma on quality of care and outcomes for patients with obesity. Obes Rev. 2015;16(4):319–26.

Garrett SB, Jones L, Montague A, Fa-Yusuf H, Harris-Taylor J, Powell B, Chan E, Zamarripa S, Hooper S, Chambers Butcher BD. Challenges and opportunities for clinician implicit bias training: insights from perinatal care stakeholders. Health Equity. 2023;7(1):506–19.

Shah HS, Bohlen J. Implicit bias. In: StatPearls. Treasure Island (FL): StatPearls Publishing; 2023. Copyright © 2023, StatPearls Publishing LLC.

Google Scholar  

Institute of Medicine (US) Committee on Understanding and Eliminating Racial and Ethnic Disparities in Health Care. Unequal Treatment: Confronting Racial and Ethnic Disparities in Health Care. In: Smedley BD, Stith AY, Nelson AR, editors. Washington (DC): National Academies Press (US); 2003. PMID: 25032386.

Dehon E, Weiss N, Jones J, Faulconer W, Hinton E, Sterling S. A systematic review of the impact of physician implicit racial bias on clinical decision making. Acad Emerg Med. 2017;24(8):895–904.

Oliver MN, Wells KM, Joy-Gaba JA, Hawkins CB, Nosek BA. Do physicians’ implicit views of African Americans affect clinical decision making? J Am Board Fam Med. 2014;27(2):177–88.

Rincon-Subtirelu M. Education as a tool to modify anti-obesity bias among pediatric residents. Int J Med Educ. 2017;8:77–8.

Gustafsson Sendén M, Renström EA. Gender bias in assessment of future work ability among pain patients - an experimental vignette study of medical students’ assessment. Scand J Pain. 2019;19(2):407–14.

Hardeman RR, Burgess D, Phelan S, Yeazel M, Nelson D, van Ryn M. Medical student socio-demographic characteristics and attitudes toward patient centered care: do race, socioeconomic status and gender matter? A report from the medical student CHANGES study. Patient Educ Couns. 2015;98(3):350–5.

Greenwald AG, Banaji MR. Implicit social cognition: attitudes, self-esteem, and stereotypes. Psychol Rev. 1995;102(1):4–27.

Kruse JA, Collins JL, Vugrin M. Educational strategies used to improve the knowledge, skills, and attitudes of health care students and providers regarding implicit bias: an integrative review of the literature. Int J Nurs Stud Adv. 2022;4:100073.

Zestcott CA, Blair IV, Stone J. Examining the presence, consequences, and reduction of implicit bias in health care: a narrative review. Group Process Intergroup Relat. 2016;19(4):528–42.

Hall WJ, Chapman MV, Lee KM, Merino YM, Thomas TW, Payne BK, Eng E, Day SH, Coyne-Beasley T. Implicit racial/ethnic bias among health care professionals and its influence on health care outcomes: a systematic review. Am J Public Health. 2015;105(12):E60–76.

FitzGerald C, Hurst S. Implicit bias in healthcare professionals: a systematic review. BMC Med Ethics. 2017;18(1):19.

Gonzalez CM, Onumah CM, Walker SA, Karp E, Schwartz R, Lypson ML. Implicit bias instruction across disciplines related to the social determinants of health: a scoping review. Adv Health Sci Educ. 2023;28(2):541–87.

Pham MT, Rajić A, Greig JD, Sargeant JM, Papadopoulos A, McEwen SA. A scoping review of scoping reviews: advancing the approach and enhancing the consistency. Res Synth Methods. 2014;5(4):371–85.

Levac D, Colquhoun H, O’Brien KK. Scoping studies: advancing the methodology. Implement Sci. 2010;5:69.

Pat C, Geeta S, Sílvia M. Cognitive debiasing 1: origins of bias and theory of debiasing. BMJ Qual Saf. 2013;22(Suppl 2):ii58.

Arksey H, O’Malley L. Scoping studies: towards a methodological framework. Int J Soc Res Methodol. 2005;8(1):19–32.

Thomas A, Lubarsky S, Durning SJ, Young ME. Knowledge syntheses in medical education: demystifying scoping reviews. Acad Med. 2017;92(2):161–6.

Hagopian A, Thompson MJ, Fordyce M, Johnson KE, Hart LG. The migration of physicians from sub-Saharan Africa to the United States of America: measures of the African brain drain. Hum Resour Health. 2004;2(1):17.

Tricco AC, Lillie E, Zarin W, O’Brien KK, Colquhoun H, Levac D, Moher D, Peters MDJ, Horsley T, Weeks L, et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): checklist and explanation. Ann Intern Med. 2018;169(7):467–73.

Teal CR, Shada RE, Gill AC, Thompson BM, Frugé E, Villarreal GB, Haidet P. When best intentions aren’t enough: Helping medical students develop strategies for managing bias about patients. J Gen Intern Med. 2010;25(Suppl 2):S115–8.

Gonzalez CM, Walker SA, Rodriguez N, Noah YS, Marantz PR. Implicit bias recognition and management in interpersonal encounters and the learning environment: a skills-based curriculum for medical students. MedEdPORTAL. 2021;17:11168.

Hoffman KM, Trawalter S, Axt JR, Oliver MN. Racial bias in pain assessment and treatment recommendations, and false beliefs about biological differences between blacks and whites. Proc Natl Acad Sci U S A. 2016;113(16):4296–301.

Mayfield JJ, Ball EM, Tillery KA, Crandall C, Dexter J, Winer JM, Bosshardt ZM, Welch JH, Dolan E, Fancovic ER, et al. Beyond men, women, or both: a comprehensive, LGBTQ-inclusive, implicit-bias-aware, standardized-patient-based sexual history taking curriculum. MedEdPORTAL. 2017;13:10634.

Morris M, Cooper RL, Ramesh A, Tabatabai M, Arcury TA, Shinn M, Im W, Juarez P, Matthews-Juarez P. Training to reduce LGBTQ-related bias among medical, nursing, and dental students and providers: a systematic review. BMC Med Educ. 2019;19(1):325.

Perdomo J, Tolliver D, Hsu H, He Y, Nash KA, Donatelli S, Mateo C, Akagbosu C, Alizadeh F, Power-Hays A, et al. Health equity rounds: an interdisciplinary case conference to address implicit bias and structural racism for faculty and trainees. MedEdPORTAL. 2019;15:10858.

Sherman MD, Ricco J, Nelson SC, Nezhad SJ, Prasad S. Implicit bias training in a residency program: aiming for enduring effects. Fam Med. 2019;51(8):677–81.

van Ryn M, Hardeman R, Phelan SM, Burgess DJ, Dovidio JF, Herrin J, Burke SE, Nelson DB, Perry S, Yeazel M, et al. Medical school experiences associated with change in implicit racial bias among 3547 students: a medical student CHANGES study report. J Gen Intern Med. 2015;30(12):1748–56.

Chary AN, Molina MF, Dadabhoy FZ, Manchanda EC. Addressing racism in medicine through a resident-led health equity retreat. West J Emerg Med. 2020;22(1):41–4.

DallaPiazza M, Padilla-Register M, Dwarakanath M, Obamedo E, Hill J, Soto-Greene ML. Exploring racism and health: an intensive interactive session for medical students. MedEdPORTAL. 2018;14:10783.

Dennis SN, Gold RS, Wen FK. Learner reactions to activities exploring racism as a social determinant of health. Fam Med. 2019;51(1):41–7.

Gonzalez CM, Walker SA, Rodriguez N, Karp E, Marantz PR. It can be done! a skills-based elective in implicit bias recognition and management for preclinical medical students. Acad Med. 2020;95(12S Addressing Harmful Bias and Eliminating Discrimination in Health Professions Learning Environments):S150–5.

Motzkus C, Wells RJ, Wang X, Chimienti S, Plummer D, Sabin J, Allison J, Cashman S. Pre-clinical medical student reflections on implicit bias: Implications for learning and teaching. PLoS ONE. 2019;14(11):e0225058.

Phelan SM, Burke SE, Cunningham BA, Perry SP, Hardeman RR, Dovidio JF, Herrin J, Dyrbye LN, White RO, Yeazel MW, et al. The effects of racism in medical education on students’ decisions to practice in underserved or minority communities. Acad Med. 2019;94(8):1178–89.

Zeidan A, Tiballi A, Woodward M, Di Bartolo IM. Targeting implicit bias in medicine: lessons from art and archaeology. West J Emerg Med. 2019;21(1):1–3.

Baker TK, Smith GS, Jacobs NN, Houmanfar R, Tolles R, Kuhls D, Piasecki M. A deeper look at implicit weight bias in medical students. Adv Health Sci Educ Theory Pract. 2017;22(4):889–900.

Eymard AS, Douglas DH. Ageism among health care providers and interventions to improve their attitudes toward older adults: an integrative review. J Gerontol Nurs. 2012;38(5):26–35.

Garrison CB, McKinney-Whitson V, Johnston B, Munroe A. Race matters: addressing racism as a health issue. Int J Psychiatry Med. 2018;53(5–6):436–44.

Geller G, Watkins PA. Addressing medical students’ negative bias toward patients with obesity through ethics education. AMA J Ethics. 2018;20(10):E948-959.

Onyeador IN, Wittlin NM, Burke SE, Dovidio JF, Perry SP, Hardeman RR, Dyrbye LN, Herrin J, Phelan SM, van Ryn M. The value of interracial contact for reducing anti-black bias among non-black physicians: a Cognitive Habits and Growth Evaluation (CHANGE) study report. Psychol Sci. 2020;31(1):18–30.

Poustchi Y, Saks NS, Piasecki AK, Hahn KA, Ferrante JM. Brief intervention effective in reducing weight bias in medical students. Fam Med. 2013;45(5):345–8.

Ruiz JG, Andrade AD, Anam R, Taldone S, Karanam C, Hogue C, Mintzer MJ. Group-based differences in anti-aging bias among medical students. Gerontol Geriatr Educ. 2015;36(1):58–78.

Simpson T, Evans J, Goepfert A, Elopre L. Implementing a graduate medical education anti-racism workshop at an academic university in the Southern USA. Med Educ Online. 2022;27(1):1981803.

Wittlin NM, Dovidio JF, Burke SE, Przedworski JM, Herrin J, Dyrbye L, Onyeador IN, Phelan SM, van Ryn M. Contact and role modeling predict bias against lesbian and gay individuals among early-career physicians: a longitudinal study. Soc Sci Med. 2019;238:112422.

Miller DP Jr, Spangler JG, Vitolins MZ, Davis SW, Ip EH, Marion GS, Crandall SJ. Are medical students aware of their anti-obesity bias? Acad Med. 2013;88(7):978–82.

Gonzalez CM, Deno ML, Kintzer E, Marantz PR, Lypson ML, McKee MD. A qualitative study of New York medical student views on implicit bias instruction: implications for curriculum development. J Gen Intern Med. 2019;34(5):692–8.

Gonzalez CM, Kim MY, Marantz PR. Implicit bias and its relation to health disparities: a teaching program and survey of medical students. Teach Learn Med. 2014;26(1):64–71.

Gonzalez CM, Nava S, List J, Liguori A, Marantz PR. How assumptions and preferences can affect patient care: an introduction to implicit bias for first-year medical students. MedEdPORTAL. 2021;17:11162.

Hernandez RA, Haidet P, Gill AC, Teal CR. Fostering students’ reflection about bias in healthcare: cognitive dissonance and the role of personal and normative standards. Med Teach. 2013;35(4):e1082-1089.

Kushner RF, Zeiss DM, Feinglass JM, Yelen M. An obesity educational intervention for medical students addressing weight bias and communication skills using standardized patients. BMC Med Educ. 2014;14:53.

Nazione S, Silk KJ. Patient race and perceived illness responsibility: effects on provider helping and bias. Med Educ. 2013;47(8):780–9.

Ogunyemi D. Defeating unconscious bias: the role of a structured, reflective, and interactive workshop. J Grad Med Educ. 2021;13(2):189–94.

Phelan SM, Burke SE, Hardeman RR, White RO, Przedworski J, Dovidio JF, Perry SP, Plankey M, A Cunningham B, Finstad D, et al. Medical school factors associated with changes in implicit and explicit bias against gay and lesbian people among 3492 graduating medical students. J Gen Intern Med. 2017;32(11):1193–201.

Phelan SM, Puhl RM, Burke SE, Hardeman R, Dovidio JF, Nelson DB, Przedworski J, Burgess DJ, Perry S, Yeazel MW, et al. The mixed impact of medical school on medical students’ implicit and explicit weight bias. Med Educ. 2015;49(10):983–92.

Barber Doucet H, Ward VL, Johnson TJ, Lee LK. Implicit bias and caring for diverse populations: pediatric trainee attitudes and gaps in training. Clin Pediatr (Phila). 2021;60(9–10):408–17.

Burke SE, Dovidio JF, Przedworski JM, Hardeman RR, Perry SP, Phelan SM, Nelson DB, Burgess DJ, Yeazel MW, van Ryn M. Do contact and empathy mitigate bias against gay and lesbian people among heterosexual first-year medical students? A report from the medical student CHANGE study. Acad Med. 2015;90(5):645–51.

Johnston B, McKinney-Whitson V, Garrison V. Race matters: addressing racism as a health issue. WMJ. 2021;120(S1):S74–7.

Kost A, Akande T, Jones R, Gabert R, Isaac M, Dettmar NS. Use of patient identifiers at the University of Washington School of Medicine: building institutional consensus to reduce bias and stigma. Fam Med. 2021;53(5):366–71.

Madan AK, Aliabadi-Wahle S, Beech DJ. Ageism in medical students’ treatment recommendations: the example of breast-conserving procedures. Acad Med. 2001;76(3):282–4.

Marbin J, Lewis L, Kuo AK, Schudel C, Gutierrez JR. The power of place: travel to explore structural racism and health disparities. Acad Med. 2021;96(11):1569–73.

Phelan SM, Dovidio JF, Puhl RM, Burgess DJ, Nelson DB, Yeazel MW, Hardeman R, Perry S, van Ryn M. Implicit and explicit weight bias in a national sample of 4,732 medical students: the medical student CHANGES study. Obesity (Silver Spring). 2014;22(4):1201–8.

Van J, Aloman C, Reau N. Potential bias and misconceptions in liver transplantation for alcohol- and obesity-related liver disease. Am J Gastroenterol. 2021;116(10):2089–97.

White-Means S, Zhiyong D, Hufstader M, Brown LT. Cultural competency, race, and skin tone bias among pharmacy, nursing, and medical students: implications for addressing health disparities. Med Care Res Rev. 2009;66(4):436–55.

Williams RL, Vasquez CE, Getrich CM, Kano M, Boursaw B, Krabbenhoft C, Sussman AL. Racial/gender biases in student clinical decision-making: a mixed-method study of medical school attributes associated with lower incidence of biases. J Gen Intern Med. 2018;33(12):2056–64.

Cohen RW, Persky S. Influence of weight etiology information and trainee characteristics on physician-trainees’ clinical and interpersonal communication. Patient Educ Couns. 2019;102(9):1644–9.

Haider AH, Sexton J, Sriram N, Cooper LA, Efron DT, Swoboda S, Villegas CV, Haut ER, Bonds M, Pronovost PJ, et al. Association of unconscious race and social class bias with vignette-based clinical assessments by medical students. JAMA. 2011;306(9):942–51.

Lewis R, Lamdan RM, Wald D, Curtis M. Gender bias in the diagnosis of a geriatric standardized patient: a potential confounding variable. Acad Psychiatry. 2006;30(5):392–6.

Matharu K, Shapiro JF, Hammer RR, Kravitz RL, Wilson MD, Fitzgerald FT. Reducing obesity prejudice in medical education. Educ Health. 2014;27(3):231–7.

McLean ME, McLean LE, McLean-Holden AC, Campbell LF, Horner AM, Kulkarni ML, Melville LD, Fernandez EA. Interphysician weight bias: a cross-sectional observational survey study to guide implicit bias training in the medical workplace. Acad Emerg Med. 2021;28(9):1024–34.

Meadows A, Higgs S, Burke SE, Dovidio JF, van Ryn M, Phelan SM. Social dominance orientation, dispositional empathy, and need for cognitive closure moderate the impact of empathy-skills training, but not patient contact, on medical students’ negative attitudes toward higher-weight patients. Front Psychol. 2017;8:15.

Stone J, Moskowitz GB, Zestcott CA, Wolsiefer KJ. Testing active learning workshops for reducing implicit stereotyping of Hispanics by majority and minority group medical students. Stigma Health. 2020;5(1):94–103.

Symons AB, Morley CP, McGuigan D, Akl EA. A curriculum on care for people with disabilities: effects on medical student self-reported attitudes and comfort level. Disabil Health J. 2014;7(1):88–95.

Ufomata E, Eckstrand KL, Hasley P, Jeong K, Rubio D, Spagnoletti C. Comprehensive internal medicine residency curriculum on primary care of patients who identify as LGBT. LGBT Health. 2018;5(6):375–80.

Aultman JM, Borges NJ. A clinical and ethical investigation of pre-medical and medical students’ attitudes, knowledge, and understanding of HIV. Med Educ Online. 2006;11:1–12.

Bates T, Cohan M, Bragg DS, Bedinghaus J. The Medical College of Wisconsin senior mentor program: experience of a lifetime. Gerontol Geriatr Educ. 2006;27(2):93–103.

Chiaramonte GR, Friend R. Medical students’ and residents’ gender bias in the diagnosis, treatment, and interpretation of coronary heart disease symptoms. Health Psychol. 2006;25(3):255–66.

Friedberg F, Sohl SJ, Halperin PJ. Teaching medical students about medically unexplained illnesses: a preliminary study. Med Teach. 2008;30(6):618–21.

Gonzales E, Morrow-Howell N, Gilbert P. Changing medical students’ attitudes toward older adults. Gerontol Geriatr Educ. 2010;31(3):220–34.

Hinners CK, Potter JF. A partnership between the University of Nebraska College of Medicine and the community: fostering positive attitudes towards the aged. Gerontol Geriatr Educ. 2006;27(2):83–91.

Lee M, Coulehan JL. Medical students’ perceptions of racial diversity and gender equality. Med Educ. 2006;40(7):691–6.

Schmetzer AD, Lafuze JE. Overcoming stigma: involving families in medical student and psychiatric residency education. Acad Psychiatry. 2008;32(2):127–31.

Willen SS, Bullon A, Good MJD. Opening up a huge can of worms: reflections on a “cultural sensitivity” course for psychiatry residents. Harv Rev Psychiatry. 2010;18(4):247–53.

Dogra N, Karnik N. First-year medical students’ attitudes toward diversity and its teaching: an investigation at one U.S. medical school. Acad Med. 2003;78(11):1191–200.

Fitzpatrick C, Musser A, Mosqueda L, Boker J, Prislin M. Student senior partnership program: University of California Irvine School of Medicine. Gerontol Geriatr Educ. 2006;27(2):25–35.

Hoffman KG, Gray P, Hosokawa MC, Zweig SC. Evaluating the effectiveness of a senior mentor program: the University of Missouri-Columbia School of Medicine. Gerontol Geriatr Educ. 2006;27(2):37–47.

Kantor BS, Myers MR. From aging…to saging-the Ohio State Senior Partners Program: longitudinal and experiential geriatrics education. Gerontol Geriatr Educ. 2006;27(2):69–74.

Klamen DL, Grossman LS, Kopacz DR. Medical student homophobia. J Homosex. 1999;37(1):53–63.

Kopacz DR, Grossman LS, Klamen DL. Medical students and AIDS: knowledge, attitudes and implications for education. Health Educ Res. 1999;14(1):1–6.

Leiblum SR. An established medical school human sexuality curriculum: description and evaluation. Sex Relatsh Ther. 2001;16(1):59–70.

Rastegar DA, Fingerhood MI, Jasinski DR. A resident clerkship that combines inpatient and outpatient training in substance abuse and HIV care. Subst Abuse. 2004;25(4):11–5.

Roberts E, Richeson NA, Thornhill JTIV, Corwin SJ, Eleazer GP. The senior mentor program at the University of South Carolina School of Medicine: an innovative geriatric longitudinal curriculum. Gerontol Geriatr Educ. 2006;27(2):11–23.

Burgess DJ, Burke SE, Cunningham BA, Dovidio JF, Hardeman RR, Hou YF, Nelson DB, Perry SP, Phelan SM, Yeazel MW, et al. Medical students’ learning orientation regarding interracial interactions affects preparedness to care for minority patients: a report from medical student CHANGES. BMC Med Educ. 2016;16:254.

Burgess DJ, Hardeman RR, Burke SE, Cunningham BA, Dovidio JF, Nelson DB, Perry SP, Phelan SM, Yeazel MW, Herrin J, et al. Incoming medical students’ political orientation affects outcomes related to care of marginalized groups: results from the medical student CHANGES study. J Health Pol Policy Law. 2019;44(1):113–46.

Kurtz ME, Johnson SM, Tomlinson T, Fiel NJ. Teaching medical students the effects of values and stereotyping on the doctor/patient relationship. Soc Sci Med. 1985;21(9):1043–7.

Matharu K, Kravitz RL, McMahon GT, Wilson MD, Fitzgerald FT. Medical students’ attitudes toward gay men. BMC Med Educ. 2012;12:71.

Pearl RL, Argueso D, Wadden TA. Effects of medical trainees’ weight-loss history on perceptions of patients with obesity. Med Educ. 2017;51(8):802–11.

Perry SP, Dovidio JF, Murphy MC, van Ryn M. The joint effect of bias awareness and self-reported prejudice on intergroup anxiety and intentions for intergroup contact. Cultur Divers Ethnic Minor Psychol. 2015;21(1):89–96.

Phelan SM, Burgess DJ, Burke SE, Przedworski JM, Dovidio JF, Hardeman R, Morris M, van Ryn M. Beliefs about the causes of obesity in a national sample of 4th year medical students. Patient Educ Couns. 2015;98(11):1446–9.

Phelan SM, Puhl RM, Burgess DJ, Natt N, Mundi M, Miller NE, Saha S, Fischer K, van Ryn M. The role of weight bias and role-modeling in medical students’ patient-centered communication with higher weight standardized patients. Patient Educ Couns. 2021;104(8):1962–9.

Polan HJ, Auerbach MI, Viederman M. AIDS as a paradigm of human behavior in disease: impact and implications of a course. Acad Psychiatry. 1990;14(4):197–203.

Reuben DB, Fullerton JT, Tschann JM, Croughan-Minihane M. Attitudes of beginning medical students toward older persons: a five-campus study. J Am Geriatr Soc. 1995;43(12):1430–6.

Tsai J. Building structural empathy to marshal critical education into compassionate practice: evaluation of a medical school critical race theory course. J Law Med Ethics. 2021;49(2):211–21.

Weyant RJ, Bennett ME, Simon M, Palaisa J. Desire to treat HIV-infected patients: similarities and differences across health-care professions. AIDS. 1994;8(1):117–21.

Ross PT, Lypson ML. Using artistic-narrative to stimulate reflection on physician bias. Teach Learn Med. 2014;26(4):344–9.

Calabrese SK, Earnshaw VA, Krakower DS, Underhill K, Vincent W, Magnus M, Hansen NB, Kershaw TS, Mayer KH, Betancourt JR, et al. A closer look at racism and heterosexism in medical students’ clinical decision-making related to HIV Pre-Exposure Prophylaxis (PrEP): implications for PrEP education. AIDS Behav. 2018;22(4):1122–38.

Fitterman-Harris HF, Vander Wal JS. Weight bias reduction among first-year medical students: a quasi-randomized, controlled trial. Clin Obes. 2021;11(6):e12479.

Madan AK, Cooper L, Gratzer A, Beech DJ. Ageism in breast cancer surgical options by medical students. Tenn Med. 2006;99(5):37–8, 41.

Bikmukhametov DA, Anokhin VA, Vinogradova AN, Triner WR, McNutt LA. Bias in medicine: a survey of medical student attitudes towards HIV-positive and marginalized patients in Russia, 2010. J Int AIDS Soc. 2012;15(2):17372.

Dijkstra AF, Verdonk P, Lagro-Janssen AL. Gender bias in medical textbooks: examples from coronary heart disease, depression, alcohol abuse and pharmacology. Med Educ. 2008;42(10):1021–8.

Dobrowolska B, Jędrzejkiewicz B, Pilewska-Kozak A, Zarzycka D, Ślusarska B, Deluga A, Kościołek A, Palese A. Age discrimination in healthcare institutions perceived by seniors and students. Nurs Ethics. 2019;26(2):443–59.

Hamberg K, Risberg G, Johansson EE, Westman G. Gender bias in physicians’ management of neck pain: a study of the answers in a Swedish national examination. J Womens Health Gend Based Med. 2002;11(7):653–66.

Magliano L, Read J, Sagliocchi A, Oliviero N, D’Ambrosio A, Campitiello F, Zaccaro A, Guizzaro L, Patalano M. “Social dangerousness and incurability in schizophrenia”: results of an educational intervention for medical and psychology students. Psychiatry Res. 2014;219(3):457–63.

Reis SP, Wald HS. Contemplating medicine during the Third Reich: scaffolding professional identity formation for medical students. Acad Med. 2015;90(6):770–3.

Schroyen S, Adam S, Marquet M, Jerusalem G, Thiel S, Giraudet AL, Missotten P. Communication of healthcare professionals: Is there ageism? Eur J Cancer Care (Engl). 2018;27(1):e12780.

Swift JA, Hanlon S, El-Redy L, Puhl RM, Glazebrook C. Weight bias among UK trainee dietitians, doctors, nurses and nutritionists. J Hum Nutr Diet. 2013;26(4):395–402.

Swift JA, Tischler V, Markham S, Gunning I, Glazebrook C, Beer C, Puhl R. Are anti-stigma films a useful strategy for reducing weight bias among trainee healthcare professionals? Results of a pilot randomized control trial. Obes Facts. 2013;6(1):91–102.

Yertutanol FDK, Candansayar S, Seydaoğlu G. Homophobia in health professionals in Ankara, Turkey: developing a scale. Transcult Psychiatry. 2019;56(6):1191–217.

Arnold O, Voracek M, Musalek M, Springer-Kremser M. Austrian medical students’ attitudes towards male and female homosexuality: a comparative survey. Wien Klin Wochenschr. 2004;116(21–22):730–6.

Arvaniti A, Samakouri M, Kalamara E, Bochtsou V, Bikos C, Livaditis M. Health service staff’s attitudes towards patients with mental illness. Soc Psychiatry Psychiatr Epidemiol. 2009;44(8):658–65.

Lopes L, Gato J, Esteves M. Portuguese medical students’ knowledge and attitudes towards homosexuality. Acta Med Port. 2016;29(11):684–93.

Papadaki V, Plotnikof K, Gioumidou M, Zisimou V, Papadaki E. A comparison of attitudes toward lesbians and gay men among students of helping professions in Crete, Greece: the cases of social work, psychology, medicine, and nursing. J Homosex. 2015;62(6):735–62.

Papaharitou S, Nakopoulou E, Moraitou M, Tsimtsiou Z, Konstantinidou E, Hatzichristou D. Exploring sexual attitudes of students in health professions. J Sex Med. 2008;5(6):1308–16.

Roberts JH, Sanders T, Mann K, Wass V. Institutional marginalisation and student resistance: barriers to learning about culture, race and ethnicity. Adv Health Sci Educ. 2010;15(4):559–71.

Wilhelmi L, Ingendae F, Steinhaeuser J. What leads to the subjective perception of a ‘rural area’? A qualitative study with undergraduate students and postgraduate trainees in Germany to tailor strategies against physician’s shortage. Rural Remote Health. 2018;18(4):4694.

Herrmann-Werner A, Loda T, Wiesner LM, Erschens RS, Junne F, Zipfel S. Is an obesity simulation suit in an undergraduate medical communication class a valuable teaching tool? A cross-sectional proof of concept study. BMJ Open. 2019;9(8):e029738.

Ahadinezhad B, Khosravizadeh O, Maleki A, Hashtroodi A. Implicit racial bias among medical graduates and students by an IAT measure: a systematic review and meta-analysis. Ir J Med Sci. 2022;191(4):1941–9. https://doi.org/10.1007/s11845-021-02756-3 .

Hsieh JG, Hsu M, Wang YW. An anthropological approach to teach and evaluate cultural competence in medical students - the application of mini-ethnography in medical history taking. Med Educ Online. 2016;21:32561.

Poreddi V, Thimmaiah R, Math SB. Attitudes toward people with mental illness among medical students. J Neurosci Rural Pract. 2015;6(3):349–54.

Mino Y, Yasuda N, Tsuda T, Shimodera S. Effects of a one-hour educational program on medical students’ attitudes to mental illness. Psychiatry Clin Neurosci. 2001;55(5):501–7.

Omori A, Tateno A, Ideno T, Takahashi H, Kawashima Y, Takemura K, Okubo Y. Influence of contact with schizophrenia on implicit attitudes towards schizophrenia patients held by clinical residents. BMC Psychiatry. 2012;12:8.

Banwari G, Mistry K, Soni A, Parikh N, Gandhi H. Medical students and interns’ knowledge about and attitude towards homosexuality. J Postgrad Med. 2015;61(2):95–100.

Lee SY. Obesity education in medical school curricula in Korea. J Obes Metab Syndr. 2018;27(1):35–8.

Aruna G, Mittal S, Yadiyal MB, Acharya C, Acharya S, Uppulari C. Perception, knowledge, and attitude toward mental disorders and psychiatry among medical undergraduates in Karnataka: a cross-sectional study. Indian J Psychiatry. 2016;58(1):70–6.

Wong YL. Review paper: gender competencies in the medical curriculum: addressing gender bias in medicine. Asia Pac J Public Health. 2009;21(4):359–76.

Earnshaw VA, Jin H, Wickersham JA, Kamarulzaman A, John J, Lim SH, Altice FL. Stigma toward men who have sex with men among future healthcare providers in Malaysia: would more interpersonal contact reduce prejudice? AIDS Behav. 2016;20(1):98–106.

Larson B, Herx L, Williamson T, Crowshoe L. Beyond the barriers: family medicine residents’ attitudes towards providing Aboriginal health care. Med Educ. 2011;45(4):400–6.

Wagner AC, Girard T, McShane KE, Margolese S, Hart TA. HIV-related stigma and overlapping stigmas towards people living with HIV among health care trainees in Canada. AIDS Educ Prev. 2017;29(4):364–76.

Tellier P-P, Bélanger E, Rodríguez C, Ware MA, Posel N. Improving undergraduate medical education about pain assessment and management: a qualitative descriptive study of stakeholders’ perceptions. Pain Res Manage. 2013;18(5):259–65.

Loignon C, Boudreault-Fournier A, Truchon K, Labrousse Y, Fortin B. Medical residents reflect on their prejudices toward poverty: a photovoice training project. BMC Med Educ. 2014;14:1050.

Phillips SP, Clarke M. More than an education: the hidden curriculum, professional attitudes and career choice. Med Educ. 2012;46(9):887–93.

Jaworsky D, Gardner S, Thorne JG, Sharma M, McNaughton N, Paddock S, Chew D, Lees R, Makuwaza T, Wagner A, et al. The role of people living with HIV as patient instructors—Reducing stigma and improving interest around HIV care among medical students. AIDS Care. 2017;29(4):524–31.

Sukhera J, Wodzinski M, Teunissen PW, Lingard L, Watling C. Striving while accepting: exploring the relationship between identity and implicit bias recognition and management. Acad Med. 2018;93(11S Association of American Medical Colleges Learn Serve Lead: Proceedings of the 57th Annual Research in Medical Education Sessions):S82-s88.

Harris R, Cormack D, Curtis E, Jones R, Stanley J, Lacey C. Development and testing of study tools and methods to examine ethnic bias and clinical decision-making among medical students in New Zealand: the Bias and Decision-Making in Medicine (BDMM) study. BMC Med Educ. 2016;16:173.

Cormack D, Harris R, Stanley J, Lacey C, Jones R, Curtis E. Ethnic bias amongst medical students in Aotearoa/New Zealand: findings from the Bias and Decision Making in Medicine (BDMM) study. PLoS ONE. 2018;13(8):e0201168.

Harris R, Cormack D, Stanley J, Curtis E, Jones R, Lacey C. Ethnic bias and clinical decision-making among New Zealand medical students: an observational study. BMC Med Educ. 2018;18(1):18.

Robinson EL, Ball LE, Leveritt MD. Obesity bias among health and non-health students attending an Australian university and their perceived obesity education. J Nutr Educ Behav. 2014;46(5):390–5.

Sopoaga F, Zaharic T, Kokaua J, Covello S. Training a medical workforce to meet the needs of diverse minority communities. BMC Med Educ. 2017;17:19.

Parker R, Larkin T, Cockburn J. A visual analysis of gender bias in contemporary anatomy textbooks. Soc Sci Med. 2017;180:106–13.

Gomes MdM. Doctors’ perspectives and practices regarding epilepsy. Arq Neuropsiquiatr. 2000;58(2):221–6.

Caixeta J, Fernandes PT, Bell GS, Sander JW, Li LM. Epilepsy perception amongst university students - A survey. Arq Neuropsiquiatr. 2007;65:43–8.

Tedrus GMAS, Fonseca LC, da Câmara Vieira AL. Knowledge and attitudes toward epilepsy amongst students in the health area: intervention aimed at enlightenment. Arq Neuropsiquiatr. 2007;65(4-B):1181–5.

Gomez-Moreno C, Verduzco-Aguirre H, Contreras-Garduño S, Perez-de-Acha A, Alcalde-Castro J, Chavarri-Guerra Y, García-Lara JMA, Navarrete-Reyes AP, Avila-Funes JA, Soto-Perez-de-Celis E. Perceptions of aging and ageism among Mexican physicians-in-training. Clin Transl Oncol. 2019;21(12):1730–5.

Campbell MH, Gromer J, Emmanuel MK, Harvey A. Attitudes Toward Transgender People Among Future Caribbean Doctors. Arch Sex Behav. 2022;51(4):1903-11. https://doi.org/10.1007/s10508-021-02205-3 .

Hatala R, Case SM. Examining the influence of gender on medical students’ decision making. J Womens Health Gend Based Med. 2000;9(6):617–23.

Deb T, Lempp H, Bakolis I, et al. Responding to experienced and anticipated discrimination (READ): anti -stigma training for medical students towards patients with mental illness – study protocol for an international multisite non-randomised controlled study. BMC Med Educ. 2019;19:41. https://doi.org/10.1186/s12909-019-1472-7 .

Morgan S, Plaisant O, Lignier B, Moxham BJ. Sexism and anatomy, as discerned in textbooks and as perceived by medical students at Cardiff University and University of Paris Descartes. J Anat. 2014;224(3):352–65.

Alford CL, Miles T, Palmer R, Espino D. An introduction to geriatrics for first-year medical students. J Am Geriatr Soc. 2001;49(6):782–7.

Stone J, Moskowitz GB. Non-conscious bias in medical decision making: what can be done to reduce it? Med Educ. 2011;45(8):768–76.

Nazione S. Slimming down medical provider weight bias in an obese nation. Med Educ. 2015;49(10):954–5.

Dogra N, Connin S, Gill P, Spencer J, Turner M. Teaching of cultural diversity in medical schools in the United Kingdom and Republic of Ireland: cross sectional questionnaire survey. BMJ. 2005;330(7488):403–4.

Aultman JM, Borges NJ. A clinical and ethical investigation of pre-medical and medical students’ attitudes, knowledge, and understanding of HIV. Med Educ Online. 2006;11(1):4596.

Deb T, Lempp H, Bakolis I, Vince T, Waugh W, Henderson C, Thornicroft G, Ando S, Yamaguchi S, Matsunaga A, et al. Responding to experienced and anticipated discrimination (READ): anti -stigma training for medical students towards patients with mental illness – study protocol for an international multisite non-randomised controlled study. BMC Med Educ. 2019;19(1):41.

Gonzalez CM, Grochowalski JH, Garba RJ, Bonner S, Marantz PR. Validity evidence for a novel instrument assessing medical student attitudes toward instruction in implicit bias recognition and management. BMC Med Educ. 2021;21(1):205.

Ogunyemi D. A practical approach to implicit bias training. J Grad Med Educ. 2021;13(4):583–4.

Dennis GC. Racism in medicine: planning for the future. J Natl Med Assoc. 2001;93(3 Suppl):1S-5S.

Maina IW, Belton TD, Ginzberg S, Singh A, Johnson TJ. A decade of studying implicit racial/ethnic bias in healthcare providers using the implicit association test. Soc Sci Med. 2018;199:219–29.

Blair IV, Steiner JF, Hanratty R, Price DW, Fairclough DL, Daugherty SL, Bronsert M, Magid DJ, Havranek EP. An investigation of associations between clinicians’ ethnic or racial bias and hypertension treatment, medication adherence and blood pressure control. J Gen Intern Med. 2014;29(7):987–95.

Stanford FC. The importance of diversity and inclusion in the healthcare workforce. J Natl Med Assoc. 2020;112(3):247–9.

Education LCoM. Standards on diversity. 2009. https://health.usf.edu/~/media/Files/Medicine/MD%20Program/Diversity/LCMEStandardsonDiversity1.ashx?la=en .

Onyeador IN, Hudson STJ, Lewis NA. Moving beyond implicit bias training: policy insights for increasing organizational diversity. Policy Insights Behav Brain Sci. 2021;8(1):19–26.

Forscher PS, Mitamura C, Dix EL, Cox WTL, Devine PG. Breaking the prejudice habit: mechanisms, timecourse, and longevity. J Exp Soc Psychol. 2017;72:133–46.

Lai CK, Skinner AL, Cooley E, Murrar S, Brauer M, Devos T, Calanchini J, Xiao YJ, Pedram C, Marshburn CK, et al. Reducing implicit racial preferences: II. Intervention effectiveness across time. J Exp Psychol Gen. 2016;145(8):1001–16.

Sukhera J, Watling CJ, Gonzalez CM. Implicit bias in health professions: from recognition to transformation. Acad Med. 2020;95(5):717–23.

Vuletich HA, Payne BK. Stability and change in implicit bias. Psychol Sci. 2019;30(6):854–62.

Tversky A, Kahneman D. Judgment under uncertainty: Heuristics and biases. Science. 1974;185(4157):1124–31.

Miller DT, Ross M. Self-serving biases in the attribution of causality: fact or fiction? Psychol Bull. 1975;82(2):213–25.

Nickerson RS. Confirmation bias: a ubiquitous phenomenon in many guises. Rev Gen Psychol. 1998;2(2):175–220.

Suveren Y. Unconscious bias: definition and significance. Psikiyatride Guncel Yaklasimlar. 2022;14(3):414–26.

Dietrich D, Olson M. A demonstration of hindsight bias using the Thomas confirmation vote. Psychol Rep. 1993;72(2):377–8.

Green AR, Carney DR, Pallin DJ, Ngo LH, Raymond KL, Iezzoni LI, Banaji MR. Implicit bias among physicians and its prediction of thrombolysis decisions for black and white patients. J Gen Intern Med. 2007;22(9):1231–8.

Rushmer R, Davies HT. Unlearning in health care. Qual Saf Health Care. 2004;13 Suppl 2(Suppl 2):ii10-15.

Vu MT, Pham TTT. Gender, critical pedagogy, and textbooks: Understanding teachers’ (lack of) mediation of the hidden curriculum in the EFL classroom. Lang Teach Res. 2022;0(0). https://doi.org/10.1177/13621688221136937 .

Kalantari A, Alvarez A, Battaglioli N, Chung A, Cooney R, Boehmer SJ, Nwabueze A, Gottlieb M. Sex and race visual representation in emergency medicine textbooks and the hidden curriculum. AEM Educ Train. 2022;6(3):e10743.

Satya-Murti S, Lockhart J. Recognizing and reducing cognitive bias in clinical and forensic neurology. Neurol Clin Pract. 2015;5(5):389–96.

Chang EH, Milkman KL, Gromet DM, Rebele RW, Massey C, Duckworth AL, Grant AM. The mixed effects of online diversity training. Proc Natl Acad Sci U S A. 2019;116(16):7778–83.

Download references

Acknowledgements

The authors would like to thank Dr. Misa Mi, Professor and Medical Librarian at the Oakland University William Beaumont School of Medicine (OWUB) for her assistance with selection of databases and construction of literature search strategies for the scoping review. The authors also wish to thank Dr. Changiz Mohiyeddini, Professor in Behavioral Medicine and Psychopathology at Oakland University William Beaumont School of Medicine (OUWB) for his expertise and constructive feedback on our manuscript.

Author information

Authors and affiliations.

Department of Foundational Sciences, Central Michigan University College of Medicine, Mt. Pleasant, MI, 48859, USA

Brianne E. Lewis

Department of Foundational Medical Studies, Oakland University William Beaumont School of Medicine, 586 Pioneer Dr, Rochester, MI, 48309, USA

Akshata R. Naik

You can also search for this author in PubMed   Google Scholar

Contributions

A.R.N and B.E.L were equally involved in study conception, design, collecting data and analyzing the data. B.E.L and A.R.N both contributed towards writing the manuscript. A.R.N and B.E.L are both senior authors on this paper. All authors reviewed the manuscript.

Corresponding author

Correspondence to Akshata R. Naik .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Lewis, B.E., Naik, A.R. A scoping review to identify and organize literature trends of bias research within medical student and resident education. BMC Med Educ 23 , 919 (2023). https://doi.org/10.1186/s12909-023-04829-6

Download citation

Received : 14 March 2023

Accepted : 01 November 2023

Published : 05 December 2023

DOI : https://doi.org/10.1186/s12909-023-04829-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Preclinical curriculum
  • Evidence of bis

BMC Medical Education

ISSN: 1472-6920

research study criterion validity

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • Internal Validity in Research | Definition, Threats, & Examples

Internal Validity in Research | Definition, Threats & Examples

Published on May 1, 2020 by Pritha Bhandari . Revised on June 22, 2023.

Internal validity is the extent to which you can be confident that a cause-and-effect relationship established in a study cannot be explained by other factors.

Table of contents

Why internal validity matters, how to check whether your study has internal validity, trade-off between internal and external validity, threats to internal validity and how to counter them, other interesting articles, frequently asked questions about internal validity.

Internal validity makes the conclusions of a causal relationship credible and trustworthy. Without high internal validity, an experiment cannot demonstrate a causal link between two variables.

Once they arrive at the laboratory, the treatment group participants are given a cup of coffee to drink, while control group participants are given water. You also give both groups memory tests. After analyzing the results, you find that the treatment group performed better than the control group on the memory test.

For your conclusion to be valid, you need to be able to rule out other explanations (including control , extraneous , and confounding variables) for the results.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

research study criterion validity

There are three necessary conditions for internal validity. All three conditions must occur to experimentally establish causality between an independent variable A (your treatment variable) and dependent variable B (your response variable).

  • Your treatment and response variables change together.
  • Your treatment precedes changes in your response variables
  • No confounding or extraneous factors can explain the results of your study.

In the research example above, only two out of the three conditions have been met.

  • Drinking coffee and memory performance increased together.
  • Drinking coffee happened before the memory test.
  • The time of day of the sessions is an extraneous factor that can equally explain the results of the study.

Because you assigned participants to groups based on the schedule, the groups were different at the start of the study. Any differences in memory performance may be due to a difference in the time of day. Therefore, you cannot say for certain whether the time of day or drinking a cup of coffee improved memory performance.

That means your study has low internal validity, and you cannot deduce a causal relationship between drinking coffee and memory performance.

External validity is the extent to which you can generalize the findings of a study to other measures, settings or groups. In other words, can you apply the findings of your study to a broader context?

There is an inherent trade-off between internal and external validity ; the more you control extraneous factors in your study, the less you can generalize your findings to a broader context.

Threats to internal validity are important to recognize and counter in a research design for a robust study. Different threats can apply to single-group and multi-group studies.

Single-group studies

How to counter threats in single-group studies.

Altering the experimental design can counter several threats to internal validity in single-group studies.

  • Adding a comparable control group counters threats to single-group studies. If comparable control and treatment groups each face the same threats, the outcomes of the study won’t be affected by them.
  • A large sample size counters testing, because results would be more sensitive to any variability in the outcomes and less likely to suffer from sampling bias .
  • Using filler-tasks or questionnaires to hide the purpose of study also counters testing threats and demand characteristics .

Multi-group studies

How to counter threats in multi-group studies.

Altering the experimental design can counter several threats to internal validity in multi-group studies.

  • Random assignment of participants to groups counters selection bias and regression to the mean by making groups comparable at the start of the study.
  • Blinding participants to the aim of the study counters the effects of social interaction.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Degrees of freedom
  • Null hypothesis
  • Discourse analysis
  • Control groups
  • Mixed methods research
  • Non-probability sampling
  • Quantitative research
  • Ecological validity

Research bias

  • Rosenthal effect
  • Implicit bias
  • Cognitive bias
  • Selection bias
  • Negativity bias
  • Status quo bias

I nternal validity is the degree of confidence that the causal relationship you are testing is not influenced by other factors or variables .

External validity is the extent to which your results can be generalized to other contexts.

The validity of your experiment depends on your experimental design .

There are eight threats to internal validity : history, maturation, instrumentation, testing, selection bias , regression to the mean, social interaction and attrition .

Attrition bias is a threat to internal validity . In experiments, differential rates of attrition between treatment and control groups can skew results.

This bias can affect the relationship between your independent and dependent variables . It can make variables appear to be correlated when they are not, or vice versa .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2023, June 22). Internal Validity in Research | Definition, Threats & Examples. Scribbr. Retrieved April 5, 2024, from https://www.scribbr.com/methodology/internal-validity/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, external validity | definition, types, threats & examples, guide to experimental design | overview, steps, & examples, correlation vs. causation | difference, designs & examples, "i thought ai proofreading was useless but..".

I've been using Scribbr for years now and I know it's a service that won't disappoint. It does a good job spotting mistakes”

  • Introduction
  • Conclusions
  • Article Information

Screenshots of the smartphone cognitive tasks developed by Datacubed Health and included in the ALLFTD Mobile App. Details about the task design and instructions are included in the eMethods in Supplement 1. A, Flanker (Ducks in a Pond) is a task of cognitive control requiring participants to select the direction of the center duck. B, Go/no-go (Go Sushi Go!) requires participants to quickly tap on pieces of sushi (go) but not to tap when they see a fish skeleton (no-go). C, Card sort (Card Shuffle) is a task of cognitive flexibility requiring participants to learn rules that change during the task. D, The adaptative, associative memory task (Humi’s Bistro) requires participants to learn the food orders of several restaurant tables. E, Stroop (Color Clash) is a cognitive inhibition paradigm requiring participants to inhibit their tendency to read words and instead respond based on the color of the word. F, The 2-back task (Animal Parade) requires participants to determine whether animals on a parade float match the animals they saw 2 stimuli previously. G, Participants are asked to complete 3 testing sessions over 2 weeks. Shown in dark blue, they have 3 days to complete each testing session with a washout day between sessions on which no tests are available. Session 2 always begins on day 5 and session 3 on day 9. Screenshots are provided with permission from Datacubed Health.

Forest plots present internal consistency and test-retest reliability results in the discovery and validation cohorts, as well as an estimate in a combined sample of discovery and validation participants. ICC indicates interclass correlation coefficient.

A and B, Correlation matrices display associations of in-clinic criterion standard measures and ALLFTD mobile App (mApp) test scores in discovery and validation cohorts. Below the horizontal dashed lines, the associations among app tests and between app tests and demographic characteristics convergent clinical measures, divergent cognitive tests, and neuroimaging regions of interest can be viewed. Most app tests show strong correlations with each other and with age, convergent clinical measures, and brain volume. The measures show weaker correlations with divergent measures of visuospatial (Benson Figure Copy) and language (Multilingual Naming Test [MINT]) abilities. The strength of convergent correlations between app measures and outcomes is similar to the correlations between criterion standard neuropsychological scores and these outcomes, which can be viewed by looking across the rows above the horizontal black line. C and D, In the discovery and validation cohorts, receiver operating characteristics curves were calculated to determine how well a composite of app tests, the Uniform Data Set, version 3.0, Executive Functioning Composite (UDS3-EF), and the Montreal Cognitive Assessment (MoCA) discriminate individuals without symptoms (Clinical Dementia Rating Scale plus National Alzheimer’s Coordinating Center FTLD module sum of boxes [CDR plus NACC-FTLD-SB] score = 0) from individuals with the mildest symptoms of FTLD (CDR plus NACC-FTLD-SB score = 0.5). AUC indicates area under the curve; CVLT, California Verbal Learning Test.

eMethods. Instruments and Statistical Analysis

eResults. Participants

eTable 1. Participant Characteristics and Test Scores in Original and Validation Cohorts

eTable 2. Comparison of Diagnostic Accuracy for ALLFTD Mobile App Composite Score Across Cohorts

eTable 3. Number of Distractions Reported During the Remote Smartphone Testing Sessions

eTable 4. Qualitative Description of the Distractions Reported During Remote Testing Sessions

eFigure 1. Scatterplots of Test-Retest Reliability in a Mixed Sample of Adults Without Functional Impairment and Participants With FTLD

eFigure 2. Comparison of Test-Retest Reliability Estimates by Endorsement of Distractions

eFigure 3. Comparison of Test-Retest Reliability Estimates by Operating System

eFigure 4. Correlation Matrix in the Combined Cohort

eFigure 5. Neural Correlates of Smartphone Cognitive Test Performance

eReferences

Nonauthor Collaborators

Data Sharing Statement

See More About

Sign up for emails based on your interests, select your interests.

Customize your JAMA Network experience by selecting one or more topics from the list below.

  • Academic Medicine
  • Acid Base, Electrolytes, Fluids
  • Allergy and Clinical Immunology
  • American Indian or Alaska Natives
  • Anesthesiology
  • Anticoagulation
  • Art and Images in Psychiatry
  • Artificial Intelligence
  • Assisted Reproduction
  • Bleeding and Transfusion
  • Caring for the Critically Ill Patient
  • Challenges in Clinical Electrocardiography
  • Climate and Health
  • Climate Change
  • Clinical Challenge
  • Clinical Decision Support
  • Clinical Implications of Basic Neuroscience
  • Clinical Pharmacy and Pharmacology
  • Complementary and Alternative Medicine
  • Consensus Statements
  • Coronavirus (COVID-19)
  • Critical Care Medicine
  • Cultural Competency
  • Dental Medicine
  • Dermatology
  • Diabetes and Endocrinology
  • Diagnostic Test Interpretation
  • Drug Development
  • Electronic Health Records
  • Emergency Medicine
  • End of Life, Hospice, Palliative Care
  • Environmental Health
  • Equity, Diversity, and Inclusion
  • Facial Plastic Surgery
  • Gastroenterology and Hepatology
  • Genetics and Genomics
  • Genomics and Precision Health
  • Global Health
  • Guide to Statistics and Methods
  • Hair Disorders
  • Health Care Delivery Models
  • Health Care Economics, Insurance, Payment
  • Health Care Quality
  • Health Care Reform
  • Health Care Safety
  • Health Care Workforce
  • Health Disparities
  • Health Inequities
  • Health Policy
  • Health Systems Science
  • History of Medicine
  • Hypertension
  • Images in Neurology
  • Implementation Science
  • Infectious Diseases
  • Innovations in Health Care Delivery
  • JAMA Infographic
  • Law and Medicine
  • Leading Change
  • Less is More
  • LGBTQIA Medicine
  • Lifestyle Behaviors
  • Medical Coding
  • Medical Devices and Equipment
  • Medical Education
  • Medical Education and Training
  • Medical Journals and Publishing
  • Mobile Health and Telemedicine
  • Narrative Medicine
  • Neuroscience and Psychiatry
  • Notable Notes
  • Nutrition, Obesity, Exercise
  • Obstetrics and Gynecology
  • Occupational Health
  • Ophthalmology
  • Orthopedics
  • Otolaryngology
  • Pain Medicine
  • Palliative Care
  • Pathology and Laboratory Medicine
  • Patient Care
  • Patient Information
  • Performance Improvement
  • Performance Measures
  • Perioperative Care and Consultation
  • Pharmacoeconomics
  • Pharmacoepidemiology
  • Pharmacogenetics
  • Pharmacy and Clinical Pharmacology
  • Physical Medicine and Rehabilitation
  • Physical Therapy
  • Physician Leadership
  • Population Health
  • Primary Care
  • Professional Well-being
  • Professionalism
  • Psychiatry and Behavioral Health
  • Public Health
  • Pulmonary Medicine
  • Regulatory Agencies
  • Reproductive Health
  • Research, Methods, Statistics
  • Resuscitation
  • Rheumatology
  • Risk Management
  • Scientific Discovery and the Future of Medicine
  • Shared Decision Making and Communication
  • Sleep Medicine
  • Sports Medicine
  • Stem Cell Transplantation
  • Substance Use and Addiction Medicine
  • Surgical Innovation
  • Surgical Pearls
  • Teachable Moment
  • Technology and Finance
  • The Art of JAMA
  • The Arts and Medicine
  • The Rational Clinical Examination
  • Tobacco and e-Cigarettes
  • Translational Medicine
  • Trauma and Injury
  • Treatment Adherence
  • Ultrasonography
  • Users' Guide to the Medical Literature
  • Vaccination
  • Venous Thromboembolism
  • Veterans Health
  • Women's Health
  • Workflow and Process
  • Wound Care, Infection, Healing

Get the latest research based on your areas of interest.

Others also liked.

  • Download PDF
  • X Facebook More LinkedIn

Staffaroni AM , Clark AL , Taylor JC, et al. Reliability and Validity of Smartphone Cognitive Testing for Frontotemporal Lobar Degeneration. JAMA Netw Open. 2024;7(4):e244266. doi:10.1001/jamanetworkopen.2024.4266

Manage citations:

© 2024

  • Permissions

Reliability and Validity of Smartphone Cognitive Testing for Frontotemporal Lobar Degeneration

  • 1 Department of Neurology, Memory and Aging Center, Weill Institute for Neurosciences, University of California, San Francisco
  • 2 Department of Neurology, Columbia University, New York, New York
  • 3 Department of Neurology, Mayo Clinic, Rochester, Minnesota
  • 4 Department of Quantitative Health Sciences, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
  • 5 Department of Neurology, Case Western Reserve University, Cleveland, Ohio
  • 6 Department of Neurosciences, University of California, San Diego, La Jolla
  • 7 Department of Radiology, University of North Carolina, Chapel Hill
  • 8 Department of Neurology, Indiana University, Indianapolis
  • 9 Department of Neurology, Vanderbilt University, Nashville, Tennessee
  • 10 Department of Neurology, University of Washington, Seattle
  • 11 Department of Psychiatry and Psychology, Mayo Clinic, Rochester, Minnesota
  • 12 Department of Neurology, Institute for Precision Health, University of California, Los Angeles
  • 13 Department of Neurology, Knight Alzheimer Disease Research Center, Washington University, Saint Louis, Missouri
  • 14 Department of Psychiatry, Knight Alzheimer Disease Research Center, Washington University, Saint Louis, Missouri
  • 15 Department of Neuroscience, Mayo Clinic, Jacksonville, Florida
  • 16 Department of Neurology, University of Pennsylvania Perelman School of Medicine, Philadelphia
  • 17 Division of Neurology, University of British Columbia, Musqueam, Squamish & Tsleil-Waututh Traditional Territory, Vancouver, Canada
  • 18 Department of Neurosciences, University of California, San Diego, La Jolla
  • 19 Department of Neurology, Nantz National Alzheimer Center, Houston Methodist and Weill Cornell Medicine, Houston Methodist, Houston, Texas
  • 20 Department of Neurology, UCLA (University of California, Los Angeles)
  • 21 Department of Neurology, University of Colorado, Aurora
  • 22 Department of Neurology, David Geffen School of Medicine, UCLA
  • 23 Department of Neurology, University of Alabama, Birmingham
  • 24 Tanz Centre for Research in Neurodegenerative Diseases, Division of Neurology, University of Toronto, Toronto, Ontario, Canada
  • 25 Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston
  • 26 Department of Epidemiology and Biostatistics, University of California, San Francisco
  • 27 Department of Psychological & Brain Sciences, Washington University, Saint Louis, Missouri

Question   Can remote cognitive testing via smartphones yield reliable and valid data for frontotemporal lobar degeneration (FTLD)?

Findings   In this cohort study of 360 patients, remotely deployed smartphone cognitive tests showed moderate to excellent reliability comparedwith criterion standard measures (in-person disease severity assessments and neuropsychological tests) and brain volumes. Smartphone tests accurately detected dementia and were more sensitive to the earliest stages of familial FTLD than standard neuropsychological tests.

Meaning   These findings suggest that remotely deployed smartphone-based assessments may be reliable and valid tools for evaluating FTLD and may enhance early detection, supporting the inclusion of digital assessments in clinical trials for neurodegeneration.

Importance   Frontotemporal lobar degeneration (FTLD) is relatively rare, behavioral and motor symptoms increase travel burden, and standard neuropsychological tests are not sensitive to early-stage disease. Remote smartphone-based cognitive assessments could mitigate these barriers to trial recruitment and success, but no such tools are validated for FTLD.

Objective   To evaluate the reliability and validity of smartphone-based cognitive measures for remote FTLD evaluations.

Design, Setting, and Participants   In this cohort study conducted from January 10, 2019, to July 31, 2023, controls and participants with FTLD performed smartphone application (app)–based executive functioning tasks and an associative memory task 3 times over 2 weeks. Observational research participants were enrolled through 18 centers of a North American FTLD research consortium (ALLFTD) and were asked to complete the tests remotely using their own smartphones. Of 1163 eligible individuals (enrolled in parent studies), 360 were enrolled in the present study; 364 refused and 439 were excluded. Participants were divided into discovery (n = 258) and validation (n = 102) cohorts. Among 329 participants with data available on disease stage, 195 were asymptomatic or had preclinical FTLD (59.3%), 66 had prodromal FTLD (20.1%), and 68 had symptomatic FTLD (20.7%) with a range of clinical syndromes.

Exposure   Participants completed standard in-clinic measures and remotely administered ALLFTD mobile app (app) smartphone tests.

Main Outcomes and Measures   Internal consistency, test-retest reliability, association of smartphone tests with criterion standard clinical measures, and diagnostic accuracy.

Results   In the 360 participants (mean [SD] age, 54.0 [15.4] years; 209 [58.1%] women), smartphone tests showed moderate-to-excellent reliability (intraclass correlation coefficients, 0.77-0.95). Validity was supported by association of smartphones tests with disease severity ( r range, 0.38-0.59), criterion-standard neuropsychological tests ( r range, 0.40-0.66), and brain volume (standardized β range, 0.34-0.50). Smartphone tests accurately differentiated individuals with dementia from controls (area under the curve [AUC], 0.93 [95% CI, 0.90-0.96]) and were more sensitive to early symptoms (AUC, 0.82 [95% CI, 0.76-0.88]) than the Montreal Cognitive Assessment (AUC, 0.68 [95% CI, 0.59-0.78]) ( z of comparison, −2.49 [95% CI, −0.19 to −0.02]; P  = .01). Reliability and validity findings were highly similar in the discovery and validation cohorts. Preclinical participants who carried pathogenic variants performed significantly worse than noncarrier family controls on 3 app tasks (eg, 2-back β = −0.49 [95% CI, −0.72 to −0.25]; P  < .001) but not a composite of traditional neuropsychological measures (β = −0.14 [95% CI, −0.42 to 0.14]; P  = .32).

Conclusions and Relevance   The findings of this cohort study suggest that smartphones could offer a feasible, reliable, valid, and scalable solution for remote evaluations of FTLD and may improve early detection. Smartphone assessments should be considered as a complementary approach to traditional in-person trial designs. Future research should validate these results in diverse populations and evaluate the utility of these tests for longitudinal monitoring.

Frontotemporal lobar degeneration (FTLD) is a neurodegenerative pathology causing early-onset dementia syndromes with impaired behavior, cognition, language, and/or motor functioning. 1 Although over 30 FTLD trials are planned or in progress, there are several barriers to conducting FTLD trials. Clinical trials for neurodegenerative disease are expensive, 2 and frequent in-person trial visits are burdensome for patients, caregivers, and clinicians, 3 a concern magnified in FTLD by behavioral and motor impairments. Given the rarity and geographical dispersion of eligible participants, FTLD trials require global recruitment, 4 particularly for those that are far from expert FTLD clinical trial centers. Furthermore, criterion standard neuropsychological tests are not adequately sensitive until symptoms are already noticeable to families, limiting their usefulness as outcomes in early-stage FTLD treatment trials. 4

Reliable, valid, and scalable remote data collection methods may help surmount these barriers to FTLD clinical trials. Smartphones are garnering interest across neurological conditions as a method for administering remote cognitive and motor evaluations. Preliminary evidence supports the feasibility, reliability, and/or validity of unsupervised smartphone cognitive and motor testing in older adults at risk for Alzheimer disease, 5 - 8 Parkinson disease, 9 and Huntington disease. 10 The clinical heterogeneity of FTLD necessitates a uniquely comprehensive smartphone battery. In the ALLFTD Consortium (Advancing Research and Treatment in Frontotemporal Lobar Degeneration [ARTFLD] and Longitudinal Evaluation of Familial Frontotemporal Dementia Subjects [LEFFTDS]), the ALLFTD mobile Application (ALLFTD-mApp) was designed to remotely monitor cognitive, behavioral, language, and motor functioning in FTLD research. Taylor et al 11 recently reported that unsupervised ALLFTD-mApp data collection through a multicenter North American FTLD research network was feasible and acceptable to participants. Herein, we extend that work by investigating the reliability and validity of unsupervised remote smartphone tests of executive functioning and memory in a cohort with FTLD that has undergone extensive phenotyping.

Participants were enrolled from ongoing FTLD studies requiring in-person assessment, including participants from 18 centers from the ALLFTD study study 12 and University of California, San Francisco (UCSF) FTLD studies. To study the app in older individuals, a small group of older adults without functional impairment was recruited from the UCSF Brain Aging Network for Cognitive Health. All study procedures were approved by the UCSF or Johns Hopkins Central Institutional Review Board. All participants or legally authorized representatives provided written informed consent. The study followed the Strengthening the Reporting of Observational Studies in Epidemiology ( STROBE ) reporting guideline.

Inclusion criteria were age 18 years or older, having access to a smartphone, and reporting English as the primary language. Race and ethnicity were self reported by participants using options consistent with the National Alzheimer’s Coordinating Center (NACC) Uniform Data Set (UDS) and were collected to contextualize the generalizability of these results. Participants were asked to complete tests on their own smartphones. Informants were encouraged for all participants and required for those with symptomatic FTLD (Clinical Dementia Rating Scale plus NACC FTLD module [CDR plus NACC-FTLD] global score ≥1). Recruitment targeted individuals with CDR plus NACC-FTLD global scores less than 2, but sites had discretion to enroll more severely impaired participants. Exclusion criteria were consistent with the parent ALLFTD study. 12

Participants were enrolled in the ALLFTD-mApp study within 90 days of annual ALLFTD study visits (including neuropsychological and neuroimaging data collection). Site research coordinators (including J.C.T., A.B.W., S.D., and M.M.) assisted participants with app download, setup, and orientation and observed participants completing the first questionnaire. All cognitive tasks were self-administered without supervision (except pilot participants, discussed below) in a predefined order with minor adjustments throughout the study. Study partners of participants with symptomatic FTLD were asked to remain nearby during participation to help navigate the ALLFTD-mApp but were asked not to assist with testing.

The baseline participation window was divided into three 25- to 35-minute assessment sessions occurring over 11 days. All cognitive tests were repeated in every session to enhance task reliability 6 , 13 and enable assessment of test-retest reliability, except for card sort, which was administered once every 6 months due to expected practice effects. Adherence was defined as the percentage of all available tasks that were completed. Participants were asked to complete the triplicate of sessions every 6 months for the duration of the app study. Only the baseline triplicate was analyzed in this study.

Replicability was tested by dividing the sample into a discovery cohort (n = 258) comprising all participants enrolled until the initial data freeze (October 1, 2022) and a validation cohort (n = 102) comprising participants enrolled after October 1, 2022, and 18 pilot participants 11 who completed the first session in person with an examiner present during cognitive pretesting. Sensitivity analyses excluded this small pilot cohort.

ALLFTD investigators partnered with Datacubed Health 14 to develop the ALLFTD-mApp on Datacubed Health’s Linkt platform. The app includes cognitive, motor, and speech tasks. This study focuses on 6 cognitive tests developed by Datacubed Health 11 comprising an adaptive associative memory task (Humi’s Bistro) and gamified versions of classic executive functioning paradigms: flanker (Ducks in a Pond), Stroop (Color Clash), 2-back (Animal Parade), go/no-go (Go Sushi Go!), and card sort (Card Shuffle) ( Figure 1 and eMethods in Supplement 1 ). Most participants with symptomatic FTLD (49 [72.1%]) were not administered Stroop or 2-back, as pilot studies identified these as too difficult. 11 The app test results were summarized as a composite score (eMethods in Supplement 1 ). Participants completed surveys to assess technological familiarity (daily or less than daily use of a smartphone) and distractions (present or absent).

Criterion standard clinical data were collected during parent project visits. Syndromic diagnoses were made according to published criteria 15 - 19 based on multidisciplinary conferences that considered neurological history, neurological examination results, and collateral interview. 20

The CDR plus NACC-FTLD module is an 8-domain rating scale based on informant and participant report. 21 A global score was calculated to categorize disease severity as asymptomatic or preclinical if a pathogenic variant carrier (0), prodromal (0.5), or symptomatic (1.0-3.0). 22 A sum of the 8 domain box scores (CDR plus NACC-FTLD sum of boxes) was also calculated. 22

Participants completed the UDS Neuropsychological Battery, version 3.0 23 (eMethods in Supplement 1 ), which includes traditional neuropsychological measures and the Montreal Cognitive Assessment (MoCA), a global cognitive screen. Executive functioning and processing speed measures were summarized into a composite score (UDS3-EF). 24 Participants also completed a 9-item list-learning memory test (California Verbal Learning Test, 2nd edition, Short Form). 25 Most (339 [94.2%]) neuropsychological evaluations were conducted in person. In a subsample (n = 270), motor speed and dexterity were assessed using the Movement Disorder Society Uniform Parkinson Disease Rating Scale 26 Finger Tapping subscale (0 indicates no deficits [n = 240]).

We acquired T1-weighted brain magnetic resonance imaging for 199 participants. Details of image acquisition, harmonization, preprocessing, and processing are provided in eMethods in Supplement 1 and prior publications. 27 Briefly, SPM12 (Statistical Parametric Mapping) was used for segmentation 28 and Large Deformation Diffeomorphic Metric Mapping for generating group templates. 29 Gray matter volumes were calculated in template space by integrating voxels and dividing by total intracranial volume in 2 regions of interest (ROIs) 30 : a frontoparietal and subcortical ROI and a hippocampal ROI. Voxel-based morphometry was used to test unbiased voxel-wise associations of volume with smartphone tests (eMethods in Supplement 1 ). 31 , 32

Participants in the ALLFTD study underwent genetic testing 33 at the University of California, Los Angeles. DNA samples were screened using targeted sequencing of a custom panel of genes previously implicated in neurodegenerative diseases, including GRN ( 138945 ) and MAPT ( 157140 ). Hexanucleotide repeat expansions in C9orf72 ( 614260 ) were detected using both fluorescent and repeat-primed polymerase chain reaction analysis. 34

Statistical analyses were conducted using Stata, version 17.0 (StataCorp LLC), and R, version 4.4.2 (R Project for Statistical Computing). All tests were 2 sided, with a statistical significance threshold of P < .05.

Psychometric properties of the smartphone tests were explored using descriptive statistics. Comparisons between CDR plus NACC-FTLD groups (ie, asymptomatic or preclinical, prodromal, and symptomatic) for continuous variables, including demographic characteristics and cognitive task scores (first exposure to each measure), were analyzed by fitting linear regressions. We used χ 2 difference tests for frequency data (eg, sex and race and ethnicity).

Internal consistency, which measures reliability within a task, was estimated for participants’ first exposure to each test using Cronbach α (details in eMethods in Supplement 1 ). Test-retest reliability was estimated using intraclass correlation coefficients for participants who completed a task at least twice; all exposures were included. Reliability estimates are described as poor (<0.500), moderate (0.500-0.749), good (0.750-0.890), and excellent (≥0.900) 35 ; these are reporting rules of thumb, and clinical interpretation should consider raw estimates. We calculated 95% CIs via bootstrapping with 1000 samples.

Validity analyses used participants’ first exposure to each test. Linear regressions were fitted in participants without symptoms with age, sex, and educational level as independent variables to understand the unique contribution of each demographic factor to cognitive test scores. Correlations and linear regression between the app-based tasks and disease severity (CDR plus NACC-FTLD sum of boxes score), neuropsychological test scores, and gray matter ROIs were used to investigate construct validity in the full sample. Demographic characteristics were not entered as covariates because the primary goal was to assess associations between app-based measures and criterion standards, rather than understand the incremental predictive value of app measures. To address potential motor confounds, associations with disease severity were evaluated in a subsample without finger dexterity deficits on motor examination (using the Movement Disorder Society Uniform Parkinson Disease Rating Scale Finger Tapping subscale). To complement ROI-based neuroimaging analysis based on a priori hypotheses, we conducted voxel-based morphometry (eMethods in Supplement 1 ) to uncover other potential neural correlates of test performance. 31 , 32 Finally, we evaluated the association of the number of distractions and operating system with reliability and validity, controlling for age and disease severity, which are predictive factors associated with test performance in correlation analyses.

To evaluate the app’s ability to select participants with prodromal or symptomatic FTLD for trial enrollment, we tested discrimination of participants without symptoms from those with prodromal and symptomatic FTLD. To understand the app’s utility for screening early cognitive impairment, we fit receiver operating characteristics curves testing the predictive value of the app composite, UDS3-EF, and MoCA for differentiating participants without symptoms and those with preclinical FTLD from those with prodromal FTLD; areas under the curves (AUC) for the app and MoCA were compared using the DeLong test in participants with results for both predictive factors.

We compared app performance in preclinical participants who carried pathogenic variants with that in noncarrier controls using linear regression adjusted for age (a predictive factor in earlier models). For this analysis, we excluded those younger than 45 years to remove participants likely to be years from symptom onset based on natural history studies. 4 We analyzed memory performance in participants who carried MAPT pathogenic variants, as early executive deficits may be less prominent. 34 , 36

Of 1163 eligible participants, 360 were enrolled, 439 were excluded, and 364 refused to participate (additional details are provided in the eResults in Supplement 1 ). Participant characteristics are reported in Table 1 for the full sample. The discovery and validation cohorts did not significantly differ in terms of demographic characteristics, disease severity, or cognition (eTable 1 in Supplement 1 ). In the full sample, there were 209 women (58.1%) and 151 men (41.9%), and the mean (SD) age was 54.0 (15.4) years (range, 18-89 years). The mean (SD) educational level was 16.5 (2.3) years (range, 12-20 years). Among the 358 participants with racial and ethnic data available, 340 (95.0%) identified as White. For the 18 participants self-identifying as being of other race or ethnicity, the specific group was not provided to protect participant anonymity. Among the 329 participants with available CDR plus NACC-FTLD scores ( Table 1 ), 195 (59.3%) were asymptomatic or preclinical (Global Score, 0), 66 (20.1%) were prodromal (Global score, 0.5), and 68 (20.7%) were symptomatic (global score, 1.0 or 2.0). Of those with available genetic testing results (n = 222), 100 (45.0%) carried a pathogenic familial FTLD pathogenic variant, including 63 of 120 participants without symptoms and with available results. On average, participants completed 78% of available smartphone measures over a mean (SD) of 2.6 (0.6) sessions.

Descriptive statistics for each task are presented in Table 2 . Ceiling effects were not observed for any tests. A small percentage of participants were at the floor for flanker (19 [5.3%]), go/no-go (13 [4.0%]), and card sort (9 [3.3%]) scores. Floor effects were only observed in participants with prodromal or symptomatic FTLD.

Except for go/no-go, internal consistency estimates ranged from good to excellent (Cronbach α range, 0.84 [95% CI, 0.81-0.87] to 0.99 [95% CI, 0.99-0.99]), and test-retest reliabilities were moderate to excellent (interclass correlation coefficient [ICC] range, 0.77 [95% CI, 0.69-0.83] to 0.95 [95% CI, 0.93-0.96]), with slightly higher estimates in participants with prodromal or symptomatic FTLD ( Table 2 , Figure 2 , and eFigure 1 in Supplement 1 ). Go/no-go reliability was particularly poor in participants without symptoms (ICC, 0.10 [95% CI, −0.37 to 0.48]) and was removed from subsequent validation analyses except the correlation matrix ( Figure 3 A and B). The 95% CIs for reliability estimates overlapped in the discovery and validation cohorts ( Figure 2 ). Reliability estimates showed overlapping 95% CIs regardless of distractions (eFigure 2 in Supplement 1 ) or operating systems (eFigure 3 in Supplement 1 ), with a pattern of slightly lower reliability estimates when distractions were endorsed for all comparisons except Stroop (Cronbach α).

In 57 participants without symptoms who did not carry pathogenic variants, older age was associated with worse performance on all measures (β range,  − 0.40 [95 CI, −0.68 to −0.13] to −0.78 [95 CI, −0.89 to −0.52]; P ≤ .03), except card sort (β = −0.22 [95% CI, −0.54 to 0.09]; P  = .16) and go-no/go (β = −0.15 [95% CI, −0.44 to 0.14]; P  = .31), though associations were in the expected direction. Associations with sex and educational level were not statistically significant.

Cognitive tests administered using the app showed evidence of convergent and divergent validity (eFigure 4 in Supplement 1 ), with very similar findings in discovery ( Figure 3 A) and validation cohorts ( Figure 3 B). App–based measures of executive functioning were generally correlated with criterion standard in-person measures of these domains and less with measures of other cognitive domains ( r range, 0.40-0.66). For example, the flanker task was associated with the UDS3-EF composite (β = 0.58 [95% CI, 0.48-0.68]; P  < .001) and measures of visuoconstruction (β for Benson Figure Copy, 0.43 [95% CI, 0.32-0.54]; P  = .01) and naming (β for Multilingual Naming Test, 0.25 [95% CI, 0.14-0.37]; P  < .001). The app memory test was associated with criterion standard memory and executive functioning tests.

Worse performance on all app measures was associated with greater disease severity on CDR plus NACC-FTLD ( r range, 0.38-0.59) ( Table 1 , Figure 3 , and eFigure 4 in Supplement 1 ). The same pattern of results was observed after excluding those with finger dexterity issues. Except for go/no-go, performance of participants with prodromal FTLD was statistically significantly worse than that of participants without symptoms on all measures ( P  < .001).

The AUC for the app composite to distinguish participants without symptoms from those with dementia was 0.93 (95% CI, 0.90-0.96). The app also accurately differentiated participants without symptoms from those with prodromal or symptomatic FTLD (AUC, 0.87 [95% CI, 0.84-0.92]). Compared with the MoCA (AUC, 0.68 [95% CI, 0.59-0.78), app composite performance (AUC, 0.82 [95% CI, 0.76-0.88]) more accurately differentiated participants without symptoms and with prodromal FTLD ( z of comparison, −2.49 [95% CI, −0.19 to −0.02]; P  = .01), with similar accuracy to the UDS3-EF (AUC, 0.81 [95% CI, 0.73-0.88]); highly similar results (eTable 2 in Supplement 1 ) were observed in the discovery ( Figure 3 C) and validation ( Figure 3 D) cohorts.

In 56 participants without symptoms who were older than 45 years, those carrying GRN , C9orf72 , or another rare pathogenic variants performed significantly worse on 3 of 4 executive tests compared with noncarrier controls, including flanker (β = −0.26 [95% CI, −0.46 to −0.05]; P  = .02), card sort (β = −0.28 [95% CI, −0.54 to −0.30]; P  = .03), and 2-back (β = −0.49 [95% CI, −0.72 to −0.25]; P  < .001). The estimated scores of participants who carried pathogenic variants were on average lower than those of carriers on a composite of criterion standard in-person tests, but the difference was not statistically significant (UDS3-EF β = −0.14 [95% CI, −0.42 to 0.14]; P  = .32). Participants who carried preclinical MAPT pathogenic variants scored higher than noncarriers on the app Memory test, though the difference was not statistically significant (β = 0.21 [95% CI, −0.50 to 0.58]; P  = .19).

In prespecified ROI analyses, worse app executive functioning scores were associated with lower frontoparietal and/or subcortical volume ( Figures 3 A and B) (β range, 0.34 [95% CI, 0.22-0.46] to 0.50 [95 CI, 0.40-0.60]; P < .001 for all) and worse memory scores with smaller hippocampal volume (β = 0.45 [95% CI, 0.34-0.56]; P  < .001). Voxel-based morphometry (eFigure 5 in Supplement 1 ) suggested worse app performance was associated with widespread atrophy, particularly in frontotemporal cortices.

Only for card sort were distractions (eTables 3 and 4 in Supplement 1 ) associated with task performance; those experiencing distractions unexpectedly performed better (β = 0.16 [95% CI, 0.05-0.28]; P  = .005). The iPhone operating system was associated with better performance on 2 speeded tasks: flanker (β = 0.16 [95% CI, 0.07-0.24]; P  < .001) and go/no-go (β = 0.16 [95% CI, 0.06-0.26]; P  = .002). In a sensitivity analysis, associations of all app tests with disease severity, UDS3-EF, and regional brain volumes remained after covarying for distractions and operating system, as did the models differentiating participants who carried preclinical pathogenic variants and noncarrier controls.

There is an urgent need to identify reliable and valid digital tools for remote neurobehavioral measurement in neurodegenerative diseases, including FTLD. Prior studies provided preliminary evidence that smartphones collect reliable and valid cognitive data in a variety of age-related and neurodegenerative illnesses. This is the first study, to our knowledge, to provide analogous support for the reliability and validity of remote cognitive testing via smartphones in FTLD and preliminary evidence that this approach improves early detection relative to traditional in-person measures.

Reliability, a prerequisite for a valid clinical trial end point, indicates measurements are consistent. In 2 cohorts, we found smartphone cognitive tests were reliable within a single administration (ie, internally consistent) and across repeated assessments (ie, test-retest reliability) with no apparent differences by operating system. For all measures except go/no-go, reliability estimates were moderate to excellent and on par with other remote digital assessments 5 , 6 , 10 , 37 , 38 and in-clinic criterion standards. 39 - 41 Go/no-go showed similar within- and between-person variability in participants without symptoms (ie, poor reliability), and participant feedback suggested instructions were confusing and the stimuli disappeared too quickly. Those endorsing distractions tended to have lower reliability, though 95% CIs largely overlapped; future research detailing the effect of the home environment on test performance is warranted.

Construct validity was supported by strong associations of smartphone tests with demographics, disease severity, neuroimaging, and criterion standard neuropsychological measures that replicated in a validation sample. These associations were similar to those observed among the criterion standard measures and similar to associations reported in other validation studies of smartphone cognitive tests. 5 , 6 , 10 Associations with disease severity were not explained by motor impairments. The iPhone operating system was associated with better performance on 2 time-based measures, consistent with prior findings. 6

A composite of brief smartphone tests was accurate in distinguishing dementia from cognitively unimpaired participants, screening out participants without symptoms, and detecting prodromal FTLD with greater sensitivity than the MoCA. Moreover, carriers of preclinical C9orf72 and GRN pathogenic variants performed significantly worse than noncarrier controls on 3 tests, whereas they did not significantly differ on criterion standard measures. These findings are consistent with previous studies showing digital executive functioning paradigms may be more sensitive to early FTLD than traditional measures. 42 , 43

This study has some limitations. Validation analyses focused on participants’ initial task exposure. Future studies will explore whether repeated measurements and more sophisticated approaches to composite building (current composite assumes equal weighting of tests) improve reliability and sensitivity, and a normative sample is being collected to better adjust for demographic effects on testing. 24 Longitudinal analyses will explore whether the floor effects in participants with symptomatic FTLD will affect the utility for monitoring. The generalizability of the findings is limited by the study cohort, which comprised participants who were college educated on average, mostly White, and primarily English speakers who owned smartphones and participated in the referring in-person research study. Equity in access to research is a priority in FTLD research 44 , 45 ; translations of the ALLFTD-mApp are in progress, cultural adaptations are being considered, and devices have been purchased for provisioning to improve the diversity of our sample.

The findings of this cohort study, coupled with prior reports indicating that smartphone testing is feasible and acceptable to patients with FTLD, 11 suggest that smartphones may complement traditional in-person research paradigms. More broadly, the scalability, ease of use, reliability, and validity of the ALLFTD-mApp suggest the feasibility and utility of remote digital assessments in dementia clinical trials. Future research should validate these results in diverse populations and evaluate the utility of these tests for longitudinal monitoring.

Accepted for Publication: February 2, 2024.

Published: April 1, 2024. doi:10.1001/jamanetworkopen.2024.4266

Open Access: This is an open access article distributed under the terms of the CC-BY License . © 2024 Staffaroni AM et al. JAMA Network Open .

Corresponding Author: Adam M. Staffaroni, PhD, Weill Institute for Neurosciences, Department of Neurology, Memory and Aging Center, University of California, San Francisco, 675 Nelson Rising Ln, Ste 190, San Francisco, CA 94158 ( [email protected] ).

Author Contributions: Dr Staffaroni had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Concept and design: Staffaroni, A. Clark, Taylor, Heuer, Wise, Forsberg, Miller, Hassenstab, Rosen, Boxer.

Acquisition, analysis, or interpretation of data: Staffaroni, A. Clark, Taylor, Heuer, Sanderson-Cimino, Wise, Dhanam, Cobigo, Wolf, Manoochehri, Mester, Rankin, Appleby, Bayram, Bozoki, D. Clark, Darby, Domoto-Reilly, Fields, Galasko, Geschwind, Ghoshal, Graff-Radford, Hsiung, Huey, Jones, Lapid, Litvan, Masdeu, Massimo, Mendez, Miyagawa, Pascual, Pressman, Ramanan, Ramos, Rascovsky, Roberson, Tartaglia, Wong, Kornak, Kremers, Kramer, Boeve, Boxer.

Drafting of the manuscript: Staffaroni, A. Clark, Taylor, Heuer, Wolf, Lapid.

Critical review of the manuscript for important intellectual content: Staffaroni, Taylor, Heuer, Sanderson-Cimino, Wise, Dhanam, Cobigo, Manoochehri, Forsberg, Mester, Rankin, Appleby, Bayram, Bozoki, D. Clark, Darby, Domoto-Reilly, Fields, Galasko, Geschwind, Ghoshal, Graff-Radford, Hsiung, Huey, Jones, Lapid, Litvan, Masdeu, Massimo, Mendez, Miyagawa, Pascual, Pressman, Ramanan, Ramos, Rascovsky, Roberson, Tartaglia, Wong, Miller, Kornak, Kremers, Hassenstab, Kramer, Boeve, Rosen, Boxer.

Statistical analysis: Staffaroni, A. Clark, Taylor, Heuer, Sanderson-Cimino, Cobigo, Kornak, Kremers.

Obtained funding: Staffaroni, Rosen, Boxer.

Administrative, technical, or material support: A. Clark, Taylor, Heuer, Wise, Dhanam, Wolf, Manoochehri, Forsberg, Darby, Domoto-Reilly, Ghoshal, Hsiung, Huey, Jones, Litvan, Massimo, Mendez, Miyagawa, Pascual, Pressman, Ramanan, Kramer, Boeve, Boxer.

Supervision: Geschwind, Miyagawa, Roberson, Kramer, Boxer.

Conflict of Interest Disclosures: Dr Staffaroni reported being a coinventor of 4 ALLFTD mobile application tasks (not analyzed in the present study) and receiving licensing fees from Datacubed Health; receiving research support from the National Institute on Aging (NIA) of the National Institutes of Health (NIH), Bluefield Project to Cure FTD, the Alzheimer’s Association, the Larry L. Hillblom Foundation, and the Rainwater Charitable Foundation; and consulting for Alector Inc, Eli Lilly and Company/Prevail Therapeutics, Passage Bio Inc, and Takeda Pharmaceutical Company. Dr Forsberg reported receiving research support from the NIH. Dr Rankin reported receiving research support from the NIH and the National Science Foundation and serving on the medical advisory board for Eli Lilly and Company. Dr Appleby reported receiving research support from the Centers for Disease Control and Prevention (CDC), the NIH, Ionis Pharmaceuticals Inc, Alector Inc, and the CJD Foundation and consulting for Acadia Pharmaceuticals Inc, Ionis Pharmaceuticals Inc, and Sangamo Therapeutics Inc. Dr Bayram reported receiving research support from the NIH. Dr Domoto-Reilly reported receiving research support from NIH and serving as an investigator for a clinical trial sponsored by Lawson Health Research Institute. Dr Bozoki reported receiving research funding from the NIH, Alector Inc, Cognition Therapeutics Inc, EIP Pharma, and Transposon Therapeutics Inc; consulting for Eisai and Creative Bio-Peptides Inc; and serving on the data safety monitoring board for AviadoBio. Dr Fields reported receiving research support from the NIH. Dr Galasko reported receiving research funding from the NIH; clinical trial funding from Alector Inc and Esai; consulting for Esai, General Electric Health Care, and Fujirebio; and serving on the data safety monitoring board of Cyclo Therapeutics Inc. Dr Geschwind reported consulting for Biogen Inc and receiving research support from Roche and Takeda Pharmaceutical Company for work in dementia. Dr Ghoshal reported participating in clinical trials of antidementia drugs sponsored by Bristol Myers Squibb, Eli Lilly and Company/Avid Radiopharmaceuticals, Janssen Immunotherapy, Novartis AG, Pfizer Inc, Wyeth Pharmaceuticals, SNIFF (The Study of Nasal Insulin to Fight Forgetfulness) study, and A4 (The Anti-Amyloid Treatment in Asymptomatic Alzheimer’s Disease) trial; receiving research support from Tau Consortium and the Association for Frontotemporal Dementia; and receiving funding from the NIH. Dr Graff-Radford reported receiving royalties from UpToDate; reported participating in multicenter therapy studies by sponsored by Biogen Inc, TauRx Therapeutics Ltd, AbbVie Inc, Novartis AG, and Eli Lilly and Company; and receiving research support from the NIH. Dr Grossman reported receiving grant support from the NIH, Avid Radiopharmaceuticals, and Piramal Pharma Ltd; participating in clinical trials sponsored by Biogen Inc, TauRx Therapeutics Ltd, and Alector Inc; consulting for Bracco and UCB; and serving on the editorial board of Neurology . Dr Hsiung reported receiving grant support from the Canadian Institutes of Health Research, the NIH, and the Alzheimer Society of British Columbia; participating in clinical trials sponsored by Anavax Life Sciences Corp, Biogen Inc, Cassava Sciences, Eli Lilly and Company, and Roche; and consulting for Biogen Inc, Novo Nordisk A/S, and Roche. Dr Huey reported receiving research support from the NIH. Dr Jones reported receiving research support from the NIH. Dr Litvan reported receiving research support from the NIH, the Michael J Fox Foundation, the Parkinson Foundation, the Lewy Body Association, CurePSP, Roche, AbbVie Inc, H Lundbeck A/S, Novartis AG, Transposon Therapeutics Inc, and UCB; serving as a member of the scientific advisory board for the Rossy PSP Program at the University of Toronto and for Amydis; and serving as chief editor of Frontiers in Neurology . Dr Masdeu reported consulting for and receiving research funding from Eli Lilly and Company; receiving personal fees from GE Healthcare; receiving grant funding and personal fees from Eli Lilly and Company; and receiving grant funding from Acadia Pharmaceutical Inc, Avanir Pharmaceuticals Inc, Biogen Inc, Eisai, Janssen Global Services LLC, the NIH, and Novartis AG outside the submitted work. Dr Mendez reported receiving research support from the NIH. Dr Miyagawa reported receiving research support from the Zander Family Foundation. Dr Pascual reported receiving research support from the NIH. Dr Pressman reported receiving research support from the NIH. Dr Ramos reported receiving research support from the NIH. Dr Roberson reported receiving research support from the NIA of the NIH, the Bluefield Project, and the Alzheimer’s Drug Discovery Foundation; serving on a data monitoring committee for Eli Lilly and Company; receiving licensing fees from Genentech Inc; and consulting for Applied Genetic Technologies Corp. Dr Tartaglia reported serving as an investigator for clinical trials sponsored by Biogen Inc, Avanex Corp, Green Valley, Roche/Genentech Inc, Bristol Myers Squibb, Eli Lilly and Company/Avid Radiopharmaceuticals, and Janssen Global Services LLC and receiving research support from the Canadian Institutes of Health Research (CIHR). Dr Wong reported receiving research support from the NIH. Dr Kornak reported providing expert witness testimony for Teva Pharmaceuticals Industries Ltd, Apotex Inc, and Puma Biotechnology and receiving research support from the NIH. Dr Kremers reported receiving research funding from NIH. Dr Kramer reported receiving research support from the NIH and royalties from Pearson Inc. Dr Boeve reported serving as an investigator for clinical trials sponsored by Alector Inc, Biogen Inc, and Transposon Therapeutics Inc; receiving royalties from Cambridge Medicine; serving on the Scientific Advisory Board of the Tau Consortium; and receiving research support from NIH, the Mayo Clinic Dorothy and Harry T. Mangurian Jr. Lewy Body Dementia Program, and the Little Family Foundation. Dr Rosen reported receiving research support from Biogen Inc, consulting for Wave Neuroscience and Ionis Pharmaceuticals, and receiving research support from the NIH. Dr Boxer reported being a coinventor of 4 of the ALLFTD mobile application tasks (not the focus of the present study) and previously receiving licensing fees; receiving research support from the NIH, the Tau Research Consortium, the Association for Frontotemporal Degeneration, Bluefield Project to Cure Frontotemporal Dementia, Corticobasal Degeneration Solutions, the Alzheimer’s Drug Discovery Foundation, and the Alzheimer’s Association; consulting for Aeovian Pharmaceuticals Inc, Applied Genetic Technologies Corp, Alector Inc, Arkuda Therapeutics, Arvinas Inc, AviadoBio, Boehringer Ingelheim, Denali Therapeutics Inc, GSK, Life Edit Therapeutics Inc, Humana Inc, Oligomerix, Oscotec Inc, Roche, Transposon Therapeutics Inc, TrueBinding Inc, and Wave Life Sciences; and receiving research support from Biogen Inc, Eisai, and Regeneron Pharmaceuticals Inc. No other disclosures were reported.

Funding/Support: This work was supported by grants AG063911, AG077557, AG62677, AG045390, NS092089, AG032306, AG016976, AG058233, AG038791, AG02350, AG019724, AG062422, NS050915, AG032289-11, AG077557, K23AG061253, and K24AG045333 from the NIH; the Association for Frontotemporal Degeneration; the Bluefield Project to Cure FTD; the Rainwater Charitable Foundation; and grant 2014-A-004-NET from the Larry L. Hillblom Foundation. Samples from the National Centralized Repository for Alzheimer’s Disease and Related Dementias, which receives government support under cooperative agreement grant U24 AG21886 from the NIA, were used in this study.

Role of the Funder/Sponsor: The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Group Information: A complete list of the members of the ALLFTD Consortium appears in Supplement 2 .

Data Sharing Statement: See Supplement 3 .

Additional Contributions: We thank the participants and study partners for dedicating their time and effort, and for providing invaluable feedback as we learn how to incorporate digital technologies into FTLD research.

Additional Information: Dr Grossman passed away on April 4, 2023. We want to acknowledge his many contributions to this study, including data acquisition, and design and conduct of the study. He was an ALLFTD site principal investigator and contributed during the development of the ALLFTD mobile app.

  • Register for email alerts with links to free full-text articles
  • Access PDFs of free articles
  • Manage your interests
  • Save searches and receive search alerts

IMAGES

  1. 10 Criterion Validity Examples (2023)

    research study criterion validity

  2. School essay: Components of valid research

    research study criterion validity

  3. Types of validity in research

    research study criterion validity

  4. Reliability vs. Validity in Research

    research study criterion validity

  5. PPT

    research study criterion validity

  6. Research validity and reliability

    research study criterion validity

VIDEO

  1. Criterion-Related validity (08.05)

  2. Pani poori duniya #kalpanamusicc #shorts

  3. VALIDITY-NURSING RESEARCH

  4. CRITERION VALIDITY types of criterion validity CONCURRENT VALIDITY, PREDICTIVE VALIDITY

  5. Reliability and Validity in Research || Validity and Reliability in Research in Urdu and Hindi

  6. Difference between Reliability & Validity in Research

COMMENTS

  1. What Is Criterion Validity?

    Revised on June 22, 2023. Criterion validity (or criterion-related validity) evaluates how accurately a test measures the outcome it was designed to measure. An outcome can be a disease, behavior, or performance. Concurrent validity measures tests and criterion variables in the present, while predictive validity measures those in the future.

  2. What Is Criterion Validity?

    Published on 2 September 2022 by Kassiani Nikolopoulou . Criterion validity (or criterion-related validity) evaluates how accurately a test measures the outcome it was designed to measure. An outcome can be a disease, behaviour, or performance. Concurrent validity measures tests and criterion variables in the present, while predictive validity ...

  3. Criterion Validity

    These constraints may limit the feasibility of conducting comprehensive criterion validity studies, particularly in certain research or practical contexts. Criterion Contamination: Criterion contamination occurs when the measure itself influences the criterion, leading to an inflated relationship between them. This can happen when the same ...

  4. Validity, reliability, and generalizability in qualitative research

    Most qualitative research studies, if not all, are meant to study a specific issue or phenomenon in a certain population or ethnic group, of a focused locality in a particular context, hence generalizability of qualitative research findings is usually not an expected attribute. ... In summary, the three gold criteria of validity, reliability ...

  5. Criterion Validity: Definition & Examples

    Olivia Guy-Evans, MSc. Criterion validity is the extent to which a measure accurately predicts or correlates with the construct it is supposed to be measuring. There are two different types of criterion validity: concurrent and predictive. Criterion validity is important because, without it, measures would not be able to accurately assess what ...

  6. Criterion validity (concurrent and predictive validity)

    What is criterion validity? Criterion validity reflects the use of a criterion - a well-established measurement procedure - to create a new measurement procedure to measure the construct you are interested in. The criterion and the new measurement procedure must be theoretically related.The measurement procedures could include a range of research methods (e.g., surveys, structured observation ...

  7. The 4 Types of Validity

    Face validity. Face validity considers how suitable the content of a test seems to be on the surface. It's similar to content validity, but face validity is a more informal and subjective assessment. Example: Face validity. You create a survey to measure the regularity of people's dietary habits. You review the survey items, which ask ...

  8. Criterion Validity: Definition, Assessing & Examples

    What is Criterion Validity? Criterion validity (aka criterion related validity) is the degree to which scores from a construct assessment correlate with a manifestation of that construct in the real world (the criterion). The construct assessment is a test, measurement instrument, or psychological inventory that assesses a latent construct.

  9. 5.2 Reliability and Validity of Measurement

    Here we consider four basic kinds: face validity, content validity, criterion validity, and discriminant validity. Face Validity. ... the words "wonderful" or "nasty"). The IAT has been used in dozens of published research studies, and there is strong evidence for both its reliability and its validity (Nosek, Greenwald, & Banaji, 2006).

  10. Criterion Validity

    Survey Research Methods. A. Fink, in International Encyclopedia of Education (Third Edition), 2010 Criterion Validity. Criterion validity compares responses to future performance or to those obtained from other, more well-established surveys. Criterion validity is made up two subcategories: predictive and concurrent. Predictive validity refers to the extent to which a survey measure forecasts ...

  11. Validity In Psychology Research: Types & Examples

    Types of Validity In Psychology. Two main categories of validity are used to assess the validity of the test (i.e., questionnaire, interview, IQ test, etc.): Content and criterion. Content validity refers to the extent to which a test or measurement represents all aspects of the intended content domain. It assesses whether the test items ...

  12. Validity in Qualitative Evaluation: Linking Purposes, Paradigms, and

    However, the increased importance given to qualitative information in the evidence-based paradigm in health care and social policy requires a more precise conceptualization of validity criteria that goes beyond just academic reflection. After all, one can argue that policy verdicts that are based on qualitative information must be legitimized by valid research, just as quantitative effect ...

  13. Validity

    Validity is a concept used in logic and research methodology to assess the strength of an argument or the quality of a research study. It refers to the extent to which a conclusion or result is supported by evidence and reasoning. ... Criterion Validity: Example 1: A company wants to evaluate the effectiveness of a new employee selection test ...

  14. Validity of Research and Measurements • LITFL • CCC Research

    Two types of validity are considered when critically appraising clinical research studies: internal validity; external validity; Validity applies to an outcome or measurement, not the instrument used to obtain it and is based on 'validity evidence' ... Criterion validity. the extent to which a measure is related to an outcome, with two ...

  15. Internal, External, and Ecological Validity in Research Design, Conduct

    The concept of validity is also applied to research studies and their findings. Internal validity examines whether the study design, conduct, and analysis answer the research questions without bias. External validity examines whether the study findings can be generalized to other contexts. Ecological validity examines, specifically, whether the ...

  16. Criterion validity

    Criterion validity. In psychometrics, criterion validity, or criterion-related validity, is the extent to which an operationalization of a construct, such as a test, relates to, or predicts, a theoretically related behaviour or outcome — the criterion. [1] [2] Criterion validity is often divided into concurrent and predictive validity based ...

  17. Validity and reliability in quantitative studies

    Validity. Validity is defined as the extent to which a concept is accurately measured in a quantitative study. For example, a survey designed to explore depression but which actually measures anxiety would not be considered valid. The second measure of quality in a quantitative study is reliability, or the accuracy of an instrument.In other words, the extent to which a research instrument ...

  18. Construct Validity

    Construct Validity | Definition, Types, & Examples. Published on February 17, 2022 by Pritha Bhandari.Revised on June 22, 2023. Construct validity is about how well a test measures the concept it was designed to evaluate. It's crucial to establishing the overall validity of a method.. Assessing construct validity is especially important when you're researching something that can't be ...

  19. Content Validity in Research: Definition & Examples

    Olivia Guy-Evans, MSc. Content validity is a type of criterion validity that demonstrates how well a measure covers the construct it is meant to represent. It is important for researchers to establish content validity in order to ensure that their study is measuring what it intends to measure. There are several ways to establish content ...

  20. Internal and external validity: can you apply research study results to

    Lack of internal validity implies that the results of the study deviate from the truth, and, therefore, we cannot draw any conclusions; hence, if the results of a trial are not internally valid, external validity is irrelevant. 2 Lack of external validity implies that the results of the trial may not apply to patients who differ from the study population and, consequently, could lead to low ...

  21. A scoping review to identify and organize literature trends of bias

    We conducted an exhaustive and focused scoping review and followed the methodological framework for scoping reviews as previously described in the literature [20, 22].This study aligned with the four goals of a scoping review [].We followed the first five out of the six steps outlined by Arksey and O'Malley's to ensure our review's validity 1) identifying the research question 2 ...

  22. Internal Validity in Research

    Research example In your study of coffee and memory, the external validity depends on the selection of the memory test, the participant inclusion criteria, and the laboratory setting. For example, restricting your participants to college-aged people enhances internal validity at the expense of external validity - the findings of the study may ...

  23. Reliability and Validity of Smartphone Cognitive Testing for

    This cohort study investigates the reliability and validity of unsupervised remote testing of executive functioning and memory via smartphone among ... Criterion standard clinical data were collected during parent project visits. ... and primarily English speakers who owned smartphones and participated in the referring in-person research study ...

  24. Concern or Opportunity: Implementation of the TBL Criterion in the

    This study systematically investigated the extent and application of sustainability practices in the healthcare system by thoroughly examining existing research conducted on healthcare-related issues within the framework of sustainability. The review primarily focuses on three key conceptual aspects: the social, economic, and ecological dimensions of sustainability. PLS-SEM (partial least ...