It’s a wonderful world — and universe — out there.
Come explore with us!
Science News Explores
When a study can’t be replicated.
Many factors can prevent scientists from repeating research and confirming results
Sometimes the findings of research that was done well can’t be replicated — confirmed by other scientists. The reasons may vary or never be fully understood, new studies find.
ViktorCap / iStockphoto
Share this:
- Google Classroom
By Janet Raloff
September 11, 2015 at 6:00 am
In the world of science, the gold standard for accepting a finding is seeing it “replicated.” To achieve this, researchers must repeat a study and find the same conclusion. Doing so helps confirm that the original finding wasn’t a fluke — one due to chance.
Yet try as they might, many research teams cannot replicate , or match, an original study’s results. Sometimes that occurs because the original scientists faked the study. Indeed, a 2012 study looked at more than 2,000 published papers that had to be retracted — eventually labeled by the publisher as too untrustworthy to believe. Of these, more than 65 percent involved cases of misconduct, including fraud.
But even when research teams act honorably, their studies may still prove hard to replicate, a new study finds. Yet a second new analysis shows how important it is to try to replicate studies. It also shows what researchers can learn from the mistakes of others.
The first study focused on 100 human studies in the field of psychology. That field examines how animals or people respond to certain conditions and why. The second study looked at 38 research papers reporting possible explanations for global warming. The papers presented explanations for global warming that run contrary to those of the vast majority of the world’s climate scientists.
Both new studies set out to replicate the earlier ones. Both had great trouble doing so. Yet neither found evidence of fraud. These studies point to how challenging it can be to replicate research. Yet without that replication, the research community may find it hard to trust a study’s data or know how to interpret what those data mean.
Trying to make sense of the numbers
Brian Nosek led the first new study. He is a psychologist the University of Virginia in Charlottesville. His research team recruited 270 scientists. Their mission: to reproduce the findings of 100 previously published studies. All of the studies had appeared in one of three major psychology journals in 2008. In the end, only 35 of the studies could be replicated by this group. The researchers described their efforts in the August 28 issue of Science .
Two types of findings proved hardest to confirm. The first were those that originally had been described as unexpected. The second were ones that had barely achieved statistical significance . That raises concerns, Nosek told Science News , about the common practice of publishing attention-grabbing results. Many of those types of findings appear to have come from data that had not been statistically strong. Such studies may have included too few individuals. Or they may have turned up only weak signs of an effect. There is a greater likelihood that such findings are the result of random chance.
No one can say why the tests by Nosek’s team failed to confirm findings in 65 percent of their tries. It’s possible the initial studies were not done well. But even if they had been done well, conflicting conclusions raise doubts about the original findings. For instance, they may not be applicable to groups other than the ones initially tested.
Rasmus Benestad works at the Norwegian Meteorological Institute in Oslo. He led the second new study. It focused on climate research.
Explainer: Global warming and the greenhouse effect
In climate science, some 97 percent of reports and scientists have come to a similar conclusion: that human activities, mostly the burning of fossil fuels , are a major driver of a recent global warming. The 97 percent figure came from the United Nations’ Intergovernmental Panel on Climate Change. This is a group of researchers active in climate science. The group reviewed nearly 12,000 abstracts of published research findings. It also received some 1,200 ratings by climate scientists of what the published data and analyses had concluded about climate change. Nearly all came up with the same source: us.
But what about the other 3 percent? Was there something different about those studies? Or could there be something different about the scientists who felt that humans did not play a big role in global warming? That’s what this new study sought to probe. It took a close look at 38 of these “contrarian” papers.
Benestad’s team attempted to replicate the original analyses in these papers. In doing so, the team pored over the details of each study. Along the way, they identified several common problems. Many started with false assumptions, the new analysis says. Some used a faulty analysis. Others set up an improper hypothesis for testing. Still others used “incorrect statistics” for making their analyses, Benestad’s group reports. Several papers also set up a false either/or situation. They had argued if one thing influenced global warming, then the other must not have. In fact, Benestad’s group noted, that logic was sometimes faulty. In many cases, both explanations for global warming might work together.
Mistakes or an incomplete understanding of previous work by others could lead to faulty assessments, Benestad’s team concluded . Its new analysis appeared August 20 in Theoretical and Applied Climatology.
What to make of this?
It might seem like it should be easy to copy a study and come up with similar findings. As the two new studies show, it’s not. And there can be a host of reasons why.
Some investigators have concluded that it may be next to impossible to redo a study exactly. This can be true especially when a study works with subjects or materials that vary greatly. Cells, animals and people are all things that have a lot of variation. Due to genetic or developmental differences, one cell or individual may respond differently to stimuli than another will. Stimuli might include foods, drugs, infectious germs or some other aspect of the environment.
Similarly, some studies involve conditions that are quite complicated. Examples can include the weather or how crowds of people behave. Consider climate studies. Computers are not yet big enough and fast enough to account for everything that affects climate, scientists note. Many of these factors will vary broadly over time and distance. So climate scientists choose to analyze the conditions that seem the most important. They may concentrate on those for which they have the best or the most data. If the next group of researchers uses a different set of data, their findings may not match the earlier ones.
Eventually, time and more data may show why the findings of an original study and a repeated one differ. One of the studies may be found weak or somewhat flawed. Perhaps both will be.
This points to what can make advancing science so challenging. “Science is never settled, and both the scientific consensus and alternative hypotheses should be subject to ongoing questioning,” Benestad’s group argues.
Researchers should try to prove or disprove even those things that have been considered common knowledge, they add. Resolving differences in an understanding of science and data is essential, they argue. That is true in climate science, psychology and every other field. After all, without a good understanding of science, they say, society won’t be able to make sound decisions on how to create a safer, healthier and more sustainable world.
More Stories from Science News Explores on Science & Society
Magic helped this researcher trick birds for research
Scientists want to create a sort of Noah’s Ark on the moon
ChatGPT and other AI tools are full of hidden racial biases
Two AI trailblazers win the 2024 Nobel Prize in physics
Let’s learn about the Nobel Prize
Spacecraft need an extra boost to travel between stars
Let’s learn about how much climate change is to blame for extreme weather
Use of injectable GLP-1 weight-loss drugs skyrockets among teens
An official website of the United States government
The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Publications
- Account settings
- Browse Titles
NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
National Academies of Sciences, Engineering, and Medicine; Policy and Global Affairs; Committee on Science, Engineering, Medicine, and Public Policy; Board on Research Data and Information; Division on Engineering and Physical Sciences; Committee on Applied and Theoretical Statistics; Board on Mathematical Sciences and Analytics; Division on Earth and Life Studies; Nuclear and Radiation Studies Board; Division of Behavioral and Social Sciences and Education; Committee on National Statistics; Board on Behavioral, Cognitive, and Sensory Sciences; Committee on Reproducibility and Replicability in Science. Reproducibility and Replicability in Science. Washington (DC): National Academies Press (US); 2019 May 7.
Reproducibility and Replicability in Science.
- Hardcopy Version at National Academies Press
7 Confidence in Science
The committee was asked to “draw conclusions and make recommendations for improving rigor and transparency in scientific and engineering research.” Certainly, reproducibility and replicability play an important role in achieving rigor and transparency, and for some lines of scientific inquiry, replication is one way to gain confidence in scientific knowledge. For other lines of inquiry, however, direct replications may be impossible due to the characteristics of the phenomena being studied. The robustness of science is less well represented by the replications between two individual studies than by a more holistic web of knowledge reinforced through multiple lines of examination and inquiry. In this chapter, the committee illustrates a spectrum of pathways to attain rigor and confidence in scientific knowledge, beginning with an overview of research synthesis and meta-analysis, and then citing illustrative approaches and perspectives from geoscience, genetics, psychology, and big data in social sciences. The chapter concludes with a consideration of public understanding and confidence in science.
When results are computationally reproduced or replicated, confidence in robustness of the knowledge derived from that particular study is increased. However, reproducibility and replicability are focused on the comparison between individual studies. By looking more broadly and using other techniques to gain confidence in results, multiple pathways can be found to consistently support certain scientific concepts and theories while rejecting others. Research synthesis is a widely accepted and practiced method for gauging the reliability and validity of bodies of research, although like all research methods, it can be used in ways that are more or less valid ( de Vrieze, 2018 ). The common principles of science—gathering evidence, developing theories and/or hypotheses, and applying logic—allow us to explore and predict systems that are inherently non-replicable. We use several of these systems below to highlight how scientists gain confidence when direct assessments of reproducibility or replicability are not feasible.
- RESEARCH SYNTHESIS
As we note throughout this report, studies purporting to investigate similar scientific questions can produce inconsistent or contradictory results. Research synthesis addresses the central question of how the results of studies relate to each other, what factors may be contributing to variability across studies, and how study results coalesce or not in developing the knowledge network for a particular science domain. In current use, the term research synthesis describes the ensemble of research activities involved in identifying, retrieving, evaluating, synthesizing, interpreting, and contextualizing the available evidence from studies on a particular topic and comprises both systematic reviews and meta-analyses. For example, a research synthesis may classify studies based on some feature and then test whether the effect size is larger for studies with or without the feature compared with the other studies. The term meta-analysis is reserved for the quantitative analysis conducted as part of research synthesis.
Although the terms used to describe research synthesis vary, the practice is widely used, in fields ranging from medicine to physics. In medicine, Cochrane reviews are systematic reviews that are performed by a body of experts who examine and synthesize the results of medical research. 1 These reviews provide an overview of the best available evidence on a wide variety of topics, and they are updated periodically as needed. In physics, the Task Group on Fundamental Constants performs research syntheses as part of its task to adjust the values of the fundamental constants of physics. The task group compares new results to each other and to the current estimated value, and uses this information to calculate an adjusted value ( Mohr et al., 2016 ). The exact procedure for research synthesis varies by field and by the scientific question at hand; the following is a general description of the approach.
Research synthesis begins with formal definitions of the scientific issues and the scope of the investigation and proceeds to search for published and unpublished sources of potentially relevant information (e.g., study results). The ensemble of studies identified by the search is evaluated for relevance to the central scientific question, and the resulting subset of studies undergoes review for methodological quality, typically using explicit criteria and the assignment of quality scores. The next step is the extraction of qualitative and quantitative information from the selected studies. The former includes study-level characteristics of design and study processes; the latter includes quantitative results, such as study-level estimates of effects and variability overall as well as by subsets of study participants or units or individual-level data on study participants or units ( Institute of Medicine, 2011 , Chapter 4 ).
Using summary statistics or individual-level data, meta-analysis provides estimates of overall central tendencies, effect sizes, or association magnitudes, along with estimates of the variance or uncertainty in those estimates. For example, the meta-analysis of the comparative efficacy of two treatments for a particular condition can provide estimates of an overall effect in the target clinical population. Replicability of an effect is reflected in the consistency of effect sizes across the studies, especially when a variety of methods, each with different weaknesses, converge on the same conclusion. As a tool for testing whether patterns of results across studies are anomalous, meta-analyses have, for example, suggested that well-accepted results in a scientific field are or could plausibly be largely due to publication bias.
Meta-analyses also test for variation in effect sizes and, as a result, can suggest potential causes of non-replicability in existing research. Meta-analyses can quantify the extent to which results appear to vary from study to study solely due to random sampling variation or to varying in a systematic way by subgroups (including sociodemographic, clinical, genetic, and other subject characteristics), as well as by characteristics of the individual studies (such as important aspects of the design of studies, the treatments used, and the time period and context in which studies were conducted). Of course, these features of the original studies need to be described sufficiently to be retrieved from the research reports.
For example, a meta-analytic aggregation across 200 meta-analyses published in the top journal for reviews in psychology, Psychological Bulletin , showed that only 8 percent of studies had adequate statistical power; variation across studies testing the same hypothesis was very high, with 74 percent of variation due to unexplained heterogeneity; and reporting bias overall was low ( Stanley et al., 2018 ).
In social psychology, Malle (2006) conducted a meta-analysis of studies comparing how actors explain their own behavior with how observers explain it and identified an unrecognized confounder—the positivity of the behavior. In studies that tested positive behaviors, actors took credit for the action and attributed it more to themselves than did observers. In studies that tested negative behaviors, actors justified the behavior and viewed it as due to the situation they were in more than did observers. Similarly, meta-analyses have often shown that the association of obesity with various outcomes (e.g., dementia) depend on the age in life at which the obesity is considered.
Systematic reviews and meta-analyses are typically conducted as retrospective investigations, in the sense that they search and evaluate the evidence from studies that have been conducted. Systematic reviews and meta-analyses are susceptible to biased datasets, for example, if the scientific literature on which a systematic review or a meta-analysis is biased due to publication bias of positive results. However, the potential for a prospective formulation of evidence synthesis is clear and is beginning to transform the landscape. Some research teams are beginning to monitor the scientific literature on a particular topic and conduct periodic updates of systematic reviews on the topic. 2 Prospective research synthesis may offer a partial solution to the challenge of biased datasets.
Meta-research is a new field that involves evaluating and improving the practice of research. Meta-research encompasses and goes beyond meta-analysis. As Ioannidis et al. (2015) aptly argued, meta-research can go beyond single substantive questions to examine factors that affect rigor, reproducibility, replicability, and, ultimately, the truth of research results across many topics.
CONCLUSION 7-1: Further development in and the use of meta-research would facilitate learning from scientific studies. These developments would include the study of research practices such as research on the quality and effects of peer review of journal manuscripts or grant proposals, research on the effectiveness and side effects of proposed research practices, and research on the variation in reproducibility and replicability between fields or over time.
What distinguishes geoscience from much of chemistry, biology, and physics is its focus on phenomena that emerge out of uncontrolled natural environments, as well as its special concern with understanding past events documented in the geologic record. Emergent phenomena on a global scale include climate variations at Earth's surface, tectonic motions of its lithospheric plates, and the magnetic field generated in its iron-rich core. The geosystems responsible for these phenomena have been active for billions of years, and the geologic record indicates that many of the terrestrial processes in the distant geologic past were similar to those that are occurring today. Geoscientists seek to understand the geosystems that produced these past behaviors and to draw implications regarding the future of the planet and its human environment. While one cannot replicate geologic events, such as earthquakes or hurricanes, scientific methods are used to generate increasingly accurate forecasts and predictions.
Emergent phenomena from complex natural systems are infinite in their variety; no two events are identical, and in this sense, no event repeats itself. Events can be categorized according to their statistical properties, however, such as the parameters of their space, time, and size distributions. The satisfactory explanation of an emergent phenomena requires building a geosystem model (usually a numerical code) that can replicate the statistics of the phenomenon by simulating the causal processes and interactions. In this context, replication means achieving sufficient statistical agreement between the simulated and observed phenomena.
Understanding of a geosystem and its defining phenomena is often measured by scientists' ability to replicate behaviors that were previously observed (i.e., retrospective testing) and predict new ones that can be subsequently observed (i.e., prospective testing). These evaluations can be in the form of null-hypothesis significance tests (e.g., expressed in terms of p- values) or in terms of skill scores relative to a prediction baseline (e.g., weather forecasts relative to mean-climate forecasts).
In the study of geosystems, reproducibility and replicability are closely tied to verification and validation. 3 Verification confirms the correctness of the model by checking that the numerical code correctly solves the mathematical equations. Validation is the process of deciding whether a model replicates the data-generating process accurately enough to warrant some specific application, such as the forecasting of natural hazards.
Hazard forecasting is an area of applied geoscience in which the issues of reproducibility and replicability are sharply framed by the operational demands for delivering high-quality information to a variety of users in a timely manner. Federal agencies tasked with providing authoritative hazard information to the public have undertaken substantial programs to improve reproducibility and replicability standards in operational forecasting. The cyberinfrastructure constructed to support operational forecasting also enhances capabilities for exploratory science in geosystems.
Natural hazards—from windstorms, droughts, floods, and wildfires to earthquakes, landslides, tsunamis, and volcanic eruptions—are notoriously difficult to predict because of the scale and complexity of the geosystems that produce them. Predictability is especially problematic for extreme events of low probability but high consequence that often dominate societal risk, such as the “500-year flood” or “2,500-year earthquake.” Nevertheless, across all sectors of society, expectations are rising for timely, reliable predictions of natural hazards based on the best available science. 4 A substantial part of applied geoscience now concerns the scientific forecasting of hazards and their consequences. A forecast is deemed scientific if meets five criteria:
formulated to predict measurable events
respectful of physical laws
calibrated against past observations
as reliable and skillful as practical, given the available information
testable against future observations
To account for the unavoidable sources of non-replicability (i.e., the randomness of nature and lack of knowledge about this variability), scientific forecasts must be expressed as probabilities. The goal of probabilistic forecasting is to develop forecasts of natural events that are statistically ideal—the best forecasts possible given the available information. Progress toward this goal requires the iterated development of forecasting models over many cycles of data gathering, model calibration, verification, simulation, and testing.
In some fields, such as weather and hydrological forecasting, the natural cycles are rapid enough and the observations are dense and accurate enough to permit the iterated development of system-level models with high explanatory and predictive power. Through steady advances in data collection and numerical modeling over the past several decades, the skill of the ensemble forecasting models developed and maintained by the weather prediction centers has been steadily improved ( Bauer et al., 2015 ). For example, forecasting skill in the range from 3 to 10 days ahead has been increasing by about 1 day per decade; that is, today's 6-day forecast is as accurate as the 5-day forecast was 10 years ago. This is a familiar illustration of gaining confidence in scientific knowledge without doing repeat experiments.
One of the principal tools to gain knowledge about genetic risk factors for disease is a genome-wide association study (GWAS). A GWAS is an observational study of a genome-wide set of genetic variants with the aim of detecting which variants may be associated with the development of a disease, or more broadly, associated with any expressed trait. These studies can be complex to mount, involve massive data collection, and require application of a range of sophisticated statistical methods for correct interpretation.
The community of investigators undertaking GWASs have adopted a series of practices and standards to improve the reliability of their results. These practices include a wide range of activities, such as:
- efforts to ensure consistency in data generation and extensive quality control steps to ensure the reliability of genotype data;
- genotype and phenotype harmonization;
- a push for large sample sizes through the establishment of large international disease consortia;
- rigorous study design and standardized statistical analysis protocols, including consensus building on controlling for key confounders, such as genetic ancestry/population stratification, the use of stringent criteria to account for multiple testing, and the development of norms for conducting independent replication studies and meta-analyzing multiple cohorts;
- a culture of large-scale international collaboration and sharing of data, results, and tools, empowered by strong infrastructure support; and
- an incentive system, which is created to meet scientific needs and is recognized and promoted by funding agencies and journals, as well as grant and paper reviewers, for scientists to perform reproducible, replicable, and accurate research.
For a description of the general approach taken by this community of investigators, see Lin (2018) .
The idea that there is a “replication crisis” in psychology has received a good deal of attention in professional and popular media, including The New York Times , The Atlantic , National Review , and Slate . However, there is no consensus within the field on this point. Some researchers believe that the field is rife with lax methods that threaten validity, including low statistical power, failure to clarify between a priori and a posteriori hypothesis testing, and the potential for p- hacking (e.g., Pashler and Wagenmakers, 2012 ; Simmons et al., 2011 ). Other researchers disagree with this characterization and have discussed the costs of what they see as misportraying psychology as a field in crisis, such as the possible chilling effects of such claims on young investigators and an overemphasis on Type I errors (i.e., false positives) at the expense of Type II errors (i.e., false negatives), and failing to discover important new phenomena ( Fanelli, 2018 ; Fiedler et al., 2012 ). Yet others have noted that psychology has long been concerned with improving its methodology, and the current discussion of reproducibility is part of the normal progression of science. An analysis of experimenter bias in the 1960s is a good example, especially as it spurred the use of double-blind methods in experiments ( Rosenthal, 1979 ). In this view, the current concerns can be situated within a history of continuing methodological improvements as psychological scientists continue to develop better understanding and implementation of statistical and other methods and reporting practices.
One reason to believe in the fundamental soundness of psychology as a science is that a great deal of useful and reliable knowledge is being produced. Researchers are making numerous replicable discoveries about the causes of human thought, emotion, and behavior ( Shiffrin et al., 2018 ). To give but a few examples, research on human memory has documented the fallibility of eyewitness testimony, leading to the release of many wrongly convicted prisoners ( Loftus, 2017 ). Research on “overjustification” shows that rewarding children can undermine their intrinsic interest in desirable activities ( Lepper and Henderlong, 2000 ). Research on how decisions are framed has found that more people participate in social programs, such as retirement savings or organ donation, when they are automatically enrolled and have to make a decision to leave (i.e., opt out), compared with when they have to make a decision to join (i.e., opt in) ( Jachimowicz et al., 2018 ). Increasingly, researchers and governments are using such psychological knowledge to meet social needs and solve problems, including improving educational outcomes, reducing government waste from ineffective programs, improving people's health, and reducing stereotyping and prejudice ( Walton and Wilson, 2018 ; Wood and Neal, 2016 ).
It is possible that accompanying this progress are lower levels of reproducibility than would be desirable. As discussed throughout this report, no field of science produces perfectly replicable results, but it may be useful to estimate the current level of replicability of published psychology results and ask whether that level is as high as the field believes it needs to be. Indeed, psychology has been at the forefront of empirical attempts to answer this question with large-scale replication projects, in which researchers from different labs attempt to reproduce a set of studies (refer to Table 5-1 in Chapter 5 ).
The replication projects themselves have proved to be controversial, however, generating wide disagreement about the attributes used to assess replication and the interpretation of the results. Some view the results of these projects as cause for alarm. In his remarks to the committee, for example, Brian Nosek observed: “The evidence for reproducibility [replicability] has fallen short of what one might expect or what one might desire.” ( Nosek, 2018 ). Researchers who agree with this perspective offer a range of evidence. 5
First, many of the replication attempts had similar or higher levels of rigor (e.g., sample size, transparency, preregistration) as the original studies, and yet many were not able to reproduce the original results ( Cheung et al., 2016 ; Ebersole et al., 2016a ; Eerland et al., 2016 ; Hagger et al., 2016 ; Klein et al., 2018 ; O'Donnell et al., 2018 ; Wagenmakers et al., 2016 ). Given the high degree of scrutiny on replication studies ( Zwaan et al., 2018 ), it is unlikely that most failed replications are the result of sloppy research practices.
Second, some of the replication attempts have focused specifically on results that have garnered a lot of attention, are taught in textbooks, and are in other ways high profile—results that one might expect have a high chance of being robust. Some of these replication attempts were successful, but many were not (e.g., Hagger et al., 2016 ; O'Donnell et al., 2018 ; Wagenmakers et al., 2016 ).
Third, a number of the replication attempts were collaborative, with researchers closely tied to the original result (e.g., the authors of the original studies or people with a great deal of expertise on the phenomenon) playing an active role in vetting the replication design and procedure ( Cheung et al., 2016 ; Eerland et al., 2016 ; Hagger et al., 2016 ; O'Donnell et al., 2018 ; Wagenmakers et al., 2016 ). This has not consistently led to positive replication results.
Fourth, when potential mitigating factors have been identified for the failures to replicate, these are often speculative and yet to be tested empirically. For example, failures to replicate have been attributed to context sensitivity and that some phenomena are simply more difficult to recreate in another time and place ( Van Bavel et al., 2016 ). However, without prospective empirical tests of this or other proposed mitigating factors, the possibility that the original result is not replicable remains a real possibility.
And fifth, even if a substantial portion (say, one-third) of failures to replicate are false negatives, it would still lead to the conclusion that the replicability of psychology results falls short of the ideal. Thus, to conclude that replicability rates are acceptable (say, near 80%), one would need to have confidence that most failed replications have significant flaws.
Others, however, have a quite different view of the results of the replication projects that have been conducted so far and offer their own arguments and evidence. First, some replication projects have found relatively high rates of replication: for example, Klein et al. (2014) replicated 10 of 13 results. Second, some high-profile replication projects (e.g., Open Science Collaboration, 2015 ) may have underestimated the replication rate by failing to correct for errors and by introducing changes in the replications that were not in the original studies (e.g., Bench et al., 2017 ; Etz and Vandekerckhove, 2016 ; Gilbert et al., 2015 ; Van Bavel et al., 2016 ). Moreover, several cases have come to light in which studies failed to replicate because of methodological changes in the replications, rather than problems with the original studies, and when these changes were corrected, the study replicated successfully (e.g., Alogna et al., 2014 ; Luttrell et al., 2017 ; Noah et al., 2018 ). Finally, the generalizability of the replication results is unknown, because no project randomly selected the studies to be replicated, and many were quite selective in the studies they chose to try to replicate.
An unresolved question in any analysis of replicability is what criteria to use to determine success or failure. Meta-analysis across a set of results may be a more promising technique to assess replicability, because it can evaluate moderators of effects as well as uniformity of results. However, meta-analysis may not achieve sufficient power given only a few studies.
Despite opposing views about how to interpret large-scale replication projects, there seems to be an emerging consensus that it is not helpful, or justified, to refer to psychology as being in a state of “crisis.” Nosek put it this way in his comments to the committee: “How extensive is the lack of reproducibility in research results in science and engineering in general? The easy answer is that we don't know. We don't have enough information to provide an estimate with any certainty for any individual field or even across fields in general.” He added, “I don't like the term crisis because it implies a lot of things that we don't know are true.”
Moreover, even if there were a definitive estimate of replicability in psychology, no one knows the expected level of non-replicability in a healthy science. Empirical results in psychology, like science in general, are inherently probabilistic, meaning that some failures to replicate are inevitable. As we stress throughout this report, innovative research will likely produce inconsistent results as it pushes the boundaries of knowledge. Ambitious research agendas that, for example, link brain to behavior, genetic to environmental influences, computational models to empirical results, and hormonal fluctuations to emotions necessarily yield some dead ends and failures. In short, some failures to replicate can reflect normal progress in science, and they can also highlight a lack of theoretical understanding or methodological limitations.
Whatever the extent of the problem, scientific methods and data analytic techniques can always be improved, and this discussion follows a long tradition in psychology of methodological innovation. New practices, such as checks on the efficacy of experimental manipulations, are now accepted in the field. Funding proposals now include power analyses as a matter of course. Longitudinal studies no longer just note attrition (i.e., participant dropout), but instead routinely estimate its effects (e.g., intention-to-treat analyses). At the same time, not all researchers have adopted best practices, sometimes failing to keep pace with current knowledge ( Sedlmeier and Gigerenzer, 1989 ). Only recently are researchers starting to systematically use power calculations in research reports or to provide online access to data and materials. Pressures on researchers to improve practices and to increase transparency have been heightened in the past decade by new developments in information technology that increase public access to information and scrutiny of science ( Lupia, 2017 ).
- SOCIAL SCIENCE RESEARCH USING BIG DATA
With close to 7 in 10 Americans now using social media as a regular news source ( Pew, 2018 ), social scientists in communication research, psychology, sociology, and political science routinely analyze a variety of information disseminated on commercial social media platforms, such as Twitter and Facebook, how that information flows through social networks, and how it influences attitudes and behaviors.
Analyses of data from these commercial platforms may rely on publicly available data that can be scraped and collected by any researcher without input from or collaboration with industry partners (model 1). Alternatively, industry staff may collaborate with researchers and provide access to proprietary data for analysis (such as code or underlying algorithms) that may not be made available to others (model 2). Variations on these two basic models will depend on the type of intellectual property being used in the research.
Both models raise challenges for reproducibility and replicability. In terms of reproducibility, when data are proprietary and undisclosed, the computation by definition is not reproducible by others. This might put this kind of research at odds with publication requirements of journals and other academic outlets. An inability to publish results from such industry partnerships may in the long term create a disincentive for work on datasets that cannot be made publicly available and increase pressure from within the scientific community on industry partners for more openness. This process may be accelerated if funding agencies only support research that follows the standards for full documentation and openness detailed in this report.
Both models also raise issues with replicability. Social media platforms, such as Twitter and Facebook, regularly modify their application programming interfaces (APIs) and other modalities of data access, which influences the ability of researchers to access, document, and archive data consistently. In addition, data are likely confounded by ongoing A/B testing 6 and tweaks to underlying algorithms. In model 1, these confounds are not transparent to researchers and therefore cannot be documented or controlled for in the original data collections or attempts to replicate the work. In model 2, they are known to the research team, but because they are proprietary they cannot be shared publicly. In both models, changes implemented by social media platforms in algorithms, APIs, and other internal characteristics over time make it impossible to computationally reproduce analytic models and to have confidence that equivalent data for reproducibility can be collected over time.
In summary, the considerations for social science using big data of the type discussed above illustrate a spectrum of challenges and approaches toward gaining confidence in scientific studies. In these and other scientific domains, science progresses through growing consensus in the scientific community of what counts as scientific knowledge. At the same time, public trust in science is premised on public confidence in the ability of scientists to demonstrate and validate what they assert is scientific knowledge.
In the examples above, diverse fields of science have developed methods for investigating phenomena that are difficult or impossible to replicate. Yet, as in the case of hazard prediction, scientific progress has been made as evidenced by forecasts with increased accuracy. This progress is built from the results of many trials and errors. Differentiating a success from a failure of a single study cannot be done without looking more broadly at the other lines of evidence. As noted by Goodman and colleagues (2016 , p. 3): “[A] preferred way to assess the evidential meaning of two or more results with substantive stochastic variability is to evaluate the cumulative evidence they provide.”
CONCLUSION 7-2: Multiple channels of evidence from a variety of studies provide a robust means for gaining confidence in scientific knowledge over time. The goal of science is to understand the overall effect or inference from a set of scientific studies, not to strictly determine whether any one study has replicated any other.
- PUBLIC PERCEPTIONS OF REPRODUCIBILITY AND REPLICABILITY
The congressional mandate that led to this study expressed the view that “there is growing concern that some published research results cannot be replicated, which can negatively affect the public's trust in science.” The statement of task for this report reflected this concern, asking the committee to “consider if the lack of replicability and reproducibility impacts . . . the public's perception” of science (refer to Box 1-1 in Chapter 1 ). This committee is not aware of any data that have been collected that specifically address how non-reproducibility and non-replicability have affected the public's perception of science. However, there are data about topics that may shed some light on how the public views these issues. These include data about the public's understanding of science, the public's trust in science, and the media's coverage of science.
Public Understanding of Science
When examining public understanding of science for the purposes of this report, at least four areas are particularly relevant: factual knowledge, understanding of the scientific process, awareness of scientific consensus, and understanding of uncertainty.
Factual knowledge about scientific terms and concepts in the United States has been fairly stable in recent years. In 2016, Americans correctly answered an average of 5.6 of the 9 true-or-false or multiple-choice items asked on the Science & Engineering Indicators surveys. This number was similar to the averages from data gathered over the past decade. In other words, there is no indication that knowledge of scientific facts and terms has decreased in recent years. It is clear from the data, however, that “factual knowledge of science is strongly related to individuals' level of formal schooling and the number of science and mathematics courses completed” ( National Science Foundation, 2018e , p. 7-35).
Americans' understanding of the scientific process is mixed. The Science & Engineering Indicators surveys ask respondents about their understanding of three aspects related to the scientific process. In 2016, 64 percent could correctly answer two questions related to the concept of probability, 51 percent provided a correct description of a scientific experiment, and 23 percent were able to describe the idea of a scientific study. While these numbers have not been declining over time, they nonetheless indicate relatively low levels of understanding of the scientific process and suggest an inability of “[m]any members of the public . . . to differentiate a sound scientific study from a poorly conducted one and to understand the scientific process more broadly” ( Scheufele, 2013 , p. 14041).
Another area in which the public lacks a clear understanding of science is the idea of scientific consensus on a topic. There are widespread perceptions that no scientific consensus has emerged in areas that are supported by strong and consistent bodies of research. In a 2014 U.S. survey ( Funk and Raine, 2015 , p. 8), for instance, two-thirds of respondents (67%) thought that scientists did “not have a clear understanding about the health effects of GM [genetically modified] crops,” in spite of more than 1,500 peer-refereed studies showing that there is no difference between genetically modified and traditionally grown crops in terms of their health effects for human consumption ( National Academies of Sciences, Engineering, and Medicine, 2016a ). Similarly, even though there is broad consensus among scientists, one-half of Americans (52%) thought “scientists are divided” that the universe was created in a single, violent event often called the big bang, and about one-third thought that scientists are divided on the human causes of climate change (37%) and on evolution (29%).
For the fourth area, the public's understanding about uncertainty, its role in scientific inquiry, and how uncertainty ought to be evaluated, research is sparse. Some data are available on uncertainties surrounding public opinion poll results. In a 2007 Harris interactive poll, 7 for instance, only about 1 in 10 Americans (12%) could correctly identify the source of error quantified by margin-of-error estimates. Yet slightly more than one-half (52%) agreed that pollsters should use the phrase “margin of error” when reporting on survey results.
Some research has shown that scientists believe that the public is unable to understand or contend with uncertainty in science ( Besley and Nisbet, 2013 ; Davies, 2008 ; Ecklund et al., 2012 ) and that providing information related to uncertainty creates distrust, panic, and confusion ( Frewer et al., 2003 ). However, people appear to expect some level of uncertainty in scientific information, and seem to have a relatively high tolerance for scientific uncertainty ( Howell, 2018 ). Currently, research is being done to explore how best to communicate uncertainties to the public and how to help people accurately process uncertain information.
Public Trust in Science
Despite a sometimes shaky understanding of science and the scientific process, the public continues largely to trust the scientific community. In its biannual Science & Engineering Indicators reports, the National Science Board ( National Science Foundation, 2018e ) tracks public confidence in a range of institutions (see Figure 7-1 ). Over time, trust in science has remained stable—in contrast to other institutions, such as Congress, major corporations, and the press, which have all shown significant declines in public confidence over the past 50 years. With respect to public confidence, science has been eclipsed in public confidence only by the military during Operation Desert Storm in the early 1990s and since the 9/11 terrorist attacks.
Levels of public confidence in selected U.S. institutions over time. SOURCE: National Science Foundation (2018e, Figure 7-16) and General Social Survey (2018 data from http://gss.norc.org/Get-The-Data).
In the most recent iteration of the Science & Engineering Indicators surveys ( National Science Foundation, 2018e ), almost 9 in 10 (88%) Americans also “strongly agreed” or “agreed” with the statement that “[m]ost scientists want to work on things that will make life better for the average person.” A similar proportion (89%) “strongly agreed” or “agreed” that “[s]cientific researchers are dedicated people who work for the good of humanity.” Even for potentially controversial issues, such as climate change, levels of trust in scientists as information sources remain relatively high, with 71 percent in a 2015 Yale University Project on Climate Change survey saying that they trust climate scientists “as a source of information about global warming,” compared with 60 percent trusting television weather reporters as information sources, and 41 percent trusting mainstream news media. Controversies around scientific conduct, such as “climategate,” have not led to significant shifts in public trust. In fact, “more than a decade of public opinion research on global warming . . . [shows] that these controversies . . . had little if any measurable impact on relevant opinions of the nation as a whole” ( MacInnis and Krosnick, 2016 , p. 509).
In recent years, some scholars have raised concerns that unwarranted attention on emerging areas of science can lead to misperceptions or even declining trust among public audiences, especially if science is unable to deliver on early claims or subsequent research fails to replicate initial results ( Scheufele, 2014 ). Public opinion surveys show that these concerns are not completely unfounded. In national surveys, one in four Americans (27%) think that it is a “big problem” and almost one-half of Americans (47%) think it is at least a “small problem” that “[s]cience researchers overstate the implications of their research”; only one in four (24%) see no problem ( Funk et al., 2017 ). In other words, “science may run the risk of undermining its position in society in the long term if it does not navigate this area of public communication carefully and responsibly” ( Scheufele and Krause, 2019 , p. 7667).
Media Coverage of Science
The concerns noted above are exacerbated by the fact that the public's perception of science—and of reproducibility and replicability issues—is heavily influenced by the media's coverage of science. News is an inherently event-driven profession. Research on news values ( Galtung and Ruge, 1965 ) and journalistic norms ( Shoemaker and Reese, 1996 ) has shown that rare, unexpected, or novel events and topics are much more likely to be covered by news media than recurring or what are seen as routine issues. As a result, scientific news coverage often tends to favor articles about single-study, breakthrough results over stories that might summarize cumulative evidence, describe the process of scientific discovery, or delineate between systemic, application-focused, or intrinsic uncertainties surrounding science, as discussed throughout this report. In addition to being event driven, news is also subject to audience demand. Experimental studies have demonstrated that respondents prefer conflict-laden debates over deliberative exchanges ( Mutz and Reeves, 2005 ). Audience demand may drive news organizations to cover scientific stories that emphasize conflict—for example, studies that contradict previous work—rather than reporting on studies that support the consensus view or make incremental additions to existing knowledge.
In addition to what is covered by the media, there are also concerns about how the media cover scientific stories. There is some evidence that media stories contain exaggerations or make causal statements or inferences that are not warranted when reporting on scientific studies. For example, a study that looked at journal articles, press releases about these articles, and the subsequent news stories found that more than one-third of press releases contained exaggerated advice, causal claims, or inferences from animals to humans ( Sumner et al., 2016 ). When the press release contained these exaggerations, the news stories that followed were far more likely also to contain exaggerations in comparison with news stories based on press releases that did not exaggerate.
Public confidence in science journalism reflects this concern about coverage, with 73 percent of Americans saying that the “biggest problem with news about scientific research findings is the way news reporters cover it,” and 43 percent saying it is a “big problem” that the news media are “too quick to report research findings that may not hold up” ( Funk et al., 2017 ). Implicit in discussions of sensationalizing and exaggeration of research results is the concept of uncertainty. While scientific publications almost always include at least a brief discussion of the uncertainty in the results—whether presented in error bars, confidence intervals, or other metrics—this discussion of uncertainty does not always make it into news stories. When results are presented without the context of uncertainty, it can contribute to the perception of hyping or exaggerating a study's results.
In recent years, the term “replication crisis” has been used in both academic writing (e.g., Shrout and Rodgers, 2018 ) and in the mainstream media (see, e.g., Yong, 2016 ), despite a lack of reliable data about the existence of such a “crisis.” Some have raised concerns that highly visible instances of media coverage of the issue of replicability and reproducibility have contributed to a larger narrative in public discourse around science being “broken” ( Jamieson, 2018 ). The frequency and prominence with which an issue is covered in the media can influence the perceived importance among audiences about that issue relative to other topics and ultimately how audiences evaluate actors in their performance on the issue ( National Academies of Sciences, Engineering, and Medicine, 2016b ). However, large-scale analyses suggest that widespread media coverage of the issue is not the case. A preliminary analysis of print and online news outlets, for instance, shows that overall media coverage on reproducibility and replicability remains low, with fewer than 200 unique, on-topic articles captured for a 10-year period, from June 1, 2008, to April 30, 2018 ( Howell, 2018 ). Thus, there is currently limited evidence that media coverage of a replication crisis has significantly influenced public opinion.
Scientists also bear some responsibility for misrepresentation in the public's eye, with many believing that scientists overstate the implications of their research. The purported existence of a replication crisis has been reported in several high-profile articles in the mainstream media; however, overall coverage remains low and it is unclear whether this issue has reached the ears of the general population.
CONCLUSION 7-3: Based on evidence from well-designed and long-standing surveys of public perceptions, the public largely trusts scientists. Understanding of the scientific process and methods has remained stable over time, though is not widespread. The National Science Foundation's most recent Science & Engineering Indicators survey shows that 51 percent of Americans understand the logic of experiments and 23 percent understand the idea of a scientific study.
As discussed throughout this report, uncertainty is an inherent part of science. Unfortunately, while people show some tolerance for uncertainty in science, it is often not well communicated by researchers or the media. There is, however, a large and growing body of research outlining evidence-based approaches for scientists to more effectively communicate different dimensions of scientific uncertainty to nonexpert audiences (for an overview, see Fischhoff and Davis, 2014 ). Similarly, journalism teachers and scholars have long examined how journalists cover scientific uncertainty (e.g., Stocking, 1999 ) and best practices for communicating uncertainty in science news coverage (e.g., Blum et al., 2005 ).
Broader trends in how science is promoted and covered in modern news environments may indirectly influence public trust in science related to replicability and reproducibility. Examples include concerns about hyperbolic claims in university press releases (for a summary, see Weingart, 2017 ) and a false balance in reporting, especially when scientific topics are covered by nonscience journalists: in these cases, the established scientific consensus around issues such as climate change are put on equal footing with nonfactual claims by nonscientific organizations or interest groups for the sake of “showing both sides” ( Boykoff and Boykoff, 2004 ).
RECOMMENDATION 7-1: Scientists should take care to avoid overstating the implications of their research and also exercise caution in their review of press releases, especially when the results bear directly on matters of keen public interest and possible action.
RECOMMENDATION 7-2: Journalists should report on scientific results with as much context and nuance as the medium allows. In covering issues related to replicability and reproducibility, journalists should help their audiences understand the differences between non-reproducibility and non-replicability due to fraudulent conduct of science and instances in which the failure to reproduce or replicate may be due to evolving best practices in methods or inherent uncertainty in science. Particular care in reporting on scientific results is warranted when
- the scientific system under study is complex and with limited control over alternative explanations or confounding influences;
- a result is particularly surprising or at odds with existing bodies of research;
- the study deals with an emerging area of science that is characterized by significant disagreement or contradictory results within the scientific community; and
- research involves potential conflicts of interest, such as work funded by advocacy groups, affected industry, or others with a stake in the outcomes.
Finally, members of the public and policy makers have a role to play to improve reproducibility and replicability. When reports of a new discovery are made in the media, one needs to ask about the uncertainties associated with the results and what other evidence exists that the discovery might be weighed against.
RECOMMENDATION 7-3: Anyone making personal or policy decisions based on scientific evidence should be wary of making a serious decision based on the results, no matter how promising, of a single study. Similarly, no one should take a new, single contrary study as refutation of scientific conclusions supported by multiple lines of previous evidence.
For an overview of the Cochrane, see http://www .cochrane.org .
In the broad area of health care research, for example, this approach has been adopted by Cochrane, an international group for systematic reviews, and by U.S. government organizations such as the Agency for Healthcare Research and Quality and the U.S. Preventive Services Task Force.
The meanings of the terms verification and validation, like reproducibility and replicability, differ among fields. Here we conform to the usage in computer and information science. In weather forecasting, a model is verified by its agreement with data—what is here called validation.
For example, the 2015 Paris Agreement adopted by the U.N. Framework Convention on Climate Change specifies that “adaptation action . . . should be based on and guided by the best available science.” And the California Earthquake Authority is required by law to establish residential insurance rates that are based on “the best available science” ( Marshall, 2018 , p. 106).
For a list of replication studies in psychology, see http: //curatescience .org/#replicationssection .
A/B testing is a randomized experiment with two variants that includes application of statistical hypothesis testing or “two-sample hypothesis testing” as used in the field of statistics.
See https: //theharrispoll .com/wp-content/uploads /2017/12/Harris-Interactive-Poll-ResearchMargin-of-Error-2007-11.pdf .
- Cite this Page National Academies of Sciences, Engineering, and Medicine; Policy and Global Affairs; Committee on Science, Engineering, Medicine, and Public Policy; Board on Research Data and Information; Division on Engineering and Physical Sciences; Committee on Applied and Theoretical Statistics; Board on Mathematical Sciences and Analytics; Division on Earth and Life Studies; Nuclear and Radiation Studies Board; Division of Behavioral and Social Sciences and Education; Committee on National Statistics; Board on Behavioral, Cognitive, and Sensory Sciences; Committee on Reproducibility and Replicability in Science. Reproducibility and Replicability in Science. Washington (DC): National Academies Press (US); 2019 May 7. 7, Confidence in Science.
- PDF version of this title (2.9M)
In this Page
Recent activity.
- Confidence in Science - Reproducibility and Replicability in Science Confidence in Science - Reproducibility and Replicability in Science
Your browsing activity is empty.
Activity recording is turned off.
Turn recording back on
Connect with NLM
National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894
Web Policies FOIA HHS Vulnerability Disclosure
Help Accessibility Careers
Why is Replication in Research Important?
Replication in research is important because it allows for the verification and validation of study findings, building confidence in their reliability and generalizability. It also fosters scientific progress by promoting the discovery of new evidence, expanding understanding, and challenging existing theories or claims.
Updated on June 30, 2023
Often viewed as a cornerstone of science , replication builds confidence in the scientific merit of a study’s results. The philosopher Karl Popper argued that, “we do not take even our own observations quite seriously, or accept them as scientific observations, until we have repeated and tested them.”
As such, creating the potential for replication is a common goal for researchers. The methods section of scientific manuscripts is vital to this process as it details exactly how the study was conducted. From this information, other researchers can replicate the study and evaluate its quality.
This article discusses replication as a rational concept integral to the philosophy of science and as a process validating the continuous loop of the scientific method. By considering both the ethical and practical implications, we may better understand why replication is important in research.
What is replication in research?
As a fundamental tool for building confidence in the value of a study’s results, replication has power. Some would say it has the power to make or break a scientific claim when, in reality, it is simply part of the scientific process, neither good nor bad.
When Nosek and Errington propose that replication is a study for which any outcome would be considered diagnostic evidence about a claim from prior research, they revive its neutrality. The true purpose of replication, therefore, is to advance scientific discovery and theory by introducing new evidence that broadens the current understanding of a given question.
Why is replication important in research?
The great philosopher and scientist, Aristotle , asserted that a science is possible if and only if there are knowable objects involved. There cannot be a science of unicorns, for example, because unicorns do not exist. Therefore, a ‘science’ of unicorns lacks knowable objects and is not a ‘science’.
This philosophical foundation of science perfectly illustrates why replication is important in research. Basically, when an outcome is not replicable, it is not knowable and does not truly exist. Which means that each time replication of a study or a result is possible, its credibility and validity expands.
The lack of replicability is just as vital to the scientific process. It pushes researchers in new and creative directions, compelling them to continue asking questions and to never become complacent. Replication is as much a part of the scientific method as formulating a hypothesis or making observations.
Types of replication
Historically, replication has been divided into two broad categories:
- Direct replication : performing a new study that follows a previous study’s original methods and then comparing the results. While direct replication follows the protocols from the original study, the samples and conditions, time of day or year, lab space, research team, etc. are necessarily different. In this way, a direct replication uses empirical testing to reflect the prevailing beliefs about what is needed to produce a particular finding.
- Conceptual replication : performing a study that employs different methodologies to test the same hypothesis as an existing study. By applying diverse manipulations and measures, conceptual replication aims to operationalize a study’s underlying theoretical variables. In doing so, conceptual replication promotes collaborative research and explanations that are not based on a single methodology.
Though these general divisions provide a helpful starting point for both conducting and understanding replication studies, they are not polar opposites. There are nuances that produce countless subcategories such as:
- Internal replication : when the same research team conducts the same study while taking negative and positive factors into account
- Microreplication : conducting partial replications of the findings of other research groups
- Constructive replication : both manipulations and measures are varied
- Participant replication : changes only the participants
Many researchers agree these labels should be confined to study design, as direction for the research team, not a preconceived notion. In fact, Nosek and Errington conclude that distinctions between “direct” and “conceptual” are at least irrelevant and possibly counterproductive for understanding replication and its role in advancing knowledge.
How do researchers replicate a study?
Like all research studies, replication studies require careful planning. The Open Science Framework (OSF) offers a practical guide which details the following steps:
- Identify a study that is feasible to replicate given the time, expertise, and resources available to the research team.
- Determine and obtain the materials used in the original study.
- Develop a plan that details the type of replication study and research design intended.
- Outline and implement the study’s best practices.
- Conduct the replication study, analyze the data, and share the results.
These broad guidelines are expanded in Brown’s and Wood’s article , “Which tests not witch hunts: a diagnostic approach for conducting replication research.” Their findings are further condensed by Brown into a blog outlining four main procedural categories:
- Assumptions : identifying the contextual assumptions of the original study and research team
- Data transformations : using the study data to answer questions about data transformation choices by the original team
- Estimation : determining if the most appropriate estimation methods were used in the original study and if the replication can benefit from additional methods
- Heterogeneous outcomes : establishing whether the data from an original study lends itself to exploring separate heterogeneous outcomes
At the suggestion of peer reviewers from the e-journal Economics, Brown elaborates with a discussion of what not to do when conducting a replication study that includes:
- Do not use critiques of the original study’s design as a basis for replication findings.
- Do not perform robustness testing before completing a direct replication study.
- Do not omit communicating with the original authors, before, during, and after the replication.
- Do not label the original findings as errors solely based on different outcomes in the replication.
Again, replication studies are full blown, legitimate research endeavors that acutely contribute to scientific knowledge. They require the same levels of planning and dedication as any other study.
What happens when replication fails?
There are some obvious and agreed upon contextual factors that can result in the failure of a replication study such as:
- The detection of unknown effects
- Inconsistencies in the system
- The inherent nature of complex variables
- Substandard research practices
- Pure chance
While these variables affect all research studies, they have particular impact on replication as the outcomes in question are not novel but predetermined.
The constant flux of contexts and variables makes assessing replicability, determining success or failure, very tricky. A publication from the National Academy of Sciences points out that replicability is obtaining consistent , not identical, results across studies aimed at answering the same scientific question. They further provide eight core principles that are applicable to all disciplines.
While there is no straightforward criteria for determining if a replication is a failure or a success, the National Library of Science and the Open Science Collaboration suggest asking some key questions, such as:
- Does the replication produce a statistically significant effect in the same direction as the original?
- Is the effect size in the replication similar to the effect size in the original?
- Does the original effect size fall within the confidence or prediction interval of the replication?
- Does a meta-analytic combination of results from the original experiment and the replication yield a statistically significant effect?
- Do the results of the original experiment and the replication appear to be consistent?
While many clearly have an opinion about how and why replication fails, it is at best a null statement and at worst an unfair accusation. It misses the point, sidesteps the role of replication as a mechanism to further scientific endeavor by presenting new evidence to an existing question.
Can the replication process be improved?
The need to both restructure the definition of replication to account for variations in scientific fields and to recognize the degrees of potential outcomes when comparing the original data, comes in response to the replication crisis . Listen to this Hidden Brain podcast from NPR for an intriguing case study on this phenomenon.
Considered academia’s self-made disaster, the replication crisis is spurring other improvements in the replication process. Most broadly, it has prompted the resurgence and expansion of metascience , a field with roots in both philosophy and science that is widely referred to as "research on research" and "the science of science." By holding a mirror up to the scientific method, metascience is not only elucidating the purpose of replication but also guiding the rigors of its techniques.
Further efforts to improve replication are threaded throughout the industry, from updated research practices and study design to revised publication practices and oversight organizations, such as:
- Requiring full transparency of the materials and methods used in a study
- Pushing for statistical reform , including redefining the significance of the p-value
- Using pre registration reports that present the study’s plan for methods and analysis
- Adopting result-blind peer review allowing journals to accept a study based on its methodological design and justifications, not its results
- Founding organizations like the EQUATOR Network that promotes transparent and accurate reporting
Final thoughts
In the realm of scientific research, replication is a form of checks and balances. Neither the probability of a finding nor prominence of a scientist makes a study immune to the process.
And, while a single replication does not validate or nullify the original study’s outcomes, accumulating evidence from multiple replications does boost the credibility of its claims. At the very least, the findings offer insight to other researchers and enhance the pool of scientific knowledge.
After exploring the philosophy and the mechanisms behind replication, it is clear that the process is not perfect, but evolving. Its value lies within the irreplaceable role it plays in the scientific method. Replication is no more or less important than the other parts, simply necessary to perpetuate the infinite loop of scientific discovery.
Charla Viera, MS
See our "Privacy Policy"
- Your Health
- Treatments & Tests
- Health Inc.
- Public Health
In Psychology And Other Social Sciences, Many Studies Fail The Reproducibility Test
Richard Harris
A researcher showed people a picture of The Thinker in an effort to study the link between analytical thinking and religious disbelief. In hindsight, the researcher called his study design "silly". The study could not be reproduced. Peter Barritt/Getty Images hide caption
A researcher showed people a picture of The Thinker in an effort to study the link between analytical thinking and religious disbelief. In hindsight, the researcher called his study design "silly". The study could not be reproduced.
The world of social science got a rude awakening a few years ago, when researchers concluded that many studies in this area appeared to be deeply flawed. Two-thirds could not be replicated in other labs.
Some of those same researchers now report those problems still frequently crop up, even in the most prestigious scientific journals.
But their study, published Monday in Nature Human Behaviour , also finds that social scientists can actually sniff out the dubious results with remarkable skill.
First, the findings. Brian Nosek , a psychology researcher at the University of Virginia and the executive director of the Center for Open Science, decided to focus on social science studies published in the most prominent journals, Science and Nature .
"Some people have hypothesized that, because they're the most prominent outlets they'd have the highest rigor," Nosek says. "Others have hypothesized that the most prestigious outlets are also the ones that are most likely to select for very 'sexy' findings, and so may be actually less reproducible."
To find out, he worked with scientists around the world to see if they could reproduce the results of key experiments from 21 studies in Science and Nature , typically psychology experiments involving students as subjects. The new studies on average recruited five times as many volunteers, in order to come up with results that were less likely due to chance.
Shots - Health News
Scientists are not so hot at predicting which cancer studies will succeed.
The results were better than the average of a previous review of the psychology literature, but still far from perfect. Of the 21 studies, the experimenters were able to reproduce 13. And the effects they saw were on average only about half as strong as had been trumpeted in the original studies.
The remaining eight were not reproduced.
"A substantial portion of the literature is reproducible," Nosek concludes. "We are getting evidence that someone can independently replicate [these findings]. And there is a surprising number [of studies] that fail to replicate."
One of the eight studies that failed this test came from the lab of Will Gervais , when he was getting his PhD at the University of British Columbia. He and a colleague had run a series of experiments to see whether people who are more analytical are less likely to hold religious beliefs. In one test, undergraduates looked at pictures of statues.
"Half of our participants looked at a picture of the sculpture, 'The Thinker,' where here's this guy engaged in deep reflective thought," Gervais says. "And in our control condition, they'd look at the famous stature of a guy throwing a discus."
People who saw The Thinker, a sculpture by August Rodin, expressed more religious disbelief, Gervais reported in Science . And given all the evidence from his lab and others, he says there's still reasonable evidence that underlying conclusion is true. But he recognizes the sculpture experiment was really quite weak.
"Our study, in hindsight, was outright silly," says Gervais, who is now an assistant professor at the University of Kentucky.
A previous study also failed to replicate his experimental findings, so the new analysis is hardly a surprise.
But what interests him the most in the new reproducibility study is that scientists had predicted that his study – along with the seven others that failed to replicate – were unlikely to stand up to the challenge.
As part of the reproducibility study, about 200 social scientists were surveyed and asked to predict which results would stand up to the re-test and which would not. Scientists filled out a survey in which they predicted the winners and losers. They also took part in a "prediction market," where they could buy or sell tokens that represented their views.
"They're taking bets with each other, against us," says Anna Dreber , an economics professor at the Stockholm School of Economics, and coauthor of the new study.
It turns out, "these researchers were very good at predicting which studies would replicate," she says. "I think that's great news for science."
These forecasts could help accelerate the process of science. If you can get panels of experts to weigh in on exciting new results, the field might be able to spend less time chasing errant results known as false positives.
How Flawed Science Is Undermining Good Medicine
"A false positive result can make other researchers, and the original researcher, spend lots of time and energy and money on results that turn out not to hold," she says. "And that's kind of wasteful for resources and inefficient, so the sooner we find out that a result doesn't hold, the better."
But if social scientists were really good at identifying flawed studies, why did the editors and peer reviewers at Science and Nature let these eight questionable studies through their review process?
"The likelihood that a finding will replicate or not is one part of what a reviewer would consider," says Nosek. "But other things might influence the decision to publish. It may be that this finding isn't likely to be true, but if it is true, it is super important, so we do want to publish it because we want to get it into the conversation."
Nosek recognizes that, even though the new studies were more rigorous than the ones they attempted to replicate, that doesn't guarantee that the old studies are wrong and the new studies are right. No single scientific study gives a definitive answer.
Forecasting could be a powerful tool in accelerating that quest for the truth.
That may not work, however, in one area where the stakes are very high: medical research, where answers can have life-or-death consequences.
Jonathan Kimmelman at McGill University, who was not involved in the new study, says when he's asked medical researchers to make predictions about studies, the forecasts have generally flopped.
"That's probably not a skill that's widespread in medicine," he says. It's possible that the social scientists selected to make the forecasts in the latest study have deep skills in analyzing data and statistics, and their knowledge of the psychological subject matter is less important.
And forecasting is just one tool that could be used to improve the rigor of social science.
"The social-behavioral sciences are in the midst of a reformation," says Nosek. Scientists are increasingly taking steps to increase transparency, so that potential problems surface quickly. Scientists are increasingly announcing in advance the hypothesis they are testing; they are making their data and computer code available so their peers can evaluate and check their results.
Perhaps most important, some scientists are coming to realize that they are better off doing fewer studies, but with more experimental subjects, to reduce the possibility of a chance finding.
"The way to get ahead and get a job and get tenure is to publish lots and lots of papers," says Gervais. "And it's hard to do that if you are able run fewer studies, but in the end I think that's the way to go — to slow down our science and be more rigorous up front."
Gervais says when he started his first faculty job, at the University of Kentucky, he sat down with his department chair and said he was going to follow this path of publishing fewer, but higher quality studies. He says he got the nod to do that. He sees it as part of a broader cultural change in social science that's aiming to make the field more robust.
You can reach Richard Harris at [email protected] .
External sources
- Other sources
- Pre-appraised research
- Critical appraisal tools
Useful information
- Sampling methods
Replicability
- Confounders
- Asking the right questions
- Are some types of evidence better than others?
- Populations and samples
- Correlation and causation
- Scientific uncertainty
- How to read a scientific paper
- How science media stories work
- Mixed methods research
- Common sources of bias
- Evidence-based medicine, practice and policy
- How was Understanding Health Research developed?
- Who was involved in the project?
- Privacy Policy
- UHR workshops
- What is the Understanding Health Research tool?
One of the most important features of a scientific research paper is that the research must be replicable, which means that the paper gives readers enough detailed information that the research can be repeated (or 'replicated').
It is very important that research can be replicated, because it means that other researchers can test the findings of the research. Replicability keeps researchers honest and can give readers confidence in research.
For example, if a new research paper concludes that smoking is not related to lung cancer, readers would be very skeptical because it disagrees with the weight of existing evidence. If the paper explained in detail how the research was carried out, other researchers would be able to repeat the research and either confirm or oppose the findings. However, if the paper did not explain how the research was carried out, readers would have no way of testing the controversial conclusions.
Replicability is an essential concept throughout the Understanding Health Research tool. As readers, we cannot know for sure whether researchers have misrepresented or lied about their findings, but we can always ask whether the paper gives us enough detail to be able to replicate the research. If the research is replicable, then any false conclusions can eventually be shown to be wrong.
- Skip to primary navigation
- Skip to main content
- Skip to primary sidebar
- Skip to footer
Understanding Science
How science REALLY works...
- Understanding Science 101
- Scientists aim for their studies to be replicable — meaning that another researcher could perform a similar investigation and obtain the same basic results.
- When a study cannot be replicated, it suggests that our current understanding of the study system or our methods of testing are insufficient.
Copycats in science: The role of replication
Scientists aim for their studies’ findings to be replicable — so that, for example, an experiment testing ideas about the attraction between electrons and protons should yield the same results when repeated in different labs. Similarly, two different researchers studying the same dinosaur bone in the same way should come to the same conclusions regarding its measurements and composition—though they may interpret that evidence differently (e.g., regarding what it means about dinosaur growth patterns). This goal of replicability makes sense. After all, science aims to reconstruct the unchanging rules by which the universe operates, and those same rules apply, 24 hours a day, seven days a week, from Sweden to Saturn, regardless of who is studying them. If a finding can’t be replicated, it suggests that our current understanding of the study system or our methods of testing are insufficient.
Does this mean that scientists are constantly repeating what others before them have already done? No, of course not — or we would never get anywhere at all. The process of science doesn’t require that every experiment and every study be repeated, but many are, especially those that produce surprising or particularly important results. In some fields, it is standard procedure for a scientist to replicate his or her own results before publication in order to ensure that the findings were not due to some fluke or factors outside the experimental design.
The desire for replicability is part of the reason that scientific papers almost always include a methods section, which describes exactly how the researchers performed the study. That information allows other scientists to replicate the study and to evaluate its quality, helping ensure that occasional cases of fraud or sloppy scientific work are weeded out and corrected.
- Science in action
When a finding can’t be replicated, it spells trouble for the idea supported by that piece of evidence. Lack of replicability has challenged the idea of cold fusion. Read the full story: Cold fusion: A case study for scientific behavior .
Scrutinizing science: Peer review
Benefits of science
Subscribe to our newsletter
- The science flowchart
- Science stories
- Grade-level teaching guides
- Teaching resource database
- Journaling tool
- Misconceptions
clock This article was published more than 9 years ago
Many scientific studies can’t be replicated. That’s a problem.
This post has been updated.
Maverick researchers have long argued that much of what gets published in elite scientific journals is fundamentally squishy — that the results tell a great story but can’t be reproduced when the experiments are run a second time.
[ No, science’s reproducibility problem is not limited to psychology ]
Now a volunteer army of fact-checkers has published a new report that affirms that the skepticism was warranted. Over the course of four years, 270 researchers attempted to reproduce the results of 100 experiments that had been published in three prestigious psychology journals.
It was awfully hard. They ultimately concluded that they’d succeeded just 39 times.
The failure rate surprised even the leaders of the project, who had guessed that perhaps half the results wouldn’t be reproduced.
The new paper, titled "Estimating the reproducibility of psychological science," was published Thursday in the journal Science. The sweeping effort was led by the Center for Open Science , a nonprofit based in Charlottesville. The center's director, Brian Nosek, a University of Virginia psychology professor, said the review focused on the field of psychology because the leaders of the center are themselves psychologists.
Despite the rather gloomy results, the new paper pointed out that this kind of verification is precisely what scientists are supposed to do: “Any temptation to interpret these results as a defeat for psychology, or science more generally, must contend with the fact that this project demonstrates science behaving as it should.”
The phenomenon -- irreproducible results -- has been a nagging issue in the science world in recent years. That's partly due to a few spectacular instances of fraud, such as when Dutch psychologist Diederik Stapel admitted in 2011 that he’d been fabricating his data for years.
[ Background: Reproducibility is the new scientific revolution ]
A more fundamental problem, say Nosek and other reform-minded scientists, is that researchers seeking tenure, grants or professional acclaim feel tremendous pressure to do experiments that have the kind of snazzy results that can be published in prestigious journals.
They don’t intentionally do anything wrong, but may succumb to motivated reasoning. That’s a subtle form of bias, like unconsciously putting your thumb on the scale. Researchers see what they want and hope to see, or tweak experiments to get a more significant result.
Moreover, there's the phenomenon of "publication bias.” Journals are naturally eager to publish significant results rather than null results. The problem is that, by random chance, some experiments will produce results that appear significant but are merely anomalies – spikes in the data that might mean nothing.
Reformers like Nosek want their colleagues to pre-register their experimental protocols and share their data so that the rest of the community can see how the sausage is made. Meanwhile, editors at Science, Nature and other top journals have crafted new standards that require more detailed explanations of how experiments are conducted.
Gilbert Chin, senior editor of the journal Science, said in a teleconference this week, “This somewhat disappointing outcome does not speak directly to the validity or the falsity of the theories. What it does say is that we should be less confident about many of the experimental results that were provided as empirical evidence in support of those theories.”
[ ‘Fraudulent’ peer review strikes another academic publisher -- 32 articles questioned ]
John Ioannidis, a professor of medicine at Stanford, has argued for years that most scientific results are less robust than researchers believe. He published a paper in 2005 with the instantly notorious title, "Why Most Published Research Findings Are False."
In an interview this week, Ioannidis called the new paper “a landmark for psychological science” and said it should have repercussions beyond the field of psychology. He said the paper validates his long-standing argument, “and I feel sorry for that. I wish I had been proven wrong.”
The 100 replication attempts, whether successful or unsuccessful, do not definitively prove or disprove the results of the original experiments, noted Marcia McNutt, editor-in-chief of the Science family of journals. There are many reasons that a replication might fail to yield the same kind of data.
Perhaps the replication was flawed in some key way – a strong possibility in experiments that have multiple moving parts and many human factors.
And science is conducted on the edge of the knowable, often in search of small, marginal effects.
“The only finding that will replicate 100 percent of the time is one that’s likely to be trite and boring and probably already known,” said Alan Kraut, executive director of the Association for Psychological Science. “I mean, yes, dead people can never be taught to read.”
[ Two scientific journals accepted a study by Maggie Simpson and Edna Krabappel ]
One experiment that underwent replication had originally showed that students who drank a sugary beverage were better able to make a difficult decision about whether to live in a big apartment far from campus or a smaller one closer to campus. But that first experiment was conducted at Florida State University. The replication took place at the University of Virginia. The housing decisions around Charlottesville were much simpler -- effectively blowing up the experiment even before the first sugary beverage had been consumed.
Another experiment had shown, the first time around, that students exposed to a text that undermined their belief in free will were more likely to engage in cheating behavior. The replication, however, showed no such effect.
The co-author of the original paper, Jonathan Schooler, a psychologist at the University of California at Santa Barbara, said he still believes his original findings would hold up under specified conditions, but added, “Those conditions may be more narrowly specified than we originally appreciated.”
He has himself been an advocate for improving reproducibility, and said the new study shouldn’t tarnish the reputation of his field: “Psychology’s really leading the charge here in investigating the science of science.”
Nosek acknowledged that this new study is itself one that would be tricky to reproduce exactly, because there were subjective decisions made along the way and judgment calls about what, exactly, "reproduced" means. The very design of the review injected the possibility of bias, in that the volunteer scientists who conducted the replications were allowed to pick which experiments they wanted to do.
“At every phase of this process, decisions were made that might not be exactly the same kind of decision that another group would make,” Nosek said.
There are about 1.5 million scientific studies published a year, he said. This review looked at only 100 studies.
That’s a small sample size – another reason to be hesitant before declaring the discovery of a new truth.
Further Reading:
Sexism in science: Peer editor tells female researchers their study needs a male author
Hundreds of scientists ask Science to stop publishing a smorgasbord of stereotypes
- Bipolar Disorder
- Therapy Center
- When To See a Therapist
- Types of Therapy
- Best Online Therapy
- Best Couples Therapy
- Managing Stress
- Sleep and Dreaming
- Understanding Emotions
- Self-Improvement
- Healthy Relationships
- Student Resources
- Personality Types
- Guided Meditations
- Verywell Mind Insights
- 2024 Verywell Mind 25
- Mental Health in the Classroom
- Editorial Process
- Meet Our Review Board
- Crisis Support
What Is Replication in Psychology Research?
Kendra Cherry, MS, is a psychosocial rehabilitation specialist, psychology educator, and author of the "Everything Psychology Book."
Emily is a board-certified science editor who has worked with top digital publishing brands like Voices for Biodiversity, Study.com, GoodTherapy, Vox, and Verywell.
Examples of Replication in Psychology
- Why Replication Matters
- How It Works
What If Replication Fails?
- The Replication Crisis
How Replication Can Be Strengthened
Replication refers to the repetition of a research study, generally with different situations and subjects, to determine if the basic findings of the original study can be applied to other participants and circumstances.
In other words, when researchers replicate a study, it means they reproduce the experiment to see if they can obtain the same outcomes.
Once a study has been conducted, researchers might be interested in determining if the results hold true in other settings or for other populations. In other cases, scientists may want to replicate the experiment to further demonstrate the results.
At a Glance
In psychology, replication is defined as reproducing a study to see if you get the same results. It's an important part of the research process that strengthens our understanding of human behavior. It's not always a perfect process, however, and extraneous variables and other factors can interfere with results.
For example, imagine that health psychologists perform an experiment showing that hypnosis can be effective in helping middle-aged smokers kick their nicotine habit. Other researchers might want to replicate the same study with younger smokers to see if they reach the same result.
Exact replication is not always possible. Ethical standards may prevent modern researchers from replicating studies that were conducted in the past, such as Stanley Milgram's infamous obedience experiments .
That doesn't mean that researchers don't perform replications; it just means they have to adapt their methods and procedures. For example, researchers have replicated Milgram's study using lower shock thresholds and improved informed consent and debriefing procedures.
Why Replication Is Important in Psychology
When studies are replicated and achieve the same or similar results as the original study, it gives greater validity to the findings. If a researcher can replicate a study’s results, it is more likely that those results can be generalized to the larger population.
Human behavior can be inconsistent and difficult to study. Even when researchers are cautious about their methods, extraneous variables can still create bias and affect results.
That's why replication is so essential in psychology. It strengthens findings, helps detect potential problems, and improves our understanding of human behavior.
How Do Scientists Replicate an Experiment?
When conducting a study or experiment , it is essential to have clearly defined operational definitions. In other words, what is the study attempting to measure?
When replicating earlier researchers, experimenters will follow the same procedures but with a different group of participants. If the researcher obtains the same or similar results in follow-up experiments, it means that the original results are less likely to be a fluke.
The steps involved in replicating a psychology experiment often include the following:
- Review the original experiment : The goal of replication is to use the exact methods and procedures the researchers used in the original experiment. Reviewing the original study to learn more about the hypothesis, participants, techniques, and methodology is important.
- Conduct a literature review : Review the existing literature on the subject, including any other replications or previous research. Considering these findings can provide insights into your own research.
- Perform the experiment : The next step is to conduct the experiment. During this step, keeping your conditions as close as possible to the original experiment is essential. This includes how you select participants, the equipment you use, and the procedures you follow as you collect your data.
- Analyze the data : As you analyze the data from your experiment, you can better understand how your results compare to the original results.
- Communicate the results : Finally, you will document your processes and communicate your findings. This is typically done by writing a paper for publication in a professional psychology journal. Be sure to carefully describe your procedures and methods, describe your findings, and discuss how your results compare to the original research.
So what happens if the original results cannot be reproduced? Does that mean that the experimenters conducted bad research or that, even worse, they lied or fabricated their data?
In many cases, non-replicated research is caused by differences in the participants or in other extraneous variables that might influence the results of an experiment. Sometimes the differences might not be immediately clear, but other researchers might be able to discern which variables could have impacted the results.
For example, minor differences in things like the way questions are presented, the weather, or even the time of day the study is conducted might have an unexpected impact on the results of an experiment. Researchers might strive to perfectly reproduce the original study, but variations are expected and often impossible to avoid.
Are the Results of Psychology Experiments Hard to Replicate?
In 2015, a group of 271 researchers published the results of their five-year effort to replicate 100 different experimental studies previously published in three top psychology journals. The replicators worked closely with the original researchers of each study in order to replicate the experiments as closely as possible.
The results were less than stellar. Of the 100 experiments in question, 61% could not be replicated with the original results. Of the original studies, 97% of the findings were deemed statistically significant. Only 36% of the replicated studies were able to obtain statistically significant results.
As one might expect, these dismal findings caused quite a stir. You may have heard this referred to as the "'replication crisis' in psychology.
Similar replication attempts have produced similar results. Another study published in 2018 replicated 21 social and behavioral science studies. In these studies, the researchers were only able to successfully reproduce the original results about 62% of the time.
So why are psychology results so difficult to replicate? Writing for The Guardian , John Ioannidis suggested that there are a number of reasons why this might happen, including competition for research funds and the powerful pressure to obtain significant results. There is little incentive to retest, so many results obtained purely by chance are simply accepted without further research or scrutiny.
The American Psychological Association suggests that the problem stems partly from the research culture. Academic journals are more likely to publish novel, innovative studies rather than replication research, creating less of an incentive to conduct that type of research.
Reasons Why Research Cannot Be Replicated
The project authors suggest that there are three potential reasons why the original findings could not be replicated.
- The original results were a false positive.
- The replicated results were a false negative.
- Both studies were correct but differed due to unknown differences in experimental conditions or methodologies.
The Nobel Prize-winning psychologist Daniel Kahneman has suggested that because published studies are often too vague in describing methods used, replications should involve the authors of the original studies to more carefully mirror the methods and procedures used in the original research.
In fact, one investigation found that replication rates are much higher when original researchers are involved.
While some might be tempted to look at the results of such replication projects and assume that psychology is more art than science, many suggest that such findings actually help make psychology a stronger science. Human thought and behavior is a remarkably subtle and ever-changing subject to study.
In other words, it's normal and expected for variations to exist when observing diverse populations and participants.
Some research findings might be wrong, but digging deeper, pointing out the flaws, and designing better experiments helps strengthen the field. The APA notes that replication research represents a great opportunity for students. it can help strengthen research skills and contribute to science in a meaningful way.
Nosek BA, Errington TM. What is replication ? PLoS Biol . 2020;18(3):e3000691. doi:10.1371/journal.pbio.3000691
Burger JM. Replicating Milgram: Would people still obey today ? Am Psychol . 2009;64(1):1-11. doi:10.1037/a0010932
Makel MC, Plucker JA, Hegarty B. Replications in psychology research: How often do they really occur? Perspectives on Psychological Science . 2012;7(6):537-542. doi:10.1177/1745691612460688
Aarts AA, Anderson JE, Anderson CJ, et al. Estimating the reproducibility of psychological science . Science. 2015;349(6251). doi:10.1126/science.aac4716
Camerer CF, Dreber A, Holzmeister F, et al. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015 . Nat Hum Behav . 2018;2(9):637-644. doi:10.1038/s41562-018-0399-z
American Psychological Association. Learning into the replication crisis: Why you should consider conducting replication research .
Kahneman D. A new etiquette for replication . Social Psychology. 2014;45(4):310-311.
By Kendra Cherry, MSEd Kendra Cherry, MS, is a psychosocial rehabilitation specialist, psychology educator, and author of the "Everything Psychology Book."
Subscribe or renew today
Every print subscription comes with full digital access
Science News
A massive 8-year effort finds that much cancer research can’t be replicated.
Unreliable preclinical studies could impede drug development later on
An effort to replicate nearly 200 preclinical cancer experiments that generated buzz from 2010 to 2012 found that only about a quarter could be reproduced. Prostate cancer cells are shown in this artist’s illustration.
Dr_Microbe/iStock/Getty Images Plus
Share this:
By Tara Haelle
December 7, 2021 at 8:00 am
After eight years, a project that tried to reproduce the results of key cancer biology studies has finally concluded. And its findings suggest that like research in the social sciences, cancer research has a replication problem.
Researchers with the Reproducibility Project: Cancer Biology aimed to replicate 193 experiments from 53 top cancer papers published from 2010 to 2012. But only a quarter of those experiments were able to be reproduced , the team reports in two papers published December 7 in eLife .
The researchers couldn’t complete the majority of experiments because the team couldn’t gather enough information from the original papers or their authors about methods used, or obtain the necessary materials needed to attempt replication.
What’s more, of the 50 experiments from 23 papers that were reproduced, effect sizes were, on average, 85 percent lower than those reported in the original experiments. Effect sizes indicate how big the effect found in a study is. For example, two studies might find that a certain chemical kills cancer cells, but the chemical kills 30 percent of cells in one experiment and 80 percent of cells in a different experiment. The first experiment has less than half the effect size seen in the second one.
The team also measured if a replication was successful using five criteria. Four focused on effect sizes, and the fifth looked at whether both the original and replicated experiments had similarly positive or negative results, and if both sets of results were statistically significant. The researchers were able to apply those criteria to 112 tested effects from the experiments they could reproduce. Ultimately, just 46 percent, or 51, met more criteria than they failed, the researchers report.
“The report tells us a lot about the culture and realities of the way cancer biology works, and it’s not a flattering picture at all,” says Jonathan Kimmelman, a bioethicist at McGill University in Montreal. He coauthored a commentary on the project exploring the ethical aspects of the findings.
It’s worrisome if experiments that cannot be reproduced are used to launch clinical trials or drug development efforts, Kimmelman says. If it turns out that the science on which a drug is based is not reliable, “it means that patients are needlessly exposed to drugs that are unsafe and that really don’t even have a shot at making an impact on cancer,” he says.
At the same time, Kimmelman cautions against overinterpreting the findings as suggesting that the current cancer research system is broken. “We actually don’t know how well the system is working,” he says. One of the many questions left unresolved by the project is what an appropriate rate of replication is in cancer research, since replicating all studies perfectly isn’t possible. “That’s a moral question,” he says. “That’s a policy question. That’s not really a scientific question.”
The overarching lessons of the project suggest that substantial inefficiency in preclinical research may be hampering the drug development pipeline later on, says Tim Errington, who led the project. He is the director of research at the Center for Open Science in Charlottesville, Va., which cosponsored the research.
As many as 14 out of 15 cancer drugs that enter clinical trials never receive approval from the U.S. Food and Drug Administration. Sometimes that’s because the drugs lack commercial potential, but more often it is because they do not show the level of safety and effectiveness needed for licensure.
Much of that failure is expected. “We’re humans trying to understand complex disease, we’re never going to get it right,” Errington says. But given the cancer reproducibility project’s findings, perhaps “we should have known that we were failing earlier, or maybe we don’t understand actually what’s causing [an] exciting finding,” he says.
Still, it’s not that failure to replicate means that a study was wrong or that replicating it means that the findings are correct, says Shirley Wang, an epidemiologist at Brigham and Women’s Hospital in Boston and Harvard Medical School. “It just means that you’re able to reproduce,” she says, a point that the reproducibility project also stresses.
Scientists still have to evaluate whether a study’s methods are unbiased and rigorous, says Wang, who was not involved in the project but reviewed its findings. And if the results of original experiments and their replications do differ, it’s a learning opportunity to find out why and the implications, she adds.
Errington and his colleagues have reported on subsets of the cancer reproducibility project’s findings before , but this is the first time that the effort’s entire analysis has been released ( SN: 1/18/17 ).
During the project, the researchers faced a number of obstacles, particularly that none of the original experiments included enough details in their published studies about methods to attempt reproduction. So the reproducibility researchers contacted the studies’ authors for additional information.
While authors for 41 percent of the experiments were extremely or very helpful, authors for another third of the experiments did not reply to requests for more information or were not otherwise helpful, the project found. For example, one of the experiments that the group was unable to replicate required the use of a mouse model specifically bred for the original experiment. Errington says that the scientists who conducted that work refused to share some of these mice with the reproducibility project, and without those rodents, replication was impossible.
Some researchers were outright hostile to the idea that independent scientists wanted to attempt to replicate their work, says Brian Nosek, executive director at the Center for Open Science and a coauthor on both studies. That attitude is a product of a research culture that values innovation over replication, and that prizes the academic publish-or-perish system over cooperation and data sharing, Nosek says.
Some scientists may feel threatened by replication because it is uncommon. “If replication is normal and routine, people wouldn’t see it as a threat,” Nosek says. But replication may also feel intimidating because scientists’ livelihoods and even identities are often so deeply rooted in their findings, he says. “Publication is the currency of advancement, a key reward that turns into chances for funding, chances for a job and chances for keeping that job,” Nosek says. “Replication doesn’t fit neatly into that rewards system.”
Even authors who wanted to help couldn’t always share their data for various reasons, including lost hard drives or intellectual property restrictions or data that only former graduate students had.
Calls from some experts about science’s “ reproducibility crisis ” have been growing for years, perhaps most notably in psychology (SN: 8/27/18 ) . Then in 2011 and 2012, pharmaceutical companies Bayer and Amgen reported difficulties in replicating findings from preclinical biomedical research.
But not everyone agrees on solutions, including whether replication of key experiments is actually useful or possible , or even what exactly is wrong with the way science is done or what needs to improve ( SN: 1/13/15 ).
At least one clear, actionable conclusion emerged from the new findings, says Yvette Seger, director of science policy at the Federation of American Societies for Experimental Biology. That’s the need to provide scientists with as much opportunity as possible to explain exactly how they conducted their research.
“Scientists should aspire to include as much information about their experimental methods as possible to ensure understanding about results on the other side,” says Seger, who was not involved in the reproducibility project.
Ultimately, if science is to be a self-correcting discipline, there needs to be plenty of opportunities not only for making mistakes but also for discovering those mistakes, including by replicating experiments, the project’s researchers say.
“In general, the public understands science is hard, and I think the public also understands that science is going to make errors,” Nosek says. “The concern is and should be, is science efficient at catching its errors?” The cancer project’s findings don’t necessarily answer that question, but they do highlight the challenges of trying to find out.
More Stories from Science News on Health & Medicine
Once-weekly insulin might mean fewer shots for some with diabetes
Doula care may lead to fewer C-sections or preterm births
50 years ago, chronic pain mystified scientists
Navigation research often excludes the environment. That’s starting to change
Radioactive beams give a real-time view of cancer treatment in mice
A viral gene drive could offer a new approach to fighting herpes
New electrical stitches use muscle movement to speed up healing
Blood pressure may read falsely high if the arm isn’t positioned properly
Subscribers, enter your e-mail address for full access to the Science News archives and digital editions.
Not a subscriber? Become one now .
Rigorous research practices improve scientific replication
Science has suffered a crisis of replication—too few scientific studies can be repeated by peers. A new study from Stanford and three leading research universities shows that using rigorous research practices can boost the replication rate of studies.
The research article described in this story has been retracted .
Science has a replication problem. In recent years, it has come to light that the findings of many studies, particularly those in social psychology, cannot be reproduced by other scientists. When this happens, the data, methods, and interpretation of the study’s results are often called into question, creating a crisis of confidence.
“When people don’t trust science, that’s bad for society,” said Jon Krosnick , the Frederic O. Glover Professor of Humanities and Social Sciences in the Stanford School of Humanities and Sciences. Krosnick is one of four co-principal investigators on a study that explored ways scientists in fields ranging from physics to psychology can improve the replicability of their research. The study, published Nov. 9 in Nature Human Behavior , found that using rigorous methodology can yield near-perfect rates of replication.
“Replicating others’ scientific results is fundamental to the scientific process,” Krosnick argues. According to a paper published in 2015 in Science , fewer than half of findings of psychology studies could be replicated—and only 30 percent for studies in the field of social psychology. Such findings “damage the credibility of all scientists, not just those whose findings cannot be replicated,” Krosnick explained.
Publish or perish
“Scientists are people, too,” said Krosnick, who is a professor of communication and of political science in H&S and of social sciences in the Stanford Doerr School of Sustainability. “Researchers want to make their funders happy and to publish head-turning results. Sometimes, that inspires researchers to make up or misrepresent data.
Almost every day, I see a new story about a published study being retracted—in physics, neuroscience, medicine, you name it. Showing that scientific findings can be replicated is the only pathway to solving the credibility problem.”
Accordingly, Krosnick added that the publish-or-perish environment creates the temptation to fake the data or to analyze and reanalyze the data with various methods until a desired result finally pops out, which is not actually real—a practice known as p-hacking.
In an effort to assess the true potential of rigorous social science findings to be replicated, Krosnick’s lab at Stanford and labs at the University of California, Santa Barbara; the University of Virginia; and the University of California, Berkeley set out to discover new experimental effects using best practices and to assess how often they could be reproduced. The four teams attempted to replicate the results of 16 studies using rigor-enhancing practices.
“The results reassure me that painstakingly rigorous methods pay off,” said Bo MacInnis , a Stanford lecturer and study co-author whose research on political communication was conducted under the parameters of the replicability study. “Scientific researchers can effectively and reliably govern themselves in a way that deserves and preserves the public’s highest trust.”
Matthew DeBell , director of operations at the American National Election Studies program at the Stanford Institute for Research in the Social Sciences is also a co-author.
“The quality of scientific evidence depends on the quality of the research methods,” DeBell said. “Research findings do hold up when everything is done as well as possible, underscoring the importance of adhering to the highest standards in science.”
Transparent methods
In the end, the team found that when four “rigor-enhancing” practices were implemented, the replication rate was almost 90 percent. Although the recommended steps place additional burdens on the researchers, those practices are relatively straightforward and not particularly onerous.
These practices call for researchers to run confirmatory tests on their own studies to corroborate results prior to publication. Data should be collected from a sufficiently large sample of participants. Scientists should preregister all studies, committing to the hypotheses to be tested and the methods to be used to test them before data are collected, to guard against p-hacking. And researchers must fully document their procedures to ensure that peers can precisely repeat them.
The four labs conducted original research using these recommended rigor-enhancing practices. Then they submitted their work to the other labs for replication. Overall, of the 16 studies produced by the four labs during the five-year project, replication was successful in 86 percent of the attempts.
“The bottom line in this study is that when science is done well, it produces believable, replicable, and generalizable findings,” Krosnick said. “What I and the other authors of this study hope will be the takeaway is a wake-up call to other disciplines to doubt their own work, to develop and adopt their own best practices, and to change how we all publish by building in replication routinely. If we do these things, we can restore confidence in the scientific process and in scientific findings.”
Acknowledgements
Krosnick is also a professor, by courtesy, of psychology in H&S. Additional authors include lead author John Protzko of Central Connecticut State University; Leif Nelson, a principal investigator from the University of California, Berkeley; Brian Nosek, a principal investigator from the University of Virginia; Jordan Axt of McGill University; Matt Berent of Matt Berent Consulting; Nicholas Buttrick and Charles R. Ebersole of the University of Virginia; Sebastian Lundmark of the University of Gothenburg, Gothenburg, Sweden; Michael O’Donnell of Georgetown University; Hannah Perfecto of Washington University, St. Louis; James E. Pustejovsky of the University of Wisconsin, Madison; Scott Roeder of the University of South Carolina; Jan Walleczek of the Fetzer Franklin Fund; and senior author and project principal investigator author Jonathan Schooler of the University of California, Santa Barbara.
This research was funded by the Fetzer Franklin Fund of the John E. Fetzer Memorial Trust.
Competing Interests
Nosek is the executive director of the nonprofit Center for Open Science. Walleczek was the scientific director of the Fetzer Franklin Fund that sponsored this research, and Nosek was on the fund’s scientific advisory board. Walleczek made substantive contributions to the design and execution of this research but as a funder did not have controlling interest in the decision to publish or not. All other authors declared no conflicts of interest.
- Social Sciences
Steven Kivelson receives 2025 American Physical Society prize
- Natural Sciences
Philip Zimbardo, the psychologist behind the ‘Stanford Prison Experiment,’ dies at 91
A ‘deep time revolution’ paved the way to American modernity, Stanford historian asserts
Dafna Zur receives Korean Order of Culture Merit
© Stanford University. Stanford, California 94305.
A new replication crisis: Research that is less likely to be true is cited more
Papers that cannot be replicated are cited 153 times more because their findings are interesting.
Papers in leading psychology, economic and science journals that fail to replicate and therefore are less likely to be true are often the most cited papers in academic research, according to a new study by the University of California San Diego's Rady School of Management.
Published in Science Advances , the paper explores the ongoing "replication crisis" in which researchers have discovered that many findings in the fields of social sciences and medicine don't hold up when other researchers try to repeat the experiments.
The paper reveals that findings from studies that cannot be verified when the experiments are repeated have a bigger influence over time. The unreliable research tends to be cited as if the results were true long after the publication failed to replicate.
"We also know that experts can predict well which papers will be replicated," write the authors Marta Serra-Garcia, assistant professor of economics and strategy at the Rady School and Uri Gneezy, professor of behavioral economics also at the Rady School. "Given this prediction, we ask 'why are non-replicable papers accepted for publication in the first place?'"
Their possible answer is that review teams of academic journals face a trade-off. When the results are more "interesting," they apply lower standards regarding their reproducibility.
The link between interesting findings and nonreplicable research also can explain why it is cited at a much higher rate -- the authors found that papers that successfully replicate are cited 153 times less than those that failed.
"Interesting or appealing findings are also covered more by media or shared on platforms like Twitter, generating a lot of attention, but that does not make them true," Gneezy said.
Serra-Garcia and Gneezy analyzed data from three influential replication projects which tried to systematically replicate the findings in top psychology, economic and general science journals (Nature and Science). In psychology, only 39 percent of the 100 experiments successfully replicated. In economics, 61 percent of the 18 studies replicated as did 62 percent of the 21 studies published in Nature/Science.
With the findings from these three replication projects, the authors used Google Scholar to test whether papers that failed to replicate are cited significantly more often than those that were successfully replicated, both before and after the replication projects were published. The largest gap was in papers published in Nature/Science: non-replicable papers were cited 300 times more than replicable ones.
When the authors took into account several characteristics of the studies replicated -- such as the number of authors, the rate of male authors, the details of the experiment (location, language and online implementation) and the field in which the paper was published -- the relationship between replicability and citations was unchanged.
They also show the impact of such citations grows over time. Yearly citation counts reveal a pronounced gap between papers that replicated and those that did not. On average, papers that failed to replicate are cited 16 times more per year. This gap remains even after the replication project is published.
"Remarkably, only 12 percent of post-replication citations of non-replicable findings acknowledge the replication failure," the authors write.
The influence of an inaccurate paper published in a prestigious journal can have repercussions for decades. For example, the study Andrew Wakefield published in The Lancet in 1998 turned tens of thousands of parents around the world against the measles, mumps and rubella vaccine because of an implied link between vaccinations and autism. The incorrect findings were retracted by The Lancet 12 years later, but the claims that autism is linked to vaccines continue.
The authors added that journals may feel pressure to publish interesting findings, and so do academics. For example, in promotion decisions, most academic institutions use citations as an important metric in the decision of whether to promote a faculty member.
This may be the source of the "replication crisis," first discovered the early 2010s.
"We hope our research encourages readers to be cautious if they read something that is interesting and appealing," Serra-Garcia said. "Whenever researchers cite work that is more interesting or has been cited a lot, we hope they will check if replication data is available and what those findings suggest."
Gneezy added, "We care about the field and producing quality research and we want to it to be true."
- K-12 Education
- Learning Disorders
- Consumer Behavior
- STEM Education
- Surveillance
- Retail and Services
- Poverty and Learning
- Cognitive science
- Social science
- Scientific misconduct
- Psycholinguistics
- Humanistic psychology
- Game theory
- Macroeconomics
Story Source:
Materials provided by University of California - San Diego . Original written by Christine Clark. Note: Content may be edited for style and length.
Journal Reference :
- Marta Serra-Garcia, Uri Gneezy. Nonreplicable publications are cited more than replicable ones . Science Advances , 2021; 7 (21): eabd1705 DOI: 10.1126/sciadv.abd1705
Cite This Page :
Explore More
- Giant Meteorite Had Silver Lining for Life
- El Nino Oscillation 250 Million Years Old
- Role of Climate Change in Drought Intensity
- Plant Guard Cells Count Environmental Stimuli
- Plant CO2 Uptake Up by Nearly a Third Globally
- Rare Fossils of Extinct Elephants: Butchery
- New Insights Into Air Pollution Formation
- Scientists Build Modules for Synthetic Cell
- Life-Saving Spongelike 'Bandage'
- Why Do We Love Carbs? Pre-Neanderthal DNA
Trending Topics
Strange & offbeat.
New Research
Scientists Replicated 100 Psychology Studies, and Fewer Than Half Got the Same Results
The massive project shows that reproducibility problems plague even top scientific journals
Brian Handwerk
Science Correspondent
Academic journals and the press regularly serve up fresh helpings of fascinating psychological research findings. But how many of those experiments would produce the same results a second time around?
According to work presented today in Science , fewer than half of 100 studies published in 2008 in three top psychology journals could be replicated successfully. The international effort included 270 scientists who re-ran other people's studies as part of The Reproducibility Project: Psychology , led by Brian Nosek of the University of Virginia .
The eye-opening results don't necessarily mean that those original findings were incorrect or that the scientific process is flawed. When one study finds an effect that a second study can't replicate, there are several possible reasons, says co-author Cody Christopherson of Southern Oregon University. Study A's result may be false, or Study B's results may be false—or there may be some subtle differences in the way the two studies were conducted that impacted the results.
“This project is not evidence that anything is broken. Rather, it's an example of science doing what science does,” says Christopherson. “It's impossible to be wrong in a final sense in science. You have to be temporarily wrong, perhaps many times, before you are ever right.”
Across the sciences, research is considered reproducible when an independent team can conduct a published experiment, following the original methods as closely as possible, and get the same results. It's one key part of the process for building evidence to support theories. Even today, 100 years after Albert Einstein presented his general theory of relativity, scientists regularly repeat tests of its predictions and look for cases where his famous description of gravity does not apply.
"Scientific evidence does not rely on trusting the authority of the person who made the discovery," team member Angela Attwood , a psychology professor at the University of Bristol, said in a statement "Rather, credibility accumulates through independent replication and elaboration of the ideas and evidence."
The Reproducibility Project, a community-based crowdsourcing effort, kicked off in 2011 to test how well this measure of credibility applies to recent research in psychology. Scientists, some recruited and some volunteers, reviewed a pool of studies and selected one for replication that matched their own interest and expertise. Their data and results were shared online and reviewed and analyzed by other participating scientists for inclusion in the large Science study.
To help improve future research, the project analysis attempted to determine which kinds of studies fared the best, and why. They found that surprising results were the hardest to reproduce, and that the experience or expertise of the scientists who conducted the original experiments had little to do with successful replication.
The findings also offered some support for the oft-criticized statistical tool known as the P value , which measures whether a result is significant or due to chance. A higher value means a result is most likely a fluke, while a lower value means the result is statistically significant.
The project analysis showed that a low P value was fairly predictive of which psychology studies could be replicated. Twenty of the 32 original studies with a P value of less than 0.001 could be replicated, for example, while just 2 of the 11 papers with a value greater than 0.04 were successfully replicated.
But Christopherson suspects that most of his co-authors would not want the study to be taken as a ringing endorsement of P values, because they recognize the tool's limitations. And at least one P value problem was highlighted in the research: The original studies had relatively little variability in P value, because most journals have established a cutoff of 0.05 for publication. The trouble is that value can be reached by being selective about data sets , which means scientists looking to replicate a result should also carefully consider the methods and the data used in the original study.
It's also not yet clear whether psychology might be a particularly difficult field for reproducibility—a similar study is currently underway on cancer biology research. In the meantime, Christopherson hopes that the massive effort will spur more such double-checks and revisitations of past research to aid the scientific process.
“Getting it right means regularly revisiting past assumptions and past results and finding new ways to test them. The only way science is successful and credible is if it is self-critical,” he notes.
Unfortunately there are disincentives to pursuing this kind of research, he says: “To get hired and promoted in academia, you must publish original research, so direct replications are rarer. I hope going forward that the universities and funding agencies responsible for incentivizing this research—and the media outlets covering them—will realize that they've been part of the problem, and that devaluing replication in this way has created a less stable literature than we'd like.”
Get the latest Science stories in your inbox.
Brian Handwerk | READ MORE
Brian Handwerk is a science correspondent based in Amherst, New Hampshire.
IMAGES
VIDEO
COMMENTS
Many factors can prevent scientists from repeating research and confirming results, such as unexpected or weak findings, false assumptions, faulty analysis, or incorrect statistics. Descriptive investigations are especially challenging to replicate, as they may involve complex or variable conditions or subjects.
Nearly two-thirds of the experiments did not replicate, meaning that scientists repeated these studies but could not obtain the results that were found by the original research team.
From his lab at the University of Virginia's Centre for Open Science, immunologist Dr Tim Errington runs The Reproducibility Project, which attempted to repeat the findings reported in five ...
Ioannidis (2005): "Why Most Published Research Findings Are False".[1]The replication crisis [a] is an ongoing methodological crisis in which the results of many scientific studies are difficult or impossible to reproduce.Because the reproducibility of empirical results is an essential part of the scientific method, [2] such failures undermine the credibility of theories building on them and ...
The replication studies such as those shown in Table 5-1 are not necessarily indicative of the actual rate of non-replicability across science for a number of reasons: the studies to be replicated were not randomly chosen, the replications had methodological shortcomings, many replication studies are not reported as such, and the reported ...
Learn the difference between reproducibility and replicability in research, and why they matter. A study is reproducible when the existing data is reanalysed using the same methods and yields the same results, while a study is replicable when the entire research process is conducted again with new data and yields the same results.
When results are computationally reproduced or replicated, confidence in robustness of the knowledge derived from that particular study is increased. However, reproducibility and replicability are focused on the comparison between individual studies. By looking more broadly and using other techniques to gain confidence in results, multiple pathways can be found to consistently support certain ...
Replication is a fundamental tool for verifying and validating study findings, building confidence in their reliability and generalizability. It also fosters scientific progress by promoting the discovery of new evidence, expanding understanding, and challenging existing theories or claims. Learn about the types, methods, and challenges of replication in research.
A report defines and examines the extent of non-reproducibility and non-replicability in science, and provides recommendations to improve them. Reproducibility is obtaining consistent results using the same data and code as the original study, while replicability is obtaining consistent results across studies using new data or methods.
Papers that cannot be replicated are cited 153 times more because their findings are interesting, according to a new UC San Diego study Credit: Dilok Klaisataporn/iStock. Findings from studies that cannot be verified when the experiments are repeated are cited 153 times more because the research is interesting, according to a new UC San Diego ...
Many social sciences experiments couldn't be reproduced in a new study, thus calling into question their findings. The field of social science is pushing hard to improve its scientific rigor.
Understanding Health Research A tool for making sense of health studies. home; Review a study ... which means that the paper gives readers enough detailed information that the research can be repeated (or 'replicated ... Replicability is an essential concept throughout the Understanding Health Research tool. As readers, we cannot know for sure ...
Learn how scientists aim for their studies to be replicable, meaning that another researcher could perform a similar investigation and obtain the same basic results. Find out why replicability is important for science and how it is tested and evaluated.
Maverick researchers have long argued that much of what gets published in elite scientific journals is fundamentally squishy — that the results tell a great story but can't be reproduced when ...
Replication is the repetition of a research study to see if the same results can be obtained with different participants and situations. Learn why replication is important, how it works, and what factors can affect its success or failure.
A project that tried to reproduce 193 experiments from 53 top cancer papers found that only 25 percent could be replicated. The results suggest that cancer research has a replication problem that ...
Science has suffered a crisis of replication—too few scientific studies can be repeated by peers. A new study from Stanford and three leading research universities shows that using rigorous research practices can boost the replication rate of studies. ... not just those whose findings cannot be replicated," Krosnick explained. Publish or ...
The science replication crisis might be worse than we thought: new research reveals that studies with replicated results tend to be cited less often than studies which have failed to replicate. That's not to say that these more widely cited studies with unreplicated experiments are necessarily wrong or misleading - but it does mean that, for ...
A study by UC San Diego shows that papers that fail to replicate and are less likely to be true are often the most cited in academic journals. The authors suggest that journals and researchers ...
A project involving 270 scientists found that fewer than half of 100 psychology studies published in 2008 could be replicated successfully. The results suggest that reproducibility problems are ...