Stanford University

Rigorous research practices improve scientific replication

Science has suffered a crisis of replication—too few scientific studies can be repeated by peers. A new study from Stanford and three leading research universities shows that using rigorous research practices can boost the replication rate of studies.

Science has a replication problem. In recent years, it has come to light that the findings of many studies, particularly those in social psychology, cannot be reproduced by other scientists. When this happens, the data, methods, and interpretation of the study’s results are often called into question, creating a crisis of confidence.

“When people don’t trust science, that’s bad for society,” said Jon Krosnick , the Frederic O. Glover Professor of Humanities and Social Sciences in the Stanford School of Humanities and Sciences. Krosnick is one of four co-principal investigators on a study that explored ways scientists in fields ranging from physics to psychology can improve the replicability of their research. The study, published Nov. 9 in Nature Human Behavior , found that using rigorous methodology can yield near-perfect rates of replication.

Image of Jon Krosnick

“Replicating others’ scientific results is fundamental to the scientific process,” Krosnick argues. According to a paper published in 2015 in Science , fewer than half of findings of psychology studies could be replicated—and only 30 percent for studies in the field of social psychology. Such findings “damage the credibility of all scientists, not just those whose findings cannot be replicated,” Krosnick explained.

Publish or perish

“Scientists are people, too,” said Krosnick, who is a professor of communication and of political science in H&S and of social sciences in the Stanford Doerr School of Sustainability. “Researchers want to make their funders happy and to publish head-turning results. Sometimes, that inspires researchers to make up or misrepresent data.

Almost every day, I see a new story about a published study being retracted—in physics, neuroscience, medicine, you name it. Showing that scientific findings can be replicated is the only pathway to solving the credibility problem.”

Accordingly, Krosnick added that the publish-or-perish environment creates the temptation to fake the data or to analyze and reanalyze the data with various methods until a desired result finally pops out, which is not actually real—a practice known as p-hacking.

Image of Bo MacInnis

In an effort to assess the true potential of rigorous social science findings to be replicated, Krosnick’s lab at Stanford and labs at the University of California, Santa Barbara; the University of Virginia; and the University of California, Berkeley set out to discover new experimental effects using best practices and to assess how often they could be reproduced. The four teams attempted to replicate the results of 16 studies using rigor-enhancing practices.

“The results reassure me that painstakingly rigorous methods pay off,” said Bo MacInnis , a Stanford lecturer and study co-author whose research on political communication was conducted under the parameters of the replicability study. “Scientific researchers can effectively and reliably govern themselves in a way that deserves and preserves the public’s highest trust.”

Matthew DeBell , director of operations at the American National Election Studies program at the Stanford Institute for Research in the Social Sciences is also a co-author.

“The quality of scientific evidence depends on the quality of the research methods,” DeBell said. “Research findings do hold up when everything is done as well as possible, underscoring the importance of adhering to the highest standards in science.”

Image of Matthew DeBell

Transparent methods

In the end, the team found that when four “rigor-enhancing” practices were implemented, the replication rate was almost 90 percent. Although the recommended steps place additional burdens on the researchers, those practices are relatively straightforward and not particularly onerous.

These practices call for researchers to run confirmatory tests on their own studies to corroborate results prior to publication. Data should be collected from a sufficiently large sample of participants. Scientists should preregister all studies, committing to the hypotheses to be tested and the methods to be used to test them before data are collected, to guard against p-hacking. And researchers must fully document their procedures to ensure that peers can precisely repeat them.

The four labs conducted original research using these recommended rigor-enhancing practices. Then they submitted their work to the other labs for replication. Overall, of the 16 studies produced by the four labs during the five-year project, replication was successful in 86 percent of the attempts.

“The bottom line in this study is that when science is done well, it produces believable, replicable, and generalizable findings,” Krosnick said. “What I and the other authors of this study hope will be the takeaway is a wake-up call to other disciplines to doubt their own work, to develop and adopt their own best practices, and to change how we all publish by building in replication routinely. If we do these things, we can restore confidence in the scientific process and in scientific findings.”

Acknowledgements

Krosnick is also a professor, by courtesy, of psychology in H&S. Additional authors include lead author John Protzko of Central Connecticut State University; Leif Nelson, a principal investigator from the University of California, Berkeley; Brian Nosek, a principal investigator from the University of Virginia; Jordan Axt of McGill University ; Matt Berent of Matt Berent Consulting; Nicholas Buttrick and Charles R. Ebersole of the University of Virginia; Sebastian Lundmark of the University of Gothenburg, Gothenburg, Sweden; Michael O’Donnell of Georgetown University; Hannah Perfecto of Washington University, St. Louis; James E. Pustejovsky of the University of Wisconsin, Madison; Scott Roeder of the University of South Carolina ; Jan Walleczek of the Fetzer Franklin Fund; and senior author and project principal investigator author Jonathan Schooler of the University of California, Santa Barbara.

This research was funded by the Fetzer Franklin Fund of the John E. Fetzer Memorial Trust.

Competing Interests

Nosek is the executive director of the nonprofit Center for Open Science. Walleczek was the scientific director of the Fetzer Franklin Fund that sponsored this research, and Nosek was on the fund’s scientific advisory board. Walleczek made substantive contributions to the design and execution of this research but as a funder did not have controlling interest in the decision to publish or not. All other authors declared no conflicts of interest.

  • Social Sciences

replicated science experiments

Finding hope in a cynical world

Image of Monika Schleier-Smith

Monika Schleier-Smith named 2024 Moore Foundation Experimental Physics Investigator

  • Natural Sciences

Image of Asad L. Asad

Asad L. Asad receives 2023 C. Wright Mills Award

Illustration showing the chemical structure of the peptide oxytocin

New mass spectrometry method could advance multiple fields

  • Foundational Research

Stanford University

© Stanford University.   Stanford, California 94305.

The BMJ logo

Genuine replication and pseudoreplication: what’s the difference?

By Stanley E. Lazic ( @StanLazic )

Replication is a key idea in science and statistics, but is often misunderstood by researchers because they receive little education or training on experimental design. Consequently, the wrong entity is replicated in many experiments, leading to pseudoreplication or the “unit of analysis” problem [1,2]. This results in exaggerated sample sizes and a potential increase in both false positives and false negatives – the worst of all possible worlds.

Replication can mean many things

Replication is not always easy to understand because many parts of an experiment can be replicated, and a non-exhaustive list includes:

  • Replicating the measurements taken on a set of samples. Examples include taking two blood pressure readings on each person or dividing a blood sample into two aliquots and measuring the concentration of a substance in each aliquot.
  • Replicating the application of a treatment or intervention to a biological entity of interest. This is the traditional way of increasing the sample size, by increasing the number of treatment–entity pairs; for example, the number of times a drug or vehicle control is randomly and independently applied to a set of rats.

replicated science experiments

  • Replicating the experimental procedure under different conditions. Repeating the experimental procedure several times, but where a known source of variation is present on each occasion. An example is a multi-centre clinical trial where differences between centres may exist. Another example is a large animal experiment that is broken down into two smaller experiments to make it manageable, and each smaller experiment is run by a different technician.
  • Replicating the experiment by independent researchers. Repeating the whole experiment by researchers that were not part of the initial experiment. This occurs when a paper is published and others try to obtain the same results.

To add to the confusion, terms with related meanings exist, such as repeatability, reproducibility, and replicability. Furthermore, the reasons for having or increasing replication are diverse and include a need to increase statistical power, a desire to make the results more generalisable, or the result of a practical constraint, such as an inability to recruit enough patients in one centre and so multiple centres are needed.

Requirements for genuine replication

How do you design an experiment to have genuine replication and not pseudoreplication? First, ensure that replication is at the level of the biological question or scientific hypothesis. For example, to test the effectiveness of a drug in rats, give the drug to multiple rats, and compare the result with other rats that received a control treatment (corresponding to example 2 above). Multiple measurements on each rat (example 1 above) do not count towards genuine replication.

To test if a drug kills proliferating cells in a well compared to a control condition, you will need multiple drug and control wells, since the drug is applied on a per-well basis. But you may worry that the results from a single experimental run will not generalise – even if you can perform a valid statistical test – because results from in vitro experiments can be highly variable. You could then repeat the experiment four times (corresponding to example 3 above), and the sample size is now four, not the total number of wells that were used across all of the experimental runs. This second option requires more work, will take longer, and will usually have lower power, but it provides a more robust result because the experimenter’s ability to reproduce the treatment effect across multiple experimental runs has been replicated.

To test if pre-registered studies report different effect sizes from traditional studies that are not pre-registered, you will need multiple studies of both types (corresponding to example 5 above). The number of subjects in each of these studies is irrelevant for testing this study-level hypothesis.

Replication at the level of the question or hypothesis a necessary but not sufficient condition for genuine replication – three criteria must be satisfied [1,3]:

  • For experiments, the biological entities of interest must be randomly and independently assigned to treatment groups. If this criterion holds, the biological entities are also called the experimental units [1,3].
  • The treatment(s) should be applied independently to each experimental unit. Injecting animals with a drug is an independent application of a treatment, whereas putting the drug in the drinking water shared by all animals in a cage is not.
  • The experimental units should not influence each other, especially on the measured outcome variables. This criterion is often impossible to verify – how do you prove that the aggressive behaviour of one rat in a cage is not influencing the behaviour of the other rats?

It follows that cells in a well or neurons in a brain or slice culture can rarely be considered genuine replicates because the above criteria are unlikely to be met, whereas fish in a tank, rats in a cage, or pigs in a pen could be genuine replicates in some cases but not in others. If the criteria are not met, the solution is to replicate one level up in the biological or technical hierarchy. For example, if you’re interested in the effect of a drug on cells in an in vitro experiment, but cannot use cells as genuine replicates, then the number of wells can be the replicates, and the measurements on cells within a well can be averaged so that the number of data points corresponds to the number of wells, that is, the sample size (hierarchical or multi-level models can also be used and don’t require values to be averaged because they take the structure of the data into account, but they are harder to implement and interpret compared with averaging followed by simpler statistical methods). Similarly, if rats in a cage cannot be considered genuine replicates, then calculating a cage-averaged value and using cages as genuine replicates is an appropriate solution (or a multi-level model).

If genuine replication is too low, the experiment may be unable to answer any scientific questions of interest. Therefore issues about replication must be resolved when designing an experiment, not after the data have been collected. For example, if cages are the genuine replicates and not the rats, then putting fewer rats in a cage and having more cages will increase power; and power is maximised with one rat per cage, but this may be undesirable for other reasons.

Confusing pseudoreplication for genuine replication reduces our ability to learn from experiments, understand nature, and develop treatments for diseases. It is also easily fixed. The requirements for genuine replication, like the definition of a p-value, is often misunderstood by researchers, despite many papers on the topic. An open-access overview is provided in reference [1], and reference [3] has a detailed discussion along with analysis options for many experimental designs.

[1] Lazic SE, Clarke-Williams CJ, Munafo MR (2018). What exactly is “N” in cell culture and animal experiments? PLoS Biol 6(4):e2005282. https://doi.org/10.1371/journal.pbio.2005282

[2] Lazic SE (2010). The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis? BMC Neuroscience 11:5. https://doi.org/10.1186/1471-2202-11-5

[3] Lazic SE (2016). Experimental Design for Laboratory Biologists: Maximising Information and Improving Reproducibility. Cambridge University Press, Cambridge, UK. https://www.cambridge.org/Lazic

replicated science experiments

Stanley E. Lazic is Co-founder  and Chief Scientific Officer at Prioris.ai Inc.

Prioris.ai, Suite 459, 207 Bank Street, Ottawa ON, K2P 2N2, Canada.

Analysis and discussion of research | Updates on the latest issues | Open debate

All BMJ blog posts are published under a CC-BY-NC licence

BMJ Journals

Terms and Conditions Cookie Settings

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 17 January 2020

Low replicability can support robust and efficient science

  • Stephan Lewandowsky   ORCID: orcid.org/0000-0003-1655-2013 1 , 2 &
  • Klaus Oberauer   ORCID: orcid.org/0000-0003-3902-7318 3  

Nature Communications volume  11 , Article number:  358 ( 2020 ) Cite this article

27k Accesses

48 Citations

122 Altmetric

Metrics details

  • Research management

A Publisher Correction to this article was published on 11 August 2020

This article has been updated

There is a broad agreement that psychology is facing a replication crisis. Even some seemingly well-established findings have failed to replicate. Numerous causes of the crisis have been identified, such as underpowered studies, publication bias, imprecise theories, and inadequate statistical procedures. The replication crisis is real, but it is less clear how it should be resolved. Here we examine potential solutions by modeling a scientific community under various different replication regimes. In one regime, all findings are replicated before publication to guard against subsequent replication failures. In an alternative regime, individual studies are published and are replicated after publication, but only if they attract the community’s interest. We find that the publication of potentially non-replicable studies minimizes cost and maximizes efficiency of knowledge gain for the scientific community under a variety of assumptions. Provided it is properly managed, our findings suggest that low replicability can support robust and efficient science.

Similar content being viewed by others

replicated science experiments

Psychologists update their beliefs about effect sizes after replication studies

replicated science experiments

The natural selection of good science

replicated science experiments

High replicability of newly discovered social-behavioural findings is achievable

Introduction.

Replicability is fundamental to science 1 . Any finding that cannot be replicated at best fails to contribute to knowledge and, at worst, wastes other researchers’ time when they pursue a blind alley based on an unreliable result. The fact that the replicability of published findings is <30% in social psychology, hovers ~50% in cognitive psychology 2 , and remains at ~67% even for studies published in Nature and Science 3 , has therefore justifiably stimulated much concern and debate 4 , 5 . At one end of the spectrum, it has been suggested that failures to replicate are best considered as an interaction triggered by one or more (typically unknown) moderator variables that capture the idiosyncratic conditions prevailing during the original study, but that were absent during the replication attempt 6 . (An overview of this position can be found in ref. 1 ). On this account, what matters is not whether an exact replication can reproduce the original effect but whether the underlying theory finds further support in conceptual replications (studies that use a variety of different manipulations or measures to operationalize the crucial theoretical variables) 6 . Contrary to this position, the success of independent replications is no greater for effects that were initially reported together with conceptual replications than effects that were reported in isolation 7 .

At the other end of the spectrum is the view that low replicability arises for a number of reasons related to currently widespread—but suboptimal—research practices 8 . Several factors have been identified: (1) The use of small samples and the resultant low power of studies contributes to low replicability because the significant effect reported in an underpowered study is more likely to represent a type I error than the same effect obtained with a powerful study 9 . The harmful effects of low power can be amplified by questionable statistical practices, often referred to as p -hacking. (2) One form of p -hacking involves multiple sequential analyses that are used to inform further data collection. This process, known as the optional stopping rule 10 , can lead to dramatic increases in type I error rates 11 . If applied repeatedly, testing of additional participants can guarantee a significant result under the null hypothesis if data collection continues until the desired p value is ultimately obtained. (3) Data are explored without differentiating between a priori hypotheses and post hoc reasoning. This is known as Hypothesizing After the Results are Known (HARKing) and, because the same data are used to identify a hypothesis as well as test it, HARKing renders the reported p values uninformative because they are known to be inflated 12 . (4) Publication bias in favor of significant results 13 , 14 amplifies the preceding three problems and additionally prevents the community from discovering when findings have failed to replicate.

Recommendations to avoid suboptimal research practices 15 and introduce transparency 5 , such as through preregistration of method and analysis plan 16 , more stringent significance levels 17 , 18 , reliance on strong theories 19 , or reporting all data irrespective of significance 20 , therefore deserve support. Nonetheless, even flawless and transparent research may yield spurious results for the simple reason that all psychological measurements involve random variables and hence the possibility of type I errors. Spurious results can only be avoided if replications become a mainstream component of psychological research 1 . Highlighting the virtues of replications is, however, not particularly helpful without careful consideration of when, how, why, and by whom experiments should be replicated.

To examine those questions, we simulate an idealized and transparent scientific community that eschews p -hacking and other questionable research practices and conducts studies with adequate power ( \(P=.8\) ). We focuse on an idealized community precisely because we wanted to examine the issues surrounding replication in the absence of contamination by questionable research practices, although we also show later that our conclusions are robust to the injection of questionable practices and fraud. We measure the community’s success (the number of correctly identified true phenomena that were of interest to the scientific community) and efficiency (the number of experiments conducted overall) under two different knowledge acquisition strategies and, orthogonally, two different replication regimes. The key attribute of our model is that not all findings are deemed to be equally interesting by the scientific community.

Knowledge acquisition is either discovery-oriented 21 or guided by theory (with the predictive merit of the theory being another design variable). Discovery-oriented research seeks to identify interesting findings by foraging across a wide landscape of possible variables and measures. On this approach, failure of any given experiment is uninformative because the underlying theory makes no exact predictions about specific phenomena, only that they should arise somewhere in the search space 19 . For example, researchers may look for instances in which people’s representation of time, and its tacit link to future events and progress, can be primed by a bodily action. Exploration of various options may eventually reveal that turning a crank clockwise rather than anticlockwise primes people’s openness to experience 22 . Discovery-oriented research is particularly vulnerable to producing nonreplicable results 23 because it relies on few constraints from theory or previous findings to select hypotheses for testing 24 . Moreover, because these hypotheses target eye-catching and counterintuitive findings that are a priori unexpected 24 , the chance of testing a true hypothesis is low. Low prior probabilities of hypotheses, in turn, imply low replicability 21 , 24 , 25 . Theory-testing research, by contrast, focuses on a tightly constrained space of variables and measures for which the theory necessarily predicts an effect 19 . That is, predictions are tightly tethered to the theory, and falsification of a hypothesis provides more information than in discovery-oriented research. For example, if a theory of memory predicts that temporally isolated items should be recalled better than those that are temporally crowded, the fact that they are not (or only under some conditions) challenges the theory 26 . Conversely, if the theory has survived initial test, then it is unlikely to be completely off target. In consequence, the hypotheses that are chosen for further test have greater prior odds of being true, which in turn implies that positive findings are more likely to replicate 21 .

In the simulation, the two knowledge acquisition regimes differ only in the manner in which true discoverable effects and the search for those effects are structured: for discovery-oriented research, both are random, whereas for theory-testing research, true effects are clustered together and the theory guides search in various degrees of proximity to the true cluster. For both regimes, 9% of all possible simulated experiments can discover a true effect ( \(P({{\rm{H}}}_{1})=.09\) ). This value reflects estimates of the baserate of a psychological hypothesis being true 4 .

Replication decisions are either private or public. Private replication decisions are modeled by investigators replicating any notable result and publishing only successfully replicated phenomena. Public replication decisions are modeled by investigators publishing any notable result, with the scientific community deciding which of those to replicate based on whether the results are deemed interesting. For discovery-oriented research, we consider only positive and significant results to be notable, because only the discovery of effects is informative 19 . By contrast, for theory-testing research, the discovery of reliable null effects is also informative—because they may falsify necessary predictions of a theory 19 —and in one of our simulations, we therefore also consider reliably established null effects to be notable candidates for replication.

Figure  1 contrasts the two replication regimes. In both regimes decision making is shared between individual investigators and the scientific community (represented by orange shading), and in both regimes the scientific community determines whether or not a finding is deemed interesting. The principal difference is whether the community selects interesting findings from among those known to have been replicated (private regime; Fig. 1a ) or selects from a larger published set of studies of unknown r3eplicability and communally determines which of those studies deserve replication because of their potential interest (public regime; Fig. 1b ).

figure 1

a Private replication regime: investigators independently conduct 100 studies, and each investigator replicates any significant result. If replication is successful, both studies are published. The scientific community (represented by orange shading) then determines which of those replicated findings are deemed interesting based on a stochastic decision rule. b Public replication regime: investigators independently conduct 100 studies, and any significant result is published without replication. The scientific community (orange shading) then determines which of those findings of unknown replicability are deemed interesting, and hence worthy of replication, based on the same stochastic decision rule.

Scientific interest is modeled on the basis of the observed citation patterns in psychology. Citations, by definition, are a good proxy for scientific interest, and expert judgment and analysis of replicability (Methods section) confirm that citations do not predict replicability. The actual distribution of citations is highly skewed, with nearly 40% of articles receiving five or fewer citations within 3 years of publication (Methods section) and only 1.3% of articles receiving >50 citations during that period. Very few articles receive high citations beyond common bibliometric time horizons 27 , confirming that lack of citations indicates lasting lack of scientific interest. For the simulations, irrespective of replication regime, the probability of replication of a study increases with the number of citations. Specifically, we consider a finding to be interesting—and hence a candidate for replication—if its citation rate, obtained by random sample from the modeled distribution of citations, exceeds the 90th percentile of that distribution (Methods section). Varying degrees of sharpness of the decision were explored by varying the temperature parameter of a logistic decision function centered on the 90th percentile. Larger values of temperature imply a more graded threshold of scientific interest, rendering it more likely that articles with fewer citations are considered interesting. Figure  2 shows the distribution of citations in psychology together with the logistic threshold functions with the three values of temperature (1, 5, and 10) used in the simulations.

figure 2

The blue histogram shows observed distribution of citations for all articles published in 2014. The best-fitting Pareto distribution is represented by the solid red line. The threshold centered on the 90th percentile of the fitted distribution is indicated by the vertical dashed line. The three logistic functions are centered on the threshold but have different temperatures (see legend). They determine the probability of the scientific community finding a phenomenon to be of interest. In the simulation, each finding is associated with a random draw from the best-fitting Pareto distribution, which is then converted into a probability of interest using the appropriate logistic decision function.

In summary, we model a scientific community under two different replication regimes. In one regime, all findings are replicated before publication to guard against subsequent replication failures. In the alternative regime, individual studies are published and are replicated after publication, but only if they attract the community’s interest. The outcome measure of interest is the efficiency of knowledge generation; specifically, we consider the number of experiments conducted by the community overall that are required to discover a set number of true effects. To foreshadow, we find that the publication of potentially nonreplicable single studies minimizes cost and maximizes efficiency of knowledge gain for the scientific community.

Discovery-oriented research

Discovery-oriented research used a random selection of an independent and a dependent variable for each of the 100 studies simulated during the first round of experimentation. Because selection was random with replacement, multiple identical studies could be conducted and the same effect discovered more than once. This mirrors scientific practice that frequently gives rise to independent discoveries of the same phenomenon. Figure  3b provides an overview of the simulation procedure. Each study during the first round was classified as “significant” based either on its p value ( \(p\, <\, .05\) , two-tailed single-sample t -test) or its Bayes factor ( \({{\rm{BF}}}_{10}\, > \, 3\) , Bayesian single-sample t -test with Jeffrey-Zellner-Siow prior, Cauchy distribution on effect size, see ref. 28 ), irrespective of whether the null hypothesis was actually false. As is typical for discovery-oriented research, we were not concerned with detection of null effects. Some or all of the studies thus identified were then selected for replication according to the applicable regime (Fig.  1 ).

figure 3

a Landscape of true effects for discovery-oriented research. Each cell is randomly initialized to 0 (i.e., \({{\rm{H}}}_{0}\) is true) or 1 (i.e., \({{\rm{H}}}_{1}\) ) with probability .09, reflecting the estimated baserate of true effects in psychology 4 . The landscape is initialized anew for each of the 1000 replications in each simulation experiment. b For each replication, 100 discovery-oriented experiments are conducted, each with a randomly chosen dependent and independent variable (sampling is with replacement). When an experiment yields a significant result ( \(p\,<\, .05\) , two-tailed single-sample t -test or \({{\rm{BF}}}_{10}\,> \, 3\) , Bayesian single-sample t -test), the appropriate replication regime from Fig.  1 is applied.

For frequentist analysis, we set statistical power either at 0.5 or 0.8. Figure  4 shows the results for the higher (Fig. 4a, b ) and lower power (Fig. 4c, d ). The figure reveals that regardless of statistical power, the replication regime did not affect the success of scientific discovery (Fig. 4b, d ). Under both regimes, the number of true and interesting discovered effects increased with temperature, reflecting the fact that with a more diffuse threshold of scientific interest more studies were selected for replication in the public regime, or were deemed interesting after publication in the private regime. When power is low (Fig. 4d ), fewer effects are discovered than when power is high (Fig. 4b ). Note that nearly all replicated effects are also true: this is because the probability of two successive type I errors is small ( \({\alpha }^{2}=.0025\) ).

figure 4

Power was 0.8 a , b or 0.5 c , d . a and c show the cost (total number of experiments conducted) to generate the knowledge (true effects discovered) shown in b and d . Temperature refers to the temperature of the logistic decision function (Methods section). Successful replications are identified by dashed lines in b and d . Successful replications that are also true (i.e., the null hypothesis was actually false) are identified by plotting symbols and solid lines.

By contrast, the cost of generating knowledge differed strikingly between replication regimes (Fig. 4a, c ), again irrespective of statistical power. The private replication regime incurred an additional cost of around ten studies compared to public replications. This difference represents ~10% of the total effort the scientific community expended on data collection. Publication of single studies whose replicability is unknown thus boosts the scientific community’s efficiency, whereas replicating studies before they are published carries a considerable opportunity cost. This cost is nearly unaffected by statistical power. Because variation in power has no impact on our principal conclusions, we keep it constant at 0.8 from here on. Moreover, as shown in Fig.  5 , the opportunity cost arising from the private replication regime also persists when Bayesian statistics are used instead of conventional frequentist analyses.

figure 5

a shows the cost (total number of experiments conducted) to generate the knowledge (true effects discovered) shown in b . Temperature refers to the temperature of the logistic decision function (Methods section). Successful replications are identified by dashed lines in b . Successful replications that are also true (i.e., the null hypothesis was actually false) are identified by plotting symbols and solid lines.

The reasons for this result are not mysterious: Notwithstanding scientists’ best intentions and fervent hopes, much of their work is of limited interest to the community. Any effort to replicate such uninteresting work is thus wasted. To maximize scientific productivity overall, that effort should be spent elsewhere, for example in theory development and test, or in replicating published results deemed interesting.

Theory-testing research

The basic premise of theory-testing research is that the search for effects is structured and guided by the theory. The quality or plausibility of a theory is reflected in how well the theory targets real effects to be tested. We instantiated those ideas by introducing structure into the landscape of true effects and into the experimental search (Methods section). Figure  6 illustrates the role of theory. Across panels, the correspondence between the location of true effects and the search space guided by the theory (parameter \(\rho\) ) increases from 0.1 (poor theory) to 1 (perfect theory). A poor theory is targeting a part of the landscape that contains no real effects, whereas a highly plausible theory targets a segment that contains many real effects.

figure 6

Each panel shows the same landscape of ground truth as in Fig.  3 . In each panel, the gray contours illustrate the location of true effects (randomly chosen for each replication) and the red lines outline the space of experiments actually conducted as determined by the theory. Unlike in discovery-oriented research, true effects and experiments cluster together. The quality of the theory (determined by parameter \(\rho\) ) is reflected in the overlap between the true state of the world (gray cluster) and the experiments (red cluster). The leftmost panel shows a poor theory ( \(\rho =0.1\) ), the center panel a modestly powerful theory ( \(\rho =0.5\) ), and the rightmost panel a perfect theory ( \(\rho =1.0\) ). Each panel represents a single arbitrarily chosen replication.

Not unexpectedly, the introduction of theory boosts performance considerably. Figure  7 shows results when all statistical tests focus on rejecting the null hypothesis, with power kept constant at 0.8. When experimentation is guided by a perfect theory ( \(\rho =1\) ), the number of true phenomena being discovered under either replication regime with a diffuse decision threshold (high temperature) is approaching or exceeding the actual number of existing effects. (Because the same phenomenon can be discovered in multiple experiments, the discovery count can exceed the true number of phenomena.) The cost associated with those discoveries, however, again differs strikingly between replication regimes. In the extreme case, with the most powerful theory, the private replication regime required nearly 40% additional experimental effort compared to the public regime. The cost associated with private replications is thus even greater with theory-testing research than with discovery-oriented research. The greater penalty is an ironic consequence of the greater accuracy of theory-testing research, because the larger number of significant effects (many of them true) automatically entails a larger number of private replications and hence many additional experiments. As with discovery-oriented research, the cost advantage of the public regime persists irrespective of whether frequentist or Bayesian techniques are used to analyze the experiments.

figure 7

The frequentist analysis is shown in a and b , and Bayesian tests for the presence of effects in c and d . a and c show the cost (total number of experiments conducted) to generate the knowledge (true effects discovered) shown in b and d . In all panels, the thickness of lines and size of plotting symbols indicates the value of \(\rho\) , which captures overlap between the theory and reality. In increasing order of thickness, \(\rho\) was 0.1, 0.5, and 1.0. Temperature refers to the temperature of the logistic decision function (Methods section). All successful replications shown in b and d are true (i.e., the null hypothesis was actually false). Significant replications that did not capture true effects are omitted to avoid clutter.

There is nonetheless an important difference between the two classes of statistical techniques: Unlike frequentist statistics, Bayesian techniques permit rigorous tests of the absence of effects. This raises the issue of whether such statistically well-supported null results are of interest to the community, and if so, whether the interest follows the same distribution as for non-null results. In the context of discovery-oriented research, we assumed that null results are of little or no interest because failures to find an effect that is not a necessary prediction of any theory is of no theoretical or practical value 19 . The matter is very different with theory-testing research, where a convincing failure to find an effect counts against the theory that predicted it. We therefore performed a symmetrical Bayesian analysis for theory-testing research and assumed that the same process applied to determining interest in a null result as for non-null results. That is, whenever a Bayes Factor provided evidence for the presence of an effect ( \({{\rm{BF}}}_{10}\,> \, 3\) ) or for its absence ( \({{\rm{BF}}}_{10}\, <\, 1/3={{\rm{BF}}}_{01}\, > \, 3\) ), we considered it a notable candidate for replication. Figure  8 shows that when both presence and absence of effects are considered, the cost for the private replication regime is increased even further, to 50% or more. This is because there is now also evidence for null effects ( \({{\rm{BF}}}_{01}\,> \,3\) ) that require replication irrespective of whether they are deemed interesting by the community.

figure 8

Tests are conducted for the presence ( \({{\rm{BF}}}_{10}> 3\) ) as well as absence ( \({{\rm{BF}}}_{01}\,> \, 3\) ) of effects. a shows the cost (total number of experiments conducted) to generate the knowledge (true effects and true absences of effects discovered) in b . Because both presence and absence of an effect are considered, the maximum discoverable number of true outcomes is 100. In both panels, the thickness of lines and size of plotting symbols indicates the value of \(\rho\) , which captures overlap between the theory and reality. In increasing order of thickness, \(\rho\) was 0.1, 0.5, and 1.0. Temperature refers to the temperature of the logistic decision function (Methods section).

Another aspect of Fig.  8 is that the value of \(\rho\) matters considerably less than when only non-null effects are considered. This is because a poor theory that is being consistently falsified (by the failure to find predicted effects) generates as many interesting (null) results as a perfect theory that is consistently confirmed. Because our focus here is on empirical facts (i.e., effects and null-effects) rather than the welfare of particular theories, we are not concerned with the balance between confirmations and falsifications of a theory’s predictions.

Boundary conditions and limitations

We consider several conceptual and methodological boundary conditions of our model. One objection to our analysis might invoke doubts about the validity of citations as an indicator of scientific quality. This objection would be based on a misunderstanding of our reliance on citation rates. The core of our model is the assumption that the scientific community shows an uneven distribution of interest in phenomena. Any differentiation between findings, no matter how small, will render the public replication regime more efficient. It is only when there is complete uniformity and all effects are considered equally interesting, that the cost advantage of the public replication regime is eliminated (this result trivially follows from the fact that the public replication regime then no longer differs from the private regime). It follows that our analysis does not hinge on whether or not citation rates are a valid indicator of scientific quality or merit. Even if citations were an error-prone measure of scientific merit 29 , they indubitably are an indicator of attention or interest. An article that has never been cited simply cannot be as interesting to the community as one that has been cited thousands of times, whatever one’s personal judgment of its quality may be.

Another objection to our results might invoke the fact that we simulated an idealized scientific community that eschewed fraud or questionable research practices. We respond to this objection by showing that our model is robust to several perturbations of the idealized community. The first perturbation involves p -hacking. As noted at the outset, p -hacking may variously involve removal of outlying observations, switching of dependent measures, adding ad hoc covariates, such as participants’ gender, and so on. A shared consequence of all those questionable research practices is an increased type I error rate: the actual \(\alpha\) can be vastly greater than the value set by the experimenter (e.g., the conventional .05). Figure  9a, b shows the consequences of p -hacking with frequentist analysis, operationalized by setting \(\alpha =0.2\) in a simulation of discovery-oriented research. The most notable effect of p -hacking is that a greater number of interesting replicated effects are not true (difference between dashed and solid lines in Fig. 9b ). The opportunity cost associated with private replications, however, is unaffected.

figure 9

a and b show the effects of p -hacking on the number of experiments conducted and the number of effects discovered. p -hacking is operationalized by raising the type I error rate, \(\alpha =.2\) . c and d show the effects of keeping \(\alpha =.05\) but adding up to five batches of additional participants (each batch containing 1, 5, or 10 participants), if an effect failed to reach significance. Results are shown averaged across levels of temperature.

Figure 9c, d explores the consequences of an optional stopping rule, another common variant of p -hacking. This practice involves repeated testing of additional participants, if a desired effect has failed to reach significance with the initial sample. If this process is repeated sufficiently often, a significant outcome is guaranteed even if the null hypothesis is true 10 . We instantiated the optional stopping rule by adding \({N}_{ph}\in \{1,5,10\}\) additional participants, if an effect had not reached significance with the initial sample. This continued for a maximum of five additional batches or until significance had been reached. Optional stopping had little effect on the basic pattern of results, including the opportunity cost associated with the private replication regime, although persistent testing of additional participants, as expected, again increased the number of replicated results that did not represent true effects. Overall, Fig.  9 confirms that our principal conclusions hold even if the simulated scientific community engages in questionable research practices.

We examined two further and even more extreme cases (both simulations are reported in the online supplement). First, we considered the effects of extreme fraud, where all effects during the first round are arbitrarily declared significant irrespective of the actual outcome (Supplementary Fig. 3 ), and only subsequent public replications are honest (the private replication regime makes little sense when simulating fraud, as a fraudster would presumably just report a second faked significance level). Fraud was found to have two adverse consequences compared to truthful research: (a) it incurs a greater cost in terms of experiments conducted by other investigators (because if everything is declared significant at the first round, more effects will be of interest and hence require replication). (b) Fraud engenders a greater number of falsely identified interesting effects because all type I errors during the honest replications are assumed to represent successfully replicated findings. These results clarify that our public replication regime is not comparable to a scenario in which completely fictitious results are offered to the community for potential replication—this scenario would merely mislead the community by generating numerous ostensibly replicated results that are actually type I errors.

Second, we considered the consequences of true effects being absent from the landscape of ground truths ( \(P({{\rm{H}}}_{1})=0\) ). This situation likely confronts research in parapsychology. In these circumstances, significant results from the first round can only reflect type I errors. In consequence, the overall cost of experimentation is lower than when true effects are present, but the cost advantage of the public regime persists (Supplementary Fig. 4 ).

Waste of resources has been identified as a major adverse consequence of the replication crisis 30 . We have shown that prepublication replications are wasteful. Perhaps ironically, waste is reduced by withholding replication until after publication. Regardless of whether research is discovery-oriented or theory-testing, and regardless of whether frequentist or Bayesian statistics are employed, the community benefits from publication of findings that are of unclear replicability. The cost advantage of the public replication regime was robust to various perturbations of the idealized community, such as p -hacking, fraud, and the pursuit of nonexistent effects.

Our model is consonant with other recent approaches that have placed the merit of research within a cost-benefit framework 18 , 31 , 32 , 33 . For example, Miller and Ulrich 33 examined the trade-off between false positives (type I errors) and false negatives (type II) under different payoff scenarios. Their model could determine the optimal sample size to maximize researchers’ overall payoff, based on the recognition that although larger sample sizes increase power, they are also costly because they reduce the number of studies that can be carried out. In light of estimates that upward of 85% of research effort and resources are wasted because of correctable problems 34 , any new practice that can free up resources for deployment elsewhere—e.g., to conduct advisable replications—should be given careful consideration.

Although we have shown private replications to be wasteful, adoption of our model would heed calls for a replication culture 35 in several ways. Powerful and sophisticated replications require much investment 36 , and by reducing the opportunity cost associated with unnecessary replications, our model frees up the resources necessary for powerful replications. Another favorable aspect of our model is that public replications are most likely conducted by laboratories other than the original investigator’s. A recent expert survey (Methods section) revealed that 87% of experts considered a replication to be more informative if it is conducted by a different lab, with the remainder (13%) considering replications by the same investigator to be equally informative. Out of >100 respondents, none thought that a replication by the original author was preferable to a replication by others. The overwhelming expert judgment is consonant with the result that replicability (by others) does not increase with the number of (conceptual) replications reported by the same author together with the original finding 7 , and that too many low-powered replications in an article may reveal publication bias rather than indicate replicability 37 , 38 . In addition, our approach is entirely compatible with other solutions to the replication crisis, such as preregistration 16 or reliance on strong theory 19 .

There are, however, some legitimate concerns that can be raised about a public replication regime. First, given that the regime is justified by greater efficiency of data collection, the increased burden on editors and reviewers that the regime implies through the increased number of publications is problematic. This added burden, however, is fairly modest by comparison to the gains in efficiency. To illustrate, for discovery-oriented research, the expected number of initially published findings under the public replication regime (with \(P({{\rm{H}}}_{1})=.09\) , \(\alpha =.05\) , and power 0.8) is the expected number of significant results in the first round of experiments: \(0.8\,\times 9+0.05\times 91=11.75\) . This number is only slightly greater than the number of additional experiments carried out in the private replication regime (viz. the difference between private and public replication in Fig.  4a ). Hence, factoring in the costs of editing and reviewing would render the public replication policy more costly only if the cost of reviewing and editing one publication is significantly larger than the cost of running one replication study. We maintain that this is rarely, if ever, the case: according to a recent detailed analysis of the global peer review system 39 , each manuscript submission in psychology attracts 1.6 peer reviews on average, and the average duration to prepare a review is estimated at 5 h. It follows that total average reviewer workload for a manuscript is 8 h. Even if this estimate were doubled to accommodate the editor’s time (for inviting reviewers, writing action letters and so on), the total additional editorial workload for the public replication regime would be \(16\,\times 11.75=188\)  h. We consider it implausible to assume that this burden would exceed the time required to conduct the (roughly) ten additional replications required by the private regime. That said, careful consideration must be given to the distribution of workload: our analysis is limited to the aggregate level of the scientific community overall and does not consider potential inequalities across levels of seniority, gender, employment security, and so on. Our considerations here point to the broader need for a comprehensive cost-benefit analysis of all aspects of research, including replications under different regimes, that permits different payoffs to be applied to type I and type II errors 33 . However, this broader exploration goes beyond the scope of the current paper, in particular because the payoffs associated with statistical errors may vary with publication practice, as we discuss below.

A second concern arises from the perceived status of published nonreplicated results, which are an inevitable consequence of the public replication regime. It is likely that the media and the public would not understand the preliminary nature of such results, and even other researchers—especially when strapped for the resources required for replication—might be tempted to give undue credence to nonreplicated results. This is particularly serious for clinical trials, where a cautionary treatment of preliminary results is critical. Moreover, there is evidence that results, once published, are considered credible even if an article is retracted 40 , and published replication failures seemingly do not diminish a finding’s citation trajectory 41 . Hence, even if preliminary results are eventually subjected to public replication attempts, a failure to replicate may not expunge the initial fluke from the community’s knowledge base.

We take this concern seriously, but we believe that it calls for a reform of current publication practice rather than abandoning the benefits of the public replication regime. Adopting the public replication regime entails that published findings are routinely considered as preliminary, and gradually gain credibility through successful replication, or lose credibility when replications are unsuccessful. We suggest that the public replication regime can live up to its promise if (1) nonreplicated findings are published provisionally and with an embargo (e.g., 1 year) against media coverage or citation. (2) Provisional publications are accompanied by an invitation for replication by other researchers. (3) If the replication is successful, the replicators become coauthors, and an archival publication of record replaces the provisional version. (4) Replication failure leads to a public withdrawal of the provisional publication accompanied by a citable public acknowledgement of the replicators. This ensures that replication failures are known, thus eliminating publication bias. (5) If no one seeks to replicate a provisional finding, the original publication becomes archival after the embargo expires with a note that it did not attract interest in replication. This status can still change into (3) or (4) if a replication is undertaken later.

Although these cultural changes may appear substantial, in light of the replication crisis and wastefulness of current practice, cosmetic changes may be insufficient to move science forward. A recent initiative in Germany that provides free data collection for (preregistered) studies through a proposal submission process points in a promising direction ( https://leibniz-psychology.org/en/services/data-collection/ ).

All simulations involved 1000 replications. The simulation comprised three main components.

The landscape of true effects was modelded by a \(10\,\times 10\) grid that represented the ground truth. For discovery-oriented research, the grid was randomly initialized for each replication to 0 ( \({{\rm{H}}}_{0}\) ) or 1 ( \({{\rm{H}}}_{1}\) ), with \(P({{\rm{H}}}_{1})=.09\) (Fig.  3a ). The two dimensions of the grid are arbitrary but can be taken to represent potential independent and dependent variables, respectively. Each grid cell therefore involves a unique combination of an experimental intervention and an outcome measure, and the ground truth in that cell (1 or 0) can be understood as presence or absence, respectively, of a difference to a presumed control condition. For theory-testing research, the same landscape was used but all effects were randomly clustered within four rows and columns centered on a randomly chosen centroid (subject to the constraint that all effects fit within the \(10\times 10\) grid; Fig.  6 ).

The second component was a decision module to determine scientific interest. The distribution of citations for 1665 articles published in psychology in 2014 (downloaded from Scopus in April 2018) was fit by a generalized Pareto distribution (shape parameter, \(k=0.115\) ; scale parameter, \(\sigma =8.71\) ; and location parameter, \(\theta =0\) ; Fig.  2 ). For the simulations reported here, the 90th percentile of the fitted distribution ( \(q=22.98\) citations) was used as threshold in a logistic transfer function:

where \(P({I}_{k})\) is the probability that finding \(k\) would be deemed interesting, \({n}_{k}\) represents the finding’s citation count, and \(t\in \{1,5,10\}\) the temperature of the logistic function. (The reciprocal of the temperature is known as the gain of the function.) Each \({n}_{k}\) represented a random sample from the best-fitting Pareto distribution. Other cutoff values of \(q\) were explored, spanning the range from the 10th through the 90th percentile, which did not materially affect the results (Supplementary Figs. 1 and 2 ).

The final component was an experimental module to run and interpret experiments. Each simulation run (that is, each of 1000 replications) involved a first round of 100 experiments. Each experiment was simulated by sampling observations from a normal distribution with mean equal to the value of the targeted cell in the grid of ground truths (0 or 1) and standard deviation \(\sigma\) . The sample size was determined by G*Power 42 to achieve the desired statistical power. Power was either .5 or .8, mapping into sample sizes of \(n=\{18,34\}\) . For frequentist analyses, \(\sigma =2.0\) and \(\alpha =.05\) in all simulations. For Bayesian analyses, \(n=34\) and \(\sigma =1.5\) throughout, which achieved a “power” of ~0.8 with \({{\rm{BF}}}_{10}=3\) . An experiment was declared “significant” if the single-sample \(t\) -statistic exceeded the appropriate two-tailed critical value for \(\alpha =.05\) or if \({{\rm{BF}}}_{10}\,> \, 3\) , Bayesian single-sample t -test as decribed in ref. 28 .

For discovery-oriented research, the targeted cell in the landscape was chosen randomly (Fig.  3b ). Theory-testing research also used a \(10\, \times 10\) grid to represent the gound truth, but all true effects (i.e., \({{\rm{H}}}_{1}\) ) were constrained to fall within a \(4\times 4\) grid that straddled a randomly chosen centroid. For each simulated experiment, the targeted cell was chosen randomly from another \(4\times 4\) grid of predicted effects whose centroid was a prescribed distance from the centroid of true effects. The parameter \(\rho\) determined the proximity between the centroid of true effects and the centroid of the predicted effects targeted by theory-testing research (Fig.  6 ). When \(\rho =1\) , the centroids were identical, and for \(\rho\, <\, 1\) , the theory’s centroid was moved \((1-\rho )\times 9\) rows and columns away from the true centroid (subject to the constraint that all cells predicted by the theory had to fit within the \(10\, \times 10\) grid). A perfect theory ( \(\rho =1\) ) thus predicted effects to be present in precisely the same area in which they actually occurred, whereas a poor theory ( \(\rho \simeq 0\) ) would search for effects in a place where none actually occurred.

The first round of 100 experiments was followed by replications as determined by the applicable regime (Fig.  1 ). Thus, under the private regime, any significant result from the first round was replicated, whereas under the public regime, significant results were replicated with a probability proportional to their scientific interest as determined by Eq. ( 1 ). (In the simulation that also examined null effects, see Fig.  8 , replication decisions were also based on Bayes Factors for the null hypothesis).

Expert survey

Attendees of a symposium on statistical and conceptual issues relating to replicability at the International Meeting of the Psychonomic Society in Amsterdam (May 2018) were given the opportunity to respond to a seven-item single-page survey that was distributed before the symposium started. Responses were collected after each talk until a final set of 102 responses was obtained.

Each item involved a quasi-continuous scale (14 cm horizontal line) with marked end points. Responses were indicated by placing a tick mark or cross along the scale. Responses were scored to a resolution of 0.5 cm (minimum 0, maximum 14, and midpoint 7). Items, scale end points, and summary of responses are shown in Table  1 .

Reporting summary

Further information on research design is available in the  Nature Research Reporting Summary linked to this article.

Data availability

MATLAB code for the simulation and all results are available at https://git.io/fhHjg . A reporting summary for this Article is available as a Supplementary Information file.

Change history

11 august 2020.

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

Zwaan, R. A., Etz, A., Lucas, R. E. & Donnellan, M. B. Making replication mainstream. Behav. Brain Sci. 41 , E120 (2017).

Article   Google Scholar  

Open Science Collaboration. Estimating the reproducibility of psychological science. Science 349 , 1–8 (2015).

Camerer, C. F. et al. Evaluating the replicability of social science experiments in nature and science between 2010 and 2015. Nat. Hum. Behav. 2 , 637–644 (2018).

Dreber, A. et al. Using prediction markets to estimate the reproducibility of scientific research. Proc. Natl Acad. Sci. USA 112 , 15343–15347 (2015).

Article   ADS   CAS   Google Scholar  

Morey, R. D. et al. The peer reviewers’ openness initiative: incentivizing open research practices through peer review. R. Soc. Open Sci. 2 , 15047 (2015).

Google Scholar  

Stroebe, W. & Strack, F. The alleged crisis and the illusion of exact replication. Perspect. Psychol. Sci. 9 , 59–71 (2014).

Kunert, R. Internal conceptual replications do not increase independent replication success. Psychon. Bull. Rev. 23 , 1631–1638 (2016).

Simmons, J. P., Nelson, L. D. & Simonsohn, U. False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22 , 1359–1366 (2011).

Button, K. S. et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14 , 365–376 (2013).

Article   CAS   Google Scholar  

Wagenmakers, E.-J. A practical solution to the pervasive problems of p values. Psychon. Bull. Rev. 14 , 779–804 (2007).

Jennison, C. & Turnbull, B. W. Statistical approaches to interim monitoring of medical trials: a review and commentary. Stat. Sci. 5 , 299–317 (1990).

Article   MathSciNet   Google Scholar  

Kerr, N. L. HARKing: Hypothesizing after the results are known. Pers. Soc. Psychol. Rev. 2 , 196–217 (1998).

Ferguson, C. J. & Heene, M. A vast graveyard of undead theories: publication bias and psychological science’s aversion to the null. Perspect. Psychol. Sci. 7 , 555–561 (2012).

Ferguson, C. J. & Brannick, M. T. Publication bias in psychological science: prevalence, methods for identifying and controlling, and implications for the use of meta-analyses. Psychol. Methods 17 , 120–128 (2012).

Wagenmakers, E.-J., Wetzels, R., Borsboom, D. & Van Der Maas, H. L. J. Why psychologists must change the way they analyze their data: the case of Psi: comment on Bem (2011). J. Pers. Soc. Psychol. 100 , 426–432 (2011).

Nosek, B. A., Ebersole, C. R., DeHaven, A. C. & Mellor, D. T. The preregistration revolution. Proc. Natl Acad. Sci. USA 115 , 2600–2606 (2018).

Benjamin, D. J. et al. Redefine statistical significance. Nat. Hum. Behav. 2 , 6–10 (2018).

Miller, J. & Ulrich, R. The quest for an optimal alpha. PLoS ONE 14 , e0208631 (2019).

Oberauer, K. & Lewandowsky, S. Addressing the theory crisis in psychology. Psychon. Bull. Rev. 26 , 1596–1618 (2019).

van Assen, M. A. L. M., van Aert, R. C. M., Nuijten, M. B. & Wicherts, J. M. Why publishing everything is more effective than selective publishing of statistically significant results. PLoS ONE 9 , e84896 (2014).

Article   ADS   Google Scholar  

Ioannidis, J. P. A. Why most published research findings are false. PLoS Med. 2 , e124 (2005).

Topolinski, S. & Sparenberg, P. Turning the hands of time: clockwise movements increase preference for novelty. Soc. Psychol. Pers. Sci. 3 , 308–314 (2012).

Wagenmakers, E.-J. et al. Turning the hands of time again: a purely confirmatory replication study and a bayesian analysis. Front. Psychol. 6 , 494 (2015).

Wilson, B. M. & Wixted, J. T. The prior odds of testing a true effect in cognitive and social psychology. Adv. Methods Pract. Psychol. Sci. 1 , 186–197 (2018).

Miller, J. What is the probability of replicating a statistically significant effect? Psychon. Bull. Rev. 16 , 617–640 (2009).

Lewandowsky, S., Brown, G. D. A., Wright, T. & Nimmo, L. M. Timeless memory: evidence against temporal distinctiveness models of short-term memory for serial order. J. Mem. Lang. 54 , 20–38 (2006).

Gl�nzel, W., Schlemmer, B. & Thijs, B. Better late than never? On the chance to become highly cited only beyond the standard bibliometric time horizon. Scientometrics 58 , 571–586 (2003).

Rouder, J. N., Speckman, P. I., Sun, D. & Morey, R. D. Bayesian t tests for accepting and rejecting the null hypothesis. Psychon. Bull. Rev. 16 , 225–237 (2009).

Eyre-Walker, A. & Stoletzki, N. The assessment of science: the relative merits of post-publication review, the impact factor, and the number of citations. PLoS Biol. 11 , e1001675 (2013).

Ioannidis, J. P. A. et al. Increasing value and reducing waste in research design, conduct, and analysis. Lancet 383 , 166–175 (2014).

Coles, N. A., Tiokhin, L., Scheel, A. M., Isager, P. M. & Lakens, D. The costs and benefits of replication studies. Behav. Brain Sci. 41 , e124 (2018).

Field, S. M., Hoekstra, R., Bringmann, L. F. and van Ravenzwaaij, D. When and why to replicate: as easy as 1, 2, 3? Collabra: Psychology 5 , (2019).

Miller, J. & Ulrich, R. Optimizing research payoff. Perspect. Psychol. Sci. 11 , 664–691 (2016).

Chalmers, I. & Glasziou, P. Avoidable waste in the production and reporting of research evidence. Lancet 374 , 86–89 (2009).

Ioannidis, J. P. A. How to make more published research true. PLoS Med. 11 , e1001747 (2014).

Baribault, B. et al. Metastudies for robust tests of theory. Proc. Natl Acad. Sci. USA 115 , 2607–2612 (2018).

Francis, G. The psychology of replication and replication in psychology. Perspect. Psychol. Sci. 7 , 585–594 (2012).

Francis, G. Too good to be true: publication bias in two prominent studies from experimental psychology. Psychon. Bull. Rev. 19 , 151–156 (2012).

Publons global state of peer review 2018. Tech . Rep . (Clarivate Analytics, 2018). https://doi.org/10.14322/publons.gspr2018 .

Greitemeyer, T. Article retracted, but the message lives on. Psychon. Bull. Rev. 21 , 557–561 (2014).

Arslan, R. Revised: Are studies that replicate cited more? https://rubenarslan.github.io/posts/2019-01-02-are-studies-that-replicate-cited-more/ (2019).

Mayr, S., Erdfelder, E., Buchner, A. & Faul, F. A short tutorial of GPower. Tutor. Quant. Methods Psychol. 3 , 51–59 (2007).

Download references

Acknowledgements

The authors do not have funding to acknowledge.

Author information

Authors and affiliations.

School of Psychological Science, University of Bristol, 12A, Priory Road, Bristol, BS8 1TU, UK

Stephan Lewandowsky

School of Psychological Science, University of Western Australia, Perth, WA, Australia

Department of Psychology, University of Zurich, Zurich, Switzerland

Klaus Oberauer

You can also search for this author in PubMed   Google Scholar

Contributions

S.L. wrote and conducted the simulations and wrote the first draft of the paper. S.L. and K.O. jointly developed and discussed the project throughout.

Corresponding author

Correspondence to Stephan Lewandowsky .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks Rolf Ulrich and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, peer review file, reporting summary, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Lewandowsky, S., Oberauer, K. Low replicability can support robust and efficient science. Nat Commun 11 , 358 (2020). https://doi.org/10.1038/s41467-019-14203-0

Download citation

Received : 04 March 2019

Accepted : 17 December 2019

Published : 17 January 2020

DOI : https://doi.org/10.1038/s41467-019-14203-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Consensus definitions of perception-action-integration in action control.

  • Christian Frings
  • Christian Beste
  • Philip Schmalbrock

Communications Psychology (2024)

Open Times: The future of critique in the age of (un)replicability

  • Nathalie Cooke
  • Ronny Litvack-Katzman

International Journal of Digital Humanities (2024)

The replication crisis has led to positive structural, procedural, and community changes

  • Max Korbmacher
  • Flavio Azevedo
  • Thomas Evans

Communications Psychology (2023)

Disconnected psychology and neuroscience—implications for scientific progress, replicability and the role of publishing

Communications Biology (2021)

Exploring Bayesian analyses of a small-sample-size factorial design in human systems integration: the effects of pilot incapacitation

  • Daniela Schmid
  • Neville A. Stanton

Human-Intelligent Systems Integration (2019)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

replicated science experiments

New Research

Scientists Replicated 100 Psychology Studies, and Fewer Than Half Got the Same Results

The massive project shows that reproducibility problems plague even top scientific journals

Brian Handwerk

Science Correspondent

42-52701089.jpg

Academic journals and the press regularly serve up fresh helpings of fascinating psychological research findings. But how many of those experiments would produce the same results a second time around?

According to work presented today in Science , fewer than half of 100 studies published in 2008 in three top psychology journals could be replicated successfully. The international effort included 270 scientists who re-ran other people's studies as part of The Reproducibility Project: Psychology , led by Brian Nosek of the University of Virginia .

The eye-opening results don't necessarily mean that those original findings were incorrect or that the scientific process is flawed. When one study finds an effect that a second study can't replicate, there are several possible reasons, says co-author Cody Christopherson of Southern Oregon University. Study A's result may be false, or Study B's results may be false—or there may be some subtle differences in the way the two studies were conducted that impacted the results.

“This project is not evidence that anything is broken. Rather, it's an example of science doing what science does,” says Christopherson. “It's impossible to be wrong in a final sense in science. You have to be temporarily wrong, perhaps many times, before you are ever right.”

Across the sciences, research is considered reproducible when an independent team can conduct a published experiment, following the original methods as closely as possible, and get the same results. It's one key part of the process for building evidence to support theories. Even today, 100 years after Albert Einstein presented his general theory of relativity, scientists regularly repeat tests of its predictions and look for cases where his famous description of gravity does not apply.

"Scientific evidence does not rely on trusting the authority of the person who made the discovery," team member Angela Attwood , a psychology professor at the University of Bristol, said in a statement "Rather, credibility accumulates through independent replication and elaboration of the ideas and evidence."

The Reproducibility Project, a community-based crowdsourcing effort, kicked off in 2011 to test how well this measure of credibility applies to recent research in psychology. Scientists, some recruited and some volunteers, reviewed a pool of studies and selected one for replication that matched their own interest and expertise. Their data and results were shared online and reviewed and analyzed by other participating scientists for inclusion in the large Science study.

To help improve future research, the project analysis attempted to determine which kinds of studies fared the best, and why. They found that surprising results were the hardest to reproduce, and that the experience or expertise of the scientists who conducted the original experiments had little to do with successful replication.

The findings also offered some support for the oft-criticized statistical tool known as the P value , which measures whether a result is significant or due to chance. A higher value means a result is most likely a fluke, while a lower value means the result is statistically significant.

The project analysis showed that a low P value was fairly predictive of which psychology studies could be replicated. Twenty of the 32 original studies with a P value of less than 0.001 could be replicated, for example, while just 2 of the 11 papers with a value greater than 0.04 were successfully replicated.

But Christopherson suspects that most of his co-authors would not want the study to be taken as a ringing endorsement of P values, because they recognize the tool's limitations. And at least one P value problem was highlighted in the research: The original studies had relatively little variability in P value, because most journals have established a cutoff of 0.05 for publication. The trouble is that value can be reached by being selective about data sets , which means scientists looking to replicate a result should also carefully consider the methods and the data used in the original study.

It's also not yet clear whether psychology might be a particularly difficult field for reproducibility—a similar study is currently underway on cancer biology research. In the meantime, Christopherson hopes that the massive effort will spur more such double-checks and revisitations of past research to aid the scientific process.

“Getting it right means regularly revisiting past assumptions and past results and finding new ways to test them. The only way science is successful and credible is if it is self-critical,” he notes. 

Unfortunately there are disincentives to pursuing this kind of research, he says: “To get hired and promoted in academia, you must publish original research, so direct replications are rarer. I hope going forward that the universities and funding agencies responsible for incentivizing this research—and the media outlets covering them—will realize that they've been part of the problem, and that devaluing replication in this way has created a less stable literature than we'd like.”

Get the latest Science stories in your inbox.

Brian Handwerk | READ MORE

Brian Handwerk is a science correspondent based in Amherst, New Hampshire.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • v.13(4); 2012 Apr

Logo of emborep

Replicates and repeats—what is the difference and is it significant?

David l vaux.

1 The Walter and Eliza Hall Institute, and the Department of Experimental Biology, University of Melbourne, Melbourne, Australia.

Fiona Fidler

2 La Trobe University School of Psychological Science, Melbourne, Australia.

Geoff Cumming

Science is knowledge gained through repeated experiment or observation. To be convincing, a scientific paper needs to provide evidence that the results are reproducible. This evidence might come from repeating the whole experiment independently several times, or from performing the experiment in such a way that independent data are obtained and a formal procedure of statistical inference can be applied—usually confidence intervals (CIs) or statistical significance testing. Over the past few years, many journals have strengthened their guidelines to authors and their editorial practices to ensure that error bars are described in figure legends—if error bars appear in the figures—and to set standards for the use of image-processing software. This has helped to improve the quality of images and reduce the number of papers with figures that show error bars but do not describe them. However, problems remain with how replicate and independently repeated data are described and interpreted. As biological experiments can be complicated, replicate measurements are often taken to monitor the performance of the experiment, but such replicates are not independent tests of the hypothesis, and so they cannot provide evidence of the reproducibility of the main results. In this article, we put forward our view to explain why data from replicates cannot be used to draw inferences about the validity of a hypothesis, and therefore should not be used to calculate CIs or P values, and should not be shown in figures.

…replicates are not independent tests of the hypothesis, and so they cannot provide evidence of the reproducibility of the main results

Let us suppose we are testing the hypothesis that the protein Biddelonin (BDL), encoded by the Bdl gene, is required for bone marrow colonies to grow in response to the cytokine HH-CSF. Luckily, we have wild-type (WT) and homozygous Bdl gene-deleted mice at our disposal, and a vial of recombinant HH-CSF. We prepare suspensions of bone marrow cells from a single WT and a single Bdl −/− mouse (same sex littermates from a Bdl +/− heterozygous cross) and count the cell suspensions by using a haemocytometer, adjusting them so that there are 1 × 10 5 cells per millilitre in the final solution of soft agar growth medium. We add 1 ml aliquots of the suspension to sets of ten 35 × 10 mm Petri dishes that each contain 10 μl of either saline or purified recombinant mouse HH-CSF.

We therefore put in the incubator four sets of ten soft agar cultures: one set of ten plates has WT bone marrow cells with saline; the second has Bdl −/− cells with saline; the third has WT cells with HH-CSF, and the fourth has Bdl −/− cells with HH-CSF. After a week, we remove the plates from the incubator and count the number of colonies (groups of >50 cells) in each plate by using a dissecting microscope. The number of colonies counted is shown in Table 1 .

 Plate number
12345678910
WT + saline0001100000
+ saline0000010002
WT + HH-CSF61595564576963516161
+ HH-CSF48345059374644395147

1 × 10 5 WT or Bdl −/− bone marrow cells were plated in 1 ml soft agar cultures in the presence or absence of 1 μM HH-CSF. Colonies per plate were counted after 1 week. WT, wild type.

We could plot the counts of the plates on a graph. If we plotted just the colony counts of only one plate of each type ( Fig 1A shows the data for plate 1), it seems clear that HH-CSF is necessary for many colonies to form, but it is not immediately apparent whether the response of the Bdl −/− cells is significantly different to that of the WT cells. Furthermore, the graph does not look ‘sciency’ enough; there are no error bars or P -values. Besides, by showing the data for only one plate we are breaking the fundamental rule of science that all relevant data should be reported and subjected to analysis, unless good reasons can be given why some data should be omitted.

An external file that holds a picture, illustration, etc.
Object name is embor201236f1.jpg

Displaying data from replicates—what not to do. ( A ) Data for plate 1 only (shown in Table 1 ). ( B ) Means ± SE for replicate plates 1–3 (in Table 1 ), * P > 0.05. ( C ) Means ± SE for replicate plates 1–10 (in Table 1 ), * P < 0.0001. ( D ) Means ± SE for HH-CSF-treated replicate plates 1–10 (in Table 1 ). Statistics should not be shown for replicates because they merely indicate the fidelity with which the replicates were made, and have no bearing on the hypothesis being tested. In each of these figures, n = 1 and the size of the error bars in ( B ), ( C ) and ( D ) reflect sampling variation of the replicates. The SDs of the replicates would be expected to be roughly the square root of the mean number of colonies. Also, axes should commence at 0, other than in exceptional circumstances, such as for log scales. SD, standard deviation; SE, standard error.

To make it look better, we could add the mean numbers of colonies in the first three plates of each type to the graph ( Fig 1B ), with error bars that report the standard error (SE) of the three values of each type. Now it is looking more like a figure in a high-profile journal, but when we use the data from the three replicate plates of each type to assess the statistical significance of the difference in the responses of the WT and Bdl −/− cells to HH-CSF, we find P > 0.05, indicating they are not significantly different.

As we have another seven plates from each group, we can plot the means and SEs of all ten plates and re-calculate P ( Fig 1C ). Now we are delighted to find that there is a highly significant difference between the Bdl −/− and WT cells, with P < 0.0001.

However, although the differences are highly statistically significant, the heights of the columns are not dramatically different, and it is hard to see the error bars. To remedy this, we could simply start the y -axis at 40 rather than zero ( Fig 1D ), to emphasize the differences in the response to HH-CSF. Although this necessitates removing the saline controls, these are not as important as visual impact for high-profile journals.

With a small amount of effort, and no additional experiments, we have transformed an unimpressive result ( Fig 1A,B ) into one that gives strong support to our hypothesis that BDL is required for a response to HH-CSF, with a highly significant P -value, and a figure ( Fig 1D ) that looks like it could belong in one of the top journals.

So, what is wrong? The first problem is that our data do not confirm the hypothesis that BDL is required for bone marrow colonies to grow in response to HH-CSF, they actually refute it. Clearly, bone marrow colonies are growing in the absence of BDL, even if the number is not as great as when the Bdl genes are intact. Terms such as ‘required’, ‘essential’ and ‘obligatory’ are not relative, yet are still often incorrectly used when partial effects are seen. At the very least, we should reformulate our hypothesis, perhaps to “BDL is needed for a full response of bone marrow colony-forming cells to the cytokine HH-CSF”.

…by showing the data for only one plate we are breaking the fundamental rule of science that all relevant data should be reported and subjected to analysis…

The second major problem is that the calculations of P and statistical significance are based on the SE of replicates, but the ten replicates in any of the four conditions were each made from a single suspension of bone marrow cells from just one mouse. As such, we can at best infer a statistically significant difference between the concentration of colony-forming cells in the bone marrow cell suspension from that particular WT mouse and the bone marrow suspension from that particular gene-deleted mouse. We have made just one comparison, so n = 1, no matter how many replicate plates we count. To make an inference that can be generalized to all WT mice and Bdl −/− mice, we need to repeat our experiments a number of times, making several independent comparisons using several mice of each type.

Rather than providing independent data, the results from the replicate plates are linked because they all came from the same suspension of bone marrow cells. For example, if we made any error in determining the concentration of bone marrow cells, this error would be systematically applied to all of the plates. In this case, we determined the initial number of bone marrow cells by performing a cell count using a haemocytometer, a method that typically only gives an accuracy of ±10%. Therefore, no matter how many plates are counted, or how small the error bars are in Fig 1 , it is not valid to conclude that there is a difference between the WT and Bdl −/− cells. Moreover, even if we had used a flow cytometer to sort exactly the same number of bone marrow cells into each of the plates, we would still have only tested cells from a single Bdl −/− mouse, so n would still equal 1 (see Fundamental principle 1 in Sidebar A ).

Sidebar A | Fundamental principles of statistical design

Fundamental principle 1

Science is knowledge obtained by repeated experiment or observation: if n = 1, it is not science, as it has not been shown to be reproducible. You need a random sample of independent measurements.

Fundamental principle 2

Experimental design, at its simplest, is the art of varying one factor at a time while controlling others: an observed difference between two conditions can only be attributed to Factor A if that is the only factor differing between the two conditions. We always need to consider plausible alternative interpretations of an observed result. The differences observed in Fig 1 might only reflect differences between the two suspensions, or be due to some other (of the many) differences between the two individual mice, besides the particular genotypes of interest.

Fundamental principle 3

A conclusion can only apply to the population from which you took the random sample of independent measurements: so if we have multiple measures on a single suspension from one individual mouse, we can only draw a conclusion about that particular suspension from that particular mouse. If we have multiple measures of the activity of a single vial of cytokine, then we can only generalize our conclusion to that vial.

Fundamental principle 4

Although replicates cannot support inference on the main experimental questions, they do provide important quality controls of the conduct of experiments. Values from an outlying replicate can be omitted if a convincing explanation is found, although repeating part or all of the experiment is a safer strategy. Results from an independent sample, however, can only be left out in exceptional circumstances, and only if there are especially compelling reasons to justify doing so.

To be convincing, a scientific paper describing a new finding needs to provide evidence that the results are reproducible. While it might be argued that a hypothetical talking dog would represent an important scientific discovery even if n = 1, few people would be convinced if someone claimed to have a talking dog that had been observed on one occasion to speak a single word. Most people would require several words to be spoken, with a number of independent observers, on several occasions. The cloning of Dolly the sheep represented a scientific breakthrough, but she was one of five cloned sheep described by Campbell et al [ 1 ]. Eight fetuses and sheep were typed by microsatellite analysis and shown to be identical to the cell line used to provide the donor nuclei.

To be convincing, a scientific paper needs to provide evidence that the results are reproducible

Inferences can only be made about the population from which the independent samples were drawn. In our original experiment, we took individual replicate aliquots from the suspensions of bone marrow cells ( Fig 2A ). We can therefore only generalize our conclusions to the ‘population’ from which our sample aliquots came: in this case the population is that particular suspension of bone marrow cells. To test our hypothesis, it is necessary to carry out an experiment similar to that shown in Fig 2B . Here, bone marrow has been independently isolated from a random sample of WT mice and another random sample of Bdl −/− mice. In this case, we can draw conclusions about Bdl −/− mice in general, and compare them withWT mice (in general). In Fig 2A , the number of Bdl −/− mice that have been compared with WT mice (which is the comparison relevant to our hypothesis) is one, so n = 1, regardless of how many replicate plates are counted. Conversely, in Fig 2B we are comparing three Bdl −/− mice with WT controls, so n = 3, whether we plate three replicate plates of each type or 30. Note, however, that it is highly desirable for statistical reasons to have samples larger than n = 3, and/or to test the hypothesis by some other approach, for example, by using antibodies that block HH-CSF or BDL, or by re-expressing a Bdl cDNA in the Bdl −/− cells (see Fundamental principle 2 in Sidebar A ).

An external file that holds a picture, illustration, etc.
Object name is embor201236f2.jpg

Sample variation. Variation between samples can be used to make inferences about the population from which the independent samples were drawn (red arrows). For replicates, as in ( A ), inferences can only be made about the bone marrow suspensions from which the aliquots were taken. In ( A ), we might be able to infer that the plates on the left and the right contained cells from different suspensions, and possibly that the bone marrow cells came from two different mice, but we cannot make any conclusions about the effects of the different genotypes of the mice. In ( B ), three independent mice were chosen from each genotype, so we can make inferences about all mice of that genotype. Note that in the experiments in ( B ), n = 3, no matter how many replicate plates are created.

One of the most commonly used methods to determine the abundance of mRNA is real-time quantitative reverse transcription PCR (qRT-PCR; although the following example applies equally well to an ELISA or similar). Typically, multi-well plates are used so that many samples can be simultaneously read in a PCR machine. Let us suppose we are going to use qRT-PCR to compare levels of Boojum mRNA ( Bjm ) in control bone marrow cells (treated with medium alone) with Bjm levels in bone marrow cells treated with HH-CSF, in order to test the hypothesis that HH-CSF induces expression of the Bjm gene.

We isolate bone marrow cells from a normal mouse, and dispense equal aliquots containing a million cells into each of two wells of a six-well plate. For the moment we use only two of the six wells. We then add 4 ml of plain medium to one of the wells (the control), and 4 ml of a mixture of medium supplemented with HH-CSF to the other well (the experimental well). We incubate the plate for 24 h and then transfer the cells into two tubes, in which we extract the RNA using TRizol. We then suspend the RNA in 50 μl TRIS-buffered RNAse-free water.

We put 10 μl from each tube into each of two fresh tubes, so that both Actin (as a control) and Bjm message can be determined in each sample. We now have four tubes, each with 10 μl of mRNA solution. We make two sets of ‘reaction mix’ with the only difference being that one contains Actin PCR primers and the other Bjm primers. We add 40 μl of one or the other ‘reaction mix’ to each of the four tubes, so we now have 50 μl in each tube. After mixing, we take three aliquots of 10 μl from each of the four tubes and put them into three wells of a 384-well plate, so that 12 wells in total contain the RT-PCR mix. We then put the plate into the thermocycler. After an hour, we get an Excel spreadsheet of results.

…should we dispense with replicates altogether? The answer, of course, is ‘no’. Replicates serve as internal quality checks on how the experiment was performed

We then calculate the ratio of the Bjm signal to the Actin signal for each of the three pairs of reactions that contained RNA from the HH-CSF-treated cells, and for each of the three pairs of control reactions. In this case, the variation among the three replicates will not be affected by sampling error (which was what caused most of the variation in colony number in the earlier bone marrow colony-forming assay), but will only reflect the fidelity with which the replicates were made, and perhaps some variation in the heating of the separate wells in the PCR machine. The three 10 μl aliquots each came from the same, single, mRNA preparation, so we can only make inferences about the contents of that particular tube. As in the previous example, in this case n still equals 1, and no inferences about the main experimental hypothesis can be made. The same would be true if each RNA sample were analysed in 10 or 100 wells; we are only comparing one control sample to one experimental sample, so n = 1 ( Fig 3A ). To draw a general inference about the effect of HH-CSF on Bjm expression, we would have to perform the experiment on several independent samples derived from independent cultures of HH-CSF-stimulated bone marrow cells ( Fig 3B ).

An external file that holds a picture, illustration, etc.
Object name is embor201236f3.jpg

Means of replicates compared with means of independent samples. ( A ) The ratios of the three-replicate Bjm PCR reactions to the three-replicate Actin PCR reactions from the six aliquots of RNA from one culture of HH-CSF-stimulated cells and one culture of unstimulated cells are shown (filled squares). The means of the ratios are shown as columns. The close correlation of the three replicate values (blue lines) indicates that the replicates were created with high fidelity and the pipetting was consistent, but is not relevant to the hypothesis being tested. It is not appropriate to show P -values here, because n = 1. ( B ) The ratios of the replicate PCR reactions using mRNA from the other cultures (two unstimulated, and two treated with HH-CSF) are shown as triangles and circles. Note how the correlation between the replicates (that is, the groups of three shapes) is much greater than the correlation between the mean values for the three independent untreated cultures and the three independent HH-CSF-treated cultures (green lines). Error bars indicate SE of the ratios from the three independent cultures, not the replicates for any single culture. P > 0.05. SE, standard error.

For example, we could have put the bone marrow cells in all six wells of the tissue culture plate, and performed three independent cultures with HH-CSF, and three independent control cultures in medium without HH-CSF. mRNA could then have been extracted from the six cultures, and each split into six wells to measure Actin and Bjm mRNA levels by using qRT-PCR. In this case, 36 wells would have been read by the machine. If the experiment were performed this way, then n = 3, as there were three independent control cultures, and three independent HH-CSF-dependent cultures, that were testing our hypothesis that HH-CSF induces Bjm expression. We then might be able to generalize our conclusions about the effect of that vial of recombinant HH-CSF on expression of Bjm mRNA. However, in this case ( Fig 3B ) P > 0.05, so we cannot exclude the possibility that the differences observed were just due to chance, and that HH-CSF has no effect on Bjm mRNA expression. Note that we also cannot conclude that it has no effect; if P > 0.05, the only conclusion we can make is that we cannot make any conclusions. Had we calculated and shown errors and P -values for replicates in Fig 3A , we might have incorrectly concluded, and perhaps misled the readers to conclude that there was a statistically significant effect of HH-CSF in stimulating Bjm transcription (see Fundamental principle 3 in Sidebar A ).

Why bother with replicates at all? In the previous sections we have seen that replicates do not allow inferences to be made, or allow us to draw conclusions relevant to the hypothesis we are testing. So should we dispense with replicates altogether? The answer, of course, is ‘no’. Replicates serve as internal quality checks on how the experiment was performed. If, for example, in the experiment described in Table 1 and Fig 1 , one of the replicate plates with saline-treated WT bone marrow contained 100 colonies, you would immediately suspect that something was wrong. You could check the plate to see if it had been mislabelled. You might look at the colonies using a microscope and discover that they are actually contaminating colonies of yeast. Had you not made any replicates, it is possible you would not have realized that a mistake had occurred.

Replicates […] cannot be used to infer conclusions

Fig 4 shows the results of the same qRT-PCR experiment as in Fig 3 , but in this case, for one of the sets of triplicate PCR ratios there is much more variation than in the others. Furthermore, this large variation can be accounted for by just one value of the three replicates—that is, the uppermost circle in the graph. If you had results such as those in Fig 4A , you would look at the individual values for the Actin PCR and Bjm PCR for the replicate that had the strange result. If the Bjm PCR sample was unusually high, you could check the corresponding well in the PCR plate to see if it had the same volume as the other wells. Conversely, if the Actin PCR value was much lower than those for the other two replicates, on checking the well in the plate you might find that the volume was too low. Alternatively, the unusual results might have been due to accidentally adding two aliquots of RNA, or two of PCR primer-reaction mix. Or perhaps the pipette tip came loose, or there were crystals obscuring the optics, or the pipette had been blocked by some debris, etc., etc., etc. Replicates can thus alert you to aberrant results, so that you know when to look further and when to repeat the experiment. Replicates can act as an internal check of the fidelity with which the experiment was performed. They can alert you to problems with plumbing, leaks, optics, contamination, suspensions, mixing or mix-ups. But they cannot be used to infer conclusions.

An external file that holds a picture, illustration, etc.
Object name is embor201236f4.jpg

Interpreting data from replicates. ( A ) Mean ± SE of three independent cultures each with ratios from triplicate PCR measurements. P > 0.05. This experiment is much like the one in Fig 3B . However, notice in this case, for one of the sets of replicates (the circles from one of the HH-CSF-treated replicate values), there is a much greater range than for the other five sets of triplicate values. Because replicates are carefully designed to be as similar to each other as possible, finding unexpected variation should prompt an investigation into what went wrong during the conduct of the experiment. Note how in this case, an increase in variation among one set of replicates causes a decrease in the SEs for the values for the independent HH-CSF results: the SE bars for the HH-CSF condition are shorter in Fig 4A than in Fig 3B . Failure to take note of abnormal variation in replicates can lead to incorrect statistical inferences. ( B ) Bjm mRNA levels (relative to Actin ) for three independent cultures each with ratios from triplicate PCR measurements. Means are shown by a horizontal line. The data here are the same as those for Fig 3B or Fig 4A with the aberrant value deleted. When n is as small as 3, it is better to just plot the data points, rather than showing statistics. SE, standard error.

Because replicate values are not relevant to the hypothesis being tested, they—and statistics derived from them—should not be shown in figures. In Fig 4B , the large dots show the means of the replicate values in Fig 4A , after the aberrant replicate value has been excluded. While in this figure you could plot the means and SEs of the mRNA results from the three independent medium- and HH-CSF-treated cultures, in this case, the independent values are plotted and no error bars are shown. When the number of independent data points is low, and they can easily be seen when plotted on the graph, we recommend simply doing this, rather than showing means and error bars.

What should we look for when reading papers? Although replicates can be a valuable internal control to monitor the performance of your experiments, there is no point in showing them in the figures in publications because the statistics from replicates are not relevant to the hypothesis being tested. Indeed, if statistics, error bars and P -values for replicates are shown, they can mislead the readers of a paper who assume that they are relevant to the paper's conclusions. The corollary of this is that if you are reading a paper and see a figure in which the error bars—whether standard deviation, SE or CI—are unusually small, it might alert you that they come from replicates rather than independent samples. You should carefully scrutinize the figure legend to determine whether the statistics come from replicates or independent experiments. If the legend does not state what the error bars are, what n is, or whether the results come from replicates or independent samples, ask yourself whether these omissions undermine the paper, or whether some knowledge can still be gained from reading it.

…if statistics, error bars and P -values for replicates are shown, they can mislead the readers of a paper who assume that they are relevant to the paper’s conclusions

You should also be sceptical if the figure contains data from only a single experiment with statistics for replicates, because in this case, n = 1, and no valid conclusions can be made, even if the authors state that the results were ‘representative’—if the authors had more data, they should have included them in the published results (see Sidebar B for a checklist of what to look for). If you wish to see more examples of what not to do, search the Internet for the phrases ‘SD of one representative’, ‘SE of one representative’, ‘SEM of one representative’, ‘SD of replicates’ or ‘SEM of replicates’.

Sidebar B | Error checklist when reading papers

  • If error bars are shown, are they described in the legend?
  • If statistics or error bars are shown, is n stated?
  • If the standard deviations (SDs) are less than 10%, do the results come from replicates?
  • If the SDs of a binomial distribution are consistently less than √( np (1 – p ))—where n is sample size and P is the probability—are the data too good to be true?
  • If the SDs of a Poisson distribution are consistently less than √(mean), are the data too good to be true?
  • If the statistics come from replicates, or from a single ‘representative’ experiment, consider whether the experiments offer strong support for the conclusions.
  • If P -values are shown for replicates or a single ‘representative’ experiment, consider whether the experiments offer strong support for the conclusions.

An external file that holds a picture, illustration, etc.
Object name is embor201236i1.jpg

David L. Vaux

An external file that holds a picture, illustration, etc.
Object name is embor201236i2.jpg

Acknowledgments

This work was made possible through Victorian State Government Operational Infrastructure Support, and Australian Government NHMRC IRIISS and NHMRC grants 461221and 433063.

The authors declare that they have no conflict of interest.

  • Campbell KH, McWhir J, Ritchie WA, Wilmut I (1996) Sheep cloned by nuclear transfer from a cultured cell line . Nature 380 : 64–66 [ PubMed ] [ Google Scholar ]

Every print subscription comes with full digital access

Science News

A massive 8-year effort finds that much cancer research can’t be replicated.

Unreliable preclinical studies could impede drug development later on

illustration of orange and red prostate cancer cells

An effort to replicate nearly 200 preclinical cancer experiments that generated buzz from 2010 to 2012 found that only about a quarter could be reproduced. Prostate cancer cells are shown in this artist’s illustration.

Dr_Microbe/iStock/Getty Images Plus

Share this:

By Tara Haelle

December 7, 2021 at 8:00 am

After eight years, a project that tried to reproduce the results of key cancer biology studies has finally concluded. And its findings suggest that like research in the social sciences, cancer research has a replication problem.

Researchers with the Reproducibility Project: Cancer Biology aimed to replicate 193 experiments from 53 top cancer papers published from 2010 to 2012. But only a quarter of those experiments were able to be reproduced , the team reports in two papers published December 7 in eLife .

The researchers couldn’t complete the majority of experiments because the team couldn’t gather enough information from the original papers or their authors about methods used, or obtain the necessary materials needed to attempt replication.

What’s more, of the 50 experiments from 23 papers that were reproduced, effect sizes were, on average, 85 percent lower than those reported in the original experiments. Effect sizes indicate how big the effect found in a study is. For example, two studies might find that a certain chemical kills cancer cells, but the chemical kills 30 percent of cells in one experiment and 80 percent of cells in a different experiment. The first experiment has less than half the effect size seen in the second one. 

The team also measured if a replication was successful using five criteria. Four focused on effect sizes, and the fifth looked at whether both the original and replicated experiments had similarly positive or negative results, and if both sets of results were statistically significant. The researchers were able to apply those criteria to 112 tested effects from the experiments they could reproduce. Ultimately, just 46 percent, or 51, met more criteria than they failed, the researchers report.

“The report tells us a lot about the culture and realities of the way cancer biology works, and it’s not a flattering picture at all,” says Jonathan Kimmelman, a bioethicist at McGill University in Montreal. He coauthored a commentary on the project exploring the ethical aspects of the findings.

It’s worrisome if experiments that cannot be reproduced are used to launch clinical trials or drug development efforts, Kimmelman says. If it turns out that the science on which a drug is based is not reliable, “it means that patients are needlessly exposed to drugs that are unsafe and that really don’t even have a shot at making an impact on cancer,” he says.

At the same time, Kimmelman cautions against overinterpreting the findings as suggesting that the current cancer research system is broken. “We actually don’t know how well the system is working,” he says. One of the many questions left unresolved by the project is what an appropriate rate of replication is in cancer research, since replicating all studies perfectly isn’t possible. “That’s a moral question,” he says. “That’s a policy question. That’s not really a scientific question.”

The overarching lessons of the project suggest that substantial inefficiency in preclinical research may be hampering the drug development pipeline later on, says Tim Errington, who led the project. He is the director of research at the Center for Open Science in Charlottesville, Va., which cosponsored the research.

As many as 14 out of 15 cancer drugs that enter clinical trials never receive approval from the U.S. Food and Drug Administration. Sometimes that’s because the drugs lack commercial potential, but more often it is because they do not show the level of safety and effectiveness needed for licensure.

microscope image of a squamous cell carcinoma in a mouse

Much of that failure is expected. “We’re humans trying to understand complex disease, we’re never going to get it right,” Errington says. But given the cancer reproducibility project’s findings, perhaps “we should have known that we were failing earlier, or maybe we don’t understand actually what’s causing [an] exciting finding,” he says.

Still, it’s not that failure to replicate means that a study was wrong or that replicating it means that the findings are correct, says Shirley Wang, an epidemiologist at Brigham and Women’s Hospital in Boston and Harvard Medical School. “It just means that you’re able to reproduce,” she says, a point that the reproducibility project also stresses.

Scientists still have to evaluate whether a study’s methods are unbiased and rigorous, says Wang, who was not involved in the project but reviewed its findings. And if the results of original experiments and their replications do differ, it’s a learning opportunity to find out why and the implications, she adds.

Errington and his colleagues have reported on subsets of the cancer reproducibility project’s findings before , but this is the first time that the effort’s entire analysis has been released ( SN: 1/18/17 ).

During the project, the researchers faced a number of obstacles, particularly that none of the original experiments included enough details in their published studies about methods to attempt reproduction. So the reproducibility researchers contacted the studies’ authors for additional information.

While authors for 41 percent of the experiments were extremely or very helpful, authors for another third of the experiments did not reply to requests for more information or were not otherwise helpful, the project found. For example, one of the experiments that the group was unable to replicate required the use of a mouse model specifically bred for the original experiment. Errington says that the scientists who conducted that work refused to share some of these mice with the reproducibility project, and without those rodents, replication was impossible.

image of a blue gloved hands holding a lab mouse

Some researchers were outright hostile to the idea that independent scientists wanted to attempt to replicate their work, says Brian Nosek, executive director at the Center for Open Science and a coauthor on both studies. That attitude is a product of a research culture that values innovation over replication, and that prizes the academic publish-or-perish system over cooperation and data sharing, Nosek says.

Some scientists may feel threatened by replication because it is uncommon. “If replication is normal and routine, people wouldn’t see it as a threat,” Nosek says. But replication may also feel intimidating because scientists’ livelihoods and even identities are often so deeply rooted in their findings, he says. “Publication is the currency of advancement, a key reward that turns into chances for funding, chances for a job and chances for keeping that job,” Nosek says. “Replication doesn’t fit neatly into that rewards system.”

Even authors who wanted to help couldn’t always share their data for various reasons, including lost hard drives or intellectual property restrictions or data that only former graduate students had.

Calls from some experts about science’s “ reproducibility crisis ” have been growing for years, perhaps most notably in psychology (SN: 8/27/18 ) . Then in 2011 and 2012, pharmaceutical companies Bayer and Amgen reported difficulties in replicating findings from preclinical biomedical research.

But not everyone agrees on solutions, including whether replication of key experiments is actually useful or possible , or even what exactly is wrong with the way science is done or what needs to improve ( SN: 1/13/15 ).   

At least one clear, actionable conclusion emerged from the new findings, says Yvette Seger, director of science policy at the Federation of American Societies for Experimental Biology. That’s the need to provide scientists with as much opportunity as possible to explain exactly how they conducted their research.

“Scientists should aspire to include as much information about their experimental methods as possible to ensure understanding about results on the other side,” says Seger, who was not involved in the reproducibility project.

Ultimately, if science is to be a self-correcting discipline, there needs to be plenty of opportunities not only for making mistakes but also for discovering those mistakes, including by replicating experiments, the project’s researchers say.

“In general, the public understands science is hard, and I think the public also understands that science is going to make errors,” Nosek says. “The concern is and should be, is science efficient at catching its errors?” The cancer project’s findings don’t necessarily answer that question, but they do highlight the challenges of trying to find out.

More Stories from Science News on Health & Medicine

A black and white mosquito sits on the skin of a white person, sucking up a meal. Its abdomen is slightly filled with blood.

Extreme heat and rain help send dengue cases skyrocketing

an photo of a microwave

More than 100 bacteria species can flourish in microwave ovens

Multidrug resistant Staphylococcus aureus, or MRSA, bacteria

50 years ago, antibiotic resistant bacteria became a problem outside hospitals

moderna covid vaccine

New COVID-19 booster shots have been approved. When should you get one?

An image of a T cell on a black background

A newly approved ‘living drug’ could save more cancer patients’ lives

two kids

Expanding antibiotic treatment in sub-Saharan Africa could save kids’ lives

A stock image of a person holding a glass under a running faucet. The cup is filling with drinking water.

More than 4 billion people may not have access to clean water

The hands of a health care worker wearing gloves hold a vial and a swab that's being used to collect a sample from a person with a rash on their arm.

Why mpox is a global health emergency — again

Subscribers, enter your e-mail address for full access to the Science News archives and digital editions.

Not a subscriber? Become one now .

More social science studies just failed to replicate. Here’s why this is good.

What scientists learn from failed replications: how to do better science.

by Brian Resnick

Psychologists are still wondering: “What’s going on in there?” They’re just doing it with greater rigor.

One of the cornerstone principles of science is replication. This is the idea that experiments need to be repeated to find out if the results will be consistent. The fact that an experiment can be replicated is how we know its results contain a nugget of truth. Without replication, we can’t be sure.

For the past several years, social scientists have been deeply worried about the replicability of their findings. Incredibly influential, textbook findings in psychology — like the “ ego depletion” theory of willpower, or the “ marshmallow test ” — have been bending or breaking under rigorous retests. And the scientists have learned that what they used to consider commonplace methodological practices were really just recipes to generate false positives. This period has been called the “ replication crisis ” by some.

And the reckoning is still underway. Recently, a team of social scientists — spanning psychologists and economists — attempted to replicate 21 findings published in the most prestigious general science journals: Nature and Science. Some of the retested studies have been widely influential in science and in pop culture, like a 2011 paper on whether access to search engines hinders our memories, or whether reading books improves a child’s theory of mind (meaning their ability to understand that other people have thoughts and intentions different from their own).

On Monday, they’re publishing their results in the journal Nature Human Behavior. Here’s their take-home lesson: Even studies that are published in the top journals should be taken with a grain of salt until they are replicated. They’re initial findings, not ironclad truth. And they can be really hard to replicate, for a variety of reasons.

Rigorous retests of social science studies often yield less impressive results

The scientists who ran the 21 replication tests didn’t just repeat the original experiments — they made them more rigorous. In some cases, they increased the number of participants by a factor of five, and preregistered their study and analysis designs before a single participant was brought into the lab.

All the original authors (save for one group that couldn’t be reached), signed off on the study designs too. Preregistering is like making a promise to not deviate from a plan and inject bias into the results.

Here are the results: 13 of the 21 results replicated. But perhaps just as notable: Even among the studies that did pass, the effect sizes (that is, the difference between the experimental group and the control group in the experiment, or the size of the change the experimental manipulation made) decreased by around half, meaning that the original findings likely overstated the power of the experimental manipulation.

“Overall, our study shows statistically significant scientific findings should be interpreted rather cautiously until they have been replicated, even if they have been published in the most renowned journals,” Felix Holzmeister, an Austrian economist and one of the study co-authors, says.

It’s not always clear why a study doesn’t replicate. Science is hard.

Many of the papers that were retested contained multiple experiments. Only one experiment from each paper was tested. So these failed replications don’t necessarily mean the theory behind the original findings is totally bunk.

For instance, the famous “Google Effects on Memory” paper — which found that we often don’t remember things as well when we know we can search for them online — did not replicate in this study. But the experiment chosen was a word-priming task (i.e., does thinking about the internet make it harder to retrieve information), and not the more real-world experiment that involved actually answering trivia statements. And other research since has bolstered that paper’s general argument that access to the internet is shifting the relationship we have with, and the utility of, our own memories.

There could be a lot of reasons a result doesn’t replicate. One is that the experimenters doing the replication messed something up.

The world needs more wonder

The Unexplainable newsletter guides you through the most fascinating, unanswered questions in science — and the mind-bending ways scientists are trying to answer them. Sign up today .

Another reason can be that the study stumbled on a false positive.

One of the experiments that didn’t replicate was from University of Kentucky psychologist Will Gervais. The experiment tried to see if getting people to think more rationally would make them less willing to report religious belief.

“In hindsight, our study was outright silly,” Gervais says. They had people look at a picture of Rodin’s The Thinker or another statue. They thought The Thinker would nudge people to think harder.

“When we asked them a single question on whether they believe in God, it was a really tiny sample size, and barely significant ... I’d like to think it wouldn’t get published today,” Gervais says. (And know, this study was published in Science a top journal.)

In other cases, a study may not replicate because the target — the human subjects — has changed. In 2012, MIT psychologist David Rand published a paper in Nature on human cooperation. The experiment involved online participants playing an economics game. He argues that a lot of online study participants have since grown familiar with this game, which makes it a less useful tool to probe real-life behaviors. His experiment didn’t replicate in the new study.

Finding out why a study didn’t replicate is hard work. But it’s exactly the type of work, and thinking, that scientists need to be engaged in. The point of this replication project, and others like it , is not to call out individual researchers. “It’s a reminder of our values,” says Brian Nosek, a psychologist and the director of the Center for Open Science , who collaborated on the new study. Scientists who publish in top journals should know their work may be checked up on. It’s also important, he notes, to know that social science’s inability to be replicable is in itself a replicable finding.

Often, when studies don’t replicate, it’s not that the effort totally disproves the underlying hypothesis. And it doesn’t mean the original study authors were frauds. But replication results do often significantly change the story we tell about the experiment .

For instance, I recently wrote about a replication effort of the famous “marshmallow test” studies, which originally showed that the ability to delay gratification early in life is correlated with success later on. A new paper found this correlation, but when the authors controlled for factors like family background, the correlation went away.

Here’s how the story changed: Delay of gratification is not a unique lever to pull to positively influence other aspects of a person’s life. It’s a consequence of bigger-picture, harder-to-change components of a person.

In science, too often, the first demonstration of an idea becomes the lasting one. Replications are a reminder that in science, this isn’t supposed to be the case. Science ought to embrace and learn from failure.

The “replication crisis” in psychology has been going on for years now. And scientists are reforming their ways.

The “replication crisis” in psychology, as it is often called, started around 2010 , when a paper using completely accepted experimental methods was published purporting to find evidence that people were capable of perceiving the future, which is impossible. This prompted a reckoning : Common practices like drawing on small samples of college students were found to be insufficient to find true experimental effects.

Scientists thought if you could find an effect in a small number of people, that effect must be robust. But often, significant results from small samples turn out to be statistical flukes. (For more on this, read our explainer on p-values.)

The crisis intensified in 2015 when a group of psychologists, which included Nosek, published a report in Science with evidence of an overarching problem: When 270 psychologists tried to replicate 100 experiments published in top journals, only around 40 percent of the studies held up. The remainder either failed or yielded inconclusive data. And again, the replications that did work showed weaker effects than the original papers. The studies that tended to replicate had more highly significant results compared to the ones that just barely crossed the threshold of significance.

Another important reason to do replications, Nosek says, is to get better at understanding what types of studies are most likely to replicate, and to sharpen scientists’ intuitions about what hypotheses are worthy of testing and which are not.

As part of the new study, Nosek and his colleagues added a prediction component. A group of scientists took bets on which studies they thought would replicate and which they thought wouldn’t. The bets largely tracked with the final results.

As you can see in the chart below, the yellow dots are the studies that did not replicate, and they were all unfavorably ranked by the prediction market survey.

“These results suggest [there’s] something systematic about papers that fail to replicate,” Anna Dreber, a Stockholm-based economist and one of the study co-authors, says.

replicated science experiments

One thing that stands out: Many of the papers that failed to replicate sound a little too good to be true. Take this 2010 paper that finds simply washing hands negates a common human hindsight bias . When we make a tough choice, we often look back on the choice we passed on unfavorably and are biased to find reasons to justify our decision. Washing hands in an experiment “seems to more generally remove past concerns, resulting in a metaphorical ‘clean slate’ effect,” the study’s abstract stated .

It all sounds a little too easy, too simple — and it didn’t replicate.

All that said, there are some promising signs that social science is getting better. More and more scientists are preregistering their study designs . This prevents them from cherry-picking results and analyses that are more favorable to their favored conclusions. Journals are getting better at demanding larger subject pools in experiments and are increasingly insisting that scientists share all the underlying data of their experiments for others to assess.

“The lesson out of this project,” Nosek says, “is a very positive message of reformation. Science is going to get better.”

Most Popular

  • Why I changed my mind about volunteering
  • The staggering death toll of scientific lies
  • Why Democrats aren’t talking much about one of their biggest issues
  • The massive Social Security number breach is actually a good thing
  • The difference between American and UK Love Is Blind

Today, Explained

Understand the world with a daily explainer plus the most compelling stories of the day.

 alt=

This is the title for the native ad

 alt=

More in Science

Big Pharma claims lower prices will mean giving up miracle medications. Ignore them.

The case against Medicare drug price negotiations doesn’t add up.

Antibiotics are failing. The US has a plan to launch a research renaissance.

But there might be global consequences.

Why does it feel like everyone is getting Covid?

Covid’s summer surge, explained

Earthquakes are among our deadliest disasters. Scientists are racing to get ahead of them.

Japan’s early-warning system shows a few extra seconds can save scores of lives.

The only child stigma, debunked

Being an only child doesn’t mess you up for life. We promise.

We have a drug that might delay menopause — and help us live longer

Ovaries age faster than the rest of the body. Figuring out how to slow menopause might help all of us age better.

  • Mobile Site
  • Staff Directory
  • Advertise with Ars

Filter by topic

  • Biz & IT
  • Gaming & Culture

Front page layout

Science —

100 psychology experiments repeated, less than half successful, large-scale effort to replicate scientific studies produces some mixed results..

Cathleen O'Grady - Aug 28, 2015 1:29 pm UTC

100 psychology experiments repeated, less than half successful

Since November 2011, the Center for Open Science has been involved in an ambitious project: to repeat 100 psychology experiments and see whether the results are the same the second time round. The first wave of results will be released in tomorrow’s edition of Science , reporting that fewer than half of the original experiments were successfully replicated.

The studies in question were from social and cognitive psychology, meaning that they don’t have immediate significance for therapeutic or medical treatments. However, the project and its results have huge implications in general for science, scientists, and the public. The key takeaway is that a single study on its own is never going to be the last word, said study coordinator and psychology professor Brian Nosek.

“The reality of science is we're going to get lots of different competing pieces of information as we study difficult problems,” he said in a public statement. “We're studying them because we don't understand them, and so we need to put in a lot of energy in order to figure out what's going on. It's murky for a long time before answers emerge.”

Tuning up science's engines

A lack of replication is a problem  for many scientific disciplines, from psychology to biomedical science and beyond. This is because a single experiment is a very limited thing, with poor abilities to give definitive answers on its own.

Experiments need to operate under extremely tight constraints to avoid unexpected influences from toying with the results, which means they look at a question through a very narrow window. Meanwhile, experimenters have to make myriad individual decisions from start to finish: how to find the sample to be studied, what to include and exclude, what methods to use, how to analyse the results, how best to explain the results.

This is why it’s essential for a question to be peered at from every possible angle to get a clear understanding of how it really looks in its entirety, and for each experiment to be replicated: repeated again, and again, and again, to ensure that each result wasn’t a fluke, a mistake, a result of biased reporting or specific decisions—or, in worst-case scenarios, fraud .

And yet, the incentives for replications in scientific institutions are weak. “Novel, positive and tidy results are more likely to survive peer review,” said Nosek. Novel studies have a “wow” factor; replications are less exciting, and so they're less likely to get published.

It’s better for researchers’ careers to conduct and publish original research, rather than repeating studies someone else has already done. When grant money is scarce, it’s also difficult to direct it towards replications. With scientific journals more likely to accept novel research than publications, the incentives for researchers to participate in replication efforts diminish.

At the same time, studies that found what they set out to find—called a positive effect—are also more likely to be published, while less exciting results are more likely to languish in a file drawer. Over time, these problems combine to make “the published literature … more beautiful than the reality,” Nosek explained.

The more blemished reality is that it's impossible for all hunches to be correct. Many experiments will likely turn up nothing interesting, or show the opposite effect from what was expected, but these results are important in themselves. It helps researchers to know if someone else has already tried what they’re about to do, and found that it doesn’t work. And of course, if there are five published experiments showing that something works, and eight unpublished experiments showing it doesn’t, the published literature gives a very skewed image overall.

Many researchers are working to combat these problems in different ways , by tackling both the journals and the rewards systems in institutions. Some have called for all PhD candidates to be required to conduct at least one replication in order to graduate, although this could run the risk of making replication boring, low-prestige grunt work and do little to enhance its popularity.

Scratching the surface

In 2011, the Reproducibility Project: Psychology, coordinated by the Center for Open Science, started a massive replication effort: 100 psychology experiments from three important psychology journals, replicated by 270 researchers around the world.

As with all experiments, there were complicated decisions to be made along the way. Which experiments were most important to replicate first? How should they decide what level of expertise was necessary for the researchers doing the replicating? And most importantly, what counts as a successful replication?

The last question wasn’t an easy one to answer, so the researchers came up with a multitude of ways to assess it, and applied all the criteria to each replication.

Of the 100 original studies, 97 had results that were statistically significant; only a third of the replications, however, had statistically significant results. Around half of the replications had effect sizes that were roughly comparable to the original studies. The teams conducting the replications reported whether they considered the effect to be replicated, and only 39 percent of them said it did. These criteria suggest that fewer than half of the originals were successfully replicated.

So what does this mean? It’s easy to over-simplify what a successful or failed replication implies, the authors of the Science paper write. If the replication worked, all that means is that the original experiment produced a reliable, repeatable result. It doesn’t mean that the explanation for the results is necessarily correct.

There are often multiple different explanations for a particular pattern, and one set of authors might prefer one explanation, while others prefer another. Those questions remain unresolved with a simple replication, and they need different experiments to answer those concerns.

A failed replication, meanwhile, doesn’t necessarily mean that the original result was a false positive, although this is definitely possible. For a start, the replication result could have been a false negative. There’s also the possibility that small changes in the methods used for an experiment could change the results in unforeseen ways.

What’s really needed is multiple replications, as well as tweaks to the experiment to figure out when the effect appears, and when it disappears—this can help to figure out exactly what might be going on. If many different replications, trying different things, find that the original effect can’t be repeated, then it means that we can probably think about scrapping that original finding.

replicated science experiments

No clear answers, just hints. Obviously.

Part of what the Center for Open Science hoped to demonstrate with this effort is that, despite the incentives for novel research, it is possible to conduct huge replication efforts. In this project, there were incentives for researchers to invest, even if they weren’t the usual ones. “I felt I was taking part in an important groundbreaking effort and this motivated me to invest heavily in the replication study that I conducted,” said E. J. Masicampo, who led one of the replication teams.

Like all experiments, the meta-analysis of replications wasn’t able to answer every possible question at once. For instance, the project provided a list of potential experiments for volunteers to choose from, and it’s likely that there were biases in which experiments were chosen to be replicated. Because funding was thin on the ground, less resource-intensive experiments were likely to be chosen. It’s possible this affected the results in some way.

Another replication effort for another 100 experiments might turn up different results: the sample of original experiments will be different, the project coordinators might make different choices, and the analyses they choose might also change. That’s the point: uncertainty is “the reality of doing science, even if it is not appreciated in daily practice,” write the authors.

“After this intensive effort to reproduce a sample of published psychological findings, how many of the effects have we established are true?” they continue. Their answer: zero. We also haven’t established that any of the effects are false. The first round of experiments offered the first bit of evidence; the replications added to that, and further replications will be needed to continue to build on that. And slowly, the joined dots begin to form a picture.

Science , 2015. DOI: doi/10.1126/science.aac4716  ( About DOIs ).

reader comments

Channel ars technica.

  • Skip to main content
  • Keyboard shortcuts for audio player

Shots - Health News

  • Your Health
  • Treatments & Tests
  • Health Inc.
  • Public Health

In Psychology And Other Social Sciences, Many Studies Fail The Reproducibility Test

Richard Harris

replicated science experiments

A researcher showed people a picture of The Thinker in an effort to study the link between analytical thinking and religious disbelief. In hindsight, the researcher called his study design "silly". The study could not be reproduced. Peter Barritt/Getty Images hide caption

A researcher showed people a picture of The Thinker in an effort to study the link between analytical thinking and religious disbelief. In hindsight, the researcher called his study design "silly". The study could not be reproduced.

The world of social science got a rude awakening a few years ago, when researchers concluded that many studies in this area appeared to be deeply flawed. Two-thirds could not be replicated in other labs.

Some of those same researchers now report those problems still frequently crop up, even in the most prestigious scientific journals.

But their study, published Monday in Nature Human Behaviour , also finds that social scientists can actually sniff out the dubious results with remarkable skill.

First, the findings. Brian Nosek , a psychology researcher at the University of Virginia and the executive director of the Center for Open Science, decided to focus on social science studies published in the most prominent journals, Science and Nature .

"Some people have hypothesized that, because they're the most prominent outlets they'd have the highest rigor," Nosek says. "Others have hypothesized that the most prestigious outlets are also the ones that are most likely to select for very 'sexy' findings, and so may be actually less reproducible."

To find out, he worked with scientists around the world to see if they could reproduce the results of key experiments from 21 studies in Science and Nature , typically psychology experiments involving students as subjects. The new studies on average recruited five times as many volunteers, in order to come up with results that were less likely due to chance.

Scientists Are Not So Hot At Predicting Which Cancer Studies Will Succeed

Shots - Health News

Scientists are not so hot at predicting which cancer studies will succeed.

The results were better than the average of a previous review of the psychology literature, but still far from perfect. Of the 21 studies, the experimenters were able to reproduce 13. And the effects they saw were on average only about half as strong as had been trumpeted in the original studies.

The remaining eight were not reproduced.

"A substantial portion of the literature is reproducible," Nosek concludes. "We are getting evidence that someone can independently replicate [these findings]. And there is a surprising number [of studies] that fail to replicate."

One of the eight studies that failed this test came from the lab of Will Gervais , when he was getting his PhD at the University of British Columbia. He and a colleague had run a series of experiments to see whether people who are more analytical are less likely to hold religious beliefs. In one test, undergraduates looked at pictures of statues.

"Half of our participants looked at a picture of the sculpture, 'The Thinker,' where here's this guy engaged in deep reflective thought," Gervais says. "And in our control condition, they'd look at the famous stature of a guy throwing a discus."

People who saw The Thinker, a sculpture by August Rodin, expressed more religious disbelief, Gervais reported in Science . And given all the evidence from his lab and others, he says there's still reasonable evidence that underlying conclusion is true. But he recognizes the sculpture experiment was really quite weak.

"Our study, in hindsight, was outright silly," says Gervais, who is now an assistant professor at the University of Kentucky.

A previous study also failed to replicate his experimental findings, so the new analysis is hardly a surprise.

But what interests him the most in the new reproducibility study is that scientists had predicted that his study – along with the seven others that failed to replicate – were unlikely to stand up to the challenge.

As part of the reproducibility study, about 200 social scientists were surveyed and asked to predict which results would stand up to the re-test and which would not. Scientists filled out a survey in which they predicted the winners and losers. They also took part in a "prediction market," where they could buy or sell tokens that represented their views.

"They're taking bets with each other, against us," says Anna Dreber , an economics professor at the Stockholm School of Economics, and coauthor of the new study.

It turns out, "these researchers were very good at predicting which studies would replicate," she says. "I think that's great news for science."

These forecasts could help accelerate the process of science. If you can get panels of experts to weigh in on exciting new results, the field might be able to spend less time chasing errant results known as false positives.

How Flawed Science Is Undermining Good Medicine

How Flawed Science Is Undermining Good Medicine

"A false positive result can make other researchers, and the original researcher, spend lots of time and energy and money on results that turn out not to hold," she says. "And that's kind of wasteful for resources and inefficient, so the sooner we find out that a result doesn't hold, the better."

But if social scientists were really good at identifying flawed studies, why did the editors and peer reviewers at Science and Nature let these eight questionable studies through their review process?

"The likelihood that a finding will replicate or not is one part of what a reviewer would consider," says Nosek. "But other things might influence the decision to publish. It may be that this finding isn't likely to be true, but if it is true, it is super important, so we do want to publish it because we want to get it into the conversation."

Nosek recognizes that, even though the new studies were more rigorous than the ones they attempted to replicate, that doesn't guarantee that the old studies are wrong and the new studies are right. No single scientific study gives a definitive answer.

Forecasting could be a powerful tool in accelerating that quest for the truth.

That may not work, however, in one area where the stakes are very high: medical research, where answers can have life-or-death consequences.

Jonathan Kimmelman at McGill University, who was not involved in the new study, says when he's asked medical researchers to make predictions about studies, the forecasts have generally flopped.

"That's probably not a skill that's widespread in medicine," he says. It's possible that the social scientists selected to make the forecasts in the latest study have deep skills in analyzing data and statistics, and their knowledge of the psychological subject matter is less important.

And forecasting is just one tool that could be used to improve the rigor of social science.

"The social-behavioral sciences are in the midst of a reformation," says Nosek. Scientists are increasingly taking steps to increase transparency, so that potential problems surface quickly. Scientists are increasingly announcing in advance the hypothesis they are testing; they are making their data and computer code available so their peers can evaluate and check their results.

Perhaps most important, some scientists are coming to realize that they are better off doing fewer studies, but with more experimental subjects, to reduce the possibility of a chance finding.

"The way to get ahead and get a job and get tenure is to publish lots and lots of papers," says Gervais. "And it's hard to do that if you are able run fewer studies, but in the end I think that's the way to go — to slow down our science and be more rigorous up front."

Gervais says when he started his first faculty job, at the University of Kentucky, he sat down with his department chair and said he was going to follow this path of publishing fewer, but higher quality studies. He says he got the nod to do that. He sees it as part of a broader cultural change in social science that's aiming to make the field more robust.

You can reach Richard Harris at [email protected] .

IMAGES

  1. PPT

    replicated science experiments

  2. Celery-Transpiration-Experiment-for-Kids

    replicated science experiments

  3. Ultimate List Of 100 Cool Science Experiments For Kids

    replicated science experiments

  4. 20 AMAZING SCIENCE EXPERIMENTS Compilation At Home

    replicated science experiments

  5. 16 Best Egg Science Experiments

    replicated science experiments

  6. Gently used 101 great science experiments A step by step guide Updated

    replicated science experiments

COMMENTS

  1. Replicating scientific results is tough

    Replicabillity — the ability to obtain the same result when an experiment is repeated — is foundational to science. But in many research fields it has proved difficult to achieve. An important ...

  2. Most scientists 'can't replicate studies by their peers'

    Science is facing a "reproducibility crisis" where more than two-thirds of researchers have tried and failed to reproduce another scientist's experiments, research suggests. This is frustrating ...

  3. Why Should Scientific Results Be Reproducible?

    Reproducing experiments is one of the cornerstones of the scientific process. Here's why it's so important. Since 2005, when Stanford University professor John Ioannidis published his paper "Why ...

  4. Rigorous research practices improve scientific replication

    Bo MacInnis. Photo courtesy of Bo MacInnis. In an effort to assess the true potential of rigorous social science findings to be replicated, Krosnick's lab at Stanford and labs at the University of California, Santa Barbara; the University of Virginia; and the University of California, Berkeley set out to discover new experimental effects using best practices and to assess how often they ...

  5. Six factors affecting reproducibility in life science research and how

    Here are some of the most significant categories. A lack of access to methodological details, raw data, and research materials. For scientists to be able to reproduce published work, they must be ...

  6. Summary

    Generalizability, another term frequently used in science, refers to the extent that results of a study apply in other contexts or populations that differ from the original one. 1 A single scientific study may include elements or any combination of these concepts. In short, reproducibility involves the original data and code; replicability ...

  7. Preregistering, transparency, and large samples boost ...

    For the past decade, psychology has been in the midst of a replication crisis. Large, high-profile studies have found that only about half of the findings from behavioral science literature can be replicated—a discovery that has cast a long shadow over psychological science, but that has also spurred advocates to push for improved research methods that boost rigor.

  8. PDF Replicating scientific results is tough

    the non-profit Center for Open Science in Charlottes-ville, Virginia, and ScienceExchange, a research-services company based in Palo Alto, California, set out to syst-ematically test whether selected experiments in highly cited papers published in prestigious scientific journals could be replicated. The effort was part of the high-profile Repro-

  9. Evaluating the replicability of social science experiments in

    Camerer et al. carried out replications of 21 Science and Nature social science experiments, successfully replicating 13 out of 21 (62%). Effect sizes of replications were about half of the size ...

  10. What is replication?

    According to common understanding, replication is repeating a study's procedure and observing whether the prior finding recurs [ 7 ]. This definition of replication is intuitive, easy to apply, and incorrect. The problem is this definition's emphasis on repetition of the technical methods—the procedure, protocol, or manipulated and ...

  11. Replicability

    The replication studies such as those shown in Table 5-1 are not necessarily indicative of the actual rate of non-replicability across science for a number of reasons: the studies to be replicated were not randomly chosen, the replications had methodological shortcomings, many replication studies are not reported as such, and the reported ...

  12. Reproducibility

    Reproducibility, closely related to replicability and repeatability, is a major principle underpinning the scientific method.For the findings of a study to be reproducible means that results obtained by an experiment or an observational study or in a statistical analysis of a data set should be achieved again with a high degree of reliability when the study is replicated.

  13. Scientific Findings Often Fail To Be Replicated, Researchers Say

    A massive effort to test the validity of 100 psychology experiments finds that more than 50 percent of the studies fail to replicate. This is based on a new study published in the journal "Science."

  14. Genuine replication and pseudoreplication: what's ...

    Replication is a key idea in science and statistics, but is often misunderstood by researchers because they receive little education or training on experimental design. Consequently, the wrong entity is replicated in many experiments, leading to pseudoreplication or the "unit of analysis" problem [1,2].

  15. Low replicability can support robust and efficient science

    Replicability is fundamental to science 1.Any finding that cannot be replicated at best fails to contribute to knowledge and, at worst, wastes other researchers' time when they pursue a blind ...

  16. Scientists Replicated 100 Psychology Studies, and Fewer Than Half Got

    According to work presented today in Science, fewer than half of 100 studies published in 2008 in three top psychology journals could be replicated successfully. The international effort included ...

  17. Replicates and repeats—what is the difference and is it significant?

    The answer, of course, is 'no'. Replicates serve as internal quality checks on how the experiment was performed. If, for example, in the experiment described in Table 1 and Fig 1, one of the replicate plates with saline-treated WT bone marrow contained 100 colonies, you would immediately suspect that something was wrong. You could check the ...

  18. Dozens of major cancer studies can't be replicated

    A massive 8-year effort finds that much cancer research can't be replicated. An effort to replicate nearly 200 preclinical cancer experiments that generated buzz from 2010 to 2012 found that ...

  19. Social science replication crisis: studies in top journals keep ...

    This is the idea that experiments need to be repeated to find out if the results will be consistent. The fact that an experiment can be replicated is how we know its results contain a nugget of truth.

  20. 100 psychology experiments repeated, less than half successful

    107. Since November 2011, the Center for Open Science has been involved in an ambitious project: to repeat 100 psychology experiments and see whether the results are the same the second time round ...

  21. Psychology Studies Often Can't Be Reproduced : Shots

    Many social sciences experiments couldn't be reproduced in a new study, thus calling into question their findings. The field of social science is pushing hard to improve its scientific rigor.

  22. Replicates and repeats—what is the difference and is it significant?:

    The answer, of course, is 'no'. Replicates serve as internal quality checks on how the experiment was performed. If, for example, in the experiment described in Table 1 and Fig 1, one of the replicate plates with saline‐treated WT bone marrow contained 100 colonies, you would immediately suspect that something was wrong. You could check ...

  23. More than half of high-impact cancer lab studies could not be ...

    An ambitious project that set out 8 years ago to replicate findings from top cancer labs has drawn to a discouraging close. The Reproducibility Project: Cancer Biology (RP:CB) reports today that when it attempted to repeat experiments drawn from 23 high-impact papers published about 10 years ago, fewer than half yielded similar results.