Open Science Impact Indicator Handbook

S. Apartis; G. Catalano; G. Consiglio; R. Costas; E. Delugas; M. Dulong de Rosnay; I. Grypari; I. Karasz; Thomas Klebel; E. Kormann; N. Manola; H. Papageorgiou; E. Seminaroti; P. Stavropoulos; L. Stoy; V.A. Traag; T. van Leeuwen; T. Venturini; S. Vignetti; L. Waltman; T. Willemse

doi:10.5281/zenodo.14538442

Level of replication

Authors

Affiliations

T. Klebel

Know-Center

E. Kormann

Graz University of Technology

History

Version	Revision date	Revision	Author
1.3	2024-12-06	Additions	Eva Kormann
1.2	2023-08-30	Revisions	Eva Kormann & Thomas Klebel
1.1	2023-07-20	Revisions	Thomas Klebel
1.0	2023-04-26	First draft	Eva Kormann

Description

Replication is often defined as the process of repeating a study with the same methodology: generating new data that can then be analysed similarly to the original study. A study is considered successfully replicated when the replication yields the same results as the original. The term replicability is closely related to the term reproducibility and sometimes used interchangeably. However, terms can be differentiated by referring to reproduction when repeating the analysis with the original study data and referring to replication when repeating the entire study creating new data to analyse (Goodman, Fanelli, and Ioannidis 2016).

A certain number of replication attempts is expected to fail due to the chance of false positives / false negatives in the original or replication studies (Marino 2018). However, higher proportions of failed replication attempts might, for instance, be signs of insufficient reporting, biases (cognitive or related to the publication process, i.e., publication bias) or methodological issues (such as low statistical power), and therefore challenge the validity and credibility of results. Low levels of replication indicate flaws in research practices and potential waste of research effort (Munafò et al. 2017).

The level of successful replication represents a direct indicator for reproducibility. It can also serve as an indicator for research quality in a broader sense, since issues related to reporting or methods increase the risk for failed replication. The extent to which research findings are replicable can be examined over time and in relation to the employed research practices.

Metrics

Number (%) of studies found to successfully replicate

The level of replication can be measured by counting the number or calculating the proportion of studies that were found to successfully replicate. Data on success of replication attempts, however, is limited. Re-performing a study requires substantial resources. For a large share of the published literature no data on the number or share of successful replications is available. Additionally, it might be impossible for some studies to be replicated, e.g., because a one-time event was studied. For these types of studies, levels of replication cannot be assessed.

Measurement

Levels of replication can directly be examined by analysing the proportion of successful replication attempts. Therefore, the number of replication attempts and their success/failure need to be measured. Then, the percentage can be calculated as the proportion of successful replications within all replication attempts. Difficulties, however, lie in the definition of what constitutes a successful replication. A common argument in the literature on replication and reproducibility is that exact replications are not possible, since the exact setting, context, sample, etc. usually cannot be recreated fully (Nosek et al. 2022; Schmidt 2009). A study might be seen as a replication “when the differences to the original study are believed to be irrelevant for obtaining the evidence about the same finding” (Nosek et al. 2022), but this cannot easily be determined and appears quite subjective.

To calculate the number or percentage of successful replications, a dichotomous indicator is needed for success of replication (Nosek et al. 2022). Multiple ways of indicating success of replication are in use (Nosek et al. 2022):

The null hypothesis is rejected in the same direction (p < α).
An estimate is within a confidence or prediction interval.
The detected effect size is consistent with the original study.
The findings are similar when assessed subjectively.

There are also continuous measures that can be dichotomized:

Bayes factors for comparison of original and replication findings.
Bayesian tests to compare null distribution and posterior distribution of the original study.

Datasources

There is no single data source for this metric. Data needs to be extracted from existing publications or gathered newly by employing these methodologies.

Existing methodologies

Replication projects/studies

Many studies or projects on the topic of replicability pursue the goal of conducting a multitude of replication attempts following the same process or using the same measures of success to determine the proportion of replicable findings. Some of these studies and projects are concentrated on specific fields of research, e.g., (“Estimating the Reproducibility of Psychological Science” 2015) and SCORE Project (including subprojects like the repliCATS Project) for social-behavioural science. The “Many Labs” studies also had their initial focus on psychology but have since spread into other disciplines (Klein et al. 2014; Stroebe 2019). Approaches to investigate replicability have reached a multitude of disciplines, e.g., the field of humanities.

Scoping review papers

Since multiple reperformed studies are needed to determine the percentage of successful replications, standalone replication attempts provide only limited information. Scoping reviews are one way to synthesize singular replication attempts to gain an overview and to estimate the percentage of successful replications more precisely. However, singular replication studies synthesized within a review might be inconsistent in their procedures and employ different measures of success, complicating synthesis or comparison.

Automated tools

Recently, automated tools have been developed and tested that are intended to discover replication studies and assess whether they were successful or not. Proposed approaches use text classification models to differentiate replications studies from original studies and to categorize them based on whether they were successful. However, performance of this approach is still too limited for fully automated use (Ruiter 2023).

Number (%) of studies reported to successfully replicate

Replication attempts might not be published (especially in the case of repeated replications of the same study) and might for instance only be conducted internally within a research group. To gather information about these replications attempts, reports can be obtained from researchers about the total number of replications they attempt and the number of studies out of those that they were able to replicate.

Measurement

In addition to directly measuring the success of replication studies, the level of replication can also be assessed by surveying researchers. They can report retrospectively about their replication attempts and indicate or estimate the level of replication they encountered. These reports, however, might be less systematic, detailed or objective than studies or projects directly reperforming studies. However, they can be acquired with fewer resources.

Datasources

There is no single data source for this metric. Data needs to be extracted from existing publications or gathered newly by employing these methodologies.

Existing methodologies

Surveys

Experiences of researchers with replications and the level of replication they have encountered in their work can be investigated through surveys. There, questions can be included about previous replication attempts and their success and about general estimates of replicability. While using the term “reproducibility” instead of “replicability”, the Nature survey by (Baker 2016) employed this method using similar questions.

Number (%) of studies predicted to successfully replicate

Since replication attempts require substantial resources, they cannot be conducted for all studies. Without studies or researcher reports available to assess levels of replication, the number of studies to successfully replicate can also be estimated through expert predictions.

Measurement

Levels of replication can be measured prospectively. This is done through expert predictions of the replicability of studies (mostly captured as the predicted probability of successful replication), without directly attempting a replication. A percentage or number of studies predicted to successfully replicate can be calculated after dichotomizing this probability (e.g., interpreting probability > .5 as prediction of success). While these predictions might be less accurate compared to other measures of replicability, studies would not actually have to be reperformed when only employing this measure.

Datasources

There is no single data source for this metric. Data needs to be extracted from existing publications or gathered newly by employing these methodologies. The studies cited below made their data available, which can be used to re-analyse or extend existing analyses. Note however that Forsell and colleagues promised to populate a repository with the data but have not done so as of June 2023.

Existing methodologies

The following methodologies, namely surveys and prediction markets, are so far most often used in conjunction with each other. They also have been validated by subsequently conducting full replication studies. Automated tools have also been proposed but require further development.

Surveys

Experts (mainly researchers) can be asked in surveys about the probability they estimate for specific studies to be successfully reperformed based on information they are given about these studies (e.g., hypothesis, effect size, p-value, link to the original paper, etc.). Surveys have been validated with subsequent replication attempts and compared to prediction markets (see next section). While some studies see many prediction errors and low accuracy when predicting whether the null hypothesis will be rejected (p < α) in the same direction (Dreber et al. 2015; Forsell et al. 2019), other findings show better general accuracy of these predictions (Gordon et al. 2021) and better performance predicting relative effect sizes compared to prediction markets (Forsell et al. 2019).

Prediction markets

Prediction markets are used for trading bets on a certain outcome. The final market prices can then be taken as an indicator for the probability of an event. In the context of replicability, experts (mainly researchers) are given a budget to bet on studies they think will successfully replicate. The final market price is then a proxy for the probability of successful replication (reaching a previously defined replication criterion) that is estimated by the entire market. Prediction markets have been shown to reach accuracies higher than 70% for their predictions and to outperform surveys (Dreber et al. 2015; Forsell et al. 2019; Gordon et al. 2021). However, prediction markets do not yet appear to be established as a standalone measure for levels of replication.

Automated tools

Algorithmic approaches have been proposed to predict whether a study can be replicated. These algorithms could be trained on historic examples of original studies and whether they were successfully replicated or not, to then classify other studies based on their text. However, first attempts have yielded only low performance, also due to a lack of sufficient training data (Ruiter 2023).

References

Baker, Monya. 2016. “1,500 Scientists Lift the Lid on Reproducibility.” Nature 533 (7604): 452–54. https://doi.org/10.1038/533452a.

Dreber, Anna, Thomas Pfeiffer, Johan Almenberg, Siri Isaksson, Brad Wilson, Yiling Chen, Brian A. Nosek, and Magnus Johannesson. 2015. “Using Prediction Markets to Estimate the Reproducibility of Scientific Research.” Proceedings of the National Academy of Sciences 112 (50): 15343–47. https://doi.org/10.1073/pnas.1516179112.

“Estimating the Reproducibility of Psychological Science.” 2015. Science 349 (6251): aac4716. https://doi.org/10.1126/science.aac4716.

Forsell, Eskil, Domenico Viganola, Thomas Pfeiffer, Johan Almenberg, Brad Wilson, Yiling Chen, Brian A. Nosek, Magnus Johannesson, and Anna Dreber. 2019. “Predicting Replication Outcomes in the Many Labs 2 Study.” Journal of Economic Psychology 75 (December): 102117. https://doi.org/10.1016/j.joep.2018.10.009.

Goodman, Steven N., Daniele Fanelli, and John P. A. Ioannidis. 2016. “What Does Research Reproducibility Mean?” Science Translational Medicine 8 (341). https://doi.org/10.1126/scitranslmed.aaf5027.

Gordon, Michael, Domenico Viganola, Anna Dreber, Magnus Johannesson, and Thomas Pfeiffer. 2021. “Predicting ReplicabilityAnalysis of Survey and Prediction Market Data from Large-Scale Forecasting Projects.” Edited by Michelangelo Vianello. PLOS ONE 16 (4): e0248780. https://doi.org/10.1371/journal.pone.0248780.

Klein, Richard A., Kate A. Ratliff, Michelangelo Vianello, Reginald B. Adams Jr., Štěpán Bahník, Michael J. Bernstein, Konrad Bocian, et al. 2014. “Investigating Variation in Replicability: A “Many Labs” Replication Project.” Social Psychology 45 (3): 142. https://doi.org/10.1027/1864-9335/a000178.

Marino, Michael J. 2018. “How Often Should We Expect to Be Wrong? Statistical Power, P Values, and the Expected Prevalence of False Discoveries.” Biochemical Pharmacology 151 (May): 226–33. https://doi.org/10.1016/j.bcp.2017.12.011.

Munafò, Marcus R., Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button, Christopher D. Chambers, Nathalie Percie Du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware, and John P. A. Ioannidis. 2017. “A Manifesto for Reproducible Science.” Nature Human Behaviour 1 (1): 0021. https://doi.org/10.1038/s41562-016-0021.

Nosek, Brian A., Tom E. Hardwicke, Hannah Moshontz, Aurélien Allard, Katherine S. Corker, Anna Dreber, Fiona Fidler, et al. 2022. “Replicability, Robustness, and Reproducibility in Psychological Science.” Annual Review of Psychology 73 (1): 719–48. https://doi.org/10.1146/annurev-psych-020821-114157.

Ruiter, Bob de. 2023. “Automatically Finding and Categorizing Replication Studies.” arXiv. https://doi.org/10.48550/arXiv.2311.15055.

Schmidt, Stefan. 2009. “Shall We Really Do It Again? The Powerful Concept of Replication Is Neglected in the Social Sciences.” Review of General Psychology 13 (2): 90–100. https://doi.org/10.1037/a0015108.

Stroebe, Wolfgang. 2019. “What Can We Learn from Many Labs Replications?” Basic and Applied Social Psychology 41 (2): 91–103. https://doi.org/10.1080/01973533.2019.1577736.

Reuse

Citation

BibTeX citation:

@online{apartis2024,
  author = {Apartis, S. and Catalano, G. and Consiglio, G. and Costas,
    R. and Delugas, E. and Dulong de Rosnay, M. and Grypari, I. and
    Karasz, I. and Klebel, Thomas and Kormann, E. and Manola, N. and
    Papageorgiou, H. and Seminaroti, E. and Stavropoulos, P. and Stoy,
    L. and Traag, V.A. and van Leeuwen, T. and Venturini, T. and
    Vignetti, S. and Waltman, L. and Willemse, T.},
  title = {Open {Science} {Impact} {Indicator} {Handbook}},
  date = {2024},
  url = {https://handbook.pathos-project.eu/sections/5_reproducibility/level_of_replication.html},
  doi = {10.5281/zenodo.14538442},
  langid = {en}
}

For attribution, please cite this work as:

Apartis, S., G. Catalano, G. Consiglio, R. Costas, E. Delugas, M. Dulong de Rosnay, I. Grypari, et al. 2024. “Open Science Impact Indicator Handbook.” Zenodo. 2024. https://doi.org/10.5281/zenodo.14538442.