Level of replication
History
Version | Revision date | Revision | Author |
---|---|---|---|
1.2 | 2023-08-30 | Revisions | Eva Kormann & Thomas Klebel |
1.1 | 2023-07-20 | Revisions | Thomas Klebel |
1.0 | 2023-04-26 | First draft | Eva Kormann |
Description
Replication is often defined as the process of repeating a study with the same methodology: generating new data that can then be analysed similarly to the original study. A study is considered successfully replicated when the replication yields the same results as the original. The term replicability is closely related to the term reproducibility and sometimes used interchangeably. However, terms can be differentiated by referring to reproduction when repeating the analysis with the original study data and referring to replication when repeating the entire study creating new data to analyse (Goodman, Fanelli, and Ioannidis 2016).
A certain number of replication attempts is expected to fail due to the chance of false positives / false negatives in the original or replication studies (Marino 2018). However, higher proportions of failed replication attempts might, for instance, be signs of insufficient reporting, biases (cognitive or related to the publication process, i.e., publication bias) or methodological issues (such as low statistical power), and therefore challenge the validity and credibility of results. Low levels of replication indicate flaws in research practices and potential waste of research effort (Munafò et al. 2017).
The level of successful replication represents a direct indicator for reproducibility. It can also serve as an indicator for research quality in a broader sense, since issues related to reporting or methods increase the risk for failed replication. The extent to which research findings are replicable can be examined over time and in relation to the employed research practices.
Metrics
Number (%) of studies found to successfully replicate
The level of replication can be measured by counting the number or calculating the proportion of studies that were found to successfully replicate. Data on success of replication attempts, however, is limited. Re-performing a study requires substantial resources. For a large share of the published literature no data on the number or share of successful replications is available. Additionally, it might be impossible for some studies to be replicated, e.g., because a one-time event was studied. For these types of studies, levels of replication cannot be assessed.
Measurement.
Levels of replication can directly be examined by analysing the proportion of successful replication attempts. Therefore, the number of replication attempts and their success/failure need to be measured. Then, the percentage can be calculated as the proportion of successful replications within all replication attempts. Difficulties, however, lie in the definition of what constitutes a successful replication. A common argument in the literature on replication and reproducibility is that exact replications are not possible, since the exact setting, context, sample, etc. usually cannot be recreated fully (Nosek et al. 2022; Schmidt 2009). A study might be seen as a replication “when the differences to the original study are believed to be irrelevant for obtaining the evidence about the same finding” (Nosek et al. 2022), but this cannot easily be determined and appears quite subjective.
To calculate the number or percentage of successful replications, a dichotomous indicator is needed for success of replication (Nosek et al. 2022). Multiple ways of indicating success of replication are in use (Nosek et al. 2022):
- The null hypothesis is rejected in the same direction (p < α).
- An estimate is within a confidence or prediction interval.
- The detected effect size is consistent with the original study.
- The findings are similar when assessed subjectively.
There are also continuous measures that can be dichotomized:
- Bayes factors for comparison of original and replication findings.
- Bayesian tests to compare null distribution and posterior distribution of the original study.
Datasources
There is no single data source for this metric. Data needs to be extracted from existing publications or gathered newly by employing these methodologies.
Existing methodologies:
Replication projects/studies
Many studies or projects on the topic of replicability pursue the goal of conducting a multitude of replication attempts following the same process or using the same measures of success to determine the proportion of replicable findings. Some of these studies and projects are concentrated on specific fields of research, e.g., (“Estimating the Reproducibility of Psychological Science” 2015) and SCORE Project (including subprojects like the repliCATS Project) for social-behavioural science. The “Many Labs” studies also had their initial focus on psychology but have since spread into other disciplines (Klein et al. 2014; Stroebe 2019). Approaches to investigate replicability have reached a multitude of disciplines, e.g., the field of humanities.
Scoping review papers
Since multiple reperformed studies are needed to determine the percentage of successful replications, standalone replication attempts provide only limited information. Scoping reviews are one way to synthesize singular replication attempts to gain an overview and to estimate the percentage of successful replications more precisely. However, singular replication studies synthesized within a review might be inconsistent in their procedures and employ different measures of success, complicating synthesis or comparison.
Number (%) of studies reported to successfully replicate
Replication attempts might not be published (especially in the case of repeated replications of the same study) and might for instance only be conducted internally within a research group. To gather information about these replications attempts, reports can be obtained from researchers about the total number of replications they attempt and the number of studies out of those that they were able to replicate.
Measurement.
In addition to directly measuring the success of replication studies, the level of replication can also be assessed by surveying researchers. They can report retrospectively about their replication attempts and indicate or estimate the level of replication they encountered. These reports, however, might be less systematic, detailed or objective than studies or projects directly reperforming studies. However, they can be acquired with fewer resources.
Datasources
There is no single data source for this metric. Data needs to be extracted from existing publications or gathered newly by employing these methodologies.
Existing methodologies:
Surveys
Experiences of researchers with replications and the level of replication they have encountered in their work can be investigated through surveys. There, questions can be included about previous replication attempts and their success and about general estimates of replicability. While using the term “reproducibility” instead of “replicability”, the Nature survey by (Baker 2016) employed this method using similar questions.
Number (%) of studies predicted to successfully replicate
Since replication attempts require substantial resources, they cannot be conducted for all studies. Without studies or researcher reports available to assess levels of replication, the number of studies to successfully replicate can also be estimated through expert predictions.
Measurement.
Levels of replication can be measured prospectively. This is done through expert predictions of the replicability of studies (mostly captured as the predicted probability of successful replication), without directly attempting a replication. A percentage or number of studies predicted to successfully replicate can be calculated after dichotomizing this probability (e.g., interpreting probability > .5 as prediction of success). While these predictions might be less accurate compared to other measures of replicability, studies would not actually have to be reperformed when only employing this measure.
Datasources
There is no single data source for this metric. Data needs to be extracted from existing publications or gathered newly by employing these methodologies. The studies cited below made their data available, which can be used to re-analyse or extend existing analyses. Note however that Forsell and colleagues promised to populate a repository with the data but have not done so as of June 2023.
Existing methodologies:
The following methodologies, namely surveys and prediction markets, are so far most often used in conjunction with each other. They also have been validated by subsequently conducting full replication studies.
Surveys
Experts (mainly researchers) can be asked in surveys about the probability they estimate for specific studies to be successfully reperformed based on information they are given about these studies (e.g., hypothesis, effect size, p-value, link to the original paper, etc.). Surveys have been validated with subsequent replication attempts and compared to prediction markets (see next section). While some studies see many prediction errors and low accuracy when predicting whether the null hypothesis will be rejected (p < α) in the same direction (Dreber et al. 2015; Forsell et al. 2019), other findings show better general accuracy of these predictions (Gordon et al. 2021) and better performance predicting relative effect sizes compared to prediction markets (Forsell et al. 2019).
Prediction markets
Prediction markets are used for trading bets on a certain outcome. The final market prices can then be taken as an indicator for the probability of an event. In the context of replicability, experts (mainly researchers) are given a budget to bet on studies they think will successfully replicate. The final market price is then a proxy for the probability of successful replication (reaching a previously defined replication criterion) that is estimated by the entire market. Prediction markets have been shown to reach accuracies higher than 70% for their predictions and to outperform surveys (Dreber et al. 2015; Forsell et al. 2019; Gordon et al. 2021). However, prediction markets do not yet appear to be established as a standalone measure for levels of replication.
References
Reuse
Citation
@online{apartis2023,
author = {Apartis, S. and Catalano, G. and Consiglio, G. and Costas,
R. and Delugas, E. and Dulong de Rosnay, M. and Grypari, I. and
Karasz, I. and Klebel, Thomas and Kormann, E. and Manola, N. and
Papageorgiou, H. and Seminaroti, E. and Stavropoulos, P. and Stoy,
L. and Traag, V.A. and van Leeuwen, T. and Venturini, T. and
Vignetti, S. and Waltman, L. and Willemse, T.},
title = {PathOS - {D2.1} - {D2.2} - {Open} {Science} {Indicator}
{Handbook}},
date = {2023},
url = {https://handbook.pathos-project.eu/indicator_templates/quarto/5_reproducibility/level_of_replication.html},
doi = {10.5281/zenodo.8305626},
langid = {en}
}