Impact of Open Data in research

Author
Affiliation

P. Stavropoulos

Athena Research Center

Version Revision date Revision Author
1.2 2023-08-30 Revisions Petros Stavropoulos
1.1 2023-07-19 Revisions Petros Stavropoulos
1.0 2023-05-11 First draft Petros Stavropoulos

Description

The impact of Open Data in research aims to capture the effect of making research data openly accessible and reusable on enhancing the reproducibility of research results, as open and accessible code is a cornerstone for verification and validation in science.

The indicator can be used to assess the level of openness and accessibility of research data within a specific scientific community or field, and to identify potential barriers or incentives for the adoption of Open Data practices.

Connections to Academic Indicators

This indicator examines the broader effects of making research data openly accessible, focusing on how transparency and accessibility enhance reproducibility, collaboration, and innovation within the scientific community. This builds upon the Use of Data in Research, which evaluates how data is initially utilized within research activities, and the Reuse of Data in Research, which measures the extent to which datasets are adopted in subsequent studies. Together, these indicators provide a comprehensive view of how data contributes to scientific outputs, reusability and reproducibility.

Metrics

NCI for publications that have introduced/reused Open Data

This metric captures the Normalised Citation Impact (NCI) for publications that have either introduced or reused Open Data. By assessing citation impact, this indicator reflects the visibility and influence of research publications that contribute to or benefit from Open Data practices. Citation-based metrics, including the NCI, are extensively discussed under the academic indicator Citation Impact. For general details on the methodology, limitations, and measurement of NCI, refer to the academic indicator and its corresponding metrics.

In this metric, we focus specifically on publications that have directly contributed to reproducibility by either introducing new Open Data or reusing existing Open Data. The reuse of Open Data can be identified using methodologies and metrics outlined in the academic indicator Use of Data in Research, which provides tools and techniques for tracking data usage in research publications. Additionally, this indicator highlights publications that explicitly document and share new Open Data repositories.

To measure the NCI for publications that have introduced Open Data, we identify relevant publications through metadata analysis of datasets, such as unique identifiers or DOIs associated with repositories like Zenodo, DataCite, or OpenAIRE. This process can be supported by automated tools, including the SciNoBo toolkit, which uses Deep Learning and Natural Language Processing (NLP) to extract metadata such as the dataset name, version, license, and URLs. These tools enable precise identification of publications introducing Open Data, making it possible to calculate their citation impact.

The NCI for publications that have introduced or reused Open Data is particularly relevant for reproducibility because it highlights how data sharing and reuse practices enable verification and extension of scientific findings. Highly cited publications introducing Open Data signal that the data provided novel, generalizable insights to scientific questions, thereby enabling other researchers to replicate and build upon the results. Similarly, publications with high NCI that reuse Open Data indicate that openly shared datasets are not only accessible but integral to advancing research transparency and reproducibility.

By focusing on NCI, we can compare publications across disciplines and timeframes, overcoming disparities in citation practices. This ensures that the contribution of Open Data to reproducibility is evaluated equitably, identifying impactful practices and outputs. Furthermore, normalized metrics allow us to track trends in Open Data adoption, evaluate the effectiveness of Open Data policies, and identify areas where further incentives for Open Data practices might be beneficial. Such analysis provides critical insights into the evolving role of Open Data in enhancing scientific reliability and collaboration across research communities.

Dataset downloads/usage counts/stars from repositories

This metric captures the level of interest and impact of Open Datasets by measuring repository activity such as downloads, usage counts, and stars. Metrics derived from repository platforms like Zenodo, DataCite, or GitHub can provide insight into how often datasets are accessed, favorited, or reused by other users. These indicators are extensively discussed under the academic indicator Use of Data in Research, particularly in the metric “Number (Avg.) of views/clicks/downloads from repository.” For detailed methodologies and measurement approaches, refer to the academic indicator.

In the context of reproducibility, this indicator highlights how repository usage statistics reflect the accessibility and usability of Open Datasets. High download counts, frequent views, or significant repository engagement often suggest that datasets are well-documented, standardized, and integral to the reproducibility of scientific findings. These metrics serve as proxies for the broader acceptance and utility of the datasets within the research community.

Furthermore, repository metrics provide insights into the integration of Open Data practices within the research ecosystem. Datasets that are widely accessed and reused facilitate cumulative science by enabling researchers to verify existing results, build upon prior work, and avoid redundant data collection efforts. By monitoring repository engagement, we can evaluate the effectiveness of Open Data practices in promoting transparency and reproducibility while identifying gaps in documentation, accessibility, or usability that may hinder broader adoption and reuse of Open Data.

Downloads / views of published DMPs

This metric measures the number of downloads or views of published Data Management Plans (DMPs) from data repositories, such as DataCite, Zenodo, or institutional repositories. A DMP is a document that outlines how research data will be managed throughout a research project, including details on data collection, storage, sharing, and preservation. The number of downloads or views of published DMPs can indicate the level of interest and engagement of researchers and other stakeholders in Open Data practices and the importance of data management planning in the research process, thereby reflecting the adoption of good data management practices that indirectly contribute to overall research reproducibility.

A limitation of this metric is that it only captures the number of downloads or views of DMPs, which may not necessarily indicate the actual implementation of the DMP or the quality of the data management practices. Therefore, it is important to use this metric in conjunction with other metrics in this document to obtain a more comprehensive assessment of the impact of Open Data practices on research output.

Measurement

To measure this metric, we can start by identifying published DMPs in data repositories, such as DataCite, Zenodo, or institutional repositories. To identify the relevant DMPs, we can utilize search features and application programming interfaces (APIs) provided by these data repositories, conduct keyword searches related to the specific research or project, and review the metadata associated with each DMP for relevance. Once we have identified the relevant DMPs, we can track the number of downloads or views of these DMPs over a specified period of time.

Potential measurement problems and limitations of this metric include the possibility of multiple downloads or views by the same user, which can inflate the metric. Additionally, the number of downloads or views may not reflect the actual use or implementation of the DMP, as some researchers may download or view DMPs out of curiosity or to gain insight into best practices. Therefore, it is important to interpret the results of this metric in the context of other metrics and qualitative data on the use and effectiveness of DMPs.

Existing data sources and methodologies for this metric include the data repositories and web analytics tools mentioned above. DataCite and Zenodo provide download counts for their published content, including DMPs, while Google Analytics can be used to track views of DMPs on institutional or funder websites. However, there may be gaps in the availability of download or view counts for DMPs published on other platforms or websites. In such cases, it may be necessary to manually track the number of downloads or views through user surveys or by contacting individual users who have downloaded or viewed the DMP.

Datasources

DataCite

DataCite is a global registry for research data that provides persistent identifiers (DOIs) for research datasets. To measure the number of downloads or views of published DMPs in DataCite, we can use the DataCite REST API to search for DMPs by the keyword “Data Management Plan” and filter the results by the download count or view count metadata. The API also allows filtering by date range and repository location, which can provide additional context for the measurement.

One potential limitation of using DataCite for this metric is that not all DMPs may be registered with DataCite, and the search results may not capture all relevant DMPs. Additionally, the download or view count metadata may not always accurately reflect the actual use or engagement with the DMP, as these metrics can be affected by factors such as availability, accessibility, and discoverability of the DMP.

Zenodo

Zenodo is a data repository that allows researchers to upload and share research outputs, including DMPs. To calculate the number of downloads or views of published DMPs on Zenodo, we can use the Zenodo REST API to retrieve the relevant metadata for each DMP, such as the number of views and downloads. This can be done by searching for DMPs on Zenodo using their unique identifiers or keywords, and then extracting the relevant metadata for each search result.

One limitation of using Zenodo to measure this metric is that not all DMPs may be published on this repository. Additionally, the number of views and downloads may not necessarily reflect the actual use or implementation of the DMP, as users may simply be browsing or downloading the document for reference purposes. Finally, the number of downloads and views may be influenced by factors such as the popularity of the topic or the visibility of the DMP on the repository.

Number of datasets reused inside DMPs

This metric measures the number of datasets that are reused in Data Management Plans (DMPs). A DMP is a document that outlines how research data will be managed throughout a research project, including details on data collection, storage, sharing, and preservation. The number of datasets reused in DMPs can indicate the level of engagement of researchers in Open Data practices and the potential impact of data sharing and reuse practices on research output. This metric can also serve as a proxy for reproducibility, as datasets explicitly cited and reused in multiple DMPs are likely to be more robust and have undergone scrutiny, thus facilitating other researchers in verifying or replicating results. Furthermore, the standardization of data storage, management, and processing practices encouraged by DMPs can indirectly promote reproducibility.

A limitation of this metric is that it may not capture the full range of Open Data practices that are being utilized by researchers, such as the sharing of data outside of DMPs or the creation of new datasets for reuse. Additionally, the metric may not capture the quality or impact of the datasets being reused in DMPs. Therefore, it is important to use this metric in conjunction with other metrics in this document to obtain a more comprehensive assessment of the impact of Open Data practices on research output.

Measurement

To measure this metric, we can start by identifying published DMPs in data repositories, such as DataCite, Zenodo, or institutional repositories. We can then analyse the content of published DMPs to identify the datasets that are being reused through automated text mining techniques (e.g., using the SciNoBo toolkit).

However, there are some limitations to this approach. One limitation is that not all DMPs are publicly available, which may limit the scope of the analysis. Additionally, automated techniques may not capture all instances of dataset reuse if they are not explicitly mentioned in the text of the DMP.

Datasources

DataCite

DataCite is a metadata repository that provides persistent identifiers for research datasets. It collects metadata from various sources, including data centres, publishers, and institutional repositories. The metadata includes information on the dataset, such as the title, author, publisher, date of publication, and the identifier of the dataset.

To measure the number of datasets reused inside DMPs using DataCite, we can search for published DMPs in DataCite, extract the metadata for each DMP, and analyse the content of the DMP to identify the datasets that are being reused. This can be done using automated text mining techniques to identify dataset names or identifiers mentioned in the DMP.

However, it is important to note that not all DMPs may be available in DataCite, and some datasets may not have persistent identifiers, which may limit the scope of the analysis. Additionally, automated text mining techniques may not capture all instances of dataset reuse if they are not explicitly mentioned in the text of the DMP.

To obtain the metadata for published DMPs in DataCite, we can use the DataCite REST API to search for DMPs that have been registered with DataCite. The metadata can be obtained in various formats, including JSON and XML.

Zenodo

Zenodo is a general-purpose data repository that allows users to upload any kind of research output, including datasets and data management plans (DMPs). Zenodo assigns a unique Digital Object Identifier (DOI) to each uploaded item, which can be used to track usage and reuse.

To measure the number of datasets reused inside DMPs based on Zenodo, we can search for published DMPs on Zenodo using keywords and filters, such as the “data management plan” keyword and the “DMP” tag. Once we have identified a set of DMPs, we can use automated text mining techniques to identify the datasets that are being reused. This can involve searching for mentions of dataset names or identifiers in the text of the DMPs.

However, it is important to note that not all DMPs on Zenodo may contain information on reused datasets, and some datasets may not be explicitly mentioned in the text of the DMP. Additionally, the automated text mining techniques used to identify reused datasets may not capture all instances of reuse, particularly if the datasets are referred to in a non-standard way or if they are combined with other datasets.

Existing methodologies

SciNoBo toolkit

The SciNoBo toolkit (Gialitsis, Kotitsas, and Papageorgiou 2022; Kotitsas et al. 2023) has a new component, currently undergoing evaluation, which involves an automated tool that employs Deep Learning and Natural Language Processing techniques to identify datasets mentioned in scientific text (i.e., the text of a DMP) and extract metadata associated with them, such as name, version, license, URLs etc. This tool can also classify whether the dataset has been (re)used by the authors of the DMP.

To use the tool to measure the proposed metric, we can provide a collection of DMPs as input to the tool to extract all the datasets mentioned in the text, along with their metadata. We can then analyse them to identify which datasets have been reused by the authors of the DMP, as classified by the machine learning algorithms of the tool.

One limitation of this methodology is that it may not capture all instances of dataset reuse if they are not explicitly mentioned in the text of the DMP. Additionally, the machine learning algorithms used by the tool may not always accurately classify whether a dataset has been reused, and may require manual validation.

References

Gialitsis, Nikolaos, Sotiris Kotitsas, and Haris Papageorgiou. 2022. “WWW ’22: The ACM Web Conference 2022.” In, 800–809. Virtual Event, Lyon France: ACM. https://doi.org/10.1145/3487553.3524677.
Kotitsas, Sotiris, Dimitris Pappas, Natalia Manola, and Haris Papageorgiou. 2023. “SCINOBO: A Novel System Classifying Scholarly Communication in a Dynamically Constructed Hierarchical Field-of-Science Taxonomy.” Frontiers in Research Metrics and Analytics 8: 1149834. https://www.frontiersin.org/articles/10.3389/frma.2023.1149834/full.

Reuse

Open Science Impact Indicator Handbook © 2024 by PathOS is licensed under CC BY 4.0 (View License)

Citation

BibTeX citation:
@online{apartis2024,
  author = {Apartis, S. and Catalano, G. and Consiglio, G. and Costas,
    R. and Delugas, E. and Dulong de Rosnay, M. and Grypari, I. and
    Karasz, I. and Klebel, Thomas and Kormann, E. and Manola, N. and
    Papageorgiou, H. and Seminaroti, E. and Stavropoulos, P. and Stoy,
    L. and Traag, V.A. and van Leeuwen, T. and Venturini, T. and
    Vignetti, S. and Waltman, L. and Willemse, T.},
  title = {Open {Science} {Impact} {Indicator} {Handbook}},
  date = {2024},
  url = {https://handbook.pathos-project.eu/sections/5_reproducibility/impact_of_open_data_in_research.html},
  doi = {10.5281/zenodo.14538442},
  langid = {en}
}
For attribution, please cite this work as:
Apartis, S., G. Catalano, G. Consiglio, R. Costas, E. Delugas, M. Dulong de Rosnay, I. Grypari, et al. 2024. “Open Science Impact Indicator Handbook.” Zenodo. 2024. https://doi.org/10.5281/zenodo.14538442.