Use of data in research

Author
Affiliation

T. Willemse

Leiden University

Version Revision date Revision Author
1.0 2023-12-14 First draft T. Willemse

Description

In today’s data-driven era, the utilization of data has revolutionized the landscape of research across diverse fields and disciplines. Data serves as the lifeblood of modern research endeavors, empowering scholars to make informed decisions, and unveil patterns that were previously hidden. Whether in the realms of social sciences, business analytics, healthcare, or natural sciences, the role of data in research has transcended mere information-gathering; it has become the cornerstone upon which discoveries, innovations, and evidence-based conclusions are built.

The open research data (ORD) movement within open science has advocated to make data more accessible and transparent to accelerate these developments in research (Quarati and Raffaghelli 2022). However, to assess whether or not these “open” measures are effective one needs insight in the use of data in research. Assessing data use is not straightforward though. It can for instance be ambiguous to assess when data is actually “used” and to quantify this. Moreover, definitions for data use vary across science and subsequently data use assessment methods do so as well (Pasquetto, Randles, and Borgman 2017).

Sometimes a distinction is made between “reuse” and “use”, where “reuse” refers explicitly to the use of openly released data, whereas “use” refers to the use of data more generally. We do not make such a distinction here.

Nevertheless, this document attempts to summarize what indicators can be used to approximate data use in research.

Metrics

Number (Avg.) of times data is cited/mentioned in publications

Measurement

The Number or percentage of times data is cited/mentioned in publications can be used to indicate data use in research. It specifically gives a representation of how often data is used to progress scientific publishing. However, it does not explicitly show what the data is being used for in the articles that cite the data, and it does not capture information on what academic activities the data is used for outside of scientific publishing.

Existing datasources:

Data repositories

Data repositories may show citation information on the datasets they index. Examples of data repositories that provide this information are Zenodo, Figshare, Dryad, DataCite, Mendeley data, Dataverse and Harvard dataverse.

In particular, the Data Citation Corpus (DataCite and Count 2024) by DataCite and Make Data Count, is currently one of the most comprehensive data citation corpus available, even if it is limited in scope. It is based on a similar approach developed originally by CZI to extract software mentions.

Existing methodologies

Citation scores

Based on the data citation information from data repositories one can compile a score for datasets or repositories on how many citations were received. If the repositories provide DOIs to datasets like for instance DataCite does, it is also possible to track citations based on DOIs, for example using OpenAlex.

UsageCounts for data use by OpenAIRE aims to monitor and report how often research datasets hosted within OpenAIRE are accessed, downloaded, or used by the scholarly community. The service tracks various metrics related to data use in research among which are statistics on data views and downloads.

Additionally, the [datastet``](https://github.com/kermitt2/datastet) can be used to find named and implicit research datasets from within the academic literature. DataStet extends from [dataseer-ml](https://github.com/dataseer/dataseer-ml) to identify implicit and explicit dataset mentions in scientific documents, with DataSeer also contributing back todatastet`. It automatically characterizes dataset mentions as used or created in the research work. The identified datasets are classified based on a hierarchy derived from MeSH. It can process various scientific article formats such as PDF, TEI, JATS/NLM, ScholarOne, BMJ, Elsevier staging format, OUP, PNAS, RSC, Sage, Wiley, etc. Docker is recommended to deploy and run the DataStet service. In the aforementioned link instructions are provided for pulling the Docker image and running the service as a container.

Science resources

Scientific resources like bibliometric databases often provide citation information on datasets that are included. Sources like these are for instance Crossref, Google Scholar, Pubmed Central, OpenAIRE and DataCite. Note that the indexing and discoverability of datasets within platforms focusing on scientific article retrieval (mainly Google Scholar and PubMed Central) depend on several factors, including how the datasets are described in the publications, how they are tagged or linked within the platforms, and the availability of metadata provided by authors or publishers. If one is specifically looking for datasets or data mentions within scholarly literature, using specialized data repositories or archives mentioned in section 1) Data repositories will be preferable.

Existing methodologies

Citation score

Based on the data citation information from scientometric databases (e.g. OpenAlex) one can compile a score for datasets or repositories on how many citations were received from other scientific publiations.

DataCite Data Citation Corpus Prototype

DataCite, with the Data Citation Corpus prototype a resource, aims to generate aggregated references to data citations that can be used to collect data citation number in research (source). It will present aggregates on research data citations across articles, preprints and government documents. This endeavour is now in its prototype phase, but when the full version will be released it provides a hub for data citation information that is not yet available.

Existing methodologies

Citation score

Based on the data citation information from the DataCite Data Citation Corpus one can compile a score for datasets or repositories on how many citations were received.

Number (Avg.) of views/clicks/downloads from repository

The number or percentage of views/clicks/downloads from repositories can be used to indicate more broad data use. In contrast to data citations, this measure includes representation of the potential use of data outside of academic publishing. However, one has no insight in the scientific objective for which the dataset has been retrieved. Moreover, even though this measure focuses on academic data repositories, one cannot be certain that the data retrieval has been made for academic purposes. Additionally, across data repositories there is not yet a standardized method of keeping track of views/clicks/downloads. Resulting in the tracking of different figures (views or clicks or downloads) and different definitions for these figures, which makes cross platform comparisons more complex.

Measurement

Existing datasources:

Data repositories

Some data repositories like Zenodo and Harvard dataverse provide download information for the datasets that are provided in the repository. OpenAIRE also provides repositories on usage statistics based on other datasources. Information from these sources can be collected to make up a data use score.

Existing methodologies

Data retrieval score

Based on the data retrieval information from the data repositories one can compile a score for datasets or repositories on how many times these were downloaded or viewed.

Standardization protocols

To combat the issue of standardization for the tracking of clicks/views/downloads of data use in research The COUNTER Code of Practice for Research Data gives guidance on how to standardize count systems for data use. The guidelines provided by The COUNTER Code of Practice for Research Data can be followed to make inquiries in data retrieval information more comparable among researchers.

References

DataCite, and Make Data Count. 2024. “Data Citation Corpus Data File.” DataCite. https://doi.org/10.5281/zenodo.11216814.
Pasquetto, Irene V., Bernadette M. Randles, and Christine L. Borgman. 2017. “On the Reuse of Scientific Data.” Data Science Journal 16 (0). https://doi.org/10.5334/dsj-2017-008.
Quarati, Alfonso, and Juliana E Raffaghelli. 2022. “Do Researchers Use Open Research Data? Exploring the Relationships Between Usage Trends and Metadata Quality Across Scientific Disciplines from the Figshare Case.” Journal of Information Science 48 (4): 423–48. https://doi.org/10.1177/0165551520961048.

Reuse

Open Science Impact Indicator Handbook © 2024 by PathOS is licensed under CC BY 4.0 (View License)

Citation

BibTeX citation:
@online{apartis2024,
  author = {Apartis, S. and Catalano, G. and Consiglio, G. and Costas,
    R. and Delugas, E. and Dulong de Rosnay, M. and Grypari, I. and
    Karasz, I. and Klebel, Thomas and Kormann, E. and Manola, N. and
    Papageorgiou, H. and Seminaroti, E. and Stavropoulos, P. and Stoy,
    L. and Traag, V.A. and van Leeuwen, T. and Venturini, T. and
    Vignetti, S. and Waltman, L. and Willemse, T.},
  title = {Open {Science} {Impact} {Indicator} {Handbook}},
  date = {2024},
  url = {https://handbook.pathos-project.eu/sections/2_academic_impact/use_of_data_in_research.html},
  doi = {10.5281/zenodo.14538442},
  langid = {en}
}
For attribution, please cite this work as:
Apartis, S., G. Catalano, G. Consiglio, R. Costas, E. Delugas, M. Dulong de Rosnay, I. Grypari, et al. 2024. “Open Science Impact Indicator Handbook.” Zenodo. 2024. https://doi.org/10.5281/zenodo.14538442.