Use of code in research

Author
Affiliation

V.A Traag

Leiden University

History

Version Revision date Revision Author
1.0 2023-11-24 First draft V.A. Traag

Description

Many, if not most, scientific analyses involve the use of code or software in one way or another. Code and software can be used for data handling, statistical estimation, visualisation, or various other tasks. Both open-source and closed-source software may be used for research. For instance, MATLAB and Mathematica are two commercial software packages that may be used in research, whereas Octave and SageMath are open-source alternatives. We here try to provide metrics that can serve as an indicator of the use of code in research, where “code” refers to any type of software (e.g. computer library, tool, package) or any set of computer instructions (e.g. like an R or Python script) used in the research cycle.

One challenge is that we are typically interested in the use of “research software”, not in all software per se. Defining what this encompasses is not straightforward. Gruenpeter et al. (2021) defines it as code “that [was] created during the research process or for a research purpose. Software components (e.g., operating systems, libraries, dependencies, packages, scripts, etc.) that are used for research but were not created during or with a clear research intent should be considered software in research and not Research Software” (Gruenpeter et al., 2021, p. 16) As this clarifies, this might also involve the creation of new software that is released for other researchers to work with., However, this is not considered in this indicator, but in the indicator on open code. Almost any code depends on other code to work properly. Some of these dependencies might constitute research software themselves, but this is not necessarily the case. Instead of trying to classify software as “research software” or not, we will take a more agnostic approach in the description of this indicator, and simply try to describe approaches to uncover the use of some code in research, regardless of whether it constitutes “research software” or not.

This indicator can be useful to provide a more comprehensive view of the impact of the contributions by researchers. Some researchers might be more involved in publishing, whereas others might be more involved in developing and maintaining research software (and possibly a myriad other activities).

Metrics

Most research software is not properly indexed. There are initiatives to have research software properly indexed and identified, such as the Research Software Directory, but these are far from comprehensive at the moment. Many repositories support uploading research software. For instance, Zenodo currently holds about 116,000 records of research software. However, there are also reports of the absence of support for including research software in repositories (Carlin et al., 2023).

Number of times code is cited/mentioned in scientific publications

If software is cited/mentioned in scientific publications, it provides a direct indication of the use of that software in research. The metric of the number of times code is cited or mentioned in publications, is therefore a reasonable indicator for the use of code in research.

The biggest limitation is that not all researchers report all research software they used. Some researchers might not report the used software at all. Other researchers might perhaps report some software but forget to mention some packages that were used during some part of the research cycle. Researchers who actively and properly cite the software they use seem to be a minority, and citing software is a relatively rare event. In addition, software might perhaps not be mentioned in the main text of the publication, but it could be mentioned in appendices, (online) supplementary material or replication material. Moreover, some research software might never be mentioned in publications, but it is a critical dependency for other research software that is mentioned in publications. In this metric, we do not consider the dependency structure of research software, but this is something that is relevant, and is considered in a separate metric.

In addition, software might not be cited explicitly, and instead the paper associated with the software might be cited. The association between papers and software can be retrieved in various ways. Sometimes, software repositories are mentioned in papers, while vice-versa, the software repository may include citation information. This may take various forms, such as a CITATION.cff file in a GitHub repository, or a CITATION file in an R package. The association between papers and code is also being tracked by https://paperswithcode.com/. However, it is difficult to distinguish between citations to a publication for the software it introduced, or other advances made in the paper. Nonetheless, it might be relevant to combine citations statistics to the paper with explicit citations or mentions of the research software.

Measurement.

Existing datasources:
Bibliometric databases

If software is indexed in a repository and is provided a DOI, it can in principle be cited. Zenodo for instance covers about 160,000 records that contain research software, each supplied a DOI through DataCite. Hence, if researchers actively cite software in their publication using the DOI of the repository, the citations can be counted.

Not all bibliometric databases actively track research software, and therefore not all bibliometric databases can be used for this purpose. Dimensions does index research software (although it is referred to as data), but it is not clear whether citations to the research software are also being tracked. OpenAlex does not index research software, and also does not track citations to the research software. However, only very few research software citations should be expected.

Existing methodologies
Extract software mentions from full text

Especially because of the limited explicit references to software, it is important to also explore other possibilities to track the use of code in research. One possibility is to try to extract the mentions of a software package or tool from the full-text. This is done by Istrate et al. (2022) who have trained a machine learning model to extract references to software from full-text. They rely on the manual annotation of software mentions in PDFs by Du et al. (2021). The resulting dataset of software mentions is available from https://doi.org/10.5061/dryad.6wwpzgn2c.

Although the dataset of software mentions might provide a useful resource, it is a static dataset, and at the moment, there do not yet seem to be initiative to continuously monitor and scan the full-text of publications. Additionally, its coverage is limited to mostly biomedical literature. For that reason, it might be necessary to run the proposed machine learning algorithm itself. The code is available from https://github.com/chanzuckerberg/software-mention-extraction.

Repository statistics (# Forks/Clones/Stars/Downloads/Views)

Much (open-source) software is shared in version control repositories in online platforms. Various types of usage statistics can be derived from these online platforms, that somehow relate to the general level of interest in the software. These metrics vary from how many other users have copies of those repositories (often called forks), to how many people downloaded a particular release from this platform.

There are some clear limitations to this approach. Firstly, not all research software is necessarily shared through such online platforms, and sometimes may only be shared as, for example, supplementary material. Secondly, the type of usage is not limited to research only. Hence, some of these metrics might be equally well an indicator of usage by industry as it is an indicator of usage by researchers. How to distinguish between the various types of use is not evident. Moreover, even if the source-code is available through a repository platform, it might be distributed in other forms, for example through packaging indices such as the Python Package Index (PyPI) or the Comprehensive R Archive Network (CRAN).

The most common version control system at the moment is Git, which itself is open-source. There are other version control systems, such as Subversion or Mercurial, but these are less popular. The most common platform on which Git repositories are shared is GitHub, which is not open-source itself. There are also other repository platforms, such as CodeBerg (built on Forgejo) and GitLab, which are themselves open-source, but they have not yet managed to reach the popularity of GitHub. We therefore limit ourselves to describing GitHub, although we might extend this in the future.

Measurement.

We propose three concrete metrics based on the GitHub API: the number of forks, the number of stars and the number of downloads of releases. There are additional metrics about traffic available from GitHub API metrics, but these unfortunately require permissions from a specific repository.

Existing methodologies
Forks/Stars (GitHub API)

On GitHub, people can make a personal copy of a repository, which is called a fork. In addition, they can “star” a repository, in order to “save” it in their list of “favourite” repositories. The number of forks of a repository hence provides a metric of how many people have made personal copies of a repository, and the number of stars provides a metric of how many people have marked it as a “favourite”.

The calculation of the number of forks and the number of stars is really straightforward. For a particular repo from a particular owner, one can get the count from https://api.github.com/repos/owner/repo. For instance, for the repository openalex-guts from ourresearch, one can get the information from the URL https://api.github.com/repos/ourresearch/openalex-guts. The number of forks are then listed in the field forks_count and the number of starts are listed in stargazers_count. See the API documentation for more details.

Downloads (GitHub API)

Versions of repositories on GitHub can be made available as a release. In addition to the source code itself for a specific version, it also allows to include different files, for example binaries for different platforms. Releases can then be easily downloaded from GitHub.

The implementation of the number of downloads of releases is slightly more work, since this depends on the exact release and files (called assets) that are made available. For a particular repo from a particular owner, one can get the necessary information from https://api.github.com/repos/owner/repo/releases. For instance, for the repository igraph from the organisation igraph, one can get the information from the URL https://api.github.com/repos/igraph/igraph/releases. One needs to consider the field download_count for each asset listed in assets for each release. In Python code, this translates to

import requests
repo = 'igraph'
owner = 'igraph'
url = f'https://api.github.com/repos/{owner}/{repo}/releases'
response = requests.get(url)
releases = response.json()
total_downloads = sum(asset['download_count'] 
                        for release in releases 
                            for asset in release['assets'])

See the releases API documentation for more details.

Software dependencies

Software dependencies indicate to what extent software is being used by other software. This metric provides some idea to which extent the software is being used by others.

There are some clear limitations to this approach. First of all, such statistics can only be calculated when software is shared in a clear software ecosystem, such as the Python Package Index (PyPI) or the Comprehensive R Archive Network (CRAN). Not all research software is shared in such packaging indices, and they cannot easily be tracked. Secondly, the type of dependencies is not limited to research only. Hence, some of these metrics might be equally well an indicator of usage by industry as it is an indicator of usage by researchers. How to distinguish between the various types of use is not evident.

Measurement.

Measurement of such dependencies can potentially be provided by tools targeted at the various software ecosystems. Here, we will limit ourselves to the Python Package Index (PyPI) and the Comprehensive R Archive Network (CRAN). Dependencies are usually limited to within these package indices, and hence external packages that depend on some particular software packages are not captured.

One common issue in the measurement of dependencies, similar to references in publications, is that it is usually easy in one direction, but difficult in the other direction. That is, it is easy to list all packages that a package A depends on (similar to listing all references of a publication), but more difficult to list all packages that depend on package A (similar to listing all citations to a publication). The latter requires going through all packages, and seeing whether they depend on package A. These are sometimes referred to as reverse dependencies.

One consideration is to also track transitive dependencies. That is, a package C might depend on package B, which in turn might depend on package A. In that case, package A has on direct dependent, namely package B, but package C is then a transitively dependent of package A.

Existing methodologies
Python Package Index (PyPI)

Most Python Packages are shared on the Python Package Index (PyPI), and declare their dependencies. Unfortunately, some of these dependencies can be dynamic in nature, and can depend on specific configurations and options. Nonetheless, it is possible to scan all packages in PyPI, in order to list all dependencies. One option that can be considered is this context is https://www.wheelodex.org/, which lists Python packages and their reverse dependencies. This also includes an API for retrieving reverse dependencies, for instance for the package requests the reverse dependencies can retrieved via https://www.wheelodex.org/json/projects/requests/rdepends. Transitive dependencies are not directly tracked, but this could in principle be reconstructed manually.

The code for gathering all the (reverse) dependencies from PyPI can be found at https://github.com/wheelodex/wheelodex.

Comprehensive R Archive Network (CRAN)

Most R packages are shared on the Comprehensive R Archive Network (CRAN). CRAN lists reverse dependencies on the webpage for any package, which allows for easy inspection for any package. For obtaining more extensive statistics, it is recommended to use automated tools. In this case, the tools:: dependsOnPkgs function is particularly useful, for example

tools::dependsOnPkgs("lattice")

gives an explicit list of all packages that depend on the package lattice. In addition, the argument recursive can be set to TRUE to obtain the transitive dependencies.

References

Barker, M., Chue Hong, N. P., Katz, D. S., Lamprecht, A.-L., Martinez-Ortiz, C., Psomopoulos, F., Harrow, J., Castro, L. J., Gruenpeter, M., Martinez, P. A., & Honeyman, T. (2022). Introducing the FAIR Principles for research software. Scientific Data, 9(1), Article 1. https://doi.org/10.1038/s41597-022-01710-x

Carlin, D., Rainer, A., & Wilson, D. (2023). Where is all the research software? An analysis of software in UK academic repositories. PeerJ Computer Science, 9, e1546. https://doi.org/10.7717/peerj-cs.1546

Du, C., Cohoon, J., Lopez, P., & Howison, J. (2021). Softcite dataset: A dataset of software mentions in biomedical and economic research publications. Journal of the Association for Information Science and Technology, 72(7), 870–884. https://doi.org/10.1002/asi.24454

Gruenpeter, M., Katz, D. S., Lamprecht, A.-L., Honeyman, T., Garijo, D., Struck, A., Niehues, A., Martinez, P. A., Castro, L. J., Rabemanantsoa, T., Chue Hong, N. P., Martinez-Ortiz, C., Sesink, L., Liffers, M., Fouilloux, A. C., Erdmann, C., Peroni, S., Martinez Lavanchy, P., Todorov, I., & Sinha, M. (2021). Defining Research Software: A controversial discussion (Version 1). Zenodo. https://doi.org/10.5281/ZENODO.5504016

Istrate, A.-M., Li, D., Taraborelli, D., Torkar, M., Veytsman, B., & Williams, I. (2022). A large dataset of software mentions in the biomedical literature (arXiv:2209.00693). arXiv. https://doi.org/10.48550/arXiv.2209.00693

PreprintMatch: A tool for preprint to publication detection shows global inequities in scientific publication | PLOS ONE. (n.d.). Retrieved November 20, 2023, from https://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0281659