Open Science Impact Indicator Handbook

S. Apartis; G. Catalano; G. Consiglio; R. Costas; E. Delugas; M. Dulong de Rosnay; I. Grypari; I. Karasz; Thomas Klebel; E. Kormann; N. Manola; H. Papageorgiou; E. Seminaroti; P. Stavropoulos; L. Stoy; V.A. Traag; T. van Leeuwen; T. Venturini; S. Vignetti; L. Waltman; T. Willemse

doi:10.5281/zenodo.14538442

Science-industry collaboration

Author

Affiliation

P. Stavropoulos

Athena Research Center

History

Version	Revision date	Revision	Author
1.4	2024-12-16	Feedback incorporation	Petros Stavropoulos
1.3	2024-04-29	Second Draft	Petros Stavropoulos
1.2	2024-04-24	Peer review	V.A. Traag
1.1	2024-03-19	First draft	Petros Stavropoulos, Erica Delugas (reviewer)

Description

This indicator aims to capture the collaborative efforts between science (academia) and industry for open science (OS). It focuses on the exchange of knowledge and resources that facilitate the development of new technologies, processes, or products, thereby advancing scientific understanding and commercial applications. By measuring the interactions between these sectors, particularly those leading to tangible outputs like publications, OS outputs and patents, this indicator sheds light on how OS principles are being integrated into collaborative efforts and how these collaborations contribute to economic development and innovation.

Metrics

# / % of Joint Projects between Academia and Industry

This metric measures the number or percentage of joint projects conducted between academia and industry. These projects could include research collaborations, funded grants, or other forms of cooperative endeavors. By capturing the extent of these joint activities, this metric offers insights into the intensity and nature of science-industry collaboration.

The inclusion of joint projects as a metric provides a broader view of collaborations beyond publications and patents. It emphasizes the collaborative process itself, which is foundational for the production of both tangible and intangible outputs.

Joint projects are a key indicator of active collaboration between academia and industry. They often lead to significant innovations and knowledge transfer. Tracking these projects can also highlight the alignment between research priorities in academia and the practical needs of industry.

Measurement

Step 1: Data Collection. Identify databases and repositories that catalog joint projects between academia and industry. Possible sources include:

CORDIS: Tracks funded projects under EU programs such as Horizon 2020 and Horizon Europe.
OpenAIRE: Aggregates information on funded projects, publications, and research data.
EU Community Innovation Survey (CIS): Provides data on innovation activities in enterprises, including collaborations, innovation drivers, and outcomes.

Step 2: Filtering Relevant Projects. Use metadata from OpenAIRE or Research Organization Registry (ROR) to filter projects that involve both academic institutions and industry partners. Specifically, identify:

Projects where participants include at least one academic institution and one industry entity.
Projects explicitly mentioning collaboration.

Step 3: Quantification.

Count the number of such projects.
Calculate the percentage relative to the total number of projects funded in a given program or region.

Step 4: Additional Metadata. Extract additional information to enrich the metric, such as:

The financial size of the projects.
Duration of the projects.
Number of repeat collaborations.

Existing datasources

CORDIS

CORDIS provides detailed information on EU-funded projects, including participants, objectives, and funding details. It is a valuable source for identifying joint academia-industry projects. It also includes project metadata such as funding amount, duration, and collaboration types, making it an essential resource for this metric.

Use CORDIS to filter projects using participant metadata to identify collaborations involving both academia and industry and analyze project metadata to gain insights into project size, type, and scope of collaboration.

OpenAIRE Research Graph

The OpenAIRE Research Graph is a comprehensive open access database that aggregates metadata on publications, research data, and project information across various disciplines. It includes details on open access publications and datasets, making it a valuable resource for tracking the output of academic-industry collaborations and their adherence to open science principles.

Use the OpenAIRE Research Graph to query for project records involving co-participation by academia and industry. Use filtering within OpenAIRE to identify collaborative projects and their outputs. This includes analyzing the affiliations of participants and extracting project metadata.

Research Organization Registry (ROR)

The Research Organization Registry (ROR) is a comprehensive, open, and community-driven registry that assigns unique identifiers to research organizations worldwide. It aims to solve the issue of institution name disambiguation by providing persistent identifiers, thus facilitating the accurate linking of research organizations to scholarly outputs and researchers. ROR is instrumental in tracking changes in organization names, mergers, and closures, thereby maintaining a current and accessible record of research entities.

Use ROR metadata to categorize participants in projects identified via CORDIS or OpenAIRE. Query organization identifiers to distinguish between academia and industry and validate participant affiliations.

EU Community Innovation Survey (CIS)

The EU Community Innovation Survey (CIS) is the reference survey on innovation in enterprises. It collects data on innovation activities, including collaboration, innovation expenditures, and outcomes. CIS focuses on key aspects such as product and process innovation, barriers to innovation, and innovation cooperation.

Use CIS data to analyze the innovation outcomes of joint projects. Specifically:

Identify collaborations involving academia and industry from the “innovation cooperation” data.
Assess innovation outputs, such as product and process innovations, linked to joint projects.
Examine funding sources, barriers, and drivers of innovation to contextualize collaborative efforts.

# / % of Data and Code Repositories with Science-Industry Collaborations

This metric captures the number or percentage of data and code repositories that demonstrate collaborations between academia and industry. Such repositories are typically hosted on platforms like GitHub, GitLab, Huggingface, or other collaborative software and data-sharing platforms. By analyzing repository metadata and contributor affiliations, this metric provides insights into the extent of collaborative efforts in creating and maintaining shared scientific and technological resources.

Repositories often document affiliations in textual descriptions, metadata, or contributor profiles, enabling the identification of collaborations between academia and industry. This metric also emphasizes the importance of open-source practices and their role in fostering science-industry partnerships.

Tracking repositories with science-industry collaborations highlights the contribution of open-source software and data to scientific and industrial advancements. It underscores the growing role of digital collaboration in innovation and knowledge transfer. Additionally, it provides a lens to assess how academia and industry jointly contribute to the creation of accessible and reusable resources.

Some limitations to this approach are:

Not all repositories clearly state affiliations or maintain accurate metadata.
Contributors may not always disclose affiliations, leading to underestimation.
Collaboration types and intensity may not be easily discernible from repository metadata alone

Measurement

Step 1: Data Collection. Identify repositories from major platforms (e.g., GitHub, GitLab, Huggingface) that are relevant to science and industry collaboration. Use APIs and publicly available datasets to extract repository data, including:

Repository descriptions.
Metadata (e.g., organization ownership, project topics).
Contributor and maintainer information.

Step 2: Identifying Collaborations. Analyze metadata and textual descriptions for evidence of collaboration. Specific actions include:

Affiliation Matching: Identify affiliations of contributors and maintainers from metadata or linked profiles (e.g., LinkedIn, institutional websites).
Organization Metadata: Use repository-level metadata to identify whether the repository is linked to academic institutions or industry organizations.
Textual Analysis: Analyze repository descriptions and README files for mentions of academic or industrial collaboration.

Step 3: Quantification.

Count the number of repositories identified as involving both academic and industry contributors.
Calculate the percentage of such repositories relative to the total number analyzed.

Step 4: Qualitative Analysis. Supplement quantitative measures with qualitative analysis, such as:

Examining the types of projects (e.g., datasets, software).
Assessing the depth of collaboration (e.g., co-maintenance, joint contributions).

Existing datasources

GitHub / Gitlab / etc

GitHub is one of the most widely used platforms for hosting code repositories. It provides detailed metadata, including information about repository ownership, contributors, maintainers, and organizational affiliations. GitHub’s APIs allow users to query repositories and access public data, making it a critical resource for analyzing collaborative efforts.

GitLab is a robust platform for hosting repositories and is often used by organizations for both private and public projects. It provides extensive metadata on repositories and contributors and offers APIs similar to GitHub for data extraction.

Use Github or Gitlab to:

Extract metadata to identify repositories associated with academia and industry.
Query contributor profiles to determine affiliations and validate collaborations.
Analyze repository descriptions, README files, and tags to identify explicit mentions of partnerships.

Huggingface

Huggingface specializes in hosting repositories for machine learning models, datasets, and related tools. It is popular among both academic researchers and industry practitioners, making it a unique platform for studying science-industry collaborations in cutting-edge technologies.

Use Huggingface to:

Identify collaborations through metadata and contributor profiles.
Analyze repositories for co-maintenance by academic and industrial entities.
Use metadata fields and descriptions to validate affiliations and collaborations.

# / % of patents filed by industry in collaboration with academia that cites Open Science (OS) resources

This metric assesses the extent to which industrial patents resulting from academic collaborations acknowledge or build upon OS resources, such as data, publications, or methodologies. A high number or percentage of such patents indicate a strong science-industry linkage and a productive exchange of OS resources.

This metric is a good operationalization of the indicator because it provides a direct measure of the output of collaborative efforts. However, it may not capture the quality or impact of the collaboration. It differs from other metrics by focusing on legal intellectual property outcomes rather than purely academic outputs.

Measurement

Utilize the PATSTAT dataset (https://www.epo.org/en/searching-for-patents/business/patstat) to identify patents filed by industry partners in collaboration with academic institutions. Examine the citations within these patents for references to papers or other resources that are recognized as open science inputs. Specifically, determine if these patents cite papers that contribute to new open science resources or artifacts.

Methodology:

Step 1: Data Collection. Access the PATSTAT database for comprehensive patent data, focusing on patents that result from industry-academia collaborations, based on inventor affiliations and patent assignments.

Step 2: Citation Analysis. Examine the NPL (Non-patent literature) citations within these patents for references to open science inputs such as papers, datasets, or software, identifying those that directly relate to open science principles.

Step 3: Identification of Open Science Inputs. Establish criteria for what qualifies as an open science input and validate these against recognized open access repositories and directories.

Step 4: Quantification. Calculate the number and percentage of identified patents that cite Open Science (OS) inputs, providing a measure of the extent of science-industry collaboration. To further assess the quality and impact of these patents, include the following dimensions, which can be acquired using the PATSTAT database:

Patent Citations: Analyze forward citations from other patents to measure the influence of the patents within the scientific and industrial communities. A higher number of citations suggests greater adoption and impact of the innovation.
Grant Status: Determine whether the patents were granted, as granted patents often reflect higher-quality innovations that meet stricter evaluation criteria. PATSTAT includes information on the legal status of patents, which can be used to identify granted patents.
Non-Patent Literature (NPL) Citations: Count the number of references to academic articles, datasets, or other Open Science resources within the patent. These citations, available in the NPL fields in PATSTAT, indicate the level of integration of scholarly outputs in the patent.

By leveraging PATSTAT data, these metrics can provide a more comprehensive and multidimensional measure of the extent and quality of science-industry collaborations utilizing Open Science inputs.

Step 5: Analysis and Reporting. Analyze the data to understand the nature and extent of the collaborations, including any limitations or challenges encountered in data collection and analysis, such as incomplete citation information or difficulties in distinguishing open science inputs.

Challenges may include accurately identifying collaborations and open science inputs, dealing with incomplete citation records, and accessing the necessary databases. This approach aims to provide a structured methodology for assessing the impact of open science on innovation and collaboration between academia and industry.

Other data source are Orbis IP, The Lens, EUIPO. Please refer to the “Innovation output” indicator to further details on The Lens and Orbis IP. The choice among the different resources depends on the information to be processed. For instance, Orbis IP includes information on patent authors, which is not available in other data sources, and also offers the possibility to link companies to balance sheet data that might be useful for a comprehensive analysis of the economic growth of companies.

Existing datasources

PATSTAT

PATSTAT is a global patent statistical database maintained by the European Patent Office (EPO) that offers a detailed set of patent data, including bibliographic data, citations, family links, and legal status information for patents across multiple jurisdictions. It is designed to facilitate statistical analysis on patents and their citations to understand trends in innovation.

To calculate the number / percentage of patents filed by industry in collaboration with academia that cites Open Science inputs, use the PATSTAT dataset to:

Identify patents with co-inventors from academia and industry by analyzing the affiliations of inventors.
Examine the citations within these patents for references to open science inputs, such as open access publications or datasets.
Count the patents that cite these open science inputs and calculate this as a percentage of the total patents analyzed.

OpenAIRE Research Graph

The OpenAIRE Research Graph is a comprehensive open access database that aggregates metadata on publications, research data, and project information across various disciplines. It includes details on open access publications and datasets, making it a valuable resource for tracking the output of academic-industry collaborations and their adherence to open science principles.

To complement PATSTAT data, use the OpenAIRE Research Graph to identify which of the publications cited by patents are open access. This involves:

Extracting publication references from identified patents in PATSTAT.
Querying the OpenAIRE Research Graph to determine which cited publications are open access.

Research Organization Registry (ROR)

The Research Organization Registry (ROR) is a comprehensive, open, and community-driven registry that assigns unique identifiers to research organizations worldwide. It aims to solve the issue of institution name disambiguation by providing persistent identifiers, thus facilitating the accurate linking of research organizations to scholarly outputs and researchers. ROR is instrumental in tracking changes in organization names, mergers, and closures, thereby maintaining a current and accessible record of research entities.

Utilizing the ROR API to distinguish between academic institutions and industry players involves the following steps:

Extract affiliation data from patents in PATSTAT and publications in the OpenAIRE Research Graph, focusing on the names or identifiers of the organizations involved.
Use the ROR API to query each collected affiliation. The API supports searches by organization name or external identifiers, offering advanced query capabilities for more detailed searches.
For each query response, examine the detailed metadata provided by ROR, which includes the organization’s type, related organizations, and activity fields. This metadata is crucial for categorizing organizations as either academic or industry.
Based on the ROR metadata, categorize each organization involved in the patent or publication as either an academic institution or an industry entity.

There may be limitations in the coverage of certain types of organizations and the evolving nature of the ROR dataset as new organizations are added or existing records are updated.

Existing methodologies

SciNoBo Research Artifact Analysis (RAA) Tool

This is an automated tool (Stavropoulos et al. 2023), leveraging Deep Learning and Natural Language Processing techniques to identify research artifacts (datasets, software) mentioned in the scientific text and extract metadata associated with them, such as name, version, license, etc. This tool can also classify whether the dataset has been reused or created by the authors of the scientific text.

To measure the proposed metric, the tool can be used to identify the reused and created OS inputs in the patents text or the OA publication texts that the patents cite.

One limitation of this methodology is that it may not capture all instances of research artifacts if they are not explicitly mentioned in the scientific text. Additionally, the machine learning algorithms used by the tool may not always accurately classify whether a research artifact has been reused or created and may require manual validation.

# / % of Publications Produced by Academia in Collaboration with Industry that Cites Open Science Resources

This metric evaluates how often publications resulting from academia-industry collaborations incorporate or reference open science artifacts. It signifies the degree to which collaborative research between these sectors utilizes open science as a foundation. It highlights a different aspect of collaboration by focusing on scholarly publications. Similar to the first metric, it is a measure of output but from an academic perspective.

Measurement

Access OA publications from comprehensive databases (e.g., OpenAIRE Research Graph), scanning for those with co-authors from both academia and industry. Analyze these publications for citations or mentions of open science inputs, indicating the reuse, or creation of open science artifacts.

Methodology:

Step 1: Data Collection. Gather data from comprehensive databases (e.g., OpenAIRE Research Graph), focusing on publications with co-authorship between academia and industry, determined through author affiliations.

Step 2: Identifying Open Science Inputs. Aim to identify publications that cite or are based on open science inputs, such as datasets or open-source software. This involves distinguishing these inputs from other types of references.

Step 3: Citation Analysis. Examine the citations in these publications to find references to known open science resources. Apply text mining and natural language processing (NLP) techniques to automate this process where feasible.

Step 4: Artifact Analysis. Perform an in-depth analysis of the publication texts themselves to find mentions of open science inputs within the body of the articles. This involves using NLP techniques to detect and extract mentions of datasets, software, and other artifacts that indicate direct use or contribution to open science, beyond mere citations.

Step 5: Quantification. Calculate the number and percentage of publications citing open science inputs out of the total set of identified academia-industry collaborative publications. Note potential limitations due to database coverage and indexing quality.

Step 6: Reporting and Analysis. Analyze the data to extract insights on the extent and nature of open science in academia-industry collaborations. Document any limitations encountered, such as incomplete citation records or inaccuracies in affiliation data.

Existing datasources

OpenAIRE Research Graph

The OpenAIRE Research Graph is a comprehensive open access database that aggregates metadata on publications, research data, and project information across various disciplines. It includes details on open access publications and datasets, making it a valuable resource for tracking the output of academic-industry collaborations and their adherence to open science principles.

To calculate the number / percentage of publications produced by academia in collaboration with industry that cites open science inputs:

Filter publications that are OA.
Filter publications based on author affiliations that indicate academia-industry collaborations, by utilizing ROR.
For each publication, examine the references field to identify citations of datasets or software.
Count and calculate the percentage of these publications out of the total number of academia-industry collaborative publications identified in the dataset.

Research Organization Registry (ROR)

The Research Organization Registry (ROR) is a comprehensive, open, and community-driven registry that assigns unique identifiers to research organizations worldwide. It aims to solve the issue of institution name disambiguation by providing persistent identifiers, thus facilitating the accurate linking of research organizations to scholarly outputs and researchers. ROR is instrumental in tracking changes in organization names, mergers, and closures, thereby maintaining a current and accessible record of research entities.

Utilizing the ROR API to distinguish between academic institutions and industry players involves the following steps:

Extract affiliation data from patents in PATSTAT and publications in the OpenAIRE Research Graph, focusing on the names or identifiers of the organizations involved.
Use the ROR API to query each collected affiliation. The API supports searches by organization name or external identifiers, offering advanced query capabilities for more detailed searches.
For each query response, examine the detailed metadata provided by ROR, which includes the organization’s type, related organizations, and activity fields. This metadata is crucial for categorizing organizations as either academic or industry.
Based on the ROR metadata, categorize each organization involved in the patent or publication as either an academic institution or an industry entity.

There may be limitations in the coverage of certain types of organizations, potential inaccuracies in metadata, and the evolving nature of the ROR dataset as new organizations are added or existing records are updated.

Existing methodologies

SciNoBo Research Artifact Analysis (RAA) Tool

This is an automated tool (Stavropoulos et al. 2023), leveraging Deep Learning and Natural Language Processing techniques to identify research artifacts (datasets, software) mentioned in the scientific text and extract metadata associated with them, such as name, version, license, etc. This tool can also classify whether the dataset has been reused or created by the authors of the scientific text.

To measure the proposed metric, the tool can be used to identify the reused and created OS inputs in the OA publication texts.

One limitation of this methodology is that it may not capture all instances of research artifacts if they are not explicitly mentioned in the scientific text. Additionally, the machine learning algorithms used by the tool may not always accurately classify whether a research artifact has been reused or created, and may require manual validation.

Existing datasources (applicable to all metrics)

ORBIS

ORBIS is a global database that contains detailed information on companies, including financial data, ownership structures, and industry classifications. It is particularly valuable for tracking established firms and linking companies to their activities and affiliations. ORBIS provides insights into company performance, size, and economic connections, making it ideal for robust analysis of industry players.

ORBIS can be utilized to validate and categorize company affiliations in academic-industry collaborations. It allows users to verify company profiles and gather detailed information such as industry sector, revenue, and geographic presence. This data is essential for distinguishing industry participants from academic institutions and for analyzing the economic dimensions of collaborations. ORBIS is particularly useful when deeper financial or structural insights into the companies involved are required.

Crunchbase

Crunchbase is a comprehensive platform that aggregates data on companies, founders, and their funding activities. It includes detailed metadata on company profiles, such as industry, location, funding rounds, and associated personnel. Crunchbase provides extensive information about startups, established firms, and their interconnections, making it a valuable resource for identifying industry affiliations.

Crunchbase can be used to identify the names and metadata of companies involved in academic-industry collaborations. By cross-referencing organizations from research outputs or project data, researchers can determine whether affiliations are from industry. Crunchbase’s metadata can also help uncover patterns such as funding sources, company size, and sectoral focus, which are crucial for understanding the scope and impact of collaborations.

Notes

The proposed metrics offer a solid foundation for measuring the collaboration between science and industry, however some aspects require further exploration:

Broader Identification of Academic Patents: Current reliance on co-assigned patents may underestimate academic contributions. Identifying academic inventors using methodologies like those by (Ljungberg and McKelvey 2012) and (Crescenzi, Filippetti, and Iammarino 2017) could provide a more comprehensive view.
Patent Novelty: Future work could leverage NLP and category analysis to assess the novelty of patents, complementing existing quality measures.
Personnel Mobility: Data on personnel transitions between academia and industry (e.g., LinkedIn) could reveal valuable insights into collaboration dynamics, though privacy and data access pose challenges.
Collaboration Types: Expanding analysis to include collaboration forms like licensing, consulting, and outsourcing would enrich understanding of interaction patterns.

Addressing these areas will deepen insights into the nature and impact of science-industry collaboration.

References

Crescenzi, Riccardo, Andrea Filippetti, and Simona Iammarino. 2017. “Academic Inventors: Collaboration and Proximity with Industry.” Journal of Technology Transfer 42: 730–62. https://doi.org/10.1007/s10961-016-9550-z.

Ljungberg, Daniel, and Maureen McKelvey. 2012. “What Characterizes Firms’ Academic Patents? Academic Involvement in Industrial Inventions in Sweden.” Industry and Innovation 19: 585–606. https://doi.org/10.1080/13662716.2012.726808.

Stavropoulos, Petros, Ioannis Lyris, Natalia Manola, Ioanna Grypari, and Harris Papageorgiou. 2023. “Empowering Knowledge Discovery from Scientific Literature: A Novel Approach to Research Artifact Analysis.” In, 3753. https://aclanthology.org/2023.nlposs-1.5/.

Reuse

Citation

BibTeX citation:

@online{apartis2024,
  author = {Apartis, S. and Catalano, G. and Consiglio, G. and Costas,
    R. and Delugas, E. and Dulong de Rosnay, M. and Grypari, I. and
    Karasz, I. and Klebel, Thomas and Kormann, E. and Manola, N. and
    Papageorgiou, H. and Seminaroti, E. and Stavropoulos, P. and Stoy,
    L. and Traag, V.A. and van Leeuwen, T. and Venturini, T. and
    Vignetti, S. and Waltman, L. and Willemse, T.},
  title = {Open {Science} {Impact} {Indicator} {Handbook}},
  date = {2024},
  url = {https://handbook.pathos-project.eu/sections/4_economic_impact/science_industry_collaboration.html},
  doi = {10.5281/zenodo.14538442},
  langid = {en}
}

For attribution, please cite this work as:

Apartis, S., G. Catalano, G. Consiglio, R. Costas, E. Delugas, M. Dulong de Rosnay, I. Grypari, et al. 2024. “Open Science Impact Indicator Handbook.” Zenodo. 2024. https://doi.org/10.5281/zenodo.14538442.