PathOS - D2.1 - D2.2 - Open Science Indicator Handbook

S. Apartis; G. Catalano; G. Consiglio; R. Costas; E. Delugas; M. Dulong de Rosnay; I. Grypari; I. Karasz; Thomas Klebel; E. Kormann; N. Manola; H. Papageorgiou; E. Seminaroti; P. Stavropoulos; L. Stoy; V.A. Traag; T. van Leeuwen; T. Venturini; S. Vignetti; L. Waltman; T. Willemse

doi:10.5281/zenodo.8305626

Author

Affiliation

P. Stavropoulos

Athena Research Center

Uptake by patient groups

History

Version	Revision date	Revision	Author
1.2	2024-04-24	Review	Thomas Klebel
1.1	2024-03-29	Review	Tommaso Venturini, Melanie Dulong de Rosnay
1.0	2024-03-26	First draft	Petros Stavropoulos

Description

A patient group, often referred to as a patient advocacy group or a patient organization, is a non-profit entity designed to support and advocate for individuals suffering from specific diseases, conditions, or health-related issues. These groups aim to improve the quality of life for their members by offering support, education, and resources. They often engage in activities such as raising public awareness, advocating for patient rights, funding research, providing direct support to patients and their families, and influencing health policy and research topics.

This indicator aims to capture the extent to which open science (OS) outputs such as code, data, and OA publications are being utilized by patient advocacy groups. By analyzing the mentions or references to these OS outputs within patient group websites and reports, we can gauge the influence and integration of OS in societal and patient-focused discussions and resources. This can serve as a measure of how well the knowledge generated through OS is disseminated beyond the academic community, impacting patient education, policy and research advocacy, and policy-making.

Metrics

Number / Percentage of References to OS Inputs in Patient Group Websites/Reports

This metric represents the number of references or mentions of OS outputs (OA publications, datasets, software) found within the content of patient group websites and reports as well as the ratio of OS resources over the total number of scientific references or mentions. It provides an operationalization of the indicator by offering a quantifiable measure of OS uptake. The metric highlights the integration of OS practices in the resources produced by patient groups, which can reflect the value and relevance of these outputs to patient advocacy and education.

A limitation of this metric could be its potential variability based on the size and scope of the patient group websites and reports reviewed. Some groups have very rich and updated websites offering a reliable window into the activities of the group, but others are less thorough or quick in their online publishing strategies, which means that their websites can offer a less dependable or complete description of their actions. Some patient groups may also use OS publications for self-training, to enable them to understand their disease, request medical tests, and participate to the advocacy and care work with other patients, without any report or publication as an output citing the OS materials they read.

Measurement.

To measure the proposed metric of the percentage of references to open science (OS) outputs in patient group websites and reports, a structured methodology encompassing both automated and manual techniques could be employed. This approach aims to systematically identify and quantify mentions of OA publications, datasets, and software within the targeted content.

One issue is the accessibility of the websites and reports, as some may have restricted access or be in formats that are difficult to process automatically.

Despite these challenges, leveraging existing web scraping and text analysis technologies offers a viable pathway to operationalize this metric.

Methodology:

Step 1: Select Patient Group Websites and Reports. Identify a comprehensive list of patient advocacy groups relevant to the study and compile their website URLs and any available reports (e.g., PDF, HTML format).

Step 2: Web Scraping & Text Extraction. Utilize web scraping tools such as Beautiful Soup (Python) or rvest (R) to programmatically extract content from the identified websites and reports. For reports in non-HTML formats (e.g., PDFs), employ text extraction tools like PyPDF2 (Python) or pdf_text (R) to convert document content into analysable text.

Step 3: Identifying OS Outputs. Apply text mining and natural language processing (NLP) techniques to search the extracted texts for references to scientific resources in general and to OS outputs (publications, software, datasets) in particular. This could involve regular expressions to detect DOI patterns for publications, URLs for datasets and software repositories, or specific keywords related to OA and OS materials. OpenAIRE Research Graph could be utilized to search whether the publication and dataset DOIs/links found are OA. The tool from “PathOS work in Task2.5” could also be utilized to extract mentions of datasets and software from the text with their metadata.

Step 4: Quantification and Analysis. Calculate the percentage of

documents or web pages that contain references to OS inputs. This involves tallying the occurrences of references and comparing them to the total number of documents or pages analyzed;
references or mentions to OS inputs over the total of the scientific resources mentioned in each website.

Existing datasources:

Patient Group Websites and Reports

The “Patient Groups Websites and Reports” data are not within a singular, pre-existing database but instead must be meticulously compiled through the scraping and crawling of numerous patient advocacy group websites and their published reports.

This involves identifying and cataloguing online resources across a vast array of patient groups, each potentially focusing on different health issues or advocacy goals (e.g. a list of patient group websites).

Then the development of web scraping tools and algorithms are required to extract the texts needed for the construction of the dataset.

Example short list of patient group websites:

Organization	Focus Area	Website
National Patient Advocate Foundation (NPAF)	Healthcare challenges	npaf.org
Patient Advocate Foundation (PAF)	Chronic, life-threatening, and debilitating illnesses	patientadvocate.org
Asthma and Allergy Foundation of America (AAFA)	Asthma and allergies	aafa.org
National Alliance on Mental Illness (NAMI)	Mental health	nami.org
European Patient Advocacy Groups (ePAGs) - EURORDIS	Rare diseases	eurordis.org

OpenAIRE Research Graph

The OpenAIRE Research Graph is a comprehensive open access database that aggregates metadata on publications, research data, and project information across various disciplines. It includes details on open access publications and datasets, making it a valuable resource for tracking the output of academic-industry collaborations and their adherence to open science principles.

We can use the OpenAIRE Research Graph to identify which of the publications cited in the patient group websites / reports are open access (OA). This involves:

Extracting publication references with unique identifiers (e.g. DOIs) from the patient group websites / reports.
Querying the OpenAIRE Research Graph to determine which of these publications are open access (OA).

Existing methodologies

SciNoBo Research Artifact Analysis (RAA) Tool

This is an automated tool (Stavropoulos et al. 2023), leveraging Deep Learning and Natural Language Processing techniques to identify research artifacts (datasets, software) mentioned in the scientific text and extract metadata associated with them, such as name, version, license, etc. This tool can also classify whether the dataset has been reused or created by the authors of the scientific text.

To measure the proposed metric, the tool can be used to identify the reused and created OS inputs in the texts of the patient group websites / reports.

One limitation of this methodology is that it may not capture all instances of research artifacts if they are not explicitly mentioned in the scientific text. Additionally, the machine learning algorithms used by the tool may not always accurately classify whether a research artifact has been reused or created, and may require manual validation.

References

Stavropoulos, Petros, Ioannis Lyris, Natalia Manola, Ioanna Grypari, and Harris Papageorgiou. 2023. “Empowering Knowledge Discovery from Scientific Literature: A Novel Approach to Research Artifact Analysis.” In, 3753. https://aclanthology.org/2023.nlposs-1.5/.

Reuse

Citation

BibTeX citation:

@online{apartis2023,
  author = {Apartis, S. and Catalano, G. and Consiglio, G. and Costas,
    R. and Delugas, E. and Dulong de Rosnay, M. and Grypari, I. and
    Karasz, I. and Klebel, Thomas and Kormann, E. and Manola, N. and
    Papageorgiou, H. and Seminaroti, E. and Stavropoulos, P. and Stoy,
    L. and Traag, V.A. and van Leeuwen, T. and Venturini, T. and
    Vignetti, S. and Waltman, L. and Willemse, T.},
  title = {PathOS - {D2.1} - {D2.2} - {Open} {Science} {Indicator}
    {Handbook}},
  date = {2023},
  url = {https://handbook.pathos-project.eu/indicator_templates/quarto/3_societal_impact/uptake_by_patient_groups.html},
  doi = {10.5281/zenodo.8305626},
  langid = {en}
}

For attribution, please cite this work as:

Apartis, S., G. Catalano, G. Consiglio, R. Costas, E. Delugas, M. Dulong de Rosnay, I. Grypari, et al. 2023. “PathOS - D2.1 - D2.2 - Open Science Indicator Handbook.” Zenodo. 2023. https://doi.org/10.5281/zenodo.8305626.