Maria Levchenko

| 26 February 2024

| 6 MINS READ

Moving to open source

Europe PMC POSI update – 2 years on

Two years have sailed by since Europe PMC adopted the Principles of Open Scholarly Infrastructure (POSI) in February 2021. POSI is a set of guidelines for open scholarly infrastructure providers and outlines how these organisations should be run and sustained. It offers a framework to uphold transparency and accountability. Since Europe PMC joined the POSI adopter group, the POSI Posse, it has grown to 15 organisations. In this time the POSI principles have progressed to Version 1.1, as a result of work in which Europe PMC was involved.

The POSI principles enable organisations to measure progress on the 16 principles using a traffic light system. In our original blog post, we completed this exercise and indicated our level for all of the principles, as shown below.

Governance

🟢 Coverage across the research enterprise

🟢 Stakeholder governed

🟢 Non-discriminatory membership

🟢 Transparent operations

🟢 Cannot lobby

🟡 Living will

🟢 Formal incentives to fulfil mission & wind-down

Sustainability

🟡 Time-limited funds are used only for time-limited activities

🔴 Goal to generate surplus

🔴 Goal to create contingency fund to support operations for 12 months

🟢 Mission-consistent revenue generation

🟢 Revenue based on services, not data

Insurance

🟡 Open source

🟢 Open data (within constraints of privacy laws)

🟢 Available data (within constraints of privacy laws)

🟡 Patent non-assertion

No traffic lights have changed colour, but in this post we indicate where we have made significant progress.

Open source software is one of the POSI insurance components, along with open and available data and patent non-assertion. POSI states that all software and assets required to run the infrastructure should be available under an open-source licence.

Europe PMC has focussed on the open source principle in the past two years and has made significant progress as part of Europe PMC’s continuous open source transition. All of Europe PMC’s open source software is available via our public-projects GitLab repository, which now contains 27 open projects. Since the last POSI review, we continue to apply the open source principle to any new code bases we develop. Where possible we also update and make our legacy systems publicly available, cementing our commitment to open source.

One of the fundamental advantages of open source is community involvement. Below we highlight how our open source journey has increased Europe PMC’s impact and supported innovation and discovery as part of our mission.

Open source projects for text-mining applications

Europe PMC uses dictionary-based text-mining to find biological entities, for example, gene names or organisms, citations to biological databases, and accession numbers mentioned in journal publications and preprints. Recently we have developed a new tool – a text-mining API – that extends this process to new types of content. This tool is now employed in Europe PMC to identify concepts of interest in supplementary files to unlock additional evidence associated with the article.

To enable reuse we have made the code and dictionaries required to run our text-mining infrastructure openly available. This permits other organisations to analyse data sources not currently available in Europe PMC. BioModels, an EMBL-EBI database for mathematical models of biological systems, plans to repurpose this software to enrich existing data by annotating biological models with gene ontology terms. In another example, the German Collection of Microorganisms and Cell Cultures at Leibniz Institute DSMZ is investigating how this tool can be used to extract details about relevant microorganisms from biological patents.

As a result of a collaboration with Open Targets, machine learning was applied to extract evidence from scientific literature for drug target identification and validation. This combined Named Entity Recognition (NER) for identifying genes/proteins, diseases, organisms, and chemicals/drugs within scientific texts, and entity normalisation to accurately map these entities to databases like Ensembl, Experimental Factor Ontology (EFO), and ChEMBL. The results are accessible through Europe PMC’s text-mining infrastructure. We are currently developing and extending the models to replace our dictionary-based text-mining applications. This work is being executed in the open via the Europe PMC GitLab site.

A previous project by Dr. Maaly Nassar developed a machine learning framework to enrich metagenomics data in MGnify, the microbiome sequence data analysis resource. As an outcome, Europe PMC’s collection of open access publications was annotated with metagenomic terms. Extending this effort to new articles as they become available in Europe PMC required further work. For example, compressing machine learning models to ensure sustainability and reduced computation. The new metagenomics pipeline and underlying machine learning models are being developed as open source in collaboration with SureChembl and PBDe. SureChembl will use the machine learning architecture and dictionaries to mine patents for biological entities including chemical/drugs, gene-proteins, diseases and organisms and PDBe will use it to mine important residues in proteins from scientific articles, further delivering value to Europe PMC. This collaboration will advance the utility of the Europe PMC annotations platform, through improved annotation and discovery tools. This has enabled dual benefits and reduced direct workloads for all teams.

Open source projects for new publication models

Supporting open science and emerging open publication workflows is a strategic goal for Europe PMC. Recognising the importance of preprints in the life sciences, we have taken steps to build trust, improve discoverability, and support preprint reuse.

Europe PMC incorporates preprints from over 30 preprint servers. Preprint information is pulled into Europe PMC from Crossref – a DOI registration agency used by many preprint providers. Europe PMC uses the Crossref REST API to retrieve preprint abstracts and metadata. The legacy software used to fetch and store this data in an Oracle database has been converted to open source. This code can now be reused by other tools and services that aggregate preprints across different platforms.

To support preprint evaluation Europe PMC links preprints with associated reviews. This is accomplished in collaboration with Sciety and EMBO’s Early Evidence Base – two services that collate preprint reviews from different initiatives. Both Sciety and Early Evidence Base share review information in the DocMaps format. To enable DocMaps adoption by the preprint review community we have developed a DocMap parser. This tool converts DocMap files into JATS XML, a format widely used by the scholarly communications community. The parser code is open and available for reuse and we hope this will support further growth around the use of the DocMaps framework.

Next chapter

Europe PMC is committed to providing open data supported by high quality, sustainable, open and community driven infrastructure. This re-audit demonstrates the efforts Europe PMC has achieved since our commitment to the principles in 2021 to increase our open source code proportion.

As a grant funded resource set within the context of EMBL, Europe PMC cannot create a goal to generate surplus or a goal to create contingency funds to support operations for 12 months. Europe PMC is one of the open biological data resources managed by EMBL-EBI. In February 2024 EMBL published a long-term data preservation statement. This statement provides greater transparency regarding sustainability plans for EMBL-EBI data resources and highlights grounds for Europe PMC’s resilience despite the lack of an individual living will or commitment to time-limited funds being used only for time-limited activities because it exists within the context of many other databases run by EMBL-EBI, which already has a history of responsible data resource life cycle management and retirement. EMBL does not have a Patent non-assertion covenant but the Open Science and Open Access Policy asserts “EMBL expects (…) software to be Open Source by default in both services and research, and made available in open/community software repositories”.

Since adopting the POSI principles, Europe PMC was selected by the Global Biodata Coalition as a GBC Global Core Biodata Resource in December 2022; further indicating Europe PMC is of fundamental importance to the wider life-science community and the long-term preservation of biological data. Europe PMC is at the heart of open science. It provides comprehensive access to life sciences literature and is available to anyone, anywhere for free. As such, ensuring its sustainability as a critical resource is vital.

Tags: DocMaps, open source, POSI, text mining

Follow Europe PMC

Follow Us

News blog

Europe PMC POSI update – 2 years on

Open source projects for text-mining applications

Open source projects for new publication models

Next chapter

Post a comment Cancel reply

Subscribe to the Europe PMC News blog to receive the latest updates

Partnerships & funding

Follow Europe PMC

Follow Us

News blog

Europe PMC POSI update – 2 years on

Open source projects for text-mining applications

Open source projects for new publication models

Next chapter

Related posts

Transforming protein research with AI and human expertise

Revolutionising drug discovery with deep learning

Introducing Europe PMC Annotated Full-text Corpus for bioentities and associations

Post a comment Cancel reply

Related posts

Transforming protein research with AI and human expertise

Revolutionising drug discovery with deep learning

Introducing Europe PMC Annotated Full-text Corpus for bioentities and associations

Subscribe to the Europe PMC News blog

Subscribe to the Europe PMC News blog to receive the latest updates

Partnerships & funding