Tuesday, 28 March 2023

Improved affiliation search for grants

Integrating ROR IDs in Grant Finder

Funding and research organisations are increasingly looking to understand the impact of the research they support. To link institutions with grant awards from Europe PMC funders the Europe PMC Grant Finder tool now incorporates Research Organization Registry (ROR) IDs.

Linking grants and institutions

The Grant Finder tool from Europe PMC allows users to search and access information about 98,000 biomedical research grants from 37 Europe PMC funders. This information includes the affiliation of the principal investigator, who is awarded the grant. However, a search for grants by institutional affiliation can miss relevant entries due to variations in institutional names.

For example, the same organisation may be referred to as University of Cambridge, Cambridge University, etc.

Screenshot of the Grant Finder prior to the ROR ID integration. Affiliation drop-down menu suggests multiple options for the same organisation.

A ROR ID is an open persistent identifier for research organisations. It helps disambiguate names for over 102,000 institutions. ROR makes it easy for anyone to connect research organisations to research outputs. 

To improve the affiliation search function of the Grant Finder tool we have decided to map institutions in the Europe PMC grant database to corresponding ROR IDs. 

Integrating ROR IDs

At the start of this project some Europe PMC funders were already providing ROR IDs in the funding data they supplied. However as of July 2022 only ~10% of grants in Europe PMC were linked to a ROR ID. 

To assign ROR IDs to remaining affiliations, we have used an automated script to query the ROR REST API for the ~23,000 institution names in the grants database. The ROR API response includes not only a matching ROR ID, but also the matching confidence score and the type of matching algorithm applied. Importantly, it provides a binary indicator of whether the score is high enough to consider the organisation correctly matched. 

Using the ROR API we were able to confidently match 41% of institutions. This was followed by a manual check for any potential errors. Institutions associated with multiple grants were prioritised in this step.

As a result, over 80% of grants in Europe PMC now have a ROR ID associated with the principal investigator’s institution. Over 2600 unique ROR IDs are recorded in the Europe PMC grants database and can be retrieved programmatically as part of the GRIST API core response.

Screenshot of the GRIST API core response query for (gid:081052) with institution name and ROR ID highlighted in red.

Improved grants search

To streamline user experience we then used the ROR ID to suggest matching affiliations in the Grant Finder search. When users enter an affiliation into the Grant Finder search, the auto-suggestion feature will display a single name for institutions associated with a ROR ID, consolidating different institutional aliases. The feature also displays the ROR ID in brackets.

Screenshot of the Grant Finder after the ROR ID integration. Affiliation drop down menu consolidates aliases for Cambridge University under the organisation’s primary name -  ‘University of Cambridge’, with the ROR ID displayed in brackets.

Future plans

Our ambition is to enable grant search by ROR ID of the associated institution. In the current implementation, only the institution name is used for affiliation search via the GRIST API and the Grant Finder tool.

The current integration primarily supports research funders, who want to view grant allocations through an institutional lens. However, we believe that it creates many opportunities for wider adoption. Lessons learnt from incorporation of ROR IDs into the grant data could help with future steps, for example, enriching publications with ROR IDs. 

Supporting innovation by integrating related research outputs is an important part of Europe PMC’s mission. Over the years we have built a rich network of research objects connected via persistent identifiers. This includes DOIs/PMIDs/PMCIDs for journal articles, PPRIDs for preprints, ORCIDs for research authors, accession numbers and DOIs for data, and Grant IDs and DOIs for grants. We hope that addition of ROR IDs to Europe PMC will pave the way for new tools that build on this interconnected network of research.

Wednesday, 8 March 2023

Introducing Europe PMC Annotated Full-text Corpus for bioentities and associations

Europe PubMed Central (Europe PMC) is an open access repository of life science research, including peer-reviewed journal articles and preprints. It contains over 41 million abstracts and 8.7 million full-text articles, adding over 1.7 million new articles annually. To facilitate information discovery and foster literature–data integration, Europe PMC has incorporated text-mining approaches into its workflows. 

Text-mining is a powerful tool used in the field of biomedicine to extract relevant information from large amounts of text data, such as research papers, clinical reports, and patient records. It involves using machine learning algorithms and natural language processing techniques to identify key concepts and relationships between words and phrases in the text. 

With millions of new research papers published every year, it's impossible for researchers to read and synthesise all of the information available manually. Text-mining algorithms can be used to automatically extract key concepts, relationships, and findings from scientific papers, allowing researchers to quickly identify relevant information and stay up-to-date on the latest developments in their field.

At Europe PMC, the SciLite Annotations tool uses text-mining to highlight terms in research articles and preprints, allowing users to quickly scan the article for relevant concepts, such as diseases, chemicals, or protein interactions. Europe PMC contains ∼1.3 billion annotations sourced in-house and from 10 external providers. The annotations platform covers multiple annotations types including bioentities ranging from accession numbers to Open Targets gene–disease relationships. Users can programmatically access the annotations using the Annotations API, reducing the time requirement of extracting facts and evidence to help advance the discovery process.

How Europe PMC developed annotations for named entities

Europe PMC has developed annotations for Gene/Protein, Disease, Organism and Chemical bioentities. For this purpose, established ontologies are being used as dictionaries to pattern match the entity-terms from the text. For example, the Unified Medical Language System (UMLS) ontology is used for tagging diseases mentioned in articles. Although the dictionary-based approach is easy to understand and to implement, an exhaustive list of patterns are required to recall more entities and require regular updating to remain current. Moreover, with the contextual information missing, this creates ambiguity, especially with the use of acronyms and abbreviations by scientists writing papers. For example,Cockayne syndrome Group A and Corporate Social Responsibility are both abbreviated to CSA. Overall, this approach has multiple challenges, which cause false positives, false negatives, and other technical issues, as listed below:

False positives

  • Short target names are confused with ambiguous abbreviations.

    •  Cockayne syndrome Group A (CSA) vs Corporate Social Responsibility (CSA)

  • Contextual differences to identify terms

    • Hearing (aids) vs AIDS the disease

  • Common English words

    • CAN gene vs ‘can’ being a common English word

  • Multiple entities or mentions of compound entity

    • It is difficult to pattern match BRCA1 and 2 or BRCA1/2 vs BRCA1 and BRCA2

False negatives

  • Missing of entities that do not exist in the original ontology/dictionaries 

    • E.g., T2D for Type 2 Diabetes Mellitus

    • The ontology shall be updated regularly to keep up with the growing literature

Technical issues

  • Encoding issues often create garbage values when converted from ascii to UTF-8 format

    • TNFα vs TNFα

  • Sentence boundary problems

    • It is difficult to delineate a sentence within a caption of a Figure or a Table

Challenges faced when using a dictionary-based approach to annotate entities in scientific text including false positives, ambiguity of abbreviations, special characters, and a lack of distinction between genes and proteins.

Improving annotations using machine learning: gold standard dataset 

To address the challenges of using a dictionary-based approach to annotations the Data Science team at Europe PMC has explored the use of machine/deep learning techniques. 

To train any machine-learning algorithm, a gold standard dataset of annotations is needed. While corpora without annotations are good for learning semantics, text-mining tools trained on human-annotated corpora outperform those trained on non-annotated ones. Therefore, open-source gold-standard datasets are crucial for improving biomedical text-mining systems. Based on this information, the team methodologically developed an annotated full-text corpus based on annotation guidelines. The corpus is a collection of 300 research articles from the Europe PMC open access subset. The selected articles have been annotated by humans to indicate mentions of three biomedical concepts: Gene/Protein, Disease, and Organism. The annotation guidelines were used by the human annotators to select the correct text span and type of annotation. 

Problems faced while developing the gold standard dataset

The Europe PMC team faced several challenges in creating a gold standard training set to support machine learning approaches for entity extraction. One of the first hurdles was to select a small number of representative articles from the several million available in the Europe PMC database. For this purpose a strategy was designed and several techniques were employed to stratify articles and select the representative set. 

Out of approximately 6 million full-text articles in the Europe PMC repository, archived on the 31st of August, 2018 (v2018.09), approximately 1 million were open access with a creative commons licence and could  be included in the training set.  The training set was further refined by the following criteria: article sizes between 25 and 50 KB were selected, which resulted in a collection of approximately 0.5 million articles. This was followed by sorting the articles with the entity mentions into low, medium, and high ‘bins’ for each entity type, that is Gene/Protein, Disease, and Organisms. Articles containing small or no mentions of any of the entities were discarded, leaving over 460,000 articles of which 300 were randomly selected. The final stage of the strategic workflow included working with the annotators iteratively to improve the annotation guidelines.

Collaborating with Molecular Connections to annotate the corpus

Europe PMC collaborated with Molecular Connections to annotate the corpus. The annotators were asked to use the hypothes.is tool, which was added onto Europe PMC as a plug-in to support collaborative efforts for the annotation work. Annotators saved their annotations to the hypothes.is server in JSON format, which was retrieved and converted to CSV format using in-house tools. This project used a triple-anonymous approach to annotation; three annotators annotated the same articles independently to ensure annotation quality and validate inter-annotation agreement. Annotation discrepancies were resolved by the majority vote to achieve/ensure the best quality annotation. That is, at least two annotators must agree on the annotation boundary and the entity type of the entity terms to pass the acceptance threshold. This maximised the total number of annotations. For example, if one annotator misses a term, it will likely be picked by the two other annotators. The triple-anonymous method made it possible to conveniently assess the inter-annotator agreements to ensure the annotation quality. Using this approach, we were able to increase the annotations’ accuracy from 70% to 99%.


What can you do with the Europe PMC open source corpus?

The Europe PMC Annotations Corpus is among the largest human-annotated biomedical corpora publicly available. The corpus also comes with scripts that were used to clean and format the annotations from the Hypothes.is platform. The dataset is also available in the IOB format for input to deep learning algorithms. In addition, the annotation guidelines are also made available for researchers to improve/compare conclusions drawn from the results. 

This open source gold-standard dataset can be used to improve the accuracy and reliability of life science natural language processing tools such as entity recognition, supporting advancements in scientific research. Furthermore, the corpus can support clinical decision-making by providing access to relevant clinical information that can be used to develop clinical decision support systems, which could improve patient outcomes and reduce healthcare costs.

The Annotated Corpus can accelerate life science research by providing large amounts of accurately labelled data that can be used to train machine learning models for various applications, including drug discovery and disease diagnosis. For example, the deep learning models that were trained using the Europe PMC Annotations Corpus are now being used for literature mining for Open Targets to identify and prioritise drug targets, provide evidence for drug target validation, and support drug repurposing efforts to accelerate and improve the efficiency of drug development. 

To find out more about the Europe PMC Annotations Corpus and details on how to access and reuse this open community resource: 

Written by Santosh Tirunagari, Senior Machine Learning Developer at EMBL-EBI

Wednesday, 11 January 2023

Europe PMC in 2022: a year in review

With the start of the new year Europe PMC reflects back on the year 2022. As many of us adjusted to hybrid working, with the opportunity to return back to the office, meet new colleagues, and attend conferences and workshops in person again for the first time since the pandemic, Europe PMC’s efforts continued to concentrated on open access to research, building trust in preprints, user-centred design, and long term sustainability.

Preprint highlights

In 2022, in line with our open access commitment, Europe PMC has increased the number of preprints available with the inclusion of five new preprint platforms: SciELO Preprints, Access Microbiology, ARPHA preprints, agriRxiv, and EcoEvoRxiv. Over 520,000 preprints from 28 preprint servers are now freely discoverable in Europe PMC.

Following the success of full text COVID-19 preprints in Europe PMC, conversion of full text preprints funded by Europe PMC funders was launched in April 2022, to improve the discoverability of science reported in preprints, and increase the visibility of research supported by Europe PMC funders. These full text preprints are available via the Europe PMC website, as well as automated bulk download for text-mining and programmatic analysis.

For those interested in reviewing preprints in a journal club setting Europe PMC ran a webinar on how to use Europe PMC to select and evaluate preprints. Europe PMC also expanded the number of preprint review providers in Europe PMC by the inclusion of public preprint evaluations available from Sciety, a preprint review platform.

Finally, to help increase transparency and build trust in preprints, Europe PMC developed the Article status monitor tool, which can alert researchers when a preprint is withdrawn or removed, published in a journal, or has a new version. For further information and to learn more about the tool, see this poster presented by Europe PMC at the Reproducibility, Replicability and Trust in Science conference.


Service for users

Europe PMC places users at the heart of innovation. To provide even more powerful and customisable search tools, user research was carried out to inform the redesign of the advanced search and main search functionality. Based on what was learnt, some changes have already been implemented – for example, changing the default sort order to relevance. Europe PMC also completed a user research study investigating ways to collate and display preprint peer review status to help build trust in preprints. Preprint review is an area of increasing interest to the research community, and therefore an important part of the Europe PMC roadmap. For more information on Europe PMC’s user research and future plans in this area see our lightning talk at the Recognising preprint peer review conference.

A site-wide user survey was launched in November, to better understand how you use Europe PMC, what you like about it, and where improvements can be made. The survey results will help guide the development of many exciting projects that we can’t wait to share with you! Keep up to date with us by following @EuropePMC_news on Twitter. Europe PMC is always keen to learn from user feedback and hopes that a new and improved design of the feedback form will make sharing your thoughts easier than ever.



There are a number of ways work on the long term sustainability of Europe PMC has been approached this past year.

In February 2022 Europe PMC adopted the Principles of Open Scholarly Infrastructure (POSI) – a set of guidelines by which open scholarly infrastructures can be operated and sustained. The Europe PMC review against POSI demonstrated our commitment to provide high quality, sustainable, open, and community driven infrastructure, and highlighted potential improvements to increase the proportion of open source code Europe PMC is run on.

In December, Europe PMC was announced as a Global Core BioData Resource by the Global BioData Coalition, demonstrating the need for the long term funding and sustainability of Europe PMC as a critical global life science and biomedical research infrastructure.

To this end, Europe PMC was excited to announce that it has welcomed three new funders to the Europe PMC funder family in 2022. The Medical Research Foundation funds research that will ‘advance medical research, improve human health and change peoples lives’. Health and Care Research Wales focuses on health and social care, with their goal ‘to ensure that today’s research makes a difference to tomorrow’s care’. The European and Developing Countries Clinical Trials Partnership (EDCTP) funds clinical research in line with their vision ‘to reduce the individual, social and economic burden of poverty-related infectious diseases in sub-Saharan Africa’. All Europe PMC funders expect research they fund to be made openly available, and we delivered a webinar explaining how to make research open with Europe PMC.

To help address the urgent challenges of climate change seen across the globe, and in line with EMBL’s sustainability strategy, Europe PMC has worked hard to migrate services to a new high performance computing farm to reduce our energy consumption and resultant impact on the planet.

Final remarks

Europe PMC is dedicated to providing an exceptional free and open service for the scientific community that meets your needs. Europe PMC will continue to be guided by user feedback as part of its mission to support innovation and discovery by engaging users, enabling contributors, and integrating related research outputs.

Take a look at the plans for 2023 on Europe PMC’s Roadmap and let us know what you think!

Monday, 20 June 2022

Medical Research Foundation joins Europe PMC

We are delighted to announce that the Medical Research Foundation joins Europe PMC as a new funder. This brings the Europe PMC funder family to 37 members.

The Medical Research Foundation is the charitable foundation of the Medical Research Council. With support from the scientific and medical communities and the public, the charity funds high-quality medical research that improves human health and changes people’s lives. The Foundation is a purely research-led organisation, meaning all its efforts are centred around funding scientists and research. This allows them to focus solely on finding and funding research that has the potential to change lives now, or may become important in the future.

Researchers funded by the Medical Research Foundation will join thousands of others who make their published research articles freely available from Europe PMC as soon as possible without any embargo period. If your research is supported by the Medical Research Foundation you can submit your published manuscript for inclusion in Europe PMC via Europe PMC plus manuscript submission system.

You can now find publications supported by the Medical Research Foundation, as well as Medical Research Foundation grant awards via Europe PMC search and the Grant finder tool. 

For more information about joining Europe PMC funder group, visit our website: http://europepmc.org/Joining

Tuesday, 17 May 2022

Health and Care Research Wales joins Europe PMC funders group

 We are delighted to announce that the Health and Care Research Wales joins Europe PMC as a new funder. This brings the Europe PMC funder family to 36 members.

Health and Care Research Wales is a networked organisation which brings together a wide range of partners across the NHS in Wales, local authorities, universities, research institutions, third sector, and others. Health and Care Research Wales aims to ensure that today’s research makes a difference to tomorrow's care. To achieve this goal Health and Care Research Wales brings together partners to promote research into diseases, treatments, and services, which can lead to discoveries and innovations to improve and save lives. Health and Care Research Wales is supported by the Welsh government. 

Health and Care Research Wales researchers will join thousands of others who have made a commitment to making research open access through inclusion of research articles in Europe PMC. Health and Care Research Wales requires all peer-reviewed research articles submitted on or after 1st September 2022 to be published under the Creative Commons attribution licence (CC BY) (or Open Government Licence (OGL) when subject to Crown Copyright), made open access, and to be included in Europe PMC as soon as they are published without any embargo period. Authors are strongly encouraged by Health and Care Research Wales to self-archive peer-reviewed research articles submitted before 1st September 2022 in Europe PMC. If your research is funded by Health and Care Research Wales you can submit your published manuscript for inclusion in Europe PMC via Europe PMC plus manuscript submission system.

Tuesday, 3 May 2022

Europe PMC improves discoverability of preprints

Europe PMC now includes the full text preprints supported by Europe PMC funders

Open science is at the heart of Europe PMC, providing access to open content and data. Recognising the role that preprints play as a way for life science researchers to openly and rapidly share their findings, Europe PMC has made over 420,000 preprint abstracts from 24 preprint servers discoverable alongside journal publications. Following the success of the COVID-19 full text preprints initiative, which currently includes over 31,000 full text COVID-19 preprints, Europe PMC is expanding the number of searchable full text preprints to include those supported by Europe PMC funders. Overall this new project aims to increase the discoverability of science reported in preprints, expand the collection of full text preprints for future analyses, as well as improve visibility of preprints supported by Europe PMC funders

Which preprints are included?

From April 1st 2022, Europe PMC includes the full text of preprints that acknowledge funding from at least one of the 36 Europe PMC funders and have a Creative Commons licence. As the first step Europe PMC has added preprints from medRxiv, bioRxiv, and Research Square, with plans to expand to other preprint servers in the future. 

How does it work?

Europe PMC converts the freely available full text in PDF format to a machine-readable XML format suitable for text-mining. A preview of how the preprint will appear in Europe PMC is then shared with the corresponding author. The full text is added to Europe PMC two weeks later or immediately after author approval. The full text of preprints supported by Europe PMC funders is made searchable along with other preprint abstracts through the Europe PMC website as well as programmatically via the API. It is also available for bulk download as part of the Preprints subset for future analyses. 

What are the benefits?

While the full text of each preprint is openly available from the corresponding preprint server, there are numerous advantages to including it in Europe PMC. 

Being able to view the full text directly on Europe PMC makes it more convenient to users and makes research presented in preprints more discoverable. For preprint authors supported by Europe PMC funders this means higher visibility and wider reach for their scientific findings. 

By default Europe PMC search applies to the full text of journal articles and preprints that are indexed, not just abstracts. Therefore, having the full text of these preprints within Europe PMC means that they are surfaced if terms searched for are beyond their abstract. It also enables advanced search options, for example the ability to limit search to specific sections of the preprint, for example Figures, Results, or Methods.

Making the full text of preprints available programmatically in a structured machine-readable format also supports text and data mining. The Europe PMC text and data mining pipeline, in collaboration with several text-mining groups, identifies key biological entities, such as data accessions or gene/protein names, experimental methods, protein interactions, mutations, gene-disease relationships and more, in the abstracts and available full text of preprints. This enables better linking of the literature and the data behind it. The text mining pipeline powers the Annotations tool, which allows readers to quickly scan preprints of interest to find data and evidence presented in the manuscript. 

Full text can also support future research on research, for example around the impact of peer review or data availability. For users carrying out bioinformatic studies or literature reviews, having open access to the full text preprint collection from multiple preprint servers both on the website and programmatically via RESTful APIs in Europe PMC makes analysis easier and further supports open sharing of data.

Finally, as an archive of scholarly content, Europe PMC contributes to longevity and continued access to scientific data and findings presented in preprints. We believe that preprints can remove barriers to open science and Europe PMC is committed to making the science reported in preprints more widely discoverable. 

For more information about preprints in Europe PMC, visit our website: https://europepmc.org/Preprints

Thursday, 3 March 2022

SciELO Preprints discoverable in Europe PMC


We are delighted to announce that SciELO Preprints are now discoverable in Europe PMC. 

SciELO (Scientific Electronic Library Online) is a bibliographic database, digital library, and cooperative electronic publishing model of open access journals. It was originally established in Brazil in 1997 and has since expanded to include collections from 16 countries, predominantly in Latin America. 

In 2020 SciELO and the Public Knowledge Project (PKP) launched the SciELO Preprints Collection to accelerate the availability of research articles and other scientific communications.

As an avid supporter of open science, Europe PMC has been indexing life science preprints alongside journal articles since 2018. Currently, over 400,000 preprints from over 20 different platforms are available in Europe PMC. Preprints in Europe PMC are enriched with links to open peer review materials, related data, citing articles, and other useful resources. 

Over 1000 SciELO Preprints are available in Europe PMC in their original language, Portuguese, Spanish, or English, and can be accessed using the following search: PUBLISHER:"SciELO Preprints".

            An example of a SciELO preprint in Europe PMC. 

The preprint page displays the title, abstract, and author information. The preprint is linked to the journal published version from the preprint banner, as well as preprint reviews, including the recent integration between SciELO Preprints and PREreview, and also to the citation information and alternative metrics from the Citations & impact section on the left hand side. Readers can view genes, diseases and organisms mentioned in the preprint under the Annotations tool on the right hand side. The preprint can also be easily added to the ORCID profile by the authors using the Claim to ORCID option on the right.

An important outcome of the new collaboration between Europe PMC and SciELO is the push for changes to scholarly infrastructure to better handle multilingual content. Support for multilingual metadata is now part of Crossref’s public roadmap. Implementation of these changes would enable Europe PMC to host Portuguese, Spanish, and English versions for SciELO Preprints. But much more importantly, it could mean greater accessibility and discoverability of multilingual research across many scholarly platforms.

If you are interested to learn more about SciELO Preprints in Europe PMC, please register to join our live demo at 14.00 (GMT) on April 13th.