News blog

Updates from Europe PMC, a global database of life sciences literature

Summer Rosonovski

 | 10 October 2024

 | 2 MINS READ

 | Editor's pick

Transforming protein research with AI and human expertise


How a novel machine learning system is bridging the gap between advanced models and expert insights to accelerate protein research

Proteins are essential molecules with many critical roles in the body. Determining a protein’s 3D structure helps scientists understand its functions within cells. This is crucial for advances in biology, medicine, and drug design.

Our paper published in Scientific Data introduces a novel AI-human hybrid workflow, developed by  PBD and Europe PMC. The workflow identifies links between protein structure and function. The data, models and code are freely available.

Human expertise and AI approach 

While human expertise is indispensable for its accuracy, it is often costly, time-consuming, and lacks scalability. On the other hand, AI offers speed and efficiency but requires expert input to train or fine-tune it to effectively perform complex tasks. This novel AI-human hybrid workflow combines the power of machine learning and expert curation to find links between protein structure and function. 

The process starts with expert curators who manually review a set of scientific articles. They identify key terms related to protein structures. These key terms refer to the building blocks of proteins, known as amino acid residues. Residues’ specific locations and properties determine how the protein functions within the body. 

The key terms identified from curation are then used to fine-tune a specialised AI model called PubMedBERT. The resulting model is highly specific, and key terms related to protein structures can be quickly found in articles. This helps scale the process of identifying the amino acid residue information and accelerates the discovery of connections between protein structures and their functions.

Human expertise and AI approach. Step 1. Expert curators manually find keywords in the scientific literature. Step 2. This scientific literature is used to fine tune an AI model. Step 3. The AI model identifies key terms related to protein structures in articles. Step 4. The expert curators manually correct the AI model annotations and this is fed back into the AI model to further fine tune it. This loop of human manual curation and AI model key term identification happens in a loop to better fine tune the AI model.

Accelerating discovery: faster and scalable annotations

The workflow developed by Europe PMC and PDBe bridges the gap between advanced machine learning and expert knowledge. The resulting model makes the identification of residues (the key terms) related to protein structure faster and easier to scale. The annotated terms it produces can be used to generate functional insights, support drug target validation, and propel the development of novel treatments.

“This work was a first start for PDBe in the field of NLP and the initial step for a more ambitious project downstream. The details given in the paper will hopefully allow anyone new to this field to follow through like in a recipe book.” Said Melanie Vollmar, PDBe Research Fellow (ARISE/Marie Curie).

Explore the article on Scientific Data to learn more about the AI-human approach and its potential impact on protein research.

Post a comment


I agree to the limited use of my personal data as described in the Europe PMC advanced user services privacy policy.

Creative Commons Licence
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Subscribe to the Europe PMC News blog to receive the latest updates

This website requires cookies, and the limited processing of your personal data in order to function. By using the site you are agreeing to this as outlined in our privacy notice and cookie policy.

Partnerships & funding

Europe PMC is a service of the Europe PMC Funders' Group, in partnership with EMBL’s European Bioinformatics Institute (EMBL-EBI); and in cooperation with the National Center for Biotechnology Information (NCBI) at the U.S. National Library of Medicine (NCBI/NLM) . It includes content provided to the PubMed Central (NLM/PMC) archive by participating publishers.