Thursday, 8 November 2018

Mapping out the path to data

Data availability statements in biomedical literature
Every research paper is a story about data. Over 2.5 million articles in Europe PMC contain data of all sorts, from microscopy images to bird song recordings. While in the past, a research paper might have deciphered a single gene sequence, modern experiments often produce gigabytes of information at once. This means that data described in a paper might be spread across several databases, creating a challenge for researchers who want to access and reuse it.

To boost reproducibility and reuse of research datasets many scientific journals have introduced a data availability statement - a distinct article section that contains guidelines on data access. The data availability statement, as the name suggests, states whether the data is available, underlines conditions for access, and includes hyperlinks to publicly archived datasets analysed or generated during the study.

In fact, to date, over 230,000 full-text publications in Europe PMC contain a dedicated data availability section, with the oldest record dating to 1980. However, it’s only in 2014 that we see a sharp increase in a number of articles that include a data statement, reaching just over a quarter of all full-text articles published in 2018 so far.


data_availability

*data analysis by Michael Parkin

This increase is a very positive trend in support of research reproducibility; however, there is significant variation in how Data Availability sections are included in publications across different journals. For example, they appear in a number of different places within an article, sometimes as a stand-alone section, and sometimes as part of the Methods, Results or Discussion. Many variations in the title of the Data Availability section also hamper discoverability across multiple journals.



To improve access to scientific data reported in research papers and enable analysis of data sharing practices, Europe PMC has built search filters that enable searching for data availability statements specifically. This should enable trends analysis and research into data sharing practices, potentially providing insight into how data is shared and the downstream impact of data sharing.

The right kind of search

Granular search within article sections has been available in Europe PMC for a while. It allows you to restrict your searches to figure legends or materials and methods section to get more relevant results. You can access this feature in the Europe PMC Advanced Search by selecting the section of interest from a drop-down menu.


We have recently extended the list of full text article sections available for deep searching by adding a “Data Availability” category. It unifies all different name variants mentioned above, and can be searched using (DATA_AVAILABILITY:*) syntax. For example, if you would like to retrieve papers that have data deposited in Figshare, you could search for (DATA_AVAILABILITY:Figshare).

While the search tool developed by Europe PMC can help surface the data reported in scientific literature, it relies on the publication authors to provide sufficient information on how the data can be accessed. There is great variety in data sharing practices, from data being deposited in public community database such as the European Nucleotide Archive (ENA) or the Protein Data Bank (PDB), to data included in supplemental files, in Institutional Repositories, on websites, or being available upon request from the authors.

As an example, the share of data availability statements including the word “request” has risen dramatically in the last three years, and has reached one third of all data availability statements in 2018.


data_availability_request

*data analysis by Michael Parkin

Getting to the data

Integrating research data and literature is an important part of the Europe PMC mission to support data discovery and reuse. We identify data DOIs and accession numbers for over 40 life science resources in abstracts and full text articles using a text-mining approach. Over 450,000 publications in Europe PMC cite 1,000,000 unique datasets.


By using the advanced search tools you can identify papers that have generated protein structures, or find articles that cite proteomics datasets. The data availability section search enables you to go a step further and map out the way to supporting data. The new data search filter is one of the latest additions to the Europe PMC tools suite for locating biological data cited in the literature, including the SciLite application powered by the Annotations API, and the Data tab powered by the Data module of the Europe PMC API.

Ready for a test drive?

We hope that this new search feature for data availability can help improve reproducibility of research results, by making it easier for scientists and data enthusiasts to track underlying data from thousands of papers, wherever it may be hosted.

It can also enable detailed analyses of research data sharing practices. We can get deeper insights into the effects of publishers’ and funders’ policies on data sharing, researchers’ preferences for discipline-specific vs generic data repositories, or variations in data citations.

Whatever the use case, give the new tool a try, play with the data, and share your thoughts. We are always keen to hear your ideas and suggestions.

No comments:

Post a comment