Monday, 18 April 2016

Europe PMC, Wikipedia and Wikidata - opportunities for deeper integration

Since our blog post last summer on the inclusion of Wikipedia as an external links provider, we have been lucky to host an intern, Tom Arrow,  who has spent the last few months investigating possible further connections between Wikimedia projects and Europe PMC. This post highlights some of the ways Tom has been exploring these connections.


When have PMC/Europe PMC articles been added to Wikipedia?

Using an updated version of the same dataset that created the external links (as mentioned in our blog post in June) produced by A Halfak and D Taraborelli (doi:10.6084/m9.figshare.1299540) Tom made a plot of the number of citations in English Wikipedia to articles in the PMC dataset against time. 
This is available at: https://plot.ly/~tarrow/32/first-appearance-of-pmcids-on-wikipedia. Here you can observe the continued increase in PMC citations on wikipedia. The steep points are times when automated processes added PMCIDs to citations that previous only had DOIs.


Uploading article metadata to Wikidata

Wikidata is a Wikimedia project to store structured data for inclusion in both projects like Wikipedia and the world at large (doi:10.1145/2629489). Tom has been investigating ways to share metadata from Europe PMC with this project to increase public exposure to, and interaction with, the metadata. Initially we have focussed on metadata for Europe PMC Open Access articles that are cited in Wikipedia. From a total of around 70K articles metadata was created for 15K.

He has been creating items about journals, journal articles and authors on a server running the same software as Wikidata, Wikibase. All of these items are being created using data consumed from the Europe PMC RESTful web services. While this work is still in progress the results can be seen at Librarybase.

This was done by the production of various Python scripts which are available on GitHub. First, a Python client for the Europe PMC RESTful webservices API, available here https://github.com/tarrow/epmclib, was created. This enables use of both core and lite API queries for functions such as getting the title of an article, checking if a PMID or PMCID resolves, and getting a dictionary of basic metadata about and article. This client could easily be reused by other consumers of the API.

A second package of scripts is available at https://github.com/tarrow/librarybase-pwb, which uses and extends the popular pywikibot suite for interfacing with MediaWiki and Wikibase sites. These scripts form a foundation for making and curating Wikibase items relating to bibliographic metadata from Europe PMC.

Finally, two utilities for discovering which citations appear on which Wikipedia article were written; both rely on the mwcites utility written by Aaron Halfak. One is for processing the output of mwcites in bulk for importing thousands of articles at a time into Librarybase (https://github.com/tarrow/queryCitefile) and the other produces a realtime stream of citations (https://github.com/tarrow/citationslivefeed) as they are added or removed from Wikipedia which can be used to keep Librarybase up to date.

This work demonstrates how the Europe PMC API can be used to share Europe PMC more widely, lowering the entry barrier to its use with a basic Python client, and provides a next step to link Europe PMC and the Wikimedia communities. It enables straightforward analysis of academic citations in Wikipedia, and may help people find more useful papers.