Research findings: going deeper than the article

We have recently released two new Article-Level Metrics (ALM) data sources: Europe PMC Database Citations and DataCite. The data from both sources are displayed on the metrics tab of PLOS articles, are available through the PLOS ALM API, and are available to all users of the open source ALM application. These new sources are both related to research data, and they represent a new breed of metrics in our ALM suite.

Not only are Europe PMC Database Citations and DataCite our first ALM sources that track research data that mention a particular PLOS article. But these ALM sources are also different from other ALM sources: although there can of course be additional research data that cite an article post-publication, they typically link datasets associated with an article and created by the same research group. These ALM sources discover links from the research data to the article and in an ideal world should be consistent with the links from the article to the research data. Unfortunately we know from the work by Jo McEntyre and others at Europe PMC (see the May 2013 PLOS ONE article by Kafkas et al.) that the overlap between article-to-database citations and database-to-article citations is surprisingly small.

PLOS and other publishers should of course do a better job helping authors to properly cite research data in their submitted manuscripts, and the recently published Draft Declaration of Data Citation Principles are an excellent starting point. But providing the links from database-to-article as ALM can also increase the visibility of datasets associated with an article. We know of course that we are not the first publisher to do this, earth and environmental sciences research data deposited in the Pangaea data archive and cited in Elsevier articles were highlighted since 2009 using a similar mechanism – taking advantage of the database-to-article link provided by the DataCite DOI service.

Europe PMC Database Citations

Europe PubMed Central is a service of the Europe PMC Funders’ Group working in partnership with the European Bioinformatics Institute, University of Manchester and the British Library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine (NCBI/NLM). Europe PMC is not only hosting many important databases in the life sciences in collaboration with PubMed and others – e.g., UniProt, Protein Data Bank (PDB), and the European Nucleotide Archive (ENA) – but is also extracting the citations to journal articles in these datasets and makes them available via a public API. We harvest these citations in aggregate form and incorporate them into our newly expanded suite of ALMs. As with all ALMs (to the extent allowable by source), we link out to Europe PMC so that users can access the data from the source.

The methylthiotransferase protein entry in the UniProtKB/TrEMBL database cites a PLOS NTD publication.

Europe PMC DB citation example

Different public databases will have varying levels of metadata available and interconnecting links from data to published results. EMBL’s European Nucleotide Archive is a good way forward as you can see in this example taken from the Mycobacterium tuberculosis genome .

ENA Example

DataCite

DataCite is a DOI registration agency with the aim to establish easier access to research data on the Internet and to increase acceptance of research data as legitimate, citable contributions to the scholarly record. Although DataCite focuses on research data, they provide DOIs for all forms of content, e.g. PeerJ Preprints or all non-publisher content on Figshare. Most people are familiar with CrossRef DOIs, but they are not the only DOI registration agency (see Geoff Bilder’s September blog post for more background info). CrossRef and DataCite do of course collaborate, e.g. in DOI content negotiation.

We have provided CrossRef ALM for a very long time, and have now added DataCite as ALM source. The DataCite metadata schema is different from CrossRef’s: we are searching the relatedIdentifier metadata for PLOS DOIs and as of today we find 240 DataCite DOIs related to PLOS papers. Similar to the Europe PMC Database Links, most of these datasets are associated with a paper and submitted by the same research group. Most DataCite DOIs linking to PLOS papers are from datasets deposited into the Dryad data repository (disclaimer: Martin is a member of the Dryad Board), followed by Pangaea.

The 2011 paper by Jonathan Eisen et al., Stalking the Fourth Domain in Metagenomic Data, has underlying data deposited in Dryad, which is a member repository and is thus linked in DataCite. Alternatively, we can see that the “Biogeochemical measurements associated to deep-sea wood falls” dataset is a supplement to the 2013 Antje Boetius, et. al paper and is cited as supporting data which informed the research conclusions published. Both PLOS papers also link to the dataset(s) using the DataCite DOI in the Materials and Methods section, as recommended by the PLOS Editorial Policies.

Linking Research Data and Journal Articles (and People)

We made a conscious decision to display both of these data-related ALMs in the Citations category. The Draft Declaration of Data Citation Principles states that Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications. We think that the inverse is also true, i.e. citations of publications by datasets should have the same importance in the scholarly record and should be grouped together with citations by other publications. Data citation in both directions can enable easy reuse and verification of data, allow the impact of data to be tracked, and create a scholarly structure that recognizes and rewards data producers.

As a publisher, we call for a dedicated space on the article to seamlessly link to the public repositories where the data are available. At PLOS, there is much work to do as this example from Jonathan Eisen, et al clearly demonstrates. Eisen has diligently catalogued all the datasets from this paper on his own blog. But few researchers have employed this practice. More importantly, we need to display and preserve this information as part of the published record.

The careful observer will have noted that the two example articles by Jonathan Eisen and Antje Boetius mentioned above also have citations by other publications (here and here) that are tracked in Europe PMC Citations, another new ALM source. And that Europe PMC links to their ORCID author identifier (here and here). To go full circle they can use the DataCite/ORCID integration tool to also import their Dryad or Pangaea datasets into their ORCID profile, and then can use ORCID content negotiation (disclaimer: Martin was involved in the development of both ORCID tools) to automatically generate a RSS feed or BibTeX reference manager file of all their publications and datasets.

In this light, we anticipate that the upcoming Data Literature Integration Workshop at EMBL-EBI headed by Jo McEntyre, Thomas Lemberger, and Ewan Birney on December 10-11, 2013 will address many of the much-needed issues applicable across our community and lay down better practices for handling data-publication interactions.

This entry was posted in Tech. Bookmark the permalink.
Add Comment Register



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>