Dealing with data

July 13, 2012 Theo Bloom Data Editorial policy PLOS Biology Publishing Research

PLoS Biology aspires to drive openness in data use and re-use, not just open access to the literature. We are pleased to announce a partnership with Dryad that helps our authors make their data re-useable and accessible by providing a home for data linked to publications.

640px-Wikimedia_data — Source: Wikimedia attributed to Victorgrigas

It is not news to say that the current handling of research data within the life sciences is rather chaotic. Most funding agencies and research institutions want to ensure that the data generated by the researchers they fund is made easily available both to the research community, for replication and re-use, and to all other interested parties, so as to facilitate the advancement of science and technology. These organisations look to journals that publish original research to help make data available. We might imagine that, in an idealised future world, scientists will collect all types of data in an organised and structured way that links methodology, date and time of sample together with all relevant details alongside the data (such that all electronic data automatically has complete and well-structured metadata). We might also hope that researchers will want to share their research outputs with all interested parties, subject only to adequate attribution and certain limitations on specific types of data (for example, data about human patients whose privacy should be protected, and data on endangered species whose locations need similar safeguards). But we are not very close to this utopia right now. (As an aside to those purists who belive data ‘are’ plural, I refer you to a post on The Guardian’s blog that can be summarised in this quote: “It’s like agenda, a Latin plural that is now almost universally used as a singular. Technically the singular is datum/agendum, but we feel it sounds increasingly hyper-correct, old-fashioned and pompous to say “the data are”.”)

Even when data is shared – which is not always – where it should be housed remains a matter for discussion. In many cases, the most useful place for publicly available data is in well-structured, well-curated and long-term-sustainable databases. Such databases are often unique to a field in order to meet that field’s particular and unique requirements for database structure and curation. Well-known examples include GenBank for gene sequences, the Protein Data Bank for protein structure and ArrayExpress for microarray data. But some kinds of data are insufficiently common to have driven the development of their own specific databases to house them, or a particular field may not yet have agreed to the standards and formats required for database development and adoption. Some institutions and funders are therefore stepping in to provide suitable long-term and open access homes for such ‘orphan’ data (for example via OpenAIREplus in the European Union).

The publishers of research output – journals – are uniquely well-placed to help researchers ensure that all data underlying a study are made available alongside any published articles. Journals can also help to meet the needs that researchers and funders have in making data available in useful formats, while accruing appropriate credit to the people who generate the data generators. But many people have noted that ‘supplementary files’ provided with an article are not the best way to share data either. Current publisher practices (including PLoS’s own) tend to exacerbate problems around data re-use by requiring standardised figure formats, providing ‘flat’ file types such as PDFs, and discouraging the provision of raw data while simultaneously encouraging the ‘dumping’ of various combinations of data and text into ill-structured ‘supplementary files’.

An ideal that most funders of research and some authors aspire to is to allow both re-use and total replicability: anyone who wants to can both use the data for any future use and (re)analyse the data to get exactly the same results as the authors. A key barrier to achieving this ideal is that a significant number of ‘data generators’ do not want to share, for one reason or another. For those who believe that – having spent a long time collecting data – they own it and don’t want to lose their competitive edge by sharing it, we publishers can work with the funders of research to enforce sharing (just as we do for deposition of sequencing data in GenBank, protein structure data in PDB, and so on). But I believe the majority of researchers would share data if they felt it accrued the same kinds of benefits to them as does sharing their ideas through publication. Credit in the form of publications is used to assess researchers for funding, promotion and tenure, and researchers need similar incentives and credits to encourage them to share data. Some editors and journals are trying to fit data into the format of a published article, by launching data journals or contemplating data article types so as to use the established systems for citing research articles as a way to accrue credit for data sharing. But it seems we need to come up with better ways to give appropriate credit to those who generate and share data – methods that allow small and large data packets to be cited, and that allow citation of specific ‘bits’ of a dataset.

A step towards resolving the data problem: PLoS Biology partners with Dryad

In efforts to address some of the issues highlighted here with the sharing of data, PLoS has begun working with several partners on a number of fronts to improve systems for citing and crediting data sharing. One partnership we are pleased to announce formally today is with Dryad (www.datad ryad.org), an open access repository of data underlying peer-reviewed articles that is being developed by the National Evolutionary Synthesis Center and the University of North Carolina Metadata Research Center, in coordination with a large group of Jou rnals and Societies . From PLoS Biology’s perspective, Dryad provides a good answer for authors who don’t know where or how to store their data: it takes data ‘packages’ associated with published articles and makes them freely available. Dryad provides a unique identifier (DOI) for each data package, and allows authors to upload subsequent versions of their data (clearly indicated), as well as providing download statistics for each data package. By having a close partnership with Dryad, PLoS Biology can offer authors a seamless tying together of an article with its underlying data; we can also provide confidential access for editors and reviewers to data associated with articles under review (see D epositing data to Dryad guidelines). PLoS Biology is the first of the PLoS journals to have these close links to Dryad in place, but we plan to roll out this partnership to the other PLoS journals in the near future, and indeed all authors can submit data directly to Dryad: at the time of writing there are 36 articles in the PLoS corpus that have associated data in Dryad.

Please ‘watch this space’ as we announce further plans and partnerships in the data arena, and please do let us know what you think are the most pressing issues, by emailing us or starting a discussion here.

COI declaration: Theo Bloom is a member of the advisory boards of OpenAIREplus and Dryad.

Discussion

Science Policy Around the Web – July 13, 2012 « Science Policy For All says:

July 13, 2012 at 12:00 am

[…] to Access – Two new initiatives–one from the new journal GigaScience and one from PLoS and data repository, Dryad–aim to make today’s complex scientific data easier to publish, distribute, and re-use. […]

PLoS and Dryad partner to facilitate data sharing « Dryad news and views says:

July 13, 2012 at 12:00 am

[…] is delighted to join with PLoS today to announce our partnership with PLoS Biology, as described here on the official PLoS Biology blog, Biologue. As the first Public Library of Science (PLoS) […]

PLoS and Dryad partner to facilitate data sharing | Scholarly Communication at Mason says:

September 10, 2012 at 12:00 am

[…] re-useable and accessible by providing a home for data linked to publications, as described on the PLoS Biology blog. This entry was posted in Uncategorized. Bookmark the permalink. ← Concerns with OMICS […]

A Data Management Bibliography for Librarians « Kevin the Librarian says:

January 6, 2013 at 12:00 am

[…] 5. Bloom T. Dealing with data. PLOS Biologue [Internet]. 2012 [cited 2012 Nov 9]; Available from: https://blogs.plos.org/biologue/2012/07/13/dealing-with-data/ […]

PLOS Genetics partners with Dryad | PLOS Biologue says:

September 18, 2013 at 12:00 am

[…] Following PLOS Biology’s integration with Dryad last year (detailed by Theo Bloom in her piece on dealing with data), PLOS Genetics authors can now take advantage of a fully integrated Dryad submission process, […]

PLOS Data Policy: Update - PLOS Biologue says:

May 30, 2014 at 12:00 am

[…] somewhere being not on their own hard drives). We know that much more will be needed before we are dealing with data satisfactorily. For a small minority of commenters, we did not go far enough. For many […]

Make data sharing easy: PLOS launches its Data Repository Integration Partner Program | PLOS Tech says:

November 5, 2014 at 12:00 am

[…] research data discoverable, freely reusable, and citable. We have connected two PLOS journals, PLOS Biology in 2012 and PLOS Genetics in 2013, tying together the data deposition process and our submission […]

'- 2 Docs Talk says:

October 3, 2016 at 12:00 am

[…] PLOS on Data […]

Episode 47: Which Statistics Should You Trust? - 2 Docs Talk says:

October 10, 2016 at 12:00 am

[…] PLOS on Data […]

Leave a Reply Cancel reply