PLOS Data Policy: Update

Two months after the implementation of the PLOS journals’ data policy, what have we learned from our authors, reviewers, editors, correspondents, and commenters in the blogosphere?

More than 16,000 manuscripts have been submitted with a data availability statement

In order to optimise the re-use of data by readers and by data miners, authors of all new manuscripts submitted since March 3, 2014 have included a statement about where the data underlying their description of research can be found. At the time of writing, more than 16,000 sets of authors have included information about data availability with their submission. We have had fewer than 10 enquiries per week to data@plos.org from authors who need advice about ‘edge cases’ of data handling and availability – fewer than 1% of authors – and these cases have helped us to further update our FAQ, contributing to a decline in such enquiries over time. We would like to say a huge thankyou to all the authors and editors who have worked with us in this period to iron out wrinkles in our submission processes and helped us make it as easy as possible to capture information about data availability. Special mention is due to the pioneering subset of authors who have already published articles in PLOS journals with a Data Availability Statement, whether all data are provided within the manuscript and its supplementary files or providing links to a domain-specific repository, a non-specific repository or more diverse resources (see Image).

 

Screen Shot 2014-05-28 at 11 20 28

 

Some groups of authors still have concerns about data sharing

From our period of increasing public consultation about the PLOS data policy, we knew that at launch we would encounter authors with specific issues in two main areas: firstly, big datasets that are too large for hosting in most repositories (although some, such as the journal GigaScience target precisely this domain), and secondly around patient confidentiality and the associated need to have oversight committees for instances in which access should be restricted to appropriate individuals. Since the launch we have heard these issues raised again, but we have also heard three main arguments used by bloggers and others to justify not sharing the data underlying research articles that we find it hard to agree with. They can be summarised as follows.

It’s mine, I collected it. Most funders and institutions have moved away from this idea, but it persists in the mind of many researchers.

It’s complicated and unique: no-one else could understand it properly This “data as a unique snowflake” argument supposes that no-one other than those who collected the data can understand it enough to re-use it. Taken to an extreme, this argument would tend to suggest that there is no point in peer-reviewing or publishing research at all. We would rather work with those (e.g. BioSharing) who are working to develop standards and approaches to describing data so that it can indeed be used by others.

I’d like to share, but my lab is little and/or under-funded and/or in a lower-income country, and once I share the big guys can jump on the data and do cool things with it before I can do them myself. We of course have sympathy with this perspective, and PLOS journals work specifically with authors in less-developed countries to help them publish their work. However, as noted by panellist Joe DeRisi in a discussion of data sharing at UCSF earlier this month, it would be perverse to suggest that we delay, for example, progress in malaria research in order to allow researchers in the most-affected countries to contribute optimally.

There is one additional argument that has been made, and we acknowledge this one reflects a genuine concern, namely that it takes work to make data sharing-ready. Previously we required all PLOS authors to share data “on request”, but some bloggers have noted that no-one ever requested much of their data, or that when it was requested they in fact refused, whereas now all data should be made ready for sharing, whether ultimately needed or not. We agree that this does require work, however it takes less work to prepare the data at the same time as the publication than was previously required when trying to dig into archives to find material some time later (humorously summarised in a video cartoon). And we would note that increasingly all funders require, as a condition of a grant, that a data-management plan be included, and that this is driving researchers and their institutions to have good systems in place that will meet the criteria of the PLOS policy.

We focused on where, when and how to share, but many are still concerned about what to share

The new PLOS data policy refers to sharing the data underlying a publication, just as our previous policy did. The new part of the policy is to ask for sharing ‘up front’, at the time a manuscript is submitted, rather than subsequently and on request. But an awful lot of the responses to the policy have focussed on the issue of which datasets need to be included. It was not quite as apparent from our prior consultation as it is now: researchers in many fields don’t know which data to archive and share and which should be considered ‘disposable’ moments en route to data worth preserving. Although funders such as NIH and community organisations such as MIBBI try to outline the requirements either generally or specifically, it seems it will be a sisyphean task to provide detailed guidance for every type of experiment in every domain within science. We are therefore currently considering the extent to which we at PLOS can or should aim to provide this type of guidance, and would welcome your input on this issue.

There can be real difficulties about ‘limited sharing’, whether during peer review or after publication. Most repositories, whether subject-specific or general, institutional or international, are set up to allow full open access. But there are two main circumstances in which more limited sharing is appropriate. The first is during peer review, when editors and reviewers need access to the data but the authors may not want it to be public; we are aware of only a few databases (e.g. Dryad) that routinely provide this facility. The second circumstance is when datasets contain sensitive information – whether about patients or, for example, endangered species’ locations -  such that it may be appropriate to share only a subset of the information more widely, and/or to share only with appropriately screened individuals. Several databases and repositories have plans to allow more limited access (e.g. Dataverse, figshare), which should help address concerns in this area, but this is a work in progress. For now, it remains a major challenge for clinical studies to both provide controlled access to the data and preserve patient confidentiality.

There is plenty still to do

The 2014 PLOS data policy deliberately set out to take just one step towards improved integration between the published literature and the data underlying it, by asking authors to say where there data can be found ( that somewhere being not on their own hard drives). We know that much more will be needed before we are dealing with data satisfactorily. For a small minority of commenters, we did not go far enough. For many more, we leave too many open questions, and we agree there are many, most of which are not unique to PLOS and need further community discussion. Among the most pressing, from our perspective:

  • When should an author choose Supplementary Files vs. a repository vs. figures and tables.
  • Should software/code be treated any differently from ‘data’? How should materials-sharing differ?
  • What does peer review of data mean, and should reviewers and editors be paying more attention to data than they did previously, now that they can do so?
  • And getting at the reason why we encourage data sharing: how much data, metadata, and explanation is necessary for replication?
  • A crucial issue that is much wider than PLOS is how to cite data and give academic credit for data reuse, to encourage researchers to make data sharing part of their everyday routine.
  • And for long-term preservation, we must ask who funds the costs of data sharing? What file formats should be acceptable and what will happen in the future with data in obsolete file formats? Is there likely to be universal agreement on how long researchers should store data, given the different current requirements of institutions and funders?

As we continue to work on these issues and others, we would once again welcome your feedback and input, here on the blog, via individual journals, or at data@plos.org.

Theo Bloom and Jennifer Lin, for the PLOS Data Policy group.

This entry was posted in Uncategorized. Bookmark the permalink.

8 Responses to PLOS Data Policy: Update

  1. Eric Ross says:
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)

    Major props to PLOS for being willing to stick its neck out, listening to the community, and being frank and open about the challenges and gray areas in this process. And thanks for an interesting article with a lot of excellent links – I look forward to reading more of these articles* as the data policy matures!

    *Preferably with some data on the policy’s effect on submission rates, frequency of various rationales for not including raw data, etc! : )

  2. VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)

    GREAT summary. I’m so happy (vindicated?) to see that the PLOS policy hasn’t caused the total collapse of science as we know it. Glad PLOS is part of the move towards open science!

  3. Pingback: PLOS’ Data Policy Update | Data Forwards

  4. Eric Ross says:
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)

    I’ve been taking a look through some recently-published medical articles on the PLOS site to see how the new policy is playing out, and I’m a little disappointed by the way many articles have approached the new requirements. Here are some examples:

    One article on colorectal cancer includes all of the patient-level data that would be needed to replicate the statistical analyses. This is what I imagined the new policy would look like, and it’s great!

    An article on HIV treatment outcomes did not include the patient-level data, but the data statement explained that this was an IRB decision to ensure confidentiality. This seems very reasonable, and the data are available on request from either of two institutions. So far so good.

    Unfortunately, several other articles claim that “All data are included within the manuscript (and Supporting Information files)” but do not include any patient-level data. As far I can tell, these articles contain no more data than they would have prior to the new data policy.

    What’s going on here? Is this expected to be the status quo going forward?

  5. Jennifer Lin says:
    VA:F [1.9.22_1171]
    Rating: +1 (from 1 vote)

    We’ve been very pleased at the community’s willingness to comply with the PLOS data policy and provide a description of their compliance. Just a few months into this new policy, we are still in a learning phase, working with our editorial contributors and authors to make sure that the data policy is effectively implemented and integrating these requirements into existing processes.

    We hope that as this becomes more commonplace, authors; reviewers; and editors will become more accustomed to knowing what data they’re expected to provide. We recognize there are different expectations in different areas of research. As researchers across different fields become more aware of the benefits of sharing data, we expect that the expectations on what data is required for reproducibility will become higher and more standardized, in a manner similar to the MIAME guidelines for microarray studies. As always, PLOS journal editors encourage researchers to contact them if they encounter difficulties in obtaining data from articles published in PLOS journals. We will work with the editors and the authors to provide them, issuing corrections if necessary.

  6. Leo Martins says:
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)

    Regarding the treatment of software/code:

    If you want to achieve “replication, reanalysis, new analysis, interpretation, or inclusion into meta-analyses, and facilitates reproducibility of research, all providing a better ‘bang for the buck’ out of scientific research”, then obviously the source code must be openly accessible. Actually I can’t believe people want to treat the source code differently from other file formats. If anything, it must be more stringently reviewed since the data re-analysis may need some special hardware, while the source code doesn’t.

    For instance, how can we trust the article http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0093984, or its whole review process actually, when it only describes an algorithm claimed to be implemented, but that is not available at all? How can this article be useful for anybody?

  7. Pingback: What we’re reading: Population genetics of an invasive vine, demography and GWAS, | The Molecular Ecologist

  8. Pingback: May highlights in scientific publishing | sharmanedit

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>