Designer debacles and other misdemeanors

In the last issue of Nature, a news feature and research highlight look at two recent high-profile paper retractions. The two papers by biochemist Homme Hellinga delt with rational enzyme design. A second group couldn't reproduce the results, ultimately leading to the paper retractions. Then a third group was able to demonstrate that rational enzyme design is indeed possible.

The research highlight looks at the troubles of the second reseach group led by John Richard. He wasted a lot of time and money trying to reproduce Hellinga's findings and in the end did not gain anything.

Non-reproducible work is a common problem in research, and papers containing this questionable work are rarely retracted. I would guess that most of the time this is unintentional. John P. A. Ioannidis explains this in an PLoS Medicine essay: Why most published research findings are false.

Sometimes the reasons behind non-reproducible results can be calculated, and this includes drug trials in clinical medicine. Statistical Power of Negative Randomized Controlled Trials Presented at American Society for Clinical Oncology Annual Meetings found that more than half of these randomized controlled trials that showed no benefit for a new treatment did not have enough patients to detect even a medium-sized treatment effect.

What should we do about this? The first step is to accept the fact that a significant number of research findings you read in a paper are not reproducible. We have to be careful to start a PhD thesis or other research project based on a few exciting papers, especially when this work was done by someone else. Thinking about this, I should have taken that advice myself before starting a particular project 5 years ago.

Related Posts Plugin for WordPress, Blogger...
This entry was posted in Snippets. Bookmark the permalink.

23 Responses to Designer debacles and other misdemeanors

  1. Cameron Neylon says:

    Ooh, was wondering when someone would bring this up. I’ve also posted on this “over here”:

  2. Martin Fenner says:

    Cameron, it is interesting that your post focusses on a different aspect, i.e. all raw data should be made available. But I still think that non-reproducible research is very common.

  3. Cameron Neylon says:

    Absolutely, but providing the raw data makes it easier to figure out why you can’t reproduce something or at least what the parameters are. I think this is a problem with the modern obsession with peer review, this notion that its about making sure the paper is ‘right’. What peer review can do is make a reasonable decision about whether the data provided supports the claims made.
    I seem to have spent a lot of the last few years discovering things that didn’t work because of the wrong water, the wrong buffer, the wrong supplier, or the wrong brand of tube!

  4. Martin Fenner says:

    “Minimum Information About a Microarray Experiment”: (MIAME) would be one example of a standard to provide not only the raw data, but also annotations and experimental design. Microarray data should be deposited in public databases such as “Gene Expression Omnibus”: Most journals now require MIAME-compliant data and deposition in these repositories in paper submissions. The Nature announcement to follow this policy from 2002 can be found “here”:
    For many other areas of research there probably are no standard data formats, or the raw data are not available in digital form.

  5. Cameron Neylon says:

    I would go further even and say that for some types of data there _shouldn’t_ be any standard data format because it doesn’t make sense to standardise. But this doesn’t mean that the data can’t be made available in a friendly useable fashion. Even scanning lab book pages and putting them online would be a start in some cases.
    Interesting thing about MIAME and the array databases. Its not actually the raw data. We wanted for a paper recently to get at some _really_ raw microarray data (i.e. the images and/or raw uncorrected intensities) and its actually not straightforward. Re-use often comes out of left field so assumptions about what data should be deposited are often not useful. Another reason why I think it all should (but we need to sort out cataloging and minimal descriptions)

  6. Martin Fenner says:

    Cameron, we have a sort of parallel discussion, as you also wrote a “blog post”: about this topic and have started an interesting discussion. So I’m not sure where to respond.
    To me one recent trend appears to be the increasing amounts of supplementary information that is deposited together with a paper. If we encourage this trend, e.g. by thinking about standard data formats for this supplementary information, things would move in the direction you suggested. Open Notebook Science would obviously be another approach.

  7. Maxine Clarke says:

    Nature published a correspondence a year or so ago which audited SI — apparently many journals don’t even have it, so that the data are scattered about or “lost”.
    MIAME, which you discuss here, is a good example of where a community has got together and agreed on a standard, which makes it easy for journals. There are other areas where there aren’t standards (and/or curated public databases) which makes it harder for the journals, as it is quite hard to “impose” a standard on people independently of their community of peers having debated and agreed it. (It is even hard on writing style, gene nomenclature etc). So, standard data formats that are (1) agreed by the community via conference or other discussion; (2) annotated; (3) curated — publicly accessible database — are all very much welcomed by journals.

  8. Heather Etchevers says:

    _…standard data formats for this supplementary information_
    Yes, yes, yes!!! How many times have I seen locked PDFs of enormous tables in SI – it makes me grit my teeth… and such information is MUCH more likely to be available years after the fact through the journal than by a local website on which the authors store and try to disseminate the data. I’m quite keen on GEO, too, and similar public resources. As long as one can get from the citation in the paper to the data in a fairly permanent manner.
    A couple of years ago, after having struggled with a new-ish technique, I made a conscious decision not to apply any other such new techniques until at least a couple or few papers came out using it. One I had my eye on, nothing has happened with it since the four years since it appeared in _Cell_ – so I know I did the right thing. However, I see a new application for the approach with the so-called “next generation” sequencers – and so have plenty of other folks.
    What I meant is that perhaps open lab notebooks could have helped to avoid trying certain things that won’t work – but writing to the authors of earlier papers had helped me in that respect, too. Only when they don’t respond do you really lose time. Raw data wouldn’t help with this sort of problem – getting data _was_ the problem.
    Moral: if you’re working in a poorish lab, better to be among the front-runners than to try to be the leader of the pack. Use tried tools to attack new problems, instead of new tools to attack problems everyone is working on.

  9. Martin Fenner says:

    The Nautilus post about supplementary information by Maxine can be found “here”: and includes a link to the Nature correpondance “A paper should appear with all the information it needs”: by Larry Benson. The blog post started an interesting discussion of the pros and cons.

  10. Cameron Neylon says:

    Martin, you’re always welcome at “my place”: (as is anyone else) but seeing as this part of the conversation is here it seems to make sense to continue here.
    Maxine makes what I think is the critical point. How do you include data when there is no agreed format. The point I wanted to make was that even just scanned copies of the paper lab book pages or raw data could be valuable for the future. But as Jean-Claude points out in comments to “my post”: it is hard to make the case for making raw data available when people worry about having to reformat it for this journal or that journal.
    The important thing is to try and encourage people to collect data in such a way that it can be bundled up. The format actually wouldn’t matter very much if we could get a little bit of self describing information into the system. We need to build systems that will make this easy for people to do and this will involve re-thinking what we mean by formats and indeed data. This is a big project but one we are trying to get “started”:
    Open Notebook Science depends on tools that enable you to keep a lab book online, but one of the aims of developing these tools is to make it easier for other people to keep a good record of their research. Whether people choose to be open on the same timescale as we do is their business, but hopefully we encourage them to make more detail available once things are published.
    The point about what papers should include is an interesting one. I have a personal subscription to Nature (I only read it for the pictures of course). But it is really irritating when you find a research paper that’s interesting while sitting on the train and can’t get to the critical piece of information you want because its only online. But if all of that was included in the print version I wouldn’t have it on a train because it would be too heavy to carry. Bring on usable electronic readers!

  11. Martin Fenner says:

    Cameron, that is a very interesting point. Using a “Laboratory information management system”: (LIMS) is a first step towards standardized raw data. And there are many reasons to use Open Source software for this, whether or not you like to have your lab notebook open to the world.

  12. Cameron Neylon says:

    I would probably avoid the term ‘LIMS’ but that’s because of general baggage. But yes, if we had effective tools for capturing what happens in the laboratory (and that is not the same as recording what happens, and definitely not the same as planning) then it would go a long way to solving many of the issues.
    Open source is helpful but what is critical is open standards. XML is powerful in this respect because it is open, extensible, and can be used to wrap pretty much anything up. Getting the wrapping right is seriously difficult but I think its worth working on.

  13. Henry Gee says:

    When I’m on the stump giving my Confessions Of A Nature Editor lecture I often throw the following lexical grenade at the audience:
    _Everything we publish in Nature is wrong – and I’m proud of it_.
    By this I mean that all results will be provisional (which, I admit, is not quite the same thing as non-reproducible) and will turn out to have been ‘wrong’, in that they will eventually come to be seen as a small part of a greater and as-yet-unimagined whole.
    Why am I proud of this? Because if science were not provisional, scientists would soon discover everything there is to know, and where would be the fun in that?
    A more general point is the importance of presenting all scientific work to the public as a set of provisionals, and not giving the misleading impression that scientists come up with complete solutions, even if this gives some fodder to creationists and other parasites.

  14. Maxine Clarke says:

    Cameron: your personal subscription carries full online access so as long as you have one of those wireless doo-dahs with you at all times, you are well away. (Me, forget it, as far as round-the-clock connectivity is concerned, but each to her own.) In fact you don’t even need a personal subscription because SI is free-access for _Nature_ journals.
    Cameron again, we’ve received a lot of advice, solicited and unsolicited, from readers and authors over the years about length in print vs online SI. Altough the individual author of an individual paper would love to have the entire issue of the journal devoted solely to her/his great work, the vast majority of readers do not want to read the supporting data. For them, they know they are reading peer-reviewed conclusions and seeing the key figures or tables in the print or online version and that’s good for them as they are receiving a focused account, easier to absorb.
    For any given paper there are readers like you, fellow specialists who want to do things with the data. That’s fine too, but for the relatively small number of readers like you per paper we publish, and in view of the finite annual page budget, we think that SI online-only for background data is the optimal solution.
    As for formats, I think that for SI journals need to host the most commonly used formats, but it is always nice when a community has got together to agree standards and to create a database, because then the data can be hosted in a central place and all the journals can provide a link, making it more convenient for readers (users) I would imagine, compared with each journal hosting the information in whichever format the author decided to use and the journal can host?

  15. Maxine Clarke says:

    PS Cameron, in some ways what you are suggesting higher up, re scanning in lab notebooks and so on, sounds like a case for a preprint server — which again if properly tagged and indexed for web searches, shoudl be relatively independent of “which” preprint server? I understand the principle of e-lab notebooks but not enough to know how securely archived they are compared with (say) a “proper” preprint server.

  16. Martin Fenner says:

    How do journals archive supplementary information? I believe that currently it is mostly just a bunch of files that are somehow linked to the article. Wouldn’t it be a step forward if all the supplementary information is stored in a searchable database? Right now it would probably be mostly .doc, .pdf, .jpg and .mov files, but those could be tagged with names for methods or materials. Over time, more specific data types could be added.
    Henry, thanks for picking up my original argument. The provisional nature of scientific results is often forgotten. And it is usually not because someone was cheating. Many scientific breakthroughs are so exciting because they destroy some of our previous knowledge. Bacteria causing cancer? Growing sheep from adult cells? Nonsense.

  17. Maxine Clarke says:

    I think for a publisher there is an issue with what you propose, Martin, because this would take some investment, and it would be hard to justify to a publisher, who is looking at the bottom line, why the SI should then remain free.
    Nature has done a small, and popular, experiment, which is to integrate SI Methods into the online paper — HTML and PDF. The content of the additional methods is picked up via the full-text search engine along with all the other content.
    The rest of the SI is “attached” to the paper in the way you describe, and is not inter-operable (let alone between different publishers). It would be great if community initiatives such as PDB or MIAME or GEO, could be extended to other fields. I imagine that on the whole, publishers would prefer to link to the data if safely archived in a reliable depository, rather than take the resposibility for maintaining an archive of it all on their own sites, which is what they do at the moment (Nature journals, anyway).

  18. Maxine Clarke says:

    Just saw this good blog post about “crystallography wikis”:, relevant to this discussion.

  19. Cameron Neylon says:

    Maxine, if I had one of the wireless thingummies I’d spend my time on trains (~2.5 hours a day) dealing with trivial emails. Its actually why I have a print subscription. Just the problem is when I find an interesting research article I always forget to go back and check the SI once I get home/to work. My train time is sacrosanct. Having net access would be giving in!
    I think you can worry too much about interoperability. There are some seriously clever people working on very sophisticated data wrapping systems that will (in their ideal form) wrap up a related set of data, with enough information to explain what the bits are and how to read them. Its not perfect yet but its very powerful conceptually. Essentially this will enable us to stop getting stressed about data formats and rather sort out a description of how to read them and how they relate to each other. The data is meaningless, without context and metadata it has no value at all. And these wrapping approaches make a lot of that happen automatically (in principle).
    I’ve discussed “elsewhere”: a concept which might either enable publishers to offload the cost of running SI repositories or indeed turn the model on its head and pay publishers to deposit the data into the public domain. Still needs work (and money!) but it’s based on the premise that it isn’t the data that is worth money, but its accessibility. Fundamentally it’s science funders who should be paying for this and what I suggest is essentially a carrot to encourage scientists and publishers to put the extra work in to make it useful.

  20. Maxine Clarke says:

    Very interesting, Cameron, thanks.

  21. Martin Fenner says:

    This Nature Preceedings Paper is also relevant to this discussion: “A review of journal policies for sharing research data”: by Heather Piwowar and Wendy Chapman (see also their blog post “here”: The authors looked at the data sharing policies of 70 journals, using microarray data as an example. One interesting finding: even in journals with a strong data sharing policy (such as the Nature policy “mentioned earlier”: in this discussion), only 29% of microarray papers had submission links to GEO. So even for microarray data, a lot still needs to be done. So we need not just the technical infrastructure, but also the willingness of authors and journals to share the research data.

  22. Cameron Neylon says:

    And I believe if you actually try and get hold of data, even where the journal policies are strong, the success rates hover around 25% (too late to go find a citation for that) but a few people have done this kind of experiment I think.
    But yes the key thing is author willingness, hopefully author demand in the future, to make sure their data is out there and being used. Heather Piwowar has talked about a data re-use repository as a way of helping people get credit for having their data re-used which is an interesting concept.
    If re-use of data (i.e. its used often, it must be good stuff) was more highly valued this would help. But we, and funders, and journals, have an unhealthy obsession with novelty. And not enough value or resource is put into the proper recording and deposition of data.

  23. Maxine Clarke says:

    I’ve just seen this interesting Nature Network forum, called “Citation in Science”:, by Allan Sudlow (and friends). There is a talk at the British Library in London on 27 May (next week) and the forum is to continue the debate. Allan has posted a very interesting and pertinent list of topics. Please check it out. I would like to attend the talk so will see if I can.