The trouble with DOIs

ScienceCard is a new service that I started last month with the simple idea to automatically track all journal articles of a given author, and to collect the article-level metrics (citations, bookmarks, etc.) for these papers. ScienceCard requires unique identifiers for articles and authors to work. Unique identifiers for authors is a difficult topic regularly discussed in this blog. But I thought that using digital object identifiers (DOI) for journal articles would be easy. The system managed by CrossRef was started 10 years ago, and almost all journal publishers now use DOIs – there were 49,350,542 registered CrossRef DOI links as of today.

number buttons

Number buttons by fragmented on Flickr

The first problem I encountered is that many bibliographic databases don’t fully support DOIs. Most of them store DOIs, but not all of them allow queries using DOIs, and very few services allow linking to them using DOIs. In the end I had to store various other article identifiers in ScienceCard (currently PubMed ID, PubMed Central ID, Microsoft Academic Search ID, Mendeley UUID, Scopus ID). One side effect of this proliferation of identifiers is that (in very rare cases) DOIs are not unique in these bibliographic services. And it makes it more complicated than necessary to build tools based on DOIs. The members of CrossRef are publishers, the other service providers (whether public or private) seem to be reluctant to fully support a service where they have no direct influence.

The second problem with DOIs is that they are often not web-friendly. DOIs are really permanent URLs, and CrossRef has recently changed the display guidelines for DOIs to reflect this. Instead of doi: 10.1371/journal.pcbi.0010057 we are supposed to show DOIs as http://dx.doi.org/10.1371/journal.pcbi.0010057. The problem is that DOIs can contain characters such as “+”, “(“, “.” or “/” that need to be escaped when used as URLs. Some ScienceCard examples include the following:

  1. http://dx.doi.org/10.1016/S0959-8049(05)80357-0
  2. http://dx.doi.org/10.1093/bioinformatics/12.4.357
  3. http://dx.doi.org/10.1021/bi980175+
  4. http://dx.doi.org/10.1642/0004-8038(2002)119[0088:SSCPEO]2.0.CO;2

These special characters can create problems when DOIs are used in software programs. ScienceCard for example wants to create links to articles in the format http://sciencecard.org/10.1642/0004-8038(2002)119[0088:SSCPEO]2.0.CO;2.xml, but this function is currently broken.

One possible solution are shortDOIs. Article (3) would for example become http://doi.org/dcp, whereas article (4) is rejected as invalid DOI. I would love to use shortDOIs in ScienceCard and other places (e.g. Twitter), but haven’t found an API yet that automatically returns shortDOIs for DOIs.

Component DOIs directly link to a figure or table of a paper. This is an underused, but very useful feature, and is for example provided by the PLoS journals. Unfortunately component DOIs can confuse bibliographic databases and make it more difficult to track all the links to a given article. I had to write a little routine to detect component DOIs imported into ScienceCard.

Articles are sometimes updated or corrected, and many publishers will use a different DOI for the updated article. This is a problem when you want to track all references to this particular article. http://dx.doi.org/10.1371/journal.pcbi.0020121 and http://dx.doi.org/10.1371/journal.pcbi.0020181 are for example DOIs for the same PLoS Computational Biology article (the latter is the corrected version). Nature Precedings uses a format that is easier to understand for computers - http://dx.doi.org/10.1038/npre.2011.4479.3 is for example a link to the third version of this particular manuscript. CrossMark is a new CrossRef service that will make it easier to track the different versions of a manuscript, including retractions.

ScienceCard should of course not be limited to journal articles. I’m also interested in other scholarly content, e.g. preprints from ArXiV or research datasets from DataCite. But I want to first solve the problems with DOIs for journal articles, before I tackle the much bigger problems with uniquely identifying and tracking other scholarly contributions. Science blog posts are a good example. It would be wonderful to track them in ScienceCard, but I don’t see how we can do that before we have a system in place that assigns unique and persistent identifiers to blog posts. For this and other reasons I really want unique identifiers for science blog posts, and we should also think about using DOIs for this purpose.

Update October 9: A ScienceCard example of multiple identifiers for the same paper:

  • DOI: 10.1007/s10654-011-9572-7
  • PubMed ID: 21461943
  • PubMed Central ID: 3115050
  • Microsoft Academic Search: 48849734
  • Mendeley: 5b0023f0-609e-11e0-8f54-0024e8453de6
  • Mendeley URL: http://www.mendeley.com/research/informativeness-indices-blood-pressure-obesity-serum-lipids-relation-ischaemic-heart-disease-mortality-huntii-study/
  • Scopus: 79959714408
Related Posts Plugin for WordPress, Blogger...
This entry was posted in Conferences, Interviews, Presentations, Recipes, ResearchBlogging, Reviews, Snippets, Thoughts and tagged , . Bookmark the permalink.

17 Responses to The trouble with DOIs

  1. Joerg Heber says:

    DOI for all sorts of scholarly content (blogs, wikipedia pages?) would be great! Though someone would have to issue these. Anyway, you’re right, the problem with doi’s is an issue. It would require publishers to confirm with stricter, web friendly formats etc. Probably will be difficult to get everone in line.

    By the way, for the journal I work for (Nature Materials) we indeed recently changed the doi in the reference list to the URL format…

  2. Martin Fenner says:

    Jörg, currently scholarly DOIs are issued by CrossRef and DataCite. I think they are useful for things beyond journal articles and datasets, but maybe not for microattributions such as Wikipedia articles. DOIs for blog posts would be great, but it would require a commitment to make both the DOI and content persistent.

    Have you thought about using shortDOIs for Nature Materials? I don’t know all the details about them, but they look like the web-friendly version of full DOIs.

  3. Joerg Heber says:

    well, Martin, I meant that any such scheme would have to be really easy for a blogger to implement.

    As for the shortened DOIs, it sounds good to me, I’ll pass it on. We already use an in-house URL shortener for other websites, so it might make good sense.

  4. Pingback: DOIs aren't as easy as they seem - materialsdave.com

  5. Phillip Lord says:

    We implemented DOIs for knowledgeblog.org (currently experiencing some unfortunate downtime, am afraid), using datacite DOIs. We did this because people think that they add credibility, and not for any other technical reasons.

    The bottom line is, given all the hassles that you report from using DOIs, I fail to understand why adding them to blogs make sense. They are a legacy technology anyway. DOIs will break ping backs and trackbacks. Blogs can advertise their own metadata, and host their own permalinks. The technology is enough.

    Permanence is a social issue anyway. “Permalinks” do go away. The solution to this is to make sure your blog is archived by one of the Web Archives that exist. Even during our downtime at knowledgeblog, you can still get to the content, with identifiers.

  6. Phillip Lord says:

    Incidentally, we discussed our own experiences with DOIs back in February.

    For those who are interested….

    http://www.russet.org.uk/blog/2011/02/the-problem-with-dois
    http://blog.fuzzierlogic.com/archives/473

  7. Roderic Page says:

    DOIs like 10.1642/0004-8038(2002)119[0088:SSCPEO]2.0.CO;2 are based on SICIs, an identifier scheme for generating unique identifiers for articles based on metadata for the article itself. It’s a clever idea, although the spec is somewhat baroque. JSTOR used SICIs but now seems to have abandoned them in favour of URLs with integers. BioOne seems to have gone through a phase of using SICIs to construct DOIs, but now uses simpler identifiers.

    The URL encoding issue also affects Mendeley, which often fails to correctly encode BioOne URLs.

  8. Martin, you say you can’t find an API for generating short DOIs…

    The ‘API’ is simply to call the shortdoi with a normal DOI. So, for example, to get the ShortDOI for the DOI 10.1038/452029a is to call:

    http://shortdoi.org/10.1038/452029a?format=json

    Which results in:

    {
    DOI: “10.1038/452029a”
    ShortDOI: “10/dd3″
    IsNew: true
    }

    This is documented at the bottom of the ShortDOI home page.

    –G

  9. Geoff,

    sorry, I must have been blind. I will implement shortDOIs in ScienceCard, as I see several advantages. Of course I will continue storing regular DOIs internally.

  10. Roderic,

    thanks for explaining SICIs. I think that DOIs based on them are too complicated, and the Wikipedia page you link to says that they are no longer used for DOIs since 2009.

  11. Phil,

    I like to disagree. We need both persistent identifiers and persistent availability of the blog post itself if we want to reliably cite science blog posts. URLs and Web Archives will not really work for this. It doesn’t have to be DOIs, but I don’t see what’s wrong with them or why they are a legacy technology. I’m open to other suggestions, but it is currently difficult to impossible to track the blog posts that discuss scholarly papers, or link to each other. I think this is a problem that we have to solve, and I don’t think that trackbacks and ping packs alone will do.

  12. Phil Lord says:

    “It doesn’t have to be DOIs, but I don’t see what’s wrong with them”

    Then I am totally confused. If you don’t see what is wrong with them, why did you write an entire post called “The trouble with DOIs”?

    I didn’t hold up trackbacks and pingbacks as a solution — simply, as nice technologies that get broken.

    And as I say, permanence comes not from technology, but from organisations. The UK Web Archive offers a general purpose way of persisting websites, but it is not for the purposes of science, and therefore may not fulfil its needs exactly.

    The basic technology, though, is good enough. It shows that it can be done. The W3C, for example, label all of their recommendations with URIs. Do you think that these will go away?

    We should spend our time worrying about the social organisation, and not the technology. We should be asking our libraries why they don’t look after our identifiers, and persist our content for us; this is, after all, their job.

  13. Martin Fenner says:

    Phil, my particular interest with DOIs is in tracking all the citations, bookmarks, etc. of a particular scholarly contribution: journal article, dataset or even blog post. DOIs have many problems, but URLs are far worse. Not only do they change or break over time, but there are often already several URLs for the same content to begin with.

    I agree that this is a social problem and not a technological one. But I think this is not only the problem of our libraries – it is foremost our problem if we publish content on the web.

    And thanks for the links to your posts about DOIs. Something that has obviously changed since their writing is that CrossRef now provides DOIs as linked data.

  14. Phillip Lord says:

    You are still confusing technology with society. One one hand, you say “URIs break” and on the other hand crossref says “DOIs are URIs”.

    Some URIs break. This does not mean that all URIs break. I stand by the point made in my blog post six months ago. We have already added DOIs to our publishing environment; there was then and remains no technological advantage to this. It is purely for social reasons.

    I’ll leave it there, or we shall start going around in circles.

  15. Ed says:

    I’m glad that ShortDOIs are working for you and it seems that your main point isn’t about DOIs themselves but the fact that secondary/bibliographic databases don’t index DOIs – I, of course, agree that they should!

    I wanted to correct one error in your post. In the PLoS example where an article is corrected the second DOI is not for the “corrected article” it’s for the correction notice, which was published as a separate item in the journal and therefore correctly has a separate DOI.

  16. Ed, thanks for the clarification. I think I’m not the only one confused about the correction note: the PDF was downloaded several hundred times.

  17. WebCite (http://www.webcitation.org) assigns unique IDs to webmaterial and can also assign DOIs to all sort of webmaterial (snapshots of blogs, preprints, datasets etc.). The problem is – somebody needs to pay for it. WebCite hasn’t figured out a business model yet as neither CrossRef not the publishers seem interested in paying for this, which would leave the authors. But would you pay (e.g. 50 cents) for assigning a DOI to your blogpost?
    WebCite can by the way also archive datasets, as long as they are on the web.