Why BibTeX, RIS and Endnote XML will soon be broken

BibTeX is one of the most popular file formats for bibliographies, and is therefore commonly used to transfer bibliographies from one reference manager to another, or to other applications that handle bibliographic references. RIS and Endnote XML are probably the other two bibliographic file formats most commonly used. Most reference managers support all three formats, making it easy to move references around.

All three formats have been around for a while, BibTeX for example since 1985. Reference management has of course gone through many changes during this time, and an important change will happen next year: unique identifiers for scholarly authors. In 2012 the Open Researcher & Contributor ID (ORCID) initiative will start issuing unique identifiers for researchers, and researchers, universities, funding organizations, publishers and hopefully everyone else will start using them. But ORCID will only be successful if as many bibliographic tools as possible can handle ORCID identifiers, and if these tools can exchange these author identifiers.

None of the three bibliographic file formats mentioned above can handle unique author identifiers. If we take for example this paper from ScienceCard, then the authors would look like this in BibTeX:

author = {Kirstein, Janine and Dougan, David and Gerth, Ulf and Hecker, Michael and Turgay, Kürşad}

And like this in RIS:

AU  - Kirstein, Janine
AU  - Dougan, David
AU  - Gerth, Ulf
AU  - Hecker, Michael
AU  - Turgay, Kürşad

I suggest we extend the BibTeX  format to understand author identifiers like this:

orcid = {1274643, 8474644, 847412, 9183414, 7461414}

And RIS:

AI  - 1274643
AI  - 8474644
AI  - 847412
AI  - 9183414
AI  - 7461414

This would look similar in Endnote XML. Will we see these changes to BibTeX, RIS and Endnote XML in 2012? I don’t know, but I very much hope so. Imagine what your Zotero, Mendeley or Endnote could do if the application knew about unique author identifiers, e.g. find all papers by a particular author, alert me when a particular author publishes something new, or organize your reference library by author.

This entry was posted in Conferences, Interviews, Presentations, Recipes, ResearchBlogging, Reviews, Snippets, Thoughts and tagged , , . Bookmark the permalink.

18 Responses to Why BibTeX, RIS and Endnote XML will soon be broken

  1. Alex says:

    I’m not sure it’s realistic to expect any addition to the BibTeX standard (I thought it’d been dormant for a while, hopefully I’m wrong), but I’m afraid that unique identifiers would be particularly contentious. Although I’m hardly aware of it — as you say, I currently deal with BibTeX primarily as a middleman between XML, CSV, and Mendeley — one of the defining elements of the standard used to be (according to some old CS friends) a unique identifier for every entry that served to reinforce links between large shared collections and (e.g.) the ACM Digital Library. Now, unless I’m looking at BibTeX in the still-excellent JabRef, this parameter is usually treated as deprecated and hidden, but it’s still in there, no doubt driving a handful of home-grown engineering department document servers. I’m afraid it might be exactly the wrong place for evolution — but I hope I’m wrong.

  2. Martin Fenner says:

    Alex, are you referring to the BibTeX crossref field?

    I agree that BibTeX is probably the least likely of the three file formats to change. It is also possible that another standard bibliographic file format will emerge, not because of ORCID, but because it is more web friendly: the Citation Style Language input format (in XML or JSON).

  3. Trevor says:

    Has RIS ever added stuff? It seems to me to be really unlikely that they would add anything to that specification. It might just be better to get some consensus on some made up additions tothe RIS spec and get folk who format RIS to start including that data in thif files according to a common convention.

    Out of curiosity, is ORCID going to generate IDs for researchers from the past too? I’m interested to look into this a bit, curious to know if it leverages existing stuff like the LC Name Athority records…

  4. I think this post could just as easily have been titled, “Why ORCID is about to fail.”

    Declaring something a “new standard” is entirely meaningless if no one implements it (or even if it’s just inconsistently implemented). If the maintainers of BibTeX can’t be persuaded to implement it, then ORCID IDs will be effectively meaningless for huge swathes of the science, math, and computer science communities. We can blame them, but why shouldn’t I blame the ORCID initiative for failing to countenance the design of existing formats?

    It wouldn’t be the first standard effort to be consigned to the dustbin of history for doing so. Standard efforts “break” all the time.

  5. David says:

    There’s nothing stopping you from adding an orcid field to a BibTeX entry, as you can with ISBN or DOI etc. The question is whether the software (the BibTeX program itself, or others importing BibTeX files) recognises the field.

    Is it something that is going to be in the citation itself, e.g.
    Smith J. (orcid/abc123) and Jones G. (orcid/def456) “How to Reference”. London: PLC, 2011.

    In that case it would should be just a case of writing modified BST files, shouldn’t it?

  6. Martin Fenner says:

    Trevor, I don’t know the answer to RIS, but Thomson Reuters is a strong supporter of ORCID. ORCID will initially focus on active researchers, but ORCID identifiers will in principle also be available for researchers from the past.

    Stephen, I agree that is up to ORCID to convince the community, particularly vendors of reference management software, to implement ORCID fields for references and use them in BibTeX and RIS export. This blog post was intended as a start of this discussion.

    David, we haven’t really had the discussion whether ORCID identifiers should be displayed in bibliographies, or whether they are just part of the machine-readable metadata. I personally think that ORCID identifiers should be used as links if a citation is displayed as HTML5.

  7. Rintze Zelle says:

    I don’t speak for the CSL project, but I feel that the CSL input format (which is basically based on what Frank Bennett’s citeproc-js CSL processor uses, and which hasn’t been through much review) is at least currently inadequate as a comprehensive storage format for bibliographic information. It could be easily extended with support for ORCID IDs, though. I would also like to note that even if these IDs are not included in bibliographic entries, they might still be useful to the rendering engine for determining how to disambiguate names (e.g. http://www.zotero.org/support/kb/given_name_disambiguation ).

    This discussion also needs to address HTML-embedded metadata formats, but I’m not an expert on those (maybe Bruce D’Arcus can chime in).

  8. Bruce says:

    The one big problem I see with what you’re proposing, Martin, is that it’d be an all-or-nothing way to identify authors (and other contributors). For sake of argument, what happens if two authors have an ORCID identifiers, and two don’t? To handle that, you’d need to treat contributors as objects (not strings), with names, and optional identifiers.

    In theory CSL could be extended to do this. But so could FOAF. And so to bring things full circle, I’d prefer if ORCIDs were just URIs, which are inherently distributed. Maybe my Google+ profile should be able to serve to identify me as an author?

  9. Martin Fenner says:

    Rintze and Bruce,

    thanks a lot for your input. I haven’t studied the CSL input format enough to understand what is needed to make it into a full-fledged bibliographic file format. But it is certainly an interesting thought.

    Author identifiers should be optional fields. This works better with XML- or JSON-based bibliographic formats with contributors as objects as you suggest. With my RIS and BibTeX examples it is a little bit more difficult to associate author name and identifier if some authors don’t have an ORCID. One solution could be orcid = {1274643,, 847412,, 7461414} if the second and fourth author had no ORCID identifier (and a similar approach would work for RIS).

    You could use http://orcid.org/1274643 instead of 1274643. Whether using your Google+ profile as scholarly author identifier is a good strategy, is another discussion.

  10. Martin, you don’t approach RDF or using ORCID as an attribute in some ontology… in a linked data world citations need to live there too.

    In other thoughts, in linguistic studies some of us find it useful to attache ISO 639-3 codes to our citations. There is only one paper I know of which uses these codes in the printed bibliography. However, I use Endnote and Papers2 and have a custom fields for ISO 639-3 codes for both language of content and language being described. These codes also live outside of citations and citation managers. So perhaps what is needed is a way to enrich bibliographic data with outside content.

  11. Stephen, you mention “blaming the ORCID initiative for failing to countenance the design of existing formats”. I struggle to see the reasoning for judging ORCID as a failed effort before it’s even properly begun.

    Seems to me that starting out by making backwards compatibility with BibTeX (or other legacy formats for that matter) a requirement would be putting the cart before the horse. In fact, by this logic, any author identification infrastructure project (ORCID or not) ever conceived would “break” and be doomed to fail, solvely on the basis that there isn’t an obvious way for author identifiers to fit into these legacy formats as is (or for existing bibliographic software to deal with them).

    [disclaimer: I am a member of the ORCID Technical Working Group (http://www.orcid.org/twg)]

  12. Martin, it’s pretty clear that there would need to be some kind of compound author/creator field which can hold together both A) author name (and possibly additional information, like affilation, and B) the identifier itself. Separate lists for names and identifiers would surely be sooo too easy to get subtly wrong in parsing or generating output.

    The DataCite metadata schema (http://schema.datacite.org) takes this approach, see this example (yes, I know it’s XML so it’s not directly comparable):

    Smith, Jane
    1422 4586 3573 0476

    [… more creators]

    Perhaps this could be done in a super-simple way in BibTeX like this, which wouldn’t even require a change in the format, only in parsing software:
    author = {Smith, John (orcid:[ID]) and Doe, Jane (orcid:[ID])}


    Just a thought.

  13. Martin Fenner says:

    Hugh, RDF, ORCID and bibliographies is material for a separate post. HTML5 and Scheme.org would be yet another post. And yes, an extensible bibliographic file format would be great for other use cases as well.

    Mummi, I think that parsing BibTeX authors is already difficult as it is. I’m not sure I would want to stuff more information into that author field. RIS has separate fields for each author, so AU – Kirstein, Janine, 1274643 should work. It is obviously easier with XML and JSON to create compound author fields.

  14. “I struggle to see the reasoning for judging ORCID as a failed effort before it’s even properly begun.”

    I wasn’t arguing that. I was saying that that judgment is no more-or-less valid that saying, “BibTeX, RIS and Endnote XML will soon be broken.”

    Standards efforts succeed or fail based on their adoption. And while the purveyors of the standard tend to blame the implementors in the case of failure, there’s an argument to be made the other way. This post, it seems to me, puts the onus entirely on the implementors — as if to say, “Alas, if only these poor besotted fools would realize how good our idea is.”

    Maybe it is a good idea. I would just like to suggest that if it’s not adopted, it might well be because the people who actually build these systems didn’t think so, and they might have a point. Declaring them “about to be broken” makes it sound as if *their* efforts have failed. I see no reason for that conclusion either.

  15. Rintze Zelle says:

    “RIS has separate fields for each author, so AU – Kirstein, Janine, 1274643 should work.”

    The RIS specification already uses the third comma-separated value in the AU field to store the name suffix, e.g. “AU – Lastname, Firstname, Suffix”.

  16. Hi all,

    Speaking on behalf of Qiqqa.com (but I am certain this would hold for at least Zotero), I believe that ORCID is a great and necessary advancement to bibliographic management, and that Qiqqa would certainly support ORCID from early on, both in terms of however it might be represented in BibTeX, and in that citeproc-js might need in order to process it for citation.

    I would recommend something like
    ,author = {Smith, John and Doe, Jane and Marek, Peter}
    ,orcid = {1274643, 847412, 7461414}
    over something like
    ,author = {Smith, John (orcid:[ID]) and Doe, Jane (orcid:[ID])}
    just so that the same BibTeX might be used in both a ORCID aware processor and a legacy processor.

    Awesome progress: let’s hope ORCID flies!


  17. I’d like to follow up on Gudmundur’s suggestion above, but change it a little bit. I’d suggest just to add a semantically coded text string to stand in for the author’s name, instead of the changes to the format that you propose. I think it might address some of the problems described in the comments.

    For example, instead of “Martin Fenner”, or “Fenner, Martin”, the author string in the record would be “ORCID:123456″. I’m not familiar with the nitty-gritties of these format specifications, but hopefully they are flexible with regards to what is allowed in the author string. If they don’t allow the colon character, though, there are ways around it, like “ORCIDID, 123456″ (Hopefully no-one has a last name “ORCIDID”).

    So the record would be like this in BibTex:
    author = {Kirstein, Janine and Dougan, David and ORCID:123456}
    Or like this in RIS:
    AU - Kirstein, Janine
    AU - Dougan, David
    AU - ORCID:123456

    As has been mentioned, you shouldn’t expect the standards bodies to change the standards right away; or, even if they did, expect the software vendors to implement the changes quickly. I think it’s important to try to adapt to the existing landscape. Along the same lines, I think a first area of attack, from a strategy perspective, should be to get the open-source project developers (CSL/citeproc-js) to implement compatible changes, since they are much more agile.

  18. Martin Fenner says:

    Jimme and Chris, please excuse the late approval of your comments. I have recently switched to approval of the first comment because of too much spam, and I haven’t gotten into the habit yet of checking regularly.

    As for your comments, I think these are good suggestions. I think it’s important to make changes that a) don’t break the existing tools and b) have the support of a critical mass of vendors – not everybody, but enough to get the ball rolling.