The Trouble with Bibliographies

The bibliography of a scholarly paper is interesting and important reading material. You can see whether the authors have cited the relevant literature, and you often find references to interesting papers you didn’t know about. Bibliographies are obviously also needed to count citations, and then do all kinds of useful and not so useful things with them.


Bibliography by quinn.anya at Flickr.

Unfortunately allmost all bibliographies are in the wrong format. What you want is at least a direct link to the cited work using the DOI (if available), and a lot of journals do that. You don’t want to have a link to PubMed using the PubMed ID as the only option (as in PubMed Central), as this requires a few more mouseclicks to get to the fulltext article. And you don’t want to go to an extra page, then use a link to search the PubMed database, and then use a few more mouseclicks to get to the fulltext article (something that could happen to you with a PLoS journal). 

A bibliography should really be made available in a downloadable format such as BibTeX. Unfortunately journal publishers – including Open Access publishers – in most cases don’t see that they can provide a lot of value here without too much extra work. One of the few publishers offering this service is BioMed Central – feel free to mention other journals that do the same in the comments.

This weekend Peter Murray-Rust invited Peter Sefton and me to Cambridge (UK) for a very interesting workshop about Scholarly HTML. Our goal is to discuss how we can define standards and build tools to make HTML the best platform for scholars and scholarly works. The event is in fact a hackfest, and we hope to have something to show by Sunday evening.

My idea for the hackfest is a tool that extracts all links (references and weblinks) out of a HTML document (or URL) and creates a bibliography. The generated bibliography should be both in HTML (using the Citation Style Language ) and BibTex formats, and should ideally also support the Citation Typing Ontology (CiTO) and COinS –  a standard to embed bibliographic metadata in HTML. I will use PHP as a programming language and will try to build both a generic tool and something that can work as a WordPress plugin. Obviously I will not start from scratch, but will reuse several alrady existing libraries. Any feedback or help for this project is much appreciated.

If I had a tool with which I could create my own bibliographies (and in the formats I want), I would no longer care so much about journals not offering this service. One big problem would still persist, and that is that most subscription journals wouldn’t allow the redistrubition of the bibliographies to their papers. A single citation can’t have a copyright, but a compilation of citations can. I’m sure we will also discuss this topic at the workshop, as Peter Murray-Rust is one of the biggest proponents of Open Bibliographic Data.

Related Posts Plugin for WordPress, Blogger...
This entry was posted in Conferences, Interviews, Presentations, Recipes, ResearchBlogging, Reviews, Snippets, Thoughts and tagged , , , . Bookmark the permalink.

19 Responses to The Trouble with Bibliographies

  1. Bruce says:

    Only comment is that you may find both BibTeX and COinS rather limiting.

    I don’t recall if I’ve mentioned to you before, but I’d like to see CSL libraries gravitate towards generating rich bibliographies, using something like microdata and/or RDFa. In general, we probably need to work on what the HTML markup really ought to look like, and do. Imagine, for example, a reader being able to hover over a citation in a blog post, and for a tooltip to pop-up with complete reference information, complete with a clickable link to the HTTP-reference full text source?

    Still, this shouldn’t hold you back; “release early and often” and such. Look forward to what you all come up with!

  2. Bruce, very good points. I agree that there are a lot interesting things you can do with bibliographies and HTML. And I will certainly have a closer look at microformats. Are there any attempts to define a citation microformat – other than COinS? Something related and easy to implement is the HTML5 “data-” attribute.

  3. Bruce says:

    I’m not referring to the “data- attributes.” but microdata (though I’m ambivalent about it; tend to prefer RDFa, even if the tech around it is heavier-weight).

    As for your “attempts” question, my thought is to possibly take the CSL JSON input representation (a couple of us have started to formalize this with json schema but it’s not done), and cast that as microdata.

  4. Rintze says:

    I just discovered this, but there are some people working on a citation microformat:

  5. Bruce says:

    hCite is an old effort that never got finished AFAIK. I and other people were involved in that a couple years ago, but kind of gave up after awhile.

    And as a general proposition, I think microdata provides many of the benefits of microformats (ease of authoring, and easy mapping to JS), without some of the liabilities (like having to write custom parsers).

  6. Martin Fenner says:

    Very interesting discussion. I think it should be one of the goals of this Scholarly HTML initiative to define a microdata format for citations. Not something we will accomplish this weekend, but I hope we have a good start.

  7. Pingback: Unilever Centre for Molecular Informatics, Cambridge - Scholarly HTML hackfest « petermr's blog

  8. Peter Sefton says:

    A (micro) format for embedding citation info will definately be on the agenda. I think it’s obvious to all of us here in the comments but I just wanted to note that this should be declarative, and not tied to any particular platform or implementation, so to get Bruce’s suggested popup we would not be embedding anything the HTML to do that, but servers or browsers would sense the data and do something with it. The same approach should work with Maths, Chemistry, Geography – declaratively specified data in HTML that comes alive when placed in an environment that understands it.

    I am interested in the whole toolchain, but I will probably work on authoring tools for word processors to help people create the Scholarly HTML. One thing I’m interested in looking at is how to get Zotero to embed citations that can be usefully reprocessed into Scholarly HTML (that is, embed as much as possible in the citation in-text).

  9. This weekend Peter Murray-Rust invited Peter Sefton and me to Cambridge (UK) for a very interesting workshop about Scholarly HTML.

    I can’t believe you fuckeasses had a snifter-snooter and DIDN’T INVITE ME!!!!!!!

  10. Phil Lord says:

    “My idea for the hackfest is a tool that extracts all links (references and weblinks) out of a HTML document (or URL) and creates a bibliography. The generated bibliography should be both in HTML (using the Citation Style Language ) and BibTex formats, and should ideally also support the Citation Typing Ontology (CiTO) and COinS”

    Martin, our tool kcite, is already well on the way to achieving this. It already generates the bibliography, it gathers metadata automatically. Our knowledgeblog posts already advertise their own metadata. Citation style language support is not that far away either; I have this working roughly (very roughly) on my machine now, and it’s slated for the release after next.


  11. Martin Fenner says:

    Phil, I know that kcite already does a lot of the things that I want to achieve with this tool. But there are also a few differences. I’m for example interested in extracted all links out of a document, not just the kcite shortcodes. And I want to automatically generate a BibTex file out of these links. I want to format the bibliography without accessing CrossRef or Pubmed for every citation, as I think this creates too much overhead.

    What do you think is the best way to coordinate these efforts?

  12. Bruce says:

    How about putting stuff on github and coordinating through that (code fork/pull, wikis, etc.)?

  13. Martin Fenner says:

    I’ve created the CSL-Exporter repository on github for this project. I’ve also registered for a WordPress plugin with the same name. This is not the best name, but will do for now. Even though this is a WordPress plugin, I will try to make this generic enough that it is also useful in other environments, e.g. the command line.

  14. Phil Lord says:


    We’ve already fixed the problem with overload of cross ref (or pubmed or whatever); it’s in our mercurial, and will turn up as a release imminently. It now caches, so that it does 1 query per NEW citation, and only once per blog (or per wordpress installation — haven’t checked that yet), with a long timeout (once a month I think, which we will make configurable). I have a test article with 150 citations which formats quite happily. For really high traffic websites, standard WP rules apply–install wp-cache (or equivalent), which my blog is running (totally gratuitously, as it don’t get that much traffic!). It’s this sort of thing that made me choose WP in the first place; the issues of scientific publishing are often not scientific; someone else has already fixed them.

    In terms of dragging all links out of an article — great idea, and something that we can happily make use of. We could use it for kciting legacy articles, apart from anything else.

    Oh, and bibtex export. Well, to my mind, the ideal solution would be to abuse CSL — maybe some one has done it already — but if CSL can specific human readable formats, can it specific machine readable ones also? We should be able to get a CSL processor to produce author-date, [1] and bibtex all at the option of the reader. If not, I will find some other way of achieving it.

    In terms of co-ordinating, probably time we have a conversation:-) I’d have loved to come to Cambridge for the hackfest, but I am in very low travel mode currently.


  15. Martin Fenner says:

    Phil, it’s probably easiest if we talk on the phone – I wrote you an email. Bruce can probably answer your question about whether it makes sense to use CSL for BibTeX export.

  16. Pingback: Hacking towards Scholarly HTML « ptsefton's Anotar discussion blog

  17. Pingback: Linktipps der Woche: Nutzungsstatistiken für Repositorien, AppStore versus Android Market und Hochschulen im Web | Wissenschaft und neue Medien

  18. Pingback: Scienceblogging Roundup – March 6-12

  19. Kurt says:

    I keep looking for these components to capture the spatial component of papers. Especially important for earth science papers.