Beyond the PDF … is ePub

Having breakfast at the end of a conference is a good way to recap what was discussed and can help to generate new ideas. Two years ago at ScienceOnline09 a conversation with Cameron Neylon that followed up on the session Reputation, authority and incentives. Or: How to get rid of the Impact Factor (moderated by Björn Brembs and Pete Binfield) was started my interest in unique author identifiers for researchers. An interview with Geoff Bilder about author identifiers and the CrossRef Contributor ID project followed a month later – still one of my favorite blog posts. I have since become deeply involved with author identifiers and have joined the Board of the Open Researcher & Contributor ID (ORCID) initiative last September.


Beyond the PDF group photo my lesliekwchan, on Flickr

Wednesday to Friday I attended the Beyond the PDF workshop in San Diego to discuss how we can do better in scholarly publishing. The limitations of the PDF format were just one topic, the main themes were annotation, data, provenance, new models, writing and reviewing and impact. This is my presentation:

We had two very productive breakout sessions about writing and reading tools, and we agreed that we should build something that makes it much easier to describe and distribute our research data. Most people in the group took a very pragmatic approach and want to build simple tools appropriate for small research groups in the next few months. We thought that graduate students would be good early adopters, and we already have three principal investigators willing to test these tools in their labs.

Our first prototype, as drawn by Peter Murray Rust

Peter Sefton demonstrated his Fascinator tool that already has a lot of the required functionality. But it was the breakfast discussion before departing – again with Cameron Neylon, but this time also including Peter Murray Rust, Peter Sefton and Ana Nelson – that helped me to put all my thoughts into place.

ePub should become the standard document format for authoring, distributing and reading scholarly content.

The ePub format uses a collection of files held together in a zip archive. Content is displayed using a combination of XHTML and CSS – not different from web pages – and the ePub can also contain other files. Journal publishers use XML internally, and it is therefore easy to distribute journal articles in ePub format – some of them are already doing this routinely. ePub has several advantages over PDF, including:

  • ePub can be used for all steps in the creation of a scholarly document, including data collection, authoring, annotating and peer review. There is no need for time-consuming and expensive format conversions. Currently most manuscripts are submitted in Microsoft Word or LateX formats, and then converted first to XML and then to HTML and PDF. Metadata such as author identifiers, digital object identifiers and semantic information can be added early on and don’t get lost in a format conversion.
  • ePub makes it easy to include supplementary material, e.g. video and other multimedia content, the datasets used in the publication (particularly the data used for tables and figures), all cited references in BibTeX format, etc.
  • ePub is much better suited for reading on mobile devices, as the format allows reflowing of content. Most articles today are printed from the PDF and then read, but this behavior is rapidly changing.

ePub is relatively new, and not many applications for scientists already support this format. We want lab equipment that stores its data in ePub, lab notebooks that write all files from an experiment into an ePub file, reference managers that store and display papers in ePub format, authoring tools that import all these ePub files and thus make it much easier to write, annotate and submit a manuscript, and journal submission systems that take ePub files. I have written a lot about WordPress recently and this is of course a platform that would play nicely with ePub. At least two WordPress plugins support ePub, and it should be possible to modify them to the requirements of the scholarly paper.

Almost as important as the document format is the distribution mechanism of these ePub files. We need a system that makes it easy to collaborate on a document, and that includes version control. The simplest solution would of course be centralized and web-based, but I’m not sure that this is a realistic scenario. We talked a lot about Dropbox during the meeting, but a solution using git (and github), Amazon Simple Storage Service, Windows Live SkyDrive or the repository software ePrints or DSpace is also possible. As an ePub document can contain all required documents, the submission of a manuscript to a journal or institutional repository could become as simple as uploading a single file, and all the peer review (including reviewer comments and revisions) could be done with that file. Submissions of datasets to databases such as Dryad could of course also be done using ePub files. A versioned distribution system should also make it easier to automatically get information about corrections or retractions (e.g. using the CrossMark system that will launch in 2011), and to receive regular updates of article-level metrics, including new citations of the article or dataset.

Several of us attending the meeting will continue the discussion in the coming weeks, and I hope I can convince them of the advantages of ePub. It shouldn’t take us more than a month or two to produce a nice ePub of the sample PLoS Computational Biology article provided for the Beyond the PDF workshop. The next Science Online London Conference will be September 2-3 at the British Library. This is a good opportunity to discuss the progress of this project, ideally including reports about new ePub tools for scientists, more journals using ePub for their articles, and practical feedback from the first users.

32 Responses to Beyond the PDF … is ePub

  1. Mark says:

    Martin, I too am well versed in wordpress and would love to get involved in setting uo the prototype. Please do not hesitate to get in touch if you’re looking for people to develop the idea.

  2. Martin Fenner says:

    Mark, that would be great, I will be in touch. This idea of an ePub container for data should work well with FigShare.

  3. Peter Sefton says:

  4. Peter Sefton says:

    Oops – jetlag. Last comment was supposed to read:

    Martin, it’s good that there discussion of the merits of ePub and consideration of the roles it might play in scholarly communications.

    Further reading: see the post that I put up after Open Repositories 2010:

    There are some challenges to ePub, too, not the least of which is that there is no reader installed by default on the major desktop or mobile platforms, although a variety of reading and packaging apps are available.

  5. Peter,

    thanks for the link. I wrote the blog post on the plane and didn’t spend much time searching for the literature on using ePub in science. BagIt is an alternative, but I like ePub because it is widely supported.

  7. Euan Adie says:

    I posted some late night thoughts here:

  8. Much appreciated for the post, have passed the link around to other JISC colelagues. As to, JISC has put in strategic funding to support further work in ePub (#jiscPUB at Edina, being managed by Theo Andrew[1]), as well as Researcher Identifiers (more announcements coming soon from JISC’s Digital Infraructure Team Blog). Both these themes are becoming low hanging fruit and it will be good to see the community come together to work on plausible solutions. Next question is, where can we showcase these protoype solutions to the wider community?


  9. David, the Beyond the PDF workshop created a lot of momentum. I think that there are a number of people willing to work on the concept of ePub (or maybe similar tools such as BagIt) as a container for research datasets. Three principal investigators (all based in the US) have agreed to become “beta testers” with one or more of their graduate students.

    Euan, thanks a lot for your comments. We unfortunately didn’t have a HTML vs. XML discussion at the Beyond the PDF workshop. I see many advantages of using XML in the publishing workflow, but HTML is the much friendlier format from an author perspective.

  15. Peter says:

    Late as usual, my 2 cents. I *really* like the idea ; epub is just my cup of tea.

    I experimented a little with and conversion via calibre. So far I’m a little discouraged by the lack of mathematical ability of epub, but I’m ready to invest some spare time. I haven’t found any real mathematical “epubbing” to see its abilities. If anybody could share some references in that respect, I’d be thrilled. Maybe there’s a simple way to use, say, mathjax rendering for a nice workflow, if mathml works, who knows.

    Martin, is there a forum of some sort for the joined effort?

  16. Martin Fenner says:

    Peter, you will be happy to hear that the soon to be released ePub3 standard has support for MathML. I havent’ yet tested the various LaTex for WordPress plugins with my ePub exporter, but I’m happy to do so. And I see that you have already found the new WordPress for Scientists Google Group.

  17. Peter says:

    Martin, thanks. I had heard of ePub3 including mathml. But mathml sucks and the workflow is much to be desired for (not even thinking about things like accessibility). I have better hopes for a different idea: mathjax integration. Since epub already “does not forbid” javascript, I found a couple of examples using js. When I get around to it, I’ll try to create an example with mathjax to render mathematics. Who knows, it might even work…

  26. Bryce Collins says:

    I agree ePub is really good becasue currently it is the most widely supported, however many other places are still strong supporters of Mobipocket and other formats. I really recommend when you convert your PDF to ePub you also convert it to other formats so you don’t miss out on tons of readers.

  31. Giacomo says:

    Martin, I agree with your (much apreciated) post. I think scientific publishing is in need of a “revolution” regarding drafting, versioning, sharing and formatting of scholarly work. Many pieces of very good software are already out there, but the hard goal will be (a) coordinating these efforts with the definition of good standards (the least restrictive and the more robust, the better) and (b) getting more and more researchers adopting the new practice.

  32. Martin Fenner says:

    Thanks Giacomo, I agree that coordinating all the good tools that are already out there is a very important step.