HTML5 or messages from beyond the PDF

In 1990 Tim Berners-Lee and others started HTML and the world wide web to facilitate scientific communications at CERN, the world’s largest particle physics laboratory.


xmos_basser by .M., on Flickr, November 1993

Although the world wide web profoundly changed scholarly publishing (and of course many other things), HTML did not become the standard document format for scientific papers. In fact, there is no standard document format. We have document formats for authors, for the internal workflow of publishers, and for the distribution and reading of papers.

There are of course many good reasons to use LaTeX for writing, XML for workflows, PDF to print papers or ePub for mobile devices. But reformatting a manuscript into different formats several times is both expensive (in terms of time and costs) and means that the formatting options used will be a compromise of what is available in all formats.

Scholarly Publishing Workflow

All this would be much easier if we just used HTML. With HTML, authors, publishers and readers can all use the same document format. And they will have an endless number of tools at their hands, including of course WordPress for writing and the web browser of choice for reading. HTML in 2010 is very different from HTML in 1990. HTML5 supports new semantic elements such as <article>, microdata, embedding of video without plugins, geolocation, and offline web applications.

An HTML-based scholarly publishing workflow will make it

Related Posts Plugin for WordPress, Blogger...
This entry was posted in Conferences, Interviews, Presentations, Recipes, ResearchBlogging, Reviews, Snippets, Thoughts and tagged , . Bookmark the permalink.

10 Responses to HTML5 or messages from beyond the PDF

  1. Archiving and offline reading also need to be supported.

    It’s not clear that the 4 claims you make about HTML-based scholarly workflow are true; evidence for those would make this argument far stronger.

    I think EPUB and other HTML-based offline reading formats are very important for the future. I hope you’ll take a closer look at EPUB — it’s essentially HTML + CSS in a tasty zip package with some metadata. If zipping is distasteful, there are alternative offline reading formats (aPub) which remove the zip requirement and stay closer to HTML.

  2. Jodi,

    you are right that I have to prove my claims about HTML in the scholarly publishing workflow. One example for support of claim #1 is this presentation. Claim #2 is an interesting one, most enhancements of papers that I have seen so far use Flash. We also haven’t really started to go beyond the PDF, most article formats still look very similar to the time when journals where printed on paper.

    And when I talk about HTML, I also mean the other other formats associated with it, including CSS and Javascript. This would also include ePub, but I haven’t found the time yet to study ePub in more detail.

  3. Euan says:

    I definitely agree about outputs (though yeah, EPUB is just HTML). You do need the XML part in the middle still though.

    The problem is that HTML is geared towards presentation – it’s not designed for interoperability or querying. Mixing content and presentation is a recipe for disaster in the not much longer term, as any web developer will tell you. :)


    1) ten years ago this website could have been using tables for layout. Nowadays it uses enclosing divs and CSS styles. In 2012 it could be using CSS3 template layouts. The markup for each of these is very different. Are articles from different years going to look different?

    2) publisher x shows acknowledgements before references. They decide to start showing acknowledgements after instead. Their entire back archive needs to be modified.

    3) x’s full text needs to go to PubMedCentral. Either all publishers use a common HTML format or PMC needs a different parser for each one.

    Presentation shouldn’t be stored with documents – it should be added at the point of delivery to the canonical source, which is stored in such a way as to make it portable to other systems.

    This is the model that the publishing industry – trade and STM – is (slowly) moving towards.

  4. Euan,

    you raise a very good point, and I used to completely agree with you. It’s just that the XML part in the middle of the workflow complicates things tremendously. And that the XML describing the paper (e.g. XML-DTD) can be rather rigid. I would be happy if a Nature paper from 10 years ago looks differently than it does now, as long as I can still access the old paper. HTML is not as good as XML in separating content from presentation, but I think it’s good enough for scholarly papers.

    HTML is a compromise in many ways. But it just works.

  5. Peter says:

    A little late to the party I think Euan raises good points but I think there’s also a way out.

    In your post two things struck me as very odd: putting LaTeX and DOC on the same level and making them the starting point of the workflow. I don’t think replacing them with html will solve any problem by itself. But let me step back.

    Putting LaTeX and DOC on the same level seems rather bizarre. The current Microsoft format has lots of good technology (even though I’d prefer odf) but the problem is that most DOCuments are of extremely poor technical quality. I remember the story of a friend in the humanities editing a multi-author book in msword (because the publisher wanted it that way). Unfortunately, some of the co-authors did not know about automatic toc generation and one co-author actually did the footnotes ‘by hand’, i.e., inserted a line, typed tiny numbers and wrote the footnote text… My point being that a lot of people use wysiwyg like a white piece of paper, doodling away, rarely thinking about the formal structure of their documents.

    (La)TeX, of course, forces the author to learn its rules, its structures and its commands. It forces the author to think about content and to separate content from presentation. After understanding the basics you will automagically produce well structured, well typeset documents. Nevertheless, the price to pay is considerable because you have to put a lot of work into learning, understanding and practising the system (even though it’s easier these days of ctan, specialised forums and tex.stackexchange).

    Of course, for mathematicians this was no question — the payoff was huge! There was no alternative, you finally had an alternative! Thanks to Saint Knuth you could write and reproduce proper mathematical texts at home. Nowadays, mathematics publishers expect ready-to-print LaTeX code, so the pressure essentially remains, since you won’t get anything published if you can’t use TeX.

    Now what happens if we switch to html? I would expect nothing but lots of bad html code. If you’re lucky, simple but stupid code: huge paragraphs, instead of tags, manually inserted spaces, pictures and diagrams; if you’re unlucky, uber-complicated word-to-html export.

    Instead, I think, the question is what lies before the DOC/HTML part of the workflow. How can we get scientists to use a technology such as DOC/html5 ‘for real’? As Euan pointed out, a closely related question is: how do we do this in such a way, that some backward compatibility can be guaranteed. TeX is a programming language as well, so there’ll always be a way to write any kind of output. Do you think, say, virtual reality output of data, be possible with html5? With MicrosoftDOC?

    What I’m trying to say is, although I think your idea is excellent and needed, I think it’s missing the human factor a little. I think it would be good to develop a general meta-structure, the essential toolbox of scientific writing so to speak, and implement it as a soft meta-structure, much like pseudo code (and not as human-unreadable as LaTeX); a strucutre that can be communicate effectively. Maybe on top of that a simplified editor (as opposed to msword) implementing this.

    Oh dear, another one of those long comments of mine. I think I better stop. I hope I ended up making some sense anyway. Hope to read more about the BeyondPDF workshop!

  6. Martin Fenner says:

    Peter, thanks for your comment. I put DOC and LaTeX next to each other because these are two popular authoring tools used by researchers – despite all the differences. I used to think that the ideal authoring tool should write XML which can then be easily submitted to journals. The problem with this approach is twofold: I think the market for such a tool is too small, making it very difficult to commit the development resources required to write and maintain such a tool. And XML was essentially abandoned as format for web documents (work on XHTML 2.0 stopped in 2009), and some of the reasons for that are also relevant to scholarly publishing.

  7. Peter says:

    Martin, I absolutely agree that there’s little point of developing a new tool. I was more thinking of a modified Word-interface (and html-export) similar to a modern LaTeX editor helping you design a document well.

    But my main question is: how would the choice of html help anyone produce files that meet (useful) structural requirements? I think an initiative regarding the structuring of scientific writing as part of the scientific method would be important in any case. And I think the development in mathematics with regards to TeX shows that large scientific communities can have a successful community process in this respect.
    (PS: is there a comments feed for your blog? I can’t find it…)

  8. Martin Fenner says:

    Peter, Microsoft has released the Article Authoring Add-In for Microsoft Word (Windows only). The Add-In will produce XML conforming to the NLM-DTD specification.

    HTML5 allows for some document structure, but XML would be far better for that. But I have since given up on document structure as a top priority, and think that other aspects (served well by HTML) are more important.

    The RSS feed of my comments is here.

  9. Peter says:

    Martin, very interesting tool from MS. Do you know some review from scientists?

    The feed, btw, does not work — it’s the feed to your articles, not your comments. I also played around with the feed link to find a hidden link, but no luck. A missing feature at plos blogs?

  10. Pingback: Beyond the PDF: Some ideas for document formats and authoring tools « ptsefton