Structured Documents for Science: JATS XML as Canonical Content Format

It’s only my 7th day on the job here at PLOS as a product manager for content management. So it’s early days, but I’m starting to think about the role of JATS XML in the journal publishing process.

I come from the book-publishing world, so my immediate challenge is to get up to speed on journal publishing. And that includes learning the NISO standard JATS (Journal Archiving and Interchange Tag Suite). You may know JATS by its older name, NLM. As journal publishing folks know, JATS is used for delivering metadata, and sometimes full text, to the various journal archives.

But here’s where journal and book publishing share the same dilemma: just because XML is a critically important exchange format, is it the best authoring format these days? Should it be the canonical storage format for full text content? And how far upstream should XML be incorporated into the workflow?

Let’s look at books for a minute. The book-publishing world has standardized on an electronic delivery format of EPUB (and its cousin, MOBI). This standardization has helped publishers drill down to a shorter list of viable options for canonical source format. Even if most publishers haven’t yet jumped to adopt end-to-end HTML workflows, it’s clear to me that HTML makes a lot of sense for book publishing. Forward-thinking book publishers like O’Reilly are starting to replace their XML workflow with an HTML5/CSS3 workflow. HTML/CSS can provide a great authoring and editing experience, and then it also gets you to print and electronic delivery with a minimum of processing, handling, or conversion. (O’Reilly’s Nellie McKesson gave a presentation about this at TOC 2013.) And which technology will get the most traction and advance the most in the next few years, XML or HTML? I know which one I’m betting on.

In terms of canonical file format, journal publishing may have one less worry than book publishing, because many journals are moving away from print to focus exclusively on electronic delivery whereas most books still have a print component. Electronic journal reading—or at least article discovery—happens in a browser; therefore, HTML is the de facto principal delivery format. And as much as I’d like to think HTML is the only format that matters, I know that many readers still like to download and read articles in PDF format. But as I mentioned, spinning off attractive, readable PDF from HTML is pretty easy to automate these days. So I ask:

If XML is being used as an interchange format only, what do we gain from moving the XML piece of the workflow any further upstream from final delivery?

Well, why does anyone adopt an XML workflow? The key benefits are: platform/software independence (which HTML also provides), managing and remixing content to the node level (which is not terribly useful for journal articles), and transforming the content to a number of different output formats such as PDF, HTML, and XML (HTML5/CSS3 can be used for this transformation as well, with a bit of toolchain development work).

But XML workflows come with a hefty price tag. The obvious one is conversion, which is not just expensive, but costly in terms of the time it takes. Another downside is the learning curve for the people actually interacting with the XML—how many people should that be? In the real world, will you ever get authors, editors, and reviewers to agree to interact with their content as XML? So more likely than not, you’re either going to need to hide the fact that the underlying format is XML through a WYSIWYG-ish editor that you either buy or build (both are expensive), or you’re doing your XML conversion towards the end of the process. On a similar note, how easy is it to hire experienced XSL-FO toolchain developers? But developers who work in the world of HTML5, CSS3, and JavaScript are plentiful.

So building an entire content management system and workflow for journal publishing around XML—specifically JATS XML, which is just one delivery format, that isn’t needed until basically the end of the process—doesn’t seem like a slam-dunk to me. I should clarify that using JATS XML for defining metadata does seem like the obvious way to go. But I’m not so sure it’s a good fit to serve as the canonical storage format for the full text. One idea is to separate article metadata from the article body text, to leverage the ease-of-editing of HTML for the text itself.

What about moving HTML upstream, and focusing efforts on delivering better, more readable HTML in the browser? What about shifting focus away from old print models and toward leveraging modern browser functionality, maybe by adding inline video or interactive models, or by making math, figures, and tables easier to read and work with?

Just to throw a curve ball into the discussion, I’m attending Markdown for Science this weekend, where Martin Fenner and Stian Håklev will lead the conversation about whether it makes sense to use markdown plus Git for academic authoring and collaboration. I want to hear from as many sides of the content format conversation as possible.

So, what do YOU think?

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
This entry was posted in Tech and tagged , , , , , , , , , , , , . Bookmark the permalink.

31 Responses to Structured Documents for Science: JATS XML as Canonical Content Format

  1. Louis Maddox says:

    Hi Molly,

    Sorry to go off topic a little here, I don’t suppose you can tell me whether or not the JATS format is in use outside of Pubmed Central?

    It seems to be a far better format than the DOI XML schema, which is at present really limiting what I can do with Pubmed search results when only around 10% have a PMCID…

    One of the key differences I’m interested in is the lack of corresponding author in other XML schema – in looking for a way around this I’ve seen countless others with the same problem, and all I can see currently where a PMCID is unavailable is to check if the surname matches within the corresponding email provided (although this is far from perfect).

    Unless I’m being totally ignorant of some generic XML method here, this info seems to always be provided by the individual journals’ databases, and is unable to be retrieved in any general sense…

    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
    • Martin Fenner says:

      Louis, let me jump in here, as we have done a hackathon project using PLOS author data in July at the #hack4ac conference (PLOS Author Contributions).

      Although there is a lot of info you can get out the PLOS Search API, some things (e.g. author with his affiliation) are only found in the XML.

      I’m a bit confused what you mean with DOI XML schema, you should really be looking at the CrossRef metadata schema, what they call Unixref. Their contributor attribute has some information, but as I understand it only whether the author is the first author, not corresponding author.

      VN:F [1.9.22_1171]
      Rating: 0 (from 0 votes)
      • Louis Maddox says:

        Cheers for pointing that out Martin, I was just using that DOI schema on the recommendation of a peer at ResearchGate (see RG Q&A thread).

        Sorry if my post was unclear. All the data I’m using is from a Pubmed search (not a PLOS one).

        Obviously I’ve got Pubmed ID (PMID) for all the papers, but have only been able to get DOI for some and Pubmed Central ID (PMCID) for a small number.

        PMCID can give the corresponding author (I was trying to pull it into a spreadsheet but Google Spreadsheet’s importXML function and Excel’s filterXML are both useless! I’m trying to use Watir (Web Application Testing in Ruby) instead at present, and possibly from accessing the individual web pages then through the doi-encoded document URL (at http://dx.doi.org/…).

        JATS seems to be the only way to get a corresponding-author attribute, shame it’s not more widely used.

        I can’t seem to find any documentation on the project, did the #hack4ac conference find any novel ways to determine ac I might be able to use?

        VA:F [1.9.22_1171]
        Rating: 0 (from 0 votes)
        • Martin Fenner says:

          Louis, we didn’t look into corresponding author information at #hack4ac.

          There are a number of web services to get the DOI for a paper where you know the PMID, e.g. this one.

          Using Excel to extract XML sounds painful. If you know Ruby, you can use the nokogiri gem, there are similar libraries for other languages.

          VN:F [1.9.22_1171]
          Rating: +1 (from 1 vote)
          • Louis Maddox says:

            Hi Martin, I looked into what you said and now have a more or less working Ruby script extracting CA – I could use some help on it now but it’s undeniably set up and functions as expected !

            The repo’s on GitHub, as is my email if you could lend a hand and want to get in touch.

            Specifically I could use help with parsing emails to deduce which author it belongs to (as sometimes this is the only way to tell who is corresp. auth. in poorly annotated XML), and would like to implement the PMID–>DOI as the next step.

            Doesn’t seem like this exists in any other language/resource.

            VA:F [1.9.22_1171]
            Rating: 0 (from 0 votes)
          • Martin Fenner says:

            Louis, I am glad you have this mostly working for you. It would be great if you could wrap up your code into a Ruby gem, as this make the code easier to reuse. I will contact you via email to discuss this further. For PMID->DOI there are already several services out there, including the PubMed EUtils.

            VN:F [1.9.22_1171]
            Rating: 0 (from 0 votes)
  2. Mark Fortner says:

    Molly,
    Does PLoS have any plans for either creating or using a web-based, collaborative tools to support the workflow? In my minds eye, I imagine being able to use something like Google Docs, with special plugins like the Google References plugin that can manage references. In addition, collaborative editing means that authors can work simultaneously on a single manuscript. And reviewers can add comments directly to the manuscript.

    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
  3. Mark Fortner says:

    I’d be interested to know if PLoS has any plans to hands entity recognition in the tool chain. For example, if the author puts a gene symbol, pathway name, disease, article references in the text of a manuscript, it would be extremely useful to have them recognized and automatically linked to the appropriate resources. Hovering over a gene name might show you a brief synopsis of the gene, where right-clicking on it would display a menu with linkouts to EntrezGene or UniProt. The menus and “lenses” might be something that different scientific communities could contribute.

    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
    • Molly Sharp says:

      Thanks for your comment, Mark, it’s always good to hear of the types of content enrichment that people are interested in for PLOS articles. This one is on our radar. Hmmm… makes me wonder if we should start a new post/thread about semantic enrichment – another topic about which people have very strong opinions.

      VN:F [1.9.22_1171]
      Rating: 0 (from 0 votes)
  4. Michael Kay says:

    Let’s just think about citations, a key part of scientific articles. How do you mark up citations to maximize your chances of being able to control (a) how they are formatted in different house styles, and (b) how they can be used as the basis for an intelligent search? The HTML5 vocabulary doesn’t have the semantic richness to say “this is a volume number” and “this is the range of pages”. You need a specialized markup vocabulary for citations. You can create one in XML, but creating one in HTML5 can only be done by tag abuse or by exploiting the very limited capabilities in HTML5 for application extensions.

    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
    • Molly Sharp says:

      I agree, Michael, I don’t think HTML is the best strategy for references either. I would be inclined to treat them more as metadata and leverage DOIs whenever possible to automate the data collection. Let your toolchain *display* them in house style but store structured data.

      Good point about tag abuse, too.

      VN:F [1.9.22_1171]
      Rating: 0 (from 0 votes)
  5. John Cowan says:

    Unicode was originally intended only as an interchange format, but the advantages of it becoming an end-to-end format turned out to be too huge to ignore.

    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
  6. Molly Sharp says:

    Thank you for all of the comments so far; my intention for this post was to start a dialog. You’ve raised critical points, and I’d love to keep the discussion going. Here are a few thoughts inspired by your comments.

    It’s a fair point that different stages of a text’s lifecycle may call for different optimum file formats, depending on the user or consumer (human or machine) most impacted at the particular stage. Authors and editors clearly have a very specific set of needs during the editing process, so the underlying format of the text required for that stage could be driven by the best editing tool for the job. Optimal flexibility for feature-rich web delivery may call for a different optimal underlying format than that used during editing. The bottom line is that conversions may very well be necessary, but wouldn’t it be nice if a single underlying format would suffice at least up to the point of publication?

    I really hate text file conversions. Let me clarify – I hate text file conversions if they require a separate workflow step that costs time and money. That cost generally means the conversion process can be applied only once, at one fixed stage in the life cycle of the text. So the idea of eliminating a manual conversion step is always coloring my take on these issues.

    In the modern landscape of preprints and Advanced Online Publication, a journal publishing platform needs the agility and flexibility of outputting all the required consumable formats at any stage along the way, not just at one specified stage (currently, generally just before official publication).

    So a platform or publishing system that can handle all the necessary file transformations, at any stage, elegantly and “losslessly,” is my holy grail.

    On a related note, I attended the Markdown for Science hackathon on Saturday. It was a great event; I learned a good deal about how some scientists are already using markdown today, and about the power of the markdown tools that already exist. I believe similar challenges that markdown advocates face now, apply equally to promoters of XML-first or HTML-first workflows for science publishing. Namely, some authors are already on board with a given format, but what do you do for the rest, who are tied to Word for a plethora of reasons? I was amazed by what some of the scientists at the event are doing with markdown today – they described using a variety of powerful tools such as Pandoc, R, iPython Notebooks, and dexy to manage their entire corpus of work (not papers for publication). They can update a single data point and then, nearly instantly, output all the formats they need to redistribute the work to their colleagues. They are hand-crafting miniature but powerful publishing platforms of their own. There is a wiki for the entire topic of markdown for science, newly forked under the name of Scholarly Markdown (since it’s not just for science), available here for further discussion.

    I will respond more specifically to the comments inline. Again, I appreciate the thoughtful dialog. Let’s keep it going.

    VN:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
    • Gareth Oakes says:

      You will never control all of the authors all of the time :)

      In a far off future, I imagine the children of today may in fact prefer a simple online tool for authoring papers. It would be a fully integrated set of “cloud” services presented in a fashion very similar to Google Docs. The format produced would be structured and ready to move with little effort to production systems.

      In the interim, a transition needs to occur. Depending on discipline, authors submit in Word, TeX/LaTeX, PDF, etc. If you can’t control the authors you can at least incentivise them. Imagine that a journal publisher provides a “Google Docs for Papers” tool and a carrot-on-a-stick (eg. financial or turnaround time incentive). Gold OA may provide simple opportunity to test this theory. Eg. offer regular submissions at $3500 and a discount of $2500 if the author uses the online tool instead.

      I circulated this idea around the SSP conference last week and it seemed to pique some interest.

      VA:F [1.9.22_1171]
      Rating: 0 (from 0 votes)
  7. Gareth Oakes says:

    I think once the shift to electronic as primary delivery format is mostly complete, then this is the natural way to manage content, even in full text. What is needed for more structured content, such as journals, is a semantic overlay on the HTML. This should be a standardised microformat perhaps lead by or compatible with JATS and their conventions.

    BTW, this chain of thought is not entirely novel. I provide for your reference an article in a similar vein from 2010: http://quod.lib.umich.edu/j/jep/3336451.0013.106?rgn=main;view=fulltext

    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
    • Molly Sharp says:

      Gareth, excellent point about the need for semantic overlay. I also agreed with much of the article you linked to by John Maxwell et al. – thanks, I hadn’t seen that. I especially agree with the idea of using the web as a production platform due to its unparalleled ubiquity. I very much agree with choosing free (libre) software and simple tools. The one addendum I would make to John’s article is that today, with CSS3′s paged media controls, we can avoid using layout tools altogether – we can produce attractive and readable print-quality materials and we can automate that process.

      VN:F [1.9.22_1171]
      Rating: 0 (from 0 votes)
  8. Donat Agosti says:

    Dear Molly
    You make a good point regarding “wasting” resources. The motivation to use XML and create Taxpub a JATS flavor for the taxonomic community (does discovering and describing the diversity of life) was the ability of XML to create semantically enhanced structured documents, additionally to the linking.
    There is also an entire XML based production workflow implemented at Pensoft creating journals like Zookeys, Phytokeys, where the goal is not only to produce a journal with html, pdf and print output, but to be able to disseminate the content and integrate it as widely as possible through exports to PubMed Central (where they extract and explicitly make accessible the treatments (one of the main elements of the publications), to species-ID again to expose the treatments in a wiki environment to edit by the crowd, to treatments again to Plazi, and to EOL, and not least to Pensoft’s own species profile.
    The question then is not so much whether the same functionality, that is semantic enhancement and wide dissemination, can efficiently being done using HTML5.

    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
    • Molly Sharp says:

      Donat, I am right there with you for the question of “whether the same functionality, that is semantic enhancement and wide dissemination, can efficiently being done using HTML5.” If someone has this answer, I’d love to hear it. Thank you for your comment.

      VN:F [1.9.22_1171]
      Rating: 0 (from 0 votes)
  9. Paul Donohoe says:

    You asked: If XML is being used as an interchange format only, what do we gain from moving the XML piece of the workflow any further upstream from final delivery?

    I think that for journal publishers particularly, your supposition is incorrect: XML is being used for more that just interchange to PDF and HTML. Adoption of XML (with a markup standard such as JATS) earlier on in the publishing workflow, allows extraction and enrichment of semantically-useful content such as chemical and biological compounds, people such as authors and editors, places such as organisations and addresses, and relations between such things. The content for such enrichment comes from many sources, such as databases, which have XML schemas in place. Admittedly, RDF is gaining ground as an alternative format for such semantic content mixing, but this too has international standards for semantic markup.

    Without common standards for semantic markup, adoption of HTML further back in publishing workstreams makes sharing and mixing content between publishers very challenging.

    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
    • Molly Sharp says:

      Paul, I acknowledge your point, that XML is not useful just for interchange, although it seems to be the only thing some publishers are using it for, at the moment. I would say that until we have the answer to Donat’s question, we need to hang on to XML. My original question remains, at what point do we integrate XML into the workflow. I agree that enhancing, sharing, and mixing content are all features we need to work towards doing more ably in the very near future, and it would be a grave mistake to build systems or workflows that leave any of those out.

      VN:F [1.9.22_1171]
      Rating: 0 (from 0 votes)
  10. I agree that the HTML5/CSS3 approach is promising indeed, but having XML-based workflows is certainly not a bad idea if you want the final document to be marked up semantically (e.g. for references, place names and coordinates, people, genes, taxa, funding sources). In any case, I am looking forward to the results of the Scholarly Markdown workshop, as I cannot attend in person.

    In the meantime, I would be interested in getting your views on how to fix the JATS XML that PLOS journals are currently delivering to PubMed Central. Some of them have been highlighted (not specifically for PLOS, but with PLOS examples) here, and an abstract on the matter (again not specific to PLOS) has been submitted to JATS-Con – it would be nice to be able to report by then that these problems have been addressed at PLOS, perhaps starting with a typo that has already been reported twice.

    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
    • Molly Sharp says:

      Daniel, we appreciate your raising these issues. At PLOS, we are actively working on a plan for more effective handling of certain types of XML errors. I would like to follow up with you about the more general issues about JATS; I’ll contact you offline. Thank you for your comment.

      VN:F [1.9.22_1171]
      Rating: 0 (from 0 votes)
  11. John Callaway says:

    In my mind, XML and HTML(5) are totally different beasts. XML is a data-representation format, while HTML/CSS is a presentation technology. Now, I hate XML just as much as the average software developer, but I can’t deny that there is a huge existing toolchain and workforce trained to deal with it.

    My worry about having the canonical article representation in HTML5 is rather simple: what happens when HTML6 (or CSS4) comes along? Do we then have to go through the laborious process of updating our corpus?

    If I was designing a journal publishing system from scratch, I would personally use JSON as the representation format, but XML is a defensible choice too. Nothing has fundamentally changed with XML since about 2000, and it’s likely that data archaeologists centuries in the future will know how to parse it.

    I agree with you that few authors will want to touch XML or JSON. But I would argue the same for HTML5/CSS. I think it’s unavoidable that you will need a conversion step, or authoring tools, at the start of the article submission process.

    I’m just not convinced that HTML5 is anything more than a presentation technology, and using it as an archival format strikes me as a mistake.

    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
    • Molly Sharp says:

      Hi, John. I, too, am entranced by the possibilities that JSON may provide. I look forward to discussing this further in the office. Look out, I’ll be cornering you when I see you! Thank you for the thoughtful comments.

      VN:F [1.9.22_1171]
      Rating: 0 (from 0 votes)
      • Gareth Oakes says:

        The problem with JSON is how to effectively/efficiently represent mixed content models. SGML derived standards such as XML/HTML make this so easy.

        Another important point for XML that I don’t see brought up is the tool set. There is a wide availability of tools for dealing with XML (authoring, editing, conversion, validation, publishing, etc.). Any new format would also require an appropriate set of tools to allow people to work with it for production use.

        VA:F [1.9.22_1171]
        Rating: 0 (from 0 votes)
  12. I think you have hit the nail on the head, Molly. Or at least it is a very timely discussion. There is a lot to be said for “piggybacking” on the myriad tools available on the net for html5, instead of reinventing them all for the publishing industry. So I would suggest html5-first, as opposed to XML first. The former is not as strict, but perhaps we can have a QA tool that checks for the structure.

    So the definitive archive, or what we might call the “format of record” will be the html5 or effectively a blog post.

    VA:F [1.9.22_1171]
    Rating: +1 (from 1 vote)
    • Molly Sharp says:

      Kaveh, I thoroughly enjoyed our discussions about all of this last week, and I look forward to more.

      VN:F [1.9.22_1171]
      Rating: 0 (from 0 votes)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>