Author: Molly Sharp

Getting to CrossMark

This week, we launched our participation in CrossRef’s CrossMark program. It’s an exciting step for PLOS, and getting there was a learning experience we hope you’ll find interesting.

The Program

CrossMark is a service of CrossRef that is gaining traction among scholarly publishers, with more than 30 publishers to date, and nearly half a million scholarly documents. The purpose of the CrossMark logo appearing on article pages is to give researchers a consistent way to know the status of any article, from any participating publisher. When someone clicks the CrossMark logo from either the online version of the article, or the PDF, they see a popup like this one. It indicates that either the article is up to date, or that updates are available.

crossmark_final

It’s clear that the CrossMark service is valuable for keeping content current, which assists the integrity and completeness of the scholarly record. It’s also worth highlighting that we’d like our initial CrossMark participation to be the first step toward additional exciting uses in the future. We could extend our CrossMark usage to…

  • support article versioning
  • display FundRef info
  • display info about our peer review process
  • link to related data
  • experiment with threaded publications
  • …and more

The Journey

Getting from “we want to participate in CrossMark” to “the CrossMark logo is live” was a process that took time. Seven months, if you want to know the truth! Don’t let that scare you if you’re a publisher interested in kicking off your own CrossMark participation. The main reason it took us 7 months is that we bundled the CrossMark initiative into a larger corrections handling overhaul, which included a massive data migration effort. Anyone who has been through one of these will tell you the same thing: data migrations are not for the faint of heart. And in retrospect, this bundling of initiatives was a decidedly un-Agile way to go.

So the overall initiative included overhauling our corrections handling process, which meant switching systems for inputting and publishing correction notices. This new process required system development, which in turn required documentation, training, and hands-on practice for a pretty big chunk of our staff. And then there was the data migration effort, which took a long time on its own. (None of this part of the initiative included our CrossMark program implementation.)

Then, we tackled the CrossMark piece, which was fairly straightforward in the scheme of the overall project. We added the CrossMark logo to articles: the CrossMark logo now appears on every PLOS article page on our journal sites, and on the downloadable PDFs for all newly-published articles going forward. And we updated our deposit toolchain to include the CrossMark metadata. But there were a few complications, because of the aforementioned data migration.

First, we chose to create a back-deposit of CrossMark data for our entire corpus. Over ten years of publishing equals somewhere around 110,000 articles, as well as over 3,000 migrated corrections. Naturally, things change over time. How does a person get a grasp of the minor differences between article XML generated over ten years? You can look at a few files from various periods in each year, but that’s just barely scratching the surface. You still have no clear idea of what might actually be different. A metaphorical needle in a gigantic digital haystack. So we wrote some XSL transforms, threw the whole lot at ’em, and temporarily kicked some cans down the road. We figured we’d let CrossRef’s submission results tell us if something was wrong. After sending off 110,000+ XML files (with a slight chuckle) and letting the script run for about twelve hours, we had a pretty decent success rate. After some slight tweaking, the rest were good to go as well.

Dealing with back-deposits for our migrated corrections was a bit dirtier, and required a little more clean-up. First they had to be re-formatted simply for display on our website in their new form, and then mined for the needed CrossMark deposit information before sending the XML off for deposit (thanks for that .jar file, CrossRef!). The vast majority of the work was accomplished with a small toolset, really. Some .jar files (provided by CrossRef), and some XSLT files did most of the heavy lifting. Though how you compile and prepare your corpus could vary from ours.

And now a few words about article PDFs for our CrossMark program. As we mentioned, the CrossMark logo appears on PDFs for articles we publish going forward. We chose to back-update the online versions of our articles to include full CrossMark functionality, but we decided not to update the 110,000+ downloadable PDFs for previously-published articles. It was a decision based more on our unique volume situation, and less about the process of updating the PDFs. The marking and stamping process is simple, once you have it set up. But we decided that the testing and remediation challenges associated with replacing 110,000+ active PDFs was too much to take on at this time. CrossRef leaves it up to the publisher in terms of whether you choose to fully update your corpus, or start participating in CrossMark from a given date onward. We took a bit of a hybrid approach because we chose to add CrossMark functionality to all HTML articles, but only to PDFs for newly-published articles.

So there you have it! Overall, getting to CrossMark turned out to be a bit more of a journey than we anticipated, but we have arrived, and we’re glad we took the trip. We hope this post is useful to any of you who may be considering kicking off a CrossMark participation program of your own.

Category: Tech | Leave a comment

What Do We Hold In Highest Regard?

A page of printed letterpress at Arion Press, October 14, 2013

A page of printed letterpress at Arion Press, October 14, 2013

Beautiful black serif impressions on thick creamy paper

Mythical engravings you can feel, with a pass of your fingers over the paper

A substantial, weighty, J.C. Paul & Son press engraved with serpent decorations… and other solid metal equipment, all with a notable lack of safety features

Engraved letterpress page, Arion Press

Cover engraving at Arion Press

These were some of what was on offer at a tour of Arion Press in the Presidio, San Francisco. Arion Press is a boutique San Francisco press where many of the production aspects of the publishing trade are practiced as they were in pre-World War II days. It would be hard to find a publisher more in love with exquisite typography than Arion. They still compose and cast their own type, using the foundry they purchased in 1989, Mackenzie & Harris. M&H is the oldest and largest surviving typefoundry in America. Arion Press also operates a letterpress and full book bindery on the premises. Mark Sarigianis, apprentice typecaster, gave us the full tour yesterday (public tours are available on Thursdays).

The best thing about touring Arion was getting this tangible view into the past, when working the typesetting keyboard was a craft requiring 6 years’ apprenticeship. When typos weren’t possible to fix with backspace, because the letters were cast into words in metal. When the keyboard operator needed to calculate, line by line, whether to hyphenate a word, because he heard the chime that signals that the end of the line is drawing near. When signatures were imposed by positioning metal blocks of composed type in a frame. When paper was cut and folded by hand, and then sewn into bindings. Every step was tangible, every step carried its own set of sensations—sight, touch, smell, and sound.

J.C. Paul & Son press, Arion Press

J.C. Paul & Son press at Arion Press

The people at Arion Press have gone out of their way to hold on to those tenets of publishing they hold in highest regard. They may have discarded older ideas and processes that no longer served (gender inequality used to be prevalent in the printing industry, for example), but they held onto their core values of typography and design. While the most obvious direct application to PLOS is thinking about typesetting and typography, I prefer to consider how PLOS can hold on to what we most value in the scholarly publishing world, and ditch the rest, the things that no longer serve, much like acid etching away the negative space from an illustrative plate in an Arion Press book.

So tell us your thoughts, please:

What are the existing tenets of scholarly publishing that we should retain?

What are the ones that used to serve a purpose, but no longer do?

Metal type at Arion Press

Metal type at Arion Press

Blocks of composed pages at Arion Press

Blocks of composed pages at Arion Press

Category: Tech | Leave a comment

Structured Documents for Science: JATS XML as Canonical Content Format

It’s only my 7th day on the job here at PLOS as a product manager for content management. So it’s early days, but I’m starting to think about the role of JATS XML in the journal publishing process.

I come from the book-publishing world, so my immediate challenge is to get up to speed on journal publishing. And that includes learning the NISO standard JATS (Journal Archiving and Interchange Tag Suite). You may know JATS by its older name, NLM. As journal publishing folks know, JATS is used for delivering metadata, and sometimes full text, to the various journal archives.

But here’s where journal and book publishing share the same dilemma: just because XML is a critically important exchange format, is it the best authoring format these days? Should it be the canonical storage format for full text content? And how far upstream should XML be incorporated into the workflow?

Let’s look at books for a minute. The book-publishing world has standardized on an electronic delivery format of EPUB (and its cousin, MOBI). This standardization has helped publishers drill down to a shorter list of viable options for canonical source format. Even if most publishers haven’t yet jumped to adopt end-to-end HTML workflows, it’s clear to me that HTML makes a lot of sense for book publishing. Forward-thinking book publishers like O’Reilly are starting to replace their XML workflow with an HTML5/CSS3 workflow. HTML/CSS can provide a great authoring and editing experience, and then it also gets you to print and electronic delivery with a minimum of processing, handling, or conversion. (O’Reilly’s Nellie McKesson gave a presentation about this at TOC 2013.) And which technology will get the most traction and advance the most in the next few years, XML or HTML? I know which one I’m betting on.

In terms of canonical file format, journal publishing may have one less worry than book publishing, because many journals are moving away from print to focus exclusively on electronic delivery whereas most books still have a print component. Electronic journal reading—or at least article discovery—happens in a browser; therefore, HTML is the de facto principal delivery format. And as much as I’d like to think HTML is the only format that matters, I know that many readers still like to download and read articles in PDF format. But as I mentioned, spinning off attractive, readable PDF from HTML is pretty easy to automate these days. So I ask:

If XML is being used as an interchange format only, what do we gain from moving the XML piece of the workflow any further upstream from final delivery?

Well, why does anyone adopt an XML workflow? The key benefits are: platform/software independence (which HTML also provides), managing and remixing content to the node level (which is not terribly useful for journal articles), and transforming the content to a number of different output formats such as PDF, HTML, and XML (HTML5/CSS3 can be used for this transformation as well, with a bit of toolchain development work).

But XML workflows come with a hefty price tag. The obvious one is conversion, which is not just expensive, but costly in terms of the time it takes. Another downside is the learning curve for the people actually interacting with the XML—how many people should that be? In the real world, will you ever get authors, editors, and reviewers to agree to interact with their content as XML? So more likely than not, you’re either going to need to hide the fact that the underlying format is XML through a WYSIWYG-ish editor that you either buy or build (both are expensive), or you’re doing your XML conversion towards the end of the process. On a similar note, how easy is it to hire experienced XSL-FO toolchain developers? But developers who work in the world of HTML5, CSS3, and JavaScript are plentiful.

So building an entire content management system and workflow for journal publishing around XML—specifically JATS XML, which is just one delivery format, that isn’t needed until basically the end of the process—doesn’t seem like a slam-dunk to me. I should clarify that using JATS XML for defining metadata does seem like the obvious way to go. But I’m not so sure it’s a good fit to serve as the canonical storage format for the full text. One idea is to separate article metadata from the article body text, to leverage the ease-of-editing of HTML for the text itself.

What about moving HTML upstream, and focusing efforts on delivering better, more readable HTML in the browser? What about shifting focus away from old print models and toward leveraging modern browser functionality, maybe by adding inline video or interactive models, or by making math, figures, and tables easier to read and work with?

Just to throw a curve ball into the discussion, I’m attending Markdown for Science this weekend, where Martin Fenner and Stian Håklev will lead the conversation about whether it makes sense to use markdown plus Git for academic authoring and collaboration. I want to hear from as many sides of the content format conversation as possible.

So, what do YOU think?

Category: Tech | Tagged , , , , , , , , , , , , | 31 Comments