My Paper Writing Dream Machine 1.0

I've written a similar post before, put I would like to talk about some of the features that I would like to see in an ideal paper writing application.

Intelligent Formatting
Content and formatting should be separated from each other. A manuscript should require as little formatting as possible, and that formatting (including the format of references) should be defined in a Journal style that is automatically applied to the manuscript. Publicon by Wolfram Research tried to achieve this, but unfortunately appears to be a dead product.

References
A reference database should be integrated into the paper writing application. Ideally this would be a web-based database such as Connotea, CiteULike. Both Refworks and EndNote Web already offer some level of integration.

Versioning
Storing all versions of a manuscript is very important for obvious reasons: backup, keeping track of revisions and coordinating the input from more than one author. Version control is standard practice in software development, using tools like Subversion or the newer Git.

Integration with Online Submission Systems
Submitting a manuscript to an online manucript submission system such as EditorialManager or Topaz is too complicated. This process could and should be automated.

Summary
My Paper Writing Dream Machine will in all likelyhood turn out to be a web-based application, using on one of the more advanced platforms Google Gears, Microsoft Silverlight or Adobe Flex. And the data will be in XML using a standard DTD. Some of the pieces of the puzzle already exist, but nobody has yet put them together in a way that it creates a compelling alternative to the currently used systems.

Related Posts Plugin for WordPress, Blogger...
This entry was posted in Snippets. Bookmark the permalink.

29 Responses to My Paper Writing Dream Machine 1.0

  1. Richard P. Grant says:

    Someone’s going to mention LaTeX any minute…

  2. Martin Fenner says:

    Richard, you are of course right. I’m not a LaTeX expert, but for me LaTeX is more of a typesetting application – it creates publication-ready manuscripts. I also don’t need most the formatting features in Microsoft Word. Publicon was actually the right idea and I bought the software when it was released in 2005. But we are still at version 1.0.1 three years later.

  3. Sabine Hossenfelder says:

    This “Paper Recipe”:http://backreaction.blogspot.com/2006/06/paper-recipe.html might also come in handy ;-)

  4. Martin Fenner says:

    … or use the “SCIgen”:http://pdos.csail.mit.edu/scigen/ automatic paper generator.

  5. Maxine Clarke says:

    Many (or even most) journals have a print-first production workflow. Hence a lot of these clever programmes are incompatible with the codes the production department and typesetters use (eg at Nature we format manuscripts using a programme that provides xml tags for the typesetters to pick up automatically. But this program is not compatible with codes the author might have inserted in the manuscript to do various things such as format references. It is quite a process to set up new typesetting rules.

  6. Brian Derby says:

    In the dim and distant past, when I was doing my Ph.D., someone in the computer lab in Cambridge wrote what was probably the original thesis/paper generating engine. For any computer geeks, this was written in BCPL. The product fooled my superviser as a bit of thesis draft (for about 5 minutes).

  7. Martin Fenner says:

    Maxine, I’m obviously no expert of the manuscript workflow after paper submission. But I would believe that submitting a paper in a standard XML format would make things much easier for everybody involved. One possible standard XML format is the “Article Authoring Tag Set”:http://dtd.nlm.nih.gov/articleauthoring/tag-library/2.3/index.html developed by PubMed Central. And this XML format is also supported by the “Microsoft Word 2007 Authoring Add-In”:http://www.softpedia.com/get/Office-tools/Other-Office-Tools/Article-Authoring-Add-in-for-Microsoft-Office-Word.shtml that was released earlier this month.

  8. Scott Keir says:

    There’s probably an “Emacs command”:http://xkcd.com/378/ that does all that…

  9. Massimo Pinto says:

    I did use LaTeX for some time before and it worked very well for me. But having not used it in more than 6 years now, I would not know what progress (if any) LaTeX made in the directions that Martin is pointing out here in his post.
    The main problem I had with LaTeX is that my PhD supervisor did not use it, and I had to translate everything to rtf for him to comment. Huge waste of time. Not worth it. And yet, LaTeX helped me to produce a 240 pages PhD thesis in PDF format, with index and glossary and all hyperlinks etc, within 2Mega bytes. Mmmm.

  10. Maxine Clarke says:

    Martin, there is XML and XML. Microsoft’s, I hear quite rude things about it (spurious corrupt background codes, etc) from the “people who know about XML” but as you write, they (MS) are working on it. Basically, MS’s latest version and associated XML was not created in consultation with the STM (science, technical and medical) publishing industry, MS released it mainly having consulted with “business users” (presumably its main customer base) rather than professional publishers, but since then in view of outcries it has been consulting with the publishing industry. You can follow some of this on Nascent blog. It is true they are retrospectively working on these comptatibilities.
    The XML output by a publisher meets a standard called DTD. A lot of these very flexible and superb tools for authors (as in your post), while fantastic for the process of preparing and writing the paper, tend to fall down when it comes to a journal’s/publisher’s technical standard — which has to be the same for all the material published by the journal irrespective of the format in which the author submitted. MS’s own XML is nowhere near a journal DTD standard. (Which is not surprising as it is not a professional typesetting program.)
    The macro our sub and copyeditors use is one we purchase from a small company who created it precisely for the purpose: to allow us to edit and structure MS Word manuscripts accepted for publication, which does output XML to our DTD standard, for direct automatic use by the typesetter. When a ms is accepted for pubication, it needs to slot into that workflow. Many if not all of the tools you mention, Martin, are fine for creating the paper, but when you submit your final version to the journal’s production process (assuming other journals are like the one I know), the code underlying these processes will probably need to be stripped out, either by you (the author) or the journal copy/sub editors, before they apply the journal’s own XML coding.
    Probably this is far more detail than you are interested in, apologies if so!
    Massimo, LaTeX is passionately loved by those who use it, mainly physical scientists but some biologists too; but its extreme customisation, which is one of the reason its users love it, is a factor that makes it hard for the journal/publisher on standards. At Nature, most authors use MS word and submit in that, so our workflows are set up for that. But a minority don’t and we have to accommodate them too. A mss that entirely consists of highly customised LaTeX equations is quite a challenge. However, our physical science journals use a TeX-based workflow (with all the associated DTD standards I mentioned above re Nature’s Word workflow), because most if not all of their authors use TeX formats.
    Again, apologies if I have gone too far off the topic of the post.

  11. Martin Fenner says:

    Maxine, thanks for the detailed comment. My post was intended more as a -dream- perspective of how things could be in the future and not so much about the current state of affairs. The submission process will become smoother once more applications support the NLM Article Authoring Tag Set (or another standard XML DTD) and more electronic submission systems (“eJounalPress”:http://www.ejpress.com, EditorialManager, “BenchPress”:http://benchpress.highwire.org/, etc.) can handle this format.
    Massimo, the ideal paper writing application is not a word processor. That’s why I don’t think that LaTex or Microsoft Word are the future. I don’t need different fonts, font sizes, line spacing and many of the more sophisticated features. But I need a program that tells me that I have all the required elements (title, author list, keywords, abstract, etc.) in my manuscript and that they are formatted correctly (mostly the references, but also the word count in the abstract, the figure and table numbering, etc.).
    The fact that most people use Microsoft Word tells me that there wasn’t really a market for a paper writing application (as there is for example a market for “screenwriting software”:http://www.screenwriting.com/). But Web 2.0 applications are now sophisticated enough that someone might just create the perfect paper writing application. I have written about “Adobe Buzzword”:http://www.screenwriting.com/ and “Google Docs”:http://docs.google.com before, but they also try to become full-blown word processors and lack most of the features from my wish list.

  12. Maxine Clarke says:

    Agree with the principle, certainly. It is hard for a jouranal operating within a deadline/production workflow, to be constantly flexible and changing, when you scale up the operations (eg a typesetter is working with hundreds of journals).
    Detacting web from print workflows is one way to do it, which some journals do already (especiall those that don’t have print editions ;-) ) and this will become increasingly prevalent I think.
    BTW I agree with you also, Martin, on web-based submission systems. Not only for journals, I hear quite a few “comments” about funding agencies and other similar – eg if one is asked to peer-review a grant, as well as submit one. Speaking personally, I write quite a few references for people — increasingly, I have to wrestle with these automatic submission systems for my reference, which makes a lot more work for me even if it makes it easier for the receiving organisation: end result, I am less inclined to write referencs for people.
    Some disciplines and publishers have already made it easier for authors to transfer manuscript submissions between titles, and I see no reason why this process could not develop further.

  13. Maxine Clarke says:

    Apologies for the missing characters above, I did not check preview (I often forget), and am working on a “remote” computer which seems to have a funny keyboard.

  14. Cameron Neylon says:

    Maxine, thanks for that description – I kind of knew that publishers final format was XML but I’d never thought about what it might look like. Can I ask Martin’s question kind of from a different angle?
    If Nature has an XML template would it be possible to imagine loading that into the application Martin is talking about in the same way that many journals now generate word templates? Is the DTD XML any use as a document format for authoring? Or is purely for typsetting?
    And how much does the format differ in practice from one publisher to another? XML is supposed to be easily convertible so that in principle any set of tags in one (properly formed) XML document can be (relatively) easily converted to another as long as a translation of appropriate tags can be agreed. So one can imagine a template XML document for authoring which has many tags stripped out and some converted for typesetting. I’m sure the devil is in the details.

  15. Maxine Clarke says:

    Cameron, I think each publisher has its own DTD. I’ll have to ask the tech guys that question when I’m at work (which I am not today, as it is Sunday ;-) )
    Yes, we would like authors to be able to create their own XML docs if they like, as you suggest — but we have to pay a license fee for every computer terminal that has the editing system. I have asked the publishers to look into the cost options of having a template that authors can use. Nevertheless, in practice what we do is to edit the manuscript and give the author a copy of the edited version before we send it to the typesetters. The author can then check all is OK and make minor changes electronically– which they like to do — and frankly all of them are grateful not to have to mess around with the XML formatting — apart from one or two LaTeX enthusiasts, but as mentioned above, Nature does not have a LaTeX typesetting workflow.

  16. Cameron Neylon says:

    Ah license fees. Well you know what my answer to that is likely to be :)
    But seriously I am interested in what this looks like – thinking about how you link the kind of things Martin is talking about with an expanded role for the journal template to enable more semantic authoring. It probably will require open formats though and that’s probably likely to involve serious workflow changes – but I would take that as meaning that if we want to make the case to publishers we’ve got to have really compelling demonstrators.

  17. Martin Fenner says:

    I would like to mention Publicon again, as it was ahead of its time when first released in 2004. Publicon can export to XML using the BiomedCentral DTD, an example (also showing beautiful PDF export) can be found “here”:http://www.wolfram.com/products/publicon/samples/bmc/. The Publicon XML file can then be used by BiomedCentral without file conversions.
    This concept could and should be picked up by other applications. The new Microsoft Word 2007 plugin for the NLM DTD is another step in this direction, but our university still uses Microsoft Word 2003.
    Maxine, you mention additional licensing fees. BiomedCentral is offering a discount on article processing charges when manuscripts are submitted in the XML format mentioned above.

  18. Maxine Clarke says:

    I meant a licence fee to use the structuring and editing tool – we pay a small company (who wrote it and who sell it) a fee for each machine on which we install it. Most STM publishers use this particular product (it is called eXtyles and written by a company called Inera).

  19. Martin Fenner says:

    This is again one of those posts where I learn a lot of stuff. For example the fact that Inera “helped develop”:http://www.inera.com/nlmresources.shtml the NLM DTD mentioned several times in the comments already – together with “NCBI”:http://www.ncbi.nih.gov/ and “Mulberry Technologies”:http://www.mulberrytech.com/. Bruce Rosenblum from Inera gave a nice introduction to the NLM DTD at “this ALPSP Meeting”:http://www.alpsp.org.uk/ngen_public/article.asp?id=335&did=47&aid=1244&st=&oaid=-1 in December 2007. A blog post by Chris Baker summarizing the meeting is “here”:http://usabilitynotes.typepad.com/usabilitynotes/2007/12/the-nlm-dtd.html.
    Using XML for manuscript writing has another advantage: it makes it easier to store different versions of a manuscript and have several authors work on the same text. This is because versoning software such as Subversion works best with text files rather than proprietary file formats such as Microsoft Word 97-2004.

  20. Duncan Hull says:

    It’s remarkable isn’t it, authoring and publishing papers is at the core of what scientists do and we have all these systems which are all fundamentally inadequate in some way, Latex, MS Word and XML.
    What we seem to need is a paper writing dream machine version 3.0 but I’m not going to hold my breath waiting for it…

  21. Cameron Neylon says:

    I think the whole thing is very tied up with what the ‘paper’ turns into over the next decade. With data ‘publishing’ becoming increasingly the norm there will be a real shift towards automated generation of the documents – Acta Cryst E is an interesting example of round peg data shoe horned into a square peg journal shaped hole – I think the authoring of the traditional paper will be dragged along as those tools are developed.
    I would still maintain that XML has the potential to act as a superset of all these things. But the problem lies in making sure tag sets are orthogonal and the structure consistent. The fact that most STM publishers use the same product is encouraging – I guess the exceptions are mostly web only and open access publishers?

  22. Martin Fenner says:

    Today found Paper Writing Dream Machine 0.1beta. “Lemon8″:http://www.lemon8.org/ is an open source web-based tool to convert OpenOffice and Microsoft Word documents into XML based on the NLM DTD. Lemon8 is in the early stages of development, but already has some impressive features: (1) importing of a number of document formats, (2) reference manager including the import from PubMed and (3) export to the NLM DTD and PDF. Lemon8 is probably intended as an open source alternative to eXtyles, but it has the potential for more.

  23. Maxine Clarke says:

    Yes, these are exciting times, as web programmers meet editorial production workflows — everything is possible!

  24. Pablo Fernicola says:

    Most of the functionality that Martin is asking for is coming together in the Article Authoring Add-in for Word. I encourage all to provide feedback and input into its development.
    “add-in blog”:http://blogs.msdn.com/exscientia/
    Also, the add-in will enable others to build on top of it, being able to benefit from conversion to the NLM DTD, and just focusing on adding their own unique value.
    -pablo

  25. Maxine Clarke says:

    When will this be released, Pablo?

  26. Martin Fenner says:

    I’m really looking forward to this tool, as the Add-In should help promote the XML format defined in the NLM DTD. But I still use Word 2003, mainly because my university doesn’t support Office 2007. What happened to the “initial problems”:http://blogs.nature.com/wp/nascent/2007/06/word_2007_and_the_stm_publishe.html with manuscripts submitted in Word 2007 format? And what percentage of manuscripts is now submitted in this format?

  27. Graham Steel says:

    I wonder where “Open Office.org”:http://www.openoffice.org/ (i.e. Open Source suite) knits in (or knot) with all of this.

  28. Martin Fenner says:

    To me OpenOffice (just as Microsoft Word) is too much of a word processor. Intelligent formatting is one example where this is a problem. I don’t need the formatting of a word processor (different fonts, font sizes, line spacing, etc.), but I need help with a different kind of formatting (title, abstract, keywords, material and methods, etc.). Lemon8 (mentioned above) does that.

  29. Pablo Fernicola says:

    @Maxine – we are looking at early in the second half of this year for the first release of the add-in
    @Martin Jun 30 – there were two related issues – the processing back-end for journals had not evolved to ingest docx files (the new format in Word 2007), and, when saving from Word 2007 into the older format, math equations were not editable (were in fact pictures). Solutions to both issues are evolving, with back-ends being updated to process docx files, and deal with the math equations
    @Martin Jul 01 – over time the look and feel within the word processor will become irrelevant in relation to what gets presented in the published paper. The look and feel will still be relevant for the author’s experience while writing.
    The XML content will be transformed for presentation (as happens today with articles posted on PubMed Central, transformed from the XML to HTML or PDF for presentation). The add-in aims at incorporating the semantic structure which you refer to, independent of the look/styling.