A Call for Scholarly Markdown

Markdown is a lightweight markup language, originally created by John Gruber for writing content for the web. Other popular lightweight markup languages are Textile and Mediawiki. Whereas Mediawiki markup is of course popular thanks to the ubiquitous Wikipedia, Markdown seems to have gained momentum among scholars. Markdown really focuses on writing content, many of the features of today’s word processors are just a distraction (e.g. (fonts, line spacing or style sheets). Adding markup for document structure (e.g. title, authors or abstract) on the other hand is overly complicated with tools such as Microsft Word.

Fortunately or unfortunately there are several versions (or flavors) of Markdown. The original specification by John Gruber hasn’t been updated for years. Github uses Markdown with some minor modifications. Multimarkdown and Pandoc provide features important for scholarly content, e.g. citations, superscript and tables.

  • Markdown
  • Github-flavored Markdown
  • Multimarkdown
  • Pandoc

The Pandoc flavor of Markdown probably comes closest to the requirements of a scholar, but still has limitations, e.g. support for metadata and tables isn’t very flexible. I propose that we as a community create a new Scholarly Markdown flavor, which takes into account most of the use cases important for scholarly content.

One of the big advantages of Markdown is that the format can not only be translated to HTML, but also to other formats, and Pandoc is particularly good in translating to and from many different formats. We want to make sure that Scholarly Markdown not only translates into nice Scholarly HTML (with good support for HTML5 tags relevant for scholars), but also into Microsot Word, LaTeX and PDF, as these are the formats typically required by manuscript tracking systems.

Some of the features required for Scholarly Markdown include:

  • Superscript and subscript
  • Highlighting text (supporting the HTML tag <mark>)
  • Captions for tables and figures (with support for the HTML tags <caption> and <figcaption>)
  • Support for document sections (the HTML5 tags <article>, <header>, <footer>, <section>)
  • Good table support
  • Math support
  • Good citation support
  • Support for comments and annotations

Multimarkdown and Pandoc of course already support many of these features. Tables and citations are two examples where it is important to not only support them, but support them in a non-intrusive way that doesn’t get in the way of the flow of writing

BTW, this wouldn’t be the first community flavor for Markdown. The screenwriting community has dome this already with Fountain.

Related Posts Plugin for WordPress, Blogger...
This entry was posted in Thoughts and tagged , , . Bookmark the permalink.

56 Responses to A Call for Scholarly Markdown

  1. Mike Taylor says:

    Do you envisage this MarkDown translating directly into the XML format that gets deposited in PubMed and from which other formats are generated? If so, would this automate much of the workflow? Could citations and references be true pointers to a global reference DB instead of text that needs to be resolved? If so, this would be very well worth doing. (And using MarkDown would be a hundred times better than XML, as has sometimes been suggested for submissions.)

  2. Ian Mulvany says:

    Martin, hear hear!

    Markdown is an excellent format for capturing input. I think there are a couple of other parts of the equation that are also worth thinking about. These are the underlying document model and the publishing systems that can support the dissemination of the work, once written.

    Our current underlying document model is tending towards NLM XML. That’s better than nothing, but I don’t like it as much as the idea of moving towards a JSON based document model. The core advantage of moving to the latter is that you can start to capture transactional information. What this allows you is real time collaborative editing. This is a document model close to what google wave used. I think the advantages that this would bring to should be self-evident, but to spell it out it means that where you needed to, you could collaborate with other people, on the same document, without needing to push track change versions around.

    The makers of substance are working towards making an open standard in this vein (http://interior.substance.io/modules/document.html). They need our support http://pledgie.com/campaigns/18902.

    The next thing is the underlying infrastructure for publishing. I love markdown, and there are a very large number of static site generation tools for taking markdown text and converting it into styleable sites. You can now host sites on Dropbox, Google Drive in github, etc. My own blog runs this way. What are the key additional components we would want to see to support scholarly presentation?

  3. Mike Taylor says:

    BTW., check out the delighfully minimalist online MarkDown editor:
    http://pencil.asleepysamurai.com/intro#p

    Very quick and easy to use.

  4. Martin Fenner says:

    Ian and Mike, for the time being I would focus on using Markdown for writing, including of course collaborative writing with your coauthors. I assume that HTML, XML, PDF, ePub, etc. are the formats used further downstream, and changing that takes much longer and isn’t my first priority. It is of course a good question at which stage Markdown is translated into one of these other formats: before manuscript submission or after manuscript submission. As long as the text needs editing, Markdown might be the right choice, I’m not so sure after publication.

  5. Mike Taylor says:

    Martin, I didn’t mean to suggest that the format downstream should be changed — only that the MarkDown that we submit should be capable of carrying all the information that the downstream format needs, so that there is no manual processing required for references and other structured data. See also http://svpow.com/2012/12/12/where-do-formatted-references-come-from/

  6. Chris Rusbridge says:

    I thought it might be worth while summarising a twitter conversation between me and Martin here:

    Me: @mfenner is this the “solution to too many standards is to create another one” problem? (NB I like the idea…)

    Martin: @carusb yes, of course ;) . Pandoc markdown is very good start, but would for example like to add metadata a la Multimarkdown

    Me: @mfenner daft Q no 2; is an answer to extend pandoc to take multimarkdown input? Creates no new standards!

    Martin: @carusb extending pandoc is of course fine. The idea behind Scholarly Markdown is to come up with list of requirements, tools is 2nd step

    I suppose I’m suggesting some sort of Parsimony Principle: no new standards, markup or tools unless absolutely necessary (as each one of these fractures the community and reduces critical mass). Borrowing, enhancing and integrating existing standards, markup and tools fine, though.

    Anyway, this sounds like a Good Thing. Reminiscent of (but perhaps orthogonal to) the work @ptsefton used to do at USQ, eg as described in Sefton, P., Barnes, I., Ward, R., & Downing, J. (2009). Embedding Metadata and Other Semantics in Word Processing Documents. International Journal of Digital Curation, 4(2), 93–106. Retrieved from http://www.ijdc.net/index.php/ijdc/article/view/121/114

  7. Martin Fenner says:

    Mike, completely agree, and references are a good example. Pandoc supports many citation styles via Citation-Style Language (CSL) support, but I agree that it is an anachronism that authors should still worry about citation styles when journals reformat them anyway.

  8. Martin Fenner says:

    Chris, thanks for reporting our Twitter conversation, and for pointing out the risk of introducing yet another standard. With this post I wanted to highlight that there are different favors of Markdown in common use (something not everyone is aware), and that some of them are difficult to use for scholarly content.

    Markdown can only become a widely used standard for authoring scholarly content if a) you can use it for your typical scholarly manuscript and b) we have the tools for it. The best strategy would of course be to enhance one of the existing formats and tools (mainly Pandoc and/or Multimarkdown) just enough for what we need.

  9. Soli says:

    I think that the version of Markdown you’re looking for is called LaTeX…

  10. Mike Taylor says:

    For what it’s worth, Martin, I suspect that building the tools would be a fairly trivial exercise if we could once agree on the specification. I’ve written these kinds of tools many times before. (Too many!)

    I do wonder if Soli has a point, though. A LaTeX/BiBTeX-based submission-to-production pipeline would do everything we want, wouldn’t it?

  11. Martin Fenner says:

    Soli, no LaTeX is something else. It is a great authoring tool, but for me too heavyweight (e.g. I don’t need all the typesetting functionality). My goal is to have Markdown as a serious alternative to LaTeX and (Microsoft Word), not to replace them. For similar reasons I don’t think HTML and XML are good markups for authoring tools.

  12. Thanks for the post Martin. I totally agree. Markdown is a good compromise between the expensive/unstable MS Word and the feature rich latex. I can’t picture convincing any senior academic people to use latex if they use word, but using markdown I think they would do.

  13. Martin Fenner says:

    Thanks Scott. The number and variety of tools that can handle Markdown is amazing and growing. I’m particularly interested in Jekyll to produce static web pages, knitr to work with R, Texts as a Mac visual editor for Markdown files, and Marked to preview Markdown files. It seems that pandoc and discount are the most popular libraries to convert Markdown into HTML and other formats.

  14. Peter Sefton says:

    Thanks Martin, good idea.

    I think you need to add generalised support for semantic elements (to be rendered into RDFa or microdata) – ATM there’s no way to support this in any of the wiki-like markup languages without dropping in to HTML. The citations stuff would be a special case of this.

    pt

  15. Karthik Ram says:

    I wrote about how this might be work in a recent blog post. Pandoc already supports citation styles with the –csl tag. Math is also well supported via Mathjax. Tables work fine for the most part, even without multi-markdown, but complex (especially multi line) tables are tricky. The great thing about this text based workflow is the easy conversion to any format + ability to version control.

    For comments, GitHub doesn’t allow line highlighting for markdown files (as it does for code, see this example). But it’s something that could easily be implemented.

    It would be easy to build this stack with some funding. Yoav and I have been talking about this. See his first stab at this effort.

  16. Martin Fenner says:

    Thanks Karthik. Do you think Pandoc does everything you need from Scholarly Markdown?

  17. Martin Fenner says:

    Pt, support for semantic elements would be great, but is probably a challenge for a lightweight markup language. The citation implementations in Multimarkdown and Pnadoc seem to take a more traditional approach.

  18. Karthik Ram says:

    Hi Martin,
    Yes, I think it works fine for me for now. Additional features make it less lightweight and at some point one might as well use LateX or more “rich” markup. I think it would be fantastic to create markdown notebooks (just like ipython notebooks).

    I envision a document with an extension like .mdoc and a Python based web app opens it in a browser. There one can edit, link to local or remote bib files, and edit away. Live previews could be generated on the fly with the ability to export. One could share that `notebook` with others and that could serve as a container for all the associated files. Just a idea.

  19. Martin Fenner says:

    OK, it is important to not add too many features and focus on what is important for scholarly content.

  20. Peter Sefton says:

    I think the semantic support would be key to being able to map to formats like NLM XML – eg how do you say that a section is a ‘methodology’ in a portable way? I think the semantics could be added in a similar way to the linking conventions, ie put all the URIs at the end of the doc and refer to them in shorthand above.

    (I’ll cook up some examples when I get time)

  21. Frederik Elwert says:

    Have you evaluated reStructuredText? Pandoc also supports it as an input format, and it seems to be quite powerful in many regards. We have used it for storing text that should be converted to HTML and PDF (through LaTeX). It would still require some enhancements like improved citation support, but that seems to be doable, like the zot4rst extension shows.

  22. Martin Fenner says:

    Frederik, I have only briefly looked at reStructuredText, but it seems to have an interesting history. Markdown is only one of several lightweight markup formats, but may have the biggest momentum (if you don’t count MediaWiki markup).

  23. Chris Maloney says:

    This is a great idea, Martin. Since I haven’t noticed anyone else mention it in the comments above, let me point out that there is a W3C community group up and running to work, insofar as it is possible, towards standardizing Markdown: http://www.w3.org/community/markdown/. This was the result of a “call to action” blog post by Jeff Atwood of Stack Overflow fame: http://www.codinghorror.com/blog/2012/10/the-future-of-markdown.html.

    I would like very much to see this effort move forward, and perhaps I can interest some of the other PMC folks.

  24. Just to put an idea out there: a kind of mini-symposium via g+hangouts on this.

  25. Martin Fenner says:

    Thanks for the two links Chris. I like the call to action, although this may only address standard Markdown, not the specific issues important for scholars.

  26. Martin Fenner says:

    I like the idea.

  27. Mike Taylor says:

    Here’s my growing concern. How much complexity would we need to add to MarkDown for it to do everything we need for scholarly publishing? And by the time we’d done that, how much simpler would it be than LaTeX? I worry that we might be re-inventing an ad-hoc, poorly-expressed alternative to 90% of LaTeX.

  28. Martin Fenner says:

    Mike, this is exactly what I’m trying to figure out, see my blog post from yesterday. I think this can work if we say Markdown works for 80% of scholarly papers. Stuff that is heavy in math, has complicated tables, etc. is better written in LaTeX, but most papers are not.

  29. Robert Jacobson says:

    I am at a loss for what problem this solves. Is the problem that LaTeX is too hard for scientists to use? I just can’t believe that. Is LaTeX just too big? It’s big because the demands of scientific typesetting are great. Is LaTeX too complicated? Then learn just enough to do the basics, which is very, very little and very easy to do. Don’t want a tool that’s a hundred megabytes? Then you don’t want a tool that meets the demands of scientific typesetting.

    LaTeX isn’t good for the web, but HTML is designed to solve exactly the problems you seem to be outlining. HTML doesn’t do math well, but that problem is trivially solved by MathJax.

    So what exactly is the problem that Markdown is supposed to solve?

  30. Dana Ernst says:

    I second Peter’s comment. I’d like to at least listen in on such a G+ Hangout. By the way, there is a side conversation going on about this topic on one of my G+ posts.

  31. Martin Fenner says:

    Robert, LateX is a fine authoring tool and is the best writing tool for many scientists. You can say the same about Microsoft Word. I think there is place for other authoring tools, and a tool built around Markdown looks like a good strategy to me. We shouldn’t reinvent LaTeX, but can try to do some things differently.

    You can of course write directly in HTML, but Markdown (and other lightweight markup languages such as Textile) were invented to make it easier to author HTML.

    The problem that Markdown is trying to solve is clear. It is also clear that other solutions to solve this problem exist already. We can build another solution with Markdown, and this solution could have some unique properties interesting to some scientists.

    Thanks for the G+ link, Dana.

  32. antistokes says:

    Half the reason I chose PLoS ONE for my manuscript on a cancer biomarker I submitted a couple of months ago was because you guys are one of the few bio journals with a decent LaTeX template. My lovely co-authors at the hospital I work with were very confused about this, especially when I told them that Word was extremely unprofessional.

    I’m not a math/physics person, I just know know a lot of them; and they refused to help me format my theses unless I learned some simple LaTeX tricks. (When I asked for help with Word, they mocked me. Extensively. And then proceeded to inform me on how Bill Gates is evil personified.) I have never had a formatting issue since, and technically I’m “just a simple biochemist” (or at least, that’s what I tell my physics friends).

    LaTeX is very user friendly, usually you can find all you need to know with a simple Google search. I wouldn’t even dream of using anything but JabRef/BibTeX to manage my continuously growing bibliography, the Word citation management system is simply barbaric.

    Never even heard of this …. “Markdown” thing. It strikes me as someone trying to fix something that ain’t broken.

  33. Chris Maloney says:

    Here are my two cents on the “fix something that ain’t broken” argument. There is nothing wrong with having more than one authoring environment. Each markup syntax, whether its HTML or LaTeX or Markdown, has a learning curve, and while you may not have heard of Markdown, millions of other have, and tens or hundreds of thousands use it every day. It’s very popular among software developers, for instance.
    Since you know LaTeX, you are probably loathe to learn yet another syntax … most people who know Markdown would probably feel the same way about LaTeX. I think a scholarly flavor of Markdown would lower the barrier for those would-be authors considerably.

  34. antistokes says:

    I thought the point of being in research- especially academic research- was to be constantly learning new things. So, I wouldn’t mind learning an entirely new system, so long as it was free. However, LaTeX is already the “gold standard” for math, physics, computer science, and statistics (to name a few– although I’m still trying to convince my chemistry professors that Word is of the devil). Which field, exactly, uses Markdown as its standard for publishing…? And can it force figures and tables to go where you want them to, and does it have a \cite{} and \ref{} capability so I don’t have to order my references and figures by hand…?

  35. Martin Fenner says:

    LaTeX is the gold standard for the disciplines you mention. In many other disciplines (I published most of my papers in life sciences) LaTeX is not the standard for authoring manuscripts.

    Markdown is very popular to author content that will published on the web as HTML, it is not (yet) a standard for scholarly publishing. Some Markdown flavors (e.g. pandoc) handles tables, figures and references, but this certainly can be improved.

    In my mind a major problem with authoring in LaTeX (and Microsoft Word) is that these tools are not just authoring tools, but also publishing tools. When I write a manuscript I don’t really care about font faces, font sizes, line heights, document margins, etc. HTML nicely separates the content from the layout using CSS (and often Javascript), and Markdown follows a similar philosophy. I prefer to leave it up to the publisher (or the blog engine if you publish something on the web) to decide how the text should look like.

  36. antistokes says:

    “When I write a manuscript I don’t really care about font faces, font sizes, line heights, document margins, etc.”

    That’s the nice thing about LaTeX, though: if the publisher hires a comp sci major to work out a template, it’s really just plug and play after that (I don’t write templates, I just annoy the comp sci majors when they don’t work). I was able to completely re-format this particular manuscript in the space of a day for two different ACS journals because I could just plug the plain text into the respective LaTeX templates. The bibliography and referencing styles between JACS and Biochemistry are different, but it’s all automated with LaTeX; so I didn’t have to worry about it. To be honest, one thing the life science journals could do to make it easy on authors would be to just accept plain text and figures, and then just plug the content into their particular template. As a side benefit, this would create a job for people who know LaTeX.

    “In many other disciplines (I published most of my papers in life sciences) LaTeX is not the standard for authoring manuscripts.”

    I know. Have you heard the old joke about how medicine is merely an uninteresting application of biology, biology is merely a small section of the periodic table, the periodic table is mostly just simplified condensed matter physics, and the mathematicians laugh at us all? I think biologists in particular need to stop talking to their peons in medicine and start talking to physicists (I usually just let the physicists talk to the math people, otherwise they get confused about improperly defined acronyms)– and respect the fact that the physical sciences already have a solution for the formatting problem. (Much like the math for quantum mechanics had been worked out long before physicists started delving into the nature of electronic transitions.) I started out as a 16 year old girl working at the Fred Hutch as a lab tech in the late 90s, and I’m more familiar with the genetics/molecular biology/etc. literature than any physical chemist should be (I do interdisciplinary cancer diagnostics work now; and trying to get the laser physicists, engineers, and neurosurgeons to agree on something is like trying to herd cats). I’ve published in both life science and physical science journals, and the life sciences ones always seemed overly ornate in their formatting rules to me…..

  37. The issue with both HTML and LaTeX as authoring languages is that they allow — and therefore inadvertently encourage — an author to include layout-specific instructions, rather than forcing the manuscript to be written in a way that’s independent of layout. And layout independence is going to be increasingly important as readers demand to be able to read stuff on different sized devices.

    At their core, both LaTeX and HTML support lots of layout independent stuff, which is of similar complexity to learn. In LaTeX you write

    \section{Introduction}

    in HTML it’s

    Introduction

    and whether you prefer writing one or the other, or getting some WYSIWYG authoring tool to generate that for you is just a matter of preference / habit. But the problem then is that both mechanisms then let you do stuff that presumes a particular layout for the reader; things like “Put this diagram here”, or “Leave a space of exactly 1cm before this next thing.”. Learning how to do that is hard (ish) in both systems; and if you do insist on doing it, it then makes it much harder to convert your manuscript into another format. The problem is, that as soon as a system offers you these possibilities, it’s quite tricky as an author not to use them. As Martin points out, Word and LaTeX (and also in my mind HTML) are combined authoring and publishing tools, and this muddies the waters considerably.

    My ideal world would include the following:

    1) A specification for what a scholarly article should contain in terms of metadata and structure elements. Who cares if this is stored in JSON / XML / something else; this should never need to been seen by authors, it’s only of interest to the people building infrastructure.

    2) Tools that target that spec that authors like to use. These could be graphical interfaces, extensions to Word, MarkDown itself (with or without a GUI), or ‘lint’ like parsers for HTML/LaTeX that check that nothing naughty has been included that can’t be mapped automatically to the spec from 1, or which assumes a particular layout of the final document.

    3) Agreement on how the spec from 1 is stored electronically in various formats (e.g. XML, JSON, RDF, blah blah etc)

    4) Mechanisms for turning the stuff stored in 3 into the formats that readers like to consume (PDF, ePub, HTML, TheNextNewExcitingThingWhateverItIs).

    5) A mechanism for keeping associations between the different representations in 4 ‘in sync’ (so that a comment made on one representation of an article, or an altmetric related to it can be associated with.

    Then authors get to write in whatever tool they like; readers get to consume in whatever formats they like, and everything stays nicely in sync (a classic Model View Controller scenario for those of us of a nerdy persuasion)

    I think this is a long rambling way of saying that I think Markdown (or something like it) will have a crucial role to play in next-gen publishing workflows.

  38. Kaveh says:

    I agree with Steven that LaTeX allows logical and visual formatting. Ideally it would only allow the former. But I would say it is better than Word, as at least you can see the mark-up, whereas Word hides it in its proprietary format. Our experience as typesetters, is that the majority of TeX users are well behaved, and like the idea of generic markup. So the clean-up necessary in order to create an XML file is less than the equivalent in Word.

    What is really depressing is that most of the publishing industry have moved to a Word-based workflow, as the majority of submissions to them are in Word. So some are positively discouraging LaTeX submissions. And most typesetters will convert a TeX file into Word, then use expensive software to do “semantic enrichment”!! Yes, that is the reality now. No wonder authors are angry!!

    At the risk of self-publicity, we take the opposite approach. We’ll convert all author content, even poetry, into TeX, then mark up semantically, and everything is automated thereafter. And we use only open free software!!

  39. One of the issues with LaTeX (and HTML) is that the markup is very verbose. As someone who has written 100s of Answers on StackOverflow and edited countless more Questions, I appreciate the paucity of keyboard strikes required to add common markup; wrapping text in “`foo`”, “*foo*” or “**foo**” will result in code/typewriter font, italic and bold respectively. Everytime I write a blog post using WordPress I find myself hitting “`” to add code chunks than retyping out “>code<” tags. Once you’ve used markdown a bit, writing documents in LaTeX or HTML is just a pain.

    Those espousing the virtues of LaTeX are perhaps missing that point.

    Also, Markdown is often used as an intermediary format allowing rapid production of a document that can subsequently be rendered to other formats or markup languages. People/Publishers could still use a LaTeX template and render the md source using that template if they wanted the layout control provided by LaTeX.

    Also, we don’t need to discard all our hard-won LaTeX fu; Pandoc flavour MD allows LaTeX math markup and also raw TeX markup, but including those has implications for the formats to which the MD can be rendered, the latter being the most restrictive.

    On the collaboration front; do we even need to think now about this?Collaborative text editors already exist and as MD is plain text, users could be free to employ whichever of these existing tools they like. Keeping to the MD source also would allow people to use other workflows such as version control via github etc.

    I do like the idea of extending MD to allow for scholarly-specific information or metadata. Much of what is wanted in Mark’s list is available via Pandoc, but not the metadata.

  40. A further thought; couldn’t we do enough with Pandoc markdown for the main markup but use the same ideas as static site generators such as Jekyll and add YAML (or whatever) metadata to the top of the md sources. Then have something process the file, including the YAML metadata where required and then use Pandoc to convert the resulting “compiled” md source into the final document?

    Jekyll is already doing this for web documents and Pandoc MD covers most things we need for scholarly publishing. Perhaps we don’t need to reinvent anything except a new Jekyll-like script that handles YAML but runs the document via Pandoc.

  41. Apropos nothing in particular (and yes, I know its too late for my letter to Santa), it would be really nice if the citation part of a markdown-like language supported citation by just giving a DOI. i.e. rather than having to create a reference list with its own numbering scheme, and then to refer to that from the body text just being able to cite like by doing something like [DOI:10.1087/20110309] in the source, and letting the pipeline turn that into something nicely human-readable would be great (Endnote might not be so keen, of course). Obviously not every paper can be cited like this, but a growing number can.

    Right then, who is going to make all this happen?

  42. Mike Taylor says:

    I’ve been following all these comments.

    I have to say I am increasingly of the opinion that whatever MarkDown variant we ended up coagulating to support all our needs would be nine tenths as difficult to use of LaTeX and only half as powerful. Plus we’d need to build all the tools and would at best achieve adding yet another format to the bestiary.

    So even though I am myself working in a field (palaeontology) where working and submitting in MS-Word is ubiquitous, I find myself agreeing that the right fix is just for everyone to switch to LaTeX (specifying semantics only, not presentation).

  43. As much as I love using LaTeX myself, one of the major downsides of it is that it’s a fairly hefty suite of software (the Mac package with all its styles and support tools is a 1Gb download) most of which is there to support fonts and layout stuff (it does lovely ligatures and kerning) rather than the ‘semantics’ that we’d want for scholarly publishing. Although builds exist for most platforms, it’s not a trivial beast to compile (anybody tried making it work on a tablet yet?).

    And, unless it was somehow an artificially cut down version of LaTeX, it would allow people to use the unsemantic layout-specific features which would make presentation on multiple devices a pain.

    I’ve tried for years to convince other non Computer Sciencey colleagues of the benefits of writing in LaTeX but to no avail; I think it’s just too daunting with no obvious benefit.

    So although I love it, I’m not convinced it’s the right thing for most authors.

    But please don’t tell any of my colleagues that.

  44. Carl Leubsdorf says:

    Very interesting discussion! I’ve been a fan of wiki markup, markdown, and similar writing methods for a long time, and I can appreciate the idea of a specialized markup for Scholarly content. The debate in this thread, including the sub-thread about whether or not LaTeX is a suitable composition language for articles, reminds me of the original discussions at Beyond The PDF and earlier about the difficulty of creating structured text in any form for writers.

    This challenge is something that Annotum attempts to solve via a somewhat imperfect WYSIWYM/WYSIWYG browser-based article editor, but there is something about the clarity and simplicity of markdown that makes it a very appealing approach.

  45. Carl Leubsdorf says:

    One of the great things about markdown is that there are many existing implementations that could be used as a base – I like the Github flavor but existing WordPress / PHP implementations are also fine – and I’d be very interested in working with others on some extensions including tables (I prefer the OpenWiki syntax, personally) and equations (easy enough using the Google Chart API and a bit of LaTeX).

    However, I’m stuck on something that Kaveh wrote:

    What is really depressing is that most of the publishing industry have moved to a Word-based workflow, as the majority of submissions to them are in Word.

    We get a lot of feedback about the inability of Annotum to accept content formatted in Word – and as many of us have experienced, Word produces, for want of a better word, garbage markup that is best completely stripped. But from an end-user perspective, people – scholars and others – like to use Word. I just don’t see end users flocking en masse to markup or any other text-based formatting/structuring approach. I think the biggest challenge with text-based article creation is that there’s always a ‘compilation’ step – you have to generate the viewable version, or in the case of XML, parse it against a schema; with Word, what you see on the screen is what prints out. That’s why I’m still chasing the holy grail of a true structured WYSIWYG editing tool that can emit NLM XML natively.

    In the end, the overall goal, as I see it, is to obtain structured text, in NLM XML, as easily as possible. Markdown may well be a way to do this, but it is unlikely to be the way.

    We are currently working on a version 2.0 of the Annotum article editor – a completely revamped WYSIWYG tool that will include better strict XML schema enforcement along with a much more robust editing experience, that should get us much closer than Annotum 1.1 did, but Martin’s article has inspired me to pursue a parallel path using markdown.

    As I look at some of the existing online markdown tools, for example Dillinger or online-markdown-editor, I see an interesting use-case for a tool that allows one to enter an article using a simple web form, and from it generate NLM XML directly – which could be imported into Annotum or sent straight up to NLM (or imported into existing publishing workflows that “speak NLM”. The current markdown tools already convert markdown to HTML, so it should be a small (huge?) leap to modify one of them to create XML.

    I would be very interested in working on this – who’s with me?

  46. Kaveh says:

    @Steve: The world is going into the cloud, and there are now several cloud based Latex platforms, e.g.

    https://www.writelatex.com/
    https://www.sharelatex.com/

    which allow collaboration too. I think this is a good step and will reduce the threshold for using Latex. (We are working on our own version, to be released soon.)

    @Carl: I like your approach of using WordPress (in Annotum), and we are working along similar lines.

    I think that in the end authors will only change the way they work if:

    1. There is a real incentive, e.g. faster publication, lower APC, etc.
    2. A system is actually easier and nicer to use.

    So we cannot force authors to change ways, just give them alternatives which are more attractive.

  47. So, great discussion. But at the start, Martin says…

    “I propose that we as a community create a new Scholarly Markdown flavor, which takes into account most of the use cases important for scholarly content.”

    What we gonna do about it?

  48. Chris Maloney says:

    I see Martin has started a GitHub project here: https://github.com/mfenner/scholarly-markdown. Perhaps we could also start a Google group to provide a mailing list?

    The GitHub project has a wiki that, I see, is enabled for editing by anyone. Perhaps we could use that as a platform for beginning to enumerate requirements?

  49. I love this idea. I really believe in moving articles to light-weight markup languages that specify semantics, and are able to be rendered in many different formats (I want to be able to read articles on my Kindle, or iPhone, not having to scroll around in some fancy PDF). I know you are mostly talking about process, not final product – but I hope that changes too… The time I spent formatting my MA thesis to perfectly match my department’s specifications (first page in chapter has page number centered at bottom, subsequent pages have it right-aligned on the top) could have been used more productively.

    I know that LaTeX is used a lot in the hard sciences, but in my field (education) it is virtually unknown, and I personally think it looks very cluttered. The only output I ever see is PDFs – can you render HTML pages, or ebooks from LaTeX? Most of the articles I write have very simple layout – a few levels of headings, abstract, text, italics and bold, and of course citations etc. And perhaps some illustrations and references. I write all my articles in Scrivener anyway.

    And for more advanced stuff like graphs, I really enjoy the work being done with “executable documents” both by for example R and knitr, and iPython notebooks (both based on Markdown), see for example http://www.r-bloggers.com/knitr-github-and-a-new-phase-for-the-lab-notebook/.

    I also very strongly second one of the first comments, saying that we should include unique identifiers for cited documents, and have the system render the citations, rather than worry about BibTeX etc. I built an academic workflow around citekeys, which allow me to add a publication to BibDesk, and then cite it in my wiki, WordPress blog, and in scholarly publications as [@scardamalia1999knowledge] (http://reganmian.net/wiki/researchr:start), but it would be much better if this was universal. DOIs are a start, although they will likely never cover all the publications we need (DOIs cost money, and almost no OA journals provide them, for example).

    One big barrier to “plain-text” academic authoring in my experience, is a good system for collaborating. I love the idea of version controlling articles in plain-text (with markup), but whatever else you say about Word, “Track Changes” is incredibly useful. My supervisor is quite tech savvy, and I might convince him to read a draft I sent him in Markdown, but the tools for tracking changes in code are not good enough for editing prose. If you could easily show an in-line word-based diff in the editor, you’d be partway there, but you could still not accomodate for example comments. Almost all of my publications are collaborative, and even the ones that are not, are usually read and commented on by others, so this is an important point.

    Anyway I hope this goes forward, and I’ll contribute if I can!

  50. Jason Moore says:

    I agree that scholars and publishers alike would benefit from a standard lightweight markup language. But maybe markdown doesn’t quite cut it. For example, RestructuredText already has all of your desired features. I wrote my entire disseratation in the rst format: moorepants.github.com/dissertation and it was a pleasure.

  51. Frederik Elwert says:

    Jason, I also think that reST is a great candidate, especially since it has a well-defined parsing model, which markdown currently seems to lack. And it is extensible, which might be a good way to implement specific features for scholars. That bibtex integration into reST via Sphinx is definitely interesting!

    But I also understand that markdown is quite popular in different places, so I also see its benefits.

  52. Phillip Lord says:

    @Mike Taylor — Mike, references could indeed be to a “universal database”. You just use links. Metadata can be resolved for DOIs from their registration agency, arXiv, pubmed IDs by theirs, and tools like our own Greycite can provide metadata from any URL.

    @Steve Pettifer — most or all of it exists in some format already:-)

  53. January Weiner says:

    Hi Martin,

    I think the idea is great; especially as I have been trying to stitch together something like that for a while now :-) . We all hate writing manuscripts not because we don’t like to write, but because we are forced to focus on boring and mundane things like formatting bibliography and *manually* keeping track of the changes. Yes, manually, because if you have ever used a proper version control system then you know that writing a manuscript between five groups and twenty authors using Word and Word-like tracking of changes *is* manual work.

    An ubiquitous, widely spread markdown would not only make the life of a single author simpler, but would also solve the problem of collaboration and version control (not to mention manuscript submission). The current situation is just painful.

    If you have specific ideas or would like to collaborate on that and other issues, contact me at my e-mail address.

  54. rmflight says:

    First: about the proposal

    This doesn’t seem like a bad idea. I have been using Markdown thanks to R and Knitr support for it as a format for reproducible research. I don’t know that we need anything separate for citations, as citation display is highly overrated, it was necessary in the age of print journals, but not so much anymore. Better support of internal links might help, but formatting support (a la CSL, for example) I don’t think is that important. Easy inclusion of citations is a plus (i.e. bibTex), but I think this will come down to authoring environment and processing of intermediate (simple pointer insertion) to final document (full reference available).

    Semantics I think could be handled better through the simple use of tags. HTML can be directly incorporated into Markdown where necessary, and semantic markup using XML might be a decent compromise.

    Second: for all those saying Latex solves the problems

    Latex was a great solution when we were extremely worried about PRINT layout, and what the final document would look like. It is also possible to write a Latex document worrying more about the content rather than the layout, but that almost never happens, and the amount of backend to even write simple Latex can be a challenge (I’ve done R vignettes, and collaborated on a paper authored in Latex). Reading raw Latex to find mistakes is challenging, and collaborating with non-latex users will be challenging.

    In comparison, reading and finding mistakes in Markdown is pretty good, and the simple markup makes it apparent what is what. For example, here is a conference paper I recently did in Markdown (now, the original was in Word, but I like the Markdown better) https://raw.github.com/rmflight/affyMM/master/flightetal_draft.md

    Inclusion of math using MathML or Latex formulas is easy (thanks to MathJax library integration) for final display. I hope Markdown makes it as an authoring format. I know for people writing analyses in R and Python, using KnitR and iPython notebook for reproducible analyses means that Markdown is already making progress in this regard.

    I like Markdown because it forces me to think about the content, and only about the content, without worrying about the style until much later in the game. Latex makes me think about the style initially, because in order to read it much at all, I need more than the basic text editor I’m writing the Latex in, I want to see the PDF, because the markup is verbose.

  55. Chad Black says:

    I’m a bit late to this discussion. As an historian, I love writing in markdown. I use it for all of my notetaking, but also for transcribing documents, writing articles and presentations, and now for chapters of a book manuscript. For a long time I used multimarkdown (in part because of its integration with Scrivener), python markdown for web development, and now pandoc. The one thing I thought I missed in pandoc was multimarkdown metadata, as mentioned in comments above.

    But! But, there’s a not well documented extension/option in new versions of pandoc that allows for the use of mmd-style metadata. So, we may already be 95% there.

    I usually use BibDesk, or single per-project bibliography files exported from Zotero, with pandoc for citation management.

  56. Ron DuPlain says:

    Interesting proposal, and I’ll comment on a few of the issues here. Some context: I am interested in a different kind of document, the technical memo for software developers. Since our output is machine executable, professional software developers love clear, unambiguous deliverables. We know READMEs in our fingers, but not technical memos. Docs or weblogs sometimes, but too often a key discussion goes undocumented. My goal is to build something with clear utility (.md or .rst converted into .html or .pdf) when working on software concerns which do not otherwise have clear deliverables. It is an important asset to build when your mind is fresh on the nuances. [Note I have specific engineers using this, so my feature set is finite and small.]

    That project is linked here, and is just a wrapper around pandoc.
    https://github.com/rduplain/memo-builder

    Multiple Objectives

    To understand the available tools and why they fall short, note that there are multiple objectives at play. Since this is a scholarly bunch, the topic to research is Pareto optimality.

    You can see Markdown’s objectives in its documented philosophy: “Markdown is intended to be as easy-to-read and easy-to-write as is feasible. Readability, however, is emphasized above all else. A Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions.”

    LaTeX appears perfect, because it meets its objectives very well. Hence, depending on your objectives, LaTeX is indeed perfect, and you will wonder why anyone wants anything else. Readability of the source is not one of them, it seems. Spend some time with markdown and you will understand its ultimate objective of “Readability, however, is emphasized above all else.” Readability is an elusive quality, and should be evaluated subjectively.

    Adding features to markdown will trade readability for structured data, in general, but readability is the ultimate goal, allowing trade-offs where appropriate (insert other objectives here).

    reStructuredText

    If you are evaluating reStructuredText (.rst), note that pandoc’s implementation of .rst is not nearly as fully implemented (compared to docutils) or documented as markdown, particularly when it comes to rst directives. Pandoc’s design and code are so impressive, that it is still the place to start for any text formatting project.

    Extensions & Preprocessors

    The proposal for a Scholarly Markdown could be implemented as extensions to pandoc’s markdown. Pandoc already has a clear precendent for creating extensions and a configurable execution for enabling/disabling extensions. What is missing? Prototype an extension. Repeat.

    For those features which require yet more reduction for readability, you could think of a preprocessor approach. Use a simplified version of formatting (bibliographic references, tables, …), run some tool against it to either produce something pandoc understands or something that can be stitched together with pandoc’s output on the other end. That is, there is nothing to stop you from using pandoc to output .tex and stitch things together there.

    The Community vs. The Committee

    Committees require consensus. Communities can have BDFLs. Part of the reason for markdown’s success, I think, is that its specification existed in one head before it was published. I am pro-standards, but I would be weary of seeking too much consensus before getting a working prototype up and running which addresses the objectives. That prototype can be put out for comments. Which means you need to be really clear up front (now) on the *objectives*, not the *requirements*. These are your guiding principles in figuring out what stays and what goes.

    You know, minimize X and maximize Y with constraint Z.

    Collaboration

    I use git diffs for all sorts of content, and I am really productive doing so. I never miss a track-changes feature, but I’m usually overwhelmed by how much UI cruft it adds (arrows and colors, arrows and colors everywhere!). Specifically to prose, open source developers collaborate to write docs. Instead of tracking changes, we just commit and push to master if we’re confident or a topic branch if we’re not.

    That said, tools like Google Docs are amazing. Support for multiple editors in real-time and comment threads with resolve buttons are too productive to ignore. The tools are out there, though. For starters, just set a Google Doc to monospace font, and write a simple script against the Docs API. Therefore, I suggest that collaboration is out of scope for a Scholarly Markdown dialect.

    By the way, if you want to talk about cult-like preferences, I will wonder why anyone wants to use anything but vi or emacs.

    Sweave & Noweb

    I will add a new idea to the mix: could/should Scholarly Markdown support literate code? This is a question for the data-oriented crowd. It is the noweb approach, or Sweave for R. Instead of embedding a figure in the manuscript, you place code in the manuscript to generate the figure. That might sound strange, but it is very powerful for visualizing data.