Rich Citations: Open Data about the Network of Research

Why are citations just binary links? There’s a huge difference between the article you cite once in the introduction alongside 15 others, and the data set that you cite eight times in the methods and results sections, and once more in the conclusions for good measure. Yet both appear in the list of references with a single chunk of undifferentiated plain text, and they’re indistinguishable in citation databases — databases that are nearly all behind paywalls. So literature searches are needlessly difficult, and maps of that literature are incomplete.

To address this problem, we need a better form of academic reference. We need citations that carry detailed information about the citing paper, the cited object, and the relationship between the two. And these citations need to be in a format that both humans and computers can read, available under an open license for anyone to

This is exactly what we’ve done here at PLOS. We’ve developed an enriched format for citations, called, appropriately enough, rich citations. Rich citations carry a host of information about the citing and cited entities (A and B, respectively), including:

  • Bibliographic information about A and B, including the full list of authors, titles, dates of publication, journal and publisher information, and unique identifiers (e.g. DOIs) for both;
  • The sections and locations in A where a citation to B appears;
  • The license under which B appears;
  • The CrossMark status of B (updated, retracted, etc);
  • How many times B is cited within A, and the context in which it is cited;
  • Whether A and B share any authors (self-citation);
  • Any additional works cited by A at the same location as B (i.e. citation groupings);
  • The data types of A and B (e.g. journal article, book, code, etc.).

As a demonstration of the power of this new citation format, we’ve built a new overlay for PLOS papers, which displays much more information about the references in our papers, and also makes it easier to navigate and search through them. Try it yourself here:
The suite of open-source tools we’ve built make it easy to extract and display rich citations for any PLOS paper. The rich citation API is available now for interested developers at

We’ve started collecting rich citations for all PLOS papers; currently, our database has over 10,000 PLOS papers, including nearly all PLOS Medicine papers. In a few weeks’ time, we’ll have indexed and extracted rich citations from the rest of the PLOS corpus. The ultimate goal is to collect rich citations for the entire scientific literature and provide it as open data for the research community. This kind of database would be a valuable resource not only in itself but also for the wide variety of applications that could be built using it. With a detailed database of the connections between scientific works, it would be much easier to trace the intellectual history of an idea or fact, and to see the true dependencies between different pieces of the scientific literature. We can also use this database to create better paper recommendation engines, helping readers find new and exciting work related to older work. Such software could also suggest additional works to read and cite while writing manuscripts. And a detailed map of the research literature would give us a more nuanced view of the relationships among published research than is currently available with traditional citation-based metrics. These are admittedly ambitious goals, but we have already started working with citation researchers and developers from outside of PLOS to make these ideas into a reality.

We hosted a hackathon here at PLOS this past weekend where we started to adapt our APIs to work with other publishers’ content. We also spoke with folks from the Wikidata project about the possibilities for other ways we can showcase this data and connect it to other resources like Wikipedia. Some computational researchers worked on better algorithms to detect whether two authors are the same person. And the bibliometric researchers we’ve spoken with about this project are keen to start playing with the rich citations dataset, which already contains over half a million distinct references.

We’re excited to roll out rich citations over the coming weeks. If you’ve got any suggestions for us, or if you’d like to hear more about it, please don’t hesitate to post in the comments, or to contact our Labs team directly at We’d love to hear from you!

Category: Tech | 6 Comments

Making Data Count: PLOS, CDL, and DataONE join forces to build incentives for data sharing


In partnership with the University of California Curation Center at the California Digital Library,and DataONE, we are pleased to announce the launch of a new project to develop data-level metrics. This project is funded by an EAGER grant from the National Science Foundation. The project, titled “Making Data Count: Developing a Data Metrics Pilot”, will result in a suite of metrics that track and measure data use. The proposal is available on eScholarship (

Sharing data is time consuming and researchers need incentives for undertaking the extra work. Metrics for data will provide feedback on data usage, views, and impact that will help encourage researchers to share their data. This project will explore and test the metrics needed to capture activity surrounding research data.

The Data-Level Metrics (DLM) pilot will build from the successful open source Article-Level Metrics community project, Lagotto, originally started by PLOS in 2009. ALM provide a view into the activity surrounding an article after publication, across a broad spectrum of ways in which research is disseminated and used (e.g., viewed, shared, discussed, cited, and recommended, etc.)


Partners include:

PLOS (Public Library of Science) is a nonprofit publisher and advocacy organization founded to accelerate progress in science and medicine by leading a transformation in research communication.

Data Observation Network for Earth (DataONE) is an NSF DataNet project which is developing a distributed framework and sustainable cyberinfrastructure that meets the needs of science and society for open, persistent, robust, and secure access to well-described and easily discovered Earth observational data.

The University of California Curation Center (UC3) at the California Digital Library is a creative partnership bringing together the expertise and resources of the University of California. Together with the UC libraries, we provide high quality and cost-effective solutions that enable campus constituencies – museums, libraries, archives, academic departments, research units and individual researchers – to have direct control over the management, curation and preservation of the information resources underpinning their scholarly activities.

Category: Tech | 5 Comments

MozFest: Bringing the Web to Science

Mozilla_infographic_D5R3In just a month’s time from now, Mozilla will be hosting their annual Mozilla Festival (“MozFest” for short), which for the 2nd year will feature a Science Track. MozFest is where communities working in technology, design, education, journalism, and research come together to innovate in the space of the Web. It’s for coders and non-coders alike. Bring everything from your laptop to knitting needles (and that includes kids—it’s for all ages!). But most importantly, it’s a place for people interested in the Web and science to come together not only to talk about but to hack away at how the Web can help us do more, do better, and connect with like-minded people from different fields around a common cause.

Who will be there?

You might have heard of the now year-old Mozilla Science Lab and their Software Carpentry program, whose mission is to teach scientists how to code using best practices and how to teach others to code. It’s an initiative that has been greatly needed across the sciences. The Software Carpentry gang will be there as well as Kaitlin Thaney and the two newest members to Mozilla Science Lab (Bill Mills and Abby Cabunoc).

The entire Mozilla Foundation will also be there – that includes the people from the Open Badges project to the Web Maker community.

When and Where?

October 24-26 in London near the O2.

Still not sure what this is about…

To use its own words, the science track of MozFest aims to use “the open web to re-define how we experiment, analyze and share scientific knowledge”. The best way to learn more about MozFest is to take a look at the website and its tracks. But also take a look at the list of science sessions accepted, which include working with some of the new authoring tools out there like Authorea and iPython Notebooks and working with the Open Badges Infrastructure to bring more credit to all the work one does as a researcher beyond publications and position titles:

Sprint sessions:

  • Science in the browser – Michael Saunby (Met Office)
  • Storytelling from space – Brian Jacobs (ProPublica)
  • Upscience – Francois Grey (CERN) et al
  • Open Science Badges for Contributorship – Amye Kennall (BMC), Patrick Polischuk (PLOS), Liz Allen (Wellcome Trust), Laura Paglione (ORCiD), and Amy Brand (Digital Science)
  • Authoring tools – Alberto Pepe (Authorea), John Lees-Miller + John Hammersley (writeLaTeX), Raghuram Korukonda
  • Improving Reference Management in Wikipedia – Daniel Mietchen (Wikimedia
  • Building a Repository of Open Tools – Kathleen Luschek (PLOS)
  • Text as data – humanities and social science – Fiona Tweedie (Univ. Melbourne)
  • Zooniverse: Open Source Citizen Science – Robert Simpson (Zooniverse)
  • Apps for Climate Change: using Appmaker for Citizen Science – Brian Fuchs (Mobile Collective)
  • Curriculum Mapping for Open Science – Fabiana Kubke (Univ. Auckland) ; Billy Meinke (Creative Commons)
  • Network Analysis Visualisation for the Web – Matt Hong (Knight Lab)
  • Hacking hacking the library – Dave Riordan (New York Public Library)
  • Web audio as an open protocol – Jeffrey Warren (Public Lab)
  • Working with open health data – Fran Bennett  (Mastodon C)
  • Learn how to build cool things with weather data in Python – Jacob Tomlinson (Met Office)
  • PressureNET – Jared Kerim (Mozilla)
  • Building Pathways to Careers in Science – Lucas Blair

Trainings / Tutorials:

  • Academic Publishing Using Git and GitHub – Arfon Smith ; Jessica Lord (GitHub)
  • Intro to IPython Notebook – Kyle Kelley (Rackspace)
  • Collaborative Development: Teaching data on the web – Karthik Ram (rOpenSci)
  • Dealing with Messy Data – Milena Marin (OKFN)
  • Indie Science – Cindy Wu (
  • Learning Analytics – Adam Lofting, Doug Belshaw (Mozilla)
  • Spreadsheets – Milena (OKFN)

Other sessions:

  • Open Science Collaboration for Development – Angela Okune (iHub)
  • Teen-Driven Open Science – David Bild (Nature Museum)
  • Scientific peer review: identifying conflicts and fraud – Rebecca Lawrence (F1000)
  • “Can you help me ‘break’ my project?” – Knight Lab students

But it would be a waste to go to MozFest and never leave the science floor. Jump around to the different floors, explore. A full list of proposals is here.

Don’t forget to register. See you soon!

Category: Tech | 1 Comment

Open Source for Open Metrics

The scientific community is increasingly recognizing how the open science enterprise critically relies on access to scientific tooling. John Willinsky, Stanford scholar and Director of the Public Knowledge Project (PKP), presaged this development, calling it an “unacknowledged convergence” between open source, open access, and open science in 2005. Today, the vision has developed in material ways with a number of organizations coordinating the development of open source software to analyze data, connect it to existing and new channels for disseminating the results, and more. Michael Woefle, Piero Olliaro, and Matthew Todd’s crystal clear summary of the advantages of open science is one of many compelling arguments circulating in presentations, blogs, and the media.

"Bellalagotto1" by Entheta - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons

“Bellalagotto1″ by Entheta – Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons

PLOS’s mission to advance open access is a part of this larger enterprise – doing science in an open way also means communicating it openly for others to read and reuse. Furthermore, these research outputs entail not only the content itself, but metadata surrounding the object, including type, provenance, conditions of its production and its reception, etc. Article-level metrics are a part of this diverse family, enabling authors, readers, institutions, and funders to better understand the evolving state of the research object.

The ALM software, which harvests article-level activity, was started at PLOS in 2009 and made available as Open Source software in 2011. Over the last three years a small but growing community of developers has reported issues with the software, reminded us of missing documentation, and contributed code. Earlier this year we started a community site for the project, including a page listing all contributors.

We are happy to announce that last week – with big help from developer Jure Triglav – we also made the ALM Reports application available as open source software. ALM Reports uses the PLOS Search and ALM APIs to generate reports and visualizations of metrics for PLOS content. In the initial release ALM Reports only works with the PLOS Search API, but any enterprising developer can now take the ALM Reports application and tweak it to his or her needs.


Both ALM and ALM Reports are licensed under an MIT license, the most popular license for Ruby on Rails software (the web framework used to create them), and one of the open licenses recognized by the Open Source Initiative. The MIT license is a permissive open source license, allowing unrestricted commercial reuse, and several commercial and non-profit organizations are using the ALM software in their production systems.

In celebration of this new era, we have renamed the software Lagotto, in reference to the Lagotto Romagnolo, a water dog that originally comes from southern Italy. The breed is especially gifted in hunting and retrieving, used not only as a water retriever but also for hunting truffles. In the same spirit, the ALM application hunts for activity surrounding a research paper, bringing it back and making it visible to the community.

The data collected by the PLOS ALM service have always been freely available via API and a monthly data package. As both the software used to collect and analyze the data are available under an open license, other users can validate the data, perform additional data quality checks, and even more critically, open up the discussion about values underlying what each of the communities are measuring and using for assessment.


Category: Tech | Comments Off

Diving into the haystack to make more hay?

Diving into the haystack to make hay is one of the most inefficient activities imaginable (as well as a figurative absurdity). Doing science inevitably entails discovery, but the process has historically been far more difficult without effective tools to support it. With the quickening pace of scholarly communications, the vast volume of information available is overwhelming. The key puzzle piece might be as hard to find as a needle, and moreover, no scholar has the time to dive into the hay-verse to make more hay.

With recent advances in data science and information management, research discovery has become one of the most pressing and fastest growing areas of interest. Reference managers (ReadCube, Mendeley, Zotero) and dedicated services have been experimenting with novel ways to deliver relevant articles to scholars (PubChaseSparrhoEigenfactor Recommends). While we have by no means exhausted the number of ways to innovate and make them mainstream, we have yet to turn our attention to scholarly objects beyond the article. The utility of finding the full research narrative is established, a given. But the potential value of discovering and accessing scholarly outputs that are created before final results are communicated or never integrated into an article at all is almost untapped. Until now.

Scholarly Recommendations

Last year, our partnership with figshare began with the hosting and display of Supporting Information files on the article, accommodating the broad range of file types found in this mixing pot. Both the Supporting Information file viewer and figshare‘s PLOS figure and data portal increase the accessibility of the data and article component content associated with our articles. The latter makes PLOS figures, tables, and SI files searchable based upon key terms of interest.

Beyond the article: Today, we continue to build on the figshare offering with the launch of research recommendations, a service which delivers relevant outputs associated with PLOS articles and beyond. This begins to fill a critical need for tools that address the full breadth of research content. Rather than being limited by the article as a container, we can now present a far broader universe of scholarly objects: figures, datasets, media, code, software, filesets, scripts, etc.

figshare recommendations widget

 Hansen J, Kharecha P, Sato M, Masson-Delmotte V, Ackerman F, et al. (2013) Assessing “Dangerous Climate Change”: Required Reduction of Carbon Emissions to Protect Young People, Future Generations and Nature. PLoS ONE 8(12): e81648. doi: 10.1371/journal.pone.0081648

While papers tell the story of the final research narrative, data – the building blocks of science – are especially critical to the progress of research. They underlie the results expressed in a narrative that is published in papers. The most rigorous and expansive path of discovery includes not only related articles, but arguably even more fundamentally, data and a host of research outputs that lead up to the paper or may even be independent of an article. PLOS’ data availability policy was the foundational step, ensuring that data is publicly accessible. Delivering strong recommendations to surface relevant research now adds even more value for the scholarly community.

Beyond the publisher: the recommendations delivered by figshare extend beyond research outputs attached to PLOS publications. In fact, they are retrieved from the entire figshare corpus of over 1.5 million objects. We want to enrich the discovery experience for users using the breadth of possible OA research outputs, regardless of whether they have been published as part of a research paper. Not all scholarly outputs may fit in an article, but might very well be critically instrumental to others’ research.

Right at your finger tips

The recommendations are displayed for every PLOS article on the Related Content tab. To select the most related ones, figshare uses Latent Semantic Analysis across the entire PLOS corpus to build a “semantic” matrix, which is then used to retrieve a list of best related entries for each of the articles. Five recommendations are displayed with the option to load more. The type of file is denoted by icons or thumbnails when available, with a preview of the object upon hover-over. The full view of the file is available by clicking on the thumbnail. The content and all its metadata is available on figshare via the highlighted title. Keyword tags are also displayed, which can be used to find other associated content of that kind.

figshare recommendations widget2

Franzen JL, Gingerich PD, Habersetzer J, Hurum JH, von Koenigswald W, et al. (2009) Complete Primate Skeleton from the Middle Eocene of Messel in Germany: Morphology and Paleobiology. PLoS ONE 4(5): e5723. doi: 10.1371/journal.pone.0005723

Mark Hahnel, founder of figshare, said “PLOS has continuously demonstrated their desire to advance academic publishing and we’re always very happy to play a part in their innovations. The latest developments will ultimately make figshare content more discoverable and benefits our user base as well as PLOS readers and authors.”

With the figshare recommendations, it is our aim to advance the process of discovery and accelerate the research lifecycle itself. Please check them out on the Related Content tab at every PLOS publication, dig into the offerings, and see where your research goes. We welcome your thoughts and reflections. Are these useful and relevant? Would you like them delivered through additional channels? Feel free to comment here or contact @jenniferlin for PLOS information. figshare is also available via emailtwitter, facebook or google+.

Cross-posted on figshare blog.

Category: Tech | 2 Comments

A Step-by-Step Approach to Content Management


Image credit: Michael Morris

Today at PLOS, we celebrate an important milestone: We launched the first iteration of the new PLOS CMS (codename: Lemur). This first installment is a homepage editor for six of the PLOS journals (Biology, Medicine, Pathogens, Computational Biology, Genetics, and NTDs). Our homepage editor is a browser application that facilitates curating and preparing content to feature on the journal homepages. It lets our journal staff queue up items to feature, write blurbs about each item in a WYSIWYG editor, select images, and perform basic image manipulations. It has a drag-and-drop interface to easily reorder featured content, and as you’d expect, it includes previewing and publishing controls.

Our launch timing, not so coincidentally, corresponds to the launch of our updated homepage design for the aforementioned six journals. The new, more sophisticated homepage design called for sophisticated curation controls, so it was an easy call to make the homepage editor the first functional area for the new PLOS CMS to tackle. The old process for homepage updates was manual and restrictive. But we’re restricted no longer—journal staff now have all the controls they need at their fingertips, without having to be HTML experts.

PLOS’s new CMS is an internal tool… for today. But that could evolve over time, as the scientific publishing community moves to a brave new world that’s “beyond the journal”. I’ll dig deeper into the idea of meeting an evolving landscape by answering the question you’re probably asking yourself right now:

Why are we building our own CMS?

There are several categories of answers to this question, presented here in no particular order:

  • Flexibility: We could have gone with an off-the-shelf CMS. But it’s likely that we would have actually required two different off-the-shelf content management systems: one to manage our web content, and one to manage our article corpus. These two content types generally utilize two very different kinds of CMS. The idea of adopting—and adapting—two different CMSes (or select just one type of CMS and heavily customize it to work with both content types) was not very appealing. This is a particularly unappealing solution when viewed through the lens of our ultimate goal, which is to blur the lines between all of our different content and media types, and interact with and present any combination of them seamlessly, in ways we can’t necessarily predict today. That’s why we have opted to separate the functions of content management, and build a curation layer for pulling the content together in the ways we need.
  • Innovation: We don’t want to miss opportunities to innovate on content delivery. We suspect that innovation would be slowed if we become locked into monolithic systems that are difficult to customize.
  • Technical considerations: Separation of content management functions from a technical perspective makes a lot of sense. You hear of a variety of publishing operations opting to roll their own for the same reasons, including the New York Times. This separation should also make us more agile in the sense of being able to respond relatively quickly to emerging needs, new ideas for innovation, or other interesting technological developments we want to try out.
  • Longer term community ideas: We start today by building Lemur with specific curation tasks in mind. But it’s easy to imagine a future in which anyone could curate PLOS (and other) content, using tools we provide. Imagine societies or educators using our content and curating it to their own needs, which may or may not be tied to the traditional model of the journal. Many examples of how this could evolve spring to mind!

Lean makes it better

We’ve been using Lean methodology to develop the content management system, component by component. We’ve been testing our wireframes throughout the design process. We’ve co-located the entire team in a conference room—since February, mind you! We’re reaping the benefits (and enjoying the process) of pair programming. Close collaboration among developers, UX, and product owner noticeably improves both our product and our velocity, while collective code ownership ensures maintainability. And here’s the part that requires lots of discipline: We release a minimum viable product, so we can gather observations of how it is used and see how it could be improved, rather than anticipate the entire product in a vacuum. As we improve the individual features with the findings from our observations, we will also incorporate that learning in the approaches we take with the rest of the system.


Category: Tech | Tagged | Leave a comment

Delving into subject areas with PLOS Cloud Explorer

As discussed in a previous post in the PLOS Tech blog, PLOS uses a sophisticated approach to classify research articles according to what they are about. Using machine-aided indexing, articles are associated with subject areas from a thesaurus containing over ten thousand terms. You can now explore an interactive visualization of the entire thesaurus, which uses article data from PLOS journals to show how different fields of research are interrelated, and how that has changed over time. Check it out: PLOS Cloud Explorer.

We made this web app while we were students together in a course on working with open data at the UC Berkeley School of Information last spring. We were interested in doing something with open data pertaining to scholarly literature, which would enable both researchers and curious members of the general public to explore trends in research and interactions between research topics. Naturally, we looked to PLOS as a source for open data about scientific research. As a publisher of open access journals, PLOS articles and metadata are all Creative Commons-Attribution licensed. PLOS has an open search API as well, which provides access to full article data and metadata—including sets of subject area terms for each article, which specify the position of each term within the polyhierarchy of the thesaurus. We wanted to build a tool that would allow users to navigate across fields and reach real articles by harnessing this rich, faceted representation of research areas that is bound to PLOS article data. When we asked about the thesaurus, Rachel Drysdale kindly provided us with a full copy—it’s now also available on GitHub.


The fabulous complexity of PLOS’s classification of research articles hasn’t really been surfaced on the PLOS website. Although PLOS ONE has a subject area browser as part of its search interface, we found this difficult to navigate as part of an exploratory search, and started thinking of ways to add context to this kind of experience. We decided to create an interactive tree, using D3.js, that illuminates the larger structure of the relationships between research areas. As you browse the tree, graphs in the dashboard show how many articles have been published each year within the current field, and which other major disciplines those articles are also associated with. The word cloud shows which specific subject terms (the leaves of the tree) are most prevalent among articles in the selected field, and clicking on a word in the cloud takes you directly to a query of that term on the PLOS website. Early on in the making of this tool, we were inspired by a word cloud example of a specific query, and built PLOS Cloud Explorer around this notion of using a dynamic word cloud, filtered on interactive charts that provide context, to reach real documents of interest.

PLOS Cloud Explorer reveals the interconnectedness of research areas that are represented and developed in PLOS journals. The word cloud and the histogram visualizations show that many fields of study are highly interconnected: PLOS articles tend to be associated with interdisciplinary research, such as combining Medicine and Physical Sciences. You can also observe and explore trends in the number of articles over time for a given field (using the time series graphs), and also trends in the collaborations among research areas (using the histogram and word cloud). We hope you enjoy exploring!

What you see in PLOS Cloud Explorer is based on data about all the 126718 articles published in PLOS journals up until July 21, 2014, and represents a snapshot of the PLOS Thesaurus in its current state of evolution. You can find our source code and documentation on GitHub.

About the authors: Anna Swigart and Colin Gerber are graduate students in the UC Berkeley School of Information. Akos Kokai is a graduate student in the Department of Environmental Science, Policy, and Management at UC Berkeley.

Category: Tech | Leave a comment

Making Metrics Count – ALM Article Feature Series

In this age when we are all obsessed by counting, should we be celebrating yet more sets of metrics? Albert Einstein famously quipped: “Not everything that can be counted counts, and not everything that counts can be counted.” While a well-worn sentiment, it does bear some thought. At PLOS, we believe we should celebrate—though not journal-level metrics—but those of individual articles and the diverse metrics and the stories associated with them.

The ALM Article Feature series is an ongoing and regularly published set of posts that highlight articles that have caught the eye of the editorial teams across the PLOS journals. We examine their notable metrics as well as telling some of the stories behind the articles. We don’t have any fixed criteria for articles in this series, but rather have asked the journal teams to highlight articles that had meaningful metrics for their journal. As you’d expect, there will be an eclectic mix selected. This series will not only highlight individual articles but will celebrate what in the end is a core editorial function of journals—curating content that matters to their, and we hope wider, audiences.

Currently, the ALM Article Feature Series includes the following posts:

  1. From One to One Million Article Views on PLOS Medicine’s Why Most Published Research Findings are False by John Ioannidis
  2. You Just Read my Mind… on PLOS Biology’s Reconstructing Speech from Human Auditory Cortex by Brian Pasley, et al.
  3. “Low T” and Prescription Testosterone: Public Viewing of the Science Does Matter on PLOS ONE’s Increased Risk of Non-fatal Myocardial Infarction following Testosterone Therapy Prescription in Men by William Finkle, et al.
  4. Reflections on feces and its synonyms on PLOS NTD’s An In-Depth Analysis of a Piece of Shit: Distribution of Schistosoma mansoniand Hookworm Eggs in Human Stool by Stefanie J. Krauth, et al.
  5. How Much of Your Genome is Functional? on PLOS Genetics’ 8.2% of the Human Genome Is Constrained: Variation in Rates of Turnover across Functional Element Classes in the Human Lineage by Chris Rands, et al.
  6. When Retroviral Research Goes Viral on PLOS Pathogens’ Highly Significant Antiviral Activity of HIV-1 LTR-Specific Tre-Recombinase in Humanized Mice by Ilona Hauber, et al.
  7. Measuring the success of an online bioinformatics resource on PLOS Computational Biology’s An Online Bioinformatics Curriculum by David Searls.

We will continue to update this list with the latest additions. Also, you can follow it on twitter at #celebratingalms to discover the newest posts, join the community’s conversation about the articles chosen, and tell us the PLOS articles you want highlighted next. With article-level metrics, we look forward to sharing the breadth of fascinating ways in which PLOS science has impacted scholarly research and the broader world beyond.

Category: Tech | Leave a comment

R Markdown for Scholarly Communication?

Open question: how much longer will researchers be limited to submitting manuscripts in formats like Word or LaTeX to STM journals?

I caught a glimpse of a potential alternative at a recent Software Carpentry Bootcamp hosted by UC Davis.

Software Carpentry is a volunteer organization that teaches basic computing skills to researchers. Instruction typically takes the form of 2 to 3 day hands-on workshops aimed at helping researchers, typically graduate students, work more effectively with data.

The event was a blast. Topics can vary from bootcamp to bootcamp and included the following in the session I attended:

  • Basic UNIX Shell Functions
  • Data Munging
  • Basic Version Control using Git
  • Data Sharing with GitHub
  • Intro to Programming in R for data manipulation and visualization
  • Markdown and R Markdown for creating text documents that run R code
  • Intro to ggplot2, a plotting system for R that make creating beautiful plots easy

Thanks in large part to Software Carpentry’s patient and capable volunteers, after two full days I was able to write short scripts, post them to GitHub, and create simple visualizations using sample data.

countries.png GDP per capita from 1950 through 2010

After the first day of instruction (and over drinks) a discussion popped up over the relative merits of R Markdown and IPython Notebook – two potential alternatives to Word and LaTeX. What makes the formats compelling is their ability to leverage R or Python code to create dynamic documents.

Both R Markdown documents and IPython Notebooks can easily be converted to HTML, PDF and even Word – formats that scholarly publishers are generally familiar with. While these conversions make it easy for researchers to produce traditional research outputs, I wonder if there is a better way for publishers to leverage these formats. It also makes me curious to hear from the research community directly. How can publishers leverage formats like R Markdown and IPython Notebook to facilitate enhanced scholarly communication? Any and all suggestions are welcome.

Category: Tech | 2 Comments

Why I am a product manager at PLOS: Linking up value across the research process

 Why am I a product manager at PLOS?

I am a product manager [1] at PLOS because of the thrilling opportunity to re-imagine and create better conditions for researchers to do and share science. Those far more insightful and eloquent have enumerated the existing system’s thousand points of failure to support the research enterprise.  What follows is a set of thoughts that originate from traditions borrowed outside of the everyday practice of science and, more broadly, scholarly communications.

While the research enterprise is comprised of vastly diverse activities, the overall set can be generalized at the highest level with a simple rubric (Four C’s):

Create: do something that inspires or catalyzes a new idea
Capture: externalize and preserve the idea
Communicate: make the artifact extensible and disseminate it, making it accessible for others
Credit: integrate work into existing credit systems

Science easily fits into this model. A researcher will design a project and conduct experiments (Create), analyze and synthesize the results (Capture).  She will disseminate the narrative by publishing a research article (Communicate) and seek credit for work done (Credit).  In the traditional rendering of the research cycle, you move from one phase to another in an orderly and straightforward fashion.  The sequence seems to make sense.  (What is there to “communicate” before something is “captured,” if not scientific misconduct?)

But does the process of scientific discovery actually work this way?

Not really.  Researchers continually discuss results so as to get feedback that is folded into their next set of experiments or analyses.  They do it informally through lab meetings, departmental gatherings, digital platforms (ex: Mendeley, Twitter, blogs, etc.), and any myriad of unplanned “water cooler” encounters.  They also do it formally through conference presentations, seminar talks, research articles, etc.  These recurring encounters are an intrinsic part of the process whereby scientific ideas are propagated, confirmed/refuted, and built upon by the community.  Science is very much a complex, social enterprise in this sense.  But this means that the four phases described above are more accurately depicted as overlapping modes of doing science.  Here, we no longer have a simple sequence of discrete moments. Rather, we have a dynamic set of live events that may start at an identifiable origin, but then proceeds to extend, fork into multiple branches, intertwine, double-over, etc.

The model in which research moves by sequence through the research lifecycle seems to me a paltry simplification of the larger process.  It might get frenetic and messy with multiple projects underway with different collaborator networks, each at different points of development, moving at different speeds.  But with so many simultaneous movements at play inside and outside the lab, no wonder doing science is so exciting!

I will elaborate on the first three C’s in subsequent posts, but focus here on the last one.  The fourth mode, Credit, stands apart from the others by its very nature and plays a more significant role in forming the environment where research acts play out. By and large, it is the mode single-handedly identified with the production of value.  The activity is embedded within formal institutions such as evaluation committees at institutions and funding organizations.  Informal indicators may attest to a research output as product of one’s work and thus accrue favor amongst colleagues. But credit is formalized by virtue of the outcomes available once credit is assigned. [2]  In the strictest sense, it counts as much insofar as it advances one’s reputation within the established systems that endow benefit to those awarded.  Credit forms the basis of an incentive structure, which then shapes the activities of those tied to it and in this sense, considered the least malleable.  It is considered the intended – explicit or implicit – terminus for every work unit started (in relation to the expected outcome of the work, not the personal motivations which drive the researcher to do it.)  The reward itself is presumed to account for the efforts entailed.  In the existing incentive structure for scholars, this still means publication in a high-impact journal or of a highly cited article.

Not surprisingly, this way of thinking plays a large role in shaping and reinforcing the linear view of how science works.  If there is no credit until the condition of citation is satisfied, we have no other narrative possible than the straight progression of the discrete phases depicted.  If the only work product that really counts is a research paper, published 5 years after the project commenced, which must wait another few years before citations accumulate, then we have a seriously protracted time lag before any evidence of contribution is formally recognized.  In this environment, why would we be surprised to discover that the entire practice of research has been reduced to a simple, linear process.  And this notion merely perpetuates the misconception that Credit, supposedly at the end of the single-link chain, can sufficiently reflect (and vindicate) the work which preceded.  Science is too fertile of an enterprise for this to suffice.  Researchers deserve better.  Fortunately, the prevailing notions of value and credit which underlie the incentive structure are beginning to change.

My sense – one shared by my colleagues – is that value is created throughout the lifecycle, not just when Credit is awarded.  If we share in a way that can be tracked as well as establish measurements for such outputs, we are capturing value across practices, not just recorded activities but actual practices.  If we capture value at each stage along the way, we can formally assess and recognize far more work products, not just surfaced pieces contrived to fit the final narrative.  If we think of the Four C’s as separate strands with possible outcomes which might be identified; compared; and measured, we could glimpse a far richer view of how scientific ideas impact each other to extend far beyond our anemic one recognized today.  If we take a broader view of the production of value and create formal mechanisms for its distribution and allocation, we create an economy that supports the myriad ways in which researchers are doing science and engaging with others’ work.  And we create a holistic environment more conducive to the advancement of research and far more supportive of all involved in the enterprise.

I am a product manager because I think all this is possible with social change and structural realignments (of technology and policy) across the research ecosystem.  I am a product manager because it is possible to transform how we access, track, share, discuss, discover, interrogate, and evaluate research findings — nothing short of how we do science.


1. I respectfully put the oft-asked question of “what a product manager does” aside for another occasion. If hows come also before whys, we would never get started for all the contradictory biographical minutiae to satisfy both the oppositional poles of narrative and truth.
2. Here, the point pertains to role, not person. A colleague may also be a formal evaluator of one’s work.

Category: Tech | 1 Comment