Lessons learned developing scholarly open source software

Earlier this week we released version 3.8 of the Lagotto open source software (migrating the software to version 4 of the Rails framework). Lagotto is software that tracks events (views, saves, discussions and citations) around scholarly articles and provides these Article-Level Metrics to all journal articles published by PLOS. The project was started in 2008, and I am the technical lead since May 2012. This is the 24th release since I am involved with the software, and I want to take the opportunity to go into more detail about some of the lessons learned along the way.


We achieved some important Article-Level Metrics milestones in the past few weeks. In October we collected 500 million events around PLOS articles – an event can be anything from a pageview, tweet or Facebook like to a mention on a Wikipedia page to citation in another scholarly journal (data).


Most of these events are usage stats in the form of HTML page views and PDF downloads – including usage stats from PubMed Central and usage stats for supplementary information files hosted by figshare. But we are tracking events from a diverse range of data sources (data with logarithmic scaling):


Another important milestone was reached this October: PLOS is no longer the largest user of the Lagotto software. We were surpassed by the Public Knowledge Project that is providing a hosted ALM service for journals running the open source OJS software, and by CrossRef Labs, which loaded all 11 million DOIs registered with them since January 2011 into the software (data, unknown means software versions before 3.6).

Lagotto instances

Three features implemented in the last three months made this possible:

  1. tight integration with the CrossRef REST API to easily import articles by date and/or publisher (loading the 11 million CrossRef DOIs took a little over 24 hours)
  2. many performance improvements, in particular for database calls
  3. support for publisher-specific configurations and display of data.


Tracking events around scholarly articles is almost by definition a large-scale operation, collecting 500 million events and serving them via an API takes a lot of resources. Work on scalability has focussed on two areas: handling large numbers of articles and associated events, and making the API as fast as possible. Essential for any scalability work is good monitoring of the performance and we have mainly used two tools. rack-mini-profiler is a profiler for development and production Ruby applications, and gives detailed real-time feedback:


To measure API performance we use Rails Notifications and track the duration of every incoming (from external data sources) and outgoing (to clients) API call. We filter and visualize the duration of the outgoing API calls in the admin dashboard using the D3 and crossfilter Javascript libraries:


One lesson learnt is that it is critical to understand the database calls in detail and not rely on an ORM (object-relation mapping) framework (ActiveRecord in the case of Rails) to always make the right decisions.

The other important lesson learnt is that caching is absolutely critical. Lagotto uses memcached to do some pretty complex caching. For API calls we use fragment caching, and in the admin dashboard we also use model caching of some slow queries. Fragment caching is a common pattern nicely supported by Rails and the rabl library we use for JSON rendering, although we run into some edge cases because our API responses are a bit too complex. Model caching is part of a custom solution where we cache all slow database queries in the admin dashboard and refresh the cache every 60 min via a cron job.


Making it easy to install and run your software is the most important thing you can do if you want your open source project to be used by other people and organizations. This is a complex topic and includes several aspects: testing, error tracking, automation, documentation, and user support.


code climate

The overriding aim is obviously to ship bug-free software. The approach that the Ruby community has taken is to write extensive test coverage for your application. We are using Rspec, Cucumber and Jasmine for tests, and track test coverage (and overall code quality) using Code Climate.

Lagotto is also using rubocop to make sure the code follows the Ruby style guide, and the brakeman security scanner. All these tools run in the Travic CI continous integration environment.

Error tracking

Lagotto can create thousands of errors every day if there are problems with even one of the external APIs which the application talks to. For this reason we have implemented a custom error tracker that stores all errors in the database and makes it easy to search for them, receive an email for critical errors, etc.


The alerts dashboard of course also helps discover problems with the code that are missed by testing for a variety reasons, such as in the example above.


Lagotto is a typical Rails web application for those who are familiar with Rails. Unfortunately many people are not. Automating installation and deployment of new releases is therefore critical. Since 2012 we have worked with three tools to automate the process:

  • Vagrant, a tool to easily create virtual development and deployment environments
  • Chef, and IT automation tool
  • Capistrano, the standard Rails deployment tool

All three tools are written in Ruby, making it straightforward to customize them for a Ruby application. Deploying a new Lagotto server from scratch can be done in less than 30 min, and we have refined the process over time, including some big changes in the recent 3.7 release. Thanks to Vagrant, Lagotto can be easily deployed to a cloud environment such as AWS or Digital Ocean. After some initial work with platform as a service (PaaS) providers CloudFoundry and OpenShift (I didn’t try Heroku). I chose not to delve deeper, mainly because these tools added a layer of complexity and cost that wasn’t outweighed by the easy of deployment. The elephant in the room is of course Docker, an exciting new application platform that we hope to support in 2015.


The typical user of the Lagotto application isn’t hosting applications in a public cloud, and not all use virtual servers internally. We therefore still need good support for manual installation, and here documentation is critical.

All Lagotto documentation is written in markdown, making it easy to reuse nicely formatted documents (links, code snippets, etc.) in multiple places, e.g. via the Documentation menu from any Lagotto instance.


Writing documentation is hard, and all pages are evolving over time based on user feedback.

User Support Forum

In the past I have answered most support questions via IM, private email, mailing list or Github issue tracker. There are some shortcomings to these approaches, and we have therefore last month launched a new Lagotto forum (http://discuss.lagotto.io) as the central place for support.


The site runs Discourse, a great open source Ruby on Rails application that a number of (much larger) open source projects (e.g. EmberJS or Atom) have picked for user support. The forum supports the third-party login via Mozilla Persona we are using in Lagotto (you can also sign in via Twitter or Github), and has nice markdown support so that we can simply paste in documentation snippets when creating posts. Please use this forum if you have any questions or comments regarding the Lagotto software.


Users of the Lagotto application need to know what high-level features are planned for the coming months, and be able to suggest features important to them. The Lagotto forum is a great place for this and we have posted the development roadmap for the next 4 months there. The planned high-level features are:

  • 4.0 – Data-Push Model (November 2014)
  • 4.1 – Data-Level Metrics (December 2014)
  • 4.2 – Server-to-Server Replication (January 2015)
  • 4.3 – Archiving and Auditing of Raw Data (February 2015)

Work on the data push API has already begun and will be the biggest change to the system architecture since summer 2012. This will allow the application to scale better, and to be more flexible in how data from external sources are collected.

The data-level metrics work relates to the recently started NSF-funded Making Data Count project. This work will also make Lagotto much more flexible in the kinds of things we want to track, going well beyond articles and other items that have a DOI associated with them.

One of the biggest risks for any software project is probably feature creep, adding more and more things that look nice in the beginning, but make maintenance harder and confuse users. Every software, in particular open source software, needs a critical mass of users. I am therefore happy to expand the scope of the software beyond articles, and to accommodate both small organizations with hundreds of articles as well as large instances with millions of DOIs.

But we should keep Lagotto’s focus as software that tracks events around scholarly outputs such as articles and makes them available via an API. Lagotto is not is a client application, which the average user would interact with. Independent Lagotto client applications serve this purpose, in particular journal integrations with the API (at PLOS and elsewhere) as well as ALM Reports and the rOpenSci alm package.

Lagotto is also not an index service with rich metadata about the scholarly outputs it tracks. We only know about the persistent identifier, title, publication date and publisher of an article, not enough to provide a search interface, or a service that slices the data by journal, author, affiliation or funder.

The future for Lagotto looks bright.


Source: Wikimedia Commons

Category: Tech | Tagged , | 1 Comment

Make data sharing easy: PLOS launches its Data Repository Integration Partner Program

Image credit.

Over the past couple of years, we’ve been improving data access in a number of ways, notably: unlocking content in SI files,connecting PLOS Biology and PLOS Genetics with Dryad data, and making data discoverable through figshare recommendations.  Our update to the PLOS data policy in March 2014 undergirds and reinforces these services (c.f. policy FAQs).  Once data is available without restrictions upon publication, we can build tools to support the research enterprise.  Today, our efforts to improve data access at PLOS continues with an exciting new chapter.

Data Repository Integration Partner Program: easy deposition and article submission

We announce the launch of the PLOS Data Repository Integration Partner Program, which integrates our submission process with those of a select set of data repositories to better support data sharing and author compliance of the PLOS data policy.  (PDRIPP didn’t make the cutting board for names.)  Through this program, we make it easier for researchers to deposit data and submit their manuscript to PLOS through a single, streamlined workflow.  Our submission system is sewn together with partner repositories to ensure that the article and its underlying data are fully paired – published together and linked together.

Community benefits of the data repository integration partner program include the following:

  • ensuring that data underlying the research article are publicly available for the community
  • making it easier for PLOS authors to comply with the data policy
  • making data underlying the research article available to peer review contributors even if not yet publicly released
  • establishing bi-directional linkage between article publication and data
  • enabling data citation so that data producers can gain professional credit with data Digital Object Identifiers (DOIs)

We recognize that data types can vary quite widely and repositories have different competencies.  Most importantly, researcher needs are ever diverse, and repository choice is important.  The program thus aims to accommodate the diversity of data requirements across subject areas as well as from funders and institutions.  While such partners are strong options for researchers, they are neither representative nor exhaustive of the suitable repositories available which satisfy the PLOS data policy. [1]

Dryad: our first integration partner


We are thrilled that Dryad is our first integration partner member to join the program.  It is a curated resource that hosts a wide variety of data underlying publications in science and medicine, making research data discoverable, freely reusable, and citable.  We have connected two PLOS journals, PLOS Biology in 2012 and PLOS Genetics in 2013, tying together the data deposition process and our submission workflow.  Now, we expand this service offering for all PLOS authors across all seven journals.

Better yet, we have made the workflow to be more flexible and efficient for authors so that data deposition can now occur before article submission.  The steps are quite simple: once a researcher has deposited data in Dryad, s/he will receive a dataset DOI (provisional) along with a private reviewer URL link.  The data DOI should be incorporated into the full Data Availability Statement as per standard procedure.  The reviewer URL is also uploaded into the submission system and will serve as a passcode for private access to the data before public release.  The manuscript then moves swiftly through peer review.  And if accepted for publication, both article and dataset will be published together on the same day.  The fantastic Dryad blog post details the whole story.  If you have any questions about journal article submission, please contact the respective journal at plos[journal]@plos.org or data@plos.org.  Dryad repository questions can be directed to help@datadryad.org. [2]

More repos: growing the partnership to better serve researchers

PLOS is repository agnostic, provided that data centers meet certain baseline criteria (ex: availability, reliability, preservation, etc.) that ensure trustworthiness and good stewardship of data.  We are expanding the current selection of partners to provide this service to more of our authors, across more data centers.  We have a few already slated to join in the next couple of months.  Stay tuned!

But we ask for your help, researchers:

which new repositories would most benefit your work?

Let us know your recommendations for additional partner candidates.  Your thoughts and feedback on the new program are always welcomed at data@plos.org.



1. Authors are under no obligation to use the data repositories that are part of the integration program.  We recommend that researchers continue to deposit their data based on the field-specific standards for preparation and recording of data and select repositories appropriate to their field. 

2. Please note that Dryad has a Data Publication Charge. For authors selecting the service, this fee will be charged by Dryad to the author if/when the related manuscript is accepted for publication, so there is no charge while manuscripts are under review.  PLOS does not gain financially from our association with Dryad, which is a nonprofit organization.

Category: Tech | 2 Comments

How do you DO data?


We all know that data are important for research. So how can we quantify that? How can you get credit for the data you produce? What do you want to know about how your data is used? If you are a researcher or data manager, we want to hear from you. Take this 5-10 minute survey and help us craft data-level metrics:


Please share widely! The survey closes December 1st.

The responses will directly be fed into a broader project to design and develop metrics that track and measure data use, i.e. “data-level metrics” (DLM).  See an earlier blog post for more detail on the NSF-funded project, Making Data Count, that is a partnership between PLOS, CDL, and DataONE.

DLM are a multi-dimensional suite of indicators, measuring the broad range of activity surrounding the reach and use of data as a research output. They will provide a clear and growing picture of the activity around, direct, first-hand views of the dissemination of, and reach of research data. These indicators capture the footprint of the data from the moment of deposition in a repository through to its dynamic evolution over time. DLM are automatically tracked, thus reducing the burden of reporting and potentially increasing consistency. Our aims in measuring the level and type of data usage across many channels are plain:

  • make it possible for data producers to get credit for their work
  • prototype a platform so that these footprints can be automatically harvested
  • make all DLM data freely available to all (open metrics!)

At the moment, we are canvassing researchers and data managers to better understand data sharing attitudes and perceptions to identify core values for data use and reuse to describe existing norms surrounding the use of and sharing of data. The survey answers combined with the previous and ongoing research (ex: Tenopir et al.Callaghan et al.) will serve as the basis for this work ahead. They will be converted into requirements for an industry-wide data metrics platform. We will explore metrics that can be generalized across broad research areas and communities of practice, including life sciences, physical sciences, and social sciences. The resulting framework and prototype will represent the connections between data and the various channels in which engagement with the data is occurring. We will then test the validity of the pilot DLMs with real data in the wild and explore the extent to which automatic tracking is a viable approach for implementation.

Thank you in advance for your contributions in the data survey. Please fill it out and pass it on today! Read more about the project at mdc.plos.org or the CDL survey blog post.

Category: Tech | 1 Comment

1:AM Funding Available for Altmetrics Projects

circus-305034_640 (1)

We are extending the great enthusiasm from the 1:AM altmetrics conference (London, September 25-26, 2014) with a call for proposals of altmetrics projects. Much of our discussion was aimed at figuring out what we can do now to continue building and expanding our understanding and use of altmetrics – how to move altmetrics forward.  A few areas of need highlighted in the discussions include: altmetrics analysis of article data across sources or collections of papers, altmetrics education and outreach, building tools using altmetrics for research discovery and evaluation, etc.

Researchers, we know the best ideas will come from you – ones that address real needs and are most pressing for your work. What better way for the conference organizers to bolster this work than to help fund some of it?

We are asking researchers to send us their best ideas for projects aimed at research, promotion, and community-building in the altmetrics space. While we have limited funds, we are keen to assist new projects in getting them off the ground. Some ideas may include:

  • Run a session for your post-grads to demonstrate how altmetrics can help them identify how the articles they are reading were more widely received (so need funding for refreshments, materials, promotion etc)
  • Send a member of your team to a conference with a strong altmetrics theme
  • Do some research yourself (or support others) into what you can find out about a given set of publications, or indeed conduct a study with a broader scope
  • Build a plug-in to an existing application that displays altmetrics

Applications will be accepted through November 6. To submit your project or research funding request, please email altmetricsconf@gmail.com,and include the following details:

  • You, and your affiliation
  • What you would like to use the money for, and how much you think you would need
  • A timeframe for your project
  • The outcomes (e.g., write up of your workshop, a poster presentation, talk at a conference, etc.)

We’ll review the proposals and notify the selected applicants by mid-November. Please see the Call for Proposals page on the conference website for more information. Let the fun begin…

Category: Tech | Leave a comment

Rich Citations: Open Data about the Network of Research

Why are citations just binary links? There’s a huge difference between the article you cite once in the introduction alongside 15 others, and the data set that you cite eight times in the methods and results sections, and once more in the conclusions for good measure. Yet both appear in the list of references with a single chunk of undifferentiated plain text, and they’re indistinguishable in citation databases — databases that are nearly all behind paywalls. So literature searches are needlessly difficult, and maps of that literature are incomplete.

To address this problem, we need a better form of academic reference. We need citations that carry detailed information about the citing paper, the cited object, and the relationship between the two. And these citations need to be in a format that both humans and computers can read, available under an open license for anyone to use.rich-citations

This is exactly what we’ve done here at PLOS. We’ve developed an enriched format for citations, called, appropriately enough, rich citations. Rich citations carry a host of information about the citing and cited entities (A and B, respectively), including:

  • Bibliographic information about A and B, including the full list of authors, titles, dates of publication, journal and publisher information, and unique identifiers (e.g. DOIs) for both;
  • The sections and locations in A where a citation to B appears;
  • The license under which B appears;
  • The CrossMark status of B (updated, retracted, etc);
  • How many times B is cited within A, and the context in which it is cited;
  • Whether A and B share any authors (self-citation);
  • Any additional works cited by A at the same location as B (i.e. citation groupings);
  • The data types of A and B (e.g. journal article, book, code, etc.).

As a demonstration of the power of this new citation format, we’ve built a new overlay for PLOS papers, which displays much more information about the references in our papers, and also makes it easier to navigate and search through them. Try it yourself here: http://alpha.richcitations.org.
The suite of open-source tools we’ve built make it easy to extract and display rich citations for any PLOS paper. The rich citation API is available now for interested developers at http://api.richcitations.org.

We’ve started collecting rich citations for all PLOS papers; currently, our database has over 10,000 PLOS papers, including nearly all PLOS Medicine papers. In a few weeks’ time, we’ll have indexed and extracted rich citations from the rest of the PLOS corpus. The ultimate goal is to collect rich citations for the entire scientific literature and provide it as open data for the research community. This kind of database would be a valuable resource not only in itself but also for the wide variety of applications that could be built using it. With a detailed database of the connections between scientific works, it would be much easier to trace the intellectual history of an idea or fact, and to see the true dependencies between different pieces of the scientific literature. We can also use this database to create better paper recommendation engines, helping readers find new and exciting work related to older work. Such software could also suggest additional works to read and cite while writing manuscripts. And a detailed map of the research literature would give us a more nuanced view of the relationships among published research than is currently available with traditional citation-based metrics. These are admittedly ambitious goals, but we have already started working with citation researchers and developers from outside of PLOS to make these ideas into a reality.

We hosted a hackathon here at PLOS this past weekend where we started to adapt our APIs to work with other publishers’ content. We also spoke with folks from the Wikidata project about the possibilities for other ways we can showcase this data and connect it to other resources like Wikipedia. Some computational researchers worked on better algorithms to detect whether two authors are the same person. And the bibliometric researchers we’ve spoken with about this project are keen to start playing with the rich citations dataset, which already contains over half a million distinct references.

We’re excited to roll out rich citations over the coming weeks. If you’ve got any suggestions for us, or if you’d like to hear more about it, please don’t hesitate to post in the comments, or to contact our Labs team directly at ploslabs@plos.org. We’d love to hear from you!

Category: Tech | 6 Comments

Making Data Count: PLOS, CDL, and DataONE join forces to build incentives for data sharing


In partnership with the University of California Curation Center at the California Digital Library,and DataONE, we are pleased to announce the launch of a new project to develop data-level metrics. This project is funded by an EAGER grant from the National Science Foundation. The project, titled “Making Data Count: Developing a Data Metrics Pilot”, will result in a suite of metrics that track and measure data use. The proposal is available on eScholarship (http://escholarship.org/uc/item/9kf081vf).

Sharing data is time consuming and researchers need incentives for undertaking the extra work. Metrics for data will provide feedback on data usage, views, and impact that will help encourage researchers to share their data. This project will explore and test the metrics needed to capture activity surrounding research data.

The Data-Level Metrics (DLM) pilot will build from the successful open source Article-Level Metrics community project, Lagotto, originally started by PLOS in 2009. ALM provide a view into the activity surrounding an article after publication, across a broad spectrum of ways in which research is disseminated and used (e.g., viewed, shared, discussed, cited, and recommended, etc.)


Partners include:

PLOS (Public Library of Science) is a nonprofit publisher and advocacy organization founded to accelerate progress in science and medicine by leading a transformation in research communication.

Data Observation Network for Earth (DataONE) is an NSF DataNet project which is developing a distributed framework and sustainable cyberinfrastructure that meets the needs of science and society for open, persistent, robust, and secure access to well-described and easily discovered Earth observational data.

The University of California Curation Center (UC3) at the California Digital Library is a creative partnership bringing together the expertise and resources of the University of California. Together with the UC libraries, we provide high quality and cost-effective solutions that enable campus constituencies – museums, libraries, archives, academic departments, research units and individual researchers – to have direct control over the management, curation and preservation of the information resources underpinning their scholarly activities.

Category: Tech | 6 Comments

MozFest: Bringing the Web to Science

Mozilla_infographic_D5R3In just a month’s time from now, Mozilla will be hosting their annual Mozilla Festival (“MozFest” for short), which for the 2nd year will feature a Science Track. MozFest is where communities working in technology, design, education, journalism, and research come together to innovate in the space of the Web. It’s for coders and non-coders alike. Bring everything from your laptop to knitting needles (and that includes kids—it’s for all ages!). But most importantly, it’s a place for people interested in the Web and science to come together not only to talk about but to hack away at how the Web can help us do more, do better, and connect with like-minded people from different fields around a common cause.

Who will be there?

You might have heard of the now year-old Mozilla Science Lab and their Software Carpentry program, whose mission is to teach scientists how to code using best practices and how to teach others to code. It’s an initiative that has been greatly needed across the sciences. The Software Carpentry gang will be there as well as Kaitlin Thaney and the two newest members to Mozilla Science Lab (Bill Mills and Abby Cabunoc).

The entire Mozilla Foundation will also be there – that includes the people from the Open Badges project to the Web Maker community.

When and Where?

October 24-26 in London near the O2.

Still not sure what this is about…

To use its own words, the science track of MozFest aims to use “the open web to re-define how we experiment, analyze and share scientific knowledge”. The best way to learn more about MozFest is to take a look at the website and its tracks. But also take a look at the list of science sessions accepted, which include working with some of the new authoring tools out there like Authorea and iPython Notebooks and working with the Open Badges Infrastructure to bring more credit to all the work one does as a researcher beyond publications and position titles:

Sprint sessions:

  • Science in the browser – Michael Saunby (Met Office)
  • Storytelling from space – Brian Jacobs (ProPublica)
  • Upscience – Francois Grey (CERN) et al
  • Open Science Badges for Contributorship – Amye Kennall (BMC), Patrick Polischuk (PLOS), Liz Allen (Wellcome Trust), Laura Paglione (ORCiD), and Amy Brand (Digital Science)
  • Authoring tools – Alberto Pepe (Authorea), John Lees-Miller + John Hammersley (writeLaTeX), Raghuram Korukonda
  • Improving Reference Management in Wikipedia – Daniel Mietchen (Wikimedia
  • Building a Repository of Open Tools – Kathleen Luschek (PLOS)
  • Text as data – humanities and social science – Fiona Tweedie (Univ. Melbourne)
  • Zooniverse: Open Source Citizen Science – Robert Simpson (Zooniverse)
  • Apps for Climate Change: using Appmaker for Citizen Science – Brian Fuchs (Mobile Collective)
  • Curriculum Mapping for Open Science – Fabiana Kubke (Univ. Auckland) ; Billy Meinke (Creative Commons)
  • Network Analysis Visualisation for the Web – Matt Hong (Knight Lab)
  • Hacking hacking the library – Dave Riordan (New York Public Library)
  • Web audio as an open protocol – Jeffrey Warren (Public Lab)
  • Working with open health data – Fran Bennett  (Mastodon C)
  • Learn how to build cool things with weather data in Python – Jacob Tomlinson (Met Office)
  • PressureNET – Jared Kerim (Mozilla)
  • Building Pathways to Careers in Science – Lucas Blair

Trainings / Tutorials:

  • Academic Publishing Using Git and GitHub – Arfon Smith ; Jessica Lord (GitHub)
  • Intro to IPython Notebook – Kyle Kelley (Rackspace)
  • Collaborative Development: Teaching data on the web – Karthik Ram (rOpenSci)
  • Dealing with Messy Data – Milena Marin (OKFN)
  • Indie Science – Cindy Wu (Experiment.com)
  • Learning Analytics – Adam Lofting, Doug Belshaw (Mozilla)
  • Spreadsheets – Milena (OKFN)

Other sessions:

  • Open Science Collaboration for Development – Angela Okune (iHub)
  • Teen-Driven Open Science – David Bild (Nature Museum)
  • Scientific peer review: identifying conflicts and fraud – Rebecca Lawrence (F1000)
  • “Can you help me ‘break’ my project?” – Knight Lab students

But it would be a waste to go to MozFest and never leave the science floor. Jump around to the different floors, explore. A full list of proposals is here.

Don’t forget to register. See you soon!

Category: Tech | 1 Comment

Open Source for Open Metrics

The scientific community is increasingly recognizing how the open science enterprise critically relies on access to scientific tooling. John Willinsky, Stanford scholar and Director of the Public Knowledge Project (PKP), presaged this development, calling it an “unacknowledged convergence” between open source, open access, and open science in 2005. Today, the vision has developed in material ways with a number of organizations coordinating the development of open source software to analyze data, connect it to existing and new channels for disseminating the results, and more. Michael Woefle, Piero Olliaro, and Matthew Todd’s crystal clear summary of the advantages of open science is one of many compelling arguments circulating in presentations, blogs, and the media.

"Bellalagotto1" by Entheta - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons

“Bellalagotto1″ by Entheta – Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons

PLOS’s mission to advance open access is a part of this larger enterprise – doing science in an open way also means communicating it openly for others to read and reuse. Furthermore, these research outputs entail not only the content itself, but metadata surrounding the object, including type, provenance, conditions of its production and its reception, etc. Article-level metrics are a part of this diverse family, enabling authors, readers, institutions, and funders to better understand the evolving state of the research object.

The ALM software, which harvests article-level activity, was started at PLOS in 2009 and made available as Open Source software in 2011. Over the last three years a small but growing community of developers has reported issues with the software, reminded us of missing documentation, and contributed code. Earlier this year we started a community site for the project, including a page listing all contributors.

We are happy to announce that last week – with big help from developer Jure Triglav – we also made the ALM Reports application available as open source software. ALM Reports uses the PLOS Search and ALM APIs to generate reports and visualizations of metrics for PLOS content. In the initial release ALM Reports only works with the PLOS Search API, but any enterprising developer can now take the ALM Reports application and tweak it to his or her needs.


Both ALM and ALM Reports are licensed under an MIT license, the most popular license for Ruby on Rails software (the web framework used to create them), and one of the open licenses recognized by the Open Source Initiative. The MIT license is a permissive open source license, allowing unrestricted commercial reuse, and several commercial and non-profit organizations are using the ALM software in their production systems.

In celebration of this new era, we have renamed the software Lagotto, in reference to the Lagotto Romagnolo, a water dog that originally comes from southern Italy. The breed is especially gifted in hunting and retrieving, used not only as a water retriever but also for hunting truffles. In the same spirit, the ALM application hunts for activity surrounding a research paper, bringing it back and making it visible to the community.

The data collected by the PLOS ALM service have always been freely available via API and a monthly data package. As both the software used to collect and analyze the data are available under an open license, other users can validate the data, perform additional data quality checks, and even more critically, open up the discussion about values underlying what each of the communities are measuring and using for assessment.


Category: Tech | Comments Off on Open Source for Open Metrics

Diving into the haystack to make more hay?

Diving into the haystack to make hay is one of the most inefficient activities imaginable (as well as a figurative absurdity). Doing science inevitably entails discovery, but the process has historically been far more difficult without effective tools to support it. With the quickening pace of scholarly communications, the vast volume of information available is overwhelming. The key puzzle piece might be as hard to find as a needle, and moreover, no scholar has the time to dive into the hay-verse to make more hay.

With recent advances in data science and information management, research discovery has become one of the most pressing and fastest growing areas of interest. Reference managers (ReadCube, Mendeley, Zotero) and dedicated services have been experimenting with novel ways to deliver relevant articles to scholars (PubChaseSparrhoEigenfactor Recommends). While we have by no means exhausted the number of ways to innovate and make them mainstream, we have yet to turn our attention to scholarly objects beyond the article. The utility of finding the full research narrative is established, a given. But the potential value of discovering and accessing scholarly outputs that are created before final results are communicated or never integrated into an article at all is almost untapped. Until now.

Scholarly Recommendations

Last year, our partnership with figshare began with the hosting and display of Supporting Information files on the article, accommodating the broad range of file types found in this mixing pot. Both the Supporting Information file viewer and figshare‘s PLOS figure and data portal increase the accessibility of the data and article component content associated with our articles. The latter makes PLOS figures, tables, and SI files searchable based upon key terms of interest.

Beyond the article: Today, we continue to build on the figshare offering with the launch of research recommendations, a service which delivers relevant outputs associated with PLOS articles and beyond. This begins to fill a critical need for tools that address the full breadth of research content. Rather than being limited by the article as a container, we can now present a far broader universe of scholarly objects: figures, datasets, media, code, software, filesets, scripts, etc.

figshare recommendations widget

 Hansen J, Kharecha P, Sato M, Masson-Delmotte V, Ackerman F, et al. (2013) Assessing “Dangerous Climate Change”: Required Reduction of Carbon Emissions to Protect Young People, Future Generations and Nature. PLoS ONE 8(12): e81648. doi: 10.1371/journal.pone.0081648

While papers tell the story of the final research narrative, data – the building blocks of science – are especially critical to the progress of research. They underlie the results expressed in a narrative that is published in papers. The most rigorous and expansive path of discovery includes not only related articles, but arguably even more fundamentally, data and a host of research outputs that lead up to the paper or may even be independent of an article. PLOS’ data availability policy was the foundational step, ensuring that data is publicly accessible. Delivering strong recommendations to surface relevant research now adds even more value for the scholarly community.

Beyond the publisher: the recommendations delivered by figshare extend beyond research outputs attached to PLOS publications. In fact, they are retrieved from the entire figshare corpus of over 1.5 million objects. We want to enrich the discovery experience for users using the breadth of possible OA research outputs, regardless of whether they have been published as part of a research paper. Not all scholarly outputs may fit in an article, but might very well be critically instrumental to others’ research.

Right at your finger tips

The recommendations are displayed for every PLOS article on the Related Content tab. To select the most related ones, figshare uses Latent Semantic Analysis across the entire PLOS corpus to build a “semantic” matrix, which is then used to retrieve a list of best related entries for each of the articles. Five recommendations are displayed with the option to load more. The type of file is denoted by icons or thumbnails when available, with a preview of the object upon hover-over. The full view of the file is available by clicking on the thumbnail. The content and all its metadata is available on figshare via the highlighted title. Keyword tags are also displayed, which can be used to find other associated content of that kind.

figshare recommendations widget2

Franzen JL, Gingerich PD, Habersetzer J, Hurum JH, von Koenigswald W, et al. (2009) Complete Primate Skeleton from the Middle Eocene of Messel in Germany: Morphology and Paleobiology. PLoS ONE 4(5): e5723. doi: 10.1371/journal.pone.0005723

Mark Hahnel, founder of figshare, said “PLOS has continuously demonstrated their desire to advance academic publishing and we’re always very happy to play a part in their innovations. The latest developments will ultimately make figshare content more discoverable and benefits our user base as well as PLOS readers and authors.”

With the figshare recommendations, it is our aim to advance the process of discovery and accelerate the research lifecycle itself. Please check them out on the Related Content tab at every PLOS publication, dig into the offerings, and see where your research goes. We welcome your thoughts and reflections. Are these useful and relevant? Would you like them delivered through additional channels? Feel free to comment here or contact @jenniferlin for PLOS information. figshare is also available via emailtwitter, facebook or google+.

Cross-posted on figshare blog.

Category: Tech | 2 Comments

A Step-by-Step Approach to Content Management


Image credit: Michael Morris

Today at PLOS, we celebrate an important milestone: We launched the first iteration of the new PLOS CMS (codename: Lemur). This first installment is a homepage editor for six of the PLOS journals (Biology, Medicine, Pathogens, Computational Biology, Genetics, and NTDs). Our homepage editor is a browser application that facilitates curating and preparing content to feature on the journal homepages. It lets our journal staff queue up items to feature, write blurbs about each item in a WYSIWYG editor, select images, and perform basic image manipulations. It has a drag-and-drop interface to easily reorder featured content, and as you’d expect, it includes previewing and publishing controls.

Our launch timing, not so coincidentally, corresponds to the launch of our updated homepage design for the aforementioned six journals. The new, more sophisticated homepage design called for sophisticated curation controls, so it was an easy call to make the homepage editor the first functional area for the new PLOS CMS to tackle. The old process for homepage updates was manual and restrictive. But we’re restricted no longer—journal staff now have all the controls they need at their fingertips, without having to be HTML experts.

PLOS’s new CMS is an internal tool… for today. But that could evolve over time, as the scientific publishing community moves to a brave new world that’s “beyond the journal”. I’ll dig deeper into the idea of meeting an evolving landscape by answering the question you’re probably asking yourself right now:

Why are we building our own CMS?

There are several categories of answers to this question, presented here in no particular order:

  • Flexibility: We could have gone with an off-the-shelf CMS. But it’s likely that we would have actually required two different off-the-shelf content management systems: one to manage our web content, and one to manage our article corpus. These two content types generally utilize two very different kinds of CMS. The idea of adopting—and adapting—two different CMSes (or select just one type of CMS and heavily customize it to work with both content types) was not very appealing. This is a particularly unappealing solution when viewed through the lens of our ultimate goal, which is to blur the lines between all of our different content and media types, and interact with and present any combination of them seamlessly, in ways we can’t necessarily predict today. That’s why we have opted to separate the functions of content management, and build a curation layer for pulling the content together in the ways we need.
  • Innovation: We don’t want to miss opportunities to innovate on content delivery. We suspect that innovation would be slowed if we become locked into monolithic systems that are difficult to customize.
  • Technical considerations: Separation of content management functions from a technical perspective makes a lot of sense. You hear of a variety of publishing operations opting to roll their own for the same reasons, including the New York Times. This separation should also make us more agile in the sense of being able to respond relatively quickly to emerging needs, new ideas for innovation, or other interesting technological developments we want to try out.
  • Longer term community ideas: We start today by building Lemur with specific curation tasks in mind. But it’s easy to imagine a future in which anyone could curate PLOS (and other) content, using tools we provide. Imagine societies or educators using our content and curating it to their own needs, which may or may not be tied to the traditional model of the journal. Many examples of how this could evolve spring to mind!

Lean makes it better

We’ve been using Lean methodology to develop the content management system, component by component. We’ve been testing our wireframes throughout the design process. We’ve co-located the entire team in a conference room—since February, mind you! We’re reaping the benefits (and enjoying the process) of pair programming. Close collaboration among developers, UX, and product owner noticeably improves both our product and our velocity, while collective code ownership ensures maintainability. And here’s the part that requires lots of discipline: We release a minimum viable product, so we can gather observations of how it is used and see how it could be improved, rather than anticipate the entire product in a vacuum. As we improve the individual features with the findings from our observations, we will also incorporate that learning in the approaches we take with the rest of the system.


Category: Tech | Tagged | Leave a comment