Earlier this week we released version 3.8 of the Lagotto open source software (migrating the software to version 4 of the Rails framework). Lagotto is software that tracks events (views, saves, discussions and citations) around scholarly articles and provides these Article-Level Metrics to all journal articles published by PLOS. The project was started in 2008, and I am the technical lead since May 2012. This is the 24th release since I am involved with the software, and I want to take the opportunity to go into more detail about some of the lessons learned along the way.
We achieved some important Article-Level Metrics milestones in the past few weeks. In October we collected 500 million events around PLOS articles – an event can be anything from a pageview, tweet or Facebook like to a mention on a Wikipedia page to citation in another scholarly journal (data).
Most of these events are usage stats in the form of HTML page views and PDF downloads – including usage stats from PubMed Central and usage stats for supplementary information files hosted by figshare. But we are tracking events from a diverse range of data sources (data with logarithmic scaling):
Another important milestone was reached this October: PLOS is no longer the largest user of the Lagotto software. We were surpassed by the Public Knowledge Project that is providing a hosted ALM service for journals running the open source OJS software, and by CrossRef Labs, which loaded all 11 million DOIs registered with them since January 2011 into the software (data, unknown means software versions before 3.6).
Three features implemented in the last three months made this possible:
- tight integration with the CrossRef REST API to easily import articles by date and/or publisher (loading the 11 million CrossRef DOIs took a little over 24 hours)
- many performance improvements, in particular for database calls
- support for publisher-specific configurations and display of data.
Tracking events around scholarly articles is almost by definition a large-scale operation, collecting 500 million events and serving them via an API takes a lot of resources. Work on scalability has focussed on two areas: handling large numbers of articles and associated events, and making the API as fast as possible. Essential for any scalability work is good monitoring of the performance and we have mainly used two tools. rack-mini-profiler is a profiler for development and production Ruby applications, and gives detailed real-time feedback:
One lesson learnt is that it is critical to understand the database calls in detail and not rely on an ORM (object-relation mapping) framework (ActiveRecord in the case of Rails) to always make the right decisions.
The other important lesson learnt is that caching is absolutely critical. Lagotto uses memcached to do some pretty complex caching. For API calls we use fragment caching, and in the admin dashboard we also use model caching of some slow queries. Fragment caching is a common pattern nicely supported by Rails and the rabl library we use for JSON rendering, although we run into some edge cases because our API responses are a bit too complex. Model caching is part of a custom solution where we cache all slow database queries in the admin dashboard and refresh the cache every 60 min via a cron job.
Making it easy to install and run your software is the most important thing you can do if you want your open source project to be used by other people and organizations. This is a complex topic and includes several aspects: testing, error tracking, automation, documentation, and user support.
The overriding aim is obviously to ship bug-free software. The approach that the Ruby community has taken is to write extensive test coverage for your application. We are using Rspec, Cucumber and Jasmine for tests, and track test coverage (and overall code quality) using Code Climate.
Lagotto can create thousands of errors every day if there are problems with even one of the external APIs which the application talks to. For this reason we have implemented a custom error tracker that stores all errors in the database and makes it easy to search for them, receive an email for critical errors, etc.
The alerts dashboard of course also helps discover problems with the code that are missed by testing for a variety reasons, such as in the example above.
Lagotto is a typical Rails web application for those who are familiar with Rails. Unfortunately many people are not. Automating installation and deployment of new releases is therefore critical. Since 2012 we have worked with three tools to automate the process:
- Vagrant, a tool to easily create virtual development and deployment environments
- Chef, and IT automation tool
- Capistrano, the standard Rails deployment tool
All three tools are written in Ruby, making it straightforward to customize them for a Ruby application. Deploying a new Lagotto server from scratch can be done in less than 30 min, and we have refined the process over time, including some big changes in the recent 3.7 release. Thanks to Vagrant, Lagotto can be easily deployed to a cloud environment such as AWS or Digital Ocean. After some initial work with platform as a service (PaaS) providers CloudFoundry and OpenShift (I didn’t try Heroku). I chose not to delve deeper, mainly because these tools added a layer of complexity and cost that wasn’t outweighed by the easy of deployment. The elephant in the room is of course Docker, an exciting new application platform that we hope to support in 2015.
The typical user of the Lagotto application isn’t hosting applications in a public cloud, and not all use virtual servers internally. We therefore still need good support for manual installation, and here documentation is critical.
All Lagotto documentation is written in markdown, making it easy to reuse nicely formatted documents (links, code snippets, etc.) in multiple places, e.g. via the Documentation menu from any Lagotto instance.
Writing documentation is hard, and all pages are evolving over time based on user feedback.
User Support Forum
In the past I have answered most support questions via IM, private email, mailing list or Github issue tracker. There are some shortcomings to these approaches, and we have therefore last month launched a new Lagotto forum (http://discuss.lagotto.io) as the central place for support.
The site runs Discourse, a great open source Ruby on Rails application that a number of (much larger) open source projects (e.g. EmberJS or Atom) have picked for user support. The forum supports the third-party login via Mozilla Persona we are using in Lagotto (you can also sign in via Twitter or Github), and has nice markdown support so that we can simply paste in documentation snippets when creating posts. Please use this forum if you have any questions or comments regarding the Lagotto software.
Users of the Lagotto application need to know what high-level features are planned for the coming months, and be able to suggest features important to them. The Lagotto forum is a great place for this and we have posted the development roadmap for the next 4 months there. The planned high-level features are:
- 4.0 – Data-Push Model (November 2014)
- 4.1 – Data-Level Metrics (December 2014)
- 4.2 – Server-to-Server Replication (January 2015)
- 4.3 – Archiving and Auditing of Raw Data (February 2015)
Work on the data push API has already begun and will be the biggest change to the system architecture since summer 2012. This will allow the application to scale better, and to be more flexible in how data from external sources are collected.
The data-level metrics work relates to the recently started NSF-funded Making Data Count project. This work will also make Lagotto much more flexible in the kinds of things we want to track, going well beyond articles and other items that have a DOI associated with them.
One of the biggest risks for any software project is probably feature creep, adding more and more things that look nice in the beginning, but make maintenance harder and confuse users. Every software, in particular open source software, needs a critical mass of users. I am therefore happy to expand the scope of the software beyond articles, and to accommodate both small organizations with hundreds of articles as well as large instances with millions of DOIs.
But we should keep Lagotto’s focus as software that tracks events around scholarly outputs such as articles and makes them available via an API. Lagotto is not is a client application, which the average user would interact with. Independent Lagotto client applications serve this purpose, in particular journal integrations with the API (at PLOS and elsewhere) as well as ALM Reports and the rOpenSci alm package.
Lagotto is also not an index service with rich metadata about the scholarly outputs it tracks. We only know about the persistent identifier, title, publication date and publisher of an article, not enough to provide a search interface, or a service that slices the data by journal, author, affiliation or funder.
The future for Lagotto looks bright.