When Counting is Hard

Editor’s note: This post originally appeared on the DataCite blog. It is republished here with permission.

Counting is hard. But when it comes to research data, not in the way we thought it was (example 1example 2example 3. The Making Data Count (MDC) project aims to go further – measurement. But to do so, we must start with basic counting: 1, 2, 3… uno, dos, tres…

MDC is an NSF-funded project to design and develop metrics that track and measure data use, “data-level metrics” (DLM). DLM are a multi-dimensional suite of indicators, measuring the broad range of activities surrounding the reach and use of data as a research output. Our team, made up of staff from the University of California Curation Center at California Digital Library, PLOS, and DataONE, investigated the validity and feasibility of using metrics by collecting and investigating the use of harvested data to power discovery and reporting of datasets that are part of scholarly outputs.

To do this, we extended Lagotto, an open source application, to track datasets and collect a host of online activity surrounding datasets from usage to references, social shares, discussions, and citations. During this pilot phase we ran DLM against our sample test corpus of all datasets in DataONE member repositories (~150k). The overall profile of dataset usage appears to be significantly different from scholarly journal articles in the early DLM data collected.

Counting what we cannot see

Within this spectrum of indicators, citations – the single focus of this blog post – are considered by far the most interesting metric to both researchers and data managers (Kratz and Strasser, http://doi.org/10.1038/sdata.2015.39). However, citations currently pose a difficult challenge for measurement. Article citation services are fairly well established – some openly available (PubMed Central), others require subscriptions to gain access (Scopus and Web of Science) or publisher membership to participate (Crossref). To date only one major data citation service exists across journals and it is relatively untested by the community, in part due to its subscription cost: Thomson Reuter’s Data Citation Index. There are, however, other initiatives beginning to explore this arena such as BioCaddie, an NIH Big Data to Knowledge initiative. One of the biggest challenges in understanding dataset usage is associated with researcher practice in how datasets are cited. By and large, researchers do not cite datasets and where they are cited there is a great degree of variance in practice. Datasets may be mentioned in numerous places within the main body of the paper and/or formally cited in the reference section. There are emerging efforts to standardize this practice and enable the propagation of research data across scholarly communications by publishers, data repositories, and funders.

The DLM approach

Without access to a tested and openly available index for data citations across journal literature, what are we to do today? Since we cannot plug into a system that has already aggregated them (like most sources), we have to mine the literature directly and collect connections in the system. How does the DLM application do this and what content is targeted? The DLM application conducts full text searches across a corpus of content, looking for any mention of a dataset via its persistent identifier (DOI, ARK, etc.). Regardless of the location of the dataset persistent identifier – methods section, results, reference list, figure legend, etc., – the DLM application is able to find it. DLM uses publisher open APIs to do this. To date, we have implemented this search for all publications from PLOSBMC, and Nature (all journals listed in nature.com). Additionally, we ran the full text search on articles indexed in the Europe PMC corpus, a mirror of PubMed Central’s corpus with 3.3 million articles available.

Researchers who publish in journals whose content is openly available (i.e., open access) or their corpus is openly searchable via an API have a citation advantage and a better chance of getting credit (Eysenbach G., http://doi.org/10.1371/journal.pbio.0040157). In a recent analysis of gene expression microarray studies, this phenomenon has also been identified in datasets (Piwowar, H. et al., http://doi.org/10.7717/peerj.175). At this point, we have verified this more widely in our pilot system with datasets representing a broader range of subject areas (life sciences and geosciences).

Counting events as ‘sorta’ counts

We found a number of interesting results from data citations collected on DataONE datasets. While full analyses of the data (over time and across channels) is still underway, we are able to share very preliminary findings:

  • BMC – 339 events
  • PLOS – 741 events
  • Nature OpenSearch – 388 events
  • Europe PMC – 2107 events

Data collected 12 Aug 2015

The table above shows the number of events collected and stored in DLM for matches found between a DataONE dataset DOI and a publication in the search sources. DLM counts events by detecting the presence of a persistent identifier for a dataset in a paper and picked up by the source API. This is a data citation, in the loose sense of the word. (Some may define data citations in the formal sense based on requirements such as location, metadata identification, etc.) This brings us to our first finding, which while manifestly evident, is still worth mentioning given the lack of real information on article and linked data connections. The counts represent the set of journals made available in the search. If we are to know the objective count of data citations, we cannot simply sum up all the events collected across citation search sources. Data from publisher-specific sources is unique as they are limited to the exclusive corpus. But Europe PMC data covers the corpus of multiple publishers, some of which have already been covered in the current sources. At the same time, this archive is limited to articles deposited by participating journals, leaving out the majority of articles from subscription-based publishers. In our current configuration with the existing citation source list described, the fullest set would constitute a unique list of dataset-article links from the results of Europe PMC and Nature OpenSearch. Early evidence shows that the DLM approach to searching the Europe PMC archive is effective in extracting linked publication connections for OA publishers here, making it redundant to poll individual publishers if already indexed. As we continue to develop the system, future opportunities include expanding the search source list to text mine a broader set of publisher content (Crossref TDM, etc.) and exploring connections with other emerging data literature linking efforts (OLDRADADLI Service, etc.).

However, some events captured are not dataset citations in the formal sense. We discovered two types that are arguably not dataset citations, adding unwanted noise. In rare instances, DLM pulls article corrections in a data citation search where the data reference is part of a formal literature update – in this instance it is treated as a separate publication. This has occurred in both PLOS and Nature corpus.In the former instance, PLOS publishes a data access statement along with every article. The new policy which began April 2014 has elevated the visibility and prominence of the underlying data, which may have contributed to the added corrections. Additionally, Nature’s article PDFs published before 2014 are consistently included in the Nature OpenSearch count along with online article itself. For example, the Dryad dataset on Somatic deleterious mutation rates (10.5061/dryad.t8q7t) pulls back two events from Nature, the associated article and the article PDF (search results here). Thus, we are double counting dataset citations in these articles.

Unlike the search sources, DataCite counts capture a different set of elements altogether for our sample corpus: the components in the dataset package are each counted separately, along with the article it was originally part of if applicable. For example, the Dryad dataset http://doi.org/10.5061/dryad.3qd54 returns a DataCite count of 8 eventsbecause the dataset has 7 component datasets (10.5061/DRYAD.3QD54/1 – 10.5061/DRYAD.3QD54/7) that is linked to one associated article. This tally provides us with meaningful information on the dataset, though the numerical count of events in itself should not be directly included in the data citation calculation.

Counting in the early days (before the abacus)

Data citation is a very messy business in the absence of widespread and consistent data citation practices and DLM data provides further verification of the challenges and downstream impacts. Citations to data are primarily inline and, in rare instances, are located in reference lists. We will investigate the frequency of reference list citations further with the Europe PMC corpus. The presence of datasets are almost always mentioned as part of the research process, though described in a wide range of ways, the least of which make it extremely hard to automatically tag or classify. On rare occasions, the dataset may be handled in the same manner as article references. These are ‘wild west’ days for data citation as publishers continue to consider policies and practices on how to implement data citations in their submission systems. Data repositories can also have a significant impact on the data citation and publication end. Currently all the top cited DataONE datasets are associated with Dryad Digital Repository. Dryad actively partners with publishers to integrate data and manuscript submissions workflow, thereby facilitating data-article linkage. Dryad also provides clear instructions to researchers on how to cite datasets deposited in their repository.

Measurement is recognized as difficult, but counting proves to be its own challenge. The MDC pilot has begun to test the design of data metrics and its preliminary results have already begun to offer a richer view into the ways and degree in which researchers are really using scholarly data in the wild. Next up, we will begin to examine usage statistics, another fascinating area which we are eager to dive into. For full project background information and the latest progress updates, please visit the main MDC project page: http://mdc.lagotto.io.

This is a guest post by Jennifer Lin, project manager for the Making Data Count project, and since last week CrossRef Director of Product Management.

Category: Tech | Leave a comment

Crowdsourcing a Language for the Lab

Neither human nor machine communication can happen without language standards. Advancing science and technology demands standards for communication, but also adaptability to enable innovation. ResourceMiner is an open-source project attempting to provide both.

From Gutenberg’s invention of the printing press to the internet of today, technology has enabled faster communication, and faster communication has accelerated technology development. Today we can zip photos from a mountaintop in Switzerland back home to San Francisco with hardly a thought, but that wasn’t so trivial just a decade ago. It’s not just selfies that are being sent; it’s also product designs, manufacturing instructions, and research plans — all of it enabled by invisible technical standards (e.g., TCP/IP) and language standards (e.g., English) that allow machines and people to communicate.

But in the laboratory sciences (life, chemical, material, and other disciplines), communication remains inhibited by practices more akin to the oral traditions of a blacksmith shop than the modern internet. In a typical academic lab, the reference description of an experiment is the long-form narrative in the “Materials and Methods” section of a paper or a book. Similarly, industry researchers depend on basic text documents in the form of Standard Operating Procedures. In both cases, essential details of the materials and protocol for an experiment are typically written somewhere in a long-forgotten, hard-to-interpret lab notebook (paper or electronic). More typically, details are simply left to the experimenter to remember and to the “lab culture” to retain.

At the dawn of science, when a handful of researchers were working on fundamental questions, this may have been good enough. But nowadays this archaic method of protocol recordkeeping and sharing is so lacking that half of all biomedical studies are estimated to be irreproducible, wasting $28 billion each year of US government funding. With more than $400 billion invested each year in biological and chemical research globally, the full cost of irreproducible research to the public and private sector worldwide could be staggeringly large.

One of the main sources of this problem is that there is no shared method for communicating unambiguous protocols, no standard vocabulary, and no common design. This makes it harder to share, improve upon, and reuse experimental designs — imagine if a construction company had no blueprints to rely on, only ad hoc written documents describing their project. That’s more-or-less where science is today. It makes it hard, if not impossible, to compare and extend experiments run by different people and at different times. It also leads to unidentified data errors and missed observations of fundamental import.

To address this gap, we set out at Riffyn to give lab researchers the software design tools to communicate their work as effortlessly as sharing a Google doc, and with the precision of computer-aided design. But precise designs are only useful for communication if the underlying vocabulary is broadly understood. It occurred to us that development of such a common vocabulary is an ideal open-source project.

To that end, Riffyn teamed-up with PLOS to create ResourceMiner. ResourceMiner is an open-source project to use natural-language processing tools and a crowdsourced stream of updates to create controlled vocabularies that adapt to researchers’ experiments and continually incorporate new materials and equipment as they come into use.

A number of outstanding projects have produced, or are producing, standardized vocabularies (BioPortal, Allotrope, Global Standards Institute, Research Resource Initiative). However, the standards are constantly battling to stay current with the shifting landscape of practical use patterns. Even a standard that is extensible by design needs to be manually extended. ResourceMiner aims to build on the foundations of these projects and extend them by mining the incidental annotation of terminology that occurs in the scientific literature — a sort of indirect crowdsourcing of knowledge.

We completed the first stage of ResourceMiner during the second annual Mozilla Science Global Sprint in early June. The goal of the Global Sprint is to develop tools and lessons to advance open science and scientific communication, and this year’s was a rousing success (summary here). More than 30 sites around the world participated, including one at Riffyn for our project.

The base vocabulary for ResourceMiner is a collection of ontologies sourced from the National Center for Biomedical Ontology (Bioportal) and Wikipedia. We will soon incorporate ontologies from the Global Standards Initiative and the Research Resource Initiative as well. Our first project within ResourceMiner was to annotate this base vocabulary with usage patterns from subject-area tags available in PLOS publications. Usage patterns will enable more effective term search (akin to PageRank) within software applications.

Out of the full corpus of PLOS publications, about 120,000 papers included a specific methods section. The papers were loaded into a MongoDB instance and indexed on the Methods and Materials section for full-text searches. About 12,000 of 60,000 terms from the base vocabulary were matched to papers based on text string matches. The parsed papers and term counts can be accessed on our MongoDB server, and instructions on how to do that are in the project Github repo. We are incorporating the subject-area tags into Riffyn’s software to adapt search results to the user’s experiment and to nudge researchers into using the same vocabulary terms for the same real-world items. Riffyn’s software also provides users the ability to derive new, more precise terms as needed and then contribute those directly to the ResourceMiner database.

The next steps in the development of ResourceMiner (and where you can help!) are to (1) expand the controlled vocabulary of resources by text mining other collections of protocols and (2) apply subject tags to papers and protocols from other repositories based on known terms, and (3) add term counts from these expanded papers and protocols to the database. During the sprint, we identified two machine learning tools that could be useful in these efforts, and could be explored further: the Named Entity Recognizer within Stanford NLP for new resource identification and Maui for topic identification. Our existing vocabulary and subject area database provide a set of training data.

Special thanks to Rachel Drysdale, Jennifer Lin, and John Chodacki at PLOS for their expertise, suggestions, and data resources; to Bill Mills and Kaitlin Thaney at MozillaScience for their enthusiasm and facilitating the kick-off event for ResourceMiner; and to the contributors to ResourceMiner.

If you’d like to contribute to this project or have ideas for other applications of ResourceMiner, please get in touch! Check out our project GitHub repo, in particular the wiki and issues, or contact us at hello@riffyn.com.

Category: Tech | Tagged , , , , , | 1 Comment

User Testing at PLOS

At PLOS we strive to provide world class tools that help scientists in publishing their work. This includes improving existing functionality, such as browsing and searching, as well as developing entirely new features to meet the needs of the researchers and others who read and publish in our journals.

Tech Blog - User Testing Post

To guarantee success in the real world we have to test our ideas with the people who will be using what we build. We already use many methods to gather feedback from our authors and readers, such as surveys and focus groups. However, observing someone actually using a tool or website is the most effective way to ensure that the things we build are easy to work with and will genuinely help our users accomplish their goals. This method is called usability testing.

In usability testing, we present a volunteer with the software or system we are designing. At the simplest level, this might be non-interactive sketches or designs, however it could also involve fully working software. We then ask our volunteer tester to perform the types of tasks the tool will be used for. As they give each task a try, we watch where they look and click, and listen to their thoughts. Observing how people try to use the tool while performing actual tasks gives us the very best feedback on whether our designs are intuitive, and effectively highlights rough spots that require further work.

In addition to providing us with important feedback, usability testing also functions as an important way in which we can collaborate with our customers and include them in the design process. For designers at PLOS, observing usability tests engenders empathy with our customers and provides a first-hand understanding of their needs. For our customers, participation allows them to contribute to the development of a successful tool designed specifically around their work.

If you would like to help PLOS build better tools and user experiences, sign up to participate in our usability tests.

Category: Tech | 1 Comment

Testing Made Awesome with Docker

As PLOS has grown (in users, articles, and developers) we have put a lot of effort into splitting our original applications into groups of services. My team is responsible for several web services. We have worked on finding open source tools that can help us and the other PLOS development teams across our growing service ecosystem. I would like to tell you about one of our favorite new tools: Docker.
large_h-darkDocker is an open platform for developers and sysadmins to build, ship, and run distributed applications. For our purposes, Docker containers give us access to a jailed operating system that any of our services can run in with almost none of the overhead of a virtual machine.

The many uses of Docker are still being realized and the broader community continues to determine what the best practices are. It is not just used by operations and development. For example, Melissa Gymrek wrote about using Docker to support reproducible research, using a PLOS Genetics article as a test case.

Some months back I began using Docker on a few projects that needed cross team development. We liked Docker, but we were not yet ready to take it to production deployments. Around the same time, our team was without a QA engineer, so we were tasked with setting up our own testing plan. As we talked about it, Docker seemed like a perfect fit for what we needed. We created test environments that mimic production using Docker containers and found that Docker makes it quick and easy to narrow the gap between production and development.

Consistent Environments

Dockerfiles are used to define an exact template for what each application needs, from operating system to libraries to open ports. In the git repository for each of our projects we include a Dockerfile right along with the source code. This way anyone who pulls the project can get it up and running with a few simple commands. The user no longer needs to install all the dependencies manually and then hope it works on their own computer, since this will all be taken care of in the Docker container. This means getting up and running is fast since the host machine requires nothing installed but the Docker service itself.

Docker containers are very lightweight, so it is beneficial to run only one application per container. Since your application likely has dependencies on other services (database, caching server, message queue, etc), you will need multiple Docker files to bring up a working stack. This is where Docker Compose comes in.

Docker Compose enables us to easily manage a collection of containers by defining them in a short yaml file. With a ‘docker-compose up’, our entire stack is up and running in a few seconds. Here is a sample compose.yml file:

  build: webServiceAPI    # location of our Dockerfile
    - ../target:/build    # share directory to get compiled build from
    - database:database   # network link between containers
    - "8080:8080"                  # expose port 8080 to host
  image: mysql:5.5                # name of community a supported image
    - MYSQL_ROOT_PASSWORD=veryComplexPassword

Black Box Testing

Our web service container exposes port 8080 so it can be reached from the host. To run our API tests all we have to do is point them at that port, and we are testing the Dockerized API instead of a service running directly on the host. I wrote a few small scripts that orchestrate bringing those containers up and down as needed between various tests and now we have a very fast and comprehensive integration testing system.

Scalability Testing

You can use compose to bring up any number of instances of the services in your stack with something like:

> docker-compose scale web=10 database=3

This is great for testing things like database clusters or load balancers. I have seen hundreds of database containers run on a single laptop with no problem. Try doing that with a virtual machine.

Configuration Testing

It is easy to add very invasive automated tests here as well. Let’s say we want to automate the testing of different system configurations. We usually use MySQL, but this service can also work with a Postgres backend. This test can be added in a few minutes. Just create a Postgres container (or find an existing one on Dockerhub), and modify our test script to run as follows:

  • Bring up new stack (with MySQL backend), Seed with data, Run API tests, Tear down stack.
  • Bring up new stack (with Postgres backend), Seed with data, Run API tests, Tear down stack.
  • Report on any differences in test results between MySQL and Postgres backends.

Simplifying Development

In the past, some of our system tests used a different database technology than the production deployment. We used an in-memory H2 database since it is faster to bring up than a full database server. This has given us problems since we need to manage additional drivers specific to testing, and H2’s compatibility mode for MySQL only implements some of the MySQL feature set. Since Docker allows us to quickly create production-like systems, we have completely removed H2 and test right on a MySQL container. Now that we have fewer dependencies, we have less code, and fewer points of failure.

Simplifying Testing

Now that our stack is Dockerized, anyone can easily run the tests. New developers can run tests against their new code right away, leading to fewer checked in bugs. And QA can focus on tasks beyond regression testing.

Speaking of QA, we don’t need special servers that only QA has access to. Anyone with Docker installed can not only run the whole application stack but can run the full test suite. The same minimal requirements are in place on our continuous integration build server. We don’t need to take up any public facing ports. When the build finishes, it brings up a whole stack, tests it inside the local Docker service, and tears it down. Did I mention the overhead to run these tests against a brand new full stack is only seconds?! I probably did. But it’s worth saying again.

Category: Tech | 2 Comments

‘Open Source, Open Science’ meeting report – March 2015

Olaus Linn, Noun Project

On March 19th and 20th, the Center for Open Science hosted a small meeting in Charlottesville, VA, convened by COS and co-organized by Kaitlin Thaney (Mozilla Science Lab) and Titus Brown (UC Davis). People working across the open science ecosystem attended, including publishers, infrastructure non-profits, public policy experts, community builders, and academics.

Open Science has emerged into the mainstream, primarily due to concerted efforts from various individuals, institutions, and initiatives. This small, focused gathering brought together several of those community leaders. The purpose of the meeting was to define common goals, discuss common challenges, and coordinate on common efforts.

We had good discussions about several issues at the intersection of technology and social hacking including badging, improving standards for scientific APIs, and developing shared infrastructure. We also talked about coordination challenges due to the rapid growth of the open science community. At least three collaborative projects emerged from the meeting as concrete outcomes to combat the coordination challenges.

A repeated theme was how to make the value proposition of open science more explicit. Why should scientists become more open, and why should institutions and funders support open science? We agreed that incentives in science are misaligned with practices, and we identified particular pain points and opportunities to nudge incentives. We focused on providing information about the benefits of open science to researchers, funders, and administrators, and emphasized reasons aligned with each stakeholders’ interests. We also discussed industry interest in “open”, both in making good use of open data, and also in participating in the open ecosystem. One of the collaborative projects emerging from the meeting is a paper or papers to answer the question “Why go open?” for researchers.

Many groups are providing training for tools, statistics, or workflows that could improve openness and reproducibility. We discussed methods of coordinating training activities, such as a training “decision tree” defining potential entry points and next steps for researchers. For example, Center for Open Science offers statistics consulting, rOpenSci offers training on tools, and Software Carpentry, Data Carpentry, and Mozilla Science Lab offer training on workflows. A federation of training services could be mutually reinforcing and bolster collective effectiveness, and facilitate sustainable funding models.

The challenge of supporting training efforts was linked to the larger challenge of funding the so-called “glue” – the technical infrastructure that is only noticed when it fails to function. One such collaboration is the SHARE project, a partnership between the Association of Research Libraries, its academic association partners, and the Center for Open Science. There is little glory in training and infrastructure, but both are essential elements for providing knowledge to enable change, and tools to enact change.

Another repeated theme was the “open science bubble”. Many participants felt that they were failing to reach people outside of the open science community. Training in data science and software development was recognized as one way to introduce people to open science. For example, data integration and techniques for reproducible computational analysis naturally connect to discussions of data availability and open source. Re-branding was also discussed as a solution – rather than “post preprints!”, say “get more citations!” Another important realization was that researchers who engage with open practices need not, and indeed may not want to, self-identify as “open scientists” per se. The identity and behavior need not be the same.

A number of concrete actions and collaborative activities emerged at the end, including a more coordinated effort around badging, collaboration on API connections between services and producing an article on best practices for scientific APIs, and the writing of an opinion paper outlining the value proposition of open science for researchers. While several proposals were advanced for “next meetings” such as hackathons, no decision has yet been reached. But, a more important decision was clear – the open science community is emerging, strong, and ready to work in concert to help the daily scientific practice live up to core scientific values.


Tal Yarkoni, University of Texas at Austin

Kara Woo, NCEAS

Andrew Updegrove, Gesmer Updegrove and ConsortiumInfo.org

Kaitlin Thaney, Mozilla Science Lab

Jeffrey Spies, Center for Open Science

Courtney Soderberg, Center for Open Science

Elliott Shore, Association of Research Libraries

Andrew Sallans, Center for Open Science

Karthik Ram, rOpenSci and Berkeley Institute for Data Science

Min Ragan-Kelley, IPython and UC Berkeley

Brian Nosek, Center for Open Science and University of Virginia

Erin C. McKiernan, Wilfrid Laurier University

Jennifer Lin, PLOS

Amye Kenall, BioMed Central

Mark Hahnel, figshare

C. Titus Brown, UC Davis

Sara D. Bowman, Center for Open Science

Category: Tech | Leave a comment

ALM Reports – more features, more power

We are pleased to share the newest features just launched in ALM Reports (http://almreports.plos.org/), which will better support the time-consuming and laborious work of tracking and reporting the reach of PLOS articles.

As before, you can get a broad view of the latest activity for any collection of PLOS publications via searches from key terms, individual DOI/PMIDs, and bulk upload of DOI/PMIDs. To enhance the search capabilities, we have introduced faceted search. With faceted search, users can do a broad search across the entire corpus and then winnow down results by restricting the searches to publication year, article type, and journal. Users can dive deeper into a search stream, peruse results, and come back up to target an adjacent one within the master search. Additionally, sorting based on ALM makes it easy to surface popular articles.

faceted search

These faceted filters will make direct search far more usable and boost the exploratory power of ALM in literature discovery.

ALM Reports also contains three new enhanced visualizations, using the popular D3.js visualization library.  The first displays article usage over time and allows users to select an extra source for adding an additional dimension to the graph. Interesting trends may appear as you explore deeper into each paper’s profile across the various types of article engagement.  Hover over each of the bubbles for additional article metadata (article title, journal, publication age, etc.).


The second graph provides a sunburst display of article usage by subject area.   Size correlates with total views from 85 articles.  Color intensity correlates with Scopus citation count.  Clicking on a subject area will zoom into its component subject areas. Clicking the center of the circle will zoom out to a more general view. Explore the activity of the article collection across subject areas by diving into the PLOS thesaurus of taxonomy terms.

subject area viz

The third visualization displays a geolocation map of authors (based on affiliation) for all the articles in the report.  Users can hover over each location on the map to display more detail such as institution, department, etc.  This can provide a potent view of where collaborators across multiple projects are based.


Finally, we have introduced a feature which we’ll be building out in future releases: user accounts.  We have integrated ALM Reports into PLOS journal user profiles.  Once logged in, ALM reports are automatically saved and available on the home page.

saved reports

In this initial release, reports are currently titled by system number. In the future, we will enhance the report name to make report management more accessible. Starting now, though, users can get the latest ALM for any collection of papers by jumping directly to the report from the home page.

As discussed in a previous blog post, ALM Reports is an open source tool, which can be easily customized to deliver reports for any set of ALM data. We invite the community to use this tool in creative ways and consider how to integrate this new output into their research evaluation activities.

A big thanks to the lead developer of this major release, Jure Triglav, as well as Martin Fenner who managed this effort.  We welcome your thoughts and comments on the tool at alm[at]plos.org.

Category: Tech | 1 Comment

An article phenome project, and reflections on the Music Genome Project®


I have an ongoing fascination with the distribution of data, no matter what form it takes, around our small blue planet. Thinking back thirty or so years, this may have all begun with the Grundig shortwave radio we had in our home when I was a child. How could this little box which sounded so good be grabbing “content” from almost halfway around the world where another part of my family lived? All that was necessary to listen to communications in Europe was the rotation of a band selector which provided focus on a portion of the shortwave spectrum, plus a bit of fine tuning with another knob. And there you had it – very slightly aged radio waves were captured and amplified for our enjoyment. Languages I’d never heard, events in different countries discussed, new discoveries broadcast for the masses.

Things have changed since then to say the very least. Content of all sorts has become overwhelmingly present. And content discovery in and of itself is hard, let alone keeping a finger on the pulse of that which interests an individual. Going a step further, let’s place hypothetical bookends on an equation and say one desires to consistently listen to a few portions of the broader spectrum of published, openly accessible scientific research. What is the best way to go about doing so? That now commonly asked question prompted many thoughts presented throughout this post.

Ultimately, while pondering the basis of how to enable selective consumption of content in a well-delineated system, this twofold question arises in my mind: what would an article “phenome” project entail, and how would it benefit us all?

Focus first on the nature of an individual article. What inherent traits contribute to its characterization? At PLOS, a thesaurus of thousands of Subject Area terms exists. Based on the presence of those terms or related concepts in the text of a given article, as well as their proximity to one another, the article itself will be associated with a Subject Area. The rules around association of terms with Subject Areas evolve over time. Through machine-aided indexing, those very human rules gate how applicable a Subject Area is to an article, assigning a weight factor along the way. A higher weight factor for a Subject Area indicates an associated term or concept appears more often within an article’s text. The PLOS taxonomy captures these article traits across the entire corpus.

Though it may be a bit tempting to consider a Subject Area akin to a genetic trait, such a comparison falls through when we adhere to a strict interpretation of a genome being a finite framework from which a trait is mapped forward. Indeed, it necessitates an implausible analogy where every article might be a species unto itself, never to be replicated word-for-word, which then inherits its genes from a proto-language of sorts.

For our purposes, a Subject Area might better be described as an atomic unit of article meaning – analogous to a phenotypic trait being an atomic unit of the expression of the underlying organism’s genome.

If so, how will these atomic units of an article ultimately interact with their environment to create a longstanding entity with unique characteristics? In other words, how will its environment affect such manifestation and result in the article’s phenotype? Can phenotypic patterns be made use of for the purposes of article discovery?

Beyond Subject Areas, consider characteristics such as readability and intrigue in the eyes of one audience versus the next. A sum total of parts – terms, figures, subsections, paragraph structure – plus the very nature of the article’s theme, all play a great role in how an article is able to express itself to its readers (one variety of phenotypic pattern). As a simple example, an article with great appeal to a broad audience per its topic and readability will likely be cited and referred to in numerous publications – spanning from the New York Times to Nature, Gizmodo and National Geographic Russia. Take for instance, “Sliding Rocks on Racetrack Playa, Death Valley National Park: First Observation of Rocks in Motion” (DOI: 10.1371/journal.pone.0105948). Collective data, as gathered through Article-Level Metrics can help us identify some of these patterns. And subsequent association of this data with an article as high-level metadata could begin to allow for methods of customizable content discovery and sharing.

Let’s step back for a minute and think about those same two elements in the context of the music industry – more specifically, in terms of Pandora’s Music Genome Project®. Although some have criticized Pandora for the name of the Music Genome Project (here’s an example), I’m squarely in the crowd that appreciates the listener end-benefits which come with definition of the 450 musical characteristics that Pandora’s project institutes – namely a way to navigate my own musical undiscovered country. As with our taxonomy, a human element is present in the systematic association of characteristics with every song in their catalog. And the platform Pandora has built to curate in-house, distribute, share and tune stations paved the way for today’s streaming services and their business models.

In turn, drawing a series of simple parallels between articles and individual works of music, collections (or channels) and stations becomes straightforward. Building on the individual characteristics of each, elevating those characteristics along with metrics, and the creation of customizable paths of distribution could very well push the open access model another great step forward.

If you had to further define aspects of an article phenome project for article discovery and distribution, where would you begin? How would you define patterns that would enable you to locate the research which interests you most? What metadata might aid in the definition of these? And how would you want to share the fruits of your labor?

I for one would want a socially-enabled ecosystem in which articles could thrive, and influence their own distribution through the expression of their phenotypes – while readers benefitted from accelerated article discovery.

For in the years to come, humans will birth articles as part of scientific research. And through their discovery, the cycle of life begins anew.


Category: Tech | Leave a comment

Summary of ALM data quality work

Article-level metrics data, like any other data, is subject to errors, inconsistencies, and bias. Data integrity is the foundation of trust for these metrics.  Making sure data is correct is a difficult task as collecting and providing article-level metrics data involves not only the party collecting the system (here, PLOS), but also the sources (Crossref, Twitter) that provide data.  To this end, we must strive to account for all possible sources of problems with data, and fix them whenever possible.

I’ve been doing some data quality work on article-level metrics data with PLOS. The goal was to explore data quality issues with article-level metrics data, using specifically the data collected on PLOS articles.

There were two sets of data:

  • monthly reports – these are basically spreadsheet dumps  of summary metrics for each data source for every article.
  • alerts data – The Lagotto application has an alert system that produces alerts for many things. I’ll explain some below.

I’ll do a high level summary of the findings from each set of data.

Monthly reports

A file is created at the end of each month. It holds all DOI’s published up to that month, including article level metrics. These files make it easy to analyze data for one particular month, or can be summed together to get a picture of PLOS altmetrics over many months (as I do below). You can find these files on FigShare.


The monthly data covers:

  • 16 months
  • Data from: 2012 (07, 08, 09, 10, 11, 12), 2013 (01, 04, 05, 08), 2014 (01, 03, 06, 07, 08, 09)
  • 128986 DOIs
  • 43 article-level metrics variables

Summary statistics

The below plots use the data from the 2014-03 file, across all DOIs. Sources are dropped below that have no data, or have a sum or mean of zero.

Mean metric values

You can see that some article-level metrics are on average larger than others, with, for example, counter_html (number of html page views) much higher than twitter. Of the social media data sources, facebook has higher mean values than twitter. NB: this is a partial list of metrics.


Overview of some altmetrics variables through time (mean value across articles for each month before plotting)

Through time, metrics show different patterns. The Datacite citations source was only brought online in early 2014. Twitter has seen a steady increase, while Mendeley has been up and down but trending upwards. NB: each panel has a different y-axis, while the x-axes are the same.



Distribution of the same subset of altmetrics variables (mean taken across dates for each article before plotting)

These panels show a few different things. First, some data sources have much more data than others (e.g., compare Crossref to Datacite). Second, some data sources have a tight grouping of values around the mean (e.g., counter_html), while others have long tails (e.g., Twitter). Note that panels have different y- and x-axes – the y-axis is log base 10 number of articles, and the x-axis is the metric value.


Some patterns

As you’d expect, some metrics are correlated, and some are not so related. If some are tightly correlated, we can possibly predict metric A simply by knowing metric B. For seven metrics, how are they related to one another? The below plot shows the relationships among counter (html+pdf views), Mendeley readers, Crossref citations, Facebook likes, Reddit shares, and Wikipedia mentions. Counter (html+pdf views) is tightly un-correlated with Mendeley readers, while Crossref and Twitter do seem to have a stronger negative relationship to one another. These kinds of academic exercises are useful and have been done (e.g., Priem et al. 2011[1], Eysenbach et al. 2011[2], Costas et al. 2014[3]), but for our purposes, we are more interested in the extent to which  we can take advantage of the correlations among metrics. That is, if we think there may be some kind of gaming going on with metric X and we can not predict when that will happen, and we know there is some combination of two other metrics that correlate/predict metric X, we can possibly take advantage of those relationships. Unfortunately, I didn’t get any definitive answers with respect to this use case, partly due to the difficulty in detecting potential gaming.



The Lagotto application collects and provides article-level metrics data for scholarly articles. As part of the data integrity process, various alerts are given from Lagotto that help determine what may be going wrong with the application, data sources used in Lagotto, and any problems with users requesting data from the Lagotto application. Analyzing these alerts helps to determine what errors are the most common, and what may lie behind errors.

I’ve been working on an R client to work with Lagotto application data, called alm. This R client can also interact with alerts data from Lagotto. Python and Ruby clients are also in the works. NB: accessing alerts data takes an extra level of permissions.

As other publishers are starting to use Lagotto, the below is a discussion mostly of PLOS data, but includes some discussion of other publishers.

Alerts data can be used for many things. One potential use is discovering potential gaming activities. For example, the EventCountIncreasingTooFastError alert flags articles in which the event (ex: Facebook likes) is increasing faster than some cutoff. These articles can then be investigated as to whether the event counts are justifiable or not.

Another use falls under the broader arena of system operations. Alerts like Net::HTTPUnauthorized are not useful for the purpose of detecting potential gaming, but are useful for the publisher using the Lagotto application. Alerts can help determine if there is a problem with one of the data sources, and why the error occurred.

How to interpret alerts (partial list)

Alert class name Description
Net::HTTPUnauthorized 401 – authorization likely missing
Net::HTTPRequestTimeOut 408 – request timeout
Net::HTTPConflict 409 – Document update conflict
Net::HTTPServiceUnavailable 503 – service is down
Faraday::ResourceNotFound 404 – resource not found
ActiveRecord::RecordInvalid title is usually blank, and can’t be
EventCountDecreasingError decrease too fast, check on it
EventCountIncreasingTooFastError increasing too fast, check on it
ApiResponseTooSlowError Alert if successful API responses took too long
HtmlRatioTooHighError HTML/PDF ratio higher than 50
ArticleNotUpdatedError Alert if articles have not been updated within X days
CitationMilestoneAlert Alert if an article has been cited the specified number of times


PLOS currently has 142,136 articles available in their Lagotto instance as of 2015-01-06. Most of the PLOS articles in the recent few weeks are WorkNotUpdatedError alerts – meaning works have not been updated recently. Most alerts have to do with system operations problems, not potential gaming problems.

Class Name




















An interesting one is EventCountDecreasingError, which had 69 results. Let’s dig in further, if we search for EventCountDecreasingError alerts from a single source (Web of Science, or wos) to simplify things.

alm_alerts(class_name = "EventCountDecreasingError", 
           per_page = 10)$data %>% 
    select(message, work, source)
message work source
decreased from 2 to 1 10.1371/journal.pone.0080825 pmceurope
decreased from 1 to 0 10.1371/journal.pone.0101947 pmceurope
decreased from 1 to 0 10.1371/journal.pone.0104703 pmceurope
decreased from 3 to 0 10.1371/journal.ppat.1002565 pmceuropedata
decreased from 2 to 1 10.1371/journal.pone.0034257 wos
decreased from 81 to 0 10.1371/journal.ppat.0010001 wos
decreased from 9 to 0 10.1371/journal.ppat.0010003 wos
decreased from 13 to 0 10.1371/journal.ppat.0010007 wos
decreased from 37 to 0 10.1371/journal.ppat.0010008 wos
decreased from 21 to 0 10.1371/journal.ppat.0010009 wos

One of the highest offenders is the article 10.1371/journal.ppat.0010001 where WebofScience counts decreased from 81 to 0. Indeed, requesting metrics data for this DOI gives 0 for wos

alm_ids("10.1371/journal.ppat.0010001")$data %>% 
   filter(.id == "wos")














This particular error can indicate something wrong with the Lagotto instance from which the data is collected and provided, but it might instead originate from the data source for any number of reasons.

Public Knowledge Project (PKP)

PKP has a growing collection of articles, with 158,368 as of 2015-01-06. An analysis on 2015-01-06 reveals that most of the alerts are Net::HTTPForbidden and Net::HTTPClientError. Again, as with PLOS, the message with alerts data is that most errors have to do with servers not responding at all, or too slowly, or some other technical error – but not an issue of potential gaming.

Class Name























My work on altmetrics data quality has all been done with the goal of other reproducing and building on the work. Code is in the articlemetrics/data-quality GitHub repository, and includes scripts for reproducing this blog post, and other analyses I’ve done on the monthly data files as well as the alerts data. I’ve used the R packrat library with this project – if you are familiar with that you can use that. Other options are cloning the repo if you are a git user, or downloading a zip file at the repo landing page. Using alerts data and reproducing those scripts may not be possible for most people because you need higher level permission for those data.


If you are an altmetrics data provider, a simple way to start is visualizing your altmetrics data: How are different metrics distributed? How are metrics related to one another? Do they show any weird patterns that might suggest something is wrong? If you are at least a little experienced in R, I recommend trying the scripts and workflows I’ve provided in the articlemetrics/data-quality repo – if you try it let me know if you have any questions. If you don’t like R, but know Python or Ruby, we are working on Lagotto clients for both languages at articlemetrics. In addition, alerts data is very useful in examining server side errors in collection or provision of altmetrics data. See the alerts folder in the articlemetrics/data-quality repo.

If you are an altmetrics researcher/observer, you are likely more interested in collecting altmetrics data. I’ve done data manipulations with the dplyr R package, which is great at handling big data and makes data manipulation quite easy. Starting with the scripts I’ve written and extending those should be an easy way to start collecting your altmetrics data programatically.

From my experience doing this work I think one thing badly needed is easy access to log data for article access. A potentially big source of gaming could be a bot (or perhaps an actual human) hitting a web page over and over to increase metrics for an article. This data isn’t collected in the Lagotto application, but would more likely be in the server logs of the publisher’s web application. Perhaps publishers could use a tool like Logstash to collect and more easily inspect log data when potential gaming activity is found.

[1]Priem, Jason, Heather Piwowar, and B. Hemminger. “Altmetrics in the wild: An exploratory study of impact metrics based on social media.” W: Metrics (2011). http://arxiv.org/abs/1203.4745

[2]Eysenbach G. Can Tweets Predict Citations? Metrics of Social Impact Based on Twitter and Correlation with Traditional Metrics of Scientific Impact. Federer A, ed. Journal of Medical Internet Research 2011;13(4):e123. doi:10.2196/jmir.2012. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3278109/

[3]Costas, Rodrigo, Zohreh Zahedi, and Paul Wouters. “Do altmetrics correlate with citations? Extensive comparison of altmetric indicators with citations from a multidisciplinary perspective.” arXiv preprint arXiv:1401.4321 (2014). http://arxiv.org/abs/1401.4321

Category: Tech | Leave a comment

Under the hood at PLOS

You may have noticed some changes to various pages on the PLOS journal websites over the last few months. Last April we launched new mobile-optimized sites for all of our journals, and in July we released a major update to the homepages of some of our journals.

These projects are part of a larger effort to radically improve Ambra, our open-source publishing platform, and as of today we are serving all article pages from the new Ambra architecture.

If you don’t notice anything different, that’s because most of the action is happening under the hood. There are a few visible changes–like a new ALM signpost design and improved display and feedback options for subject area terms–but our over-arching goal is to create a strong foundation for further feature development and project integrations.

While we’ve worked hard to hide all of the scaffolding and jackhammers from our readers, we ask for your patience while we work through such a huge undertaking. There are still a few rough edges, but please let us know if anything looks severely broken.

Category: Tech | Leave a comment

Make Data Rain

Credit: http://cdns2.freepik.com/free-photo/digital-data-raining-cloud-vector_21-97952666.jpg

Credit: http://cdns2.freepik.com/free-photo/digital-data-raining-cloud-vector_21-97952666.jpg

Last October, UC3,  PLOS, and DataONE launched Making Data Count, a collaboration to develop data-level metrics (DLMs). This 12-month National Science Foundation-funded project will pilot a suite of metrics to track and measure data use that can be shared with funders, tenure & promotion committees, and other stakeholders.

To understand how DLMs might work best for researchers, we conducted an online survey and held a number of focus groups, which culminated on a very (very) rainy night last December in a discussion at the PLOS offices with researchers in town for the 2014 American Geophysical Union Fall Meeting.

Six eminent, senior researchers participated:

Much of the conversation concerned how to motivate researchers to share data. Sources of external pressure that came up included publishers, funders, and peers. Publishers can require (as PLOS does) that, at a minimum, the data underlying every figure be available. Funders might refuse to ‘count’ publications based on unavailable data, and refuse to renew funding for projects that don’t release data promptly. Finally, other researchers– in some communities, at least– are already disinclined to work with colleagues who won’t share data.

However, Making Data Count is particularly concerned with the inverse– not punishing researchers who don’t share, but rewarding those who do. For a researcher, metrics demonstrating data use serve not only to prove to others that their data is valuable, but also to affirm for themselves that taking the time to share their data is worthwhile. The researchers present regarded altmetrics with suspicion and overwhelmingly affirmed that citations are the preferred currency of scholarly prestige.

Many of the technical difficulties with data citation (e.g., citing  dynamic data or a particular subset) came up in the course of the conversation. One interesting point was raised by many: when citing a data subset, the needs of reproducibility and credit diverge. For reproducibility, you need to know exactly what data has been used– at a maximum level of granularity. But, credit is about resolving to a single product that the researcher gets credit for, regardless of how much of the dataset or what version of it was used– so less granular is better.

We would like to thank everyone who attended any of the focus groups. If you have ideas about how to measure data use, please let us know in the comments!

Cross-posted from CDL DataLib blog.




Related Posts Plugin for WordPress, Blogger...
Category: Tech | Leave a comment