‘Open Source, Open Science’ meeting report – March 2015

Olaus Linn, Noun Project

On March 19th and 20th, the Center for Open Science hosted a small meeting in Charlottesville, VA, convened by COS and co-organized by Kaitlin Thaney (Mozilla Science Lab) and Titus Brown (UC Davis). People working across the open science ecosystem attended, including publishers, infrastructure non-profits, public policy experts, community builders, and academics.

Open Science has emerged into the mainstream, primarily due to concerted efforts from various individuals, institutions, and initiatives. This small, focused gathering brought together several of those community leaders. The purpose of the meeting was to define common goals, discuss common challenges, and coordinate on common efforts.

We had good discussions about several issues at the intersection of technology and social hacking including badging, improving standards for scientific APIs, and developing shared infrastructure. We also talked about coordination challenges due to the rapid growth of the open science community. At least three collaborative projects emerged from the meeting as concrete outcomes to combat the coordination challenges.

A repeated theme was how to make the value proposition of open science more explicit. Why should scientists become more open, and why should institutions and funders support open science? We agreed that incentives in science are misaligned with practices, and we identified particular pain points and opportunities to nudge incentives. We focused on providing information about the benefits of open science to researchers, funders, and administrators, and emphasized reasons aligned with each stakeholders’ interests. We also discussed industry interest in “open”, both in making good use of open data, and also in participating in the open ecosystem. One of the collaborative projects emerging from the meeting is a paper or papers to answer the question “Why go open?” for researchers.

Many groups are providing training for tools, statistics, or workflows that could improve openness and reproducibility. We discussed methods of coordinating training activities, such as a training “decision tree” defining potential entry points and next steps for researchers. For example, Center for Open Science offers statistics consulting, rOpenSci offers training on tools, and Software Carpentry, Data Carpentry, and Mozilla Science Lab offer training on workflows. A federation of training services could be mutually reinforcing and bolster collective effectiveness, and facilitate sustainable funding models.

The challenge of supporting training efforts was linked to the larger challenge of funding the so-called “glue” – the technical infrastructure that is only noticed when it fails to function. One such collaboration is the SHARE project, a partnership between the Association of Research Libraries, its academic association partners, and the Center for Open Science. There is little glory in training and infrastructure, but both are essential elements for providing knowledge to enable change, and tools to enact change.

Another repeated theme was the “open science bubble”. Many participants felt that they were failing to reach people outside of the open science community. Training in data science and software development was recognized as one way to introduce people to open science. For example, data integration and techniques for reproducible computational analysis naturally connect to discussions of data availability and open source. Re-branding was also discussed as a solution – rather than “post preprints!”, say “get more citations!” Another important realization was that researchers who engage with open practices need not, and indeed may not want to, self-identify as “open scientists” per se. The identity and behavior need not be the same.

A number of concrete actions and collaborative activities emerged at the end, including a more coordinated effort around badging, collaboration on API connections between services and producing an article on best practices for scientific APIs, and the writing of an opinion paper outlining the value proposition of open science for researchers. While several proposals were advanced for “next meetings” such as hackathons, no decision has yet been reached. But, a more important decision was clear – the open science community is emerging, strong, and ready to work in concert to help the daily scientific practice live up to core scientific values.

People

Tal Yarkoni, University of Texas at Austin

Kara Woo, NCEAS

Andrew Updegrove, Gesmer Updegrove and ConsortiumInfo.org

Kaitlin Thaney, Mozilla Science Lab

Jeffrey Spies, Center for Open Science

Courtney Soderberg, Center for Open Science

Elliott Shore, Association of Research Libraries

Andrew Sallans, Center for Open Science

Karthik Ram, rOpenSci and Berkeley Institute for Data Science

Min Ragan-Kelley, IPython and UC Berkeley

Brian Nosek, Center for Open Science and University of Virginia

Erin C. McKiernan, Wilfrid Laurier University

Jennifer Lin, PLOS

Amye Kenall, BioMed Central

Mark Hahnel, figshare

C. Titus Brown, UC Davis

Sara D. Bowman, Center for Open Science

Category: Tech | Leave a comment

ALM Reports – more features, more power

We are pleased to share the newest features just launched in ALM Reports (http://almreports.plos.org/), which will better support the time-consuming and laborious work of tracking and reporting the reach of PLOS articles.

As before, you can get a broad view of the latest activity for any collection of PLOS publications via searches from key terms, individual DOI/PMIDs, and bulk upload of DOI/PMIDs. To enhance the search capabilities, we have introduced faceted search. With faceted search, users can do a broad search across the entire corpus and then winnow down results by restricting the searches to publication year, article type, and journal. Users can dive deeper into a search stream, peruse results, and come back up to target an adjacent one within the master search. Additionally, sorting based on ALM makes it easy to surface popular articles.

faceted search

These faceted filters will make direct search far more usable and boost the exploratory power of ALM in literature discovery.

ALM Reports also contains three new enhanced visualizations, using the popular D3.js visualization library.  The first displays article usage over time and allows users to select an extra source for adding an additional dimension to the graph. Interesting trends may appear as you explore deeper into each paper’s profile across the various types of article engagement.  Hover over each of the bubbles for additional article metadata (article title, journal, publication age, etc.).

usage

The second graph provides a sunburst display of article usage by subject area.   Size correlates with total views from 85 articles.  Color intensity correlates with Scopus citation count.  Clicking on a subject area will zoom into its component subject areas. Clicking the center of the circle will zoom out to a more general view. Explore the activity of the article collection across subject areas by diving into the PLOS thesaurus of taxonomy terms.

subject area viz

The third visualization displays a geolocation map of authors (based on affiliation) for all the articles in the report.  Users can hover over each location on the map to display more detail such as institution, department, etc.  This can provide a potent view of where collaborators across multiple projects are based.

geoviz2

Finally, we have introduced a feature which we’ll be building out in future releases: user accounts.  We have integrated ALM Reports into PLOS journal user profiles.  Once logged in, ALM reports are automatically saved and available on the home page.

saved reports

In this initial release, reports are currently titled by system number. In the future, we will enhance the report name to make report management more accessible. Starting now, though, users can get the latest ALM for any collection of papers by jumping directly to the report from the home page.

As discussed in a previous blog post, ALM Reports is an open source tool, which can be easily customized to deliver reports for any set of ALM data. We invite the community to use this tool in creative ways and consider how to integrate this new output into their research evaluation activities.

A big thanks to the lead developer of this major release, Jure Triglav, as well as Martin Fenner who managed this effort.  We welcome your thoughts and comments on the tool at alm[at]plos.org.

Category: Tech | Leave a comment

An article phenome project, and reflections on the Music Genome Project®

blogArt-8_ellipse_500

I have an ongoing fascination with the distribution of data, no matter what form it takes, around our small blue planet. Thinking back thirty or so years, this may have all begun with the Grundig shortwave radio we had in our home when I was a child. How could this little box which sounded so good be grabbing “content” from almost halfway around the world where another part of my family lived? All that was necessary to listen to communications in Europe was the rotation of a band selector which provided focus on a portion of the shortwave spectrum, plus a bit of fine tuning with another knob. And there you had it – very slightly aged radio waves were captured and amplified for our enjoyment. Languages I’d never heard, events in different countries discussed, new discoveries broadcast for the masses.

Things have changed since then to say the very least. Content of all sorts has become overwhelmingly present. And content discovery in and of itself is hard, let alone keeping a finger on the pulse of that which interests an individual. Going a step further, let’s place hypothetical bookends on an equation and say one desires to consistently listen to a few portions of the broader spectrum of published, openly accessible scientific research. What is the best way to go about doing so? That now commonly asked question prompted many thoughts presented throughout this post.

Ultimately, while pondering the basis of how to enable selective consumption of content in a well-delineated system, this twofold question arises in my mind: what would an article “phenome” project entail, and how would it benefit us all?

Focus first on the nature of an individual article. What inherent traits contribute to its characterization? At PLOS, a thesaurus of thousands of Subject Area terms exists. Based on the presence of those terms or related concepts in the text of a given article, as well as their proximity to one another, the article itself will be associated with a Subject Area. The rules around association of terms with Subject Areas evolve over time. Through machine-aided indexing, those very human rules gate how applicable a Subject Area is to an article, assigning a weight factor along the way. A higher weight factor for a Subject Area indicates an associated term or concept appears more often within an article’s text. The PLOS taxonomy captures these article traits across the entire corpus.

Though it may be a bit tempting to consider a Subject Area akin to a genetic trait, such a comparison falls through when we adhere to a strict interpretation of a genome being a finite framework from which a trait is mapped forward. Indeed, it necessitates an implausible analogy where every article might be a species unto itself, never to be replicated word-for-word, which then inherits its genes from a proto-language of sorts.

For our purposes, a Subject Area might better be described as an atomic unit of article meaning – analogous to a phenotypic trait being an atomic unit of the expression of the underlying organism’s genome.

If so, how will these atomic units of an article ultimately interact with their environment to create a longstanding entity with unique characteristics? In other words, how will its environment affect such manifestation and result in the article’s phenotype? Can phenotypic patterns be made use of for the purposes of article discovery?

Beyond Subject Areas, consider characteristics such as readability and intrigue in the eyes of one audience versus the next. A sum total of parts – terms, figures, subsections, paragraph structure – plus the very nature of the article’s theme, all play a great role in how an article is able to express itself to its readers (one variety of phenotypic pattern). As a simple example, an article with great appeal to a broad audience per its topic and readability will likely be cited and referred to in numerous publications – spanning from the New York Times to Nature, Gizmodo and National Geographic Russia. Take for instance, “Sliding Rocks on Racetrack Playa, Death Valley National Park: First Observation of Rocks in Motion” (DOI: 10.1371/journal.pone.0105948). Collective data, as gathered through Article-Level Metrics can help us identify some of these patterns. And subsequent association of this data with an article as high-level metadata could begin to allow for methods of customizable content discovery and sharing.

Let’s step back for a minute and think about those same two elements in the context of the music industry – more specifically, in terms of Pandora’s Music Genome Project®. Although some have criticized Pandora for the name of the Music Genome Project (here’s an example), I’m squarely in the crowd that appreciates the listener end-benefits which come with definition of the 450 musical characteristics that Pandora’s project institutes – namely a way to navigate my own musical undiscovered country. As with our taxonomy, a human element is present in the systematic association of characteristics with every song in their catalog. And the platform Pandora has built to curate in-house, distribute, share and tune stations paved the way for today’s streaming services and their business models.

In turn, drawing a series of simple parallels between articles and individual works of music, collections (or channels) and stations becomes straightforward. Building on the individual characteristics of each, elevating those characteristics along with metrics, and the creation of customizable paths of distribution could very well push the open access model another great step forward.

If you had to further define aspects of an article phenome project for article discovery and distribution, where would you begin? How would you define patterns that would enable you to locate the research which interests you most? What metadata might aid in the definition of these? And how would you want to share the fruits of your labor?

I for one would want a socially-enabled ecosystem in which articles could thrive, and influence their own distribution through the expression of their phenotypes – while readers benefitted from accelerated article discovery.

For in the years to come, humans will birth articles as part of scientific research. And through their discovery, the cycle of life begins anew.

 

Category: Tech | Leave a comment

Summary of ALM data quality work

Article-level metrics data, like any other data, is subject to errors, inconsistencies, and bias. Data integrity is the foundation of trust for these metrics.  Making sure data is correct is a difficult task as collecting and providing article-level metrics data involves not only the party collecting the system (here, PLOS), but also the sources (Crossref, Twitter) that provide data.  To this end, we must strive to account for all possible sources of problems with data, and fix them whenever possible.

I’ve been doing some data quality work on article-level metrics data with PLOS. The goal was to explore data quality issues with article-level metrics data, using specifically the data collected on PLOS articles.

There were two sets of data:

  • monthly reports – these are basically spreadsheet dumps  of summary metrics for each data source for every article.
  • alerts data – The Lagotto application has an alert system that produces alerts for many things. I’ll explain some below.

I’ll do a high level summary of the findings from each set of data.

Monthly reports

A file is created at the end of each month. It holds all DOI’s published up to that month, including article level metrics. These files make it easy to analyze data for one particular month, or can be summed together to get a picture of PLOS altmetrics over many months (as I do below). You can find these files on FigShare.

Coverage

The monthly data covers:

  • 16 months
  • Data from: 2012 (07, 08, 09, 10, 11, 12), 2013 (01, 04, 05, 08), 2014 (01, 03, 06, 07, 08, 09)
  • 128986 DOIs
  • 43 article-level metrics variables

Summary statistics

The below plots use the data from the 2014-03 file, across all DOIs. Sources are dropped below that have no data, or have a sum or mean of zero.

Mean metric values

You can see that some article-level metrics are on average larger than others, with, for example, counter_html (number of html page views) much higher than twitter. Of the social media data sources, facebook has higher mean values than twitter. NB: this is a partial list of metrics.

meanall

Overview of some altmetrics variables through time (mean value across articles for each month before plotting)

Through time, metrics show different patterns. The Datacite citations source was only brought online in early 2014. Twitter has seen a steady increase, while Mendeley has been up and down but trending upwards. NB: each panel has a different y-axis, while the x-axes are the same.

 

stats

Distribution of the same subset of altmetrics variables (mean taken across dates for each article before plotting)

These panels show a few different things. First, some data sources have much more data than others (e.g., compare Crossref to Datacite). Second, some data sources have a tight grouping of values around the mean (e.g., counter_html), while others have long tails (e.g., Twitter). Note that panels have different y- and x-axes – the y-axis is log base 10 number of articles, and the x-axis is the metric value.

distr

Some patterns

As you’d expect, some metrics are correlated, and some are not so related. If some are tightly correlated, we can possibly predict metric A simply by knowing metric B. For seven metrics, how are they related to one another? The below plot shows the relationships among counter (html+pdf views), Mendeley readers, Crossref citations, Facebook likes, Reddit shares, and Wikipedia mentions. Counter (html+pdf views) is tightly un-correlated with Mendeley readers, while Crossref and Twitter do seem to have a stronger negative relationship to one another. These kinds of academic exercises are useful and have been done (e.g., Priem et al. 2011[1], Eysenbach et al. 2011[2], Costas et al. 2014[3]), but for our purposes, we are more interested in the extent to which  we can take advantage of the correlations among metrics. That is, if we think there may be some kind of gaming going on with metric X and we can not predict when that will happen, and we know there is some combination of two other metrics that correlate/predict metric X, we can possibly take advantage of those relationships. Unfortunately, I didn’t get any definitive answers with respect to this use case, partly due to the difficulty in detecting potential gaming.

ggpairs

Alerts

The Lagotto application collects and provides article-level metrics data for scholarly articles. As part of the data integrity process, various alerts are given from Lagotto that help determine what may be going wrong with the application, data sources used in Lagotto, and any problems with users requesting data from the Lagotto application. Analyzing these alerts helps to determine what errors are the most common, and what may lie behind errors.

I’ve been working on an R client to work with Lagotto application data, called alm. This R client can also interact with alerts data from Lagotto. Python and Ruby clients are also in the works. NB: accessing alerts data takes an extra level of permissions.

As other publishers are starting to use Lagotto, the below is a discussion mostly of PLOS data, but includes some discussion of other publishers.

Alerts data can be used for many things. One potential use is discovering potential gaming activities. For example, the EventCountIncreasingTooFastError alert flags articles in which the event (ex: Facebook likes) is increasing faster than some cutoff. These articles can then be investigated as to whether the event counts are justifiable or not.

Another use falls under the broader arena of system operations. Alerts like Net::HTTPUnauthorized are not useful for the purpose of detecting potential gaming, but are useful for the publisher using the Lagotto application. Alerts can help determine if there is a problem with one of the data sources, and why the error occurred.

How to interpret alerts (partial list)

Alert class name Description
Net::HTTPUnauthorized 401 – authorization likely missing
Net::HTTPRequestTimeOut 408 – request timeout
Net::HTTPConflict 409 – Document update conflict
Net::HTTPServiceUnavailable 503 – service is down
Faraday::ResourceNotFound 404 – resource not found
ActiveRecord::RecordInvalid title is usually blank, and can’t be
EventCountDecreasingError decrease too fast, check on it
EventCountIncreasingTooFastError increasing too fast, check on it
ApiResponseTooSlowError Alert if successful API responses took too long
HtmlRatioTooHighError HTML/PDF ratio higher than 50
ArticleNotUpdatedError Alert if articles have not been updated within X days
CitationMilestoneAlert Alert if an article has been cited the specified number of times

PLOS

PLOS currently has 142,136 articles available in their Lagotto instance as of 2015-01-06. Most of the PLOS articles in the recent few weeks are WorkNotUpdatedError alerts – meaning works have not been updated recently. Most alerts have to do with system operations problems, not potential gaming problems.

Class Name

N

WorkNotUpdatedError

17653

Net::HTTPServiceUnavailable

860

Net::HTTPNotFound

128

EventCountDecreasingError

69

Net::HTTPRequestTimeOut

15

Delayed::WorkerTimeout

12

NoMethodError

9

Faraday::ClientError

2

Net::HTTPInternalServerError

2

An interesting one is EventCountDecreasingError, which had 69 results. Let’s dig in further, if we search for EventCountDecreasingError alerts from a single source (Web of Science, or wos) to simplify things.

alm_alerts(class_name = "EventCountDecreasingError", 
           per_page = 10)$data %>% 
    select(message, work, source)
message work source
decreased from 2 to 1 10.1371/journal.pone.0080825 pmceurope
decreased from 1 to 0 10.1371/journal.pone.0101947 pmceurope
decreased from 1 to 0 10.1371/journal.pone.0104703 pmceurope
decreased from 3 to 0 10.1371/journal.ppat.1002565 pmceuropedata
decreased from 2 to 1 10.1371/journal.pone.0034257 wos
decreased from 81 to 0 10.1371/journal.ppat.0010001 wos
decreased from 9 to 0 10.1371/journal.ppat.0010003 wos
decreased from 13 to 0 10.1371/journal.ppat.0010007 wos
decreased from 37 to 0 10.1371/journal.ppat.0010008 wos
decreased from 21 to 0 10.1371/journal.ppat.0010009 wos

One of the highest offenders is the article 10.1371/journal.ppat.0010001 where WebofScience counts decreased from 81 to 0. Indeed, requesting metrics data for this DOI gives 0 for wos

alm_ids("10.1371/journal.ppat.0010001")$data %>% 
   filter(.id == "wos")
.id

pdf

html

readers

comments

likes

total

wos

NA

NA

NA

NA

NA

0

This particular error can indicate something wrong with the Lagotto instance from which the data is collected and provided, but it might instead originate from the data source for any number of reasons.

Public Knowledge Project (PKP)

PKP has a growing collection of articles, with 158,368 as of 2015-01-06. An analysis on 2015-01-06 reveals that most of the alerts are Net::HTTPForbidden and Net::HTTPClientError. Again, as with PLOS, the message with alerts data is that most errors have to do with servers not responding at all, or too slowly, or some other technical error – but not an issue of potential gaming.

Class Name

N

Net::HTTPForbidden

23474

Net::HTTPClientError

13266

Net::HTTPServiceUnavailable

5281

DelayedJobError

4916

Faraday::ResourceNotFound

1565

Net::HTTPInternalServerError

1032

StandardError

351

Net::HTTPRequestTimeOut

275

Net::HTTPNotAcceptable

201

TooManyErrorsBySourceError

143

Reproducibility

My work on altmetrics data quality has all been done with the goal of other reproducing and building on the work. Code is in the articlemetrics/data-quality GitHub repository, and includes scripts for reproducing this blog post, and other analyses I’ve done on the monthly data files as well as the alerts data. I’ve used the R packrat library with this project – if you are familiar with that you can use that. Other options are cloning the repo if you are a git user, or downloading a zip file at the repo landing page. Using alerts data and reproducing those scripts may not be possible for most people because you need higher level permission for those data.

Conclusion

If you are an altmetrics data provider, a simple way to start is visualizing your altmetrics data: How are different metrics distributed? How are metrics related to one another? Do they show any weird patterns that might suggest something is wrong? If you are at least a little experienced in R, I recommend trying the scripts and workflows I’ve provided in the articlemetrics/data-quality repo – if you try it let me know if you have any questions. If you don’t like R, but know Python or Ruby, we are working on Lagotto clients for both languages at articlemetrics. In addition, alerts data is very useful in examining server side errors in collection or provision of altmetrics data. See the alerts folder in the articlemetrics/data-quality repo.

If you are an altmetrics researcher/observer, you are likely more interested in collecting altmetrics data. I’ve done data manipulations with the dplyr R package, which is great at handling big data and makes data manipulation quite easy. Starting with the scripts I’ve written and extending those should be an easy way to start collecting your altmetrics data programatically.

From my experience doing this work I think one thing badly needed is easy access to log data for article access. A potentially big source of gaming could be a bot (or perhaps an actual human) hitting a web page over and over to increase metrics for an article. This data isn’t collected in the Lagotto application, but would more likely be in the server logs of the publisher’s web application. Perhaps publishers could use a tool like Logstash to collect and more easily inspect log data when potential gaming activity is found.


[1]Priem, Jason, Heather Piwowar, and B. Hemminger. “Altmetrics in the wild: An exploratory study of impact metrics based on social media.” W: Metrics (2011). http://arxiv.org/abs/1203.4745

[2]Eysenbach G. Can Tweets Predict Citations? Metrics of Social Impact Based on Twitter and Correlation with Traditional Metrics of Scientific Impact. Federer A, ed. Journal of Medical Internet Research 2011;13(4):e123. doi:10.2196/jmir.2012. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3278109/

[3]Costas, Rodrigo, Zohreh Zahedi, and Paul Wouters. “Do altmetrics correlate with citations? Extensive comparison of altmetric indicators with citations from a multidisciplinary perspective.” arXiv preprint arXiv:1401.4321 (2014). http://arxiv.org/abs/1401.4321

Category: Tech | Leave a comment

Under the hood at PLOS

You may have noticed some changes to various pages on the PLOS journal websites over the last few months. Last April we launched new mobile-optimized sites for all of our journals, and in July we released a major update to the homepages of some of our journals.

These projects are part of a larger effort to radically improve Ambra, our open-source publishing platform, and as of today we are serving all article pages from the new Ambra architecture.

If you don’t notice anything different, that’s because most of the action is happening under the hood. There are a few visible changes–like a new ALM signpost design and improved display and feedback options for subject area terms–but our over-arching goal is to create a strong foundation for further feature development and project integrations.

While we’ve worked hard to hide all of the scaffolding and jackhammers from our readers, we ask for your patience while we work through such a huge undertaking. There are still a few rough edges, but please let us know if anything looks severely broken.

Category: Tech | Leave a comment

Make Data Rain

Credit: http://cdns2.freepik.com/free-photo/digital-data-raining-cloud-vector_21-97952666.jpg

Credit: http://cdns2.freepik.com/free-photo/digital-data-raining-cloud-vector_21-97952666.jpg

Last October, UC3,  PLOS, and DataONE launched Making Data Count, a collaboration to develop data-level metrics (DLMs). This 12-month National Science Foundation-funded project will pilot a suite of metrics to track and measure data use that can be shared with funders, tenure & promotion committees, and other stakeholders.

To understand how DLMs might work best for researchers, we conducted an online survey and held a number of focus groups, which culminated on a very (very) rainy night last December in a discussion at the PLOS offices with researchers in town for the 2014 American Geophysical Union Fall Meeting.

Six eminent, senior researchers participated:

Much of the conversation concerned how to motivate researchers to share data. Sources of external pressure that came up included publishers, funders, and peers. Publishers can require (as PLOS does) that, at a minimum, the data underlying every figure be available. Funders might refuse to ‘count’ publications based on unavailable data, and refuse to renew funding for projects that don’t release data promptly. Finally, other researchers– in some communities, at least– are already disinclined to work with colleagues who won’t share data.

However, Making Data Count is particularly concerned with the inverse– not punishing researchers who don’t share, but rewarding those who do. For a researcher, metrics demonstrating data use serve not only to prove to others that their data is valuable, but also to affirm for themselves that taking the time to share their data is worthwhile. The researchers present regarded altmetrics with suspicion and overwhelmingly affirmed that citations are the preferred currency of scholarly prestige.

Many of the technical difficulties with data citation (e.g., citing  dynamic data or a particular subset) came up in the course of the conversation. One interesting point was raised by many: when citing a data subset, the needs of reproducibility and credit diverge. For reproducibility, you need to know exactly what data has been used– at a maximum level of granularity. But, credit is about resolving to a single product that the researcher gets credit for, regardless of how much of the dataset or what version of it was used– so less granular is better.

We would like to thank everyone who attended any of the focus groups. If you have ideas about how to measure data use, please let us know in the comments!

Cross-posted from CDL DataLib blog.

 

 

 

Category: Tech | Leave a comment

Visualize PLOS ALM with Google Charts and Fusion Tables

Learn to filter, sort, and visualize Article Level Metrics with Google Fusion Tables and Google Charts

All scientific journal articles published by the Public Library of Science (PLoS) have Article Level Metrics (ALM) collected about them. These metrics include citation counts, web views, online bookmarks, and more. The metrics are available through a variety of sources including a bulk CSV file.

The CSV data can be uploaded to Google Fusion Tables, a free data hosting service that allows web-based browsing, filtering, and aggregation. The Fusion Tables API supports SQL-like queries and integrates tightly with Google Charts allowing visualization of results in a web browser using Javascript.

The advantage of Javascript compared to Flash or static images is that it works in any web browser including mobile devices, and allows interaction with the data. This makes the visualization easy to share with a large online audience. Javascript is also easy to update if you want to change the query. Visitors will see the new chart version as soon as the script is altered. This requires fewer manual steps than exporting a new image and uploading it.

Google Fusion Tables

The article metrics CSV file can be uploaded directly to Fusion Tables and browsed online. The web interface allows sorting, filtering, and other operations.

Browsing the data in Fusion Tables after uploading the CSV

For example, we can filter by publication_date to show only articles from 2013, and sort by the most CrossRef citations.

2013 sorted by CrossRef

Fusion Tables web interface shows a filtered, sorted result

Fusion Tables API

The query options from the web interface can also be used through an API. The syntax is similar to an SQL query. To see the top 10 CrossRef cited papers from 2013:

SELECT title, crossref, scopus, pubmed FROM 1zkfQ7rtG9UI5a8rPDk2bpD6d0QbgP63h2v2l9YzW WHERE publication_date >= '2013-01-01' AND publication_date < '2014-01-01' ORDER BY crossref desc LIMIT 10

The query commands closely match SQL:

  • SELECT Pick the columns to include in the results.
  • FROM Use the unique identifier for each data table which can be found in File > About this table or in the docid= part of the table URL.
  • WHERE Filter the results by the value of a column. Dates within a certain range can be filtered by using inequality operators with AND for multiple filters. Use >= to include articles published on or after 2013-01-01 and < to include articles published only before 2014-01-01.
  • ORDER BY Sort the results using a selected column and in the chosen sort direction.
  • LIMIT Return only the specified number of results. This prevents loading too much data into a browser causing it to slow down, and also is useful for testing queries.

The results of a query can be downloaded as a CSV by appending the query to this URL: https://www.google.com/fusiontables/exporttable?query=

To make testing queries during development easier, use the Hurl.it HTTP Request tool.

Google Charts

Now that we have a working query, we can display the results using Google Charts. Because of the tight integration between Fusion Tables and Charts, the query can be passed directly as a Javascript parameter.

Starting from this sample code we can modify a few lines of the drawChart() function to visualize our data.

We can modify the function parameters of the sample code to make a chart with our data.

 1 google.visualization.drawChart({
 2         "containerId": "visualization_div",
 3         "dataSourceUrl": 'http://www.google.com/fusiontables/gvizdata?tq=',
 4         "query":"SELECT Year, Austria, Bulgaria, Denmark, Greece FROM 641716",
 5         "refreshInterval": 5,
 6         "chartType": "BarChart",
 7         "options": {
 8           "title":"Yearly Coffee Consumption by Country",
 9           "vAxis": {"title": "Year"},
10           "hAxis": {"title": "Cups"}
11         }
12       });
  • Line 4 is where the query value is set. Replace the sample query with our custom query.
  • Remove refreshInterval on line 5 because the source data is not being changed.
  • Instead of BarChart, change chartType to ColumnChart to select the type of chart.

Change these settings in the options area:

  • title A title to describe the chart.
  • vAxis Control settings for the vertical axis, such as title to show what the axis is measuring.

Modified Version

 1 google.visualization.drawChart({
 2         "containerId": "visualization_div",
 3         "dataSourceUrl": 'http://www.google.com/fusiontables/gvizdata?tq=',
 4         "query":"select title,crossref,scopus,pubmed from 1zkfQ7rtG9UI5a8rPDk2bpD6d0QbgP63h2v2l9YzW where publication_date >= '2013-01-01 00:00:00' and publication_date < '2014-01-01 00:00:00' order by crossref desc limit 10",
 5         "chartType": "ColumnChart",
 6         "options": {
 7           "title": "Top 10 Crossref Cited PLoS Articles from 2013",
 8           "vAxis": {"title": "Citations"},
 9         }
10       });

The chart ouputs to a visualization_div that must exist in the HTML. Like any HTML element, it can be styled, for example with a desired height and width. Adding text such as “Loading…” inside the div helps prevent users from being confused by a blank space while the visualization loads.

<div id="visualization_div" style="width: 100%; height: 800px;">Loading...</div>

Results

The final chart can be embedded in a web page or displayed on its own.

The result is a query performed live on a dataset, with the results visualized in any web browser. We can change the query by editing a single file, and the new results will be displayed instantly. The query and visualization can even be edited live in a web browser. This is an easy way to experiment and to share visualizations with others.

The chart allows interaction because it uses Javascript. When the user hovers their mouse on a column, a label appears showing the full article title and an exact citation count. This provides important detail without taking additional screen space or overloading the user with information.

The chart shows the top CrossRef cited articles for 2013 and compares their citation counts. All of these articles had equal or more Scopus citations than CrossRef, except for “The Effectiveness of Mobile-Health Technology-Based Health…” which received more CrossRef citations than Scopus. This difference may be due to article awareness, citation inclusion methods, or publication date. By displaying the data visually, the viewer can identify outliers for further investigatation.

Next Steps

Fusion Tables lets anyone develop their own query to run on the PLOS ALM dataset. This query can use SQL-like operations and runs on the entire 100k+ rows of data. Calculations on the data such as SUM() and AVERAGE() can also be applied.

Google Charts also supports more user interaction, for example filtering by user selected year or journal. Extra formatting can be added to the chart, such as links to read the full articles. There are also a range of chart types available to display query results.

When PLOS releases updated versions of the CSV data, you can upload it to the Fusion Table, and the associated chart will reflect the new version of the data.

Category: Tech | Tagged , , , , | Leave a comment

Lessons learned developing scholarly open source software

Earlier this week we released version 3.8 of the Lagotto open source software (migrating the software to version 4 of the Rails framework). Lagotto is software that tracks events (views, saves, discussions and citations) around scholarly articles and provides these Article-Level Metrics to all journal articles published by PLOS. The project was started in 2008, and I am the technical lead since May 2012. This is the 24th release since I am involved with the software, and I want to take the opportunity to go into more detail about some of the lessons learned along the way.

Milestones

We achieved some important Article-Level Metrics milestones in the past few weeks. In October we collected 500 million events around PLOS articles – an event can be anything from a pageview, tweet or Facebook like to a mention on a Wikipedia page to citation in another scholarly journal (data).

Status

Most of these events are usage stats in the form of HTML page views and PDF downloads – including usage stats from PubMed Central and usage stats for supplementary information files hosted by figshare. But we are tracking events from a diverse range of data sources (data with logarithmic scaling):

Events

Another important milestone was reached this October: PLOS is no longer the largest user of the Lagotto software. We were surpassed by the Public Knowledge Project that is providing a hosted ALM service for journals running the open source OJS software, and by CrossRef Labs, which loaded all 11 million DOIs registered with them since January 2011 into the software (data, unknown means software versions before 3.6).

Lagotto instances

Three features implemented in the last three months made this possible:

  1. tight integration with the CrossRef REST API to easily import articles by date and/or publisher (loading the 11 million CrossRef DOIs took a little over 24 hours)
  2. many performance improvements, in particular for database calls
  3. support for publisher-specific configurations and display of data.

Scalability

Tracking events around scholarly articles is almost by definition a large-scale operation, collecting 500 million events and serving them via an API takes a lot of resources. Work on scalability has focussed on two areas: handling large numbers of articles and associated events, and making the API as fast as possible. Essential for any scalability work is good monitoring of the performance and we have mainly used two tools. rack-mini-profiler is a profiler for development and production Ruby applications, and gives detailed real-time feedback:

rack-mini-profiler

To measure API performance we use Rails Notifications and track the duration of every incoming (from external data sources) and outgoing (to clients) API call. We filter and visualize the duration of the outgoing API calls in the admin dashboard using the D3 and crossfilter Javascript libraries:

crossfilter

One lesson learnt is that it is critical to understand the database calls in detail and not rely on an ORM (object-relation mapping) framework (ActiveRecord in the case of Rails) to always make the right decisions.

The other important lesson learnt is that caching is absolutely critical. Lagotto uses memcached to do some pretty complex caching. For API calls we use fragment caching, and in the admin dashboard we also use model caching of some slow queries. Fragment caching is a common pattern nicely supported by Rails and the rabl library we use for JSON rendering, although we run into some edge cases because our API responses are a bit too complex. Model caching is part of a custom solution where we cache all slow database queries in the admin dashboard and refresh the cache every 60 min via a cron job.

Support

Making it easy to install and run your software is the most important thing you can do if you want your open source project to be used by other people and organizations. This is a complex topic and includes several aspects: testing, error tracking, automation, documentation, and user support.

Testing

code climate

The overriding aim is obviously to ship bug-free software. The approach that the Ruby community has taken is to write extensive test coverage for your application. We are using Rspec, Cucumber and Jasmine for tests, and track test coverage (and overall code quality) using Code Climate.

Lagotto is also using rubocop to make sure the code follows the Ruby style guide, and the brakeman security scanner. All these tools run in the Travic CI continous integration environment.

Error tracking

Lagotto can create thousands of errors every day if there are problems with even one of the external APIs which the application talks to. For this reason we have implemented a custom error tracker that stores all errors in the database and makes it easy to search for them, receive an email for critical errors, etc.

alerts

The alerts dashboard of course also helps discover problems with the code that are missed by testing for a variety reasons, such as in the example above.

Automation

Lagotto is a typical Rails web application for those who are familiar with Rails. Unfortunately many people are not. Automating installation and deployment of new releases is therefore critical. Since 2012 we have worked with three tools to automate the process:

  • Vagrant, a tool to easily create virtual development and deployment environments
  • Chef, and IT automation tool
  • Capistrano, the standard Rails deployment tool

All three tools are written in Ruby, making it straightforward to customize them for a Ruby application. Deploying a new Lagotto server from scratch can be done in less than 30 min, and we have refined the process over time, including some big changes in the recent 3.7 release. Thanks to Vagrant, Lagotto can be easily deployed to a cloud environment such as AWS or Digital Ocean. After some initial work with platform as a service (PaaS) providers CloudFoundry and OpenShift (I didn’t try Heroku). I chose not to delve deeper, mainly because these tools added a layer of complexity and cost that wasn’t outweighed by the easy of deployment. The elephant in the room is of course Docker, an exciting new application platform that we hope to support in 2015.

Documentation

The typical user of the Lagotto application isn’t hosting applications in a public cloud, and not all use virtual servers internally. We therefore still need good support for manual installation, and here documentation is critical.

All Lagotto documentation is written in markdown, making it easy to reuse nicely formatted documents (links, code snippets, etc.) in multiple places, e.g. via the Documentation menu from any Lagotto instance.

markdown

Writing documentation is hard, and all pages are evolving over time based on user feedback.

User Support Forum

In the past I have answered most support questions via IM, private email, mailing list or Github issue tracker. There are some shortcomings to these approaches, and we have therefore last month launched a new Lagotto forum (http://discuss.lagotto.io) as the central place for support.

discourse

The site runs Discourse, a great open source Ruby on Rails application that a number of (much larger) open source projects (e.g. EmberJS or Atom) have picked for user support. The forum supports the third-party login via Mozilla Persona we are using in Lagotto (you can also sign in via Twitter or Github), and has nice markdown support so that we can simply paste in documentation snippets when creating posts. Please use this forum if you have any questions or comments regarding the Lagotto software.

Roadmap

Users of the Lagotto application need to know what high-level features are planned for the coming months, and be able to suggest features important to them. The Lagotto forum is a great place for this and we have posted the development roadmap for the next 4 months there. The planned high-level features are:

  • 4.0 – Data-Push Model (November 2014)
  • 4.1 – Data-Level Metrics (December 2014)
  • 4.2 – Server-to-Server Replication (January 2015)
  • 4.3 – Archiving and Auditing of Raw Data (February 2015)

Work on the data push API has already begun and will be the biggest change to the system architecture since summer 2012. This will allow the application to scale better, and to be more flexible in how data from external sources are collected.

The data-level metrics work relates to the recently started NSF-funded Making Data Count project. This work will also make Lagotto much more flexible in the kinds of things we want to track, going well beyond articles and other items that have a DOI associated with them.

One of the biggest risks for any software project is probably feature creep, adding more and more things that look nice in the beginning, but make maintenance harder and confuse users. Every software, in particular open source software, needs a critical mass of users. I am therefore happy to expand the scope of the software beyond articles, and to accommodate both small organizations with hundreds of articles as well as large instances with millions of DOIs.

But we should keep Lagotto’s focus as software that tracks events around scholarly outputs such as articles and makes them available via an API. Lagotto is not is a client application, which the average user would interact with. Independent Lagotto client applications serve this purpose, in particular journal integrations with the API (at PLOS and elsewhere) as well as ALM Reports and the rOpenSci alm package.

Lagotto is also not an index service with rich metadata about the scholarly outputs it tracks. We only know about the persistent identifier, title, publication date and publisher of an article, not enough to provide a search interface, or a service that slices the data by journal, author, affiliation or funder.

The future for Lagotto looks bright.

lagotto

Source: Wikimedia Commons

Category: Tech | Tagged , | 1 Comment

Make data sharing easy: PLOS launches its Data Repository Integration Partner Program

Image credit.

Over the past couple of years, we’ve been improving data access in a number of ways, notably: unlocking content in SI files,connecting PLOS Biology and PLOS Genetics with Dryad data, and making data discoverable through figshare recommendations.  Our update to the PLOS data policy in March 2014 undergirds and reinforces these services (c.f. policy FAQs).  Once data is available without restrictions upon publication, we can build tools to support the research enterprise.  Today, our efforts to improve data access at PLOS continues with an exciting new chapter.

Data Repository Integration Partner Program: easy deposition and article submission

We announce the launch of the PLOS Data Repository Integration Partner Program, which integrates our submission process with those of a select set of data repositories to better support data sharing and author compliance of the PLOS data policy.  (PDRIPP didn’t make the cutting board for names.)  Through this program, we make it easier for researchers to deposit data and submit their manuscript to PLOS through a single, streamlined workflow.  Our submission system is sewn together with partner repositories to ensure that the article and its underlying data are fully paired – published together and linked together.

Community benefits of the data repository integration partner program include the following:

  • ensuring that data underlying the research article are publicly available for the community
  • making it easier for PLOS authors to comply with the data policy
  • making data underlying the research article available to peer review contributors even if not yet publicly released
  • establishing bi-directional linkage between article publication and data
  • enabling data citation so that data producers can gain professional credit with data Digital Object Identifiers (DOIs)

We recognize that data types can vary quite widely and repositories have different competencies.  Most importantly, researcher needs are ever diverse, and repository choice is important.  The program thus aims to accommodate the diversity of data requirements across subject areas as well as from funders and institutions.  While such partners are strong options for researchers, they are neither representative nor exhaustive of the suitable repositories available which satisfy the PLOS data policy. [1]

Dryad: our first integration partner

dryad-logo2

We are thrilled that Dryad is our first integration partner member to join the program.  It is a curated resource that hosts a wide variety of data underlying publications in science and medicine, making research data discoverable, freely reusable, and citable.  We have connected two PLOS journals, PLOS Biology in 2012 and PLOS Genetics in 2013, tying together the data deposition process and our submission workflow.  Now, we expand this service offering for all PLOS authors across all seven journals.

Better yet, we have made the workflow to be more flexible and efficient for authors so that data deposition can now occur before article submission.  The steps are quite simple: once a researcher has deposited data in Dryad, s/he will receive a dataset DOI (provisional) along with a private reviewer URL link.  The data DOI should be incorporated into the full Data Availability Statement as per standard procedure.  The reviewer URL is also uploaded into the submission system and will serve as a passcode for private access to the data before public release.  The manuscript then moves swiftly through peer review.  And if accepted for publication, both article and dataset will be published together on the same day.  The fantastic Dryad blog post details the whole story.  If you have any questions about journal article submission, please contact the respective journal at plos[journal]@plos.org or data@plos.org.  Dryad repository questions can be directed to help@datadryad.org. [2]

More repos: growing the partnership to better serve researchers

PLOS is repository agnostic, provided that data centers meet certain baseline criteria (ex: availability, reliability, preservation, etc.) that ensure trustworthiness and good stewardship of data.  We are expanding the current selection of partners to provide this service to more of our authors, across more data centers.  We have a few already slated to join in the next couple of months.  Stay tuned!

But we ask for your help, researchers:

which new repositories would most benefit your work?

Let us know your recommendations for additional partner candidates.  Your thoughts and feedback on the new program are always welcomed at data@plos.org.

 

Footnotes:

1. Authors are under no obligation to use the data repositories that are part of the integration program.  We recommend that researchers continue to deposit their data based on the field-specific standards for preparation and recording of data and select repositories appropriate to their field. 

2. Please note that Dryad has a Data Publication Charge. For authors selecting the service, this fee will be charged by Dryad to the author if/when the related manuscript is accepted for publication, so there is no charge while manuscripts are under review.  PLOS does not gain financially from our association with Dryad, which is a nonprofit organization.

Category: Tech | 2 Comments

How do you DO data?

title

We all know that data are important for research. So how can we quantify that? How can you get credit for the data you produce? What do you want to know about how your data is used? If you are a researcher or data manager, we want to hear from you. Take this 5-10 minute survey and help us craft data-level metrics:

https://www.surveymonkey.com/s/makedatacount

Please share widely! The survey closes December 1st.

The responses will directly be fed into a broader project to design and develop metrics that track and measure data use, i.e. “data-level metrics” (DLM).  See an earlier blog post for more detail on the NSF-funded project, Making Data Count, that is a partnership between PLOS, CDL, and DataONE.

DLM are a multi-dimensional suite of indicators, measuring the broad range of activity surrounding the reach and use of data as a research output. They will provide a clear and growing picture of the activity around, direct, first-hand views of the dissemination of, and reach of research data. These indicators capture the footprint of the data from the moment of deposition in a repository through to its dynamic evolution over time. DLM are automatically tracked, thus reducing the burden of reporting and potentially increasing consistency. Our aims in measuring the level and type of data usage across many channels are plain:

  • make it possible for data producers to get credit for their work
  • prototype a platform so that these footprints can be automatically harvested
  • make all DLM data freely available to all (open metrics!)

At the moment, we are canvassing researchers and data managers to better understand data sharing attitudes and perceptions to identify core values for data use and reuse to describe existing norms surrounding the use of and sharing of data. The survey answers combined with the previous and ongoing research (ex: Tenopir et al.Callaghan et al.) will serve as the basis for this work ahead. They will be converted into requirements for an industry-wide data metrics platform. We will explore metrics that can be generalized across broad research areas and communities of practice, including life sciences, physical sciences, and social sciences. The resulting framework and prototype will represent the connections between data and the various channels in which engagement with the data is occurring. We will then test the validity of the pilot DLMs with real data in the wild and explore the extent to which automatic tracking is a viable approach for implementation.

Thank you in advance for your contributions in the data survey. Please fill it out and pass it on today! Read more about the project at mdc.plos.org or the CDL survey blog post.

Category: Tech | 1 Comment