Update of the PLoS Journals to Ambra 0.9.5

We will update the PLoS journals with the Ambra 0.9.5 release tonight. This release focused on development of the long-awaited PLoS Queue syndication module. Starting in January, PLoS will send article packages directly to PubMed Central at publication time. Article packages are currently sent to PubMed Central the same time as they arrive at PLoS so if any errors are found by the PLoS editorial or production teams, we need to send the fixed article packages back for revision. Sending these article packages at publication time will save a lot of time/effort if any errors are found. This also allows us to automatically send article packages, XML and PDFs to other external repositories in the future. The admin panel and publication workflow were updated for this new syndication service.

New features and updates:

  • Added a new syndication service to the Admin panel. This will allow administrators to export article files and metadata to external sources. PLoS is using this syndication service with an internal application built on Apache Camel to send published files to PubMed Central and other external repositories. This is not an out-of-the-box solution as only the service hooks are available in Ambra.
  • Added RDFa for the Dublin Core elements in the article HTML. Look for a blog posting on this feature soon!
  • Added a new action to support a Google Scholar XML feed for published articles.
  • Update to the article RHC to display articles cross-published in other journals.
  • Improvements to how XSL sheets are configured.
  • Upgraded the Dojo library to 1.3.2.

Bug fixes:

  • Promote a note to a formal correction and the annotation body was removed.
  • Templates from other journals were displayed if freemarker.properties template_update_delay was set.
  • Templates changes to fix a number of formatting errors encountered in IE6, IE7 and IE8.
  • Minor fixes to annotation displays, formal corrections, formatting and messaging.

For a complete list of 0.9.5 features and bug fixes, please see the Ambra 0.9.5 RC1 Tickets and the Ambra 0.9.5 RC2 Tickets.

Category: Technology | Leave a comment

Network Outage at UnitedLayer

It seems that the storm slamming the Bay Area also affected the co-location facility UnitedLayer. A power “glitch” in their building brought down one of their big UPS units. This caused power outage to all of their network equipment. When their routers restarted, they were not serving external traffic. And their backup routers also failed.

We were notified of the PLoS website outages at 7:22am. We could not connect to any PLoS servers which meant a network problem occurred at the co-location facility UnitedLayer. A phone call confirmed the worst and we set about trying to get backup environment running on Amazon EC2 (“the cloud”). We redirected PLoS ONE to an instance that was already running but the site was too slow to be useful. We redirected website traffic to the everyONE blog. UnitedLayer fixed their network issues at 10:30am. We switched the redirects back and the journals websites are back online.

Category: Technology | Leave a comment

Update to the PLoS Journals

Last night (September 15), we updated the PLoS Journals to Ambra 0.9.4. This release culminates a many-sprint development effort and huge data migration to provide per-article usage statistics. The article usage statistics join the other article usage data (citations, bookmarks, blog posts, etc.) to allow users new ways to evaluate the value of articles. Mark Patterson has posted a blog entry about the Article-level Metrics at PLoS. More information about our article-level metrics program can be found in our PLoS FAQ and the Article-level Metrics website.

To get the article usage statistics, we ran Apache log files from the last four years through a massive data migration pipeline to provide per-article usage data for the number of HTML page views, PDF downloads and XML downloads. For this data, we conformed to the COUNTER 3 standards (industry standard guidelines used to report the usage of online journals to subscribing libraries). But we found that COUNTER’s list of robots was extremely limited, so we exceeded the COUNTER standards by excluding even more robots from the usage data. For detailed information about the usage data, see the Usage Data Help section of PLoS ONE.

Every article now has a new tab called Metrics which displayed a graph of the cumulative usage data. You can see a great example of the article usage data on the Ten Simple Rules for Getting Published article. Since we have all this data, we also created journal summary usage data showing the average lifetime usage per PLoS journal. And we provide a summary Excel file containing the full data set for every PLoS article up to July 31, 2009.

But that’s not all! Other features that have been implemented include:

  • We now display the citations to articles by CrossRef. You can see an example of the citations to this article as recorded by CrossRef.
  • Users can now search across multiple journals.
  • All administrators to override annotation citation information. Correction citations used to be dynamically generated from the article information. If this article information was part of the correction (e.g. title, author name), then the error is propagated in the Correction citation. Now, admins can update the citation information for any annotation to display the correct citation.
  • A number of outstanding bugs were fixed

For a complete list of features implemented, see the release notes.

During the maintenance window last night, we also restarted all of the production servers. We run all of our servers on Linux CentOS so we don’t restart the servers very often. Most of the servers had been running non-stop for 450 days but one of the web servers had been up for 890 days! Unfortunately, one of the production servers rebooted with a bad drive and caused the maintenance window to be a bit longer than expected. We’re off to the colo today to fix the drive but this won’t affect the PLoS web sites while we replace the drive.

Congrats to everybody involved in this enormous effort to provide the article usage statistics!

Category: Technology | 2 Comments

Inaugural Meeting of the Concept Web Alliance

On May 7 – 8th, I attended the inaugural meeting of the Concept Web Alliance. CWA wants to enable interoperability between large triple stores like the Large Knowledge Collider (LarKC) and provide an Open Access mechanism for accessing the triple stores. This is great for the projects in life sciences as the semantic triple stores are becoming the de facto way to store data for gene expression and sequencing, biobanks, etc.

The CWA mission statement is:

”To enable an open collaborative environment to jointly address the challenges associated with high volume scholarly and professional data production, storage, interoperability and analyses for knowledge discovery.”

You can read the entire CWA declaration.

There were a number of representatives from the STM publishing world (Abel Packer from Bireme is a founding CWA member, Nature, Springer, SEED, The Scientist, Thomson Reuters) and I had some good conversations about the vision of the CWA in relation to STM publishers. All agree that the CWA is a much needed initiative but there are questions on how it can feed back into a revenue model. Most STM publishers don’t have triple stores that can be offered to the CWA endeavor. They publish the final result – research articles based on the triple stores.

But I see a few ways that publishers can benefit from working with the CWA:

1. The CWA can provide tools that link the data stores directly to the content of the research article. Search, data mining, and visualization tools can be created for publishers which would give their users new ways of interacting with the research article. Users can find the research articles that they really want and can dig into the underlying data even if the data is a massive data store. For the publisher, this can increase revenue by bringing more traffic, focused advertising campaigns, etc. CWA can provide these tools for a fee which would be used to sustain and further the CWA mission.

2. The CWA can provide tools to automate semantic encoding of the research article. As an example, David Shotten has shown how this can be used by publishers by encoding a PLoS NTD article – see Adventures in Semantic Publishing: Exemplar Semantic Enhancements of a Research Article and has another paper titled ““Semantic Publishing: the coming revolution in scientific journal publishing” – preprint available here. Knewco also has some great technology in this space – check out their Concept Web.

3. Publishers can provide Open Data back to the CWA. Publishers could provide access to content tagged with RDFa for easier auto-machine discovery, allow access to the data from the supplemental information in the research article or provide triples generated from the content of the research article itself. PLoS is a bit ahead of the publishing curve as all of the PLoS journals run on the Ambra/Topaz platform which stores the content of the research articles as triples. We’re looking at ways to provide access to a subset of these triples (we would need to remove user information) through a SPARQL endpoint or other means of access. This would allow for direct access to the triples that could then be given back to the CWA.

What other ways can publishers interact with the CWA? The CWA wants to know how your organization can participate.

Category: Technology | Leave a comment

PLoS Biology Migration to Ambra/Topaz

Yesterday, we migrated PLoS Biology to the Ambra/Topaz platform. This completed a two year, 5 journal, ~9000 article migration involving many of the PLoS staff. Now all of the PLoS journals have the same feature set including notes, comments, ratings, article impact metrics, etc. Migrating all of the PLoS journals to a single platform is a major milestone for PLoS and will allow us to finally create cross-journal features such as cross-journal search.

We released a snapshot version of Ambra to production just before the PLoS Biology migration that fixed a number of bugs that were uncovered with the PLoS Medicine migration in March. These bug fixes will likely be unnoticed by most users.

We also standardized our environments for performance testing of the Ambra/Topaz platform in Amazon EC2. The development team created scripts to automate the launch of the Ambra/Topaz platform in the Amazon “cloud” with a snapshot of our production data. We will continue working on these scripts so that others to easily launch Ambra/Topaz instances and test the platform (stay tuned for more info).

Liz has provided a bit more information on migration at the everyONE blog.

Category: Technology | Leave a comment

PLoS Journal Outage on April 9

A number of factors contributed to the long outage today. The outage was caused by the sabotaged fiber-optic cable lines San Jose. This affected the network traffic going to United Layer, our co-location facility. United Layer is supposed to have a redundant network line for failover in case something like this happens. I don’t know the details, but this redundant network line wasn’t working. Their engineers finally rerouted their customer’s traffic around the San Jose disruption at 1:43pm PST.

During the outage, we were able to redirect journal traffic to the everyONE Blog which Liz updated throughout the morning. We also launched an Amazon EC2 instance and were (literally) minutes away from having the sites running on EC2 albeit with a snapshot of production data from March 17.

United Layer will be held accountable for their part in the outage. We’ll also look to improve our disaster recovery plans to try and limit the downtime caused by future “catastrophic” outages.

A comment about the outage today left at sfgate.com: “The more complicated they make the plumbing, the easier it is to plug up the pipes.” – Lt. Com. Montgomery Scott

Category: Technology | 1 Comment

PLoS Journals upgrade to Topaz 0.9.2

Tonight, we upgraded the PLoS journal websites to Topaz 0.9.2. This release is chock full of user interface changes and enhancements. We’ve completely redesigned the article page to accommodate new features and give a better visual experience to the user. Since this is a significant design change for the article layout, we’d like to hear from our users. Email us or reply to this blog post and let us know what you think about the changes.

The article page now has three tabs:

  1. Article: Much of the content in the right hand column has been moved into the other tabs to make way for new features. Links to the appropriate issue or collection appear in the right hand column of the article page. In PLoS ONE related subject categories have been added to the right hand column to allow easy access to other related articles. We’ve also designed the right hand column to allow for some new feature growth in the future (e.g. user tags).
  2. Related Content: Data from external sources is provided on the this tab. Sources include the number of citations from PubMed Central and Scopus; the number of bookmarks from CiteULike and Connotea; and the number of blog posts linking to the article from Postgenomic, Nature Blogs and Bloglines. More sources will be added in the future.
  3. Comments: All of the comments, minor corrections and formal corrections are easily viewed in one location.

Many other features were added to this release:

  • Competing interest statements were added to all notes, comments and ratings.
  • The creation of retraction annotation types.
  • The Most Recently Published homepage block can be configured to display a set number of articles, articles from within a published date range and an article white list (e.g. display only research articles).
  • Articles can be ordered within an issue and in the table of contents.
  • Formal corrections and retractions are now displayed in the table of contents next to the appropriate article.
  • Support for NLM DTD 2.3.
  • The administration portal was overhauled to provide a better workflow for creating volumes and issues.
  • The administration portal was updated to allow manual ordering of articles within an issue.

This is the first release that the PLoS development team has used SCRUM for software development. We’re believers! We have a lot of lessons to learn from our first sprint, but the overall consensus is that SCRUM allowed us to quickly (in just six weeks!) developer a healthy amount of new features for the Ambra platform. As we continue to improve on the process, we’ll be able to push out new features in rapid succession. The hopes are for a new release every four weeks.

Category: Technology | 1 Comment

PLoS Journal Websites – Upgrade to Topaz 0.9.1

At 5pm PST, we will update the PLoS journal websites to Topaz 0.9.1.. The journal websites will be offline for approximately one hour. Once the upgrade is complete, the websites will be a bit slow for the first couple of hours while the caches re-fill. We will run scripts before go-live to re-fill the caches for articles linked from the homepage and the current issue, so the most recent articles should display quickly to the end user.

The PLoS development team used Amazon EC2 extensively for performance tests and for benchmark tests vs. Topaz 0.9.0. We’ve also finished the herculean task of ingesting 6372 PLoS Medicine and PLoS Biology corpus articles required for the migration of those journals to Ambra/Topaz in the next few months.

Features implemented in Topaz 0.9.1 include:

  • Revamp of Ambra search to use Mulgara (does not include update to the search UI)
  • Support of article packages greater than 4Gb in size
  • Support of article packages in tar format
  • Support for very large assets (e.g. supporting info files) outside of memory
  • Fixes to formal corrections
  • Fixes to citations
  • Fixes to citation downloads
  • Article-level feeds for top-level notes/comments on an article
  • Updates to admin panel for administering notes/comments
  • Many fixes to the UI including line wrap for long gene sequences, browser-specific bugs, wrapping long URLs in annotations/articles, etc.
  • Cleanup of content models, back-end bug fixes, etc.
Category: Technology | 2 Comments

DNS Issue with PLoS.org – Resolved

We had an outage on our PLoS.org domain resources yesterday due to a DNS issue. As a result, www.plos.org and all of the plos.org subdomains were intermittent as well. The issue was addressed promptly but there was a 2-10 hour delay while DNS servers were updated and ~24 hour delay before a comprehensive international DNS update. Most people were able to access www.plos.org within a few hours.

The DNS issue did not affect any of the PLoS journal websites.

Please be aware that if doi.plos.org is down, you can use CrossRef for DOI resolution.

Category: Technology | 1 Comment

Journal Websites – Topaz 0.9 rc1 Upgrade

Last night, we upgraded the journal websites to Topaz 0.9 rc1 (rc1 because this is a “beta” 0.9 release). The development for this release focused on performance and stability – specifically to alleviate the sluggish speed of the websites and the pain of ingests.

The development included a major re-architecture of the publishing application and weeks of performance testing. There are a few minor issues but the sites are quite zippy and the new ingest times have, to quote James “exceeded expectations.” We’ll probably have a few more quick restarts over the next few days as we shift through the logs/bugs but we won’t have to rebuild the article cache ever again (another pain point).

Thanks to Russ for the heroic migration day, Josh for the speedy drive rebuilds of the Mulgara server last night, and to the Topaz/PLoS developers that got this “beta” release out the door. We still have some cleanup after the dust settles, but I feel confident that the site performance/stability has greatly improved.

Category: Technology | Leave a comment