This is a document prepared to support the European Commission’s ongoing discussion on Content Mining. In particular it is a discussion of publisher best practice in terms of enabling content mining and the challenges that can arise when particular types of traffic reach high levels from the perspective of a purely Open Access publisher.
Enabling the discovery and creative re-use of content is a core aim of Open Access and of Open Access publishers. For those offering Open Access publication services enabling downstream users to discover and use published research is a crucial part of the value offering for customers. Content mining is an essential emerging means of supporting discovery of research content and of creating new derivative works that enhance the value of that content.
Content mining generally involves the computational reading of content, either by obtaining specific articles from a publisher website or by working on a downloaded corpus. Computational access to a publisher website has the potential in theory to create load issues that may degrade performance for human or other machine users.
In practice downloads that result from crawling and content mining contribute a trivial amount to the overall traffic at one of the largest Open Access publisher sites and are irrelevant compared to other sources of traffic. This is true both of average traffic levels and of unexpected spikes.
Managing high traffic users is a standard part of running a modern web service and there are a range of technical and social approaches to take in managing that use. For large scale analysis a data dump is almost always going to be the preferred means of accessing data and removes traffic issues. Mechanisms exist to request automated traffic be kept at certain levels and these requests are widely followed – where they are not technical measures are available to manage these problematic users.
Scale and scope of the problem
PLOS receives around 5 million page views per month users to a corpus of 100,000 articles as reported by Google Analytics. This is a small proportion of the total traffic as it does not include automated agents such as the Google bot. The total number of page views per month is over 60 million for PLOS ONE alone. Scaling this up to the whole literature suggests that there might be a total of 500 million to 5 billion page views per month across the industry, or up to seven million an hour from human users. As noted below the largest traffic websites in the world provide guidance that automated agents should limit retrieving pages to a specified rate. Wikipedia suggests one page per second. PLOS requests a delay of 30s between downloading pages.
PLOS infrastructure routinely deals with spikes of activity that are ten times the average traffic and is designed to manage loads of over 100 times average traffic without suffering performance problems. Thus it would require hundreds of thousands of simultaneously operating agents to even begin to degrade performance.
Content mining is a trivial and easily managed source of traffic compared to other sources, particularly coverage on popular social media sites. Coverage of an article on a site like Reddit often leads to tens of thousands of requests for a single page within an hour. By contrast automated crawling usually leads to a smaller number of overall downloads and is spread out over longer time periods making it much easier to manage. As an example there are attempts made to artificially inflate article download counts, which involve tens of thousands of requests for the same article. We do not even attempt to catch these at the level of traffic spikes because they would be undetectable, they are detected through later analysis of the article usage data.
Sources of traffic that do cause problems are generally rogue agents and distributed denial of service attacks where hundreds of thousands or millions of requests occur per second. These sources of traffic are the main source of service degradation and need to be managed based on the scale of traffic and the likelihood of being a target for such attacks. The scale of content mining traffic for any given publisher will be dependent on the scale of interest in the content that publisher is providing.
There are broadly three complementary approaches to supporting content mining in a way that does not have any impact on user experience. While all of these approaches are implemented by effective scholarly publishers it is worth examining these approaches in the context of a truly high traffic site. Wikipedia is an excellent example of an extremely high traffic site that is also subject to large scale mining, scraping, and analysis.
Providing a data dump
The first and simplest approach is to provide a means of accessing a dump of all the content where it can be obtained for off line analysis. Generally speaking the aim of analysis is to mine a whole corpus and enabling the user to obtain a single dump and process this offline improves the experience for the miner while removing any risk of impact to website performance. Wikipedia provides a regular full dump of all content for all language versions and recommends that this be the first source of content for analysis. Many Open Access publishers adopt a similar strategy utilising deposition at Pubmed Central or on their own websites as a means of providing access to a full dump of content. PLOS recommends that those wishing to parse the full corpus use PMC or EuropePMC as the source of that content.
This approach is especially useful for smaller publishers running their own infrastructure as it means they can use a larger third party to handle dumps. Of course for smaller publishers with a relatively small corpus the scale of such a data dump may also be such that readily available file sharing technologies suffice. For a small publisher with a very large backfile the imperative to ensure persistence and archiving for the future would be further supported by working with appropriate deposit sites to provide both access for content miners and preservation. Data dumps of raw content files are also unlikely to provide a viable alternative to access for human readers so need not concern subscription publishers.
Agreed rates of crawling
It is standard best practice for any high traffic website to provide a “robots.txt” file that include information on which parts of the sites may be accessed by machine agents, or robots, and at what rate. These files should always include a ‘crawl-delay’ which indicates the time in seconds that an agent should wait before downloading a new page. Wikipedia’s robot.txt file says for instance “Friendly, low-speed bots are welcome viewing article pages, but not dynamically-generated pages please” and suggests a delay of at least one second between retrieving pages. This is not enforced technically but is a widely recognised mechanism that is respected by all major players – not following this is generally regarded as grounds for taking technical measures as described below.
PLOS request a crawl delay of 30 seconds currently, Biomed Central asks for one second, eLife for ten. When working with content from a large publisher crawl delays of this magnitude means that it is more sensible for large scale work to obtain a full data dump. Where a smaller number of papers are of interest, perhaps a few hundred or a few thousand then the level of traffic that results from even large numbers of content mining agents that respect the crawl delay is trivial compared to human and automated traffic from other sources.
It is however the case that some actors will not respect crawl-delays and other restrictions in robots.txt. In our experience this is rarely the case with content miners and much more frequently the result of malicious online activity, rogue automated agents, or in several cases the result of security testing at research institution which sometimes involves attempts to overload local network systems.
Whether the source is a spike in human traffic, malicious agents, or other sources of heavy traffic maintaining a good service requires that these issues be managed. The robots.txt restrictions become useful here as when it is clear that an agent is exceeding those recommendations it can be shut down. The basic approach here is to “throttle” access from the specific IP address that is causing problems. This can be automated, although care is required because in some cases a single IP may represent a large number of users, for instance a research institution proxy. For PLOS such throttling is therefore only activated manually at present. This has been done in a handful of cases, none of which related to text mining.
At larger scale automated systems are needed but again this is part of running any highly used website. Load-balancing, monitoring incoming requests and managing the activity of automated is a standard part of running a good website. Identifying and throttling rogue activity is just one part of the suite of measures required.
Enabling content mining is a core part of the value offering for Open Access publication services. Downloads that result from crawling and content mining contribute a trivial amount to the overall traffic at one of the largest Open Access publisher sites and are irrelevant compared to other sources of traffic. This is true both of average traffic levels and of unexpected spikes.
Managing high traffic users is a standard part of running a modern web service and there are a range of technical and social approaches to take in managing that use. For large scale analysis a data dump is almost always going to be the preferred means of accessing data and removes traffic issues. Mechanisms exist to request automated traffic be kept at certain levels and these requests are widely followed – and where they are not technical measures are available to manage these problematic users.
There are sources of traffic to publisher website that can cause problems and performance degradation. These issues are part of the competent management of any modern website. Content mining, even if it occurred at volumes orders of magnitude above what we see currently, would not be a significant source of issues.