We know that those Open Access policies that work are the ones that have teeth. Both institutional and funder policies work better when tied to reporting requirements. The success of the University of Liege in filling its repository is in large part due to the fact that works not in the repository do not count for annual reviews. Both the NIH and Wellcome policies have seen substantial jumps in the proportion of articles reaching the repository when grantees final payments or ability to apply for new grants was withheld until issues were corrected.
The Liege, Wellcome and NIH policies all have something in common. They specify which repository content must go into to count. This makes it straightforward to determine if an article complies with the policy. For various reasons, other policies are less specific about where articles should go. This makes it harder to track policy implementation. The RCUK policy is particularly relevant with the call currently out for evidence to support the implementation review currently being undertaken. However the issues of implementation monitoring are equally relevant to the European Commission Horizon 2020 policy, Australian funder policies and the UK HEFCE policy as well as implementation of the US White House order.
The challenges of implementation monitoring
Monitoring Open Access policy implementation requires three main steps. The steps are:
- Identify the set of outputs are to be audited for compliance
- Identify accessible copies of the outputs at publisher and/or repository sites
- Check whether the accessible copies are compliant with the policy
Each of these steps are difficult or impossible in our current data environment. Each of them could be radically improved with some small steps in policy design and metadata provision, alongside the wider release of data on funded outputs.
Identifying relevant outputs
It may seem strange but it remains the case that the hardest step in auditing policy implementation is the first. Identifying which outputs are subject to the policy. There is no comprehensive public database of research outputs. Crossref and Pubmed come closest to providing this information but both have substantial weaknesses. Pubmed only covers a subset of the literature, missing most of the social sciences and virtually all the humanities. Crossref does a better job of covering a wider range of disciplines.
Affiliation and funder are the two key signifiers of policy requirements. Pubmed only provides affiliation for corresponding authors, Crossref metadata currently has very entries with author affiliations. Crossref’s Fundref project is gradually adding funder information but is currently limited in coverage. Pubmed only has funder information for Pubmed partners. Private data sources such as Web of Knowledge and Scopus can provide some of this data but are also incomplete and can not be publicly audited.
Funders and institutions rarely provide any public-facing list of their outputs. RCUK is probably the leader in this space with Gateway to Research providing an API that allows querying via institution, funder, grant or person. GtR is a good system but is reliant on author reporting. It therefore takes some years for outputs to be registered. In principle the SHARE notification system could go some way to addressing this updating issue but to manage the process of keeping records updated at scale will require standards development. Pubmed and Europe PubMed Central provide the most up to date public information linking outputs to funding currently available but as noted above have disciplinary gaps and weaknesses in terms of affiliation information.
Identifiers for research outputs are crucial here. Pretty much any large scale tool for identifying and auditing the implementation of any scholarly communications policy will need to pull data from multiple sources. To do this at scale requires that we can cross-reference outputs and compare data across these sources. Unique identifiers such as DOIs, ISBNs and Handles make a huge difference to accuracy. Without them, many outputs will simply be missed or the data will be too messy to handle. Disciplines that have not adopted identifiers will therefore be systematically under represented and under reported.
Identifying accessible copies
Assuming we can create a list of relevant outputs it might seem simple to test whether it is possible to find accessible copies. A quick Google Scholar search should suffice. And this will work for one, or ten or perhaps a hundred outputs. But if we are to track implementation across a funder, a large institution or a country we will be dealing with tens or hundreds of thousands, or millions of outputs. Manual checks will be very labour intensive (as the poor souls preparing returns from UK universities for the RCUK review can currently attest).
Check the publisher copy
As noted above a substantial proportion of the scholarly literature does not have a unique ID. This means finding even the ‘official’ copy can be challenging. Where an ID is available it should be straightforward to reach the publisher copy but determining whether this is ‘accessible’ is not trivial. While many publishers will mark accessible outputs in some way this is done inconsistent across publishers. Currently this requires outputs to be checked manually, an approach that will not scale. Consistent metadata is required to make it possible to check accessibility status via machine. This is gradually improving for journal articles but books, with a wider range of mixed models for Open Access volumes will remain a challenge for some time.
Find a repository copy
If the publisher copy can’t be found or isn’t accessible then it is important to find a copy in a repository…somewhere. Google might be indexing the repository, but does not provide an API. This means each article needs to be checked by hand. Aggregators like CORE, BASE and OpenAIRE might be pulling from the repository in question providing a search mechanism that scales. But many repositories do not provide information in the right form for aggregation.
While there is a standard metadata format provided by repository systems, OAI-PMH, it is applied very differently by different repositories. Many repositories do not record publisher identifiers such as DOIs and titles are frequently different from the publisher version making it difficult to efficiently search the records. More importantly OAI-PMH is a harvesting protocol. It is not designed for querying a repository and identifying whether it holds resources relating to specific outputs.
Even if we do find a record in a repository it does not necessarily mean that a full text copy of the output has been deposited, nor that it is actually available. Institutional repositories are very inconsistent in the way they index articles and in the metadata they provide. CORE harvests files so if a file is available it is generally full text. OpenAIRE provides metadata on whether an output is available. Both have limitations in coverage and neither have appropriate infrastructure funding for the long term future.
Determining output compliance
Once accessible copies of outputs have been identified it remains to be determined whether all the policy requirements have been met. Requirements fall into two broad categories: the time of availability (i.e. any embargo on public access to the document) and licensing requirements. Neither of these can currently be tested at scale.
Embargos and availability
Most access policies require that outputs made available via repositories are made public within a specified period after publication. Precise wording of the policy often varies on this point. Most policies specify that the output must be available after some acceptable embargo period but differ on when the output should be deposited (on acceptance, on publication, before the embargo ends).
The metadata provided by most OAI-PMH feeds does not provide sufficient information to determine whether a full text copy is available. Where any information is provided on copies that are deposited but not accessible this is not provided in a consistent form. OpenAIRE specifies requirements for repository metadata that define whether a full text copy is currently embargoed but only a subset of repositories are currently OpenAIRE compliant.
Overall it is currently not possible to comprehensively survey repositories to determine whether a full text copy has been deposited, whether it is available to read, and if not when it will be. Confusion created by policy wording on when any acceptable embargo period commences is also not helpful. If it starts on the date of publication is that the date of release online or the formal date of publication (which can be months or years later)? If it is the date of acceptance where is this recorded? What does acceptance even mean if we are talking about a monograph?
The work on RIOXX metadata standards will address many aspects of this and illustrates the need for consistent metadata profiles to enable automated auditing. The challenges also illustrate the need for standardising policy language and for expressing policy requirements in measurable terms.
Publisher site licensing
If an output is made available via a journal then there are often requirements associated with this. For RCUK, Wellcome Trust and FWF where an APC has been paid the article must be published under a CC BY license. The experience of implementation has been patchy with many traditional publishers doing a fairly poor job of providing the correct licenses. This means that each and every output needs to be checked.
As with repository auditing this is a challenge. Different publishers, and different outputs from the same publisher, have differing and inconsistent ways of expressing license statements. Some journal publishers even manage to express licenses inconsistently on the same article.
To address this PLOS funded a tool, built by Cottage Labs which aims to check, for individual journal articles, what the license is. It does this by following the DOI to the article page, reading the HTML and checking for known licenses statements for that website. The tool provides an API that allows a user to query a thousand IDs at a time for available license information. This approach is limited. It only works when we have an identifier (DOI or PMID). It is focussed on journal articles. It breaks when publishers change their website design. It can only recognize known statements.
But worst of all it depends on publishers actually making the license clear. While some traditional publishers (NPG and Oxford University Press deserve credit here) do a good job of this many do not. Taylor and Francis place a license statement only in the PDF (and in a context which makes it hard to detect which license applies). Springer sometimes do and sometimes don’t make the license statement available on the abstract page, sometimes only on the article page. Elsevier’s API (which we have to use because they make article pages difficult to parse) is not always consistent with the human readable license on the article. And the American Chemical Society create a link on the article with the text “CC BY” which links to a page which isn’t really the Creative Commons Attribution license but an ‘enhanced’ version with more limitations.
The NISO Accessibility and Licensing Information Working Group (full disclosure: I am a co-chair of this group) has proposed a metadata framework which could address these issues by providing a standardised way of expressing licenses – while not restricting the ability of publishers to choose which license to apply. Crossref is already offering a means for publishers to bulk upload license references for existing DOIs. This needs to be expanded across all publishers if we are to effectively monitor implementation of policies.
As we move from the politics of policy development to the (social) engineering problem of policy implementation our needs are changing. It is no longer enough to simply state aspirations, we need to be able to test performance. At the moment this is being done via manual and ad hoc processes. This is both inefficient and not scalable. At the same time, with the right information environment it should be possible to not just monitor our implementation of Open Access but the continuously monitor it in real time.
The majority of public access policies to date have been designed as human readable documents. Little thought has gone into how the policy goals translate into auditable requirements. As a result the burden of monitoring implementation is going up. In many cases there are no mechanisms to monitor implementation at all.
As we move from aspirational policies to the details of implementation we need efficient means of generating data that help us to understand what works, and what does not. To do this we need to move towards requirements that are auditable at scale, that work from sustainably public datasets using consistent metadata formats.
Policies are necessarily political documents. The devil is in the details of implementation. For clarity and consistency it would be valuable to develop formal requirements documents, alongside policy expressions that provide explicit detail on how implementation will be monitored. None of the infrastructure required is terribly difficult to build and much of it is already in place. What is required is coordination and a commitment to standardising the flow of information between all the stakeholders involved.
Identification of Relevant Outputs: Policy design should include mechanisms for identifying and publicly listing outputs that are subject to the policy. The use of community standard persistable and unique identifiers should be strongly recommended. Further work is needed on creating community mechanisms that identify author affiliations and funding sources across the scholarly literature.
Discovery of Accessible Versions: Policy design should express compliance requirements for repositories and journals in terms of metadata standards that enable aggregation and consistent harvesting. The infrastructure to enable this harvesting should be seen as a core part of the public investment in scholarly communications.
Auditing Policy Implementation: Policy requirements should be expressed in terms of metadata requirements that allow for automated implementation monitoring. RIOXX and ALI proposals represent a step towards enabling automated auditing but further work, testing and refinement will be required to make this work at scale.