In our call to the STM Association to withdraw their model licenses we drew attention to the fact that Creative Commons licenses are a de facto global standard. But sometimes it is claimed that (as the STM Association did in their response) that CC licenses are somehow “not designed” for scholarly communications, or “not proven” in our space.
We thought it might be useful to get some data on just how many CC licensed peer reviewed articles are out there. This turns out to be a non-trivial exercise but I think it’s feasible to come up with a reasonable lower bound. The too-long didn’t-read version: there are at least 1.2M CC licensed scholarly articles in the wild, with over 720,000 of them being licensed CC BY.
Our first call is the Directory of Open Access Journals. The DOAJ, alongside its listing of journals also has the opportunity for providing article metadata, including the default license for the journal. At the search page there is an option to limit the search to articles and if you then click on the licenses selector tab you can get the number of articles registered under different CC licenses. When I looked this gave around 547,138 CC BY licensed articles, 311,956 CC BY-NC articles and so on to give a total of just over one million CC licensed articles in total.
However this isn’t a complete representation of the picture. A number of large publishers (including [cough] PLOS) don’t deposit article level metadata with DOAJ. So 1M is undercounting. For some publishers of pure OA journals we have data from OASPA up to 2013 on CC BY licensed articles. The big contributors missing from the DOAJ dataset are Springer Open, PLOS, OUP and MDPI. These publishers contribute a further 144,203 articles up to the end of 2013, bringing our total to over 1.1M. I can add the 22k articles published by PLOS and 3,463 published by SpringerOpen in 2014 to this total (but not those from Biomed Central which are included in the DOAJ numbers).
There are some further gaps, NPG’s Scientific Reports uses CC licenses (5,793 articles according to Pubmed) and Nature Communications uses CC licenses for its free-to-read papers (I obtained a total of 1,026 free to read articles from this data set). Nature Communications illustrates a big gap in our knowledge. We know that there is substantial uptake of CC licenses by big publishers including Wiley, Taylor and Francis, Sage, OUP, and Elsevier for their hybrid offerings but we have limited information on the scale of that at the moment. The sources and quality of information are likely to improve substantially by the end of the year but at the moment the best I could do is guess that these might amount to a few tens of thousands but not yet hundreds. I’m therefore leaving them out of my current estimates.
Some caveats – clearly I’m missing a range of articles here, particularly from smaller publishers that have journals not registered with DOAJ. But if I’m going to claim this is a reasonable lower bound I also need to ensure I’m not double counting. A search for all the publishers for which I’ve added articles above and beyond those in DOAJ gives zero results except for a search for PLOS (650) and Springer (99). I’m also missing a substantial number of papers from Springer. They recently announced reaching 200,000 OA papers with CC licenses from various imprints including Biomed Central and Springer Open. The totals in my numbers are 136,895 papers from Biomed Central (via DOAJ) and 18,375 for Springer Open (based on the OASPA data and a search for 2014). Therefore there are another ~45k papers I’m missing. Similarly for OUP I’m missing maybe another 10,000 papers in journals like Nucleic Acids Research that are now mostly CC licensed.
One criticism of these figures might be the fact that the DOAJ does contain some journals that are currently being removed as they do not meet the stricter quality conditions being imposed. Am I therefore including dodgy journals in my figures? The counter argument is that those publishers that think about licensing and provision of article metadata tend to be the most reliable. The fact that the data is there at all is a good indicator of a serious publisher. Overall I think the balance of clear undercounting above vs the risk of these potential issues contributing significant numbers of articles is approximately a wash.
Overall the total figures come out to slightly over 1.2M articles with CC licenses. Of these at least 724,000 use the CC BY license. You can of course take the links I’ve given and check my maths. The data can also be split out by year, and although that gets more messy with missing data it looks like around 200k CC licensed articles were released in 2012 and 2013 making them a substantial proportion of the whole literature.
As we (and the 80+ organizations that have now signed the letter) said before, and we’ll say again; Creative Commons is the established, adopted and proven standard for both wider content on the web as well as for scholarly communications.