Plenary 3 covered Document Semantics and Data
Olivier Bodenreider shared background, details, developments, and the future of NLM’s Medical Subject Headings (MeSH). MeSH is a controlled vocabulary from NLM with 26,000 descriptors, < 100 qualifiers, and 214,000 supplementary concept records.
Protip: default on PubMed is to retrieve articles indexed by a descriptor or any descendant. Prevent this with the
mesh:noexpflag (“please don’t explode”).
Synonyming and clustering are built in to MeSH. Bodenreider explains they are attempting to use some automatic indexing to improve MeSH, which has historically been maintained manually. Approaches include hybrid extracting concepts from title and abstract then mapping from UMLS to MeSH, as well as extracting MeSH descriptors from related citations. Read about the Medical Text Indexer, which includes a diagram of NLM’s process. 3600 new citations processed every week night. To visualize this vocabulary, go here.
Javier Lacasta discussed his work converting thesauri into ontologies such as work that supports prototypes for search with thesauri from the European Urban Knowledge Network and Urbamet. Mapping a thesaurus in to a direct vocabulary in Wordnet was not possible alone, it required additional heuristics. For example, “Educational Building” related to “Primary School” or “High School” in the thesaurus and remapped to “School” in Wordnet. However, Wordnet maintains ~7 meanings for “School” including “A large group of fish.” One solution was to use context, such as when “Water sport” matches Wordnet’s “Water sport”, a child of “Sport”, then siblings to “Water sport” in the thesaurus like “Winter sport” without a match in Wordnet can be map to a member of the tree of “Water Sport” namely “Sport”. This avoids the ambiguity of Wordnet’s alternative definitions of “Sport”, such as a mutant biological organism. Lacasta also used term properties to define relationships in the hierarchy. The work tested reasonably well with low error rates, and resulted in a map based on Urbamet data.
Ágnes Sándor investigated “knowledge-level claims” in scholarly research (with special credit to Frederique Lisacek, Simon Buckingham Shum, and Anna de Liddo). Specifically, Sándor is searching for phrases, or “rhetorical formulas” that denote contrast, open questions, emerging tendencies, etc–such as “In contrast with previous hypotheses”, “but its function is poorly understood”, and “emerging as a promising approach”, respectively. Further, certain facts stated in the article may be more or less important, which is potentially discernible based on phrases. Sándor et al use an incremental parser to find these elements, for instance in detecting “paradigm shifts”, or “claimed knowledge updates”, based on pre-defined grammars. For European Educationl Research Quality Indicators (EERQI), they applied the same tool on social science work in English, French, German, and Swedish vocabularies. The team even compared results between human and machine annotation for open educational research, one example of very high accuracy was given. As these analyses continue, we will hopefully have a chance to isolate both rhetorical tools and scientific facts. Sándor intends to launch an open web service on open.xerox.com for folks to be able to use their tool.
Similar work has been used for summarization and citation analysis, among other projects. Luckily, the scientific jargon is static enough that it was not difficult to create a fairly complete list of terms and phrases, it doesn’t seem to grow linearly as the corpus grows. During questions, Cameron Neylon asked about working in the reverse direction–rather than parsing the texts, but to help works be generated following a rhetorical methodology or grammar. For now, Sándor explained they do not prescribe, though it is an interesting application.
Anita de Waard noted she is one of only 4 women presenters, and also she is incorrectly listed as Dr., but she does not yet have her PhD. De Waard notes there are “claimed knowledge updates” as referenced in Sándor’s presentation (above), but these often reference data that may not be accessible or sufficient to justify their claim. There are many data preservation initiatives, but scientists’ workflow is simply insular, passing hands all over the lab, changing formats (especially non digital), and potentially getting lost along the way, especially in biology. Out of perhaps 1 million papers / year, majority of data (90%?) is likely on local hard drives, perhaps 8% in large, generic data repositories, and finally some 1-2% in small, focused data repositories. Path is clear for their group at Elsevier: increase data digitization, improve data usability, improve repository interoperability, and develop sustainable models for the entire ecosystem. Elsevier launched a pilot program with CMU Urban Legend App, turning a paper-based lab into an electronic lab for better data capture. Researchers are pulled in 4 different directions to submit in a: domain specific data repository by their community, local data repository by collaborators and research group members, large data repository like dryad by funding agencies, institutional data repository by universities–and they all want different metadata! There needs to be a better way.
Poster Sessions (See bottom).
covered Research Data
Dr. Wolfram Horstmann explored the complexities of Data Policies at every level. International Organizations, Governments, Associations, Universities (and Departments), Funders, Publishers, and Journals can set policies that set requirements for expected results of academic research. Horstmann covered the groups from whom we see currently policies–Funders, Publishers, and Journals. Horstmann concludes that some policies are aspirational and are necessary “Zeitgeist” works, while others are practical policies, a “Seachange” style and include more methodology, infrastrucure, advice, and funding components.
Donatella Castelli shared her investigations in data interoperability, especially across heterogeneous data sources required for different tasks. One of Castelli’s insights included that data infrastructure can benefit from shared solutions. Overall, she isolates two types of issues: (a) how to make data available and (b) providing tools to interact with the data. One of the major problems for repositories and repository managers is providing the proper formats for various scientists’ needs, such as for different applications and services that they require. Castelli offered i-marine as an example of trying to meet these needs.
Kevin Ashley spoke on data quality and curation. The point was there are very different types of quality for different people, each with different needs–provenance, timing, accuracy (which is already complex). As an example, Ashley gave an example of a government dataset on companies that contains some invalid data–you probably don’t want data with dates like Feb 31 in it. If social scientists clean it up so its usable for their purposes, it doesn’t necessarily meet every researcher’s needs. If researchers want to analyze what data the government has, you need to have a copy of the original set, and potentially a mapping of any changes and who made the changes. These needs can be very different. Data is usually focused on one kind of use case, and one data provider, so how do you rethink questions of quality so you can provide what people want when they are different? Problems are around resources and costs in addition to the trade-off between provenance and accuracy of data. We need machine-readable change information, and to cater more for provenance.
Tim Smith shared CERN’s experience and swath of tools for managing a global information infrastructure for storing, transferring, analyzing, and archiving massive amounts of data. Storing all detected data at CERN would be equivalent to storing about 1 petabyte per second. There are few places to store the data they do save, required distributed data management using a worldwide LHC computer grid. Smith uses an acronym for their servic: “AAA – Any data, Any where, Any time… Almost.” CERN’s data needs are so big that the network used for data transfer has become a resource that needs to be scheduled. The CERN Document Server runs on open source softare Invenio, and handles about 30Tb already. Smith and folks at CERN write lots of open source software, including collaborating on Zenodo.
Tutorial sessions covered topics in Metrics, Metadata, OJS, NISO/OAI ResourceSync, and Open Access Q&A.
The Kick-off Plenary was a technical session.
Paul Groth opened up with a simple ask for publishers, and various content producers relevant to scholarly research, to add popular metadata formats to their pages. For instance: Twitter, Open Graph, RDF, COinS, microformats, and various others. Curious about your page’s metadata? Run it through Any23 to get a preview of some data possibly stored already. Also, follow David Shotton’s work on his blog. But more than a little metadata, Groth shared visions of an article-focused, web-developer-friendly, even journal-free future. Follow developments on data2semantics.org including one project, LinkItUp.
Robert Sanderson followed with updates on and descriptions of the W3C’s Open Annotation project. In particular, he highlighted the existing Open Annotation Community Group and encouraged others to join. In addition, Sanderson shared several diagrams of the open annotation data model in various examples of its potential to represent annotations–comments, etc. He also clarified that the intention behind Open Annotation is to use a simple JSON implementation to make the model practical and accessible. Learn more about his memento project.
Henry Thompson rounded out this plenary with an analysis of Naming on the web. He noted how auspicious it is to discuss the future of the web for scholarly research miles from its birthplace at CERN. Thompson points out the web is tragically limited at the link layer, we still use links that often break, with few alternatives to find the source content (e.g. archive.org’s Wayback Machine). Interestingly, rather than pointing to purely technical options for improving naming on the web (like refactoring DNS), Thompson suggests codifying a sort of Social Contract for web linking, agree to socially-moderated non-reputable binding mechanism for scholarly references in URIs, and provide a companion fallback resolution process (i.e. w3c.org should never disappear due to non-renewal of its domain name).
Plenary 2 on Metrics covered analysis of scholarly work before and after publication
Johan Bollen stated that science is about “ideas not bricks”, with the assumption that not all ideas are equally as valuable (claims there are bad ideas that should not be communicated). Science is much like a gift economy, where sharing freely is positive, and you are rewarded through social recognition of that sharing, such is clear in the case of citations. The impact factor is too simple to possibly suffice as a reasonable metric for the value of scholarly communication. There is citation, usage, and social media that can constitute counts, normalized counts, social network metrics, and even trends–plus the granularity expands beyond the Journal to category, region, University, author(s), etc. Additonally, there are various algorithms to generate metric values from citations (e.g. walks, pagerank, eigenfactor, etc). Overall, Bollen is nervous about applying these new metrics to science, wants to separate funding from the metrics of scientific impact.
Euan Adie is founder of altmetric.com a small London start-up funded by The Macmillan Group, owners of Nature Publishing Group. Altmetric is different from the general field of Altmetrics, which is a broad category. Altmetric builds a big dataset of various social media posts with mentions of academic works. There are some gaps in the data, including fringe cases when they cannot pull URLs from a PDF, etc. The overall goal for Altmetric is to collect the data, but they want other people to pull the data and make applications with it. Contact Euan, especially if you’re a non-profit, or browse their site to collaborate.
Jelte Wicherts reviews for a wide array of journals, mostly in his field of psychology. Points out that, especially for young Open Access journals, becoming stable or profitable is a primary motivating factor, potentially driving accepting more papers, which may affect peer review. Wicherts shares his experiences of reviewing for journal articles without a transparent process, not knowing if there were other reviewers, and who eventually approved the articles or not. BioMed Central‘s policy of publishing peer review reports along with the paper works really well. Jeffrey Beall has been investigating predatory publishers and is worth following to learn more. OASPA and COPE share guidelines for OA journals.
Post Session Gallery
The CERN Workshops on Innovations in Scholarly Communication (OAI8) by PLOS Blogs Network, unless otherwise expressly stated, is licensed under a Creative Commons Attribution 4.0 International License.