Oxford e-Research Centre and Department of Zoology University of Oxford, UK Fifth Conference on Open Access Scholarly Riga, Latvia 20 September 2013 The.

Slides:

Advertisements

Similar presentations

Partnering with Faculty / researchers to Enhance Scholarly Communication Caroline Mutwiri.

Advertisements

1 of 16 Information Access The External Information Providers © FAO 2005 IMARK Investing in Information for Development Information Access The External.

Publishers Web Sites Standard Features. Objectives Access publishers websites Identify general features available on most publishers websites Know how.

Frighteningly Sane or The first steps to Madness?.

NATIONAL LIBRARY OF MEDICINE PubMed Central Edwin Sequeira National Library of Medicine May 26, 2004.

David Shotton Image BioInformatics Research Group Department of Zoology University of Oxford, UK The Dryad-UK vision © David Shotton,

Introduction to Mendeley. What is Mendeley? Mendeley is a reference manager allowing you to manage, read, share, annotate and cite your research papers...

Ensuring a Journal’s Economic Sustainability, While Increasing Access to Knowledge.

PubMed Central ANCHASL Spring Meeting April 1, 2005 Robert James Associate Director of Public Services Duke University.

Electronic publishing: issues and future trends Anne Bell.

The Library behind the scene How does it work ? The Library behind the scenes 1 JINR / CERN Grid and advanced information systems 2012 Anne Gentil-Beccot.

SciVal Experts & SciVal Funding Information Sessions.

Februrary 2005UCSF Library & Center for Knowledge Management Scholarly Communication.

University of Adelaide Library Life Impact The University of Adelaide The well connected catalogue Patricia Scott, Denise Tobin and Helen Attar.

1 Enriching UK PubMed Central SPIDER launch meeting, Wolfson College, Oxford Paul Davey, UK PubMed Central Engagement Manager.

Using library resources for research Paul Johnson Bedford Library.

Challenges for the DL and the Standards to solve them Alan Hopkinson Technical Manager (Library Systems) Learning Resources Middlesex University.

Web of Science: An Introduction Peggy Jobe

1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.

CBioC: Massive Collaborative Curation of Biomedical Literature Future Directions.

Institutional Repositories Tools for scholarship Mary Westell University of Calgary AMTEC Conference May 26, 2005.

Data Sources & Using VIVO Data Visualizing Scholarship VIVO provides network analysis and visualization tools to maximize the benefits afforded by the.

Release 4 of the COUNTER Code of Practice for e- Resources and new usage- based measures of impact Peter Shepherd COUNTER May 2014.

Institutional Perspective on Credit Systems for Research Data MacKenzie Smith Research Director, MIT Libraries.

EZID (easy-eye-dee) is a service that makes it simple for digital object producers (researchers and others) to obtain and manage long-term identifiers.

Getting started on informaworld™ How do I register with informaworld™? What do I do if I forget my password? My institution does not subscribe to any journals,

The impact of the development of institutional repositories on “Kiyo” or institutional research journals in Japan Hiroya Takeuchi and Syun Tutiya Chiba.

Getting started on informaworld™ How do I register my institution with informaworld™? How is my institution’s online access activated? What do I do if.

1 Guidelines For The Future Sharing Best Practice For National Bibliographies In The Digital Era Neil Wilson Information Coordinator IFLA Bibliography.

Research evaluation requirements José Manuel Barrueco Universitat de València (SPAIN) Servei de Biblioteques i Documentació May, 2011.

1 CrossRef - a DOI Implementation for Journal Publishers January 29, 2003 CENDI Workshop.

Introduction to Mendeley. What is Mendeley? Mendeley is a reference manager allowing you to manage, read, share, annotate and cite your research papers...

Thomson Scientific October 2006 ISI Web of Knowledge Autumn updates.

E - Physical Sciences & Engineering Jeff Pache IEE

Open access & visibility Management Digital Preservation ORA: Purposes.

Joint Declaration of Data Citation Principles Notes [1] CODATA 2013: sec 3.2.1; Uhlir (ed.) 2012, ch 14; Altman &

CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &

WISER Social Sciences: Finding Quality Information on the Internet Angela Carritt and Penny Schenk Bodleian Law Library.

Image BioInformatics Research Group Department of Zoology University of Oxford, UK CERIF Data Surgery University of Bath 9 February.

BIOL 155 STUDENTS Spring, 2011 California State University, Los Angeles GETTING THE MOST OUT OF THE LIBRARY.

WISER Finding stuff: Articles Kerry Webb, Deputy Librarian, English Faculty Library Isabel McMann, Academic Liaison Services, Radcliffe Science Library.

Chuck Koscher Director of Technology CrossRef ICSTI General Assembly TACC Workshop Tokyo October 19, 2014 crossref: mainstay of the scholarly communication.

Weaving Data into the Scholarly Information Network UNECE Work Session on the Communication of Statistics OECD Conference Centre, Paris June 30 - July.

| 1 Open Access Advancing Text and Data Mining Libraries & Publishers working together to support Researchers What is Text Mining?

1 CH450 CHEMICAL WRITING AND PRESENTATION Alan Buglass.

Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.

RESEARCH METHODS IN TOURISM Nicos Rodosthenous PhD 07/03/ /3/2013Dr Nicos Rodosthenous1.

Institutional Repositories: the DSpace Experience Ann J. Wolpert Director of Libraries Massachusetts Institute of Technology.

Citing Datasets. Research: search for knowledge or any systematic investigation to establish facts. And to establish facts, one needs Data.

WISER: Finding stuff Journal articles Kerry Webb, Deputy Librarian, English Faculty Library & Angela Carritt, OULS User Education Coordinator.

Filling institutional repositories: considering copyright issues Susan Veldsman eIFL Content Manager

Guide to publishing OA at the RSC. How to apply for open access There are two main ways to apply for open access: Gold for Gold voucher Payment of an.

Joint Declaration of Data Citation Principles (Overview) The Data Citation Synthesis Group Joint Declaration.

PERSISTENT IDENTIFIERS FOR THE UK: SOCIAL AND ECONOMIC DATA …………………………………………………………………………………………………… LOUISE CORTI …………………….…………………………….… UK DATA ARCHIVE.

Roger Mills February don’t be evil stand on the shoulders of giants.

RCUK Policy on Open Access Name Job title Research Councils UK.

WISER Finding stuff: Journals and Journal Articles Kerry Webb, Deputy Librarian, English Faculty Library & Angela Carritt, Bodleian Libraries’ User Education.

Data Sources & Using VIVO Data Visualizing Science VIVO provides network analysis and visualization tools to maximize the benefits afforded by the data.

Opening access to quality research materials

Author Rights Sarah A. Norris, Scholarly Communication Librarian,

How to Apply for Open Access

Sarah Norris, Lily Flick, UCF Libraries

Getting started on informaworld™

Link Resolver and Knowledge Base in Discovery Services

Open Access to your Research Papers and Data

Zetoc: Electronic Table of Contents from the British Library

Introduction of KNS55 Platform

WISER Finding stuff: Articles

Data + Research Elements What Publishers Can Do (and Are Doing) to Facilitate Data Integration and Attribution David Parsons – Lawrence, KS, 13th February.

PHARM Library Orientation

Presentation transcript:

Oxford e-Research Centre and Department of Zoology University of Oxford, UK Fifth Conference on Open Access Scholarly Riga, Latvia 20 September 2013 The Open Citations Corpus – freeing scholarly citation data © David Shotton, 2013 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence David Shotton

Scholarly communication today

Scholarly articles haven't really changed much in 346 years 4 th Aug st Jan th March 2012

Scholarly communication – an analogy Scholarly communication, at this mid-point in the digital revolution,is in an ill- defined transitional state—a ‘horseless carriage’ state—that lies somewhere between the world of print and paper and the world of the web and computers, with the former still exercising significantly more influence than the latter We started here: We’re now here (online): Great – that’s a significant start

Scholarly communication – an analogy... but this is really where we need to be!

The importance of citations

What is a citation? The performative act of citing a published work that is relevant to the current work, typically made by including a reference in a reference list Why are citations important? The act of bibliographic citation is central to scholarly communication – bibliographic references are the links that knit together independent scholarship Citations unify the whole world of scholarship into a giant citation network Citation networks reveal the development of academic disciplines Sir Isaac Newton: “If I have seen a little further, it is by standing on the shoulders of Giants”

How is the present situation imperfect? The present scholarly citation system inadequately exposes the knowledge networks that exist within the scholarly literature, linking papers, authors, funders, research projects and datasets Citation data are hidden behind subscription firewalls of commercial companies Academics are not free to use their own citation data as they please In this Open Access age, it is a scandal that reference lists from journal articles, the core elements of the academic data cycle, are not freely available for use by the scholars who created them Citation data now need to be recognized as a part of the Commons – those works that are freely and legally available for sharing

Nomenclature and metadata

“a reference” Current citation practice  Well-formed references in reference lists ... relate to clearly defined entities  But extreme ambiguity in terminology!

c4o:InTextReferencePointer biro:BibliographicReference cito:cites Citing article Cited article biro:references c4o:denotes  This is the nomenclature used in our SPAR (Semantic Publishing and Referencing) Ontologies Recommended nomenclature for references and citations

Generic structured metadata required to record a citation Citing paper Bibliographic citation Publication date e.g. Journal article Title Source of citation info, e.g. CrossRef cito:cites Cited paper type bibliographic metadata relationship provenance Unique identifier entities

The Open Citations Corpus

The original Open Citations Corpus An open repository of bibliographic citation data created in 2011  available at Created with JISC funding of the Open Citations Project  project blog: Originally populated with ~6.4 million individual references from the reference lists of ~200,000 articles in the Open Access Subset of PubMed Central (as of January 2011) These reference >3 million unique papers  ~ 20% of all PubMed papers published between 1950 and 2010, including all the highly cited papers in every biomedical field  Multiple citations of the same well-cited papers permitted us to perform error correction of the harvested citations (approx 1% erroneous) These citations are encoded as Linked Open Data using the SPAR ontologies, and are freely available under a CC0 waiver from

Viewing citation networks at

The outward citation network of Reis et al. (2008)

Limitations of the original Open Citations Corpus A snapshot in time of the citation data in PubMed Central as of January 2011  becoming increasingly out of date Contains references from open access articles only Limited to the biomedical domain

Expanding the Open Citations Corpus

Expanding the Open Citations Corpus - Objectives Redesign the OCC data model Update the current ingest Increase the domain coverage Include reference lists from subscription-access journals Harvest references on a continuing ongoing basis, as articles are published Improve the user interface and the user experience Publish the citation data both in BibJSON and in RDF as Linked Open Data Build added value services over the citation data

Redesigning the Open Citations Corpus data model Three record types: Entity Records, Personal Records and Citation Records A clear separation is made between potentially erroneous citation information 'as received’ in text strings from article reference lists  ReferenceTextRecords containing NameTextRecords (of authors, editors) and authoritative bibliographic metadata derived from trustworthy sources such as CrossRef, PubMed and the web pages of published articles  BibliographicRecords and PersonalRecords (of authors, editors) A distinction is also made between an UnmatchedCitationRecord  where no BibliographicRecord exists within the OCC for the cited entity and a MatchedCitationRecord  where the cited entity has a BibliographicRecord within the OCC A unique internal identifier is created for each OCC record Provenance information details the source of each citation, the date it was acquired, its format, and the name of the curator responsible for its ingestion

Reconfiguring the Open Citations Corpus Underlying technical implementation being revised  Bibliographic information encoded in BibJSON  Data stored in BibServer, that handles BibJSON natively Data from different sources brought into a common BibJSON format as soon as possible Processing the whole ingest from either source takes over 24 hours Work still to be done on the ingest pipeline, since the parsing of citation information from the reference list entries is not yet 100% accurate

Matching citation strings to bibliographic records When a new reference has been extracted from a reference list  a ReferenceTextRecord is created for the citation target, and  an UnmatchedCitationRecord is created between the BibliographicRecord of the citing paper and the citation target’s ReferenceTextRecord The ReferenceTextRecord is then compared with existing BibliographicRecords If a match is found, a new MatchedCitationRecord is created within the OCC between the BibliographicRecords of the citing and cited entities, and the pre-existing UnmatchedCitationRecord between the citing and cited entities is deprecated Similarly, a new NameTextRecord is created for each author and editor named in the new ReferenceTextRecord, and the OCC is then searched for matches to existing PersonalRecords within the OCC

Citation error correction Examples of errors in reference list entries vary  from the trivial – a non-English name with incorrect accents or an article title containing “beta” instead of the correct “β”  to the serious – two papers in the same reference list with the same DOI Such errors can be detected by comparing a new ReferenceTextRecord with pre-existing BibliographicRecords, and of a new NameTextRecord with pre- existing PersonalRecords Where there are several OCC ReferenceTextRecords referencing the same multiply-cited paper for which an authoritative OCC BibliographicRecord does not yet exist, we use voting algorithms for reference disambiguation and error correction, enabling the creation of a reliable BibliographicRecord for that entity even when we can find no external authority to provide it In future, we wish to offer an automated OCC reference correction service to third parties such as authors and journal editors, enabling them to spot and correct errors in the reference lists of submitted papers before publication

New relationship types in the Open Citations Corpus Entity type relationships The nature of the source entity and the target entity (e.g. journal article, book, dataset) are separately recorded in the OCC. We can thus infer the nature of each entity type relationship, for example:  Article-to-article bibliographic citation  Article-to-database data citation  Data_repository-to-article bibliographic citation Relationships other than bibliographic citations Additional relationship types between entities in the OCC may be encoded using CiTO, the Citation Typing Ontology, if that information is available:  Citation :EntityA cito:cites :EntityB.  Shared authorship :EntityA cito:sharesAuthorsWith :EntityB.  Common funding :EntityA cito:sharesFundingAgencyWith EntityB.  Common institution :EntityA cito:sharesAuthorInstitutionWith :EntityB.  Related :EntityA dcterms:relation :EntityB.

Expansion of the Open Citations Corpus coverage Ingest from the Open Access Subset of PubMed Central is being updated from ~200,000 articles in Jan 2011 to the current ~658,000 articles in September 2013 Domain coverage is being expanded to include the physical sciences and mathematics, by the ingest of the reference lists from all ~872,000 preprints in the arXiv preprint repository at Cornell University Library This will bring the total number of references from ~6.4 million to ~40 million We then intend to ingest all the references in CiteSeer and from Wikipedia, marking these with clear provenance information To this we will add citations from data repositories such as Dryad, that contain literature references associated with the datasets they hold and from DataCite, that issues DOIs for datasets, and harvests metadata that contain literature references

Citations from heritage literature – ‘The Future of the Past’ Funding application just submitted to harvest references from the pre-digital biodiversity / biological taxonomy literature, where papers have lasting value We will use the Biodiversity Heritage Library ( as a source of references David King, a text mining colleague at the Open University, will use advanced text mining techniques to dig references out of ‘dirty’ OCR’d page images We will then ingest these data into the Open Citations Corpus and make them freely available This will be the only source of digital citation data from a major fraction of the world's heritage literature in the field of biodiversity / biological taxonomy, that is simply not available in digital form anywhere else

Additional citations from PubMed Central There are ~2.2 million articles in PubMed Central that are not part of the Open Access Subset, presently missing from the Open Citations Corpus These contain citations not only to other papers, but also to datasets, typically in the form of database accession numbers, buried within the full text or footnotes Recent text mining initiatives undertaken by Europe PubMed Central (EPMC) have extracted both the bibliographic citations and the data citations from all ~2.8 million PubMed Central articles, which are now freely availableEurope PubMed Central We propose to ingest all these EPMC literature and data citations into the expanded and improved Open Citations Corpus  This will increase the number of PMC articles for which the OCC holds citation information by about 330%  In addition, it will further expand the nature of the citation data held to include the data citations contained within these PMC articles However, these are just a fraction of the total scholarly citations, most of which are locked behind the pay walls of commercial providers

Reference lists from subscription–access articles All fully open access publishers already publish article reference lists openly I am working to persuade other major scholarly publishers to do the same  i.e. to put article reference lists outside the subscription pay-wall, in the same way as abstracts and bibliographic metadata are freely available Last January, I published an Open Letter to Publishers requesting this  Claire Redhead kindly distributed it to all OASPA members  The letter is available at al_publishers_re_open_citations.pdf A number of leading STM publishers have expressed their willingness to open the reference lists from subscription-access journal articles  Nature, Science, Taylor & Francis, Royal Society Publishing, Portland Press, MIT Press and Oxford University Press are among the first  another has expressed willingness verbally, but has yet to commit formally

Opening article reference lists via CrossRef How can these be ingested into the Open Citations Corpus? Most publishers already submit their reference lists to CrossRef as part of its CitedBy Linking Service  If you do not at present, you should use this free service! With publisher’s permission, CrossRef can enable reference lists to be ‘opened’  on a publisher-by-publisher basis based on DOI prefixes  on a journal-by-journal basis  on an article-by-article basis for hybrid journals References are then available via the CrossRef API for ingest into the OCC However, because the default CrossRef CitedBy Linking Service agreement is not to publish reference lists, even Open Access publishers must specifically inform CrossRef that the reference lists of their journal articles should be open Geoff Bilder has a new CrossRef Metadata Best Practice Document that I will circulate, explaining how to specify this choice in your article metadata,

Summary - Benefits of the Open Citations Corpus Created by scholars for scholars using scholarly data  No profit motive constraining free publication of the data  Will bring particular benefit to those who are NOT members of First World academic institutions whose libraries subscribe to commercial citation data from Thomson-Reuters or Elsevier Will provide integrated access to citation data from a variety of sources, both inside and outside traditional scholarly publishing, with provenance information Data are semantically described using the SPAR bibliographic ontologies  Citations thus become part of the Web of Linked Open Data Data available in a variety of formats including BibJSON, BibTex and RDF for download by third parties for their own use or to build into cool services  indexing, search and browse (in prototype)  timeline visualizations (in prototype)  analysis of citation networks, co-authorship networks, etc.  trend identification, recommendation services, etc.

Sustainability

The development of the Open Citations Corpus has been enabled by short- term grant funding, but this does not provide a sustainable financial model For the future, we seek one of the following long-term arrangements:  Adoption by a major institutional or national library  Adoption by a publishing organization such as CrossRef, with indirect support from publishers  Direct support by the scholarly publishing community  Social investment, i.e. the provision of capital to generate social as well as financial returns, to support open access to scholarly information  Income support by charging for added-value services over the open data I would be grateful for your views on the value of the Open Citations Corpus and the manner in which its ongoing development might be supported

Acknowledgements and thanks Alex Dutton, who developed the original Open Citations Corpus Richard Jones, Martyn Whitwell and Mark MacGillivray of Cottage Labs, who have undertaken more recent development work Silvio Peroni, my colleague in developing the suite of SPAR (Semantic Publishing and Referencing) Ontologies The JISC, who have funded the development of the Open Citations Corpus