Presentation is loading. Please wait.

Presentation is loading. Please wait.

Oxford e-Research Centre and Department of Zoology University of Oxford, UK Fifth Conference on Open Access Scholarly Riga, Latvia 20 September 2013 The.

Similar presentations

Presentation on theme: "Oxford e-Research Centre and Department of Zoology University of Oxford, UK Fifth Conference on Open Access Scholarly Riga, Latvia 20 September 2013 The."— Presentation transcript:

1 Oxford e-Research Centre and Department of Zoology University of Oxford, UK Fifth Conference on Open Access Scholarly Riga, Latvia 20 September 2013 The Open Citations Corpus – freeing scholarly citation data © David Shotton, 2013 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence David Shotton

2 Scholarly communication today

3 Scholarly articles haven't really changed much in 346 years 4 th Aug st Jan th March 2012

4 Scholarly communication – an analogy Scholarly communication, at this mid-point in the digital revolution,is in an ill- defined transitional state—a ‘horseless carriage’ state—that lies somewhere between the world of print and paper and the world of the web and computers, with the former still exercising significantly more influence than the latter We started here: We’re now here (online): Great – that’s a significant start

5 Scholarly communication – an analogy... but this is really where we need to be!

6 The importance of citations


8 What is a citation? The performative act of citing a published work that is relevant to the current work, typically made by including a reference in a reference list Why are citations important? The act of bibliographic citation is central to scholarly communication – bibliographic references are the links that knit together independent scholarship Citations unify the whole world of scholarship into a giant citation network Citation networks reveal the development of academic disciplines Sir Isaac Newton: “If I have seen a little further, it is by standing on the shoulders of Giants”

9 How is the present situation imperfect? The present scholarly citation system inadequately exposes the knowledge networks that exist within the scholarly literature, linking papers, authors, funders, research projects and datasets Citation data are hidden behind subscription firewalls of commercial companies Academics are not free to use their own citation data as they please In this Open Access age, it is a scandal that reference lists from journal articles, the core elements of the academic data cycle, are not freely available for use by the scholars who created them Citation data now need to be recognized as a part of the Commons – those works that are freely and legally available for sharing

10 Nomenclature and metadata

11 “a reference” Current citation practice  Well-formed references in reference lists ... relate to clearly defined entities  But extreme ambiguity in terminology!

12 c4o:InTextReferencePointer biro:BibliographicReference cito:cites Citing article Cited article biro:references c4o:denotes  This is the nomenclature used in our SPAR (Semantic Publishing and Referencing) Ontologies Recommended nomenclature for references and citations

13 Generic structured metadata required to record a citation Citing paper Bibliographic citation Publication date e.g. Journal article Title Source of citation info, e.g. CrossRef cito:cites Cited paper type bibliographic metadata relationship provenance Unique identifier entities

14 The Open Citations Corpus

15 The original Open Citations Corpus An open repository of bibliographic citation data created in 2011  available at Created with JISC funding of the Open Citations Project  project blog: Originally populated with ~6.4 million individual references from the reference lists of ~200,000 articles in the Open Access Subset of PubMed Central (as of January 2011) These reference >3 million unique papers  ~ 20% of all PubMed papers published between 1950 and 2010, including all the highly cited papers in every biomedical field  Multiple citations of the same well-cited papers permitted us to perform error correction of the harvested citations (approx 1% erroneous) These citations are encoded as Linked Open Data using the SPAR ontologies, and are freely available under a CC0 waiver from

16 Viewing citation networks at

17 The outward citation network of Reis et al. (2008)

18 Limitations of the original Open Citations Corpus A snapshot in time of the citation data in PubMed Central as of January 2011  becoming increasingly out of date Contains references from open access articles only Limited to the biomedical domain

19 Expanding the Open Citations Corpus

20 Expanding the Open Citations Corpus - Objectives Redesign the OCC data model Update the current ingest Increase the domain coverage Include reference lists from subscription-access journals Harvest references on a continuing ongoing basis, as articles are published Improve the user interface and the user experience Publish the citation data both in BibJSON and in RDF as Linked Open Data Build added value services over the citation data

21 Redesigning the Open Citations Corpus data model Three record types: Entity Records, Personal Records and Citation Records A clear separation is made between potentially erroneous citation information 'as received’ in text strings from article reference lists  ReferenceTextRecords containing NameTextRecords (of authors, editors) and authoritative bibliographic metadata derived from trustworthy sources such as CrossRef, PubMed and the web pages of published articles  BibliographicRecords and PersonalRecords (of authors, editors) A distinction is also made between an UnmatchedCitationRecord  where no BibliographicRecord exists within the OCC for the cited entity and a MatchedCitationRecord  where the cited entity has a BibliographicRecord within the OCC A unique internal identifier is created for each OCC record Provenance information details the source of each citation, the date it was acquired, its format, and the name of the curator responsible for its ingestion

22 Reconfiguring the Open Citations Corpus Underlying technical implementation being revised  Bibliographic information encoded in BibJSON  Data stored in BibServer, that handles BibJSON natively Data from different sources brought into a common BibJSON format as soon as possible Processing the whole ingest from either source takes over 24 hours Work still to be done on the ingest pipeline, since the parsing of citation information from the reference list entries is not yet 100% accurate

23 Matching citation strings to bibliographic records When a new reference has been extracted from a reference list  a ReferenceTextRecord is created for the citation target, and  an UnmatchedCitationRecord is created between the BibliographicRecord of the citing paper and the citation target’s ReferenceTextRecord The ReferenceTextRecord is then compared with existing BibliographicRecords If a match is found, a new MatchedCitationRecord is created within the OCC between the BibliographicRecords of the citing and cited entities, and the pre-existing UnmatchedCitationRecord between the citing and cited entities is deprecated Similarly, a new NameTextRecord is created for each author and editor named in the new ReferenceTextRecord, and the OCC is then searched for matches to existing PersonalRecords within the OCC

24 Citation error correction Examples of errors in reference list entries vary  from the trivial – a non-English name with incorrect accents or an article title containing “beta” instead of the correct “β”  to the serious – two papers in the same reference list with the same DOI Such errors can be detected by comparing a new ReferenceTextRecord with pre-existing BibliographicRecords, and of a new NameTextRecord with pre- existing PersonalRecords Where there are several OCC ReferenceTextRecords referencing the same multiply-cited paper for which an authoritative OCC BibliographicRecord does not yet exist, we use voting algorithms for reference disambiguation and error correction, enabling the creation of a reliable BibliographicRecord for that entity even when we can find no external authority to provide it In future, we wish to offer an automated OCC reference correction service to third parties such as authors and journal editors, enabling them to spot and correct errors in the reference lists of submitted papers before publication

25 New relationship types in the Open Citations Corpus Entity type relationships The nature of the source entity and the target entity (e.g. journal article, book, dataset) are separately recorded in the OCC. We can thus infer the nature of each entity type relationship, for example:  Article-to-article bibliographic citation  Article-to-database data citation  Data_repository-to-article bibliographic citation Relationships other than bibliographic citations Additional relationship types between entities in the OCC may be encoded using CiTO, the Citation Typing Ontology, if that information is available:  Citation :EntityA cito:cites :EntityB.  Shared authorship :EntityA cito:sharesAuthorsWith :EntityB.  Common funding :EntityA cito:sharesFundingAgencyWith EntityB.  Common institution :EntityA cito:sharesAuthorInstitutionWith :EntityB.  Related :EntityA dcterms:relation :EntityB.

26 Expansion of the Open Citations Corpus coverage Ingest from the Open Access Subset of PubMed Central is being updated from ~200,000 articles in Jan 2011 to the current ~658,000 articles in September 2013 Domain coverage is being expanded to include the physical sciences and mathematics, by the ingest of the reference lists from all ~872,000 preprints in the arXiv preprint repository at Cornell University Library This will bring the total number of references from ~6.4 million to ~40 million We then intend to ingest all the references in CiteSeer and from Wikipedia, marking these with clear provenance information To this we will add citations from data repositories such as Dryad, that contain literature references associated with the datasets they hold and from DataCite, that issues DOIs for datasets, and harvests metadata that contain literature references

27 Citations from heritage literature – ‘The Future of the Past’ Funding application just submitted to harvest references from the pre-digital biodiversity / biological taxonomy literature, where papers have lasting value We will use the Biodiversity Heritage Library ( as a source of references David King, a text mining colleague at the Open University, will use advanced text mining techniques to dig references out of ‘dirty’ OCR’d page images We will then ingest these data into the Open Citations Corpus and make them freely available This will be the only source of digital citation data from a major fraction of the world's heritage literature in the field of biodiversity / biological taxonomy, that is simply not available in digital form anywhere else

28 Additional citations from PubMed Central There are ~2.2 million articles in PubMed Central that are not part of the Open Access Subset, presently missing from the Open Citations Corpus These contain citations not only to other papers, but also to datasets, typically in the form of database accession numbers, buried within the full text or footnotes Recent text mining initiatives undertaken by Europe PubMed Central (EPMC) have extracted both the bibliographic citations and the data citations from all ~2.8 million PubMed Central articles, which are now freely availableEurope PubMed Central We propose to ingest all these EPMC literature and data citations into the expanded and improved Open Citations Corpus  This will increase the number of PMC articles for which the OCC holds citation information by about 330%  In addition, it will further expand the nature of the citation data held to include the data citations contained within these PMC articles However, these are just a fraction of the total scholarly citations, most of which are locked behind the pay walls of commercial providers

29 Reference lists from subscription–access articles All fully open access publishers already publish article reference lists openly I am working to persuade other major scholarly publishers to do the same  i.e. to put article reference lists outside the subscription pay-wall, in the same way as abstracts and bibliographic metadata are freely available Last January, I published an Open Letter to Publishers requesting this  Claire Redhead kindly distributed it to all OASPA members  The letter is available at al_publishers_re_open_citations.pdf A number of leading STM publishers have expressed their willingness to open the reference lists from subscription-access journal articles  Nature, Science, Taylor & Francis, Royal Society Publishing, Portland Press, MIT Press and Oxford University Press are among the first  another has expressed willingness verbally, but has yet to commit formally


31 Opening article reference lists via CrossRef How can these be ingested into the Open Citations Corpus? Most publishers already submit their reference lists to CrossRef as part of its CitedBy Linking Service  If you do not at present, you should use this free service! With publisher’s permission, CrossRef can enable reference lists to be ‘opened’  on a publisher-by-publisher basis based on DOI prefixes  on a journal-by-journal basis  on an article-by-article basis for hybrid journals References are then available via the CrossRef API for ingest into the OCC However, because the default CrossRef CitedBy Linking Service agreement is not to publish reference lists, even Open Access publishers must specifically inform CrossRef that the reference lists of their journal articles should be open Geoff Bilder has a new CrossRef Metadata Best Practice Document that I will circulate, explaining how to specify this choice in your article metadata,

32 Summary - Benefits of the Open Citations Corpus Created by scholars for scholars using scholarly data  No profit motive constraining free publication of the data  Will bring particular benefit to those who are NOT members of First World academic institutions whose libraries subscribe to commercial citation data from Thomson-Reuters or Elsevier Will provide integrated access to citation data from a variety of sources, both inside and outside traditional scholarly publishing, with provenance information Data are semantically described using the SPAR bibliographic ontologies  Citations thus become part of the Web of Linked Open Data Data available in a variety of formats including BibJSON, BibTex and RDF for download by third parties for their own use or to build into cool services  indexing, search and browse (in prototype)  timeline visualizations (in prototype)  analysis of citation networks, co-authorship networks, etc.  trend identification, recommendation services, etc.

33 Sustainability

34 The development of the Open Citations Corpus has been enabled by short- term grant funding, but this does not provide a sustainable financial model For the future, we seek one of the following long-term arrangements:  Adoption by a major institutional or national library  Adoption by a publishing organization such as CrossRef, with indirect support from publishers  Direct support by the scholarly publishing community  Social investment, i.e. the provision of capital to generate social as well as financial returns, to support open access to scholarly information  Income support by charging for added-value services over the open data I would be grateful for your views on the value of the Open Citations Corpus and the manner in which its ongoing development might be supported

35 Acknowledgements and thanks Alex Dutton, who developed the original Open Citations Corpus Richard Jones, Martyn Whitwell and Mark MacGillivray of Cottage Labs, who have undertaken more recent development work Silvio Peroni, my colleague in developing the suite of SPAR (Semantic Publishing and Referencing) Ontologies The JISC, who have funded the development of the Open Citations Corpus

Download ppt "Oxford e-Research Centre and Department of Zoology University of Oxford, UK Fifth Conference on Open Access Scholarly Riga, Latvia 20 September 2013 The."

Similar presentations

Ads by Google