From RePEc to 3lib. the long march for free bibliographic data Thomas Krichel 2011-06-02.

Slides:

Advertisements

Similar presentations

Zetoc.mimas.ac.uk Zetoc Electronic Table of Contents from the British Library Zetoc Support.

Advertisements

EPrints - Introducing EPrints 3 Software William J Nixon Digital Library Development Manager, University of Glasgow With many thanks to Les Carr and the.

Publishers Web Sites Standard Features. Objectives Access publishers websites Identify general features available on most publishers websites Know how.

© 2008 EBSCO Information Services SUSHI, COUNTER and ERM Systems An Update on Usage Standards Ressources électroniques dans les bibliothèques électroniques.

Usage Statistics in Context: related standards and tools Oliver Pesch Chief Strategist, E-Resources EBSCO Information Services Usage Statistics and Publishers:

IRRA DSpace April 2006 Claire Knowles University of Edinburgh.

28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.

Open Archives and Free Online Scholarship Thomas Krichel (RePEc & Long Island University) Simeon M. Warner (ArXiv & Cornell University)

Towards an open library of relational metadata: the experience of RePEc (Research Papers in Economics) Thomas Krichel

Anwendung von open source Ideen in digitalen Bibliotheken: die Beispiele von RePEc und rclis Thomas Krichel

Digital scholarly communication in Economics: from NetEc to RePEc Thomas Krichel work partly sponsored by the Joint Information.

Acknowledgements Ellen Fischer for her hospitality. Michael Heinz for organizing the seminar.

The RePEc model for the academic digital library Thomas Krichel work partly sponsored by the Joint Information Systems.

RePEc, a digital commons for economics Thomas Krichel

Что делать? Thomas Krichel

RePEc, a case to illustrate the evolution and future trends of repositories and open access Thomas Krichel

RePEc: a public-access database that promotes scholarly communication in Economics Thomas Krichel

Designing for the Discipline: Open Libraries and Scholarly Communication Thomas Krichel

Rclis in vision and reality Thomas Krichel

RePEc and OLS Thomas Krichel prepared for the first retreat for disciplinary repositories Monterey

RePEc: An Open Library for Economics Thomas Krichel Work partly supported by the Joint Information Systems Committee of.

Transforming scholarly communities with open libraries Thomas Krichel

RePEc as frontier repository, the business model and what it means to survive as network in a more and more web-collaborative academia and a developing.

Bringing scholarly communication in kicking and screaming into the Internet age Thomas Krichel

Bringing scholarly communication in Economics kicking and screaming into the Internet age: NetEc, RePEc and more to come Thomas Krichel

Disintermediation of Academic Publishing through the Internet: An Intermediate Report from the Front Line Thomas Krichel

Information policy issues in RePEc Thomas Krichel

Open Archives and Open Libraries Thomas Krichel

RePEc: a early example of an open library Thomas Krichel

The future of scholarly communication in Economics Thomas Krichel work partly sponsored by the Joint Information Systems.

Academic self-organization on the Internet. The example of RePEc Thomas Krichel

Document data & personal data Thomas Krichel Long Island University & Novosibirsk State University

How to become an 800 pound gorilla: the case of RePEc. Thomas Krichel 2008–10–29.

Ariw and AuthorClaim: current state Thomas Krichel prepared for the first retreat for disciplinary repositories Monterey.

Use your bean. Count it. Thomas Krichel

Free author registration Thomas Krichel LIU & НГУ

LIS510 lecture 0 Thomas Krichel feeling nervous? So am I. It is my second time. Overall approach –I follow what has been done before. –I am.

Managing References : Mendeley

EndNote Web Reference Management Software (module 5.1)

EndNote Web Reference Management Software (module 5)

NIMAC 2.0: The Accessible Media Producer Portal NIMAC 2.0 for AMPs.

Electronic tools can help find, organize, store, and cite resources Dropbox Evernote Google Tools/Advanced Searching Tips Google Scholar EndNote Easybib.

Indispensable tools for research at its best COS Pivot: Accessing Pivot and Managing Your Profile.

CrossRef Linking and Library Users “The vast majority of scholarly journals are now online, and there have been a number of studies of what features scholars.

DSpace: the MIT Libraries Institutional Repository MacKenzie Smith, MIT EDUCAUSE 2003, November 5 th Copyright MacKenzie Smith, This work is the.

How the University Library can help you with your term paper

Accessing and Using the e-Book Collection from EBSCOhost ® When an arrow appears, click to proceed to the next slide at your own pace. To go back, click.

An introduction to Cambridge Collections Online… Full online access to collections of classic and newly- published scholarly titles in PDF format Contains.

Administration & Workflow

Proquest. Digital Commons/Institutional Repository at Pace.

The Role of the Public Library in the Digital Age Sarah Ormes UKOLN University of Bath Bath, BA2 7AY UKOLN is funded by the Library and Information Commission,

Managing references : Mendeley

National Aeronautics and Space Administration Implementing DSpace at NASA Langley Research Center 1 Greta Lowe Librarian NASA Langley Research Center

MyiLibrary® ‘Search & View’ Website Training June 8, 2010.

Management, marketing and population of repositories Morag Greig, University of Glasgow.

Getting started on informaworld™ How do I register with informaworld™? What do I do if I forget my password? My institution does not subscribe to any journals,

Where I am coming from Thomas Krichel

Getting started on informaworld™ How do I register my institution with informaworld™? How is my institution’s online access activated? What do I do if.

Open Bibliographic Data and Author Claiming James R. Griffin III 1, 3 and Thomas Krichel 1, 2, 3 1 Long Island University 2 Novosibirsk State University.

Libra: Thesis and Dissertation Submission. What is Libra? UVA’s institutional repository, providing online archiving and access for the scholarly output.

Research evaluation requirements José Manuel Barrueco Universitat de València (SPAIN) Servei de Biblioteques i Documentació May, 2011.

IUScholarWorks is a set of services to make the work of IU scholars freely available. Allows IU departments, institutes, centers and research units to.

Finding Credible Sources

CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &

Development of Electronic Services in Public Libraries: Issues and Possibilities Sally Criddle UKOLN University of Bath Bath, BA2 7AY UKOLN is funded by.

Greater Visibility, Greater Access QSpace QSpace Queen’s University Research & Learning Repository.

CitEc as a source for research assessment and evaluation José Manuel Barrueco Universitat de València (SPAIN) May, й Международной научно-практической.

Merit JISC Collections Merit: presentation for UKCORR Hugh Look, Project Director.

The Hosted Model Charl Roberts Good morning again,

The RePEc database about Economics

Presentation transcript:

From RePEc to 3lib. the long march for free bibliographic data Thomas Krichel

structure Prolog RePEc: a digital library for economics Ongoing work to build a general digital library for scholarly communication Beers?

who is me? I was an economist. I was a leisure digital librarian. –NetEcsince 1993 –RePEcsince 1997 I am just another Perl hacker. I am a visionary, but I'm not like St. John the Baptist

who is he?

he is "St. IGNUicus" A humoristic creation of Richard M. Stallman (RMS) RMS is the father of the free software movement –a geek –a visionary St. IGNUicus shows an emphasis on the moral case for free software, rather than the business case

moral case and business case Other folks in the free software movement avoid the "f" word –free can mean cheap –cheap can mean bad They stress the business case of free software They use the term "open source software", (OSS)

RMS and us Amen, I tell you: we librarians need to learn more from the OSS movement. We need to make the concepts coming of free software more a part of our business. Let us look at a key concept: free software.

free software according to RMS Free software comes with four freedoms –The freedom to run the software, for any purpose –The freedom to study how the program works, and adapt it to your needs –The freedom to redistribute copies so you can help your neighbor –The freedom to improve the program, and release your improvements to the public, so that the whole community benefits

what has this to do with us? Just replace free software with free information. Libraries are about free information. But the analogy is not quite as simple. –When we talk about free information, we usually mean things that we can freely read (download…). free as in: $0 –We do not usually mean free information as information we are free to do things with. Free as in freedom.

moral and business There is a moral case for free information. –We rely on it. There is a business case for free information. –We need to make our own.

we rely on the moral case The citizen should be informed… Individuals in the organization should have free access… This is how we justify resources given to us. Often, members of the community who pay get privileged access.

from moral case to business case To form the business case for free information, think of "free information" as "freedom to do things" rather than $0. Thus libraries can make a crucial business case for them as agents who transform information. Recall that there are whole industries out there that produces free information.

Now for something different RePEc is an example for an Open Library. An Open Library is loosely defined an application of the OSS principles to libraries. –vague –in the making –but has some history Looking at RePEc will fix ideas.

History It started with me as a research assistant an in the Economics Department of Loughborough University of Technology in a predecessor of the Internet allowed me to download free software without effort but academic papers had to be gathered in a painful way

CoREJ published by HMSO –Photocopied lists of contents tables recently published economics journal received at the Department of Trade and Industry –Typed list of the recently received working papers received by the University of Warwick library The latter was the more interesting.

working papers early accounts of research findings published by economics departments –in universities –in research centers –in some government offices –in multinational administrations disseminated through exchange agreements important because of 4 year publishing delay

I planned to circulate the Warwick working paper list over listserv lists I argued it would be good for them –increase incentives to contribute –increase revenue for ILL After many trials, Warwick refused. During the end of that time, I was offered a lectureship, and decided to get working on my own collection.

1993: BibEc and WoPEc Fethy Mili of Université de Montréal had a good collection of papers and gave me his data. I put his bibliographic data on a gopher and called the service "BibEc" I also gathered the first ever online electronic working papers on a gopher and called the service "WoPEc".

NetEc consortium BibEcprinted papers WoPEcelectronic papers CodEcsoftware WebEcweb resource listings JokEcjokes HoPEc a lot of Ec!

WoPEc to RePEc WoPEc was a catalog record collection WoPEc remained largest web access point but getting contributions was tough In 1996 I wrote basic architecture for RePEc. –ReDIF –Guildford Protocol

1996: RePEc principle Many archives –archives offer metadata about digital objects (mainly working papers) One database –The data from all archives forms one single logical database despite the fact that it is held on different servers. Many services –users can access the data through many interfaces. –providers of archives offer their data to all interfaces at the same time. This provides for an optimal distribution.

RePEc is based on archives WoPEc EconWPA DEGREE S-WoPEc NBER CEPR US Fed in Print IMF OECD MIT University of Surrey CO PAH

to form a 1M+ item dataset 390,000 working papers 620,000 journal articles 2,100 software components 22,000 book and chapter listings 27,000 author contact and publication listings 12,000 institutional contact listings

RePEc is used in many services BibEc and WoPEc EconPapers NEP: New Economics Papers Inomics RePEc Author Service Journal of Economic Literature IDEAS RuPEc EDIRC LogEc CollEc CitEc

… describes documents Template-Type: ReDIF-Paper 1.0 Title: Dynamic Aspect of Growth and Fiscal Policy Author-Name: Thomas Krichel Author-Person: RePEc:per: :thomas_krichel Author- Author-Name: Paul Levine Author- Author-WorkPlace-Name: University of Surrey Classification-JEL: C61; E21; E23; E62; O41 File-URL: ftp:// pub/RePEc/sur/surrec/surrec9601.pdf File-Format: application/pdf Creation-Date: Revision-Date: Handle: RePEc:sur:surrec:9601

… describes persons (RAS) template-type: ReDIF-Person 1.0 name-full: MANKIW, N. GREGORY name-last: MANKIW name-first: N. GREGORY handle: RePEc:per: :N__GREGORY_MANKIW homepage: mankiw/mankiw.html workplace-institution: RePEc:edi:deharus workplace-institution: RePEc:edi:nberrus Author-Article: RePEc:aea:aecrev:v:76:y:1986:i:4:p: Author-Article: RePEc:aea:aecrev:v:77:y:1987:i:3:p: Author-Article: RePEc:aea:aecrev:v:78:y:1988:i:2:p: ….

… describes institutions Template-Type: ReDIF-Institution 1.0 Primary-Name: University of Surrey Primary-Location: Guildford Secondary-Name: Department of Economics Secondary-Phone: (01483) Secondary- Secondary-Fax: (01483) Secondary-Postal: Guildford, Surrey GU2 5XH Secondary-Homepage: Handle: RePEc:edi:desuruk

what value added does RePEc provide RePEc identifies records RePEc relates identified records These actions require human control. They prepare for assessment of performance.

key to success Have a small group of volunteers Disseminate as widely as possible Demonstrate to authors and institutions that it works for them. –institutional registration –author registration

institutional registration It started by one sad geezer making a list of departments that have a web site. I persuaded him that his data would be more widely used if integrated into the RePEc database. Now he is a happy geezer and one of our three crucial volunteers.

author registration It started when funding allowed us to hire a crazy programmer to write an author registration system. system went online as HoPEc in late has been renamed RePEc Author Service (RAS) A 2003 grant from OSI allows for a rewrite and expansion.

RePEc author service RePEc document data has author names as strings. The authors register with RAS to list contact details and identify the papers they wrote. This is classic access control, but done by the authors. In a ranking of 1000 most important economists, over 80% are registered with RAS.

authors incentives Authors perceive the registration as a way to achieve common advertising for their papers. Author records are used to aggregate usage logs across RePEc user services for all papers of an author. Stimulates a "I am bigger than you are" mentality. Size matters!

general outlook ok, it works for Economics but what about the rest of the world? There are two trends that are related to RePEc –institutional repositories –open bibliographic data [lets start with the former]

open bibliographic data There is a growing (albeit, slowly growing) movement for open bibliographic data. Open bibliographic data is bibliographic data that comes with more liberal licensing conditions for its reuse. The main player here is the Bibliographic Data Working Group of the Open Knowledge Foundation (the group)

types of data The group works on two types of data. –One is library catalog data. –The other is scholarly paper data. Both have different challenges and opportunities. The group has defined the open bibliographic principles.

the principles Good in principle. But in practice they are difficult to implement for large collections, that are often aggregates. For RePEc it would be difficult to implement.

the metadata The group does some collection work mainly for cataloging data. This uses RDF and semantic web technologies. I have not looked at this much in detail.

the open library society Yours truly created it. Site at y.openlib.org. The society complements the group. OLS aggregates bibliographic data without insisting on legal conditions. OLS builds useful services that make contributors happy. Its the road that RePEc took in 1993.

OLS projects The societys aims are fairly broad. Basically we are conducting work to build a RePEc for all disciplines. Some components are there –RePEc bibliographic data 3lib –RePEc Author Service AuthorClaim –EDIRC ARIW

relationship to free software Basically the OLS intends to build bibliographic and associated metadata in the way free software is built. Software generally built reusing basic elements (libraries). Over time, geeks have found object orientation as a way to improve reuse. We have no object-oriented metadata yet, and we cant wait for it.

emphasizing re-use OLS services are intended to re-use other sources They are built for re-use of the data they generate. We are aiming toward making re-use as transparent as possible. That distinguishes them from commercial dead-ends.

OLS and RePEc Since RePEc has no legal personality, the OLS has lent it its legal personality. Thus RePEc is now an OLS project, but it is de facto run by a meritocratic broad. See for details.

3lib 3lib is an initial attempt at building an aggregate of freely available (sort of, as opposed ot open) bibliographic data. Its a project by OLS sponsored by OKFN. About 35 million records from the usual suspects: PubMed, OpenLibrary, DBLP, RePEc…

3lib record structure The data elements in 3lib are very simple –title –author name expressions –link to item page on provider site –identifier 3lib is meant to serve AuthorClaim. Other data elements could be added to 3lib structure if needed.

AuthorClaim AuthorClaim is an authorship claiming service for 3lib data. It is modeled after the RePEc Author Service It uses the same software, called ACIS. It is running since early 2008, predates ORCID.

[ORCID] This is a broad initiative to build an author identification system. Its been around since late There is sandbox, but that may not be direction the project will go to. There is a new whitepaper. I am a member of the technical architecture group.

author claiming vs identification Author claiming records are NOT author identification records. The difference is what I call Klinks problem –An person can claim to be an author of a paper. –If there are several author, we don't know what author (s)he is.

Klinks problem Examples of Klinks problem –Jane and John Smith write a paper. Its a author list say J. Smith and J. Smith. –Barak Obama claims a paper. Author string says Obama B., Laden et alii. In practice Klinks problem is not very important but in theory it is.

AuthorClaim name details It contains the name details as they may be found in the bibliographic data –Krichel, Thomas –T. Krichel –Томас Крихель Sometimes a name of an author may not appear in the bibliographic data at all et. al.

AuthorClaim contact details This is a set of trivial fields – . This detail is required but not exported by default. –homepage. This detail is optional. –phone number. This detail is optional –postal address. This detail is optional.

affiliations profile This is more complicated. Institutional data is kept as separate records, not as string data. Registrants can search for existing institutional records to create an affiliation with. Or they can propose a new record to be added by filling out a form.

research profile This is collection of metadata about research documents the registrant has written. Available functions include –display a list of works in the profile –search for new suggested works –manual search for works by title –display refused research documents –change preferences for automatic updates (next)

automatic updates By default, a regular search using the name variations profile identifies a set of potential new documents and reports them to the user. If the registrant has accepted and refused documents, any suggested documents list can be learned i.e. sorted by relevance.

document to document links ACIS has the ability to manage document to document links. Authors can say that two documents in the profile are related. This is not running in AuthorClaim.

full-text recognition ACIS has a feauture to validate document full-text links can be confirmed or rejected. Typically such full-text files would found by an automated search engine. I have worked on such an dataset but that would be a topic that would get us too far.

AuthorClaim data ftp://ftp.authorclaim.org CC0 Today there are 123 profiles. There is an obvious chicken and egg problem. Its the same problem that RePEc overcame.

AuthorClaim profiles Each profile contains 3lib data of each accepted and each refused document. Ideally, we would like to include source data (in XML) from the bilbiographies we receive. Why on earth would we distribute data on refused papers?

AuthorClaim and ARIW Registrants can claim affiliation with an institution registered in ARIW. Registrants can propose new institution to ARIW. If accepted ARIW can change AuthorClaim records to point to new ARIW record. AuthorClaim exports records back to ARIW. ARIW is using them to find AuthorClaim registrants for ARIW institutions.

AuthorClaim and 3lib CrossRef labs have an experimental search engine at f.org/search It is difficult to work with, to be polite. For AuthorClaim registered users, I run a query robot to this engine. The detail is at Results flow to 3lib. Copyright issues?

AuthorProfile AuthorProfile is an information service that combines 3lib and AuthorClaim data. Joint work with Robert J. Griffin III, since late The site is running at e.org. At this time Id rather use the development site

vision behind AuthorProfile We believe that scholarly communication is for authors more than for readers. Authors have to take priority. We want to invert bibliographic data from its conventional structure into a CV-like structure.

vertical integration Vertical integration explores collaboration between authors and aunexes. From aunexes, we can link to authors. We want to choose one upstream author or aunex to link to. The selection problem for this is complicated. We use an intensity of collaboration and then an off-centrality measure.

name problem Without author claiming, it is based on author name expressions. It is therefore error-prone. It gives us the chance to show how poor name data can be. It may also incentives people to register.

auversion We build data les that contain all papers written by authors with the same name expression. This leaves us with a large number of auverted files. Each contain the document by an author name expression, henceforth aunex.

navigation We can navigate auverted data through co-authorship relationships. In addition we build –horizontal integration –vertical integration –search

horizontal integration Horizontal integration links name expressions which may represent the same author. For example: F. Lefevre Fred Lefevre This feature has yet to be implemented.

AuthorProfile search A search feature gives priority to registered authors. If more than one author is identified by the same name, links to both authors shall be returned by the query. This feature has yet to be implemented. for example will perform a search.

ranking We intend to compute a pageRank-type algorithm to rank authors. For this we have to do some network calculations, but not a lot. At this time authors are the only entry points to the collection.

AuthorProfile documentation There is a special documentation interface at It shows the entire software written for the project both in Perl and XSLT. The aim is to have the entire internals of the project available. This is also done for ARIW, but here its done in a smarter way.

obstacles Scale, scale, scale Representation of negativity. Instant data inject from AuthorClaim.

institutional repositories Running institutional repositories one way (the only?) that academic libraries can stay in business. Take-up has been slow. Same issues as with AuthorClaim. A layer of use/assessment is missing. OAI-DC data from repositories is notoriously difficult to use.

BASE Recently OLS have acquired a data feed from the BASE search engine. This contains post-processed harvested data for institutional repositories for inclusion in AuthorClaim. We want to develop an add-on to DSpace for AuthorClaim records.

AuthorClaim and IRs There is a theory, developed by Ivan Kurmanov about level of interoperability between AuthorClaim and institutional repositories. All level were once implemented in a test using ACIS and EPrints.

the institutional repository IR Its main function is to collect document data. Author metadata comes as part of the metadata information with the document. The author may be identified within the scope of the collection. The user of the IR is called a submitter.

AuthorClaim AuthorClaim a service that collects personal data and connects them with metadata about documents. The key data is author name data and document identifiers. Authors can contact the AuthorClaim to identify themselves. Once they are registered they can say what documents they have authored. The AuthorClaim user is called a registrant.

AuthorClaim IR interoperability Interoperability comes in different levels. With each level up, we have more (better) interoperability. We have levels 0 to 4. At level zero, AuthorClaim and the IR simply live side by side, and no interaction is happening.

level 1 In level 1, the IR provides metadata about its documents to AuthorClaim. This will be intermeditated by BASE. The AuthorClaim processes the data periodically. –add new records to the document stock –perform probationary associations between documents and registrants

level 2 An IR delivers to AuthorClaim data for some of its authorships that point to data in AuthorClaim. ACIS will accept any of the following 3 identification avenues –an identifier known to AuthorClaim –a shortid, previously generated by AuthorClaim –an address, know to the AuthorClaim as the login on a registrant. This data will have to be entered by submitter.

level 3 The IR helps submitters to find the data required for level 2 interoperability. While submitters enter authorship data, the IR performs searches in the AuthorClaim data. If matching records are found, the submitter is invited to select them. The document data is the exported to the AuthorClaim in the usual way.

implementing level 3 AuthorClaim needs to expose registrants data to the IR. The data can not be made available publicly if we want the to be an avenue of identification. The IR must –search the AuthorClaim data –display optional matches in an unobtrusive way –give submitters an easy way to choose an option.

level 4 The IR immediately notifies AuthorClaim about a document submission. The AuthorClaim processes the notification, the document is added to the research profiles of its identified authors. (you may argue: it's on a different level)

level dependency There is level dependency –Level 1 is really required for other levels. –Level 2 is a basis for level 3. –Level 4 can be done without either level 2 or level 3. It does not really matter current ACIS code can implement all four levels. There is ACIS code for Eprints 2.0 that implements the dosus side of the interoperability.

Thank you for your attention!