Presentation on theme: "From RePEc to 3lib. the long march for free bibliographic data Thomas Krichel 2011-06-02."— Presentation transcript:
From RePEc to 3lib. the long march for free bibliographic data Thomas Krichel
structure Prolog RePEc: a digital library for economics Ongoing work to build a general digital library for scholarly communication Beers?
who is me? I was an economist. I was a leisure digital librarian. –NetEcsince 1993 –RePEcsince 1997 I am just another Perl hacker. I am a visionary, but I'm not like St. John the Baptist
who is he?
he is "St. IGNUicus" A humoristic creation of Richard M. Stallman (RMS) RMS is the father of the free software movement –a geek –a visionary St. IGNUicus shows an emphasis on the moral case for free software, rather than the business case
moral case and business case Other folks in the free software movement avoid the "f" word –free can mean cheap –cheap can mean bad They stress the business case of free software They use the term "open source software", (OSS)
RMS and us Amen, I tell you: we librarians need to learn more from the OSS movement. We need to make the concepts coming of free software more a part of our business. Let us look at a key concept: free software.
free software according to RMS Free software comes with four freedoms –The freedom to run the software, for any purpose –The freedom to study how the program works, and adapt it to your needs –The freedom to redistribute copies so you can help your neighbor –The freedom to improve the program, and release your improvements to the public, so that the whole community benefits
what has this to do with us? Just replace free software with free information. Libraries are about free information. But the analogy is not quite as simple. –When we talk about free information, we usually mean things that we can freely read (download…). free as in: $0 –We do not usually mean free information as information we are free to do things with. Free as in freedom.
moral and business There is a moral case for free information. –We rely on it. There is a business case for free information. –We need to make our own.
we rely on the moral case The citizen should be informed… Individuals in the organization should have free access… This is how we justify resources given to us. Often, members of the community who pay get privileged access.
from moral case to business case To form the business case for free information, think of "free information" as "freedom to do things" rather than $0. Thus libraries can make a crucial business case for them as agents who transform information. Recall that there are whole industries out there that produces free information.
Now for something different RePEc is an example for an Open Library. An Open Library is loosely defined an application of the OSS principles to libraries. –vague –in the making –but has some history Looking at RePEc will fix ideas.
History It started with me as a research assistant an in the Economics Department of Loughborough University of Technology in a predecessor of the Internet allowed me to download free software without effort but academic papers had to be gathered in a painful way
CoREJ published by HMSO –Photocopied lists of contents tables recently published economics journal received at the Department of Trade and Industry –Typed list of the recently received working papers received by the University of Warwick library The latter was the more interesting.
working papers early accounts of research findings published by economics departments –in universities –in research centers –in some government offices –in multinational administrations disseminated through exchange agreements important because of 4 year publishing delay
I planned to circulate the Warwick working paper list over listserv lists I argued it would be good for them –increase incentives to contribute –increase revenue for ILL After many trials, Warwick refused. During the end of that time, I was offered a lectureship, and decided to get working on my own collection.
1993: BibEc and WoPEc Fethy Mili of Université de Montréal had a good collection of papers and gave me his data. I put his bibliographic data on a gopher and called the service "BibEc" I also gathered the first ever online electronic working papers on a gopher and called the service "WoPEc".
NetEc consortium BibEcprinted papers WoPEcelectronic papers CodEcsoftware WebEcweb resource listings JokEcjokes HoPEc a lot of Ec!
WoPEc to RePEc WoPEc was a catalog record collection WoPEc remained largest web access point but getting contributions was tough In 1996 I wrote basic architecture for RePEc. –ReDIF –Guildford Protocol
1996: RePEc principle Many archives –archives offer metadata about digital objects (mainly working papers) One database –The data from all archives forms one single logical database despite the fact that it is held on different servers. Many services –users can access the data through many interfaces. –providers of archives offer their data to all interfaces at the same time. This provides for an optimal distribution.
RePEc is based on archives WoPEc EconWPA DEGREE S-WoPEc NBER CEPR US Fed in Print IMF OECD MIT University of Surrey CO PAH
to form a 1M+ item dataset 390,000 working papers 620,000 journal articles 2,100 software components 22,000 book and chapter listings 27,000 author contact and publication listings 12,000 institutional contact listings
RePEc is used in many services BibEc and WoPEc EconPapers NEP: New Economics Papers Inomics RePEc Author Service Journal of Economic Literature IDEAS RuPEc EDIRC LogEc CollEc CitEc
… describes documents Template-Type: ReDIF-Paper 1.0 Title: Dynamic Aspect of Growth and Fiscal Policy Author-Name: Thomas Krichel Author-Person: RePEc:per: :thomas_krichel Author- Author-Name: Paul Levine Author- Author-WorkPlace-Name: University of Surrey Classification-JEL: C61; E21; E23; E62; O41 File-URL: ftp://www.econ.surrey.ac.uk/ pub/RePEc/sur/surrec/surrec9601.pdf File-Format: application/pdf Creation-Date: Revision-Date: Handle: RePEc:sur:surrec:9601
… describes institutions Template-Type: ReDIF-Institution 1.0 Primary-Name: University of Surrey Primary-Location: Guildford Secondary-Name: Department of Economics Secondary-Phone: (01483) Secondary- Secondary-Fax: (01483) Secondary-Postal: Guildford, Surrey GU2 5XH Secondary-Homepage: Handle: RePEc:edi:desuruk
what value added does RePEc provide RePEc identifies records RePEc relates identified records These actions require human control. They prepare for assessment of performance.
key to success Have a small group of volunteers Disseminate as widely as possible Demonstrate to authors and institutions that it works for them. –institutional registration –author registration
institutional registration It started by one sad geezer making a list of departments that have a web site. I persuaded him that his data would be more widely used if integrated into the RePEc database. Now he is a happy geezer and one of our three crucial volunteers.
author registration It started when funding allowed us to hire a crazy programmer to write an author registration system. system went online as HoPEc in late has been renamed RePEc Author Service (RAS) A 2003 grant from OSI allows for a rewrite and expansion.
RePEc author service RePEc document data has author names as strings. The authors register with RAS to list contact details and identify the papers they wrote. This is classic access control, but done by the authors. In a ranking of 1000 most important economists, over 80% are registered with RAS.
authors incentives Authors perceive the registration as a way to achieve common advertising for their papers. Author records are used to aggregate usage logs across RePEc user services for all papers of an author. Stimulates a "I am bigger than you are" mentality. Size matters!
general outlook ok, it works for Economics but what about the rest of the world? There are two trends that are related to RePEc –institutional repositories –open bibliographic data [lets start with the former]
open bibliographic data There is a growing (albeit, slowly growing) movement for open bibliographic data. Open bibliographic data is bibliographic data that comes with more liberal licensing conditions for its reuse. The main player here is the Bibliographic Data Working Group of the Open Knowledge Foundation (the group)
types of data The group works on two types of data. –One is library catalog data. –The other is scholarly paper data. Both have different challenges and opportunities. The group has defined the open bibliographic principles.
the principles Good in principle. But in practice they are difficult to implement for large collections, that are often aggregates. For RePEc it would be difficult to implement.
the metadata The group does some collection work mainly for cataloging data. This uses RDF and semantic web technologies. I have not looked at this much in detail.
the open library society Yours truly created it. Site at y.openlib.org. The society complements the group. OLS aggregates bibliographic data without insisting on legal conditions. OLS builds useful services that make contributors happy. Its the road that RePEc took in 1993.
OLS projects The societys aims are fairly broad. Basically we are conducting work to build a RePEc for all disciplines. Some components are there –RePEc bibliographic data 3lib –RePEc Author Service AuthorClaim –EDIRC ARIW
relationship to free software Basically the OLS intends to build bibliographic and associated metadata in the way free software is built. Software generally built reusing basic elements (libraries). Over time, geeks have found object orientation as a way to improve reuse. We have no object-oriented metadata yet, and we cant wait for it.
emphasizing re-use OLS services are intended to re-use other sources They are built for re-use of the data they generate. We are aiming toward making re-use as transparent as possible. That distinguishes them from commercial dead-ends.
OLS and RePEc Since RePEc has no legal personality, the OLS has lent it its legal personality. Thus RePEc is now an OLS project, but it is de facto run by a meritocratic broad. See for details.
3lib 3lib is an initial attempt at building an aggregate of freely available (sort of, as opposed ot open) bibliographic data. Its a project by OLS sponsored by OKFN. About 35 million records from the usual suspects: PubMed, OpenLibrary, DBLP, RePEc…
3lib record structure The data elements in 3lib are very simple –title –author name expressions –link to item page on provider site –identifier 3lib is meant to serve AuthorClaim. Other data elements could be added to 3lib structure if needed.
AuthorClaim AuthorClaim is an authorship claiming service for 3lib data. It is modeled after the RePEc Author Service It uses the same software, called ACIS. It is running since early 2008, predates ORCID.
[ORCID] This is a broad initiative to build an author identification system. Its been around since late There is sandbox, but that may not be direction the project will go to. There is a new whitepaper. I am a member of the technical architecture group.
author claiming vs identification Author claiming records are NOT author identification records. The difference is what I call Klinks problem –An person can claim to be an author of a paper. –If there are several author, we don't know what author (s)he is.
Klinks problem Examples of Klinks problem –Jane and John Smith write a paper. Its a author list say J. Smith and J. Smith. –Barak Obama claims a paper. Author string says Obama B., Laden et alii. In practice Klinks problem is not very important but in theory it is.
AuthorClaim name details It contains the name details as they may be found in the bibliographic data –Krichel, Thomas –T. Krichel –Томас Крихель Sometimes a name of an author may not appear in the bibliographic data at all et. al.
AuthorClaim contact details This is a set of trivial fields – . This detail is required but not exported by default. –homepage. This detail is optional. –phone number. This detail is optional –postal address. This detail is optional.
affiliations profile This is more complicated. Institutional data is kept as separate records, not as string data. Registrants can search for existing institutional records to create an affiliation with. Or they can propose a new record to be added by filling out a form.
research profile This is collection of metadata about research documents the registrant has written. Available functions include –display a list of works in the profile –search for new suggested works –manual search for works by title –display refused research documents –change preferences for automatic updates (next)
automatic updates By default, a regular search using the name variations profile identifies a set of potential new documents and reports them to the user. If the registrant has accepted and refused documents, any suggested documents list can be learned i.e. sorted by relevance.
document to document links ACIS has the ability to manage document to document links. Authors can say that two documents in the profile are related. This is not running in AuthorClaim.
full-text recognition ACIS has a feauture to validate document full-text links can be confirmed or rejected. Typically such full-text files would found by an automated search engine. I have worked on such an dataset but that would be a topic that would get us too far.
AuthorClaim data ftp://ftp.authorclaim.org CC0 Today there are 123 profiles. There is an obvious chicken and egg problem. Its the same problem that RePEc overcame.
AuthorClaim profiles Each profile contains 3lib data of each accepted and each refused document. Ideally, we would like to include source data (in XML) from the bilbiographies we receive. Why on earth would we distribute data on refused papers?
AuthorClaim and ARIW Registrants can claim affiliation with an institution registered in ARIW. Registrants can propose new institution to ARIW. If accepted ARIW can change AuthorClaim records to point to new ARIW record. AuthorClaim exports records back to ARIW. ARIW is using them to find AuthorClaim registrants for ARIW institutions.
AuthorClaim and 3lib CrossRef labs have an experimental search engine at f.org/search It is difficult to work with, to be polite. For AuthorClaim registered users, I run a query robot to this engine. The detail is at Results flow to 3lib. Copyright issues?
AuthorProfile AuthorProfile is an information service that combines 3lib and AuthorClaim data. Joint work with Robert J. Griffin III, since late The site is running at e.org. At this time Id rather use the development site
vision behind AuthorProfile We believe that scholarly communication is for authors more than for readers. Authors have to take priority. We want to invert bibliographic data from its conventional structure into a CV-like structure.
vertical integration Vertical integration explores collaboration between authors and aunexes. From aunexes, we can link to authors. We want to choose one upstream author or aunex to link to. The selection problem for this is complicated. We use an intensity of collaboration and then an off-centrality measure.
name problem Without author claiming, it is based on author name expressions. It is therefore error-prone. It gives us the chance to show how poor name data can be. It may also incentives people to register.
auversion We build data les that contain all papers written by authors with the same name expression. This leaves us with a large number of auverted files. Each contain the document by an author name expression, henceforth aunex.
navigation We can navigate auverted data through co-authorship relationships. In addition we build –horizontal integration –vertical integration –search
horizontal integration Horizontal integration links name expressions which may represent the same author. For example: F. Lefevre Fred Lefevre This feature has yet to be implemented.
AuthorProfile search A search feature gives priority to registered authors. If more than one author is identified by the same name, links to both authors shall be returned by the query. This feature has yet to be implemented. for example will perform a search.
ranking We intend to compute a pageRank-type algorithm to rank authors. For this we have to do some network calculations, but not a lot. At this time authors are the only entry points to the collection.
AuthorProfile documentation There is a special documentation interface at It shows the entire software written for the project both in Perl and XSLT. The aim is to have the entire internals of the project available. This is also done for ARIW, but here its done in a smarter way.
obstacles Scale, scale, scale Representation of negativity. Instant data inject from AuthorClaim.
institutional repositories Running institutional repositories one way (the only?) that academic libraries can stay in business. Take-up has been slow. Same issues as with AuthorClaim. A layer of use/assessment is missing. OAI-DC data from repositories is notoriously difficult to use.
BASE Recently OLS have acquired a data feed from the BASE search engine. This contains post-processed harvested data for institutional repositories for inclusion in AuthorClaim. We want to develop an add-on to DSpace for AuthorClaim records.
AuthorClaim and IRs There is a theory, developed by Ivan Kurmanov about level of interoperability between AuthorClaim and institutional repositories. All level were once implemented in a test using ACIS and EPrints.
the institutional repository IR Its main function is to collect document data. Author metadata comes as part of the metadata information with the document. The author may be identified within the scope of the collection. The user of the IR is called a submitter.
AuthorClaim AuthorClaim a service that collects personal data and connects them with metadata about documents. The key data is author name data and document identifiers. Authors can contact the AuthorClaim to identify themselves. Once they are registered they can say what documents they have authored. The AuthorClaim user is called a registrant.
AuthorClaim IR interoperability Interoperability comes in different levels. With each level up, we have more (better) interoperability. We have levels 0 to 4. At level zero, AuthorClaim and the IR simply live side by side, and no interaction is happening.
level 1 In level 1, the IR provides metadata about its documents to AuthorClaim. This will be intermeditated by BASE. The AuthorClaim processes the data periodically. –add new records to the document stock –perform probationary associations between documents and registrants
level 2 An IR delivers to AuthorClaim data for some of its authorships that point to data in AuthorClaim. ACIS will accept any of the following 3 identification avenues –an identifier known to AuthorClaim –a shortid, previously generated by AuthorClaim –an address, know to the AuthorClaim as the login on a registrant. This data will have to be entered by submitter.
level 3 The IR helps submitters to find the data required for level 2 interoperability. While submitters enter authorship data, the IR performs searches in the AuthorClaim data. If matching records are found, the submitter is invited to select them. The document data is the exported to the AuthorClaim in the usual way.
implementing level 3 AuthorClaim needs to expose registrants data to the IR. The data can not be made available publicly if we want the to be an avenue of identification. The IR must –search the AuthorClaim data –display optional matches in an unobtrusive way –give submitters an easy way to choose an option.
level 4 The IR immediately notifies AuthorClaim about a document submission. The AuthorClaim processes the notification, the document is added to the research profiles of its identified authors. (you may argue: it's on a different level)
level dependency There is level dependency –Level 1 is really required for other levels. –Level 2 is a basis for level 3. –Level 4 can be done without either level 2 or level 3. It does not really matter current ACIS code can implement all four levels. There is ACIS code for Eprints 2.0 that implements the dosus side of the interoperability.