Presentation is loading. Please wait.

Presentation is loading. Please wait.

Acknowledgements Ellen Fischer for her hospitality. Michael Heinz for organizing the seminar.

Similar presentations

Presentation on theme: "Acknowledgements Ellen Fischer for her hospitality. Michael Heinz for organizing the seminar."— Presentation transcript:

1 acknowledgements Ellen Fischer for her hospitality. Michael Heinz for organizing the seminar.

2 background I have just come back from an invitational meeting by JISC and SURF about next steps in institutional repositories. For working groups – identification (lead by Andrew Trealoar) – citations (lead by Les Carr) – organizational matters (lead by Norbert Lossau) – repository handshake (lead by Peter Burnhil)

3 overview historical introduction to RePEc lessons for the IR movement steps forward – ariw – AuthorClaim some more detail about ACIS

4 RePEc History It started with me as a research assistant an in the Economics Department of Loughborough University of Technology in a predecessor of the Internet allowed me to download free software without effort but academic papers had to be gathered in a painful way

5 CoREJ published by HMSO –Photocopied lists of contents tables recently published economics journal received at the Department of Trade and Industry –Typed list of the recently received working papers received by the University of Warwick library The latter was the more interesting.

6 working papers early accounts of research findings published by economics departments –in universities –in research centers –in some government offices –in multinational administrations disseminated through exchange agreements important because of 4 year publishing delay

7 I planned to circulate the Warwick working paper list over listserv lists I argued it would be good for them –increase incentives to contribute –increase revenue for ILL After many trials, Warwick refused. During the end of that time, I was offered a lectureship, and decided to get working on my own collection.

8 1993: BibEc and WoPEc Fethy Mili of Université de Montréal had a good collection of papers and gave me his data. I put his bibliographic data on a gopher and called the service "BibEc" I also gathered the first ever online electronic working papers on a gopher and called the service "WoPEc".

9 NetEc consortium BibEcprinted papers WoPEcelectronic papers CodEcsoftware WebEcweb resource listings JokEcjokes HoPEc a lot of Ec!

10 WoPEc to RePEc WoPEc was a catalog record collection WoPEc remained largest web access point but getting contributions was tough In 1996 I wrote basic architecture for RePEc. –ReDIF –Guildford Protocol

11 1997: RePEc principle Many archives –archives offer metadata about digital objects (mainly working papers) One database –The data from all archives forms one single logical database despite the fact that it is held on different servers. Many services –users can access the data through many interfaces. –providers of archives offer their data to all interfaces at the same time. This provides for an optimal distribution.

12 based on close to 1000 archives WoPEc EconWPA DEGREE S-WoPEc NBER CEPR Blackwell US Fed in Print IMF OECD MIT University of Surrey CO PAH Elsevier

13 to form a 721k item dataset 284,000 working papers 430,000 journal articles 1,700 software components 5,200 book and book chapters 19,000 author contact and publication listings 11,000 institutional contact listings

14 RePEc is used in many services Econpapers Economists Online NEP: New Economics Papers OAI-PMH gateway RePEc Author Service IDEAS RuPEc EDIRC LogEc CitEc MPRA

15 … describes documents Template-Type: ReDIF-Paper 1.0 Title: Dynamic Aspect of Growth and Fiscal Policy Author-Name: Thomas Krichel Author-Person: RePEc:per: :thomas_krichel Author- Author-Name: Paul Levine Author- Author-WorkPlace-Name: University of Surrey Classification-JEL: C61; E21; E23; E62; O41 File-URL: pub/RePEc/sur/surrec/surrec9601.pdf File-Format: application/pdf Creation-Date: Revision-Date: Handle: RePEc:sur:surrec:9601

16 … describes persons (RAS) template-type: ReDIF-Person 1.0 name-full: MANKIW, N. GREGORY name-last: MANKIW name-first: N. GREGORY handle: RePEc:per: :N__GREGORY_MANKIW homepage: mankiw/mankiw.html workplace-institution: RePEc:edi:deharus workplace-institution: RePEc:edi:nberrus Author-Article: RePEc:aea:aecrev:v:76:y:1986:i:4:p: Author-Article: RePEc:aea:aecrev:v:77:y:1987:i:3:p: Author-Article: RePEc:aea:aecrev:v:78:y:1988:i:2:p: ….

17 … describes institutions Template-Type: ReDIF-Institution 1.0 Primary-Name: University of Surrey Primary-Location: Guildford Secondary-Name: Department of Economics Secondary-Phone: (01483) Secondary- Secondary-Fax: (01483) Secondary-Postal: Guildford, Surrey GU2 5XH Secondary-Homepage: Handle: RePEc:edi:desuruk

18 probably best overall number There is an independent list of 1000 most important economists compiled by Tom Coupe. From that list 80% have registered with the RePEc Author Service to manually claim their papers.

19 nature of RePEc RePEc is not a service, it is a library dataset. The library is freely reusable. Re-users of RePEc data make the augmented data available. A positive feedback mechanism is born. An example is NEP.

20 business model of RePEc The business model of RePEc is similar to open source. In fact RePEc can be thought of as an application of open source coding to library- like metadata. But the aim of RePEc is not centered on research users.

21 aim of RePEc RePEc is focused on the needs of research suppliers rather than research users. The end use of RePEc data generates evaluative data. Yes, without end use there is no evaluative data but the end use is only a means to an end. This is very difficult to understand for people with a library background.

22 the coordinators There are about 10 people who spend quite a bit of their time on RePEc. They provide crucial functions to the whole. Their contributions go way beyond what would be expected within their professional settings. There is no formal list of responsibilities and no command and control structure. The RePEc-run mailing list is used to communicate.

23 hosting Hosting is a critical issue for RePEc services. A number of RePEc machines are based on very informal agreement. Moving to acknowledged hosting seems difficult. Libraries, for example, talk the talk but don't walk the walk.

24 one key factor emerges It is most important that authors and institutions are registered in a reusable registration framework. This is a service to be build for all disciplines I am work on this – –

25 back to repositories There are 1300 institutional repositories registered in OpenDOAR. Of these roughly 50% have an OAI-PMH interface that works at a point in time. There are some pockets of academic strength in some repositories. Much of it is of low academic value.

26 central ideas or IRs The central ideas behind repositories are close the the ideas of RePEc – Repositories store the research work of the authors of an institution. – They can be federated so as to make them interoperable. but the current state looks sad.

27 peripheral ideas Since authors don't spontaneously deposit. So repositories deposit other stuff – student work – presentations – course regulations – digitized old documents – course modules – old library pictures academic research mission is being diluted

28 beyond that The history of RePEc suggests that repositories can grow by researcher input without the requirement of mandates. It is mainly a job of learning from RePEc. I proposed a talk about this at the SPARC repository meeting in It was rejected.

29 what can be learned from RePEc The documents in the repositories must form a library. (i) They must be sufficiently well documented as to form a basis for meaningful service (ii) Actors (= Authors and institutions) must be identified. (iii) Usage reporting must be provided to them. (iv)

30 (i) repositories to libraries Currently there is no registry for repositories. Every repository is free to adopt its own identifications. Raw handles from repositories can clash. Some components of these raw handles make little sense, e.g. eprintsgeneric. No perspectives of evolution.

31 (ii) better metadata There is no common way to refer to a plash page or full text files. There no agreed way to provide collection information. There is a format that I defined with the help of some friends, the Academic Metadata Format. But the peripheral dilution prevents its adoption.

32 (iv) usage stats RePEc user services collaborate to a single usage statistics aggregator. It calls two types of data – abstract views – full-text downloads Both have some measurement problem. But they still give a rough idea.

33 (iii) author and institution registration Since a freely available digital library should mainly work for evaluation purposes, registration of authors and institutions is critical. institutions can be centrally registered, I have such site with data at

34 author registration ? Author registration is not disambiguation of names. Author registration is not authority control. Author registration is usually done by authors themselves. It involves two steps –Registrants put in some personal data. –Registrants finds in the document data records about documents they have written.

35 personal data These contains required element: –person's name – and optional elements –i nstitutional affiliation –homepage URL

36 search for authorships This is based on a set of name variations. A name variations is a string by which document metadata authors may have referred to the registrant. Example: –Thomas Krichel –Крихель, Т. Registrants maintain a name variations profile.

37 authors An author is a registrant who has at least one work claim. Since author registration is a pioneering innovation by yours truly, it's purpose is not yet clearly understood. A user who registers to gain access to data is called a bozo registrant. RAS managers periodically clear presumed bozo registrants.

38 free? as in $0 Registrations don't pay in money terms for registration. Document data providers don't pay to have their document data list. Registrants data is freely available if they allow it.

39 free ? as in freedom Author records are freely available for any purpose, as long as we have registrants consent. Registrants' consent is assumed for anything but the address. By default addresses are not exported.

40 freedom is crucial Users will not register with the intention that the records will be used. They will prefer a system that has high re- usage. Therefore I am confident an open system will win over a closed system.

41 free document data In principle, document data has to contain only three fields –Title –Author name expressions –URL for further information and/or Such data is in principle not copyrightable. But there are still only few sources that have such data readily available.

42 service implementation scale Registration of authors can be conducted against any document datasets. What is the appropriate set –type scale? –subject scale? RAS shows it works for a single discipline scale with research paper documents, both article. But economics is fairly insular.

43 Since 2008 yours truly have been working on an interdisciplinary system. This will be the last important project before my death. The idea is that it will help the fledging repository movement. Since IRs currently are either empty or contain rubbish, AuthorClaim has to be primed with other contents.

44 competition: researcherID researcherID is a system by Thomson ISI. It allows authors to find their documents It has been modeled after the RePEc author service. But the document and personal records are not freely available.

45 datasets The data used in an AuthorClaim are –PubMed (problematic) –DBLP (XML file only) –CiteSeer –arXiv (not announced yet) –CIS (non-free dataset) –E-LIS Work is under way to include broad range of the repositories listed in DOAR.

46 PubMed The 800 pound gorilla of bibliographic datasets, with 17 million records. Free only as $0, through a convoluted license. In addition, NLM added the condition that I would not offer the personal records to them. Just saying that they would refuse them if I offered them was not enough for them.

47 DBLP Not freely available either. –only an XML dump of some records (individual documents) –only for non-commercial purposes Overlap with CiteSeer would be nice to clean up.

48 CIS This is the Current Index to Statistics. Not a free dataset at all but your truly has access to a database version where extract the 3 metadata fields that are required.

49 DOAR repositories DOAR repositories used the OAI-PMH protocol. Dirty UTF-8/XML seems to the main culprit. Roughly, out of 1200 registered repositories, ½ work on a particular day. For roughly 2/3 rd we can get some records by trying and stopping when the first error occurs. BTW RePEc makes for the second-largest DOAR repository by record number.

50 subject coverage and overlap The subject coverage of AuthorClaim will remain uneven unless publishers are giving data directly (replacing libraries, eventually). Overlap is less of a problem than lack of good data. RePEc routinely groups various versions of authors' work together. This is feasible if they are in the claimed set of a person.

51 scaling issue With 30 times the number of record, and with PubMed only using initials (phew!) registrants with common names have large sets of potential documents to work through. Clearly they also derive more benefits. Example: Joanna P. Davies has currently 795 proposed documents. Now think about Chen or Li.

52 machine learning In a new project Ilja Kurliov and Thomas Krichel are working on enhancing ACIS to provide help through machine learning. The idea is that the users will submit a few positive and negative examples, and machine learning sorts the most likely authored documents to the front. The assessment of such a system is really interesting.

53 ACIS This is the Academic Contribution Information System. It is a generic software to enable author registration services that are somewhat more general. Work on ACIS was sponsored by the Open Society Institute. The software was written by Ivan V. Kurmanov. It is verrrry complicated.

54 basic idea A contribution is a relationship between document data records and personal records that a registrant can claim. Authorship and editorship are built-in contribution types, but others can be configured. The contribution system allows registrants to provide information about their contribution.

55 no document creation Using ACIS, registrants can not create document records. While many RAS registrants want to do this, it is considered out of scope for an ACIS installation. ACIS-based systems are not supposed to substitute but complement the work of publishers.

56 ACIS implementations and document services An ACIS implementation service (henceforth: AuthorClaim) can work with a document submission service (henceforth: IR) While such systems are distinct, on different machines etc, they can be so interconnected that they appear integrated to a naive user.

57 interoperability AuthorClaim and IR interoperability comes in different levels. With each level up, we have more (better) interoperability. We have levels 0 to 4. At level zero, an AuthorClaim and an IR simply live side by side, and no interaction is happening.

58 level 1 In level 1, an IR provides metadata about its documents to AuthorClaim. –The data is stored in files. –in a compatible format, for AuthorClaim this would be AMF or be translated to it. AuthorClaim processes the data periodically. –adds new records to the document data set –perform probationary associations between documents and authors

59 level 2 A IR delivers to the AuthorClaim data for some of its authorships that point to data in the AuthorClaim. AuthorClaim can accept any of the following 3 identification avenues –an identifier known to AuthorClaim –a shortID, previously generated by AuthorClaim –an address, know to AuthorClaim as the login of a registrant. This data will have to be entered by a submitter.

60 level 3 The IR helps submitters to find the data required for level 2 interoperability. While submitters enter authorship data, the IR performs searches in the AuthorClaim data. If matching records are found, the submitter is invited to select them. The document data is the exported to the AuthorClaim in the usual way.

61 implementing level 3 AuthorClaim needs to expose registrants data to the IR. The data can not be made available publicly if we want the to be an avenue of identification. The IR must search the AuthorClaim data display optional matches in an unobtrusive way and give submitters an easy way to choose an option.

62 level 4 The IR immediately notifies AuthorClaim about a document submission. AuthorClaim processes the notification, the document is added to the research profiles of its identified authors.

63 level dependency There is level dependency –level 1 is really required for other levels. –level 2 is a basis for level 3. –level 4 can be done without either level 2 or level 3. Current ACIS code can implement all four levels. There is code written for EPrints 2.0 that implements the IR side of the interoperability.

64 ACIS components rid is a feeding daemon. It feeds records in files into a processor. It used the Berkeley DB transactional database system. ARDB is a software suite that implements bibliographic relational bibliographical datasets. There is general web application layer. It fires up XSLT.

65 ACIS components, a few more As shortID system associates shortIDs with documents and more importantly, registrants A userData system manages the data handled by users and feeds it back to the ARBD system. A resources system deals with searches and suggestions.

66 ACIS functionality Beside the association of documents with users, ACIS provides a range of functionality that complement or extend the basic functionality. I will review some now.

67 ACIS contact details This is a set of trivial fields – . This detail is required but not exported by default. –homepage –phone number –postal address We don't do pictures of the registrants' dogs etc.

68 affiliations profile This is more complicated. Institutional data is kept as separate records, not as string data. Registrants can search for existing institutional records to create an affiliation with. Or they can propose a new record to be added by filling out a form.

69 research profile This is collection of metadata about research documents the registrant has written. Available functions include –display a list of works in the profile –search for new suggested works –manual search for works by title –display refused research documents –change preferences for automatic updates

70 automatic updates By default, when a document record quotes an person short id, the document is added to the profile. By default, a regular search using the name variations profile identifies a set of potential new documents and reports them to the user via . The registrant may choose to have exact matches of these searches being added to the research profile.

71 document to document links Document to document links can be created for authors to say that two documents in the profile are related. Document full-text links can be confirmed or rejected. Typically such full-text files would found by an automated search external to the AIS.

72 citations profile Within this profile, author can partially manage citation information for items is the research profile. Like a DSS may submit data to a AIS a citation discovery service may take give citations data to a AIS. Such data can be maintained in the citations profile.

73 references processing References are processed to see if they may correspond to a document in the research profile. If a document in the profile has a potential citation it is called an interesting document. Once reference processing is done, registrants can navigate by decreasing level of interest.

74 suggestions processing Registrants navigate the set of suggested citations to see if the reference string really matches the research profile item. If the registrant refuses a citations, there is a screen where she can later overturn such a decision.

75 automatic citation updates If the reference is very close to citation data, the registrant can have it added automatically. When a co-author has identified a citation to an item in her profile, the registrant can allow it to be added automatically.

76 automatic citation updates If the reference is very close to citation data, the registrant can have it added automatically. When a co-author has identified a citation to an item in her profile, the registrant can allow it to be added automatically.

77 thank you for your attention!

Download ppt "Acknowledgements Ellen Fischer for her hospitality. Michael Heinz for organizing the seminar."

Similar presentations

Ads by Google