Presentation on theme: "Free author registration Thomas Krichel LIU & НГУ 2008-12-11."— Presentation transcript:
free author registration Thomas Krichel LIU & НГУ
me today I am working for the Palmer School of Library and Information Science in he College of Information and computer science of the CW Post Campus of Long Island University in Brookville NY, U.S.A. and for the Division of Information Systems in the Faculty of Information Technology at Novosibirsk State University in Novosibirsk, Russia. I do a lot of programming & sysadmin.
formerly I am a trained economist. My main claim to fame is the creation and and coordination of the RePEc digital library for economics at My main area of work within RePEc is the NEP: New Economics Papers current awareness service. It's a totally different topic.
RePEc now It is a collection of data about academic economics. The bulk of the data is data about documents. And the bulk of that is –published article data –working paper data But the interesting data is the author, institution and usage data.
RePEc principle of 1997 many archives –archives offer metadata about digital objects (mainly working papers & journal articles) one database –the data from all archives forms one single logical database many services –users can access the data through many service –providers of archives offer their data to all services
repec is based 900+ archives Blackwell MPRA DEGREE S-WoPEc NBER CEPR Taylor & Francis US Fed in Print IMF OECD MIT University of Surrey CO PAH Elsevier
to form a 630k item dataset 254,000 working papers 370,000 journal articles 1,600 software components 4,200 book and chapter listings 17,600 author records 10,800 institutional contact listings
RePEc is used in many services EconPapers NEP: new economics papers Google Scholar RePEc Author Service Twitter bulk posting (planned) LogEc IDEAS RuPEc EDIRC LogEc CitEc MPRA
… describes documents template-type: redif-paper 1.0 title: dynamic aspect of growth and fiscal policy author-name: thomas krichel author-person: repec:per: :thomas_krichel author- author-name: paul levine author- author-workplace-name: university of surrey classification-jel: c61; e21; e23; e62; o41 file-url: ftp://www.econ.surrey.ac.uk/ pub/repec/sur/surrec/surrec9601.pdf file-format: application/pdf creation-date: revision-date: handle: repec:sur:surrec:9601
… describes institutions template-type: redif-institution 1.0 primary-name: university of surrey primary-location: guildford secondary-name: department of economics secondary-phone: (01483) secondary- secondary-fax: (01483) secondary-postal: guildford, surrey gu2 5xh secondary-homepage: handle: repec:edi:desuruk
author registration It started when JISC funding allowed us to hire a student to write an author registration system. The system went online as HoPEc in late It has been renamed RePEc Author Service (RAS). A 2002 grant from OSI allows for a rewrite and expansion.
researcherID researcherID is a system by Thomson ISI. It allows authors to find their documents It has been modeled after the RePEc author service. But the document and personal records are not freely available.
success of RAS Measuring the success of an author registration service is difficult in general. In RePEc we are fortunate that an independent list of top 1000 authors exists. Of those 80% are registered.
author registration ? Author registration is not disambiguation of names. Author registration is not authority control. Author registration is usually done by authors themselves. It involves two steps –Registrants put in some personal data. –Registrants finds in the document data records about documents they have written.
personal data These contains required element: –person's name – and optional elements –institutional affiliation –homepage URL
search for authorships This is based on a set of name variations. A name variations is a string by which document metadata authors may have referred to the registrant. Example: –Thomas Krichel –Крихель, Т. Registrants maintain a name variations profile.
authors An author is a registrant who has at least one work claim. Since author registration is a pionering innovation by yours truly, it's purpose is not yet clearly understood. A user who registers to gain access to data is called a bozo registrant. RAS managers periodically clear presumed bozo registrants.
free? as in $0 Registrations don't pay in money terms for registration. Document data providers don't pay to have their document data list. Registrants data is freely available if they allow it.
free ? as in freedom Author records are freely available for any purpose, as long as we have registrants consent. Registrants' consent is assumed for anything but the address. By default addresses are not exported.
freedom is crucial Users will not register with the intention that the records will be used. They will prefer a system that has high re- usage. Therefore I am confident an open system will win over a closed system.
free document data In principle, document data has to contain only three fields –Title –Author name expressions –URL for further information and/or Such data is in principle not copyrightable. But there are still only few sources that have such data readily available.
service implementation scale Registration of authors can be conducted against any document datasets. What is the appropriate set –type scale? –subject scale? RAS shows it works for a single discipline scale with research paper documents, both article. But economics is fairly insular.
AuthorClaim.org Since 2008 yours truly have been working on an interdisciplinary system. This will be the last important project before my death. The idea is that it will help the fledging institutional repository (IR) movement. Since IRs currently are either empty or contain rubbish, AuthorClaim has to be primed with other contents.
datasets The data used in an AuthorClaim are –PubMed (problematic) –DBLP (XML file only) –CiteSeer –arXiv (not announced yet) –CIS (non-free dataset) –E-LIS Work is under way to include broad range of the repositories listed in DOAR.
PubMed The 800 pound gorilla of bibliographic datasets, with 17 million records. Free only as $0, through a convoluted license. In addition, NLM added the condition that I would not offer the personal records to them. Just saying that they would refuse them if I offered them was not enough for them.
DBLP Not freely available either. –only an XML dump of some records (individual documents) –only for non-commercial purposes Overlap with CiteSeer would be nice to clean up.
CIS This is the Current Index to Statistics. Not a free dataset at all but your truly has access to a database version where extract the 3 metadata fields that are required.
DOAR repositories DOAR repositories used the OAI-PMH protocol. Dirty UTF-8/XML seems to the main culprit. Roughly, out of 1200 registered repositories, ½ work on a particular day. For roughly 2/3 rd we can get some records by trying and stopping when the first error occurs. BTW RePEc makes for the second-largest DOAR repository by record number.
subject coverage and overlap The subject coverage of AuthorClaim will remain uneven unless publishers are giving data directly (replacing libraries, eventually). Overlap is less of a problem than lack of good data. RePEc routinely groups various versions of authors' work together. This is feasible if they are in the claimed set of a person.
scaling issue With 30 times the number of record, and with PubMed only using initials (phew!) registrants with common names have large sets of potential documents to work through. Clearly they also derive more benefits. Example: Joanna P. Davies has currently 795 proposed documents. Now think about Chen or Li.
machine learning In a new project Илья Королёв and Thomas Krichel are working on enhancing ACIS to provide help through machine learning. The idea is that the users will submit a few positive and negative examples, and machine learning sorts the most likely authored documents to the front. The assessment of such a system is really interesting.
ACIS This is the Academic Contribution Information System. It is a generic software to enable author registration services that are somewhat more general. Work on ACIS was sponsored by the Open Society Institute. The software was written by Ivan V. Kurmanov. It is verrrry complicated.
basic idea A contribution is a relationship between document data records and personal records that a registrant can claim. Authorship and editorship are built-in contribution types, but others can be configured. The contribution system allows registrants to provide information about their contribution.
no document creation Using ACIS, registrants can not create document records. While many RAS registrants want to do this, it is considered out of scope for an ACIS installation. ACIS-based systems are not supposed to substitute but complement the work of publishers.
ACIS implementations and document services An ACIS implementation service (AIS) can work with a document submission service (DSS). A DSS would typically run EPrints, Dspace or Fedora-Commons. While such systems are distinct, on different machines etc, they can be so interconnected that they appear integrated to a naive user.
interoperability AIS and DSS interoperability comes in different levels. With each level up, we have more (better) interoperability. We have levels 0 to 4. At level zero, an AIS and an DSS simply live side by side, and no interaction is happening.
level 1 In level 1, a DSS provides metadata about its documents to an AIS. –The data is stored in files. –in a compatible format. for ACIS this would be AMF or ReDIF. The AIS processes the data periodically. –adds new records to the document data set –perform probationary associations between documents and authors
level 2 A DSS delivers to the AIS data for some of its authorships that point to data in the AIS. The AIS can accept any of the following 3 identification avenues –an identifier known to the AIS –a shortID, previously generated by the AIS –an address, know to the AIS as the login of a registrant. This data will have to be entered by a submitter.
level 3 The DSS helps submitters to find the data required for level 2 interoperability. While submitters enter authorship data, the DSS performs searches in the AIS data. If matching records are found, the submitter is invited to select them. The document data is the exported to the AIS in the usual way.
implementing level 3 The AIS needs to expose registrants data to the DSS. The data can not be made available publicly if we want the to be an avenue of identification. The DSS must search the AIS data display optional matches in an unobtrusive way and give submitters an easy way to choose an option.
level 4 The DSS immediately notifies the AIS about a document submission. The AIS processes the notification, the document is added to the research profiles of its identified authors.
level dependency There is level dependency –level 1 is really required for other levels. –level 2 is a basis for level 3. –level 4 can be done without either level 2 or level 3. Current ACIS code can implement all four levels. There is code written for EPrints 2.0 that implements the DSS side of the interoperability.
ACIS components rid is a feeding daemon. It feeds records in files into a processor. It used the Berkeley DB transactional database system. ARDB is a software suite that implements bibliographic relational bibliographical datasets. There is general web application layer. It fires up XSLT.
ACIS components, a few more As shortID system associates shortIDs with documents and more importantly, registrants A userData system manages the data handled by users and feeds it back to the ARBD system. A resources system deals with searches and suggestions.
ACIS functionality Beside the association of documents with users, ACIS provides a range of functionality that complement or extend the basic functionality. I will review some now.
ACIS contact details This is a set of trivial fields – . This detail is required but not exported by default. –homepage –phone number –postal address We don't do pictures of the registrants' dogs etc.
affiliations profile This is more complicated. Institutional data is kept as separate records, not as string data. Registrants can search for existing institutional records to create an affiliation with. Or they can propose a new record to be added by filling out a form.
research profile This is collection of metadata about research documents the registrant has written. Available functions include –display a list of works in the profile –search for new suggested works –manual search for works by title –display refused research documents –change preferences for automatic updates
automatic updates By default, when a document record quotes an person short id, the document is added to the profile. By default, a regular search using the name variations profile identifies a set of potential new documents and reports them to the user via . The registrant may choose to have exact matches of these searches being added to the research profile.
document to document links Document to document links can be created for authors to say that two documents in the profile are related. Document full-text links can be confirmed or rejected. Typically such full-text files would found by an automated search external to the AIS.
citations profile Within this profile, author can partially manage citation information for items is the research profile. Like a DSS may submit data to a AIS a citation discovery service may take give citations data to a AIS. Such data can be maintained in the citations profile.
references processing References are processed to see if they may correspond to a document in the research profile. If a document in the profile has a potential citation it is called an interesting document. Once reference processing is done, registrants can navigate by decreasing level of interest.
suggestions processing Registrants navigate the set of suggested citations to see if the reference string really matches the research profile item. If the registrant refuses a citations, there is a screen where she can later overturn such a decision.
automatic citation updates If the reference is very close to citation data, the registrant can have it added automatically. When a co-author has identified a citation to an item in her profile, the registrant can allow it to be added automatically.