Presentation on theme: "Ariw and AuthorClaim: current state Thomas Krichel prepared for the first retreat for disciplinary repositories Monterey."— Presentation transcript:
ariw and AuthorClaim: current state Thomas Krichel http://openlib.org/home/krichel prepared for the first retreat for disciplinary repositories Monterey 2009-10-19
introduction I am here representing two activites RePEc Open Library Society (OLS) ariw AuthorClaim (RePEc?) RePEc is more established. OLS may become an umbrella organiztation for RePEc.
open library society This is a 503 1 c charity set up by Thomas Krichel to support the work on the registration systems. The purposes of the society are formulated quite generally. The society can support related purposes. In the next few weeks a formal alignment between RePEc and the society may becoming along such as to enable a legal representation, or at least support, through the society.
two official projects ARIW is a registry of academic and research institutions in the world. It lives at http://ariw.org. The data comes from a similar collection (academ.cc) and from Isidro Arguillo's data that he uses to build his webometric rankings of academic institutions. It is not much maintained at this time.
data structure The data is in AMF, an XML encoded format. Each record contains a unique id for each institution contains an http URL contains one or more name variations full names abbreviated names names in different language There are country and US states units. ??? records
ariw.org web site It is designed to be resdistributable. The entier site can be download in one tarball. It will install with one Perl script. You may have to get some modules. The site is self-documenting. AMF data, xslt data and scripts are all fully accessible on the site. The site's code was written by Thomas Krichel.
users At this time, AuthorClaim is the only official user of ARIW data. Registrants can claim affiliation to one or more ARIW identified institutions. Registrants can also propose to add a new institution if they don't find theirs. As a result, an email is sent to the maintainters of ARIW.
AuthorClaim AuthorClaim is bascially a implementation of the principle functions of the RePEc Author Service (RAS) into an interdisciplinary document dataset and the ARIW institution dataset. The most important non-principal function of RAS is citation data processing. There are no plans to integrate that.
ACIS ACIS is the academic contribution information system. It was written by Ivan V. Kurmanov. The development was funded by the Open Society Institute. There are about 150 Perl modules in the system, and a large pile of XSLT. The code is very complicated and very sparsely documented.
basic idea There are people. There are documents. There are relations between people and documents. ACIS lets people manage claims of Authorships is only one claim, the system should really be called ContributorClaim but's not a good term from a marketing perspecive.
basic process Users register. They maintain a name varitaions profile. The system searches the document data for matches of the name variations. On initial registration, the searches are conducted while the user waits. For registered users, the system conducts searches and informs registrants about new potential contributions that may have appeared.
exported data The personal data is exported in AMF. Some data elements that are not covered by AMF are represented through an acis namespace. The most frequest is. Why do we want to know about papers that somebody has not written, you may wonder.
important absence At this time, there are individual pages for registered users. But there is no way to search for them. More generally there is no user service on the personal data. There is no intention for the society to provide such a service at this time.
isolated uselessness The whole idea of AuthorClaim is to serve as an intermediary for others to delegate a boring technicality to. It is not meant to become a point where authors modify document data. Especially, it will never ever become a document (or metadata) submission system. This involves expertise that AuthorClaim can not. AuthorClaim is a complement to, not a substitute for the systems that feed it with document data.
document data in AuthorClaim AuthorClaim is rounding up the usual suspects. arXiv CiteSeer DBLP E-LIS PubMed RePEc SPIRES Work in the fall 2009 should bring in some major institutional repositories.
centralize author registration? Author registration is a simple factual claim. Claim verification not require subject expertise It is in principle the same process across differenent disciplines. It has been talked about a lot for years, but nothing much appears to be done.
is a central system possible? On an IR level, registration of authors appears not cost effective. On an discipline-based r
document data format In principle ACIS uses AMF document data. De facto, it only uses four data elements id title author name expressions (multiple) URL to provider site or other location data. Such data is in principle not copyrighted, a further advantage of the system.
at this stage It has not been up to ballistic start about 20,000,000 documents... about 30 profiles... but RePEc too took a long time to really take off. If there no user services soon Thomas Krichel will build one himself, possibly with the help of Jose Manuel Barrueco.
merger with RAS It is possible that in about a couple of years, RAS will be merged into AuthorClaim. There are three obstracles: shortIds will have to change citation processing will have to be included in RAS profiling of institution choice will have to be implemented. Currently the main difference with RAS is the volume of document records.
processing of potential claims Each potential claim has to be manually processed. For people with common names this may involve processing hundreds of potential claims, esp. since PubMed only recently added authors' first names. To make that easier, Thomas Krichel worked in the Summer of 2009 on a learning system.
SVM learning Thomas Krichel has some experience with SVM because he is using it with great success in his current awareness work. In current awareness, learning aims to predict what papers will be included into the current vesion of the report. It looks at features of included and excluded in past issues of the report.
learning about authors As soon as an author has accepted at least one document, and refused at least one document, it is possible to sort the suggested documents to bring the ones more likely to be close to the front. AuthorClaim now learns about suggested documents when a new document arrives in the queue when the user refuses documents when the user accepts documents This happens via a daemon. The daemon sorts the entries in a table of suggested documents.
known document learning ACIS also learns about documents that have a known status for all refused documents, put the most likely to be accepted first for all accepted documents, put the most likely to be refused first Learning about known documents is carried out when the user logs out.
competition ResearcherID is a clone of the RAS that has bene done by Thompson / ISI. Neither input nor output data are freely available. CrossRef are rumoured to work on a system codenamed CrossReg to do author registration. Competition, even if it is succesful, will help to make the concept of author registration more widely know.