Presentation on theme: "Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29."— Presentation transcript:
Open and self-sustaining digital library services: the example of NEP. Thomas Krichel
introduction Title "Open and self-sustaining digital libraries" has been chosen before I was really aware of the need of the audience. I read in the announcement that I am supposed to talk about "по информационному поиску и автоматической обработке текстов". This is area I don't know that much about but I hope to be asking some interesting questions. I hope to find someone who is interested enough in some of them to work with me.
my background I am a trained economist. An economist knows the price of everything and the value of nothing. I am interested in free digital libraries. "Free" can mean "бесплатный" or "свободный". I am interested more in the former than in the latter. My work has mainly been on building such digital libraries. I am less concerned with the usage of such libraries. The building and maintenance of the library will generate costs. How can it be given away for $0?
automation Digital libraries could be entirely automated. This is true if the purpose of the digital library is mainly to retrieve information. Generally speaking, for information retrieval an automated system is quite sufficient. Examples are Google and CiteSeer.
limit to automation This comes in when the library is used to assess underlying facts. If we say "Thomas Krichel wrote paper X" the computer will not understand who Thomas Krichel is. Only a human can know for sure. When the library is used for evaluative purposes, it needs some controlled human intervention. By evaluative purpose I mean to purpose to say how well a person or institution has behaved.
evaluative purpose Seems vague but here are some evaluative issues in academic libraries –which journal is the most cited in field X? –who has written the most papers in field Y? –which institution has the most researchers in field Z? Human intervention is critical because –identification problems that we have discussed –problem of abuse and fraud
why bother with evaluation? For a self-sustaining freely available digital library, the problem of contribution is critical. Providers of data will have good incentives, if the data that they contribute is used to evaluate performance. In academic digital libraries a crucial ingredient that helps performance is visibility. Publish (in the sense of "make public) or perish quite literally.
role of automated means Ideally a digital library will use a mixture of automated and human activity. We push automation as far as we can, and let humans do the rest. The design and successful implementation of such digital libraries is a complex long-run task. It can be helped if the digital library is also open.
Example: RePEc This is what I am most famous for. I founded the RePEc digital library. In fact its creation in 1997 goes back to efforts that I made as early as RePEc is a digital library that aims to document keys aspect of the discipline of Economics. It is essentially a metadata collection. But it goes beyond document+collections metadata to collect data about academic authors and institutions. These data on authors and institutions stand in relation to the document metadata.
RePEc is based on 440+ archives WoPEc EconWPA DEGREE S-WoPEc NBER CEPR US Fed in Print IMF OECD MIT University of Surrey CO PAH
to form a 300+k item dataset 146,000 working papers 154,000 journal articles 1,600 software components 900 book and chapter listings 6,400 author contact and publication listings 8,400 institutional contact listings
RePEc is used in many services EconPapers NEP: New Economics Papers Inomics RePEc author service Z39.50 service by the DEGREE partners IDEAS RuPEc EDIRC LogEc CitEc
institutional registration This works through a system called EDIRC. Christian Zimmermann started it as a list of departments that have a web site. I persuaded him that his data would be more widely used if integrated into the RePEc database. Now he is a crucial RePEc leader.
LogEc It is a service by Sune Karlsson that tracks usage of items in the RePEc database –abstract views –downloads There is mail that is sent by Christian Zimmermann to –archive maintainers –RAS registrants that contains a monthly usage summary.
authors' incentives Authors perceive the registration as a way to achieve common advertising for their papers. Author records are used to aggregate usage logs across RePEc user services for all papers of an author. Stimulates a "I am bigger than you are" mentality. Size matters!
NEP: New Economics Papers NEP is a current awareness service for new working papers in RePEc. Working papers are accounts of recent research findings prior to formal publications. Formal publication takes about four years in Economics, so no formal paper is new.
NEP reports NEP is a collection of subject-specific report. Each report is a serial. It has issues, usually every week. Each report has –code e.g. nep-mic –subject e.g. microeconomics –editor, i.e. human who controls the contents. A special NEP report, nep-all, contains all new papers.
history Initially, I opened NEP in John S. Irons agreed to be the general editor. The general editor is the person who –prepares nep-all –overlooks the lists In early 2005, the command structure was changed to –general editor who prepares nep-all –managing director who opens new reports and communicates to the editors –controller who watches what editors are doing
edition control In the years 1999 to 2001 I took a rather peripheral interest in NEP. At this time many reports developed long editorial delays or where not issued at all. Despite that the number of reports did still grow. But there is no organization of reports into line of subject in economics. The report subject space is linear, with most subjects being covered.
coverage ratio analysis In a paper by Krichel & Bakkalbasi, there is an effort to analyze the coverage ratio of NEP issues. This is the ratio of papers in NEP-all that make it to at least one subject report. Historical data shows the mean coverage ratio is not improving over time. Rather it stays constant at around 70%. There are two theories that can help to explain the static nature of the coverage ratio.
coverage ratio theory I: target size When editors compose the subject report, they have an implicit report size in mind. When nep-all is large, then the editors will be more selective. That is, they will take a narrow view of the subject area. The chances of a paper to be included in the subject report are likely to be smaller when a nep- all issue is large.
coverage ratio theory II: quality Papers in RePEc have different quality. Some papers have problems with "substantive quality" –come from authors that are unknown –come from institutions that have an unenviable research reputation –appear in collections that are unknown. Some papers have problems with "descriptive quality". –not in English –no abstract –no keywords Editors also filter for quality.
empirical study Krichel & Bakkalbasi investigate this by using a binary logistic regression analysis. This estimates, for every paper that appeared in nep- all, the probability that is will get announced in any subject report. They find support for both target size and quality theories. There is strong empirical support that the series matters. There is also some empirical support that author prolificacy matters. These results have been greeted with protests by the editors, who claim that they only consider the subject when making decision.
pre-sorting reports As RePEc is growing the growing size of nep-all threatens the survival of NEP. Editors simply don't want the cope with it. In 2001 I developed an idea to pre-sort the report for the editors. A computer program would look at past issues of the report, extract features, and make forecasts about the most likely papers. Editors would then only need to look at the top part of the pre-sorted nep-all issue, not at the bottom.
current state of play I extract the following features –author names –title –abstract –keyword –journal of economic literature (JEL) classifications –series I remove punctuation, lowercase, normalize using L2 I submit the result to svm_light for classification. I test using 300 record, and use the rest for training.
How well am I doing? This is not a trivial question. Precision and recall are useless. It matters what documents are judged relevant by the system. Only the ordering matters. We know the best and worst outcomes. Some measures have been proposed that do take ordering. But they still need to be applied to our case. Ideally I have a measure that will evaluate instant outcomes and that have some normalization properties –The value of the measure at the best outcome should be 1. –the expected value of the measure, under random ordering should be 0.
the hiking measure One measure that I have developed is what I call the hiking measure. –I define a steps as a permutation of two documents in the outcome vector. –I the number of steps that it takes, from an outcome x to be evaluated, to the best outcome as s(x) –Then the hiking measure h(x) = 1 – 2s(x) / n / ( n – r) –where n is the total number of documents and r is the number of relevant documents.
example r=2 n=5 Here is the complete table and outcome x h(x)xh(x) 1,1,0,0, ,0,0,0, ,0,1,0,02/30,0,1,1,0 -1/3 0,1,1,0,01/30,1,0,0,1 1/3 1,0,0,1,01/30,0,1,0,1-2/3 0,1,0,1,00.00,0,0,1,1-1.0 Problems: –no strict ordering different outcomes have the same hikes –violation of a "natural order of outcomes"
natural order A conscientious editor will be concerned by how low the last relevant paper sinks. Thus comparing two outcomes, the one that has the last relevant paper at a lower position should be preferred. If two outcomes have the last relevant paper at the same position, the second-to-last paper relevant paper should be compared. This leads to a complete ordering of outcomes.
conjecture A rational editor faces two penalities when composing the report. –examine a new paper –risk loosing a relevant paper I claim that under a large class of formulation of the editor's choice, ranking outcomes by the natural order is consistent with minimizing the loss experienced by the editor. But I can not show this.
one way for the computational implementation of natural order Derive an algorithm that will associate consecutive natural numbers with each of the outcomes, ordered by the natural order. The expected value is then trivial to compute, and a measure can can be defined. Does anyone know such an algorithm?
a more flexible way for the computational implementation of natural order Pick y > 1 Then evaluate any outcome as –sum(y**p)*i, –where p is the position, starting from the right –i=1 if relevant –i=0 if not example: for y=2, interpret x as a binary number example for y=3, – > 3**1*0+3**2*0+ 3**3*1+3**4*1+3**5 Does anybody know the expected value?
outcome: average hike, 30 trials exp 98.66cis 98.35spo 96.08ets 95.75tra hea 95.50dcm 94.76geo int ecm gth 94.09dge mon 92.54eff ene ifn 90.64ino 90.31cba 90.04fmk 89.90ure hpe 88.91agr 88.89evo 87.90law env cul 86.39cbe 85.76ent 85.07com net edu 83.80lab dev cfn res sea 82.25ias cmp tur fin tid pbe 78.99pol mfd eec mac rmg 76.22cdm 76.12cwa 75.38pub his 71.90ltv 71.23afr 69.72acc 68.72ind lam 66.20mic 61.17reg 59.12pke 58.85bec 57.76
some remarks There is a great diversity in the results. Some topics are more easy to classify automatically than others. The value of the report lies in what the human says that goes beyond the recognition by the machine. Unfortunately, manual inspection of poorly forecasted results suggests that the reason for the poor result may lie more in the inconsistency of editor decision making than in the forecasting technique. This suggests that this could be used as evaluation device for the editors. This was not intended when I started this work!
how to improve Clearly word ordering is important in this areas since different classes don't differ that much by word choice. I can use all the keyword data in the RePEc database to find phrases to add to my feature set. There may also be a way to automatically deduct significant word combinations from titles and abstracts. Finally a combination with the quality criteria mentioned may be good but it does not appear obvious how to do it.
conclusions To provide high quality digital library services, human intervention still appears to be desirable. However, we need ways to monitor how well the humans are doing. If they take bad decisions Forecastability can be one criterion. Timeliness and usage can be others. I will have to work further to develop better monitoring systems for editor behavior.