Presentation on theme: "Metadata characteristics as predictors for editor selectivity in a current awareness service Thomas Krichel & Nisa Bakkalbasi 2005-10-31."— Presentation transcript:
Metadata characteristics as predictors for editor selectivity in a current awareness service Thomas Krichel & Nisa Bakkalbasi 2005-10-31
outline Background to work that we did –RePEc (Research Papers in Economics) –NEP: New Economics Papers The research –Theory –Method –Results Other work done for NEP.
RePEc Digital library for academic Economics. It collects descriptions of –economics documents (working papers, articles etc) –collections of those documents –economists –collections of economists Pionneering effort to create a relational dataset describing an academic discipline as a whole. The data is freely available.
RePEc principle Many archives –Archives offer metadata about digital objects or authors and institutions data. One database Many services –Users can access the data through many interfaces. –Providers of archives offer their data to all interfaces at the same time. This provides for an optimal distribution.
it's the incentives, stupid RePEc applies the ideas of open source to the construction of bibliographic dataset. It provides an open library. The entire system is constructed in such a way as to be sustainable without monetary exchange between participants.
some history Thomas Krichel in the early 1990s dreamed about a current awareness service for working paper. It would later have electronic papers. In 1993 he made the first economics working paper available online. In 1997 he wrote the key protocols that govern RePEc.
RePEc is based on 500+ archives WoPEc EconWPA DEGREE S-WoPEc NBER CEPR Elsevier US Fed in Print IMF OECD MIT University of Surrey CO PAH Blackwell
to form a 340+k item dataset 161,000working papers 180,000journal articles 1,300software components 1,200book and chapter listings 8,000author contact & publication listings 9,100institutional contact listings more records than arXiv.org
RePEc is used in many services EconPapers NEP: New Economics Papers Inomics RePEc author service Z39.50 service by the DEGREE partners IDEAS RuPEc EDIRC LogEc CitEc
NEP: New Economics Papers This is a set of current awareness reports on new additions to the working paper stock only. Journal articles would be too old. Founded by Thomas Krichel in 1998. Supported by the Economics department at WUStL. Initial software was written by Jose Manuel Barrueco Cruz. First general editor was John S. Irons.
why NEP Public aim: Current awareness if well done, can be an important service in its own right. It is sheltered from the competition of general search engines. Private aim: It is useful to have some, even though limited classification information. This should be useful in performance measures within subject areas.
modus operandi: stage 1 The general editor uses a computer program who gathers all the new additions to the working paper stock. This is usually done weekly. S/he filters out new descriptions of old papers –date field –handle heuristics The result is an issue of the nep-all report.
modus operandi: stage 2 Editors consider the papers in the nep-all report to filter out papers that belong to the subject. This forms as issue of a subject report nep-???. nep-all and the subject reports are circulated via email. A special arrangement makes the data of NEP available to other RePEc services.
some numbers The are now 60+ NEP lists. Over 37k subscriptions. Close to 16k subscribers. Over 50k papers announced. Over 100k announcements. Homepage at http://nep.repec.org All this is a fantastic success!!
problem with the private aim We would have to have all the papers to be classified not only the working papers. We would need to have 100% coverage of NEP. This means every paper in nep-all appears in at least one subject report.
coverage ratio We call the coverage ratio the number of papers in nep-all that have been announced in at least one subject report. We can define this ratio –for each nep-all issue –for a subset of nep-all issues –for NEP as a whole
coverage ratio theory & evidence Over time more and more NEP reports have been added. As this happens, we expect the coverage ratio to increase. However, the evidence, from research by Barrueco Cruz, Krichel and Trinidad is –The coverage ratio of different nep-all issues varies a great deal. –Overall, it remains at around 70%. We need some theory as to why.
two theories Target-size theory Quality theory –descriptive quality –substantive quality
theory 1: target size theory When editors compose a report issue, they have a size of the issue in mind. If the nep-all issue is large, editors will take a narrow interpretation of the report subject. If the nep-all ratio is small, editors will take a wide interpretation of the report subject.
target size theory & static coverage There are two things going on –The opening new subject reports improves the coverage ratio. –The expansion of RePEc implies that the size of nep-all, though varying in the short-run, grows in the long run. Target size theory implies that the coverage ratio deteriorates. The static coverage ratio that we observe is the result of both effects canceling out.
theory 2: quality theory George W. Bush version of quality theory –Some papers are rubbish. They will not get announced. –The amount of rubbish in RePEc remains constant. –This implies constant coverage. Reality is slightly more subtle.
two versions of quality theory Descriptive quality theory: papers that are badly described –misleading titles –no abstract –languages other than English Substantive quality theory: papers that are well described, but not good –from unknown authors –issued by institutions with unenviable research reputation
practical importance We do care whether one or the other theory is true. –Target size theory implies that NEP should open more reports to achieve perfect coverage. –Quality theory suggests that opening more report will have little to no impact on coverage. Since operating more reports is costly, there should be an optimal number of reports.
overall model We need an overall model that explains subject editors behavior. We can feed this model with variables that represent theoretical determinants of behavior. We can then assess the strength of various factors empirically.
method The dependent variable is announced. It is one if the paper has been announced, 0 otherwise. Since we are explaining a binary variable, we can use binary logistic regression analysis (BLRA). This is a fairly flexible technique, useful when the probability distributions governing the independent variables are not well known. That's why BLRA is popular in the life sciences.
independent variables: size size is the size of the nep-all issue in which the paper appeared. This is the critical indicator of target size theory. We expect it to have a negative impact on announced.
independent variables: position position is the position of the paper in the nep-all issue. The presence of this variable can be justified by the combined assumption of target size and editor myopia. If editors are myopic, they will be more liberal at the start of nep-all then at the end of nep-all.
independent variables: title title is the length of a title of the paper, measured by the number of characters. This variable is motivated by descriptive quality theory. A longer title will say more about the paper than a short title. This makes is less likely that a paper is being overlooked.
independent variables: abstract abstract isthe presence/absence of an abstract to the paper. This is also motivated by descriptive quality theory. Note that we do not use the length of the abstract because that would be a highly skewed variable.
independent variables: language language is an indicator if the language of the metadata is in English or not. This variable is motivated by descriptive quality theory and the idea that English is the most commonly understood language. While there are a lot of multilingual editors, customizing this variable would have been rather hard.
independent variables: series series is the size of the series where a paper appears in. This variable is motivated by substantive quality theory. The larger a series is the higher, usually, is its reputation. We can roughly qualify by size and quality –multi-institution series (NBER, CEPR) –large departments –small departments
independent variables: author author is the prolificacy of the authors of the paper. It is justified by substantive quality theory. This is the most difficult variable to measure. We use the number of papers written by the registered author with the highest number. Since about 50% of the papers have no registered author, a lot of them are excluded. But there should be no bias by the exclusion.
scandal! Substantive quality theory can not be rejected. That means that the editors are selecting for quality as well as for the subject. The editors have rejected our findings. Almost all protest that there is no quality filtering.
consequences There has been no program to expand list. There has to be a concentrated effort to help editors to find subject specific papers. More effort needs to be made for editors to really find the subject-specific papers. This can be done by –the use of a more efficient interface –the use of automated resource discovery methods.
ernad editing reports on new academic documents. It is purpose-built software system for current awareness reports. It has been designed by Thomas Krichel, http://openlib.org/home/krichel/work/altai.html The system was written by Roman D. Shapiro.
statistical learning The idea is that a computer may be able to make decision on the current nep-all reports based on the observation of earlier editorial decisions. ernad now works using support vector machines (SVM), with titles, abstracts, author name, classification values and series as features.
performance criteria We are not aware of performance criteria for the sorting of papers in a report. Precision and recall appear useless. Expected search length and average search don't appear very attractive. Thus research into precise criteria is required.
SVM performance If we use average search length, we can do performance evaluations. It turns out that reports have very different forecastability. Some are almost perfect, others are weak. Again, this raises a few eyebrows!
what is the value of an editor? If the forecast is perfect, we don't need the editor. If the forecast is very weak the editor may be a prankster.
pre-sorting reconceived We should not think of pre-sorting via SVM as something to replace the editor. We should not think about it encouraging editors to be lazy. Instead, we should think it as an invitation to examine some papers more closely than others.
headline vs. bottomline data The editors really have a three stage process of decision. –They read title, author names. –They read the abstract. –They read the full text A lot of papers fail at the first hurdle. SVM can read the abstract and prioritize papers for abstract reading. Editors are happy with the pre-sorting system.
firstname.lastname@example.org http://openlib.org/home/krichel/ Thank you for your attention!