Quality assessment of an academic current awareness system: the case of NEP Thomas Krichel LIU & HГУ 2007–1121.
Presentation on theme: "Thomas Krichel LIU & HГУ 2007–11—21"— Presentation transcript:
1Thomas Krichel LIU & HГУ 2007–11—21 NEPThomas KrichelLIU & HГУ2007–11—21
2acknowledgments Thanks to organizers Andrew M. Cox. I thank everybody involved in RePEc and NEP, as well as JISC.
3current awarenessA “Current Awareness” aka “Selective Dissemination of Information” service informs users about new documents in their area of interest.It is niche activity that has been neglected by the general search engines.
4common features Current awareness generates a double classification in subject ...time ...matter.
5types of current awareness There is no official typology that I know of.I distinguishcommon classification basedkeyword basedusage history basedothersMost of them have a problem when it comes to academic information.
6common classification based You have a set of classifications say “sports”, “national politics”, “gossip”. Computer filters into the classes.Works very well as long as the vocabularies in documents of different classes is very distinct.Example: Google NewsAcademic information tends to be difficult to classify by computer.
7keyword basedA user builds a profile of her interest and files it with a service provider. The provider will mail documents that contain keywords.Example: Nexis contents alert.Works well if one gets the keywords right. Unfortunately this is difficult.For academic documents, getting the keywords right seems next to impossible.
8usage history basedA users logs into a system. The system tracks usage and build a profile of interest of the user.Works very badly if user interests change.Example: amazon.com suggestions.For academic documents there is no unified system the usage of which a computer could watch.
9academic current awareness All three types work badly with academic documents.But current awareness is the essence of academic publishing.Academic journals are, in essence, nothing but current awareness services.But there is more and more work published outside of formal academic journals.
10a new approachRather than only publishing in journals in some academic fields authors publish through other channels.This tendency is likely to grow.If there is a library that fills with records about these documents, we can build a new type current awareness system.We have to have a digital library first.
11the RePEc digital library RePEc is a digital library about economics.Founded by Thomas Krichel in 1997, based on work he started in 1993.It is compiled from over 800 sources.Currently describes over items.It has a current awareness system.
12introducing NEP“NEP: New Economics Papers” is a current awareness system for RePEc.It is based atSelections into topics is done manually by volunteers.
13who volunteers for NEP? Many volunteers are junior academics. They have good incentives. Theyneed to be aware of latest literatureare absent from some informal circulation channels of senior academicsneed to “get their name around” among peersSome are senior people but at second-tier institutions.
14NEP service modelThere is a basic model behind this service we could call the “NEP Service Model”.It is characterized bytwo stages...flat report space...
15general two stage setup First stage: A general editor compiles a list of all new papers. This forms an issues of the “allport”.Second stage: A group of subject editors filter the new allport issue into subject reports. Each editor does this independently for the subject reports she looks after.
16a flat spaceThere is a series of reports. Each reports has a number of issues over time.There is an “allport” a report that contains all papers that are new in the period covered by the issue.
17first stage in NEPGeneral editor compiles a list of recent additions to the RePEc working papers data.Computer generatedJournal articles are excludedExamined by the General Editor (GE, a person)This list forms an issue of nep-all.
18first stage in NEP nep-all contains all new papers Circulated to nep-all subscriberseditors of subject-reports
19second stageEach editor creates, independently, a subject report for her subject. She does this by removing from nep-all.A subject report issues (SRI) is the result of this process.There have been over 50,000 SRIs issued through the lifetime of NEP to date.
20historyThere are basically two phases in NEP the pre-ernad 1998 to 2004 and the post-ernad phase.I will deal with pre-ernad history here.Some research on NEP has been conducted in the pre-ernad phase.This has informed the work that went into ernad.
21early history System was conceived by Thomas Krichel Name “NEP” by Sune KarlssonImplemented by José Manuel Barrueco Cruz.Started to run in May 1998.
22starting setup First the system was all email based. The nep-all was composed as an .It was sent to editors as an .Editors used whatever tool they used to compose the .
23web interfaceJohn S. Irons issued the first web interface for report composition onThis would just compose the report.Editors would still cut and paste the results of the form into clients.
24historic mail support First mail support was given by mailbase.ac.uk. When this was closed in , NEP moved to jiscmail.ac.uk.Since the mailing list service was only supposed to be for UK academic community, it was deemed not sustainable.Thomas Krichel started hosting lists on It is a nightmare.
25Aeroflot documentThe Aeroflot document was a thinking piece that Thomas Krichel wrote as early as atThis paper already sets out ideas for what would be ernad.At that time the Siberian RePEc team promised help with building such a system.
26discover disasterIn Jeremiah Cochise Trinidad Christensen and Thomas Krichel were the first people to try to get a systematic picture of how NEP works.They discover that this is exceedingly difficult.
27mail log parsing Logs were not moved from to Maibase to JISCMail. Mailbase removed the logs in Thomas Krichel got them just before they were destroyed.The mail logs were the only source for historic NEP information.
28parsing targetshandles: severely compromised by cut-n- paste operations, editor locales, etc.date of issue: editors were free to set dates, nep-all dates may not be preservedtime of issue: an is almost impossible to time.
29state of pre-ernad data After a regular expressions orgy, we can get some approximate idea about the handles that were used.Thus the thematic component is roughly intact.We have a problem with a bug in the discovery program that made many papers appear several times in nep-all. This makes it difficult to associate subject and allport issues.
30state of pre-ernad data Timing of s is extremely difficult, even with full headers.The logs of the Mailbase system only have times for when the client said it sent the mail. This is the local editor's PC time, can be years out of whack!We still have some data for research...
31research conducted on NEP Most of the research conducted on ernad has been done in the pre-ernad phase.The difficulties of some of this work has informed the construction of ernad.
32Chu and Krichel (2003)Heting Chu & Thomas Krichel (2003) “NEP: Current awareness service of the RePEc digital library”. vaguely talks about NEP. Notes that there is a problem of timeliness in the subject report issue, despite the very shaky data.
33Barrueco Cruz et al. (2003)Jose Manuel Barrueco Cruz, Thomas Krichel and Jeremiah Cochise Trinidad- Chrisitensen “Organizing Current Awareness in a Large Digital Library” o.pdf have two themesoverlap between reports...coverage ratio...as well as history and suggestions.
34overlapBarrueco Cruz et al (2003) argue that overlap occurs not when two papers are appearing in the two reports, but when the two reports are read by the same readers.They have data on pairwise overlap between reports, based on crude membership data.
35overlap puzzle Here is a puzzle to think about If a person will be interested in two subject areas because they are close, she will subscribe to both reports.But since they are thematically close, she will sometimes receive the same papers twice.With mail technology and asynchronous issue generation, this appears difficult to solve.
36coverage ratioWe call the coverage ratio the number of papers in nep-all that have been announced in at least one subject report.We can define this ratiofor each nep-all issuefor a subset of nep-all issuesfor NEP as a whole
37coverage ratio theory & evidence Over time more and more NEP reports have been added. As this happens, we expect the coverage ratio to increase.However, the evidence, from research by Barrueco Cruz, Krichel and Trinidad isThe coverage ratio of different nep-all issues varies a great deal.Overall, it remains at around 70%.We need some theory as to why.
38Krichel & Bakkalbaşı (2005) Thomas Krichel and Nisa Bakkalbaşı “Developing a predicitve model of editor selectivity in a current awareness service of a large digital library”. boston.pdf
39coverage ratio theories Krichel & Bakkalbaşı (2005) build two theories of the observations of Barrueco Cruz at al. (2003)They areTarget-size theoryQuality theorydescriptive qualitysubstantive quality
40theory 1: target size theory When editors compose a report issue, they have a size of the issue in mind.If the nep-all issue is large, editors will take a narrow interpretation of the report subject.If the nep-all ratio is small, editors will take a wide interpretation of the report subject.
41target size theory & static coverage There are two things going onThe opening new subject reports improves the coverage ratio.The expansion of RePEc implies that the size of nep-all, though varying in the short-run, grows in the long run. Target size theory implies that the coverage ratio deteriorates.The static coverage ratio is the result of both effects canceling out.
42theory 2: quality theory George W. Bush version of quality theorySome papers are rubbish. They will not get announced.The amount of rubbish in RePEc remains constant.This implies constant coverage.Reality is slightly more subtle.
432 versions of quality theory Descriptive quality theory: papers that are badly describedmisleading titlesno abstractlanguages other than EnglishSubstantive quality theory: papers that are well described, but not goodfrom unknown authorsissued by institutions with unenviable research reputation
44practical importanceWe do care whether one or the other theory is true.Target size theory implies that NEP should open more reports to achieve perfect coverage.Quality theory suggests that opening more report will have little to no impact on coverage.Since operating more reports is costly, there should be an optimal number of reports.
45resultsKrichel & Bakkalbaşı (2005) build a binary logistic regression analysis model.They find positive evidence for both target size and quality theory.The NEP editors don't like the results. They insist that they only filter by topic.
46Bátiz-Lazo & Krichel (2005) Bernardo Bátiz-Lazo and Thomas Krichel “On-line distribution of working papers through NEP: A Brief Business History” el.pdf has an early history of NEP that covers organizational details I don't talk about here.
47ernad stands for editing reports on new academic documents. Software system designed by Thomas Krichel at hel/work/altai.html.Software written in Perl by Roman D. Shapiro. Cost $2000.Started to work after
48cut editor freedom I Editors no longer send mail to lists. Only one address sends mail.But the mail appears like coming from the editor:From: Marcus DesjardinReply-To: Marcus Desjardin
49cut editor freedom IIEditors can no longer edit report issue s, e.g. add announcements of conferences.They are generated from XML files into standardized text and HTML files bound together by MIME multipart/alternative.They can not change dates of issue.
50help editorsProvide a simple-to-use interface for the composition of reportsprovide an easy to scroll inputallow for easy sorting of reportdo a better job at pretty-printingGet ready for the introduction of pre-sortingActually presorting was only introduced in
51statistical learningThe idea is that a computer may be able to make decision on the current nep-all reports based on the observation of earlier editorial decisions. This is known as pre-sorting.Thomas Krichel “Information retrieval performance measures for a current awareness report composition aid” sendai.pdf deals with the evaluation of presorting.
52presorting When an allport issue is created, it is presorted. In the allport rif each paper has a number in document order. That number is still reported in the presorted rif.The method is support vector machines svm, using svm_light.
53pre-sorting reconceived We should not think of pre-sorting via SVM as something to replace the editor.We should not think about it encouraging editors to be lazy.Instead, we should think it as an invitation to examine some papers more closely than others.
54headline vs. bottomline data The editors really have a three stage process of decision.They read title, author names.They read the abstract.They read the full text.A lot of papers fail at the first hurdle.SVM can read the abstract and prioritize papers for abstract reading.Editors are happy with the presorting system.
55what use is an editor?It turns out that reports have very different forecastability. Some are almost perfect, others are weak.If the forecast is very weak the editor may bea geniusa prankster.
56svm training setThe positive examples are taking from the report up to a certain time limit, called the experience length.The negative results are taken from nep-all, from the date of the last issued subject report until the experience length.The experience length is fixed in ernad.conf. For NEP it is 13 months.
57features selectionWe use individual words out of the contents from titles, author names, abstracts, classification data and the id of the RePEc series.We normalize the Euclidean sum of the feature weights.We run svm_light with the default settings.
58presorting timelineWhen a nep-all issues has been created, a customized version of its rif is created in the source/us directory.This issue is then presorted. The presorted version is stored in the source/ps directory.Presorting therefore only takes account of the information available at nep-all creation time.
59underlying technologies Written in Perl, using LibXSLT.Uses mod_perl under Apache 2.Runs on Debian GNU/Linux, could run on similar systems.Ernad needs to used some sort of mailing system but is not geared to a specific system. It basically just sends mail.
60underlying information AMF is a format for description of academic documents and academics.It is based on XML Schema, itself based on XML.Report data and issue data is encoded in AMF.
61ernad.conf Ernad uses one single configuration file ernad.conf. It has a simple attribute = value structure.
62affordances and domains There are basically four things that an ernad-based current awareness system provides for.For each of these “affordances”, we have a separate domain.This allows for distinct affordances to be run by distinct domains.
63the composition domain This is the domain used by the report issue composition interface.This is the virtual domain that the ernad apache is running under.The ernad process creates files so the apache server is best run as the user ernad.Recall ernad requires mod_perl, which in turn is incompatible with suexec.
64the service domainThis is where potential reader look at information about the ernad servicewhat reports are availablewho edits themThis domain is fixed through the reports configuration file report.amf.xml.
65the list domain This is the domain where the mailing lists are under domain of web interfacedomain of the mailing listsEach report report has a list where listdomain is the list domain.This domain is fixed through the reports configuration file report.amf.xml.
66delivery domainThe links to the full text use the encode the identifier of the paper and the identifier of the report.This allows to see what report readers requested.It is imperative that these links are not further disseminated. There should be no archives of nep lists.It is fixed in ernad.conf.
67reports.amf.xmlThe first part has the definition of NEP itself. (next slide)The second part as the definition of a report (slide after)The allport is the first listedreports.amf.xml fixesreport handleseditor information, (incl. here: editor ids)list domainservice domain
70operation A rif always as a name yyyy-mm-dd_tist, where yyyy is the year, mm month, dd day of the nep-all issue.tist is a UNIX time stamp, i.e. the number of seconds that have passed since the first of January 1970rifs are never deleted. When an operation is made, a new version of the rif with a new tist is written.
71creating a sri 0After login in to ernad, an editor sees a set of allport issues to work on.This is the report selection stage.If there is no allport issue needs working on a “sorry” message is displayed.
72creating a sri 1When a subject report issue is created, it is copied from the source/ps or source/us directory, depending on whether the editor chooses to work with the presorted or the unsorted version of the report.
73creating a sri 1.5If there is no paper worth to be included in the report, the editor can move back from the paper selection to the issue selection stage. There she can delete the issue.The rif of the issue is not deleted. Instead an empty issue is created in the final “sent” directory.
74creating a sri 2Once papers have been selected, a rif is created in the into the director “selected”. This rif only contains the selected papers.If there are changes in the selections, new rifs are created, as soon as the editor moves to the next screen.
75creating a sri 3Once papers have been ordered, a rif is created in the into the director “ordered”.If there are changes in the order, new rifs are created, as soon as the editor moves to the next screen.
76creating a sri 4Once the sri has been previewed the editor can click the “send” button. The rif is stored in the “sent” directory.Ernad created a mail file containing a HTML and text version of the sri and places it in the “mail” directory.This file can be sent again if there is an problem.
77a lot of data ernad@khufu:~$ date Thu Oct 18 05:18:37 EDT 2007 du -s ernad/var/reports/ernad/var/reports/find ernad/var/reports/ -type f | grep -c ^76043
78maintenanceThomas Krichel maintains a technical guide, mainly for the director and the general editor.It is atIt illustrates well that the technical maintenance is still quite heavy.A lot of maintenance scripts still have to be written by Thomas.
79assessment How well does NEP work? Some criteria are already discussed the literaturedelaycoverage ratiooverlap
80coverage to lossage There is a web site http://nep.repec.org/lossage that does show the papers that have not been sent, as well as the coverage ratio for each issue.It appears that coverage has improved.
81overlap There is no script to compute current overlap data. There is quite good historical subscriber for most of the post-ernad period.Thus it is possible to calculate overlap of reports for various nep-all issues.
82to improve coverageIt would be interesting to redo the work of Krichel and Bakkalbaşı (2005)For English papers, we can try to presort nep-all issues for a virtual “nep-no” report. This could help to identify thematic gaps.We open language-specific reports?
83delayThe site shows average delays of editors.Half of the editors appear to do a good job. A good job is when the average delay is below a week.
84editor activityThere is a web site containing activity data of editors.There appear some minor problems but overall it appears ok.Date is only available from , because of a misunderstanding between Roman D. Shapiro and Thomas Krichel.
85downloads This is the ultimate measure of success. Downloads from a report can be measured because of a GCI parameter identifying the report.Parsing the logs and matching the handles with handles in reports is difficult.
86downloads from issuesA lot of downloads are made by editors when they compose the report issue.A lot of others are made by robots.As for the rest, data is atAt this time this is preliminary.
87future extensions of ernad 1 Use editor identities to build a customized experience length.Use a collection of multi-word RePEc keywords to aid pre-sorting.
88future extensions of ernad 2 Deal better with duplicate papers under different handles.use lists before ordering as a basis for inclusion into pre-sorting. This will allow editors to delete duplicates without confusing the SVMpotentially detect duplicates at allport composition time.
89tough problemsFiltering for new papers is deficient as the date on papers is not mandatory. Presorting for age seems impossible.The fight with spam is a problem for anyone who sends out a lot of mail.
90recent developmentsChristian Zimmermann wrote “At I have attempted to classify registered authors by field. For this I used their papers as disseminated by NEP, and if at least one fourth are in a report, authors are considered to be a specialist of that field.”
91Leonardo Fernandes Souto is a Brazilian PhD student working on current awareness services.He has (correctly) identified NEP as the model that all should follow ;-)His questionnaire is at c.org/research/NEP_questionnaire_ doc.He is analyzing the 41 responses he received.
92Thank you for your attention! Thank you for your attention!