Metadata characteristics as predictors for editor selectivity in a current awareness service Thomas Krichel & Nisa Bakkalbasi 2005-10-31.

Slides:



Advertisements
Similar presentations
1 of 16 Information Access The External Information Providers © FAO 2005 IMARK Investing in Information for Development Information Access The External.
Advertisements

28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.
Digital scholarly communication in Economics: from NetEc to RePEc Thomas Krichel work partly sponsored by the Joint Information.
Open and self-sustaining digital library services: the example of NEP. Thomas Krichel
The RePEc model for the academic digital library Thomas Krichel work partly sponsored by the Joint Information Systems.
LIS618 lecture 2 Thomas Krichel Structure Theory: information retrieval performance Practice: more advanced dialog.
Distributed Current Awareness Services Thomas Krichel
Designing for the Discipline: Open Libraries and Scholarly Communication Thomas Krichel
Rclis in vision and reality Thomas Krichel
RePEc and OLS Thomas Krichel prepared for the first retreat for disciplinary repositories Monterey
RePEc: An Open Library for Economics Thomas Krichel Work partly supported by the Joint Information Systems Committee of.
Bringing scholarly communication in kicking and screaming into the Internet age Thomas Krichel
Current Awareness in a Large Digital Library José Manuel Barrueco Cruz Thomas Krichel Jeremiah Trinidad.
Information policy issues in RePEc Thomas Krichel
Open Archives and Open Libraries Thomas Krichel
The future of scholarly communication in Economics Thomas Krichel work partly sponsored by the Joint Information Systems.
New Century, New Metadata Thomas Krichel University of Surrey, Hitotsubashi University and Long Island University.
Use your bean. Count it. Thomas Krichel
My life and times Thomas Krichel LIU & НГУ
Four slides for the future Thomas Krichel given at 4 th International Socionet seminar Novosibirsk
Current work on CitEc José Manuel Barrueco Cruz Thomas Krichel
LIS618 lecture 6 Thomas Krichel structure DIALOG –basic vs additional index –initial database file selection (files) Lexis/Nexis.
LIS618 lecture 1 Thomas Krichel Structure of talk Recap on Boolean Before online searching Working with DIALOG –Overview –Search command –Bluesheets.
Electronic Library and Information Resources Introduction and overview.
Library Electronic Resources in the EUI Library Veerle Deckmyn, Library Director Aimee Glassel, Electronic Resources Librarian 07 September
LIBRARY WEBSITE, CATALOG, DATABASES AND FREE WEB RESOURCES.
Chapter 7 Sampling and Sampling Distributions
1 Aggregating with GeoscienceWorld (GSW) Whats in it for us?
Auto-Moto Financial Services- The Old Process
The basics for simulations
Configuration management
Chapter 11: Models of Computation
1 The information industry and the information market Summary.
Chapter 16 Goodness-of-Fit Tests and Contingency Tables
Reform and Innovation in Higher Education
CHAPTER 1 WHAT IS RESEARCH?.
Detecting Spam Zombies by Monitoring Outgoing Messages Zhenhai Duan Department of Computer Science Florida State University.
Chapter 2 Entity-Relationship Data Modeling: Tools and Techniques
DIKLA GRUTMAN 2014 Databases- presentation and training.
CINAHL Keyword Searching. This presentation will take you through the procedure of finding reliable information which can be used in your academic work.
Multiple Regression and Model Building
INFORMATION SOLUTIONS Citation Analysis Reports. Copyright 2005 Thomson Scientific 2 INFORMATION SOLUTIONS Provide highly customized datasets based on.
How the University Library can help you with your term paper
Workshop on Women in Science and Engineering Ruthanne D. Thomas, Chair Department of Chemistry University of North Texas
Review of Related Literature By Dr. Ajay Kumar Professor School of Physical Education DAVV Indore.
1 ONESEARCH/ WRITING A THESIS STATEMENT ENGLISH 115 Hudson Valley Community College Marvin Library Learning Commons.
H E L S I N G I N K A U P P A K O R K E A K O U L U H E L S I N K I S C H O O L O F E C O N O M I C S Orientaatiopäivät 1 Writing Scientific.
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
Dr. Engr. Sami ur Rahman Assistant Professor Department of Computer Science University of Malakand Research Methods in Computer Science Lecture: Research.
Assessing a human mediated current awareness service International Symposium of Information Science (ISI 2015) Zadar, Zeljko Carevic 1, Thomas.
IL Step 1: Sources of Information Information Literacy 1.
AELDP ACADEMIC READING. Questions Do you have any questions about academic reading?
Research evaluation requirements José Manuel Barrueco Universitat de València (SPAIN) Servei de Biblioteques i Documentació May, 2011.
LIS510 lecture 3 Thomas Krichel information storage & retrieval this area is now more know as information retrieval when I dealt with it I.
IL Step 3: Using Bibliographic Databases Information Literacy 1.
I know of no more encouraging fact than the unquestionable ability of man to elevate his life by conscious endeavor. Henry David Thoreau.
CHAPTER 12 Descriptive, Program Evaluation, and Advanced Methods.
Organizing current awareness in a large volunteer-based digital library Thomas Krichel
Research Methods School of Economic Information Engineering Dr. Xu Yun :
A brief tour of Academic Search Premier. Agenda: Agenda: What is a database? What is a database? Searching keywords and using truncation. Searching keywords.
PSY 219 – Academic Writing in Psychology Fall Çağ University Faculty of Arts and Sciences Department of Psychology Inst. Nilay Avcı Week 9.
What is Research?. Intro.  Research- “Any honest attempt to study a problem systematically or to add to man’s knowledge of a problem may be regarded.
Chapter 20 Asking Questions, Finding Sources. Characteristics of a Good Research Paper Poses an interesting question and significant problem Responds.
Abstract  An abstract is a concise summary of a larger project (a thesis, research report, performance, service project, etc.) that concisely describes.
CitEc as a source for research assessment and evaluation José Manuel Barrueco Universitat de València (SPAIN) May, й Международной научно-практической.
Dr Hidayathulla Shaikh. Objectives At the end of the lecture student should be able to – Define journal club Mention types Discuss critical evaluation.
Supplementary Table 1. PRISMA checklist
The RePEc database about Economics
Building an autonomous citation index for grey literature: the
IL Step 3: Using Bibliographic Databases
Presentation transcript:

Metadata characteristics as predictors for editor selectivity in a current awareness service Thomas Krichel & Nisa Bakkalbasi

outline Background to work that we did –RePEc (Research Papers in Economics) –NEP: New Economics Papers The research –Theory –Method –Results Other work done for NEP.

RePEc Digital library for academic Economics. It collects descriptions of –economics documents (working papers, articles etc) –collections of those documents –economists –collections of economists Pionneering effort to create a relational dataset describing an academic discipline as a whole. The data is freely available.

RePEc principle Many archives –Archives offer metadata about digital objects or authors and institutions data. One database Many services –Users can access the data through many interfaces. –Providers of archives offer their data to all interfaces at the same time. This provides for an optimal distribution.

it's the incentives, stupid RePEc applies the ideas of open source to the construction of bibliographic dataset. It provides an open library. The entire system is constructed in such a way as to be sustainable without monetary exchange between participants.

some history Thomas Krichel in the early 1990s dreamed about a current awareness service for working paper. It would later have electronic papers. In 1993 he made the first economics working paper available online. In 1997 he wrote the key protocols that govern RePEc.

RePEc is based on 500+ archives WoPEc EconWPA DEGREE S-WoPEc NBER CEPR Elsevier US Fed in Print IMF OECD MIT University of Surrey CO PAH Blackwell

to form a 340+k item dataset 161,000working papers 180,000journal articles 1,300software components 1,200book and chapter listings 8,000author contact & publication listings 9,100institutional contact listings more records than arXiv.org

RePEc is used in many services EconPapers NEP: New Economics Papers Inomics RePEc author service Z39.50 service by the DEGREE partners IDEAS RuPEc EDIRC LogEc CitEc

NEP: New Economics Papers This is a set of current awareness reports on new additions to the working paper stock only. Journal articles would be too old. Founded by Thomas Krichel in Supported by the Economics department at WUStL. Initial software was written by Jose Manuel Barrueco Cruz. First general editor was John S. Irons.

why NEP Public aim: Current awareness if well done, can be an important service in its own right. It is sheltered from the competition of general search engines. Private aim: It is useful to have some, even though limited classification information. This should be useful in performance measures within subject areas.

modus operandi: stage 1 The general editor uses a computer program who gathers all the new additions to the working paper stock. This is usually done weekly. S/he filters out new descriptions of old papers –date field –handle heuristics The result is an issue of the nep-all report.

modus operandi: stage 2 Editors consider the papers in the nep-all report to filter out papers that belong to the subject. This forms as issue of a subject report nep-???. nep-all and the subject reports are circulated via . A special arrangement makes the data of NEP available to other RePEc services.

some numbers The are now 60+ NEP lists. Over 37k subscriptions. Close to 16k subscribers. Over 50k papers announced. Over 100k announcements. Homepage at All this is a fantastic success!!

problem with the private aim We would have to have all the papers to be classified not only the working papers. We would need to have 100% coverage of NEP. This means every paper in nep-all appears in at least one subject report.

coverage ratio We call the coverage ratio the number of papers in nep-all that have been announced in at least one subject report. We can define this ratio –for each nep-all issue –for a subset of nep-all issues –for NEP as a whole

coverage ratio theory & evidence Over time more and more NEP reports have been added. As this happens, we expect the coverage ratio to increase. However, the evidence, from research by Barrueco Cruz, Krichel and Trinidad is –The coverage ratio of different nep-all issues varies a great deal. –Overall, it remains at around 70%. We need some theory as to why.

two theories Target-size theory Quality theory –descriptive quality –substantive quality

theory 1: target size theory When editors compose a report issue, they have a size of the issue in mind. If the nep-all issue is large, editors will take a narrow interpretation of the report subject. If the nep-all ratio is small, editors will take a wide interpretation of the report subject.

target size theory & static coverage There are two things going on –The opening new subject reports improves the coverage ratio. –The expansion of RePEc implies that the size of nep-all, though varying in the short-run, grows in the long run. Target size theory implies that the coverage ratio deteriorates. The static coverage ratio that we observe is the result of both effects canceling out.

theory 2: quality theory George W. Bush version of quality theory –Some papers are rubbish. They will not get announced. –The amount of rubbish in RePEc remains constant. –This implies constant coverage. Reality is slightly more subtle.

two versions of quality theory Descriptive quality theory: papers that are badly described –misleading titles –no abstract –languages other than English Substantive quality theory: papers that are well described, but not good –from unknown authors –issued by institutions with unenviable research reputation

practical importance We do care whether one or the other theory is true. –Target size theory implies that NEP should open more reports to achieve perfect coverage. –Quality theory suggests that opening more report will have little to no impact on coverage. Since operating more reports is costly, there should be an optimal number of reports.

overall model We need an overall model that explains subject editors behavior. We can feed this model with variables that represent theoretical determinants of behavior. We can then assess the strength of various factors empirically.

method The dependent variable is announced. It is one if the paper has been announced, 0 otherwise. Since we are explaining a binary variable, we can use binary logistic regression analysis (BLRA). This is a fairly flexible technique, useful when the probability distributions governing the independent variables are not well known. That's why BLRA is popular in the life sciences.

independent variables: size size is the size of the nep-all issue in which the paper appeared. This is the critical indicator of target size theory. We expect it to have a negative impact on announced.

independent variables: position position is the position of the paper in the nep-all issue. The presence of this variable can be justified by the combined assumption of target size and editor myopia. If editors are myopic, they will be more liberal at the start of nep-all then at the end of nep-all.

independent variables: title title is the length of a title of the paper, measured by the number of characters. This variable is motivated by descriptive quality theory. A longer title will say more about the paper than a short title. This makes is less likely that a paper is being overlooked.

independent variables: abstract abstract isthe presence/absence of an abstract to the paper. This is also motivated by descriptive quality theory. Note that we do not use the length of the abstract because that would be a highly skewed variable.

independent variables: language language is an indicator if the language of the metadata is in English or not. This variable is motivated by descriptive quality theory and the idea that English is the most commonly understood language. While there are a lot of multilingual editors, customizing this variable would have been rather hard.

independent variables: series series is the size of the series where a paper appears in. This variable is motivated by substantive quality theory. The larger a series is the higher, usually, is its reputation. We can roughly qualify by size and quality –multi-institution series (NBER, CEPR) –large departments –small departments

independent variables: author author is the prolificacy of the authors of the paper. It is justified by substantive quality theory. This is the most difficult variable to measure. We use the number of papers written by the registered author with the highest number. Since about 50% of the papers have no registered author, a lot of them are excluded. But there should be no bias by the exclusion.

create categorical variables size_1[179, 326) size_2[326, 835] title_1[55, 77) title_2[77, 1945] position_1[0.357, 0.704) position _2[0.704, 1.000] series_1 [98, 231) series_2 [231, 3654]

results P(announced=1| x) =(exp(g(x))/(1+exp(g(x)) g(x) = *size_ * size_ *title_ *title_ *abstract *author *language *series_ *series_2 position is not significant. author just makes the cut.

odds ratio size_11.32[1.22, 1.44] size_20.83[0.76, 0.90] title_11.16[1.07, 1.26] title_21.28[1.18, 1.39] abstract1.47[1.34, 1.62] language2.15[1.85, 2.51] series_11.11[1.02, 1.20] series_21.37[1.26, 1.49] author1.05[1.01, 1.09]

scandal! Substantive quality theory can not be rejected. That means that the editors are selecting for quality as well as for the subject. The editors have rejected our findings. Almost all protest that there is no quality filtering.

consequences There has been no program to expand list. There has to be a concentrated effort to help editors to find subject specific papers. More effort needs to be made for editors to really find the subject-specific papers. This can be done by –the use of a more efficient interface –the use of automated resource discovery methods.

ernad editing reports on new academic documents. It is purpose-built software system for current awareness reports. It has been designed by Thomas Krichel, The system was written by Roman D. Shapiro.

statistical learning The idea is that a computer may be able to make decision on the current nep-all reports based on the observation of earlier editorial decisions. ernad now works using support vector machines (SVM), with titles, abstracts, author name, classification values and series as features.

performance criteria We are not aware of performance criteria for the sorting of papers in a report. Precision and recall appear useless. Expected search length and average search don't appear very attractive. Thus research into precise criteria is required.

SVM performance If we use average search length, we can do performance evaluations. It turns out that reports have very different forecastability. Some are almost perfect, others are weak. Again, this raises a few eyebrows!

what is the value of an editor? If the forecast is perfect, we don't need the editor. If the forecast is very weak the editor may be a prankster.

pre-sorting reconceived We should not think of pre-sorting via SVM as something to replace the editor. We should not think about it encouraging editors to be lazy. Instead, we should think it as an invitation to examine some papers more closely than others.

headline vs. bottomline data The editors really have a three stage process of decision. –They read title, author names. –They read the abstract. –They read the full text A lot of papers fail at the first hurdle. SVM can read the abstract and prioritize papers for abstract reading. Editors are happy with the pre-sorting system.

Thank you for your attention!