2004.09.28 SLIDE 1IS 202 – FALL 2004 Lecture 9: IR Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.

2004.09.28 SLIDE 1IS 202 – FALL 2004 Lecture 9: IR Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004 http://www.sims.berkeley.edu/academics/courses/is202/f04/ SIMS 202: Information Organization and Retrieval

2004.09.28 SLIDE 2IS 202 – FALL 2004 Lecture Overview Review –Probabilistic IR Evaluation of IR systems –Precision vs. Recall –Cutoff Points and other measures –Test Collections/TREC –Blair & Maron Study –Discussion Credit for some of the slides in this lecture goes to Marti Hearst and Warren Sack

2004.09.28 SLIDE 4IS 202 – FALL 2004 Probability Ranking Principle “If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.” Stephen E. Robertson, J. Documentation 1977

2004.09.28 SLIDE 5IS 202 – FALL 2004 Model 1 – Maron and Kuhns Concerned with estimating probabilities of relevance at the point of indexing: –If a patron came with a request using term t i, what is the probability that she/he would be satisfied with document D j ?

2004.09.28 SLIDE 6IS 202 – FALL 2004 Model 2 Documents have many different properties; some documents have all the properties that the patron asked for, and other documents have only some or none of the properties. If the inquiring patron were to examine all of the documents in the collection she/he might find that some having all the sought after properties were relevant, but others (with the same properties) were not relevant. And conversely, he/she might find that some of the documents having none (or only a few) of the sought after properties were relevant, others not. The function of a document retrieval system is to compute the probability that a document is relevant, given that it has one (or a set) of specified properties. Robertson, Maron & Cooper, 1982

2004.09.28 SLIDE 7IS 202 – FALL 2004 Model 2 – Robertson & Sparck Jones Document Relevance Document Indexing Given a term t and a query q + - + r n-r n - R-r N-n-R+r N-n R N-R N

2004.09.28 SLIDE 8IS 202 – FALL 2004 Robertson-Sparck Jones Weights Retrospective formulation

2004.09.28 SLIDE 9IS 202 – FALL 2004 Robertson-Sparck Jones Weights Predictive formulation

2004.09.28 SLIDE 10IS 202 – FALL 2004 Probabilistic Models: Some Unifying Notation D = All present and future documents Q = All present and future queries (D i,Q j ) = A document query pair x = class of similar documents, y = class of similar queries, Relevance (R) is a relation:

2004.09.28 SLIDE 11IS 202 – FALL 2004 Probabilistic Models Model 1 -- Probabilistic Indexing, P(R|y,D i ) Model 2 -- Probabilistic Querying, P(R|Q j,x) Model 3 -- Merged Model, P(R| Q j, D i ) Model 0 -- P(R|y,x) Probabilities are estimated based on prior usage or relevance estimation

2004.09.28 SLIDE 12IS 202 – FALL 2004 Probabilistic Models Q D x y DiDi QjQj

2004.09.28 SLIDE 13IS 202 – FALL 2004 Logistic Regression Another approach to estimating probability of relevance Based on work by William Cooper, Fred Gey and Daniel Dabney Builds a regression model for relevance prediction based on a set of training data Uses less restrictive independence assumptions than Model 2 –Linked Dependence

2004.09.28 SLIDE 14IS 202 – FALL 2004 Logistic Regression 100 - 90 - 80 - 70 - 60 - 50 - 40 - 30 - 20 - 10 - 0 - 0 10 20 30 40 50 60 Term Frequency in Document Relevance

2004.09.28 SLIDE 15IS 202 – FALL 2004 Relevance Feedback Main Idea: –Modify existing query based on relevance judgements Extract terms from relevant documents and add them to the query And/or re-weight the terms already in the query –Two main approaches: Automatic (pseudo-relevance feedback) Users select relevant documents –Users/system select terms from an automatically-generated list

2004.09.28 SLIDE 16IS 202 – FALL 2004 Rocchio Method

2004.09.28 SLIDE 17IS 202 – FALL 2004 Rocchio/Vector Illustration Retrieval Information 0.5 1.0 0 0.51.0 D1D1 D2D2 Q0Q0 Q’ Q” Q 0 = retrieval of information = (0.7,0.3) D 1 = information science = (0.2,0.8) D 2 = retrieval systems = (0.9,0.1) Q’ = ½*Q 0 + ½ * D 1 = (0.45,0.55) Q” = ½*Q 0 + ½ * D 2 = (0.80,0.20)

2004.09.28 SLIDE 19IS 202 – FALL 2004 IR Evaluation Why Evaluate? What to Evaluate? How to Evaluate?

2004.09.28 SLIDE 20IS 202 – FALL 2004 Why Evaluate? Determine if the system is desirable Make comparative assessments –Is system X better than system Y? Others?

2004.09.28 SLIDE 21IS 202 – FALL 2004 What to Evaluate? How much of the information need is satisfied How much was learned about a topic Incidental learning: –How much was learned about the collection –How much was learned about other topics Can serendipity be measured? How inviting the system is?

2004.09.28 SLIDE 22IS 202 – FALL 2004 Relevance (revisited) In what ways can a document be relevant to a query? –Answer precise question precisely –Partially answer question –Suggest a source for more information –Give background information –Remind the user of other knowledge –Others...

2004.09.28 SLIDE 23IS 202 – FALL 2004 Relevance (revisited) How relevant is the document? –For this user for this information need Subjective, but Measurable to some extent –How often do people agree a document is relevant to a query? How well does it answer the question? –Complete answer? Partial? –Background Information? –Hints for further exploration?

2004.09.28 SLIDE 24IS 202 – FALL 2004 What can be measured that reflects users’ ability to use system? (Cleverdon 66) –Coverage of information –Form of presentation –Effort required/ease of use –Time and space efficiency –Recall Proportion of relevant material actually retrieved –Precision Proportion of retrieved material actually relevant What to Evaluate? Effectiveness

2004.09.28 SLIDE 25IS 202 – FALL 2004 Relevant vs. Retrieved Relevant Retrieved All Docs

2004.09.28 SLIDE 26IS 202 – FALL 2004 Precision vs. Recall Relevant Retrieved All Docs

2004.09.28 SLIDE 27IS 202 – FALL 2004 Why Precision and Recall? Get as much good stuff as possible while at the same time getting as little junk as possible

2004.09.28 SLIDE 28IS 202 – FALL 2004 Retrieved vs. Relevant Documents Very high precision, very low recall Relevant

2004.09.28 SLIDE 29IS 202 – FALL 2004 Retrieved vs. Relevant Documents Very low precision, very low recall (0 in fact) Relevant

2004.09.28 SLIDE 30IS 202 – FALL 2004 Retrieved vs. Relevant Documents High recall, but low precision Relevant

2004.09.28 SLIDE 31IS 202 – FALL 2004 Retrieved vs. Relevant Documents High precision, high recall (at last!) Relevant

2004.09.28 SLIDE 32IS 202 – FALL 2004 Precision/Recall Curves There is a well-known tradeoff between Precision and Recall So we typically measure Precision at different (fixed) levels of Recall Note: this is an AVERAGE over MANY queries precision recall x x x x

2004.09.28 SLIDE 33IS 202 – FALL 2004 Precision/Recall Curves Difficult to determine which of these two hypothetical results is better: precision recall x x x x

2004.09.28 SLIDE 34IS 202 – FALL 2004 TREC (Manual Queries)

2004.09.28 SLIDE 36IS 202 – FALL 2004 Document Cutoff Levels Another way to evaluate: –Fix the number of RELEVANT documents retrieved at several levels: Top 5 Top 10 Top 20 Top 50 Top 100 Top 500 –Measure precision at each of these levels –(Possibly)Take average over levels This is a way to focus on how well the system ranks the first k documents

2004.09.28 SLIDE 37IS 202 – FALL 2004 Problems with Precision/Recall Can’t know true recall value –Except in small collections Precision/Recall are related –A combined measure sometimes more appropriate Assumes batch mode –Interactive IR is important and has different criteria for successful searches –We will touch on this in the UI section Assumes that a strict rank ordering matters

2004.09.28 SLIDE 38IS 202 – FALL 2004 Relation to Contingency Table Accuracy: (a+d) / (a+b+c+d) Precision: a/(a+b) Recall: ? Why don’t we use Accuracy for IR Evaluation? (Assuming a large collection) –Most docs aren’t relevant –Most docs aren’t retrieved –Inflates the accuracy value Doc is Relevant Doc is NOT relevant Doc is retrieved ab Doc is NOT retrieved cd

2004.09.28 SLIDE 39IS 202 – FALL 2004 The E-Measure Combine Precision and Recall into one number (van Rijsbergen 79) P = precision R = recall  = measure of relative importance of P or R For example, $= 1 means user is equally interested in precision and recall $=  means user doesn’t care about precision $= 0 means user doesn’t care about recall

2004.09.28 SLIDE 40IS 202 – FALL 2004 F Measure (Harmonic Mean)

2004.09.28 SLIDE 42IS 202 – FALL 2004 Test Collections Cranfield 2 – –1400 Documents, 221 Queries –200 Documents, 42 Queries INSPEC – 542 Documents, 97 Queries UKCIS -- > 10000 Documents, multiple sets, 193 Queries ADI – 82 Documents, 35 Queries CACM – 3204 Documents, 50 Queries CISI – 1460 Documents, 35 Queries MEDLARS (Salton) 273 Documents, 18 Queries

2004.09.28 SLIDE 43IS 202 – FALL 2004 TREC Text REtrieval Conference/Competition –Run by NIST (National Institute of Standards & Technology) -- http://trec.nist.gov –13th TREC in mid November Collection: >6 Gigabytes (5 CRDOMs), >1.5 Million Docs –Newswire & full text news (AP, WSJ, Ziff, FT) –Government documents (federal register, Congressional Record) –Radio Transcripts (FBIS) in multiple languages –Web “subsets” (“Large Web” separate with 18.5 Million pages of Web data – 100 Gbytes) New GOV2 collection nearly 1 Tb –Patents

2004.09.28 SLIDE 44IS 202 – FALL 2004 TREC (cont.) Queries + Relevance Judgments –Queries devised and judged by “Information Specialists” –Relevance judgments done only for those documents retrieved—not entire collection! Competition –Various research and commercial groups compete (TREC 6 had 51, TREC 7 had 56, TREC 8 had 66) –Results judged on precision and recall, going up to a recall level of 1000 documents Following slides are from TREC overviews by Ellen Voorhees of NIST

2004.09.28 SLIDE 45IS 202 – FALL 2004

2004.09.28 SLIDE 46IS 202 – FALL 2004

2004.09.28 SLIDE 47IS 202 – FALL 2004

2004.09.28 SLIDE 48IS 202 – FALL 2004

2004.09.28 SLIDE 49IS 202 – FALL 2004

2004.09.28 SLIDE 50IS 202 – FALL 2004 Sample TREC Query (Topic) Number: 168 Topic: Financing AMTRAK Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK) Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to AMTRAK would also be relevant.

2004.09.28 SLIDE 51IS 202 – FALL 2004

2004.09.28 SLIDE 52IS 202 – FALL 2004

2004.09.28 SLIDE 53IS 202 – FALL 2004

2004.09.28 SLIDE 54IS 202 – FALL 2004

2004.09.28 SLIDE 55IS 202 – FALL 2004

2004.09.28 SLIDE 56IS 202 – FALL 2004 TREC Benefits: –Made research systems scale to large collections (at least pre-WWW “large”) –Allows for somewhat controlled comparisons Drawbacks: –Emphasis on high recall, which may be unrealistic for what many users want –Very long queries, also unrealistic –Comparisons still difficult to make, because systems are quite different on many dimensions –Focus on batch ranking rather than interaction There is an interactive track but not a lot is being learned, given the constraints of the TREC evaluation process

2004.09.28 SLIDE 57IS 202 – FALL 2004 TREC is Changing Emphasis on specialized “tracks” –Interactive track –Natural Language Processing (NLP) track –Multilingual tracks (Chinese, Spanish, Arabic) –Filtering track –High-Precision –High-Performance –Very-Large Scale (terabyte track) http://trec.nist.gov/

2004.09.28 SLIDE 58IS 202 – FALL 2004 Other Test Forums/Collections CLEF (Cross-language Evaluation Forum) –Collections in English, French, German, Spanish, Italian with new languages being added (Russian, Finnish, etc). Primarily Euro. NTCIR (NII-NACSIS Test Coll. For IR Sys.) –Primarily Japanese, Chinese and Korean, with partial English INEX (Initiative for Evaluation of XML Retrieval). –Main track uses about 525Mb of XML data from IEEE. Combines Structure and Content.

2004.09.28 SLIDE 60IS 202 – FALL 2004 Blair and Maron 1985 A classic study of retrieval effectiveness –Earlier studies were on unrealistically small collections Studied an archive of documents for a lawsuit –~350,000 pages of text –40 queries –Focus on high recall –Used IBM’s STAIRS full-text system Main Result: –The system retrieved less than 20% of the relevant documents for a particular information need –Lawyers thought they had 75% But many queries had very high precision

2004.09.28 SLIDE 61IS 202 – FALL 2004 Blair and Maron (cont.) How they estimated recall –Generated partially random samples of unseen documents –Had users (unaware these were random) judge them for relevance Other results: –Two lawyers searches had similar performance –Lawyers recall was not much different from paralegal’s

2004.09.28 SLIDE 62IS 202 – FALL 2004 Blair and Maron (cont.) Why recall was low –Users can’t foresee exact words and phrases that will indicate relevant documents “accident” referred to by those responsible as: “event,” “incident,” “situation,” “problem,” … Differing technical terminology Slang, misspellings –Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied

2004.09.28 SLIDE 64IS 202 – FALL 2004 An Evaluation of Retrieval Effectiveness (Blair & Maron) Questions from Shufei Lei Blair and Maron concluded that the full-text retrieval system such as IBMs STAIRS was ineffective because Recall was very low (average 20%) when searching for documents in a large database of documents (about 40,000 documents). However, the lawyers who were asked to perform this test were quite satisfied with the results of their search. Think about how you search the web today. How do you evaluate the effectiveness of the full-text retrieval system (user satisfaction or Recall rate)? The design of the full-text retrieval system is based on the assumption that it is simple matter for users to foresee the exact words and phrases that will be used in the documents they will find useful, and only in those documents.The author pointed out some factors that invalidate this assumption: misspellings, using different terms to refer to the same event, synonyms, etc. What can we do to help overcome these problems?

2004.09.28 SLIDE 65IS 202 – FALL 2004 Rave Reviews (Belew) Questions from Scott Fisher What are the drawbacks of using an "expert" to evaluate documents in a collection for relevance? RAVEUnion follows the pooling procedure used by many evaluators. What is a weakness of this procedure? How do the RAVE researchers try to overcome this weakness?

2004.09.28 SLIDE 66IS 202 – FALL 2004 A Case for Interaction (Koeneman & Belkin) Questions from Lulu Guo It is reported that people thought that using the feedback component as a suggestion device made them "lazy" since the task of generating terms was replaced by selecting terms. Is there any potential problem with this "laziness"? In evaluating the effectiveness of the second search task, the authors reported median precision (M) instead of mean (X bar) precision. What's the difference between the two, and which do you think is more appropriate?

2004.09.28 SLIDE 67IS 202 – FALL 2004 Work Tasks and Socio-Cognitive Relevance (Hjorland & Chritensen) Questions from Kelly Snow Schizophrenia research has a number of different theories (psychosocial, biochemical) leading to different courses of treatment. According to the reading, finding a 'focus' is crucial for the search process. When prevailing consensus has not been reached, how might a Google- like page-rank approach be a benefit? How might it pose problems? The article discusses relevance ranking by the user as a subjective measure. Relevance ranking can be a reflection of a user's uncertainty about an item's relevance. It can also reflect relevance to a specific situation at a certain time - A document might be relevant for discussion with a colleague but not for clinical treatment. Does this insight change the way you've been thinking about relevance as discussed in the course so far?

2004.09.28 SLIDE 68IS 202 – FALL 2004 Social Information Filtering (Shardanand & Maes) Questions from Judd Antin Would carelessly rating albums or artists 'break' Ringo? Why or why not? How would you break Ringo if you wanted to? Is the accuracy or precision of predicted target values a good measure of system performance? What good is a social filtering system if it never provides information which leads to new or different behavior? How do we measure performance in a practical sense? One important criticism of Social Information Filtering is that it does not situate information in its sociocultural context - that liking or disliking a piece of music is an evolving relationship between music and the listening environment. So, in this view, Social Information Filtering fails because a quantitative, statistical measure of preference is not enough to account for the reality of any individual user's preference. How might a system account for this failing? Would it be enough to include additional metadata such as 'Mood,' 'Genre,' 'First Impression,' etc.?

2004.09.28 SLIDE 69IS 202 – FALL 2004 Next Time WEDNESDAY 10:30-11:30 RM 205: A Gentle (re)introduction to Math notation and IR. Thursday: In-class workshop on IR Evaluation – Bring computer and/or calculator! Readings –MIR Chapter 10

2004.09.28 SLIDE 1IS 202 – FALL 2004 Lecture 9: IR Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.

Similar presentations

Presentation on theme: "2004.09.28 SLIDE 1IS 202 – FALL 2004 Lecture 9: IR Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2004.09.28 SLIDE 1IS 202 – FALL 2004 Lecture 9: IR Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.

Similar presentations

Presentation on theme: "2004.09.28 SLIDE 1IS 202 – FALL 2004 Lecture 9: IR Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00."— Presentation transcript:

Similar presentations

About project

Feedback