2003.11.18 SLIDE 1IS 202 – FALL 2003 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.

2003.11.18 SLIDE 1IS 202 – FALL 2003 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003 http://www.sims.berkeley.edu/academics/courses/is202/f03/ SIMS 202: Information Organization and Retrieval

2003.11.18 SLIDE 2IS 202 – FALL 2003 Lecture Overview Review –Probabilistic IR Evaluation of IR systems –Precision vs. Recall –Cutoff Points and other measures –Test Collections/TREC –Blair & Maron Study Credit for some of the slides in this lecture goes to Marti Hearst and Warren Sack

2003.11.18 SLIDE 3IS 202 – FALL 2003 Lecture Overview Review –Probabilistic IR Evaluation of IR systems –Precision vs. Recall –Cutoff Points and other measures –Test Collections/TREC –Blair & Maron Study –Discussion Credit for some of the slides in this lecture goes to Marti Hearst and Warren Sack

2003.11.18 SLIDE 4IS 202 – FALL 2003 Probability Ranking Principle “If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.” Stephen E. Robertson, J. Documentation 1977

2003.11.18 SLIDE 5IS 202 – FALL 2003 Model 1 – Maron and Kuhns Concerned with estimating probabilities of relevance at the point of indexing: –If a patron came with a request using term t i, what is the probability that she/he would be satisfied with document D j ?

2003.11.18 SLIDE 6IS 202 – FALL 2003 Model 2 Documents have many different properties; some documents have all the properties that the patron asked for, and other documents have only some or none of the properties. If the inquiring patron were to examine all of the documents in the collection she/he might find that some having all the sought after properties were relevant, but others (with the same properties) were not relevant. And conversely, he/she might find that some of the documents having none (or only a few) of the sought after properties were relevant, others not. The function of a document retrieval system is to compute the probability that a document is relevant, given that it has one (or a set) of specified properties. Robertson, Maron & Cooper, 1982

2003.11.18 SLIDE 7IS 202 – FALL 2003 Model 2 – Robertson & Sparck Jones Document Relevance Document Indexing Given a term t and a query q + - + r n-r n - R-r N-n-R+r N-n R N-R N

2003.11.18 SLIDE 8IS 202 – FALL 2003 Robertson-Sparck Jones Weights Retrospective formulation

2003.11.18 SLIDE 9IS 202 – FALL 2003 Robertson-Sparck Jones Weights Predictive formulation

2003.11.18 SLIDE 10IS 202 – FALL 2003 Probabilistic Models: Some Unifying Notation D = All present and future documents Q = All present and future queries (D i,Q j ) = A document query pair x = class of similar documents, y = class of similar queries, Relevance (R) is a relation:

2003.11.18 SLIDE 11IS 202 – FALL 2003 Probabilistic Models Model 1 -- Probabilistic Indexing, P(R|y,D i ) Model 2 -- Probabilistic Querying, P(R|Q j,x) Model 3 -- Merged Model, P(R| Q j, D i ) Model 0 -- P(R|y,x) Probabilities are estimated based on prior usage or relevance estimation

2003.11.18 SLIDE 12IS 202 – FALL 2003 Probabilistic Models Q D x y DiDi QjQj

2003.11.18 SLIDE 13IS 202 – FALL 2003 Logistic Regression Another approach to estimating probability of relevance Based on work by William Cooper, Fred Gey and Daniel Dabney Builds a regression model for relevance prediction based on a set of training data Uses less restrictive independence assumptions than Model 2 –Linked Dependence

2003.11.18 SLIDE 14IS 202 – FALL 2003 Logistic Regression 100 - 90 - 80 - 70 - 60 - 50 - 40 - 30 - 20 - 10 - 0 - 0 10 20 30 40 50 60 Term Frequency in Document Relevance

2003.11.18 SLIDE 15IS 202 – FALL 2003 Relevance Feedback Main Idea: –Modify existing query based on relevance judgements Extract terms from relevant documents and add them to the query And/or re-weight the terms already in the query –Two main approaches: Automatic (pseudo-relevance feedback) Users select relevant documents –Users/system select terms from an automatically-generated list

2003.11.18 SLIDE 16IS 202 – FALL 2003 Rocchio Method

2003.11.18 SLIDE 17IS 202 – FALL 2003 Rocchio/Vector Illustration Retrieval Information 0.5 1.0 0 0.51.0 D1D1 D2D2 Q0Q0 Q’ Q” Q 0 = retrieval of information = (0.7,0.3) D 1 = information science = (0.2,0.8) D 2 = retrieval systems = (0.9,0.1) Q’ = ½*Q 0 + ½ * D 1 = (0.45,0.55) Q” = ½*Q 0 + ½ * D 2 = (0.80,0.20)

2003.11.18 SLIDE 19IS 202 – FALL 2003 IR Evaluation Why Evaluate? What to Evaluate? How to Evaluate?

2003.11.18 SLIDE 20IS 202 – FALL 2003 Why Evaluate? Determine if the system is desirable Make comparative assessments –Is system X better than system Y? Others?

2003.11.18 SLIDE 21IS 202 – FALL 2003 What to Evaluate? How much of the information need is satisfied How much was learned about a topic Incidental learning: –How much was learned about the collection –How much was learned about other topic How inviting the system is?

2003.11.18 SLIDE 22IS 202 – FALL 2003 Relevance (revisited) In what ways can a document be relevant to a query? –Answer precise question precisely –Partially answer question –Suggest a source for more information –Give background information –Remind the user of other knowledge –Others...

2003.11.18 SLIDE 23IS 202 – FALL 2003 Relevance (revisited) How relevant is the document? –For this user for this information need Subjective, but Measurable to some extent –How often do people agree a document is relevant to a query? How well does it answer the question? –Complete answer? Partial? –Background Information? –Hints for further exploration?

2003.11.18 SLIDE 24IS 202 – FALL 2003 What can be measured that reflects users’ ability to use system? (Cleverdon 66) –Coverage of information –Form of presentation –Effort required/ease of use –Time and space efficiency –Recall Proportion of relevant material actually retrieved –Precision Proportion of retrieved material actually relevant What to Evaluate? Effectiveness

2003.11.18 SLIDE 25IS 202 – FALL 2003 Relevant vs. Retrieved Relevant Retrieved All Docs

2003.11.18 SLIDE 26IS 202 – FALL 2003 Precision vs. Recall Relevant Retrieved All Docs

2003.11.18 SLIDE 27IS 202 – FALL 2003 Why Precision and Recall? Get as much good stuff as possible while at the same time getting as little junk as possible

2003.11.18 SLIDE 28IS 202 – FALL 2003 Retrieved vs. Relevant Documents Very high precision, very low recall Relevant

2003.11.18 SLIDE 29IS 202 – FALL 2003 Retrieved vs. Relevant Documents Very low precision, very low recall (0 in fact) Relevant

2003.11.18 SLIDE 30IS 202 – FALL 2003 Retrieved vs. Relevant Documents High recall, but low precision Relevant

2003.11.18 SLIDE 31IS 202 – FALL 2003 Retrieved vs. Relevant Documents High precision, high recall (at last!) Relevant

2003.11.18 SLIDE 32IS 202 – FALL 2003 Precision/Recall Curves There is a well-known tradeoff between Precision and Recall So we typically measure Precision at different (fixed) levels of Recall Note: this is an AVERAGE over MANY queries precision recall x x x x

2003.11.18 SLIDE 33IS 202 – FALL 2003 Precision/Recall Curves Difficult to determine which of these two hypothetical results is better: precision recall x x x x

2003.11.18 SLIDE 34IS 202 – FALL 2003 TREC (Manual Queries)

2003.11.18 SLIDE 36IS 202 – FALL 2003 Document Cutoff Levels Another way to evaluate: –Fix the number of RELEVANT documents retrieved at several levels: Top 5 Top 10 Top 20 Top 50 Top 100 Top 500 –Measure precision at each of these levels –(Possibly)Take average over levels This is a way to focus on how well the system ranks the first k documents

2003.11.18 SLIDE 37IS 202 – FALL 2003 Problems with Precision/Recall Can’t know true recall value –Except in small collections Precision/Recall are related –A combined measure sometimes more appropriate Assumes batch mode –Interactive IR is important and has different criteria for successful searches –We will touch on this in the UI section Assumes that a strict rank ordering matters

2003.11.18 SLIDE 38IS 202 – FALL 2003 Relation to Contingency Table Accuracy: (a+d) / (a+b+c+d) Precision: a/(a+b) Recall: ? Why don’t we use Accuracy for IR Evaluation? (Assuming a large collection) –Most docs aren’t relevant –Most docs aren’t retrieved –Inflates the accuracy value Doc is Relevant Doc is NOT relevant Doc is retrieved ab Doc is NOT retrieved cd

2003.11.18 SLIDE 39IS 202 – FALL 2003 The E-Measure Combine Precision and Recall into one number (van Rijsbergen 79) P = precision R = recall  = measure of relative importance of P or R For example, $= 1 means user is equally interested in precision and recall $=  means user doesn’t care about precision $= 0 means user doesn’t care about recall

2003.11.18 SLIDE 40IS 202 – FALL 2003 F Measure (Harmonic Mean)

2003.11.18 SLIDE 42IS 202 – FALL 2003 Test Collections Cranfield 2 – –1400 Documents, 221 Queries –200 Documents, 42 Queries INSPEC – 542 Documents, 97 Queries UKCIS -- > 10000 Documents, multiple sets, 193 Queries ADI – 82 Document, 35 Queries CACM – 3204 Documents, 50 Queries CISI – 1460 Documents, 35 Queries MEDLARS (Salton) 273 Documents, 18 Queries

2003.11.18 SLIDE 43IS 202 – FALL 2003 TREC Text REtrieval Conference/Competition –Run by NIST (National Institute of Standards & Technology) –1999 was the 8th year - 9th TREC in early November Collection: >6 Gigabytes (5 CRDOMs), >1.5 Million Docs –Newswire & full text news (AP, WSJ, Ziff, FT) –Government documents (federal register, Congressional Record) –Radio Transcripts (FBIS) –Web “subsets” (“Large Web” separate with 18.5 Million pages of Web data – 100 Gbytes) –Patents

2003.11.18 SLIDE 44IS 202 – FALL 2003 TREC (cont.) Queries + Relevance Judgments –Queries devised and judged by “Information Specialists” –Relevance judgments done only for those documents retrieved—not entire collection! Competition –Various research and commercial groups compete (TREC 6 had 51, TREC 7 had 56, TREC 8 had 66) –Results judged on precision and recall, going up to a recall level of 1000 documents Following slides are from TREC overviews by Ellen Voorhees of NIST

2003.11.18 SLIDE 45IS 202 – FALL 2003

2003.11.18 SLIDE 46IS 202 – FALL 2003

2003.11.18 SLIDE 47IS 202 – FALL 2003

2003.11.18 SLIDE 48IS 202 – FALL 2003

2003.11.18 SLIDE 49IS 202 – FALL 2003

2003.11.18 SLIDE 50IS 202 – FALL 2003 Sample TREC Query (Topic) Number: 168 Topic: Financing AMTRAK Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK) Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to AMTRAK would also be relevant.

2003.11.18 SLIDE 51IS 202 – FALL 2003

2003.11.18 SLIDE 52IS 202 – FALL 2003

2003.11.18 SLIDE 53IS 202 – FALL 2003

2003.11.18 SLIDE 54IS 202 – FALL 2003

2003.11.18 SLIDE 55IS 202 – FALL 2003

2003.11.18 SLIDE 56IS 202 – FALL 2003 TREC Benefits: –Made research systems scale to large collections (at least pre-WWW “large”) –Allows for somewhat controlled comparisons Drawbacks: –Emphasis on high recall, which may be unrealistic for what many users want –Very long queries, also unrealistic –Comparisons still difficult to make, because systems are quite different on many dimensions –Focus on batch ranking rather than interaction There is an interactive track but not a lot is being learned, given the constraints of the TREC evaluation process

2003.11.18 SLIDE 57IS 202 – FALL 2003 TREC is Changing Emphasis on specialized “tracks” –Interactive track –Natural Language Processing (NLP) track –Multilingual tracks (Chinese, Spanish) –Filtering track –High-Precision –High-Performance http://trec.nist.gov/

2003.11.18 SLIDE 59IS 202 – FALL 2003 Blair and Maron 1985 A classic study of retrieval effectiveness –Earlier studies were on unrealistically small collections Studied an archive of documents for a legal suit –~350,000 pages of text –40 queries –Focus on high recall –Used IBM’s STAIRS full-text system Main Result: –The system retrieved less than 20% of the relevant documents for a particular information need –Lawyers thought they had 75% But many queries had very high precision

2003.11.18 SLIDE 60IS 202 – FALL 2003 Blair and Maron (cont.) How they estimated recall –Generated partially random samples of unseen documents –Had users (unaware these were random) judge them for relevance Other results: –Two lawyers searches had similar performance –Lawyers recall was not much different from paralegal’s

2003.11.18 SLIDE 61IS 202 – FALL 2003 Blair and Maron (cont.) Why recall was low –Users can’t foresee exact words and phrases that will indicate relevant documents “accident” referred to by those responsible as: “event,” “incident,” “situation,” “problem,” … Differing technical terminology Slang, misspellings –Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied

2003.11.18 SLIDE 63IS 202 – FALL 2003 Carolyn Cracraft Questions 1. It would seem that some of the problems that contributed to poor recall, particularly spelling, might still be mitigated automatically. Given your experience with spell check, do you think it would be at all reasonable to run it on a document without human intervention? Sometimes it does make rather fanciful suggestions for replacements. Is there any spell check system out there that includes similarity calculations - i.e. where you could set a threshold and only automatically correct spelling if the word and the suggested replacement were, say, 80% similar? Should I have realized by this point in the course that computers are NEVER going to deal adequately with nuances of language?

2003.11.18 SLIDE 64IS 202 – FALL 2003 Carolyn Cracraft Questions 2. Certainly, the system seems to cry out for some metadata or a controlled vocabulary. But given the scenario in which the database was created (i.e. paralegals indexing documents relevant to just one case), it seems like it would be unreasonable to rigorously develop a vocabulary for every case handled by a law firm. But would even an ad-hoc version help with recall? If the paralegals just quickly agreed on a set of keywords relevant to the case and assigned these words as appropriate to the documents entered, would that have an enough of an effect on recall to justify the extra time spent?

2003.11.18 SLIDE 65IS 202 – FALL 2003 Megan Finn Questions The experimenters mention that one problem with their method was that they didn't give clear instructions on how long to spend on each document. It also seems like seeing the queries relative to each other might influence how systems users rank documents. (For example, a user might say that a document seems more relevant to query 1 than query 2, rather than judging them soley on their own merits.) What are some problems that you see with their experiment design?

2003.11.18 SLIDE 66IS 202 – FALL 2003 Megan Finn Questions It seems like one of the most challenging parts of this experiments is getting enough people to sit down and use their tools for two hours. Evaluating IR systems by tracking human interaction with query results seems like it could be easier. How would you measure the effectiveness of an IR system through tracking human action (something like clickthroughs)? Would tracking human action be likely to give you the same results as RAVe Reviews?

2003.11.18 SLIDE 67IS 202 – FALL 2003 Megan Finn Questions If the ultimate goal of IR is to provide the user with results that are relevant to them (not necessarily to everyone), is here a way to utilize the results of this experiment to return results that are more relevant to that user?

2003.11.18 SLIDE 68IS 202 – FALL 2003 Margaret Spring Questions A Case For Interaction: A Study of Interactive Information Retrieval Behavior and Effectiveness (Koenemann & Belkin) The article distinguishes between opaque, transparent and penetrable feedback. Since even inexperienced users (in 1996) had notably better success using penetrable feedback, why isn’t this approach seen more often in online search/retrieval tools?

2003.11.18 SLIDE 69IS 202 – FALL 2003 Margaret Spring Questions The researchers seemed dismayed that users became "lazy" in term generation & relied too heavily on term selection from feedback. Doesn’t the willingness and preference to focus a search through multiple interactions of feedback indicate further support penetrable feedback? Is this indicative of an outdated perspective on user patience with/expectations of system processing abilities?

2003.11.18 SLIDE 70IS 202 – FALL 2003 Margaret Spring Questions Analysis of individual tests seemed to indicate that if a user was guided to particular search topics via feedback, the user was far more likely to successful identify the relevant documents. Does this search topic “favoritism” indicate an application weakness in not connecting the term more effectively to other topics or does it indicate a discrepancy in user vocabulary and document vocabulary?

2003.11.18 SLIDE 71IS 202 – FALL 2003 Jeff Towle Questions 1) GroupLens, Ringo and other collaborative filtering systems (such as Amazon's) are all applicable in domains that I would describe as very 'taste'-sensitive. Does collaborative filtering have further applications? Or is information retrieval in general a method of matching tastes? 2) The GroupLens authors propose filter-bots as a method of dealing with the sparsity problem. Would such bots provide meaningful results, or is sparsity simply another method of rating that would be lost with the use of bots?

2003.11.18 SLIDE 72IS 202 – FALL 2003 Rebecca Shapley Questions Ringo was successful, and the second-to-last paragraph essentially predicts Amazon.com's current recommendation system. Can you think of other good applications of computerized social filtering? What characteristics of these applications make them most amenable to a social filtering system? We've considered and even experienced how values are incorporated into classification structures. What values are built into the Ringo example? More broadly, how are values built into social filtering systems? What implications does that have?

2003.11.18 SLIDE 73IS 202 – FALL 2003 Rebecca Shapley Questions This paper shows that a "constrained pearson" calculation of similarity of user profile works best out of the three approaches they tried. The "constrained pearson" considers similarity of user profiles both for artists they like AND for ones they don't like, and adjusts the equation to incorporate the specific numbers resulting from the 7 point scale. As you read the article, were there any ideas that occurred to you to try, to see if those would improve the performance of the system at recommending?

2003.11.18 SLIDE 74IS 202 – FALL 2003 Rebecca Shapley Questions Who else might find the collected user-profiles from a social filtering system useful? Mapping the user profiles in n-dimensional hyperspace, they might cluster into groups roughly representing stereotypical consumer appetites, or they might be more spread out, more web- like. Would marketers look at this structure or the texture of their product in this web for useful info? Would users like to "travel" this web, seeing what people sort-of like them like? Would your user profile in something like Ringo or Amazon.com be an asset in a dating service? A liability under the Patriot Act?

2003.11.18 SLIDE 75IS 202 – FALL 2003 Next Time Assignment 8 Web Searching and Crawling Readings/Discussion –The Anatomy of a Large-Scale Hypertextual Web Search Engine (Brin, Sergey and Page, Lawrence); Jesse –Mercator: A Scalable, Extensible Web Crawler (Heydon, Allan and Najork, Marc) Yuri

2003.11.18 SLIDE 1IS 202 – FALL 2003 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.

Similar presentations

Presentation on theme: "2003.11.18 SLIDE 1IS 202 – FALL 2003 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2003.11.18 SLIDE 1IS 202 – FALL 2003 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.

Similar presentations

Presentation on theme: "2003.11.18 SLIDE 1IS 202 – FALL 2003 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00."— Presentation transcript:

Similar presentations

About project

Feedback