Presentation is loading. Please wait.

Presentation is loading. Please wait.

2002.11.12 SLIDE 1IS 202 – FALL 2002 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.

Similar presentations


Presentation on theme: "2002.11.12 SLIDE 1IS 202 – FALL 2002 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00."— Presentation transcript:

1 2002.11.12 SLIDE 1IS 202 – FALL 2002 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002 http://www.sims.berkeley.edu/academics/courses/is202/f02/ SIMS 202: Information Organization and Retrieval

2 2002.11.12 SLIDE 2IS 202 – FALL 2002 Lecture Overview Review –Lexical Relations –WordNet –Can Lexical and Semantic Relations be Exploited to Improve IR? Evaluation of IR systems –Precision vs. Recall –Cutoff Points –Test Collections/TREC –Blair & Maron Study Credit for some of the slides in this lecture goes to Marti Hearst and Warren Sack

3 2002.11.12 SLIDE 3IS 202 – FALL 2002 Syntax The syntax of a language is to be understood as a set of rules which accounts for the distribution of word forms throughout the sentences of a language These rules codify permissible combinations of classes of word forms

4 2002.11.12 SLIDE 4IS 202 – FALL 2002 Semantics Semantics is the study of linguistic meaning Two standard approaches to lexical semantics (cf., sentential semantics and logical semantics): –(1) Compositional –(2) Relational

5 2002.11.12 SLIDE 5IS 202 – FALL 2002 Pragmatics Deals with the relation between signs or linguistic expressions and their users Deixis (literally “pointing out”) –E.g., “I’ll be back in an hour” depends upon the time of the utterance Conversational implicature –A: “Can you tell me the time?” –B: “Well, the milkman has come.” [I don’t know exactly, but perhaps you can deduce it from some extra information I give you.] Presupposition –“Are you still such a bad driver?” Speech acts –Constatives vs. performatives –E.g., “I second the motion.” Conversational structure –E.g., turn-taking rules

6 2002.11.12 SLIDE 6IS 202 – FALL 2002 Major Lexical Relations Synonymy Polysemy Metonymy Hyponymy/Hyperonymy Meronymy Antonymy

7 2002.11.12 SLIDE 7IS 202 – FALL 2002 Thesauri and Lexical Relations Polysemy: Same word, different senses of meaning –Slightly different concepts expressed similarly Synonyms: Different words, related senses of meanings –Different ways to express similar concepts Thesauri help draw all these together Thesauri also commonly define a set of relations between terms that is similar to lexical relations –BT, NT, RT

8 2002.11.12 SLIDE 8IS 202 – FALL 2002 WordNet Started in 1985 by George Miller, students, and colleagues at the Cognitive Science Laboratory, Princeton University Can be downloaded for free: –www.cogsci.princeton.edu/~wn/ “In terms of coverage, WordNet’s goals differ little from those of a good standard college-level dictionary, and the semantics of WordNet is based on the notion of word sense that lexicographers have traditionally used in writing dictionaries. It is in the organization of that information that WordNet aspires to innovation.” –(Miller, 1998, Chapter 1)

9 2002.11.12 SLIDE 9IS 202 – FALL 2002 WordNet: Size POSUniqueSynsets Strings Noun 107930 74488 Verb 10806 12754 Adjective 21365 18523 Adverb 4583 3612 Totals144684 109377 WordNet Uses “Synsets” – sets of synonymous terms

10 2002.11.12 SLIDE 10IS 202 – FALL 2002 Structure of WordNet

11 2002.11.12 SLIDE 11IS 202 – FALL 2002 Structure of WordNet

12 2002.11.12 SLIDE 12IS 202 – FALL 2002 Structure of WordNet

13 2002.11.12 SLIDE 13IS 202 – FALL 2002 Lexical Relations and IR Recall that most IR research has primarily looked at statistical approaches to inferring the topicality or meaning of documents I.e., Statistics imply Semantics –Is this really true or correct? How has (or might) WordNet be used to provide more functionality in searching? What about other thesauri, classification schemes, and ontologies?

14 2002.11.12 SLIDE 14IS 202 – FALL 2002 Using NLP Strzalkowski TextNLPrepres Dbase search TAGGER NLP: PARSERTERMS

15 2002.11.12 SLIDE 15IS 202 – FALL 2002 NLP & IR: Possible Approaches Indexing –Use of NLP methods to identify phrases Test weighting schemes for phrases –Use of more sophisticated morphological analysis Searching –Use of two-stage retrieval Statistical retrieval Followed by more sophisticated NLP filtering

16 2002.11.12 SLIDE 16IS 202 – FALL 2002 Can Statistics Approach Semantics? One approach is the Entry Vocabulary Index (EVI) work being done here… (The following slides are from my presentation at JCDL 2002)

17 2002.11.12 SLIDE 17IS 202 – FALL 2002 What is an Entry Vocabulary Index? EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents…

18 2002.11.12 SLIDE 18IS 202 – FALL 2002 Solution: Entry Level Vocabulary Indexes. Index EVI pass mtr veh spark ign eng” = “Automobile”

19 2002.11.12 SLIDE 19IS 202 – FALL 2002 Find Plutonium In Arabic Chinese Greek Japanese Korean Russian Tamil Statistical association Digital library resources

20 2002.11.12 SLIDE 20IS 202 – FALL 2002 Lecture Overview Review –Lexical Relations –WordNet –Can Lexical and Semantic Relations be Exploited to Improve IR? Evaluation of IR systems –Precision vs. Recall –Cutoff Points –Test Collections/TREC –Blair & Maron Study Credit for some of the slides in this lecture goes to Marti Hearst and Warren Sack

21 2002.11.12 SLIDE 21IS 202 – FALL 2002 IR Evaluation Why Evaluate? What to Evaluate? How to Evaluate?

22 2002.11.12 SLIDE 22IS 202 – FALL 2002 Why Evaluate? Determine if the system is desirable Make comparative assessments –Is system X better than system Y? Others?

23 2002.11.12 SLIDE 23IS 202 – FALL 2002 What to Evaluate? How much of the information need is satisfied How much was learned about a topic Incidental learning: –How much was learned about the collection –How much was learned about other topic How inviting the system is

24 2002.11.12 SLIDE 24IS 202 – FALL 2002 Relevance In what ways can a document be relevant to a query? –Answer precise question precisely –Partially answer question –Suggest a source for more information –Give background information –Remind the user of other knowledge –Others...

25 2002.11.12 SLIDE 25IS 202 – FALL 2002 Relevance How relevant is the document? –For this user for this information need Subjective, but Measurable to some extent –How often do people agree a document is relevant to a query? How well does it answer the question? –Complete answer? Partial? –Background Information? –Hints for further exploration?

26 2002.11.12 SLIDE 26IS 202 – FALL 2002 What can be measured that reflects users’ ability to use system? (Cleverdon 66) –Coverage of information –Form of presentation –Effort required/ease of use –Time and space efficiency –Recall Proportion of relevant material actually retrieved –Precision Proportion of retrieved material actually relevant What to Evaluate? Effectiveness

27 2002.11.12 SLIDE 27IS 202 – FALL 2002 Relevant vs. Retrieved Relevant Retrieved All Docs

28 2002.11.12 SLIDE 28IS 202 – FALL 2002 Precision vs. Recall Relevant Retrieved All Docs

29 2002.11.12 SLIDE 29IS 202 – FALL 2002 Why Precision and Recall? Get as much good stuff while at the same time getting as little junk as possible

30 2002.11.12 SLIDE 30IS 202 – FALL 2002 Retrieved vs. Relevant Documents Very high precision, very low recall Relevant

31 2002.11.12 SLIDE 31IS 202 – FALL 2002 Retrieved vs. Relevant Documents Very low precision, very low recall (0 in fact) Relevant

32 2002.11.12 SLIDE 32IS 202 – FALL 2002 Retrieved vs. Relevant Documents High recall, but low precision Relevant

33 2002.11.12 SLIDE 33IS 202 – FALL 2002 Retrieved vs. Relevant Documents High precision, high recall (at last!) Relevant

34 2002.11.12 SLIDE 34IS 202 – FALL 2002 Precision/Recall Curves There is a tradeoff between Precision and Recall So measure Precision at different levels of Recall Note: this is an AVERAGE over MANY queries precision recall x x x x

35 2002.11.12 SLIDE 35IS 202 – FALL 2002 Precision/Recall Curves Difficult to determine which of these two hypothetical results is better: precision recall x x x x

36 2002.11.12 SLIDE 36IS 202 – FALL 2002 TREC (Manual Queries)

37 2002.11.12 SLIDE 37IS 202 – FALL 2002 Document Cutoff Levels Another way to evaluate: –Fix the number of RELEVANT documents retrieved at several levels: Top 5 Top 10 Top 20 Top 50 Top 100 Top 500 –Measure precision at each of these levels –Take (weighted) average over results This is a way to focus on how well the system ranks the first k documents

38 2002.11.12 SLIDE 38IS 202 – FALL 2002 Problems with Precision/Recall Can’t know true recall value –Except in small collections Precision/Recall are related –A combined measure sometimes more appropriate Assumes batch mode –Interactive IR is important and has different criteria for successful searches –We will touch on this in the UI section Assumes a strict rank ordering matters

39 2002.11.12 SLIDE 39IS 202 – FALL 2002 Relation to Contingency Table Accuracy: (a+d) / (a+b+c+d) Precision: a/(a+b) Recall: ? Why don’t we use Accuracy for IR Evaluation? (Assuming a large collection) –Most docs aren’t relevant –Most docs aren’t retrieved –Inflates the accuracy value Doc is Relevant Doc is NOT relevant Doc is retrieved ab Doc is NOT retrieved cd

40 2002.11.12 SLIDE 40IS 202 – FALL 2002 The E-Measure Combine Precision and Recall into one number (van Rijsbergen 79) P = precision R = recall b = measure of relative importance of P or R For example, b = 0.5 means user is twice as interested in precision as recall

41 2002.11.12 SLIDE 41IS 202 – FALL 2002 F Measure (Harmonic Mean)

42 2002.11.12 SLIDE 42IS 202 – FALL 2002 Test Collections Cranfield 2 – –1400 Documents, 221 Queries –200 Documents, 42 Queries INSPEC – 542 Documents, 97 Queries UKCIS -- > 10000 Documents, multiple sets, 193 Queries ADI – 82 Document, 35 Queries CACM – 3204 Documents, 50 Queries CISI – 1460 Documents, 35 Queries MEDLARS (Salton) 273 Documents, 18 Queries

43 2002.11.12 SLIDE 43IS 202 – FALL 2002 TREC Text REtrieval Conference/Competition –Run by NIST (National Institute of Standards & Technology) –1999 was the 8th year - 9th TREC in early November Collection: >6 Gigabytes (5 CRDOMs), >1.5 Million Docs –Newswire & full text news (AP, WSJ, Ziff, FT) –Government documents (federal register, Congressional Record) –Radio Transcripts (FBIS) –Web “subsets” (“Large Web” separate with 18.5 Million pages of Web data – 100 Gbytes) –Patents

44 2002.11.12 SLIDE 44IS 202 – FALL 2002 TREC (cont.) Queries + Relevance Judgments –Queries devised and judged by “Information Specialists” –Relevance judgments done only for those documents retrieved—not entire collection! Competition –Various research and commercial groups compete (TREC 6 had 51, TREC 7 had 56, TREC 8 had 66) –Results judged on precision and recall, going up to a recall level of 1000 documents Following slides from TREC overviews by Ellen Voorhees of NIST

45 2002.11.12 SLIDE 45IS 202 – FALL 2002

46 2002.11.12 SLIDE 46IS 202 – FALL 2002

47 2002.11.12 SLIDE 47IS 202 – FALL 2002

48 2002.11.12 SLIDE 48IS 202 – FALL 2002

49 2002.11.12 SLIDE 49IS 202 – FALL 2002

50 2002.11.12 SLIDE 50IS 202 – FALL 2002 Sample TREC Query (Topic) Number: 168 Topic: Financing AMTRAK Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK) Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to AMTRAK would also be relevant.

51 2002.11.12 SLIDE 51IS 202 – FALL 2002

52 2002.11.12 SLIDE 52IS 202 – FALL 2002

53 2002.11.12 SLIDE 53IS 202 – FALL 2002

54 2002.11.12 SLIDE 54IS 202 – FALL 2002

55 2002.11.12 SLIDE 55IS 202 – FALL 2002

56 2002.11.12 SLIDE 56IS 202 – FALL 2002 TREC Benefits: –Made research systems scale to large collections (pre-WWW) –Allows for somewhat controlled comparisons Drawbacks: –Emphasis on high recall, which may be unrealistic for what most users want –Very long queries, also unrealistic –Comparisons still difficult to make, because systems are quite different on many dimensions –Focus on batch ranking rather than interaction There is an interactive track

57 2002.11.12 SLIDE 57IS 202 – FALL 2002 TREC is Changing Emphasis on specialized “tracks” –Interactive track –Natural Language Processing (NLP) track –Multilingual tracks (Chinese, Spanish) –Filtering track –High-Precision –High-Performance http://trec.nist.gov/

58 2002.11.12 SLIDE 58IS 202 – FALL 2002 Blair and Maron 1985 A classic study of retrieval effectiveness –Earlier studies were on unrealistically small collections Studied an archive of documents for a legal suit –~350,000 pages of text –40 queries –Focus on high recall –Used IBM’s STAIRS full-text system Main Result: –The system retrieved less than 20% of the relevant documents for a particular information need –Lawyers thought they had 75% But many queries had very high precision

59 2002.11.12 SLIDE 59IS 202 – FALL 2002 Blair and Maron (cont.) How they estimated recall –Generated partially random samples of unseen documents –Had users (unaware these were random) judge them for relevance Other results: –Two lawyers searches had similar performance –Lawyers recall was not much different from paralegal’s

60 2002.11.12 SLIDE 60IS 202 – FALL 2002 Blair and Maron (cont.) Why recall was low –Users can’t foresee exact words and phrases that will indicate relevant documents “accident” referred to by those responsible as: “event,” “incident,” “situation,” “problem,” … Differing technical terminology Slang, misspellings –Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied


Download ppt "2002.11.12 SLIDE 1IS 202 – FALL 2002 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00."

Similar presentations


Ads by Google