Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval: Problem Formulation & Evaluation ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Similar presentations


Presentation on theme: "Information Retrieval: Problem Formulation & Evaluation ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign."— Presentation transcript:

1 Information Retrieval: Problem Formulation & Evaluation ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

2 Research Process Identification of a research question/topic Propose a possible solution/answer (formulate a hypothesis) Implement the solution Design experiments (measures, data, etc) Test the solution/hypothesis Draw conclusions Repeat the cycle of question-answering or hypothesis-formulation- and-testing if necessary Today’s lecture 2

3 Part 1: IR Problem Formulation

4 Basic Formulation of TR (traditional) Vocabulary V={w 1, w 2, …, w N } of language Query q = q 1,…,q m, where q i  V Document d i = d i1,…,d im i, where d ij  V Collection C= {d 1, …, d k } Set of relevant documents R(q)  C –Generally unknown and user-dependent –Query is a “hint” on which doc is in R(q) Task = compute R’(q), an “approximate R(q)” (i.e., decide which documents to return to a user) 4

5 Computing R(q) Strategy 1: Document selection –R(q)={d  C|f(d,q)=1}, where f(d,q)  {0,1} is an indicator function or classifier –System must decide if a doc is relevant or not (“absolute relevance”) Strategy 2: Document ranking –R(q) = {d  C|f(d,q)>  }, where f(d,q)  is a relevance measure function;  is a cutoff –System must decide if one doc is more likely to be relevant than another (“relative relevance”) 5

6 Document Selection vs. Ranking + + + + - - - - - - - - - - - - - - + - - Doc Selection f(d,q)=? + + + + - - + - + - - - - - - - - Doc Ranking f(d,q)=? 1 0 0.98 d 1 + 0.95 d 2 + 0.83 d 3 - 0.80 d 4 + 0.76 d 5 - 0.56 d 6 - 0.34 d 7 - 0.21 d 8 + 0.21 d 9 - R’(q) True R(q) User sets the threshold 6

7 Problems of Doc Selection/Boolean model [Cooper 88] The classifier is unlikely accurate –“Over-constrained” query (terms are too specific): no relevant documents found –“Under-constrained” query (terms are too general): over delivery –It is hard to find the right position between these two extremes (hard for users to specify constraints) Even if it is accurate, all relevant documents are not equally relevant; prioritization is needed since a user can only examine one document at a time 7

8 Ranking is often preferred A user can stop browsing anywhere, so the boundary is controlled by the user –High recall users would view more items –High precision users would view only a few Theoretical justification: Probability Ranking Principle [Robertson 77] 8

9 Probability Ranking Principle [Robertson 77] Seek for more fundamental justification –Why is ranking based on probability of relevance reasonable? –Is there a better way of ranking documents? –What is the optimal way of ranking documents? Theoretical justification for ranking (Probability Ranking Principle): returning a ranked list of documents in descending order of probability that a document is relevant to the query is the optimal strategy under the following two assumptions (do they hold?): –The utility of a document (to a user) is independent of the utility of any other document –A user would browse the results sequentially 9

10 Two Justifications of PRP Optimization of traditional retrieval effectiveness measures –Given an expected level of recall, ranking based on PRP maximizes the precision –Given a fixed rank cutoff, ranking based on PRP maximizes precision and recall Optimal decision making –Regardless the tradeoffs (e.g., favoring high precision vs. high recall), ranking based on PRP optimizes the expected utility of a binary (independent) retrieval decision (i.e., to retrieve or not to retrieve a document) Intuition: if a user sequentially examines one doc at each time, we’d like the user to see the very best ones first 10

11 According to the PRP, all we need is “A relevance measure function f” which satisfies For all q, d 1, d 2, f(q,d 1 ) > f(q,d 2 ) iff p(Rel|q,d 1 ) >p(Rel|q,d 2 ) Most existing research on IR models so far has fallen into this line of thinking…. (Limitations?)

12 Modeling Relevance: Raodmap for Retrieval Models Relevance  (Rep(q), Rep(d)) Similarity P(r=1|q,d) r  {0,1} Probability of Relevance P(d  q) or P(q  d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) … Generative Model Regression Model (Fuhr 89) Classical prob. Model (Robertson & Sparck Jones, 76) Doc generation Query generation LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) Div. from Randomness (Amati & Rijsbergen 02) Learn. To Rank (Joachims 02, Berges et al. 05) Relevance constraints [Fang et al. 04] 12

13 Part 2: IR Evaluation

14 Evaluation: Two Different Reasons Reason 1: So that we can assess how useful an IR system/technology would be (for an application) –Measures should reflect the utility to users in a real application –Usually done through user studies (interactive IR evaluation) Reason 2: So that we can compare different systems and methods (to advance the state of the art) –Measures only need to be correlated with the utility to actual users, thus don’t have to accurately reflect the exact utility to users –Usually done through test collections (test set IR evaluation) 14

15 What to Measure? Effectiveness/Accuracy: how accurate are the search results? –Measuring a system’s ability of ranking relevant docucments on top of non-relevant ones Efficiency: how quickly can a user get the results? How much computing resources are needed to answer a query? –Measuring space and time overhead Usability: How useful is the system for real user tasks? –Doing user studies 15

16 The Cranfield Evaluation Methodology A methodology for laboratory testing of system components developed in 1960s Idea: Build reusable test collections & define measures –A sample collection of documents (simulate real document collection) –A sample set of queries/topics (simulate user queries) –Relevance judgments (ideally made by users who formulated the queries)  Ideal ranked list –Measures to quantify how well a system’s result matches the ideal ranked list A test collection can then be reused many times to compare different systems This methodology is general and applicable for evaluating any empirical task 16

17 Test Collection Evaluation 17 Q1 D1 + Q1 D2 + Q1 D3 – Q1 D4 – Q1 D5 + … Q2 D1 – Q2 D2 + Q2 D3 + Q2 D4 – … Q50 D1 – Q50 D2 – Q50 D3 + … Relevance Judgments Document Collection Q1 Q2 Q3 … Q50... D1 D2 D3 D48 … Queries D2 + D1 + D4 - D5 + System A System B Query= Q1 D1 + D4 - D3 - D5 + Precision=3/4 Recall=3/3 Precision=2/4 Recall=2/3

18 18 Measures for evaluating a set of retrieved documents Relevant Retrieved a Irrelevant Retrieved c Irrelevant Rejected d Relevant Rejected b Relevant Not relevant RetrievedNot Retrieved Doc Action Ideal results: Precision=Recall=1.0 In reality, high recall tends to be associated with low precision (why?)

19 19 How to measure a ranking? Compute the precision at every recall point Plot a precision-recall (PR) curve precision recall x x x x precision recall x x x x Which is better?

20 Computing Precision-Recall Curve 20 D1 + D2 + D3 – D4 – D5 + D6 – D7 – D8 + D9 – D10 – Precision Recall 1/1 Total number of relevant documents in collection: 10 1/10 2/22/10 2/32/10 3/5 3/10 4/8 4/10 ? 10/10 1.0 0.10.21.00.3 …. 0.6

21 How to summarize a ranking? 21 D1 + D2 + D3 – D4 – D5 + D6 – D7 – D8 + D9 – D10 – Precision Recall 1/1 Total number of relevant documents in collection: 10 1/10 2/22/10 2/32/10 3/5 3/10 4/8 4/10 0 10/10 1.0 0.10.21.00.3 …. 0.6 Average Precision=?

22 22 Summarize a Ranking: MAP Given that n docs are retrieved –Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs –E.g., if the first rel. doc is at the 2 nd rank, then p(1)=1/2. –If a relevant document never gets retrieved, we assume the precision corresponding to that rel. doc to be zero Compute the average over all the relevant documents –Average precision = (p(1)+…p(k))/k This gives us an average precision, which captures both precision and recall and is sensitive to the rank of each relevant document Mean Average Precisions (MAP) – MAP = arithmetic mean average precision over a set of topics –gMAP = geometric mean average precision over a set of topics (more affected by difficult topics) –Which one should be used?

23 What if we have multi-level relevance judgments? 23 D1 3 D2 2 D3 1 D4 1 D5 3 D6 1 D7 1 D8 2 D9 1 D10 1 Gain Normalized DCG=? Relevance level: r=1 (non-relevant), 2 (marginally relevant), 3 (very relevant) Cumulative Gain 3 3+2 3+2+1 3+2+1+1 … Discounted Cumulative Gain 3 3+2/log 2 3+2/log 2+1/log 3 … DCG@10 = 3+2/log 2+1/log 3 +…+ 1/log 10 IdealDCG@10 = 3+3/log 2+3/log 3 +…+ 3/log 9+ 2/log 10 Assume: there are 9 documents rated “3” in total in the collection

24 24 Summarize a Ranking: NDCG What if relevance judgments are in a scale of [1,r]? r>2 Cumulative Gain (CG) at rank n –Let the ratings of the n documents be r1, r2, …rn (in ranked order) –CG = r1+r2+…rn Discounted Cumulative Gain (DCG) at rank n –DCG = r1 + r2/log 2 2 + r3/log 2 3 + … rn/log 2 n –We may use any base for the logarithm, e.g., base=b –For rank positions above b, do not discount Normalized Cumulative Gain (NDCG) at rank n –Normalize DCG at rank n by the DCG value at rank n of the ideal ranking –The ideal ranking would first return the documents with the highest relevance level, then the next highest relevance level, etc

25 Other Measures Precision at k documents (e.g., prec@10doc): –easier to interpret than MAP (why?) –also called breakeven precision when k is the same as the number of relevant documents Mean Reciprocal Rank (MRR): –Same as MAP when there’s only 1 relevant document –Reciprocal Rank = 1/Rank-of-the-relevant-doc F-Measure (F1): harmonic mean of precision and recall 25 P: precision R: recall  : parameter (often set to 1)

26 Challenges in creating early test collections Challenges in obtaining documents: –Salton had students to manually transcribe Time magazine articles –Not a problem now! Challenges in distributing a collection –TREC started when CD-ROMs are available –Not a problem now! Challenge of scale – limited by qrels (relevance judgments) –The idea of “pooling” (Sparck Jones & Rijsbergen 75) 26

27 Larger collections created in 1980s 27 NameDocs.Qrys.YearSize, Mb Source document INSPEC12,684771981-Title, authors, source, abstract and indexing information from Sep-Dec 1979 issues of Computer and Control Abstracts. CACM3,2046419832.2Title, abstract, author, keywords and bibliographic information from articles of Communications of the ACM, 1958-1979. CISI1,46011219832.2Author, title/abstract, and co-citation data for the 1460 most highly cited articles and manuscripts in information science, 1969- 1977. LISA6,0043519833.4Taken from the Library and Information Science Abstracts database. Commercial systems then routinely support searching over millions of documents  Pressure for researchers to use larger collections for evaluation

28 The Ideal Test Collection Report [Sparck Jones & Rijsbergen 75] Introduced the idea of pooling –Have assessors to judge only a pool of top-ranked documents returned by various retrieval systems Other recommendations (the vision was later implemented in TREC) 28 1.that an ideal test collection be set up to facilitate and promote research; 2.that the collection be of sufficient size to constitute an adequate test bed for experiments relevant to modern IR systems … 3.that the collection(s) be set up by a special purpose project carried out by an experienced worker, called the Builder; 4.that the collection(s) be maintained in a well-designed and documented machine form and distributed to users, by a Curator; 5.that the curating (sic) project be encouraged to, promote research via the ideal collection(s), and also via the common use of other collection(s) acquired from independent projects. ”

29 TREC (Text REtrieval Conference) 1990: DARPA funded NIST to build a large test collection 1991: NIST proposed to distribute the data set through TREC (leader: Donna Harman) Nov. 1992: First TREC meeting Goals of TREC: –create test collections for a set of retrieval tasks; –promote as widely as possible research in those tasks; –organize a conference for participating researchers to meet and disseminate their research work using TREC collections. 29

30 The “TREC Vision” (mass collaboration for creating a pool) 30 “Harman and her colleagues appear to be the first to realize that if the documents and topics of a collection were distributed for little or no cost, a large number of groups would be willing to load that data into their search systems and submit runs back to TREC to form a pool, all for no costs to TREC. TREC would use assessors to judge the pool. The effectiveness of each run would then be measured and reported back to the groups. Finally, TREC could hold a conference where an overall ranking of runs would be published and participating groups would meet to present work and interact. It was hoped that a slight competitive element would emerge between groups to produce the best possible runs for the pool.” (Sanderson 10)

31 The TREC Ad Hoc Retrieval Task & Pooling Simulate an information analyst (high recall) Multi-field topic description News documents + Government documents Relevance criteria: “a document is judged relevant if any piece of it is relevant (regardless of how small the piece is in relation to the rest of the document)” Each run submitted returns 1000 document for evaluation with various measures Top 100 documents were taken to form a pool All the documents in the pool were judged The unjudged documents are often assumed to be non- relevant (problem?) 31

32 An example TREC topic 32

33 33 Precion-Recall Curve Mean Avg. Precision (MAP) Recall=3212/4728 Breakeven Precision (precision when prec=recall) Out of 4728 rel docs, we’ve got 3212 D1 + D2 + D3 – D4 – D5 + D6 - Total # rel docs = 4 System returns 6 docs Average Prec = (1/1+2/2+3/5+0)/4 about 5.5 docs in the top 10 docs are relevant Precision@10docs Typical TREC Evaluation Result Denominator is 4, not 3 (why?)

34 34 What Query Averaging Hides Slide from Doug Oard’s presentation, originally from Ellen Voorhees’ presentation

35 Statistical Significance Tests How sure can you be that an observed difference doesn’t simply result from the particular queries you chose? System A 0.20 0.21 0.22 0.19 0.17 0.20 0.21 System B 0.40 0.41 0.42 0.39 0.37 0.40 0.41 Experiment 1 Query 12345671234567 Average0.200.40 System A 0.02 0.39 0.16 0.58 0.04 0.09 0.12 System B 0.76 0.07 0.37 0.21 0.02 0.91 0.46 Experiment 2 Query 12345671234567 Average0.200.40 Slide from Doug Oard 35

36 Statistical Significance Testing System A 0.02 0.39 0.16 0.58 0.04 0.09 0.12 System B 0.76 0.07 0.37 0.21 0.02 0.91 0.46 Query 12345671234567 Average0.200.40 Sign Test +-+--+-+-+--+- p=1.0 Wilcoxon +0.74 - 0.32 +0.21 - 0.37 - 0.02 +0.82 - 0.38 p=0.9375 0 95% of outcomes Try some out at: http://www.fon.hum.uva.nl/Service/CGI-Inline/HTML/Statistics.html Slide from Doug Oard 36

37 Live Labs: Involve Real Users in Evaluation Stuff I’ve Seen [Dumais et al. 03] –Real systems deployed with hypothesis testing in mind (different interfaces + logging capability) –Search logs can then be used to analyze hypotheses about user behavior The “A-B Test” –Initial proposal by Cutting at a panel [Lest et al. 97] –First research work published by Joachims [Joachims 03] –Great potential, but only a few follow-up studies 37

38 38 What You Should Know Why is retrieval problem often framed as a ranking problem? Two assumptions of PRP What is Cranfield evaluation methodology? How to compute the major evaluation measures (precision, recall, precision-recall curve, MAP, gMAP, nDCG, F1, MRR, breakeven precision) How does “pooling” work? Why is it necessary to do statistical significance test?

39 Open Challenges in IR Evaluation Almost all issues are still open for research! What are the best measures for various search tasks (especially newer tasks such as subtopic retrieval)? What’s the best way of doing statistical significance test? What’s the best way to adopt the pooling strategy in practice? How can we assess the quality of a test collection? Can we create representative test sets? New paradigms for evaluation? Open IR system for A-B test? 39


Download ppt "Information Retrieval: Problem Formulation & Evaluation ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign."

Similar presentations


Ads by Google