Presentation is loading. Please wait.

Presentation is loading. Please wait.

IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.

Similar presentations


Presentation on theme: "IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance."— Presentation transcript:

1 IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance of system components –“Cranfield tradition”

2 Cranfield Tradition Laboratory testing of system components –fine control over variables –abstraction from operational setting –comparative testing –TREC modern example of tradition Test collections –set of documents –set of questions –relevance judgments

3 Relevance Judgments Main source of criticism of Cranfield tradition –In test collections, judgments are usually binary, static, and assumed to be complete. –But... “relevance” is highly idiosyncratic relevance does not entail utility documents have different degrees of relevance relevance can change over time for the same user for realistic collections, judgments cannot be complete

4 Cranfield Tradition Despite the abstraction, laboratory tests are useful –evaluation technology is predictive (i.e., results transfer to operational settings) –while different relevant sets produce different absolute scores, they almost always produce the same comparative score assumes comparing averages over sets of queries –incomplete judgments ok if sample judged is unbiased with respect to systems tested

5 Assessor Agreement

6 Average Precision by Qrel

7 Stability of Laboratory Tests Mean Kendall  between system rankings produced from different qrel sets:.938 Similar results held for different query sets different evaluation measures different assessor types single opinion vs. group opinion judgments How is filtering (with strong learning component) affected?

8 Incompleteness Effects of TREC pooling –studied by Harman (TREC-4) and Zobel (SIGIR-98) –did find additional relevant documents roughly uniform across systems highly skewed across topics –systems that do not contribute to the pool not harmed –need for unbiased judgments argues against newer pooling schemes

9 Effectiveness Measures Given well-constructed test collection, what should you measure? –assume ranked retrieval results –assume evaluate over a set of queries (> 25) –desirable to have one summary number

10 Current Practice precision: ratio of retrieved documents that are relevant recall: ratio of relevant documents that are retrieved query-based averaging interpolation & extrapolation to plot precision at standard set of recall levels

11 Recall-Precision Graph

12 Uninterpolated R-P Curve for Single topic

13 Interpolated R-P Curves for Individual Topics

14 Single Number Summary Scores Precision (X): # rel in top X / X Recall(Y): # rel in top Y/ R Average precision: Avg r (Prec(rank of r)) R-Precision: Prec(R) Recall at.5 precision –use Prec(10) if precision <.5 in top 10 Rank of first relevant (expected search length)

15 Runs Ranked by Different Measures Ranked by measure averaged over 50 topics

16 Correlations Between Rankings Kendall’s  computed between pairs of rankings

17 Document Level Measures Advantage immediately interpretable Disadvantages don’t average well –different number of relevant implies topics are in different parts of recall-precision curve –theoretical maximums impossible to reach insensitive to ranking: only # rels that cross cut- off affect ranking –less useful for tuning a system less predictive of performance in other environments (?)

18 Number Relevant

19 Average Precision Advantages sensitive to entire ranking: changing a single rank will change final score –facilitates failure analysis stable: a small change in ranking makes a relatively small change in score (relative to # rels) has both precision- and recall-oriented factors –ranks closest to 1 receive largest weight –computed over all relevant documents Disadvantages less easily interpreted

20 Set-based Evaluation Required for some tasks, e.g., filtering 2 main approaches –utility functions assign reward for retrieving relevant doc, penalty for retrieving non-relevant doc, e.g., 3R + -2N + hard to normalize, and can’t interpret or average if not normalized –combinations of recall & precision average set precision = recall*precision system not penalized for retrieving many irrelevant when no relevant

21 Filtering Results

22 Evaluation Research Questions How to accommodate query variability? –express variability of results within one run –appropriate tests for statistical significance –“Holy Grail” of IR: ontology of query types (i.e., which queries act the same for particular system types) improve retrieval effectiveness construct test collections that are balanced for difficulty and/or target specific functionality

23 Evaluation Research Questions Facilitate user-based experiments –time-based evaluation measures time used as a measure of user effort TREC-7 high precision track –good experimental design of user studies TREC-6 interactive track designed to allow cross- site comparisons of user experiments


Download ppt "IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance."

Similar presentations


Ads by Google