IR Theory: Evaluation Methods

IR Theory: Evaluation Methods

Evaluation of IR Systems: Why?
Which is the best? IR Model, Term Weighting, Indexing Method, User Interface, Search System How to determine if a given approach is effective? Is my new term weighting formula better than Okapi? Is Google better than Naver? Is VSM better than Boolean? Without standard evaluation methods, assessment can be Anecdotal Biased Incorrect Search Engine

Evaluation of IR Systems: How?
Measure the degree to which information need is satisfied Number and ranks of relevant documents in search results User’s assessment Usage How often is the system used? Return rate Change in user’s knowledge Task completion How to determine if a given approach is effective Baseline system vs. Test system Baseline system Standard (top) system Test system Same as baseline except for the methods to test Easy to assess if system differences are restricted Multiple factors cause compound effects e.g., IR model + term weight Difficult to isolate the contribution of each factor Search Engine

Evaluation of IR Systems: Challenges
What measure to use? Effectiveness, Relevance, Utility, User Satisfaction Who to judge? Individual User, All Users, Subject Expert How to measure? Binary, Multi-level, Continuous Difficulties of Evaluation At the core, it is a subjective process Information needs change over time and with learning Evaluation criteria vary by user and context Search Engine

IR Evaluation: Standard Approach
Measure to used Effectiveness, Relevance, Utility, User Satisfaction Who to judge? Individual User, All Users, Subject Expert How to measure? Binary, Multi-level, Continuous Process Collect a set of documents (Collection) Construct a set of queries (Queries/Topics) For each query, human assessors identify relevant documents (Relevance Judgments) IR systems use the queries against the collection to produce search results Evaluate the results by comparing it to the relevance judgments Search Engine

IR Evaluation: TREC Approach
Making relevance judgments for a large collection is practically impossible #docs to review = #queries * #documents (collection size) For 50 queries and 1 million documents, assessors need to review 50 million documents Pooled Relevance Judgment Create a pool of documents to review by collecting only top n (e.g., 100) documents from each system result Justification Previous research finding Different systems retrieve different sets of documents Assumption Document pool created from top results will include majority of relevant documents in the collection TREC finding Pooled relevance does not affect relative rankings of systems Search Engine

Evaluation Measures: Overview
Test Collection Document Corpus Topics: Description of information need Relevance Judgments: Set of known relevant documents for each topic Evaluation Measures Precision (P): 정밀도 A measure of the system’s ability to present only the relevant items P = number of relevant items retrieved / total number of items retrieved Recall (R): 재현율 A measure of the system’s ability to retrieve all the relevant items R = number of relevant items retrieved / number of relevant items in the collection R-Precision (RP) De-emphasizes the exact ranking of documents (good when many reldocs are retrieved) RP = precision at rank R, where R= number of relevant items in the collection Reciprocal Rank (RR) Good for known item search RR = 1/rank where first relevant item is retrieved Search Engine

Evaluation Measures: Precision & Recall
Document Collection Retrieved Documents P =  Relevant Documents Retrieved Documents Relevant Documents Relevant documents retrieved R =  Search Engine

Evaluation Measures: Precision vs. Recall
Recall-Precision Tradeoff Retrieves mostly relevant documents but misses many relevant documents Optimum Performance 1 Precision 1 Recall Retrieves most of relevant documents along with lots of junk Search Engine

Evaluation Measures: Recall-Precision
Recall-Precision Graph Precision at 11 standard recall levels Commonly used to compare systems closer to upper right  better system Recall Precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.6169 0.4517 0.3938 0.3243 0.2725 0.2224 0.1642 0.1342 0.0904 0.0472 0.0031 Search Engine

Evaluation Measures: Average Precision
Average Precision (AP) Single valued measure that reflects performance over all relevant documents Rewards the system that retrieves relevant documents at high ranks Standard evaluation measure used to evaluate IR systems Devised by TREC Search Engine

Examples: Precision & Recall
rank Doc# Relevant? Precision Recall (Nrel=10) 1 42 Yes 1/1 =1.00 1/10 =0.10 2 221 No 1/2 =0.50 3 123 1/3 = 0.33 4 21 2/4 = 0.50 2/10 =0.20 5 111 3/5 = 0.60 3/10 =0.30 6 11 3/6 = 0.50 7 93 3/7 = 0.43 8 234 3/8 = 0.38 9 4/9 = 0.44 4/10 =0.40 10 254 4/10 = 0.40 333 5/11 = 0.45 5/10 =0.50 12 421 6/12 = 0.50 6/10 =0.60 13 45 6/13 = 0.46 14 761 7/14 = 0.50 7/10 =0.70 15 8/15 = 0.53 8/10 =0.80 Search Engine

Examples: Average Precision
rank Doc# Relevant? Precision Recall (Nrel=10) 1 42 Yes 1/1 =1.00 1/10 =0.10 2 221 No 1/2 =0.50 3 123 1/3 = 0.33 4 21 2/4 = 0.50 2/10 =0.20 5 111 3/5 = 0.60 3/10 =0.30 6 11 3/6 = 0.50 7 93 3/7 = 0.43 8 234 3/8 = 0.38 9 4/9 = 0.44 4/10 =0.40 10 254 4/10 = 0.40 333 5/11 = 0.45 5/10 =0.50 12 421 6/12 = 0.50 6/10 =0.60 13 45 6/13 = 0.46 14 761 7/14 = 0.50 7/10 =0.70 15 8/15 = 0.53 8/10 =0.80 Search Engine

IR Theory: Evaluation Methods

Similar presentations

Presentation on theme: "IR Theory: Evaluation Methods"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IR Theory: Evaluation Methods

Similar presentations

Presentation on theme: "IR Theory: Evaluation Methods"— Presentation transcript:

Similar presentations

About project

Feedback