Presentation is loading. Please wait.

Presentation is loading. Please wait.

Super Awesome Presentation Dandre Allison Devin Adair.

Similar presentations

Presentation on theme: "Super Awesome Presentation Dandre Allison Devin Adair."— Presentation transcript:

1 Super Awesome Presentation Dandre Allison Devin Adair

2 Comparing the Sensitivity of Information Retrieval Metrics Filip Radlinski Microsoft Cambridge, UK Nick Craswell Microsoft Redmond, WA, USA

3 How do you evaluate Information Retrieval effectiveness? Precision (P) Mean Average Precision (MAP) Normalized Discounted Cumulative Gain (NDCG)

4 Precision Average the number of relevant documents in the top 5 for a given query Average over all queries

5 Mean Average Precision For each relevant document in the top 10, find the precision up until its rank for a given query Sum the precisions and normalize by the known relevant documents Average over all queries

6 Normalized Discounted Cumulative Gain Normalize the Discounted Cumulative Gain by the Ideal Discounted Cumulative Gain for a given query Average over all queries

7 Normalized Discounted Cumulative Gain Discounted Cumulative Gain – Give more emphasis to relevant documents by using 2 relevance – Give more emphasis to earlier ranks by using a logarithmic reduction factor – Sums over top 5 Ideal Discounted Cumulative Gain – Same as DCG by sorts by relevance

8 What’s the problem Sensitivity Might reject small but significant improvements Bias Judges removed from search process Fidelity Evaluation should reflect user success!!

9 Alternative Evaluation Use actually user searches Judges become actual users Evaluation becomes user success

10 Interleaving System A Results + System B Results Team-Draft Algorithm

11 Captain AhabCaptain Barnacle

12 Captain AhabCaptain Barnacle Interleaved List

13 Crediting Whoever has the most distinct clicks is considered “better” In case of tie - ignored

14 Retrieval Systems Pairs Major improvements – majorAB – majorBC – majorAC Minor improvements – minorE – minorD

15 Evaluation 12,000 queries – Samples n-times with replacement count sampled queries where rankers differ – Ignores ties Percent where better ranker scores better





20 Interleaving Evaluation



23 Credit Assignment Alternatives Shared top k – Ignore? – Lower clicks treated the same Not all clicks are created equal – log(rank) – 1/rank – Top – Bottom

24 Conclusions Performance measured by: – Judgment-based – Usage-based Surprise surpise small sample size is stupid – (check out that alliteration) Interleaving is transitive


Download ppt "Super Awesome Presentation Dandre Allison Devin Adair."

Similar presentations

Ads by Google