Download presentation

Presentation is loading. Please wait.

Published byIyana Tarkington Modified over 3 years ago

1
Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007

2
Abstract Probability Ranking Priciple (PRP) – Rank documents in decreasing order of probability of relevance. Propose a greedy algorithm that approximately optimizes the following objectives – %no metric: the percentages of queries for which no relevant documents are retrieved. – The diversity of results. 4/30/2007

3
Introduction Probability Ranking Principle – Rule of thumb: “optimal”. TREC robust track – %no metric – Question answering and finding a homepage. Diversity – For example, “Trojan horse” – PRP-based method may choose one “most likely” interpretation. Greedy algorithm – Fill each position in the ranking by assuming that all previous documents in the ranking are not relevant. 4/30/2007

4
Introduction (Cont.) Other measures – Search length (SL) – Reciprocal rank (RR) – Instance recall: the number of difference subtopics in a given result set. Retrieving for Diversity – The diversity automatically arises as a consequence of the objective function. 4/30/2007

5
Related Work Algorithm – Zhai and Lafferty: a risk minimization framework – Bookstein: a sequential learning retrieval system Diversity – Zhai et al.: novelty and redundancy – Clustering is an approach to quickly cover a diverse range of query interpretations. 4/30/2007

6
Evaluation Metrics MSL (mean search length) MRR (mean reciprocal rank) %no – k-call at n: 1 if at least k of the top n docs returned by system for the given query are deemed relevant; otherwise 0. – mean 1-call: one minus the %no metric – n-call at n: perfect precision Instance recall at rank n 4/30/2007

7
Bayesian Retrieval Standard Bayesian Information Retrival – The documents in a corpus should be ranked by Pr[r|d] – By a monotonic transformation – Focus on the objective function, so use Naïve Bayes framework with multinomial models (θ i ) as the family of distributions. – Determine the parameters (training) – Dirichlet prior: prior probability distribution over the parameters (θ i ). – Estimate the probability of parameters of the relevant distribution (i.e., Pr[d|r]). 4/30/2007

8
Object Function Considering optimizing for the k-call at n metric. – k=1: the probability that at least one of the first n relevance variables be true – For arbitrary k: the probability that at least k docs are relevant 4/30/2007

9
Optimization Methods NP-hard Problem – To perfectly optimize the k-call of any specific set of n docs objective function from a corpus of m docs, because Greedy algorithm (approximately optimize it) – Successively select each result of the result set. 1.Select first result by applying the conventional PRP. 2.For the ith result, we hold results 1 throught i-1 to their already selected value, and consider all remaining corpus documents as a possibility for document i. 3.Pick the document with highest k-call score as the ith result. 4/30/2007

10
Applying the Greedy Approach k=1 – First, choose the doc d 0 maximizing Pr[r 0 |d 0 ]. – Wish to choose d 1 maximizing the below quantity: – Choose d2 by maximizing – In general, select the optimal d i that maximizes 4/30/2007

11
Applying the Greedy Approach (Cont.) k=n (perfect precision) – Select the ith document according to: 1

12
Optimizing for Other Metrics Optimizing 1-call – Choose greedily conditioned on there being no previous document relevant. – Equal to minimize expected search length and maximize expected reciprocal rank. – Also optimize instance recall metric, which measures the number of distinct subtopics retrieved. If a query has t subtopics, then instance recall is 4/30/2007

13
Google Examples Two ambiguous queries: “Trojan horse” and “virus” – Usd the titles, summaries, and snippets of Google’s results to form a corpus of 1000 docs for each query. 4/30/2007

14
Experiments Methods – 1-greedy, 10-greedy, and conventional PRP Datasets – ad hoc topics from TREC-1, TREC-2, and TREC-3 to set the weight parameters of model appropriately. – TREC2004 robust track – TREC-6,7,8 interactive track – TREC-4 and TREC-6 ad hoc tracks 4/30/2007

15
Tuning the Weights Key weight – For the proposed model, the key weights are the strength of the relevant distribution and irrelevant distribution priors with respect to the strength of the docs. TRECs 1, 2, and 3 – Consisting about 724,000 docs, and 150 topics (topics 51- 200) – Used for tuning weight 4/30/2007

16
Robust Track Experiments TREC2004 robust track – 249 topics in total, about 528,000 docs – 50 topics were selected by TREC as being “difficult” queries. 4/30/2007

17
Instance Retrieval Experiments TREC-6, 7, and 8 interactive track – Test the performance of diversity – Total 20 topics with between 7 and 56 aspects each, and about 210,000 docs. – Zhai et al’s LM approach is better for aspect retrieval. 4/30/2007

18
Multiple Annotator Experiments TREC-4 and TREC-6 – Multiple independent annotators are asked to make relevant judgments for the same topics over the same corpus. – TREC-6 had three annotators, TREC-6 had two. 4/30/2007

19
Query Analysis A specific topic 100 – The description is: 4/30/2007

20
Conclusions and Future Work Conclusions – Identify the PRP is not optimal, and given an approach to directly optimize other desired objective. – The approach is algorithmically feasible. Future work – Other objective functions – More sophisticated techniques, such as local search alg. – The likelihood of relevance collections of docs Two-Poisson model Language model 4/30/2007

Similar presentations

OK

Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.

Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on our mother earth Ppt on water pollution download Ppt on power system analysis Ppt on national education day 2016 Ppt on 1857 the first war of independence By appt only movie Ppt on dry cell and wet cell phones Ppt on conservation of momentum law Ppt on 9-11 conspiracy theories attacks videos Ppt on basic leadership skills