Presentation is loading. Please wait.

Presentation is loading. Please wait.

Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.

Similar presentations


Presentation on theme: "Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the."— Presentation transcript:

1 Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the information need represented by q, i.e. P(R|q,d) Compared to previous models: Boolean and Vector Models: Ranking based on relevance value which is inter- preted as a similarity measure between q and d Probabilistic Models: Ranking based on estimated likelihood of d being relevant for query q

2 Probabilistic IR Models Again: Documents and queries are represented as vectors with binary weights (i.e. w ij = 0 or 1) Relevance is seen as a relationship between an information need (expressed as query q) and a document. A document d is relevant if and only if a user with information need q "wants" d. Relevance is a function of various parameters, is subjective, can not always be exactly specified, Hence: Probabilistic description of relevance, i.e. instead of a vector space, we operate in an event space Q x D (Q = set of possible queries, D = set of all docs in collection) Interpretation: If a user with info need q draws a random document d from the collection, how big is its likelihood of being relevant, i.e. P(R | q,d)?

3 The Probability Ranking Principle Probability Ranking Principle (Robertson, 1977): Optimal retrieval performance can be achieved when documents are ranked according to their probabilities of being judged relevant to a query. (Informal definition) Involves two assumptions: 1. Dependencies between docs are ignored 2. It is assumed that the probabilities can be estimated in the best possible way Main task: Estimation of probability P(R|q,d) for every document d in the document collection D REFERENCE: F. CRESTANI ET AL. IS THIS DOCUMENT RELEVANT?... PROBABLIY: A SURVEY OF PROBABILISTIC MODELS IN IR [3]

4 Probabilistic Modeling Given: Documents d j = (t 1, t 2,..., t n ), queries q i (n = no of docs in collection) We assume similar dependence between d and q as before, i.e. relevance depends on term distribution (Note: Slightly different notation here than before!) Estimating P(R|d,q) directly often impossible in practice. Instead: Use Bayes Theorem, i.e. or

5 Probab. Modeling as Decision Strategy Decision about which docs should be returned based on a threshold calculated with a cost function C j Example: C j (R, dec)RetrievedNot retrieved Relevant Doc.0 1 Non-Rel. Doc. 2 0 Decision based on risk function that minimizes costs:

6 Probabilistic Modeling as Decision Strategy (Cont.)

7 Probability Estimation Different approaches to estimate P(d|R) exist: Binary Independence Retrieval Model (BIR) Binary Independence Retrieval Model (BII) Darmstadt Indexing Approach (DIA) Generally we assume stochastic independence between the terms of one document, i.e.

8 Binary Independence Retr. Model (BIR) Learning : Estimation of probability distribution based on - a query q k - a set of documents d j - respective relevance judgments Application : Generalization to different documents from the collection (but restricted to same query and terms from training) DOCS TERMS QUERIES LEARNINGAPPLICATION BIR

9 Binary Indep. Indexing Model (BII) Learning : Estimation of probability distribution based on - a document d j - a set of queries q k - respective relevance judgments Application : Generalization to different queries (but restricted to same doc. and terms from training) DOCS TERMS QUERIES APPLICA- TION BII LEARNING

10 Learning : Estimation of probability distribution based on - a set of queries q k - an abstract description of a set of documents d j - respective relevance judgments Application : Generalization to different queries and documents Darmstadt Indexing Approach (DIA) DOCS TERMS QUERIES APPLICA- TION DIA LEARNING

11 DIA - Description Step Basic idea: Instead of term-document pairs, consider relevance descriptions x(t i, d m ) These contain the values of certain attributes of term t i, document d m and their relation to each other Examples: - Dictionary information about t i (e.g. IDF) - Parameters describing d m (e.g. length or no. of unique terms) - Information about the appearance of t i in d m (e.g. in title, abstract), its frequency, the distance between two query terms, etc. REFERENCE: FUHR, BUCKLEY [4]

12 DIA - Decision Step Estimation of probability P(R | x(t i, d m )) P(R | x(t i, d m )) is the probability of a document d m being relevant to an arbitrary query given that a term common to both document and query has a relevance description x(t i, d m ). Advantages: - Abstraction from specific term-doc pairs and thus generalization to random docs and queries - Enables individual, application-specific relevance descriptions

13 DIA - (Very) Simple Example RELEVANCE DESCRIPTION: x(t i, d m ) = (x 1, x 2 ) with QUERYDOC.REL.TERMx q1q1 d1d1 REL.t1t2t3t1t2t3 (1,1) (0,1) (1,2) q1q1 d2d2 NOT REL. t1t3t4t1t3t4 (0,2) (1,1) (0,1) q2q2 d1d1 REL.t2t5t6t7t2t5t6t7 (0,2) (1,1) (1,2) q2q2 d3d3 NOT REL. t5t7t5t7 (0,1) xExEx 1/4 (0,2)2/3 (1,1)2/3 (1,2)1 TRAINING SET: q 1, q 2, d 1, d 2, d 3 EVENT SPACE: 1, if t i  title of d m 0, otherwise 1, if t i  d m once 2, if t i  d m at least twice x 1 = x 2 =


Download ppt "Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the."

Similar presentations


Ads by Google