Presentation is loading. Please wait.

Presentation is loading. Please wait.

IR Models: Review Vector Model and Probabilistic.

Similar presentations


Presentation on theme: "IR Models: Review Vector Model and Probabilistic."— Presentation transcript:

1 IR Models: Review Vector Model and Probabilistic

2 IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector Lat. Semantic Index Neural Networks Browsing Flat Structure Guided Hypertext Classic Models boolean vector probabilistic Retrieval: Adhoc Filtering Browsing

3 Classic IR Models - Basic Concepts Each document represented by a set of representative keywords or index terms An index term is a document word useful for remembering the document main themes The importance of the index terms is represented by weights associated to them Let –k i be an index term –d j be a document –w ij is a weight associated with (k i,d j ) The weight w ij quantifies the importance of the index term for describing the document contents

4 Vector Model Similarity Sim(q,d j ) = cos(  ) = [vec(d j )  vec(q)] / |d j | * |q| = [  w ij * w iq ] / |d j | * |q| TF-IDF term-weighting scheme –w ij = freq(i,q) / max(freq(l,q)]) * log(N/n i ) Default query term weights –w iq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/n i ) i j dj q 

5 Example 1: no weights d1 d2 d3 d4d5 d6 d7 k1 k2 k3 k1k2k3q * dj d11012 d21001 d30112 d41001 d51113 d61102 d70101 q111

6 Example 2: query weights d1 d2 d3 d4d5 d6 d7 k1 k2 k3 k1k2k3q * dj d11014 d21001 d30115 d41001 d51116 d61103 d70102 q123

7 Example 3: query and document weights d1 d2 d3 d4d5 d6 d7 k1 k2 k3 k1k2k3q * dj d12015 d21001 d301311 d42002 d512417 d61205 d705010 q123

8 Summary of Vector Space Model Advantages: –term-weighting improves quality of the answer set –partial matching allows retrieval of docs that approximate the query conditions –cosine ranking formula sorts documents according to degree of similarity to the query Disadvantages: –assumes independence of index terms (??); not clear that this is bad though

9 Probabilistic Model Objective: to capture the IR problem using a probabilistic framework Given a user query, there is an ideal answer set Querying as specification of the properties of this ideal answer set (clustering) –But, what are these properties? Guess at the beginning what they could be (i.e., guess initial description of ideal answer set) Improve by iteration

10 Probabilistic Model An initial set of documents is retrieved somehow User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected) IR system uses this information to refine description of ideal answer set By repeating this process, it is expected that the description of the ideal answer set will improve Description of ideal answer set is modeled in probabilistic terms

11 Probabilistic Ranking Principle Given a user query q and a document d j, the probabilistic model tries to estimate the probability that the user will find the document d j relevant. The model assumes that this probability of relevance depends on the query and the document representations only. Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant. But, –how to compute probabilities?

12 The Ranking Probabilistic ranking computed as: –sim(q,d j ) = P(d j relevant-to q) / P(d j non-relevant-to q) –This is the odds of the document d j being relevant –Taking the odds minimizes the probability of an erroneous judgement Definition: –w ij  {0,1}, w iq  {0,1} –P (R | vec(d j )) : probability that given doc is relevant –P (  R | vec(d j )) : probability doc is not relevant

13 The Ranking sim(d j,q) = P(R | vec(d j )) / P(  R | vec(d j )) = [P(vec(d j ) | R) * P(R)] [P(vec(d j ) |  R) * P(  R)] ~ P(vec(d j ) | R) P(vec(d j ) |  R) P(vec(d j ) | R) : probability of randomly selecting the document d j from the set R of relevant documents

14 The Ranking sim(dj,q)~ P(vec(d j ) | R) P(vec(d j ) |  R) ~ [  indoc P(k i | R)] x [   indoc P(  k i | R)] [  indoc P(k i |  R)] x [   indoc P(  k i |  R)] P(k i | R) : probability that the index term k i is present in a document randomly selected from the set R of relevant documents Assumes the independence of terms.

15 The Ranking sim(d j,q) ~ [  indoc P(k i | R)] x [   indoc P(  k i | R)] [  indoc P(k i |  R)] x [   indoc P(  k i |  R)] math happens... ~  w iq * w ij * (log P( k i | R) + log P(  k i |  R) ) P(  k i | R) P(k i |  R) where P(  k i | R) = 1 - P(k i | R) P(  k i |  R) = 1 - P(k i |  R)

16 The Initial Ranking sim(d j,q) ~ ~  w iq * w ij * (log P( k i | R) + log P(  k i |  R) ) P(  k i | R) P(k i |  R) Probabilities P(k i | R) and P(k i |  R) ? Estimates based on assumptions: –P(k i | R) = 0.5 –P(k i |  R) = n i N where n i is the number of docs that contain k i –Use this initial guess to retrieve an initial ranking –Improve upon this initial ranking

17 Improving the Initial Ranking sim(d j,q) ~ ~  w iq * w ij * (log P( k i | R) + log P(  k i |  R) ) P(  k i | R) P(k i |  R) Let –V : set of docs initially retrieved These are considered relevant even if we are not sure –V i : subset of docs retrieved that contain k i Reevaluate estimates: –P(k i | R) = V i V –P(k i |  R) = n i - V i N - V Repeat recursively

18 Improving the Initial Ranking sim(d j,q) ~ ~  w iq * w ij * (log P( k i | R) + log P(  k i |  R) ) P(  k i | R) P(k i |  R) Need to avoid problems with small known document sets (e.g. when V=1 and V i =0): –P(k i | R) = V i + 0.5 V + 1 –P(k i |  R) = n i - V i + 0.5 N - V + 1 But we can use frequency in corpus instead, –P(k i | R) = V i + n i /N V + 1 –P(k i |  R) = n i - V i + n i /N N - V + 1

19 Probabilistic Model Advantages: –Documents are ranked in decreasing order of probability of relevance Disadvantages: –need to guess initial estimates for P(k i | R) –method does not take into account tf and idf factors New Use for Model: –Meta search Use probabilities to model the value of different search engines for different topics

20 Brief Comparison of Classic Models Boolean model does not provide for partial matches and is considered to be the weakest classic model Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections This seems also to be the view of the research community


Download ppt "IR Models: Review Vector Model and Probabilistic."

Similar presentations


Ads by Google