# The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.

## Presentation on theme: "The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an."— Presentation transcript:

The Probabilistic Model

Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an ideal answer set; n Querying as specification of the properties of this ideal answer set (clustering); n But, what are these properties? n Guess at the beginning what they could be (i.e., guess initial description of ideal answer set); n Improve by iteration.

Probabilistic Model n An initial set of documents is retrieved somehow; n User inspects these docs looking for the relevant ones (only top 10-20 need to be inspected); n IR system uses this information to refine description of ideal answer set; n By repeating this process, it is expected that the description of the ideal answer set will improve; n Have always in mind the need to guess at the very beginning the description of the ideal answer set; n Description of ideal answer set is modeled in probabilistic terms.

Probabilistic Ranking Principle n Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant); n The model assumes that this probability of relevance depends on the query and the document representations only; n Ideal answer set is referred to as R and should maximize the probability of relevance; n Documents in the set R are predicted to be relevant.

Probabilistic Ranking Principle n But, u how to compute probabilities? u what is the sample space?

The Ranking n Probabilistic ranking computed as: u sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q) u This is the odds of the document dj being relevant; u Taking the odds minimize the probability of an erroneous judgment; n Definition: u wij  {0,1} u P(R | dj) : probability that given doc is relevant; u P(  R | dj) : probability doc is not relevant.

The Ranking n sim(dj,q) = P(R | dj) / P(  R | dj) = [P(dj) | R) * P(R)] (Bayes` rule) [P(dj) |  R) * P(  R)] ~ P(dj) | R) P(dj) |  R) n P(dj) | R) : probability of randomly selecting the document dj from the set R of relevant documents.

The Ranking n sim(dj,q)~ P(dj) | R) P(dj) |  R) ~ [  P(ki | R)] * [  P(  ki | R)] [  P(ki |  R)] * [  P(  ki |  R)] n P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents.

The Ranking n sim(dj,q) ~  wiq * wij * (log P(ki | R) + log P(  ki |  R) ) P(  ki | R) P(ki |  R) where: P(  ki | R) = 1 - P(ki | R) P(  ki |  R) = 1 - P(ki |  R)

The Initial Ranking n How we can compute the probabilities P(ki | R) and P(ki |  R) ? n Estimation based on assumptions: u P(ki | R) = 0.5 u P(ki |  R) = ni / N where ni is the number of docs that contain ki; u Use this initial guess to retrieve an initial ranking; u Improve upon this initial ranking.

Improving the Initial Ranking n Let u V : set of docs initially retrieved u Vi : subset of docs retrieved that contain ki n Reevaluate estimates: u P(ki | R) = Vi / V u P(ki |  R) = (ni – Vi)/(N – V) n Repeat recursively

Improving the Initial Ranking n To avoid problems with V=1 and Vi=0: u P(ki | R) = (Vi + 0.5) / (V + 1); u P(ki |  R) = (ni - Vi + 0.5) / (N - V + 1); n Also, u P(ki | R) = Vi + (ni/N) V + 1 u P(ki |  R) = ni - Vi + (ni/N) N - V + 1

Pluses and Minuses n Advantages: u Docs ranked in decreasing order of probability of relevance; n Disadvantages: u need to guess initial estimates for P(ki | R); u method does not take into account tf and idf factors.

Brief Comparison of Classic Models n Boolean model does not provide for partial matches and is considered to be the weakest classic model; n Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections; n This seems also to be the view of the research community.

Alternative models n Models based on fuzzy sets; n Extensions of the Boolean model: continuous weights belonging to [0, 1] interval; n Models based in latent semantic analysis (LSA); n Models based on neural networks; n Models based on Bayesian networks; n Models for structured documents; n Models for browsing; n …

Download ppt "The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an."

Similar presentations