Probabilistic Information Retrieval

Probabilistic Information Retrieval
CSE Database Exploration Gautam Das Thursday, March Z.M. Joseph Spring 2006, CSE, UTA

Basic Rules of Probability
Recall the product rule: Baye’s Theorem:

Basic Assumptions Assume a database D consisting of a set of objects: documents, tuples, etc. Q : Query R : ‘Relevant Set’ of tuples Goal is to find an R for each Q, given D. Instead of deterministic, consider probabilistic ordering Ranking/Scoring function should decide the degree of relevance of a document Thus given a document d: Score(d) = P(R|D) [1] Thus, according to this, if you know the relevance set, then R’s members would have probability of 1, which would be the maximum score. Others would get a probability of 0.

Simplification From [1]:
Take ratios of probability that document is in R to probability that it is not in R: This retains the old ordering. Factors in the elements outside R which are part of D.

Applying Bayes Theorem
Simplify as follows:

Observations Forms the scoring function
The equation still retains R, which we do not know. The ordering will still be the same using this equation as a scoring function

Derivation for Keyword Queries
Now assume that a query contains a vector of words, with zero probability assigned if it does not occur. Then, applying the previous equation to each word w (instead of to a document) and combining all the words of the query gives:

Search for “Microsoft Corporation”
Thus expression would be: Assume you had two documents: D1 : Contains ‘Microsoft’ but not ‘Corporation’ D2 : Contains ‘Corporation’ but not ‘Microsoft’ Thus:

Search for “Microsoft Corporation”
Because Corporation is more common in the database D, then P(Corporation|D) will be far higher than P(Microsoft|D). Thus Score(D1) will be higher than Score(D2). Thus document which has ‘Microsoft’ in it will get higher ranking as this is more specific than the word ‘Corporation’. Similar to Vector Space ranking by relevance

Relevance Feedback Can keep fine-tuning R by getting user feedback on initial rankings. Once a better R is known, better scoring and ranking of matches is possible.

PIR Applied to Databases
Originally PIR was applied to documents and not to databases Applying PIR to databases is not easy as it is difficult to capture various aspects These include: Different values of an attributes PIR is based on words in document, in a database if a car is blue, black,etc. that is not easily captured Would you assign each color as a keyword? What to sacrifice in ranking is also not easy to capture – if a user’s preference is black cars, how is PIR applied to that when listing results that do not match entirely?

Probabilistic Information Retrieval

Similar presentations

Presentation on theme: "Probabilistic Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilistic Information Retrieval

Similar presentations

Presentation on theme: "Probabilistic Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback