CS 430: Information Discovery

CS 430: Information Discovery
Lecture 10 Probabilistic Information Retrieval

Course Administration
Assignment 1 • You should receive results by Assignment 2 • Due date changed to October 18

Three Approaches to Information Retrieval
Many authors divide the methods of information retrieval into three categories: Boolean (based on set theory) Vector space (based on linear algebra) Probabilistic (based on Bayesian statistics) In practice, the latter two have considerable overlap.

Probability Ranking Principle
"If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately a possible on the basis of whatever data is made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data." W.S. Cooper

Probabilistic Ranking
Basic concept: "For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents. By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically." Van Rijsbergen

Example of Bayes Theorem
a Weight over 200 lb. b Height over 6 ft. P(a | b) = D / (A+D) = D / P(b) P(b | a) = D / (D+C) = D / P(a) D is P(a  b) Over 200 lb D C A B Over 6 ft

Concept R is a set of documents that are guessed to be relevant and R
the complement of R. 1. Guess a preliminary probabilistic description of R and use it to retrieve a first set of documents. 2. Interact with the user to refine the description. 3. Repeat, thus generating a succession of approximations to R.

Probabilistic Principle
The probability that a document is relevant to a query is assumed to depend on the terms in the query and the terms used to index the document, only. Given a user query q and a document dj in the collection, the probabilistic model estimates the probability that the user will find dj relevant. The ideal answer set is labeled R, which is the set that maximizes the overall probability of relevance.

Probabilistic Principle
Given a user query q and a document dj the model estimates the probability that the user finds dj relevant. i.e., P(R | dj). similarity (dj, q) = = by Bayes Theorem = x constant P(R | dj) P(dj | R) P(R) P(dj | R) P(dj | R) is the probability of randomly selecting dj from R.

Binary Independence Retrieval Model (BIR)
Suppose that the weights for term i in document dj and query q are wi,j and wi,q, where all weights are 0 or 1. Let P(ki | R) be the probability that index term ki is present in a document randomly selected from the set R. If the index terms are independent, after some mathematical manipulation, taking logs and ignoring factors that are constant for all documents: similarity (dj, q) =  wi,q x wi,j x ( log log ) P(ki | R) P(ki | R) 1 - P(ki | R) P(ki | R) i

Estimates of P(ki | R) Initial guess, with no information to work from: P(ki | R) = c P(ki | R) = ni / N where: c is an arbitrary constant, e.g., 0.5 ni is the number of documents that contain ki N is the total number of documents in the collection

Improving the Estimates of P(ki | R)
Human feedback -- relevance feedback Automatically (a) Run query q using initial values. Consider the t top ranked documents. Let r be the number of these documents that contain the term ki. (b) The new estimates are: P(ki | R) = r / t P(ki | R) = (ni - r) / (N - t) Note: The ratio of these two terms, with minor changes of notation and taking logs, gives w2 on page 368 of Frake.

Continuation =  wi,q x wi,j x ( log + log ) similarity (dj, q)
=  wi,q x wi,j x ( log r/(t - r) + log (N - r)/(N + r - t - ni) ) =  wi,q x wi,j x log {r/(t - r)}/{(N + r - t - ni)/(N - r)} Note: With a minor change of notation, this is w4 on page 368 of Frake. P(ki | R) P(ki | R) 1 - P(ki | R) P(ki | R) i i i

Probabilistic Weighting
R - r n - r N - R ( ) N number of documents in collection R number of relevant documents for query q n number of documents with term t r number of relevant documents with term t w = log r R - r n - r N - R number of relevant documents with term t number of relevant documents without term t ( ) number of non-relevant documents with term t number of non-relevant documents in collection

Discussion of Probabilistic Model
Advantages • Based on firm theoretical basis Disadvantages • Initial definition of R has to be guessed. • Weights ignore term frequency • Assumes independent index terms (as does vector model)

Review of Weighting The objective is to measure the similarity between a document and a query using statistical (not linguistic) methods. Concept is to weight terms by some factor based on the distribution of terms within and between documents. In general: (a) Weight is an increasing function of the number of times that the term appears in the document (b) Weight is a decreasing function of the number of documents that contain the term (or the total number of occurrences of the term) (c) Weight needs to be adjusted for documents that differ greatly in length.

Normalization of Within Document Frequency (Term Frequency)
Normalization to moderate the effect of high-frequency terms Croft's normalization: cfij = K + (1 - K) fij/mi (fij > 0) fij is the frequency of term j in document i cfij is Croft's normalized frequency mi is the maximum frequency of any term in document i K is a constant between 0 and 1 that is adjusted for the collection K should be set to low values (e.g., 0.3) for collections with long documents (35 or more terms). K should be set to higher values (greater than 0.5) for collections with short documents.

Normalization of Within Document Frequency (Term Frequency)
Examples Croft's normalization: cfij = K + (1 - K) fij/mi (fij > 0) document K mi weight (most weight (least length frequent term) frequent term)

Inverse Document Frequency (IDF)
(a) Simplest to use is 1 / dk (Salton) dk number of documents that contain term k (b) Normalized forms: IDFi = log2 (N/ni) + 1 or IDFi = log2 (maxn/ni) (Sparck Jones) N number of documents in the collection ni total number of occurrences of term i in the collection maxn maximum frequency of any term in the collection

Measures of Within Document Frequency
(c) Salton and Buckley recommend using different weightings for documents and queries documents fik for terms in collections of long documents 1 for terms in collections of short document queries cfik with K = 0.5 for general use fik for long queries (cfik with K = 0)

Ranking -- Practical Experience
1. Basic method is inner (dot) product with no weighting 2. Cosine (dividing by product of lengths) normalizes for vectors of different lengths 3. Term weighting using frequency of terms in document usually improves ranking 4. Term weighting using an inverse function of terms in the entire collection improves ranking (e.g., IDF) 5. Weightings for document structure improve ranking 6. Relevance weightings after initial retrieval improve ranking Effectiveness of methods depends on characteristics of the collection. In general, there are few improvements beyond simple weighting schemes.

Latent Semantic Indexing
Objective Replace indexes that use sets of index terms by indexes that use concepts. Approach Map the index term vector space into a lower dimensional space, using singular value decomposition.

The index term vector space
The space has as many dimensions as there are terms in the word list. d1 d2 t2  t1

Mathematical concepts
Vector space theory (Singular Value Decomposition) Define X as the term-document matrix, with t rows (number of index terms) and n columns (number of documents). There exist matrices T, S and D', such that: X = TSD T is the matrix of eigenvectors of XX' D' is the matrix of eigenvectors of X'X S is an r x r diagonal matrix, where r is the rank of X, usually the smaller of t and n, and every element of S is non-negative.

Reduction of dimension
Select the s largest elements of S and the corresponding columns of T and D'. This gives a reduced matrix: Xs = TsSsDs' It is claimed that the rows of this matrix represent concepts. Therefore calculation of the similarity between a query expressed in this space and a document is more effective than in the index term vector space.

CS 430: Information Discovery

Similar presentations

Presentation on theme: "CS 430: Information Discovery"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 430: Information Discovery

Similar presentations

Presentation on theme: "CS 430: Information Discovery"— Presentation transcript:

Similar presentations

About project

Feedback