CS 430: Information Discovery

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Basic IR: Modeling Basic IR Task: Slightly more complex:
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Probabilistic Information Retrieval Chris Manning, Pandu Nayak and
Introduction to Information Retrieval Information Retrieval and Data Mining (AT71.07) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor:
CpSc 881: Information Retrieval
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
IR Models: Overview, Boolean, and Vector
Hinrich Schütze and Christina Lioma
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
IR Models: Review Vector Model and Probabilistic.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Linear Algebra Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
1 Computing Relevance, Similarity: The Vector Space Model.
SINGULAR VALUE DECOMPOSITION (SVD)
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.
Vector Space Models.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.
1 CS 430: Information Discovery Lecture 5 Ranking.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework n Given a user query, there is an ideal answer set n Querying.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Automated Information Retrieval
Plan for Today’s Lecture(s)
CS 430: Information Discovery
Latent Semantic Indexing
Information Retrieval Models: Probabilistic Models
Information Retrieval and Web Search
Representation of documents and queries
CS 430: Information Discovery
Recuperação de Informação B
Boolean and Vector Space Retrieval Models
Information Retrieval and Web Search
Recuperação de Informação B
Recuperação de Informação B
Recuperação de Informação B
Recuperação de Informação B
Information Retrieval and Web Design
Advanced information retrieval
CS 430: Information Discovery
Presentation transcript:

CS 430: Information Discovery Lecture 10 Probabilistic Information Retrieval

Course Administration Assignment 1 • You should receive results by email Assignment 2 • Due date changed to October 18

Three Approaches to Information Retrieval Many authors divide the methods of information retrieval into three categories: Boolean (based on set theory) Vector space (based on linear algebra) Probabilistic (based on Bayesian statistics) In practice, the latter two have considerable overlap.

Probability Ranking Principle "If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately a possible on the basis of whatever data is made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data." W.S. Cooper

Probabilistic Ranking Basic concept: "For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents. By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically." Van Rijsbergen

Probability Theory -- Bayesian Formulas Notation Let a, b be two events. P(a | b) is the probability of a given b Bayes Theorem P(a | b) = Derivation P(a | b) P(b) = P(a  b) = P(b | a) P(a) P(b | a) P(a) P(b) P(b | a) P(a) P(b) where a is the event not a

Example of Bayes Theorem a Weight over 200 lb. b Height over 6 ft. P(a | b) = D / (A+D) = D / P(b) P(b | a) = D / (D+C) = D / P(a) D is P(a  b) Over 200 lb D C A B Over 6 ft

Concept R is a set of documents that are guessed to be relevant and R the complement of R. 1. Guess a preliminary probabilistic description of R and use it to retrieve a first set of documents. 2. Interact with the user to refine the description. 3. Repeat, thus generating a succession of approximations to R.

Probabilistic Principle The probability that a document is relevant to a query is assumed to depend on the terms in the query and the terms used to index the document, only. Given a user query q and a document dj in the collection, the probabilistic model estimates the probability that the user will find dj relevant. The ideal answer set is labeled R, which is the set that maximizes the overall probability of relevance.

Probabilistic Principle Given a user query q and a document dj the model estimates the probability that the user finds dj relevant. i.e., P(R | dj). similarity (dj, q) = = by Bayes Theorem = x constant P(R | dj) P(dj | R) P(R) P(dj | R) P(dj | R) is the probability of randomly selecting dj from R.

Binary Independence Retrieval Model (BIR) Suppose that the weights for term i in document dj and query q are wi,j and wi,q, where all weights are 0 or 1. Let P(ki | R) be the probability that index term ki is present in a document randomly selected from the set R. If the index terms are independent, after some mathematical manipulation, taking logs and ignoring factors that are constant for all documents: similarity (dj, q) =  wi,q x wi,j x ( log + log ) P(ki | R) 1 - P(ki | R) 1 - P(ki | R) P(ki | R) i

Estimates of P(ki | R) Initial guess, with no information to work from: P(ki | R) = c P(ki | R) = ni / N where: c is an arbitrary constant, e.g., 0.5 ni is the number of documents that contain ki N is the total number of documents in the collection

Improving the Estimates of P(ki | R) Human feedback -- relevance feedback Automatically (a) Run query q using initial values. Consider the t top ranked documents. Let r be the number of these documents that contain the term ki. (b) The new estimates are: P(ki | R) = r / t P(ki | R) = (ni - r) / (N - t) Note: The ratio of these two terms, with minor changes of notation and taking logs, gives w2 on page 368 of Frake.

Continuation =  wi,q x wi,j x ( log + log ) similarity (dj, q) =  wi,q x wi,j x ( log r/(t - r) + log (N - r)/(N + r - t - ni) ) =  wi,q x wi,j x log {r/(t - r)}/{(N + r - t - ni)/(N - r)} Note: With a minor change of notation, this is w4 on page 368 of Frake. P(ki | R) 1 - P(ki | R) 1 - P(ki | R) P(ki | R) i i i

Probabilistic Weighting R - r n - r N - R ( ) N number of documents in collection R number of relevant documents for query q n number of documents with term t r number of relevant documents with term t w = log r R - r n - r N - R number of relevant documents with term t number of relevant documents without term t ( ) number of non-relevant documents with term t number of non-relevant documents in collection

Discussion of Probabilistic Model Advantages • Based on firm theoretical basis Disadvantages • Initial definition of R has to be guessed. • Weights ignore term frequency • Assumes independent index terms (as does vector model)

Review of Weighting The objective is to measure the similarity between a document and a query using statistical (not linguistic) methods. Concept is to weight terms by some factor based on the distribution of terms within and between documents. In general: (a) Weight is an increasing function of the number of times that the term appears in the document (b) Weight is a decreasing function of the number of documents that contain the term (or the total number of occurrences of the term) (c) Weight needs to be adjusted for documents that differ greatly in length.

Normalization of Within Document Frequency (Term Frequency) Normalization to moderate the effect of high-frequency terms Croft's normalization: cfij = K + (1 - K) fij/mi (fij > 0) fij is the frequency of term j in document i cfij is Croft's normalized frequency mi is the maximum frequency of any term in document i K is a constant between 0 and 1 that is adjusted for the collection K should be set to low values (e.g., 0.3) for collections with long documents (35 or more terms). K should be set to higher values (greater than 0.5) for collections with short documents.

Normalization of Within Document Frequency (Term Frequency) Examples Croft's normalization: cfij = K + (1 - K) fij/mi (fij > 0) document K mi weight (most weight (least length frequent term) frequent term) 20 0.3 5 1.00 0.44 20 0.3 2 1.00 0.65 100 0.5 25 1.00 0.52 100 0.5 2 1.00 0.75

Inverse Document Frequency (IDF) (a) Simplest to use is 1 / dk (Salton) dk number of documents that contain term k (b) Normalized forms: IDFi = log2 (N/ni) + 1 or IDFi = log2 (maxn/ni) + 1 (Sparck Jones) N number of documents in the collection ni total number of occurrences of term i in the collection maxn maximum frequency of any term in the collection

Measures of Within Document Frequency (c) Salton and Buckley recommend using different weightings for documents and queries documents fik for terms in collections of long documents 1 for terms in collections of short document queries cfik with K = 0.5 for general use fik for long queries (cfik with K = 0)

Ranking -- Practical Experience 1. Basic method is inner (dot) product with no weighting 2. Cosine (dividing by product of lengths) normalizes for vectors of different lengths 3. Term weighting using frequency of terms in document usually improves ranking 4. Term weighting using an inverse function of terms in the entire collection improves ranking (e.g., IDF) 5. Weightings for document structure improve ranking 6. Relevance weightings after initial retrieval improve ranking Effectiveness of methods depends on characteristics of the collection. In general, there are few improvements beyond simple weighting schemes.

Latent Semantic Indexing Objective Replace indexes that use sets of index terms by indexes that use concepts. Approach Map the index term vector space into a lower dimensional space, using singular value decomposition.

The index term vector space The space has as many dimensions as there are terms in the word list. d1 d2 t2  t1

Mathematical concepts Vector space theory (Singular Value Decomposition) Define X as the term-document matrix, with t rows (number of index terms) and n columns (number of documents). There exist matrices T, S and D', such that: X = TSD T is the matrix of eigenvectors of XX' D' is the matrix of eigenvectors of X'X S is an r x r diagonal matrix, where r is the rank of X, usually the smaller of t and n, and every element of S is non-negative.

Reduction of dimension Select the s largest elements of S and the corresponding columns of T and D'. This gives a reduced matrix: Xs = TsSsDs' It is claimed that the rows of this matrix represent concepts. Therefore calculation of the similarity between a query expressed in this space and a document is more effective than in the index term vector space.