Chapter 7 Retrieval Models 1.  Provide a mathematical framework for defining the search process  Includes explanation of assumptions  Basis of many.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Information Retrieval in Practice
Information Retrieval Models: Probabilistic Models
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR
Chapter 7 Retrieval Models.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Scalable Text Mining with Sparse Generative Models
IR Models: Review Vector Model and Probabilistic.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 7 Retrieval Models.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Advanced information retrieval Chapter. 02: Modeling (Set Theoretic Models) – Fuzzy model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Vector Space Models.
The Boolean Model Simple model based on set theory
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Recuperação de Informação B Cap. 02: Modeling (Set Theoretic Models) 2.6 September 08, 1999.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
1 Probabilistic Models for Ranking Some of these slides are based on Stanford IR Course slides at
Plan for Today’s Lecture(s)
Lecture 13: Language Models for IR
CSCI 5417 Information Retrieval Systems Jim Martin
Language Models for Information Retrieval
Representation of documents and queries
John Lafferty, Chengxiang Zhai School of Computer Science
CS590I: Information Retrieval
INF 141: Information Retrieval
Recuperação de Informação B
Conceptual grounding Nisheeth 26th March 2019.
Recuperação de Informação B
Recuperação de Informação B
Information Retrieval and Web Design
Advanced information retrieval
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Chapter 7 Retrieval Models 1

 Provide a mathematical framework for defining the search process  Includes explanation of assumptions  Basis of many ranking algorithms  Progress in retrieval models has corresponded with improvements in effectiveness  Theories about relevance 2

Relevance  Complex concept that has been studied for some time  Many factors to consider  People often disagree when making relevance judgments  Retrieval models make various assumptions about relevance to simplify problem  e.g., topical vs. user relevance  e.g., binary vs. multi-valued relevance 3

Retrieval Model Overview  Older models  Boolean retrieval  Vector Space model  Probabilistic Models  BM25  Language models  Combining evidence  Inference networks  Learning to Rank 4

Searching Based on Retrieved Documents  Sequence of queries driven by the number of retrieved documents Example. “Lincoln” search of news articles  president AND Lincoln  president AND Lincoln AND NOT (automobile OR car)  president AND Lincoln AND biography AND life AND birthplace AND gettysburg AND NOT (automobile OR car)  president AND Lincoln AND (biography OR life OR birthplace OR gettysburg) AND NOT (automobile OR car) 5

Boolean Retrieval  Two possible outcomes for query processing  TRUE and FALSE  “Exact-match” retrieval  No ranking at all  Query usually specified using Boolean operators  AND, OR, NOT 6

Boolean Retrieval  Advantages  Results are predictable, relatively easy to explain  Many different document features, such as metadata (e.g., type/date), can be incorporated  Efficient processing, since many documents can be eliminated from search  Disadvantages  Simple queries usually don’t work well  Complex queries are difficult to construct  Effectiveness depends entirely on users 7

Vector Space Model  Simple and intuitively appealing  Documents and query represented by a vector of term weights  Collection represented by a matrix of term weights 8

Vector Space Model 9 Vector Representation of Stemmed Documents w/o Stopwords

Vector Space Model  3-D pictures useful, but can be misleading for high- dimensional space 10

Vector Space Model  Documents ranked by differences between points representing query and documents  Using similarity measure rather than a distance or dissimilarity measure  e.g. Cosine correlation 11

Similarity Calculation  Consider two documents D 1 and D 2, and a query Q  D 1 = (0.5, 0.8, 0.3)  D 2 = (0.9, 0.4, 0.2) 12 Q = (1.5, 1.0, 0) More similar to Q than D 1

Term Weights  TF-IDF Weight  Term frequency weight measures importance in document:  Inverse document frequency measures importance in collection:  Some heuristic modifications 13 t t Ensure non-zero weight

Vector Space Model  Advantages  Simple computational framework for ranking  Any similarity measure or term weighting scheme could be used  Disadvantages  Assumption of term independence  No assumption on whether relevance is binary/multivalued 14

Probabilistic Models  According to [Greiff 02]  In probabilistic approaches to IR, the occurrence of a query term in a document D contributes to the probability that D will be judged as relevant  The weight assigned to a query term should be based on the expected value of that contribution 15 [Greiff 02] Greiff, et al. The Role of Variance in Term Weighting for Probabilistic Information Retrieval. In Proc. of Intl. Conf. on Information and Knowledge Management (CIKM)

IR as Classification 16 (D) Bayes Decision Rule Based on Bayes Classifier Conditional Probability

Bayes Classifier  Bayes Decision Rule  A document D is relevant if P(R | D) > P(NR | D), where P(R | D) and P(NR | D) are conditional probabilities  Estimating probabilities  Use Bayes Rule  Classify a document as relevant if L.H.S. is the likelihood ratio of D being relevant 17 based on the probability of occurrence of words in D that are in R A prior probability of relevance Based on the Bayes Decision Rule P(R | D) > P(NR | D)

Estimating P ( D | R )  Assume word independence (Naïve Bayes assumption) and use individual term/word (d i ) probabilities  Binary (weights in doc) independence (of word) model  Document represented by a vector of binary features indicating term occurrence (or non-occurrence)  p i is probability that term i occurs in relevant document, s i is probability of occurrence of i in non-relevant document  Example. Given a document representation (1, 0, 0, 1, 1), the probability of the document occurring in the relevant set is p 1 × (1 - p 2 ) × (1 - p 3 ) × p 4 × p 5 18

Binary Independence Model 19 word i in D (Same for all documents) Probability of i in R Probability of i not in R word i not in D Probability of i in NR Probability of i not in NR  Computing the likelihood ratio of D using p i and s i 1

Binary Independence Model  Scoring function is  Using log to avoid multiplying lots of small numbers  If there is no other information on the relevant set, i.e., neither R nor NR exists, then  p i is set to be a constant (= 0.5)  s i is approximated as n i / N, where n i is the number of documents in the collection N that include i  s i is similar to the IDF-like weight 20

Contingency Table, If (non-)Relevant Sets available Gives the scoring function: 21 where r i = number of relevant documents containing term i n i = number of documents in a collection N containing i N = number of documents in the collection R = number of relevant documents in the collection n 

BM25 Ranking Algorithm  A ranking algorithm based on probabilistic arguments and experimental validation, but it is not a formal model  Adds document D and query Q term weights  f i is the frequency of term i in a document D  k 1, a constant, plays the role of tf along with f i  qf i is the frequency of term i in query Q, much lower than f i  k 2 plays the role as k 1 but is applied to query term weight qf i , where dl is document length & avdl is the average length of a doc regulated by b (0 ≤ b ≤ 1)  k 1, k 2 & K are set empirically as 1.2, , & 0.75, resp. 22

BM25 Example  Given a query with two terms, “president lincoln”, (qf = 1)  No relevance information (r i and R are zero)  N = 500,000 documents  “president” occurs in 40,000 documents (n 1 = 40,000)  “lincoln” occurs in 300 documents (n 2 = 300)  “president” occurs 15 times in doc D (f 1 = 15)  “lincoln” occurs 25 times in doc D (f 2 = 25)  document length is 90% of the average length (dl / avdl = 0.9)  k 1 = 1.2, b = 0.75, and k 2 = 100  K = 1.2 × ( × 0.9) =

BM25 Example 24

BM25 Example  Effect of term frequencies, especially on “lincoln” 25 The large occurrence of a single important query term can yield a higher score than the occurrence of both query terms

Language Model (LM)  A language model is associated with each doc & has been successfully applied to search applications to represent the topical content of a document & query  LM is a probability distribution over sequences of words to represent docs, i.e., puts a probability measure over strings from some vocabulary,  s  * P(s) = 1  The original and basic method for using LMs in IR is the query likelihood model in which a document d is constructed from a language model M d, i.e.,  Estimate a LM for each document d  Rank documents by the likelihood of the query according to the estimated LM, i.e., P(Q | M d ) 26

LMs for Retrieval  Three possibilities models of topical relevance  Probability of generating the query text from a document (language) model, i.e., the query likelihood model P mle (Q | M d ) =  t  Q P mle (t | M d ), where P mle (t | M d ) =  Probability of generating the document text from a query (language) model, i.e., P mle (D | M q ) = P mle (D | R), where R is the set of relevant documents of a query q  Comparing the language models representing the query and document topics 27 tf t,d dl d

LMs for Retrieval  Three ways of developing the LM modeling approach: a) The Query Likelihood Model: P(Q | M d ) or P(Q | D) b) The Document Likelihood Model: P(D | M Q ) or P(D | R) c) Comparison of the query and document LMs using KL- Divergence 28

LMs for Retrieval  The Query Likelihood Model  Documents are ranked based on the probability that the query Q could be generated by the document model (i.e., on the same topic) based on doc D, i.e., P(Q | D)  Given a query Q = q 1 …q n and a document D = d 1 …d m, compute the conditional probability P(D | Q) P(D | Q) = P(Q | D) P(D) where P(Q | D) P(D) is the query likelihood given D & P(D) is the prior belief that D is relevant to any query 29 A prior probability of document D (Assumed to be uniform)

Estimating Probabilities  Obvious estimate for unigram probabilities is  Maximum likelihood estimate (mle)  Makes the observed value of f q i, D most likely  If any query words are missing from document, score will be zero  Missing 1 out of 4 query words same as missing 3 out of 4 30

LMs for Retrieval  Example. The Query Likelihood Model  Given the two (unigram) LMs, M 1 and M 2, as shown below 31 Model M 1 Model M 2 the0.2the0.15 a0.1a0.12 frog0.01frog toad0.01toad said0.03said0.03 likes0.02likes0.04 that0.04that0.04 dog0.005dog0.01 cat0.003cat0.015 monkey0.001Monkey0.002 ………… Q: Frog said that toad likes that dog M M P(Q | M 1 ) = P(Q | M 2 ) = P(Q | M 1 ) > P(Q | M 2 )

The Language Models  Example. Using language models, we estimate if “Barbie car” (i.e., treated as Q) is generated from content (i.e., treated as M) targeting younger or adults  Based on the results, it is more likely that “Barbie car” is generated from content written for younger audiences 32 P( | “Barbie car”) = P(“Barbie car” | ) =  = P( | “Barbie car”) = P(“Barbie car” | ) =  = WordsProbability on Simple Wiki Probabilit on Wikipedia Blue Car Barbie Auburn Software Toy

LMs for Retrieval  The Document Likelihood Model  Instead of using the probability of a document language model (M d ) generating the query, use the probability of a query language model M q generating the document  Creating a document likelihood model, even though it is less appealing, since there is much less text available to estimate a LM based on the query text  Solution: incorporate relevance feedback into the model & update the language model M q e.g., the relevance model of Lavrenko & Croft (2001) is an instance of a document likelihood model that incorporates pseudo relevance feedback into an LM approach 33

Language Model (LM) - Example  Consider the query Q = “Frozen Movie” & the following documents, where R is relevant & NR is non-relevant 34 Frozen Character Anna Frozen Food from Fridge Disney Movie Elsa Frozen Disney Sven R NR RR Words (R)P(w) Frozen2/9 Character1/9 Anna1/9 Disney2/9 Movie1/9 Elsa1/9 Sven1/9

 Using the following query language model (M q ) we estimate the probability of docs D 1 and D 2 on their relevance for query Q = “Frozen Movie”  D 1 : “Sven, Anna, Disney, Frozen”  P(D 1 | M q ) =  t  D 1 P(t | M q ) = 1/9  1/9  2/9  2/9 = 4/6561 =  D 2 : “Frozen ice cream”  P(D 2 | M q ) =  t  D 2 P(t | M q ) = 2/9  0  0 = 0 35 Language Model (LM) - Example Words (R)P(w) Frozen2/9 Character1/9 Anna1/9 Disney2/9 Movie1/9 Elsa1/9 Sven1/9

LMs for Retrieval  Comparing the Query Likelihood Model & Document Likelihood Model  Rather than generating either a document language model (M d ) or a query language model (M q ), create an LM from both the document & query & find the difference  To model the risk of returning a document d as relevant to a query q, use the KL-divergence R(d; q) = KL(M d || M q ) =  t  V P(t | M q ) log KL-divergence measures how bad the probability distribution M q is at modeling M d It has been shown that the comparison model outperforms both query-likelihood & document-likelihood models and useful for ad hoc retrieval 36 P(t | M q ) P(t | M d )

KL-Divergence  Example. Consider the probability distributions generated using texts from Simple Wikipedia (a simplified version of Wikipedia) and Wikipedia  Probabilities generated using text targeting younger versus more mature audiences 37 WordsProbability based on Simple Wiki Probability based on Wikipedia Blue Car Barbie Auburn Software Toy

KL-Divergence  Example (Cont.). Using K-L Divergence on vocabulary (  log (0.0043/0.0042)  log (0.0048/0.0039)  log (0.002/0.0056)  log (0.0439/ )  log (0.0756/0.0003)  log (0.0037/0.0076)) =  The vocabulary distribution for children versus more mature audiences is indeed different 38 || ) = K-L Divergence (

Smoothing  Document texts are a sample from the language model  Missing words (in a document) should not have zero probability of occurring  Smoothing is a technique for estimating probabilities for missing (or unseen) words for the words that are not seen in the text  Lower (or discount) the probability estimates for words that are seen in the document text  Assign that “left-over” probability to the estimates 39

Smoothing  As stated in [Zhai 04]  Smoothing is the problem of adjusting the maximum likelihood estimator to compensate for data sparseness and it is required to avoid assigning zero probability to unseen words  Smoothing accuracy is directly related to the retrieval performance: The retrieval performance is generally sensitive to smoothing parameters  Smoothing plays two different roles: Making the estimated document LM more accurate Explaining the non-informative words in the query 40

Estimating Probabilities  Estimate for unseen words is α D P(q i | C)  P(q i | C) is the probability for query word i in the collection language model for collection C (background probability)  α D, which can be a constant, is a coefficient of the probability assigned to unseen words, depending on the document D  Estimate for words that occur is (1 − α D ) P(q i | D) + α D P(q i | C)  Different forms of estimation come from different α D 41

Estimating Probabilities  Different forms of estimation come from different α D (or ) 42 where p s (w | d) is the smoothed probability of a word seen in d, p(w | C) is the collection language model, α d is a coefficient controlling the probability mass assigned to unseen words, all probabilities sum to one

Jelinek-Mercer Smoothing  α D is a constant, λ (= 0.1 for short queries or = 0.7 for long queries in TREC evaluations  Gives estimate of  Ranking score  Use logs for convenience: solve the accuracy problem of multiplying small numbers 43

Where is tf-idf Weight?  Proportional to the term frequency (TF), inversely proportional to the collection frequency (IDF) 44 i: f q i, D > 0  i: f q i, D = 0  i: 1..n TF IDF Same for all documents

Dirichlet Smoothing  α D depends on document length where  is a parameter value set empirically  Gives probability estimation of  and document score 45

Query Likelihood  Example. Given,  Let Q be “President Lincoln”, D (a doc) in C(ollection)  For the term “President”, let f q i,D = 15, c q i = 160,000  For the term “Lincoln”, let f q i,D = 25, c q i = 2,400  Let no. of word occurrences in D (i.e., |D|) be 1,800  Let no. of word occurrences in C be 10 9 = 500,000 (docs) × 2,000 (average no. of words in a doc)  μ = 2, Negative due to the summating logs of small numbers

47 Query Likelihood

48 Set Theoretic Models  The Boolean model imposes a binary criterion for deciding relevance  The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past  Two set theoretic models for this extension are  Fuzzy Set Model  Extended Boolean Model

49 Fuzzy Queries  Binary queries are based on binary logic, in which a document either satisfies or does not satisfy a query.  Fuzzy queries are based on fuzzy logic, in which a query term is considered as a fuzzy set & each document is given a degree of relevance with respect to a query, usually [0 – 1].  Example.  In binary logic: The set of "tall people" are defined as those > 6 feet. Thus, a person 5' 9.995" is not considered tall.  In fuzzy logic: no clear distinction on “tall people.” Every person is given a degree of membership to the set of tall people, e.g., a person 7'0" will have a grade 1.0 a person 4'5" will have a grade 0.0 a person 6‘2" will have a grade 0.85 a person 5’8" will have a grade 0.7

50 Fuzzy Queries  Binary Logic vs. Fuzzy Logic Binary (crisp) Logic Fuzzy Logic 6.5  TALL

51 Fuzzy Set Model  Queries and documents are represented by sets of index terms & matching is approximate from the start  This vagueness can be modeled using a fuzzy framework, as follows:  with each (query) term is associated a fuzzy set  each document has a degree of membership (0  X  1) in this fuzzy set  This interpretation provides the foundation for many models for IR based on fuzzy theory  The model proposed by Ogawa, Morita & Kobayashi (1991) is discussed

52 Fuzzy Information Retrieval  Fuzzy sets are modeled based on a thesaurus  This thesaurus is built as follows:  Let c be a term-term correlation matrix (or keyword connection matrix)  Let c(i, l) be a normalized correlation factor for (k i, k l ): n i : number of documents which contain k i n l : number of documents which contain k l n(i, l): number of documents which contain both k i and k l  The notion of proximity among indexed terms c(i, l) = n(i, I) n i + n l - n(i, l)

53 Fuzzy Information Retrieval  The correlation factor c(i, l) can be used to define fuzzy set membership for a document d j in thefuzzy set for query term K i as follows:  (i, j) = 1 -  (1 - c(i, l)) k l  d j  (i, j): the degree of membership of document d j in fuzzy subset associated with k i  The above expression computes an algebraic sum over all terms in d j with respect to k i, and is implemented as the complement of a negated algebraic product  A document d j belongs to the fuzzy set for k i, if its own terms are associated with k i

54 Fuzzy Information Retrieval   (i, j) = 1 -  (1 - c(i, l)) k l  d j   (i, j): membership of document d j in fuzzy subset associated with query term k i  If document d j contains any index term k l which is closely related to query term k i, we have  if  l such that c(i, l) ~ 1, then  (i, j) ~ 1, and  query index k i is a good fuzzy index for document d j

55 Fuzzy IR  q = k a  (k b   k c )  q dnf = (1,1,1) + (1,1,0) + (1,0,0) (binary weighted) = cc 1 + cc 2 + cc 3 (conjunctive comp)   (q, d j ) =  (cc 1 + cc 2 + cc 3, j) = 1 - (1 -  (a, j)  (b, j)  (c, j))  (1 -  (a, j)  (b, j) (1 -  (c, j)))  (1 -  (a, j) (1 -  (b, j)) (1 -  (c, j))) Example. cc 1 cc 3 cc 2 KaKa KbKb KcKc

56 Fuzzy Information Retrieval  Fuzzy set are useful for representing vagueness and imprecision  Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory  Experiments with standard test collections are not available  Difficult to compare because previously conducted experiences used only small collections