Information Retrieval Models: Language Models

Information Retrieval Models: Language Models
ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Modeling Relevance: Raodmap for Retrieval Models
Relevance constraints [Fang et al. 04] Relevance (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Div. from Randomness (Amati & Rijsbergen 02) Generative Model Regression Model (Fuhr 89) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. (Wong & Yao, 89) … Doc generation Query Learn. To Rank (Joachims 02, Berges et al. 05) Classical prob. Model (Robertson & Sparck Jones, 76) LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a)

Query Generation ( Language Models for IR)
Query likelihood p(Q| D,R=1) Document prior Assuming uniform prior, we have Now, the question is how to compute ? Generally involves two steps: (1) estimate a language model based on D (2) compute the query likelihood according to the estimated model P(Q|D, R=1) Prob. that a user who likes D would pose query Q How to estimate it?

The Basic LM Approach [Ponte & Croft 98]
Language Model … text ? mining ? assocation ? clustering ? food ? nutrition ? healthy ? diet ? Document Query = “data mining algorithms” Text mining paper ? Which model would most likely have generated this query? Food nutrition paper

Ranking Docs by Query Likelihood
dN Doc LM p(q| d1) p(q| d2) p(q| dN) Query likelihood d1 q d2 dN

Modeling Queries: Different Assumptions
Multi-Bernoulli: Modeling word presence/absence q= (x1, …, x|V|), xi =1 for presence of word wi; xi =0 for absence Parameters: {p(wi=1|d), p(wi=0|d)} p(wi=1|d)+ p(wi=0|d)=1 Multinomial (Unigram LM): Modeling word frequency q=q1,…qm , where qj is a query word c(wi,q) is the count of word wi in query q Parameters: {p(wi|d)} p(w1|d)+… p(w|v||d) = 1 [Ponte & Croft 98] uses Multi-Bernoulli; most other work uses multinomial Multinomial seems to work better [Song & Croft 99, McCallum & Nigam 98,Lavrenko 04]

Retrieval as LM Estimation
Document ranking based on query likelihood Document language model Retrieval problem  Estimation of p(wi|d) Smoothing is an important issue, and distinguishes different approaches

How to Estimate p(w|d)? Simplest solution: Maximum Likelihood Estimator P(w|d) = relative frequency of word w in d What if a word doesn’t appear in the text? P(w|d)=0 In general, what probability should we give a word that has not been observed? If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words This is what “smoothing” is about …

Language Model Smoothing (Illustration)
P(w) Max. Likelihood Estimate Smoothed LM Word w

How to Smooth? All smoothing methods try to
discount the probability of words seen in a document re-allocate the extra counts so that unseen words will have a non-zero count Method 1 Additive smoothing [Chen & Goodman 98]: Add a constant  to the counts of each word, e.g., “add 1” Counts of w in d “Add one”, Laplace Vocabulary size Length of d (total counts)

Improve Additive Smoothing
Should all unseen words get equal probabilities? We can use a reference model to discriminate unseen words Discounted ML estimate Reference language model Normalizer Prob. Mass for unseen words

Other Smoothing Methods
Method 2 Absolute discounting [Ney et al. 94]: Subtract a constant  from the counts of each word Method 3 Linear interpolation [Jelinek-Mercer 80]: “Shrink” uniformly toward p(w|REF) # unique words parameter ML estimate

Other Smoothing Methods (cont.)
Method 4 Dirichlet Prior/Bayesian [MacKay & Peto 95, Zhai & Lafferty 01a, Zhai & Lafferty 02]: Assume pseudo counts p(w|REF) Method 5 Good Turing [Good 53]: Assume total # unseen events to be n1 (# of singletons), and adjust the seen events in the same way parameter Heuristics needed

Dirichlet Prior Smoothing
ML estimator: M=argmax M p(d|M) Bayesian estimator: First consider posterior: p(M|d) =p(d|M)p(M)/p(d) Then, consider the mean or mode of the posterior dist. p(d|M) : Sampling distribution (of data) P(M)=p(1 ,…, N) : our prior on the model parameters conjugate = prior can be interpreted as “extra”/“pseudo” data Dirichlet distribution is a conjugate prior for multinomial sampling distribution “extra”/“pseudo” word counts i= p(wi|REF)

Dirichlet Prior Smoothing (cont.)
Posterior distribution of parameters: The predictive distribution is the same as the mean: Dirichlet prior smoothing

Smoothing with Collection Model Illustrated
(Unigram) Language Model  p(w| )=? Estimation Collection LM P(w|C) Document … text ? mining ? association ? database ? query ? network? text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 the 0.1 a .. computer 0.02 database 0.01 …… text 0.001 network 0.001 mining … 10/100 5/100 3/100 1/100 0/100 Jelinek-Mercer Dirichlet prior (total #words=100)

Query Likelihood Retrieval Functions
With Jelinek-Mercer (JM): With Dirichlet Prior (DIR): What assumptions have we made in order to derive these functions? Do they capture the same retrieval heuristics (TF-IDF, Length Norm) as a vector space retrieval function?

So, which method is the best?
It depends on the data and the task! Cross validation is generally used to choose the best method and/or set the smoothing parameters… For retrieval, Dirichlet prior performs well… Backoff smoothing [Katz 87] doesn’t work well due to a lack of 2nd-stage smoothing… Note that many other smoothing methods exist See [Chen & Goodman 98] and other publications in speech recognition…

Comparison of Three Methods [Zhai & Lafferty 01a]
Comparison is performed on a variety of test collections

Understanding Smoothing
The general smoothing scheme Retrieval formula using the general smoothing scheme Discounted ML estimate Reference language model The key rewriting step Similar rewritings are very common when using LMs for IR…

Smoothing & TF-IDF Weighting [Zhai & Lafferty 01a]
Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain Doc length normalization (long doc is expected to have a smaller d) TF weighting Ignore for ranking IDF-like weighting Words in both query and doc Smoothing with p(w|C)  TF-IDF + length norm. Smoothing implements traditional retrieval heuristics LMs with simple smoothing can be computed as efficiently as traditional retrieval models

The Dual-Role of Smoothing [Zhai & Lafferty 02]
Keyword queries long Verbose queries long short short Why does query type affect smoothing sensitivity?

Two-stage Smoothing [Zhai & Lafferty 02]
+p(w|C) + Stage-1 -Explain unseen words -Dirichlet prior(Bayesian)  Collection LM (1-) + p(w|U) Stage-2 -Explain noise in query -2-component mixture  User background model Can be approximated by p(w|C) c(w,d) |d| P(w|d) =

Estimating  using leave-one-out [Zhai & Lafferty 02]
w1 log-likelihood Maximum Likelihood Estimator Newton’s Method Leave-one-out P(w1|d- w1) w2 P(w2|d- w2) P(wn|d- wn) wn ...

Why would “leave-one-out” work?
20 word by author1 Suppose we keep sampling and get 10 more words. Which author is likely to “write” more new words? abc abc ab c d d abc cd d d abd ab ab ab ab cd d e cd e Now, suppose we leave “e” out…  must be big! more smoothing  doesn’t have to be big 20 word by author2 abc abc ab c d d abe cb e f acf fb ef aff abef cdc db ge f s The amount of smoothing is closely related to the underlying vocabulary size

Automatic 2-stage results  Optimal 1-stage results [Zhai & Lafferty 02]
Average precision (3 DB’s + 4 query types, 150 topics) * Indicates significant difference Completely automatic tuning of parameters IS POSSIBLE!

Feedback and Doc/Query Generation
Rel. doc model NonRel. doc model “Rel. query” model Classic Prob. Model Query likelihood (“Language Model”) Initial retrieval: - query as rel doc vs. doc as rel query - P(Q|D,R=1) is more accurate Feedback: - P(D|Q,R=1) can be improved for the current query and future doc - P(Q|D,R=1) can also be improved, but for current doc and future query (q1,d1,1) (q1,d2,1) (q1,d3,1) (q1,d4,0) (q1,d5,0) (q3,d1,1) (q4,d1,1) (q5,d1,1) (q6,d2,1) (q6,d3,0) Parameter Estimation P(D|Q,R=1) P(D|Q,R=0) P(Q|D,R=1) Doc-based feedback Query-based feedback

Difficulty in Feedback with Query Likelihood
Traditional query expansion [Ponte 98, Miller et al. 99, Ng 99] Improvement is reported, but there is a conceptual inconsistency What’s an expanded query, a piece of text or a set of terms? Avoid expansion Query term reweighting [Hiemstra 01, Hiemstra 02] Translation models [Berger & Lafferty 99, Jin et al. 02] Only achieving limited feedback Doing relevant query expansion instead [Nallapati et al 03] The difficulty is due to the lack of a query/relevance model The difficulty can be overcome with alternative ways of using LMs for retrieval (e.g., relevance model [Lavrenko & Croft 01] , Query model estimation [Lafferty & Zhai 01b; Zhai & Lafferty 01b]) © ChengXiang Zhai, 2007

Two Alternative Ways of Using LMs
Classic Probabilistic Model :Doc-Generation as opposed to Query-generation Natural for relevance feedback Challenge: Estimate p(D|Q,R=1) without relevance feedback; relevance model [Lavrenko & Croft 01] provides a good solution Probabilistic Distance Model :Similar to the vector-space model, but with LMs as opposed to TF-IDF weight vectors A popular distance function: Kullback-Leibler (KL) divergence, covering query likelihood as a special case Retrieval is now to estimate query & doc models and feedback is treated as query LM updating [Lafferty & Zhai 01b; Zhai & Lafferty 01b] Both methods outperform the basic LM significantly

Query Model Estimation [Lafferty & Zhai 01b, Zhai & Lafferty 01b]
Question: How to estimate a better query model than the ML estimate based on the original query? “Massive feedback”: Improve a query model through co-occurrence pattern learned from A document-term Markov chain that outputs the query [Lafferty & Zhai 01b] Thesauri, corpus [Bai et al. 05,Collins-Thompson & Callan 05] Model-based feedback: Improve the estimate of query model by exploiting pseudo-relevance feedback Update the query model by interpolating the original query model with a learned feedback model [ Zhai & Lafferty 01b] Estimate a more integrated mixture model using pseudo-feedback documents [ Tao & Zhai 06]

Feedback as Model Interpolation [Zhai & Lafferty 01b]
Document D Results Query Q Feedback Docs F={d1, d2 , …, dn} =0 No feedback =1 Full feedback Generative model Divergence minimization

F Estimation Method I: Generative Mixture Model
P(w|  ) P(w| C)  1- P(source) Background words Topic words w F={D1, …, Dn} Maximum Likelihood The learned topic model is called a “parsimonious language model” in [Hiemstra et al. 04]

F Estimation Method II: Empirical Divergence Minimization
far () C Background model D1  close F={D1, …, Dn} Dn Empirical divergence Divergence minimization

Example of Feedback Query Model
Trec topic 412: “airport security” =0.9 =0.7 Mixture model approach Web database Top 10 docs

Model-based feedback Improves over Simple LM [Zhai & Lafferty 01b]

What You Should Know Derivation of query likelihood retrieval model using query generation (what are the assumptions made?) Dirichlet prior and Jelinek Mercer smoothing methods Connection between query likelihood and TF-IDF weighting + doc length normalization The basic idea of two-stage smoothing KL-divergence retrieval model Basic idea of feedback methods (mixture model)

Information Retrieval Models: Language Models

Similar presentations

Presentation on theme: "Information Retrieval Models: Language Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval Models: Language Models

Similar presentations

Presentation on theme: "Information Retrieval Models: Language Models"— Presentation transcript:

Similar presentations

About project

Feedback