Download presentation
Presentation is loading. Please wait.
1
Language Modeling Approaches for Information Retrieval Rong Jin
2
A Probabilistic Framework for Information Retrieval d1…d1000q: ‘bush Kerry’ ??? Estimating some statistics for each document Estimating likelihood p(q| )
3
A Probabilistic Framework for Information Retrieval Three fundamental questions What statistics should be chosen to describe the characteristics of documents ? How to estimate this statistics ? How to compute the likelihood of generating queries given the statistics ?
4
Unigram Language Model Probabilities for single word p(w) ={p(w) for any word w in vocabulary V} Estimate an unigram language model Simple counting Given a document d, count term frequency c(w,d) for each word w. Then, p(w) = c(w,d)/|d|
5
Statistical Inference C1: h, h, h, h, t, h bias b1 = 5/6 C2: t, t, h, t, h, h bias b2 = 1/2 C3: t, h, t, t, t, h bias b3 = 1/3 Why counting provide a good estimate of coin bias?
6
Maximum Likelihood Estimation (MLE) Observation o={o 1, o 2,…, o n } Maximum likelihood estimation E.g.: o={h, h, h, t, h,h} Pr(o|b) = b 5 (1-b)
7
Unigram Language Model Observation: d={tf 1, tf 2,…, tf n } Unigram language model ={p(w 1 ), p(w 2 ),…, p(w n )} Maximum likelihood estimation
8
Maximum A Posterior Estimation Consider a special case: we only toss each coin twice C1: h, t b1=1/2 C2: h, h b2=1 C3: t, t b3 = 0 ? MLE estimation is poor when the number of observations is small. This is called “sparse data” problem !
9
Solution to Sparse Data Problems Shrinkage Maximum a posterior (MAP) estimation Bayesian approach
10
Shrinkage: Jelinek Mercer Smoothing Linearly interpolate between document language model and the collection language model Estimation based on individual document Estimation based on the corpus 0 < < 1: is a smoothing parameter
11
Smoothing & TF-IDF Weighting Are they totally irrelevant ?
12
Smoothing & TF-IDF Weighting Similar to TF.IDF weighting irrelevant to documents
13
Maximum A Posterior Estimation Introduce a prior on b Most of coins are more or less unbiased A Dirichlet prior on b
14
Maximum A Posterior Estimation Observation o={o 1, o 2,…, o n } Maximum A Posterior Estimation
15
Maximum A Posterior Estimation Observation o={o 1, o 2,…, o n } Maximum A Posterior Estimation
16
Maximum A Posterior Estimation Observation o={o 1, o 2,…, o n } Maximum A Posterior Estimation
17
Maximum A Posterior Estimation Observation o={o 1, o 2,…, o n } Maximum A Posterior Estimation Pseudo counts (or pseudo experiments)
18
Dirichlet Prior Given a distribution A Dirichlet distribution for p is defined as i are called hyper- parameters
19
Dirichlet Prior Example: Full Dirichlet distribution: (x) is gamma function
20
Dirichlet Prior Dirichlet is a distribution of distribution The prior knowledge about distribution p is encode in hyper-parameters The maximum point of Dirichlet distribution is at p i = ( i -1)/( 1 + 2 +…+ n -n) p i i and i =c p i +1, Example: Prior knowledge: most coins are fair b=1-b=1/2 1 = 2 = c
21
Unigram Language Model Simple counting zero probabilities Introduce Dirichlet priors to smooth the language model How to construct the Dirichlet prior?
22
Dirichlet Prior for Unigram LM Prior for what distribution? d ={p(w 1 |d), p(w 2 |d),…, p(w n |d)} How to determine the appropriate value for the hyper-parameters i
23
Determine Hyper-parameters The most likely determined language model by Dirichlet distribution is p(w i | d ) i What is most likely p(w i | d ) without looking into the content of the document d?
24
Determine Hyper-parameters The most likely p(w i | d ) without looking into the content of the document d is the unigram probability of the collection: c ={p(w 1 |c), p(w 2 |c),…, p(w n |c)} So what is appropriate value for i
25
Determine Hyper-parameters The most likely p(w i | d ) without looking into the content of the document d is the unigram probability of the collection: c ={p(w 1 |c), p(w 2 |c),…, p(w n |c)} So what is appropriate value for i ?
26
Determine Hyper-parameters The most likely p(w i | d ) without looking into the content of the document d is the unigram probability of the collection: c ={p(w 1 |c), p(w 2 |c),…, p(w n |c)} So what is appropriate value for i ?
27
Dirichlet Prior for Unigram LM MAP estimation for best unigram language model Solution:
28
Dirichlet Prior for Unigram LM MAP estimation for best unigram language model Solution:
29
Dirichlet Prior for Unigram LM MAP estimation for best unigram language model Solution:
30
Dirichlet Prior for Unigram LM MAP estimation for best unigram language model Solution: Pseudo term frequency
31
Dirichlet Prior for Unigram LM MAP estimation for best unigram language model Solution: Pseudo document length
32
Dirichlet Smoothed Unigram LM What does p(w|d) looks like if s is small? What does p(w|d) looks like if s is large?
33
Dirichlet Smoothed Unigram LM What does p(w|d) looks like if s is small? What does p(w|d) looks like if s is large?
34
Dirichlet Smoothed Unigram LM No longer zero probabilities
35
Dirichlet Smoothed Unigram LM Step 1: compute the collection based unigram language model by simple counting Step 2: for each document d k, compute its smoothed unigram language model as
36
Dirichlet Smoothed Unigram LM Step 1: compute the collection based unigram language model by simple counting Step 2: for each document d k, compute its smoothed unigram language model as
37
Dirichlet Smoothed Unigram LM For a given query q={tf 1 (q), tf 2 (q),…, tf n (q)} For each document d, compute likelihood The larger the likelihood, the more relevant the document is to the query
38
Smoothing & TF-IDF Weighting Are they totally irrelevant ?
39
Smoothing & TF-IDF Weighting
42
Document normalization
43
Smoothing & TF-IDF Weighting TF.IDF
44
Shrinkage vs. Dirichlet Smoothing Linearly interpolate between document language model and the collection language model JM Smoothing Dirichlet Smoothing Linear weight is a constant for JM smoothing It is document dependent for Dirichlet smoothing
45
Current Probabilistic Framework for Information Retrieval d1…d1000q: ‘bush Kerry’ ??? Estimating some statistics for each document Estimating likelihood p(q| )
46
Current Probabilistic Framework for Information Retrieval d1…d1000q: ‘bush Kerry’ 11 1000 22 Estimating some statistics for each document Estimating likelihood p(q| )
47
Current Probabilistic Framework for Information Retrieval q: ‘bush Kerry’ 11 1000 22 Estimating likelihood p(q| )
48
Bayesian Approach d1…d1000q: ‘bush Kerry’ 11 1000 22 Estimating some statistics for each document Estimating likelihood p(q| ) 11 11 We need to consider the uncertainty in model inference
49
Bayesian Approach d 11 22 nn … p(d| i ) q p(q| i )
50
Bayesian Approach d 11 22 nn … p(d| i ) q p(q| i )
51
Bayesian Approach d 11 22 nn … p(d| i ) q p(q| i ) Assume that p(d) and p( i ) follow uniform distributions
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.