Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language Modeling Approaches for Information Retrieval Rong Jin.

Similar presentations


Presentation on theme: "Language Modeling Approaches for Information Retrieval Rong Jin."— Presentation transcript:

1 Language Modeling Approaches for Information Retrieval Rong Jin

2 A Probabilistic Framework for Information Retrieval d1…d1000q: ‘bush Kerry’ ??? Estimating some statistics  for each document Estimating likelihood p(q|  )

3 A Probabilistic Framework for Information Retrieval  Three fundamental questions What statistics  should be chosen to describe the characteristics of documents ? How to estimate this statistics ? How to compute the likelihood of generating queries given the statistics  ?

4 Unigram Language Model  Probabilities for single word p(w)  ={p(w) for any word w in vocabulary V}  Estimate an unigram language model Simple counting Given a document d, count term frequency c(w,d) for each word w. Then, p(w) = c(w,d)/|d|

5 Statistical Inference  C1: h, h, h, h, t, h  bias b1 = 5/6  C2: t, t, h, t, h, h  bias b2 = 1/2  C3: t, h, t, t, t, h  bias b3 = 1/3  Why counting provide a good estimate of coin bias?

6 Maximum Likelihood Estimation (MLE)  Observation o={o 1, o 2,…, o n }  Maximum likelihood estimation E.g.: o={h, h, h, t, h,h}  Pr(o|b) = b 5 (1-b) 

7 Unigram Language Model  Observation: d={tf 1, tf 2,…, tf n }  Unigram language model  ={p(w 1 ), p(w 2 ),…, p(w n )}  Maximum likelihood estimation

8 Maximum A Posterior Estimation  Consider a special case: we only toss each coin twice  C1: h, t  b1=1/2  C2: h, h  b2=1  C3: t, t  b3 = 0 ? MLE estimation is poor when the number of observations is small. This is called “sparse data” problem !

9 Solution to Sparse Data Problems  Shrinkage  Maximum a posterior (MAP) estimation  Bayesian approach

10 Shrinkage: Jelinek Mercer Smoothing  Linearly interpolate between document language model and the collection language model Estimation based on individual document Estimation based on the corpus 0 < < 1: is a smoothing parameter

11 Smoothing & TF-IDF Weighting Are they totally irrelevant ?

12 Smoothing & TF-IDF Weighting Similar to TF.IDF weighting irrelevant to documents

13 Maximum A Posterior Estimation  Introduce a prior on b Most of coins are more or less unbiased A Dirichlet prior on b

14 Maximum A Posterior Estimation  Observation o={o 1, o 2,…, o n }  Maximum A Posterior Estimation

15 Maximum A Posterior Estimation  Observation o={o 1, o 2,…, o n }  Maximum A Posterior Estimation

16 Maximum A Posterior Estimation  Observation o={o 1, o 2,…, o n }  Maximum A Posterior Estimation

17 Maximum A Posterior Estimation  Observation o={o 1, o 2,…, o n }  Maximum A Posterior Estimation Pseudo counts (or pseudo experiments)

18 Dirichlet Prior  Given a distribution  A Dirichlet distribution for p is defined as  i are called hyper- parameters

19 Dirichlet Prior  Example:  Full Dirichlet distribution:   (x) is gamma function

20 Dirichlet Prior  Dirichlet is a distribution of distribution  The prior knowledge about distribution p is encode in hyper-parameters  The maximum point of Dirichlet distribution is at p i = (  i -1)/(  1 +  2 +…+  n -n)  p i   i and  i =c p i +1,  Example: Prior knowledge: most coins are fair  b=1-b=1/2  1 =  2 = c 

21 Unigram Language Model  Simple counting  zero probabilities  Introduce Dirichlet priors to smooth the language model  How to construct the Dirichlet prior?

22 Dirichlet Prior for Unigram LM  Prior for what distribution?  d ={p(w 1 |d), p(w 2 |d),…, p(w n |d)} How to determine the appropriate value for the hyper-parameters  i

23 Determine Hyper-parameters  The most likely determined language model by Dirichlet distribution is p(w i |  d )   i  What is most likely p(w i |  d ) without looking into the content of the document d?

24 Determine Hyper-parameters  The most likely p(w i |  d ) without looking into the content of the document d is the unigram probability of the collection:  c ={p(w 1 |c), p(w 2 |c),…, p(w n |c)}  So what is appropriate value for  i

25 Determine Hyper-parameters  The most likely p(w i |  d ) without looking into the content of the document d is the unigram probability of the collection:  c ={p(w 1 |c), p(w 2 |c),…, p(w n |c)}  So what is appropriate value for  i ?

26 Determine Hyper-parameters  The most likely p(w i |  d ) without looking into the content of the document d is the unigram probability of the collection:  c ={p(w 1 |c), p(w 2 |c),…, p(w n |c)}  So what is appropriate value for  i ?

27 Dirichlet Prior for Unigram LM  MAP estimation for best unigram language model  Solution:

28 Dirichlet Prior for Unigram LM  MAP estimation for best unigram language model  Solution:

29 Dirichlet Prior for Unigram LM  MAP estimation for best unigram language model  Solution:

30 Dirichlet Prior for Unigram LM  MAP estimation for best unigram language model  Solution: Pseudo term frequency

31 Dirichlet Prior for Unigram LM  MAP estimation for best unigram language model  Solution: Pseudo document length

32 Dirichlet Smoothed Unigram LM  What does p(w|d) looks like if s is small?  What does p(w|d) looks like if s is large?

33 Dirichlet Smoothed Unigram LM  What does p(w|d) looks like if s is small?  What does p(w|d) looks like if s is large?

34 Dirichlet Smoothed Unigram LM No longer zero probabilities

35 Dirichlet Smoothed Unigram LM  Step 1: compute the collection based unigram language model by simple counting  Step 2: for each document d k, compute its smoothed unigram language model as

36 Dirichlet Smoothed Unigram LM  Step 1: compute the collection based unigram language model by simple counting  Step 2: for each document d k, compute its smoothed unigram language model as

37 Dirichlet Smoothed Unigram LM  For a given query q={tf 1 (q), tf 2 (q),…, tf n (q)} For each document d, compute likelihood The larger the likelihood, the more relevant the document is to the query

38 Smoothing & TF-IDF Weighting Are they totally irrelevant ?

39 Smoothing & TF-IDF Weighting

40

41

42 Document normalization

43 Smoothing & TF-IDF Weighting TF.IDF

44 Shrinkage vs. Dirichlet Smoothing  Linearly interpolate between document language model and the collection language model JM Smoothing Dirichlet Smoothing Linear weight is a constant for JM smoothing It is document dependent for Dirichlet smoothing

45 Current Probabilistic Framework for Information Retrieval d1…d1000q: ‘bush Kerry’ ??? Estimating some statistics  for each document Estimating likelihood p(q|  )

46 Current Probabilistic Framework for Information Retrieval d1…d1000q: ‘bush Kerry’ 11  1000 22 Estimating some statistics  for each document Estimating likelihood p(q|  )

47 Current Probabilistic Framework for Information Retrieval q: ‘bush Kerry’ 11  1000 22 Estimating likelihood p(q|  )

48 Bayesian Approach d1…d1000q: ‘bush Kerry’ 11  1000 22 Estimating some statistics  for each document Estimating likelihood p(q|  ) 11 11 We need to consider the uncertainty in model inference

49 Bayesian Approach d 11 22 nn … p(d|  i ) q p(q|  i )

50 Bayesian Approach d 11 22 nn … p(d|  i ) q p(q|  i )

51 Bayesian Approach d 11 22 nn … p(d|  i ) q p(q|  i ) Assume that p(d) and p(  i ) follow uniform distributions


Download ppt "Language Modeling Approaches for Information Retrieval Rong Jin."

Similar presentations


Ads by Google