Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language Models Hongning Wang Notion of Relevance Relevance  (Rep(q), Rep(d)) Similarity P(r=1|q,d) r  {0,1} Probability of Relevance P(d 

Similar presentations


Presentation on theme: "Language Models Hongning Wang Notion of Relevance Relevance  (Rep(q), Rep(d)) Similarity P(r=1|q,d) r  {0,1} Probability of Relevance P(d "— Presentation transcript:

1 Language Models Hongning Wang

2 Notion of Relevance Relevance  (Rep(q), Rep(d)) Similarity P(r=1|q,d) r  {0,1} Probability of Relevance P(d  q) or P(q  d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) … Generative Model Regression Model (Fox 83) Classical prob. Model (Robertson & Sparck Jones, 76) Doc generation Query generation LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) 6501: Information Retrieval2

3 What is a statistical LM? A model specifying probability distribution over word sequences – p(“ Today is Wednesday ”)  – p(“ Today Wednesday is ”)  – p(“ The eigenvalue is positive” )  It can be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model 6501: Information Retrieval3

4 Why is a LM useful? Provides a principled way to quantify the uncertainties associated with natural language Allows us to answer questions like: – Given that we see “ John ” and “ feels ”, how likely will we see “ happy ” as opposed to “ habit ” as the next word? (speech recognition) – Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) – Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval) 6501: Information Retrieval4

5 Source-Channel framework [Shannon 48] Source Transmitter (encoder) Destination Receiver (decoder) Noisy Channel P(X) P(Y|X) X Y X’ P(X|Y)=? When X is text, p(X) is a language model (Bayes Rule) Many Examples: Speech recognition: X=Word sequence Y=Speech signal Machine translation: X=English sentence Y=Chinese sentence OCR Error Correction: X=Correct word Y= Erroneous word Information Retrieval: X=Document Y=Query Summarization: X=Summary Y=Document 6501: Information Retrieval5

6 Unigram language model 6501: Information Retrieval6 The simplest and most popular choice!

7 More sophisticated LMs 6501: Information Retrieval7

8 Why just unigram models? Difficulty in moving toward more complex models – They involve more parameters, so need more data to estimate – They increase the computational complexity significantly, both in time and space Capturing word order or structure may not add so much value for “topical inference” But, using more sophisticated models can still be expected to improve performance : Information Retrieval8

9 Generative view of text documents (Unigram) Language Model  p(w|  ) … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food … Topic 1: Text mining … text 0.01 food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health Document Text mining document Food nutrition document Sampling 6501: Information Retrieval9

10 Estimation of language models Maximum likelihood estimation Unigram Language Model  p(w|  )=? Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 … text ? mining ? assocation ? database ? … query ? … Estimation A “text mining” paper (total #words=100) 10/100 5/100 3/100 1/ : Information Retrieval10

11 Language models for IR [Ponte & Croft SIGIR’98] Document Text mining paper Food nutrition paper Language Model … text ? mining ? assocation ? clustering ? … food ? … food ? nutrition ? healthy ? diet ? … ? Which model would most likely have generated this query? “data mining algorithms” Query 6501: Information Retrieval11

12 Ranking docs by query likelihood d1d1 d2d2 dNdN q p(q|  d 1 ) p(q|  d 2 ) p(q|  d N ) Query likelihood …… d1d1 d2d2 dNdN Doc LM …… Justification: PRP 6501: Information Retrieval12

13 Justification from PRP Query likelihood p(q|  d ) Document prior Assuming uniform document prior, we have 6501: Information Retrieval13 Query generation

14 Retrieval as language model estimation Document language model 6501: Information Retrieval14

15 Problem with MLE What probability should we give a word that has not been observed in the document? – log0? If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words This is so-called “smoothing” 6501: Information Retrieval15

16 General idea of smoothing All smoothing methods try to 1.Discount the probability of words seen in a document 2.Re-allocate the extra counts such that unseen words will have a non-zero count 6501: Information Retrieval16

17 Illustration of language model smoothing P(w|d) w Max. Likelihood Estimate Smoothed LM Assigning nonzero probabilities to the unseen words 6501: Information Retrieval17 Discount from the seen words

18 Smoothing methods Method 1: Additive smoothing – Add a constant  to the counts of each word – Problems? Hint: all words are equally important? “Add one”, Laplace smoothing Vocabulary size Counts of w in d Length of d (total counts) 6501: Information Retrieval18

19 Refine the idea of smoothing Should all unseen words get equal probabilities? We can use a reference model to discriminate unseen words Discounted ML estimate Reference language model 6501: Information Retrieval19

20 Smoothing methods (cont) Method 2: Absolute discounting – Subtract a constant  from the counts of each word – Problems? Hint: varied document length? # uniq words 6501: Information Retrieval20

21 Smoothing methods (cont) Method 3: Linear interpolation, Jelinek- Mercer – “Shrink” uniformly toward p(w|REF) – Problems? Hint: what is missing? parameter MLE 6501: Information Retrieval21

22 Smoothing methods (cont) Method 4: Dirichlet Prior/Bayesian – Assume pseudo counts  p(w|REF) – Problems? parameter 6501: Information Retrieval22

23 Dirichlet prior smoothing “extra”/“pseudo” word counts, we set  i =  p(w i |REF) prior over models likelihood of doc given the model 6501: Information Retrieval23

24 Some background knowledge Conjugate prior – Posterior dist in the same family as prior Dirichlet distribution – Continuous – Samples from it will be the parameters in a multinomial distribution Gaussian -> Gaussian Beta -> Binomial Dirichlet -> Multinomial 6501: Information Retrieval24

25 Dirichlet prior smoothing (cont) Posterior distribution of parameters: The predictive distribution is the same as the mean: Dirichlet prior smoothing 6501: Information Retrieval25

26 Estimating  using leave-one-out [Zhai & Lafferty 02] P(w 1 |d - w 1 ) P(w 2 |d - w 2 ) log-likelihood Maximum Likelihood Estimator Leave-one-out w1w1 w2w2 P(w n |d - w n ) wnwn : Information Retrieval26

27 Why would “leave-one-out” work? abc abc ab c d d abc cd d d abd ab ab ab ab cd d e cd e 20 word by author1 20 word by author2 abc abc ab c d d abe cb e f acf fb ef aff abef cdc db ge f s Now, suppose we leave “e” out…  must be big! more smoothing  doesn’t have to be big The amount of smoothing is closely related to the underlying vocabulary size 6501: Information Retrieval27

28 Understanding smoothing Query = “the algorithms for data mining” p( “algorithms”|d1) = p(“algorithm”|d2) p( “data”|d1) < p(“data”|d2) p( “mining”|d1) < p(“mining”|d2) So we should make p(“the”) and p(“for”) less different for all docs, and smoothing helps to achieve this goal… Content words Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)… p ML (w|d1): p ML (w|d2): Query = “the algorithms for data mining” P(w|REF) Smoothed p(w|d1): Smoothed p(w|d2): : Information Retrieval28

29 Understanding smoothing Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain Ignore for ranking IDF weighting TF weighting Doc length normalization (longer doc is expected to have a smaller  d ) Smoothing with p(w|C)  TF-IDF + doc- length normalization 6501: Information Retrieval29

30 Smoothing & TF-IDF weighting Smoothed ML estimate Reference language model Retrieval formula using the general smoothing scheme Key rewriting step (where did we see it before?) 6501: Information Retrieval30 Similar rewritings are very common when using probabilistic models for IR…

31 What you should know How to estimate a language model General idea and different ways of smoothing Effect of smoothing 6501: Information Retrieval31


Download ppt "Language Models Hongning Wang Notion of Relevance Relevance  (Rep(q), Rep(d)) Similarity P(r=1|q,d) r  {0,1} Probability of Relevance P(d "

Similar presentations


Ads by Google