Presentation is loading. Please wait.

Presentation is loading. Please wait.

2010 © University of Michigan Probabilistic Models in Information Retrieval SI650: Information Retrieval Winter 2010 School of Information University of.

Similar presentations


Presentation on theme: "2010 © University of Michigan Probabilistic Models in Information Retrieval SI650: Information Retrieval Winter 2010 School of Information University of."— Presentation transcript:

1 2010 © University of Michigan Probabilistic Models in Information Retrieval SI650: Information Retrieval Winter 2010 School of Information University of Michigan 1 Most slides are from Prof. ChengXiang Zhai ’s lectures

2 2010 © University of Michigan Essential Probability & Statistics

3 2010 © University of Michigan Prob/Statistics & Text Management Probability & statistics provide a principled way to quantify the uncertainties associated with natural language Allow us to answer questions like: –Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) –Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)

4 2010 © University of Michigan Basic Concepts in Probability Random experiment: an experiment with uncertain outcome (e.g., flipping a coin, picking a word from text) Sample space: all possible outcomes, e.g., –Tossing 2 fair coins, S ={HH, HT, TH, TT} Event: a subspace of the sample space –E  S, E happens iff outcome is in E, e.g., –E={HH} (all heads) –E={HH,TT} (same face) –Impossible event ({}), certain event (S) Probability of Event : 0  P(E) ≤ 1, s.t. –P(S)=1 (outcome always in S) –P(A  B)=P(A)+P(B), if (A  B)=  (e.g., A=same face, B=different face)

5 2010 © University of Michigan Example: Flip a Coin Sample space: {Head, Tail} Fair coin: –p(H) = 0.5, p(T) = 0.5 Unfair coin, e.g.: –p(H) = 0.3, p(T) = 0.7 Flipping two fair coins: –Sample space: {HH, HT, TH, TT} Example in modeling text: –Flip a coin to decide whether or not to include a word in a document –Sample space = {appear, absence} 5

6 2010 © University of Michigan Example: Toss a Dice Sample space: S = {1,2,3,4,5,6} Fair dice: –p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 1/6 Unfair dice: p(1) = 0.3, p(2) = 0.2,... N-dimensional dice: –S = {1, 2, 3, 4, …, N} Example in modeling text: –Toss a die to decide which word to write in the next position –Sample space = {cat, dog, tiger, …} 6

7 2010 © University of Michigan 7 Basic Concepts of Prob. (cont.) Joint probability: P(A  B), also written as P(A, B) Conditional Probability: P(B|A)=P(A  B)/P(A) –P(A  B) = P(A)P(B|A) = P(B)P(A|B) –So, P(A|B) = P(B|A)P(A)/P(B) (Bayes’ Rule) –For independent events, P(A  B) = P(A)P(B), so P(A|B)=P(A) Total probability: If A 1, …, A n form a partition of S, then –P(B) = P(B  S) = P(B, A 1 ) + … + P(B, A n ) (why?) –So, P(A i |B) = P(B|A i )P(A i )/P(B) = P(B|A i )P(A i )/[P(B|A 1 )P(A 1 )+…+P(B|A n )P(A n )] –This allows us to compute P(A i |B) based on P(B|A i )

8 2010 © University of Michigan 8 Revisit: Bayes’ Rule Hypothesis space: H={H 1, …, H n }Evidence: E If we want to pick the most likely hypothesis H*, we can drop P(E) Posterior probability of H i Prior probability of H i Likelihood of data/evidence if H i is true In text classification: H: class space; E: data (features)

9 2010 © University of Michigan 9 Random Variable X: S   (“measure” of outcome) –E.g., number of heads, all same face?, … Events can be defined according to X –E(X=a) = {s i |X(s i )=a} –E(X  a) = {s i |X(s i )  a} So, probabilities can be defined on X –P(X=a) = P(E(X=a)) –P(X  a) = P(E(X  a)) Discrete vs. continuous random variable (think of “partitioning the sample space”)

10 2010 © University of Michigan Getting to Statistics... We are flipping an unfair coin, but P(Head)=? (parameter estimation) –If we see the results of a huge number of random experiments, then –But, what if we only see a small sample (e.g., 2)? Is this estimate still reliable? We flip twice and got two tails, does it mean P(Head) = 0? In general, statistics has to do with drawing conclusions on the whole population based on observations of a sample (data)

11 2010 © University of Michigan 11 Parameter Estimation General setting: –Given a (hypothesized & probabilistic) model that governs the random experiment –The model gives a probability of any data p(D|  ) that depends on the parameter  –Now, given actual sample data X={x 1,…,x n }, what can we say about the value of  ? Intuitively, take your best guess of  -- “best” means “best explaining/fitting the data” Generally an optimization problem

12 2010 © University of Michigan 12 Maximum Likelihood vs. Bayesian Maximum likelihood estimation –“Best” means “data likelihood reaches maximum” –Problem: small sample Bayesian estimation –“Best” means being consistent with our “prior” knowledge and explaining data well –Problem: how to define prior?

13 2010 © University of Michigan 13 Illustration of Bayesian Estimation Prior: p(  ) Likelihood: p(X|  ) X=(x 1,…,x N ) Posterior: p(  |X)  p(X|  )p(  )    : prior mode  ml : ML estimate  : posterior mode

14 2010 © University of Michigan 14 Maximum Likelihood Estimate Data: a document d with counts c(w 1 ), …, c(w N ), and length |d| Model: multinomial distribution M with parameters {p(w i )} Likelihood: p(d|M) Maximum likelihood estimator: M=argmax M p(d|M) Problem: a document is short if c(w i ) = 0, does this mean p(w i ) = 0?

15 2010 © University of Michigan 15 If you want to know how we get it… Data: a document d with counts c(w 1 ), …, c(w N ), and length |d| Model: multinomial distribution M with parameters {p(w i )} Likelihood: p(d|M) Maximum likelihood estimator: M=argmax M p(d|M) We’ll tune p(w i ) to maximize l(d|M) Use Lagrange multiplier approach Set partial derivatives to zero ML estimate

16 2010 © University of Michigan 16 Statistical Language Models

17 2010 © University of Michigan 17 What is a Statistical Language Model? A probability distribution over word sequences –p(“ Today is Wednesday ”)  0.001 –p(“ Today Wednesday is ”)  0.0000000000001 –p(“ The eigenvalue is positive” )  0.00001 Context-dependent! Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

18 2010 © University of Michigan 18 Why is a Language Model Useful? Provides a principled way to quantify the uncertainties associated with natural language Allows us to answer questions like: –Given that we see “ John ” and “ feels ”, how likely will we see “ happy ” as opposed to “ habit ” as the next word? (speech recognition) –Given that we observe “ baseball ” three times and “ game ” once in a news article, how likely is it about “ sports ”? (text categorization, information retrieval) –Given that a user is interested in sports news, how likely would the user use “ baseball” in a query? (information retrieval)

19 2010 © University of Michigan 19 Source-Channel Framework (Communication System) Source Transmitter (encoder) Destination Receiver (decoder) Noisy Channel P(X) P(Y|X) XYX’ P(X|Y)=? When X is text, p(X) is a language model

20 2010 © University of Michigan 20 Speech Recognition Speaker Recognizer Noisy Channel P(W)P(A|W) Words (W) Acoustic signal (A) Recognized words P(W|A)=? Acoustic model Language model Given acoustic signal A, find the word sequence W

21 2010 © University of Michigan 21 Machine Translation English Speaker Translator Noisy Channel P(E)P(C|E) English Words (E) Chinese Words(C) English Translation P(E|C)=? English->Chinese Translation model English Language model Given Chinese sentence C, find its English translation E

22 2010 © University of Michigan 22 Spelling/OCR Error Correction Original Text Corrector Noisy Channel P(O)P(E|O) Original Words (O) “Erroneous” Words(E) Corrected Text P(O|E)=? Spelling/OCR Error model “Normal” Language model Given corrupted text E, find the original text O

23 2010 © University of Michigan 23 Basic Issues Define the probabilistic model –Event, Random Variables, Joint/Conditional Prob’s –P(w 1 w 2... w n )=f(  1,  2,…,  m ) Estimate model parameters –Tune the model to best fit the data and our prior knowledge –  i =? Apply the model to a particular task –Many applications

24 2010 © University of Michigan 24 The Simplest Language Model (Unigram Model) Generate a piece of text by generating each word INDEPENDENTLY Thus, p(w 1 w 2... w n )=p(w 1 )p(w 2 )…p(w n ) Parameters: {p(w i )} p(w 1 )+…+p(w N )=1 (N is voc. size) Essentially a multinomial distribution over words A piece of text can be regarded as a sample drawn according to this word distribution

25 2010 © University of Michigan Modeling Text As Tossing a Die Sample space: S = {w 1,w 2,w 3,… w N } Die is unfair: –P(cat) = 0.3, P(dog) = 0.2,... –Toss a die to decide which word to write in the next position E.g., Sample space = {cat, dog, tiger} P(cat) = 0.3; P(dog) = 0.5; P(tiger) = 0.2 Text = “cat cat dog” P(Text) = P(cat) * P(cat) * P(dog) = 0.3 * 0.3 * 0.5 = 0.045 25

26 2010 © University of Michigan 26 Text Generation with Unigram LM (Unigram) Language Model  p(w|  ) … text 0.2 mining 0.1 association 0.01 clustering 0.02 … food 0.00001 … Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health Document Text mining paper Food nutrition paper Sampling

27 2010 © University of Michigan 27 Estimation of Unigram LM (Unigram) Language Model  p(w|  )=? Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 … text ? mining ? association ? database ? … query ? … Estimation A “text mining paper” (total #words=100) 10/100 5/100 3/100 1/100

28 2010 © University of Michigan 28 Empirical distribution of words There are stable language-independent patterns in how people use natural languages A few words occur very frequently; most occur rarely. E.g., in news articles, –Top 4 words: 10~15% word occurrences –Top 50 words: 35~40% word occurrences The most frequent word in one corpus may be rare in another

29 2010 © University of Michigan 29 Revisit: Zipf’s Law rank * frequency  constant Word Freq. Word Rank (by Freq) Most useful words (Luhn 57) Biggest data structure (stop words) Is “too rare” a problem? Generalized Zipf’s law:Applicable in many domains

30 2010 © University of Michigan 30 Problem with the ML Estimator What if a word doesn’t appear in the text? In general, what probability should we give a word that has not been observed? If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words This is what “smoothing” is about …

31 2010 © University of Michigan 31 Language Model Smoothing (Illustration) P(w) w Max. Likelihood Estimate Smoothed LM

32 2010 © University of Michigan 32 How to Smooth? All smoothing methods try to –discount the probability of words seen in a document –re-allocate the extra counts so that unseen words will have a non-zero count Method 1 (Additive smoothing): Add a constant  to the counts of each word Problems? “Add one”, Laplace smoothing Vocabulary size Counts of w in d Length of d (total counts)

33 2010 © University of Michigan 33 How to Smooth? (cont.) Should all unseen words get equal probabilities? We can use a reference model to discriminate unseen words Discounted ML estimate Reference language model

34 2010 © University of Michigan 34 Other Smoothing Methods Method 2 (Absolute discounting): Subtract a constant  from the counts of each word Method 3 (Linear interpolation, Jelinek-Mercer): “Shrink” uniformly toward p(w|REF) # uniq words parameter ML estimate α d = δ|d| u /|d| α d = λ

35 2010 © University of Michigan 35 Other Smoothing Methods (cont.) Method 4 (Dirichlet Prior/Bayesian): Assume pseudo counts  p(w|REF) Method 5 (Good Turing): Assume total # unseen events to be n 1 (# of singletons), and adjust the seen events in the same way parameter α d = μ/(|d|+ μ)

36 2010 © University of Michigan 36 So, which method is the best? It depends on the data and the task! Many other sophisticated smoothing methods have been proposed… Cross validation is generally used to choose the best method and/or set the smoothing parameters… For retrieval, Dirichlet prior performs well… Smoothing will be discussed further in the course…

37 2010 © University of Michigan 37 Language Models for IR

38 2010 © University of Michigan 38 Text Generation with Unigram LM (Unigram) Language Model  p(w|  ) … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 … Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health Document Text mining paper Food nutrition paper Sampling

39 2010 © University of Michigan 39 Estimation of Unigram LM (Unigram) Language Model  p(w|  )=? Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 … text ? mining ? assocation ? database ? … query ? … Estimation A “text mining paper” (total #words=100) 10/100 5/100 3/100 1/100

40 2010 © University of Michigan 40 Language Models for Retrieval (Ponte & Croft 98) Document Text mining paper Food nutrition paper Language Model … text ? mining ? assocation ? clustering ? … food ? … food ? nutrition ? healthy ? diet ? … Query = “data mining algorithms” ? Which model would most likely have generated this query?

41 2010 © University of Michigan 41 Ranking Docs by Query Likelihood d1d1 d2d2 dNdN q d1d1 d2d2 dNdN Doc LM p(q|  d 1 ) p(q|  d 2 ) p(q|  d N ) Query likelihood

42 2010 © University of Michigan 42 Retrieval as Language Model Estimation Document ranking based on query likelihood Retrieval problem  Estimation of p(w i |d) Smoothing is an important issue, and distinguishes different approaches Document language model

43 2010 © University of Michigan 43 How to Estimate p(w|d)? Simplest solution: Maximum Likelihood Estimator –P(w|d) = relative frequency of word w in d –What if a word doesn’t appear in the text? P(w|d)=0 In general, what probability should we give a word that has not been observed? If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words This is what “smoothing” is about …

44 2010 © University of Michigan 44 A General Smoothing Scheme All smoothing methods try to –discount the probability of words seen in a doc –re-allocate the extra probability so that unseen words will have a non-zero probability Most use a reference model (collection language model) to discriminate unseen words Discounted ML estimate Collection language model

45 2010 © University of Michigan 45 Smoothing & TF-IDF Weighting Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain Ignore for ranking IDF weighting TF weighting Doc length normalization (long doc is expected to have a smaller  d ) Smoothing with p(w|C)  TF-IDF + length norm.

46 2010 © University of Michigan 46 If you want to know how we get this… Discounted ML estimate Reference language model Retrieval formula using the general smoothing scheme Key rewriting step Similar rewritings are very common when using LMs for IR…

47 2010 © University of Michigan 47 Three Smoothing Methods (Zhai & Lafferty 01) Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C) Dirichlet prior (Bayesian): Assume pseudo counts  p(w|C) Absolute discounting: Subtract a constant 

48 2010 © University of Michigan Assume We Use Dirichlet Smoothing 48 Only consider words in the query Query term frequency Doc. Term frequency Similar to IDF Doc. length

49 2010 © University of Michigan 49 The Notion of Relevance Relevance  (Rep(q), Rep(d)) Similarity P(r=1|q,d) r  {0,1} Probability of Relevance P(d  q) or P(q  d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) … Generative Model Regression Model (Fox 83) Classical prob. Model (Robertson & Sparck Jones, 76) Doc generation Query generation LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91)

50 2010 © University of Michigan Query Generation and Document Generation 50 Document ranking based on the conditional probability of relevance R  {0,1} Ranking based on Query generation  Language models Document generation  Okapi/BM25

51 2010 © University of Michigan 51 The Notion of Relevance Relevance  (Rep(q), Rep(d)) Similarity P(r=1|q,d) r  {0,1} Probability of Relevance P(d  q) or P(q  d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) … Generative Model Regression Model (Fox 83) Classical prob. Model (Robertson & Sparck Jones, 76) Doc generation Query generation LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91)

52 2010 © University of Michigan The Okapi/BM25 Retrieval Formula 52 Only consider words in the query Query term frequency Doc. Term frequency IDF Doc. length Classical probabilistic model – among the best performance

53 2010 © University of Michigan What are the Essential Features in a Retrieval Formula? 53 Document TF – how many occurrence of each query term do we observe in a document Query TF – how many occurrence of each term do we observe in a query IDF – how distinctive/important is each query term (term with a higher IDF is more distinctive) Document length – how long the document is (we want to penalize long documents, but don’t want to over- penalize them!) Others?


Download ppt "2010 © University of Michigan Probabilistic Models in Information Retrieval SI650: Information Retrieval Winter 2010 School of Information University of."

Similar presentations


Ads by Google