Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Integrating Term Relationships into Language Models for Information Retrieval Jian-Yun Nie RALI, Dept. IRO University of Montreal, Canada.

Similar presentations


Presentation on theme: "1 Integrating Term Relationships into Language Models for Information Retrieval Jian-Yun Nie RALI, Dept. IRO University of Montreal, Canada."— Presentation transcript:

1 1 Integrating Term Relationships into Language Models for Information Retrieval Jian-Yun Nie RALI, Dept. IRO University of Montreal, Canada

2 2 Overview Language model Interesting theoretical framework Efficient probability estimation and smoothing methods Good effectiveness Limitations Most approaches use uni-grams, and independence assumption Just a different way to weight terms? Extensions Integrating term relationships? Experiments Conclusions

3 3 Principle of language modeling Goal: create a statistical model so that one can calculate the probability of a sequence of words s = w 1, w 2, …, w n in a language. General approach: Training corpus Probabilities of the observed elements s P(s)P(s)

4 4 Examples of utilization Speech recognition Training corpus = signals + words probabilities: P(word|signal), P(word2|word1) Utilization: signals sequence of words Statistical tagging Training corpus = words + tags (n, v) Probabilities: P(word|tag), P(tag2|tag1) Utilization: sentence sequence of tags

5 5 Prob. of a sequence of words Elements to be estimated: - If h i is too long, one cannot observe ( h i, w i ) in the training corpus, and (hi, wi ) is hard generalize - Solution: limit the length of h i

6 6 n-grams Limit h i to n-1 preceding words Most used cases Uni-gram: Bi-gram: Tri-gram:

7 7 A simple example ( corpus = 10 000 words, 10 000 bi-grams ) Uni-gram: P(I, talk) = P(I) P(talk) = 0.001*0.0008 P(I, talks) = P(I) P(talks) = 0.001*0.0008 Bi-gram:P(I, talk) = P(I | #) P(talk | I) = 0.008*0.2 P(I, talks) = P(I | #) P(talks | I) = 0.008*0

8 8 Estimation History: short long modeling: coarserefined Estimation:easydifficult Maximum likelihood estimation MLE If (h i m i ) is not observed in training corpus, P(w i |h i )=0 P(they, talk)=P(they|*) P(talk|they) = 0 smoothing

9 9 Smoothing Goal: assign a low probability to words or n-grams not observed in the training corpus word P MLE smoothed

10 10 Smoothing methods n-gram:  Change the freq. of occurrences Laplace smoothing (add-one): Good-Turing change the freq. r to n r = no. of n-grams of freq. r

11 11 Smoothing (cont’d) Combine a model with a lower-order model Backoff (Katz) Interpolation (Jelinek-Mercer) In IR, combine doc. with corpus

12 12 Smoothing (cont’d) Dirichlet Two-stage

13 13 Using LM in IR Principle 1: Document D: Language model P(w|M D ) Query Q = sequence of words q 1, q 2, …,q n (uni-grams) Matching: P(Q|M D ) Principle 2: Query Q: Language model P(w|M Q ) Document D = sequence of words d 1, d 2, …,d n Matching: P(D|M Q ) Principle 3: Document D: Language model P(w|M D ) Query Q: Language model P(w|M Q ) Matching: comparison between P(w|M D ) and P(w|M Q ) Principle 4: Translate D to Q

14 14 Principle 1: Document LM Document D: Model M D Query Q: q 1,q 2,…,q n : uni-grams P(Q|D) = P(Q| M D ) = P(q 1 | M D ) P(q 2 | M D ) … P(q n | M D ) Problem of smoothing Short document Coarse M D Unseen words Smoothing Change word freq. Smooth with corpus Exemple

15 15 Determine Expectation maximization (EM): Choose that maximizes the likelihood of the text Initialize E-step M-step Loop on E and M

16 16 Principle 2: Query LM Query Q: M Q Document D: d 1,d 2,…,d n Matching: P(Q|D) = P(D|M Q ) P(M Q ) / P(D)  P(D|M Q ) / P(D) Query even shorter, P(D|M Q ) difficult to calculate Not directly used

17 17 Principle 3: Doc. likelihood / divergence between M d and M Q Question: Is the document likelihood increased when a query is submitted ? (Is the query likelihood increased when D is retrieved?) - P(Q|D) calculated with P(Q|M D ) - P(Q) estimated as P(Q|M C )

18 18 Divergence of M D and M Q KL: Kullback-Leibler divergence, measuring the divergence of two probability distributions Assume Q follows a multinomial distribution :

19 19 Principle 4: IR as translation Noisy channel: message received Transmit D through the channel, and receive Q P(w j |D): prob. that D generates w j P(q i | w j ): prob. of translating w j by q i Possibility to consider relationships between words How to estimate P(q i | w j )? Berger&Lafferty: Pseudo-parallel texts (align sentence with paragraph)

20 20 Summary on LM Can a query be generated from a document model? Does a document become more likely when a query is submitted (or reverse)? Is a query a "translation" of a document? Smoothing is crucial Often use uni-grams

21 21 Beyond uni-grams Bi-grams Bi-term Do not consider word order in bi-grams (analysis, data) – (data, analysis)

22 22 Relevance model LM does not capture “Relevance” Using pseudo-relevance feedback Construct a “relevance” model using top- ranked documents Document model + relevance model (feedback) + corpus model

23 23 Model using document cluster Document smoothing: Collection Some documents are more similar to the given document (document clustering) Different levels of smoothing: (Document + cluster) + collection

24 24 Experimental results LM vs. Vector space model with tf*idf (Smart) Usually better LM vs. Prob. model (Okapi) Often similar bi-gram LM vs. uni-gram LM Slight improvements (but with much larger model)

25 25 Comparaison: LM v.s. tf*idf Log P(Q|D) ~ VSM with tf*idf and document length normalization Smoothing ~ idf + length normalization idf

26 26 Contributions of LM to IR Well founded theoretical framework Exploit the mass of data available Techniques of smoothing for probability estimation Explain some empirical and heuristic methods by smoothing Interesting experimental results Existing tools for IR using LM (Lemur)

27 27 Problems Increased complexity Limitation to uni-grams: No dependence between words Problems with bi-grams Consider all the adjacent word pairs (noise) Cannot consider more distant dependencies Word order – not always important for IR Entirely data-driven, no external knowledge e.g. programming computer Logic well hidden behind numbers Key = smoothing Maybe too much emphasis on smoothing, and too little on the underlying logic Direct comparison between D and Q Requires that D and Q contain identical words (except translation model) Cannot deal with synonymy and polysemy

28 28 Extensions Classical LM: Document t1, t2, …Query (ind. terms) 1. Document comp.archi. Query (dep. terms) 2. Document prog. comp.Query (term relations)

29 29 Extensions (1): link terms in document and query Dependence LM (Gao et al. 04): Capture more distant dependencies within a sentence Syntactic analysis Statistical analysis Only retain the most probable dependencies in the query (how)(has)affirmativeactionaffected(the)constructionindustry

30 30 Estimate the prob. of links (EM) For a corpus C: 1. Initialization: link each pair of words with a window of 3 words 2. For each sentence in C: Apply the link prob. to select the strongest links that cover the sentence 3. Re-estimate link prob. 4. Repeat 2 and 3

31 31 Calculation of P(Q|D) 1.Determine the links in Q (the required links) 2.Calculate the likelihood of Q (words and links) Requirement on words and bi-terms links

32 32 Experiments

33 33 Extension (2): Inference in IR Logical deduction (A  B)  (B  C)   A  C In IR: (D  Q’)  (Q’  Q)   D  Q (D  D’)  (D’  Q)   D  Q Direct matchingInference on query Inference on doc.Direct matching

34 34 How to make inference in IR? - Language modeling Translation model: Classical LM

35 35 How to make inference in IR simply? - Language modeling Term relationships from Co-occurrences Use document collection to estimate P(w 2 |w 1 ) Term relationships from a thesaurus Use term relationships in Wordnet: synonymy, hypernymy, … + co-occurrence information to estimate their prob. Combining both through smoothing: Term relationship

36 36 Illustration: Bayesian network qiqi w 1 w 2 … w n WN model CO modelUG model document λ1λ1 λ2λ2 λ3λ3 P WN (q i |w 1 ) P CO (q i |w 1 ) P WN (w i |D)P CO (w i |D) P UG (w i |D)

37 37 Experimental results (Cao et al. 05) Coll. Unigram Model Dependency Model LM with unique WN relations LM with typed WN relations AvgPRec.AvgP%changeRec.AvgP%changeRec. WSJ 0.24661659/21720.2597+5.31 * 1706/21720.2623+6.37 * 1719/2172 AP 0.19253289/61010.2128+10.54 ** 3523/61010.2141+11.22 ** 3530/6101 SJM 0.20451417/23220.2142+4.741572/23220.2155+5.381558/2322 Integrating different types of relationships in LM may improve effectiveness

38 38 Doc expansion v.s. Query expansion Document expansion Query expansion

39 39 Question: How to implement QE in LM? Considered as a difficult task KL divergence:

40 40 Expanding query model Classical LMRelation model

41 41 Using co-occurrence information Using an external knowledge base (e.g. Wordnet) Other term relationships

42 42 Defining relational model HAL (Hyperspace Analogue to Language): a special co-occurrence matrix (Bruza&Song) “the effects of pollution on the population” “effects” and “pollution” co-occur in 2 windows (L=3) HAL(effects, pollution) = 2 = L – distance + 1

43 43 From HAL to Inference relation superconductors : Combining terms: space  program Different importance for space and program

44 44 From HAL to Inference relation (information flow) space  program |- {program:1.00 space:1.00 nasa:0.97 new:0.97 U.S.:0.96 agency:0.95 shuttle:0.95 … science:0.88 scheduled:0.87 reagan:0.87 director:0.87 programs:0.87 air:0.87 put:0.87 center:0.87 billion:0.87 aeronautics:0.87 satellite:0.87, …>

45 45 Two types of term relationship Pairwise P(t 2 |t 1 ): Inference relationship Inference relationships are less ambiguous and produce less noise (Qiu&Frei 93)

46 46 1. Query expansion with pairwise term relationships Select a set (85) of strongest HAL relationships

47 47 2. Query expansion with IF term relationships 85 strongest IF relationships

48 48 Experiments (Bai et al. 05) (AP89 collection, query 1-50) Doc. Smooth. LM baselineQE with HALQE with IFQE with IF & FB AvgPr Jelinek- Merer 0.1946 0.2037 (+5%)0.2526 (+30%) 0.2620 (+35%) Dirichlet0.2014 0.2089 (+4%)0.2524 (+25%) 0.2663 (+32%) Abslute0.1939 0.2039 (+5%)0.2444 (+26%) 0.2617 (+35%) Two- Stage 0.2035 0.2104 (+3%)0.2543 (+25%)0.2665 (+31%) Recall Jelinek- Merer 1542/3301 1588/3301 (+3%)2240/3301 (+45%) 2366/3301 (+53%) Dirichlet1569/3301 1608/3301 (+2%)2246/3301 (+43%)2356/3301 (+50%) Abslute1560/3301 1607/3301 (+3%)2151/3301 (+38%) 2289/3301 (+47%) Two- Stage 1573/3301 1596/3301 (+1%)2221/3301 (+41%)2356/3301 (+50%)

49 49 Experiments (AP88-90, topics 101-150) Doc. Smooth. LM baselineQE with HALQE with IFQE with IF & FB AvgPr Jelinek- Mercer 0.21200.2235 (+5%)0.2742 (+29%)0.3199 (+51%) Dirichlet 0.23460.2437 (+4%) 0.2745 (+17%)0.3157 (+35%) Abslute 0.22050.2320 (+5%)0.2697 (+22%)0.3161 (+43%) Two-Stage 0.23620.2457 (+4%)0.2811 (+19%)0.3186 (+35%) Recall Jelinek- Mercer 3061/48053142/3301 (+3%)3675/4805 (+20%) 3895/4805 (+27%) Dirichlet 3156/48053246/3301 (+3%)3738/4805 (+18%) 3930/4805 (+25%) Abslute 3031/48053125/3301 (+3%)3572/4805 (+18%) 3842/4805 (+27%) Two-Stage 3134/48053212/3301 (+2%)3713/4805 (+18%) 3901/4805 (+24%)

50 50 Observations Possible to implement query/document expansion in LM Expansion using inference relationships is more context-sensitive: Better than context- independent expansion (Qiu&Frei) Every kind of knowledge always useful (co- occ., Wordnet, IF relationships, etc.) LM with some inferential power

51 51 Conclusions LM = suitable model for IR Classical LM = independent terms (n-grams) Possibility to integrate term relationships: Within document and within query (link constraint ~ compound term) Between document and query (inference) Both (future work) Automatic parameter estimation = powerful tool for data-driven IR First experiments showed encouraging results


Download ppt "1 Integrating Term Relationships into Language Models for Information Retrieval Jian-Yun Nie RALI, Dept. IRO University of Montreal, Canada."

Similar presentations


Ads by Google