# 1 Fuchun Peng Microsoft Bing 7/23/2010. 2  Query is often treated as a bag of words  But when people are formulating queries, they use “concepts” as.

## Presentation on theme: "1 Fuchun Peng Microsoft Bing 7/23/2010. 2  Query is often treated as a bag of words  But when people are formulating queries, they use “concepts” as."— Presentation transcript:

1 Fuchun Peng Microsoft Bing 7/23/2010

2  Query is often treated as a bag of words  But when people are formulating queries, they use “concepts” as building blocks simmons college’s Q: simmons college sports psychology A1: “simmons college”, “sports psychology” A2: “college sports” sports psychology (course) Can we automatically segment the query to recover the concepts?

3  Summary of Segmentation approaches  Use for Improving Search Relevance ◦ Query rewriting ◦ Ranking features  Conclusions

4  Supervised learning (Bergsma et al, EMNLP-CoNLL07) ◦ Binary decision at each possible segmentation point ◦ Features: POS, web counts, the, and, … w1w1 w2w2 w3w3 w4w4 w5w5 N N Y Y Problem: –Limited-range context –Features specifically designed for noun phrases

 Manual Data Preparation ◦ Linguistic driven  [San jose international airport] ◦ Relevance driven  [San jose] [international airport] 5

6 w1w1 w2w2 w3w3 w4w4 w5w5 MI 1,2 2,3 3,4 4,5 threshold MI(w1,w2) = P(w 1 w 2 ) / P(w 1 )P(w 2 ) insert segment boundary w 1 w 2 | w 3 w 4 w 5 Problem: –only captures short-range correlation (between adjacent words) –What about my heart will go on? Iterative update

7

8  Assume the query is generated by independent sampling from a probability distribution of concepts: simmons college sports psychology unigram model P(simmons college)=0.000016 P(sports psychology)=0.000002 P=0.000016×0.000002 simmons college sports psychology P(simmons)=0.000007P(college sports)=0.000006P(psychology)=0.000024 P=0.000007×0.000006×0.000024 > Enumerate all possible segmentations; Rank by probability of being generated by the unigram model How to estimate parameters P(w) for the unigram model?

9  We have ngram (n=1..5) counts in a web corpus ◦ 464M documents; L = 33B tokens ◦ Approximate counts for longer ngrams are often computable: e.g. #(harry potter and the goblet of fire) is in [5783, 6399]  #(ABC)=#(AB)+#(BC)-#(AB OR BC) >= #(AB)+#(BC)-#(B) Solved by DP

10  Maximum Likelihood Estimate: P MLE (t) = #(t) / N  Problem: ◦ #(potter and the goblet of) = 6765 ◦ P(potter and the goblet of) > P(harry potter and the goblet of fire)? Wrong! ◦ not prob. of seeing t in text, but prob. of seeing t as a self-contained concept in text

11 Query-relevant web corpus Choose parameters to maximize the posterior probability given query-relevant corpus / minimize the total description length) t: a query substring C(t): longest matching count of t D = {(t, C(t)}: query-relevant corpus s(t): a segmentation of t θ: unigram model parameters (ngram probabilities) θ = argmax P(D|θ)P(θ) = argmax log P(D|θ) + log P(θ) log P(D|θ) = ∑ t log P(t|θ)C(t) P(t|θ) = ∑ s(t) P(s(t)|θ) posterior prob. DL of corpusDL of parameters ngram longest matching count raw frequency harry harry potter harry potter and harry potter and the harry potter and the goblet harry potter and the goblet of harry potter and the goblet of fire... … fire 1657108 277736 10436 51330 101 618 5783 … 4200957 2003112 346004 68268 57832 6502 6401 5783 … 4478774

12

13  Three human-segmented datasets ◦ 3 data sets, for training, validation, and testing, 500 queries for each set  Segmented by three editors A, B, C

14  Evaluation metric: ◦ Boundary classification accuracy ◦ Whole query accuracy: the percentage of queries with perfect boundary classification accuracy ◦ Segment accuracy: the percentage of segments being recovered  Truth [abc] [de] [fg]  Prediction: [abc] [de fg]: precision w1w1 w2w2 w3w3 w4w4 w5w5 N N Y Y

15

16

17  Summary of Segmentation approaches  Use for Improving Search Relevance ◦ Query rewriting ◦ Ranking features  Conclusions

 Phrase Proximity Boosting  Phrase Level Query Expansion 18

 Classifying a segment into one of three categories ◦ Strong concept: no word reordering, no word insertion/deletion  Treat the whole segment as a single unit in matching and ranking ◦ Weak concept: allow word reordering or deletion/insertion  Boost documents matching the weak concepts ◦ Not a concept  Do nothing 19

 Concept based BM25 ◦ Weighted by the confidence of concepts  Concept based min coverage ◦ Weighted by the confidence of concepts 20

 Phrase level replacement ◦ [San Francisco] -> [sf] ◦ [red eye flight] ->[late night flight] 21

 Significant relevance boosting ◦ Affects 40% query traffic ◦ Significant DCG gain (1.5% for affected queries) ◦ Significant online CTR gain (0.5% over all) 22

23  Summary of Segmentation approaches  Use for Improving Search Relevance ◦ Query rewriting ◦ Ranking features  Conclusions

 Data is segmentation is important for query segmentation  Phrases are important for improving relevance 24

 Bergsma et al, EMNLP-CoNLL07  Risvik et al. WWW 2003  Hagen et al SIGIR 2010  Tan & Peng, WWW 2008 25

26

27  Solution 1: Offline segment the web corpus, then collect counts for ngrams being segments Technical difficulties harry potter and the goblet of fire += 1 potter and the goblet of += 0 C. G. de Marcken, Unsupervised Language Acquisition, 96 Fuchun Peng, Self-supervised Chinese Word Segmentation, IDA01... … | Harry Potter and the Goblet of Fire | is | the | fourth | novel | in | the | Harry Potter series | written by | J.K. Rowling |...

28  Solution 2: Online computation: only consider parts of the web corpus overlapping with the query (longest matches)... … Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling... Q=harry potter and the goblet of fire harry potter and the goblet of fire += 1 the += 2 harry potter += 1

29

30  Solution 2: Online computation: only consider parts of the web corpus overlapping with the query (longest matches)... … Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling... Q= potter and the goblet potter and the goblet += 1 the += 2 potter += 1 Directly compute longest matching counts using raw ngram frequency: O(|Q| 2 )

Similar presentations