Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.

Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China

Outline People People Projects Projects Systems Systems Researches Researches

People Jianfeng Gao, Microsoft Research, China Jianfeng Gao, Microsoft Research, China Guihong Cao, Tianjin University, China Guihong Cao, Tianjin University, China Hongzhao He, Tianjin University, China Hongzhao He, Tianjin University, China Min Zhang, Tsinghua University, China Min Zhang, Tsinghua University, China Jian-Yun Nie, Université de Montréal Jian-Yun Nie, Université de Montréal Stephen Robertson, Microsoft Research, Cambridge Stephen Robertson, Microsoft Research, Cambridge Stephen Walker, Microsoft Research, Cambridge Stephen Walker, Microsoft Research, Cambridge

Systems SMART (Master: HongZhao) SMART (Master: HongZhao) Traditional IR system – VSM, TFIDF Traditional IR system – VSM, TFIDF Hold more than 500M collection Hold more than 500M collection Linix Linix Okapi (Master: Guihong) Okapi (Master: Guihong) Modern IR system – Probabilistic Model, BM25 Modern IR system – Probabilistic Model, BM25 Hold more than 10G collection Hold more than 10G collection Windows2000 Windows2000

Projects CLIR – TREC-9 ( Japanese NTCIR-3) CLIR – TREC-9 ( Japanese NTCIR-3) System: SMART System: SMART Focus: Focus: Chinese Indexing Unit [Gao et al, 00] [Gao&He, 01] Chinese Indexing Unit [Gao et al, 00] [Gao&He, 01] Query translation [Gao et al, 01] Query translation [Gao et al, 01] Web Retrieval – TREC-10 Web Retrieval – TREC-10 System: Okapi System: Okapi Focus: Focus: Blind Feedback … [Zhang et al, 01] Blind Feedback … [Zhang et al, 01] Link-based retrieval (anchor text)… [Craswell et al, 01] Link-based retrieval (anchor text)… [Craswell et al, 01]

Researches Best indexing unit for Chinese IR Best indexing unit for Chinese IR Query translation Query translation Using link information for web retrieval Using link information for web retrieval Blind feedback for web retrieval Blind feedback for web retrieval Improving the effectiveness of IR with clustering and Fusion Improving the effectiveness of IR with clustering and Fusion

Best indexing unit for Chinese IR Motivation Motivation What is the basic unit of indexing in Chinese IR – word, n-gram, or combined … What is the basic unit of indexing in Chinese IR – word, n-gram, or combined … Does the accuracy of word segmentation have a significant impact on IR performance Does the accuracy of word segmentation have a significant impact on IR performance Experiment1 – indexing units Experiment1 – indexing units Experiment2 – the impact of word segmentation Experiment2 – the impact of word segmentation

Experiment1 – settings System – SMART (modified version) System – SMART (modified version) Corpus – TREC5&6 Chinese collection Corpus – TREC5&6 Chinese collection Experiments Experiments Impact of dict. – using the longest matching with a small dict. and with a large dict. Impact of dict. – using the longest matching with a small dict. and with a large dict. Combining the first method with single characters Combining the first method with single characters Using full segmentation Using full segmentation Using bi-grams and uni-grams (characters) Using bi-grams and uni-grams (characters) Combining words with bi-grams and characters Combining words with bi-grams and characters Unknown word detection using NLPWin Unknown word detection using NLPWin

Experiment1 – results Word + character + (bigram) + unknown words Word + character + (bigram) + unknown words

Experiment2 – settings System System SMART System SMART System Songrou’s Segmentation & Evaluation System Songrou’s Segmentation & Evaluation System Corpus Corpus (1) Trec 5&6 for Chinese IR (1) Trec 5&6 for Chinese IR (2) Songrou’s Corpus (2) Songrou’s Corpus 12rst.txt 181KB 12rst.txt 181KB 12rst.src 250KB ( Standard segmentation of 12rst.txt made by linguists ) 12rst.src 250KB ( Standard segmentation of 12rst.txt made by linguists ) (3) Sampling from Songrou’s Coupus (3) Sampling from Songrou’s Coupus test.txt 20KB ( Random sampling from 12rst.txt ) test.txt 20KB ( Random sampling from 12rst.txt ) standard.src 28KB ( Standard segmentation corresponding to test.txt ) standard.src 28KB ( Standard segmentation corresponding to test.txt )

Notes A: 1 Baseline; 2 Disambiguration; 3 Number; 4 Propernoun; 5 Suffix Notes B: Feedback parameters are (10, 500, 0.5, 0.5 ) and (100, 500, 0.5, 0.5 ) Different Segmentation Larger corpus (Precision) Sampling (Precision) Propernoun(Precision)Propernoun(Recall) Smart Result (11-pt Avg) Feedback Result (10,500) (11-pt Avg) Feedback Result (100,500) (11-pt Avg) 19.62%8.90%-----80.29%0.37260.41830.4067 1+28.35%7.52%-----81.77%0.37700.42310.4105 1+37.35%7.81%92.20%80.30%0.37770.42180.4114 1+3+46.57%6.91%91.39%94.09%0.37240.42520.4073 1+2+3+45.28%6.91%92.20%93.10%0.37470.42600.4094 1+2+3+4+55.29%5.57%95.98%94.09%0.38220.43560.4189 Experiment2 – results

Query Translation Motivation – problems of simple lexicon-based approaches Motivation – problems of simple lexicon-based approaches Lexicon is incomplete Lexicon is incomplete Difficult to select correct translations Difficult to select correct translations Solution – improved lexicon-based approach Solution – improved lexicon-based approach Term disambiguation using co-occurrence Term disambiguation using co-occurrence Phrase detecting and translation using LM Phrase detecting and translation using LM Translation coverage enhancement using TM Translation coverage enhancement using TM

Term disambiguation Assumption – correct translation words tend to co- occur in Chinese language Assumption – correct translation words tend to co- occur in Chinese language A greedy algorithm: A greedy algorithm: for English terms Te = (e 1 …e n ), for English terms Te = (e 1 …e n ), find their Chinese translations Tc = (c 1 …c n ), such that Tc = argmax SIM(c 1, …, c n ) find their Chinese translations Tc = (c 1 …c n ), such that Tc = argmax SIM(c 1, …, c n ) Term-similarity matrix – trained on Chinese corpus Term-similarity matrix – trained on Chinese corpus

Phrase detection and translation Multi-word phrase is detected by base NP detector Multi-word phrase is detected by base NP detector Translation pattern (PAT Te ), e.g. Translation pattern (PAT Te ), e.g.   Phrase translation: Phrase translation: Tc = argmax P(O Tc |PAT Te )P(Tc) Tc = argmax P(O Tc |PAT Te )P(Tc) P(O Tc |PAT Te ): prob. of the translation pattern P(O Tc |PAT Te ): prob. of the translation pattern P(Tc): prob. of the phrase in Chinese LM P(Tc): prob. of the phrase in Chinese LM

Using translation model (TM) Enhance the coverage of the lexicon Enhance the coverage of the lexicon Using TM Using TM Tc = argmax P(Te|Tc)SIM(Tc) Tc = argmax P(Te|Tc)SIM(Tc) Mining parallel texts from the Web for TM training Mining parallel texts from the Web for TM training

Experiments on TREC-5&6 Experiments on TREC-5&6 Monolingual Monolingual Simple translation: lexicon looking up Simple translation: lexicon looking up Best-sense translation: 2 + manually selecting Best-sense translation: 2 + manually selecting Improved translation (our method) Improved translation (our method) Machine translation: using IBM MT system Machine translation: using IBM MT system

Summary of Experiments Translation MethodAvg.P.% of Mono. IR 1Monolingual0.5150 2Simple translation (m-mode)0.272252.85% 3Simple translation (u-mode)0.304159.05% 4Best-sense translation0.376273.05% 5 Improved translation 0.388375.40% 6Machine translation0.389175.55% 7 5 + 6 0.440085.44%

Using link information for web retrieval Motivation Motivation The effectiveness of link-based retrieval The effectiveness of link-based retrieval The evaluation on TREC web collection The evaluation on TREC web collection Link-based Web retrieval – the state-of-the-art Link-based Web retrieval – the state-of-the-art Recommendation – high in-degree is better Recommendation – high in-degree is better Topic locality – connected pages are similar Topic locality – connected pages are similar Anchor description – represented by anchor text Anchor description – represented by anchor text Link-based retrieval in TREC – No good results Link-based retrieval in TREC – No good results

Experiments on TREC-9 Experiments on TREC-9 Baseline – Content based IR Baseline – Content based IR Anchor description Anchor description Used alone – Much worse than baseline Used alone – Much worse than baseline Combined with content description – trivial improvement Combined with content description – trivial improvement Re-ranking – trivial improvement Re-ranking – trivial improvement Spreading – No positive effect Spreading – No positive effect

Summary of Experiments Technique Average Precision 1Baseline22.08% 2 1 + QE 22.21% 3 1 + anchor text 22.23% 4 2 + anchor text 22.84% 5 4 + anchor re-ranking 23.28%

Blind feedback for web retrieval Motivation Motivation Web query is short Web query is short Web collection is huge and highly mixed Web collection is huge and highly mixed Blind feedback – refine web queries Blind feedback – refine web queries Using global web collection Using global web collection Using local web collection Using local web collection Using other well-organized collection, i.e. Encarta Using other well-organized collection, i.e. Encarta

Experiments on TREC-9 Experiments on TREC-9 Baseline – 2-stage pseudo-relevance feedback (PFB) using global web collection Baseline – 2-stage pseudo-relevance feedback (PFB) using global web collection Local context analysis [Xu et al., 96] – 2-stage PFB using local web collection retrieved by the first stage Local context analysis [Xu et al., 96] – 2-stage PFB using local web collection retrieved by the first stage 2-stage PFB using Encarta collection in the first stage 2-stage PFB using Encarta collection in the first stage

Summary of Experiments ??? ???

Improving the effectiveness of IR with clustering and Fusion Clustering Hypothesis – Documents that are relevant to the same query are more similar than non-relevant documents, and can be clustered together. Clustering Hypothesis – Documents that are relevant to the same query are more similar than non-relevant documents, and can be clustered together. Fusion Hypothesis – Different ranked lists usually have a high overlap of relevant documents and a low overlap of non-relevant documents. Fusion Hypothesis – Different ranked lists usually have a high overlap of relevant documents and a low overlap of non-relevant documents.

Thanks !

Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.

Similar presentations

Presentation on theme: "Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.

Similar presentations

Presentation on theme: "Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China."— Presentation transcript:

Similar presentations

About project

Feedback