Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.

Slides:



Advertisements
Similar presentations
Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper.
Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Chapter 5: Introduction to Information Retrieval
1 Fuchun Peng Microsoft Bing 7/23/  Query is often treated as a bag of words  But when people are formulating queries, they use “concepts” as.
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Chapter 7 Retrieval Models.
Search Engines and Information Retrieval
1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
“ SINAI at CLEF 2005 : The evolution of the CLEF2003 system.” Fernando Martínez-Santiago Miguel Ángel García-Cumbreras University of Jaén.
A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
Chapter 6: Information Retrieval and Web Search
A merging strategy proposal: The 2-step retrieval status value method Fernando Mart´inez-Santiago · L. Alfonso Ure ˜na-L´opez · Maite Mart´in-Valdivia.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Chapter 23: Probabilistic Language Models April 13, 2004.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
INAOE at GeoCLEF 2008: A Ranking Approach based on Sample Documents Esaú Villatoro-Tello Manuel Montes-y-Gómez Luis Villaseñor-Pineda Language Technologies.
Automatic Labeling of Multinomial Topic Models
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Survey Jaehui Park Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR
(Pseudo)-Relevance Feedback & Passage Retrieval Ling573 NLP Systems & Applications April 28, 2011.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Personalizing Web Search Jaime Teevan, MIT with Susan T. Dumais and Eric Horvitz, MSR.
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Presentation transcript:

Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China

Outline People People Projects Projects Systems Systems Researches Researches

People Jianfeng Gao, Microsoft Research, China Jianfeng Gao, Microsoft Research, China Guihong Cao, Tianjin University, China Guihong Cao, Tianjin University, China Hongzhao He, Tianjin University, China Hongzhao He, Tianjin University, China Min Zhang, Tsinghua University, China Min Zhang, Tsinghua University, China Jian-Yun Nie, Université de Montréal Jian-Yun Nie, Université de Montréal Stephen Robertson, Microsoft Research, Cambridge Stephen Robertson, Microsoft Research, Cambridge Stephen Walker, Microsoft Research, Cambridge Stephen Walker, Microsoft Research, Cambridge

Systems SMART (Master: HongZhao) SMART (Master: HongZhao) Traditional IR system – VSM, TFIDF Traditional IR system – VSM, TFIDF Hold more than 500M collection Hold more than 500M collection Linix Linix Okapi (Master: Guihong) Okapi (Master: Guihong) Modern IR system – Probabilistic Model, BM25 Modern IR system – Probabilistic Model, BM25 Hold more than 10G collection Hold more than 10G collection Windows2000 Windows2000

Projects CLIR – TREC-9 ( Japanese NTCIR-3) CLIR – TREC-9 ( Japanese NTCIR-3) System: SMART System: SMART Focus: Focus: Chinese Indexing Unit [Gao et al, 00] [Gao&He, 01] Chinese Indexing Unit [Gao et al, 00] [Gao&He, 01] Query translation [Gao et al, 01] Query translation [Gao et al, 01] Web Retrieval – TREC-10 Web Retrieval – TREC-10 System: Okapi System: Okapi Focus: Focus: Blind Feedback … [Zhang et al, 01] Blind Feedback … [Zhang et al, 01] Link-based retrieval (anchor text)… [Craswell et al, 01] Link-based retrieval (anchor text)… [Craswell et al, 01]

Researches Best indexing unit for Chinese IR Best indexing unit for Chinese IR Query translation Query translation Using link information for web retrieval Using link information for web retrieval Blind feedback for web retrieval Blind feedback for web retrieval Improving the effectiveness of IR with clustering and Fusion Improving the effectiveness of IR with clustering and Fusion

Best indexing unit for Chinese IR Motivation Motivation What is the basic unit of indexing in Chinese IR – word, n-gram, or combined … What is the basic unit of indexing in Chinese IR – word, n-gram, or combined … Does the accuracy of word segmentation have a significant impact on IR performance Does the accuracy of word segmentation have a significant impact on IR performance Experiment1 – indexing units Experiment1 – indexing units Experiment2 – the impact of word segmentation Experiment2 – the impact of word segmentation

Experiment1 – settings System – SMART (modified version) System – SMART (modified version) Corpus – TREC5&6 Chinese collection Corpus – TREC5&6 Chinese collection Experiments Experiments Impact of dict. – using the longest matching with a small dict. and with a large dict. Impact of dict. – using the longest matching with a small dict. and with a large dict. Combining the first method with single characters Combining the first method with single characters Using full segmentation Using full segmentation Using bi-grams and uni-grams (characters) Using bi-grams and uni-grams (characters) Combining words with bi-grams and characters Combining words with bi-grams and characters Unknown word detection using NLPWin Unknown word detection using NLPWin

Experiment1 – results Word + character + (bigram) + unknown words Word + character + (bigram) + unknown words

Experiment2 – settings System System SMART System SMART System Songrou’s Segmentation & Evaluation System Songrou’s Segmentation & Evaluation System Corpus Corpus (1) Trec 5&6 for Chinese IR (1) Trec 5&6 for Chinese IR (2) Songrou’s Corpus (2) Songrou’s Corpus 12rst.txt 181KB 12rst.txt 181KB 12rst.src 250KB ( Standard segmentation of 12rst.txt made by linguists ) 12rst.src 250KB ( Standard segmentation of 12rst.txt made by linguists ) (3) Sampling from Songrou’s Coupus (3) Sampling from Songrou’s Coupus test.txt 20KB ( Random sampling from 12rst.txt ) test.txt 20KB ( Random sampling from 12rst.txt ) standard.src 28KB ( Standard segmentation corresponding to test.txt ) standard.src 28KB ( Standard segmentation corresponding to test.txt )

Notes A: 1 Baseline; 2 Disambiguration; 3 Number; 4 Propernoun; 5 Suffix Notes B: Feedback parameters are (10, 500, 0.5, 0.5 ) and (100, 500, 0.5, 0.5 ) Different Segmentation Larger corpus (Precision) Sampling (Precision) Propernoun(Precision)Propernoun(Recall) Smart Result (11-pt Avg) Feedback Result (10,500) (11-pt Avg) Feedback Result (100,500) (11-pt Avg) 19.62%8.90% % %7.52% % %7.81%92.20%80.30% %6.91%91.39%94.09% %6.91%92.20%93.10% %5.57%95.98%94.09% Experiment2 – results

Query Translation Motivation – problems of simple lexicon-based approaches Motivation – problems of simple lexicon-based approaches Lexicon is incomplete Lexicon is incomplete Difficult to select correct translations Difficult to select correct translations Solution – improved lexicon-based approach Solution – improved lexicon-based approach Term disambiguation using co-occurrence Term disambiguation using co-occurrence Phrase detecting and translation using LM Phrase detecting and translation using LM Translation coverage enhancement using TM Translation coverage enhancement using TM

Term disambiguation Assumption – correct translation words tend to co- occur in Chinese language Assumption – correct translation words tend to co- occur in Chinese language A greedy algorithm: A greedy algorithm: for English terms Te = (e 1 …e n ), for English terms Te = (e 1 …e n ), find their Chinese translations Tc = (c 1 …c n ), such that Tc = argmax SIM(c 1, …, c n ) find their Chinese translations Tc = (c 1 …c n ), such that Tc = argmax SIM(c 1, …, c n ) Term-similarity matrix – trained on Chinese corpus Term-similarity matrix – trained on Chinese corpus

Phrase detection and translation Multi-word phrase is detected by base NP detector Multi-word phrase is detected by base NP detector Translation pattern (PAT Te ), e.g. Translation pattern (PAT Te ), e.g.   Phrase translation: Phrase translation: Tc = argmax P(O Tc |PAT Te )P(Tc) Tc = argmax P(O Tc |PAT Te )P(Tc) P(O Tc |PAT Te ): prob. of the translation pattern P(O Tc |PAT Te ): prob. of the translation pattern P(Tc): prob. of the phrase in Chinese LM P(Tc): prob. of the phrase in Chinese LM

Using translation model (TM) Enhance the coverage of the lexicon Enhance the coverage of the lexicon Using TM Using TM Tc = argmax P(Te|Tc)SIM(Tc) Tc = argmax P(Te|Tc)SIM(Tc) Mining parallel texts from the Web for TM training Mining parallel texts from the Web for TM training

Experiments on TREC-5&6 Experiments on TREC-5&6 Monolingual Monolingual Simple translation: lexicon looking up Simple translation: lexicon looking up Best-sense translation: 2 + manually selecting Best-sense translation: 2 + manually selecting Improved translation (our method) Improved translation (our method) Machine translation: using IBM MT system Machine translation: using IBM MT system

Summary of Experiments Translation MethodAvg.P.% of Mono. IR 1Monolingual Simple translation (m-mode) % 3Simple translation (u-mode) % 4Best-sense translation % 5 Improved translation % 6Machine translation % %

Using link information for web retrieval Motivation Motivation The effectiveness of link-based retrieval The effectiveness of link-based retrieval The evaluation on TREC web collection The evaluation on TREC web collection Link-based Web retrieval – the state-of-the-art Link-based Web retrieval – the state-of-the-art Recommendation – high in-degree is better Recommendation – high in-degree is better Topic locality – connected pages are similar Topic locality – connected pages are similar Anchor description – represented by anchor text Anchor description – represented by anchor text Link-based retrieval in TREC – No good results Link-based retrieval in TREC – No good results

Experiments on TREC-9 Experiments on TREC-9 Baseline – Content based IR Baseline – Content based IR Anchor description Anchor description Used alone – Much worse than baseline Used alone – Much worse than baseline Combined with content description – trivial improvement Combined with content description – trivial improvement Re-ranking – trivial improvement Re-ranking – trivial improvement Spreading – No positive effect Spreading – No positive effect

Summary of Experiments Technique Average Precision 1Baseline22.08% QE 22.21% anchor text 22.23% anchor text 22.84% anchor re-ranking 23.28%

Blind feedback for web retrieval Motivation Motivation Web query is short Web query is short Web collection is huge and highly mixed Web collection is huge and highly mixed Blind feedback – refine web queries Blind feedback – refine web queries Using global web collection Using global web collection Using local web collection Using local web collection Using other well-organized collection, i.e. Encarta Using other well-organized collection, i.e. Encarta

Experiments on TREC-9 Experiments on TREC-9 Baseline – 2-stage pseudo-relevance feedback (PFB) using global web collection Baseline – 2-stage pseudo-relevance feedback (PFB) using global web collection Local context analysis [Xu et al., 96] – 2-stage PFB using local web collection retrieved by the first stage Local context analysis [Xu et al., 96] – 2-stage PFB using local web collection retrieved by the first stage 2-stage PFB using Encarta collection in the first stage 2-stage PFB using Encarta collection in the first stage

Summary of Experiments ??? ???

Improving the effectiveness of IR with clustering and Fusion Clustering Hypothesis – Documents that are relevant to the same query are more similar than non-relevant documents, and can be clustered together. Clustering Hypothesis – Documents that are relevant to the same query are more similar than non-relevant documents, and can be clustered together. Fusion Hypothesis – Different ranked lists usually have a high overlap of relevant documents and a low overlap of non-relevant documents. Fusion Hypothesis – Different ranked lists usually have a high overlap of relevant documents and a low overlap of non-relevant documents.

Thanks !