Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Multimedia Database Systems
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Evaluating Search Engine
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
 Manmatha MetaSearch R. Manmatha, Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
Introduction to Machine Learning Approach Lecture 5.
Quality-aware Collaborative Question Answering: Methods and Evaluation Maggy Anastasia Suryanto, Ee-Peng Lim Singapore Management University Aixin Sun.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.
Finding Similar Questions in Large Question and Answer Archives Jiwoon Jeon, W. Bruce Croft and Joon Ho Lee Retrieval Models for Question and Answer Archives.
1 Retrieval and Feedback Models for Blog Feed Search SIGIR 2008 Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date :
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Context-Sensitive Information Retrieval Using Implicit Feedback Xuehua Shen : department of Computer Science University of Illinois at Urbana-Champaign.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Date : 2012/10/25 Author : Yosi Mass, Yehoshua Sagiv Source : WSDM’12 Speaker : Er-Gang Liu Advisor : Dr. Jia-ling Koh 1.
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Gao Cong, Long Wang, Chin-Yew Lin, Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai Date: 2009/02/09 Finding Question-Answer Pairs from Online.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Positional Relevance Model for Pseudo–Relevance Feedback Yuanhua Lv & ChengXiang Zhai Department of Computer Science, UIUC Presented by Bo Man 2014/11/18.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date :
Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.
Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005.
A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Advisor: Koh Jia-Ling Nonhlanhla Shongwe EFFICIENT QUERY EXPANSION FOR ADVERTISEMENT SEARCH WANG.H, LIANG.Y, FU.L, XUE.G, YU.Y SIGIR’09.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
TO Each His Own: Personalized Content Selection Based on Text Comprehensibility Date: 2013/01/24 Author: Chenhao Tan, Evgeniy Gabrilovich, Bo Pang Source:
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR
SEMANTIC VERIFICATION IN AN ONLINE FACT SEEKING ENVIRONMENT DMITRI ROUSSINOV, OZGUR TURETKEN Speaker: Li, HueiJyun Advisor: Koh, JiaLing Date: 2008/5/1.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
INF 141: Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google, Inc. (SIGIR 2008) Speaker: Chin-Wei Cho Advisor: Dr. Jia-Ling Koh Date: 2009/1/8

Out-Line  Introduction  Query likelihood language model vs. IBM translation model  The retrieval model for Question and Answer Archives  Learning word-to-word translation probabilities  Experiments  Conclusion and Future work

Introduction  Question and Answer (Q&A) archives have become an important information resource on the Web (EX: Yahoo Answers!, Live QnA)  The retrieval task in a Q&A archive is to find relevant question-answer pairs for new questions posed by the user

Introduction  Advantages of Q&A retrieval over Web Search User can use natural language instead of only keywords as a query, and thus can potentially express his/her information need more clearly System returns several possible answers directly instead of a long list of ranked documents, and can therefore increase the efficiency of finding the required answers Q&A retrieval can also be considered as an alternative solution to the general Question Answering (QA) problem. Since the answers for each question in the Q&A archive are generated by humans, the difficult QA task of extracting a correct answer is transformed to the Q&A retrieval task.

Introduction  Challenge for Q&A retrieval Word mismatch between the user ’ s question and the question-answer pairs in the archive “ What is Steve Jobs best known for? ” and “ Who is the CEO of Apple Inc? ” Similar questions but no words in common We focus on translation-based approaches since the relationships between words can be explicitly modeled through word-to-word translation probabilities

Introduction  Design the translation based retrieval model IBM translation model 1 Query likelihood language model  Learn good word-to-word translation probabilities The asker and the answerer may express similar meanings with different words Use the question-answer pairs as the “ parallel corpus ” Source : Target => Q:A or A:Q or Both?

Introduction  For the question part, the query is generated by our proposed translation-based language model  For the answer part, the query is simply generated by the query likelihood language model  Our final model for Q&A retrieval is a combination of the above models.

Query likelihood language model VS IBM model q is the query D is the document C is the background collection λ is the smoothing parameter |D| and |C| are the lengths of D and C #(t,D) denotes the frequency of term t in D P(w|null) is the probability that the term w is translated (generated) from the null term P(w|t) is the the translation probability from word t to word w

Query likelihood language model VS IBM model  P ml (w|C) vs. P(w|null) Query likelihood: background distribution generates common terms that connect content words IBM: generate spurious terms in the target sentence  a little awkward and less stable  λ vs. 1 the lack of a mechanism to control background smoothing in IBM model leads to poor performance  P ml (w|D) vs. P tr (w|D) Query likelihood: use maximum likelihood estimator, Gives zero probabilities for unseen words in the document IBM: Every word in the document has some probability of being translated into a target word and these probabilities are added up to calculate the sampling probability

Query likelihood language model VS IBM model  However, we cannot simply choose the sampling method used in the IBM model because of the self translation problem. Since the target and the source languages are the same, every word has some probability to translate into itself. Low self-translation probabilities reduce retrieval performance by giving very low weights to the matching terms. Very high self-translation probabilities do not exploit the merits of the translation approach.

The retrieval model  Our Final Translation-Based Language Model for the Question Part C denotes the whole archive, C = {(q, a) 1, (q, a) 2,..., (q, a) L }. Q denotes the set of all questions in C, Q = {q 1, q 2,..., q M } A denotes the set of all answers in C, A = {a 1, a 2,..., a N }. Given the user question q 2, the task of Q&A retrieval is to rank (q, a) i according to score(q, (q, a) i ).

The retrieval model  Linearly mix two different estimations: maximum likelihood estimation and translation based estimation  Query Likelihood Language Model for the Answer Part

Learning word-to-word translation probabilities  In a Q&A archive, question-answer pairs can be considered as a type of parallel corpus, which is used for estimating word-to-word translation probabilities  In IBM translation model 1, English is the source language and French is the target language  Since the questions and answers in a Q&A archive are written in the same language, the word-to-word translation probability can be calculated through setting either as the source and the other as the target

Learning word-to-word translation probabilities  P(A|Q) is used to denote the word-to-word translation probability with the question as the source and the answer as the target  P(Q|A) is used to denote the opposite configuration  EX: Question: “ cheat ” Answer: “ trust ”, “ forgive ”, “ dump ”, “ leave ” Answer : “ cheat ” Question : “ husband ”, “ boyfriend ”

Learning word-to-word translation probabilities  w2 should be more similar to w1 than w3. This intuition will be considered implicitly by combining P(Q|A) and P(A|Q), since P(w2|w1) will get contributions from both P(Q|A) and P(A|Q), but P(w3|w1) only gets the contribution from P(A|Q). Q A Q A w1 w2 w2 w1 w3

Learning word-to-word translation probabilities  Combine P(Q|A) and P(A|Q) instead of choosing just one of them linearly combines pools the Q-A pairs used for learning P(A|Q) and the A- Q pairs used for learning P(Q|A) together, and learn the combined word-to-word translation probabilities

Experiments  The Wondir collection: 1 milliom Q-A pairs Topics for questions are very diverse, ranging from restaurant recommendations to rocket science The average length for the question part and the answer part is 27 words and 28 words Spelling errors are very common in this collection, which makes the word mismatch problem very serious.  50 questions from the TREC-9 QA track are used for testing

Experiments  Since the relevance of the answer to its corresponding question is usually guaranteed, the retrieval performance of a system can be measured by the rank of relevant questions it returns.  Ranking algorithms first output question- answer pair ranks that are then transformed into question ranks.

Experiments-1  Show the importance of the question part and the answer part for Q&A retrieval.  The query likelihood retrieval model was used with the question parts, the answer parts, and the question answer pairs

Experiments-2 Compare  Three types of baselines: Type I: Query Likelihood Language Model (LM), Okapi BM25 (Okapi) and Relevance Model (RM). This type of baseline represents state-of-the-art retrieval models Type II: The combination model which combines the language model estimated from the question part and the answer part at the word level (LM-Comb). This model is equivalent to setting β as zero. Type III: Other translation-based models. This type of baseline represents previous work on translation-based language models.

Experiments-2 Compare  TransLM model performs better than both the state-of-the- art retrieval systems  P(A|Q) is more effective than P(Q|A), which can be explained as the question source being more important than the answer source for generating the user question.

Experiments-3

Experiments-4  Compares the effect of P lin and P pool with P(A|Q) and P(Q|A) when used with TransLM

Experiments-5

Experiments-6  TransLM+QL: our retrieval model for question- answer pairs that incorporates the answer part. Compares TransLM+QL with TransLM and LM- Combine. P pool is used as the method for estimating translation probabilities.

Experiments

Conclusion and Future work  Q&A retrieval has become an important issue due to the popularity of Q&A archives on the web. In this paper, we propose a novel translation-based language model to solve this problem.  Combines the translation-based language model estimated using the question part and the query likelihood language model estimated using the answer part.  Using different configurations of question-answer pairs to improve the quality.  Phrase-based machine translation models have shown superior performance compared to word-based translation models in translation applications, We plan to study the effectiveness of these models in the Q&A setting