Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Slides:

Advertisements

Similar presentations

Psychological Advertising: Exploring User Psychology for Click Prediction in Sponsored Search Date: 2014/03/25 Author: Taifeng Wang, Jiang Bian, Shusen.

Advertisements

Chapter 5: Introduction to Information Retrieval

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.

WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.

Toward Whole-Session Relevance: Exploring Intrinsic Diversity in Web Search Date: 2014/5/20 Author: Karthik Raman, Paul N. Bennett, Kevyn Collins-Thompson.

1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.

Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)

An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,

The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.

Chapter 5: Information Retrieval and Web Search

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.

Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.

1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG, CHUNG.

Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.

Leveraging Conceptual Lexicon ： Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

1 Retrieval and Feedback Models for Blog Feed Search SIGIR 2008 Advisor ： Dr. Koh Jia-Ling Speaker ： Chou-Bin Fan Date ：

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Chapter 6: Information Retrieval and Web Search

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science.

BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.

Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,

Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.

1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,

Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.

Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

Selecting Good Expansion Terms for Pseudo-Relevance Feedback Guihong Cao, Jian-Yun Nie, Jianfeng Gao, Stephen Robertson 2008 SIGIR reporter: Chen, Yi-wen.

Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.

Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Ranking-based Processing of SQL Queries Date: 2012/1/16 Source: Hany Azzam (CIKM’11) Speaker: Er-gang Liu Advisor: Dr. Jia-ling Koh.

Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.

Learning to Estimate Query Difficulty Including Applications to Missing Content Detection and Distributed Information Retrieval Elad Yom-Tov, Shai Fine,

1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

{ Adaptive Relevance Feedback in Information Retrieval Yuanhua Lv and ChengXiang Zhai (CIKM ‘09) Date: 2010/10/12 Advisor: Dr. Koh, Jia-Ling Speaker: Lin,

1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

1 Blog Cascade Affinity: Analysis and Prediction 2009 ACM Advisor ： Dr. Koh Jia-Ling Speaker ： Chou-Bin Fan Date ：

Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.

Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.

CiteData: A New Multi-Faceted Dataset for Evaluating Personalized Search Performance CIKM’10 Advisor : Jia-Ling, Koh Speaker : Po-Hsien, Shih.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Information Retrieval and Extraction 2009 Term Project – Modern Web Search Advisor: 陳信希 TA: 蔡銘峰＆許名宏.

Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)

University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G

LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Presentation transcript:

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Outline Introduction Query Categorization Query Expansion Methods Experiments Conclusion Future Work 2

Introduction Query is too short to catch the specific information need clearly. For query expansion methods, we prefer PRF because it requires no user input. The problem is that the top retrieved documents frequently contain non-relevant documents, which introduce noise into the expanded query. 3

Introduction Meanwhile, resources on the web have emerged which can potentially supplement an initial search better in PRF, e.g. Wikipedia. Wikipedia covers a great many topics, and it reflects the diverse interests and information needs of users. The basic entry in Wikipedia is an entity page, which is an article that contains information focusing on one single entity. 4

Introduction The aim of this study is to explore the possible utility of Wikipedia as a resource improving for IR in PRF. We propose the effectiveness of three methods for expansion term selection, each modeling the Wikipedia based pseudo-relevance information from a different perspective. We incorporate the expansion terms into the original query and use language modeling IR to evaluate these methods. 5

Introduction- Query Using PRF on the basis of Wikipedia, we categorize a query into one of three types: 1) EQ : queries about specific entity “Scalable Vector Machine” 2) AQ : ambiguous queries “Apple” ”, ”Pirate ”, “Poaching ” 3) BQ : broader queries (neither EQ nor AQ) 6

Introduction- Pseudo-relevant docs Pseudo-relevant documents are generated in two ways according to the query type: 1) using top ranked articles from Wikipedia (regarded as a collection) retrieved in response to the query 2) using Wikipedia entity page corresponding to the query 7

Introduction- furthermore, consider terms distribution and structure In selecting expansion terms, term distributions and structures of Wikipedia pages are taken into account. We propose and compare a supervised method and an unsupervised method for this task. ( field-based ) 8

Introduction Query “ Data mining ” Select terms to expand query Top ranked docs 9 Check Wikipedia to find pseudo-relevant doc

Outline Introduction Query Categorization Query Expansion Methods Experiments Conclusion Future Work 10

Query Categorization We briefly summarize the relevant features of Wikipedia for our study, and then examine the different categories of queries that are typically encountered in user queries. Wikipedia Data Set (summarize the relevant features ) Query Categorization 11

Wikipedia Data Set An analysis of 200 randomly selected articles describing a single topic showed that only 4% failed to adhere to a standard format. 12

Query Categorization We define three types of queries according to their relationship with Wikipedia topics: 1) EQ : queries about specific entity “Scalable Vector Machine” 2) AQ : ambiguous queries “Apple”, ”Pirate ”, “Poaching ” 3) BQ : broader queries (neither EQ nor AQ) 13

Query Categorization Strategy for AQ (a disambiguation process) Given an ambiguous query: 1) using query likelihood language model (initial search) to retrieve top ranked 100 documents 2) cluster these documents using K-means 3) rank the clusters by cluster-based language model 4) the top ranked cluster is then compared to all the referents (entity pages) extracted from the disambiguation page associated with the query 5) the top matching entity page is then chosen for the query 14

Query Categorization 3) rank the clusters by cluster-based language model, as proposed by Lee et. Al [19] 15

Query Categorization Evaluation total 650 queries Each participant was asked to judge whether a query is ambiguous or not. If it was, the participant determined which referent from the disambiguation page is most likely to be mapped to the query. If it was not, the participant manually search with the query in Wikipedia to identify whether or not it is defined by Wikipedia (EQ). 16

Query Categorization Evaluation Participants are in general agreement, i.e., 87% in judging whether a query is ambiguous or not. Determining which referent should a query be mapped to, there is only 54% agreement. 17

Query Categorization Evaluation : the effectiveness of the cluster-based disambiguation process (most queries from TREC topic sets are AQ) We define that for each query : If there are at least two participants who indicate a referent as the most likely mapping target, this target will be used as an answer. If a query has no answer, it will not be counted by the evaluation process. Experimental results shows that our disambiguation process lead to an accuracy of 57%for AQ. 18

Outline Introduction Query Categorization Query Expansion Methods Experiments Conclusion Future Work 19

Query Expansion Method #1 Relevance Model (baseline) A relevance model is a query expansion approach based on the language modeling framework. In the model, the query words q1,q2,…, qm and the word w in relevant documents are sampled identically and independent from a distribution R 20

Query Expansion Method #2 Strategy for Entity/Ambiguous Queries Both EQ and AQ can be associated with a specific Wikipedia entity page. Instead of considering the top-ranked documents from the test collection, only the corresponding entity page from Wikipedia will be used as pseudo-relevant information. 21

Query Expansion Method #2 Strategy for Entity/Ambiguous Queries ranks all the terms in the entity page, then the top K terms are chosen for expansion Score(t)= tf * idf idf = log (N / df ), N: the # of docs in Wikipedia collection 22

Query Expansion Method- consider terms distribution and structure #3 Field Evidence for Query Expansion E.g. the importance of a term appearing in the overview may be different than its appearance in an appendix. We examine two methods for utilizing evidence from different fields #3.1 Unsupervised Method #3.2 Supervised Method 23

#3.1 Unsupervised Method We replace the term frequency in a pseudo relevance document from original relevance model with a linearly combined weighted term frequencies. 24

#3.2 Supervised Method A alternative way of utilizing evidence of field information is to transfer it into features for supervised learning. A radial-based kernel(RBF) SVM is used. Each expansion terms is represented by a feature vector (10 features in here) 25

#3.2 Supervised Method The first group of features are term distributions in the PRF documents and collections. The features we used include: 1) TD in the test collection 2) TD in PRF (top 10) from the test collection 3) TD in the Wikipedia collection 4) TD in PRF (top 10) from the Wikipedia collection. 26

#3.2 Supervised Method The second group of features is based on field evidence (structure). As described before, we divide each entity page into six fields. One feature is defined for each field; this is computed as follows: 27

Outline Introduction Query Categorization Query Expansion Methods Experiments Conclusion Future Work 28

Experiment Experiment Settings In our experiment, documents are retrieved for a given query by the query-likelihood language model (initial search). Experiments were conducted using four TREC collection Retrieval effectiveness is measured in terms of Mean Average Precision (mAP) 29

Experiment Baselines query-likelihood model (QL) relevance model (RMC) relevance model based on Wikipedia (RMW) Parameters for relevance model: N=10 (pseudo relevant docs) K=100 (adding terms) λ=0.6 30

Experiment Using Entity Pages for Relevance Feedback Our method utilizing only the entity page corresponding to the query for PRF (RE) Not all queries can be mapped to a specific Wikipedia entity page, thus the method is only applicable to EQ and AQ. 31

Experiment Field based expansion Note that in our experiments on field based expansion, the top retrieved documents are considered pseudo relevant documents for EQ and AQ. (not an entity page) BQ can be applied here. 32

Experiment Field based expansion For the supervised method, we compare two ways of incorporating expansion terms for retrieval. The first is to add the top ranked 100 good terms (SL). The second is to add the top ranked 10 good terms, each given the classification probability as weight (SLW). Unsupervised method: The relevance model with weighted TFs is denoted as (RMWTF). 33

Experiment Performance improves as weights for Links, Content increase. the increase of weight to Overview leads to deterioration of the performance. This shows that the positions where a term appears have different impacts on the indication of term relevancy. 34

Experiment Query Dependent Expansion RE for EQ and AQ, RMWTF for BQ. We denote the query dependent method as (QD) 35

Outline Introduction Query Categorization Query Expansion Methods Experiments Conclusion Future Work 36

Conclusion We have explored utilization of Wikipedia in PRF. TREC topics are categorized into three types based on Wikipedia. (We evaluated these methods on four TREC collections.) We propose and study different methods for term selection using pseudo relevance information from Wikipedia entity pages. Our experimental results show that the query dependent approach can improve over a baseline relevance model. 37

Outline Introduction Query Categorization Query Expansion Methods Experiments Conclusion Future Work 38

Future Work More investigation is needed for the broader queries. For ambiguous queries, if the disambiguation process can achieve improved accuracy, the effectiveness of the final retrieval will be improved. For the supervised term selection method, the results obtained are not satisfactory in terms of accuracy. By combining the initial result from the test collection and Wikipedia together, one may be able to develop an expansion strategy which is robust to the query being degraded by either of the resources. 39