LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.

Slides:

Advertisements

Similar presentations

Term Level Search Result Diversification DATE : 2013/09/11 SOURCE : SIGIR’13 AUTHORS : VAN DANG, W. BRUCE CROFT ADVISOR : DR.JIA-LING, KOH SPEAKER : SHUN-CHEN,

Advertisements

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.

Analyzing Retrieval Models using Retrievability Measurement Shariq Bashir Supervisor: ao. Univ. Prof. Dr. Andreas.

Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.

A Machine Learning Approach for Improved BM25 Retrieval

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.

1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.

Optimizing Text Classification Mark Trenorden Supervisor: Geoff Webb.

A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.

Ch 4: Information Retrieval and Text Mining

COMP305. Part II. Genetic Algorithms. Genetic Algorithms.

Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.

COMP305. Part II. Genetic Algorithms. Genetic Algorithms.

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar.

Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.

Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

Querying Structured Text in an XML Database By Xuemei Luo.

Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

Date: 2012/3/5 Source: Marcus Fontouraet. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou 1 Efficiently encoding term co-occurrences in inverted.

Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.

Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.

Tuning Before Feedback: Combining Ranking Discovery and Blind Feedback for Robust Retrieval* Weiguo Fan, Ming Luo, Li Wang, Wensi Xi, and Edward A. Fox.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG,

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005.

ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,

Proximity-based Ranking of Biomedical Texts Rey-Long Liu * and Yi-Chih Huang * Dept. of Medical Informatics Tzu Chi University Taiwan.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.

Survey Jaehui Park Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

SIGIR 2005 Relevance Information: A Loss of Entropy but a Gain for IDF? Arjen P. de Vries Thomas Roelleke,

Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.

Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,

Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

A Formal Study of Information Retrieval Heuristics

An Empirical Study of Learning to Rank for Entity Search

Compact Query Term Selection Using Topically Related Text

Example: Applying EC to the TSP Problem

Information Retrieval and Web Search

Learning Literature Search Models from Citation Behavior

Relevance and Reinforcement in Interactive Browsing

INF 141: Information Retrieval

SVMs for Document Ranking

Ranking using Multiple Document Types in Desktop Search

Preference Based Evaluation Measures for Novelty and Diversity

A Neural Passage Model for Ad-hoc Document Retrieval

Presentation transcript:

LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15 1

OUTLINE Introduction Proximity Measures Correlations of Measures Proximity Retrieval Model Experimental Setup Benchmarks GP process Training Experimental Result Conclusion 2

INTRODUCTION Traditional ad hoc retrieval models do not take into account the closeness or proximity of terms. Document scores in these models are primarily based on the occurrences of query-terms. Intuitively, documents in which query-terms occur closer together should be ranked higher. 3

INTRODUCTION The contributions of this paper outline several term-term proximity measures. develop an intuitive framework which useful term- term proximity functions can be incorporated. use a learning approach to combine proximity measures to develop a useful proximity function. evaluate the best learned proximity functions to show that they achieve a significant increase in performance. 4

PROXIMITY MEASURES The following sample document D will be used to explain the proximity measures 5

PROXIMITY MEASURES the minimum distance between any occurrences of a and b in D i.e, min_dist is 1 the di ﬀ erence between the average positions of a and b in D. i.e ((1+5+12)/3)−((2+6)/2))=2 6

PROXIMITY MEASURES the average distance between a and b for all position combinations in D i.e ((1 + 5) +(3 + 1) + (10 + 6))/(2 · 3) = 26/6 = 4.33 the average of the shortest distance between each occurrence of the least frequently occurring term and any occurrence of the other term. i.e ((2−1)+(6−5))/2 =1 7

PROXIMITY MEASURES the smallest distance achievable when each occurrence of a term is uniquely matched to another occurrence of a term. ((2 − 1) + (6 − 5))/2 = 1 the maximum distance between any two occurrences of a and b. (12−6) = 6 8

PROXIMITY MEASURES he sum of the term frequencies of a and b in D =5 the product of the term-frequencies of a and b 3*2=6 the length of the document that covers all occurrences query-terms. 12 9

PROXIMITY MEASURES the length of the document that covers all query terms at least once. 2 the length of the document 14 the number of unique terms that match both document and query. 2 10

CORRELATIONS OF MEASURES Use FBIS, FT, FR collections from TREC disks 4 and 5 as test collections and topics. For each set of topics we create a short query set, and a medium length query set. 11

CORRELATIONS OF MEASURES We perform an analysis of each proximity measures independently. To predict which measures may be most useful when incorporated into a proximity measure. We perform an analysis of the top 1000 documents and examine the correlation. 12

CORRELATIONS OF MEASURES Table 2 shows the average values of the individual measures per query. 13

PROXIMITY RETRIEVAL MODEL the representation of a document in our model is as follows: w () is a standard term-weighting scheme p () is a proximity function Entire score of the document can now be defined as the sum of all term-term relationships: 14

EXPERIMENTAL SETUP Benchmarks - the traditional BM25 scheme - the frequency of term t in document D - the document length - document frequency of term t - average document length - term-frequency influence parameter (1.2) - the frequency of term t in query - document normalization parameter (0.75) 15

EXPERIMENTAL SETUP where cft is the frequency of t in the entire collection proximity function We will label these functions BM25 + t() and ES + t(). 16

GP PROCESS Initially, a population of solutions is created randomly. Individuals are selected for reproduction based on their fitness value. Once selection has occurred, recombination can start. Recombination creates new solutions for the next generation by use of crossover and mutation. The process usually ends when a predefined number of generations is complete. 17

GP PROCESS Use all of the 12 proximity measures previously introduced as input terminals to the GP. Three constants for used for scaling {1, 10, 0.5} The ﬁtness function used in the experiments is MAP. We also used the following functions as inputs to the GP : +, −, ×, /,√, sq(), log(). We then ran the GP six times for 30 generations. 18

EXPERIMENTAL RESULTS We used a subset of the Financial Times collection as a training collection for the GP. Table 3 shows the results of the best three runs of the GP. 19

EXPERIMENTAL RESULTS The three best functions produced by the GP 20

EXPERIMENTAL RESULTS Tests on Unseen Data 21

CONCLUSION We have outlined an list of measures that may be used to capture the notion of proximity. We outline an IR framework which incorporates the term-term similarities of all possible query- term pairs. We adopt a learning technique (GP) which learns useful proximity functions. The research described it is possible to use combinations of proximity measures to improve the performance of IR systems. 22