Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
A Machine Learning Approach for Improved BM25 Retrieval
SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,
Hinrich Schütze and Christina Lioma
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Vector Space Model CS 652 Information Extraction and Integration.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber CIDR 2007) Conference on Innovative Data Systems.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.
Date: 2012/3/5 Source: Marcus Fontouraet. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou 1 Efficiently encoding term co-occurrences in inverted.
1 Computing Relevance, Similarity: The Vector Space Model.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Heavy-Tailed Distribution and Multi-Keyword Queries Surajit Chaudhuri, Kenneth Church, Arnd Christian K ö nig, Liying Sui Microsoft Corporation SIGIR 2007.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Clustering C.Watters CS6403.
Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,
Flickr Tag Recommendation based on Collective Knowledge BÖrkur SigurbjÖnsson, Roelof van Zwol Yahoo! Research WWW Summarized and presented.
Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.
Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Proximity-based Ranking of Biomedical Texts Rey-Long Liu * and Yi-Chih Huang * Dept. of Medical Informatics Tzu Chi University Taiwan.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,
Information Retrieval CSE 8337 Spring 2005 Modeling (Part II) Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Survey Jaehui Park Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
M. Yağmur Şahin Çağlar Terzi Arif Usta. Introduction What similarity calculations should be used? For each type of queries For each or type of documents.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
IR 6 Scoring, term weighting and the vector space model.
An Empirical Study of Learning to Rank for Entity Search
Compact Query Term Selection Using Topically Related Text
Information Retrieval and Web Search
Content Based Image Retrieval
Learning to Rank with Ties
A Neural Passage Model for Ad-hoc Document Retrieval
Presentation transcript:

Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR Summarized by Jaehui Park, IDS Lab., Seoul National University

Copyright  2008 by CEBT CONTENTS  INTRODUCTION  RELATED RESEARCH  PROXIMITY MEASURES  PROXIMITY RETREIVAL MODEL  EXPERIMENTS SETUP RESULTS  CONCLUSION 2

Copyright  2008 by CEBT INTRODUCTION  The occurrences of the query-terms in the document Intuition – Documents in which query-terms occur closer together should be ranked higher than documents in which the query-terms appear far apart. The relationships between all query-terms – The pairwise similarity between terms  Contributions A list of term-term proximity measures An intuitive framework for the proximity model Machine learning approach to search through the space of term-term proximity functions Performance evaluations 3

Copyright  2008 by CEBT PROXIMITY MEASURES DabcdabdefghaiJ Qab 4  pos(D,a) = {1,5,12}, pos(D,b)={2,6}  tf(D,a) = 3, tf(D,b) = 2  12 measures are introduced.  The distance between the positions of a pair of terms in a document. (1~6)  Combining the term-frequencies of each terms in the document (7,8)  The terms in the entire query (9,10)  Normalization measures (11,12)

Copyright  2008 by CEBT PROXIMITY MEASURES  min_dist(a,b,D) = 1 The minimum distance between any occurrences of a and b in D. – closeness -> relatedness  diff_avg_pos(a,b,D) = ((1+5+12)/3)-((2+6)/2)) The difference between the average positions of a and b in D. – Where each term tends to occur  avg_dist(a,b,D) = ((1+5)+(3+1)+(10+6))/(2*3) = 26/6=4.33 The average distance between a and b for all possible position combinations in D – Promoting the terms that consistently occur close to one another in a localised area 5

Copyright  2008 by CEBT PROXIMITY MEASURES  avg_min_dist(a,b,D) = ((2-1)+(6-5))/2 = 1 The average of the shortest distance between each occurrence of the least frequently occurring term and any occurrence of the other term. – The occurrence of a at position 12 maybe completely unrelated to b  match_dist(a,b,D) = ((2-1)+(6-5))/2 = 1 The smallest distance achievable when each occurrence of a term is uniquely matched to another occurrence of a term  max_dist(a,b,D) = (12-6) = 6 The maximum distance between any two occurrences of a and b. – Useful normalization factor 6

Copyright  2008 by CEBT PROXIMITY MEASURES  sum(tf(a),tf(b)) = 3+2 = 5 The sum of the term frequencies of a and b in D. – An implicit indication of the proximity of both terms  prod(tf(a),tf(b)) = 3*2 = 6 The product of the term frequencies of a and b in D. – An implicit indication of the proximity of both terms  fullcover(Q,D) = 12 The length of the document that covers all occurrences query-terms. – A query specific measures  min-cover(Q,D) = 2 The length of the document that covers all query-terms at least once – min-dist+1 for a two-term query 7

Copyright  2008 by CEBT PROXIMITY MEASURES  dl(D) = 14 The length of the document – A useful factor for normalization in IR  qt(Q,D) = 2 The number of unique terms that match both document and query 8

Copyright  2008 by CEBT PROXIMITY MEASURES  Correlations of measures FBIS, FT, FR collections from TREC disk 4 and 5 OHSUMED collections  Performing re-ranking on the top-N (=1000) documents from an initial ranked list using a proximity function 9

Copyright  2008 by CEBT PROXIMITY MEASURES  Inverse correlations  Exceptions: * qt: correlated with relevance 10

Copyright  2008 by CEBT PROXIMITY RETRIEVAL MODEL  Extending a vector model Documents and queries as matrices – Ex) 3-term query – w(): a standard term-weighting scheme – p(): a proximity function No theoretical basis – An intuitive extension of a vector based approach – Genetic Programming (GP) technique Combining some or all of the 12 proximity measures 11

Copyright  2008 by CEBT EXPERIMENTAL SETUP  Term weighting scheme BM25 scheme Previous work  Proximity function  The benchmark proximity functions BM25 + t() ES + t() 12

Copyright  2008 by CEBT EXPERIMENTAL SETUP  GP process A heuristic stochastic search algorithm  Training Financial Times – documents – Queries: 25 title only, 30 title + descriptions – Fitness function: MAP GP – Ranking documents using the weighting scheme for top 3000 documents – 6 runs of GP Initial population of 2000 for 30 generations Elitist strategy 13

Copyright  2008 by CEBT EXPERIMENTAL RESULTS  Wilcoxon signed-rank test 14

Copyright  2008 by CEBT EXPERIMENTAL RESULTS  Wilcoxon signed-rank test 15

Copyright  2008 by CEBT CONCLUSION  We have outlined an extensive list of measures that may be used to capture the notion of proximity in a document.  We have indicated the potential correlation between each of the individual measures and relevance. min_dist is highly correlated with relevance.  We outline an IR framework which incorporates the term-term similarities of all possible query-term pairs. We adopt population based learning technique (GP) which learns useful proximity functions.  An evaluation of three proximity functions It is possible to use combinations of proximity measures to improve the performance of IR systems for both short and long queries. 16