Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.

Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR 2009 2010. 07. 09. Summarized by Jaehui Park, IDS Lab., Seoul National University

Copyright  2008 by CEBT CONTENTS  INTRODUCTION  RELATED RESEARCH  PROXIMITY MEASURES  PROXIMITY RETREIVAL MODEL  EXPERIMENTS SETUP RESULTS  CONCLUSION 2

Copyright  2008 by CEBT INTRODUCTION  The occurrences of the query-terms in the document Intuition – Documents in which query-terms occur closer together should be ranked higher than documents in which the query-terms appear far apart. The relationships between all query-terms – The pairwise similarity between terms  Contributions A list of term-term proximity measures An intuitive framework for the proximity model Machine learning approach to search through the space of term-term proximity functions Performance evaluations 3

Copyright  2008 by CEBT PROXIMITY MEASURES 1234567891011121314 DabcdabdefghaiJ Qab 4  pos(D,a) = {1,5,12}, pos(D,b)={2,6}  tf(D,a) = 3, tf(D,b) = 2  12 measures are introduced.  The distance between the positions of a pair of terms in a document. (1~6)  Combining the term-frequencies of each terms in the document (7,8)  The terms in the entire query (9,10)  Normalization measures (11,12)

Copyright  2008 by CEBT PROXIMITY MEASURES  min_dist(a,b,D) = 1 The minimum distance between any occurrences of a and b in D. – closeness -> relatedness  diff_avg_pos(a,b,D) = ((1+5+12)/3)-((2+6)/2)) The difference between the average positions of a and b in D. – Where each term tends to occur  avg_dist(a,b,D) = ((1+5)+(3+1)+(10+6))/(2*3) = 26/6=4.33 The average distance between a and b for all possible position combinations in D – Promoting the terms that consistently occur close to one another in a localised area 5

Copyright  2008 by CEBT PROXIMITY MEASURES  avg_min_dist(a,b,D) = ((2-1)+(6-5))/2 = 1 The average of the shortest distance between each occurrence of the least frequently occurring term and any occurrence of the other term. – The occurrence of a at position 12 maybe completely unrelated to b  match_dist(a,b,D) = ((2-1)+(6-5))/2 = 1 The smallest distance achievable when each occurrence of a term is uniquely matched to another occurrence of a term  max_dist(a,b,D) = (12-6) = 6 The maximum distance between any two occurrences of a and b. – Useful normalization factor 6

Copyright  2008 by CEBT PROXIMITY MEASURES  sum(tf(a),tf(b)) = 3+2 = 5 The sum of the term frequencies of a and b in D. – An implicit indication of the proximity of both terms  prod(tf(a),tf(b)) = 3*2 = 6 The product of the term frequencies of a and b in D. – An implicit indication of the proximity of both terms  fullcover(Q,D) = 12 The length of the document that covers all occurrences query-terms. – A query specific measures  min-cover(Q,D) = 2 The length of the document that covers all query-terms at least once – min-dist+1 for a two-term query 7

Copyright  2008 by CEBT PROXIMITY MEASURES  dl(D) = 14 The length of the document – A useful factor for normalization in IR  qt(Q,D) = 2 The number of unique terms that match both document and query 8

Copyright  2008 by CEBT PROXIMITY MEASURES  Correlations of measures FBIS, FT, FR collections from TREC disk 4 and 5 OHSUMED collections  Performing re-ranking on the top-N (=1000) documents from an initial ranked list using a proximity function 9

Copyright  2008 by CEBT PROXIMITY MEASURES  Inverse correlations  Exceptions: * qt: correlated with relevance 10

Copyright  2008 by CEBT PROXIMITY RETRIEVAL MODEL  Extending a vector model Documents and queries as matrices – Ex) 3-term query – w(): a standard term-weighting scheme – p(): a proximity function No theoretical basis – An intuitive extension of a vector based approach – Genetic Programming (GP) technique Combining some or all of the 12 proximity measures 11

Copyright  2008 by CEBT EXPERIMENTAL SETUP  Term weighting scheme BM25 scheme Previous work  Proximity function  The benchmark proximity functions BM25 + t() ES + t() 12

Copyright  2008 by CEBT EXPERIMENTAL SETUP  GP process A heuristic stochastic search algorithm  Training Financial Times – 69500 documents – Queries: 25 title only, 30 title + descriptions – Fitness function: MAP GP – Ranking documents using the weighting scheme for top 3000 documents – 6 runs of GP Initial population of 2000 for 30 generations Elitist strategy 13

Copyright  2008 by CEBT CONCLUSION  We have outlined an extensive list of measures that may be used to capture the notion of proximity in a document.  We have indicated the potential correlation between each of the individual measures and relevance. min_dist is highly correlated with relevance.  We outline an IR framework which incorporates the term-term similarities of all possible query-term pairs. We adopt population based learning technique (GP) which learns useful proximity functions.  An evaluation of three proximity functions It is possible to use combinations of proximity measures to improve the performance of IR systems for both short and long queries. 16

Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.

Similar presentations

Presentation on theme: "Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.

Similar presentations

Presentation on theme: "Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR."— Presentation transcript:

Similar presentations

About project

Feedback