Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.

Similar presentations


Presentation on theme: "Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR."— Presentation transcript:

1 Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR 2009 2010. 07. 09. Summarized by Jaehui Park, IDS Lab., Seoul National University

2 Copyright  2008 by CEBT CONTENTS  INTRODUCTION  RELATED RESEARCH  PROXIMITY MEASURES  PROXIMITY RETREIVAL MODEL  EXPERIMENTS SETUP RESULTS  CONCLUSION 2

3 Copyright  2008 by CEBT INTRODUCTION  The occurrences of the query-terms in the document Intuition – Documents in which query-terms occur closer together should be ranked higher than documents in which the query-terms appear far apart. The relationships between all query-terms – The pairwise similarity between terms  Contributions A list of term-term proximity measures An intuitive framework for the proximity model Machine learning approach to search through the space of term-term proximity functions Performance evaluations 3

4 Copyright  2008 by CEBT PROXIMITY MEASURES 1234567891011121314 DabcdabdefghaiJ Qab 4  pos(D,a) = {1,5,12}, pos(D,b)={2,6}  tf(D,a) = 3, tf(D,b) = 2  12 measures are introduced.  The distance between the positions of a pair of terms in a document. (1~6)  Combining the term-frequencies of each terms in the document (7,8)  The terms in the entire query (9,10)  Normalization measures (11,12)

5 Copyright  2008 by CEBT PROXIMITY MEASURES  min_dist(a,b,D) = 1 The minimum distance between any occurrences of a and b in D. – closeness -> relatedness  diff_avg_pos(a,b,D) = ((1+5+12)/3)-((2+6)/2)) The difference between the average positions of a and b in D. – Where each term tends to occur  avg_dist(a,b,D) = ((1+5)+(3+1)+(10+6))/(2*3) = 26/6=4.33 The average distance between a and b for all possible position combinations in D – Promoting the terms that consistently occur close to one another in a localised area 5

6 Copyright  2008 by CEBT PROXIMITY MEASURES  avg_min_dist(a,b,D) = ((2-1)+(6-5))/2 = 1 The average of the shortest distance between each occurrence of the least frequently occurring term and any occurrence of the other term. – The occurrence of a at position 12 maybe completely unrelated to b  match_dist(a,b,D) = ((2-1)+(6-5))/2 = 1 The smallest distance achievable when each occurrence of a term is uniquely matched to another occurrence of a term  max_dist(a,b,D) = (12-6) = 6 The maximum distance between any two occurrences of a and b. – Useful normalization factor 6

7 Copyright  2008 by CEBT PROXIMITY MEASURES  sum(tf(a),tf(b)) = 3+2 = 5 The sum of the term frequencies of a and b in D. – An implicit indication of the proximity of both terms  prod(tf(a),tf(b)) = 3*2 = 6 The product of the term frequencies of a and b in D. – An implicit indication of the proximity of both terms  fullcover(Q,D) = 12 The length of the document that covers all occurrences query-terms. – A query specific measures  min-cover(Q,D) = 2 The length of the document that covers all query-terms at least once – min-dist+1 for a two-term query 7

8 Copyright  2008 by CEBT PROXIMITY MEASURES  dl(D) = 14 The length of the document – A useful factor for normalization in IR  qt(Q,D) = 2 The number of unique terms that match both document and query 8

9 Copyright  2008 by CEBT PROXIMITY MEASURES  Correlations of measures FBIS, FT, FR collections from TREC disk 4 and 5 OHSUMED collections  Performing re-ranking on the top-N (=1000) documents from an initial ranked list using a proximity function 9

10 Copyright  2008 by CEBT PROXIMITY MEASURES  Inverse correlations  Exceptions: * qt: correlated with relevance 10

11 Copyright  2008 by CEBT PROXIMITY RETRIEVAL MODEL  Extending a vector model Documents and queries as matrices – Ex) 3-term query – w(): a standard term-weighting scheme – p(): a proximity function No theoretical basis – An intuitive extension of a vector based approach – Genetic Programming (GP) technique Combining some or all of the 12 proximity measures 11

12 Copyright  2008 by CEBT EXPERIMENTAL SETUP  Term weighting scheme BM25 scheme Previous work  Proximity function  The benchmark proximity functions BM25 + t() ES + t() 12

13 Copyright  2008 by CEBT EXPERIMENTAL SETUP  GP process A heuristic stochastic search algorithm  Training Financial Times – 69500 documents – Queries: 25 title only, 30 title + descriptions – Fitness function: MAP GP – Ranking documents using the weighting scheme for top 3000 documents – 6 runs of GP Initial population of 2000 for 30 generations Elitist strategy 13

14 Copyright  2008 by CEBT EXPERIMENTAL RESULTS  Wilcoxon signed-rank test 14

15 Copyright  2008 by CEBT EXPERIMENTAL RESULTS  Wilcoxon signed-rank test 15

16 Copyright  2008 by CEBT CONCLUSION  We have outlined an extensive list of measures that may be used to capture the notion of proximity in a document.  We have indicated the potential correlation between each of the individual measures and relevance. min_dist is highly correlated with relevance.  We outline an IR framework which incorporates the term-term similarities of all possible query-term pairs. We adopt population based learning technique (GP) which learns useful proximity functions.  An evaluation of three proximity functions It is possible to use combinations of proximity measures to improve the performance of IR systems for both short and long queries. 16


Download ppt "Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR."

Similar presentations


Ads by Google