Presentation is loading. Please wait.

Presentation is loading. Please wait.

LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.

Similar presentations


Presentation on theme: "LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15."— Presentation transcript:

1 LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15 1

2 OUTLINE Introduction Proximity Measures Correlations of Measures Proximity Retrieval Model Experimental Setup Benchmarks GP process Training Experimental Result Conclusion 2

3 INTRODUCTION Traditional ad hoc retrieval models do not take into account the closeness or proximity of terms. Document scores in these models are primarily based on the occurrences of query-terms. Intuitively, documents in which query-terms occur closer together should be ranked higher. 3

4 INTRODUCTION The contributions of this paper outline several term-term proximity measures. develop an intuitive framework which useful term- term proximity functions can be incorporated. use a learning approach to combine proximity measures to develop a useful proximity function. evaluate the best learned proximity functions to show that they achieve a significant increase in performance. 4

5 PROXIMITY MEASURES The following sample document D will be used to explain the proximity measures 5

6 PROXIMITY MEASURES the minimum distance between any occurrences of a and b in D i.e, min_dist is 1 the di ff erence between the average positions of a and b in D. i.e ((1+5+12)/3)−((2+6)/2))=2 6

7 PROXIMITY MEASURES the average distance between a and b for all position combinations in D i.e ((1 + 5) +(3 + 1) + (10 + 6))/(2 · 3) = 26/6 = 4.33 the average of the shortest distance between each occurrence of the least frequently occurring term and any occurrence of the other term. i.e ((2−1)+(6−5))/2 =1 7

8 PROXIMITY MEASURES the smallest distance achievable when each occurrence of a term is uniquely matched to another occurrence of a term. ((2 − 1) + (6 − 5))/2 = 1 the maximum distance between any two occurrences of a and b. (12−6) = 6 8

9 PROXIMITY MEASURES he sum of the term frequencies of a and b in D. 3 + 2=5 the product of the term-frequencies of a and b 3*2=6 the length of the document that covers all occurrences query-terms. 12 9

10 PROXIMITY MEASURES the length of the document that covers all query terms at least once. 2 the length of the document 14 the number of unique terms that match both document and query. 2 10

11 CORRELATIONS OF MEASURES Use FBIS, FT, FR collections from TREC disks 4 and 5 as test collections and topics. For each set of topics we create a short query set, and a medium length query set. 11

12 CORRELATIONS OF MEASURES We perform an analysis of each proximity measures independently. To predict which measures may be most useful when incorporated into a proximity measure. We perform an analysis of the top 1000 documents and examine the correlation. 12

13 CORRELATIONS OF MEASURES Table 2 shows the average values of the individual measures per query. 13

14 PROXIMITY RETRIEVAL MODEL the representation of a document in our model is as follows: w () is a standard term-weighting scheme p () is a proximity function Entire score of the document can now be defined as the sum of all term-term relationships: 14

15 EXPERIMENTAL SETUP Benchmarks - the traditional BM25 scheme - the frequency of term t in document D - the document length - document frequency of term t - average document length - term-frequency influence parameter (1.2) - the frequency of term t in query - document normalization parameter (0.75) 15

16 EXPERIMENTAL SETUP where cft is the frequency of t in the entire collection proximity function We will label these functions BM25 + t() and ES + t(). 16

17 GP PROCESS Initially, a population of solutions is created randomly. Individuals are selected for reproduction based on their fitness value. Once selection has occurred, recombination can start. Recombination creates new solutions for the next generation by use of crossover and mutation. The process usually ends when a predefined number of generations is complete. 17

18 GP PROCESS Use all of the 12 proximity measures previously introduced as input terminals to the GP. Three constants for used for scaling {1, 10, 0.5} The fitness function used in the experiments is MAP. We also used the following functions as inputs to the GP : +, −, ×, /,√, sq(), log(). We then ran the GP six times for 30 generations. 18

19 EXPERIMENTAL RESULTS We used a subset of the Financial Times collection as a training collection for the GP. Table 3 shows the results of the best three runs of the GP. 19

20 EXPERIMENTAL RESULTS The three best functions produced by the GP 20

21 EXPERIMENTAL RESULTS Tests on Unseen Data 21

22 CONCLUSION We have outlined an list of measures that may be used to capture the notion of proximity. We outline an IR framework which incorporates the term-term similarities of all possible query- term pairs. We adopt a learning technique (GP) which learns useful proximity functions. The research described it is possible to use combinations of proximity measures to improve the performance of IR systems. 22


Download ppt "LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15."

Similar presentations


Ads by Google