Presentation is loading. Please wait.

Presentation is loading. Please wait.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

Similar presentations


Presentation on theme: "LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee."— Presentation transcript:

1 LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee

2 2 Abstract  The language modeling approach to IR  Query - random event  Documents - ranked according to the likelihood  users have a prototypical document in mind and will choose query terms accordingly  inferences about the semantic content of documents do not need to be made resulting in a conceptually

3 3 1. Introduction  the language modeling approach to IR  Developed by Ponte and Croft, 1998  Query – random event generated according to a probability distribution  Document similarity -estimating a model of the term generation probabilities for the query terms for each document -ranking the documents according to the probability of generating the query  The main advantage to the language modeling approach  Document boundaries are not predefined -use the document level statistics of tf and idf  Uncertainty are modeled by probabilities -noisy data such as OCR text and automatically recognized speech transcripts  relevance feedback or document routing

4 4 2. The Language Modeling Approach to IR  The query generation probability  The probability will be estimated starting with the maximum likelihood estimate of the probability of term t in document d -tf (t,d) : the raw term frequency of term t in document d -dl d : the total number of tokens in document d

5 5 2.1 Insufficient Data  Two problem with the maximum likelihood estimator  We do not wish to assign a probability of zero for a document that is missing one or more of the query terms -If a user included several synonyms in the query, a document missing even one of them would not be retrieved -A more reasonable distribution  We only have a document sized sample from that distribution and so the variation in the raw counts may partially be accounted for by randomness * cft : the raw count of term t in the collection * cs : the raw collection size or the total number of tokens in the collection

6 6 2.2 Averaging  The mean probability estimate of t in documents containing it -to circumvent the problem of insufficient data -some risk : if the mean were used by itself, there would be no distinction between documents with different term frequencies  Combining the two estimates using the geometric distribution -Ghosh et al., 1983 -robustness of estimation, minimize the risk dft : the document frequency of t : the mean term frequency of term t in documents

7 7 2.3 Combining the Two Estimates  The estimate of the probability of producing the query for a given document model -first term : the probability of producing the terms in the query -second term : the probability of not producing other terms -better discriminators of the document If tf (t,d) >0 otherwise If tf (t,d) >0 otherwise

8 8 3. Related Work 1.The harper and van rijsbergen model 2.The rocchio method 3.The inquery model 4.Exponential models

9 9 3.1 The Harper and Van rijsbergen model (1978)  to obtain better estimates for the probability of relevance of a document given the query  An approximation of the dependence of query terms was defined by the authors by means of a maximal spanning tree  each node of the tree : a single query term  The edges between nodes : weighted by a measure of term dependency  A tree that spanned all of the nodes and that maximized the expected mutual information - P(xi,xj) : the probability of term xi and term xj occurring - P(xi) : the probability of term xi occurring in a relevant document - P(xj) : the probability of term xj occurring in a relevant document

10 10 3.2 The Rocchio Method (1971)  Rocchio method  provides a mechanism for the selection and weighting of expansion terms  can be used to rank the terms in the judged documents -The top N can then be added to the query and weighted  a reasonable solution to the problem of relevance feedback that works very well in practice  empirically determine the optimal value of,, - : the weight assigned for occurring in relevant doc - : the weight assigned for occurring in non-relevant doc

11 11 3.3 The INQUERY Model (1/2)  INQUERY inference network (Turtle, 1991)  document portion -computed in advance  query portion -computed at retrieval time  Document Network  document nodes – d 1...d i  text nodes – t 1...t j  concept representation nodes – r 1...r k  Query Network  query concepts – c 1 …c m  queries – q 1, q 2  Information need – I  Uncertainty  due to differences in word sense Figure 3.1 Example inference network

12 12 3.3 The INQUERY Model (2/2)  Relevance Feedback  Implementation of the theoretical relevance feedback was done by Hains(1996)  Annotated query network  Proposition nodes – k 1, k 2  Observed relevance judgment nodes – j 1, j 2  and nodes – require that an annotation to have an effect on the score  The drawback of this technique  It requires inferences of considerable complexity  Relevance judgment -Two additional layers of inference and several new propositions are required Figure 3.3 Annotated query network

13 13 3.4 Exponential Models  An approach to predicting topic shifts in text using exponential models (Beeferman et al., 1997)  The model utilized ratios of long range language models and short range language models -predict useful terms  Topic shift -When a long range language model is not able to predict the next word better than a short range language model - Pl(x) : the probability of seeing word x given the context of the last 500 words - Ps(x) : the probability of seeing word x given the two previous words

14 14 4. Query Expansion in the Language Modeling Approach  Assumption of this approach  Users can choose query terms that are likely to occur in documents in which they would interested  This approach has been developed into a ranking formula by means of probabilistic language models

15 15 4.1 Interactive Retrieval with Relevance Feedback  Relevance Feedback  Small number of documents are judged relevant by user -The relevance of all the remaining documents is unknown to the system

16 16 4.2 Document Routing  Document Routing  The task is to choose terms associated with documents of interest and to avoid those associated with other documents  Training collection is available with a large number of relevance judgments, both positive and negative, for particular query  Ratio Method  Can utilize additional information by estimating probabilities for both sets

17 17 4.3 The Ratio Method  Ratio Method  predict useful terms  Terms can be ranked according to the probability of occurrence according to the relevant document models  Terms are ranked according to this ratio and top N are added to the initial query - R : the set of relevant documents - P(t|M d ) : the probability of term t given the document model for d - cft : the raw count of term t in the collection - cs : the raw collection size

18 18 4.4 Evaluation  Result are measured using the recall and precision

19 19 4.5 Experiments (1/2)  Comparison of Rocchio method vs Language Model approach  Language Model : log ratio of the probability in the judged relevant set  Rocchio : weighting function was tf,idf and no negative feedback( = 0 )  Language Modeling approach works well

20 20 4.5 Experiments (2/2)

21 21 4.6 Information Routing  Ratio Methods With More Data  Ratio 1  Ratio 2 -The log ratio of the average probability in judged relevant documents vs. the average probability in judged non-relevant documents  Result  The language modeling approach is a good model for retrieval

22 22 5. Query Term Weighting  probability estimation  Maximum likelihood probability  The average probability (combined a geometric risk function)  risk function  current risk function treats all terms equally  The change will be to mix the estimation -useless term, stop word – term is assigned an equal probability estimate for every documents ( it to have no effect on the ranking )  user specified Language Models  Queries -A specific type of text produced by the user  The term weights -Equivalent to the generation probabilities of the query model


Download ppt "LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee."

Similar presentations


Ads by Google