Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02  Introduction  Previous Work  Model Description  Empirical Results  Conclusions.

Similar presentations


Presentation on theme: "A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02  Introduction  Previous Work  Model Description  Empirical Results  Conclusions."— Presentation transcript:

1 A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02  Introduction  Previous Work  Model Description  Empirical Results  Conclusions and Future Work  Relevance Feedback in LM  Introduction  Previous Work  Model Description  Empirical Results  Conclusions and Future Work  Relevance Feedback in LM

2 한경수 LM Approach to IR2 Indexing model of probabilistic retrieval model  A model of the assignment of indexing terms to documents  Indexing model of 2-Poisson model Indicate the useful indexing terms by means of the differences in their rate of occurrence in documents elite for a given term vs. those without the property of eliteness.  The current indexing models have not led to improved retrieval results. Due to 2 unwarranted assumptions Documents are members of pre-defined classes. –Combinatorial explosion of elite sets The parametric assumption –Unnecessary to construct a parametric model of the data when we have the actual data. Introduction

3 한경수 LM Approach to IR3 Retrieval based on probabilistic LM  Treat the generation of queries as a random process.  Approach Infer a language model for each document. Estimate the probability of generating the query according to each of these models. Rank the documents according to these probabilities.  Intuition Users … –Have a reasonable idea of terms that are likely to occur in documents of interest. –Will choose query terms that distinguish these documents from others in the collection.  Collection statistics … Are integral parts of the language model. Are not used heuristically as in many other approaches. Introduction

4 한경수 LM Approach to IR4 Probabilistic IR query d1 d2 dn … Information need document collection matching Introduction

5 한경수 LM Approach to IR5 IR based on LM query d1 d2 dn … Information need document collection generation … Introduction

6 한경수 LM Approach to IR6 Previous Work  Difference from the 2-Poisson model Don ’ t make distributional assumptions. Don ’ t distinguish a subset of specialty words. –Don ’ t assume a preexisting classification of documents into elite and non-elite sets.  Difference from Robertson & Sparck Jones model and Croft & Harper model Don ’ t focus on relevance except to the extent that the process of query production is correlated with it.  Fuhr model  INQUERY  Kwok, Wong & Yao, Kalt Previous Work

7 한경수 LM Approach to IR7 Query generation probability  Ranking formula  The probability of producing the query given the language model of document d Model Description Assumption: Given a particular language model, the query term occur independently : language model of document d : raw tf of term t in document d : total number of tokens in document d

8 한경수 LM Approach to IR8 Insufficient data  Zero probability Don ’ t wish to assign a probability of zero to a document that is missing one or more of the query terms. Somewhat radical assumption to infer that  Assumption A non-occurring term is possible, but no more likely than what would be expected by chance in the collection. If, Model Description : raw count of term t in the collection : raw collection size(total number of tokens in the collection)

9 한경수 LM Approach to IR9 Averaging for robustness  If we could get an arbitrary sized sample of data from we could be reasonably confident in the maximum likelihood estimator. We only have a document sized sample from that distribution.  To circumvent this problem, Need an estimate from a larger amount of data Model Description : document frequency of t

10 한경수 LM Approach to IR10 The Risk  Cannot and are not assuming that every document containing t is drawn from the same language model.  There is some risk in using the mean to estimate If we used the mean by itself, there would be no distinction between documents with different term frequencies.  The risk for a term t in a document d (geometric distribution)geometric distribution As the tf gets further away from the normalized mean, the mean probability becomes riskier to use as an estimate. Model Description : mean term frequency of term t in documents where t occurs normalized by document length (= )

11 한경수 LM Approach to IR11 Combining the two estimates Model Description

12 한경수 LM Approach to IR12 Analysis of the formulation  Generalization: formulation of the LM for IR  Conception The user has a document in mind, and generate the query from this document. The equation represents the probability that the document that the user had in mind was in fact this one. Model Description general language model individual-document model

13 한경수 LM Approach to IR13 Experiment Environment  Data TREC topics 202-250 on TREC disks 2 and 3 –Natural language queries consisting of one sentence each TREC topics 51-100 on TREC disk 3 using the concept fields –Lists of good terms Empirical Results Number: 054 Domain: International Economics Topic: Satellite Launch Contracts Description: … Concept(s): 1.Contract, agreement 2.Launch vehicle, rocket, payload, satellite 3.Launch services, … … Number: 054 Domain: International Economics Topic: Satellite Launch Contracts Description: … Concept(s): 1.Contract, agreement 2.Launch vehicle, rocket, payload, satellite 3.Launch services, … …

14 한경수 LM Approach to IR14 Recall/Precision Experiments(1) Empirical Results

15 한경수 LM Approach to IR15 Recall/Precision Experiments(2) Empirical Results

16 한경수 LM Approach to IR16 Improving the Basic Model(1)  Smoothing the estimate of the average probability for terms with low document frequency The estimate is based on a small amount of data So could be sensitive to outliers  Binned estimate Bin the low frequency data by document frequency –Cutoff: df=100 Use the binned estimate for the average Empirical Results

17 한경수 LM Approach to IR17 Improving the Basic Model(2) Empirical Results

18 한경수 LM Approach to IR18 Improving the Basic Model(3) Empirical Results

19 한경수 LM Approach to IR19 Conclusions & Future Work  Conclusions Novel way of looking at the problem of text retrieval based on probabilistic language modeling –Conceptually simple and explanatory LM will provide effective retrieval and can be improved to the extent that the following conditions can be met –Our language models are accurate representations of the data. –Users understand our approach to retrieval. –Users have a some sense of term distribution. The ability to think about retrieval in a new way  Future Work Estimate of default probability –Current estimator could in some strange cases assign a higher probability to a non-occurring query term. Query expansion Conclusions & Future Work

20 한경수 LM Approach to IR20 LM approach to multiple relevant documents  Current LM approach Allow for N+1 language models –N(collection size) + general language model The relationship between general language model and the individual document models is never raised. –How can a document be generated from one language model when the entire collection is generated from a different one?  We need … General model for some accumulation of text, which is modified (not replaced) by a local model for some smaller part of the same text. Relevance Feedback in LM

21 한경수 LM Approach to IR21 3-level model(1)  3-level model  Whole collection model ( )  Specific-topic model; relevant-documents model ( )  Individual-document model ( )  Relevance hypothesis A request(query; topic) is generated from a specific-topic model {, }. Iff a document is relevant to the topic, the same model will apply to the document. –It will replace part of the individual-document model in explaining the document. The probability of relevance of a document –The probability that this model explains part of the document –The probability that the {,, } combination is better than the {, } combination Relevance Feedback in LM

22 한경수 LM Approach to IR22 3-level model(2) query d1 d2 dn … Information need document collection generation … …

23 한경수 LM Approach to IR23 Geometric distribution(1)  기하분포 첫번째 성공을 거둘 때까지 성공률이 p 인 베르누이 시행을 반복 할 때, 총 시행횟수를 X 라 두면 이 확률변수 X 가 갖는 분포가 기하분 포이다.

24 한경수 LM Approach to IR24 Geometric distribution(2)  예 어떤 실험을 한번 하는데 드는 비용은 10 만원이다. 이 실험이 성공 할 확률은 0.2 이고 성공할 때까지 이 실험을 반복한다고 할 때 실험 에 드는 총비용을 얼마로 예상하면 될까 ?


Download ppt "A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02  Introduction  Previous Work  Model Description  Empirical Results  Conclusions."

Similar presentations


Ads by Google