Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Smoothing Methods for LM in IR Alejandro Figueroa.

Similar presentations


Presentation on theme: "1 Smoothing Methods for LM in IR Alejandro Figueroa."— Presentation transcript:

1 1 Smoothing Methods for LM in IR Alejandro Figueroa

2 2 Outline The linguistic phenomena behind the retrieval of documents. Language Modeling Approach. Smoothing methods. –Overview. –Methods. –Parameters setting. Interpolation vs. Back-off. Comparison of methods. Combination of methods. Personal outlook and conclusions. The linguistic phenomena behind the retrieval of documents. Language Modeling Approach. Smoothing methods. –Overview. –Methods. –Parameters setting. Interpolation vs. Back-off. Comparison of methods. Combination of methods. Personal outlook and conclusions.

3 3 The Linguistic Phenomena behind IR. „Reducing Information Variation on Texts“ (Agata Savary and Christian Jacquemin). Work on our QA Group – DFKI.

4 4 Information Variation The problem: simply keyword matching is not enough to retrieve the best documents for a query. For example: „When was Albert Einstein born?„ –The nobel prize of physics Albert Einstein was born in 1879 in Ulm, Germany. –Born: 14 March 1879 in Ulm, Württemberg, Germany. –Physics nobel prize Albert Einstein was born at Ulm, in Württemberg, Germany, on March 14, 1879. –Died 18 Apr 1955 (born 14 Mar 1879) German- American physicist. The same information can be found in several ways: The problem: simply keyword matching is not enough to retrieve the best documents for a query. For example: „When was Albert Einstein born?„ –The nobel prize of physics Albert Einstein was born in 1879 in Ulm, Germany. –Born: 14 March 1879 in Ulm, Württemberg, Germany. –Physics nobel prize Albert Einstein was born at Ulm, in Württemberg, Germany, on March 14, 1879. –Died 18 Apr 1955 (born 14 Mar 1879) German- American physicist. The same information can be found in several ways:

5 5 Information Variation Kinds of variation: –Graphic: "14 March 1879“ and "14 Mar 1879“. –Morphological:” Physics nobel prize“ –Syntactical: “German-American physicist“ –Semantic:"Albert Einstein was born at Ulm“ and "German-American physicist“. Appropriateness: –Precision. –Economy. Kinds of variation: –Graphic: "14 March 1879“ and "14 Mar 1879“. –Morphological:” Physics nobel prize“ –Syntactical: “German-American physicist“ –Semantic:"Albert Einstein was born at Ulm“ and "German-American physicist“. Appropriateness: –Precision. –Economy.

6 6 Language Modeling Approach „A Study of smoothing methods for Language Models applied to Information Retrieval“ (Chengxiang Zhai and John Lafferty)

7 7 Language Modeling The probability that a query Q was generated by a probabilistic model based on a document. Uni-gram model: P(q|d)?0

8 8 Language Modeling Smoothing methods makes use of two probabilites for the model P u (w|d) and P s (w|d).

9 9 Language Modeling carried out over the matched terms. Longer documents => less smoothing, longer documents => greater penalty!!.

10 10 Smoothing Methods

11 11 Overview The problem: Adjust the MLE to compensate data sparseness. The role of smoothing is: –LM more accurate. –Explain the non-informative words in the query. Goal of the work: –How sensitive is retrieval performance to the smoothing of a document LM? –How should be the model and the parameters chosen? The problem: Adjust the MLE to compensate data sparseness. The role of smoothing is: –LM more accurate. –Explain the non-informative words in the query. Goal of the work: –How sensitive is retrieval performance to the smoothing of a document LM? –How should be the model and the parameters chosen?

12 12 Overview The unsmoothed model is the MLE:

13 13 Overview Smoothing: tackles the effect of statistical variability in small training sets. Discounting: the relative frequencies of seen events are discounted; the gained probability mass is then distributed over the unseen words. Smoothing: tackles the effect of statistical variability in small training sets. Discounting: the relative frequencies of seen events are discounted; the gained probability mass is then distributed over the unseen words.

14 14 Smoothing Methods Based on the Good-turing idea: Estimate the probabilities of new events by taking the counts of singleton events, dividing it by the total number of events (0,1).

15 15 GooD-Turing Idea The probability of a term with freq. tf is given by: N d = Total number of terms occurred in d. Number of terms with frequency t f in a document. Expected value of N tf. Total number of terms occurred in d.

16 16 Smoothing Methods Jelinek-mercer method: involves a linear interpolation of the ML model with the collection model.

17 17 Smoothing Methods Absolute discounting: decrease the probability of seen words by substracting a constant from their counts.

18 18 Smoothing Methods Bayesian smoothing using Dirichlet priors: A multinomial distribution, for which the conjugate prior for bayesian analysis is the dirichlet distribution: The idea is to adjust the probabilities according to the query.

19 19 Summary: Smoothing Methods MethodP s (w|d)αdαd Parameter Jelinek-Mercer λλ Dirichlet μ Absolute discounting δ

20 20 Parameters Setting 5 databases from TREC: –Financial Times on disk 4. –FBIS on disk 5. –Los Angeles on disk 5. –Disk 4 and disk 5 minus Congressional Record. –The TREC8 web data. Queries: –Topics 351-400 (TREC 7 ad-hoc task). –Topics 401-450 (TREC 8 ad hoc web task). 5 databases from TREC: –Financial Times on disk 4. –FBIS on disk 5. –Los Angeles on disk 5. –Disk 4 and disk 5 minus Congressional Record. –The TREC8 web data. Queries: –Topics 351-400 (TREC 7 ad-hoc task). –Topics 401-450 (TREC 8 ad hoc web task).

21 21 Parameters Setting Number: 384 space station moon Description: Identify documents that discuss the building of a space station with the intent of colonizing the moon. Narrative: A relevant document will discuss the purpose of a space station, initiatives towards colonizing the moon, impediments which thus far have thwarted such a project, plans currently underway or in the planning stages for such a venture; cost, countries prepared to make a commitment of men, resources, facilities and money to accomplish such a feat. TREC 7

22 22 Parameters Setting Number: 414 Cuba, sugar, exports Description: How much sugar does Cuba export and which countries import it? Narrative: A relevant document will provide information regarding Cuba's sugar trade. Sugar production statistics are not relevant unless exports are mentioned explicitly. TREC 8

23 23 Parameters Setting Interaction query length/type: –Two different version of each set of queries: Title only (2 or 3 words). A long version (Title + description + narrative). Optimize the performance of each method by means of the non-interpolated average precision. Interaction query length/type: –Two different version of each set of queries: Title only (2 or 3 words). A long version (Title + description + narrative). Optimize the performance of each method by means of the non-interpolated average precision.

24 24 Parameters Setting Jelinek-Mercer smoothing: –Weight for a matched term: Jelinek-Mercer smoothing: –Weight for a matched term: λ->1

25 25 Parameters Setting Dirichlet priors: –Term weight: Dirichlet priors: –Term weight: α d is a document-dependent length normalization factor that penalizes long documents.

26 26 Parameters Setting Absolute discounting: α d is a document- dependent: –Larger for a document with a flatter distribution of words. –Weight of a matched term: Absolute discounting: α d is a document- dependent: –Larger for a document with a flatter distribution of words. –Weight of a matched term:

27 27 Parameters Setting Conclusions Jelinek-Mercer: –The precision is much more sensitive to λ for long queries than for title queries. Long queries need more smoothing, that is, lees emphasis on the relative weighting of terms. –In the web collection, it was sensitive to smoothing for title queries too. –For title queries the retrieval performance tends to be optimized when λ=0.1. Conclusions Jelinek-Mercer: –The precision is much more sensitive to λ for long queries than for title queries. Long queries need more smoothing, that is, lees emphasis on the relative weighting of terms. –In the web collection, it was sensitive to smoothing for title queries too. –For title queries the retrieval performance tends to be optimized when λ=0.1.

28 28 Parameters Setting Conclusions Dirichlet Priors: –The precision is more sensitive to μ for long queries than for title queries, especially, when μ is small. –When μ is large, all long queries performed better than short queries, opposite to μ small. –The optimal value of μ tends to be larger for long queries than for title queries. –The value of μ tends to vary from collection to collection. Conclusions Dirichlet Priors: –The precision is more sensitive to μ for long queries than for title queries, especially, when μ is small. –When μ is large, all long queries performed better than short queries, opposite to μ small. –The optimal value of μ tends to be larger for long queries than for title queries. –The value of μ tends to vary from collection to collection.

29 29 Parameters Setting Conclusions Absolute discounting: –The precision is more sensitive to δ for long queries than for title queries. –The optimal value of δ  0.7 does not seem to be much different for title queries and long queries. –Smoothing plays a more important role for long verbose queries than for concise queries. Conclusions Absolute discounting: –The precision is more sensitive to δ for long queries than for title queries. –The optimal value of δ  0.7 does not seem to be much different for title queries and long queries. –Smoothing plays a more important role for long verbose queries than for concise queries.

30 30 Interpolation vs. Back-off

31 31 Interpolation vs. Back-off Interpolation-based methods: counts of the seen words and the extra counts are shared by both the seen words and unseen words. Back-off: Trust in the MLE for the high count words and discount and redistribute mass only for the less common terms. Interpolation-based methods: counts of the seen words and the extra counts are shared by both the seen words and unseen words. Back-off: Trust in the MLE for the high count words and discount and redistribute mass only for the less common terms.

32 32 Interpolation: Interpolation vs. Back-off

33 33 Interpolation vs. Back-off Back-Off:

34 34 Interpolation vs. Back-off Results: –The performance of the back-off strategy is more sensitive to the smoothing parameters. Specially: Jeliner-Mercer and Dirichlet priors. –This sensitivity is smaller for the absolute discounting method, due to the lower upper bound. Results: –The performance of the back-off strategy is more sensitive to the smoothing parameters. Specially: Jeliner-Mercer and Dirichlet priors. –This sensitivity is smaller for the absolute discounting method, due to the lower upper bound.

35 35 Comparisson of methods

36 36 Comparison of methods For title queries: –Dirichlet prior is better than absolute discounting, which is better than J-M. –Dirichlet prior performed extremelly well on the web collection and it is insensitive to the value of μ. –Many no optimal runs were better than the other two methods. For title queries: –Dirichlet prior is better than absolute discounting, which is better than J-M. –Dirichlet prior performed extremelly well on the web collection and it is insensitive to the value of μ. –Many no optimal runs were better than the other two methods.

37 37 Comparison of methods For long queries: –Jelinek-Mercer is better than Dirichlet, which is better than absolute discounting. –The three methods perform better on long queries than in short queries. –Jelinek-Mercer is much more effective for long and verbose queries. –All methods perform better for long queries than for short queries. For long queries: –Jelinek-Mercer is better than Dirichlet, which is better than absolute discounting. –The three methods perform better on long queries than in short queries. –Jelinek-Mercer is much more effective for long and verbose queries. –All methods perform better for long queries than for short queries.

38 38 Comparison of methods General Remark: –The strong correlation between the effect of smoothing and the type of the query is unexpected. –Smoothing only improves accuracy in estimating the unigram language model based on a document. General Remark: –The strong correlation between the effect of smoothing and the type of the query is unexpected. –Smoothing only improves accuracy in estimating the unigram language model based on a document. Effect of verbose Queries???

39 39 Query Length/Verbosity Four types of query: –Short keywords: Only the title of the topic description. –Long keywords: Using only the description field. –Short verbose: Using the concept field, 28 keywords on average. –Long verbose: Using the title, description and the narrative field (more than 50 words on average). Generated for the TREC topics 1-150. Both keywords queries behaved in the similar way and the verbose query too. The retrieval performance is much less sensitive to smoothing in the case of the keyword queries than for the verbose queries. Four types of query: –Short keywords: Only the title of the topic description. –Long keywords: Using only the description field. –Short verbose: Using the concept field, 28 keywords on average. –Long verbose: Using the title, description and the narrative field (more than 50 words on average). Generated for the TREC topics 1-150. Both keywords queries behaved in the similar way and the verbose query too. The retrieval performance is much less sensitive to smoothing in the case of the keyword queries than for the verbose queries.

40 40 Combining Methods „A General Language Model for Information Retrieval“ (Fei Song / W. Bruce Croft)

41 41 A general LM for IR They propose a extensible model based on: –Good-turing estimate. –Curve-fitting functions. –Model combinations. The idea is to use n-grams is taking into account the local context, the uni-gram models assume independence. They propose a extensible model based on: –Good-turing estimate. –Curve-fitting functions. –Model combinations. The idea is to use n-grams is taking into account the local context, the uni-gram models assume independence.

42 42 A general LM for IR The new model: 1.Smooth each document with the Good-turing estimate. 2.Expand each document with the corpus. 3.Consider terms pairs and expand the unigram model to the bi-gram model. The new model: 1.Smooth each document with the Good-turing estimate. 2.Expand each document with the corpus. 3.Consider terms pairs and expand the unigram model to the bi-gram model.

43 43 Step 1: Good turing Idea-Revising N tf = Number of terms with frequency tf in a doc. E(N tf )= Expected value of N tf. The probability of a term with freq. tf is given by: N d = Total number of terms occurred in d.

44 44 Step 2 Expanding a document model with the corpus:

45 45 Step 3 Modeling a query as a sequence of terms:

46 46 Step 4 Combining uni-grams and bi-grams:

47 47 Results Two collections: –The wall street journal (WSJ), 250 MB, 74.520 docs. –TREC 4, 2 GB, 567.529 docs. Phrases of word pairs can be useful in improving the retrieval performance. The strategy can be easily extended. Two collections: –The wall street journal (WSJ), 250 MB, 74.520 docs. –TREC 4, 2 GB, 567.529 docs. Phrases of word pairs can be useful in improving the retrieval performance. The strategy can be easily extended.

48 48 Personal Outlook / Conclusions

49 49 Personal Outlook / Conclusions Stop-List. Porter Steemer. N-grams can not capture large-span relationships in the language. The performance of the n-gram model has reached a plateau. P(d). Stop-List. Porter Steemer. N-grams can not capture large-span relationships in the language. The performance of the n-gram model has reached a plateau. P(d).

50 50 Principal Component Analysis A low dimensional representation of the data. Relation between features. PCA tries to find a low-rank approximation, where the quality of the approximation depends on how close the data is to lying in a subspace of the given dimensionality. A low dimensional representation of the data. Relation between features. PCA tries to find a low-rank approximation, where the quality of the approximation depends on how close the data is to lying in a subspace of the given dimensionality.

51 51 Latent Semantic Analysis –Semantic Information is extracted by means of the Singular Value Decomposition (SVD). Latent Semantic Analysis –Semantic Information is extracted by means of the Singular Value Decomposition (SVD). LSI uses a reduction of the first k columns of U.

52 52 Latent Semantic Analysis –The eigenvectors for a set of documents can be viewed as concepts described by a linear combination of terms chosen in such a way that documents are described as accurately as possible using only k such concepts. –Terms that co-occur frequently will tend to align in the same eigenvectors. Latent Semantic Analysis –The eigenvectors for a set of documents can be viewed as concepts described by a linear combination of terms chosen in such a way that documents are described as accurately as possible using only k such concepts. –Terms that co-occur frequently will tend to align in the same eigenvectors.

53 53 Latent Semantic Analysis SVD is expensive to compute. Cristianini developed an approximation strategy, based on the Gram-Schmidt decomposition. Multilinguality: –The semantic space proposed here provides an ideal representation for performing multilingual information retrieval. SVD is expensive to compute. Cristianini developed an approximation strategy, based on the Gram-Schmidt decomposition. Multilinguality: –The semantic space proposed here provides an ideal representation for performing multilingual information retrieval.

54 54 Personal Outlook / Conclusions What happens if we use LSA to improve smoothing? –We can think: We can smooth terms assigning probability mass according to their semantic distance to the terms in the collection/query. – Problem: Scalability of the model: if a term is not in the set W, from which the SVD decomposition was made, then we should do an approximation. What happens if we use LSA to improve smoothing? –We can think: We can smooth terms assigning probability mass according to their semantic distance to the terms in the collection/query. – Problem: Scalability of the model: if a term is not in the set W, from which the SVD decomposition was made, then we should do an approximation.

55 55 Personal Outlook / Conclusions What happens if we use LSA to improve smoothing? –Problem: If the documents belong diverse topics, the classification on the new space becomes too heterogeneous. If the documents belong diverse topics, the classification of the words in the new space is ambiguous. What happens if we use LSA to improve smoothing? –Problem: If the documents belong diverse topics, the classification on the new space becomes too heterogeneous. If the documents belong diverse topics, the classification of the words in the new space is ambiguous.

56 56 Personal Outlook / Conclusions Conclusions: –Smoothing methods are simple and efficient. –They provide a elegant way to deal with the data sparseness problem. –They can be choose according to the taste of the consumer. –But, they do not model the linguistic phenomena behind the scenes... At least for the moment. –Even though, the techniques does not requieres language knowledge, the fact of the markov assumption, drives to some sort of language dependecy. Conclusions: –Smoothing methods are simple and efficient. –They provide a elegant way to deal with the data sparseness problem. –They can be choose according to the taste of the consumer. –But, they do not model the linguistic phenomena behind the scenes... At least for the moment. –Even though, the techniques does not requieres language knowledge, the fact of the markov assumption, drives to some sort of language dependecy.

57 57 Questions? English only?. Query Expansion?. How would help smoothing to the Question Answering task? Which method would help in a more appropriate way to a QA System? Why? English only?. Query Expansion?. How would help smoothing to the Question Answering task? Which method would help in a more appropriate way to a QA System? Why?


Download ppt "1 Smoothing Methods for LM in IR Alejandro Figueroa."

Similar presentations


Ads by Google