Presentation is loading. Please wait.

Presentation is loading. Please wait.

Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.

Similar presentations


Presentation on theme: "Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching."— Presentation transcript:

1 Query Suggestion

2 n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching related/similar terms  Semi-automatic techniques require user interaction to select best suggested terms n Query expansion is a related technique  Alternative queries, usually offer more terms 2

3 Query Suggestion n Approaches usually based on an analysis of term co- occurrence  Either in the entire document collection, a large collection of queries, or the top-ranked documents in a result list  Query-based stemming also a suggestion technique n Automatic suggestion based on general thesaurus not effective  Does not take context into account, e.g., “aquarium” is a good suggestion for “tank” in the query “tropical fish tank”, but not for “armor for tanks” 3

4 Term Association Measures n Dice’s Coefficient where stands for rank equivalent n Mutual Information Measure (MIM) where N is the number of documents in a collection P(a) = n a /N, P(b) = n b /N, P(a, b) = n ab /N 4 = rank Measures the extent to which words co- occurrence independently

5 Term Association Measures n Mutual Information measure (MIM) favors low frequency terms n Expected Mutual Information Measure (EMIM) addresses the problem of MIM by weighting MIM using P(a, b)  Actually only 1 part of EMIM focused on word occurrence  EMIM, however, favors high frequency terms 5

6 Term Association Measures n Pearson’s Chi-squared (χ 2 ) measure  Compares the number of co-occurrences of two words with the expected number of co-occurrences if the two words were independent  Normalizes this comparison by the expected number  Also limited form focused on word co-occurrence 6 Expected number of co- occurrence if the words occur independently Favors low- frequency terms

7 Association Measure Summary 7

8 Association Measure Example Most strongly associated words for “tropical” in a collection of TREC news stories. Co-occurrence counts are measured at the document level. 8 Identical ranking & favor low- frequency words More general than MIM & X 2

9 Association Measure Example Most strongly associated words for “fish”, a high frequent term, in a collection of TREC news stories. 9 Similar Top- ranked words in MIM & X 2

10 Association Measure Example Most strongly associated words for “fish” in a collection of TREC news stories. Co-occurrence counts are measured in windows of 5 words. 10 Still favor low-frequency terms Most stable & reliable regardless of the window sizes

11 Association Measures n Associated words are of little use for expanding the query “tropical fish” n Expansion based on whole query takes context into account  e.g., using Dice with term “tropical fish” gives the following highly associated words: goldfish, reptile, aquarium, coral, frog, exotic, stripe, regent, pet, wet n Impractical for all possible queries, other approaches used to achieve this effect 11

12 Other Approaches n Pseudo-relevance feedback  Expansion terms based on top retrieved docs for initial query n Context vectors  Represent words by the words that co-occur with them e.g., top 35 most strongly associated words for “aquarium” (using Dice’s coefficient):  Rank words for a query by ranking context vectors n Challenges (computational & accuracy): due to huge size & variability in quality of the collections 12

13 Other Approaches n Query logs  Best source of information about queries & related terms short pieces of text & click data  e.g., most frequent words in queries containing “tropical fish” from MSN log: stores, pictures, live, sale, types, clipart, blue, freshwater, aquarium, supplies  Query suggestion based on finding similar queries group based on click data 13

14 Query Expansion n Search engines suggest expanded/alternative queries in response to a query Q  Using some form of thesaurus to perform global analysis For each term t in Q, Q is expanded with synonyms and related words of t from the thesaurus 14

15 Query Expansion n Methods for building a thesaurus for query expansion 1. Use of a controlled vocabulary maintained by human editors, such as the Library of Congress subject headings (LCSH), e.g., The LCSH of “American Revolutionary War” is United States – History -- Revolution, 1775-1783 2. An automatically derived thesaurus, constructed using word co-occurrence statistics over a collection of docs 3. Query reformulations based on query log mining by exploring the manual query reformulations of other users to make suggestions to a user Thesaurus-based query expansion does not require any user input to increase recall 15

16 Query Expansion n Automatic thesaurus generation using word co-occurrence  A simple approach is based on term-term similarities Start with a term-document matrix A, where each cell A t,d is a weighted count of w t,d for term t & document d Calculate C = AA T in which C u,v is a similarity score between terms u and v, the larger the number, the better An example of a derived t hesaurus with good/bad suggestions 16

17 Query Expansion n The quality of term association is typically a problem in an automatically generated thesaurus  Term ambiguity easily introduces irrelevant statistically correlated terms, such as “Apple” can be expanded to “Apple red fruit computer” Suffer from false positives (FP) and false negatives (FN)  High cost to manually produce and update a thesaurus  Query expansion often increases recall, but may also significantly decease precision, especially when the query contains ambiguous terms, e.g., interest rate  interest rate fascinate evaluate is unlikely to be useful 17


Download ppt "Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching."

Similar presentations


Ads by Google