Presentation is loading. Please wait.

Presentation is loading. Please wait.

Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.

Similar presentations


Presentation on theme: "Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University."— Presentation transcript:

1 Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

2 SIGMOD 2008 2 Snippets in Text Search Snippets are widely used in text search engine to help users to quickly identify relevant query results.

3 SIGMOD 2008 3 Fragment of an XML Search Result Find the apparel retailers in Texas Keyword Search Texas, apparel, retailer store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… There can be many large search results. Good snippets can help users to quickly and easily judge the relevance.

4 SIGMOD 2008 4 A Sample Snippet From the snippet, we know  The corresponding query result contains matches to all keywords  The retailer is “Brook Brothers”  This retailer has many stores in Houston.  The clothes featured by this retailer. It helps us to differentiate this query result from other apparel retailers (e.g. Carter’s) store statecity merchandises clothes fitting men TexasHouston retailer clothes situation casual category outwear nameproduct Brook Brothers apparel How to generate good snippets for XML search? No existing work on XML snippet generation yet.

5 SIGMOD 2008 5 Challenges and Our Contributions What are desirable properties of a good snippet? Identified three properties: self-contained, distinguishable, representative What information in the query result is significant in order to achieve the properties? Designed an algorithm to generate a ranked list of significant information - IList How to generate a snippet to maximally cover the significant information within a size bound? Proved the NP-hardness of this problem. Designed an efficient and effective algorithm for snippet generation eXtract : The first system on snippet generation for XML search

6 SIGMOD 2008 6 Roadmap Identifying desirable properties of a good snippet  Self-contained  Distinguishable  Representative Constructing an information list – IList  IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions

7 SIGMOD 2008 7 Self-contained Snippet Snippets should be self-contained in order to be understandable. Text search: snippets usually preserve self- contained semantic units: phrases / sentences surrounding keyword matches. XML search: semantic units should be preserved. Challenge: What is a semantic unit?

8 SIGMOD 2008 8 Query Result Fragment (revisited) Adding keywords and their corresponding entity names to IList. IList: Texas, apparel, retailer, store Data contain  Entities  Attributes A self-contained snippet should contain names of the entities whose attributes are in snippets store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria ……

9 SIGMOD 2008 9 Distinguishable Snippet Snippets should be distinguishable, so that users can easily differentiate query results Text search: the title of the document is included. XML search: the “key” of the result should be included. Challenge: What is the key of an XML search result?

10 SIGMOD 2008 10 Query Result Fragment Adding the key of the query result to IList. IList: Texas, apparel, retailer, store, Brook Brothers We can mine keys of entities return entity support entity We identify two types of entities in a query result.  Return entities  Support entities store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria …… Inferring return entities: An entity whose name or attribute name match keywords; otherwise the highest entity Key of a query result  Keys of return entities

11 SIGMOD 2008 11 Representative Snippet Snippets should provide summaries of the query results, so that users can quickly grasp the essence of the results Text search: active research area; sometimes the first and/or last sentence of a paragraph is used as a summary. XML search: include “dominant features” of query results Challenges: What are features? What are dominant features?

12 SIGMOD 2008 12 Features of Query Result We define a feature as (entity, attribute, value). store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria …… Feature type Houston:6 Austin: 1 Other values (3): 3 Men: 600 Women: 360 Children: 40 Casual: 700 Formal: 300 Outwear: 220 Suit: 120 Skirt: 80 Sweaters: 70 Other values (7): 510 city: fitting: situation: category: entity: attribute: value: # of occurrences store: clothes: Some feature statistics

13 SIGMOD 2008 13 Houston:6 Austin: 1 Other values (3): 3 Men: 600 Women: 360 Children: 40 Casual: 700 Formal: 300 Outwear: 220 Suit: 120 Skirt: 80 Sweaters: 70 Other values (7): 510 city: fitting: situation: category: entity: attribute: value: # of occurrences store: clothes: Dominant Features of Query Result A feature that occurs often is likely to be dominant. store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria …… But this is not always reliable. Dominance score  the # of occurrence of a feature / the avg. # of occurrences of features of the same type Dominant features  Features with dominance score ≥ 1

14 SIGMOD 2008 14 Representative Snippet Adding dominant features to IList in the order of dominance scores store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria …… IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual Houston:6 Austin: 1 Other values (3): 3 Men: 600 Women: 360 Children: 40 Casual: 700 Formal: 300 Outwear: 220 Suit: 120 Skirt: 80 Sweaters: 70 Other values (7): 510 city: fitting: situation: category: entity: attribute: value: # of occurrences store: clothes:

15 SIGMOD 2008 15 Roadmap Identifying desirable properties of a good snippet  Self-contained  Distinguishable  Representative Constructing an information list – IList  IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions

16 SIGMOD 2008 16 Roadmap Identifying desirable properties of a good snippet  Self-contained  Related entity names  Distinguishable  Key of query result (return entities)  Representative  Dominant features Constructing an information list – IList  IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions IList

17 SIGMOD 2008 17 Roadmap Identifying desirable properties of a good snippet  Self-contained  Related entity names  Distinguishable  Key of query result (return entities)  Representative  Dominant features Constructing an information list – IList  IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions IList

18 SIGMOD 2008 18 Instance Selection Problem IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual Input: query result R, IList, a snippet size bound B Output: snippet S Instance Selection Problem: How to select node instances in R to cover as much items in IList as possible in the ranked order to form S within bound B?

19 SIGMOD 2008 19 Instance Selection Problem store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… Input: query result R, IList, a snippet size bound B Output: snippet S Instance Selection Problem: How to select node instances in R to cover as much items in IList as possible in the ranked order to form S within bound B? Good Bad IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual

20 SIGMOD 2008 20 Instance Selection Problem Challenges:  The cost of covering an IList item is dynamic  The number of IList items that can be covered is unknown till the very end. The Instance Selection Problem is NP hard. We designed an efficient and effective greedy algorithm to tackle this problem

21 SIGMOD 2008 21 Instance Selection Algorithm store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual weight: 1 1 1 ½ ¼ 1/8 1/16 1/32 1/64 Path based instance selection  Coverage: the entities on the path and their attributes  Benefit: the total weight of IList items covered  Cost: the path length

22 SIGMOD 2008 22 Instance Selection Algorithm store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… 1.For the next uncovered item in Ilist, choose the path with the highest benefit/cost that covers it 2.Update benefits and costs of other paths 3.Go to step 1 till the size bound is reached or the whole IList is covered IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual

23 SIGMOD 2008 23 Instance Selection Algorithm store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… 1.For the next uncovered item in Ilist, choose the path with the highest benefit/cost that covers it 2.Update benefits and costs of other paths 3.Go to step 1 till the size bound is reached or the whole IList is covered IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual

24 SIGMOD 2008 24 Final Snippet store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual

25 SIGMOD 2008 25 Roadmap Identifying desirable properties of a good snippet  Self-contained  Distinguishable  Representative Constructing an information list – IList  IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions

26 SIGMOD 2008 26 Experimental Setup Comparing the performance of  Greedy Algorithm for Instance Selection -- eXtract  Optimal (but exponential) Algorithm for Instance Selection  Google Desktop Measurements  Search quality  Speed  Scalability Data sets: Films, Retailer Query sets: Eight queries for each data set

27 SIGMOD 2008 27 Ten users were asked to score the snippets generated by the three approaches on the same query results The two approaches designed specifically for XML snippet generation have much higher scores than Google Desktop Greedy algorithm (eXtract) has close scores to the Optimal algorithm Search Quality: User Study

28 SIGMOD 2008 28 Search Quality: Precision & Recall Through another user study, the ground truth of snippets are obtained. The snippets generated by the Greedy algorithm in eXtract have close precision and recall as the Optimal algorithm Precision Recall

29 SIGMOD 2008 29 Speed Film Data Set Retailer Data Set The performance of the Greedy algorithm is much better than that of the Optimal algorithm

30 SIGMOD 2008 30 Scalability Scalability on Snippet Size (number of edges) The scalability of the Greedy algorithm is much better than that of the Optimal algorithm Scalability on Query Result Size (KB)

31 SIGMOD 2008 31 Conclusions The first work that generates result snippets for keyword search on XML data Identified the desirable properties for snippets  Self-contained  Distinguishable  Representative Designed an algorithm to generate IList as a ranked list of significant items to be included in snippets Proved that the instance selection problem is NP-hard Designed an efficient algorithm to cover IList in building a snippet within a size bound Experiments verified the effectiveness and efficiency

32 SIGMOD 2008 32 Thank You! Questions? Welcome to visit eXtract demo in VLDB 2008 http://eXtract.asu.edu/

33 SIGMOD 2008 33 Architecture of eXtract Index Builder XML Index Return Entity Identifier Query & Result Dominant Feature Identifier IList, Query Result Instance Selector Result Snippet Data Analyzer Query Result Key Identifier

34 SIGMOD 2008 Snippets Comparison store statecity merchandises clothes fitting men TexasHouston retailer clothes situation casual category outwear nameproduct Brook Brothers apparel eXtract Google Desktop


Download ppt "Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University."

Similar presentations


Ads by Google