Presentation is loading. Please wait.

Presentation is loading. Please wait.

Flexible and Efficient XML Search with Complex Full-Text Predicates Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University.

Similar presentations


Presentation on theme: "Flexible and Efficient XML Search with Complex Full-Text Predicates Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University."— Presentation transcript:

1 Flexible and Efficient XML Search with Complex Full-Text Predicates Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University of California San Diego Alin Deutsch - University of California San Diego

2 SIGMOD, June 20062 Introduction Need for complex full-text predicates beyond simple keyword search Library of Congress (LoC) Biomedical data ACM, IEEE publications INEX data collection Wikipedia XML data set

3 SIGMOD, June 20063 XML real fragment from LoC http://thomas.loc.gov/home/gpoxmlc109/h2739_ih.xml Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc

4 SIGMOD, June 20064 Query with complex FT predicates Document fragments (nodes) that contain the keywords “Jefferson” and “education” and satisfy the predicates  within a window of 10 words,  with “Jefferson” ordered before “education”

5 SIGMOD, June 20065 Example: LoC document Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc

6 SIGMOD, June 20066 Example: LoC document Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc Return document fragments Naive solution: test the query at each node → redundant Need for efficient evaluation of full-text predicates  use structural relationship between nodes  avoid redundant computation

7 SIGMOD, June 20067 Existing languages Many XML full-text search languages  expressive power, semantics, scores [BAS-06] XQFT-class W3C’s XQuery Full-Text (XQFT), NEXI, XIRQL, JuruXML, XSearch, XRank, XKSearch, Schema Free XQuery Efficient query evaluation limited to  Conjunctive keyword search (no predicates)  Full-text predicates in isolation Need for a universal optimization framework  Guarantee the universality of the solution

8 SIGMOD, June 20068 Contributions Formal semantics for XQFT-class  Unified framework  Capture family of tf*idf scoring methods Structure-aware algorithms to efficiently evaluate XQFT-class languages  XFT full-text algebra  Enable new optimizations inspired by relational rewritings

9 SIGMOD, June 20069 Talk Outline Motivation & Contributions Formalization of XML full-text search Efficient evaluation Experiments Conclusion

10 SIGMOD, June 200610 Formalization: design goals Capture existing full-text languages Language semantics in terms of  keyword patterns  pattern matches  predicates evaluated through matches Manipulate tuples  enable relational query evaluation and rewritings

11 SIGMOD, June 200611 Formalization: patterns Pattern = tuple of simultaneously matching keywords Query expression: “Jefferson” and “education”  within a window of 10 words,  with “Jefferson” ordered before “education” Pattern (“Jefferson”, “education”)

12 SIGMOD, June 200612 Formalization: patterns Formalization specifies  patterns ← conjunction of keywords  set of patterns ← disjunction of keywords  exclusion patterns ← negation of keywords No matches in the document

13 SIGMOD, June 200613 Formalization: matches Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc “Jefferson”, “education” (22, 3)

14 SIGMOD, June 200614 Formalization: matches Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc “Jefferson”, “education” (22, 3) (22, 45)

15 SIGMOD, June 200615 Formalization: matches Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc “Jefferson”, “education” (22, 3) (22, 45) (22, 67)

16 SIGMOD, June 200616 Formalization: matches Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc “Jefferson”, “education” (22, 3) (22, 45) (22, 67) (51, 3) …

17 SIGMOD, June 200617 Formalization: matching tables Matching table represents  Nested relation  Each node in the document  Each pattern in the query  Set of matches

18 SIGMOD, June 200618 Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc Formalization: matching tables NodePatternMatches action“Jefferson”, “education”(28, 45) (51, 45) ………

19 SIGMOD, June 200619 XFT Algebra Similar to relational algebra  Manipulate matching tables  Leverage relational query evaluation + optimization techniques XFT operators  construct matching table R k for each keyword k get (k)  manipulate matching tables R 1 or R 2 R 1 and R 2 R 1 minus R 2 σ times (R), σ ordered (R), σ window (R), σ distance (R)

20 SIGMOD, June 200620 XFT Algebra Query: Nodes that contain the keywords “Jefferson” and “education”  within a window of 10 words,  with “Jefferson” ordered before “education” × Benefit: equivalent query rewritings

21 SIGMOD, June 200621 Talk Outline Motivation & Contributions Formalization of XML full-text search Efficient evaluation Experiments Conclusion

22 SIGMOD, June 200622 Query evaluation: AllNodes Straightforward implementation of the XFT algebra Each node is considered separately  Each tuple is self-contained Relational-style evaluation  Joins → equi-joins  Predicates → selections on set of matches 5

23 SIGMOD, June 200623 Example: LoC document Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc 1.1 1.2 1.3 1 1.1.1 1.1.2 1.1.3 1.2.2 1.2.2.2 1.3.1 1.3.2 1.3.1.2

24 SIGMOD, June 200624 NodePatternMatches 1“Jefferson”22, 28, 51, 54, 72 1.1“Jefferson”22 1.1.3“Jefferson”22 1.2“Jefferson”28, 51 1.2.2“Jefferson”51 1.2.2.2“Jefferson”51 1.3“Jefferson”54, 72 1.3.1“Jefferson”54 1.3.1.2“Jefferson”54 1.3.2“Jefferson”72 NodePatternMatches 1“education”3, 45, 67 1.1“education”3 1.1.1“education”3 1.2“education”45 1.2.2“education”45 1.2.2.2“education”45 1.3“education”67 1.3.2“education”67 ×

25 SIGMOD, June 200625 NodePatternMatches 1“Jefferson”22, 28, 51, 54, 72 1.1“Jefferson”22 1.1.3“Jefferson”22 1.2“Jefferson”28, 51 1.2.2“Jefferson”51 1.2.2.2“Jefferson”51 1.3“Jefferson”54, 72 1.3.1“Jefferson”54 1.3.1.2“Jefferson”54 1.3.2“Jefferson”72 NodePatternMatches 1“education”3, 45, 67 1.1“education”3 1.1.1“education”3 1.2“education”45 1.2.2“education”45 1.2.2.2“education”45 1.3“education”67 1.3.2“education”67 × NodePatternMatches 1“Jefferson”, “education”(22,45), (72,67) … 1.1“Jefferson”, “education”(22, 3) 1.2“Jefferson”, “education”(28, 45), (51, 45) 1.2.2“Jefferson”, “education”(51, 45) 1.2.2.2“Jefferson”, “education”(51, 45) 1.3“Jefferson”, “education”(54, 67), (72, 67) 1.3.2“Jefferson”, “education”(72, 67)

26 SIGMOD, June 200626 NodePatternMatches 1“Jefferson”22, 28, 51, 54, 72 1.1“Jefferson”22 1.1.3“Jefferson”22 1.2“Jefferson”28, 51 1.2.2“Jefferson”51 1.2.2.2“Jefferson”51 1.3“Jefferson”54, 72 1.3.1“Jefferson”54 1.3.1.2“Jefferson”54 1.3.2“Jefferson”72 NodePatternMatches 1“education”3, 45, 67 1.1“education”3 1.1.1“education”3 1.2“education”45 1.2.2“education”45 1.2.2.2“education”45 1.3“education”67 1.3.2“education”67 × NodePatternMatches 1“Jefferson”, “education”(22,45), (72,67) … 1.1“Jefferson”, “education”(22, 3) 1.2“Jefferson”, “education”(28, 45), (51, 45) 1.2.2“Jefferson”, “education”(51, 45) 1.2.2.2“Jefferson”, “education”(51, 45) 1.3“Jefferson”, “education”(54, 67), (72, 67) 1.3.2“Jefferson”, “education”(72, 67) Predicate operates one tuple at a time

27 SIGMOD, June 200627 Example: LoC document Congress on education and workforce, comments to appropriate services. 109th Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson on May 2, 2004 Joe Jefferson introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Jefferson and services … HR2739 committee-name action-desc bill congress-info nbrsponsors action legis-session legis legis-body legis-desc 1.1 1.2 1.3 1 1.1.1 1.1.2 1.1.3 1.2.2 1.2.2.2 1.3.1 1.3.2 1.3.1.2

28 SIGMOD, June 200628 Query evaluation: SCU AllNodes = straightforward algorithm Reduce size of intermediate results  structural relationships between nodes  avoid redundant match representation SCU = Smallest Containing Unit 5

29 SIGMOD, June 200629 NodePatternMatches 1.1.3“Jefferson”22 1.2.2.2“Jefferson”51 1.2“Jefferson”28 1.3.1.2“Jefferson”54 1.3.2“Jefferson”72 NodePatternMatches 1“Jefferson”22, 28, 51, 54, 72 1.1“Jefferson”22 1.1.3“Jefferson”22 1.2“Jefferson”28, 51 1.2.2“Jefferson”51 1.2.2.2“Jefferson”51 1.3“Jefferson”54, 72 1.3.1“Jefferson”54 1.3.1.2“Jefferson”54 1.3.2“Jefferson”72 Matching tables → SCU tables → captures same information

30 SIGMOD, June 200630 NodePatternMatches 1.1.3“Jefferson”22 1.2.2.2“Jefferson”51 1.2“Jefferson”28 1.3.1.2“Jefferson”54 1.3.2“Jefferson”72 NodePatternMatches 1.1.1“education”3 1.2.2.2“education”45 1.3.2“education”67 ×

31 SIGMOD, June 200631 NodePatternMatches 1.1.3“Jefferson”22 1.2.2.2“Jefferson”51 1.2“Jefferson”28 1.3.1.2“Jefferson”54 1.3.2“Jefferson”72 NodePatternMatches 1.1.1“education”3 1.2.2.2“education”45 1.3.2“education”67 NodePatternMatches 1.2.2.2“Jefferson”, “education”(51, 45) 1.3.2“Jefferson”, “education”(72, 67) × Equi-join does not work Need to compute LCA

32 SIGMOD, June 200632 NodePatternMatches 1.1.3“Jefferson”22 1.2.2.2“Jefferson”51 1.2“Jefferson”28 1.3.1.2“Jefferson”54 1.3.2“Jefferson”72 NodePatternMatches 1.1.1“education”3 1.2.2.2“education”45 1.3.2“education”67 NodePatternMatches 1.1“Jefferson”, “education”(22, 3) 1.2.2.2“Jefferson”, “education”(51, 45) 1.2“Jefferson”, “education”(28, 45) 1.3.2“Jefferson”, “education”(72, 67) 1.3“Jefferson”, “education”(54, 67) 1“Jefferson”, “education”(22, 45) … × 1.1 is the LCA of 1.1.3 and 1.1.1

33 SIGMOD, June 200633 NodePatternMatches 1.1.3“Jefferson”22 1.2.2.2“Jefferson”51 1.2“Jefferson”28 1.3.1.2“Jefferson”54 1.3.2“Jefferson”72 NodePatternMatches 1.1.1“education”3 1.2.2.2“education”45 1.3.2“education”67 × NodePatternMatches 1.2“Jefferson”, “education”(28, 45) 1.3“Jefferson”, “education”(54, 67) 1“Jefferson”, “education”(22, 45) … NodePatternMatches EMPTY !!! NodePatternMatches 1.1“Jefferson”, “education”(22, 3) 1.2.2.2“Jefferson”, “education”(51, 45) 1.2“Jefferson”, “education”(28, 45) 1.3.2“Jefferson”, “education”(72, 67) 1.3“Jefferson”, “education”(54, 67) 1“Jefferson”, “education”(22, 45) …

34 SIGMOD, June 200634 NodePatternMatches 1.1.3“Jefferson”22 1.2.2.2“Jefferson”51 1.2“Jefferson”28 1.3.1.2“Jefferson”54 1.3.2“Jefferson”72 NodePatternMatches 1.1.1“education”3 1.2.2.2“education”45 1.3.2“education”67 NodePatternMatches 1.1“Jefferson”, “education”(22, 3) 1.2.2.2“Jefferson”, “education”(51, 45) 1.2“Jefferson”, “education”(28, 45) 1.3.2“Jefferson”, “education”(72, 67) 1.3“Jefferson”, “education”(54, 67) 1“Jefferson”, “education”(22, 45) … ×

35 SIGMOD, June 200635 NodePatternMatches 1.1.3“Jefferson”22 1.2.2.2“Jefferson”51 1.2“Jefferson”28 1.3.1.2“Jefferson”54 1.3.2“Jefferson”72 NodePatternMatches 1.1.1“education”3 1.2.2.2“education”45 1.3.2“education”67 NodePatternMatches 1.1“Jefferson”, “education”(22, 3) 1.2.2.2“Jefferson”, “education”(51, 45) 1.2“Jefferson”, “education”(28, 45) 1.3.2“Jefferson”, “education”(72, 67) 1.3“Jefferson”, “education”(54, 67) 1“Jefferson”, “education”(22, 45) … × NodePatternMatches 1.3“Jefferson”, “education”(54, 67) 1“Jefferson”, “education” (22, 45) …

36 SIGMOD, June 200636 NodePatternMatches 1.1.3“Jefferson”22 1.2.2.2“Jefferson”51 1.2“Jefferson”28 1.3.1.2“Jefferson”54 1.3.2“Jefferson”72 NodePatternMatches 1.1.1“education”3 1.2.2.2“education”45 1.3.2“education”67 NodePatternMatches 1.1“Jefferson”, “education”(22, 3) 1.2.2.2“Jefferson”, “education”(51, 45) 1.2“Jefferson”, “education”(28, 45) 1.3.2“Jefferson”, “education”(72, 67) 1.3“Jefferson”, “education”(54, 67) 1“Jefferson”, “education”(22, 45) … × NodePatternMatches 1.3“Jefferson”, “education”(54, 67) (72, 67) 1“Jefferson”, “education” (22, 45) … Postorder Stack supports single scan

37 SIGMOD, June 200637 SCU summary Equivalent to AllNodes Structure-awareness reduces size of intermediate results Increase computation cost  Compute LCAs of nodes  Match propagation Stack-based techniques 5

38 SIGMOD, June 200638 Related work on LCA for XML LCA for conjunctive keyword search  XRank [GSBS-03]  Schema-free XQuery [LYJ-04]  XKSearch [XP-05] Shortcomings  No postprocessing, not compositional Input in document order Output postorder traversal  Support for complex predicates is not straightforward

39 SIGMOD, June 200639 Talk Outline Motivation & Contributions Formalization of XML full-text search Efficient evaluation Experiments Conclusion

40 SIGMOD, June 200640 Experimental goals AllNodes vs. SCU  AllNodes: redundant representation  SCU: smaller sizes, more computation SCU Overhead  Stack  Match propagation Benefit of Rewritings  Relational-style rewritings

41 SIGMOD, June 200641 Experimental setup Centrino 1.8GHz with 1GB of RAM XMark generated datasets  Size ranges from 50 MB – 300 MB

42 SIGMOD, June 200642 Experiments: AllNodes vs. SCU Varying document size (q1 - query without predicates) q1 = get (“See”) and get (“internationally”) and get (“description”) and get (“charges”) and get (“ship”)

43 SIGMOD, June 200643 Queries  q4 = σ window>1(“See”, “internationally”, “description”, “charges”, “ship”) (q1)  q5 = σ window>90000000(“See”, “internationally”, “description”, “charges”, “ship”) (q1) Recall that  q1 = get (“See”) and get (“internationally”) and get (“description”) and get (“charges”) and get (“ship”) Experiments: SCU Overhead

44 SIGMOD, June 200644 Experiments: SCU Overhead q4 always true → no match propagation, just the stack overhead q5 always false → propagate all matches Varying query predicates (not pushed)

45 SIGMOD, June 200645 Queries  q2 = σ orderedE(“See”, “internationally”, “description”, “charges”, “ship”) (q1)  q3 = push selections in q2 Recall that  q1 = get (“See”) and get (“internationally”) and get (“description”) and get (“charges”) and get (“ship”) Experiments: Benefit of Rewritings

46 SIGMOD, June 200646 Experiments: Benefit of Rewritings Varying document size (query with predicates) 40% improvement for relational-like query rewritings

47 SIGMOD, June 200647 Conclusion A unified logical framework for XML full-text search languages Algebra admits  Efficient algorithms for operator evaluation  Rewritings of queries into more efficient forms  Facilitate XML joint optimizations of queries on both structure and text search Future work  Score-aware logical framework

48 SIGMOD, June 200648 Thank you! 5


Download ppt "Flexible and Efficient XML Search with Complex Full-Text Predicates Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University."

Similar presentations


Ads by Google