Presentation is loading. Please wait.

Presentation is loading. Please wait.

On the Memory Requirements of XPath Evaluation over XML Streams Ziv Bar-Yossef Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.

Similar presentations


Presentation on theme: "On the Memory Requirements of XPath Evaluation over XML Streams Ziv Bar-Yossef Marcus Fontoura Vanja Josifovski IBM Almaden Research Center."— Presentation transcript:

1 On the Memory Requirements of XPath Evaluation over XML Streams Ziv Bar-Yossef Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

2 Preliminaries: XML PODS Josifovski 1 Fagin 3 conference name speaker name paper_cnt root speaker name paper_cnt PODS Josifovski Fagin 1 3 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8

3 Preliminaries: XPath 1.0 /conference[name = PODS]/speaker[paper_cnt > 1]/name conference name root Document Query Result: { x 7 } speaker name paper_cnt = PODS > 1 conference name speaker name paper_cnt root speaker name paper_cnt PODS Josifovski Fagin 1 3 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8

4 XML Streams XML stream: XML document arriving as a one-way stream Critical resources: Memory Processing time Why XML streams? For transferring XML between systems For efficient access to large XML documents

5 Streaming XML Algorithms XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] X-scan [Ives, Levy, and Weld 00] XMLTK [Avila-Campillo et al 02] XTrie [Chan et al 02] SPEX [Olteanu, Kiesling, and Bry 03] Lazy DFAs [Green et al 03] The XPush Machine [Gupta and Suciu 03] XSQ [Peng and Chawathe 03] TurboXPath [Josifovski, Fontoura, and Barta 04] …

6 Our Results Space lower bounds for evaluating XPath on XML streams A streaming XML algorithm Matches the lower bounds on a large fragment of the language Uses space sub-linear in the query size rather than exponential in the query size

7 Related Work Space complexity of XPath evaluation over non- streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03] Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03] Space complexity of select-project-join queries over relational data streams [Arasu et al 02]

8 Data Complexity [Vardi 82]  (Q,D) Evaluation function of a query Q on document D.  Q (D) Evaluation function of a fixed query Q on document D. Data complexity on Q: Complexity of best algorithm for  Q on worst D. Worst-case data complexity: max Q (complexity of  Q ). We characterize the data complexity of  Q separately for each Q (not just the worst-case one).

9 XPath Fragment 1. Queries are subsumption-free conference name root Query = PODS name != SIGMOD conference root Query name != SIGMOD Not subsumption-free Subsumption-free

10 XPath Fragment (cont.) 2. Queries are univariate conference paper_cnt root Query author_cnt Query Not univariate Univariate < conference paper_cnt root author_cnt < 30 > 30

11 XPath Fragment (cont.) 3. Queries consist of conjunctions only 4. Queries are “star-restricted”

12 Query Frontier Size 1.Frontier at u: u, its siblings, and the siblings of its ancestors. Theorem 1: For all queries Q in the fragment, stream-space(  Q ) =  (FrontierSize(Q)). Definitions : 2.FrontierSize(Q): size of largest frontier. conference name root Query speaker name paper_cnt = PODS > 1

13 Theorem 2: For all queries Q in the fragment that have at least one “//” node, stream-space(  Q ) =  (recDepth Q (D)). Document Recursion Depth //part number root name part number name root name x0x0 x1x1 x3x3 x4x4 x4x4 x6x6 x7x7 x2x2 Definition: recDepth Q (D): Max number of nodes in D that lie on one root-to-leaf path and “path match” the same node in Q. Document D Query Q part number x5x5 Compressor 12 Refrigerator 456

14 Document Depth Definition: depth(D): Length of longest root-to- leaf path. part number name root name x0x0 x1x1 x3x3 x4x4 x4x4 x6x6 x7x7 x2x2 Document D part number x5x5 Compressor 12 Refrigerator Theorem 2: For all queries Q in the fragment that have at least one “/” node, stream-space(  Q ) =  (log depth(D)). 456

15 New algorithm Theorem 4(a): For all queries Q in a “Univariate XPath”: Space: O(|Q| recDepth(D) log depth(D)). Time: O(|D| |Q| recDepth(D)). Theorem 4(b): For all queries Q in a subset of our fragment and for non-recursive documents D, Space: O(FrontierSize(Q) log depth(D)). Time: O(|D| FrontierSize(Q)).

16 Proof of Theorem 1 Fragment: “subsumption-free” “univariate” Conjunctions only “star-restricted” Theorem 1: For all queries Q in the fragment, stream-space(  Q ) =  (FrontierSize(Q)). conference name root Query speaker name paper_cnt = PODS > 1

17 Critical Document Definition: Document D is critical for query Q, if: (1) D matches Q. (2) If we remove from D any node, it no longer matches Q. conference name root Query Q speaker name paper_cnt = PODS > 1 conference name speaker name paper_cnt root speaker name paper_cnt PODS Josifovski Fagin 1 3 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8 Document D

18 Main Lemmas Lemma 1: For all queries Q in the fragment and any critical document D for Q, stream-space(  Q ) =  (FrontierSize(D)). Lemma 2: For all queries Q in the fragment, there is a critical document D so that FrontierSize(D) = FrontierSize(Q). show proof Theorem 1: For all queries Q in the fragment, stream-space(  Q ) =  (FrontierSize(Q)).

19 One-way Communication Complexity Alice Bob x y m f: (X, Y)  Z f(x,y) CC(f) = number of communication bits used by the best protocol on the worst-case choice of inputs.

20 D   Reduction Alice Bob state A (  ) A : streaming algorithm for  Q using space S state A (  ) Theorem: stream-space(  Q ) >= CC(  Q )  Q (D)  

21 D,D, Fooling Set Technique Theorem: For any fooling set T, CC(  Q ) =  (log |T|). Definition A set T of partitioned documents is a fooling set for  Q if: 1. All documents in T match Q. 2. For any two distinct documents D , , D ,  in T, either D ,  does not match Q or D ,  does not match Q. Partitioned document:   Document prefix Document suffix

22 Proof of Lemma 1 Lemma 1: For all queries Q in the fragment nd any critical document D for Q, stream-space(  Q ) =  (FS(D)). conference name root Query Q speaker name = PODS > 1 conference name root speaker name paper_cnt Fagin 3 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 Document D paper_cnt PODS

23 Proof of Lemma 1 For each subset S of Frontier(D), define a partitioned document D S : S = { x 2, x 5 } conference name root Query Q speaker name = PODS > 1 conference name root speaker name paper_cnt Fagin 3 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 Document D S paper_cnt PODS

24 2. If S  T, need: either D ST or D TS does not match Q. Proof of Lemma 1 (cont) 1. For all S, D S matches Q. Claim: { D S } S is a subset of Frontier(D) is a fooling set. stream-space(  Q ) >= log(2 FS(D) ) = FS(D). Proof of Claim:

25 Proof of Claim (example) conference name root speaker name paper_cnt x0x0 x1x1 x3x3 x2x2 x4x4 x5x5 Document D T T = { x 4,x 5 } PODS Document D TS conference name root speaker name paper_cnt x0x0 x1x1 x2x2 x3x3 x5x5 x4x4 Document D S S = { x 2,x 5 } PODS Fagin 3 3 conference root x0x0 x1x1 Conference name missing! speaker name paper_cnt x3x3 x4x4 Fagin 3 name Fagin x4x4 x5x5

26 Algorithm Uses the query as an NFA Based on three global data structures Pointer array Validation array Level array Matches the lower bounds for a fragment of XPath.

27 Algorithm Example Run c1 b1... c1 b1... a F 1 Level array Validation array Pointer array with one entry /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3 Query: /a[b and c] Input XML

28 Algorithm Example Run c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 Index 0 Index 1 Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3

29 Algorithm Example Run c1 b1... c1 b1... Input XML a F 1 $ Query: /a[b and c] b F 2 a c F 2 Index 0 Index 1 b F 2 c c F 2 /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3

30 c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 Index 0 Index 1 b F 2 c c F 2 b F 2 /c c T 2 Algorithm Example Run Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3

31 c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 Index 0 Index 1 b F 2 c c F 2 b F 2 b c T 2 Algorithm Example Run b F 2 /c c T 2 Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3

32 c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 Index 0 Index 1 b F 2 c c F 2 b F 2 b c T 2 Algorithm Example Run b F 2 /c c T 2 b T 2 /b c T 2 Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3

33 c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 b F 2 c c F 2 b F 2 b c T 2 Algorithm Example Run b F 2 /c c T 2 b T 2 /b c T 2 a T 1 /a Return TRUE Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3

34 Conclusion: our Contributions Space lower bounds on the instance data complexity of XPath on XML streams: 1.In terms of Query Frontier Size 2.In terms of Document Recursion Depth 3.In terms of Document Depth A streaming XML algorithm Matches the lower bounds on a fragment of the language Does not use finite-state automata

35 XPath 1.0 C N S NP $ S NP PODS Josifovski Fagin1 3 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8 /conference/name /C/C /N/N $ u0u0 u1u1 u2u2 D Q Result: { x 2 }

36 XPath 1.0 C N S NP $ S NP PODS JosifovskiFagin13 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8 /conference//name /C/C //N $ u0u0 u1u1 u2u2 D Q Result: { x 2, x 4, x 7 }

37 D 33 11 11 22 22 33 33 11 11 22 22 33 Reduction Alice Bob s1s1 s2s2 s3s3 s4s4 A : S-space streaming algorithm for  Q. r ¸ 1: integer. (r = 6) s0s0 s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s5s5 s6s6 Theorem: S ¸ CC(  Q r ) / r  Q (D)


Download ppt "On the Memory Requirements of XPath Evaluation over XML Streams Ziv Bar-Yossef Marcus Fontoura Vanja Josifovski IBM Almaden Research Center."

Similar presentations


Ads by Google