Download presentation

Presentation is loading. Please wait.

Published byKristopher Wickliffe Modified over 2 years ago

1
On the Memory Requirements of XPath Evaluation over XML Streams Ziv Bar-Yossef Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

2
Preliminaries: XML PODS Josifovski 1 Fagin 3 conference name speaker name paper_cnt root speaker name paper_cnt PODS Josifovski Fagin 1 3 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8

3
Preliminaries: XPath 1.0 /conference[name = PODS]/speaker[paper_cnt > 1]/name conference name root Document Query Result: { x 7 } speaker name paper_cnt = PODS > 1 conference name speaker name paper_cnt root speaker name paper_cnt PODS Josifovski Fagin 1 3 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8

4
XML Streams XML stream: XML document arriving as a one-way stream Critical resources: Memory Processing time Why XML streams? For transferring XML between systems For efficient access to large XML documents

5
Streaming XML Algorithms XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] X-scan [Ives, Levy, and Weld 00] XMLTK [Avila-Campillo et al 02] XTrie [Chan et al 02] SPEX [Olteanu, Kiesling, and Bry 03] Lazy DFAs [Green et al 03] The XPush Machine [Gupta and Suciu 03] XSQ [Peng and Chawathe 03] TurboXPath [Josifovski, Fontoura, and Barta 04] …

6
Our Results Space lower bounds for evaluating XPath on XML streams A streaming XML algorithm Matches the lower bounds on a large fragment of the language Uses space sub-linear in the query size rather than exponential in the query size

7
Related Work Space complexity of XPath evaluation over non- streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03] Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03] Space complexity of select-project-join queries over relational data streams [Arasu et al 02]

8
Data Complexity [Vardi 82] (Q,D) Evaluation function of a query Q on document D. Q (D) Evaluation function of a fixed query Q on document D. Data complexity on Q: Complexity of best algorithm for Q on worst D. Worst-case data complexity: max Q (complexity of Q ). We characterize the data complexity of Q separately for each Q (not just the worst-case one).

9
XPath Fragment 1. Queries are subsumption-free conference name root Query = PODS name != SIGMOD conference root Query name != SIGMOD Not subsumption-free Subsumption-free

10
XPath Fragment (cont.) 2. Queries are univariate conference paper_cnt root Query author_cnt Query Not univariate Univariate < conference paper_cnt root author_cnt < 30 > 30

11
XPath Fragment (cont.) 3. Queries consist of conjunctions only 4. Queries are “star-restricted”

12
Query Frontier Size 1.Frontier at u: u, its siblings, and the siblings of its ancestors. Theorem 1: For all queries Q in the fragment, stream-space( Q ) = (FrontierSize(Q)). Definitions : 2.FrontierSize(Q): size of largest frontier. conference name root Query speaker name paper_cnt = PODS > 1

13
Theorem 2: For all queries Q in the fragment that have at least one “//” node, stream-space( Q ) = (recDepth Q (D)). Document Recursion Depth //part number root name part number name root name x0x0 x1x1 x3x3 x4x4 x4x4 x6x6 x7x7 x2x2 Definition: recDepth Q (D): Max number of nodes in D that lie on one root-to-leaf path and “path match” the same node in Q. Document D Query Q part number x5x5 Compressor 12 Refrigerator 456

14
Document Depth Definition: depth(D): Length of longest root-to- leaf path. part number name root name x0x0 x1x1 x3x3 x4x4 x4x4 x6x6 x7x7 x2x2 Document D part number x5x5 Compressor 12 Refrigerator Theorem 2: For all queries Q in the fragment that have at least one “/” node, stream-space( Q ) = (log depth(D)). 456

15
New algorithm Theorem 4(a): For all queries Q in a “Univariate XPath”: Space: O(|Q| recDepth(D) log depth(D)). Time: O(|D| |Q| recDepth(D)). Theorem 4(b): For all queries Q in a subset of our fragment and for non-recursive documents D, Space: O(FrontierSize(Q) log depth(D)). Time: O(|D| FrontierSize(Q)).

16
Proof of Theorem 1 Fragment: “subsumption-free” “univariate” Conjunctions only “star-restricted” Theorem 1: For all queries Q in the fragment, stream-space( Q ) = (FrontierSize(Q)). conference name root Query speaker name paper_cnt = PODS > 1

17
Critical Document Definition: Document D is critical for query Q, if: (1) D matches Q. (2) If we remove from D any node, it no longer matches Q. conference name root Query Q speaker name paper_cnt = PODS > 1 conference name speaker name paper_cnt root speaker name paper_cnt PODS Josifovski Fagin 1 3 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8 Document D

18
Main Lemmas Lemma 1: For all queries Q in the fragment and any critical document D for Q, stream-space( Q ) = (FrontierSize(D)). Lemma 2: For all queries Q in the fragment, there is a critical document D so that FrontierSize(D) = FrontierSize(Q). show proof Theorem 1: For all queries Q in the fragment, stream-space( Q ) = (FrontierSize(Q)).

19
One-way Communication Complexity Alice Bob x y m f: (X, Y) Z f(x,y) CC(f) = number of communication bits used by the best protocol on the worst-case choice of inputs.

20
D Reduction Alice Bob state A ( ) A : streaming algorithm for Q using space S state A ( ) Theorem: stream-space( Q ) >= CC( Q ) Q (D)

21
D,D, Fooling Set Technique Theorem: For any fooling set T, CC( Q ) = (log |T|). Definition A set T of partitioned documents is a fooling set for Q if: 1. All documents in T match Q. 2. For any two distinct documents D , , D , in T, either D , does not match Q or D , does not match Q. Partitioned document: Document prefix Document suffix

22
Proof of Lemma 1 Lemma 1: For all queries Q in the fragment nd any critical document D for Q, stream-space( Q ) = (FS(D)). conference name root Query Q speaker name = PODS > 1 conference name root speaker name paper_cnt Fagin 3 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 Document D paper_cnt PODS

23
Proof of Lemma 1 For each subset S of Frontier(D), define a partitioned document D S : S = { x 2, x 5 } conference name root Query Q speaker name = PODS > 1 conference name root speaker name paper_cnt Fagin 3 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 Document D S paper_cnt PODS

24
2. If S T, need: either D ST or D TS does not match Q. Proof of Lemma 1 (cont) 1. For all S, D S matches Q. Claim: { D S } S is a subset of Frontier(D) is a fooling set. stream-space( Q ) >= log(2 FS(D) ) = FS(D). Proof of Claim:

25
Proof of Claim (example) conference name root speaker name paper_cnt x0x0 x1x1 x3x3 x2x2 x4x4 x5x5 Document D T T = { x 4,x 5 } PODS Document D TS conference name root speaker name paper_cnt x0x0 x1x1 x2x2 x3x3 x5x5 x4x4 Document D S S = { x 2,x 5 } PODS Fagin 3 3 conference root x0x0 x1x1 Conference name missing! speaker name paper_cnt x3x3 x4x4 Fagin 3 name Fagin x4x4 x5x5

26
Algorithm Uses the query as an NFA Based on three global data structures Pointer array Validation array Level array Matches the lower bounds for a fragment of XPath.

27
Algorithm Example Run c1 b1... c1 b1... a F 1 Level array Validation array Pointer array with one entry /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3 Query: /a[b and c] Input XML

28
Algorithm Example Run c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 Index 0 Index 1 Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3

29
Algorithm Example Run c1 b1... c1 b1... Input XML a F 1 $ Query: /a[b and c] b F 2 a c F 2 Index 0 Index 1 b F 2 c c F 2 /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3

30
c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 Index 0 Index 1 b F 2 c c F 2 b F 2 /c c T 2 Algorithm Example Run Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3

31
c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 Index 0 Index 1 b F 2 c c F 2 b F 2 b c T 2 Algorithm Example Run b F 2 /c c T 2 Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3

32
c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 Index 0 Index 1 b F 2 c c F 2 b F 2 b c T 2 Algorithm Example Run b F 2 /c c T 2 b T 2 /b c T 2 Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3

33
c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 b F 2 c c F 2 b F 2 b c T 2 Algorithm Example Run b F 2 /c c T 2 b T 2 /b c T 2 a T 1 /a Return TRUE Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3

34
Conclusion: our Contributions Space lower bounds on the instance data complexity of XPath on XML streams: 1.In terms of Query Frontier Size 2.In terms of Document Recursion Depth 3.In terms of Document Depth A streaming XML algorithm Matches the lower bounds on a fragment of the language Does not use finite-state automata

35
XPath 1.0 C N S NP $ S NP PODS Josifovski Fagin1 3 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8 /conference/name /C/C /N/N $ u0u0 u1u1 u2u2 D Q Result: { x 2 }

36
XPath 1.0 C N S NP $ S NP PODS JosifovskiFagin13 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8 /conference//name /C/C //N $ u0u0 u1u1 u2u2 D Q Result: { x 2, x 4, x 7 }

37
D 33 11 11 22 22 33 33 11 11 22 22 33 Reduction Alice Bob s1s1 s2s2 s3s3 s4s4 A : S-space streaming algorithm for Q. r ¸ 1: integer. (r = 6) s0s0 s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s5s5 s6s6 Theorem: S ¸ CC( Q r ) / r Q (D)

Similar presentations

OK

Containment of Nested XML Queries Xin (Luna) Dong, Alon Halevy, Igor Tatarinov University of Washington.

Containment of Nested XML Queries Xin (Luna) Dong, Alon Halevy, Igor Tatarinov University of Washington.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on van de graaff generator hair Ppt on indian culture and tradition free download Ppt on vitamin b complex Ppt on importance of sports and games in students life Ppt on film industry bollywood Ppt on 3 idiots movie review Ppt on panel discussion invitation Ppt on p block elements chemistry class 11 Free ppt on sources of energy Ppt on astronomy and astrophysics review