Bose, Fegaras, Levine, Chaluvadi DBPL A Query Algebra for Fragmented XML Stream Data Sujoe Bose Leonidas Fegaras David Levine Vamsi Chaluvadi University of Texas at Arlington
Bose, Fegaras, Levine, Chaluvadi DBPL Processing Streamed XML Data Most web servers are pull-based: A client submits a request, the server returns the requested data. This doesn’t scale very well for large number of clients and large query results. Alternative method: pushed-based dissemination –The server broadcasts/multicasts data in a continuous stream –The client connects to multiple streams and evaluates queries locally –No handshaking, no error-correction –All processing is done at the client side –The only task performed by the server is slicing, scheduling, and broadcasting data: Critical data may be repeated more often than no-critical data Invalid data may be revoked New updates may be broadcast as soon as they become available.
Bose, Fegaras, Levine, Chaluvadi DBPL A Framework for Processing XML Streams The server slices an XML data source into XML fragments. Each fragment: –is a filler that fills a hole –may contain holes which can be filled by other fragments –is wrapped with control information, such as its unique hole ID, the path that reaches this fragment, etc. The client opens connections to streams and evaluates XQueries against these streams –For large streams, it’s a bad idea to reconstruct the streamed data in client’s memory need to process fragments as soon they become available from the server –There are blocking operators that require unbounded memory: Sorting Joins between two streams or self-joins Group-by with aggregation.
Bose, Fegaras, Levine, Chaluvadi DBPL The Fragmented Hole-Filler Model Wal-Mart PDA HP PalmPilot Calculator Casio FX
Bose, Fegaras, Levine, Chaluvadi DBPL An Algebra for Stored XML Data Based on the nested-relational algebra: v (T)access the XML data source T using v pred (X)select fragments from X that satisfy pred v1,….,vn (X)project X Ymerge X pred Yjoin pred v,path (X) unnest (retrieve descendents of elements) pred ,h (X) apply h and reduce by gs,pred v, ,h (X)group-by gs, apply h to each group, and reduce each group by
Bose, Fegaras, Levine, Chaluvadi DBPL Semantics v (T)= { } pred (X)= { t | t X, pred(t) } v1,….,vn (X)= { | t X } X Y= X ++ Y X pred Y= { t x t y | t x X, t y Y, pred(t x,t y ) } pred v,path (X)= { t | t X, w PATH(t,path), pred(t,w) } pred ,h (X) = /{ h(t) | t X, pred(t) } gs,pred v, ,h (X) = …
Bose, Fegaras, Levine, Chaluvadi DBPL Example #1 where
Bose, Fegaras, Levine, Chaluvadi DBPL Example #1 (cont.) ,element(“book”,$b/title) $v/bib/book $b $v document(“ $b/publisher=“Addison-Wesley” and > 1991
Bose, Fegaras, Levine, Chaluvadi DBPL Example #2 for $u in document(“users.xml”)//user_tuple return { $u/name } { for $b in document(“bids.xml”)//bid_tuple[userid=$u/userid]/itemno $i in document(“items.xml”)//item_tuple[itemno=$b] return { $i/description/text() } sortby(.) } sortby(name) document(“users.xml”) $us $us/users/user_tuple document(“bids.xml”) $bs $bs/bids/bid_tuple document(“items.xml”) $is $is/items/item_tuple $u $i $b $c/itemno $c/userid=$u/userid $c $i/itemno=$b sort, elem(“bid”,$i/description/text()) sort($u/name), elem(“user”,$u/name++ )
Bose, Fegaras, Levine, Chaluvadi DBPL XPath Expressions Path evaluation is central to the algebra: PATH: ( XML-data, simple-XPath ) set(XML-data) Some rules for stored XML data: PATH( x,A/path) = PATH(x,path) PATH( x,A) = { x } PATH(x 1 x 2,path) = PATH(x 1,path) PATH(x 2,path) PATH(x,path) = otherwise Predicates have existential semantics $v/A/B = “text” x PATH(v,A/B): x = “text”
Bose, Fegaras, Levine, Chaluvadi DBPL The Streamed XML Algebra Much like the stored XML algebra, but works on streams. A stream takes the forms: t ; ’ a fragment t followed by the rest of the stream ’ eosend-of-stream Each stored XML algebraic operator has a streamed counterpart eg, pred (t ; ) = t ; pred ( )if pred is true for t pred (t ; ) = pred ( )otherwise pred (eos) = eos but … we may not be able to validate pred due to holes in t
Bose, Fegaras, Levine, Chaluvadi DBPL Streamed Algebra Semantics To keep the suspended fragments, each streamed algebraic operator has –one state 0 for the output and –optional state(s) 1 / 2 for the input(s) The result of PATH may now be unspecified: PATH(,path) = PATH( 1 (m),path)if m 1 = { }otherwise When in predicates, requires 3-value logic Incomplete fragments are suspended when necessary, eg: pred (t ; ) = t ; pred ( )if true PATH(t,pred) pred (t ; ) = pred ( )otherwise 0 0 {t}if PATH(t,pred)
Bose, Fegaras, Levine, Chaluvadi DBPL Join Much like main-memory symmetric join states: – 0 all suspended output tuples due to unfilled holes – 1 all tuples from left stream – 2 all tuples from right stream a tuple from left stream: (t 1 ; 1 ) pred 2 = { t 1 t 2 | t 2 2, true PATH(t 1 t 2,pred) }; ( 1 pred 2 ) 1 1 t 1 0 0 { t 1 t 2 | t 2 2, PATH(t 1 t 2,pred) } a tuple from right stream: 1 pred (t 2 ; 2 ) = { t 1 t 2 | t 1 1, true PATH(t 1 t 2,pred) }; ( 1 pred 2 ) 2 2 t 2 0 0 { t 1 t 2 | t 1 1, PATH(t 1 t 2,pred) }
Bose, Fegaras, Levine, Chaluvadi DBPL Reconstructing the XML Data : set(int XML-data) is an environment that binds filler ids to XML. x replaces holes with fillers in x using the environment : x = x (x 1 x 2 ) = (x 1 ) (x 2 ) = [m]if m x = xotherwise R( ) returns a pair (a, ), where and a is [0] (the reconstructed data): if R( ) = (a, ) then R( ; ) = R(eos) = ( , ) Basically, R(t ; ) = f(R( )) (x , ) if m=0 (a ’, ’) if m 0 where ’={(m,x )} [m/x] {
Bose, Fegaras, Levine, Chaluvadi DBPL Equivalence Between Stored & Streamed Algebras If we reconstruct the XML document from the streamed fragments and evaluate a query using the stored algebra, we get the same result as when we use the equivalent streamed algebra over the streamed XML fragments and reconstruct the result. XML document XML fragments result reconstruction stored XML algebra streamed XML algebra XML fragments reconstruction Proof sketch: We prove R( p ( ))= p (R( )) inductively, where p is the stream version of p. If true PATH(t,pred), then R( p (t; ))=R(t; p ( ))=f(R( p ( )))=f( p (R( ))) = p (f(R( ))) = p (R(t; )) …
Bose, Fegaras, Levine, Chaluvadi DBPL Conclusion Fragmented XML data are easier to handle and synchronize than an infinitely long stream Associating holes with fillers takes care of out-of-sequence transmission, repetitions, replacements, and removals Our streamed algebra has similar operators but different semantics than our stored algebra Our algebra can capture most non-recursive XQueries Our future work includes –the development of main-memory algorithms for processing XML data streams under memory and power constraints –The development of a comprehensive approach to optimizing XQueries that utilizes our main-memory algorithms.