Presentation is loading. Please wait.

Presentation is loading. Please wait.

Leonidas FegarasThe Joy of SAX1 The Joy of SAX Leonidas Fegaras University of Texas at Arlington

Similar presentations


Presentation on theme: "Leonidas FegarasThe Joy of SAX1 The Joy of SAX Leonidas Fegaras University of Texas at Arlington"— Presentation transcript:

1 Leonidas FegarasThe Joy of SAX1 The Joy of SAX Leonidas Fegaras University of Texas at Arlington fegaras@cse.uta.edu http://lambda.uta.edu/

2 Leonidas FegarasThe Joy of SAX2 Design Goals Want to build an XQuery engine based entirely on SAX handlers –all the way from the points the input documents are read by the SAX parser up to the point the query results are printed This engine should consist of operators that –naturally reflect the syntactic structures of XQuery and –can be composed into pipelines in the same way the corresponding XQuery structures are composed to form complex queries The XQuery translation should be concise, clean, and completely compositional Even though it cannot compete with transducers for simple XPaths, it should not sacrifice much on performance in terms of memory and computational overhead But,... it should be able to beat transducers for complex predicates and deeply nested queries

3 Leonidas FegarasThe Joy of SAX3 Pull-Based Approach Based on iterators: class Iterator { Tuple current();// current tuple from stream void open ();// open the stream iterator Tuple next ();// get the next tuple from stream boolean eos ();// is this the end of stream? } An iterator reads data from the input stream(s) and delivers data to the output stream Connected through pipelines –an iterator (the producer) delivers a stream element to the output only when requested by the next operator in pipeline (the consumer) –to deliver one stream element to the output, the producer becomes a consumer by requesting from the previous iterator as many elements as necessary to produce a single element, etc, until the end of stream

4 Leonidas FegarasThe Joy of SAX4 What is a Tuple? A vector of components: –one component for each scoped for-variable –has fixed-size at each point in a pipeline (known at compile time) –doesn't need to include the variable names A tuple component is the unit of communication between iterators Passing fully constructed XML elements through iterators is a bad idea for a compositional translation –initially, we would have to pass the entire document as a tree! The unit of communication should be –a single event or –a fragment (a reference to an XML element in a document) this requires a structural index for fragments A proposal for a pull parser: XML Pull Parser 3 www.xmlpull.org BEA/XQRL token stream & token iterators

5 Leonidas FegarasThe Joy of SAX5 Event-Oriented Approach A tuple in an event-oriented approach consists of a sequence of events, ending with an End-Of-Tuple (EOT) event Single-node event sequence –depth-first unfolding of a single XML node A tuple with 3 components

6 Leonidas FegarasThe Joy of SAX6 Element vs Event Granularity Stream unit is a single event abstract class Event {} class Start extends Event { String tag; } class End extends Event { String tag; } class Text extends Event { String text; } class EOT extends Event {} class Child extends Iterator { Iterator input; String tagname; boolean keep = false; int nest = 0; Event next () { while (!input.eos()) { current = input.current(); if (current instanceof Start) { if (nest++ == 1) keep = ((Start) current).tag.equals(tagname); } else if (current instanceof End) if (nest-- == 1) keep = false; input.next(); if (keep) return current; } Stream unit is a DOM-like element: abstract class Element {} class Node extends Element { String tag; Element[] sequence; } class Text extends Element { String text; } class Tuple { Element[] components; } class Child extends Iterator { Iterator input; String tagname; int index = 0; Tuple next () { while (!input.eos()) { if (input.current().get(0) instanceof Node) { Node ce = (Node) input.current().get(0); if (index < ce.sequence.length) if (ce.sequence[index] instanceof Node && ((Node) ce.sequence[index]).tag.equals(tagname)) { current = new Tuple(ce.sequence[index++]); return current; } else index++; else { index = 0; input.next(); } } else { index = 0; input.next(); } } }

7 Leonidas FegarasThe Joy of SAX7 For-Loop using Iterators Need a stepper for a for-loop: Loop leftright pipeline right set Step class Step extends Iterator { boolean first; Tuple tuple; void open () { first = true; current = tuple; } Tuple next () { first = false; return current; } void set ( Tuple t ) { tuple = t; } boolean eos () { return !first; } } Tuple Loop.next () { if (!left.eos()) { while (right.eos()) { left.next(); right_step.set(left.current()); right.open(); }; current = left.current().append(right.current()); right.next(); return current; } } Not a good idea if right reads a document! right_step class Loop extends Iterator { Iterator left; Step right_step; Iterator right; }

8 Leonidas FegarasThe Joy of SAX8 Let-Bindings using Iterators Let-bindings are harder to implement: the let-value may be a sequence one producer -- many consumers we do not want to materialize the let-value in memory queue head tail slowest consumer fastest consumer Some cases are hopeless: let $v:=e return ($v,$v) backlog

9 Leonidas FegarasThe Joy of SAX9 Push-based Pipelines Unit of communication between pipelines: messages rather than events Pipeline components are SAX-like event handlers –they are instances of Operator subclasses: abstract class Operator { void suspend (); void release (); void startDocument ( int node ); void endDocument ( int node ); Status endTuple ( int node ); Status startElement ( int node, String tag ); Status endElement ( int node, String tag ); Status characters ( int node, String text ); } ('node' identifies a for-variable)

10 Leonidas FegarasThe Joy of SAX10 The Child Operator class Child extends Operator { Operator next; String tagname; int nest = 0; boolean keep = false; Status startElement ( int node, String tag ) { if (nest++ == 1) keep = tagname.equals(tag); if (keep) return next.startElement(node,tag); else return invalid; } Example: document(“...”)/A/*//B DocumentChild “A”AnyDescendant “B”Kick Print

11 Leonidas FegarasThe Joy of SAX11 For-Loops One thread per document reader Need to queue one tuple from the outer stream each time for $x in E1, $y in E2 return... Queue Loop $x startElement, endElement,....: if node=$x, insert the event into Queue else emit the event to the output (next) endTuple: if node=$x, suspend outer stream; send all events in Queue to E2 else emit all events in Queue to the output (next) endDocument: if node=$y, clear Queue & release outer stream next Not a good idea if E2 reads a document –the document is read as many times as the tuples in E1 –but we can cache the output of E2 and push the cached data instead For $yFor $x E2E1 inner outer

12 Leonidas FegarasThe Joy of SAX12 Other Issues Let-bindings can be easily done using splitters (repeaters) –no caching is necessary But,... binary concatenation needs to cache the second stream –so, let $v:=e return ($v,$v) is still hopeless We don’t need to cache path/FLWOR conditionals –the returned status of the condition events determines the predicate outcome (existential semantics) –initially, Predicate sends a suspend() event to the next stream and then the input events are propagated as is (to both pred and next) –if and when the predicate becomes true, the output is released Predicate pred Sink next condition

13 Leonidas FegarasThe Joy of SAX13 So, to Pull or to Push? For event streams, it doesn't really make a difference in terms of efficiency/storage requirements –a matter of programming style –push-based is a bit more difficult to program and harder to debug (threads) But,... if you want to use indexes, pulling is better For indexing, fragments are a better alternative to events –fragment = a reference to an element in a document –a fragment corresponds to a tree node, and you need an index to access descendants –need to guarantee that indexes deliver fragments sorted, so that all stream operators can be implemented using merge joins –examples: structural indexes based on region encoding or on preorder/postorder ranks IR-style content-based inverse indexes –see my recent work on XQuery processing with relevance ranking http://lambda.uta.edu/XQueryRank.pdf

14 Leonidas FegarasThe Joy of SAX14 Related Work Joost: XSLT transformation based on SAX BEA/XQRL: pull-based XQuery processing Apache Cocoon: user-constructed pipelines made out of SAX handlers Many XQuery processors: Galax, Xalan, Qizx, Saxon,... Lots of work on XPath/XQuery processing based on transducers


Download ppt "Leonidas FegarasThe Joy of SAX1 The Joy of SAX Leonidas Fegaras University of Texas at Arlington"

Similar presentations


Ads by Google