Leonidas FegarasThe Joy of SAX1 The Joy of SAX Leonidas Fegaras University of Texas at Arlington

Slides:



Advertisements
Similar presentations
J0 1 Marco Ronchetti - Web architectures – Laurea Specialistica in Informatica – Università di Trento Java XML parsing.
Advertisements

Advanced XSLT. Branching in XSLT XSLT is functional programming –The program evaluates a function –The function transforms one structure into another.
Advanced XSLT II. Iteration in XSLT we sometimes wish to apply the same transform to a set of nodes we iterate through a node set the node set is defined.
Querying on the Web: XQuery, RDQL, SparQL Semantic Web - Spring 2006 Computer Engineering Department Sharif University of Technology.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
CSE 6331 © Leonidas Fegaras XML and Relational Databases 1 XML and Relational Databases Leonidas Fegaras.
Bose, Fegaras, Levine, Chaluvadi DBPL A Query Algebra for Fragmented XML Stream Data Sujoe Bose Leonidas Fegaras.
1 XML: Advanced Guide Holly A. Hyland, FSA Andrew Smalera, XML Framework Session 14.
TREECHOP: A Tree- based Query-able Compressor for XML Gregory Leighton, Tomasz Müldner, James Diamond Acadia University June 6, 2005.
Interprocess Communications
Computer Science 1620 Loops.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
SAX A parser for XML Documents. XML Parsers What is an XML parser? –Software that reads and parses XML –Passes data to the invoking application –The application.
A Transducer-Based XML Query Processor Bertram Ludäscher, SDSC/CSE UCSD Pratik Mukhopadhyay, CSE UCSD Yannis Papakonstantinou, CSE UCSD.
A note on generating text with the xsl:value-of instruction.
21-Jun-15 SAX (Abbreviated). 2 XML Parsers SAX and DOM are standards for XML parsers-- program APIs to read and interpret XML files DOM is a W3C standard.
G. Gottlob, C. Koch & R. Pichler TU Wien, Vienna, Austria Elias Politarhos Advanced Databases M.Sc. in Information Systems Athens University of Economics.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
28-Jun-15 StAX Streaming API for XML. XML parser comparisons DOM is Memory intensive Read-write Typically used for documents smaller than 10 MB SAX is.
Sorting and Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 29, 2005.
MC 365 – Software Engineering Presented by: John Ristuccia Shawn Posts Ndi Sampson XSLT Introduction BCi.
Describing algorithms in pseudo code To describe algorithms we need a language which is: – less formal than programming languages (implementation details.
CSE 6331 © Leonidas Fegaras XML Tools1 XML Tools Leonidas Fegaras.
CSE 6331 © Leonidas Fegaras XML Tools1 XML Tools Leonidas Fegaras.
Networking Nasrullah. Input stream Most clients will use input streams that read data from the file system (FileInputStream), the network (getInputStream()/getInputStream()),
XML and its applications: 4. Processing XML using PHP.
Structured-Document Processing Languages Spring 2011 Course Review Repetitio mater studiorum est!
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation An Introduction to XQuery.
Efficient Evaluation of XQuery over Streaming Data Xiaogang Li Gagan Agrawal The Ohio State University.
SAX Parsing Presented by Clifford Lemoine CSC 436 Compiler Design.
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
CSE 6331 © Leonidas Fegaras XML Tools1 XML Tools.
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
Lecture 22 XML querying. 2 Example 31.5 – XQuery FLWOR Expressions ‘=’ operator is a general comparison operator. XQuery also defines value comparison.
XML Parsers Overview  Types of parsers  Using XML parsers  SAX  DOM  DOM versus SAX  Products  Conclusion.
CPS216: Advanced Database Systems Notes 07:Query Execution Shivnath Babu.
COP3530 Data Structures600 Stack Stack is one the most useful ADTs. Like list, it is a collection of data items. Supports “LIFO” (Last In First Out) discipline.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
Optimization in XSLT and XQuery Michael Kay. 2 Challenges XSLT/XQuery are high-level declarative languages: performance depends on good optimization Performance.
CSE 6331 © Leonidas Fegaras XML Tools1 XML Tools.
I Power Higher Computing Software Development High Level Language Constructs.
XML Access Control Koukis Dimitris Padeleris Pashalis.
SAX2 and DOM2 Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
XML and SAX (A quick overview) ● What is XML? ● What are SAX and DOM? ● Using SAX.
Fall 2002CS 150: Intro. to Computing1 Streams and File I/O (That is, Input/Output) OR How you read data from files and write data to files.
CS 157B: Database Management Systems II February 13 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron.
When we create.rtf document apart from saving the actual info the tool saves additional info like start of a paragraph, bold, size of the font.. Etc. This.
CSE 6331 © Leonidas Fegaras XQuery 1 XQuery Leonidas Fegaras.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Computer Science 1620 Sorting. cases exist where we would like our data to be in ascending (descending order) binary searching printing purposes selection.
CSE 143 Lecture 22 Huffman slides created by Ethan Apter
Distributed Computing, M. L. Liu 1 Interprocess Communications Mei-Ling L. Liu.
21-Jun-16 Document Object Model DOM. SAX and DOM SAX and DOM are standards for XML parsers-- program APIs to read and interpret XML files DOM is a W3C.
XML Parsers Overview Types of parsers Using XML parsers SAX DOM
Efficient Evaluation of XQuery over Streaming Data
A Fully Pipelined XQuery Processor
COMP108 Algorithmic Foundations Polynomial & Exponential Algorithms
Java XML IS
SOFTWARE DESIGN AND ARCHITECTURE
Chapter 15 QUERY EXECUTION.
April 20th – RDBMS Internals
Describing algorithms in pseudo code
XML Parsers Overview Types of parsers Using XML parsers SAX DOM
Database Query Execution
Lecture 15 (Notes by P. N. Hilfinger and R. Bodik)
A parser for XML Documents
XQuery Leonidas Fegaras.
LCC 6310 Computation as an Expressive Medium
Presentation transcript:

Leonidas FegarasThe Joy of SAX1 The Joy of SAX Leonidas Fegaras University of Texas at Arlington

Leonidas FegarasThe Joy of SAX2 Design Goals Want to build an XQuery engine based entirely on SAX handlers –all the way from the points the input documents are read by the SAX parser up to the point the query results are printed This engine should consist of operators that –naturally reflect the syntactic structures of XQuery and –can be composed into pipelines in the same way the corresponding XQuery structures are composed to form complex queries The XQuery translation should be concise, clean, and completely compositional Even though it cannot compete with transducers for simple XPaths, it should not sacrifice much on performance in terms of memory and computational overhead But,... it should be able to beat transducers for complex predicates and deeply nested queries

Leonidas FegarasThe Joy of SAX3 Pull-Based Approach Based on iterators: class Iterator { Tuple current();// current tuple from stream void open ();// open the stream iterator Tuple next ();// get the next tuple from stream boolean eos ();// is this the end of stream? } An iterator reads data from the input stream(s) and delivers data to the output stream Connected through pipelines –an iterator (the producer) delivers a stream element to the output only when requested by the next operator in pipeline (the consumer) –to deliver one stream element to the output, the producer becomes a consumer by requesting from the previous iterator as many elements as necessary to produce a single element, etc, until the end of stream

Leonidas FegarasThe Joy of SAX4 What is a Tuple? A vector of components: –one component for each scoped for-variable –has fixed-size at each point in a pipeline (known at compile time) –doesn't need to include the variable names A tuple component is the unit of communication between iterators Passing fully constructed XML elements through iterators is a bad idea for a compositional translation –initially, we would have to pass the entire document as a tree! The unit of communication should be –a single event or –a fragment (a reference to an XML element in a document) this requires a structural index for fragments A proposal for a pull parser: XML Pull Parser 3 BEA/XQRL token stream & token iterators

Leonidas FegarasThe Joy of SAX5 Event-Oriented Approach A tuple in an event-oriented approach consists of a sequence of events, ending with an End-Of-Tuple (EOT) event Single-node event sequence –depth-first unfolding of a single XML node A tuple with 3 components

Leonidas FegarasThe Joy of SAX6 Element vs Event Granularity Stream unit is a single event abstract class Event {} class Start extends Event { String tag; } class End extends Event { String tag; } class Text extends Event { String text; } class EOT extends Event {} class Child extends Iterator { Iterator input; String tagname; boolean keep = false; int nest = 0; Event next () { while (!input.eos()) { current = input.current(); if (current instanceof Start) { if (nest++ == 1) keep = ((Start) current).tag.equals(tagname); } else if (current instanceof End) if (nest-- == 1) keep = false; input.next(); if (keep) return current; } Stream unit is a DOM-like element: abstract class Element {} class Node extends Element { String tag; Element[] sequence; } class Text extends Element { String text; } class Tuple { Element[] components; } class Child extends Iterator { Iterator input; String tagname; int index = 0; Tuple next () { while (!input.eos()) { if (input.current().get(0) instanceof Node) { Node ce = (Node) input.current().get(0); if (index < ce.sequence.length) if (ce.sequence[index] instanceof Node && ((Node) ce.sequence[index]).tag.equals(tagname)) { current = new Tuple(ce.sequence[index++]); return current; } else index++; else { index = 0; input.next(); } } else { index = 0; input.next(); } } }

Leonidas FegarasThe Joy of SAX7 For-Loop using Iterators Need a stepper for a for-loop: Loop leftright pipeline right set Step class Step extends Iterator { boolean first; Tuple tuple; void open () { first = true; current = tuple; } Tuple next () { first = false; return current; } void set ( Tuple t ) { tuple = t; } boolean eos () { return !first; } } Tuple Loop.next () { if (!left.eos()) { while (right.eos()) { left.next(); right_step.set(left.current()); right.open(); }; current = left.current().append(right.current()); right.next(); return current; } } Not a good idea if right reads a document! right_step class Loop extends Iterator { Iterator left; Step right_step; Iterator right; }

Leonidas FegarasThe Joy of SAX8 Let-Bindings using Iterators Let-bindings are harder to implement: the let-value may be a sequence one producer -- many consumers we do not want to materialize the let-value in memory queue head tail slowest consumer fastest consumer Some cases are hopeless: let $v:=e return ($v,$v) backlog

Leonidas FegarasThe Joy of SAX9 Push-based Pipelines Unit of communication between pipelines: messages rather than events Pipeline components are SAX-like event handlers –they are instances of Operator subclasses: abstract class Operator { void suspend (); void release (); void startDocument ( int node ); void endDocument ( int node ); Status endTuple ( int node ); Status startElement ( int node, String tag ); Status endElement ( int node, String tag ); Status characters ( int node, String text ); } ('node' identifies a for-variable)

Leonidas FegarasThe Joy of SAX10 The Child Operator class Child extends Operator { Operator next; String tagname; int nest = 0; boolean keep = false; Status startElement ( int node, String tag ) { if (nest++ == 1) keep = tagname.equals(tag); if (keep) return next.startElement(node,tag); else return invalid; } Example: document(“...”)/A/*//B DocumentChild “A”AnyDescendant “B”Kick Print

Leonidas FegarasThe Joy of SAX11 For-Loops One thread per document reader Need to queue one tuple from the outer stream each time for $x in E1, $y in E2 return... Queue Loop $x startElement, endElement,....: if node=$x, insert the event into Queue else emit the event to the output (next) endTuple: if node=$x, suspend outer stream; send all events in Queue to E2 else emit all events in Queue to the output (next) endDocument: if node=$y, clear Queue & release outer stream next Not a good idea if E2 reads a document –the document is read as many times as the tuples in E1 –but we can cache the output of E2 and push the cached data instead For $yFor $x E2E1 inner outer

Leonidas FegarasThe Joy of SAX12 Other Issues Let-bindings can be easily done using splitters (repeaters) –no caching is necessary But,... binary concatenation needs to cache the second stream –so, let $v:=e return ($v,$v) is still hopeless We don’t need to cache path/FLWOR conditionals –the returned status of the condition events determines the predicate outcome (existential semantics) –initially, Predicate sends a suspend() event to the next stream and then the input events are propagated as is (to both pred and next) –if and when the predicate becomes true, the output is released Predicate pred Sink next condition

Leonidas FegarasThe Joy of SAX13 So, to Pull or to Push? For event streams, it doesn't really make a difference in terms of efficiency/storage requirements –a matter of programming style –push-based is a bit more difficult to program and harder to debug (threads) But,... if you want to use indexes, pulling is better For indexing, fragments are a better alternative to events –fragment = a reference to an element in a document –a fragment corresponds to a tree node, and you need an index to access descendants –need to guarantee that indexes deliver fragments sorted, so that all stream operators can be implemented using merge joins –examples: structural indexes based on region encoding or on preorder/postorder ranks IR-style content-based inverse indexes –see my recent work on XQuery processing with relevance ranking

Leonidas FegarasThe Joy of SAX14 Related Work Joost: XSLT transformation based on SAX BEA/XQRL: pull-based XQuery processing Apache Cocoon: user-constructed pipelines made out of SAX handlers Many XQuery processors: Galax, Xalan, Qizx, Saxon,... Lots of work on XPath/XQuery processing based on transducers