1 Efficient XML Stream Processing with Automata and Query Algebra A Master Thesis Presentation Student: Advisor: Reader: Jinhui Jian Prof. Elke A. Rundensteiner.

Slides:



Advertisements
Similar presentations
XML May 3 rd, XQuery Based on Quilt (which is based on XML-QL) Check out the W3C web site for the latest. XML Query data model –Ordered !
Advertisements

XML, XML Schema, Xpath and XQuery Slides collated from various sources, many from Dan Suciu at Univ. of Washington.
CSE 6331 © Leonidas Fegaras XML and Relational Databases 1 XML and Relational Databases Leonidas Fegaras.
Raindrop: An Algebra-Automata Combined XQuery Engine over XML Streams Hong Su, Elke Rundensteiner, Murali Mani, Ming Li Worcester Polytechnic Institute.
Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams Bernhard Stegmaier (TU München) Joint work with.
TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.
2015/5/5 A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML Ning Zhang(University of Waterloo) Varun Kacholia(Indian Institute.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
1 Rewriting Nested XML Queries Using Nested Views Nicola Onose joint work with Alin Deutsch, Yannis Papakonstantinou, Emiran Curtmola University of California,
Agenda from now on Done: SQL, views, transactions, conceptual modeling, E/R, relational algebra. Starting: XML To do: the database engine: –Storage –Query.
© 2002 by Prentice Hall 1 SI 654 Database Application Design Winter 2003 Dragomir R. Radev.
Friday, September 4 th, 2009 The Systems Group at ETH Zurich XML and Databases Exercise Session 6 courtesy of Ghislain Fourny/ETH © Department of Computer.
RAINDROP: XML Stream Processing Engine Murali Mani, DB seminar June 08, 2006 Partially Supported by NSF grant IIS
Web-site Management System Strudel Presented by: LAKHLIFI Houda Instructor: Dr. Haddouti.
11/08/2002WIDM20021 An Algebraic Approach For Incremental Maintenance of Materialized XQuery Views Maged EL-Sayed, Ling Wang, Luping Ding, and Elke A.
Querying XML (cont.). Comments on XPath? What’s good about it? What can’t it do that you want it to do? How does it compare, say, to SQL?
IS432: Semi-Structured Data Dr. Azeddine Chikh. 7. XQuery.
QSX (LN 3)1 Query Languages for XML XPath XQuery XSLT (not being covered today!) (Slides courtesy Wenfei Fan, Univ Edinburgh and Bell Labs)
XQuery: 1 W3C (World Wide Web Consortium) What is W3C? –An industry consortium, best known for standardizing HTML and XML. –Working Groups create or adopt.
A Graphical Environment to Query XML Data with XQuery
A Transducer-Based XML Query Processor Bertram Ludäscher, SDSC/CSE UCSD Pratik Mukhopadhyay, CSE UCSD Yannis Papakonstantinou, CSE UCSD.
Query Languages - XQuery Slides partially from Dan Suciu.
A Uniform and Layered Algebraic Framework for XQueries on XML Streams Hong Su Jinhui Jian Elke A. Rundensteiner Worcester Polytechnic Institute CIKM, Nov.
1 A Unified Model for XQuery Evaluation over XML Data Streams Jinhui Jian Hong Su Elke A. Rundensteiner Worcester Polytechnic Institute ER 2003.
An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.
WIDM 2002 DSRG, Worcester Polytechnic Institute1 Honey, I Shrunk the XQuery! —— An XML Algebra Optimization Approach Xin Zhang, Bradford Pielech and Elke.
XML QUERY LANGUAGE Prepared by Prof. Zaniolo, Hung-chih Yang, Ling-Jyh Chen Modified by Fernando Farfán.
Fundamentals, Design, and Implementation, 9/e Text and XML databases Instructor: Dragomir R. Radev Winter 2005.
1 Rainbow XML-Query Processing Revisited: The Incomplete Story (Part II) Xin Zhang.
A Unified Model for XQuery Evaluation over XML Data Streams Jinhui Jian Hong Su Elke A. Rundensteiner Worcester Polytechnic Institute ER 2003.
1 Processing Recursive Xquery over XML Streams: The Raindrop Approach Mingzhu Wei Ming Li Elke A. Rundensteiner Murali Mani Worcester Polytechnic Institute.
Xpath to XQuery February 23rd, Other Stuff HW 3 is out. Instructions for Phase 3 are out. Today: finish Xpath, start and finish Xquery. From Wednesday:
Querying XML February 12 th, Querying XML Data XPath = simple navigation through the tree XQuery = the SQL of XML XSLT = recursive traversal –will.
Advisor: Prof. Zaniolo Hung-chih Yang Ling-Jyh Chen XML Query Language.
1 Distributed Monitoring of Peer-to-Peer Systems By Serge Abiteboul, Bogdan Marinoiu Docflow meeting, Bordeaux.
XML-QL A Query Language for XML Charuta Nakhe
1 XQuery Slides From Dr. Suciu. 2 FLWR (“Flower”) Expressions FOR... LET... WHERE... RETURN... FOR... LET... WHERE... RETURN...
XML and XPath. Web Services: XML+XPath2 EXtensible Markup Language (XML) a W3C standard to complement HTML A markup language much like HTML origins: structured.
A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML Represented by: Ai Mu Based on the paper written by Ning Zhang, Varun.
XPath Processor MQP Presentation April 15, 2003 Tammy Worthington Advisor: Elke Rundensteiner Computer Science Department Worcester Polytechnic Institute.
1 XTree for Declarative XML Querying Zhuo Chen, Tok Wang Ling, Mengchi Liu, and Gillian Dobbie January 2004.
Pattern tree algebras: sets or sequences? Stelios Paparizos, H. V. Jagadish University of Michigan Ann Arbor, MI USA.
Schema-Based Query Optimization for XQuery over XML Streams Hong Su Elke A. Rundensteiner Murali Mani Worcester Polytechnic Institute, Massachusetts, USA.
Querying Structured Text in an XML Database By Xuemei Luo.
Rainbow - Bridging XML and Relational Databases: Design, Implementation, and Evaluation MQP Advisor: Prof. Elke A. Rundensteiner, PhD Sponsor:
Lecture 6: XML Query Languages Thursday, January 18, 2001.
Database Systems Part VII: XML Querying Software School of Hunan University
SPARQL Query Graph Model (How to improve query evaluation?) Ralf Heese and Olaf Hartig Humboldt-Universität zu Berlin.
SDPL 2002Notes 9: XQuery1 9 Querying XML Data and Documents n XQuery, W3C XML Query Language –"work in progress", Working Draft, 30 April 2002 –joint work.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Rainbow - Bridging XML and Relational Databases: Design, Implementation, and Evaluation MQP Advisor: Prof. Elke A. Rundensteiner, PhD Sponsor:
XML query. introduction An XML document can represent almost anything, and users of an XML query language expect it to perform useful queries on whatever.
An Efficient Inverted Index Technique for XML Documents using RDBMS Prepared by Devrim Yıldırım Original paper by Chiyoung Seo.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
Streaming XPath Engine Oleg Slezberg Amruta Joshi.
1 XQuery Slides From Dr. Suciu. 2 XQuery Based on Quilt, which is based on XML-QL Uses XPath to express more complex queries.
IS432 Semi-Structured Data Lecture 6: XQuery Dr. Gamal Al-Shorbagy.
CSE 6331 © Leonidas Fegaras XQuery 1 XQuery Leonidas Fegaras.
XQuery 1. In this lecture Summary of XQuery FLWOR expressions – For, Let, Where, Order by, Return FOR and LET expressions Collections and sorting 2.
Lecture 17: XPath and XQuery Wednesday, Nov. 7, 2001.
Efficient Evaluation of XQuery over Streaming Data
High-Performance XML Filtering with YFilter
Efficient Filtering of XML Documents with XPath Expressions
Raindrop: An Algebra-Automata Combined XQuery Engine over XML Streams
Query Processing for High-Volume XML Message Brokering
Semi-Structured data (XML Data MODEL)
Alin Deutsch, University of Pennsylvania Mary Mernandez, AT&T Labs
Early Profile Pruning on XML-aware Publish-Subscribe Systems
Probabilistic Databases
Adaptive Query Processing (Background)
Presentation transcript:

1 Efficient XML Stream Processing with Automata and Query Algebra A Master Thesis Presentation Student: Advisor: Reader: Jinhui Jian Prof. Elke A. Rundensteiner Prof. Kathi Fisler

2 The Need for XML Stream Processing XML Relational HTMLnews Internet XML data streams XML Stream Processing Engine New paradigms  Distributed data provider  Distributed data consumer New applications  Monitoring (e.g., sensor network)  Information Filtering (e.g., news, ) New challenges  Arbitrarily nested structure  Incomplete knowledge

3 Two Existing Approaches Automata-based [xfilter01, yfilter02, x-scan01,…] Algebraic [tukwila01, rainbow02, …] This thesis intends to integrate the both existing approaches into one system

4 A Running Example Give me book titles whose price is grater than 50: FOR $b in doc (bib.xml) //book WHERE $b/price > 50 RETURN $b/title TCP/IP Illustrated Stevens W. Addison-Wesley Data on the Web Abiteboul Serge Buneman Peter Suciu Dan Morgan Kaufmann Publishers Advanced Programming in the Unix environment Stevens W. Addison-Wesley TCP/IP Illustrated Advanced Programming in the Unix environment

5 XML as a Stream of Tokens timeline TCP/IP Illustrated Stevens … … Input XML stream bib book title author last first publisherprice Text A token can be:  An open tag  A close tag  PCDATA

6 Basic State-Transition Model TCP/IP Illustrated … 120 book ε 3 price * input active states011,211 1,3…… stack[0] [1] [0] [1] [1,2] [0] [1] [1,2] [0] [1] [0] [1] [1,2] …… Q := //book/price FOR $b in doc (bib.xml) //book WHERE $b/price > 50 RETURN $b/title

7 Extended with Data Buffer and Buffer Operations FOR $b in doc (bib.xml) //book WHERE $b/price > 50 RETURN $b/title Data-driven  Token at a time  Fixed order 1. eval pred and set/clear flag 2. output if buffer not empty 120 book ε 3 title * 4 price 1. write buffer 2. output if flag is set bufferflag * *

8 Algebraic Query Plan FOR $b in doc (bib.xml) //book WHERE $b/price > 50 RETURN $b/title Set at a time Postponed operation Extract //book Navigate //book, price Select price > 50 Tagger Navigate //book, title

9 Exploit the Flexibility of Postponed Operations FOR $b in doc (bib.xml) //book WHERE $b/price > 50 and $b/author/last = “Stevens” RETURN $b/title Extract //book Navigate //book, price Select price > 50 Tagger Navigate //book, author/last Select last = “Stevens” Navigate //book, title

10 Query Optimization in Algebraic Systems Logical optimization  Selection pushdown  Projection pushdown  Join order selection Physical optimization  Operator algorithms Runtime optimization  Scheduling  Resource allocation

11 Thesis Overview Motivation  The Automata model is good for on-the-fly pattern matching/retrieval  The Algebraic model is good for optimizing complex queries Major challenges  How to integrate the two models?  How to optimize a query within the integrated query model?

12 The Raindrop Approach Integration Optimization

13 Path Bindings in XQuery FOR $b in doc (bib.xml) //book WHERE $b/price > 50 and RETURN $b/title FOR $b in doc (bib.xml) //book LET $p := $b/price, $t := $b/title WHERE $p > 50 RETURN $t FLWR expression: FOR…LET...WHERE…RETURN… Path bindingsFiltering and restructuring “The purpose of path bindings is to produce a tuple stream in which each tuple consists of one or more bound variables” [W3C]

14 A Two-Tier System Architecture Automata plan Master plan Tuple stream XML data stream Query answer

15 Modeling the Master Plan: Algebraic Navigate //book, price Select price > 50 Tagger Navigate //book, author/last Select last = … Navigate //book, title

16 Modeling the Automata Plan: Black Box vs. White Box Automata Plan Q1 := //book Q2 := //book/price Q3 := //book/title SJoin //book Extract //book/price Extract //book/title

17 How to optimize it? Automata plan Master plan Tuple stream XML data stream Query answer

18 Optimization: A Unified Process in the Logical View 0 1 Extract //book ε * Navigate //book, //book/price 2 book Select //book/price >5 0 Navigate //book, //book/title Extract //book Navigate //book, price Select price >5 0 Navigate //book, title Automata Plan Master Plan cBa Cba $c$b$a

The Algebra Core OpSymbolSemantic SelectionFilter tuples based on the predicate pred ProjectionFilter columns in the input tuples based on the variable list v JoinJoin input tuples based on the predicate pred AggregateAggregate over input tuples with the aggregate function f, e.g., sum and average TaggerFormat outputs based on the pattern pt, i.e., reconstruct XML tags NavigateTake input elements of path p1 and output ancestor elements of path p2 ExtractIdentify elements of path p from the input stream Structural Join Join input tuples on their structural relationship, e.g, the common parent relationship p

20 The Extract Operator 120 book ε * Extract //book/title TCP/IP Illustrated … … 1 title TCP/IP Illustrated Data on the Web Advanced Programming in the Unix environment

21 The Structural Join Operator 120 book ε 3 title * 4 price Extract //book/title Extract //book/price SJoin //book FOR $b in doc (bib.xml) //book LET $p := $b/price, $t := $b/title WHERE $p > 50 RETURN $t … TCP/IP Illustrated … … …

22 The Navigate Operator TCP/IP Illustrated Stevens W. Addison-Wesley … … … … … … … … … Navigate //book, title A navigate operation can be postponed, independent of the input stream

23 A Special Optimization: In or Out? Automata plan Master plan Tuple stream XML data stream Query answer

Two Options: Bottom-up vs. Top-down … …</price … TCP/IP Illustrated Stevens W. Addison-Wesley … … … … … … … … … … … … … … … … … …

25 Exploiting the Options for Optimization 0 1 Extract //book ε * Navigate //book, price 2 book Select price >5 0 Navigate //book, title The pull-out plan Extract //book/price title price Extract //book/title ε * SJoin //book 2 book Select //book/price >50 The push-in plan Tagger

26 Query Optimization by Rewriting Rules Navigate pushin: Redundant SJoin: Redundant Extract: Selection Pushdown: Etc.. Algebraic transformation:

27 Runtime Optimization: Why? Optimization relies on cost estimation, which in terms relies on statistics  Statistics unknown  Statistics change Extract //book Navigate //book, price Select price >5 0 Navigate //book, title Tagger

28 Runtime Optimization Steps Stat Collection Decision Making Plan Migration

29 Why Need Migration? When to interrupt the executor  Master plan  Automata plan Normal execution Prepare for migration Decision making Plan modification Legend executor Optimizer Optimization cycle The migration process

30 Modifying the Automata: A Bad Example 0 1 Extract //book ε * Navigate //book, //book/price 2 book Select //book/price >5 0 Navigate //book, //book/title Extract //book/price title price Extract //book/title ε * SJoin //book 2 book Select //book/price >50 TCP/IP Illustrated … ……

31 Modifying the Automata: A Safe Approach … … … … … Safe point Unsafe point 0 1 ε * 2 book title price ε * 2 book FOR $b in doc (bib.xml) //book LET $p := $b/price, $t := $b/title WHERE $p > 50 RETURN $t

32 Experimental Study Is it feasible to integrate automata model and algebraic model? Is push-in vs. pull-out a feasible optimization? Is runtime optimization worthwhile?

33 Experimental Setup Java 1.4 Pentium III-750MHz, 384MB Windows XP Professional Three-party components  Xerces SAX parser  The Kweelt XQuery parser  Rainbow core

34 Exp1: System Throughput

35 Exp2: Push-in vs. Pull-out

Exp3: Runtime Optimization

37 Related work Automata-based XML processing  XFilter, YFilter, X-Scan, XTrie, XPush, … Algebraic XQuery Engine  XPeranto, LegoDB, Rainbow, Timber… Runtime Optimization  Tukwila, Telegraph CQ,…

38 Contribution While many recent XML stream work (e.g., in SIGMOD03) processes XPath query, we are among the first to deal with XQuery We are the first to consider the flexible automata and query algebra integration problem Pushin vs. Pullout optimization techniques Prototype system Experimental study

39 Conclusion Combining automata and query algebra results in a very power query model for XML stream processing Special optimization techniques (e.g., pushin vs. pullout) can be applied in the integrated system Data statistics collected at runtime can be exploited via runtime optimization techniques

40 Thanks to: Prof. Elke A. Rundensteiner Prof. Kathi Fisler The Raindrop/Rainbow team All DSRG members

41 Questions?