Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Efficient XML Stream Processing with Automata and Query Algebra A Master Thesis Presentation Student: Advisor: Reader: Jinhui Jian Prof. Elke A. Rundensteiner.

Similar presentations


Presentation on theme: "1 Efficient XML Stream Processing with Automata and Query Algebra A Master Thesis Presentation Student: Advisor: Reader: Jinhui Jian Prof. Elke A. Rundensteiner."— Presentation transcript:

1 1 Efficient XML Stream Processing with Automata and Query Algebra A Master Thesis Presentation Student: Advisor: Reader: Jinhui Jian Prof. Elke A. Rundensteiner Prof. Kathi Fisler

2 2 The Need for XML Stream Processing XML Relational HTMLnews Internet XML data streams XML Stream Processing Engine New paradigms  Distributed data provider  Distributed data consumer New applications  Monitoring (e.g., sensor network)  Information Filtering (e.g., news, email) New challenges  Arbitrarily nested structure  Incomplete knowledge

3 3 Two Existing Approaches Automata-based [xfilter01, yfilter02, x-scan01,…] Algebraic [tukwila01, rainbow02, …] This thesis intends to integrate the both existing approaches into one system

4 4 A Running Example Give me book titles whose price is grater than 50: FOR $b in doc (bib.xml) //book WHERE $b/price > 50 RETURN $b/title TCP/IP Illustrated Stevens W. Addison-Wesley 65.95 Data on the Web Abiteboul Serge Buneman Peter Suciu Dan Morgan Kaufmann Publishers 39.95 Advanced Programming in the Unix environment Stevens W. Addison-Wesley 65.95 TCP/IP Illustrated Advanced Programming in the Unix environment

5 5 XML as a Stream of Tokens timeline TCP/IP Illustrated Stevens … … Input XML stream bib book title author last first publisherprice Text A token can be:  An open tag  A close tag  PCDATA

6 6 Basic State-Transition Model TCP/IP Illustrated 65.95 … 120 book ε 3 price * input active states011,211 1,3…… stack[0] [1] [0] [1] [1,2] [0] [1] [1,2] [0] [1] [0] [1] [1,2] …… Q := //book/price FOR $b in doc (bib.xml) //book WHERE $b/price > 50 RETURN $b/title

7 7 Extended with Data Buffer and Buffer Operations FOR $b in doc (bib.xml) //book WHERE $b/price > 50 RETURN $b/title Data-driven  Token at a time  Fixed order 1. eval pred and set/clear flag 2. output if buffer not empty 120 book ε 3 title * 4 price 1. write buffer 2. output if flag is set bufferflag * *

8 8 Algebraic Query Plan FOR $b in doc (bib.xml) //book WHERE $b/price > 50 RETURN $b/title Set at a time Postponed operation Extract //book Navigate //book, price Select price > 50 Tagger Navigate //book, title

9 9 Exploit the Flexibility of Postponed Operations FOR $b in doc (bib.xml) //book WHERE $b/price > 50 and $b/author/last = “Stevens” RETURN $b/title Extract //book Navigate //book, price Select price > 50 Tagger Navigate //book, author/last Select last = “Stevens” Navigate //book, title

10 10 Query Optimization in Algebraic Systems Logical optimization  Selection pushdown  Projection pushdown  Join order selection Physical optimization  Operator algorithms Runtime optimization  Scheduling  Resource allocation

11 11 Thesis Overview Motivation  The Automata model is good for on-the-fly pattern matching/retrieval  The Algebraic model is good for optimizing complex queries Major challenges  How to integrate the two models?  How to optimize a query within the integrated query model?

12 12 The Raindrop Approach Integration Optimization

13 13 Path Bindings in XQuery FOR $b in doc (bib.xml) //book WHERE $b/price > 50 and RETURN $b/title FOR $b in doc (bib.xml) //book LET $p := $b/price, $t := $b/title WHERE $p > 50 RETURN $t FLWR expression: FOR…LET...WHERE…RETURN… Path bindingsFiltering and restructuring “The purpose of path bindings is to produce a tuple stream in which each tuple consists of one or more bound variables” [W3C]

14 14 A Two-Tier System Architecture Automata plan Master plan Tuple stream XML data stream Query answer

15 15 Modeling the Master Plan: Algebraic Navigate //book, price Select price > 50 Tagger Navigate //book, author/last Select last = … Navigate //book, title

16 16 Modeling the Automata Plan: Black Box vs. White Box Automata Plan Q1 := //book Q2 := //book/price Q3 := //book/title SJoin //book Extract //book/price Extract //book/title

17 17 How to optimize it? Automata plan Master plan Tuple stream XML data stream Query answer

18 18 Optimization: A Unified Process in the Logical View 0 1 Extract //book ε * Navigate //book, //book/price 2 book Select //book/price >5 0 Navigate //book, //book/title Extract //book Navigate //book, price Select price >5 0 Navigate //book, title Automata Plan Master Plan cBa Cba $c$b$a

19 The Algebra Core OpSymbolSemantic SelectionFilter tuples based on the predicate pred ProjectionFilter columns in the input tuples based on the variable list v JoinJoin input tuples based on the predicate pred AggregateAggregate over input tuples with the aggregate function f, e.g., sum and average TaggerFormat outputs based on the pattern pt, i.e., reconstruct XML tags NavigateTake input elements of path p1 and output ancestor elements of path p2 ExtractIdentify elements of path p from the input stream Structural Join Join input tuples on their structural relationship, e.g, the common parent relationship p

20 20 The Extract Operator 120 book ε * Extract //book/title TCP/IP Illustrated … … 1 title TCP/IP Illustrated Data on the Web Advanced Programming in the Unix environment

21 21 The Structural Join Operator 120 book ε 3 title * 4 price Extract //book/title Extract //book/price SJoin //book FOR $b in doc (bib.xml) //book LET $p := $b/price, $t := $b/title WHERE $p > 50 RETURN $t … TCP/IP Illustrated … … …

22 22 The Navigate Operator TCP/IP Illustrated Stevens W. Addison-Wesley 65.95 … … … … … … … … … Navigate //book, title A navigate operation can be postponed, independent of the input stream

23 23 A Special Optimization: In or Out? Automata plan Master plan Tuple stream XML data stream Query answer

24 Two Options: Bottom-up vs. Top-down … …</price … TCP/IP Illustrated Stevens W. Addison-Wesley 65.95 … … … … … … … … … … … … … … … … … …

25 25 Exploiting the Options for Optimization 0 1 Extract //book ε * Navigate //book, price 2 book Select price >5 0 Navigate //book, title The pull-out plan Extract //book/price 0 1 3 4 title price Extract //book/title ε * SJoin //book 2 book Select //book/price >50 The push-in plan Tagger

26 26 Query Optimization by Rewriting Rules Navigate pushin: Redundant SJoin: Redundant Extract: Selection Pushdown: Etc.. Algebraic transformation:

27 27 Runtime Optimization: Why? Optimization relies on cost estimation, which in terms relies on statistics  Statistics unknown  Statistics change Extract //book Navigate //book, price Select price >5 0 Navigate //book, title Tagger

28 28 Runtime Optimization Steps Stat Collection Decision Making Plan Migration

29 29 Why Need Migration? When to interrupt the executor  Master plan  Automata plan Normal execution Prepare for migration Decision making Plan modification Legend executor Optimizer Optimization cycle The migration process

30 30 Modifying the Automata: A Bad Example 0 1 Extract //book ε * Navigate //book, //book/price 2 book Select //book/price >5 0 Navigate //book, //book/title Extract //book/price 0 1 3 4 title price Extract //book/title ε * SJoin //book 2 book Select //book/price >50 TCP/IP Illustrated 36.65 … ……

31 31 Modifying the Automata: A Safe Approach … … … … … Safe point Unsafe point 0 1 ε * 2 book 0 1 3 4 title price ε * 2 book FOR $b in doc (bib.xml) //book LET $p := $b/price, $t := $b/title WHERE $p > 50 RETURN $t

32 32 Experimental Study Is it feasible to integrate automata model and algebraic model? Is push-in vs. pull-out a feasible optimization? Is runtime optimization worthwhile?

33 33 Experimental Setup Java 1.4 Pentium III-750MHz, 384MB Windows XP Professional Three-party components  Xerces SAX parser  The Kweelt XQuery parser  Rainbow core

34 34 Exp1: System Throughput

35 35 Exp2: Push-in vs. Pull-out

36 Exp3: Runtime Optimization

37 37 Related work Automata-based XML processing  XFilter, YFilter, X-Scan, XTrie, XPush, … Algebraic XQuery Engine  XPeranto, LegoDB, Rainbow, Timber… Runtime Optimization  Tukwila, Telegraph CQ,…

38 38 Contribution While many recent XML stream work (e.g., in SIGMOD03) processes XPath query, we are among the first to deal with XQuery We are the first to consider the flexible automata and query algebra integration problem Pushin vs. Pullout optimization techniques Prototype system Experimental study

39 39 Conclusion Combining automata and query algebra results in a very power query model for XML stream processing Special optimization techniques (e.g., pushin vs. pullout) can be applied in the integrated system Data statistics collected at runtime can be exploited via runtime optimization techniques

40 40 Thanks to: Prof. Elke A. Rundensteiner Prof. Kathi Fisler The Raindrop/Rainbow team All DSRG members

41 41 Questions?


Download ppt "1 Efficient XML Stream Processing with Automata and Query Algebra A Master Thesis Presentation Student: Advisor: Reader: Jinhui Jian Prof. Elke A. Rundensteiner."

Similar presentations


Ads by Google