Presentation is loading. Please wait.

Presentation is loading. Please wait.

Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems.

Similar presentations


Presentation on theme: "Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems."— Presentation transcript:

1 Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems Research Lab Email: rundenst@cs.wpi.eduundenst@cs.wpi.edu Office: Fuller 238 Phone: x – 5815 Webpages: http://www.cs.wpi.edu/~rundenst http://davis.wpi.edu/dsrg

2 2 Core Projects in a Nutshell: Distributed Data Sources:  EVE : Data Warehousing over Distributed Data Sources  TOTAL-ETL : Distributed Extract Transform Load Tools Web Information Systems:  RAINBOW : XML Query Engine and Tools  XML to Relational Mapping Technology Databases and Visualization:  Visualization-Driven Data Caching  Prefetching based on User Access Patterns Stream Monitoring Systems:  New Generation of Query Engines  Adaptive Distributed Optimizations

3 3 DSRG Research Group 2 faculty 8 Phd students 5 MS students 4 MQP teams

4 4 Stream Monitoring Applications Monitor troop movements during combat and warn when soldiers veer off course Send alert when patient’s vital signs begin to deteriorate Monitor incoming news feeds to see stories on “Iraq” Scour network traffic logs looking for intruders

5 5 Properties of Monitoring Applications Queries and monitors run continuously, possibly unending Applications have varying service preferences:  Patient monitoring only wants freshest data  Remote sensors have limited memory  News service wishes maximal throughput  Taking 60 seconds to process vital signs and sound an alert may be too long

6 6 Properties of Streaming Data Possibly never ending stream of data Unpredictable arrival patterns:  Network congestion  Weather (for external sensors)  Sensor moves out of range

7 7 Querying and Monitoring Streaming Data New Data Format: Querying and Analyzing Continuous, unbounded, rapid, time-varying streams of data elements Traditional DB vs. Stream System Persistent finite relations vs. transient infinite data streams One-time queries vs. large number of continuously running queries Random access vs. sequential access Query plan determined by static design decisions vs. unpredictable data characteristics and arrival patterns

8 8 Databases Upside Down data Query data streams of data static data Standing queries one-time queries

9 9 CAPE: Constraint-Aware Adaptive Processing Engine Elke A. Rundensteiner, G. Heineman, Luping Ding, Timothy Sutherland, Yali Zhu Brad Pielech, Nishant Mehta Natasha Bogdanova, Mariana Jbantova Extracted from VLDB 2004 CAPE Demo. http://davis.wpi.edu/dsrg/CAPE/index.html

10 10 Uncertainties in Stream Query Processing Register Continuous Queries Stream Query Engine Stream Query Engine Streaming Data Streaming Result May have different QoS Requirements. May have time- varying rates and data distribution. Available resources for executing each operator may vary over time. Adaptations are required for stream query engine.

11 11 What is CAPE? C onstraint-aware A daptive Continuous Query P rocessing E ngine Exploit semantic constraints such as sliding windows and punctuations to reduce resource usage and improve response time. Incorporate heterogeneous-grained adaptivity at all query processing levels. - Adaptive query operator execution - Adaptive query plan re-optimization - Adaptive operator scheduling - Adaptive query plan distribution Process queries in a real-time manner by employing well-coordinated heterogeneous-grained adaptations.

12 12 CAPE System Architecture Distribution Manager Plan Reoptimizer Operator Scheduler Operator Configurator

13 13 Research Contributions  Scalable Query Operators (Punctuations)  Adapt and select among tasks such as memory purging, stream reading, memory-to-disk shuffling, punctuation propagation, etc.  Cooperative Plan Optimization  Operators request and selectively provide punctuations (meta-knowledge) to improve their performance  Adaptive Operator Scheduling  Selector scores alternate scheduling algorithm based on their effect on QoS requirements, and selects candidate.  On-line Query Plan Migration  On-line plan restructuring and then online migration to the new plan even for stateful operators.  Distributed Plan Execution  Adaptively distribute computations across multiple machines to optimize QoS requirements without information loss

14 14 Constraint-Aware Query Operators Item_idBidder_idBid_priceBid_time 1082Marlie820.00Nov-13-03 11:02:00 1080Ultrasale1000.00Nov-13-03 11:05:00 1082Jocelyn850.00Nov-13-03 11:14:00 1082*** No more bid will be placed on item 1082. Punctuations [TMS+03] are embedded inside data streams to announce that no tuple after a punctuation will match this punctuation. Sliding windows characterize data sets that current result is based on. Punctuation TS a TS b - W a TS a - W b TS b WbWb Timeline T0T0 WaWa Stream A Stream B Utilize data-level constraint, punctuations, to purge no-longer-needed data Exploit time-based constraints, sliding windows, to discard expired data Incorporate optimizations enabled by interactions between different constraints Include ability to propagate punctuations to help downstream operators Employ adaptive execution logic to cope with dynamic stream environments Window Join :

15 15 Join A a1a1 * … a1a1 a1a1 a2a2 a3a3 b2b2 b4b4 b1b1 b2b2 AB a1a1 a1a1 a3a3 c1c1 c6c6 c2c2 AC … a1a1 a1a1 a1a1 a1a1 b2b2 b4b4 b2b2 b4b4 AB c1c1 c1c1 c6c6 a6a6 C a3a3 b2b2 c2c2 … a2a2 a3a3 b1b1 b2b2 AB a1a1 a1a1 a3a3 c1c1 c6c6 c2c2 AC … a1a1 a1a1 a1a1 a1a1 b2b2 b4b4 b2b2 b4b4 AB c1c1 c1c1 c6c6 a6a6 C a3a3 b2b2 c2c2 Punctuation-Exploiting Join Optimization Join A … a1a1 a1a1 a2a2 a3a3 b2b2 b4b4 b1b1 b2b2 AB a1a1 a1a1 a3a3 c1c1 c6c6 c2c2 AC … a1a1 a1a1 a1a1 a1a1 b2b2 b4b4 b2b2 b4b4 AB c1c1 c1c1 c6c6 a6a6 C a3a3 b2b2 c2c2 … … a1a1 a1a1 a1a1 a1a1 b2b2 b4b4 b2b2 b4b4 AB c1c1 c1c1 c6c6 a6a6 C a3a3 b2b2 c2c2 1 2 4 6 TS Window-Based Join Optimization 3 5 7 TS a2a2 c5c5 8 Window: 6 Window: 8 a1a1 a2a2 a3a3 b4b4 b1b1 b2b2 AB 2 4 6 TS a1a1 a1a1 a3a3 c1c1 c6c6 c2c2 AC 3 5 7 a2a2 c5c5 8 a2a2 b1b1 c5c5 a1a1 a1a1 b2b2 b4b4 a1a1 b2b2 1

16 16 Run-time Plan Migration The last step of plan re-optimization: After optimizer generates a new query plan, how to replace currently running plan by the new plan on the fly? A new challenge in streaming system because of stateful operators. A unique feature of the CAPE system. But can we just take out the old plan and plug in the new plan? Key Observation: Purge of tuples in states relies on processing of new tuples. Steps (1) Pause execution of old plan (2) Drain out all tuples inside old plan (3) Replace old plan by new plan (4) Resume execution of new plan AB BC AB C (2) All tuples drained (4) Processing Resumed (3) Old Replaced By new Deadlock Waiting Problem:

17 17 Run-time Plan Re-Optimization Step 1 - Decide when to optimize  Statistics Monitoring Step 2 – Generate new query plan  Query Optimization Step 3 – Replace current plan by new plan  Plan Migration

18 18 Plan Migration Strategy : Out with the Old and In with the New ? Migration Steps  Pause execution of old plan  Drain out all tuples inside old plan  Replace old plan by new plan  Resume execution of new plan AB BC AB C AB BC A B C Problem: Works for stateless operators only

19 19 Stateful Operator in CQ Why stateful  Need non-blocking operators in CQ  Operator needs to output partial results  State data structure keep received tuples AB AB b1 b2 b3 b4 b5 ax State AState B ax b2 axb3 Key Observation: The purge of tuples in states relies on processing of new tuples. Example: Symmetric NL join w/ window constraints

20 20 Migration Strategy Revisited Steps (1) Pause execution of old plan (2) Drain out all tuples inside old plan (3) Replace old plan by new plan (4) Resume execution of new plan AB BC AB C (2) All tuples drained (4) Processing Resumed (3) Old Replaced By new Deadlock Waiting Problem:

21 21 Dynamic Plan Migration  Input (two migration boxes) One contains old plan One contains new plan Have same input and output queues  Result Old box is replaced by new box Two Strategies:  Moving State Strategy  Parallel Track Guarantee Valid Migration  No missing tuples  No duplicates Key points: - Involved plans contain stateful operators - Migrate yet still retain useful states and discard useless states.

22 22 Distribution Manager Connection Manager Query Plan Manager Runtime Monitor Distribution Decision Maker Distributed Strategy Repository Cost Model Repository Configuration Repository CAPE Engine ( Query Processor) Distribution Manager  Connection Manager: synchronizes query plan distribution between query processors.  Query Plan Manager: maintains the location of each query operator and verifies that a particular distribution is valid.  Runtime Monitor: receives statistics from each query processor and calculates respective load over time.  Distribution Decision Maker: creates initial distribution and at runtime reallocates query operators to different machines.

23 23 M1M2M3 Legend: M1 M2 M3 Round-Robin DistributionGrouping Distribution Goal: To minimize network connectivity. Algorithm: Takes each query plan and creates sub-plans where neighbouring operators are grouped together. Goal: To equalize workload per machine. Algorithm: Iteratively takes each query operator and places it on the query processor with the least number of operators. Initial Distribution Policies

24 24 Distribution Manager Distribution Table M 2Operator 8 M 2Operator 7 M 1Operator 6 M 1Operator 5 M 2Operator 4 M 2Operator 3 M 1Operator 2 M 1Operator 1 MachineOperator Initial Distribution Process Stream Source Application 4321876512345 6 78 M1 M2 Step 1Step 2  Step 1: Create distribution table using initial distribution algorithm.  Step 2: Send distribution information to processing machines

25 25 Redistribution - New Distribution Balance Computation 4100 tuplesM 2 2000 tuplesM 1 M 2Operator 8 M 2Operator 7 M 1Operator 6 M 1Operator 5 M 2Operator 4 M 2Operator 3 M 1Operator 2 M 1Operator 1 Machine (M) Operator (OP) Op 3:.3 Op 4:.2 Op 7:.3 Op 8:.2.91M 2 Op 1:.25 Op 2:.25 Op 5:.25 Op 6:.25.44M 1 Operator Cost CostMachi ne Statistics Table M Capacity: 4500 tuples Distribution Table Op 3:.4 Op 4:.3 Op 8:.3.64M 2 Op 1:.15 Op 2:.15 Op 5:.15 Op 6:.15.71M 1 Operator Cost CostMachi ne Balance Cost Table (current)Cost Table (desired)  Cape’s costs: # tuples in memory and network output rate.  Operators redistributed based on redistribution policy.  Redistribution policies of Cape: Balance and Degradation.  6-Step Protocol: Moving Operators Across Machines Op 7:.4 Cost per machine is determined as percentage of memory filled with tuples.

26 26 Move Operators Across Machines

27 27 Experimental Results of Distribution and Re-distribution Algorithms Query Plan Performance with a Query Plan of 40 Operators. Observations:  Initial distribution is important for query plan performance.  Redistribution improves at run-time query plan performance.

28 28 So Much Fun Challenges, So Little Time … Motion Query Processing : Sensor on Cars Novel classes of location-aware queries Queries such as “Find me the five nearest restaurants to my current (moving) location” Not just data but also queries are streaming in. Massive Query Sharing and Shared Execution Novel classes of location-aware queries

29 29 Publications, TRs & URLs [RDZ04] E. A. Rundensteiner, L. Ding, Y. Zhu, T. Sutherland and B. Pielech, “CAPE: A Constraint-Aware Adaptive Stream Processing Engine”. Invited Book Chapter. http://www.cs.uno.edu/~nauman/streamBook/. July 2004. http://www.cs.uno.edu/~nauman/streamBook/ [ZRH04] Y. Zhu, E. A. Rundensteiner and G. T. Heineman, "Dynamic Plan Migration for Continuous Queries Over Data Streams”. SIGMOD 2004, pages 431-442. [DMR+04] L. Ding, N. Mehta, E. A. Rundensteiner and G. T. Heineman, "Joining Punctuated Streams“. EDBT 2004, pages 587-604. [DR04] L. Ding and E. A. Rundensteiner, "Evaluating Window Joins over Punctuated Streams“. CIKM 2004, to appear. [DRH03] L. Ding, E. A. Rundensteiner and G. T. Heineman, “MJoin: A Metadata-Aware Stream Join Operator”. DEBS 2003. [SPR04] T. Sutherland, B. Pielech and E. A. Rundensteiner, "Adaptive Scheduling Framework for A Continuous Query System“. Tech Report, WPI-CS-TR-04-16, 2004. [SR04] T. Sutherland and E. A. Rundensteiner, "D-CAPE: A Self-Tuning Continuous Query Plan Distribution Architecture“. Tech Report, WPI-CS-TR-04-18, 2004. CAPE Project: http://davis.wpi.edu/dsrg/CAPE/index.html

30 Raindrop : XQueries on XML Streams (Automaton Meets Algebra) Hong Su, Elke Rundensteiner, Murali Mani, Ming Li Worcester Polytechnic Institute Worcester, MA ** Extracted from VLDB 2004 demo **

31 31 XML Stream Processing data sources data requesters Networks

32 32 What ’ s Special for XML Stream Processing? Dream Catcher King S. Bt Bound 30 … Dream Catcher … Token-by-Token access manner timeline Pattern retrieval + Filtering + Restructuring FOR $b in stream(biditems.xml) //book LET $p := $b/price $t := $b/title WHERE $p < 20 Return $t Token: not a direct counterpart of a tuple 30Bt BoundS.KingDream2001 pricepublisherfirstlasttitleyear Pattern Retrieval on Token Streams

33 33 Two Computation Paradigms Automata-based [yfilter, xscan, xpush…] Algebraic [niagara00, …] FOR $a in stream(bids)//auction, $b in $a/seller[homepage], $c in $a/bidder[sameAddr] WHERE $b/*/phone = “508” Return $b, $c 1 auction * 2 3 seller bidder Automata 8 Navigate $a, /seller->$b Navigate $a, /bidder-> $c Tagger Algebra Navigate stream(bids),//auction->$a 4 homepage 9 sameAddr 56 * phone … 7 bid

34 34 Comparison of Two Paradigms Either paradigm has deficiencies Both paradigms complement each other Automata ParadigmAlgebra Paradigm Good for pattern retrieval on tokensDoes not support token inputs Need patches for filtering and restructuring Good for filtering and restructuring Present all details on same low levelSupport multiple descriptive levels (e.g., logical plan, physical plan) Little studied as query processing paradigm Well studied as query process paradigm

35 35 This Raindrop framework integrates both paradigms into one uniform model

36 36 One Uniform Algebraic View Token-based plan (automata plan) Tuple-based plan Tuple stream XML data stream Query answer Algebraic Stream Logical Plan

37 37 Example Uniform Algebraic Plan FOR $b in stream(biditems.xml) //book LET $p := $b/price $t := $b/title WHERE $p < 30 Return $t Tuple-based plan Token-based plan (automata plan)

38 38 Example Uniform Algebraic Plan FOR $b in stream(biditems.xml) //book LET $p := $b/price $t := $b/title WHERE $p < 30 Return $t StructuralJoin $b ExtractNest $b, $p ExtractNest $b, $t Navigate $b, /price->$p Navigate $b, /title->$t Navigate $S1, //book ->$b Tuple-based plan

39 39 Example Uniform Algebraic Plan FOR $b in stream(biditems.xml) //book LET $p := $b/price $t := $b/title WHERE $p < 30 Return $t StructuralJoin $b ExtractNest $b, $p ExtractNest $b, $t Navigate $b, /price->$p Navigate $b, /title->$t Navigate $S1, //book ->$b Select $p<30 Tagger “Inexpensive”, $t->$r

40 40 Automata Push-In/Out? Token-based plan (automata plan) Tuple-based Plan Tuple stream XML data stream Query answer Pattern retrieval in Semantics- focused plan Apply “push into automata”

41 41 Optimization Example: Computation Into or Out of Automata? TokenNavigate $a, /bid/bidder->$c ExtractUnnest $a, $c ExtractUnnest $a, $b StructuralJoin $a TokenNavigate $a, /seller->$b TokenNavigate stream(bids), //auction->$a ExtracUnnest stream(bids), $a NavigateUnnest $a, /seller->$b NavigateUnnest $a, /bid/bidder->$c TokenNavigate stream(bids), //auction->$a NavUnnest stream(bids), //auction->$a NavigateUnnest $a, /seller ->$b NavigateUnest $a, /bid/bidder ->$c Out of AutomataInto Automata Automata Plan … ……

42 42 Research Challenges: - Semantic Query Optimization - Automata versus Algebra Optimization - Memory-Sensitive Non-Blocking Evaluation - On-the-fly Plan Retargeting - XML Load Shedding …

43 43 See DSRG Web Site Visit DSRG Research Labs 318 & 319 Attend DSRG Research Meeting Come and talk Want More ?


Download ppt "Elke A. Rundensteiner Topics projects in database and Information systems, such as, web information systems, distributed databases, Etc. Database Systems."

Similar presentations


Ads by Google