Presentation is loading. Please wait.

Presentation is loading. Please wait.

Continuous Stream Monitoring Technology Elke A. Rundensteiner Database Systems Research Laboratory Department of Computer Science Worcester Polytechnic.

Similar presentations


Presentation on theme: "Continuous Stream Monitoring Technology Elke A. Rundensteiner Database Systems Research Laboratory Department of Computer Science Worcester Polytechnic."— Presentation transcript:

1 Continuous Stream Monitoring Technology Elke A. Rundensteiner Database Systems Research Laboratory Department of Computer Science Worcester Polytechnic Institute, USA rundenst @cs.wpi.edu October 2006

2 2 Project Topics in a Nutshell  Distributed Data Sources: EVE : Data Warehousing over Distributed Data TOTAL-ETL : Distributed Extract Transform Load [NSF’96,NSF02,IBM]  XML/Web Data Systems: RAINBOW : XML to Relational Databases MASS : Native XQuery Processing System [Verizon,IBM,NSF05]  Databases & Visualization: Scalable Visual High-Dim. Data Exploration Data and Visual Quality Support in XMDV [NSF’97,NSF01,NSF05]  Stream Monitoring System: Scalable Query Engine for Data Streams Fire Prediction and Monitoring Appl. [NSF06, NEC ]

3 3 Why Database Technology?  Vast amount of electronic information in organisations, companies, and scientific institutes that needs to be organized, stored securily, and accessed efficiently  Database management systems (DBMSs) provide: Model for logical structure of information Query languages to access and modify data Persistent data storage over long time Index technologies Efficient query processing and optimization Concurrent access for multiple users Access rights and security Scalability in query workload and data size Stored Database DBMS Select name from employee;

4 4 Generations of DBMSs  Early DBMSs Navigational access  Relational DBMSs Traditional tables and SQL queries  Object-oriented DBMSs Object modeling and extensibility  Object-relational DBMSs Combine declarative queries with OO modeling  XML DBMSs Support web and semi-structured data types

5 5 Question... ? What is common among these DBMSs ? Stored Database DBMS Select name from employee;

6 6 Answer... Three common steps :  Make schema design  Load database  Query static database Key Differences:  Different data models Stored Database DBMS Select name from employee;

7 7 So what next ? Stored Database DBMS Select name from employee;

8 8 A Look at Modern Applications  Digital radio telescopes  Network traffic monitoring  Environmental Monitoring  Tracking using RFID Tags  Sensor networks  Analyses of web usage logs  Financial analysis of stock exchanges  Out-patient critical care ... Filter & Transform select fft(s) from radiosignal s where source(s)= “Antenna1”;

9 9 A Look at Modern Applications  What do those applications have in common ? Filter & Transform select fft(s) from radiosignal s where source(s)= “Antenna1”;

10 10 Continous Queries on Data Streams Online Stream Monitoring Online Stream Monitoring

11 11 Databases : A Paradigm Shift ! data Query data streams of data static data Ad-hoc one-time queries Continuous standing queries

12 12 Data Streams and Continous Queries  Data streams: Continuous on-line ordered sequences Produced by sensors, simulations, and instruments Data pushed to reactive applications Result also continuous output streams  Stream queries: Continuous long-running or even infinite queries On-the-fly real-time processing as data arrives Constrained processing time and memory usage Selective stream storage (often of recent past)

13 13 Requirements for Data Stream Management Systems (DSMSs)  Non-blocking operators in query plans  Windows: Infinite streams into finite sub-streams  One-pass query algorithms  Approximate query answers  Real-time response for unusual behavior detected  Adaptation to environmental changes

14 14 DSMS Provides:  High-level query language (declarative interface)  Data independence from physical stream implementations  Query optimization (for performance)  Scalability in data volume and query workload  Shared execution of similar queries  Adaptive distributed processing

15 15 Real-time Stream Query Processing: Parallelism  Process Queries on shared-nothing architectures (cluster or Grid ) Make use of aggregated resources (main memory, CPU) Network Clusters of Machines Query Workload Acquired NSF Equipment grant 2006 for Purchase of High-Performance Cluster For Stream Processing Applications

16 16 Three Types of Parallelism We Exploit Pipelined: Operators be composed into producer and consumer relationship Independent: Independent operators run simultaneously on distinct machines Partitioned: Single operator replicated and run on multiple machines Adaptation Considered Within Each Processing Paradigm

17 17 Project 1 : Mobile Wireless Application Streams - moving objects - dynamic range query - dynamic kNN query

18 18 Scuba Project : Mobile Application Streams  Scalability Large number of objects Large number of queries  Limited Resources Memory CPU  Real-time Response Requirement The challenge is to provide fast query response in update-intensive environments - moving objects - dynamic range query - dynamic kNN query Novel Idea: Exploit the fact that objects naturally move in groups (i.e., clusters) to optimize query evaluation

19 19 Spatio-Temporal Continuous Tracking Monitor the traffic in the red areas Continuously return the area covered by the herd during the migration

20 20 Main Idea: Moving Clusters Main Idea: Abstracting individual objects into a cluster based on common attributes - Direction - Speed - Spatial Position  With cluster abstractions, minimize the number of unnecessary individual object/query joins, thus optimizing query evaluation Continuously retrieve closest police car next to me Police Car Scalable Cluster-Based Algorithm for Evaluating Continuous Spatio-Temporal Queries on Moving Objects (SCUBA)

21 21 Advantage of Moving Cluster Abstraction  When clusters don’t overlap, we avoid many joins of individual objects within those clusters m1m1 m2m2 No need to join objects/queries in m 1 with queries/objects in m 2 - Moving object- Spatio-temporal range query Scuba presented April 2006 at EDBT’06 If two abstractions do not ‘overlap' then we can discard negative candidates and avoid individual joins for spatio-temporal range queries.

22 22 Stream Queries for Mobile Traffic Services Monitor the traffic in the red areas Range Query Send E-coupons to all cars that I am considered as their nearest gas stations Reverse-NN Query How many cars in the highlighted area? Range Query

23 Raindrop : XQueries on XML Streams (or, Automaton Meets Algebra) Funded by NSF 2005; In collaboration with Prof. Mani

24 24 What’s Special for XML Stream Processing? Dream Catcher King S. Bt Bound 30 … Dream Catcher … Token-by-Token access manner timeline Pattern retrieval + Filtering + Restructuring FOR $b in stream(biditems.xml) //book LET $p := $b/price $t := $b/title WHERE $p < 20 Return $t Token: not a direct counterpart of a tuple 30Bt BoundS.KingDream2001 pricepublisherfirstlasttitleyear Pattern Retrieval on Token Streams

25 25 Automata-Based Paradigm FOR $b in stream(biditems.xml) //book LET $p := $b/price $t := $b/title WHERE $p < 20 Return $t 1 book * 2 4 title price Auxiliary structures for: 1.Buffering data 2.Filtering 3.Restructuring … //book //book/title //book/price 3

26 26 Observations Either paradigm has deficiencies Both paradigms complement each other Automata ParadigmAlgebra Paradigm Good for pattern retrieval on tokensDoes not support token inputs Need patches for filtering and restructuring Good for filtering and restructuring Present all details on same low levelSupport multiple descriptive levels (declarative->procedural) Little studied as query processing paradigm Well studied as query process paradigm

27 27 Towards One Uniform Algebraic View Token-based plan (automata plan) Tuple-based plan Tuple stream XML data stream Query answer Algebraic Stream Plan

28 28 Example Algebraic Plan FOR $b in stream(biditems.xml) //book LET $p := $b/price $t := $b/title WHERE $p < 30 Return $t Tuple-based plan Token-based plan (automata plan)

29 29 Example Uniform Algebraic Plan FOR $b in stream(biditems.xml) //book LET $p := $b/price $t := $b/title WHERE $p < 30 Return $t StructuralJoin $b ExtractNest $b, $p ExtractNest $b, $t Navigate $b, /price->$p Navigate $b, /title->$t Navigate $S1, //book ->$b Tuple-based plan

30 30 Example Uniform Algebraic Plan FOR $b in stream(biditems.xml) //book LET $p := $b/price $t := $b/title WHERE $p < 30 Return $t StructuralJoin $b ExtractNest $b, $p ExtractNest $b, $t Navigate $b, /price->$p Navigate $b, /title->$t Navigate $S1, //book ->$b Select $p<30 Tagger “Inexpensive”, $t->$r

31 31 Plan Rewriting : In or Out? Token-based plan (automata plan) Tuple-based Plan Tuple stream XML data stream Query answer Pattern retrieval in Semantics- focused plan Apply “push into automata”

32 32 Raindrop Plan Alternatives Nav $b, /price->$p ExtractNest $b, $p ExtractNest $b, $t SJoin //book Select price < 30 Tagger Nav $b, /title->$t Nav $S1, //book->$b ExtractNest $S1, $b Navigate /price Select price<30 Navigate book/title Tagger Nav $S1, //book->$b NavUnnest $S1, //book ->$b NavNest $b, /price ->$p NavNest $b, /title ->$t Select $p<30 Tagger “Inexpensive”, $t->$r Out In Statistics Collection and On-line Plan Migration

33 33 Raindrop : Research Contributions and Issues  Costing/query optimization of plans  On-the-fly migration into/out of automaton  Physical implementation strategies of operators  Exploit XML schema constraints for query optimization  Load-shedding from an automaton  Early memory release optimization Published in CIKM’03, ER’03, DKE’06 Journal, VLDB’05, VLDB’06.

34 34 FireEngine Project : Sensors in Buildings

35 35 Fire Monitoring Queries  Ambient Queries: What are typical temperature and humidity in given rooms based on environment ?  Detection Queries: Unusual behaviors or patterns detected ?  Tracking Queries: Track smoke and heat clouds (moving clusters) in terms of their sizes and speeds.  Analysis Queries : Is there an outlier (prank), or an actual fire ?  Reliabity Assessment: Any sensors faulty, and thus should be ignored?  Prediction Queries: Match sensors readings of fire with a fire stream simulation to determine similarity ? FireStream Demo to be presented at ICDE’07

36 36 Project : RFID Event Stream Monitoring  Given potentially infinite, heterogeneous, high-speed event streams  Goal: detect interesting patterns among events Supply chain management, e.g., ( “ insufficient inventory ” → “ no- backup ” ) or “ inventory overflow ” Business service optimization, e.g., “ search ticket ” →“timeout” Anomaly detection, e.g., “pick item”→“no checkout”→“exit” And more …  Complex query patterns to be answered in real-time Supported by NEC Cupertino and NSF Princeton

37 37 Event Processing Example  Event stream pick(1), pick(2), pick(3), checkout(3), pick(4), exit(2), …  Event Pattern Query EVENT SEQ(PICK p, !(CHECKOUT c), EXIT e) WHERE p.id=c.id AND c.id=e.id WITHIN 12 hours  Processing Sequence scan & construction : (p, e) pairs Selection : apply predicates Window : check time constraints Negation : check for negation Transformation : make complex output event Time

38 38 Challenges for High-Performance Processing  Use “Workflows” to Early Terminate Pattern Queries  Optimize Event Pattern Queries Using Rewriting  Prefix Sharing of Multiple Event Pattern Queries  Scalable Processing Using Cluster

39 39 CAPE: Uncertainties in Stream Query Processing Register Continuous Queries Scalable Stream Query Engine Scalable Stream Query Engine Streaming Data (push-based paradigm) Streaming Result Real-time and accurate responses required May have time- varying rates and high-volumes Available resources for executing each operator may vary over time. Distribution and Adaptations are required. High workload of queries Memory- and CPU resource limitations (continuous evaluation)

40 40 CAPE : Continuous Adaptive Processing Engine -- Adaptation at all Layers  Reactive Operator Algorithms  Adaptive Scheduling of Operators  On-Line Query Plan Reshaping  Multi-Query Pipeline Sharing  Synchronized Data Tree Spilling  Adaptive Cluster-Driven Load Shedding  Dynamic Workload Distribution over Cluster  Data-Partitioning for Parallel Stream Processing

41 41 Adaptation Techniques in CAPE  On-Line Query Plan Reshaping (with Yali Zhu and G. Heineman ) Published in ACM SIGMOD’ 2004, and in Submission to TODS journal

42 42 Run-time Plan Re-Optimization  Step1 - Decide when to optimize Statistics monitoring  Step2 – Generate new query plan Query optimization  Step3 – Replace current plan by new plan Plan Migration

43 43 Naïve Plan Migration Strategy  Migration Steps Pause execution of old plan Drain out all tuples inside old plan Replace old plan by new plan Resume execution of new plan AB BC AB C AB BC A B C Problem: Works for stateless operators only

44 44 Stateful Operator in CQ  Why stateful Need non-blocking operators in CQ Operator needs to output partial results AB AB State AState B Key Observation: The purge of tuples in states relies on processing of new tuples. Symmetric hash join For each new tuple A purge state B, join state B, insert to state A

45 45 Naïve Migration Strategy Revisited  Steps (1) Pause execution of old plan (2) Drain out all tuples inside old plan (3) Replace old plan by new plan (4) Resume execution of new plan AB BC AB C (2) All tuples drained (4) Processing Resumed (3) Old Replaced By new Deadlock Waiting Problem:

46 46 Proposed Dynamic Migration Strategies  Moving State Strategy  Parallel Track Strategy

47 47 Moving State Strategy  Basic idea Share common states between two boxes  Key Steps Identify common states  State matching Share common states  State moving Recompute unmatched states  State recomputing

48 48 Moving State Strategy  State Matching State in old box has unique ID During rewriting, new ID given to newly generated state in new box When rewriting done, match states based on IDs.  State Moving Between matched states On same machine, creates new pointers for matched states in new box  What’s left? Unmatched states in new box CD S ABC SDSD BC S AB SCSC AB SASA SBSB SASA S BCD CD S BC SDSD BC SBSB SCSC QAQA QBQB QCQC QDQD QAQA QBQB QCQC QDQD Q ABCD Old BoxNew Box

49 49 Unmatched States  State Recomputing Recursively recompute unmatched S BC and S BCD by joining matched states  Why always possible? Old and new boxes have same input queues The states associated with input queues always match  Why necessary? AB SASA S BCD CD S BC SDSD BC SBSB SCSC QAQA QBQB QCQC QDQD Q ABCD

50 50 MS Migration Pros and Cons  Pros Fast when # of tuples in states is small  Low input rates or small window size  Cons Output silence during entire migration stage Can we output results even during migration?  Motivation for Parallel Track Strategy

51 51 Parallel Track Strategy  Basic idea Execute both old and new plans in parallel Gradually “push” old tuples out of old box by purging  Key Steps Connect new box Execute both boxes in parallel Remove old box once “expired”  Contains only new tuples  No old tuples or sub-tuples

52 52 Parallel Track Strategy  Connect boxes  Execute in parallel Until all old tuples purged  Disconnect old box CD S ABC SDSD BC S AB SCSC AB SASA SBSB SASA S BCD CD S BC SDSD BC SBSB SCSC QAQA QBQB QCQC QDQD QAQA QBQB QCQC QDQD Q ABCD A Tuple ABC in S ABC ABC

53 53 PT Migrations Pros and Cons  Pros Keep on producing results even during migration  No results during MS migration  Cons Migration duration is at least 2W  MS may be faster depends on # of tuples in states

54 54 Summary : Stream Plan Migration  First run-time solution for stateful operators  Two migration methods: Moving State Strategy Parallel Track Strategy  Cost Models and Experimental Evaluations What next ?  Scope of optimization ?  Support of other stateful operators ?  Migration in distributed stream systems ?

55 55 Overall Summary : So Much Left to Do !  Large variety of challenging stream applications  Generic core technology for stream processing engines  Our central theme : Optimization via Adaptation Part I: Plan migration Part II: Plan distribution Part III: Plan-level spill  Many open questions remain...

56 56 Thank You For Your Patience ! The End

57 57 Acknowledgments  All the students (Ph.d., MS, and undergraduate) in the DSRG lab who have contributed to this research project directly or indirectly.  Most notably ; Luping Ding, Yali Zhu, Bin Liu, Tim Sutherland, Brad Pielech, Rimma Nehme, Mariana Jbantova, Brad Momberger, Venky Raghavan, Song Wang, Natasha Bogdanova, Mingzhu Wei, Ming Li, and others.  To National Science Foundation for partial support via IDM grants, to WPI for RDC grant, and to IBM and NEC

58 58 Selected CAPE Publications and Reports [RDZ04] E. A. Rundensteiner, L. Ding, Y. Zhu, T. Sutherland and B. Pielech, “CAPE: A Constraint- Aware Adaptive Stream Processing Engine”. Invited Book Chapter. http://www.cs.uno.edu/~nauman/streamBook/. July 2004 http://www.cs.uno.edu/~nauman/streamBook/ [ZRH04] Y. Zhu, E. A. Rundensteiner and G. T. Heineman, "Dynamic Plan Migration for Continuous Queries Over Data Streams”. SIGMOD 2004, pages 431-442. [DMR+04] L. Ding, N. Mehta, E. A. Rundensteiner and G. T. Heineman, "Joining Punctuated Streams“. EDBT 2004, pages 587-604. [DR04] L. Ding and E. A. Rundensteiner, "Evaluating Window Joins over Punctuated Streams“. CIKM 2004, to appear. [DRH03] L. Ding, E. A. Rundensteiner and G. T. Heineman, “MJoin: A Metadata-Aware Stream Join Operator”. DEBS 2003. [RDSZBM04] E A. Rundensteiner, L Ding, T Sutherland, Y Zhu, B Pielech And N Mehta. CAPE: Continuous Query Engine with Heterogeneous-Grained Adaptivity. Demonstration Paper. VLDB 2004 [SR04] T. Sutherland and E. A. Rundensteiner, "D-CAPE: A Self-Tuning Continuous Query Plan Distribution Architecture“. Tech Report, WPI-CS-TR-04-18, 2004. [SPR04] T. Sutherland, B. Pielech, Yali Zhu, Luping Ding, and E. A. Rundensteiner, "Adaptive Multi- Objective Scheduling Selection Framework for Continuous Query Processing “. IDEAS 2005. [SLJR05] T Sutherland, B Liu, M Jbantova, and E A. Rundensteiner, D-CAPE: Distributed and Self- Tuned Continuous Query Processing, CIKM, Bremen, Germany, Nov. 2005. [LR05] Bin Liu and E.A. Rundensteiner, Revisiting Pipelined Parallelism in Multi-Join Query Processing, VLDB 2005. [B05] Bin Liu and E.A. Rundensteiner, Partition-based Adaptation Strategies Integrating Spill and Relocation, Tech Report, WPI-CS-TR-05, 2005. (in submission) CAPE Project: http://davis.wpi.edu/dsrg/CAPE/index.html


Download ppt "Continuous Stream Monitoring Technology Elke A. Rundensteiner Database Systems Research Laboratory Department of Computer Science Worcester Polytechnic."

Similar presentations


Ads by Google