Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004.

Similar presentations


Presentation on theme: "Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004."— Presentation transcript:

1 Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

2 Outline Motivation of DSMS Aurora (Brown, Brandeis, MIT) Model Operator Scheduling Storage/Memory Management QoS issue STREAM (Stanford) System Architecture Query Language Query Plans and Execution Performance Issues Approximation Techniques STREAM Interface Conclusion

3 Motivation HADP  DAHP Continuous data and static queries Monitoring using sensor Military Traffic Environment Financial analysis Object tracking

4 Aurora

5 Aurora – Model General Purpose DSMS Continuous stream data comes Flow through a set of operators Output to application or materialized

6 Aurora – Model Components Storage manager Scheduler Load Shedder Router QoS Monitor GUI

7 Aurora – Model 3 kinds of query supported Continuous View Ad-Hoc Query

8 Aurora – Model 8 primitive operators (Box) Windowed Slide Tumble Latch Resample Non-windowed Filter Map GroupBy Join

9 Aurora – Operator Optimization Each operator associated with Selectivity: s(b), sel(b) Computation time: c(b), cost(b) General Optimization Techniques Pushing projection upstream Combining boxes Reordering boxes

10 Aurora – Operator Optimization Case 1 : cost of a  b c(a) + s(a)c(b) Case 2: cost of b  a c(b) + s(b)c(a) Criteria for switching box position c(a)+s(a)c(b) > c(b)+s(b)c(a) ab ba

11 Aurora – Operator Scheduling Scheduling by OS One thread per box, shift the job to OS Easier to program Aurora Scheduler Single thread for the scheduler The scheduler pick a box with highest priority and call the box to consume tuples from queue Allow finer control of resource Scalable !

12 Aurora – Operator Scheduling

13 Problem: which box to execute next? Min-Cost (MC) Reduce computation cost Min-Latency (ML) Return result as soon as possible Min-Memory (MM) Reduce memory usage of queue

14 Aurora – Operator Scheduling Example b4 b5 b6 b2 b3 b1 streams application Downstream

15 Aurora – Operator Scheduling Min-Cost Objective: avoid overhead of calling boxes Min-Latency Prefer box which can produce tuples in the output at a shorter period of time Min-Memory Give preference to box which will consume more tuples with less computation time Similar to “Chain Operator Scheduling” More at: Operator Scheduling in a Data Stream Manager, VLDB 2003

16 Aurora – Storage/Memory Management Manage the queue in front of each box 2 boxes sharing the same queue windowed operator The initial queue size is 128 KB Queues are managed as a circular queue If overflow, double the queue size, or vice versa

17 Aurora – Storage/Memory Management Swap in/out between memory / disk based on priority of boxes using it Work with Operator Scheduler to exchange box priority and buffer-state information Connection Point Management A B-tree indexed on timestamp is built to support random access of tuples by ad-hoc query

18 Aurora – Storage/Memory Management

19 Aurora – QoS Issue Different queries/applications have different QoS requirement Stock market monitoring Average temperature of a set of sensor QoS Graph

20 Latency-based QoS Graph b time QoS 0 D(b) eol(b) est(b) latency(b) cost(D(b)) Critical Point

21 Aurora – QoS-driven Scheduling Assign priority to each box based on priority (b) = [utility (b), est (b)] utility (b) = gradient (eol (b)) How is the QoS degrading by the time the tuple leave the system when we process it now. est (b) How soon it will exhibit another performance degradation if we don’t process it now. Performance 200 queries/application, each with 5 boxes Round robin - 0.43 QoS driven scheduling – 0.85

22 Aurora – Current Status Main components of a DSMS are introduced Operator scheduler Memory/storage management QoS concept in stress environment Load shedding Implemented in C++, with Java-based GUI Dependent on a few software/library More? Distributed architecture – Aurora* Fault tolerance or disaster recovery ?

23 STREAM

24 STREAM – Introduction General-purpose prototype DSMS Supports data streams and stored relations Declarative language for registering continuous queries Flexible query plans and execution strategies Aggressive sharing of state and computation among queries

25 STREAM – Introduction Designed to cope with Stream rates that may be high, variable, bursty Continuous query loads that may be high, volatile Primary coping techniques Graceful approximation as necessary Careful resource allocation and use Continuous self-monitoring and reoptimization

26 DSMS Scratch Store STREAM – System Architecture Input streams Register Query Streamed Result Stored Result Archive Stored Relations

27 STREAM – Query Language Continuous Query Language – CQL Extends SQL with Streams as new data type Stream: Unbounded bag of pairs Relation: time-varying bags of tuples Continuous instead of one-time semantics Three classes of operators Relation-to-relation Stream-to-relation Relation-to-stream

28 STREAM – CQL Operators Relation-to-relation SQL constructs Stream-to-relation Tuple-based sliding window: [Rows N], [Rows Unbounded] Time-based sliding window: [Range ω], [Now] Partitioned sliding window: [Partition By A 1,…A k Rows N] Relation-to-stream Istream: insert stream Dstream: delete stream Rstream: relation stream

29 STREAM – Example Query 1 Two example streams: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”: Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

30 STREAM – Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost: Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk

31 STREAM – Simplified Query 2 Result is a relation, updated as stream elements arrive: Select F.clerk, Max(O.cost) From O, F [Rows 100] Where O.orderID = F.orderID Group By F.clerk

32 STREAM – Simplified Query 2 Result is streamed: Emits stream element whenever max changes for a clerk (or new clerk): Select Istream(F.clerk, Max(O.cost)) From O, F [Rows 100] Where O.orderID = F.orderID Group By F.clerk

33 STREAM – Example Query 3 Relation: CurPrice(stock, price) Average price over last day for each stock: Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock Istream provides history of CurPrice Window on history (back to relation), group and aggregate

34 STREAM – Query plans and Execution When a continuous query is registered, generate a query plan New plan merged with existing plans Users can also create & manipulate plans directly Plans composed of three main components: Operators Flag: insertion(+), deletion (-) Elements: tuple-timestamp-flag tuples Streams: only + elements Relations: both + and - elements Queues Enforce nondecreasing timestamps (“heartbeats”) Mechanisms for buffering tuples States (Synopses) Global scheduler for plan execution

35 STREAM – States States (Synopses) Summarize elements seen so far (exact or approximate) for operators requiring history To implement windows Example: synopsis join Sliding-window join Approximation of full join State 1 State 2 ⋈

36 STREAM – Simple Query Plan Select * From S1 [Rows 1000], S2 [Range 2 Minutes] Where S1.A = S2.A And S1.A > 10

37 STREAM – Performance Issues Synopsis Sharing Eliminate data redundancy Exploiting Constraints Selectively discard data to reduce state Operator Scheduling Reduce queue sizes

38 STREAM – Synopsis Sharing Eliminate redundancy by replacing the nearly identical synopses with light weight stubs a single store to hold the actual tuples Store tracks the progress of each stub, presents the appropriate view to each stub. The store contains the union of its corresponding stubs

39 STREAM – Synopsis Sharing Select * From S1 [Rows 1000], S2 [Range 2 Minutes] Where S1.A = S2.A And S1.A > 10 Select A, Max(B) From S1 [Rows 200] Group By A

40 STREAM – Exploiting Constraints Specify an adherence parameter k to capture how closely a given stream or sets of streams adheres to a constraint of that type Referential integrity k-constraint Ordered-arrival k-constraint Clustered-arrival k-constraint Query execution plans reduce or eliminate sate based on k-constraints If constraint violated, get approximate result

41 STREAM – Operator Scheduling Goal: Goal: minimize total queue size for unpredictable, bursty stream arrival patterns Chain Scheduling Algorithm: Chain Scheduling Algorithm: 1. Mark the first operator in the plan as the “current” operator 2. Find the block of consecutive operators starting at the “current” operator that maximizes the reduction in total queue size per unit time. 3. Mark the first operator following this block as the “current” operator and repeat Step 2 until all operators have been assigned to chains. 4. Chains are scheduled according to the greedy algorithm, but within a chain, execution proceeds in FIFO order. Proven: Proven: within constant factor of any “clairvoyant” strategy, i.e., the optimal strategy based on knowledge of future input, for some queries Empirical results: Empirical results: large savings over naive strategies for many queries But minimizing queue sizes is at odds with minimizing latency

42 STREAM – Approximation CPU-Limited Approximation Insufficient CPU time to process each stream element due to the high data arrival rate. load-shedding sampling operators Approximate by probabilistically dropping elements before they are processed Memory-Limited Approximation The total state required for all registered queries exceeds available memory. The system selectively shrinks or discards synopses.

43 STREAM – Query Interface View the structure of query plans the their component entities. View the detailed properties of each entity. Dynamically adjust entity properties. View monitoring graphs that display time- varying entity properties plotted dynamically against time. Queue sizes, throughput, overall memory usage, and join selectivity.

44 STREAM – Query Plan Monitoring

45 STREAM – Current Status Version 1.0 up and running Includes a new monitoring and adaptive query processing infrastructure – StreaMon Executor runs query plans to produce results. Profiler collects and maintains statistics about stream and plan characteristics. Reoptimizer ensures that the plans and memory structures are the most efficient for current characteristics. Web demo available at http://shark.stanford.edu:8080/http://shark.stanford.edu:8080/ Future Directions: Distributed Stream Processing Crash Recovery Improved Approximation Classification of Applications

46 Conclusion Ideal DSMS Well defined and flexible query language User-friendly interface Scalable Operator scheduling Storage management Synopsis sharing Approximation Quality assurance Fault tolerant

47 References R. Motwani et al., “Query Processing, Approximation, and Resource Management in a Data Stream Management System”, in proceedings of the 1st CIDR Conference, 2003. S. Madden et al., “Continuously Adaptive Continuous Queries over Streams”, in proceedings of SIGMOD Conference, 2002 D. Carney et al., “Monitoring Streams - A New Class of Data Management Applications”, in Proceedings of VLDB conference, 2002. D. Carney et al., “Operator Scheduling in a Data Stream Manager”, in Proceedings of VLDB conference, 2003 Stanford STREAM Project Website: http://www- db.stanford.edu/stream/index.htmlhttp://www- db.stanford.edu/stream/index.html Aurora Project Website: http://www.cs.brown.edu/research/aurorahttp://www.cs.brown.edu/research/aurora

48 End


Download ppt "Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004."

Similar presentations


Ads by Google