Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004.

Slides:



Advertisements
Similar presentations
Load Management and High Availability in Borealis Magdalena Balazinska, Jeong-Hyon Hwang, and the Borealis team MIT, Brown University, and Brandeis University.
Advertisements

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
Processes Management.
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Data Streams & Continuous Queries The Stanford STREAM Project stanfordstreamdatamanager.
Query Processing, Resource Management, and Approximation in a Data Stream Management System Jennifer Widom Stanford University stanfordstreamdatamanager.
The Design of the Borealis Stream Processing Engine Brandeis University, Brown University, MIT Magdalena BalazinskaNesime Tatbul MIT Brown.
Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams Presented by: Andy Mason and Sheng Zhong Ahmed M.Ayad and Jeffrey.
Load Shedding in a Data Stream Manager Kevin Hoeschele Anurag Shakti Maskey.
Panoptes: A Scalable Architecture for Video Sensor Networking Applications Wu-chi Feng, Brian Code, Ed Kaiser, Mike Shea, Wu-chang Feng (OGI: The Oregon.
An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations Presenter: Liyan Zhang Presentation of ICS
Aurora Proponent Team Wei, Mingrui Liu, Mo Rebuttal Team Joshua M Lee Raghavan, Venkatesh.
Query Processing, Resource Management, and Approximation in a Data Stream Management System Selected subset of slides taken from talk by Jennifer Widom.
Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005.
Scalable Distributed Stream System Mitch Cherniack, Hari Balakrishnan, Magdalena Balazinska, Don Carney, Uğur Çetintemel, Ying Xing, and Stan Zdonik Proceedings.
Monitoring Streams -- A New Class of Data Management Applications Don Carney Brown University Uğur ÇetintemelBrown University Mitch Cherniack Brandeis.
Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream.
Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.
1 Load Shedding in a Data Stream Manager Slides edited from the original slides of Kevin Hoeschele Anurag Shakti Maskey.
An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.
Slide 1 ISTORE: System Support for Introspective Storage Appliances Aaron Brown, David Oppenheimer, and David Patterson Computer Science Division University.
Monitoring Streams -- A New Class of Data Management Applications Don Carney Brown University Uğur ÇetintemelBrown University Mitch Cherniack Brandeis.
SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.
1 04/18/2005 Flux Flux: An Adaptive Partitioning Operator for Continuous Query Systems M.A. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin UC.
Avoiding Idle Waiting in the execution of Continuous Queries Carlo Zaniolo CSD CS240B Notes April 2008.
Sensor Networks Storage Sanket Totala Sudarshan Jagannathan.
Chapter 3 Operating Systems Introduction to CS 1 st Semester, 2015 Sanghyun Park.
STREAM The Stanford Data Stream Management System.
Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.
MONITORING STREAMS: A NEW CLASS OF DATA MANAGEMENT APPLICATIONS DON CARNEY, U Ğ UR ÇETINTEMEL, MITCH CHERNIACK, CHRISTIAN CONVEY, SANGDON LEE, GREG SEIDMAN,
CHP-4 QUEUE.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
1 Scheduling Processes. 2 Processes Each process has state, that includes its text and data, procedure call stack, etc. This state resides in memory.
Query Processing, Resource Management, and Approximation in a Data Stream Management System.
Master’s Thesis (30 credits) By: Morten Lindeberg Supervisors: Vera Goebel and Jarle Søberg Design, Implementation, and Evaluation of Network Monitoring.
Chapter 3 System Performance and Models. 2 Systems and Models The concept of modeling in the study of the dynamic behavior of simple system is be able.
Aurora – system architecture Pawel Jurczyk. Currently used DB systems Classical DBMS: –Passive repository storing data (HADP – human-active, DBMS- passive.
1 STREAM: The Stanford Data Stream Management System STanfordstREamdatAManager 陳盈君 吳哲維 林冠良.
Runtime Optimization of Continuous Queries Balakumar K. Kendai and Sharma Chakravarthy Information Technology Laboratory Department of Computer Science.
Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.
Load Shedding in Stream Databases – A Control-Based Approach Yicheng Tu, Song Liu, Sunil Prabhakar, and Bin Yao Department of Computer Science, Purdue.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Aurora: a new model and architecture for data stream management Daniel J. Abadi 1, Don Carney 2, Ugur Cetintemel 2, Mitch Cherniack 1, Christian Convey.
Adaptive Ordering of Pipelined Stream Filters Babu, Motwani, Munagala, Nishizawa, and Widom SIGMOD 2004 Jun 13-18, 2004 presented by Joshua Lee Mingzhu.
Triggers and Streams Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 28, 2005.
Control-Based Load Shedding in Data Stream Management Systems Yicheng Tu and Sunil Prabhakar Department of Computer Sciences, Purdue University April 3,
1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.
Monitoring Streams -- A New Class of Data Management Applications based on paper and talk by authors below, slightly adapted for CS561: Don Carney Brown.
Control-based Quality Adaptation in Data Stream Management Systems (DSMS) Yicheng Tu†, Song Liu‡, Sunil Prabhakar†, and Bin Yao‡ † Department of Computer.
Control-Based Load Shedding in Data Stream Management Systems Yicheng Tu and Sunil Prabhakar Department of Computer Sciences, Purdue University April 3,
Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –
Optimizing Distributed Actor Systems for Dynamic Interactive Services
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
COMP3211 Advanced Databases
Real-time Software Design
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
湖南大学-信息科学与工程学院-计算机与科学系
Load Shedding in Stream Databases – A Control-Based Approach
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
Smita Vijayakumar Qian Zhu Gagan Agrawal
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
Operating systems Process scheduling.
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Query Optimization Minimizing Memory and Latency in DSMS
Adaptive Query Processing (Background)
An Analysis of Stream Processing Languages
Presentation transcript:

Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Outline Motivation of DSMS Aurora (Brown, Brandeis, MIT) Model Operator Scheduling Storage/Memory Management QoS issue STREAM (Stanford) System Architecture Query Language Query Plans and Execution Performance Issues Approximation Techniques STREAM Interface Conclusion

Motivation HADP  DAHP Continuous data and static queries Monitoring using sensor Military Traffic Environment Financial analysis Object tracking

Aurora

Aurora – Model General Purpose DSMS Continuous stream data comes Flow through a set of operators Output to application or materialized

Aurora – Model Components Storage manager Scheduler Load Shedder Router QoS Monitor GUI

Aurora – Model 3 kinds of query supported Continuous View Ad-Hoc Query

Aurora – Model 8 primitive operators (Box) Windowed Slide Tumble Latch Resample Non-windowed Filter Map GroupBy Join

Aurora – Operator Optimization Each operator associated with Selectivity: s(b), sel(b) Computation time: c(b), cost(b) General Optimization Techniques Pushing projection upstream Combining boxes Reordering boxes

Aurora – Operator Optimization Case 1 : cost of a  b c(a) + s(a)c(b) Case 2: cost of b  a c(b) + s(b)c(a) Criteria for switching box position c(a)+s(a)c(b) > c(b)+s(b)c(a) ab ba

Aurora – Operator Scheduling Scheduling by OS One thread per box, shift the job to OS Easier to program Aurora Scheduler Single thread for the scheduler The scheduler pick a box with highest priority and call the box to consume tuples from queue Allow finer control of resource Scalable !

Aurora – Operator Scheduling

Problem: which box to execute next? Min-Cost (MC) Reduce computation cost Min-Latency (ML) Return result as soon as possible Min-Memory (MM) Reduce memory usage of queue

Aurora – Operator Scheduling Example b4 b5 b6 b2 b3 b1 streams application Downstream

Aurora – Operator Scheduling Min-Cost Objective: avoid overhead of calling boxes Min-Latency Prefer box which can produce tuples in the output at a shorter period of time Min-Memory Give preference to box which will consume more tuples with less computation time Similar to “Chain Operator Scheduling” More at: Operator Scheduling in a Data Stream Manager, VLDB 2003

Aurora – Storage/Memory Management Manage the queue in front of each box 2 boxes sharing the same queue windowed operator The initial queue size is 128 KB Queues are managed as a circular queue If overflow, double the queue size, or vice versa

Aurora – Storage/Memory Management Swap in/out between memory / disk based on priority of boxes using it Work with Operator Scheduler to exchange box priority and buffer-state information Connection Point Management A B-tree indexed on timestamp is built to support random access of tuples by ad-hoc query

Aurora – Storage/Memory Management

Aurora – QoS Issue Different queries/applications have different QoS requirement Stock market monitoring Average temperature of a set of sensor QoS Graph

Latency-based QoS Graph b time QoS 0 D(b) eol(b) est(b) latency(b) cost(D(b)) Critical Point

Aurora – QoS-driven Scheduling Assign priority to each box based on priority (b) = [utility (b), est (b)] utility (b) = gradient (eol (b)) How is the QoS degrading by the time the tuple leave the system when we process it now. est (b) How soon it will exhibit another performance degradation if we don’t process it now. Performance 200 queries/application, each with 5 boxes Round robin QoS driven scheduling – 0.85

Aurora – Current Status Main components of a DSMS are introduced Operator scheduler Memory/storage management QoS concept in stress environment Load shedding Implemented in C++, with Java-based GUI Dependent on a few software/library More? Distributed architecture – Aurora* Fault tolerance or disaster recovery ?

STREAM

STREAM – Introduction General-purpose prototype DSMS Supports data streams and stored relations Declarative language for registering continuous queries Flexible query plans and execution strategies Aggressive sharing of state and computation among queries

STREAM – Introduction Designed to cope with Stream rates that may be high, variable, bursty Continuous query loads that may be high, volatile Primary coping techniques Graceful approximation as necessary Careful resource allocation and use Continuous self-monitoring and reoptimization

DSMS Scratch Store STREAM – System Architecture Input streams Register Query Streamed Result Stored Result Archive Stored Relations

STREAM – Query Language Continuous Query Language – CQL Extends SQL with Streams as new data type Stream: Unbounded bag of pairs Relation: time-varying bags of tuples Continuous instead of one-time semantics Three classes of operators Relation-to-relation Stream-to-relation Relation-to-stream

STREAM – CQL Operators Relation-to-relation SQL constructs Stream-to-relation Tuple-based sliding window: [Rows N], [Rows Unbounded] Time-based sliding window: [Range ω], [Now] Partitioned sliding window: [Partition By A 1,…A k Rows N] Relation-to-stream Istream: insert stream Dstream: delete stream Rstream: relation stream

STREAM – Example Query 1 Two example streams: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”: Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

STREAM – Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost: Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk

STREAM – Simplified Query 2 Result is a relation, updated as stream elements arrive: Select F.clerk, Max(O.cost) From O, F [Rows 100] Where O.orderID = F.orderID Group By F.clerk

STREAM – Simplified Query 2 Result is streamed: Emits stream element whenever max changes for a clerk (or new clerk): Select Istream(F.clerk, Max(O.cost)) From O, F [Rows 100] Where O.orderID = F.orderID Group By F.clerk

STREAM – Example Query 3 Relation: CurPrice(stock, price) Average price over last day for each stock: Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock Istream provides history of CurPrice Window on history (back to relation), group and aggregate

STREAM – Query plans and Execution When a continuous query is registered, generate a query plan New plan merged with existing plans Users can also create & manipulate plans directly Plans composed of three main components: Operators Flag: insertion(+), deletion (-) Elements: tuple-timestamp-flag tuples Streams: only + elements Relations: both + and - elements Queues Enforce nondecreasing timestamps (“heartbeats”) Mechanisms for buffering tuples States (Synopses) Global scheduler for plan execution

STREAM – States States (Synopses) Summarize elements seen so far (exact or approximate) for operators requiring history To implement windows Example: synopsis join Sliding-window join Approximation of full join State 1 State 2 ⋈

STREAM – Simple Query Plan Select * From S1 [Rows 1000], S2 [Range 2 Minutes] Where S1.A = S2.A And S1.A > 10

STREAM – Performance Issues Synopsis Sharing Eliminate data redundancy Exploiting Constraints Selectively discard data to reduce state Operator Scheduling Reduce queue sizes

STREAM – Synopsis Sharing Eliminate redundancy by replacing the nearly identical synopses with light weight stubs a single store to hold the actual tuples Store tracks the progress of each stub, presents the appropriate view to each stub. The store contains the union of its corresponding stubs

STREAM – Synopsis Sharing Select * From S1 [Rows 1000], S2 [Range 2 Minutes] Where S1.A = S2.A And S1.A > 10 Select A, Max(B) From S1 [Rows 200] Group By A

STREAM – Exploiting Constraints Specify an adherence parameter k to capture how closely a given stream or sets of streams adheres to a constraint of that type Referential integrity k-constraint Ordered-arrival k-constraint Clustered-arrival k-constraint Query execution plans reduce or eliminate sate based on k-constraints If constraint violated, get approximate result

STREAM – Operator Scheduling Goal: Goal: minimize total queue size for unpredictable, bursty stream arrival patterns Chain Scheduling Algorithm: Chain Scheduling Algorithm: 1. Mark the first operator in the plan as the “current” operator 2. Find the block of consecutive operators starting at the “current” operator that maximizes the reduction in total queue size per unit time. 3. Mark the first operator following this block as the “current” operator and repeat Step 2 until all operators have been assigned to chains. 4. Chains are scheduled according to the greedy algorithm, but within a chain, execution proceeds in FIFO order. Proven: Proven: within constant factor of any “clairvoyant” strategy, i.e., the optimal strategy based on knowledge of future input, for some queries Empirical results: Empirical results: large savings over naive strategies for many queries But minimizing queue sizes is at odds with minimizing latency

STREAM – Approximation CPU-Limited Approximation Insufficient CPU time to process each stream element due to the high data arrival rate. load-shedding sampling operators Approximate by probabilistically dropping elements before they are processed Memory-Limited Approximation The total state required for all registered queries exceeds available memory. The system selectively shrinks or discards synopses.

STREAM – Query Interface View the structure of query plans the their component entities. View the detailed properties of each entity. Dynamically adjust entity properties. View monitoring graphs that display time- varying entity properties plotted dynamically against time. Queue sizes, throughput, overall memory usage, and join selectivity.

STREAM – Query Plan Monitoring

STREAM – Current Status Version 1.0 up and running Includes a new monitoring and adaptive query processing infrastructure – StreaMon Executor runs query plans to produce results. Profiler collects and maintains statistics about stream and plan characteristics. Reoptimizer ensures that the plans and memory structures are the most efficient for current characteristics. Web demo available at Future Directions: Distributed Stream Processing Crash Recovery Improved Approximation Classification of Applications

Conclusion Ideal DSMS Well defined and flexible query language User-friendly interface Scalable Operator scheduling Storage management Synopsis sharing Approximation Quality assurance Fault tolerant

References R. Motwani et al., “Query Processing, Approximation, and Resource Management in a Data Stream Management System”, in proceedings of the 1st CIDR Conference, S. Madden et al., “Continuously Adaptive Continuous Queries over Streams”, in proceedings of SIGMOD Conference, 2002 D. Carney et al., “Monitoring Streams - A New Class of Data Management Applications”, in Proceedings of VLDB conference, D. Carney et al., “Operator Scheduling in a Data Stream Manager”, in Proceedings of VLDB conference, 2003 Stanford STREAM Project Website: db.stanford.edu/stream/index.htmlhttp://www- db.stanford.edu/stream/index.html Aurora Project Website:

End