PIPES: A Resource Adaptive Data Stream Management System Bernhard Seeger Philipps-University Marburg, Germany Research supported by the German Research.

Slides:



Advertisements
Similar presentations
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Advertisements

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
DATAFLOW PROCESS NETWORKS Edward A. Lee Thomas M. Parks.
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Fjording the Stream: An Architecture for Queries over Streaming Sensor Data Samuel Madden, Michael J. Franklin University of California, Berkeley Proceedings.
An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations Presenter: Liyan Zhang Presentation of ICS
Zero-programming Sensor Network Deployment 學生:張中禹 指導教授:溫志煜老師 日期: 5/7.
Dynamic Plan Migration for Continuous Query over Data Streams Yali Zhu, Elke Rundensteiner and George Heineman Database System Research Group Worcester.
1 SINA: Scalable Incremental Processing of Continuous Queries in Spatio-temporal Databases Mohamed F. Mokbel, Xiaopeng Xiong, Walid G. Aref Presented by.
Adaptive Sampling in Distributed Streaming Environment Ankur Jain 2/4/03.
Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
An Agent-Oriented Approach to the Integration of Information Sources Michael Christoffel Institute for Program Structures and Data Organization, University.
1 Load Shedding in a Data Stream Manager Slides edited from the original slides of Kevin Hoeschele Anurag Shakti Maskey.
An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.
Indexing Spatio-Temporal Data Warehouses Dimitris Papadias, Yufei Tao, Panos Kalnis, Jun Zhang Department of Computer Science Hong Kong University of Science.
TECHNIQUES FOR OPTIMIZING THE QUERY PERFORMANCE OF DISTRIBUTED XML DATABASE - NAHID NEGAR.
Query Processing Presented by Aung S. Win.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.
SensIT PI Meeting, January 15-17, Self-Organizing Sensor Networks: Efficient Distributed Mechanisms Alvin S. Lim Computer Science and Software Engineering.
NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
CSE314 Database Systems More SQL: Complex Queries, Triggers, Views, and Schema Modification Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Query Processing, Resource Management, and Approximation in a Data Stream Management System.
Database Management 9. course. Execution of queries.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
A new model and architecture for data stream management.
Model-based Validation of Streaming Data Cheng Xu, Tore Risch Dept. Information Technology Uppsala University, Sweden Daniel Wedlund, Martin Helgoson AB.
Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases Jost Enderle, Nicole Schneider, Thomas Seidl RWTH Aachen University,
Chair for Computer Science 6 (Data Management) Friedrich-Alexander-University of Erlangen-Nuremberg Michael Daum, Frank Lauterwald, Philipp Baumgärtel,
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Aum Sai Ram Security for Stream Data Modified from slides created by Sujan Pakala.
The Volcano Optimizer Generator Extensibility and Efficient Search.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, Keyword Search on Relational Data Streams Alexander Markowetz Yin.
A new model and architecture for data stream management.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Adaptive Ordering of Pipelined Stream Filters Babu, Motwani, Munagala, Nishizawa, and Widom SIGMOD 2004 Jun 13-18, 2004 presented by Joshua Lee Mingzhu.
1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,
Safety Guarantee of Continuous Join Queries over Punctuated Data Streams Hua-Gang Li *, Songting Chen, Junichi Tatemura Divykant Agrawal, K. Selcuk Candan.
Slide 1 Chapter 8 Architectural Design. Slide 2 Topics covered l System structuring l Control models l Modular decomposition l Domain-specific architectures.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
More SQL: Complex Queries, Triggers, Views, and Schema Modification
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
COMP3211 Advanced Databases
SOFTWARE DESIGN AND ARCHITECTURE
An overview of Data Streaming
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
Outline Introduction Background Distributed DBMS Architecture
The Dataflow Model.
B. Stegmaier und R. Kuntschke TU München – Fakultät für Informatik
A. Kemper, R. Kuntschke, and B. Stegmaier
Adaptive Query Processing (Background)
Course Instructor: Supriya Gupta Asstt. Prof
Presentation transcript:

PIPES: A Resource Adaptive Data Stream Management System Bernhard Seeger Philipps-University Marburg, Germany Research supported by the German Research Society (DFG) grant Se 553/4-2

2 Information Landscape Input Output

3 Outline  Motivation and problem definition  Sliding Windows  Query Processing in PIPES  Data Stream Model  Logical Operators  Algebraic Query Optimization  Physical Operators  Runtime Environment  Dynamic Plan Migration  Conclusions

4 Example Application  Traffic monitoring  Data format  Continuous dataflow  streams  Variable stream rates  Time + location dependence  Queries  Continuous, long-running “At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?” HighwayStream( lane, speed, length, timestamp )

5 Data Streams  Continuously Arriving Sequence of Records  time as an integral component  Autonomous Data Sources  sensors, mobile devices, software agents, …  Important Type of Data  miniaturization of hardware  ubiquitous networks ooo oo …

6 Requirements  Declarative Query Language  Expressive like (Temporal) SQL  join of data streams according to time  combination of data streams with persistent databases  assigns meaning to data  query results as a data stream  Publish/Subscribe Paradigm  Subscribe: users register new queries  Publish: continous report of results  Quality of Service (QoS)  e. g. at least one record per second  scalability  number of data sources  number of subscribed queries

7 Stream Query Processing  Similar to Traditional DBMS 1. Queries expressed in CQL  SQL-like query language 2. Logical Query Plan  algebra with „relational“ operators 3. Query Optimization  algebraic rules  simple, but accurate cost model 4. Physical Query Plan  select physical operators 5. Processing of the Query

8 What is special about PIPES?  PIPES provides an Infrastructure for DSMS  DSMS = Data Stream Management System  PIPES = Public Infrastructure for Processing and Exploring Data Streams  Differences to DBMS  Semantics is borrowed from Temporal Databases  Expressiveness  Query Optimization  Data Driven Query Processing  Publish/Subscribe  Adaptive Runtime Environment  Dynamic assignment of resources at runtime  scalability, QoS  Continuous Optimization of Queries von Anfragen  plan migration  scalability, QoS

9 Outline  Motivation and problem definition  Sliding Windows  Query Processing in PIPES  Data Stream Model  Logical Operators  Algebraic Query Optimization  Physical Operators  Runtime Environment  Dynamic Plan Migration  Conclusions

10 2. Sliding Windows  Requirement of Users  no impact of outdated data on our result  integration of different streams according to time  Moving Temporal Windows  Finite subsequence of an infinite stream  Query processing is restricted to the most recent data  Important for an expressive and efficient query processing  Options  Count-based windows  FIFO queue of size w  Time-based windows  ttime stamp of an element  t + w + 1 end of the validity of an element

11 Problem: Determinism  Data-driven Processing  Count-based Windows  w = 2  Non-Determinism  Result of a query depends on scheduling a3 b3 a3b1 a3b2 a1 a2 b1 b2 a2b3 a3b3 a3b1 a3b2 a2b3 a3b3 a1b3 a2b3 a3b2 a3b3 a1b3 a2b3 a3b2 a3b3 Example: Symetric Join a2 a3 b2 b3 Reset a3b1 a3b2 a2b3 a3b3 a1b3 a2b3 a3b2 a3b3

12 Temporal Windows in CQL SELECT sectionID FROM ( SELECT AVG(speed) AS avgSpeed, 1 AS sectionID FROM HighwayStream1 [Range 15 minutes] UNION ALL … UNION ALL SELECT AVG(speed) AS avgSpeed, 20 AS sectionID FROM HighwayStream20 [Range 15 minutes] ) WHERE avgSpeed < 15; “ At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?”

13 Outline  Motivation and problem definition  Sliding Windows  Query Processing in PIPES  Data Stream Model  Logical Operators  Algebraic Query Optimization  Physical Operators  Runtime Environment  Dynamic Plan Migration  Conclusions

14 3. Query Processing in PIPES  Data Streams Model  Input Streams  Autonomous Source  Logical Streams  Semantics  Physical Streams  Implementation of the Semantics, but more expressive

15 Input Streams  Sequence of Records  Arbitrary, but fixed schema  No limitation to the relational model  Records with timestamps  Temporal ordered Schema: HighwayStream( short lane, float speed, float length, Timestamp timestamp ) Input Stream: (5; 18.28; 5.27; 5:00:08) (2; 21.33; 4.62; 5:01:32) (4; 19.69; 9.97; 5:02:16) …

16 Physical Stream  PIPES: Time Intervals instead of Points  Validity of an element e  Processing of e restricted to its time interval  Removal of invalid records  Sequence of tuples (e, [t S, t E ))  Ordered by t S and t E ((5; 18.28; 5.27; 5:00:08), [5:00:08, 5:00:09)) ((2; 21.33; 4.62; 5:01:32), [5:01:32, 5:01:33)) ((4; 19.69; 9.97; 5:02:16), [5:02:16, 5:02:17)) … Transformation: input stream  physical stream

17 Data Stream Operators  Window Operator  Relational Operator  „relational“ algebra on data streams  projection  selection  Cartesian product  union  difference  temporal extension of operators

18 Window Operator  Purpose  Extension of the validity of an element by w time units.  Overlap of windows of elements  Elements need to be processed together  Window: w = 15 minutes (e 1, [5:00:08, 5:15:09)) (e 2, [5:01:32, 5:16:33)) (e 3, [5:02:16, 5:17:17)) … Sliding window: 15 minutes t S +1+w tStS w+1

19 Relational Stream Operators Snapshot-Reducibility  Snapshot  Mapping of a physical stream to a non-temporal relation.  Relation comprises all valid elements at point t Relational Operator Relational Stream Operator S 1, …, S n R 1, …, R n R out S out

20 Query Optimization  Application of Well-known Rules from Temporal Databases  Slivinskas, Jensen, Snodgrass (ICDE 2000)  Query Plans for Conventional and Temporal Queries Involving Duplicates and Ordering  many rules directly applicable to streams  conventional + temporal rules  Basis for Effective Query Optimization

21 1) Query2) Logical Query Plan3) Query Optimization 4) Physical Query Plan Steps SELECT sectionID FROM ( SELECT AVG(speed) AS avgSpeed, 1 AS sectionID FROM HighwayStream1 [Range 15 minutes] UNION ALL … UNION ALL SELECT AVG(speed) AS avgSpeed, 20 AS sectionID FROM HighwayStream20 [Range 15 minutes] ) WHERE avgSpeed < 15; “At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?” Map: projection on sectionID Filter: avgSpeed < 15 Union: merge of data streams Aggregation: average speed (avgSpeed) Map: projection on speed., assigning sectionID Window: w = 15 minutes

22 Physical Operators  Stateless Operators  Processing of an element is independent from the previous ones.  Examples: filter, map  Stateful Operators  Processing of an element depends on previous elements  Restrict to elements in sliding window  Explicit management of status  Examples: join, aggregation

23 Data-driven Joins  Input  streams A and B and sliding window of size w  join predicate P  Output  records ((a,b), [t S,t E ))  P(a,b)  overlapping intervals of a und b a b tStS tEtE (a,b)

24 Methodology  Adaptation of Sweepline Technique t A = Start time of last element of A t B = Start time of last element of B  Status for each input  Status of A: elements of A with end time ≥ t B  Status of B: elements of B with end time ≥ t A  Continuous Processing A B Status A Status B insertion probing & reorganisation

25 Runtime Environment of PIPES Sources Sinks Query graph PIPES

26 Outline  Motivation and problem definition  Sliding Windows  Query Processing in PIPES  Data Stream Model  Logical Operators  Algebraic Query Optimization  Physical Operators  Runtime Environment  Dynamic Plan Migration  Conclusions

27 4. Plan Migration  Re-Optimization of Query Plans at Runtime  Identification of poorly performing subgraphs in the query graph  Plan Migration  Substitution of old plan by a new one Requirements  Preserving of snapshot reducibility  Continuous production of results  Short migration time

28 Beispiel RS T U C1C1 C2C2 Sinks Sources

29 Semantics Problems  Duplicates  Parallel insertion of new elements into both plans  Loss of Results  Exclusive insertion of new element in the new plan

30 Split Approach in PIPES  Assumptions  Streams A and B  Window of length w  equivalent query plans P alt and P neu  Earliest split time  t split = max {t A, t B } + w  Splitting of the input at split time t split

31 Approach in PIPES  Production of Results  Acceptance of all results received from the old plan P old  Selection of results received from the new plan P new  Acceptance only if start time > t split P old P new Split A B  

32 Properties  Method is broadly applicable  Arbitrary plans  Many data streams  Different window sizes  Migration Time  Worst-case: w time units

33 Outline  Motivation and problem definition  Sliding Windows  Query Processing in PIPES  Data Stream Model  Logical Operators  Algebraic Query Optimization  Physical Operators  Runtime Environment  Dynamic Plan Migration  Conclusions

34 5. Conclusions  Applications  Traffic management  Alarming systems  Observation of production lines  Basic ideas of stream processing in PIPES  Temporal Databases  Data-driven query processing  Adaptivity at runtime  Continuous Optimization at runtime  Dynamic Plan Migration  Broadly applicable approach

35 Current Work  Problems  Cost models for optimization  New techniques  Strategies for adaptation  Memory  CPU  QoS  Runtime environment  Realtime applications  Real applications for DSMS  Observation of patients in hospitals  Processing of sensor data  Coupling of PIPES and commercial products

36 Related Work  Abadi, Carney, Cetintemel et al.  Aurora: A new model and architecture for data stream management. The VLDB Journal, 12(2): ,  Arasu, Babu, and Widom  The CQL continuous query language: Semantic foundations and query execution. Technical Report , Stanford University,  Tucker, Maier, Sheard, and Faragas  Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowledge and Data Eng., 15(3): ,  Law, Wang, and Zaniolo  Query languages and data models for database sequences and data streams. In VLDB, pages , 2004.

37 Papers on PIPES/XXL  Michael Cammert, Jürgen Krämer, Bernhard Seeger, Sonny Vaupel: An Approach to Adaptive Memory Management in Data Stream Systems, will appear in Proc. ICDE  Michael Cammert, Christoph Heinz, Jürgen Krämer, Bernhard Seeger: Sortierbasierte Joins über Datenströmen, BTW 2005, Karlsruhe - Germany, March, 2-4.  Björn Blohsfeld, Christoph Heinz, Bernhard Seeger: Maintaining Nonparametric Estimators over Data Streams, BTW 2005, Karlsruhe - Germany, March, 2-4.  Christoph Heinz, Bernhard Seeger: Wavelet Density Estimators over Data Streams (Extended Abstract), ACM Symposium on Applied Computing, Santa Fe - New Mexico,  Michael Cammert, Christoph Heinz, Jürgen Krämer, Bernhard Seeger: Anfrageverarbeitung auf Datenströmen, Datenbank-Spektrum 11: 5-13, (2004).  Jürgen Krämer, Bernhard Seeger: PIPES–A Public Infrastructure for Processing and Exploring Data Streams. Proc. SIGMOD 2004 (Demo)  Jochen Van den Bercken, Björn Blohsfeld, Jens-Peter Dittrich, Jürgen Krämer, Tobias Schäfer, Martin Schneider, Bernhard Seeger: XXL - A Library Approach to Supporting Efficient Implementations of Advanced Database Queries, In Proc. of the Conf. on Very Large Databases (VLDB), 39-48, September 

38 Future Work  Query optimization  Adequate cost model  Not only stream rates  Runtime statistics: delays, memory usage, etc.  Static query optimization  Multi query optimization  Subquery sharing  Dynamic query optimization  Detection of suitable subgraphs  Plan migration at runtime  Temporal aspects  Coalesce

Thank you ! Any questions ? For more information check our website: Home/Research/Projects/PIPES

40 Reorganization  Restriction of memory usage  All elements where t E  min  t Sj   t Sj : latest start timestamp of input stream j  Ordering invariant  no temporal overlap with future stream elements Which elements can be discarded in internal data structures ? Why ?

41 Aggregation  Incremental computation  Efficient implementation  Aggregation segment-tree  Amortized logarithmic costs per element T current state (aggregates) new element Example: Sum ReorganizationInsertion

42 Outline  Motivation and problem definition  Query formulation  Our temporal approach  Stream types  Logical query plans  Query optimization  Physical query plans  Query execution  Exploration of Data Streams  Conclusions

43 Exploration of Data Streams  Example  Estimation of selectivity during runtime of continuous range queries: select * from Stream S where S.measure between min and max  Our Approach  Exploit the density p of the distribution  Represents all information about the distribution  Suitable for estimating the selectivity multiple queries

44 Requirement  Problem  Density is unknown  Adaptation of a non-parametric density estimation technique  Kernels  Wavelets  Sampling and CDF  Requirements  Low resource consumption (memory and CPU)  Memory and CPU adaptive  Increasing memory size  higher accuracy  Valid estimation at each point in time  Adapt to a changing distribution

45 Reservoir Sampling  CDF is built on top of the iid samples  Disadvantages  Estimation relies on a few elements  No advantage from an increasing memory  Advantage  Low processing overhead

46 Blockwise Estimation  Stream is transformed into blocks  For simplicity: blocks are of the same size  Idea  Estimation of the first k blocks is available  Compute the estimation of k+1 blocks iteratively  Example (Average)  Generalization for density functions  Straightforward Extension  Problem: Violates the requirement of limited memory

47 Cumulative-Compressed Estimation  Compression  Cubic splines  Weighting strategies  Amortized cost for updates  O(log M)

48 Experimental Comparison  Streaming data from a real traffic data set  Arithmetic weights  Memory size: 5000