PIPES: A Resource Adaptive Data Stream Management System Bernhard Seeger Philipps-University Marburg, Germany Research supported by the German Research.

PIPES: A Resource Adaptive Data Stream Management System Bernhard Seeger Philipps-University Marburg, Germany Research supported by the German Research Society (DFG) grant Se 553/4-2

2 Information Landscape Input Output

3 Outline  Motivation and problem definition  Sliding Windows  Query Processing in PIPES  Data Stream Model  Logical Operators  Algebraic Query Optimization  Physical Operators  Runtime Environment  Dynamic Plan Migration  Conclusions

4 Example Application  Traffic monitoring  Data format  Continuous dataflow  streams  Variable stream rates  Time + location dependence  Queries  Continuous, long-running “At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?” HighwayStream( lane, speed, length, timestamp )

5 Data Streams  Continuously Arriving Sequence of Records  time as an integral component  Autonomous Data Sources  sensors, mobile devices, software agents, …  Important Type of Data  miniaturization of hardware  ubiquitous networks ooo oo …

6 Requirements  Declarative Query Language  Expressive like (Temporal) SQL  join of data streams according to time  combination of data streams with persistent databases  assigns meaning to data  query results as a data stream  Publish/Subscribe Paradigm  Subscribe: users register new queries  Publish: continous report of results  Quality of Service (QoS)  e. g. at least one record per second  scalability  number of data sources  number of subscribed queries

7 Stream Query Processing  Similar to Traditional DBMS 1. Queries expressed in CQL  SQL-like query language 2. Logical Query Plan  algebra with „relational“ operators 3. Query Optimization  algebraic rules  simple, but accurate cost model 4. Physical Query Plan  select physical operators 5. Processing of the Query

8 What is special about PIPES?  PIPES provides an Infrastructure for DSMS  DSMS = Data Stream Management System  PIPES = Public Infrastructure for Processing and Exploring Data Streams  Differences to DBMS  Semantics is borrowed from Temporal Databases  Expressiveness  Query Optimization  Data Driven Query Processing  Publish/Subscribe  Adaptive Runtime Environment  Dynamic assignment of resources at runtime  scalability, QoS  Continuous Optimization of Queries von Anfragen  plan migration  scalability, QoS

10 2. Sliding Windows  Requirement of Users  no impact of outdated data on our result  integration of different streams according to time  Moving Temporal Windows  Finite subsequence of an infinite stream  Query processing is restricted to the most recent data  Important for an expressive and efficient query processing  Options  Count-based windows  FIFO queue of size w  Time-based windows  ttime stamp of an element  t + w + 1 end of the validity of an element

11 Problem: Determinism  Data-driven Processing  Count-based Windows  w = 2  Non-Determinism  Result of a query depends on scheduling a3 b3 a3b1 a3b2 a1 a2 b1 b2 a2b3 a3b3 a3b1 a3b2 a2b3 a3b3 a1b3 a2b3 a3b2 a3b3 a1b3 a2b3 a3b2 a3b3 Example: Symetric Join a2 a3 b2 b3 Reset a3b1 a3b2 a2b3 a3b3 a1b3 a2b3 a3b2 a3b3

12 Temporal Windows in CQL SELECT sectionID FROM ( SELECT AVG(speed) AS avgSpeed, 1 AS sectionID FROM HighwayStream1 [Range 15 minutes] UNION ALL … UNION ALL SELECT AVG(speed) AS avgSpeed, 20 AS sectionID FROM HighwayStream20 [Range 15 minutes] ) WHERE avgSpeed < 15; “ At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?”

14 3. Query Processing in PIPES  Data Streams Model  Input Streams  Autonomous Source  Logical Streams  Semantics  Physical Streams  Implementation of the Semantics, but more expressive

15 Input Streams  Sequence of Records  Arbitrary, but fixed schema  No limitation to the relational model  Records with timestamps  Temporal ordered Schema: HighwayStream( short lane, float speed, float length, Timestamp timestamp ) Input Stream: (5; 18.28; 5.27; 5:00:08) (2; 21.33; 4.62; 5:01:32) (4; 19.69; 9.97; 5:02:16) …

16 Physical Stream  PIPES: Time Intervals instead of Points  Validity of an element e  Processing of e restricted to its time interval  Removal of invalid records  Sequence of tuples (e, [t S, t E ))  Ordered by t S and t E ((5; 18.28; 5.27; 5:00:08), [5:00:08, 5:00:09)) ((2; 21.33; 4.62; 5:01:32), [5:01:32, 5:01:33)) ((4; 19.69; 9.97; 5:02:16), [5:02:16, 5:02:17)) … Transformation: input stream  physical stream

17 Data Stream Operators  Window Operator  Relational Operator  „relational“ algebra on data streams  projection  selection  Cartesian product  union  difference  temporal extension of operators

18 Window Operator  Purpose  Extension of the validity of an element by w time units.  Overlap of windows of elements  Elements need to be processed together  Window: w = 15 minutes (e 1, [5:00:08, 5:15:09)) (e 2, [5:01:32, 5:16:33)) (e 3, [5:02:16, 5:17:17)) … Sliding window: 15 minutes t S +1+w tStS w+1

19 Relational Stream Operators Snapshot-Reducibility  Snapshot  Mapping of a physical stream to a non-temporal relation.  Relation comprises all valid elements at point t Relational Operator Relational Stream Operator S 1, …, S n R 1, …, R n R out S out

20 Query Optimization  Application of Well-known Rules from Temporal Databases  Slivinskas, Jensen, Snodgrass (ICDE 2000)  Query Plans for Conventional and Temporal Queries Involving Duplicates and Ordering  many rules directly applicable to streams  conventional + temporal rules  Basis for Effective Query Optimization

21 1) Query2) Logical Query Plan3) Query Optimization 4) Physical Query Plan Steps SELECT sectionID FROM ( SELECT AVG(speed) AS avgSpeed, 1 AS sectionID FROM HighwayStream1 [Range 15 minutes] UNION ALL … UNION ALL SELECT AVG(speed) AS avgSpeed, 20 AS sectionID FROM HighwayStream20 [Range 15 minutes] ) WHERE avgSpeed < 15; “At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?” Map: projection on sectionID Filter: avgSpeed < 15 Union: merge of data streams Aggregation: average speed (avgSpeed) Map: projection on speed., assigning sectionID Window: w = 15 minutes

22 Physical Operators  Stateless Operators  Processing of an element is independent from the previous ones.  Examples: filter, map  Stateful Operators  Processing of an element depends on previous elements  Restrict to elements in sliding window  Explicit management of status  Examples: join, aggregation

23 Data-driven Joins  Input  streams A and B and sliding window of size w  join predicate P  Output  records ((a,b), [t S,t E ))  P(a,b)  overlapping intervals of a und b a b tStS tEtE (a,b)

24 Methodology  Adaptation of Sweepline Technique t A = Start time of last element of A t B = Start time of last element of B  Status for each input  Status of A: elements of A with end time ≥ t B  Status of B: elements of B with end time ≥ t A  Continuous Processing A B Status A Status B insertion probing & reorganisation

25 Runtime Environment of PIPES Sources Sinks Query graph PIPES

27 4. Plan Migration  Re-Optimization of Query Plans at Runtime  Identification of poorly performing subgraphs in the query graph  Plan Migration  Substitution of old plan by a new one Requirements  Preserving of snapshot reducibility  Continuous production of results  Short migration time

28 Beispiel RS T U C1C1 C2C2 Sinks Sources

29 Semantics Problems  Duplicates  Parallel insertion of new elements into both plans  Loss of Results  Exclusive insertion of new element in the new plan

30 Split Approach in PIPES  Assumptions  Streams A and B  Window of length w  equivalent query plans P alt and P neu  Earliest split time  t split = max {t A, t B } + w  Splitting of the input at split time t split

31 Approach in PIPES  Production of Results  Acceptance of all results received from the old plan P old  Selection of results received from the new plan P new  Acceptance only if start time > t split P old P new Split A B  

32 Properties  Method is broadly applicable  Arbitrary plans  Many data streams  Different window sizes  Migration Time  Worst-case: w time units

34 5. Conclusions  Applications  Traffic management  Alarming systems  Observation of production lines  Basic ideas of stream processing in PIPES  Temporal Databases  Data-driven query processing  Adaptivity at runtime  Continuous Optimization at runtime  Dynamic Plan Migration  Broadly applicable approach

35 Current Work  Problems  Cost models for optimization  New techniques  Strategies for adaptation  Memory  CPU  QoS  Runtime environment  Realtime applications  Real applications for DSMS  Observation of patients in hospitals  Processing of sensor data  Coupling of PIPES and commercial products

36 Related Work  Abadi, Carney, Cetintemel et al.  Aurora: A new model and architecture for data stream management. The VLDB Journal, 12(2):120-139, 2003.  Arasu, Babu, and Widom  The CQL continuous query language: Semantic foundations and query execution. Technical Report 2003-67, Stanford University, 2003.  Tucker, Maier, Sheard, and Faragas  Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowledge and Data Eng., 15(3):555-568, 2003.  Law, Wang, and Zaniolo  Query languages and data models for database sequences and data streams. In VLDB, pages 492-503, 2004.

37 Papers on PIPES/XXL  Michael Cammert, Jürgen Krämer, Bernhard Seeger, Sonny Vaupel: An Approach to Adaptive Memory Management in Data Stream Systems, will appear in Proc. ICDE 2006.  Michael Cammert, Christoph Heinz, Jürgen Krämer, Bernhard Seeger: Sortierbasierte Joins über Datenströmen, BTW 2005, Karlsruhe - Germany, March, 2-4.  Björn Blohsfeld, Christoph Heinz, Bernhard Seeger: Maintaining Nonparametric Estimators over Data Streams, BTW 2005, Karlsruhe - Germany, March, 2-4.  Christoph Heinz, Bernhard Seeger: Wavelet Density Estimators over Data Streams (Extended Abstract), ACM Symposium on Applied Computing, Santa Fe - New Mexico, 2005.  Michael Cammert, Christoph Heinz, Jürgen Krämer, Bernhard Seeger: Anfrageverarbeitung auf Datenströmen, Datenbank-Spektrum 11: 5-13, (2004).  Jürgen Krämer, Bernhard Seeger: PIPES–A Public Infrastructure for Processing and Exploring Data Streams. Proc. SIGMOD 2004 (Demo)  Jochen Van den Bercken, Björn Blohsfeld, Jens-Peter Dittrich, Jürgen Krämer, Tobias Schäfer, Martin Schneider, Bernhard Seeger: XXL - A Library Approach to Supporting Efficient Implementations of Advanced Database Queries, In Proc. of the Conf. on Very Large Databases (VLDB), 39-48, September 2001. 

38 Future Work  Query optimization  Adequate cost model  Not only stream rates  Runtime statistics: delays, memory usage, etc.  Static query optimization  Multi query optimization  Subquery sharing  Dynamic query optimization  Detection of suitable subgraphs  Plan migration at runtime  Temporal aspects  Coalesce

Thank you ! Any questions ? For more information check our website: http://dbs.mathematik.uni-marburg.de/ Home/Research/Projects/PIPES

40 Reorganization  Restriction of memory usage  All elements where t E  min  t Sj   t Sj : latest start timestamp of input stream j  Ordering invariant  no temporal overlap with future stream elements Which elements can be discarded in internal data structures ? Why ?

41 Aggregation  Incremental computation  Efficient implementation  Aggregation segment-tree  Amortized logarithmic costs per element T current state (aggregates) new element Example: Sum 4 2 5 3 4 5 9 7 ReorganizationInsertion

42 Outline  Motivation and problem definition  Query formulation  Our temporal approach  Stream types  Logical query plans  Query optimization  Physical query plans  Query execution  Exploration of Data Streams  Conclusions

43 Exploration of Data Streams  Example  Estimation of selectivity during runtime of continuous range queries: select * from Stream S where S.measure between min and max  Our Approach  Exploit the density p of the distribution  Represents all information about the distribution  Suitable for estimating the selectivity multiple queries

44 Requirement  Problem  Density is unknown  Adaptation of a non-parametric density estimation technique  Kernels  Wavelets  Sampling and CDF  Requirements  Low resource consumption (memory and CPU)  Memory and CPU adaptive  Increasing memory size  higher accuracy  Valid estimation at each point in time  Adapt to a changing distribution

45 Reservoir Sampling  CDF is built on top of the iid samples  Disadvantages  Estimation relies on a few elements  No advantage from an increasing memory  Advantage  Low processing overhead

46 Blockwise Estimation  Stream is transformed into blocks  For simplicity: blocks are of the same size  Idea  Estimation of the first k blocks is available  Compute the estimation of k+1 blocks iteratively  Example (Average)  Generalization for density functions  Straightforward Extension  Problem: Violates the requirement of limited memory

47 Cumulative-Compressed Estimation  Compression  Cubic splines  Weighting strategies  Amortized cost for updates  O(log M)

48 Experimental Comparison  Streaming data from a real traffic data set  Arithmetic weights  Memory size: 5000

PIPES: A Resource Adaptive Data Stream Management System Bernhard Seeger Philipps-University Marburg, Germany Research supported by the German Research.

Similar presentations

Presentation on theme: "PIPES: A Resource Adaptive Data Stream Management System Bernhard Seeger Philipps-University Marburg, Germany Research supported by the German Research."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PIPES: A Resource Adaptive Data Stream Management System Bernhard Seeger Philipps-University Marburg, Germany Research supported by the German Research.

Similar presentations

Presentation on theme: "PIPES: A Resource Adaptive Data Stream Management System Bernhard Seeger Philipps-University Marburg, Germany Research supported by the German Research."— Presentation transcript:

Similar presentations

About project

Feedback