Download presentation
Presentation is loading. Please wait.
Published byClarissa Harrell Modified over 8 years ago
1
PIPES: A Resource Adaptive Data Stream Management System Bernhard Seeger Philipps-University Marburg, Germany Research supported by the German Research Society (DFG) grant Se 553/4-2
2
2 Information Landscape Input Output
3
3 Outline Motivation and problem definition Sliding Windows Query Processing in PIPES Data Stream Model Logical Operators Algebraic Query Optimization Physical Operators Runtime Environment Dynamic Plan Migration Conclusions
4
4 Example Application Traffic monitoring Data format Continuous dataflow streams Variable stream rates Time + location dependence Queries Continuous, long-running “At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?” HighwayStream( lane, speed, length, timestamp )
5
5 Data Streams Continuously Arriving Sequence of Records time as an integral component Autonomous Data Sources sensors, mobile devices, software agents, … Important Type of Data miniaturization of hardware ubiquitous networks ooo oo …
6
6 Requirements Declarative Query Language Expressive like (Temporal) SQL join of data streams according to time combination of data streams with persistent databases assigns meaning to data query results as a data stream Publish/Subscribe Paradigm Subscribe: users register new queries Publish: continous report of results Quality of Service (QoS) e. g. at least one record per second scalability number of data sources number of subscribed queries
7
7 Stream Query Processing Similar to Traditional DBMS 1. Queries expressed in CQL SQL-like query language 2. Logical Query Plan algebra with „relational“ operators 3. Query Optimization algebraic rules simple, but accurate cost model 4. Physical Query Plan select physical operators 5. Processing of the Query
8
8 What is special about PIPES? PIPES provides an Infrastructure for DSMS DSMS = Data Stream Management System PIPES = Public Infrastructure for Processing and Exploring Data Streams Differences to DBMS Semantics is borrowed from Temporal Databases Expressiveness Query Optimization Data Driven Query Processing Publish/Subscribe Adaptive Runtime Environment Dynamic assignment of resources at runtime scalability, QoS Continuous Optimization of Queries von Anfragen plan migration scalability, QoS
9
9 Outline Motivation and problem definition Sliding Windows Query Processing in PIPES Data Stream Model Logical Operators Algebraic Query Optimization Physical Operators Runtime Environment Dynamic Plan Migration Conclusions
10
10 2. Sliding Windows Requirement of Users no impact of outdated data on our result integration of different streams according to time Moving Temporal Windows Finite subsequence of an infinite stream Query processing is restricted to the most recent data Important for an expressive and efficient query processing Options Count-based windows FIFO queue of size w Time-based windows ttime stamp of an element t + w + 1 end of the validity of an element
11
11 Problem: Determinism Data-driven Processing Count-based Windows w = 2 Non-Determinism Result of a query depends on scheduling a3 b3 a3b1 a3b2 a1 a2 b1 b2 a2b3 a3b3 a3b1 a3b2 a2b3 a3b3 a1b3 a2b3 a3b2 a3b3 a1b3 a2b3 a3b2 a3b3 Example: Symetric Join a2 a3 b2 b3 Reset a3b1 a3b2 a2b3 a3b3 a1b3 a2b3 a3b2 a3b3
12
12 Temporal Windows in CQL SELECT sectionID FROM ( SELECT AVG(speed) AS avgSpeed, 1 AS sectionID FROM HighwayStream1 [Range 15 minutes] UNION ALL … UNION ALL SELECT AVG(speed) AS avgSpeed, 20 AS sectionID FROM HighwayStream20 [Range 15 minutes] ) WHERE avgSpeed < 15; “ At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?”
13
13 Outline Motivation and problem definition Sliding Windows Query Processing in PIPES Data Stream Model Logical Operators Algebraic Query Optimization Physical Operators Runtime Environment Dynamic Plan Migration Conclusions
14
14 3. Query Processing in PIPES Data Streams Model Input Streams Autonomous Source Logical Streams Semantics Physical Streams Implementation of the Semantics, but more expressive
15
15 Input Streams Sequence of Records Arbitrary, but fixed schema No limitation to the relational model Records with timestamps Temporal ordered Schema: HighwayStream( short lane, float speed, float length, Timestamp timestamp ) Input Stream: (5; 18.28; 5.27; 5:00:08) (2; 21.33; 4.62; 5:01:32) (4; 19.69; 9.97; 5:02:16) …
16
16 Physical Stream PIPES: Time Intervals instead of Points Validity of an element e Processing of e restricted to its time interval Removal of invalid records Sequence of tuples (e, [t S, t E )) Ordered by t S and t E ((5; 18.28; 5.27; 5:00:08), [5:00:08, 5:00:09)) ((2; 21.33; 4.62; 5:01:32), [5:01:32, 5:01:33)) ((4; 19.69; 9.97; 5:02:16), [5:02:16, 5:02:17)) … Transformation: input stream physical stream
17
17 Data Stream Operators Window Operator Relational Operator „relational“ algebra on data streams projection selection Cartesian product union difference temporal extension of operators
18
18 Window Operator Purpose Extension of the validity of an element by w time units. Overlap of windows of elements Elements need to be processed together Window: w = 15 minutes (e 1, [5:00:08, 5:15:09)) (e 2, [5:01:32, 5:16:33)) (e 3, [5:02:16, 5:17:17)) … Sliding window: 15 minutes t S +1+w tStS w+1
19
19 Relational Stream Operators Snapshot-Reducibility Snapshot Mapping of a physical stream to a non-temporal relation. Relation comprises all valid elements at point t Relational Operator Relational Stream Operator S 1, …, S n R 1, …, R n R out S out
20
20 Query Optimization Application of Well-known Rules from Temporal Databases Slivinskas, Jensen, Snodgrass (ICDE 2000) Query Plans for Conventional and Temporal Queries Involving Duplicates and Ordering many rules directly applicable to streams conventional + temporal rules Basis for Effective Query Optimization
21
21 1) Query2) Logical Query Plan3) Query Optimization 4) Physical Query Plan Steps SELECT sectionID FROM ( SELECT AVG(speed) AS avgSpeed, 1 AS sectionID FROM HighwayStream1 [Range 15 minutes] UNION ALL … UNION ALL SELECT AVG(speed) AS avgSpeed, 20 AS sectionID FROM HighwayStream20 [Range 15 minutes] ) WHERE avgSpeed < 15; “At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?” Map: projection on sectionID Filter: avgSpeed < 15 Union: merge of data streams Aggregation: average speed (avgSpeed) Map: projection on speed., assigning sectionID Window: w = 15 minutes
22
22 Physical Operators Stateless Operators Processing of an element is independent from the previous ones. Examples: filter, map Stateful Operators Processing of an element depends on previous elements Restrict to elements in sliding window Explicit management of status Examples: join, aggregation
23
23 Data-driven Joins Input streams A and B and sliding window of size w join predicate P Output records ((a,b), [t S,t E )) P(a,b) overlapping intervals of a und b a b tStS tEtE (a,b)
24
24 Methodology Adaptation of Sweepline Technique t A = Start time of last element of A t B = Start time of last element of B Status for each input Status of A: elements of A with end time ≥ t B Status of B: elements of B with end time ≥ t A Continuous Processing A B Status A Status B insertion probing & reorganisation
25
25 Runtime Environment of PIPES Sources Sinks Query graph PIPES
26
26 Outline Motivation and problem definition Sliding Windows Query Processing in PIPES Data Stream Model Logical Operators Algebraic Query Optimization Physical Operators Runtime Environment Dynamic Plan Migration Conclusions
27
27 4. Plan Migration Re-Optimization of Query Plans at Runtime Identification of poorly performing subgraphs in the query graph Plan Migration Substitution of old plan by a new one Requirements Preserving of snapshot reducibility Continuous production of results Short migration time
28
28 Beispiel RS T U C1C1 C2C2 Sinks Sources
29
29 Semantics Problems Duplicates Parallel insertion of new elements into both plans Loss of Results Exclusive insertion of new element in the new plan
30
30 Split Approach in PIPES Assumptions Streams A and B Window of length w equivalent query plans P alt and P neu Earliest split time t split = max {t A, t B } + w Splitting of the input at split time t split
31
31 Approach in PIPES Production of Results Acceptance of all results received from the old plan P old Selection of results received from the new plan P new Acceptance only if start time > t split P old P new Split A B
32
32 Properties Method is broadly applicable Arbitrary plans Many data streams Different window sizes Migration Time Worst-case: w time units
33
33 Outline Motivation and problem definition Sliding Windows Query Processing in PIPES Data Stream Model Logical Operators Algebraic Query Optimization Physical Operators Runtime Environment Dynamic Plan Migration Conclusions
34
34 5. Conclusions Applications Traffic management Alarming systems Observation of production lines Basic ideas of stream processing in PIPES Temporal Databases Data-driven query processing Adaptivity at runtime Continuous Optimization at runtime Dynamic Plan Migration Broadly applicable approach
35
35 Current Work Problems Cost models for optimization New techniques Strategies for adaptation Memory CPU QoS Runtime environment Realtime applications Real applications for DSMS Observation of patients in hospitals Processing of sensor data Coupling of PIPES and commercial products
36
36 Related Work Abadi, Carney, Cetintemel et al. Aurora: A new model and architecture for data stream management. The VLDB Journal, 12(2):120-139, 2003. Arasu, Babu, and Widom The CQL continuous query language: Semantic foundations and query execution. Technical Report 2003-67, Stanford University, 2003. Tucker, Maier, Sheard, and Faragas Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowledge and Data Eng., 15(3):555-568, 2003. Law, Wang, and Zaniolo Query languages and data models for database sequences and data streams. In VLDB, pages 492-503, 2004.
37
37 Papers on PIPES/XXL Michael Cammert, Jürgen Krämer, Bernhard Seeger, Sonny Vaupel: An Approach to Adaptive Memory Management in Data Stream Systems, will appear in Proc. ICDE 2006. Michael Cammert, Christoph Heinz, Jürgen Krämer, Bernhard Seeger: Sortierbasierte Joins über Datenströmen, BTW 2005, Karlsruhe - Germany, March, 2-4. Björn Blohsfeld, Christoph Heinz, Bernhard Seeger: Maintaining Nonparametric Estimators over Data Streams, BTW 2005, Karlsruhe - Germany, March, 2-4. Christoph Heinz, Bernhard Seeger: Wavelet Density Estimators over Data Streams (Extended Abstract), ACM Symposium on Applied Computing, Santa Fe - New Mexico, 2005. Michael Cammert, Christoph Heinz, Jürgen Krämer, Bernhard Seeger: Anfrageverarbeitung auf Datenströmen, Datenbank-Spektrum 11: 5-13, (2004). Jürgen Krämer, Bernhard Seeger: PIPES–A Public Infrastructure for Processing and Exploring Data Streams. Proc. SIGMOD 2004 (Demo) Jochen Van den Bercken, Björn Blohsfeld, Jens-Peter Dittrich, Jürgen Krämer, Tobias Schäfer, Martin Schneider, Bernhard Seeger: XXL - A Library Approach to Supporting Efficient Implementations of Advanced Database Queries, In Proc. of the Conf. on Very Large Databases (VLDB), 39-48, September 2001.
38
38 Future Work Query optimization Adequate cost model Not only stream rates Runtime statistics: delays, memory usage, etc. Static query optimization Multi query optimization Subquery sharing Dynamic query optimization Detection of suitable subgraphs Plan migration at runtime Temporal aspects Coalesce
39
Thank you ! Any questions ? For more information check our website: http://dbs.mathematik.uni-marburg.de/ Home/Research/Projects/PIPES
40
40 Reorganization Restriction of memory usage All elements where t E min t Sj t Sj : latest start timestamp of input stream j Ordering invariant no temporal overlap with future stream elements Which elements can be discarded in internal data structures ? Why ?
41
41 Aggregation Incremental computation Efficient implementation Aggregation segment-tree Amortized logarithmic costs per element T current state (aggregates) new element Example: Sum 4 2 5 3 4 5 9 7 ReorganizationInsertion
42
42 Outline Motivation and problem definition Query formulation Our temporal approach Stream types Logical query plans Query optimization Physical query plans Query execution Exploration of Data Streams Conclusions
43
43 Exploration of Data Streams Example Estimation of selectivity during runtime of continuous range queries: select * from Stream S where S.measure between min and max Our Approach Exploit the density p of the distribution Represents all information about the distribution Suitable for estimating the selectivity multiple queries
44
44 Requirement Problem Density is unknown Adaptation of a non-parametric density estimation technique Kernels Wavelets Sampling and CDF Requirements Low resource consumption (memory and CPU) Memory and CPU adaptive Increasing memory size higher accuracy Valid estimation at each point in time Adapt to a changing distribution
45
45 Reservoir Sampling CDF is built on top of the iid samples Disadvantages Estimation relies on a few elements No advantage from an increasing memory Advantage Low processing overhead
46
46 Blockwise Estimation Stream is transformed into blocks For simplicity: blocks are of the same size Idea Estimation of the first k blocks is available Compute the estimation of k+1 blocks iteratively Example (Average) Generalization for density functions Straightforward Extension Problem: Violates the requirement of limited memory
47
47 Cumulative-Compressed Estimation Compression Cubic splines Weighting strategies Amortized cost for updates O(log M)
48
48 Experimental Comparison Streaming data from a real traffic data set Arithmetic weights Memory size: 5000
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.