Presentation on theme: "1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook."— Presentation transcript:
1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook
2 Finding a Database Problem v Pick a simple but fundamental assumption underlying traditional database systems Drop it v Reconsider all aspects of data management and query processing Many Ph.D. theses Prototype from scratch
3 Facts v Dropped assumptions Data has a fixed schema declared in advance All data is accurate, consistent, and complete First load data, then index it, then run queries –Continuous data streams –Continuous queries
4 Streaming Data v Continuous, unbounded, rapid, time-varying streams of data elements v Occurring in a variety of modern applications Network monitoring and traffic engineering Sensor networks, RFID tags Telecom call records Financial applications Web logs and click-streams Manufacturing processes v DSMS = Data Stream Management System
5 DBMS versus DSMS v Persistent relations v One-time queries v Random access v Access plan determined by query processor and physical DB design v Transient streams (and persistent relations) v Continuous queries v Sequential access v Unpredictable data characteristics and arrival patterns
6 Continuous Queries v One time queries – run once to completion over the current data set. v Continuous queries – issued once and continuously evaluated over the data, e.g., Notify me when the temperature drops below X Tell me when prices of stock Y > 300
7 The (Simplified) Big Picture DSMS Scratch Store Input streams Register Query Streamed Result Stored Result Archive Stored Relations
9 Making Things Concrete DSMS Outgoing (call_ID, caller, time, event) Incoming (call_ID, callee, time, event) event = start or end Central Office Central Office ALICE BOB
10 Query 1 (SELF-JOIN) v Find all outgoing calls longer than 2 minutes SELECT O1.call_ID, O1.caller FROM Outgoing O1, Outgoing O2 WHERE (O2.time – O1.time > 2) AND O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end ) v Result requires unbounded storage v Can provide result as data stream Can output after 2 min, without seeing end
11 Query 2 (JOIN) v Pair up callers and callees SELECT O.caller, I.callee FROM Outgoing O, Incoming I WHERE O.call_ID = I.call_ID v Can still provide result as data stream v Requires unbounded temporary storage … v … unless streams are near-synchronized
12 Query 3 (group-by aggregation) v Total connection time for each caller SELECT O1.caller, sum(O2.time – O1.time) FROM Outgoing O1, Outgoing O2 WHERE (O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end ) GROUP BY O1.caller v Cannot provide result in (append-only) stream w Output updates? w Provide current value on demand? w Memory?
13 DSMS – Architecture & Issues v Data streams and stored relations – architectural differences v Declarative language for registering continuous queries v Flexible query plans and execution strategies v Centralized ? Distributed ?
14 DSMS – Options v Relation: Tuple Set or Sequence? v Updates: Modification or Append? v Query Answer: Exact or Approximate? v Query Evaluation: One or Multiple Pass? v Query Plan: Fixed or Adaptive?
15 Architectural Comparison DSMS DBMS v Resource (memory, per- tuple computation) limited v Reasonably complex, near real time, query processing v Useful to identify what data to populate in database v Query evaluation: one pass v Query plan: adaptive v Resource (memory, disk, per-tuple computation) rich v Extremely sophisticated query processing, analysis v Useful to audit query results of data stream systems v Query evaluation: arbitrary v Query plan: fixed
16 DSMS Challenges v Must cope with: Stream rates that may be high, variable, bursty Stream data that may be unpredictable, variable Continuous query loads that may be high, variable v Overload – need to use resources very carefully v Changing conditions – adaptive strategy
17 Query Model 17 User/ Application DSMS Query Processor
18 Query Processing v Query Language v Operators v Optimization v Multi-Query Optimization
19 Stream Query Language v SQL extension v Queries reference/produce relations or streams v Examples: GSQL, CQL Stream or Finite Relation Stream Query Language
20 Continuous Query Language – CQL Start with SQL Then add… v Streams as new data type v Continuous instead of one-time semantics v Windows on streams (derived from SQL-99) v Sampling on streams (basic)
21 Impact of Limited Memory v Continuous streams grow unboundedly v Queries may require unbounded memory v One solution: Approximate query evaluation
22 Approximate Query Evaluation v Why? Handling load – streams coming too fast Avoid unbounded storage and computation Ad hoc queries need approximate history v How? Sliding windows, synopsis, samples, load-shedding
23 Approximate Query Evaluation ( cont.) v Major Issues Metric for set-valued queries Composition of approximate operators How is it understood/controlled by user? Integrate into query language Query planning and interaction with resource allocation Accuracy-efficiency-storage tradeoff and global metric
24 Windows v Mechanism for extracting a finite relation from an infinite stream v Various window proposals for restricting operator scope. Windows based on ordering attribute (e.g. time) Windows based on tuple counts Windows based on explicit markers (e.g. punctuations) Variants (e.g., partitioning tuples in a window) Stream Finite relations manipulated using SQL Window specifications streamify
25 Windows ( cont.) v Terminology Start timeCurrent time time t1t2t3 t4t5 Sliding Window timeTumbling Window
26 Query Operators v Selection - Where clause v Projection - Select clause v Join - From clause v Group-by (Aggregation) – Group-by clause
27 Query Operators ( cont.) v Selection and projection on streams - straightforward Local per-element operators v Projection may need to include ordering attribute v Join – Problematic May need to join tuples that are arbitrarily far apart. Equijoin on stream ordering attributes may be tractable. v Majority of the work focuses on join using windows.
28 Blocking Operators v Blocking No output until entire input seen Streams – input never ends v Simple Aggregate – output “update” stream v Set Output (sort, group-by) Root – could maintain output data structure Intermediate nodes – try non-blocking analogs v Join Apply sliding-window restrictions
29 Optimization in DSMS v Traditionally table-based cardinalities used in query optimizer. Goal of query optimizer: Minimize the size of intermediate results. v Problematic in a streaming environment – All streams are unbounded = infinite size! v Need novel optimization objectives that are relevant when the input sources are streams.
30 Query Optimization in DSMS v Novel notions of optimization: Stream rate based [e.g. NiagaraCQ] QoS based [e.g. Aurora] v Continuous adaptive optimization v Possibilities that objectives cannot be met: Resource constraints Bursty arrivals under limited processing capabilities.
31 Typical Stream Projects v Amazon/Cougar (Cornell) – sensors v Aurora (Brown/MIT) – sensor monitoring, dataflow v Hancock (AT&T) – telecom streams v Niagara (OGI/Wisconsin) – Internet XML databases v OpenCQ (Georgia) – triggers, incr. view maintenance v Stream (Stanford) – general-purpose DSMS v Tapestry (Xerox) – pub/sub content-based filtering v Telegraph (Berkeley) – adaptive engine for sensors v Tribeca (Bellcore) – network monitoring v ……
32 Conclusion v Conventional DMS technology is inadequate. v We need to reconsider all aspects of data management in presence of streaming data.