Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.

Similar presentations

Presentation on theme: "1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook."— Presentation transcript:

1 1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook

2 2 Finding a Database Problem v Pick a simple but fundamental assumption underlying traditional database systems  Drop it v Reconsider all aspects of data management and query processing  Many Ph.D. theses  Prototype from scratch

3 3 Facts v Dropped assumptions  Data has a fixed schema declared in advance  All data is accurate, consistent, and complete  First load data, then index it, then run queries –Continuous data streams –Continuous queries

4 4 Streaming Data v Continuous, unbounded, rapid, time-varying streams of data elements v Occurring in a variety of modern applications  Network monitoring and traffic engineering  Sensor networks, RFID tags  Telecom call records  Financial applications  Web logs and click-streams  Manufacturing processes v DSMS = Data Stream Management System

5 5 DBMS versus DSMS v Persistent relations v One-time queries v Random access v Access plan determined by query processor and physical DB design v Transient streams (and persistent relations) v Continuous queries v Sequential access v Unpredictable data characteristics and arrival patterns

6 6 Continuous Queries v One time queries – run once to completion over the current data set. v Continuous queries – issued once and continuously evaluated over the data, e.g.,  Notify me when the temperature drops below X  Tell me when prices of stock Y > 300

7 7 The (Simplified) Big Picture DSMS Scratch Store Input streams Register Query Streamed Result Stored Result Archive Stored Relations

8 8 (Simplified) Network Monitoring Register Monitoring Queries DSMS Scratch Store Network measurements, Packet traces Intrusion Warnings Online Performance Metrics Archive Lookup Tables

9 9 Making Things Concrete DSMS Outgoing (call_ID, caller, time, event) Incoming (call_ID, callee, time, event) event = start or end Central Office Central Office ALICE BOB

10 10 Query 1 (SELF-JOIN) v Find all outgoing calls longer than 2 minutes SELECT O1.call_ID, O1.caller FROM Outgoing O1, Outgoing O2 WHERE (O2.time – O1.time > 2) AND O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end ) v Result requires unbounded storage v Can provide result as data stream  Can output after 2 min, without seeing end

11 11 Query 2 (JOIN) v Pair up callers and callees SELECT O.caller, I.callee FROM Outgoing O, Incoming I WHERE O.call_ID = I.call_ID v Can still provide result as data stream v Requires unbounded temporary storage … v … unless streams are near-synchronized

12 12 Query 3 (group-by aggregation) v Total connection time for each caller SELECT O1.caller, sum(O2.time – O1.time) FROM Outgoing O1, Outgoing O2 WHERE (O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end ) GROUP BY O1.caller v Cannot provide result in (append-only) stream w Output updates? w Provide current value on demand? w Memory?

13 13 DSMS – Architecture & Issues v Data streams and stored relations – architectural differences v Declarative language for registering continuous queries v Flexible query plans and execution strategies v Centralized ? Distributed ?

14 14 DSMS – Options v Relation: Tuple Set or Sequence? v Updates: Modification or Append? v Query Answer: Exact or Approximate? v Query Evaluation: One or Multiple Pass? v Query Plan: Fixed or Adaptive?

15 15 Architectural Comparison DSMS DBMS v Resource (memory, per- tuple computation) limited v Reasonably complex, near real time, query processing v Useful to identify what data to populate in database v Query evaluation: one pass v Query plan: adaptive v Resource (memory, disk, per-tuple computation) rich v Extremely sophisticated query processing, analysis v Useful to audit query results of data stream systems v Query evaluation: arbitrary v Query plan: fixed

16 16 DSMS Challenges v Must cope with:  Stream rates that may be high, variable, bursty  Stream data that may be unpredictable, variable  Continuous query loads that may be high, variable v Overload – need to use resources very carefully v Changing conditions – adaptive strategy

17 17 Query Model 17 User/ Application DSMS Query Processor

18 18 Query Processing v Query Language v Operators v Optimization v Multi-Query Optimization

19 19 Stream Query Language v SQL extension v Queries reference/produce relations or streams v Examples: GSQL, CQL Stream or Finite Relation Stream Query Language

20 20 Continuous Query Language – CQL Start with SQL Then add… v Streams as new data type v Continuous instead of one-time semantics v Windows on streams (derived from SQL-99) v Sampling on streams (basic)

21 21 Impact of Limited Memory v Continuous streams grow unboundedly v Queries may require unbounded memory v One solution: Approximate query evaluation

22 22 Approximate Query Evaluation v Why?  Handling load – streams coming too fast  Avoid unbounded storage and computation  Ad hoc queries need approximate history v How?  Sliding windows, synopsis, samples, load-shedding

23 23 Approximate Query Evaluation ( cont.) v Major Issues  Metric for set-valued queries  Composition of approximate operators  How is it understood/controlled by user?  Integrate into query language  Query planning and interaction with resource allocation  Accuracy-efficiency-storage tradeoff and global metric

24 24 Windows v Mechanism for extracting a finite relation from an infinite stream v Various window proposals for restricting operator scope.  Windows based on ordering attribute (e.g. time)  Windows based on tuple counts  Windows based on explicit markers (e.g. punctuations)  Variants (e.g., partitioning tuples in a window) Stream Finite relations manipulated using SQL Window specifications streamify

25 25 Windows ( cont.) v Terminology Start timeCurrent time time t1t2t3 t4t5 Sliding Window timeTumbling Window

26 26 Query Operators v Selection - Where clause v Projection - Select clause v Join - From clause v Group-by (Aggregation) – Group-by clause

27 27 Query Operators ( cont.) v Selection and projection on streams - straightforward  Local per-element operators v Projection may need to include ordering attribute v Join – Problematic  May need to join tuples that are arbitrarily far apart.  Equijoin on stream ordering attributes may be tractable. v Majority of the work focuses on join using windows.

28 28 Blocking Operators v Blocking  No output until entire input seen  Streams – input never ends v Simple Aggregate – output “update” stream v Set Output (sort, group-by)  Root – could maintain output data structure  Intermediate nodes – try non-blocking analogs v Join  Apply sliding-window restrictions

29 29 Optimization in DSMS v Traditionally table-based cardinalities used in query optimizer.  Goal of query optimizer: Minimize the size of intermediate results. v Problematic in a streaming environment – All streams are unbounded = infinite size! v Need novel optimization objectives that are relevant when the input sources are streams.

30 30 Query Optimization in DSMS v Novel notions of optimization:  Stream rate based [e.g. NiagaraCQ]  QoS based [e.g. Aurora] v Continuous adaptive optimization v Possibilities that objectives cannot be met:  Resource constraints  Bursty arrivals under limited processing capabilities.

31 31 Typical Stream Projects v Amazon/Cougar (Cornell) – sensors v Aurora (Brown/MIT) – sensor monitoring, dataflow v Hancock (AT&T) – telecom streams v Niagara (OGI/Wisconsin) – Internet XML databases v OpenCQ (Georgia) – triggers, incr. view maintenance v Stream (Stanford) – general-purpose DSMS v Tapestry (Xerox) – pub/sub content-based filtering v Telegraph (Berkeley) – adaptive engine for sensors v Tribeca (Bellcore) – network monitoring v ……

32 32 Conclusion v Conventional DMS technology is inadequate. v We need to reconsider all aspects of data management in presence of streaming data.

33 33 Question & Answer

Download ppt "1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook."

Similar presentations

Ads by Google