1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.

Slides:



Advertisements
Similar presentations
Chapter 10: Designing Databases
Advertisements

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Mining Data Streams.
A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.
Data Stream Computation Lecture Notes in COMP 9314 modified from those by Nikos Koudas (Toronto U), Divesh Srivastava (AT & T), and S. Muthukrishnan (Rutgers)
Data Streams & Continuous Queries The Stanford STREAM Project stanfordstreamdatamanager.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations Presenter: Liyan Zhang Presentation of ICS
File Systems and Databases
Aurora Proponent Team Wei, Mingrui Liu, Mo Rebuttal Team Joshua M Lee Raghavan, Venkatesh.
1 Stream-based Data Management IS698 Min Song 2 Characteristics of Data Streams  Data Streams Data streams — continuous, ordered, changing, fast, huge.
Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream.
1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.
Copyright ©2009 Opher Etzion Event Processing Course Engineering and implementation considerations (related to chapter 10)
An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.
Chapter 14 The Second Component: The Database.
Database Systems More SQL Database Design -- More SQL1.
The Stanford Data Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock,
SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.
Models and Issues in Data Streaming Presented By :- Ankur Jain Department of Computer Science 6/23/03 A list of relevant papers is available at
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Chapter 4: Organizing and Manipulating the Data in Databases
Optimizing Queries and Diverse Data Sources Laura M. Hass Donald Kossman Edward L. Wimmers Jun Yang Presented By Siddhartha Dasari.
Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.
NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
CSE314 Database Systems More SQL: Complex Queries, Triggers, Views, and Schema Modification Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Query Processing, Resource Management, and Approximation in a Data Stream Management System.
Master’s Thesis (30 credits) By: Morten Lindeberg Supervisors: Vera Goebel and Jarle Søberg Design, Implementation, and Evaluation of Network Monitoring.
Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
1 Data Warehouses BUAD/American University Data Warehouses.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.
Data Stream Management Systems
Aum Sai Ram Security for Stream Data Modified from slides created by Sujan Pakala.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Data Mining: Concepts and Techniques Mining data streams
W. Hong & S. Madden – Implementation and Research Issues in Query Processing for Wireless Sensor Networks, ICDE 2004.
Hyperion :High Volume Stream Archival Divya Muthukumaran.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
1 10 Systems Analysis and Design in a Changing World, 2 nd Edition, Satzinger, Jackson, & Burd Chapter 10 Designing Databases.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Streaming Semantic Data COMP6215 Semantic Web Technologies Dr Nicholas Gibbins –
Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –
1 Advanced Database Systems: DBS CB, 2 nd Edition Advanced Topics of Interest: DB the Cloud, and SQL & Stream Processing.
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Mining Data Streams (Part 1)
Advanced Database Systems: DBS CB, 2nd Edition
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
COMP3211 Advanced Databases
Dieter Gawlick, Oracle October, 2005 (GGF15 in Boston)
CPS216: Data-intensive Computing Systems
The Stream Model Sliding Windows Counting 1’s
Parallel Databases.
Applying Control Theory to Stream Processing Systems
Adaptive Query Processing (Background)
An Analysis of Stream Processing Languages
Presentation transcript:

1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook

2 Finding a Database Problem v Pick a simple but fundamental assumption underlying traditional database systems  Drop it v Reconsider all aspects of data management and query processing  Many Ph.D. theses  Prototype from scratch

3 Facts v Dropped assumptions  Data has a fixed schema declared in advance  All data is accurate, consistent, and complete  First load data, then index it, then run queries –Continuous data streams –Continuous queries

4 Streaming Data v Continuous, unbounded, rapid, time-varying streams of data elements v Occurring in a variety of modern applications  Network monitoring and traffic engineering  Sensor networks, RFID tags  Telecom call records  Financial applications  Web logs and click-streams  Manufacturing processes v DSMS = Data Stream Management System

5 DBMS versus DSMS v Persistent relations v One-time queries v Random access v Access plan determined by query processor and physical DB design v Transient streams (and persistent relations) v Continuous queries v Sequential access v Unpredictable data characteristics and arrival patterns

6 Continuous Queries v One time queries – run once to completion over the current data set. v Continuous queries – issued once and continuously evaluated over the data, e.g.,  Notify me when the temperature drops below X  Tell me when prices of stock Y > 300

7 The (Simplified) Big Picture DSMS Scratch Store Input streams Register Query Streamed Result Stored Result Archive Stored Relations

8 (Simplified) Network Monitoring Register Monitoring Queries DSMS Scratch Store Network measurements, Packet traces Intrusion Warnings Online Performance Metrics Archive Lookup Tables

9 Making Things Concrete DSMS Outgoing (call_ID, caller, time, event) Incoming (call_ID, callee, time, event) event = start or end Central Office Central Office ALICE BOB

10 Query 1 (SELF-JOIN) v Find all outgoing calls longer than 2 minutes SELECT O1.call_ID, O1.caller FROM Outgoing O1, Outgoing O2 WHERE (O2.time – O1.time > 2) AND O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end ) v Result requires unbounded storage v Can provide result as data stream  Can output after 2 min, without seeing end

11 Query 2 (JOIN) v Pair up callers and callees SELECT O.caller, I.callee FROM Outgoing O, Incoming I WHERE O.call_ID = I.call_ID v Can still provide result as data stream v Requires unbounded temporary storage … v … unless streams are near-synchronized

12 Query 3 (group-by aggregation) v Total connection time for each caller SELECT O1.caller, sum(O2.time – O1.time) FROM Outgoing O1, Outgoing O2 WHERE (O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end ) GROUP BY O1.caller v Cannot provide result in (append-only) stream w Output updates? w Provide current value on demand? w Memory?

13 DSMS – Architecture & Issues v Data streams and stored relations – architectural differences v Declarative language for registering continuous queries v Flexible query plans and execution strategies v Centralized ? Distributed ?

14 DSMS – Options v Relation: Tuple Set or Sequence? v Updates: Modification or Append? v Query Answer: Exact or Approximate? v Query Evaluation: One or Multiple Pass? v Query Plan: Fixed or Adaptive?

15 Architectural Comparison DSMS DBMS v Resource (memory, per- tuple computation) limited v Reasonably complex, near real time, query processing v Useful to identify what data to populate in database v Query evaluation: one pass v Query plan: adaptive v Resource (memory, disk, per-tuple computation) rich v Extremely sophisticated query processing, analysis v Useful to audit query results of data stream systems v Query evaluation: arbitrary v Query plan: fixed

16 DSMS Challenges v Must cope with:  Stream rates that may be high, variable, bursty  Stream data that may be unpredictable, variable  Continuous query loads that may be high, variable v Overload – need to use resources very carefully v Changing conditions – adaptive strategy

17 Query Model 17 User/ Application DSMS Query Processor

18 Query Processing v Query Language v Operators v Optimization v Multi-Query Optimization

19 Stream Query Language v SQL extension v Queries reference/produce relations or streams v Examples: GSQL, CQL Stream or Finite Relation Stream Query Language

20 Continuous Query Language – CQL Start with SQL Then add… v Streams as new data type v Continuous instead of one-time semantics v Windows on streams (derived from SQL-99) v Sampling on streams (basic)

21 Impact of Limited Memory v Continuous streams grow unboundedly v Queries may require unbounded memory v One solution: Approximate query evaluation

22 Approximate Query Evaluation v Why?  Handling load – streams coming too fast  Avoid unbounded storage and computation  Ad hoc queries need approximate history v How?  Sliding windows, synopsis, samples, load-shedding

23 Approximate Query Evaluation ( cont.) v Major Issues  Metric for set-valued queries  Composition of approximate operators  How is it understood/controlled by user?  Integrate into query language  Query planning and interaction with resource allocation  Accuracy-efficiency-storage tradeoff and global metric

24 Windows v Mechanism for extracting a finite relation from an infinite stream v Various window proposals for restricting operator scope.  Windows based on ordering attribute (e.g. time)  Windows based on tuple counts  Windows based on explicit markers (e.g. punctuations)  Variants (e.g., partitioning tuples in a window) Stream Finite relations manipulated using SQL Window specifications streamify

25 Windows ( cont.) v Terminology Start timeCurrent time time t1t2t3 t4t5 Sliding Window timeTumbling Window

26 Query Operators v Selection - Where clause v Projection - Select clause v Join - From clause v Group-by (Aggregation) – Group-by clause

27 Query Operators ( cont.) v Selection and projection on streams - straightforward  Local per-element operators v Projection may need to include ordering attribute v Join – Problematic  May need to join tuples that are arbitrarily far apart.  Equijoin on stream ordering attributes may be tractable. v Majority of the work focuses on join using windows.

28 Blocking Operators v Blocking  No output until entire input seen  Streams – input never ends v Simple Aggregate – output “update” stream v Set Output (sort, group-by)  Root – could maintain output data structure  Intermediate nodes – try non-blocking analogs v Join  Apply sliding-window restrictions

29 Optimization in DSMS v Traditionally table-based cardinalities used in query optimizer.  Goal of query optimizer: Minimize the size of intermediate results. v Problematic in a streaming environment – All streams are unbounded = infinite size! v Need novel optimization objectives that are relevant when the input sources are streams.

30 Query Optimization in DSMS v Novel notions of optimization:  Stream rate based [e.g. NiagaraCQ]  QoS based [e.g. Aurora] v Continuous adaptive optimization v Possibilities that objectives cannot be met:  Resource constraints  Bursty arrivals under limited processing capabilities.

31 Typical Stream Projects v Amazon/Cougar (Cornell) – sensors v Aurora (Brown/MIT) – sensor monitoring, dataflow v Hancock (AT&T) – telecom streams v Niagara (OGI/Wisconsin) – Internet XML databases v OpenCQ (Georgia) – triggers, incr. view maintenance v Stream (Stanford) – general-purpose DSMS v Tapestry (Xerox) – pub/sub content-based filtering v Telegraph (Berkeley) – adaptive engine for sensors v Tribeca (Bellcore) – network monitoring v ……

32 Conclusion v Conventional DMS technology is inadequate. v We need to reconsider all aspects of data management in presence of streaming data.

33 Question & Answer