PODS 20021 Models and Issues in Data Stream Systems Rajeev Motwani Stanford University (with Brian Babcock, Shivnath Babu, Mayur Datar, and Jennifer Widom)

Slides:



Advertisements
Similar presentations
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Advertisements

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Mining Data Streams.
A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,
Aurora Proponent Team Wei, Mingrui Liu, Mo Rebuttal Team Joshua M Lee Raghavan, Venkatesh.
SWiM Panel on Engine Implementation Jennifer Widom.
1 Stream-based Data Management IS698 Min Song 2 Characteristics of Data Streams  Data Streams Data streams — continuous, ordered, changing, fast, huge.
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream.
Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.
1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
The Stanford Data Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock,
SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.
Models and Issues in Data Streaming Presented By :- Ankur Jain Department of Computer Science 6/23/03 A list of relevant papers is available at
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.
Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.
NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Query Processing, Resource Management, and Approximation in a Data Stream Management System.
Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
PODS Models and Issues in Data Stream Systems Rajeev Motwani Stanford University (with Brian Babcock, Shivnath Babu, Mayur Datar, and Jennifer Widom)
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.
Data Stream Management Systems
Aum Sai Ram Security for Stream Data Modified from slides created by Sujan Pakala.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
BARD / April BARD: Bayesian-Assisted Resource Discovery Fred Stann (USC/ISI) Joint Work With John Heidemann (USC/ISI) April 9, 2004.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002.
Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro.
Data Mining: Concepts and Techniques Mining data streams
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Review Lecture DB A/18-849B/95-811A/19-729A Internet-Scale Sensor Systems: Design and Policy Review Lecture Databases Phil Gibbons May 1, 2003.
Lecture A/18-849B/95-811A/19-729A Internet-Scale Sensor Systems: Design and Policy Lecture 15 Sensor Databases & Data Stream Systems Phil Gibbons.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –
1 Advanced Database Systems: DBS CB, 2 nd Edition Advanced Topics of Interest: DB the Cloud, and SQL & Stream Processing.
Mining Data Streams (Part 1)
Advanced Database Systems: DBS CB, 2nd Edition
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
COMP3211 Advanced Databases
The Stream Model Sliding Windows Counting 1’s
Models and Issues in Data Stream Systems
Introduction to Query Optimization
Load Shedding Techniques for Data Stream Systems
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
Models and Issues in Data Stream Systems
Range-Efficient Computation of F0 over Massive Data Streams
Evaluation of Relational Operations: Other Techniques
Adaptive Query Processing (Background)
Presentation transcript:

PODS Models and Issues in Data Stream Systems Rajeev Motwani Stanford University (with Brian Babcock, Shivnath Babu, Mayur Datar, and Jennifer Widom) STREAM Project Members: Arvind Arasu, Gurmeet Manku, Liadan O’Callaghan, Justin Rosentein, Qi Sun, Rohit Varma

PODS Data Streams data setsTraditional DBMS – data stored in finite, persistent data sets data streamsNew Applications – data input as continuous, ordered data streams –Network monitoring and traffic engineering –Telecom call records –Network security –Financial applications –Sensor networks –Manufacturing processes –Web logs and clickstreams –Massive data sets

PODS Data Stream Management System User/Application Register Query Stream Query Processor Results Scratch Space (Memory and/or Disk) Data Stream Management System (DSMS)

PODS Meta-Questions Killer-apps –Application stream rates exceed DBMS capacity? –Can DSMS handle high rates anyway? Motivation – Need for general-purpose DSMS? – Not ad-hoc, application-specific systems? Non-Trivial –DSMS = merely DBMS with enhanced support for triggers, temporal constructs, data rate mgmt?

PODS Sample Applications Network security (e.g., iPolicy, NetForensics/Cisco, Niksun) –Network packet streams, user session information –Queries: URL filtering, detecting intrusions & DOS attacks & viruses Financial applications (e.g., Traderbot) –Streams of trading data, stock tickers, news feeds –Queries: arbitrage opportunities, analytics, patterns –SEC requirement on closing trades

PODS Executive Summary Data Stream Management Systems (DSMS) –Highlight issues and motivate research –Not a tutorial or comprehensive survey Caveats –Personal view of emerging field Stanford STREAM Project bias  Cannot cover all projects in detail

PODS DBMS versus DSMS Persistent relations One-time queries Random access “Unbounded” disk store Only current state matters Passive repository Relatively low update rate No real-time services Assume precise data Access plan determined by query processor, physical DB design Transient streams Continuous queries Sequential access Bounded main memory History/arrival-order is critical Active stores Possibly multi-GB arrival rate Real-time requirements Data stale/imprecise Unpredictable/variable data arrival and characteristics

PODS Making Things Concrete DSMS Outgoing (call_ID, caller, time, event) Incoming (call_ID, callee, time, event) event = start or end Central Office Central Office ALICE BOB

PODS Query 1 ( self-join ) Find all outgoing calls longer than 2 minutes SELECT O1.call_ID, O1.caller FROM Outgoing O1, Outgoing O2 WHERE (O2.time – O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) Result requires unbounded storage Can provide result as data stream Can output after 2 min, without seeing end

PODS Query 2 ( join ) Pair up callers and callees SELECT O.caller, I.callee FROM Outgoing O, Incoming I WHERE O.call_ID = I.call_ID Can still provide result as data stream Requires unbounded temporary storage … … unless streams are near-synchronized

PODS Query 3 ( group-by aggregation ) Total connection time for each caller SELECT O1.caller, sum(O2.time – O1.time) FROM Outgoing O1, Outgoing O2 WHERE (O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) GROUP BY O1.caller Cannot provide result in (append-only) stream –Output updates? –Provide current value on demand? –Memory?

PODS Outline of Remaining Talk Stream Models and DSMS Architectures Query Processing Runtime and Systems Issues Algorithms Conclusion

PODS Data Model Append-only –Call records Updates –Stock tickers Deletes –Transactional data Meta-Data –Control signals, punctuations System Internals – probably need all above

PODS Query Model User/ Application DSMS Query Processor

PODS Related Database Technology DSMS must use ideas, but none is substitute –Triggers, Materialized Views in Conventional DBMS –Main-Memory Databases –Distributed Databases –Pub/Sub Systems –Active Databases –Sequence/Temporal/Timeseries Databases –Realtime Databases –Adaptive, Online, Partial Results Novelty in DSMS –Semantics: input ordering, streaming output, … –State: cannot store unending streams, yet need history –Performance: rate, variability, imprecision, …

PODS Stream Projects Amazon/CougarAmazon/Cougar (Cornell) – sensors Aurora (Brown/MIT) – sensor monitoring, dataflow HancockHancock (AT&T) – telecom streams Niagara (OGI/Wisconsin) – Internet XML databases OpenCQOpenCQ (Georgia) – triggers, incr. view maintenance Stream (Stanford) – general-purpose DSMS TapestryTapestry (Xerox) – pub/sub content-based filtering Telegraph (Berkeley) – adaptive engine for sensors TribecaTribeca (Bellcore) – network monitoring

PODS Aurora/STREAM Overview Input streams Users issue continuous and ad-hoc queries Administrator monitors query execution and adjusts run-time parameters Applications register continuous queries Output streams  x  x Waiting Op Ready Op Running Op Synopses Query Plans Historical Storage

PODS Adaptivity (Telegraph) Runtime Adaptivity Multi-query Optimization Framework – implements arbitrary schemes EDDY R STeMs for join Output Queues Input Streams S T RST R x S x T grouped filter (S.B) grouped filter (R.A)

PODS Query-Split Scheme (Niagara) Aggregate subscription for efficiency Split – evaluate trigger only when file updated Triggers – multi-query optimization constant table join scan split scan trig.Act.itrig.Act.j file ifile j Quotes.XML Symbol = Const.Value … IBM MSFT file i file j ………

PODS Shared Predicates [Niagara, Telegraph] R.A > 1 R.A > 7 R.A > 11 R.A < 3 R.A < 5 R.A = 6 R.A = 8 R.A ≠ 9 Predicates for R.A A>7A>11 9 A< A<5 A>1 > < = ≠ Tuple A=8

PODS Outline of Remaining Talk Stream Models and DSMS Architectures Query Processing Runtime and Systems Issues Algorithms Conclusion

PODS Blocking Operators Blocking –No output until entire input seen –Streams – input never ends Simple Aggregates – output “update” stream Set Output (sort, group-by) –Root – could maintain output data structure –Intermediate nodes – try non-blocking analogs –Example – juggle for sort [Raman,R,Hellerstein] –Punctuations and constraints Join –non-blocking, but intermediate state? –sliding-window restrictions

PODS Punctuations Punctuations [Tucker, Maier, Sheard, Fegaras] Assertion about future stream contents Unblocks operators, reduces state Future Work –Inserted at source or internal (operator signaling)? –Does P unblock Q? Exists P? Rewrite Q? –Relation between P and memory for Q? R.A<10 R.A≥10 P: S.A≥10 State/Index group-by X RS

PODS Constraints –Schema-level: ordering, referential integrity, many-one joins –Instance-level: punctuations –Query-level: windowed join (nearby tuples only) [Babu-Widom] –Input – multi-stream SPJ query, schema-level constraints –Output – plan with low intermediate state for joins Future Work –Query-level constraints? Combining constraints? –Relaxed constraints (near-sorted, near-clustered) –Exploiting constraints in intra-operator signaling

PODS Impact of Limited Memory Continuous streams grow unboundedly Queries may require unbounded memory [ABBMW 02] –a priori memory bounds for query –Conjunctive queries with arithmetic comparisons –Queries with join need domain restrictions –Impact of duplication elimination Open – general queries

PODS Approximate Query Evaluation Why? –Handling load – streams coming too fast –Avoid unbounded storage and computation –Ad hoc queries need approximate history How? Sliding windows, synopsis, samples, load-shed Major Issues? –Metric for set-valued queries –Composition of approximate operators –How is it understood/controlled by user? –Integrate into query language –Query planning and interaction with resource allocation –Accuracy-efficiency-storage tradeoff and global metric

PODS Sliding Window Approximation Why? –Approximation technique for bounded memory –Natural in applications (emphasizes recent data) –Well-specified and deterministic semantics Issues –Extend relational algebra, SQL, query optimization –Algorithmic work –Timestamps?

PODS Timestamps Explicit –Injected by data source –Models real-world event represented by tuple –Tuples may be out-of-order, but if near-ordered can reorder with small buffers Implicit –Introduced as special field by DSMS –Arrival time in system –Enables order-based querying and sliding windows Issues –Distributed streams? –Composite tuples created by DSMS?

PODS Timestamps in JOIN Output Approach 1 User-specified, with defaults Compute output timestamp Must output in order of timestamps Better for Explicit Timestamp Need more buffering Get precise semantics and user-understanding Approach 2 Best-effort, no guarantee Output timestamp is exit-time Tuples arriving earlier more likely to exit earlier Better for Implicit Timestamp Maximum flexibility to system Difficult to impose precise semantics S x R T

PODS Approximate via Load-Shedding Input Load-Shedding Sample incoming tuples Use when scan rate is bottleneck Positive – online aggregation [Hellerstein, Haas, Wang] Negative – join sampling [Chaudhuri, Motwani, Narasaya] Output Load-Shedding Buffer input infrequent output Use when query processing is bottleneck Example – XJoin [Urhan, Franklin] Exploit synopses Handles scan and processing rate mismatch

PODS Distributed Query Evaluation Logical stream = many physical streams –maintain top 100 Yahoo pages Correlate streams at distributed servers –network monitoring Many streams controlled by few servers –sensor networks Issues –Move processing to streams, not streams to processors –Approximation-bandwidth tradeoff

PODS Example: Distributed Streams Maintain top 100 Yahoo pages –Pages served by geographically distributed servers –Must aggregate server logs –Minimize communication Pushing processing to streams –Most pages not in top 100 –Avoid communicating about such pages –Send updates about relevant pages only –Requires server coordination

PODS Stream Query Language? SQL extension Sliding windows as first-class construct –Awkward in SQL, needs reference to timestamps –SQL-99 allows aggregations over sliding windows Sampling/approximation/load-shedding/QoS support? Stream relational algebra and rewrite rules –Aurora and STREAM –Sequence/Temporal Databases

PODS Outline of Remaining Talk Stream Models and DSMS Architectures Query Processing Runtime and Systems Issues Algorithms Conclusion

PODS Aurora Run-time Architecture Scheduler Catalogs Load Shedder QoS Monitor Router Persistent Store Buffer Manager π σ x Box Processors Q1 Q2 Q3 Q4 Q5 InputsOutputs

PODS DSMS Internals Query plans: operators, synopses, queues Memory management –Dynamic Allocation – queries, operators, queues, synopses –Graceful adaptation to reallocation –Impact on throughput and precision Operator scheduling –Variable-rate streams, varying operator/query requirements –Response time and QoS –Load-shedding –Interaction with queue/memory management

PODS Queue Memory and Scheduling Queue Memory and Scheduling [Babcock, Babu, Datar, Motwani] Goal –Given – query plan and selectivity estimates –Schedule – tuples through operator chains Minimize total queue memory –Best-slope scheduling is near-optimal –Danger of starvation for some tuples Minimize tuple response time –Schedule tuple completely through operator chain –Danger of exceeding memory bound Open – graceful combination and adaptivity

PODS Queue Memory and Scheduling Queue Memory and Scheduling [Babcock, Babu, Datar, Motwani] Time selectivity = 0.0 selectivity = 0.6 selectivity = 0.2 Net Selectivity σ1σ1 σ2σ2 σ3σ3 best slope σ3σ3 σ2σ2 σ1σ1 Input Output starvation point

PODS Precision-Resource Tradeoff Resources – memory, computation, I/O Global Optimization Problem –Input: queries with alternate plans, importance weights –Precision: function of resource allocation to queries/operators –Goal: select plans, allocate resources, maximize precision Memory Allocation Algorithm [Varma, Widom] –Model – single query plan, simple precision model –Rules for precision of composed operators –Non-linear numerical optimization formulation Open – Combinatorial algorithm? General case?

PODS Rate-Based & QoS Optimization [Viglas, Naughton] –Optimizer goal is to increase throughput –Model for output-rates as function of input-rates –Designing optimizers? Aurora – QoS approach to load-shedding % tuples delivered QoS DelayOuput-value QoS Static: drop-basedRuntime: delay-basedSemantic: value-based

PODS Outline of Remaining Talk Stream Models and DSMS Architectures Query Processing Runtime and Systems Issues Algorithms Conclusion

PODS Synopses Queries may access or aggregate past data Need bounded-memory history-approximation Synopsis? –Succinct summary of old stream tuples –Like indexes/materialized-views, but base data is unavailable Examples –Sliding Windows –Samples –Sketches –Histograms –Wavelet representation

PODS Model of Computation Increasing time Synopses/Data Structures Data Stream Memory: poly(1/ε, log N) Query/Update Time: poly(1/ε, log N) N: # tuples so far, or window size ε: error parameter

PODS Sketching Techniques [Alon,Matias,Szegedy] frequency moments [Feigenbaum etal, Indyk] extended to Lp norm [Dobra et al] complex aggregates over joins Key Subproblem – Self-Join Size Estimation –Stream of values from D = {1,2,…,N} –Let f i = frequency of value i –Self-join size S = Σ f i 2 –Question – estimating S in small space?

PODS Self-Join Size Estimation AMS Technique (randomized sketches) –Given (f 1,f 2,…,f N ) –Z i = random{-1,1} –X = Σ f i Z i (X incrementally computable) Theorem Exp[X 2 ] = Σ f i 2 –Cross-terms f i Z i f j Z j have 0 expectation –Square-terms f i Z i f i Z i = f i 2 Space = log (N + Σ f i ) Independent samples X k reduce variance

PODS Sliding Window Computations Sliding Window Computations [Datar, Gionis, Indyk, Motwani] Goal: statistics/queries Memory: o(N), preferably poly(1/ε, log N) Problem: count/sum/variance, histogram, clustering,… Sample Results: (1+ε)-approximation –Counting: Space O(1/ε (log N)) bits, Time O(1) amortized –Sum over [0,R]: Space O(1/ε log N (log N + log R)) bits, Time O(log R/log N) amortized –Lp sketches: maintain with poly(1/ε, log N) space overhead –Matching space lower bounds

PODS Sliding Window Histograms Key Subproblem – Counting 1’s in bit-stream Goal – Space O(log N) for window size N Problem – Accounting for expiring bits Idea –Partition/track buckets of known count –Error in oldest bucket only –Future 0’s? …

PODS Exponential Histograms Buckets of exponentially increasing size Between K/2 and K/2+1 buckets of each size K = 1/ε and ε = relative error

PODS Bucket sizes = 4,2,2,2,1.Bucket sizes = 4,4,2,1.Bucket sizes = 4,2,2,1,1,1.Bucket sizes = 4,2,2,1,1.Bucket sizes = 4,2,2,1. Exponential Histograms … … K = 2 Element arrived this step. Future C i-1 + C i-2 +…+ C 2 + C >= (K/2) C i Buckets of exponentially increasing size Between K/2 and K/2+1 buckets of each size K = 1/ε and ε = relative error

PODS Many other results … Histograms –V-Opt Histograms [Gilbert, Guha, Indyk, Kotidis, Muthukrishnan, Strauss], [Indyk] –End-Biased Histograms (Iceberg Queries) [Manku, Motwani], [Fang, Shiva, Garcia-Molina, Motwani, Ullman] –Equi-Width Histograms (Quantiles) [Manku, Rajagopalan, Lindsay], [Khanna, Greenwald] –Wavelets Seminal work [Vitter, Wang, Iyer] + many others! Data Mining –Stream Clustering [Guha, Mishra, Motwani, O’Callaghan] [O’Callaghan, Meyerson, Mishra, Guha, Motwani] –Decision Trees [Domingos, Hulten], [Domingos, Hulten, Spencer]

PODS Conclusion: Future Work Query Processing –Stream Algebra and Query Languages –Approximations –Blocking, Constraints, Punctuations Runtime Management –Scheduling, Memory Management, Rate Management –Query Optimization (Adaptive, Multi-Query, Ad-hoc) –Distributed processing Synopses and Algorithmic Problems Systems –UI, statistics, crash recovery and transaction management –System development and deployment

PODS Thank You!