Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005.

Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005

2 Administrivia  Thursday, L101, 3PM:  Muthian Sivathanu, U. Wisc., Semantically Smart Disk Systems  Next readings:  Monday – read and review the Madden paper  Wednesday – read and summarize the Brin and Page paper

3 Today’s Trivia Question

4 Data Stream Management  Basic idea: static queries, dynamic data  Applications:  Publish-subscribe systems  Stock tickers, news headlines  Data acquisition, e.g., from sensors, traffic monitoring, …  The main two projects that are purely “stream processors”:  Stanford STREAM  MIT/Brown/Brandeis Aurora/Medusa

5 Summary from Last Time  Streams are time-varying data series  STREAM maps them into timestamped sets  (Aurora doesn’t seem to do this)  Most operations on streams resemble normal DB queries:  Filtering, projection; grouping and aggregation; join  (Though the latter few are over windows)  STREAM started with an SQL-like language called CQL  All stream operations go “through” relations  Query plan operators have queues and synopses

6 Some Tricks for Performance  Sharing synopses across multiple operators  In a few cases, more than one operator may join with the same synopsis  Can exploit punctuations or “k-constraints”  Analogous to interesting orders  Referential integrity k-constraint: bound of k between arrival of “many” element and its corresponding “one” element  Ordered-arrival k-constraint: need window of at most k to sort  Clustered-arrival k-constraint: bound on distance between items with same grouping attributes

7 Query Processing – “Chain Scheduling”  Similar in many ways to eddies  Combination of locally greedy and FIFO scheduling  Apply operator to data as follows:  Assume we know how many tuples can be processed in a time unit  Cluster groups of operators into “chains” that maximize reduction in queue size per unit time (i.e., most selective operators per time unit)  Greedily forward tuples into the most selective chain  Within a chain, process the data in FIFO order  STREAM also does a form of join reordering

8 Scratching the Surface: Approximation  They point out two areas where we might need to approximate output:  CPU is limited, and we need to drop some stream elements according to some probabilistic metric  Collect statistics via a profiler  Use Hoeffding inequality to derive a sampling rate in order to maintain a confidence interval  This is generally termed load shedding  May need to do similar things if memory usage is a constraint  Are there other options? When might they be useful?

9 STREAM in General  “Logical semantics first”  Starts with a basic data model: streams as timestamped sets  Develops a language and semantics  Heavily based on SQL  Proposes a relatively straightforward implementation  Interesting ideas like k-constraints  Interesting approaches like chain scheduling  No real consideration of distributed processing

10 Aurora  “Implementation first; mix and match operations from past literature”  Basic philosophy: most of the ideas in streams existed in previous research  Sliding windows, load shedding, approximation, …  So let’s borrow those ideas and focus on how to build a real system with them!  Emphasis is on building a scalable, robust system  Distributed implementation: Medusa

11 Queries in Aurora  Oddly: no declarative query language!  Queries are workflows of physical query operators (SQuAl)  Many operators resemble relational algebra ops

12 Example Query

13 Some Interesting Aspects  A relatively simple adaptive query optimizer  Can push filtering and mapping into many operators  Can reorder some operators (e.g., joins, unions)  Need built-in error handling  If a data source fails to respond in a certain amount of time, create a special alarm tuple  This propagates through the query plan  Incorporate built-in load-shedding, RT sched. to support QoS  Have a notion of combining a query over historical data with data from a stream  Switches from a pull-based mode (reading from disk) to a push-based mode (reading from network)

14 The Medusa Processor  Distributed coordinator between many Aurora nodes  Scalability through federation and distribution  Fail-over  Load balancing

15 Main Components  Lookup  Distributed catalog – schemas, where to find streams, where to find queries  Brain  Query setup, load monitoring via I/O queues and stats  Load distribution and balancing scheme is used  Very reminiscent of Mariposa!

16 Load Balancing  Migration – an operator can be moved from one node to another  Initial implementation didn’t support moving of state  The state is simply dropped, and operator processing resumes  Implications on semantics?  Plans to support state migration  “Agoric system model to create incentives”  Clients pay nodes for processing queries  Nodes pay each other to handle load – pairwise contracts negotiated offline  Bounded-price mechanism – price for migration of load, spec for what a node will take on  Does this address the weaknesses of the Mariposa model?

17 Some Applications They Tried  Financial services (stock ticker)  Main issue is not volume, but problems with feeds  Two-level alarm system, where higher-level alarm helps diagnose problems  Shared computation among queries  User-defined aggregation and mapping  Linear road (sensor monitoring)  Traffic sensors in a toll road – change toll depending on how many cars are on the road  Combination of historical and continuous queries  Environmental monitoring  Sliding-window calculations

18 The Big Application?  Military battalion monitoring  Positions & images of friends and foes  Load shedding is important  Randomly drop data vs. semantic, predicate-based dropping to maintain QoS  Based on a QoS utility function

19 Lessons Learned  Historical data is important – not just stream data  (Summaries?)  Sometimes need synchronization for consistency  “ACID for streams”?  Streams can be out of order, bursty  “Stream cleaning”?  Adaptors and XML are important  … But we already knew that!  Performance is critical  They spent a great deal of time using microbenchmarks and optimizing

20 Borealis  Aurora is now commercial  Borealis follows up with some new directions:  Dynamic revision of results, i.e., corrections to stream data  Dynamic query modification – change on the fly  “Control lines”: change parameters  “Time travel”: support execution of multiple queries, starting from different points in time (past thru future)  Distributed optimization  Combine stream and sensor processing ideas (we’ll talk about sensor nets next time)  Sensor-heavy vs. server-heavy optimization

21 Streams and Integration  How do streams and data integration relate?  Are streams the future, or just an interesting vista point on the side of the road?

Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005.

Similar presentations

Presentation on theme: "Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005.

Similar presentations

Presentation on theme: "Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005."— Presentation transcript:

Similar presentations

About project

Feedback