Presentation is loading. Please wait.

Presentation is loading. Please wait.

Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream.

Similar presentations


Presentation on theme: "Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream."— Presentation transcript:

1 Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream stanfordstreamdatamanager

2 2 Data Streams Continuous, unbounded, rapid, time-varying streams of data elements Occur in a variety of modern applications –Network monitoring and traffic engineering –Sensor networks, RFID tags –Telecom call records –Financial applications –Web logs and click-streams –Manufacturing processes DSMSDSMS = Data Stream Management System

3 stanfordstreamdatamanager 3 DBMS versus DSMS Persistent relations One-time queries Random access Access plan determined by query processor and physical DB design Transient streams (and persistent relations) Continuous queries Sequential access Unpredictable data characteristics and arrival patterns

4 stanfordstreamdatamanager 4 DSMS Scratch Store The (Simplified) Big Picture Input streams Register Query Streamed Result Stored Result Archive Stored Relations

5 stanfordstreamdatamanager 5 (Simplified) Network Monitoring Register Monitoring Queries DSMS Scratch Store Network measurements, Packet traces Intrusion Warnings Online Performance Metrics Archive Lookup Tables

6 stanfordstreamdatamanager 6 The STREAM System Data streams and stored relations Declarative language for registering continuous queries Designed to cope with high data rates and query workloads –Graceful approximation when needed –Careful resource allocation and usage

7 stanfordstreamdatamanager 7 The STREAM System Designed to cope with bursty streams and changing workloads –Continuous monitoring of data and system behavior –Continuous reoptimization Relational, centralized (for now)

8 stanfordstreamdatamanager 8 Contributions to Date Semantics for continuous queries Query plans Query optimization and dynamic reoptimization Exploiting stream constraints Operator scheduling Approximation techniques Resource allocation to maximize precision Initial running prototype

9 stanfordstreamdatamanager 9 Contributions to Date Semantics for continuous queries Query plans Query optimization and dynamic reoptimization Exploiting stream constraints Operator scheduling Approximation techniques Resource allocation to maximize precision Initial running prototype

10 stanfordstreamdatamanager 10 Semantics for Continuous Queries Abstract:Abstract: window specification language and relational query language as “black boxes” Concrete:Concrete: SQL-based instantiation for our system; includes syntactic shortcuts, defaults, equivalences Goals –CQs over multiple streams and relations –Exploit relational semantics to the extent possible –Easy queries should be easy to write –Queries should do what you expect

11 stanfordstreamdatamanager 11 Concrete Language – CQL Relational query language: SQL Window spec. language derived from SQL-99 –Tuple-based, time-based, partitioned Syntactic shortcuts and defaultsSyntactic shortcuts and defaults –So easy queries are easy to write and queries do what you expect EquivalencesEquivalences –Basis for query-rewrite optimizations –Includes all relational equivalences, plus new stream-based ones

12 stanfordstreamdatamanager 12 CQL Example Two streams, contrived for example: Orders (orderID, cost) Fulfillments (orderID, clerk) Maximum cost of orders fulfilled by each clerk over his/her last 100 fulfillments:

13 stanfordstreamdatamanager 13 CQL Example Two streams, contrived for example: Orders (orderID, cost) Fulfillments (orderID, clerk) Maximum cost of orders fulfilled by each clerk over his/her last 100 fulfillments: Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 100] Where O.orderID = F.orderID Group By F.clerk

14 stanfordstreamdatamanager 14 CQL Example Two streams, contrived for example: Orders (orderID, cost) Fulfillments (orderID, clerk) Maximum cost of orders fulfilled by each clerk over his/her last 100 fulfillments (streamed result): Select Istream(F.clerk, Max(O.cost)) From Orders O, Fulfillments F [Partition By clerk Rows 100] Where O.orderID = F.orderID Group By F.clerk

15 stanfordstreamdatamanager 15 Query Execution query planWhen a continuous query is registered, generate a query plan –Users can also register plans directly Plans composed of three main components: –Operators –Operators (as in most conventional DBMS’s) Queues –Inter-operator Queues (as in many conventional DBMS’s) –State (synopses) schedulerGlobal scheduler for plan execution

16 stanfordstreamdatamanager 16 Simple Query Plan State 4 ⋈ State 3  Stream 1 Stream 2 Stream 3 Q1Q1 Q2Q2 State 1 State 2 ⋈ Scheduler

17 stanfordstreamdatamanager 17 Memory Overhead in Query Processing Queues + State Continuous queries keep state indefinitely Online requirements suggest using memory rather than disk –But we realize this assumption is shaky

18 stanfordstreamdatamanager 18 Memory Overhead in Query Processing Queues + State Continuous queries keep state indefinitely Online requirements suggest using memory rather than disk –But we realize this assumption is shaky Goal: minimize memory use while providing timely, accurate answersGoal: minimize memory use while providing timely, accurate answers

19 stanfordstreamdatamanager 19 Reducing Memory Overhead Two main techniques to date constraints on streamsreduce state 1)Exploit constraints on streams to reduce state operator schedulingreduce queue sizes 2)Clever operator scheduling to reduce queue sizes

20 stanfordstreamdatamanager 20 Exploiting Stream Constraints For most queries, unbounded memory is required for arbitrary streams

21 stanfordstreamdatamanager 21 Exploiting Stream Constraints arbitraryFor most queries, unbounded memory is required for arbitrary streams But streams may exhibit constraints that reduce, bound, or even eliminate state

22 stanfordstreamdatamanager 22 Exploiting Stream Constraints For most queries, unbounded memory is required for arbitrary streams But streams may exhibit constraints that reduce, bound, or even eliminate state Conventional database constraints –Redefined for streams –“Relaxed” for stream environment

23 stanfordstreamdatamanager 23 Stream Constraints adherence parameterEach constraint type defines adherence parameter k Clustered(k) for attribute S.A Ordered(k) for attribute S.A Referential-Integrity(k) for join S 1  S 2

24 stanfordstreamdatamanager 24 Algorithm for Exploiting Constraints Input –Any Select-Project-Join query over streams –Any set of k-constraints Output –Query execution plan that reduces or eliminates state based on k-constraints –If constraints violated, get approximate result

25 stanfordstreamdatamanager 25 Operator Scheduling Global scheduler invokes run method of query plan operators with “timeslice” parameter Many possible scheduling objectives: minimize latency, inaccuracy, memory use, computation, starvation, … –First scheduler: round-robin  Second scheduler: minimize queue sizes –Third scheduler: minimize combination of queue sizes and latency

26 stanfordstreamdatamanager 26 CHAIN Scheduling Algorithm Goal: minimize total queue size for unpredictable, bursty stream arrival patterns Proven: within constant factor of any “clairvoyant” strategy for some queries Empirical results: large savings over naive strategies for many queries But minimizing queue sizes is at odds with minimizing latency

27 stanfordstreamdatamanager 27 CHAIN Algorithm (contd.) ready chain Always schedule ready chain with steepest downslope progress chart Plan progress chart: memory delta per unit time Independent scheduling of operators makes mistakes chains Consider chains of operators Memory 1 3 48 1 0.9 2 0.2 Op1 Op2 Op3 Op4 Computation time

28 stanfordstreamdatamanager 28 Approximation Why approximate? –Memory requirement too high, even with constraints and clever scheduling –Can’t process streams fast enough for query load

29 stanfordstreamdatamanager 29 Approximation (cont’d) Static:Static: rewrite queries to add (or shrink) sampling or windows +User can participate, predictable behavior –Doesn’t consider dynamic conditions Dynamic:Dynamic: modify query plan – insert sampling operators, shrink windows, load shedding +Adapts to current resource availability (major open issue) –How to convey to user? (major open issue)

30 stanfordstreamdatamanager 30 The Stream Systems Landscape (At least) three general-purpose DSMS prototypes underway –STREAM –STREAM (Stanford) –Aurora –Aurora (Brown, Brandeis, MIT) –TelegraphCQ –TelegraphCQ (Berkeley) All will be demo’d at SIGMOD ’03 stream system benchmarkCooperating to develop stream system benchmark  Goal: demonstrate that conventional systems are far inferior for data stream applications

31 Conclusion: We’re building a Data Stream Management System http://www-db.stanford.edu/stream stanfordstreamdatamanager


Download ppt "Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream."

Similar presentations


Ads by Google