Presentation is loading. Please wait.

Presentation is loading. Please wait.

Query Processing, Resource Management, and Approximation in a Data Stream Management System Selected subset of slides taken from talk by Jennifer Widom.

Similar presentations


Presentation on theme: "Query Processing, Resource Management, and Approximation in a Data Stream Management System Selected subset of slides taken from talk by Jennifer Widom."— Presentation transcript:

1 Query Processing, Resource Management, and Approximation in a Data Stream Management System Selected subset of slides taken from talk by Jennifer Widom at NEDS. stanfordstreamdatamanager

2 2 Data Streams Continuous, unbounded, rapid, time-varying streams of data elements Occur in a variety of modern applications –Network monitoring and traffic engineering –Sensor networks –Telecom call records –Financial applications –Web logs and click-streams –Manufacturing processes DSMSDSMS = Data Stream Management System

3 stanfordstreamdatamanager 3 The STREAM System Declarative language for registering continuous queries; considering data streams and stored relations

4 stanfordstreamdatamanager 4 Contributions to Date Semantics for continuous queries Query plans Exploiting stream constraints Operator scheduling Approximation techniques

5 stanfordstreamdatamanager 5 DSMS Scratch Store The (Simplified) Big Picture Input streams Register Query Streamed Result Stored Result Archive Stored Relations

6 stanfordstreamdatamanager 6 (Simplified) Network Monitoring Register Monitoring Queries DSMS Scratch Store Network measurements, Packet traces Intrusion Warnings Online Performance Metrics Archive Lookup Tables

7 stanfordstreamdatamanager 7 Declarative Language for Continuous Queries A distinction between STREAM and Aurora : –Aurora users directly manipulate one large execution plan –STREAM compiles declarative queries into individual plans, system may merge plans –STREAM also supports direct entry of plans Syntax based on SQL, additional constructs for sliding windows and sampling

8 stanfordstreamdatamanager 8 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

9 stanfordstreamdatamanager 9 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

10 stanfordstreamdatamanager 10 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) Fulfillments F [Range 1 Day] From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

11 stanfordstreamdatamanager 11 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) Orders O From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

12 stanfordstreamdatamanager 12 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] And F.clerk = “Sue” Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe” And O.customer = “Joe”

13 stanfordstreamdatamanager 13 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

14 stanfordstreamdatamanager 14 Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk

15 stanfordstreamdatamanager 15 Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk

16 stanfordstreamdatamanager 16 Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk

17 stanfordstreamdatamanager 17 Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk

18 stanfordstreamdatamanager 18 Semantics of Database Languages An often neglected topic Traditional relational databases are in reasonable shape –Relational algebra  SQL But triggers were a mess The semantics of an innocent-looking continuous query over data streams may not be obvious

19 stanfordstreamdatamanager 19 A Nonobvious Continuous Query Stream of stock quotes: Stocks(ticker,price) Monitor last 10 minutes of quotes: Select  From Stocks [Range 10 minutes] Is result a relation, a stream, or something else? If a relation, what exactly does it contain? If a stream, how does query differ from: Select  From Stocks [Range 1 minute] or Select  From Stocks [  ]

20 stanfordstreamdatamanager 20 Our Semantics and Language for Continuous Queries Abstract:Abstract: interpretation for CQs based on certain “black boxes” Concrete:Concrete: SQL-based instantiation for our system; includes syntactic shortcuts, defaults, equivalences Goals –CQs over multiple streams and relations –Exploit relational semantics to the extent possible –Easy queries should be easy to write, simple queries should do what you expect

21 stanfordstreamdatamanager 21 Relations and Streams Assume global, discrete, ordered time domain (more on this later) Relation –Maps time T to set-of-tuples R Stream –Set of (tuple,timestamp) elements

22 stanfordstreamdatamanager 22 Conversions StreamsRelations Window specification Special operators: Istream, Dstream, Rstream Any relational query language

23 stanfordstreamdatamanager 23 Conversion Definitions Stream-to-relation –S [W] is a relation — at time T it contains all tuples in window W applied to stream S up to T –When W = , contains all tuples in stream S up to T Relation-to-stream –Istream(R) contains all (r,T ) where r  R at time T but r  R at time T–1 –Dstream(R) contains all (r,T ) where r  R at time T–1 but r  R at time T –Rstream(R) contains all (r,T ) where r  R at time T

24 stanfordstreamdatamanager 24 Abstract Semantics Take any relational query language Can reference streams in place of relations –But must convert to relations using any window specification language ( default window = [  ] ) Can convert relations to streams –For streamed results –For windows over relations (note: converts back to relation)

25 stanfordstreamdatamanager 25 Query Result at Time T Use all relations at time T Use all streams up to T, converted to relations Compute relational result Convert result to streams if desired

26 stanfordstreamdatamanager 26 Time Easiest: global system clock –Stream elements and relation updates timestamped on entry to system Application-defined time –Streams and relation updates contain application timestamps, may be out of order –Application generates “heartbeat” Or deduce heartbeat from parameters: stream skew, scrambling, latency, and clock progress –Query results in application time

27 stanfordstreamdatamanager 27 Abstract Semantics – Example 1 Select F.clerk, Max(O.cost) From O [  ], F [Rows 1000] Where O.orderID = F.orderID Group By F.clerk Maximum-cost order fulfilled by each clerk in last 1000 fulfillments

28 stanfordstreamdatamanager 28 Abstract Semantics – Example 1 Select F.clerk, Max(O.cost) From O [  ], F [Rows 1000] Where O.orderID = F.orderID Group By F.clerk At time T: entire stream O and last 1000 tuples of F as relations Evaluate query, update result relation at T

29 stanfordstreamdatamanager 29 Abstract Semantics – Example 1 Istream Select Istream(F.clerk, Max(O.cost)) From O [  ], F [Rows 1000] Where O.orderID = F.orderID Group By F.clerk At time T: entire stream O and last 1000 tuples of F as relations Evaluate query, update result relation at T Streamed result:Streamed result: New element (,T) whenever changes from T–1

30 stanfordstreamdatamanager 30 Abstract Semantics – Example 2 Relation CurPrice(stock, price) Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock Average price over last day for each stock

31 stanfordstreamdatamanager 31 Abstract Semantics – Example 2 Relation CurPrice(stock, price) Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock Istream provides history of CurPrice Window on history, back to relation, group and aggregate

32 stanfordstreamdatamanager 32 Concrete Language – CQL Relational query language: SQL Window spec. language derived from SQL-99 –Tuple-based, time-based, partitioned Syntactic shortcuts and defaultsSyntactic shortcuts and defaults –So easy queries are easy to write and simple queries do what you expect EquivalencesEquivalences –Basis for query-rewrite optimizations –Includes all relational equivalences, plus new stream-based ones

33 stanfordstreamdatamanager 33 Two Extremely Simple CQL Examples Select  From Strm Had better return Strm (It does) –Default  window for Strm –Default Istream for result Select  From Strm, Rel Where Strm.A = Rel.B Often want “ NOW ” window for Strm But may not want as default

34 stanfordstreamdatamanager 34 Query Execution query planWhen a continuous query is registered, generation a query plan –Users can also register plans directly Plans composed of three main components: –Operators –Operators (as in most conventional DBMS’s) Queues –Inter-operator Queues (as in many conventional DBMS’s) –State (synopses) schedulerGlobal scheduler for plan execution

35 stanfordstreamdatamanager 35 Operators and State synopsesState (synopses) –Summarize tuples seen so far (exact or approximate) for operators requiring history –To implement windows Example: synopsis join –Sliding-window join –Approximation of full join State 1 State 2 ⋈

36 stanfordstreamdatamanager 36 Simple Query Plan State 4 ⋈ State 3  Stream 1 Stream 2 Stream 3 Q1Q1 Q2Q2 State 1 State 2 ⋈ Scheduler

37 stanfordstreamdatamanager 37 Some Issues in Query Plan Generation +/- streamsCompatibility and conversions for streams and relations (+/- streams) State sharing, incremental computation Windowed joinsWindowed joins: Multiway versus 2-way Windows in general: push down, pull up, split, merge, … Time coordination, operator-level heartbeats

38 stanfordstreamdatamanager 38 Memory Overhead in Query Processing Queues + State Continuous queries keep state indefinitely Online requirements suggest using memory rather than disk –But we realize this assumption is shaky

39 stanfordstreamdatamanager 39 Memory Overhead in Query Processing Queues + State Continuous queries keep state indefinitely Online requirements suggest using memory rather than disk –But we realize this assumption is shaky Goal: minimize memory use while providing timely, accurate answersGoal: minimize memory use while providing timely, accurate answers

40 stanfordstreamdatamanager 40 Reducing Memory Overhead Two main techniques to date constraints on streamsreduce state 1)Exploit constraints on streams to reduce state operator schedulingreduce queue sizes 2)Clever operator scheduling to reduce queue sizes

41 stanfordstreamdatamanager 41 Exploiting Stream Constraints For most queries, unbounded memory is required for arbitrary streams [PODS ’01]

42 stanfordstreamdatamanager 42 Exploiting Stream Constraints arbitraryFor most queries, unbounded memory is required for arbitrary streams [PODS ’01] But streams may exhibit constraints that reduce, bound, or even eliminate state

43 stanfordstreamdatamanager 43 Exploiting Stream Constraints For most queries, unbounded memory is required for arbitrary streams [PODS ’01] But streams may exhibit constraints that reduce, bound, or even eliminate state Conventional database constraints –Redefined for streams –“Relaxed” for stream environment

44 stanfordstreamdatamanager 44 Stream Constraints adherence parameterEach constraint type defines adherence parameter k Clustered(k) for attribute S.A Ordered(k) for attribute S.A Referential-Integrity(k) for join S 1  S 2

45 stanfordstreamdatamanager 45 Algorithm for Exploiting Constraints Input –Any Select-Project-Join query over streams –Any set of k-constraints Output –Query execution plan that reduces or eliminates state based on k-constraints –If constraints violated, get approximate result

46 stanfordstreamdatamanager 46 Constraint Examples Orders (orderID, cost) Fulfillments (orderID, portion, clerk) Query: Many-one join F  O

47 stanfordstreamdatamanager 47 Constraint Examples Orders (orderID, cost) Fulfillments (orderID, portion, clerk) Query: Many-one join F  O Clustered(k) on F.orderID Matched O tuples discarded after k arrivals of non- matching F’s

48 stanfordstreamdatamanager 48 Constraint Examples Orders (orderID, cost) Fulfillments (orderID, portion, clerk) Query: Many-one join F  O Clustered(k) on F.orderID Matched O tuples discarded after k arrivals of non- matching F’s Referential-Integrity(k) F tuples retained for at most k arrivals of O tuples

49 stanfordstreamdatamanager 49 Operator Scheduling Global scheduler invokes run method of query plan operators with “timeslice” parameter Many possible scheduling objectives: minimize latency, inaccuracy, memory use, computation, starvation, … –First scheduler: round-robin  Second scheduler: minimize queue sizes –Third scheduler: minimize combination of queue sizes and latency

50 stanfordstreamdatamanager 50 Approximation Why approximate? –Memory requirement too high, even with constraints and clever scheduling –Can’t process streams fast enough for query load

51 stanfordstreamdatamanager 51 Approximation (cont’d) Static:Static: rewrite queries to add (or shrink) sampling or windows +User can participate, predictable behavior –Doesn’t consider dynamic conditions Dynamic:Dynamic: modify query plan – insert sampling operators, shrink windows, load shedding +Adapts to current resource availability (major open issue) –How to convey to user? (major open issue)

52 stanfordstreamdatamanager 52 The Holy Grail Given: –Declarative query –Resources –Constraints on streams Generate plan and resource allocation that takes advantage of constraints and maximizes precision Do it for multiple (weighted) queries, dynamically and adaptively, and convey what’s happening to the user

53 stanfordstreamdatamanager 53 http://www-db.stanford.edu/stream Contributors: Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Justin Rosenstein, Rohit Varma


Download ppt "Query Processing, Resource Management, and Approximation in a Data Stream Management System Selected subset of slides taken from talk by Jennifer Widom."

Similar presentations


Ads by Google