Presentation is loading. Please wait.

Presentation is loading. Please wait.

Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data.

Similar presentations


Presentation on theme: "Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data."— Presentation transcript:

1 Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data stream management

2 Outline 2 The Aurora stream query algebra Run–time Architecture Introduction

3 Aurora-system architecture  Aurora: a new model and architecture for data stream management, a new system to manage data streams for monitoring applications.  The fact that a software system must process and react to continual inputs from many sources (e.g., sensors) rather than from human operators requires  Aurora - a new DBMS currently under construction at Brandeis University, Brown University, and M.I.T. 3

4 Currently used DB systems  Classical DBMS:  Passive repository storing data (HADP – human-active, DBMS- passive model)  Only current state of data is important  Data synchronized; queries have exact answers (no support for approximation)  Monitoring applications are difficult to implement in traditional DBMS  First, the basic computation model is wrong: DBMSs have a HADP model while monitoring applications often require a DAHP model.  Triggers and alerters are second-class citizens  Problems with getting required data from historical time series  Development of dedicated middleware is expensive  Conclusion: these systems are ill suited for applications used to alert human when abnormal situation occurs (expected DAHP model – DBMS-active, human-passive) 4

5 Aurora – main assumptions  Data comes from various, uniquely identified data sources (data streams)  Each incoming tuple is timestamped  Aurora is expected to process incoming streams  Tuples are transferred through loop-free, directed graph  Outputs from the system are presented to applications  Maintains historical storage 5

6 6

7 Aurora system overview 7  Any box can filter stream (select operation)  Box can compute stream aggregates applying aggregate function accross a window of values in the stream  Output of any box can be an input for several other boxes (split operation)  Each box can gather tuples from many inputs (union operation)

8 Aurora query model 8 b1b1 b7b7 b2b2 b6b6 b5b5 b4b4 b3b3 Appl Connection points Storage S1Storage S2 Storage S3 Continuous query View Ad-hoc query „Keep 2 hr” QoS spec  Each CP and view should have a persistence specification (e.g. „keep data for 2 hr”)  Each output is associated with QoS specification (helps to allocate the processing elements along the path)

9 Queries in the aurora  Continuous queries  Query continuously processes tuples  Output tuples are delivered to an application  Ad-hoc queries  System will process data and deliver answer from the earliest time stored in the connection point  Semantic is the same as continuous query that started execution at t now – (persistence specification)  Query continues until explicit termination  Views  Similar to materialized or partially-materialized views in classical DB systems  Application may connect to the end of this path whenever there is a need 9

10 Queries in the aurora Connection points  Support for dynamic modification of network  Support for data caching (persistence specification) – helpful for ad-hoc queries  Connection point without upload stream can be used as a stored data set (like in classical DBMS)  Tuples from connection point can be pushed through the system (e.g when connection point is „materialized” and stored tuples are passed as a stream to the downstream nodes)  Alternatively, downstream node can pull the data (helpful in the execution of filtering or joining operations) 10

11 Application Domains  Online Auctions  Network Traffic Management  Habitat Monitoring  Military Logistics  Immersive Environments  Road Traffic Monitoring  System Monitoring 11

12 SQuAl  The Aurora [S]tream [Qu]ery [Al]gebra  7 operators:  Order-agnostic (Filter, Map, Union)  Order-sensitive (BSort, Aggregate, Join, Resample)  Model:  A stream is an append-only sequence of tuples with uniform type  A stream type has the form: (TS, A 1,…, A n )  Steam tuples have the form: (ts, v 1,…, v n ) A i : application-specific data fields ts: timestamp

13 Order-agnostic operators  Input tuples have the form:  t = (TS = ts, A 1 = v 1,…, A k = v k )  3 operators:  Filter: similar to relational selection filter on multiple predicates route tuples according to which predicates they satisfy  Map: similar to relational projection apply arbitrary functions to tuples (including user- defined functions)  Union: merge 2 or more streams of common schema

14 Filter  Acts much like a case statement  Can be used to route input tuples to alternative streams  Form:  Filter(P 1,…,P m )(S) Pi: predicates over tuples on the input stream S  Its output consists of m + 1 streams  Output tuples have the same schema and values as input tuples, including their QoS timestamp

15 Map  Is a generalized projection operator  Form:  Map(B 1 = F 1,…, B m = F m )(S) B i : name of attribute F i : function over tuple on the input stream S  Output tuple for each input tuple t has the form:  (TS = t.TS, B 1 = F 1 (t),…, B m = F m (t))  Resulting stream can have a different schema than the input stream, but the timestamps of input tuples are preserved in corresponding output tuples

16 Union  Is used to merge 2 or more streams into a single output stream  Form:  Union(S 1,…,S n ) S i : stream, common schema  Union can output tuples in any order  Output tuples have the same schema and values as input tuples including their QoS timestamps

17 Order-sensitive operators  Require order specification arguments  Order specification: describes the tuples arrival order they expect  Order specifications have the form:  Order(On A, Slack n, GroupBy B 1,…,B m ) A, B i : attribute n: non-negative integer  4 operators:  Bsort: is an approximate sort operator with semantics equivalent to a bounded pass bubble sort  Aggregate: applies a window function to sliding windows over its input stream  Join: is a binary operator that resembles a band join applied to infinite streams  Resample: is an interpolation operator used to align streams

18 BSort  Is an approximate sort operator  Form:  Bsort(Assuming O)(S) O = Order(On A, Slack n, GroupBy B 1,…,B m ) is a specification of the assumed ordering over the output stream  Performs a buffer-based approximate sort  Equivalent to n passes of a bubble sort

19 BSort

20 Aggregate  Applies “window functions” to sliding windows over its input stream  Form:  Aggregate(F, Assuming O, Size s, Advance i)(S) F: “window function” (SQL-type aggregate operation, Postgres-style user-defined function) O = Order(On A, Slack n, GroupBy B 1,…,B m ) is an order specification over input stream S s: size of the window (measured in terms of values of A) i: integer, predicate that specifies how to advance the window when it slides  Output tuples have the form:  (TS = ts, A = a, B 1 = u 1,…, B m = u m ) ++ (F(W)) W: “window” of tuples from the input stream with values of A between a and a + s – 1 ts: the smallest timestamps associated with tuples in W ++: denotes concatenation of 2 tuples

21 Aggregate

22  Slack = 1 or more  Blocking: waiting for lost or late tuples to arrive in order to finish window calculations  Optional Timeout argument: Aggregate(F, Assuming O, Size s, Advance i, Timeout t)

23 Join  Is a binary join operator  Form:  Join(P, Size s, Left Assuming O 1, Right Assuming O 2 )(S 1, S 2 ) P: predicate over pairs of tuples from input streams S 1 and S 2 s: integer O 1 : order specification on some numeric or time-based attribute of S 1 (A) O 2 : order specification on some numeric or time-based attribute of S 2 (B)  For every in-order tuple t in S 1 and u in S 2, the concatenation of t and u (t++u) is output if: |t.A – u.B| ≤ s P holds of t and u  The QoS timestamp for the output tuple is the minimum timestamp of t and u

24 Join

25 Resample  Is an asymmetric, semijoin-like synchronization operator  Can be used to align pairs of streams  Form:  Resample(F, Size s, Left Assuming O 1, Right Assuming O 2 )(S 1, S 2 ) F: “window function” over S 1 s: integer O 1 : order specification on some numeric or time-based attribute of S 1 (A) O 2 : order specification on some numeric or time-based attribute of S 2 (B)  For every tuple t from S 1, output tuple:  (B 1 : u.B 1,..., B m : u.B m, A : t.A) + +F(W(t)) W(t) = {u ∈ S 2 |u in order wrt O 2 in S 2 ∧ |t.A − u.B| ≤ s}

26 Resample

27 Run-time architecture Router Scheduler Load Shedder QoS Monitor Storage manager Box Processors Q1Q1 Q2Q2 QiQi QnQn QjQj Buffer Manager Persistent Storage Outputs Inputs

28 Quality of Server - QoS  QoS, in general, is a multidimensional function of several attributes of an Aurora system.  Response times (production of output tuples)  Tuple drops  Values produced (importance of produced values)  Administrator specifies QoS graphs for output based on one or more of mentioned functions  Other types of QoS functions can be defined too

29 QoS graphs  Graphs are expected to be normalized  Graphs should allow a properly sized network to operate with all outputs in a ‘good zone’  Graphs should be convex (the value-based graph is an exception) 1 0 Delay 1 0 % tuples delivered 1 0 Output value good zone

30 Aurora Storage Manager (ASM) – Queues management  There is one queue at the output of each box; this queue is shared by all successor boxes  Queues are stored in memory and on disks  Queues may change length b2b2 b1b1 time Queue organization Processed tuples

31 Scheduling in Aurora  Scheduler (and Aurora) aims to reduce overall tuple execution cost  Exploit of two nonlinearities in tuple processing  Interbox nonlinearity: Minimaze tuple trashing (if buffer space is not sufficient tuples has to be shuttled between memory and disk) Avoiding to copy data from output to buffer (a possibility of bypassing ASM when one box is scheduled right after another)  Intrabox nonlinearity: The cost of tuple processing may decrease as the number of available tuples in the queue increases

32 Scheduling in Aurora  Aurora’s approach: (1) have box queues as many tuples as possible, (2) process it at once – train scheduling, and (3) pass them to subsequent boxes without going to disk – superbox scheduling  Two goals: (1) minimize number of I/O operations and (2) minimize number of box calls per tuple

33 Scheduler performance Time (ms) 0 50 100 150 200 250 300 Execution costs Scheduling overhead Tuple at a timeTrainsSuperboxes

34 Priorities assignment in Scheduler  The latency of each output tuple is the sum of the tuple’s processing delay and its waiting delay (is primarily the function of scheduling)  The goal of scheduler: to assign priorities to boxes outputs that maximize the overall QoS  The Scheduler’s approach is divided into two aspects:  state-based analysis that assigns priorities to outputs and picks for scheduling the output with the highest utility  feedback-based analysis that observes overall system and increases the priorities of outputs not doing well (base on QoS graph)

35 Load shedding  Reaction to overload  Drop is a system level operator that enables to drop randomly tuples from stream at specified rate 1. Load shedding by dropping tuples 2. Load shedding by filtering tuples

36 Load shedding Load shedding by dropping tuples  Reduces the amount of Aurora processing by dropping randomly selected tuples at strategic points in the network

37 Load shedding  Load shedding by filtering tuples  Idea: remove less important tuples rather than randomly chosen  It use value-based QoS information

38

39 Questions 1:Which of the following operators output tuples that have the same schema and values as input tuples? a.Aggregateb. b.BSort (x) c.Filter (x) d.Joine. e.Map f.Resample g.Union (x)

40 Questions 2. What does Aurora's primary run-time architecture include? a.Router b.Storage manager (x) c.Scheduler (x) d.Box processor. e.QoS monitor (x) f.Resample g.Load shedder (x)

41 Three broad application types  Aurora addresses three broad application types in a single, unique framework: 1.Real-time monitoring applications continuously monitor the present state of the world and are, thus, interested in the most current data as it arrives from the environment. In these applications, there is little or no need (or time) to store such data. 2.Archival applications are typically interested in the past. They are primarily concerned with processing large amounts of finite data stored in atime-series repository. 3.Spanning applications involve both the present and past states of the world, requiring combining and comparing incoming live data and stored historical data. These applications are the most demanding as there is a need to balance real-time requirements with efficient processing of large amounts of disk-resident data.


Download ppt "Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data."

Similar presentations


Ads by Google