Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data.

Slides:



Advertisements
Similar presentations
Computer Systems & Architecture Lesson 2 4. Achieving Qualities.
Advertisements

Load Management and High Availability in Borealis Magdalena Balazinska, Jeong-Hyon Hwang, and the Borealis team MIT, Brown University, and Brandeis University.
Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Two-Pass Algorithms Based on Sorting
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
The Design of the Borealis Stream Processing Engine Daniel J. Abadi1, Yanif Ahmad2, Magdalena Balazinska1, Ug ̆ur C ̧ etintemel2, Mitch Cherniack3, Jeong-Hyon.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams Presented by: Andy Mason and Sheng Zhong Ahmed M.Ayad and Jeffrey.
Load Shedding in a Data Stream Manager Kevin Hoeschele Anurag Shakti Maskey.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Synthesis of Embedded Software Using Free-Choice Petri Nets.
What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.
Aurora Proponent Team Wei, Mingrui Liu, Mo Rebuttal Team Joshua M Lee Raghavan, Venkatesh.
Scalable Distributed Stream System Mitch Cherniack, Hari Balakrishnan, Magdalena Balazinska, Don Carney, Uğur Çetintemel, Ying Xing, and Stan Zdonik Proceedings.
Advanced OS Chapter 3p2 Sections 3.4 / 3.5. Interrupts These enable software to respond to signals from hardware. The set of instructions to be executed.
MPDS 2003 San Diego 1 Reducing Execution Overhead in a Data Stream Manager Don Carney Brown University Uğur ÇetintemelBrown University Mitch Cherniack.
Monitoring Streams -- A New Class of Data Management Applications Don Carney Brown University Uğur ÇetintemelBrown University Mitch Cherniack Brandeis.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
1 Load Shedding in a Data Stream Manager Slides edited from the original slides of Kevin Hoeschele Anurag Shakti Maskey.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Monitoring Streams -- A New Class of Data Management Applications Don Carney Brown University Uğur ÇetintemelBrown University Mitch Cherniack Brandeis.
SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.
Switching Techniques Student: Blidaru Catalina Elena.
Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.
Monitoring Streams- A New Class of Data Management Applications Presented by Qing Cao at
MONITORING STREAMS: A NEW CLASS OF DATA MANAGEMENT APPLICATIONS DON CARNEY, U Ğ UR ÇETINTEMEL, MITCH CHERNIACK, CHRISTIAN CONVEY, SANGDON LEE, GREG SEIDMAN,
The Design of the Borealis Stream Processing Engine CIDR 2005 Brandeis University, Brown University, MIT Kang, Seungwoo Ref.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
A new model and architecture for data stream management.
Aurora – system architecture Pawel Jurczyk. Currently used DB systems Classical DBMS: –Passive repository storing data (HADP – human-active, DBMS- passive.
Data Streams: Lecture 101 Window Aggregates in NiagaraST Kristin Tufte, Jin Li Thanks to the NiagaraST PSU.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Query Execution Section 15.1 Shweta Athalye CS257: Database Systems ID: 118 Section 1.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.
Accommodating Bursts in Distributed Stream Processing Systems Yannis Drougas, ESRI Vana Kalogeraki, AUEB
A new model and architecture for data stream management.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Aurora: a new model and architecture for data stream management Daniel J. Abadi 1, Don Carney 2, Ugur Cetintemel 2, Mitch Cherniack 1, Christian Convey.
1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.
Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.
Monitoring Streams -- A New Class of Data Management Applications based on paper and talk by authors below, slightly adapted for CS561: Don Carney Brown.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
Spring Computer Networks1 Congestion Control Sections 6.1 – 6.4 Outline Preliminaries Queuing Discipline Reacting to Congestion Avoiding Congestion.
Genetic algorithms for task scheduling problem J. Parallel Distrib. Comput. (2010) Fatma A. Omara, Mona M. Arafa 2016/3/111 Shang-Chi Wu.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)
Two-Pass Algorithms Based on Sorting
Advanced Database Management System
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
Database Management System
Chapter 12: Query Processing
Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.
Advanced Database Management System
Chapter 15 QUERY EXECUTION.
Presenter Kyungho Jeon 11/17/2018.
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
Advanced Database Management System
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Theppatorn rhujittawiwat
Adaptive Query Processing (Background)
Presentation transcript:

Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data stream management

Outline 2 The Aurora stream query algebra Run–time Architecture Introduction

Aurora-system architecture  Aurora: a new model and architecture for data stream management, a new system to manage data streams for monitoring applications.  The fact that a software system must process and react to continual inputs from many sources (e.g., sensors) rather than from human operators requires  Aurora - a new DBMS currently under construction at Brandeis University, Brown University, and M.I.T. 3

Currently used DB systems  Classical DBMS:  Passive repository storing data (HADP – human-active, DBMS- passive model)  Only current state of data is important  Data synchronized; queries have exact answers (no support for approximation)  Monitoring applications are difficult to implement in traditional DBMS  First, the basic computation model is wrong: DBMSs have a HADP model while monitoring applications often require a DAHP model.  Triggers and alerters are second-class citizens  Problems with getting required data from historical time series  Development of dedicated middleware is expensive  Conclusion: these systems are ill suited for applications used to alert human when abnormal situation occurs (expected DAHP model – DBMS-active, human-passive) 4

Aurora – main assumptions  Data comes from various, uniquely identified data sources (data streams)  Each incoming tuple is timestamped  Aurora is expected to process incoming streams  Tuples are transferred through loop-free, directed graph  Outputs from the system are presented to applications  Maintains historical storage 5

6

Aurora system overview 7  Any box can filter stream (select operation)  Box can compute stream aggregates applying aggregate function accross a window of values in the stream  Output of any box can be an input for several other boxes (split operation)  Each box can gather tuples from many inputs (union operation)

Aurora query model 8 b1b1 b7b7 b2b2 b6b6 b5b5 b4b4 b3b3 Appl Connection points Storage S1Storage S2 Storage S3 Continuous query View Ad-hoc query „Keep 2 hr” QoS spec  Each CP and view should have a persistence specification (e.g. „keep data for 2 hr”)  Each output is associated with QoS specification (helps to allocate the processing elements along the path)

Queries in the aurora  Continuous queries  Query continuously processes tuples  Output tuples are delivered to an application  Ad-hoc queries  System will process data and deliver answer from the earliest time stored in the connection point  Semantic is the same as continuous query that started execution at t now – (persistence specification)  Query continues until explicit termination  Views  Similar to materialized or partially-materialized views in classical DB systems  Application may connect to the end of this path whenever there is a need 9

Queries in the aurora Connection points  Support for dynamic modification of network  Support for data caching (persistence specification) – helpful for ad-hoc queries  Connection point without upload stream can be used as a stored data set (like in classical DBMS)  Tuples from connection point can be pushed through the system (e.g when connection point is „materialized” and stored tuples are passed as a stream to the downstream nodes)  Alternatively, downstream node can pull the data (helpful in the execution of filtering or joining operations) 10

Application Domains  Online Auctions  Network Traffic Management  Habitat Monitoring  Military Logistics  Immersive Environments  Road Traffic Monitoring  System Monitoring 11

SQuAl  The Aurora [S]tream [Qu]ery [Al]gebra  7 operators:  Order-agnostic (Filter, Map, Union)  Order-sensitive (BSort, Aggregate, Join, Resample)  Model:  A stream is an append-only sequence of tuples with uniform type  A stream type has the form: (TS, A 1,…, A n )  Steam tuples have the form: (ts, v 1,…, v n ) A i : application-specific data fields ts: timestamp

Order-agnostic operators  Input tuples have the form:  t = (TS = ts, A 1 = v 1,…, A k = v k )  3 operators:  Filter: similar to relational selection filter on multiple predicates route tuples according to which predicates they satisfy  Map: similar to relational projection apply arbitrary functions to tuples (including user- defined functions)  Union: merge 2 or more streams of common schema

Filter  Acts much like a case statement  Can be used to route input tuples to alternative streams  Form:  Filter(P 1,…,P m )(S) Pi: predicates over tuples on the input stream S  Its output consists of m + 1 streams  Output tuples have the same schema and values as input tuples, including their QoS timestamp

Map  Is a generalized projection operator  Form:  Map(B 1 = F 1,…, B m = F m )(S) B i : name of attribute F i : function over tuple on the input stream S  Output tuple for each input tuple t has the form:  (TS = t.TS, B 1 = F 1 (t),…, B m = F m (t))  Resulting stream can have a different schema than the input stream, but the timestamps of input tuples are preserved in corresponding output tuples

Union  Is used to merge 2 or more streams into a single output stream  Form:  Union(S 1,…,S n ) S i : stream, common schema  Union can output tuples in any order  Output tuples have the same schema and values as input tuples including their QoS timestamps

Order-sensitive operators  Require order specification arguments  Order specification: describes the tuples arrival order they expect  Order specifications have the form:  Order(On A, Slack n, GroupBy B 1,…,B m ) A, B i : attribute n: non-negative integer  4 operators:  Bsort: is an approximate sort operator with semantics equivalent to a bounded pass bubble sort  Aggregate: applies a window function to sliding windows over its input stream  Join: is a binary operator that resembles a band join applied to infinite streams  Resample: is an interpolation operator used to align streams

BSort  Is an approximate sort operator  Form:  Bsort(Assuming O)(S) O = Order(On A, Slack n, GroupBy B 1,…,B m ) is a specification of the assumed ordering over the output stream  Performs a buffer-based approximate sort  Equivalent to n passes of a bubble sort

BSort

Aggregate  Applies “window functions” to sliding windows over its input stream  Form:  Aggregate(F, Assuming O, Size s, Advance i)(S) F: “window function” (SQL-type aggregate operation, Postgres-style user-defined function) O = Order(On A, Slack n, GroupBy B 1,…,B m ) is an order specification over input stream S s: size of the window (measured in terms of values of A) i: integer, predicate that specifies how to advance the window when it slides  Output tuples have the form:  (TS = ts, A = a, B 1 = u 1,…, B m = u m ) ++ (F(W)) W: “window” of tuples from the input stream with values of A between a and a + s – 1 ts: the smallest timestamps associated with tuples in W ++: denotes concatenation of 2 tuples

Aggregate

 Slack = 1 or more  Blocking: waiting for lost or late tuples to arrive in order to finish window calculations  Optional Timeout argument: Aggregate(F, Assuming O, Size s, Advance i, Timeout t)

Join  Is a binary join operator  Form:  Join(P, Size s, Left Assuming O 1, Right Assuming O 2 )(S 1, S 2 ) P: predicate over pairs of tuples from input streams S 1 and S 2 s: integer O 1 : order specification on some numeric or time-based attribute of S 1 (A) O 2 : order specification on some numeric or time-based attribute of S 2 (B)  For every in-order tuple t in S 1 and u in S 2, the concatenation of t and u (t++u) is output if: |t.A – u.B| ≤ s P holds of t and u  The QoS timestamp for the output tuple is the minimum timestamp of t and u

Join

Resample  Is an asymmetric, semijoin-like synchronization operator  Can be used to align pairs of streams  Form:  Resample(F, Size s, Left Assuming O 1, Right Assuming O 2 )(S 1, S 2 ) F: “window function” over S 1 s: integer O 1 : order specification on some numeric or time-based attribute of S 1 (A) O 2 : order specification on some numeric or time-based attribute of S 2 (B)  For every tuple t from S 1, output tuple:  (B 1 : u.B 1,..., B m : u.B m, A : t.A) + +F(W(t)) W(t) = {u ∈ S 2 |u in order wrt O 2 in S 2 ∧ |t.A − u.B| ≤ s}

Resample

Run-time architecture Router Scheduler Load Shedder QoS Monitor Storage manager Box Processors Q1Q1 Q2Q2 QiQi QnQn QjQj Buffer Manager Persistent Storage Outputs Inputs

Quality of Server - QoS  QoS, in general, is a multidimensional function of several attributes of an Aurora system.  Response times (production of output tuples)  Tuple drops  Values produced (importance of produced values)  Administrator specifies QoS graphs for output based on one or more of mentioned functions  Other types of QoS functions can be defined too

QoS graphs  Graphs are expected to be normalized  Graphs should allow a properly sized network to operate with all outputs in a ‘good zone’  Graphs should be convex (the value-based graph is an exception) 1 0 Delay 1 0 % tuples delivered 1 0 Output value good zone

Aurora Storage Manager (ASM) – Queues management  There is one queue at the output of each box; this queue is shared by all successor boxes  Queues are stored in memory and on disks  Queues may change length b2b2 b1b1 time Queue organization Processed tuples

Scheduling in Aurora  Scheduler (and Aurora) aims to reduce overall tuple execution cost  Exploit of two nonlinearities in tuple processing  Interbox nonlinearity: Minimaze tuple trashing (if buffer space is not sufficient tuples has to be shuttled between memory and disk) Avoiding to copy data from output to buffer (a possibility of bypassing ASM when one box is scheduled right after another)  Intrabox nonlinearity: The cost of tuple processing may decrease as the number of available tuples in the queue increases

Scheduling in Aurora  Aurora’s approach: (1) have box queues as many tuples as possible, (2) process it at once – train scheduling, and (3) pass them to subsequent boxes without going to disk – superbox scheduling  Two goals: (1) minimize number of I/O operations and (2) minimize number of box calls per tuple

Scheduler performance Time (ms) Execution costs Scheduling overhead Tuple at a timeTrainsSuperboxes

Priorities assignment in Scheduler  The latency of each output tuple is the sum of the tuple’s processing delay and its waiting delay (is primarily the function of scheduling)  The goal of scheduler: to assign priorities to boxes outputs that maximize the overall QoS  The Scheduler’s approach is divided into two aspects:  state-based analysis that assigns priorities to outputs and picks for scheduling the output with the highest utility  feedback-based analysis that observes overall system and increases the priorities of outputs not doing well (base on QoS graph)

Load shedding  Reaction to overload  Drop is a system level operator that enables to drop randomly tuples from stream at specified rate 1. Load shedding by dropping tuples 2. Load shedding by filtering tuples

Load shedding Load shedding by dropping tuples  Reduces the amount of Aurora processing by dropping randomly selected tuples at strategic points in the network

Load shedding  Load shedding by filtering tuples  Idea: remove less important tuples rather than randomly chosen  It use value-based QoS information

Questions 1:Which of the following operators output tuples that have the same schema and values as input tuples? a.Aggregateb. b.BSort (x) c.Filter (x) d.Joine. e.Map f.Resample g.Union (x)

Questions 2. What does Aurora's primary run-time architecture include? a.Router b.Storage manager (x) c.Scheduler (x) d.Box processor. e.QoS monitor (x) f.Resample g.Load shedder (x)

Three broad application types  Aurora addresses three broad application types in a single, unique framework: 1.Real-time monitoring applications continuously monitor the present state of the world and are, thus, interested in the most current data as it arrives from the environment. In these applications, there is little or no need (or time) to store such data. 2.Archival applications are typically interested in the past. They are primarily concerned with processing large amounts of finite data stored in atime-series repository. 3.Spanning applications involve both the present and past states of the world, requiring combining and comparing incoming live data and stored historical data. These applications are the most demanding as there is a need to balance real-time requirements with efficient processing of large amounts of disk-resident data.