Presentation on theme: "Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom."— Presentation transcript:
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom
Agenda Overview of Stream Processing Aurora Project Goals Aurora Processing Example Aurora Architecture Multi-Thread Vs. Single-Thread processing Important Definitions Superbox Scheduling and Processing Tuple Batching Experimental Evaluation Quality of Service (QoS) Scheduling QoS Scheduling Scalability Related Work
Overview of Stream Processing Stream Processing is the processing of potentially unbounded, continuous streams of data Data streams are created via micro-sensors, GPS devices, monitoring devices Examples include: soldier location tracking, traffic sensors, stock market exchanges, heart monitors Data may be received evenly or in bursts
Aurora Project Goals To build a data stream manager that addresses the performance and processing requirements of stream-based applications To support multiple concurrent continuous queries on one or more application data streams To use Quality-of-Service(QoS) based criteria to make resource allocation decisions
Aurora Processing Example Input Data Streams Output to Applications Operator Boxes Continuous & ad hoc queries Historical Storage
Multi-Thread Vs. Single Thread Processing Multi-Thread Processing –Each query is processed in its own thread –The operating system manages resource allocation –Advantages Processing can take advantage of efficient operating system algorithms Easier to program –Disadvantages Software has limited control of resource management Additional overhead do to cache misses, lock contention and switching
Multi-Thread Vs. Single Thread Processing Single-Thread Processing –All operations are processed within a single thread –All resource allocation decisions are made by the scheduler –Advantages Allows processing to be scheduled based on latency and other Quality of Service factors based on query needs Avoids the limitations of multi-thread processing –Disadvantages More complex to program Aurora has chosen to implement a single-threaded scheduling model
Important Definitions Quality of Service (QoS) – Specific requirements that represent the needs of a specific query. In Aurora, the primary QoS factor is latency Query Tree – The set of operators (boxes) and data streams that represent a query. Superbox – A sequence of operators that are scheduled and executed as an atomic group. Aurora treats each query as separate superbox. Two-Level Scheduling – Scheduling is done at two levels. First, at the superbox level (deciding which superbox to process) and second, what order to execute the operators within the superbox once a superbox is selected.
Important Definitions (Cont.) Scheduling Plan – The combination of dynamically based superbox scheduling and algorithm based operator execution order within the superbox is called a scheduling plan. Application-at-a-time (AAAT) is a term used in Aurora that statically defines each query (application) as a superbox Box-at-a-time (BAAT) refers to scheduling at the box level rather then the superbox level Static and dynamic scheduling approaches – Static approaches to scheduling are defined prior to runtime. Dynamic scheduling approaches use runtime information and statistics to adjust and prioritize scheduling order during execution Traversing a superbox – This refers to how the operators within a superbox should be scheduled and executed
Superbox Traversal Min-Cost (MC) – Attempts to optimize per-output-tuple processing costs by minimizing the number of operator calls per output tuple Min-Latency (ML) – Attempts to produce initial tuples as soon as possible Min-Memory (MM) – Attempts to minimize memory usage Superbox traversal refers to how the operators within a superbox should be executed
Tuple Batching (Train Processing) A Tuple Train is the process of executing tuples as a batch within a single operator call. The goal of Tuple Train processing is to reduce overall processing cost per tuple processed Advantages of Tuple Train processing are: –Decreased number of total operator executions –Cuts down on low level overhead such as context switching, scheduling, memory management and execution queue maintenance –Some windowing and merge-join operators work efficiently when batching tuples
Experimental Evaluation Definitions Stream-based applications do not currently have a standardized benchmark Aurora modeled queries as a rooted tree structure from a stream input box to an application output box Trees are categorized based on depth and fan-out –Depth is the number of box levels from input to output –Fan-out is the average number of children of each box
Experimental Evaluation Results At low volumes “Round Robin Box-At-A-Time (RR-BAAT)” scheduling was almost as efficient as “Minimum Cost Application-At- A-Time (MC-AAAT)” at low volumes but much less efficient and higher levels –At low volumes, the efficiencies of MC-AAAT were reduced by more complex scheduling overhead –As volumes increased, the efficiencies of MC-AAAT became more apparent as scheduling overhead became a lower percentage to total processing Experimentation was also done to compare ML, MC and MM scheduling techniques –As expected, each technique minimized their specified attribute (latency, cost and memory respectively) –However, at very low processing levels the simplest algorithms tended to do the best (but who cares :)
Quality of Service (QoS) Scheduling Definitions –Utility – is how useful the tuple will be when it exits the query –Urgency – is represented by the angle of the downward slope of the utility QoS parameter. In other words, how fast the utility deteriorates Approach –Keep track of the latency of tuples that reside in the queues and pick tuples for processing based on whose execution will provide the highest aggregate QoS delivered to the applications.
Latency-Utility Relationship Critical Points Latency Quality of Service 0,0 1 The older the data gets, The less it is worth, The lower the quality of service Aurora combines the QoS charts of each query being executed with the average latency of the tuples in each box to decide which superbox to execute next. The idea is to, on average, maintain the highest quality of service.
QoS Scheduling Scalability Problem –A per-tuple approach to QoS based scheduling will not scale because of the amount of processing needed to maintain it Solution –Latency is not calculated at the tuple level, rather, it is calculated as the average latency of tuples in the box input queue –Priority is given based on the combination of utility and urgency –Once a box’s priority (priority tuple or “p-tuple”) is calculated, the boxes are placed in logical buckets bases on their priority value –Scheduling is then done based on the priority of the bucket –All boxes in a given bucket are considered equal
Related Work Eddies – has a tuple-at-a-time scheduler providing adaptablility, but does not scale well Urhan – works on rate-based pipeline scheduling of data between operators NiagaraCQ – query optimization for streaming data from wide-area information sources STREAM – provides comprehensive data stream management using chain scheduling algorithms Note, that none of the above projects have a notion of QoS