Adaptive Online Scheduling in Storm Paper by Leonardo Aniello, Roberto Baldoni, and Leonardo Querzoni Presentation by Keshav Santhanam.

Slides:



Advertisements
Similar presentations
February 20, Spatio-Temporal Bandwidth Reuse: A Centralized Scheduling Mechanism for Wireless Mesh Networks Mahbub Alam Prof. Choong Seon Hong.
Advertisements

MINERVA: an automated resource provisioning tool for large-scale storage systems G. Alvarez, E. Borowsky, S. Go, T. Romer, R. Becker-Szendy, R. Golding,
U of Houston – Clear Lake
Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.
SDN + Storage.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
A 2 -MAC: An Adaptive, Anycast MAC Protocol for Wireless Sensor Networks Hwee-Xian TAN and Mun Choon CHAN Department of Computer Science, School of Computing.
Decentralized Reactive Clustering in Sensor Networks Yingyue Xu April 26, 2015.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Gossip Scheduling for Periodic Streams in Ad-hoc WSNs Ercan Ucan, Nathanael Thompson, Indranil Gupta Department of Computer Science University of Illinois.
Kuang-Hao Liu et al Presented by Xin Che 11/18/09.
A Stratified Approach for Supporting High Throughput Event Processing Applications July 2009 Geetika T. LakshmananYuri G. RabinovichOpher Etzion IBM T.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
TOWARDS IDENTITY ANONYMIZATION ON GRAPHS. INTRODUCTION.
Computer System Lifecycle Chapter 1. Introduction Computer System users, administrators, and designers are all interested in performance evaluation. Whether.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Steady and Fair Rate Allocation for Rechargeable Sensors in Perpetual Sensor Networks Zizhan Zheng Authors: Kai-Wei Fan, Zizhan Zheng and Prasun Sinha.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2007 (TPDS 2007)
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Network Aware Resource Allocation in Distributed Clouds.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
A Distributed Framework for Correlated Data Gathering in Sensor Networks Kevin Yuen, Ben Liang, Baochun Li IEEE Transactions on Vehicular Technology 2008.
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.
Introduction to Hadoop and HDFS
Energy-Aware Scheduling with Quality of Surveillance Guarantee in Wireless Sensor Networks Jaehoon Jeong, Sarah Sharafkandi and David H.C. Du Dept. of.
Maximum Network Lifetime in Wireless Sensor Networks with Adjustable Sensing Ranges Cardei, M.; Jie Wu; Mingming Lu; Pervaiz, M.O.; Wireless And Mobile.
Papers on Storage Systems 1) Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud, SC ) Making Cloud Intermediate Data Fault-Tolerant,
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al.
Competitive Queue Policies for Differentiated Services Seminar in Packet Networks1 Competitive Queue Policies for Differentiated Services William.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
S-Paxos: Eliminating the Leader Bottleneck
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.
Intradomain Traffic Engineering By Behzad Akbari These slides are based in part upon slides of J. Rexford (Princeton university)
Minimal Broker Overlay Design for Content-Based Publish/Subscribe Systems Naweed Tajuddin Balasubramaneyam Maniymaran Hans-Arno Jacobsen University of.
1 Iterative Integer Programming Formulation for Robust Resource Allocation in Dynamic Real-Time Systems Sethavidh Gertphol and Viktor K. Prasanna University.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.
Fair and Efficient multihop Scheduling Algorithm for IEEE BWA Systems Daehyon Kim and Aura Ganz International Conference on Broadband Networks 2005.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
SERENA: SchEduling RoutEr Nodes Activity in wireless ad hoc and sensor networks Pascale Minet and Saoucene Mahfoudh INRIA, Rocquencourt Le Chesnay.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Part III BigData Analysis Tools (Storm) Yuan Xue
Stela: Enabling Stream Processing Systems to Scale-in and Scale-out On- demand Le Xu ∗, Boyang Peng†, Indranil Gupta ∗ ∗ Department of Computer Science,
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Towards High Performance Processing of Streaming Data May Supun Kamburugamuve, Saliya Ekanayake, Milinda Pathirage and Geoffrey C. Fox Indiana.
Online Parameter Optimization for Elastic Data Stream Processing Thomas Heinze, Lars Roediger, Yuanzhen Ji, Zbigniew Jerzak (SAP SE) Andreas Meister (University.
Heron: a stream data processing engine
R-Storm: Resource Aware Scheduling in Storm
Optimizing Distributed Actor Systems for Dynamic Interactive Services
HERON.
Introduction to Distributed Platforms
Guangxiang Du*, Indranil Gupta
International Conference on Data Engineering (ICDE 2016)
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
Boyang Peng, Le Xu, Indranil Gupta
Networked Real-Time Systems: Routing and Scheduling
Parallel Programming in C with MPI and OpenMP
COS 518: Distributed Systems Lecture 11 Mike Freedman
EdgeWise: A Better Stream Processing Engine for the Edge
Presentation transcript:

Adaptive Online Scheduling in Storm Paper by Leonardo Aniello, Roberto Baldoni, and Leonardo Querzoni Presentation by Keshav Santhanam

Motivation

Big data 2.5 quintillion bytes of data generated per day [IBM] Volume, velocity, variety Need complex event processing engine Represent data as a real-time flow of events Analyze this data as quickly as possible

Storm

Processing engine for high throughput data streams Used by Groupon, Yahoo, Flipboard, etc.

Storm Topology: Directed graph of spouts and bolts Spout Data source Bolt Tuple Bolt Tuple Output

Worker nodes Storm Scheduler Plugin S Deployment Plan Nimbus Slots Executor Worker Process Topology G(V, T), w Supervisor

Storm Grouping strategies Shuffle grouping: target task is chosen randomly Ensures even distribution of tuples Fields grouping: tuple is forwarded to a task based on the content of the tuple E.g. tuples with key beginning with A-I are sent to one task, J-R to another task, etc.

Storm EvenScheduler Round robin allocation strategy First phase: assigns executors to workers evenly Second phase: assigns workers to worker nodes evenly Problem: does not take into account network communication overhead Solution: Identify “hot edges” of the topology Map hot edges to inter-process channels

Adaptive Scheduling in Storm

Adaptive Schedulers Key idea: place executors that frequently communicate together into the same slot, thus reducing network traffic Offline scheduler Examine the topology before deployment and use a heuristic to place the executors Online scheduler Analyze network traffic at runtime and periodically re-compute a new schedule Assumptions Only acyclic topologies Upper bound on number of hops for a tuple as it traverses topology Parameter α [0, 1] affects the maximum number of executors in a single slot

Topology-based Scheduling

Offline Scheduler 1.Create a partial ordering of components If component c i emits tuples that are consumed by another component c j then c i < c j If c i < c j and c j < c k, then c i < c k (transitivity) There can be components c i and c j such that neither c i < c j nor c j < c i are true 2.Use the partial order to create a linearization φ If c i < c i then c i appears before c j in φ The first element of φ is a spout 3.Iterate over φ and for each component c i, place its executors in the slots that already contain executors of the components that directly emit tuples towards c i 4.Assign the slots to worker nodes in round-robin fashion

Offline Scheduler Problem: If a worker does not have an executor it gets ignored Solution: Use a tuning parameter β [0, 1] to force scheduler to use its empty slots Use a higher β if traffic is expected to be heavier among upstream components

Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 3 C 3 < C 4 < C 6 C 2 < C 5 < C 6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 Worker process 1 Worker process 2 Worker node 1 Worker process 3 Worker node 2

Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 Worker process 1 Worker process 2 Worker node 1 Worker process 3 Worker node 2

Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 Worker process 1 Worker process 2 Worker node 1 C2C2 Worker process 3 Worker node 2

Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 C3C3 Worker process 1 Worker process 2 Worker node 1 C2C2 Worker process 3 Worker node 2

Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 C3C3 Worker process 1 C4C4 Worker process 2 Worker node 1 C2C2 Worker process 3 Worker node 2

Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 C3C3 Worker process 1 C4C4 Worker process 2 Worker node 1 C2C2 C5C5 Worker process 3 Worker node 2

Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 C3C3 Worker process 1 C4C4 C6C6 Worker process 2 Worker node 1 C2C2 C5C5 Worker process 3 Worker node 2

Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 C3C3 Worker process 1 C4C4 C6C6 Worker process 2 Worker node 1 C2C2 C5C5 Worker process 3 Worker node 2

Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 C3C3 Worker process 1 C4C4 C6C6 Worker process 2 Worker node 1 C2C2 C5C5 Worker process 3 Worker node 2

Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 C3C3 Worker process 1 C4C4 C6C6 Worker process 2 Worker node 1 C2C2 C5C5 Worker process 3 Worker node 2

Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 C3C3 Worker process 1 C4C4 C6C6 Worker process 2 Worker node 1 C2C2 C5C5 Worker process 3 Worker node 2

Offline Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C 1 < C 2 < C 3 < C 4 < C 5 < C 6 C1C1 C3C3 Worker process 1 C4C4 C6C6 Worker process 2 Worker node 1 C2C2 C5C5 Worker process 3 Worker node 2

Traffic-based Scheduling

Online Scheduler Goal: dynamically adapt scheduler as load on nodes changes Need to satisfy constraints on: 1.Number of workers for each topology 2.Number of slots available on each worker node 3.Computational power on each node

Storm Architecture with Online Scheduler Scheduler Plugin S Deployment Plan Nimbus Supervisor Slots Executor Worker Process Worker nodes Performance Log Scheduler Plugin Topology G(V, T), w

Online Scheduler I.Partition the executors among the workers 1.Iterate over all pairs of communicating executors (most traffic first) 2.If neither executor has been assigned, assign both to least loaded worker 3.Otherwise determine the best assignment using executors’ current workers and least loaded worker II.Allocate workers to available slots 1.Iterate over all pairs of communicating workers (most traffic first) 2.If neither worker has been assigned, assign both to least loaded node 3.Otherwise determine the best assignment using workers’ current nodes and least loaded nodes

Online Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C1C1 C3C3 Worker process 1 C4C4 C6C6 Worker process 2 C2C2 C5C5 Worker process 3 Worker process 4 [(C 5, C 6 ), (C 4, C 6 ), (C 1, C 4 ), (C 2, C 5 ), (C 1, C 3 )] Phase I

Online Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C1C1 C3C3 Worker process 1 C4C4 C6C6 Worker process 2 C2C2 C5C5 Worker process 3 Worker process 4 [(C 5, C 6 ), (C 4, C 6 ), (C 1, C 4 ), (C 2, C 5 ), (C 1, C 3 )] C4C4 C6C6 Worker process 2 C2C2 C5C5 Worker process 3 Worker process 4 (Least loaded worker) Phase I

Online Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C1C1 C3C3 Worker process 1 C4C4 Worker process 2 C2C2 Worker process 3 C5C5 C6C6 Worker process 4 [(C 5, C 6 ), (C 4, C 6 ), (C 1, C 4 ), (C 2, C 5 ), (C 1, C 3 )] C4C4 Worker process 2 C2C2 Worker process 3 C5C5 C6C6 Worker process 4 (Least loaded worker) Phase I

Online Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C1C1 C3C3 Worker process 1 C4C4 Worker process 2 C2C2 Worker process 3 C5C5 C6C6 Worker process 4 [(C 5, C 6 ), (C 4, C 6 ), (C 1, C 4 ), (C 2, C 5 ), (C 1, C 3 )] C4C4 Worker process 2 C5C5 C6C6 Worker process 4 C2C2 Worker process 3 (Least loaded worker) Phase I

Online Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C1C1 C3C3 Worker process 1 C4C4 Worker process 2 C2C2 Worker process 3 C5C5 C6C6 Worker process 4 [(C 5, C 6 ), (C 4, C 6 ), (C 1, C 4 ), (C 2, C 5 ), (C 1, C 3 )] C1C1 C3C3 Worker process 1 C4C4 Worker process 2 C2C2 Worker process 3 (Least loaded worker) Phase I

Online Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C3C3 Worker process 1 C4C4 C1C1 Worker process 2 C2C2 Worker process 3 C5C5 C6C6 Worker process 4 [(C 5, C 6 ), (C 4, C 6 ), (C 1, C 4 ), (C 2, C 5 ), (C 1, C 3 )] C3C3 Worker process 1 C4C4 C1C1 Worker process 2 C2C2 Worker process 3 (Least loaded worker) Phase I

Online Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C3C3 Worker process 1 C4C4 C1C1 Worker process 2 C2C2 Worker process 3 C5C5 C6C6 Worker process 4 [(C 5, C 6 ), (C 4, C 6 ), (C 1, C 4 ), (C 2, C 5 ), (C 1, C 3 )] C2C2 Worker process 3 C5C5 C6C6 Worker process 4 (Least loaded worker) Phase I

Online Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C3C3 Worker process 1 C4C4 C1C1 Worker process 2 C2C2 Worker process 3 C5C5 C6C6 Worker process 4 [(C 5, C 6 ), (C 4, C 6 ), (C 1, C 4 ), (C 2, C 5 ), (C 1, C 3 )] C3C3 Worker process 1 C4C4 C1C1 Worker process 2 (Least loaded worker) Phase I

Online Scheduler Spout Data source Bolt Spout C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C5C5 C6C6 Worker process 4 C4C4 C1C1 Worker process 2 Worker node 1 C2C2 Worker process 3 Worker node 2 C3C3 Worker process 1 [(C 5, C 6 ), (C 4, C 6 ), (C 1, C 4 ), (C 2, C 5 ), (C 1, C 3 )] Phase II

Evaluation

Topologies General-case reference topology DEBS 2013* Grand Challenge dataset Key metrics Average latency for event to traverse the entire topology Average inter-node traffic at runtime Cluster specifications 8 worker nodes, each with: 5 worker slots Ubuntu x2.8 GHz CPUs 3 GB RAM 15 GB disk storage *The 7th ACM International Conference on Distributed Event-Based Systems

Evaluation Reference Topology Each spout executor emits tuples at a fixed rate and the average of these rates is exactly R Bolts forward the received value ½ the time and a different constant value the rest of the time spoutsimplestateful ack stage 1stage 2 stage N-1 stage N

Evaluation Reference topology settings: 7 stages, replication factor of 4 for spout, 3 for simple bolts, 2 for stateful bolts Each point represents average latencies for a 10 events window

Evaluation Parameters: α = 0, β = 0.5, average data rate R = 100 tuples/s, variance V = 20%

Evaluation Parameters: α = 0, β = 0.5, average data rate R = 100 tuples/s, variance V = 20%

Evaluation Parameters: 5 stage topology, replication factor 5, R = 1000 tuples/s, variance V = 20%

Evaluation 2013 DEBS Grand Challenge sensors in soccer players’ shoes emit position and speed data at 200 Hz frequency goal is to maintain up-to-date statistics such as average speed, walked distance, etc. Grand Challenge Topology spout for the sensors (sensor) bolt that computes instantaneous speed and receives tuples by shuffle grouping (speed) bolt that maintains and updates statistics as tuples are received from the speed bolt (analysis)

Evaluation Spout sensor (x8) Bolt speed (x4) Bolt analysis (x2)

Evaluation

Personal Thoughts

Pros: Key idea (scheduling based on minimizing network communication) can easily make a direct impact on average processing time for Storm topologies Offline algorithm is relatively simple and does not require significant architectural changes Online algorithm is conceptually simple to understand despite length

Personal Thoughts Cons: Authors did not prove greedy heuristic correct for online algorithm Online algorithm doesn’t consider the load due to IO bound operations or network communications with extern systems Authors acknowledge that online algorithm as presented ignores corner cases - what are those corner cases?

Questions?

Storm Key constructs: Topology: directed graph representing Storm application Tuple: Ordered list of elements E.g. Stream: Sequence of tuples Spout: Node that serves as source of tuples Bolt: Node that processes tuples and provides output stream

Nimbus Java process responsible for accepting topology, deploying it across cluster, and detecting failures Serves as the master to the supervisors Uses ZooKeeper for coordination Includes the scheduler, which deploys topology in 2 phases: 1.Assign executors to workers 2.Assign workers to slots

Execution Terminology Component: node in topology graph Task: Instance of a component Worker node: physical machine in the Storm cluster Worker: Java process running on a worker node Supervisor: Java process running on a worker node that launches and monitors each worker Slot: space for worker on worker node Executor: thread running on a worker

Execution Constraints Number of tasks for a certain component is fixed by the user Number of executors for each component is fixed by the user Must be less than or equal to the number of tasks to avoid idle executors

Custom Storm Schedules Input: topology G(V, T), w + user defined parameters (α, β, …) Output: deployment plan ISchedule API provided to plug-in a custom scheduler schedule method takes 2 parameters: 1.object containing the definition of all currently-running topologies 2.object representing the physical cluster