1 A distributed Task Scheduler Optimizing Data Transfer Time Taura lab. Kei Takahashi (56428) Taura lab. Kei Takahashi (56428)

Slides:



Advertisements
Similar presentations
Exploiting Deadline Flexibility in Grid Workflow Rescheduling Wei Chen Alan Fekete Young Choon Lee.
Advertisements

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Energy-efficient Task Scheduling in Heterogeneous Environment 2013/10/25.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.
SDN + Storage.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Solutions for Scheduling Assays. Why do we use laboratory automation? Improve quality control (QC) Free resources Reduce sa fety risks Automatic data.
Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.
Distributed Computations
GridFlow: Workflow Management for Grid Computing Kavita Shinde.
UCB Notes on Optical Networks Jean Walrand EECS. UCB Outline Dynamic Configuration? Wavelength Assignment Too Much Bandwidth?
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
An Optimization Problem in Adaptive Virtual Environments Ananth I. Sundararaj Manan Sanghi Jack R. Lange Peter A. Dinda Prescience Lab Department of Computer.
Distributed Computations MapReduce
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Authors: Weiwei Chen, Ewa Deelman 9th International Conference on Parallel Processing and Applied Mathmatics 1.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Chapter 4 Processor Management
Google File System Simulator Pratima Kolan Vinod Ramachandran.
 Escalonamento e Migração de Recursos e Balanceamento de carga Carlos Ferrão Lopes nº M6935 Bruno Simões nº M6082 Celina Alexandre nº M6807.
Network Aware Resource Allocation in Distributed Clouds.
Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.
Tiziana FerrariNetwork metrics usage for optimization of the Grid1 DataGrid Project Work Package 7 Written by Tiziana Ferrari Presented by Richard Hughes-Jones.
WP9 Resource Management Current status and plans for future Juliusz Pukacki Krzysztof Kurowski Poznan Supercomputing.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
ROBUST RESOURCE ALLOCATION OF DAGS IN A HETEROGENEOUS MULTI-CORE SYSTEM Luis Diego Briceño, Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell,
DLS on Star (Single-level tree) Networks Background: A simple network model for DLS is the star network with a master-worker platform. It consists of a.
MapReduce M/R slides adapted from those of Jeff Dean’s.
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
Stochastic DAG Scheduling using Monte Carlo Approach Heterogeneous Computing Workshop (at IPDPS) 2012 Extended version: Elsevier JPDC (accepted July 2013,
Example: Sorting on Distributed Computing Environment Apr 20,
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.
Simulation of the OLSRv2 Protocol First Report Presentation.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
1 An Adaptive File Distribution Algorithm for Wide Area Network Takashi Hoshino, Kenjiro Taura, Takashi Chikayama University of Tokyo.
1 A distributed Task Scheduler Optimizing Data Transfer Time (データ転送時間を最適化する分散タスクスケジュラー) Taura lab. Kei Takahashi (56428) Taura lab. Kei Takahashi (56428)
Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd
A Utility-based Approach to Scheduling Multimedia Streams in P2P Systems Fang Chen Computer Science Dept. University of California, Riverside
A Method for Distributed Computation of Semi-Optimal Multicast Tree in MANET Eiichi Takashima, Yoshihiro Murata, Naoki Shibata*, Keiichi Yasumoto, and.
Adaptive and Robust Broadcast Algorithm Takeshi Sekiya Chikayama-Taura Lab. 2007/4/13.
Virtual Private Grid (VPG) : A Command Shell for Utilizing Remote Machines Efficiently Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa Department of Computer.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
An Adaptive Collective Communication Suppressing Contention Taura Lab. M2 Shota Yoshitomi.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
On Reducing Mesh Delay for Peer- to-Peer Live Streaming Dongni Ren, Y.-T. Hillman Li, S.-H. Gary Chan Department of Computer Science and Engineering The.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Static Process Scheduling
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Author Utility-Based Scheduling for Bulk Data Transfers between Distributed Computing Facilities Xin Wang, Wei Tang, Raj Kettimuthu,
Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.
1 Low Latency Multimedia Broadcast in Multi-Rate Wireless Meshes Chun Tung Chou, Archan Misra Proc. 1st IEEE Workshop on Wireless Mesh Networks (WIMESH),
A Stable Broadcast Algorithm Kei Takahashi Hideo Saito Takeshi Shibata Kenjiro Taura (The University of Tokyo, Japan) 1 CCGrid Lyon, France.
STAR Scheduler Gabriele Carcassi STAR Collaboration.
Distributed Control and Autonomous Systems Lab. Sang-Hyuk Yun and Hyo-Sung Ahn Distributed Control and Autonomous Systems Laboratory (DCASL ) Department.
Auburn University
Large-scale file systems and Map-Reduce
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Introduction to MapReduce and Hadoop
ISP and Egress Path Selection for Multihomed Networks
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
CPU SCHEDULING.
An Optimization Problem in Adaptive Virtual Environments
MapReduce: Simplified Data Processing on Large Clusters
Towards Predictable Datacenter Networks
Presentation transcript:

1 A distributed Task Scheduler Optimizing Data Transfer Time Taura lab. Kei Takahashi (56428) Taura lab. Kei Takahashi (56428)

2 Task Schedulers A system which distributes many serial tasks onto the grid environment ► Task assignments ► File transfers Many serial tasks can be executed in parallel Some constraints need to be considered ► Machine availability ► Data location A system which distributes many serial tasks onto the grid environment ► Task assignments ► File transfers Many serial tasks can be executed in parallel Some constraints need to be considered ► Machine availability ► Data location

3 Data Intensive Applications A computation using large data ► Some gigabytes to petabytes ► Natural language processing, data mining, etc. ► A simple algorithm can extract useful general knowledge A scheduler need additionally consider: ► Reduction in data transfers ► Effective placement of data replicas A computation using large data ► Some gigabytes to petabytes ► Natural language processing, data mining, etc. ► A simple algorithm can extract useful general knowledge A scheduler need additionally consider: ► Reduction in data transfers ► Effective placement of data replicas

4 An Example of Scheduling The scheduler maps tasks on each machine A A B B Scheduler Task t0 Requires : f0 Task t0 Requires : f0 Task t1 Requires : f1 Task t1 Requires : f1 Task t2 Requires : f1 Task t2 Requires : f1 File f1 File f0 t1 A B f0 A B Shorter processing time t2 f1 t1 t2 t0

5 Related Work Schedulers for data intensive applications ► GrADS: predicts transfer time, but only uses static bandwidth value between two hosts Efficient multicast Topology-aware bandwidth simulation ► PBS: scheduling parallel jobs to nodes with consodering network topology Schedulers for data intensive applications ► GrADS: predicts transfer time, but only uses static bandwidth value between two hosts Efficient multicast Topology-aware bandwidth simulation ► PBS: scheduling parallel jobs to nodes with consodering network topology

6 Topology-aware Transfer If a bandwidth topology map is given, file transfers can be more precisely estimated ► Detecting congestions By the estimation, more precise optimization becomes possible for the task scheduling If a bandwidth topology map is given, file transfers can be more precisely estimated ► Detecting congestions By the estimation, more precise optimization becomes possible for the task scheduling Congestion occurs Switch Destination NodesSource Nodes

7 Research Purpose Design and implement a distributed task scheduler for data intensive applications ► Predict data transfer time by using network topology map with bandwidth ► Map tasks onto machine ► Plan efficient data transfers Design and implement a distributed task scheduler for data intensive applications ► Predict data transfer time by using network topology map with bandwidth ► Map tasks onto machine ► Plan efficient data transfers

8 Input and Output Given information: ► File Locations and size ► Network Topology Output: ► Task Scheduling: Assignments of nodes to tasks ► Transfer Scheduling: From/to which host data are transferred Limit bandwidth during the transfer if needed The amount of data transfer changes depending on a task schedule Given information: ► File Locations and size ► Network Topology Output: ► Task Scheduling: Assignments of nodes to tasks ► Transfer Scheduling: From/to which host data are transferred Limit bandwidth during the transfer if needed The amount of data transfer changes depending on a task schedule

9 Agenda Background Purpose Our Approach Related Work Conclusion Background Purpose Our Approach Related Work Conclusion

10 Our Approach Task execution time is hard to predict in general ► Schedule one task for each host at a time ► Assign new tasks when a certain number of hosts have completed, or certain time passed When the data size and bandwidth is known, file transfer time is predictable ► Optimize data transfer time by using network topology and bandwidth information Multicast Transfer priorities Task execution time is hard to predict in general ► Schedule one task for each host at a time ► Assign new tasks when a certain number of hosts have completed, or certain time passed When the data size and bandwidth is known, file transfer time is predictable ► Optimize data transfer time by using network topology and bandwidth information Multicast Transfer priorities

11 Problem Formulation Final goal: minimizing the makespan Immediate goal: minimizing the sum of time that takes before each task can start More immediate goal: minimizing the sum of arrival time of every file on every node Final goal: minimizing the makespan Immediate goal: minimizing the sum of time that takes before each task can start More immediate goal: minimizing the sum of arrival time of every file on every node (file i, j : the j th file required by task i )

12 (An image will come here)

13 Algorithm When some nodes are unschduled, ► Create an initial candidate task schedule ► For each candidate task schedule: Decide priorities on each file transfer (including ongoing transfers) Plan efficient file transfer schedule Estimate the usage of each link, and find the most crowded link ► Search for a better schedule whose most crowded link is less crowded (by using heuristics like GA, SA) When some nodes are unschduled, ► Create an initial candidate task schedule ► For each candidate task schedule: Decide priorities on each file transfer (including ongoing transfers) Plan efficient file transfer schedule Estimate the usage of each link, and find the most crowded link ► Search for a better schedule whose most crowded link is less crowded (by using heuristics like GA, SA)

14 F0 F1 Transfer Priorities (1) If several transfers share a link, they divide the bandwidth of that link Since a task cannot be started before the “entire” file has arrived, it is more efficient to put priorities on each transfer ► If two transfers are going on, the it is more efficient to transfer the smaller file first If several transfers share a link, they divide the bandwidth of that link Since a task cannot be started before the “entire” file has arrived, it is more efficient to put priorities on each transfer ► If two transfers are going on, the it is more efficient to transfer the smaller file first F0 F1 50MBps 100MBps Both F0 and F1 arrive 20 second after 100MBps F1 Transfer of F0 completes 10 seconds after, and F1 arrive 20 seconds after 1GB F0 1GB

15 Transfer Priorities (2) Objective: minimizing Priorities are determined by these criteria: ► A file needed by more tasks ► A smaller file ► A file with less replicas (sources) Objective: minimizing Priorities are determined by these criteria: ► A file needed by more tasks ► A smaller file ► A file with less replicas (sources) (file i, j : the j th file required by task i )

16 Transfer Planning Transfers are planned from a file with higher priority (Greedy method) ► The highest priority transfer can use as much bandwidth as possible ► A transfer with less priority can use the rest of the bandwidth The bandwidth for ongoing transfers are also reassigned Transfers are planned from a file with higher priority (Greedy method) ► The highest priority transfer can use as much bandwidth as possible ► A transfer with less priority can use the rest of the bandwidth The bandwidth for ongoing transfers are also reassigned

17 Transfer Priorities (example) Task0, Task1, Task2 are scheduled ► File0: 3GB, needed by task0 ► File1: 2GB, needed by task0, task1, task2 ► File2: 1GB, needed by task1 Transferring priority: ► File1 > File2 > File0 File1 is transferred using multicast pipeline Task0, Task1, Task2 are scheduled ► File0: 3GB, needed by task0 ► File1: 2GB, needed by task0, task1, task2 ► File2: 1GB, needed by task1 Transferring priority: ► File1 > File2 > File0 File1 is transferred using multicast pipeline Total Bandwidth Task0 Task1Task2 File1 Task0 Task1Task2 File2 Task0 Task1Task2 Task0 Task1Task2 File0

18 Pipeline Multicast (1) For a given schedule, it is known which nodes require which files When multiple nodes need a common file, a pipeline multicast shortens transfer time (in the case of large files) The speed of a pipeline broadcast is limited by the narrowest link in the tree A broadcast can be sped up by efficiently using multiple sources For a given schedule, it is known which nodes require which files When multiple nodes need a common file, a pipeline multicast shortens transfer time (in the case of large files) The speed of a pipeline broadcast is limited by the narrowest link in the tree A broadcast can be sped up by efficiently using multiple sources

19 Pipeline Multicast (2) The tree is constructed depth-first manner Every related link is only used twice (upward/downward) Since disk access is as slow as network, the disk access bandwidth should be also counted The tree is constructed depth-first manner Every related link is only used twice (upward/downward) Since disk access is as slow as network, the disk access bandwidth should be also counted Source Destination

20 Multi-source Multicast M nodes have the same source data; N nodes need it For each link in the order of bandwidth: ► If the link connects two nodes/switches which are already connected to the source node: → Discard the link ► Otherwise: → Adopt the link (Kruskal's Algorithm: it maximizes the narrowest link in the pipelines) M nodes have the same source data; N nodes need it For each link in the order of bandwidth: ► If the link connects two nodes/switches which are already connected to the source node: → Discard the link ► Otherwise: → Adopt the link (Kruskal's Algorithm: it maximizes the narrowest link in the pipelines) Discard this link Pipeline 1 Pipeline 2 Source Destination

21 Find Crowded Link On a certain schedule, for each link, ► list every transfer using that link ► sum up transfer size ► calculate ”rough transfer time” as follows: (total transfer size) / (bandwidth) Find the longest “rough transfer time” On a certain schedule, for each link, ► list every transfer using that link ► sum up transfer size ► calculate ”rough transfer time” as follows: (total transfer size) / (bandwidth) Find the longest “rough transfer time” Link 0Link 1Link 2Link 3 Bandwidthbw0bw1bw2Bw3 File 0size000 File 10size1 File 2000size2 Rough transfer time size0 / bw0size1 / bw1 (size0 + size1 + size2 ) / bw3

22 Improve the Schedule After the most crowded link is found, the scheduler tries to reduce the transfer size by altering task assignments We are thinking of using GA or Simulated Annealing. Since the most crowded link is known, we can try to reduce the transfer of this link in the mutation phase. After the most crowded link is found, the scheduler tries to reduce the transfer size by altering task assignments We are thinking of using GA or Simulated Annealing. Since the most crowded link is known, we can try to reduce the transfer of this link in the mutation phase.

23 Actual Transfers After the transfer schedule has determined, the plan is performed as simulated In a pipeline transfer, the bandwidth is the same through links The bandwidth of the narrowest link is limited to the calculated value When detecting a significant change in bandwidth, the schedule is reconstructed ► The bandwidth is measured by using existing methods (eg. nettimer, bprobe) After the transfer schedule has determined, the plan is performed as simulated In a pipeline transfer, the bandwidth is the same through links The bandwidth of the narrowest link is limited to the calculated value When detecting a significant change in bandwidth, the schedule is reconstructed ► The bandwidth is measured by using existing methods (eg. nettimer, bprobe)

24 Re-scheduling Transfers When a file transfer has finished, transfer schedule is recalculated ► Calculating the priority on each task ► Assign bandwidth for each task A new bandwidth value is assigned for each transfer, but the pipeline is not changed When a file transfer has finished, transfer schedule is recalculated ► Calculating the priority on each task ► Assign bandwidth for each task A new bandwidth value is assigned for each transfer, but the pipeline is not changed

25 Task Description A user need specify required files, output files and dependencies for each task In order to enable flexible task description, we provide scheduling API for script languages ► Files are identified by URIs (ex. “abc.com:/home/kay/some/location”) ► The scheduler analyses dependencies from filepath A user need specify required files, output files and dependencies for each task In order to enable flexible task description, we provide scheduling API for script languages ► Files are identified by URIs (ex. “abc.com:/home/kay/some/location”) ► The scheduler analyses dependencies from filepath

26 Task Submission A task is submitted by calling submit() with required filepath fs = sched.list_files("abc.com:/home/kay/some/location/**”) parsed_files = [] for fn in fs: new_fn = fn + ".parsed“ command = "parser " + fn + " " + new_fn sched.submit(command, [new_fn], []) parsed_files.append(new_fn) sched.gather(parsed_files, “abc.com:/home/kay/parsed/")

27 Conclusion Introduced new scheduling algorithm ► Predict transfer time by using network topology, and search for a better task schedule ► Efficiently transfer files by limiting bandwidth ► Dynamically re-scheduling transfers Current status: ► The implementation is ongoing Introduced new scheduling algorithm ► Predict transfer time by using network topology, and search for a better task schedule ► Efficiently transfer files by limiting bandwidth ► Dynamically re-scheduling transfers Current status: ► The implementation is ongoing

28 Publications 高橋慧, 田浦健次朗, 近山隆. マイグレーションを支援する分散集合オブジェ クト.並列/分散/協調処理に関するサマーワークショップ ( SWoPP2005 ),武雄, 2005 年 8 月. 高橋慧, 田浦健次朗, 近山隆. マイグレーションを支援する分散集合オブジェ クト. 先進的計算基盤シンポジウム( SACSIS 2005 ),筑波, 2005 年 5 月. 高橋慧, 田浦健次朗, 近山隆. マイグレーションを支援する分散集合オブジェ クト.並列/分散/協調処理に関するサマーワークショップ ( SWoPP2005 ),武雄, 2005 年 8 月. 高橋慧, 田浦健次朗, 近山隆. マイグレーションを支援する分散集合オブジェ クト. 先進的計算基盤シンポジウム( SACSIS 2005 ),筑波, 2005 年 5 月.