Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A distributed Task Scheduler Optimizing Data Transfer Time Taura lab. Kei Takahashi (56428) Taura lab. Kei Takahashi (56428)

Similar presentations


Presentation on theme: "1 A distributed Task Scheduler Optimizing Data Transfer Time Taura lab. Kei Takahashi (56428) Taura lab. Kei Takahashi (56428)"— Presentation transcript:

1 1 A distributed Task Scheduler Optimizing Data Transfer Time Taura lab. Kei Takahashi (56428) Taura lab. Kei Takahashi (56428)

2 2 Task Schedulers A system which distributes many serial tasks onto the grid environment ► Task assignments ► File transfers Many serial tasks can be executed in parallel Some constraints need to be considered ► Machine availability ► Data location A system which distributes many serial tasks onto the grid environment ► Task assignments ► File transfers Many serial tasks can be executed in parallel Some constraints need to be considered ► Machine availability ► Data location

3 3 Data Intensive Applications A computation using large data ► Some gigabytes to petabytes ► Natural language processing, data mining, etc. ► A simple algorithm can extract useful general knowledge A scheduler need additionally consider: ► Reduction in data transfers ► Effective placement of data replicas A computation using large data ► Some gigabytes to petabytes ► Natural language processing, data mining, etc. ► A simple algorithm can extract useful general knowledge A scheduler need additionally consider: ► Reduction in data transfers ► Effective placement of data replicas

4 4 An Example of Scheduling The scheduler maps tasks on each machine A A B B Scheduler Task t0 Requires : f0 Task t0 Requires : f0 Task t1 Requires : f1 Task t1 Requires : f1 Task t2 Requires : f1 Task t2 Requires : f1 File f1 File f0 t1 A B f0 A B Shorter processing time t2 f1 t1 t2 t0

5 5 Related Work Schedulers for data intensive applications ► GrADS: predicts transfer time, but only uses static bandwidth value between two hosts Efficient multicast Topology-aware bandwidth simulation ► PBS: scheduling parallel jobs to nodes with consodering network topology Schedulers for data intensive applications ► GrADS: predicts transfer time, but only uses static bandwidth value between two hosts Efficient multicast Topology-aware bandwidth simulation ► PBS: scheduling parallel jobs to nodes with consodering network topology

6 6 Topology-aware Transfer If a bandwidth topology map is given, file transfers can be more precisely estimated ► Detecting congestions By the estimation, more precise optimization becomes possible for the task scheduling If a bandwidth topology map is given, file transfers can be more precisely estimated ► Detecting congestions By the estimation, more precise optimization becomes possible for the task scheduling Congestion occurs Switch Destination NodesSource Nodes

7 7 Research Purpose Design and implement a distributed task scheduler for data intensive applications ► Predict data transfer time by using network topology map with bandwidth ► Map tasks onto machine ► Plan efficient data transfers Design and implement a distributed task scheduler for data intensive applications ► Predict data transfer time by using network topology map with bandwidth ► Map tasks onto machine ► Plan efficient data transfers

8 8 Input and Output Given information: ► File Locations and size ► Network Topology Output: ► Task Scheduling: Assignments of nodes to tasks ► Transfer Scheduling: From/to which host data are transferred Limit bandwidth during the transfer if needed The amount of data transfer changes depending on a task schedule Given information: ► File Locations and size ► Network Topology Output: ► Task Scheduling: Assignments of nodes to tasks ► Transfer Scheduling: From/to which host data are transferred Limit bandwidth during the transfer if needed The amount of data transfer changes depending on a task schedule

9 9 Agenda Background Purpose Our Approach Related Work Conclusion Background Purpose Our Approach Related Work Conclusion

10 10 Our Approach Task execution time is hard to predict in general ► Schedule one task for each host at a time ► Assign new tasks when a certain number of hosts have completed, or certain time passed When the data size and bandwidth is known, file transfer time is predictable ► Optimize data transfer time by using network topology and bandwidth information Multicast Transfer priorities Task execution time is hard to predict in general ► Schedule one task for each host at a time ► Assign new tasks when a certain number of hosts have completed, or certain time passed When the data size and bandwidth is known, file transfer time is predictable ► Optimize data transfer time by using network topology and bandwidth information Multicast Transfer priorities

11 11 Problem Formulation Final goal: minimizing the makespan Immediate goal: minimizing the sum of time that takes before each task can start More immediate goal: minimizing the sum of arrival time of every file on every node Final goal: minimizing the makespan Immediate goal: minimizing the sum of time that takes before each task can start More immediate goal: minimizing the sum of arrival time of every file on every node (file i, j : the j th file required by task i )

12 12 (An image will come here)

13 13 Algorithm When some nodes are unschduled, ► Create an initial candidate task schedule ► For each candidate task schedule: Decide priorities on each file transfer (including ongoing transfers) Plan efficient file transfer schedule Estimate the usage of each link, and find the most crowded link ► Search for a better schedule whose most crowded link is less crowded (by using heuristics like GA, SA) When some nodes are unschduled, ► Create an initial candidate task schedule ► For each candidate task schedule: Decide priorities on each file transfer (including ongoing transfers) Plan efficient file transfer schedule Estimate the usage of each link, and find the most crowded link ► Search for a better schedule whose most crowded link is less crowded (by using heuristics like GA, SA)

14 14 F0 F1 Transfer Priorities (1) If several transfers share a link, they divide the bandwidth of that link Since a task cannot be started before the “entire” file has arrived, it is more efficient to put priorities on each transfer ► If two transfers are going on, the it is more efficient to transfer the smaller file first If several transfers share a link, they divide the bandwidth of that link Since a task cannot be started before the “entire” file has arrived, it is more efficient to put priorities on each transfer ► If two transfers are going on, the it is more efficient to transfer the smaller file first F0 F1 50MBps 100MBps Both F0 and F1 arrive 20 second after 100MBps F1 Transfer of F0 completes 10 seconds after, and F1 arrive 20 seconds after 1GB F0 1GB

15 15 Transfer Priorities (2) Objective: minimizing Priorities are determined by these criteria: ► A file needed by more tasks ► A smaller file ► A file with less replicas (sources) Objective: minimizing Priorities are determined by these criteria: ► A file needed by more tasks ► A smaller file ► A file with less replicas (sources) (file i, j : the j th file required by task i )

16 16 Transfer Planning Transfers are planned from a file with higher priority (Greedy method) ► The highest priority transfer can use as much bandwidth as possible ► A transfer with less priority can use the rest of the bandwidth The bandwidth for ongoing transfers are also reassigned Transfers are planned from a file with higher priority (Greedy method) ► The highest priority transfer can use as much bandwidth as possible ► A transfer with less priority can use the rest of the bandwidth The bandwidth for ongoing transfers are also reassigned

17 17 Transfer Priorities (example) Task0, Task1, Task2 are scheduled ► File0: 3GB, needed by task0 ► File1: 2GB, needed by task0, task1, task2 ► File2: 1GB, needed by task1 Transferring priority: ► File1 > File2 > File0 File1 is transferred using multicast pipeline Task0, Task1, Task2 are scheduled ► File0: 3GB, needed by task0 ► File1: 2GB, needed by task0, task1, task2 ► File2: 1GB, needed by task1 Transferring priority: ► File1 > File2 > File0 File1 is transferred using multicast pipeline Total Bandwidth Task0 Task1Task2 File1 Task0 Task1Task2 File2 Task0 Task1Task2 Task0 Task1Task2 File0

18 18 Pipeline Multicast (1) For a given schedule, it is known which nodes require which files When multiple nodes need a common file, a pipeline multicast shortens transfer time (in the case of large files) The speed of a pipeline broadcast is limited by the narrowest link in the tree A broadcast can be sped up by efficiently using multiple sources For a given schedule, it is known which nodes require which files When multiple nodes need a common file, a pipeline multicast shortens transfer time (in the case of large files) The speed of a pipeline broadcast is limited by the narrowest link in the tree A broadcast can be sped up by efficiently using multiple sources

19 19 Pipeline Multicast (2) The tree is constructed depth-first manner Every related link is only used twice (upward/downward) Since disk access is as slow as network, the disk access bandwidth should be also counted The tree is constructed depth-first manner Every related link is only used twice (upward/downward) Since disk access is as slow as network, the disk access bandwidth should be also counted Source Destination

20 20 Multi-source Multicast M nodes have the same source data; N nodes need it For each link in the order of bandwidth: ► If the link connects two nodes/switches which are already connected to the source node: → Discard the link ► Otherwise: → Adopt the link (Kruskal's Algorithm: it maximizes the narrowest link in the pipelines) M nodes have the same source data; N nodes need it For each link in the order of bandwidth: ► If the link connects two nodes/switches which are already connected to the source node: → Discard the link ► Otherwise: → Adopt the link (Kruskal's Algorithm: it maximizes the narrowest link in the pipelines) Discard this link Pipeline 1 Pipeline 2 Source Destination

21 21 Find Crowded Link On a certain schedule, for each link, ► list every transfer using that link ► sum up transfer size ► calculate ”rough transfer time” as follows: (total transfer size) / (bandwidth) Find the longest “rough transfer time” On a certain schedule, for each link, ► list every transfer using that link ► sum up transfer size ► calculate ”rough transfer time” as follows: (total transfer size) / (bandwidth) Find the longest “rough transfer time” Link 0Link 1Link 2Link 3 Bandwidthbw0bw1bw2Bw3 File 0size000 File 10size1 File 2000size2 Rough transfer time size0 / bw0size1 / bw1 (size0 + size1 + size2 ) / bw3

22 22 Improve the Schedule After the most crowded link is found, the scheduler tries to reduce the transfer size by altering task assignments We are thinking of using GA or Simulated Annealing. Since the most crowded link is known, we can try to reduce the transfer of this link in the mutation phase. After the most crowded link is found, the scheduler tries to reduce the transfer size by altering task assignments We are thinking of using GA or Simulated Annealing. Since the most crowded link is known, we can try to reduce the transfer of this link in the mutation phase.

23 23 Actual Transfers After the transfer schedule has determined, the plan is performed as simulated In a pipeline transfer, the bandwidth is the same through links The bandwidth of the narrowest link is limited to the calculated value When detecting a significant change in bandwidth, the schedule is reconstructed ► The bandwidth is measured by using existing methods (eg. nettimer, bprobe) After the transfer schedule has determined, the plan is performed as simulated In a pipeline transfer, the bandwidth is the same through links The bandwidth of the narrowest link is limited to the calculated value When detecting a significant change in bandwidth, the schedule is reconstructed ► The bandwidth is measured by using existing methods (eg. nettimer, bprobe)

24 24 Re-scheduling Transfers When a file transfer has finished, transfer schedule is recalculated ► Calculating the priority on each task ► Assign bandwidth for each task A new bandwidth value is assigned for each transfer, but the pipeline is not changed When a file transfer has finished, transfer schedule is recalculated ► Calculating the priority on each task ► Assign bandwidth for each task A new bandwidth value is assigned for each transfer, but the pipeline is not changed

25 25 Task Description A user need specify required files, output files and dependencies for each task In order to enable flexible task description, we provide scheduling API for script languages ► Files are identified by URIs (ex. “abc.com:/home/kay/some/location”) ► The scheduler analyses dependencies from filepath A user need specify required files, output files and dependencies for each task In order to enable flexible task description, we provide scheduling API for script languages ► Files are identified by URIs (ex. “abc.com:/home/kay/some/location”) ► The scheduler analyses dependencies from filepath

26 26 Task Submission A task is submitted by calling submit() with required filepath fs = sched.list_files("abc.com:/home/kay/some/location/**”) parsed_files = [] for fn in fs: new_fn = fn + ".parsed“ command = "parser " + fn + " " + new_fn sched.submit(command, [new_fn], []) parsed_files.append(new_fn) sched.gather(parsed_files, “abc.com:/home/kay/parsed/")

27 27 Conclusion Introduced new scheduling algorithm ► Predict transfer time by using network topology, and search for a better task schedule ► Efficiently transfer files by limiting bandwidth ► Dynamically re-scheduling transfers Current status: ► The implementation is ongoing Introduced new scheduling algorithm ► Predict transfer time by using network topology, and search for a better task schedule ► Efficiently transfer files by limiting bandwidth ► Dynamically re-scheduling transfers Current status: ► The implementation is ongoing

28 28 Publications 高橋慧, 田浦健次朗, 近山隆. マイグレーションを支援する分散集合オブジェ クト.並列/分散/協調処理に関するサマーワークショップ ( SWoPP2005 ),武雄, 2005 年 8 月. 高橋慧, 田浦健次朗, 近山隆. マイグレーションを支援する分散集合オブジェ クト. 先進的計算基盤シンポジウム( SACSIS 2005 ),筑波, 2005 年 5 月. 高橋慧, 田浦健次朗, 近山隆. マイグレーションを支援する分散集合オブジェ クト.並列/分散/協調処理に関するサマーワークショップ ( SWoPP2005 ),武雄, 2005 年 8 月. 高橋慧, 田浦健次朗, 近山隆. マイグレーションを支援する分散集合オブジェ クト. 先進的計算基盤シンポジウム( SACSIS 2005 ),筑波, 2005 年 5 月.


Download ppt "1 A distributed Task Scheduler Optimizing Data Transfer Time Taura lab. Kei Takahashi (56428) Taura lab. Kei Takahashi (56428)"

Similar presentations


Ads by Google