The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems Alexandru Iosup, Ozan Sonmez, Shanny Anoep, and Dick Epema ACM/IEEE Int’l.

The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems Alexandru Iosup, Ozan Sonmez, Shanny Anoep, and Dick Epema ACM/IEEE Int’l. Symposium on High Performance Distributed Computing Parallel and Distributed Systems Group, TU Delft

2 The VL-e project A grid project in the Netherlands (2004-) Natural gas money: VL-e 45 MEuro / 800 MEuro total research package Overall aim: … to design and build a virtual lab for (digitally) enhanced science (e-science) experiments (no in-vivo or in-vitro, but in-silico experiments). Goals: 1. create prototypes of application-specific e-science environments 2. design and develop re-usable ICT/grid components 3. validate with real-life applications in testbeds Natural gas price → $$ for grid computing

3 The VL-e project: application areas Grid Services Harness multi-domain distributed resources Management of comm. & computing Virtual Laboratory (VL) Application Oriented Services Data Intensive Science Bio- Diversity Bio- Informatics Food Informatics Medical Diagnosis & Imaging Dutch Telescience PhilipsUnileverIBM

4 The VL-e project: application areas Grid Services Harness multi-domain distributed resources Management of comm. & computing Virtual Laboratory (VL) Application Oriented Services Data Intensive Science Bio- Diversity Bio- Informatics Food Informatics Medical Diagnosis & Imaging Dutch Telescience PhilipsUnileverIBM Bags-of-Tasks

5 The VL-e project: application areas Grid Services Harness multi-domain distributed resources Management of comm. & computing Virtual Laboratory (VL) Application Oriented Services Data Intensive Science Bio- Diversity Bio- Informatics Food Informatics Medical Diagnosis & Imaging Dutch Telescience PhilipsUnileverIBM Bags-of-Tasks

6 The Challenge Complete scientific work better, … User-oriented performance metrics (time a critical performance component) Bags-of-tasks for ease-of-use … in real systems Workloads (now that real traces are available) Information unavailability What to do? Hint: the next 10% improvement won’t cut it!

7 The Challenge (cont’d.) System model What is a good model for the study of large-scale distributed computing systems that run bag-of- tasks? Input model What is a good model for bag-of-tasks workloads in large-scale distributed computing systems? What is the best setup for such system/input? How to find the best? If a best is found, can there be another?

8 The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems 1.Introduction and Motivation 2.Context: System Model 3.Workload Model 4.Design Space Exploration 5.Conclusion

9 Context: System Model [1/4] Overview System Model 1.Clusters execute jobs 2.Resource managers coordinate job execution 3.Resource management architectures route jobs among resource managers 4.Task selection policies create the eligible set 5.Task scheduling policies: schedule the eligible set

10 Context: System Model [2/4] Resource Management Architectures route jobs among resource managers Separated Clusters (sep-c) Centralized (csp) Decentralized (fcondor)

11 Context: System Model [3/4] Task Selection Policies create the eligible set Age-based: 1.S-T: Select Tasks in the order of their arrival. 2.S-BoT: Select BoTs in the order of their arrival. User priority based: 3.S-U-Prio: Select the tasks of the User with the highest Priority. Based on fairness in resource consumption: 4.S-U-T: Select the Tasks of the User with the lowest res. cons. 5.S-U-BoT: Select the BoTs of the User with the lowest res. cons. 6.S-U-GRR: Select the User Round-Robin/all tasks for this user. 7.S-U-RR: Select the User Round-Robin/one task for this user.

12 Context: System Model [4/4] Task Scheduling Policies schedule the eligible set Information availability: Known Unknown Historical records Sample policies: Earliest Completion Time (with Prediction of Runtimes) (ECT(-P)) Fastest Processor First (FPF) (Dynamic) Fastest Processor Largest Task ((D)FPLT) Shortest Task First w/ Replication (STFR) Work Queue w/ Replication (WQR) Task Information Resource Information KHU K H U ECT, FPLT FPFECT-P DFPLT, MQD STFR RR, WQR

14 Workload Modeling 101: What Matters Job arrival process & job service time: Self-similarity (burstiness) vs. Poisson [Leland & Ott ToN’94] Job grouping: bags-of-tasks dominant application type in multi-cluster grids and cycle-scavenging systems (the e-Science infrastructure) [IosupJSE EuroPar’07] Job size: almost always 1 CPU [IosupDELW Grid’06] No.Packets/ Time Unit Time Units Longer queues Time Unit= 0.01s Time Unit= 100s

15 Model: Users, Bags-of-Tasks, Tasks Heavy-tailed distributions for inter-arrival time, job service time → can model self-similar workloads More details (e.g., parameter values): see article Validation data: the Grid Workloads Archive 7 long-term grid traces >5 million tasks >2500 users >40k CPUs Domains: HEP, graphics, AI, math, biomed, climate, finance, aero… A Bag-of-Tasks Workload Model http://gwa.ewi.tudelft.nl/

17 Design Space Exploration [1/5] Overview Design space exploration: time to understand how our solutions fit into the complete system. Study the impact of: The Task Scheduling Policy (s policies) The Workload Characteristics (P characteristics) The Dynamic System Information (I levels) The Task Selection Policy (S policies) The Resource Management Architecture (A policies) s x 7 P x I x S x A x (environment) → >2M design points

18 Design Space Exploration [2/5] Experimental Setup Simulator: DGSim [IosupETFL SC’07, IosupSE EuroPar’08] System: DAS + Grid’5000 [Cappello & Bal CCGrid’07] >3,000 CPUs: relative perf. 1-1.75 Metrics: Makespan Normalized Schedule Length ~ speed-up Workloads: Real: DAS + Grid’5000 Realistic: system load 20-95% (from workload model)

19 Design Space Exploration [3/5] Selected Results A Design Guidelines for Scheduling Policies Influence of the information type: (K,K): best balance between MS and NSL (*,U),(U,*): surprisingly good (FPF) to surprisingly poor (WQR4x) (*,H),(H,*): poor. Simple runtime predictors don’t work ( see article ) Where to invest time? K -> H, K-> U: adapt for information type with lowest variation WQR4x FPF

20 Design Space Exploration [4/5] Selected Results B Task Selection Only for Busy Systems Not much difference until system load over 50%. For DAS + Grid’5000 no change of task selection policy. Same performance S-BoT S-T

21 Design Space Exploration [5/5] Selected Results C Resource Management Architecture Centralized, separated, or distributed? Centralized is best [Note: job overhead not considered.] Distributed: good for system load below 50%; over 50% it does not finish all tasks.

System Model = Resource Management Architecture + Task Selection Policy + Task Scheduling Policy Information availability framework BoT workload model Design space exploration: the performance of bags-of-tasks Conclusion Better predictors (H,H) task scheduling policies Task Information Resource Information KHU K H U ECT, FPLT FPFECT-P DFPLT, MQD STFR RR, WQR Future Work ?

24 Thank you! Questions? Remarks? Observations? Help building the Grid Workloads Archive: http://gwa.ewi.tudelft.nl Contact: A.Iosup@gmail.com [google “Iosup“] Web sites: ohttp://www.vl-e.nl : VL-e project ohttp://www.pds.ewi.tudelft.nl : PDS group articles & software

25 What About Other Workloads? (High Performance vs. High Throughput Computing) Parallel jobs vs. bags-of-tasks Workflows… We need your traces! We work blindly without them. For parallel jobs, the architecture counts much more [IosETFL SC’07] For workflows, we don’t know much about performance.

The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems Alexandru Iosup, Ozan Sonmez, Shanny Anoep, and Dick Epema ACM/IEEE Int’l.

Similar presentations

Presentation on theme: "The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems Alexandru Iosup, Ozan Sonmez, Shanny Anoep, and Dick Epema ACM/IEEE Int’l."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems Alexandru Iosup, Ozan Sonmez, Shanny Anoep, and Dick Epema ACM/IEEE Int’l.

Similar presentations

Presentation on theme: "The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems Alexandru Iosup, Ozan Sonmez, Shanny Anoep, and Dick Epema ACM/IEEE Int’l."— Presentation transcript:

Similar presentations

About project

Feedback