The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems Alexandru Iosup, Ozan Sonmez, Shanny Anoep, and Dick Epema ACM/IEEE Int’l.

Slides:

Advertisements

Similar presentations

Challenge the future Delft University of Technology Overprovisioning for Performance Consistency in Grids Nezih Yigitbasi and Dick Epema Parallel.

Advertisements

7 april SP3.1: High-Performance Distributed Computing The KOALA grid scheduler and the Ibis Java-centric grid middleware Dick Epema Catalin Dumitrescu,

1 GridSim 2.0 Adv. Grid Modelling & Simulation Toolkit Rajkumar Buyya, Manzur Murshed (Monash), Anthony Sulistio, Chee Shin Yeo Grid Computing and Distributed.

CPU Scheduling Questions answered in this lecture: What is scheduling vs. allocation? What is preemptive vs. non-preemptive scheduling? What are FCFS,

19 November 2013 Exploring Portfolio Scheduling for Long-term Execution of Scientific Workloads in IaaS Clouds Alexandru Iosup Delft University of Technology.

Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.

WS-VLAM: Towards a Scalable Workflow System on the Grid V. Korkhov, D. Vasyunin, A. Wibisono, V. Guevara-Masis, A. Belloum Institute.

June 1, Inter-Operating Grids through Delegated MatchMaking Alexandru Iosup, Dick Epema PDS Group, TU Delft, NL Todd Tannenbaum, Matt Farrellee,

June 3, 2015 Synthetic Grid Workloads with Ibis, K OALA, and GrenchMark CoreGRID Integration Workshop, Pisa A. Iosup, D.H.J. Epema Jason Maassen, Rob van.

VL-e PoC: What it is and what it isn’t Jan Just Keijser VL-e P4 Scaling and Validation Team TU Delft Grid Meeting, December 11th, 2008.

Inter-Operating Grids through Delegated MatchMaking Alexandru Iosup, Dick Epema, Hashim Mohamed,Mathieu Jan, Ozan Sonmez 3 rd Grid Initiative Summer School,

CS 3013 & CS 502 Summer 2006 Scheduling1 The art and science of allocating the CPU and other resources to processes.

DAS-3/Grid’5000 meeting: 4th December The KOALA Grid Scheduler over DAS-3 and Grid’5000 Processor and data co-allocation in grids Dick Epema, Alexandru.

1 A Performance Study of Grid Workflow Engines Alexandru Iosup and Dick Epema PDS Group Delft University of Technology The Netherlands Corina Stratan Parallel.

1 Trace-Based Characteristics of Grid Workflows Alexandru Iosup and Dick Epema PDS Group Delft University of Technology The Netherlands Simon Ostermann,

1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.

June 25, GrenchMark: Synthetic workloads for Grids First Demo at TU Delft A. Iosup, D.H.J. Epema PDS Group, ST/EWI, TU Delft.

GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.

June 28, Resource and Test Management in Grids Rapid Prototyping in e-Science VL-e Workshop, Amsterdam, NL Dick Epema, Catalin Dumitrescu, Hashim.

4 december, The Distributed ASCI Supercomputer The third generation Dick Epema (TUD) (with many slides from Henri Bal) Parallel and Distributed.

Wk 2 – Scheduling 1 CS502 Spring 2006 Scheduling The art and science of allocating the CPU and other resources to processes.

June 6, 2002D.H.J. Epema/PDS/TUD1 Processor Co-Allocation in Multicluster Systems DAS-2 Workshop Amsterdam June 6, 2002 Anca Bucur and Dick Epema Parallel.

July 13, “How are Real Grids Used?” The Analysis of Four Grid Traces and Its Implications IEEE Grid 2006 Alexandru Iosup, Catalin Dumitrescu, and.

Euro-Par 2008, Las Palmas, 27 August DGSim : Comparing Grid Resource Management Architectures Through Trace-Based Simulation Alexandru Iosup, Ozan.

1 Efficient Management of Data Center Resources for Massively Multiplayer Online Games V. Nae, A. Iosup, S. Podlipnig, R. Prodan, D. Epema, T. Fahringer,

August 28, Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing Berkeley, CA, USA Alexandru Iosup, Nezih Yigitbasi,

Euro-Par 2007, Rennes, 29th August 1 The Characteristics and Performance of Groups of Jobs in Grids Alexandru Iosup, Mathieu Jan *, Ozan Sonmez and Dick.

MobSched: An Optimizable Scheduler for Mobile Cloud Computing S. SindiaS. GaoB. Black A.LimV. D. AgrawalP. Agrawal Auburn University, Auburn, AL 45 th.

Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

1 TUD-PDS A Periodic Portfolio Scheduler for Scientific Computing in the Data Center Kefeng Deng, Ruben Verboon, Kaijun Ren, and Alexandru Iosup Parallel.

1 Cloud Computing Research at TU Delft – A. Iosup Alexandru Iosup Parallel and Distributed Systems Group Delft University of Technology The Netherlands.

1 EuroPar 2009 – POGGI: Puzzle-Based Online Games on Grid Infrastructures POGGI: Puzzle-Based Online Games on Grid Infrastructures Alexandru Iosup Parallel.

 Escalonamento e Migração de Recursos e Balanceamento de carga Carlos Ferrão Lopes nº M6935 Bruno Simões nº M6082 Celina Alexandre nº M6807.

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

Meta Scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References” slide.

1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.

Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.

E-science in the Netherlands Maria Heijne TU Delft Library Director / Chair Consortium of University Libraries and National Library.

1 Challenge the future KOALA-C: A Task Allocator for Integrated Multicluster and Multicloud Environments Presenter: Lipu Fei Authors: Lipu Fei, Bogdan.

1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.

Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.

1 ROIA 2009 – CAMEO: Continuous Analytics for Massively Multiplayer Online Games CAMEO: Continuous Analytics for Massively Multiplayer Online Games Alexandru.

October 18, 2005 Charm++ Workshop Faucets A Framework for Developing Cluster and Grid Scheduling Solutions Presented by Esteban Pauli Parallel Programming.

BOF: Megajobs Gracie: Grid Resource Virtualization and Customization Infrastructure How to execute hundreds of thousands tasks concurrently on distributed.

Using the VL-E Proof of Concept Environment Connecting Users to the e-Science Infrastructure David Groep, NIKHEF.

The DutchGrid Platform – An Overview – 1 DutchGrid today and tomorrow David Groep, NIKHEF The DutchGrid Platform Large-scale Distributed Computing.

Virtual Lab for e-Science Towards a new Science Paradigm.

Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Scaling and Validation Programme David Groep & vle-pfour-team VL-e SP Meeting NIKHEF SARA LogicaCMG IBM.

Performance Analysis of Preemption-aware Scheduling in Multi-Cluster Grid Environments Mohsen Amini Salehi, Bahman Javadi, Rajkumar Buyya Cloud Computing.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

Static Process Scheduling

Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.

Efficient Gigabit Ethernet Switch Models for Large-Scale Simulation Dong (Kevin) Jin David Nicol Matthew Caesar University of Illinois.

Xi He Golisano College of Computing and Information Sciences Rochester Institute of Technology Rochester, NY THERMAL-AWARE RESOURCE.

Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,

Architecture for Resource Allocation Services Supporting Interactive Remote Desktop Sessions in Utility Grids Vanish Talwar, HP Labs Bikash Agarwalla,

Name : Mamatha J M Seminar guide: Mr. Kemparaju. GRID COMPUTING.

1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.

Scheduling Jobs Across Geo-distributed Datacenters Chien-Chun Hung, Leana Golubchik, Minlan Yu Department of Computer Science University of Southern California.

Dynamic Resource Allocation for Shared Data Centers Using Online Measurements By- Abhishek Chandra, Weibo Gong and Prashant Shenoy.

Virtual Laboratory Amsterdam L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit van Amsterdam.

OPERATING SYSTEMS CS 3502 Fall 2017

On Dynamic Resource Availability in Grids

Resource and Test Management in Grids

CPU Scheduling David Ferry CSCI 3500 – Operating Systems

Presentation transcript:

The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems Alexandru Iosup, Ozan Sonmez, Shanny Anoep, and Dick Epema ACM/IEEE Int’l. Symposium on High Performance Distributed Computing Parallel and Distributed Systems Group, TU Delft

2 The VL-e project A grid project in the Netherlands (2004-) Natural gas money: VL-e 45 MEuro / 800 MEuro total research package Overall aim: … to design and build a virtual lab for (digitally) enhanced science (e-science) experiments (no in-vivo or in-vitro, but in-silico experiments). Goals: 1. create prototypes of application-specific e-science environments 2. design and develop re-usable ICT/grid components 3. validate with real-life applications in testbeds Natural gas price → $$ for grid computing

3 The VL-e project: application areas Grid Services Harness multi-domain distributed resources Management of comm. & computing Virtual Laboratory (VL) Application Oriented Services Data Intensive Science Bio- Diversity Bio- Informatics Food Informatics Medical Diagnosis & Imaging Dutch Telescience PhilipsUnileverIBM

4 The VL-e project: application areas Grid Services Harness multi-domain distributed resources Management of comm. & computing Virtual Laboratory (VL) Application Oriented Services Data Intensive Science Bio- Diversity Bio- Informatics Food Informatics Medical Diagnosis & Imaging Dutch Telescience PhilipsUnileverIBM Bags-of-Tasks

5 The VL-e project: application areas Grid Services Harness multi-domain distributed resources Management of comm. & computing Virtual Laboratory (VL) Application Oriented Services Data Intensive Science Bio- Diversity Bio- Informatics Food Informatics Medical Diagnosis & Imaging Dutch Telescience PhilipsUnileverIBM Bags-of-Tasks

6 The Challenge Complete scientific work better, … User-oriented performance metrics (time a critical performance component) Bags-of-tasks for ease-of-use … in real systems Workloads (now that real traces are available) Information unavailability What to do? Hint: the next 10% improvement won’t cut it!

7 The Challenge (cont’d.) System model What is a good model for the study of large-scale distributed computing systems that run bag-of- tasks? Input model What is a good model for bag-of-tasks workloads in large-scale distributed computing systems? What is the best setup for such system/input? How to find the best? If a best is found, can there be another?

8 The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems 1.Introduction and Motivation 2.Context: System Model 3.Workload Model 4.Design Space Exploration 5.Conclusion

9 Context: System Model [1/4] Overview System Model 1.Clusters execute jobs 2.Resource managers coordinate job execution 3.Resource management architectures route jobs among resource managers 4.Task selection policies create the eligible set 5.Task scheduling policies: schedule the eligible set

10 Context: System Model [2/4] Resource Management Architectures route jobs among resource managers Separated Clusters (sep-c) Centralized (csp) Decentralized (fcondor)

11 Context: System Model [3/4] Task Selection Policies create the eligible set Age-based: 1.S-T: Select Tasks in the order of their arrival. 2.S-BoT: Select BoTs in the order of their arrival. User priority based: 3.S-U-Prio: Select the tasks of the User with the highest Priority. Based on fairness in resource consumption: 4.S-U-T: Select the Tasks of the User with the lowest res. cons. 5.S-U-BoT: Select the BoTs of the User with the lowest res. cons. 6.S-U-GRR: Select the User Round-Robin/all tasks for this user. 7.S-U-RR: Select the User Round-Robin/one task for this user.

12 Context: System Model [4/4] Task Scheduling Policies schedule the eligible set Information availability: Known Unknown Historical records Sample policies: Earliest Completion Time (with Prediction of Runtimes) (ECT(-P)) Fastest Processor First (FPF) (Dynamic) Fastest Processor Largest Task ((D)FPLT) Shortest Task First w/ Replication (STFR) Work Queue w/ Replication (WQR) Task Information Resource Information KHU K H U ECT, FPLT FPFECT-P DFPLT, MQD STFR RR, WQR

13 The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems 1.Introduction and Motivation 2.Context: System Model 3.Workload Model 4.Design Space Exploration 5.Conclusion

14 Workload Modeling 101: What Matters Job arrival process & job service time: Self-similarity (burstiness) vs. Poisson [Leland & Ott ToN’94] Job grouping: bags-of-tasks dominant application type in multi-cluster grids and cycle-scavenging systems (the e-Science infrastructure) [IosupJSE EuroPar’07] Job size: almost always 1 CPU [IosupDELW Grid’06] No.Packets/ Time Unit Time Units Longer queues Time Unit= 0.01s Time Unit= 100s

15 Model: Users, Bags-of-Tasks, Tasks Heavy-tailed distributions for inter-arrival time, job service time → can model self-similar workloads More details (e.g., parameter values): see article Validation data: the Grid Workloads Archive 7 long-term grid traces >5 million tasks >2500 users >40k CPUs Domains: HEP, graphics, AI, math, biomed, climate, finance, aero… A Bag-of-Tasks Workload Model

16 The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems 1.Introduction and Motivation 2.Context: System Model 3.Workload Model 4.Design Space Exploration 5.Conclusion

17 Design Space Exploration [1/5] Overview Design space exploration: time to understand how our solutions fit into the complete system. Study the impact of: The Task Scheduling Policy (s policies) The Workload Characteristics (P characteristics) The Dynamic System Information (I levels) The Task Selection Policy (S policies) The Resource Management Architecture (A policies) s x 7 P x I x S x A x (environment) → >2M design points

18 Design Space Exploration [2/5] Experimental Setup Simulator: DGSim [IosupETFL SC’07, IosupSE EuroPar’08] System: DAS + Grid’5000 [Cappello & Bal CCGrid’07] >3,000 CPUs: relative perf Metrics: Makespan Normalized Schedule Length ~ speed-up Workloads: Real: DAS + Grid’5000 Realistic: system load 20-95% (from workload model)

19 Design Space Exploration [3/5] Selected Results A Design Guidelines for Scheduling Policies Influence of the information type: (K,K): best balance between MS and NSL (*,U),(U,*): surprisingly good (FPF) to surprisingly poor (WQR4x) (*,H),(H,*): poor. Simple runtime predictors don’t work ( see article ) Where to invest time? K -> H, K-> U: adapt for information type with lowest variation WQR4x FPF

20 Design Space Exploration [4/5] Selected Results B Task Selection Only for Busy Systems Not much difference until system load over 50%. For DAS + Grid’5000 no change of task selection policy. Same performance S-BoT S-T

21 Design Space Exploration [5/5] Selected Results C Resource Management Architecture Centralized, separated, or distributed? Centralized is best [Note: job overhead not considered.] Distributed: good for system load below 50%; over 50% it does not finish all tasks.

22 The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems 1.Introduction and Motivation 2.Context: System Model 3.Workload Model 4.Design Space Exploration 5.Conclusion

System Model = Resource Management Architecture + Task Selection Policy + Task Scheduling Policy Information availability framework BoT workload model Design space exploration: the performance of bags-of-tasks Conclusion Better predictors (H,H) task scheduling policies Task Information Resource Information KHU K H U ECT, FPLT FPFECT-P DFPLT, MQD STFR RR, WQR Future Work ?

24 Thank you! Questions? Remarks? Observations? Help building the Grid Workloads Archive: Contact: [google “Iosup“] Web sites: ohttp:// : VL-e project ohttp:// : PDS group articles & software

25 What About Other Workloads? (High Performance vs. High Throughput Computing) Parallel jobs vs. bags-of-tasks Workflows… We need your traces! We work blindly without them. For parallel jobs, the architecture counts much more [IosETFL SC’07] For workflows, we don’t know much about performance.