19 November 2013 Exploring Portfolio Scheduling for Long-term Execution of Scientific Workloads in IaaS Clouds Alexandru Iosup Delft University of Technology.

Slides:



Advertisements
Similar presentations
Challenge the future Delft University of Technology Overprovisioning for Performance Consistency in Grids Nezih Yigitbasi and Dick Epema Parallel.
Advertisements

Evaluating the Cost-Benefit of Using Cloud Computing to Extend the Capacity of Clusters Presenter: Xiaoyu Sun.
Hadi Goudarzi and Massoud Pedram
Scheduling Criteria CPU utilization – keep the CPU as busy as possible (from 0% to 100%) Throughput – # of processes that complete their execution per.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Scheduling of parallel jobs in a heterogeneous grid environment Scheduling of parallel jobs in a heterogeneous grid environment Each site has a homogeneous.
Service Level Agreement based Allocation of Cluster Resources: Handling Penalty to Enhance Utility Chee Shin Yeo and Rajkumar Buyya Grid Computing and.
Proactive Prediction Models for Web Application Resource Provisioning in the Cloud _______________________________ Samuel A. Ajila & Bankole A. Akindele.
All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.
XENMON: QOS MONITORING AND PERFORMANCE PROFILING TOOL Diwaker Gupta, Rob Gardner, Ludmila Cherkasova 1.
The Forgotten Factor: FACTS on Performance Evaluation and its Dependence on Workloads Dror Feitelson Hebrew University.
JSSPP-11, Boston, MA June 19, Pitfalls in Parallel Job Scheduling Evaluation Designing Parallel Operating Systems using Modern Interconnects Pitfalls.
Managing Risk of Inaccurate Runtime Estimates for Deadline Constrained Job Admission Control in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing.
Efficient Autoscaling in the Cloud using Predictive Models for Workload Forecasting Roy, N., A. Dubey, and A. Gokhale 4th IEEE International Conference.
June 1, Inter-Operating Grids through Delegated MatchMaking Alexandru Iosup, Dick Epema PDS Group, TU Delft, NL Todd Tannenbaum, Matt Farrellee,
The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems Alexandru Iosup, Ozan Sonmez, Shanny Anoep, and Dick Epema ACM/IEEE Int’l.
Inter-Operating Grids through Delegated MatchMaking Alexandru Iosup, Dick Epema, Hashim Mohamed,Mathieu Jan, Ozan Sonmez 3 rd Grid Initiative Summer School,
1 A Performance Study of Grid Workflow Engines Alexandru Iosup and Dick Epema PDS Group Delft University of Technology The Netherlands Corina Stratan Parallel.
Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.
Security-Driven Heuristics and A Fast Genetic Algorithm for Trusted Grid Job Scheduling Shanshan Song, Ricky Kwok, and Kai Hwang University of Southern.
Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,
An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.
June 28, Resource and Test Management in Grids Rapid Prototyping in e-Science VL-e Workshop, Amsterdam, NL Dick Epema, Catalin Dumitrescu, Hashim.
July 13, “How are Real Grids Used?” The Analysis of Four Grid Traces and Its Implications IEEE Grid 2006 Alexandru Iosup, Catalin Dumitrescu, and.
Euro-Par 2008, Las Palmas, 27 August DGSim : Comparing Grid Resource Management Architectures Through Trace-Based Simulation Alexandru Iosup, Ozan.
Is 99% Utilization of a Supercomputer a Good Thing? Scheduling in Context: User Utility Functions Cynthia Bailey Lee Department of Computer Science and.
1 Efficient Management of Data Center Resources for Massively Multiplayer Online Games V. Nae, A. Iosup, S. Podlipnig, R. Prodan, D. Epema, T. Fahringer,
Host Load Prediction in a Google Compute Cloud with a Bayesian Model Sheng Di 1, Derrick Kondo 1, Walfredo Cirne 2 1 INRIA 2 Google.
August 28, Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing Berkeley, CA, USA Alexandru Iosup, Nezih Yigitbasi,
Abstract Cloud data center management is a key problem due to the numerous and heterogeneous strategies that can be applied, ranging from the VM placement.
Euro-Par 2007, Rennes, 29th August 1 The Characteristics and Performance of Groups of Jobs in Grids Alexandru Iosup, Mathieu Jan *, Ozan Sonmez and Dick.
Integrated Risk Analysis for a Commercial Computing Service Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS) Lab. Dept.
MobSched: An Optimizable Scheduler for Mobile Cloud Computing S. SindiaS. GaoB. Black A.LimV. D. AgrawalP. Agrawal Auburn University, Auburn, AL 45 th.
1 TUD-PDS A Periodic Portfolio Scheduler for Scientific Computing in the Data Center Kefeng Deng, Ruben Verboon, Kaijun Ren, and Alexandru Iosup Parallel.
1 Cloud Computing Research at TU Delft – A. Iosup Alexandru Iosup Parallel and Distributed Systems Group Delft University of Technology The Netherlands.
Marcos Dias de Assunção 1,2, Alexandre di Costanzo 1 and Rajkumar Buyya 1 1 Department of Computer Science and Software Engineering 2 National ICT Australia.
Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]
Resource Provisioning based on Lease Preemption in InterGrid Mohsen Amini Salehi, Bahman Javadi, Rajkumar Buyya Cloud Computing and Distributed Systems.
A Broker for Cost-efficient QoS aware Resource Allocation in EC2. Kurt Vermeersch Coordinator: Kurt Vanmechelen.
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
Meta Scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References” slide.
임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.
1 Challenge the future KOALA-C: A Task Allocator for Integrated Multicluster and Multicloud Environments Presenter: Lipu Fei Authors: Lipu Fei, Bogdan.
Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.
1 ROIA 2009 – CAMEO: Continuous Analytics for Massively Multiplayer Online Games CAMEO: Continuous Analytics for Massively Multiplayer Online Games Alexandru.
BOF: Megajobs Gracie: Grid Resource Virtualization and Customization Infrastructure How to execute hundreds of thousands tasks concurrently on distributed.
Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters
Cost-effective Cloud HPC Resource Provisioning by Building Semi-Elastic Virtual Clusters Shuangcheng Niu 1, Jidong Zhai 1, Xiaosong Ma 2,3 Xiongchao Tang.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
November 29, Our team: Undergrad Thomas de Ruiter, Anand Sawant, Ruben Verboon, … Grad Siqi Shen, Guo Yong, Nezih Yigitbasi Staff Henk Sips, Dick.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Performance Analysis of Preemption-aware Scheduling in Multi-Cluster Grid Environments Mohsen Amini Salehi, Bahman Javadi, Rajkumar Buyya Cloud Computing.
Job Scheduling P. (Saday) Sadayappan Ohio State University.
QoPS: A QoS based Scheme for Parallel Job Scheduling M. IslamP. Balaji P. Sadayappan and D. K. Panda Computer and Information Science The Ohio State University.
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
Jacob R. Lorch Microsoft Research
CPU Scheduling G.Anuradha
Module 5: CPU Scheduling
3: CPU Scheduling Basic Concepts Scheduling Criteria
P. (Saday) Sadayappan Ohio State University
A Characterization of Approaches to Parrallel Job Scheduling
Chapter 6: CPU Scheduling
The Performance of Big Data Workloads in Cloud Datacenters
Operating System , Fall 2000 EA101 W 9:00-10:00 F 9:00-11:00
Mihai Neacşu, BSc. Prof.dr.eng. Alexandru Iosup Ir. Laurens Versluis
Module 5: CPU Scheduling
MagnaData: Scheduling Complex Workflows with Non-Functional Requirements in Datacenters Laurens Versluis Massivizing Computer research.
Module 5: CPU Scheduling
Presentation transcript:

19 November 2013 Exploring Portfolio Scheduling for Long-term Execution of Scientific Workloads in IaaS Clouds Alexandru Iosup Delft University of Technology Delft, the Netherlands Kefeng Deng, Junqiang Song, and Kaijun Ren National University of Defense Technology Changsha, China SC 2013, Denver, USA I am an Ad! Dutch Research Center #3322

2 Portfolio Scheduling What do These Workloads Have in Common? Hint: ok, they’re scientific and bursty, but also … Sources: The Parallel Workloads Archive The Grid Workloads Archive T1. KTH SP2T2. SDSC SP2 T3. DAS fs0T4. LPC EGEE

3 Portfolio Scheduling Old scheduling aspects Workloads evolve over time and exhibit periods of distinct characteristics No one-size-fits-all policy: hundreds exist, each good for specific conditions Data centers increasingly popular (also not new) Constant deployment since mid-1990s Users moving their computation to IaaS-cloud data centers Consolidation efforts in mid- and large-scale companies New scheduling aspects New workloads New data center architectures New cost models Developing a scheduling policy is risky and ephemeral Selecting a scheduling policy for your data center is difficult Combining the strengths of multiple scheduling policies is … Why Portfolio Scheduling?

4 Portfolio Scheduling What is Portfolio Scheduling? In a Nutshell, for Data Centers Create a set of scheduling policies Resource provisioning and allocation policies, in this work Online selection of the active policy, at important moments Periodic selection, in this work Same principle for other changes: pricing model, system, …

5 Portfolio Scheduling Agenda 1.Why portfolio scheduling? 2.What is portfolio scheduling? In a nutshell… 3.Our periodic portfolio scheduler for the data center 1.Generic process 2.A portfolio scheduler architecture, in practice 3.Time-constrained simulation 4.Experimental results How useful is our portfolio scheduler? How does it work in practice? 5.Our ongoing work on portfolio scheduling 6.How novel is our portfolio scheduler? A comparison with related work 7.ConclusionConclusion Deng, Verboon, Ren, Iosup. A Periodic Portfolio Scheduler for Scientific Computing in the Data Center. JSSPP’13. Shen, Deng, Iosup, and Epema. Scheduling Jobs in the Cloud Using On-demand and Reserved Instances, EuroPar’13. Please read

6 Portfolio Scheduling What is Portfolio Scheduling? The Generic Process CreationSelection ReflectionApplication Which policies to include? Which policy to activate? Explain to sysadmin Which resources? What to log? Validate selection immediately Which changes to the portfolio?

7 Portfolio Scheduling The Portfolio Scheduler Selection Creation Application Reflection CreationSelection ReflectionApplication

8 Portfolio Scheduling The Creation of a Policy Portfolio (1) Creation Deng, Verboon, Iosup. A Periodic Portfolio Scheduler for Scientific Computing in the Data Center. JSSPP’13. Can add any policy here. The portfolio scheduler ideally combines strengths by always selecting well.

9 Portfolio Scheduling The Creation of a Policy Portfolio (2) Creation

10 Portfolio Scheduling Other Ingredients of a Portfolio Scheduler Runtime Predictor = Tsafrir et al. Uses the average runtime of the two most recently submitted and completed jobs from the same user as the predicted runtime Online Simulator and Policy Selector For each policy, simulate scheduling all the queued jobs, then output an utility value (score) Select the policy with the highest score for real-world operation Q: Can you see a problem with running the simulator for each policy?

11 Portfolio Scheduling Time-Constrained Simulation (1) Given N: Total number of policies and ∆: Constrained simulation time Approach (for uniform execution time of per-policy simulation) Three classes of policies, Smart, Poor, Stale (not recently explored) Explore speculatively policies from the three classes To be simulated Simulated & Sorted by score K=‖Q‖=10 λ=0.6 ‖Smart‖=λK Cannot simulate now

12 Portfolio Scheduling Time-Constrained Simulation (2) The second round of policy simulation; enter a stable state A policy in Poor (historically performing badly) can still deliver excellent performance in the future! Q: Why keep running policies from Poor (historically a bad performer)?

13 Portfolio Scheduling Agenda 1.Why portfolio scheduling? 2.What is portfolio scheduling? In a nutshell… 3.Our periodic portfolio scheduler for the data center 1.Generic process 2.A portfolio scheduler architecture, in practice 3.Time-constrained simulation 4.Experimental results How useful is our portfolio scheduler? How does it work in practice? 5.Our ongoing work on portfolio scheduling 6.How novel is our portfolio scheduler? A comparison with related work 7.ConclusionConclusion Deng, Verboon, Ren, Iosup. A Periodic Portfolio Scheduler for Scientific Computing in the Data Center. JSSPP’13. Shen, Deng, Iosup, and Epema. Scheduling Jobs in the Cloud Using On-demand and Reserved Instances, EuroPar’13. Please read

14 Portfolio Scheduling Performance Evaluation Simulation Software: DGSim Simulation Environment: A virtual cluster comprised of homogeneous VM instances The maximum number of concurrent VMs that can be rented is seconds delay for VM instance acquisition and booting Performance metrics: Average bounded job slowdown Charged cost: runtime of rented VMs (rounded up to the next hour) Utility score: (Default, κ=100, α=1, β=1) Experimental Setup (1): the system Iosup et al., Performance Analysis of Cloud Computing Services for Many Tasks Scientific Computing, (IEEE TPDS 2011).

15 Portfolio Scheduling Performance Evaluation Experimental Setup (2): workload traces Four traces from the Parallel Workloads Archive (PWA) Use from these traces jobs requesting up to 64 processors T1. KTH SP2T2. SDSC SP2T3. DAS fs0T4. LPC EGEE

16 Portfolio Scheduling Performance Evaluation 1) Effect of Portfolio Scheduling (1) Portfolio scheduling is 8%, 11%, 45%, and 30% better than the best constituent policy A portfolio scheduler can be better than any of its constituent policies Q: What can prevent a portfolio scheduler from being better than any of its constituent policies?

17 Portfolio Scheduling Performance Evaluation 1) Effect of Portfolio Scheduling (2) For KTH-SP2 and SDSC-SP2, ODB and ODX are dominant ~ many long jobs For DAS-fs0 and LPC-EGEE, ODB and ODE are dominant ~ short jobs, load Not performance-related, but: A portfolio scheduler can support each decision with realistic data. Q: Can our sysadmin do this? Can we? (Rhetorical)

18 Portfolio Scheduling Performance Evaluation 1) Effect of Portfolio Scheduling (2) Job selection policies such as UNICEF and LXF that favor short jobs have the best performance Q: How well do you think a single (provisioning, job selection, VM selection) policy would perform? Will it be dominant? (Rhetorical) Q: What prevents a portfolio scheduler from being better than any of its constituent policies?

19 Portfolio Scheduling Performance Evaluation 5) Impact of Simulation Time Constraint Expectedly, having more time to simulate leads in general to better results Here, sufficient to simulate 10—20 policies (nearly all dominant policies selected) Job slowdown shows different sensitivity (please read article) The charged cost exhibits a similar trend (please read article)

20 Portfolio Scheduling Agenda 1.Why portfolio scheduling? 2.What is portfolio scheduling? In a nutshell… 3.Our periodic portfolio scheduler for the data center 1.Generic process 2.A portfolio scheduler architecture, in practice 3.Time-constrained simulation 4.Experimental results 5.Our ongoing work on portfolio scheduling Portfolio scheduling for different workloads and constituent policies 6.How novel is our portfolio scheduler? A comparison with related work 7.ConclusionConclusion Deng, Verboon, Ren, Iosup. A Periodic Portfolio Scheduler for Scientific Computing in the Data Center. JSSPP’13. Shen, Deng, Iosup, and Epema. Scheduling Jobs in the Cloud Using On-demand and Reserved Instances, EuroPar’13. Please read

21 Portfolio Scheduling Portfolio Scheduling for Online Gaming and Scientific Workloads CoH = Cloud-based, online, Hybrid scheduling Intuition: keep rental cost low by finding good mix of machine configurations and billing options Main idea: portfolio scheduler = run both solver of an Integer Programming Problem and various heuristics, then pick best schedule at deadline Additional feature: Can use reserved cloud instances Promising early results, for Gaming (and scientific) workloads Shen, Deng, Iosup, and Epema. Scheduling Jobs in the Cloud Using On-demand and Reserved Instances, EuroPar’13.

22 Portfolio Scheduling Agenda 1.Why portfolio scheduling? 2.What is portfolio scheduling? In a nutshell… 3.Our periodic portfolio scheduler for the data center 1.Generic process 2.A portfolio scheduler architecture, in practice 3.Time-constrained simulation 4.Experimental results 5.Our ongoing work on portfolio scheduling 6.How novel is our portfolio scheduler? A comparison with related work 7.ConclusionConclusion Deng, Verboon, Ren, Iosup. A Periodic Portfolio Scheduler for Scientific Computing in the Data Center. JSSPP’13. Shen, Deng, Iosup, and Epema. Scheduling Jobs in the Cloud Using On-demand and Reserved Instances, EuroPar’13. Please read

23 Portfolio Scheduling Related Work Computational portfolio design Huberman’97, Streeter et al.’07 ’12, Bougeret’09, Goldman’12, Gagliolo et al.’06 ’11, Feitelson et al. JSSPP’13 (Intel’s clusters) We focus on dynamic, scientific workloads We use an utility function that combines slowdown and utilization Modern portfolio theory in finance Markowitz’52, Magill and Constantinides’76, Black and Scholes’76 Dynamic problem set vs fixed problem set Different workloads and utility functions Selection and Application very different General scheduling Rice’76: algorithm selection problem Hyper-scheduling, meta-scheduling The learning rule may defeat the purpose, due to historical bias to dominant policy Different processes (esp. Selection, Reflection)

24 Portfolio Scheduling Agenda 1.Why portfolio scheduling? 2.What is portfolio scheduling? In a nutshell… 3.Our periodic portfolio scheduler for the data center 1.Generic process 2.A portfolio scheduler architecture, in practice 3.Time-constrained simulation 4.Experimental results 5.Our ongoing work on portfolio scheduling 6.How novel is our portfolio scheduler? A comparison with related work 7.Conclusion Deng, Verboon, Ren, Iosup. A Periodic Portfolio Scheduler for Scientific Computing in the Data Center. JSSPP’13. Shen, Deng, Iosup, and Epema. Scheduling Jobs in the Cloud Using On-demand and Reserved Instances, EuroPar’13. Please read

25 Portfolio Scheduling Conclusion Take-Home Message Portfolio Scheduling = set of scheduling policies, online selection Creation, Selection, Application, Reflection Time constraints, here in Selection step Periodic portfolio scheduler for data centers Explored Creation, Selection, simple Reflection Portfolio scheduler in general better than its constituent policies Good results for real traces (also for synthetic) Easy to setup, easy to trust JSSPP’13, EuroPar’13, SC’13, (future) new workload types and constituent policies + there is still much to explore about process Reality Check (future): we will apply it in our DAS multi-cluster. How about your system? Alexandru Iosup #3322

26 Portfolio Scheduling

27 Portfolio Scheduling Information PDS group home page and publications database: KOALA web site: Grid Workloads Archive (GWA): gwa.ewi.tudelft.nl Failure Trace Archive (FTA): fta.inria.fr

28 Portfolio Scheduling Time-Constrained Simulation (3) The third round of policy simulation A policy in Poor (historically performing badly) can still deliver excellent performance in the future! Q: Why keep running policies from Poor (historically a bad performer)?

29 Portfolio Scheduling Performance Evaluation 1) Effect of Portfolio Scheduling (1) With accurate runtime information ODA-* is the best of all policies using ODA VM provisioning policy For utility, portfolio scheduling is 8%, 11%, 45%, and 30% better than the best constituent policy

30 Portfolio Scheduling Performance Evaluation 1) Effect of Portfolio Scheduling (2) For KTH-SP2 and SDSC-SP2, ODB and ODX are the dominant policies (a result of many long jobs) For DAS-fs0 and LPC-EGEE, ODB and ODE are the dominant policies (as the majority of the jobs are very short) Job selection policies such as UNICEF and LXF that favor short jobs have the best performance

31 Portfolio Scheduling Performance Evaluation 2) Effect of Utility Function (α—Cost-Efficiency) Keep the task-urgency factor β = 1 and change the cost-efficiency factor α from 1 to 4 (the extreme setting β = 0—ignoring the job slowdown)

32 Portfolio Scheduling Performance Evaluation 2) Effect of Utility Function (β—Task-Urgency) Vary the task-urgency factor, in the same way as cost-efficiency factor Suggestion: instead of putting effort to find sophisticated algorithms to reduce the cost, it is more worthwhile to find methods to improve the performance metrics that users are interested in, such as job slowdown and wait time.

33 Portfolio Scheduling Performance Evaluation 3) Impact of Runtime Prediction Inaccuracy Portfolio scheduling is not sensitive to the inaccurate runtime estimation Policies using job runtime are adversely affected by inaccuracy Using Predicted Runtime Using User Estimated Runtime

34 Portfolio Scheduling Performance Evaluation 4) Impact of Portfolio Selection Period (1) Portfolio selection period: the interval between two consecutive selection processes (multiple times of the scheduling periods, which is 20 seconds) The selection period has an insignificant impact on job slowdown (<10%) The impact on charged cost differs (50% for DAS2-fs0; 15% for LPC-EGEE)

35 Portfolio Scheduling Performance Evaluation 4) Impact of Portfolio Selection Period (2) Portfolio selection period: the interval between two consecutive selection processes (multiple times of the scheduling periods, which is 20 seconds) The utility has an opposite trend in comparison with the charged cost The # of invocations decreases near-exponentially with the selection period Suggestion: 8 for KTH-SP2 and SDSC-SP2; 2 for LPC-EGEE; 1 for DAS2-fs0

36 Portfolio Scheduling Performance Evaluation 5) Impact of Simulation Time Constraint Add a 10 milliseconds overhead for each scheduling policy Job slowdown shows different sensitivity The charged cost exhibits a similar trend According to the utility, simulating 1/3 of policies (60) is sufficient, as it covers almost all the dominant policies