Predicting Queue Waiting Time in Batch Controlled Systems Rich Wolski, Dan Nurmi, John Brevik, Graziano Obertelli Computer Science Department University.

Slides:



Advertisements
Similar presentations
Network Weather Service Sathish Vadhiyar Sources / Credits: NWS web site: NWS papers.
Advertisements

Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.
All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.
Copyright 2004 David J. Lilja1 Errors in Experimental Measurements Sources of errors Accuracy, precision, resolution A mathematical model of errors Confidence.
GridFlow: Workflow Management for Grid Computing Kavita Shinde.
Perceptron-based Global Confidence Estimation for Value Prediction Master’s Thesis Michael Black June 26, 2003.
A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter : S.Y.Chen.
Björn Landfeldt School of Information Technologies Investigating a theoretical model Bjorn Landfeldt University of Sydney.
Chapter 6: CPU Scheduling. 5.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 2, 2005 Chapter 6: CPU Scheduling Basic.
Bandwidth Allocation in a Self-Managing Multimedia File Server Vijay Sundaram and Prashant Shenoy Department of Computer Science University of Massachusetts.
Monté Carlo Simulation MGS 3100 – Chapter 9. Simulation Defined A computer-based model used to run experiments on a real system.  Typically done on a.
Chapter 2 Computer Clusters Lecture 2.1 Overview.
Integrated Risk Analysis for a Commercial Computing Service Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS) Lab. Dept.
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
Part one. overview  Operating system is the software that controls the overall operation of a computer.  It provide the interface by which a user can.
Marcos Dias de Assunção 1,2, Alexandre di Costanzo 1 and Rajkumar Buyya 1 1 Department of Computer Science and Software Engineering 2 National ICT Australia.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
Energy-Efficient Soft Real-Time CPU Scheduling for Mobile Multimedia Systems Wanghong Yuan, Klara Nahrstedt Department of Computer Science University of.
1.1 Chapter One What is Statistics?. 1.2 What is Statistics? “Statistics is a way to get information from data.”
Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.
Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.
CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei
Deadline-based Grid Resource Selection for Urgent Computing Nick Trebon 6/18/08 University of Chicago.
Bust a Move Young MC. Modeling and Predicting Machine Availability in Volatile Computing Environments Rich Wolski John Brevik Dan Nurmi University of.
Combining the strengths of UMIST and The Victoria University of Manchester Adaptive Workflow Processing and Execution in Pegasus Kevin Lee School of Computer.
10 th December, 2013 Lab Meeting Papers Reviewed:.
October, 2000.A Self Organsing NN for Job Scheduling in Distributed Systems I.C. Legrand1 Iosif C. Legrand CALTECH.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
1 Grid Scheduling Cécile Germain-Renaud. 2 Scheduling Job –A computation to run on a machine –Possibly with network access e.g. input/output file (coarse.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
VO-Ganglia Grid Simulator Catalin Dumitrescu, Mike Wilde, Ian Foster Computer Science Department The University of Chicago.
SPRUCE Special PRiority and Urgent Computing Environment Advisor Demo Nick Trebon University of Chicago Argonne National Laboratory
Cpr E 308 Spring 2005 Process Scheduling Basic Question: Which process goes next? Personal Computers –Few processes, interactive, low response time Batch.
Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.
The Eucalyptus Open-source Cloud Computing System Daniel Nurmi Rich Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman, Lamia Youseff, Dmitrii.
Operating systems Part one Introduction to computer, 2nd semester, 2010/2011 Mr.Nael Aburas Faculty of Information.
1 Real-Time Scheduling. 2Today Operating System task scheduling –Traditional (non-real-time) scheduling –Real-time scheduling.
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
Network Weather Service. Introduction “NWS provides accurate forecasts of dynamically changing performance characteristics from a distributed set of metacomputing.
Grid Performability, Modelling and Measurement AHM’04 Optimal Tree Structures for Large-Scale Grids J. Palmer I. Mitrani School of Computing Science University.
Scheduling Strategies for Mapping Application Workflows Onto the Grid A. Mandal, K. Kennedy, C. Koelbel, G. Marin, J. Mellor- Crummey, B. Liu, L. Johnsson.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
Probabilistic Upper Bounds for Urgent Applications Nick Trebon and Pete Beckman University of Chicago and Argonne National Lab, USA Case Study Elevated.
Slot Acquisition Presenter: Daniel Nurmi. Scope One aspect of VGDL request is the time ‘slot’ when resources are needed –Earliest time when resource set.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
Resource Characterization Rich Wolski, Dan Nurmi, and John Brevik Computer Science Department University of California, Santa Barbara VGrADS Site Visit.
Lessons from LEAD/VGrADS Demo Yang-suk Kee, Carl Kesselman ISI/USC.
The EMAN Application: An Update. EMAN Oversimplified Preliminary 3D Model Preliminary 3D model Particles Electron Micrographs Refine Final 3D model.
OPERATING SYSTEMS CS 3502 Fall 2017
OPERATING SYSTEMS CS 3502 Fall 2017
Jacob R. Lorch Microsoft Research
LEAD-VGrADS Day 1 Notes.
Astronomical Data Processing & Workflow Scheduling in cloud
Resource Characterization
Pick up the Pieces Average White Band.
New Workflow Scheduling Techniques Presentation: Anirban Mandal
Predicting Queue Waiting Time For Individual User Jobs
US CMS Testbed.
Chapter 6: CPU Scheduling
CPU Scheduling G.Anuradha
Module 5: CPU Scheduling
Chapter 6: CPU Scheduling
ANALYSIS OF USER SUBMISSION BEHAVIOR ON HPC AND HTC
Probabilistic Upper Bounds
Chapter 6: CPU Scheduling
Chapter 6: CPU Scheduling
Presentation transcript:

Predicting Queue Waiting Time in Batch Controlled Systems Rich Wolski, Dan Nurmi, John Brevik, Graziano Obertelli Computer Science Department University of California, Santa Barbara

Problem: Predicting Delay in Batch Queues Time in queue is experienced as application delay Sounds like an easy problem, but —Distribution of load from users is a matter of some debate —Scheduling policy is partially hidden —Sites need to change the policies dynamically and without warning —Job execution times are difficult to predict Much research in this area over the past 20 years, but few solutions —Current commercial systems provide high variance estimates —Most sites simply disable this feature

Hard Problem

For Scheduling: It’s all about the big Q Predictions of the form —“What is the maximum time my job will wait with X% certainty?” —“What is the minimum time my job will wait with X% certainty?” Requires two estimates if certainty is to be quantified —Estimate the (1-X) quantile for the distribution of availability => Q x —Estimate the upper or lower X% confidence bound on the statistic Q x => Q (x,lb) If the estimates are unbiased, and the distribution is stationary, future availability duration will be larger than Q (x,lb) X% of the time, guaranteed

New Predictive Methodology New quantile estimator invention based on Binomial distribution Requires carefully engineered numerical system to deal with large- scale combinatorics New changepoint detector Binomial method in a time series context is difficult Need a system to determining Stationary regions in the data Minimum statistically meaningful history in each region New clustering methodology (coming soon) More accurate estimates are possible if predictions are made from jobs with similar characteristics Takes dynamic policy changes into account more effectively

Ten Years of Supercompuuting

See it In Action

Predicting Things Upside Down Deadline scheduling: My job needs to start in the next X seconds for the results to be meaningful. —Amitava Mujumdar, Tharaka Devaditha, Adam Birnbaum (SDSC) –Need to run a 4 minute image reconstruction that completes in the next 8 minutes Given a —Machine —Queue —Processor count —Run time —Deadline What is the probability that a job will meet the deadline?

How Well Does it Work with an Application? Preliminary 3D Model Preliminary 3D model Particles Electron Micrograph Refine Final 3D model EMAN has been developed at Baylor College of Medicine by Research group of Wah Chiu and Steven Ludtke EMAN

VGrADS EMAN Batch Scheduler EMAN emulator —Run the EMAN scheduler to determine a job launch sequence —Launch the jobs by submitting them to the queues specified by the scheduler —When an EMAN job acquires the processors, exit and “sleep” the emulator for the predicted execution time –Saves system allocation time —Record the overall makespan Experiment: —Chicago TeraGrid, SDSC TeraGrid, NCSA TeraGrid and CNSI Dell at UCSB —57 separate runs Results: mean observed and mean predicted makespans are not significantly different at alpha = 0.05

95% Upper Bound on Median

Clustering RMS ratio of Binomial with Clustering to without —Both achieve 95% correctness —Measures “tightness” improvement through clustering

Batch Queue Prediction for Grid Systems A good point-valued prediction remains elusive Grid users certainly can use bounds instead —Early job completion is okay, typically —Bounds give a good intuitive feel for which queue will be quickest Automatic schedulers are coming —EMAN doesn’t use ranges…it should —VGrADS is developing new schedulers (workflow) —NEESGrid and ISI are in development (workflow) —Large-scale sensor network simulation

What’s Next? Open questions: —Does the availability of predictions affect load? –Rolling out production tools now and we will be monitoring –Job cancellation does not affect results —If it does, will allocations be stable? –Grid economies Virtual resource reservations (VGrADS) —Conditional prediction and resubmission —Virtual Cluster?? Thanks —NSF SCI, VGrADS, SDSC, TACC Us: