Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Spark: Cluster Computing with Working Sets
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
The Network Weather Service A Distributed Resource Performance Forecasting Service for Metacomputing Rich Wolski, Neil T. Spring and Jim Hayes Presented.
An Integrated Framework for Dependable Revivable Architectures Using Multi-core Processors Weiding Shi, Hsien-Hsin S. Lee, Laura Falk, and Mrinmoy Ghosh.
NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory
Performance Evaluation
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
1 Distributed Systems: Distributed Process Management – Process Migration.
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.
CH 13 Server and Network Monitoring. Hands-On Microsoft Windows Server Objectives Understand the importance of server monitoring Monitor server.
Windows Server 2008 Chapter 11 Last Update
Computer System Lifecycle Chapter 1. Introduction Computer System users, administrators, and designers are all interested in performance evaluation. Whether.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
LIGO-G E ITR 2003 DMT Sub-Project John G. Zweizig LIGO/Caltech Argonne, May 10, 2004.
Computer System Architectures Computer System Software
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.
Grid Computing I CONDOR.
SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,
Rio de Janeiro, October, 2005 SBAC Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman.
Condor Birdbath Web Service interface to Condor
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
Fault Detection Sathish S. Vadhiyar Source/Credits: From Referenced Papers.
The Owner Share scheduler for a distributed system 2009 International Conference on Parallel Processing Workshops Reporter: 李長霖.
BOF: Megajobs Gracie: Grid Resource Virtualization and Customization Infrastructure How to execute hundreds of thousands tasks concurrently on distributed.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Predicting Queue Waiting Time in Batch Controlled Systems Rich Wolski, Dan Nurmi, John Brevik, Graziano Obertelli Computer Science Department University.
Bust a Move Young MC. Modeling and Predicting Machine Availability in Volatile Computing Environments Rich Wolski John Brevik Dan Nurmi University of.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Review of Condor,SGE,LSF,PBS
Uppsala, April 12-16th 2010EGEE 5th User Forum1 A Business-Driven Cloudburst Scheduler for Bag-of-Task Applications Francisco Brasileiro, Ricardo Araújo,
1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick.
AMH001 (acmse03.ppt - 03/7/03) REMOTE++: A Script for Automatic Remote Distribution of Programs on Windows Computers Ashley Hopkins Department of Computer.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Background Computer System Architectures Computer System Software.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.
COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques Dr. Xiao Qin Auburn University
G. Russo, D. Del Prete, S. Pardi Kick Off Meeting - Isola d'Elba, 2011 May 29th–June 01th A proposal for distributed computing monitoring for SuperB G.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
- DAG Scheduling with Reliability - - GridSolve - - Fault Tolerance In Open MPI - Asim YarKhan, Zhiao Shi, Jack Dongarra VGrADS Workshop April 2007.
Resource Characterization Rich Wolski, Dan Nurmi, and John Brevik Computer Science Department University of California, Santa Barbara VGrADS Site Visit.
Introduction to Distributed Platforms
Hands-On Microsoft Windows Server 2008
Resource Characterization
Condor – A Hunter of Idle Workstation
Pick up the Pieces Average White Band.
Grid Computing.
Steven Whitham Jeremy Woods
Software Architecture in Practice
Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.
湖南大学-信息科学与工程学院-计算机与科学系
Department of Computer Science University of California, Santa Barbara
Basic Grid Projects – Condor (Part I)
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara

Motivation Distributed System/Grid applications execute on wide variety of architectures –Clusters –Large SMP systems –Interactive workstation networks Condor provides vast, easily accessible resource pool, but is best suited to Condor applications

Condor As Resource Pool Provides many required features –Resource manager –Account manager –Scheduler Resource availability very dynamic –Controlled by large number of variables including overall load, user priority, occupancy time, owner revocation, etc. –Resources free up and drop out frequently Long running apps must be checkpointed

Checkpointing Schemes Condor checkpointing –Standard Universe uses system call liftoff –Core file is used to capture process state for restart Application-level checkpointing: –Application developer must generate checkpoints from within the application –Disk storage may be limited (none available locally)

Condor Checkpointing Checkpointing is invisible to application developer, but… –No threads –No forking –Single architecture support –Must use compiler supported by Condor (e.g. no GMP)

Application-Level Checkpointing No support from Condor for checkpointing in Vanilla universe –Left to the application No restrictions on system calls or compilation –If it compiles it will run No local disk storage –Checkpoints must traverse the network to a machine with stable storage Checkpoint schedule major performance concern

Checkpoint Scheduling Given a long running application and volatile resource, determine the amount of time perform useful computation between checkpoints such that the overhead of checkpointing is minimized Well studied –K. M. Chandy, C. V. Ramamoorthy. Rollback and recovery strategies for computer systems. –M. Elnozahy, L. Alvisi, Y. M. Wang, D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. –A. Duda. The effects of checkpointing on program execution time. –N. H. Vaidya. Impact of checkpoint latency on overhead ratio of a checkpointing scheme We use Markov Model based approach proposed by N. H. Vaidya.

Checkpoint Interval Selection Model requires statistical distribution describing resource availability –Vaidya, and later Plank assume exponential distributions

What is the Availability Distribution? Weibull –T. Heath, P. M. Martin, T. D. Nguyen. The shape of failure –J. Xu, Z. Kalbarczyk, R. K. Iyer. Networked Windows NT system field failure data analysis Hyperexponential –M. Mutka, M. Livny. Profiling workstations’ available capacity for remote execution. –I. Lee, D. Tang, R. K. Iyer, M. C. Hsueh. Measurement-based evaluation of operating system fault tolerance.

Generating Statistical Models Network Weather Service monitoring of Condor pool over 2 year period –708 machines observed Automatic model fitting software –Takes as input distribution type and historical Condor uptime values –Outputs best fit parameters for given distribution Design experiment to test overall work efficiency of checkpointing scheme using four different distributions

Checkpoint Experiment Test application submitted to Condor and when it runs… –Sends resource information to central server –Model fitting software estimates model parameters using MLE or EMpht methods –Checkpoint scheduler solves the Markov model using tested distribution –Application uses schedule, checkpoints its memory, and records performance Test different distributions Checkpointing to disks at UCSB

Empirical Results: Execution Time

Empirical Results: Network Utilization

Moral We can determine optimal checkpoint schedules for Condor jobs automatically –Execution performance impact is about the same until checkpoint costs get big –Network load improvements are substantial (particularly useful in wide area) Software is real, but non-NWS parts are in prototype –We want to bring them into the NWS release cycle Paper in submission to HPDC

What’s Next Better Models –Brevik Method: we can predict the percentiles of availability with provable confidence bounds using less data –Can’t use it (yet) for Markov model Better Utility –Provide information to Condor itself –Automatic fault and anomaly detection Better Information for users –Publish availability predictions the in matchmaker

Thanks Rich Wolski John Brevik Miron Livny NSF Next Generation Software program VGrADS Project (NSF ITR, Ken Kennedy, PI) NSF Middleware Initiative (NWS) Questions?

Simulation Results: Execution Time

Simulation Results: Network Utilization