Scheduling Generic Parallel Applications –Meta- scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References” slide.

Slides:



Advertisements
Similar presentations
I/O Management and Disk Scheduling
Advertisements

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
I/O Management and Disk Scheduling
Evaluating the Cost-Benefit of Using Cloud Computing to Extend the Capacity of Clusters Presenter: Xiaoyu Sun.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Hadi Goudarzi and Massoud Pedram
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Opportune Job Shredding: An Efficient Approach for Scheduling Parameter Sweep Applications Rohan Kurian, Pavan Balaji, P. Sadayappan The Ohio State University.
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
Scheduling of parallel jobs in a heterogeneous grid environment Scheduling of parallel jobs in a heterogeneous grid environment Each site has a homogeneous.
All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.
Senior Design Project: Parallel Task Scheduling in Heterogeneous Computing Environments Senior Design Students: Christopher Blandin and Dylan Machovec.
GridFlow: Workflow Management for Grid Computing Kavita Shinde.
Computer Science Department 1 Load Balancing and Grid Computing David Finkel Computer Science Department Worcester Polytechnic Institute.
DAS-3/Grid’5000 meeting: 4th December The KOALA Grid Scheduler over DAS-3 and Grid’5000 Processor and data co-allocation in grids Dick Epema, Alexandru.
Distributed Process Management
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
Device Management.
Resource Manager for Grid with global job queue and with planning based on local schedules V.N.Kovalenko, E.I.Kovalenko, D.A.Koryagin, E.Z.Ljubimskii,
Informationsteknologi Tuesday, October 9, 2007Computer Systems/Operating Systems - Class 141 Today’s class Scheduling.
June 6, 2002D.H.J. Epema/PDS/TUD1 Processor Co-Allocation in Multicluster Systems DAS-2 Workshop Amsterdam June 6, 2002 Anca Bucur and Dick Epema Parallel.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Euro-Par 2008, Las Palmas, 27 August DGSim : Comparing Grid Resource Management Architectures Through Trace-Based Simulation Alexandru Iosup, Ozan.
Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer.
Dynamic Load Sharing and Balancing Sig Freund. Outline Introduction Distributed vs. Traditional scheduling Process Interaction models Distributed Systems.
Self-Organizing Agents for Grid Load Balancing Junwei Cao Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04)
Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 I/O Management and Disk Scheduling Chapter Categories of I/O Devices Human readable Used to communicate with the user Printers Video display terminals.
1 IO Management and Disk Scheduling Chapter Categories of I/O Devices n Human readable u used to communicate with the user u video display terminals.
Scheduling of Parallel Jobs In a Heterogeneous Multi-Site Environment By Gerald Sabin from Ohio State Reviewed by Shengchao Yu 02/2005.
Marcos Dias de Assunção 1,2, Alexandre di Costanzo 1 and Rajkumar Buyya 1 1 Department of Computer Science and Software Engineering 2 National ICT Australia.
Resource Provisioning based on Lease Preemption in InterGrid Mohsen Amini Salehi, Bahman Javadi, Rajkumar Buyya Cloud Computing and Distributed Systems.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Meta Scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References” slide.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
GRID’2012 Dubna July 19, 2012 Dependable Job-flow Dispatching and Scheduling in Virtual Organizations of Distributed Computing Environments Victor Toporkov.
임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.
A High Performance Channel Sorting Scheduling Algorithm Based On Largest Packet P.G.Sarigiannidis, G.I.Papadimitriou, and A.S.Pomportsis Department of.
Silberschatz and Galvin  Operating System Concepts Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor.
Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.
More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers.
1 Distributed Databases BUAD/American University Distributed Databases.
1 Job Scheduling for Grid Computing on Metacomputers Keqin Li Proceedings of the 19th IEEE International Parallel and Distributed Procession Symposium.
Scheduling Generic Parallel Applications – classification, Meta- scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References”
Chapter 5: CPU Scheduling. 5.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 2, 2005 Chapter 5: CPU Scheduling Basic.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Performance Analysis of Preemption-aware Scheduling in Multi-Cluster Grid Environments Mohsen Amini Salehi, Bahman Javadi, Rajkumar Buyya Cloud Computing.
International Symposium on Grid Computing (ISGC-07), Taipei - March 26-29, 2007 Of 16 1 A Novel Grid Resource Broker Cum Meta Scheduler - Asvija B System.
Static Process Scheduling
MiddleMan: A Video Caching Proxy Server NOSSDAV 2000 Brian Smith Department of Computer Science Cornell University Ithaca, NY Soam Acharya Inktomi Corporation.
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
Author Utility-Based Scheduling for Bulk Data Transfers between Distributed Computing Facilities Xin Wang, Wei Tang, Raj Kettimuthu,
Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.
Chapter 4 CPU Scheduling. 2 Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation.
Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.
Basic Concepts Maximum CPU utilization obtained with multiprogramming
Introduction to Load Balancing:
Parallel Algorithm Design
Module 5: CPU Scheduling
A Characterization of Approaches to Parrallel Job Scheduling
CS 584.
Module 5: CPU Scheduling
Module 5: CPU Scheduling
Presentation transcript:

Scheduling Generic Parallel Applications –Meta- scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References” slide

Scheduling Architectures Centralized schedulers Centralized schedulers Single-site scheduling – a job does not span across sites Multi-site – the opposite Hierarchical structures - A central scheduler (metascheduler) for global scheduling and local scheduling on individual sites Hierarchical structures - A central scheduler (metascheduler) for global scheduling and local scheduling on individual sites Decentralized scheduling – distributed schedulers interact, exchange information and submit jobs to remote systems Decentralized scheduling – distributed schedulers interact, exchange information and submit jobs to remote systems Direct communication – local scheduler directly contacts remote schedulers and transfers some of its jobs Communication via central job pool – jobs that cannot be immediately executed are pushed to a central pool, other local schedulers pull the jobs out of the pool

Various Scheduling Architectures

Metascheduler across MPPs Types Centralized Centralized A meta scheduler and local dispatchers Jobs submitted to meta scheduler Hierarchical Hierarchical Combination of central and local schedulers Jobs submitted to meta scheduler Meta scheduler sends job to the site for which earliest start time is expected Local schedulers can follow their own policies Distributed Distributed Each site has a metascheduler and a local scheduler Jobs submitted to local metascheduler Jobs can be transffered to sites with lowest load

Evaluation of schemes Centralized Hierarchical Distributed 1.Global knowledge of all resources – hence optimized schedules 2.Can act as a bottleneck for large number of resources and jobs 3.May take time to transfer jobs from meta scheduler to local schedulers – need strategic position of meta scheduler 1.Medium level overhead 2.Sub optimal schedules 3.Still need strategic position of central scheduler 1.No bottleneck – workload evenly distributed 2.Needs all-to-all connections between MPPs

Evaluation of Various Scheduling Architectures Experiments to evaluate slowdowns in the 3 schemes Based on actual trace from a supercomputer centre – 5000 job set 4 sites were simulated – 2 with the same load as trace, other 2 where run time was multiplied by 1.7 FCFS with EASY backfilling was used slowdown = (wait_time + run_time) / run_time 2 more schemes Independent – when local schedulers acted independently, i.e. sites are not connected Independent – when local schedulers acted independently, i.e. sites are not connected United – resources of all processors are combined to form a single site United – resources of all processors are combined to form a single site

Results

Observations 1. Centralized and hierarchical performed slightly better than united a. Compared to hierarchical, scheduling decisions have to be made for all jobs and all resources in united – overhead and hence wait time is high b. Comparing united and centralized. i.4 categories of jobs corresponding to 4 different combinations of 2 parameters – execution time (short, long) and number of resources requested (narrow, wide) ii.Usually larger number of long narrow jobs than short wide jobs iii.Why is centralized and hierarchical better than united? 2. Distributed performed poorly a.Short narrow jobs incurred more slowdown b.short narrow jobs are large in number and best candidates for back filling c.Back filling dynamics are complex d.A site with an average light may not always be the best choice. SN jobs may find earliest holes in a heavily loaded site.

Newly Proposed Models K-distributed model Distributed scheme where local metascheduler distributes jobs to k least loaded sites Distributed scheme where local metascheduler distributes jobs to k least loaded sites When job starts on a site, notification is sent to the local metascheduler which in turn asks the k-1 schedulers to dequeue When job starts on a site, notification is sent to the local metascheduler which in turn asks the k-1 schedulers to dequeue K-Dual queue model 2 queues are maintained at each site – one for local jobs and other for remote jobs 2 queues are maintained at each site – one for local jobs and other for remote jobs Remote jobs are executed only when they don’t affect the start times of the local jobs Remote jobs are executed only when they don’t affect the start times of the local jobs Local jobs are given priority during backfilling Local jobs are given priority during backfilling

Results – Benefits of new schemes 45% improvement15% improvement

Results – Usefulness of K-Dual scheme Grouping jobs submitted at lightly loaded sites and heavily loaded sites

Assessment and Enhancement of Meta-Schedulers…(Sabin et. al.) Metascheduling working examples (LSF and Moab) 2 different modes: Standard or centralized (all scheduling decisions are made in a centralized manner) Standard or centralized (all scheduling decisions are made in a centralized manner) Forces local sites to accept advance reservations from the metascheduler Delegated Delegated Does not provide a known scheduling policy for grid jobs

Centralized Metascheduler queries local schedulers to obtain information regarding current schedule Metascheduler makes advance reservation on the “best” of local schedulers Reservations honored by local sites possibly delaying local jobs Metascheduler tries to find better reservations for all jobs at periodic intervals If a better reservation is found, metascheduler cancels existing reservation and moves job to another local scheduler This model requires close interactions between local and metaschedulers

Delegated Metascheduler determines “best” site for each grid job Delegates scheduling responsibilities to local schedulers After the job is sent to the local site, there is no interaction between meta and local scheduler Meta scheduler “queries” the local scheduler for the metric that serves as basis for site choice This model is more scalable and allows local schedulers to retain autonomy

Evaluation System wide average response time Centralized outperforms delegated since centralized revisits its scheduling decisions

Evaluation Average response time of jobs from the least loaded site Metascheduling has a detrimental effect on users at the least loaded site At low loads, centralized is best – jobs submitted at a least loaded site may run faster at another site This is a case of least loaded sites getting discouraged from joining the grid!

To avoid deterioration at least loaded sites: Dues Based Queues Goal is to improve priority of jobs originating from lightly loaded sites For each site-pair, relative resource usage surplus/deficit is maintained Each site maintains processor seconds that it has provided to other site’s jobs; also processor seconds that its jobs consumed in other sites si sets priority for all of sj’s jobs to be dues[sj] For lightly loaded sites, it is usually surplus. Hence other sites will have to pay “dues” to lightly loaded sites by increasing priorities of jobs submitted at lightly loaded sites

Dues Based Queues s1 runs a 100 processor second job for s2 dues[s2] = -100; dues[s1]=100 dues[s2] = -100; dues[s1]=100 S2 runs a 300 processor-second job for s1; s2 will be paying the “dues” to s1 dues[s2] = 200; dues[s1] = -200 dues[s2] = 200; dues[s1] = -200 Queue order at each site is determined by dues values of the submitting site Can be implemented in centralized Dues-based queuing scheme at the meta scheduler Dues-based queuing scheme at the meta scheduler Or delegated Dues based queues at the local scheduler Dues based queues at the local scheduler

Evaluation System wide average response time Dues-based scheme performs worse than the corresponding schemes

Evaluation Average response time of jobs from least loaded site Centralized dues perform the best

Another method: Local Priority with Job Sharing Dual queue Dual queue at local schedulers Dual queue at local schedulers Local jobs will have higher priority than remote jobs Local jobs will have higher priority than remote jobs Dual queue with local copy In dual queue model, remote jobs may suffer starvation In dual queue model, remote jobs may suffer starvation Jobs from a lightly loaded site sent to a remote site may suffer Jobs from a lightly loaded site sent to a remote site may suffer In this scheme, all jobs have a copy sent to the originating site’s scheduler in addition to one remote site In this scheme, all jobs have a copy sent to the originating site’s scheduler in addition to one remote site

Evaluation System wide average response time Dual queue with local copy performs the best

Evaluation Average response times of jobs from the least loaded site Dual queue with local copy performs as good as nosharing scheme

Summary

References A taxonomy of scheduling in general-purpose distributed computing systems. IEEE Transactions on Software Engineering. Volume 14, Issue 2 (February 1988) Pages: Year of Publication: 1988 Authors T. L. Casavant J. G. Kuhl Evaluation of Job-Scheduling Strategies for Grid ComputingSourceLecture Notes In Computer Science. Proceedings of the First IEEE/ACM International Workshop on Grid Computing. Pages: Year of Publication: 2000 ISBN: Volker Hamscher Uwe Schwiegelshohn Achim Streit Ramin Yahyapour "Distributed Job Scheduling on Computational Grids using Multiple Simultaneous Requests" Vijay Subramani, Rajkumar Kettimuthu, Srividya Srinivasan, P. Sadayappan, Proceedings of 11th IEEE Symposium on High Performance Distributed Computing (HPDC 2002), July 2002

References Assessment and Enhancement of Meta- Schedulers for Multi-Site Job Scheduling. Sabin et. al. HPDC 2005

References Vadhiyar, S., Dongarra, J. and Yarkhan, A. “GrADSolve - RPC for High Performance Computing on the Grid". Euro-Par 2003, 9th International Euro-Par Conference, Proceedings, Springer, LCNS 2790, p , August , Vadhiyar, S. and Dongarra, J. “Metascheduler for the Grid”. Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, pp , July 2002, Edinburgh, Scotland. Vadhiyar, S. and Dongarra, J. “GrADSolve - A Grid-based RPC system for Parallel Computing with Application-level Scheduling". Journal of Parallel and Distributed Computing, Volume 64, pp , Petitet, A., Blackford, S., Dongarra, J., Ellis, B., Fagg, G., Roche, K., Vadhiyar, S. "Numerical Libraries and The Grid: The Grads Experiments with ScaLAPACK, " Journal of High Performance Applications and Supercomputing, Vol. 15, number 4 (Winter 2001):

Coallocation in Multicluster Systems Processor coallocation – allowing jobs to use processors in multiple clusters simultaneously Jobs consist of one or more components each of which has to be scheduled on a different cluster Multi-component jobs scheduled across different clusters equal to the number of components

Queuing Structures Single central scheduler with one global queue for the entire set of clusters: all clusters submit single and multi-component jobs to the global queue Local schedulers with only local queues at the clusters: each cluster submits single and multi- component jobs to its local queue A global queue for the system and local queues for the clusters: a cluster submits single component jobs to its local queue and multi- component jobs to the global queue

Scheduling Scheduling multi-component jobs: WorstFit Order the job components in decreasing size Order the job components in decreasing size Order the clusters according to decreasing number of idle processors Order the clusters according to decreasing number of idle processors Traverse one-by-one through both lists trying to fit job components on clusters Traverse one-by-one through both lists trying to fit job components on clusters Leaves in each cluster as much room as possible for subsequent jobs Leaves in each cluster as much room as possible for subsequent jobs

Scheduling Invoked during job departure A queue is enabled when the corresponding scheduler is allowed to start jobs from the queue. When a queue is enabled, the job at the head of the queue is scheduled if it fits When a job departs, all or some of the non- empty queues are enabled Enabled queues are repeatedly visited in some order What non-empty queues are enabled and what order are they visited is defined by a scheduling policy

Scheduling Policies GS – global scheduler policy with single queue LS – each cluster has only local queues. At a job departure, in which order should the non-empty queues be disabled? Local schedulers that have not scheduled jobs for the longest time gets the first chance Local schedulers that have not scheduled jobs for the longest time gets the first chance For systems with both global queue and local queues: GP – global priority. Local queues are enabled only when the global queue is empty GP – global priority. Local queues are enabled only when the global queue is empty LP – local priority. Global queue is only enabled when at least one local queue is empty. In which order should the local queues and the global queue be enabled? LP – local priority. Global queue is only enabled when at least one local queue is empty. In which order should the local queues and the global queue be enabled? Global queue is first enabled and then the local queues

Coallocation Rules [no] only single component jobs are admitted. No coallocation [co] both single and multi-component jobs. No restriction [rco] restriction on size of job components. [fco] restriction on size and number of job components

Testbed Testbed DAS system in Netherlands – 5 clusters, 1 72-nodes, other 32-nodes Intra cluster communication – Myrinet LAN (1200 Mbit/s) Inter cluster communication – 100 Mbit/s WAN

Evaluation 2 applications Ensflow – simulating streams and eddies in the ocean Ensflow – simulating streams and eddies in the ocean Poisson – solution of 2-D Poisson equation Poisson – solution of 2-D Poisson equation Execution times measured on DAS

Results

Conclusions [co] gives the worst performance. Due to simultaneous presence of large single- component jobs and jobs with many components [rco] and [fco] improve performance LS and LP provide best results for coallocation cases; Performance of GS is better when there are only single-component jobs

Conclusions Processor co-allocation is beneficial atleast when the overhead due to wide- area communication is not high Restrictions to the job component sizes and to the number of job components improve the performance of coallocation

Reference Scheduling Policies for Processor Coallocation in MultiCluster Systems. Bucur and Epema. TPDS. July 2007.

Grid Routine / Application Manager User Grid Application Development Software (GrADS) Architecture Resource Selector Performance Modeler MDS NWS Matrix size, block size Resource characteristics, Problem characteristics Final schedule – subset of resources

Performance Modeler Grid Routine / Application Manager Performance Modeler All resources, Problem parameters Final schedule – subset of resources Scheduling Heuristic Simulation Model All resources, problem parameters Final Schedule Candidate resources Execution cost The scheduling heuristic passed only those candidate schedules that had “sufficient” memory This is determined by calling a function in simulation model

Simulation Model Simulation of the ScaLAPACK right looking LU factorization More about the application Iterative – each iteration corresponding to a block Iterative – each iteration corresponding to a block Parallel application in which columns are block-cyclic distributed Parallel application in which columns are block-cyclic distributed Right looking LU – based on Gaussian elimination Right looking LU – based on Gaussian elimination

Operations The LU application in each iteration involves: Block factorization – (ib:n, ib:ib) floating point operations Block factorization – (ib:n, ib:ib) floating point operations Broadcast for multiply – message size equals approximately n*block_size Broadcast for multiply – message size equals approximately n*block_size Each process does its own multiply: Each process does its own multiply: Remaining columns divided by number of processors

Back to the simulation model double getExecTimeCost(int matrix_size, int block_size, candidate_schedule){ for(i=0; i<number_of_blocks; i++){ for(i=0; i<number_of_blocks; i++){ /* find the proc. Belonging to the column. Note its speed, its connections to other procs. */ /* find the proc. Belonging to the column. Note its speed, its connections to other procs. */ tfact += … /* simulate block factorization. Depends on {processor_speed, machine_load, flop_count of factorization */ tfact += … /* simulate block factorization. Depends on {processor_speed, machine_load, flop_count of factorization */ tbcast += max(bcast times for each proc.) /* scalapack follows split ring broadcast. Simulate broadcast algorithm for each proc. Depends on {elements of matrix to be broadcast, connection bandwidth and latency */ tbcast += max(bcast times for each proc.) /* scalapack follows split ring broadcast. Simulate broadcast algorithm for each proc. Depends on {elements of matrix to be broadcast, connection bandwidth and latency */ tupdate += max(matrix multiplies across all proc.) /* depends on {flop count of matrix multiply, processor speed, load} */ tupdate += max(matrix multiplies across all proc.) /* depends on {flop count of matrix multiply, processor speed, load} */ } return (tfact + tbcast + tupdate); return (tfact + tbcast + tupdate);}

Grid Routine / Application Manager User Initial GrADS Architecture Resource Selector Performance Modeler App Launcher Contract Monitor Application MDS NWS Matrix size, block size Resource characteristics, Problem characteristics Problem, parameters, app. Location, final schedule

Performance Model Evaluation

GrADS Benefits MSC Cluster MSC & TORC Cluster mscs, 7 torcs 8 mscs, 8 torcs Even though performance worsened when using multiple clusters, larger problem sizes can be solved without incurring costly disk accesses