Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scheduling Generic Parallel Applications – classification, Meta- scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References”

Similar presentations


Presentation on theme: "Scheduling Generic Parallel Applications – classification, Meta- scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References”"— Presentation transcript:

1 Scheduling Generic Parallel Applications – classification, Meta- scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References” slide

2 Scheduling Architectures Centralized schedulers Centralized schedulers Single-site scheduling – a job does not span across sites Multi-site – the opposite Hierarchical structures - A central scheduler (metascheduler) for global scheduling and local scheduling on individual sites Hierarchical structures - A central scheduler (metascheduler) for global scheduling and local scheduling on individual sites Decentralized scheduling – distributed schedulers interact, exchange information and submit jobs to remote systems Decentralized scheduling – distributed schedulers interact, exchange information and submit jobs to remote systems Direct communication – local scheduler directly contacts remote schedulers and transfers some of its jobs Communication via central job pool – jobs that cannot be immediately executed are pushed to a central pool, other local schedulers pull the jobs out of the pool

3 Various Scheduling Architectures

4

5 Metascheduler across MPPs Types Centralized Centralized A meta scheduler and local dispatchers Jobs submitted to meta scheduler Hierarchical Hierarchical Combination of central and local schedulers Jobs submitted to meta scheduler Meta scheduler sends job to the site for which earliest start time is expected Local schedulers can follow their own policies Distributed Distributed Each site has a metascheduler and a local scheduler Jobs submitted to local metascheduler Jobs can be transffered to sites with lowest load

6 Evaluation of schemes Centralized Hierarchical Distributed 1.Global knowledge of all resources – hence optimized schedules 2.Can act as a bottleneck for large number of resources and jobs 3.May take time to transfer jobs from meta scheduler to local schedulers – need strategic position of meta scheduler 1.Medium level overhead 2.Sub optimal schedules 3.Still need strategic position of central scheduler 1.No bottleneck – workload evenly distributed 2.Needs all-to-all connections between MPPs

7 Evaluation of Various Scheduling Architectures Experiments to evaluate slowdowns in the 3 schemes Based on actual trace from a supercomputer centre – 5000 job set 4 sites were simulated – 2 with the same load as trace, other 2 where run time was multiplied by 1.7 FCFS with EASY backfilling was used slowdown = (wait_time + run_time) / run_time 2 more schemes Independent – when local schedulers acted independently, i.e. sites are not connected Independent – when local schedulers acted independently, i.e. sites are not connected United – resources of all processors are combined to form a single site United – resources of all processors are combined to form a single site

8 Results

9 Observations 1. Centralized and hierarchical performed slightly better than united a. Compared to hierarchical, scheduling decisions have to be made for all jobs and all resources in united – overhead and hence wait time is high b. Comparing united and centralized. i.4 categories of jobs corresponding to 4 different combinations of 2 parameters – execution time (short, long) and number of resources requested (narrow, wide) ii.Usually larger number of long narrow jobs than short wide jobs iii.Why is centralized and hierarchical better than united? 2. Distributed performed poorly a.Short narrow jobs incurred more slowdown b.short narrow jobs are large in number and best candidates for back filling c.Back filling dynamics are complex d.A site with an average light may not always be the best choice. SN jobs may find earliest holes in a heavily loaded site.

10 Newly Proposed Models K-distributed model Distributed scheme where local metascheduler distributes jobs to k least loaded sites Distributed scheme where local metascheduler distributes jobs to k least loaded sites When job starts on a site, notification is sent to the local metascheduler which in turn asks the k-1 schedulers to dequeue When job starts on a site, notification is sent to the local metascheduler which in turn asks the k-1 schedulers to dequeue K-Dual queue model 2 queues are maintained at each site – one for local jobs and other for remote jobs 2 queues are maintained at each site – one for local jobs and other for remote jobs Remote jobs are executed only when they don’t affect the start times of the local jobs Remote jobs are executed only when they don’t affect the start times of the local jobs Local jobs are given priority during backfilling Local jobs are given priority during backfilling

11 Results – Benefits of new schemes 45% improvement15% improvement

12 Results – Usefulness of K-Dual scheme Grouping jobs submitted at lightly loaded sites and heavily loaded sites

13 Meta scheduler with AppLeS Local Schedulers

14 Goals The aim was to overcome deficiencies of using plain AppLeS agents Also to have global policies Resolving different claims of applications Resolving different claims of applications Improving the response times of individual apps. Improving the response times of individual apps. Taking care of load dynamics Taking care of load dynamics Work done as part of GrADS project Grid Application Development Software Grid Application Development Software Collaboration between different universities Collaboration between different universities

15 Grid Routine / Application Manager User Initial GrADS Architecture Resource Selector Performance Modeler MDS NWS Matrix size, block size Resource characteristics, Problem characteristics Final schedule – subset of resources

16 Performance Modeler Grid Routine / Application Manager Performance Modeler All resources, Problem parameters Final schedule – subset of resources Scheduling Heuristic Simulation Model All resources, problem parameters Final Schedule Candidate resources Execution cost The scheduling heuristic passed only those candidate schedules that had “sufficient” memory This is determined by calling a function in simulation model

17 Simulation Model Simulation of the ScaLAPACK right looking LU factorization More about the application Iterative – each iteration corresponding to a block Iterative – each iteration corresponding to a block Parallel application in which columns are block-cyclic distributed Parallel application in which columns are block-cyclic distributed Right looking LU – based on Gaussian elimination Right looking LU – based on Gaussian elimination

18 Gaussian Elimination - Review for each column i zero it out below the diagonal by adding multiples of row i to later rows zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for each row j below row i for j = i+1 to n for j = i+1 to n A(j, i) = A(j, i) / A(i, i) A(j, i) = A(j, i) / A(i, i) for k = i+1 to n for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k) A(j, k) = A(j, k) – A(j, i)* A(i, k) 000000000000 0000000000 i i i,i X x j k A(i,i) A(j,i)A(j,k) A(i,k) Finished multipliers i i A(i+1:n, i) A(i, i+1:n) A(i+1:n, i+1:n)

19 Need for blocking - BLAS Basic Linear Algebra Subroutine Memory hierarchy efficiently exploited by higher level BLAS 3 levels BLAS Mem ory Refs. Flop s Flops /Mem ory refs. Level-1(vector)y=y+axZ=y.x3n2n2/3 Level-2(Matrix-vector)y=y+Ax A = A+(alpha) xy T n2n2n2n2 2n 2 2 Level-3(Matrix-Matrix)C=C+AB 4n 2 2n 3 n/2

20 Converting BLAS2 to BLAS3 Use blocking for optimized matrix- multiplies (BLAS3) Matrix multiplies by delayed updates Save several updates to trailing matrices Apply several updates in the form of matrix multiply

21 Modified GE using BLAS3 Courtesy: Dr. Jack Dongarra for ib = 1 to n-1 step b /* process matrix b columns at a time */ end = ib+b-1; end = ib+b-1; /* Apply BLAS 2 version of GE to get A(ib:n, ib:end) factored. /* Apply BLAS 2 version of GE to get A(ib:n, ib:end) factored. Let LL denote the strictly lower triangular portion of A(ib:end, ib:end) */ Let LL denote the strictly lower triangular portion of A(ib:end, ib:end) */ A(ib:end, end+1:n) = LL -1 A(ib:end, end+1:n) /* update next b rows of U */ A(ib:end, end+1:n) = LL -1 A(ib:end, end+1:n) /* update next b rows of U */ A(end+1:n, end+1:n) = A(end+1:n, end+1:n) - A(ib:n, ib:end)* A(ib:end, end+1:n) A(end+1:n, end+1:n) = A(end+1:n, end+1:n) - A(ib:n, ib:end)* A(ib:end, end+1:n) /* Apply delayed updates with single matrix multiply */ /* Apply delayed updates with single matrix multiply */ b ibend ib end b Completed part of L Completed part of U A(ib:end, ib:end) A(end+1:n, ib:end) A(end+1:n, end+1:n)

22 Operations So, the LU application in each iteration involves: Block factorization – (ib:n, ib:ib) floating point operations Block factorization – (ib:n, ib:ib) floating point operations Broadcast for multiply – message size equals approximately n*block_size Broadcast for multiply – message size equals approximately n*block_size Each process does its own multiply: Each process does its own multiply: Remaining columns divided by number of processors

23 Back to the simulation model double getExecTimeCost(int matrix_size, int block_size, candidate_schedule){ for(i=0; i<number_of_blocks; i++){ for(i=0; i<number_of_blocks; i++){ /* find the proc. Belonging to the column. Note its speed, its connections to other procs. */ /* find the proc. Belonging to the column. Note its speed, its connections to other procs. */ tfact += … /* simulate block factorization. Depends on {processor_speed, machine_load, flop_count of factorization */ tfact += … /* simulate block factorization. Depends on {processor_speed, machine_load, flop_count of factorization */ tbcast += max(bcast times for each proc.) /* scalapack follows split ring broadcast. Simulate broadcast algorithm for each proc. Depends on {elements of matrix to be broadcast, connection bandwidth and latency */ tbcast += max(bcast times for each proc.) /* scalapack follows split ring broadcast. Simulate broadcast algorithm for each proc. Depends on {elements of matrix to be broadcast, connection bandwidth and latency */ tupdate += max(matrix multiplies across all proc.) /* depends on {flop count of matrix multiply, processor speed, load} */ tupdate += max(matrix multiplies across all proc.) /* depends on {flop count of matrix multiply, processor speed, load} */ } return (tfact + tbcast + tupdate); return (tfact + tbcast + tupdate);}

24 Grid Routine / Application Manager User Initial GrADS Architecture Resource Selector Performance Modeler App Launcher Contract Monitor Application MDS NWS Matrix size, block size Resource characteristics, Problem characteristics Problem, parameters, app. Location, final schedule

25 Contract Monitor Architecture Autopilot Manager Application Sensors Contract Monitor Fork registration Obtain sensor information Obtain information about variable x

26 Performance Model Evaluation

27 GrADS Limitations Hence a metascheduler that has global knowledge of all applications is needed

28 Metascheduler To ensure that applications are scheduled based on correct resource information To accommodate as many new applications as possible To improve the performance contract of new applications To minimize the impact of new applications on executing applications To employ policies to migrate executing applications

29 Grid Routine / Application Manager User Modified GrADS Architecture Resource Selector Performance Modeler Contract Developer App Launcher Contract Monitor Application MDS NWS Permission Service RSS Contract Negotiator Rescheduler Database Manager

30 Database Manager A persistent service listening for requests from the clients Maintains global clock Has event notification capabilities – clients can express their interests in various events. Stores various information: Application’s states Application’s states Initial machines Initial machines Resource information Resource information Final schedule Final schedule Location of various daemons Location of various daemons Average number of contract violations Average number of contract violations

31 Database Manager (Contd…) When an application stops or completes, the database manager calculates percentage completion time of the application time_diff : (current_time – time when the application instance started) Avg_ratio: average of (actual costs / predicted costs)

32 Permission Service After collecting resource information from NWS, the GrADS apps. contact PS. PS makes decisions based on the problem requirements and resource characteristics makes decisions If resources have enough capacity, then permission is given If not, the permission service Waits for resource consuming applications to end soon Waits for resource consuming applications to end soon Preempts resource consuming applications to accommodate short applications Preempts resource consuming applications to accommodate short applications

33 Permission Service (Pseudo code)

34

35 Permission Service – determining resource consuming applications For each currently executing GrADS app., contact DBM, obtain NWS resource information. Determine change of resources caused by app. i Add the change to current resource characteristics to obtain resource parameters in the absence of app. i

36 Determining remaining execution time Whenever a meta scheduler component wants to determine remaining execution time of app., it contacts the contract monitor of app. Retrieves average of ratios between actual times and predicted times Uses {average, predicted time, percentage completion time} to determine r.e.t.

37 Determing r.e.t (pseudo code)

38 Determining r.e.t (pseudo code)

39 Contract Negotiator Main functionalities Ensure apps. made decisions based on updated resource information Ensure apps. made decisions based on updated resource information Improve the performance of current apps. by possibly stopping and continuing executing big apps. Improve the performance of current apps. by possibly stopping and continuing executing big apps. Reduces the impact caused by current apps. on executing apps. Reduces the impact caused by current apps. on executing apps. When contract is approved, the application starts using resources When contract is rejected, the application goes back to obtain new resource characteristics, and generates new schedule Enforces ordering of the applications whose application-level characteristics use the same resources approves the contract of one applications approves the contract of one applications Waits for the application to start using resources Waits for the application to start using resources Rejects the contract of the other Rejects the contract of the other

40 Contract Negotiator (Pseudo code) Ensuring app. has made scheduling decision based on correct resource information

41 Contract Negotiator (Pseudo code) Improving the performance of the current app. by preempting an executing large app.

42 Contract Negotiator – 3 scenarios t1 – average completion time of current app. and big app. when big app. is preempted, current app. accommodated, big app. continued t2 - average completion time of current app. and big app. when big app. is allowed to complete, then current app. is accommodated t3 - average completion time of current app. and big app. when both applications are executed simultaneously if (t1 < 25% of min(t2, t3) case 1 else if(t3 > 1.2t2) case 2 else case 3

43 Contract Negotiator (Pseudo code) Improving the performance of the current app. by preempting an executing large app.

44 Contract Negotiator (Pseudo code) Reducing the impact of the current app. on executing app. by modifying the schedule

45 Contract Negotiator (Pseudo code) Reducing the impact of the current app. on executing app. by modifying the schedule

46 Application and Metascheduler Interactions User Resource Selection Requesting Permission Service Permission? Application Specific Scheduling Contract Development Contract Negotiator Contract Approved? Application Launching Problem parameters Initial list of machines Permission NO YES Abort Exit Get new resource information Application specific schedule Get new resource information NO YES Application Completion? Application Completed Wait for restart signal Application was stopped Problem parameters, final schedule Get new resource information

47 Experiments and Results Demonstration of Permission Service

48 References A taxonomy of scheduling in general-purpose distributed computing systems. IEEE Transactions on Software Engineering. Volume 14, Issue 2 (February 1988) Pages: 141 - 154 Year of Publication: 1988 Authors T. L. Casavant J. G. Kuhl Evaluation of Job-Scheduling Strategies for Grid ComputingSourceLecture Notes In Computer Science. Proceedings of the First IEEE/ACM International Workshop on Grid Computing. Pages: 191 - 202 Year of Publication: 2000 ISBN:3-540-41403-7. Volker Hamscher Uwe Schwiegelshohn Achim Streit Ramin Yahyapour "Distributed Job Scheduling on Computational Grids using Multiple Simultaneous Requests" Vijay Subramani, Rajkumar Kettimuthu, Srividya Srinivasan, P. Sadayappan, Proceedings of 11th IEEE Symposium on High Performance Distributed Computing (HPDC 2002), July 2002

49 References Vadhiyar, S., Dongarra, J. and Yarkhan, A. “GrADSolve - RPC for High Performance Computing on the Grid". Euro-Par 2003, 9th International Euro-Par Conference, Proceedings, Springer, LCNS 2790, p. 394-403, August 26 -29, 2003. Vadhiyar, S. and Dongarra, J. “Metascheduler for the Grid”. Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, pp 343-351, July 2002, Edinburgh, Scotland. Vadhiyar, S. and Dongarra, J. “GrADSolve - A Grid-based RPC system for Parallel Computing with Application-level Scheduling". Journal of Parallel and Distributed Computing, Volume 64, pp. 774-783, 2004. Petitet, A., Blackford, S., Dongarra, J., Ellis, B., Fagg, G., Roche, K., Vadhiyar, S. "Numerical Libraries and The Grid: The Grads Experiments with ScaLAPACK, " Journal of High Performance Applications and Supercomputing, Vol. 15, number 4 (Winter 2001): 359-374.

50 Metascheduler Components Permission Service Contract Negotiator Database Manager Rescheduler Applications Metascheduler Requests from applications for permission to execute on the Grid Storing and retrieval of the states of the applications Application level schedules from the applications Decisions based on resource capacities Can stop an executing resource consuming application Can accept or reject contracts Acts as queue manager Ensures scheduling based on correct information Improves performance contracts Minimizes impact Request for Migration Reschedules executing applications: - To escape from heavy load - To use free resources

51 Taxonomy of scheduling for distributed heterogeneous systems – Casavant and Kuhl (1988)

52 Taxonomy Local vs Global Local – scheduling processes to time slices on a single processor Local – scheduling processes to time slices on a single processor Global – deciding which processor should a job go to Global – deciding which processor should a job go to Approximate vs heuristic Approximate – stop when you find a “good” solution. Uses same formal computational model. The ability to succeed depends on. Approximate – stop when you find a “good” solution. Uses same formal computational model. The ability to succeed depends on. Availability of a function to evaluate a solution The time required to evaluate a solution The ability to judge according to some metric value Mechanism to intelligently prune the solution space Heuristics Heuristics Works on assumptions about the impact of “important” parameters Cannot quantize the assumption and the amount of impact all the times

53 Also… Flat characteristics Adaptive vs. non-adaptive Adaptive vs. non-adaptive Load balancing Load balancing Bidding – e.g. Condor Bidding – e.g. Condor Probabilistic – random searches Probabilistic – random searches One time assignment vs. dynamic reassignment One time assignment vs. dynamic reassignment

54 Evaluation – Subramani et. al.

55 Results – Usefulness of K-Dual scheme Grouping jobs submitted at lightly loaded sites and heavily loaded sites

56 Experiments and Results Practical Experiments 5 applications were integrated into GrADS – ScaLAPACK LU, QR, Eigen, PETSC CG and Heat equation Integration involved – developing performance models, instrumenting with SRS 50 problems with different arrival rates - Poisson distribution with different mean arrival rates for job submission - Poisson distribution with different mean arrival rates for job submission - uniform distributions for problem types, problem sizes - uniform distributions for problem types, problem sizes Different statistics were collected Metascheduler was enabled or disabled

57 Experiments and Results Practical Experiments – Total Throughput Comparison

58 Experiments and Results Practical Experiments – Performance Contract Violations Measured Time/Expected Time Maximum allowed Measured Time/Expected Time Contract Violation: Measured/Expected > maximum allowed Measured/Expected


Download ppt "Scheduling Generic Parallel Applications – classification, Meta- scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References”"

Similar presentations


Ads by Google