CSE 160/Berman Mapping and Scheduling W+A: Chapter 4.

CSE 160/Berman Mapping and Scheduling W+A: Chapter 4

CSE 160/Berman Outline Mapping and Scheduling Static Mapping Strategies Dynamic Mapping Strategies Scheduling

CSE 160/Berman Mapping and Scheduling Models Basic Models: –Program model is a task graph with dependencies –Platform model is set of processors with interconnection network

CSE 160/Berman Mapping and Scheduling Mapping and scheduling involve the following activities –Select a set of resources on which to schedule the task(s) of the application. –Assign application task(s) to compute resources. –Distribute data or co-locate data and computation. –Order tasks on compute resources. –Order communication between tasks.

CSE 160/Berman Mapping and Scheduling Terminology 1.Select a set of resources on which to schedule the task(s) of the application. 2.Assign application task(s) to compute resources. 3.Distribute data or co- locate data and computation. 4.Order tasks on compute resources. 5.Order communication between tasks. 1 = resource selection 1-3: generally termed as mapping 4-5: generally termed as scheduling For many researchers, scheduling is also used to describe activities 1-5. Mapping is an assignment of tasks in space Scheduling focuses on ordering in time

CSE 160/Berman Goals Want the mapping and scheduling algorithms and models to promote the assignment/ordering with the smallest execution time Accuracy vs. Ranking ModelReal Stuff x A B optimum ModelReal Stuff A’ A B B’

CSE 160/Berman What is the best mapping? 3 4 1 7 2 P1 P2 3 4 1 7 2 1 1 1 2 2 P1 P2

CSE 160/Berman Static and Dynamic Mapping Strategies Static methods generate the partitioning prior to execution Static mapping strategies work well when we can reasonably predict the time to perform application tasks during execution When it is not easy to predict task execution time, dynamic strategies may be more performance- efficient Dynamic methods generate the partitioning during execution –For example, workqueue and M/S are dynamic methods

CSE 160/Berman Static Mapping Static mapping can involve –partitioning of tasks (functional decomposition) Sieve of Eratosthenes an example –partitioning of data (data decomposition) Fixed decomposition of Mandelbrot (k blocks per processor) is an example of this P2P3P5P7P11P13P17

CSE 160/Berman Load Balancing Load Balancing = strategy to partition application so that –All processors perform an equivalent amount of work –(All processors finish in an equivalent amount of time. This is really time-balancing) May take different amounts of time to do equivalent amounts of work Load balancing an important technique in parallel processing –Many ways to achieve a balanced load –Both dynamic and static load balancing techniques

CSE 160/Berman Static and Dynamic Mapping for the N-body Problem The N-body problem: Given n bodies in 3D space, determine the gravitational force F between them at any given point in time. where G is the gravitational constant, r is the distance between the bodies, and are the masses of the bodies

CSE 160/Berman Exact N-body serial pseudo-code At each time t, velocity v and position x of body i may change Real problem a bit more complicated than this. See 4.2.3 in book For (t=0: t<max; t++) –For (i=0; i<N; i++) { F= Force_routine(i); v[i]_new = v[i]+F*dt; x[i]_new=x[i]+v[i]_new*dt; –} For (i=0; i<nmax; i++) { –x[i] = x[i]_new; –v[i]=v[i]_new; –}

CSE 160/Berman Exact N-body and static partitioning Can parallelize n-body by tagging velocity and position for each body and updating bodies using correctly tagged information. This can be implemented as a data parallel algorithm. What is the worst-case complexity of complexity for a single iteration? How should we partition this? –Static partitioning can be a bad strategy for n-body problem. –Load can be very unbalanced for some configurations

CSE 160/Berman Improving the complexity of the N- body code Complexity of serial n-body algorithm very large: O(n^2) for each iteration. Communication structure not local – each body must gather data from all other bodies. Most interesting problems are when n is large – not feasible to use exact method for this Barnes-Hut algorithm is well- known approximation to exact n-body problem and can be efficiently parallelized.

CSE 160/Berman Barnes-Hut Approximation Barnes-Hut algorithm based on the observation that a cluster of distant bodies can be approximated as a single distant body –Total mass = aggregate of bodies in cluster –Distance to cluster = distance to center of mass of the cluster This clustering idea can be applied recursively

CSE 160/Berman Barnes-Hut idea Dynamic divide and conquer approach: –Each region (cube) of space divided into 8 subcubes –If subcube contains more than 1 body, it is recursively subdivided –If subcube contains no bodies, it is removed from consideration 2D example on right – each 2D region divided into 4 subregions

CSE 160/Berman Barnes-Hut idea For 3D decomposition, result is an octtree For 2D decomposition, result is a quadtree, (pictured below).

CSE 160/Berman Barnes Hut Pseudo-code For (t=0; t< tmax; t++) { Build octtree; Compute total mass and center; Traverse the tree, computing the forces Update the position and velocity of all bodies } Notes: –Total mass and center of mass of each subcube stored at its root –Tree traversal stops at a node when the clustering approximation can be used for a particular body –In the gravitational n-body problem described here, this can happen when where r is the distance to the center of mass of a subcube of side d and c is a constant.

CSE 160/Berman Barnes-Hut Complexity Partitioning is dynamic: Whole octtree must be reconstructed for each time step because bodies will have moved. Constructing tree can be done in O(nlogn) Computing forces can be done in O(nlogn) Barnes-Hut for one iteration is O(nlogn) [compare to O(n^2) for one iteration with exact solution]

CSE 160/Berman Generalizing the Barnes-Hut approach Approach can be used for applications which repeatedly performs some calculation on particles/bodies/data indexed by position. Recursive Bisection: –Divide region in half so that particles are balanced each time –Map rectangular regions onto processors so that load is balanced

CSE 160/Berman Recursive Bisection Programming Issues How do we keep track of the regions mapped to each processor? What should the density of each region be? [granularity!] What is the complexity of performing the partitioning? How often should we repartition to optimize the load balance? How can locality of communication or processor configuration be leveraged?

CSE 160/Berman Scheduling Application scheduling: ordering and allocation of tasks/communication/data to processors –Application-centric performance measure, e.g. minimal execution time Job Scheduling: ordering and allocation of jobs on an MPP –System-centric performance measure, e.g. processor utilization, throughput

CSE 160/Berman Job Scheduling Strategies Gang-scheduling Batch scheduling using backfilling

CSE 160/Berman Gang scheduling Gang scheduling is a technique for allocating a collection of jobs on a MPP –One or more jobs clustered as a gang –Gangs share time slices on whole machine Strategy combines time-sharing (gangs get time slices) and space-sharing (gangs partition space) approaches Many flavors of gang scheduling in the literature

CSE 160/Berman Gang Scheduling Formal definition from Dror Feitelson: Gang scheduling is a scheme that combines three features: –The threads of a set of jobs are grouped into gangs with the threads in a single job considered to be a single gang. –The threads in each gang execute simultaneously on distinct PEs, using a 1-1 mapping. –Time slicing is used, with all the threads in a gang being preempted and rescheduled at the same time.

CSE 160/Berman Why gang scheduling? Gang scheduling promotes efficient performance of individual jobs as well as efficient utilization and fair allocation of machine resources. Gang scheduling leads to two desirable properties: –It promotes efficient fine-grain interactions among the threads of a gang, since they are executing simultaneously. –Periodic preemption prevents long jobs from monopolizing system resources. overhead of preemption can reduce performance and so must be implemented efficiently). Used as the scheduling policy for CM-5, Meiko CS- 2, Paragon, etc.

CSE 160/Berman Batch Job Scheduling Problem: How to schedule jobs waiting in a queue to run on a multicomputer? –Each job requests some number n of nodes and some time t to run Goal: promote utilization of machine, fairness to jobs, short queue wait times

CSE 160/Berman One approach: Backfilling Main idea: pack the jobs in the processor/time space – Allow job at the head of the queue to be scheduled in the first available slot. –If other jobs in the queue can run without changing the start time of previous jobs in the queue, schedule them. –Promote jobs if they can start earlier Many versions of backfilling: –EASY: Promote jobs as long as they don’t delay the start time of the first job in the queue –Conservative: Promote jobs as long as they don’t delay the start time of any job in the queue.

CSE 160/Berman Backfilling Example Submitting five requests… processors time

CSE 160/Berman Backfilling Example Submitting five requests… Using Backfilling... processors time processors time

CSE 160/Berman Backfilling Example processors time processors time

CSE 160/Berman Backfilling Example Existing job finishes Backfilling promotes yellow job and then schedules purple job time processors time processors

CSE 160/Berman Backfilling Scheduling Backfilling used in Maui Scheduler at SDSC on SP-2, PBS at NASA, Computing Condominium Scheduler at Penn State, etc. Backfilling Issues: –What if the processors of the platform have different capacities (are not homogeneous) ? –What if some jobs get priority over others? –Should parallel jobs be treated separately than serial jobs? –If multiple queues are used, how should they be administered? –Should users be charged to wait in the queue as well as run on the machine?

Optimizing Application Performance Backfilling and MPP scheduling strategies typically optimize for throughput Optimizing throughput and optimizing application performance (e.g. execution time) can often conflict How can applications optimize performance in an MPP environment? Moldable jobs = jobs which can run with more than one partition size Question: What is the optimal partition size for moldable jobs? We can answer this question when the MPP scheduler runs a conservative backfilling strategy and publishes the list of available nodes.

Optimizing Applications targeted to a Batch-scheduled MPP SA = generic AppLeS scheduler developed for jobs submitted to backfilling MPP –uses the availability list of the MPP scheduler to determine the size of the partition to be requested by the application –Speedup curve known for Gas applications Static = jobs submitted without SA Workload taken from KTH (Swedish Royal Institute of Technology ) Experiments developed by Walfredo Cirne

CSE 160/Berman Mapping and Scheduling W+A: Chapter 4.

Similar presentations

Presentation on theme: "CSE 160/Berman Mapping and Scheduling W+A: Chapter 4."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 160/Berman Mapping and Scheduling W+A: Chapter 4.

Similar presentations

Presentation on theme: "CSE 160/Berman Mapping and Scheduling W+A: Chapter 4."— Presentation transcript:

Similar presentations

About project

Feedback