Presentation is loading. Please wait.

Presentation is loading. Please wait.

Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia.

Similar presentations


Presentation on theme: "Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia."— Presentation transcript:

1 Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia Phillips, Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract DE-AC04-94AL85000. New Experimental Results in Communication-Aware Processor Allocation for Supercomputers

2 Commodity-based supercomputers at Sandia National Laboratories (off-the-shelf components) Up to 2048 processors Production computing environment Our Job: Improve parallel node allocation on Cplant to optimize performance. Computational Plant (Cplant)

3 The Cplant System DEC alpha processors Myrinet interconnect (Sandia modified) MPI Different sizes/topologies: usually 2D or 3D grid with toroidal wraps –Ross = 2048 proc, 3D mesh –Zermatt = 128-proc 2D mesh –Alaska = ~600, heavily-augmented 2D mesh (cannibalized). Modified Linux OS (now public domain) Four processors/switch (compute, I/O, service nodes)

4 Scheduling Environment Users submit jobs to queue (online) Users specify number of processors and runtime estimate –If a job runs past this estimate by 5 min, it is killed No preemption, no migration, no multitasking (security) Actual runtime depends on set of processors allocated and placement of other jobs Goals: User - minimum response time Bureaucracy (GAO) - high utilization

5 Scheduler/Allocator Association Scheduler and allocator effect each others’ performance. SchedulerAllocator Performance dependencies

6 Scheduler/Allocator Dissociation Scheduler enforces policy –Management sets priorities for access, utilization policy Allocator can optimize performance User Executable # processors Requested time Job: Job PBS Scheduler Node Allocator Cplant queue......

7 What’s a Good Allocation? Objective: Allocate jobs to processors to minimize network contention  processor locality. Especially important for commodity networks Good allocation For 2D mesh Bad allocation For 2D mesh

8 Quantitative Effect of Processor Locality But, speed-up anomaly = 2  faster than = empty processor

9 Communication Hops on a 2D grid L 1 distance = # hops (~ # switches) between 2 processors on grid 5 4

10 Allocation Problem Given n available points on grid (some unavailable) Find a set of k available points with minimum average (or total) L 1 distance. Example: green allocation: 3(2) + 3(1) = 9

11 Empirical Correlation Leung et al, 2002 Related support: Mache and Lo, 1996

12 Previous Work Various Work forcing a convex set –Insufficient processor utilization Mache, Lo, Windisch MC algorithm Krumke et al 2-approximation, NP-hard w/general metric Complexity open for grids Dispersion problem (max distance) linear time for fixed k (Fekete and Meijer)

13 Optimal Unconstrained Shape [Bender,Bender,Demaine,Fekete 2004] Almost a circle but not quite. Only.05 percent difference in area. 0.650 245 952 951

14 Previous Results (Bender et al 2005) 7/4-approximation (2 - in d dimensions) PTAS ((1+  )-approximation in poly time for fixed  ) MC is a 4-approximation Linear-time exact dynamic program 1D O(n log n) time for k=3 Simulations (performance on job streams)

15 Experiments: Placement Algorithm MC Search in shell from minimum-size region of preferred shape. Weight processors by shells Return processor set with minimum weight.

16 Alternative: One-Dimensional Reduction Order processors so that close in linear order  close in physical processor graph Consider one-dimensional processor allocation –Bin packing (best fit, first fit, sum of squares) –Pack jobs onto the line (or ring), allowing fragmentation rlrubin: illustrate algorithms unlikely to be efficiently solvable more motivation - why default is not good enough rlrubin: illustrate algorithms unlikely to be efficiently solvable more motivation - why default is not good enough

17 New System Red Storm 12,960 Dual-Core AMD Opteron 2.4Ghz 39.19 TB Memory, 340 TB disk 124 TF peak performance 3D Mesh

18 Impact Changed the node allocator on Cplant –1D default allocator –2D algorithms implemented –Carried over to Red Storm system software 1D and 2D algorithms implemented Selectable at compilation R&D 100 winner (Leung, Bender, Bunde, Pedretti, Phillips 2006)

19 Red Storm Development Machine 1 Cray XT3/4 Cabinet I/O nodeCompute node

20 Does Bandwidth Make a Difference? Real time (seconds) User time (seconds) Sys time (seconds) 1/4 link bandwidth 15623.3531012.30250.298 Full bandwidth 6314.8181010.75250.003 Yes!

21 Red Storm Development Machine YZ S Curve I/O nodeCompute node

22 Red Storm Development Machine ZY S Curve I/O nodeCompute node

23 Hilbert (Space-Filling) Curves For 2D and 3D grids Previous applications –I/O efficient and cache-oblivious computation –Compression (images) –Domain decomposition

24 Red Storm Development Machine Zoltan Hilbert-Space-Filling Curve I/O nodeCompute node

25 Red Storm Development Machine Spliced Hilbert-Space-Filling Curve I/O nodeCompute node

26 Results (Makespan in Seconds) YZZYrandomZoltanspliced MC1x15807.1 SS5830.67003.26610.16699.66021.1 FF5868.67039.56639.66758.76052.3 BF5826.27022.66631.96739.16023.4 simple6102.4 Consistent with simulations (Bender et al 2005)

27 Results (Makespan Normalized) YZZYrandomZoltanspliced MC1x11 SS1.00401.2061.13831.15371.0369 FF1.01061.21221.14341.16391.0422 BF1.00331.20931.14201.16051.0372 simple1.0509

28 Red Storm Development Machine Is it I/O or interprocess communication? I/O nodeCompute node

29 Results (Makespan Normalized) YZZYrandomZoltanspliced BF11.20531.13831.15671.0338 BF211.23981.1761.18281.0443 Not I/O Consistent with Cplant experiments (Leung et al 2002) Consistent with Pittsburgh Supercomputing Center experiments (Weisser et al 2006)

30 Experiments- Test Set All-to-All Communications Job SizeNumber of Jobs 21820 5 660 15 620 20 660 High communication, best-case for runtime improvements Small number of repetitions (3)

31 Questions What’s the right allocation for a stream (online)? Scheduling + Allocation MPP Jobs


Download ppt "Michael Bender, SUNY Stony Brook David Bunde, Knox College Vitus Leung, Sandia National Laboratories Kevin Pedretti, Sandia National Laboratories Cynthia."

Similar presentations


Ads by Google