1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: Elizabeth.

1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth R. Jessup: jessup@cs.colorado.edu April 5, 2006 John M. Dennis: dennis@ucar.edudennis@ucar.edu Elizabeth R. Jessup: jessup@cs.colorado.edu April 5, 2006

Petascale Computation for the Geosciences Workshop 2 Motivation Outgrowth of PhD thesis Memory efficient iterative solvers Data movement is expensive Developed techniques to improve memory efficiency Apply Automated Memory Analysis to POP Parallel Ocean Program (POP) solver Large % of time Scalability issues Outgrowth of PhD thesis Memory efficient iterative solvers Data movement is expensive Developed techniques to improve memory efficiency Apply Automated Memory Analysis to POP Parallel Ocean Program (POP) solver Large % of time Scalability issues

April 5, 2006Petascale Computation for the Geosciences Workshop 3 Outline: Motivation Background Data movement Serial Performance Parallel Performance Space-Filling Curves Conclusions Motivation Background Data movement Serial Performance Parallel Performance Space-Filling Curves Conclusions

April 5, 2006Petascale Computation for the Geosciences Workshop 4 Automated Memory Analysis? Analyze algorithm written in Matlab Predicts data movement if algorithm written in C/C++ or Fortran -> Minimum Required Predictions allow: Evaluate design choices Guide performance tuning Analyze algorithm written in Matlab Predicts data movement if algorithm written in C/C++ or Fortran -> Minimum Required Predictions allow: Evaluate design choices Guide performance tuning

April 5, 2006Petascale Computation for the Geosciences Workshop 5 POP using 20x24 blocks (gx1v3) POP data structure Flexible block structure land block elimination Small blocks Better {load balanced, land block elimination} Larger halo overhead Larger blocks Smaller halo overhead Load imbalanced No land block elimination Grid resolutions: test: (128x192) gx1v3 (320x384)

April 5, 2006Petascale Computation for the Geosciences Workshop 6 Alternate Data Structure 2D data structure Advantages Regular stride-1 access Compact form of stencil operator Disadvantages Includes land points Problem specific data structure 2D data structure Advantages Regular stride-1 access Compact form of stencil operator Disadvantages Includes land points Problem specific data structure 1D data structure Advantages No more land points General data structure Disadvantages Indirect addressing Larger stencil operator

April 5, 2006Petascale Computation for the Geosciences Workshop 8 Data movement Working set load size (WSL) (MM --> L1 cache) Measure using PAPI (WSL M ) Compute platforms: Sun Ultra II (400Mhz) IBM POWER4 (1.3 Ghz) SGI R14K (500Mhz) Compare with prediction (WSL P ) Working set load size (WSL) (MM --> L1 cache) Measure using PAPI (WSL M ) Compute platforms: Sun Ultra II (400Mhz) IBM POWER4 (1.3 Ghz) SGI R14K (500Mhz) Compare with prediction (WSL P )

April 5, 2006Petascale Computation for the Geosciences Workshop 9 Predicting Data Movement solver w/2D (Matlab)solver w/1D (Matlab) 4902 Kbytes 3218 Kbytes 1D data structure --> 34% reduction in data movement > Predicts WSL P

April 5, 2006Petascale Computation for the Geosciences Workshop 10 Measured versus Predicted data movement SolverUltra IIPOWER4R14K WSL P WSL M errWSL M errWSL M err PCG2+2D v1 490251635%50683%572817% PCG2+2D v2 490249050%4865-1%4854-1% PCG2+1D32183164-2%33354%34738%

April 5, 2006Petascale Computation for the Geosciences Workshop 11 Measured versus Predicted data movement SolverUltra IIPOWER4R14K WSL P WSL M errWSL M errWSL M err PCG2+2D v1 490251635%50683%572817% PCG2+2D v2 49050%4865-1%4854-1% PCG2+1D32183164-2%33354%34738% Excessive data movement

April 5, 2006Petascale Computation for the Geosciences Workshop 12 Two blocks of source code do i=1,nblocks p(:,:,i)=z(:,:,i) + p(:,:,i)*ß q(:,:,i) = A*p(:,:,i) w0(:,:,i)=Q(:,:,i)*P(:,:,i) enddo delta = gsum(w0,lmask) do i=1,nblocks p(:,:,i)=z(:,:,i) + p(:,:,i)*ß q(:,:,i) = A*p(:,:,i) w0(:,:,i)=Q(:,:,i)*P(:,:,i) enddo delta = gsum(w0,lmask) ldelta=0 do i=1,nblocks p(:,:,i) = z(:,:,i) + p(:,:,i)* ß q(:,:,i) = A*p(:,:,i) w0=q(:,:,i)*P(:,:,i) ldelta = ldelta + lsum(w0,lmask) enddo delta=gsum(ldelta) PCG2+2D v1PCG2+2D v2 w0 array accessed after loop!extra access of w0 eliminated

April 5, 2006Petascale Computation for the Geosciences Workshop 13 Measured versus Predicted data movement SolverUltra IIPOWER4R14K WSL P WSL M errWSL M errWSL M err PCG2+2D v1 490251635%50683%572817% PCG2+2D v2 490249050%4865-1%4854-1% PCG2+1D32183164-2%33354%34738% Data movement matches predicted!

April 5, 2006Petascale Computation for the Geosciences Workshop 15 Using 1D data structures in POP2 solver (serial) Replace solvers.F90 Execution time on cache microprocessors Examine two CG algorithms w/Diagonal precond PCG2 ( 2 inner products) PCG1 ( 1 inner product) [DAzevedo 93] Grid: test [128x192 grid points]w/(16x16) Replace solvers.F90 Execution time on cache microprocessors Examine two CG algorithms w/Diagonal precond PCG2 ( 2 inner products) PCG1 ( 1 inner product) [DAzevedo 93] Grid: test [128x192 grid points]w/(16x16)

April 5, 2006Petascale Computation for the Geosciences Workshop 16 Serial execution time on IBM POWER4 (test) 56% reduction in cost/iteration

April 5, 2006Petascale Computation for the Geosciences Workshop 18 Using 1D data structure in POP2 solver (parallel) New parallel halo update Examine several CG algorithms w/Diagonal precond PCG2 ( 2 inner products) PCG1 ( 1 inner product) Existing solver/preconditioner technology: Hypre (LLNL) http://www.llnl.gov/CASC/linear_solvers PCG solver Preconditioners: Diagonal Hypre integration -> Work in progress New parallel halo update Examine several CG algorithms w/Diagonal precond PCG2 ( 2 inner products) PCG1 ( 1 inner product) Existing solver/preconditioner technology: Hypre (LLNL) http://www.llnl.gov/CASC/linear_solvers PCG solver Preconditioners: Diagonal Hypre integration -> Work in progress

April 5, 2006Petascale Computation for the Geosciences Workshop 19 Solver execution time for POP2 (20x24) on BG/L (gx1v3) 48% cost/iteration 27% cost/iteration

20 64 processors != PetaScale

April 5, 2006Petascale Computation for the Geosciences Workshop 22 0.1 degree POP Global eddy-resolving Computational grid: 3600 x 2400 x 40 Land creates problems: load imbalances scalability Alternative partitioning algorithm: Space-filling curves Evaluate using Benchmark: 1 day/ Internal grid / 7 minute timestep Global eddy-resolving Computational grid: 3600 x 2400 x 40 Land creates problems: load imbalances scalability Alternative partitioning algorithm: Space-filling curves Evaluate using Benchmark: 1 day/ Internal grid / 7 minute timestep

April 5, 2006Petascale Computation for the Geosciences Workshop 23 Partitioning with Space-filling Curves Map 2D -> 1D Variety of sizes Hilbert (Nb=2 n) Peano (Nb=3 m) Cinco (Nb=5 p ) [New] Hilbert-Peano (Nb=2 n 3 m ) Hilbert-Peano-Cinco (Nb=2 n 3 m 5 p ) [New] Partitioning 1D array Nb

April 5, 2006Petascale Computation for the Geosciences Workshop 24 Partitioning with SFC Partition for 3 processors

April 5, 2006Petascale Computation for the Geosciences Workshop 25 POP using 20x24 blocks (gx1v3)

April 5, 2006Petascale Computation for the Geosciences Workshop 26 POP (gx1v3) + Space-filling curve

April 5, 2006Petascale Computation for the Geosciences Workshop 27 Space-filling curve (Hilbert Nb=2 4 )

April 5, 2006Petascale Computation for the Geosciences Workshop 28 Remove Land blocks

April 5, 2006Petascale Computation for the Geosciences Workshop 29 Space-filling curve partition for 8 processors

April 5, 2006Petascale Computation for the Geosciences Workshop 30 POP 0.1 degree benchmark on Blue Gene/L

April 5, 2006Petascale Computation for the Geosciences Workshop 31 POP 0.1 degree benchmark Courtesy of Y. Yoshida, M. Taylor, P. Worley

April 5, 2006Petascale Computation for the Geosciences Workshop 32 Conclusions 1D data structures in Barotropic Solver No more land points Reduces execution time vs 2D data structure 48% reduction in Solver time! (64 procs BG/L) 9.5% reduction in Total time! (64 procs POWER4) Allows use of solver/preconditioner packages Implementation quality critical! Automated Memory Analysis (SLAMM) Evaluate design choices Guide performance tuning 1D data structures in Barotropic Solver No more land points Reduces execution time vs 2D data structure 48% reduction in Solver time! (64 procs BG/L) 9.5% reduction in Total time! (64 procs POWER4) Allows use of solver/preconditioner packages Implementation quality critical! Automated Memory Analysis (SLAMM) Evaluate design choices Guide performance tuning

April 5, 2006Petascale Computation for the Geosciences Workshop 33 Conclusions (cont) Good scalability to 32K processors on BG/L Increase simulation rate by 2x on 32K processors SFC partitioning 1D data structure in solver Modify 7 source files Future work Improve scalability 55% Efficiency 1K => 32K Better preconditioners Improve load-balance Different block sizes Improve partitioning algorithm Good scalability to 32K processors on BG/L Increase simulation rate by 2x on 32K processors SFC partitioning 1D data structure in solver Modify 7 source files Future work Improve scalability 55% Efficiency 1K => 32K Better preconditioners Improve load-balance Different block sizes Improve partitioning algorithm

April 5, 2006Petascale Computation for the Geosciences Workshop 34 Acknowledgements/Questions? Thanks to: F. Bryan (NCAR) J. Edwards (IBM) P. Jones (LANL) K. Lindsay (NCAR) M. Taylor (SNL) H. Tufo (NCAR) W. Waite (CU) S. Weese (NCAR) Thanks to: F. Bryan (NCAR) J. Edwards (IBM) P. Jones (LANL) K. Lindsay (NCAR) M. Taylor (SNL) H. Tufo (NCAR) W. Waite (CU) S. Weese (NCAR) Blue Gene/L time: NSF MRI Grant NCAR University of Colorado IBM (SUR) program BGW Consortium Days IBM research (Watson)

April 5, 2006Petascale Computation for the Geosciences Workshop 35 Serial Execution time on Multiple platforms (test)

April 5, 2006Petascale Computation for the Geosciences Workshop 36 Total execution time for POP2 (40x48) on POWER4 (gx1v3) 9.5% reduction Eliminate need for ~216,000 CPU hours per year @ NCAR

April 5, 2006Petascale Computation for the Geosciences Workshop 37 POP 0.1 degree blocksizeNbNb 2 Max || 36x24100100007545 30x201201440010705 24x161502250016528 18x122004000028972 15x102405760041352 12x83009000064074 Increasing || --> Decreasing overhead -->

1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: Elizabeth.

Similar presentations

Presentation on theme: "1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: Elizabeth."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: Elizabeth.

Similar presentations

Presentation on theme: "1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: Elizabeth."— Presentation transcript:

Similar presentations

About project

Feedback