Download presentation

Presentation is loading. Please wait.

Published byDavion Goodier Modified over 2 years ago

1
1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: Elizabeth R. Jessup: April 5, 2006 John M. Dennis: Elizabeth R. Jessup: April 5, 2006

2
Petascale Computation for the Geosciences Workshop 2 Motivation Outgrowth of PhD thesis Memory efficient iterative solvers Data movement is expensive Developed techniques to improve memory efficiency Apply Automated Memory Analysis to POP Parallel Ocean Program (POP) solver Large % of time Scalability issues Outgrowth of PhD thesis Memory efficient iterative solvers Data movement is expensive Developed techniques to improve memory efficiency Apply Automated Memory Analysis to POP Parallel Ocean Program (POP) solver Large % of time Scalability issues

3
April 5, 2006Petascale Computation for the Geosciences Workshop 3 Outline: Motivation Background Data movement Serial Performance Parallel Performance Space-Filling Curves Conclusions Motivation Background Data movement Serial Performance Parallel Performance Space-Filling Curves Conclusions

4
April 5, 2006Petascale Computation for the Geosciences Workshop 4 Automated Memory Analysis? Analyze algorithm written in Matlab Predicts data movement if algorithm written in C/C++ or Fortran -> Minimum Required Predictions allow: Evaluate design choices Guide performance tuning Analyze algorithm written in Matlab Predicts data movement if algorithm written in C/C++ or Fortran -> Minimum Required Predictions allow: Evaluate design choices Guide performance tuning

5
April 5, 2006Petascale Computation for the Geosciences Workshop 5 POP using 20x24 blocks (gx1v3) POP data structure Flexible block structure land block elimination Small blocks Better {load balanced, land block elimination} Larger halo overhead Larger blocks Smaller halo overhead Load imbalanced No land block elimination Grid resolutions: test: (128x192) gx1v3 (320x384)

6
April 5, 2006Petascale Computation for the Geosciences Workshop 6 Alternate Data Structure 2D data structure Advantages Regular stride-1 access Compact form of stencil operator Disadvantages Includes land points Problem specific data structure 2D data structure Advantages Regular stride-1 access Compact form of stencil operator Disadvantages Includes land points Problem specific data structure 1D data structure Advantages No more land points General data structure Disadvantages Indirect addressing Larger stencil operator

7
April 5, 2006Petascale Computation for the Geosciences Workshop 7 Outline: Motivation Background Data movement Serial Performance Parallel Performance Space-Filling Curves Conclusions Motivation Background Data movement Serial Performance Parallel Performance Space-Filling Curves Conclusions

8
April 5, 2006Petascale Computation for the Geosciences Workshop 8 Data movement Working set load size (WSL) (MM --> L1 cache) Measure using PAPI (WSL M ) Compute platforms: Sun Ultra II (400Mhz) IBM POWER4 (1.3 Ghz) SGI R14K (500Mhz) Compare with prediction (WSL P ) Working set load size (WSL) (MM --> L1 cache) Measure using PAPI (WSL M ) Compute platforms: Sun Ultra II (400Mhz) IBM POWER4 (1.3 Ghz) SGI R14K (500Mhz) Compare with prediction (WSL P )

9
April 5, 2006Petascale Computation for the Geosciences Workshop 9 Predicting Data Movement solver w/2D (Matlab)solver w/1D (Matlab) 4902 Kbytes 3218 Kbytes 1D data structure --> 34% reduction in data movement > Predicts WSL P

10
April 5, 2006Petascale Computation for the Geosciences Workshop 10 Measured versus Predicted data movement SolverUltra IIPOWER4R14K WSL P WSL M errWSL M errWSL M err PCG2+2D v %50683%572817% PCG2+2D v %4865-1%4854-1% PCG2+1D %33354%34738%

11
April 5, 2006Petascale Computation for the Geosciences Workshop 11 Measured versus Predicted data movement SolverUltra IIPOWER4R14K WSL P WSL M errWSL M errWSL M err PCG2+2D v %50683%572817% PCG2+2D v %4865-1%4854-1% PCG2+1D %33354%34738% Excessive data movement

12
April 5, 2006Petascale Computation for the Geosciences Workshop 12 Two blocks of source code do i=1,nblocks p(:,:,i)=z(:,:,i) + p(:,:,i)*ß q(:,:,i) = A*p(:,:,i) w0(:,:,i)=Q(:,:,i)*P(:,:,i) enddo delta = gsum(w0,lmask) do i=1,nblocks p(:,:,i)=z(:,:,i) + p(:,:,i)*ß q(:,:,i) = A*p(:,:,i) w0(:,:,i)=Q(:,:,i)*P(:,:,i) enddo delta = gsum(w0,lmask) ldelta=0 do i=1,nblocks p(:,:,i) = z(:,:,i) + p(:,:,i)* ß q(:,:,i) = A*p(:,:,i) w0=q(:,:,i)*P(:,:,i) ldelta = ldelta + lsum(w0,lmask) enddo delta=gsum(ldelta) PCG2+2D v1PCG2+2D v2 w0 array accessed after loop!extra access of w0 eliminated

13
April 5, 2006Petascale Computation for the Geosciences Workshop 13 Measured versus Predicted data movement SolverUltra IIPOWER4R14K WSL P WSL M errWSL M errWSL M err PCG2+2D v %50683%572817% PCG2+2D v %4865-1%4854-1% PCG2+1D %33354%34738% Data movement matches predicted!

14
April 5, 2006Petascale Computation for the Geosciences Workshop 14 Outline: Motivation Background Data movement Serial Performance Parallel Performance Space-Filling Curves Conclusions Motivation Background Data movement Serial Performance Parallel Performance Space-Filling Curves Conclusions

15
April 5, 2006Petascale Computation for the Geosciences Workshop 15 Using 1D data structures in POP2 solver (serial) Replace solvers.F90 Execution time on cache microprocessors Examine two CG algorithms w/Diagonal precond PCG2 ( 2 inner products) PCG1 ( 1 inner product) [DAzevedo 93] Grid: test [128x192 grid points]w/(16x16) Replace solvers.F90 Execution time on cache microprocessors Examine two CG algorithms w/Diagonal precond PCG2 ( 2 inner products) PCG1 ( 1 inner product) [DAzevedo 93] Grid: test [128x192 grid points]w/(16x16)

16
April 5, 2006Petascale Computation for the Geosciences Workshop 16 Serial execution time on IBM POWER4 (test) 56% reduction in cost/iteration

17
April 5, 2006Petascale Computation for the Geosciences Workshop 17 Outline: Motivation Background Data movement Serial Performance Parallel Performance Space-Filling Curves Conclusions Motivation Background Data movement Serial Performance Parallel Performance Space-Filling Curves Conclusions

18
April 5, 2006Petascale Computation for the Geosciences Workshop 18 Using 1D data structure in POP2 solver (parallel) New parallel halo update Examine several CG algorithms w/Diagonal precond PCG2 ( 2 inner products) PCG1 ( 1 inner product) Existing solver/preconditioner technology: Hypre (LLNL) PCG solver Preconditioners: Diagonal Hypre integration -> Work in progress New parallel halo update Examine several CG algorithms w/Diagonal precond PCG2 ( 2 inner products) PCG1 ( 1 inner product) Existing solver/preconditioner technology: Hypre (LLNL) PCG solver Preconditioners: Diagonal Hypre integration -> Work in progress

19
April 5, 2006Petascale Computation for the Geosciences Workshop 19 Solver execution time for POP2 (20x24) on BG/L (gx1v3) 48% cost/iteration 27% cost/iteration

20
20 64 processors != PetaScale

21
April 5, 2006Petascale Computation for the Geosciences Workshop 21 Outline: Motivation Background Data movement Serial Performance Parallel Performance Space-Filling Curves Conclusions Motivation Background Data movement Serial Performance Parallel Performance Space-Filling Curves Conclusions

22
April 5, 2006Petascale Computation for the Geosciences Workshop degree POP Global eddy-resolving Computational grid: 3600 x 2400 x 40 Land creates problems: load imbalances scalability Alternative partitioning algorithm: Space-filling curves Evaluate using Benchmark: 1 day/ Internal grid / 7 minute timestep Global eddy-resolving Computational grid: 3600 x 2400 x 40 Land creates problems: load imbalances scalability Alternative partitioning algorithm: Space-filling curves Evaluate using Benchmark: 1 day/ Internal grid / 7 minute timestep

23
April 5, 2006Petascale Computation for the Geosciences Workshop 23 Partitioning with Space-filling Curves Map 2D -> 1D Variety of sizes Hilbert (Nb=2 n) Peano (Nb=3 m) Cinco (Nb=5 p ) [New] Hilbert-Peano (Nb=2 n 3 m ) Hilbert-Peano-Cinco (Nb=2 n 3 m 5 p ) [New] Partitioning 1D array Nb

24
April 5, 2006Petascale Computation for the Geosciences Workshop 24 Partitioning with SFC Partition for 3 processors

25
April 5, 2006Petascale Computation for the Geosciences Workshop 25 POP using 20x24 blocks (gx1v3)

26
April 5, 2006Petascale Computation for the Geosciences Workshop 26 POP (gx1v3) + Space-filling curve

27
April 5, 2006Petascale Computation for the Geosciences Workshop 27 Space-filling curve (Hilbert Nb=2 4 )

28
April 5, 2006Petascale Computation for the Geosciences Workshop 28 Remove Land blocks

29
April 5, 2006Petascale Computation for the Geosciences Workshop 29 Space-filling curve partition for 8 processors

30
April 5, 2006Petascale Computation for the Geosciences Workshop 30 POP 0.1 degree benchmark on Blue Gene/L

31
April 5, 2006Petascale Computation for the Geosciences Workshop 31 POP 0.1 degree benchmark Courtesy of Y. Yoshida, M. Taylor, P. Worley

32
April 5, 2006Petascale Computation for the Geosciences Workshop 32 Conclusions 1D data structures in Barotropic Solver No more land points Reduces execution time vs 2D data structure 48% reduction in Solver time! (64 procs BG/L) 9.5% reduction in Total time! (64 procs POWER4) Allows use of solver/preconditioner packages Implementation quality critical! Automated Memory Analysis (SLAMM) Evaluate design choices Guide performance tuning 1D data structures in Barotropic Solver No more land points Reduces execution time vs 2D data structure 48% reduction in Solver time! (64 procs BG/L) 9.5% reduction in Total time! (64 procs POWER4) Allows use of solver/preconditioner packages Implementation quality critical! Automated Memory Analysis (SLAMM) Evaluate design choices Guide performance tuning

33
April 5, 2006Petascale Computation for the Geosciences Workshop 33 Conclusions (cont) Good scalability to 32K processors on BG/L Increase simulation rate by 2x on 32K processors SFC partitioning 1D data structure in solver Modify 7 source files Future work Improve scalability 55% Efficiency 1K => 32K Better preconditioners Improve load-balance Different block sizes Improve partitioning algorithm Good scalability to 32K processors on BG/L Increase simulation rate by 2x on 32K processors SFC partitioning 1D data structure in solver Modify 7 source files Future work Improve scalability 55% Efficiency 1K => 32K Better preconditioners Improve load-balance Different block sizes Improve partitioning algorithm

34
April 5, 2006Petascale Computation for the Geosciences Workshop 34 Acknowledgements/Questions? Thanks to: F. Bryan (NCAR) J. Edwards (IBM) P. Jones (LANL) K. Lindsay (NCAR) M. Taylor (SNL) H. Tufo (NCAR) W. Waite (CU) S. Weese (NCAR) Thanks to: F. Bryan (NCAR) J. Edwards (IBM) P. Jones (LANL) K. Lindsay (NCAR) M. Taylor (SNL) H. Tufo (NCAR) W. Waite (CU) S. Weese (NCAR) Blue Gene/L time: NSF MRI Grant NCAR University of Colorado IBM (SUR) program BGW Consortium Days IBM research (Watson)

35
April 5, 2006Petascale Computation for the Geosciences Workshop 35 Serial Execution time on Multiple platforms (test)

36
April 5, 2006Petascale Computation for the Geosciences Workshop 36 Total execution time for POP2 (40x48) on POWER4 (gx1v3) 9.5% reduction Eliminate need for ~216,000 CPU hours per NCAR

37
April 5, 2006Petascale Computation for the Geosciences Workshop 37 POP 0.1 degree blocksizeNbNb 2 Max || 36x x x x x x Increasing || --> Decreasing overhead -->

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google