Presentation is loading. Please wait.

Presentation is loading. Please wait.

2 Chip-Multiprocessors & You John Dennis March 16, 2007 John Dennis March 16, 2007.

Similar presentations


Presentation on theme: "2 Chip-Multiprocessors & You John Dennis March 16, 2007 John Dennis March 16, 2007."— Presentation transcript:

1 2 Chip-Multiprocessors & You John Dennis dennis@ucar.edu March 16, 2007 John Dennis dennis@ucar.edu March 16, 2007

2 Software Engineering Working Group Meeting3 Intel “Tera Chip”  80 core chip  1 Teraflop  3.16 Ghz / 0.95V / 62W  Process  45 nm technology  High-K  2D mesh network  Each processor has 5- port router  Connects to “3D- memory”

3 March 16, 2007Software Engineering Working Group Meeting4 Outline  Chip-Multiprocessor  Parallel I/O library (PIO)  Full with Large Processor Counts  POP  CICE  Chip-Multiprocessor  Parallel I/O library (PIO)  Full with Large Processor Counts  POP  CICE

4 March 16, 2007Software Engineering Working Group Meeting5 Moore’s Law  Most things are twice as nice [18 months]  Transistor count  Processor speed  DRAM density  Historical Result:  Solve problem twice as large in same time  Solve same size problem in half the time --> Inactivity leads to progress!  Most things are twice as nice [18 months]  Transistor count  Processor speed  DRAM density  Historical Result:  Solve problem twice as large in same time  Solve same size problem in half the time --> Inactivity leads to progress!

5 6 The advent of Chip- multiprocessors Moore’s Law gone bad!

6 March 16, 2007Software Engineering Working Group Meeting7 New implications of Moore’s Law  Every 18 months  # of cores per socket doubles  Memory density doubles  Clock cycle may increase slightly  18 months from now  8 cores per socket  Slight increase in clock cycle (~15%)  Same memory per core!!  Every 18 months  # of cores per socket doubles  Memory density doubles  Clock cycle may increase slightly  18 months from now  8 cores per socket  Slight increase in clock cycle (~15%)  Same memory per core!!

7 March 16, 2007Software Engineering Working Group Meeting8 New implications of Moore’s Law (con’t)  Inactivity leads to no progress!  Possible outcome  Same problem size / same parallelism  solve problem ~15% faster  Bigger problem size  scalable memory?  More processors enable ~2x reduction in time to solution  Non-scalable memory?  May limit number of processors that can be used  Waste 1/2 of cores on sockets to use memory?  All components of application must scale to benefit from Moore’s Law increases! Memory footprint problem will not solve itself!  Inactivity leads to no progress!  Possible outcome  Same problem size / same parallelism  solve problem ~15% faster  Bigger problem size  scalable memory?  More processors enable ~2x reduction in time to solution  Non-scalable memory?  May limit number of processors that can be used  Waste 1/2 of cores on sockets to use memory?  All components of application must scale to benefit from Moore’s Law increases! Memory footprint problem will not solve itself!

8 March 16, 2007Software Engineering Working Group Meeting9 Questions ?

9 10 Parallel I/O library (PIO) John Dennis (dennis@ucar.edu) Ray Loy (rloy@mcs.anl.gov)rloy@mcs.anl.gov March 16, 2007 John Dennis (dennis@ucar.edu) Ray Loy (rloy@mcs.anl.gov)rloy@mcs.anl.gov March 16, 2007

10 Software Engineering Working Group Meeting11 Introduction  All component models need parallel I/O  Serial I/O is bad!  Increased memory requirement  Typically negative impact on performance  Primary Developers: [J. Dennis, R. Loy]  Necessary for POP BGW runs  All component models need parallel I/O  Serial I/O is bad!  Increased memory requirement  Typically negative impact on performance  Primary Developers: [J. Dennis, R. Loy]  Necessary for POP BGW runs

11 March 16, 2007Software Engineering Working Group Meeting12 Design goals  Provide parallel I/O for all component models  Encapsulate complexity into library  Simple interface for component developers to implement  Provide parallel I/O for all component models  Encapsulate complexity into library  Simple interface for component developers to implement

12 March 16, 2007Software Engineering Working Group Meeting13 Design goals (con’t)  Extensible for future I/O technology  Backward compatible (node=0)  Support for multiple formats  {sequential,direct} binary  netcdf  Preserve format of input/output files  Supports 1D, 2D and 3D arrays  Currently XY  Extensible to XZ or YZ  Extensible for future I/O technology  Backward compatible (node=0)  Support for multiple formats  {sequential,direct} binary  netcdf  Preserve format of input/output files  Supports 1D, 2D and 3D arrays  Currently XY  Extensible to XZ or YZ

13 March 16, 2007Software Engineering Working Group Meeting14 Terms and Concepts  PnetCDF: [ANL]  High performance I/O  Different interface  Stable  netCDF4 + HDF5 [NCSA]  Same interface  Needs HDF5 library  Less stable  Lower performance  No support on Blue Gene  PnetCDF: [ANL]  High performance I/O  Different interface  Stable  netCDF4 + HDF5 [NCSA]  Same interface  Needs HDF5 library  Less stable  Lower performance  No support on Blue Gene

14 March 16, 2007Software Engineering Working Group Meeting15 Terms and Concepts (con’t)  Processor stride:  Allows matching of subset of MPI IO nodes to system hardware  Processor stride:  Allows matching of subset of MPI IO nodes to system hardware

15 March 16, 2007Software Engineering Working Group Meeting16 Terms and Concepts (con’t)  IO decomp vs. COMP decomp  IO decomp == COMP decomp  MPI-IO + message aggregation  IO decomp != COMP decomp  Need Rearranger : MCT  No component specific info in library  Pair with existing communication tech  1D arrays in library; component must flatten 2D and 3D arrays  IO decomp vs. COMP decomp  IO decomp == COMP decomp  MPI-IO + message aggregation  IO decomp != COMP decomp  Need Rearranger : MCT  No component specific info in library  Pair with existing communication tech  1D arrays in library; component must flatten 2D and 3D arrays

16 March 16, 2007Software Engineering Working Group Meeting17 Component Model ‘issues’  POP & CICE:  Missing blocks  Update of neighbors halo  Who writes missing blocks?  Asymmetry between read/write  ‘sub-block’ decompositions not rectangular  CLM  Decomposition not rectangular  Who writes missing data?  POP & CICE:  Missing blocks  Update of neighbors halo  Who writes missing blocks?  Asymmetry between read/write  ‘sub-block’ decompositions not rectangular  CLM  Decomposition not rectangular  Who writes missing data?

17 March 16, 2007Software Engineering Working Group Meeting18 What works  Binary I/O [direct]  Test on POWER5, BGL  Rearrange w/MCT + MPI-IO  MPI-IO no rearrangement  netCDF  Rearrange with MCT [New]  Reduced memory  PnetCDF:  Rearrange with MCT  No rearrangement  Test on POWER5, BGL  Binary I/O [direct]  Test on POWER5, BGL  Rearrange w/MCT + MPI-IO  MPI-IO no rearrangement  netCDF  Rearrange with MCT [New]  Reduced memory  PnetCDF:  Rearrange with MCT  No rearrangement  Test on POWER5, BGL

18 March 16, 2007Software Engineering Working Group Meeting19 What works (con’t)  Prototype added to POP2  Reads restart and forcing files correctly  Writes binary restart files correctly  Necessary for BGW runs  Prototype implementation in HOMME [J. Edwards]  Writes netCDF history files correctly  POPIO benchmark  2D array [3600x2400] (70 Mbyte)  Test code for correctness and performance  Tested on 30K BGL processors in Oct 06  Performance  POWER5: 2-3x serial I/O approach  BGL: mixed  Prototype added to POP2  Reads restart and forcing files correctly  Writes binary restart files correctly  Necessary for BGW runs  Prototype implementation in HOMME [J. Edwards]  Writes netCDF history files correctly  POPIO benchmark  2D array [3600x2400] (70 Mbyte)  Test code for correctness and performance  Tested on 30K BGL processors in Oct 06  Performance  POWER5: 2-3x serial I/O approach  BGL: mixed

19 March 16, 2007Software Engineering Working Group Meeting20 Complexity / Remaining Issues  Mulitple ways to express decomposition  GDOF: global degree of freedom --> (MCT, MPI- IO)  Subarrays: start + count (pNetCDF)  Limited expressiveness  Will not support ‘sub-block’ in POP & CICE, CLM  Need common language for interface  Interface between component model and library  Mulitple ways to express decomposition  GDOF: global degree of freedom --> (MCT, MPI- IO)  Subarrays: start + count (pNetCDF)  Limited expressiveness  Will not support ‘sub-block’ in POP & CICE, CLM  Need common language for interface  Interface between component model and library

20 March 16, 2007Software Engineering Working Group Meeting21 Conclusions  Working prototype  POP2 for binary I/O  HOMME for netCDF  PIO telecon: discuss progress every 2 weeks  Work in progress  Multiple efforts underway  accepting help  http://swiki.ucar.edu/ccsm/93 http://swiki.ucar.edu/ccsm/93  In CCSM subversion repository  Working prototype  POP2 for binary I/O  HOMME for netCDF  PIO telecon: discuss progress every 2 weeks  Work in progress  Multiple efforts underway  accepting help  http://swiki.ucar.edu/ccsm/93 http://swiki.ucar.edu/ccsm/93  In CCSM subversion repository

21 22 Fun with Large Processor Counts: POP, CICE John Dennis dennis@ucar.edu March 16, 2007 John Dennis dennis@ucar.edu March 16, 2007

22 Software Engineering Working Group Meeting23 Motivation  Can Community Climate System Model (CCSM) be a Petascale Application?  Use 10-100K processors per simulation  Increasing common access to large systems  ORNL Cray XT3/4 : 20K [2-3 weeks]  ANL Blue Gene/P : 160K [Jan 2008]  TACC Sun : 55K [Jan 2008]  Petascale for the masses ?  lagtime in Top 500 List [4-5 years]  @ NCAR before 2015  Can Community Climate System Model (CCSM) be a Petascale Application?  Use 10-100K processors per simulation  Increasing common access to large systems  ORNL Cray XT3/4 : 20K [2-3 weeks]  ANL Blue Gene/P : 160K [Jan 2008]  TACC Sun : 55K [Jan 2008]  Petascale for the masses ?  lagtime in Top 500 List [4-5 years]  @ NCAR before 2015

23 March 16, 2007Software Engineering Working Group Meeting24 Outline  Chip-Multiprocessor  Parallel I/O library (PIO)  Fun with Large Processor Counts  POP  CICE  Chip-Multiprocessor  Parallel I/O library (PIO)  Fun with Large Processor Counts  POP  CICE

24 March 16, 2007Software Engineering Working Group Meeting25 Status of POP  Access to 17K Cray XT4 processors  12.5 years/day [Current Record]  70% of time in solver  Won BGW cycle allocation Eddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock]  110 Rack Days/ 5.4M CPU hours  20 year 0.1° POP simulation  Includes a suite of dye-like tracers  Simulate eddy diffusivity tensor  Access to 17K Cray XT4 processors  12.5 years/day [Current Record]  70% of time in solver  Won BGW cycle allocation Eddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock]  110 Rack Days/ 5.4M CPU hours  20 year 0.1° POP simulation  Includes a suite of dye-like tracers  Simulate eddy diffusivity tensor

25 March 16, 2007Software Engineering Working Group Meeting26 Status of POP (con’t)  Allocation will occur over ~7 days  Run in production on 30K processors  Needs Parallel I/O to write history file  Start runs in 4-6 weeks  Allocation will occur over ~7 days  Run in production on 30K processors  Needs Parallel I/O to write history file  Start runs in 4-6 weeks

26 March 16, 2007Software Engineering Working Group Meeting27 Outline  Chip-Multiprocessor  Parallel I/O library (PIO)  Fun with Large Processor Counts  POP  CICE  Chip-Multiprocessor  Parallel I/O library (PIO)  Fun with Large Processor Counts  POP  CICE

27 March 16, 2007Software Engineering Working Group Meeting28 Status of CICE  Tested CICE @ 1/10   10K Cray XT4 processors  40K IBM Blue Gene processors [BGW days]  Use weighted space-filling curves (wSFC)  erfc  climatology  Tested CICE @ 1/10   10K Cray XT4 processors  40K IBM Blue Gene processors [BGW days]  Use weighted space-filling curves (wSFC)  erfc  climatology

28 March 16, 2007Software Engineering Working Group Meeting29 POP (gx1v3) + Space-filling curve

29 March 16, 2007Software Engineering Working Group Meeting30 Space-filling curve partition for 8 processors

30 March 16, 2007Software Engineering Working Group Meeting31 Weighted Space-filling curves  Estimate work for each grid block Work i = w 0 + P i *w 1 where: w 0 : Fixed work for all blocks w 1 : Work if block contains Sea-ice P i :Probability block contains Sea-ice  For our experiments:w 0 = 2, w 1 = 10  Estimate work for each grid block Work i = w 0 + P i *w 1 where: w 0 : Fixed work for all blocks w 1 : Work if block contains Sea-ice P i :Probability block contains Sea-ice  For our experiments:w 0 = 2, w 1 = 10

31 March 16, 2007Software Engineering Working Group Meeting32 Probability Function  Error Function: P i = erfc((  -max(|lat i |))/  ) where: lat i max lat in block i  mean sea-ice extent  variance in sea-ice extent  NH =70°,  SH =60°,  =5 °  Error Function: P i = erfc((  -max(|lat i |))/  ) where: lat i max lat in block i  mean sea-ice extent  variance in sea-ice extent  NH =70°,  SH =60°,  =5 °

32 March 16, 2007Software Engineering Working Group Meeting33 1° CICE4 on 20 processors Small domains @ high latitudes Large domains @ low latitudes

33 March 16, 2007Software Engineering Working Group Meeting34 0.1 ° CICE4  Developed at LANL  Finite Difference  Models sea-ice  Shares grid and infrastructure with POP  Reuse techniques from POP work  Computational grid: [3600 x 2400 x 20]  Computational load-imbalance creates problems:  ~15% of grid has sea-ice  Use weighted Space-filling curves?  Evaluate using Benchmark:  1 day/ Initial run / 30 minute timestep / no Forcing  Developed at LANL  Finite Difference  Models sea-ice  Shares grid and infrastructure with POP  Reuse techniques from POP work  Computational grid: [3600 x 2400 x 20]  Computational load-imbalance creates problems:  ~15% of grid has sea-ice  Use weighted Space-filling curves?  Evaluate using Benchmark:  1 day/ Initial run / 30 minute timestep / no Forcing

34 March 16, 2007Software Engineering Working Group Meeting35 CICE4 @ 0.1°

35 March 16, 2007Software Engineering Working Group Meeting36 Timings for 1°,npes=160,  NH =70° Load-imbalance: Hudson Bay south of 70°

36 March 16, 2007Software Engineering Working Group Meeting37 Timings for 1°,npes=160,  NH =55°

37 March 16, 2007Software Engineering Working Group Meeting38 Better Probability Function  Climatological Function: Where:  ij climatological maximum sea-ice extent [satellite observation] n i is the number of points within block i with non-zero  ij  Climatological Function: Where:  ij climatological maximum sea-ice extent [satellite observation] n i is the number of points within block i with non-zero  ij

38 March 16, 2007Software Engineering Working Group Meeting39 Timings for 1°,npes=160, climate-based Reduces dynamics sub-cycling time by 28%!

39 March 16, 2007Software Engineering Working Group Meeting40 Acknowledgements/Questions?  Thanks to: D. Bailey (NCAR) F. Bryan (NCAR) T. Craig (NCAR) J. Edwards (IBM) E. Hunke (LANL) B. Kadlec (CU) E. Jessup (CU) P. Jones (LANL) K. Lindsay (NCAR) W. Lipscomb (LANL) M. Taylor (SNL) H. Tufo (NCAR) M. Vertenstein (NCAR) S. Weese (NCAR) P. Worley (ORNL)  Thanks to: D. Bailey (NCAR) F. Bryan (NCAR) T. Craig (NCAR) J. Edwards (IBM) E. Hunke (LANL) B. Kadlec (CU) E. Jessup (CU) P. Jones (LANL) K. Lindsay (NCAR) W. Lipscomb (LANL) M. Taylor (SNL) H. Tufo (NCAR) M. Vertenstein (NCAR) S. Weese (NCAR) P. Worley (ORNL)  Computer Time:  Blue Gene/L time: NSF MRI Grant NCAR University of Colorado IBM (SUR) program BGW Consortium Days IBM research (Watson)  Cray XT3/4 time: ORNL Sandia et

40 March 16, 2007Software Engineering Working Group Meeting41 Partitioning with Space-filling Curves  Map 2D -> 1D  Variety of sizes  Hilbert (Nb=2 n)  Peano (Nb=3 m)  Cinco (Nb=5 p )  Hilbert-Peano (Nb=2 n 3 m )  Hilbert-Peano-Cinco (Nb=2 n 3 m 5 p )  Partitioning 1D array Nb

41 March 16, 2007Software Engineering Working Group Meeting42 Scalable data structures  Common problem among applications  WRF  Serial I/O [fixed]  Duplication of lateral boundary values  POP & CICE  Serial I/O  CLM  Serial I/O  Duplication of grid info  Common problem among applications  WRF  Serial I/O [fixed]  Duplication of lateral boundary values  POP & CICE  Serial I/O  CLM  Serial I/O  Duplication of grid info

42 March 16, 2007Software Engineering Working Group Meeting43 Scalable data structures (con’t)  CAM  Serial I/O  Lookup tables  CPL  Serial I/O  Duplication of grid info Memory footprint problem will not solve itself!  CAM  Serial I/O  Lookup tables  CPL  Serial I/O  Duplication of grid info Memory footprint problem will not solve itself!

43 March 16, 2007Software Engineering Working Group Meeting44 Remove Land blocks

44 March 16, 2007Software Engineering Working Group Meeting45 Case Study: Memory use in CLM  CLM Configuration:  1x1.25 grid  No RTM  MAXPATCH_PFT = 4  No CN, DGVM  Measure stack and heap on 32-512 BG/L processors  CLM Configuration:  1x1.25 grid  No RTM  MAXPATCH_PFT = 4  No CN, DGVM  Measure stack and heap on 32-512 BG/L processors

45 March 16, 2007Software Engineering Working Group Meeting46 Memory use of CLM on BGL

46 March 16, 2007Software Engineering Working Group Meeting47 Motivation (con’t)  Multiple efforts underway  CAM scalability + high resolution coupled simulation [A. Mirin]  Sequential coupler [M. Vertenstein, R. Jacob]  Single executable coupler [J. Wolfe]  CCSM on Blue Gene [J. Wolfe, R. Loy, R. Jacob]  HOMME in CAM [J. Edwards]  Multiple efforts underway  CAM scalability + high resolution coupled simulation [A. Mirin]  Sequential coupler [M. Vertenstein, R. Jacob]  Single executable coupler [J. Wolfe]  CCSM on Blue Gene [J. Wolfe, R. Loy, R. Jacob]  HOMME in CAM [J. Edwards]

47 March 16, 2007Software Engineering Working Group Meeting48 Outline  Chip-Multiprocessor  Fun with Large Processor Counts  POP  CICE  CLM  Parallel I/O library (PIO)  Chip-Multiprocessor  Fun with Large Processor Counts  POP  CICE  CLM  Parallel I/O library (PIO)

48 March 16, 2007Software Engineering Working Group Meeting49 Status of CLM  Work of T. Craig  Elimination of global memory  Reworking of decomposition algorithms  Addition of PIO  Short term goal:  Participation in BGW days June 07  Investigation scalability at 1/10   Work of T. Craig  Elimination of global memory  Reworking of decomposition algorithms  Addition of PIO  Short term goal:  Participation in BGW days June 07  Investigation scalability at 1/10 

49 March 16, 2007Software Engineering Working Group Meeting50 Status of CLM memory usage  May 1, 2006:  memory usage increases with processor count  Can run 1x1.25 on 32-512 processors of BGL  July 10, 2006:  Memory usage scales to asymptote  Can run 1x1.25 on 32- 2K processors of BGL  ~350 persistent global arrays [24 Gbytes/proc @ 1/10 degree]  January, 2007:  150 persistent global arrays  1/2 degee runs on 32-2K BGL processors  ~150 persistent global arrays [10.5 Gbytes/proc @ 1/10 degree]  February, 2007:  18 persistent global arrays [1.2 Gbytes/proc @ 1/10 degree]  Target:  no persistent global arrays  1/10 degree runs on single rack BGL  May 1, 2006:  memory usage increases with processor count  Can run 1x1.25 on 32-512 processors of BGL  July 10, 2006:  Memory usage scales to asymptote  Can run 1x1.25 on 32- 2K processors of BGL  ~350 persistent global arrays [24 Gbytes/proc @ 1/10 degree]  January, 2007:  150 persistent global arrays  1/2 degee runs on 32-2K BGL processors  ~150 persistent global arrays [10.5 Gbytes/proc @ 1/10 degree]  February, 2007:  18 persistent global arrays [1.2 Gbytes/proc @ 1/10 degree]  Target:  no persistent global arrays  1/10 degree runs on single rack BGL

50 March 16, 2007Software Engineering Working Group Meeting51 Proposed Petascale Experiment  Ensemble of 10 runs/200 years  Petascale Configuration:  CAM (30 km, L66)  POP @ 0.1°  12.5 years / wall-clock day [17K Cray XT4 processors]  Sea-Ice @ 0.1°  42 years / wall-clock day [10K Cray XT3 processors  Land model @ 0.1°  Sequential Design (105 days per run)  32K BGL/ 10K XT3 processors  Concurrent Design (33 days per run)  120K BGL / 42K XT3 processors  Ensemble of 10 runs/200 years  Petascale Configuration:  CAM (30 km, L66)  POP @ 0.1°  12.5 years / wall-clock day [17K Cray XT4 processors]  Sea-Ice @ 0.1°  42 years / wall-clock day [10K Cray XT3 processors  Land model @ 0.1°  Sequential Design (105 days per run)  32K BGL/ 10K XT3 processors  Concurrent Design (33 days per run)  120K BGL / 42K XT3 processors

51 March 16, 2007Software Engineering Working Group Meeting52 POPIO benchmark on BGW

52 March 16, 2007Software Engineering Working Group Meeting53 CICE results (con’t)  Correct weighting increases simulation rate  wSFC works best for high resolution  Variable sized domains:  Large domains at low latitude -> higher boundary exchange cost  Small domains at high latitude -> lower floating-point cost  Optimal balance of computational and communication cost? Work in progress!  Correct weighting increases simulation rate  wSFC works best for high resolution  Variable sized domains:  Large domains at low latitude -> higher boundary exchange cost  Small domains at high latitude -> lower floating-point cost  Optimal balance of computational and communication cost? Work in progress!


Download ppt "2 Chip-Multiprocessors & You John Dennis March 16, 2007 John Dennis March 16, 2007."

Similar presentations


Ads by Google