Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Similar presentations


Presentation on theme: "1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September."— Presentation transcript:

1 1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick http://titanium.cs.berkeley.edu U.C. Berkeley September 9, 2004

2 2 Titanium Review, Sep. 9, 2004 Kaushik Datta Benchmarks Current Titanium NAS benchmarks: MG (Multigrid) FT (Fast Fourier Transform) CG (Conjugate Gradient- Armando) IS (Integer Sort- Omair) EP (Embarrassingly Parallel- Meling) Today’s focus is on MG and FT

3 3 Titanium Review, Sep. 9, 2004 Kaushik Datta Platforms Seaborg: NERSC IBM SP RS/6000 16-way SMP nodes 375 MHz Power3 procs, 1.5 GFlops max 64 KB L1 D-Cache, 8 MB L2 Cache Also briefly mention Compaq Alphaserver, AMD Opteron, and Intel Itanium II processors

4 4 Titanium Review, Sep. 9, 2004 Kaushik Datta MG Benchmark Class A problem is 4 iterations on a 256 3 grid All the computations are nearest neighbor across 3D grids For coarse grids, all computations are done on one processor to minimize fine-grain communication Communication pattern for updating ghost cells is very regular Tests both the computation and communication aspects of the platform

5 5 Titanium Review, Sep. 9, 2004 Kaushik Datta Major MG Components Computation (applies 27-point 3D stencil): ApplySmoother EvaluateResidual Coarsen Prolongate Communication (very regular): UpdateBorder

6 6 Titanium Review, Sep. 9, 2004 Kaushik Datta Possible Serial Optimizations In order to improve the performance of naive MG code, we first tried to make the serial code faster The optimizations that seemed most promising were: Cache Blocking Common Subexpression Elimination

7 7 Titanium Review, Sep. 9, 2004 Kaushik Datta Possible Serial Optimization #1- Cache Blocking Cache blocking attempts to take a portion of the grid (that will fit into a given level of cache) and perform all necessary computations on it before proceeding to the next cache block In our case, we used cubic 3D cache blocks, and varied the side length of the cube

8 8 Titanium Review, Sep. 9, 2004 Kaushik Datta Possible Serial Optimization #1- Cache Blocking Cache blocking seems to help slightly on the Itanium II, but not on the Power3

9 9 Titanium Review, Sep. 9, 2004 Kaushik Datta Possible Serial Optimization #1- Cache Blocking The Alphaserver and Opteron processors do not benefit from cache blocking

10 10 Titanium Review, Sep. 9, 2004 Kaushik Datta Possible Serial Optimization #2- Common Subexpression Elimination CSE is a technique to reduce the flop count by memoizing results However, it may not always reduce the overall running time, since each pencil through the grid needs to be traversed twice The first traversal is done to memoize certain results The second traversal then uses these results to compute the final answer

11 11 Titanium Review, Sep. 9, 2004 Kaushik Datta Possible Serial Optimization #2- Common Subexpression Elimination CSE does a good job at lowering the running time, partially because it reduces the Flop count Note: The Fortran MG benchmark uses CSE

12 12 Titanium Review, Sep. 9, 2004 Kaushik Datta Chosen Serial Optimizations Based on these results, we kept the CSE optimization, but omitted any type of cache blocking This mimics the Fortran code

13 13 Titanium Review, Sep. 9, 2004 Kaushik Datta Parallel Optimizations Have each processor block communicate with only 6 nearest neighbors instead of 27 to update its border (Dan) Eliminate “static” timers This gets rid of a level of indirection Dan reduced false sharing in static variables by grouping each processor’s static variables together Force bulk arraycopy by using contiguous array buffers (manual packing/unpacking) Use the “local” keyword to let each processor know that all its computations are local

14 14 Titanium Review, Sep. 9, 2004 Kaushik Datta Seaborg SMP Performance of MG Class A Problem Titanium does about as well as Fortran up to the 16 processor case Our serial tuning seems to be successful in this case

15 15 Titanium Review, Sep. 9, 2004 Kaushik Datta FT Benchmark Class A problem is 6 iterations of 256 2 x 128 problem Each 3D FFT is performed as 3 separate 1D FFTs and 2 transposes 1D FFTs: All are local Currently are library calls (using FFTW) Transposes: One local transpose and one all-to-all transpose All-to-all transpose tests machine bisection bandwidth

16 16 Titanium Review, Sep. 9, 2004 Kaushik Datta Major FT Components Computation: 1D FFTs (part of 3D FFT) Evolve Checksum Communication: Transposes (part of 3D FFT) Both: Setup

17 17 Titanium Review, Sep. 9, 2004 Kaushik Datta Serial FT Optimizations Removed an unnecessary transpose in the 3D FFT Memoized the time evolution array to reduce the Flop count

18 18 Titanium Review, Sep. 9, 2004 Kaushik Datta Seaborg SMP Performance of FT Class A Problem Titanium does slightly better than Fortran, but we are calling FFTW library We will compare each component of the benchmark separately

19 19 Titanium Review, Sep. 9, 2004 Kaushik Datta Seaborg SMP Performance of Setup Setup creates distributed arrays and memoizes an array used in later computations This method is only called once, but still needs tuning

20 20 Titanium Review, Sep. 9, 2004 Kaushik Datta Seaborg SMP Performance of 1D FFTs The Titanium code calls the FFTW library in this case We are in the process of converting the FFT into pure Titanium code

21 21 Titanium Review, Sep. 9, 2004 Kaushik Datta Seaborg SMP Performance of Transpose The Titanium and Fortran perform similarly using shared memory Note: The Fortran code does cache blocking for the local transpose (possible Ti optimization)

22 22 Titanium Review, Sep. 9, 2004 Kaushik Datta Seaborg SMP Performance of Evolve Evolve consists of purely local FP computations The Titanium code performs slightly worse than the Fortran code, but scales better

23 23 Titanium Review, Sep. 9, 2004 Kaushik Datta Conclusion On Seaborg, Titanium serial and SMP performance is slightly worse than or comparable to Fortran with MPI in most cases

24 24 Titanium Review, Sep. 9, 2004 Kaushik Datta Future Work Examine and tune the multinode performance of the MG and FT benchmarks Convert the FT benchmark into pure Titanium (instead of calling FFTW) Start profiling and tuning the serial versions of the CG, IS, and EP benchmarks Check the performance of the benchmarks across several different platforms


Download ppt "1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September."

Similar presentations


Ads by Google