Presentation is loading. Please wait.

Presentation is loading. Please wait.

S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende

Similar presentations

Presentation on theme: "S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende"— Presentation transcript:

1 S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende

2 TAU Performance SystemS3D Scalability Study2 Acknowledgements  Alan Morris [UO]  Kevin Huck [UO]  Allen D. Malony [UO]  Kenneth Roche [ORNL]  Bronis R. de Supinski [LLNL]  John Mellor-Crummey [Rice]  Nick Wright [SDSC]  Jeff Larkin [Cray, Inc.] The performance data presented here is available at:

3 TAU Performance SystemS3D Scalability Study3 TAU Parallel Performance System   Multi-level performance instrumentation  Multi-language automatic source instrumentation  Flexible and configurable performance measurement  Widely-ported parallel performance profiling system  Computer system architectures and operating systems  Different programming languages and compilers  Support for multiple parallel programming paradigms  Multi-threading, message passing, mixed-mode, hybrid

4 TAU Performance SystemS3D Scalability Study4 The Story So Far...  Scalability study of S3D using TAU  3D Scatter plots and mapping of ranks to physical processors points to partitioning in XT3/XT4  Memory and network on XT3 partition cause the rest of the application to slow down  Hypothesis: Running S3D on a ‘pure’ XT4 system will help improve the performance significantly  Ran a 6400 core simulation on an XT4 partition to compare with XT3+XT4 (used #PBS -lfeature=xt4)...

5 TAU Performance SystemS3D Scalability Study5 3D Scatter Plots  Plot four routines along X, Y, Z, and Color axes  Each routine has a range (max, min)  Each process (rank) has a unique position along the three axes and a unique color  Allows us to examine the distribution of nodes (clusters)

6 TAU Performance SystemS3D Scalability Study6 Scatter Plot: 6400 cores XT3/XT4 - 2 Clusters! Previous work proved: Blue nodes are XT3, Red are XT4

7 TAU Performance SystemS3D Scalability Study7 3D Triangle Mesh Display  Plot MPI rank, routine name, and exclusive time along X, Y and Z axes  Color can be shown by a fourth metric  Scalable view  Suitable for very large number of processors

8 TAU Performance SystemS3D Scalability Study8 XT3+XT4: MPI_Wait Gap represents XT3 nodes

9 TAU Performance SystemS3D Scalability Study9 3D View: Large MPI_Wait times on most CPUs To improve performance, we must reduce MPI_Wait time on other cpus

10 TAU Performance SystemS3D Scalability Study10 3D View: XT3 Partition, Imbalance On XT3: MPI_Wait takes less time, other routines take more time!

11 TAU Performance SystemS3D Scalability Study11 Getting Back to MPI_Wait() MPI_Wait takes less time on XT3 nodes Other routines take longer

12 TAU Performance SystemS3D Scalability Study12 XT3+XT4: MPI_Wait - Sorted by Exclusive Time MPI_Wait takes 435.84 seconds on rank 3101 It takes 15.49 seconds on rank 0! Rank 3101 is on XT4, rank 0 is on XT3

13 TAU Performance SystemS3D Scalability Study13 Comparing XT4 and XT3 ranks (Best vs worst)

14 TAU Performance SystemS3D Scalability Study14 Improving S3D Performance  Hypothesis: Running S3D on a ‘pure’ XT4 system will help improve the performance significantly and reduce the time spent idling in MPI_Wait

15 TAU Performance SystemS3D Scalability Study15 XT4 Profile: Main Window

16 TAU Performance SystemS3D Scalability Study16 XT4: Mean Profile Sorted by Exclusive Time MPI_Wait has moved down!

17 TAU Performance SystemS3D Scalability Study17 XT4: Mean Profile Sorted by Inclusive Time

18 TAU Performance SystemS3D Scalability Study18 Comparing XT4 with XT3+XT4 MPI_Wait takes 26% of time compared to combined XT3+XT4!

19 TAU Performance SystemS3D Scalability Study19 Comparing Mean Inclusive Time

20 TAU Performance SystemS3D Scalability Study20 XT4: 3D View The “exp” loop [~1GFlop] takes most time now!

21 TAU Performance SystemS3D Scalability Study21 XT3+XT4: Scatter Plot (Before)

22 TAU Performance SystemS3D Scalability Study22 XT4 Scatter Plot (After) MPI_Wait takes from 78 to 121 s now!

23 TAU Performance SystemS3D Scalability Study23 Comparing Performance  Hypothesis confirmed: XT4 is faster than XT3+XT4  Inclusive time down from 1935 to 1702 s  12% improvement  Saved 24853.3 minutes (414 hours) of wallclock time!  Reduction in MPI_Wait time is most significant  390s (mean) down to 104s (mean)  Lessons learned:  Slower XT3 nodes can have a significant impact on a large scale S3D run  S3D harness testcase does not perform well on non- homogeneous nodes  We recommend running S3D on XT4 partition only!  #PBS -lfeature=xt4

24 TAU Performance SystemS3D Scalability Study24 Discussion  Did we get optimal performance on XT4 nodes?  Are the nodes performing at similar rates uniformly now?  Let us see the std. deviation plot of all routines...

25 TAU Performance SystemS3D Scalability Study25 XT4: Standard Deviation IO routines!

26 TAU Performance SystemS3D Scalability Study26 Scatter Plot: One CPU... WRITE_SAVEFILE

27 TAU Performance SystemS3D Scalability Study27 WRITE_SAVEFILE Rank 0 is quicker!

28 TAU Performance SystemS3D Scalability Study28 MPI_Barrier

29 TAU Performance SystemS3D Scalability Study29 I/O is not performed uniformly

30 TAU Performance SystemS3D Scalability Study30 I/O Becomes a Bottleneck: XT3, XT3+XT4... MPI_Wait WRITE_ SAVEFILE

31 TAU Performance SystemS3D Scalability Study31 Conclusions  Using pure XT4 improved performance by 12%  Need to investigate I/O in XT4/Lustre further to achieve better performance...  Discuss I/O issues with S3D developers

32 TAU Performance SystemS3D Scalability Study32 S3D - Building with TAU  Change name of compiler in build/make.XT3  ftn=>  cc =>  Set compile time environment variables  setenv TAU_MAKEFILE /spin/proj/perc/TOOLS/tau_latest/xt3/lib/ Makefile.tau-nocomm-multiplecounters-mpi-papi-pdt-pgi  Disabled tracking message communication statistics in TAU  MPI_Comm_compare() is not called inside TAU’s MPI wrapper  Choose callpath, PAPI counters, MPI profiling, PDT for source instrumentation  setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau -optPreProcess’  Selective instrumentation file eliminates instrumentation in lightweight routines  Pre-process Fortran source code using cpp before compiling  Set runtime environment variables for instrumentation control and event PAPI counter selection in job submission script:  export TAU_THROTTLE=1  export COUNTER1 GET_TIME_OF_DAY  export COUNTER2 PAPI_FP_INS  export COUNTER3 PAPI_L1_DCM  export COUNTER4 PAPI_TOT_INS  export COUNTER5 PAPI_L2_DCM


34 TAU Performance SystemS3D Scalability Study34 Getting Access to TAU on Jaguar  set path=(/spin/proj/perc/TOOLS/tau_latest/x86_64/bin $path)  Choose Stub Makefiles (TAU_MAKEFILE env. var.) from /spin/proj/perc/TOOLS/tau_latest/xt3/lib/Makefile.*  Makefile.tau-mpi-pdt-pgi (flat profile)  Makefile.tau-mpi-pdt-pgi-trace (event trace, for use with Vampir)  Makefile.tau-callpath-mpi-pdt-pgi (single metric, callpath profile)  Binaries of S3D can be found in:  ~sameer/scratch/S3D-BINARIES withtau »papi, multiplecounters, mpi, pdt, pgi options without_tau

35 TAU Performance SystemS3D Scalability Study35 Concluding Discussion  Performance tools must be used effectively  More intelligent performance systems for productive use  Evolve to application-specific performance technology  Deal with scale by “full range” performance exploration  Autonomic and integrated tools  Knowledge-based and knowledge-driven process  Performance observation methods do not necessarily need to change in a fundamental sense  More automatically controlled and efficiently use  Develop next-generation tools and deliver to community  Open source with support by ParaTools, Inc. 

36 TAU Performance SystemS3D Scalability Study36 Support Acknowledgements  Department of Energy (DOE)  Office of Science  LLNL, LANL, ORNL, ASC  PERI

Download ppt "S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende"

Similar presentations

Ads by Google