Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimizing Applications on Blue Waters

Similar presentations

Presentation on theme: "Optimizing Applications on Blue Waters"— Presentation transcript:

1 Optimizing Applications on Blue Waters
April 11, 2017 Optimizing Applications on Blue Waters Blue Waters Institute June 5, 2014 Optimization is about improving application performance Optimization is next most exciting thing after research (algorithm development) HPC and Optimization are all about improving Resercher’s productivity Victor Anisimov NCSA Science and Engineering Applications Support

2 Performance Optimization on Blue Waters
Overview Hardware Topology Aware Scheduling Balanced Injection: Sharing the Network Compilers Libraries Profiling: Finding Hot Spots in the Code File System and I/O Performance Optimization Training materials: cp -pr /u/staff/anisimov/training_bw . Optimize the usage of resources Optimize your code (reuse and improve) Optimize your optimization efforts (do planning) Performance Optimization on Blue Waters

3 Performance Optimization on Blue Waters
Hardware Overview Performance Optimization on Blue Waters

4 Blue Waters – Configuration
GPU: NVIDIA K20X (Kepler GK110) CPU: AMD 6276 Interlagos (2.3 – 2.6 GHz) 3D Torus Network Topology 22,640 XE6 nodes (Dual CPU), 64 GB RAM each 4,224 XK7 nodes (CPU+GPU), 32 GB RAM each File system 26.4 PB, Aggregate I/O bandwidth 1 TB/s Performance Optimization on Blue Waters

5 Cray XE6 Blade has 2 Compute nodes
Node Characteristics Number of Cores 32 cores (2 AMD 6276 Interlagos ) Peak Performance 313 Gflops/sec Memory Size 64 GB per node Memory Bandwidth(Peak) 102.4 GB/sec 313 = 2.45 GHz * 16 * 8 ops/cycle 2.3->2.6GHz (Turbo Core) Y X Z Performance Optimization on Blue Waters Cray Proprietary

6 AMD 6276 Interlagos Processor
Processor socket Compute node contains XE6: 2 Processors Processor has 2 numa nodes Numa node has 4 Bulldozer modules Each Buldozer module has single FP-unit and 2 integer cores Numa node Shared L3 Cache NB/HT Links Memory Controller Summarize: 2 cores / core module, 4 core modules/die, 2 dies/processor, 2 processors/xe Takeaway: Not quite flat memory. Each die has cache, each processor has memory Performance Optimization on Blue Waters Cray Proprietary

7 Job Submitting Performance Options: XE6
BW node: 64 GB RAM, 16 Bulldozer cores, 32 cores, 4 NUMA domains aprun enumerates cores from 0 to 31; each pair represents a BD core BD core consists of 2 compute cores sharing single FP unit Default task placement – fill up from 0 to 31 -N16 will use cores 0,1,…,15 utilizing only 8 FP units -N16 -d2 will utilize cores 0,2,4,6,8,10,12,14,…,24,26,28,30 and 16 FP units aprun options to specify particular number of tasks per XE6 node -N process -N2 -cc 0,16 2 processes -N4 -d8 4 processes, one per NUMA node(cache optimal) -N8 -d4 8 processes, two per NUMA node -N16 -d processes, interleaved, 16 BD cores -N processes (may improve performance over -N16) Performance Optimization on Blue Waters

8 Test 01: Optimal MPI task placement
Question: When 16 cores do better job than 32 cores? How to optimally use fewer than 32 cores per XE6 node? Running the test: Show task placement: export MPICH_CPUMASK_DISPLAY=1 cc app.c; qsub run aprun -n32 ./a.out (Time = 7.613) aprun -n16 ./a.out (Time = 5.995) aprun -n16 -d2 ./a.out (Time = 5.640) Take home lesson: Contention will degrade performance. Test different task placements and choose the optimal one before starting the production computations. Performance Optimization on Blue Waters

9 Performance Optimization on Blue Waters
3D Torus in XY plane PCIe Gen2 Y X Z Performance Optimization on Blue Waters Cray Proprietary

10 Performance Optimization on Blue Waters
Cray 3D Torus Topology VMD image of 3D torus Torus is a periodic box in 3D XK nodes – red Service nodes – blue Compute nodes – gray Slower – Y, X, Z – Faster Routing X, then Y, then Z Routing path depends on application placement on 3D torus Reference: Bob Fiedler, Cray Inc.: pdf Performance Optimization on Blue Waters

11 Why My Job Run Slow - Performance Variation
Dedicated-machine performance Performance variation due to job-job interaction Yellow – defragmented 1000-node job, red – XK nodes, blue –service nodes, other jobs not shown Example of defragmented node allocation Performance Optimization on Blue Waters

12 Performance Optimization on Blue Waters
Moab: Nodesets Available Shapes and Sizes (number of nodes in them): Sheets: 1100, 2200, 6700, 8200, Bars: 6700 Cubes: 3300 Yellow: Job placement in a cube nodeset Specifying a nodeset in PBS script: #PBS -l nodes=3360:ppn=32 #PBS -l nodeset=ONEOF:FEATURE:c1_3300n:c2_3300n:c3_3300n: c5_3300n:c6_3300n:c7_3300n Performance Optimization on Blue Waters

13 Performance Optimization on Blue Waters
Moab: Hostlist Yellow – Job placement in 3D torus #PBS -l nodes=1000:ppn=32 #PBS -l hostlist= …+8207^ Performance Optimization on Blue Waters

14 Test 02: Optimal job placement on 3D torus
Challenge: Chose a pair of adjacent nodes on 3D torus. Find the application performance on 0-1, Y-Y, and Z-Z links. Runnig the test: Show node ids and X,Y,Z coordinates: cc app.c; qsub run Example of 0-1, Z-Z, Y-Y links aprun -n64 -L 24124, /a.out aprun -n64 -L 24125, /a.out aprun -n64 -L 24108, /a.out (non-local) aprun -n64 -L 23074, /a.out Try your own node ids to test your understanding Take home lesson: Different links have different network bandwidth: 0-1 > Z-Z = X-X > Y-Y node 1 node 0 Y X Z Performance Optimization on Blue Waters

15 Congestion Protection
Network congestion is a condition that occurs when the volume of traffic on the high-speed network (HSN) exceeds the capacity to handle it. To "protect" the network from data loss, congestion protection (CP) globally “throttles” injection bandwidth per-node. If CP happens often, application performance degrades. At job completion you might see the following message reported to stdout: Application network throttled: 4459 nodes throttled, 25:31:21 node-seconds Application balanced injection 100, after throttle 63 Throttling disrupts the work on the entire machine. Performance Optimization on Blue Waters

16 Types of congestion events
There are two main forms of congestion: many-to-one and long-path. The former is easy to detect and correct. The latter is harder to detect and may not be correctable. Many-to-one congestion occurs in some algorithms and can be corrected. See “Modifying Your Application to Avoid Gemini Network Congestion Errors” on balanced injection section on the portal. Long-path congestion is typically due to a combination of communication pattern and node allocation. It can also be due to a combination of jobs running on the system. We monitor for cases of congestion protection and try to determine the most likely cause. Performance Optimization on Blue Waters

17 Congestion on a Shared, Torus Network
HSN uses dimension ordered routing: X-then-Y-then-Z between two locations on the torus. Note that AB≠BA. Shortest route can sometimes cause traffic to pass through geminis used by other jobs. A B Non-compact node allocations can have traffic that passes through geminis used by other jobs. I/O traffic can lead to network hot-spots. We are working with Adaptive and Cray on eliminating some of the above causes of congestion with better node allocation: shape, location, etc. Performance Optimization on Blue Waters

18 Performance Optimization on Blue Waters
Balanced Injection Balanced Injection (BI) is a mechanism that attempts to reduce compute node injection bandwidth in order to prevent throttling and which may have the effect of improving application performance for certain communication patterns. BI can be applied “per-job” using an environment variable or with user accessible API. export APRUN_BALANCED_INJECTION=63 Can be set from (100 = no BI). There isn’t a linear relation of BI to application performance. MPI-based applications have “balanced injection” enabled in collective MPI calls that locally “throttle” injection bandwidth. Performance Optimization on Blue Waters

19 Performance Optimization on Blue Waters
Compilers Performance Optimization on Blue Waters

20 Performance Optimization on Blue Waters
Available Compilers Cray Compilers - Cray Compiling Environment (CCE) Fortran 2003, Co-arrays, UPC, PGAS, OpenACC GNU Compiler Collection (GCC) Portland Group Inc (PGI) Compilers OpenACC Intel Compilers (to be available soon) All compilers provide Fortran, C, C++, OpenMP support Use cc, CC, ftn wrappers for C, C++, and FORTRAN So which compiler do I choose? Experiment with various compilers Work with BW staff Mixing libraries created by different compilers may cause issues Performance Optimization on Blue Waters

21 Compiler Choices – Relative Strength
CCE – outstanding Fortran, very good C, and okay C++ Very good vectorization Very good Fortran language support; only real choice for coarrays C support is very good, with UPC support Very good scalar optimization and automatic parallelization Clean implementation of OpenMP 3.0 with tasks Cleanest integration with other Cray tools (Performance tools, debuggers, upcoming productivity tools) No inline assembly support Excellent support from Cray (bugs, issues, performance etc) Performance Optimization on Blue Waters

22 Compiler Choices – Relative Strength
PGI – very good Fortran, okay C and C++ Good vectorization Good functional correctness with optimization enabled Good automatic prefetch capabilities Company (NVIDIA) focused on HPC market Excellent working relationship with Cray Slow bug fixing Performance Optimization on Blue Waters

23 Compiler Choices – Relative Strength
GNU – good Fortran, outstanding C and C++ Obviously, the best gcc compatibility Scalable optimizer was recently rewritten and is very good Vectiorization capabilities focus mostly on inline assembly Few releases have been incompatible with each other and require recompilation of modules (4.3, 4.4, 4.5) General purpose application, not necessarily HPC Performance Optimization on Blue Waters

24 Recommended CCE Compilation Options
Use default optimization levels It’s the equivalent of most other compilers -O3 or -fast Use -O3, fp3 (or -O3 -hfp3 or some variation) -O3 gives slightly more than -O2 -hfp3 gives a lot more floating point optimizations, esp 32 bit If an application is intolerant of floating point reassociation, try lower hfp number, try hfp1 first, only hfp0 if absolutely necessary Might be needed for tests that require strict IEEE conformance Or applications that have validated results from diffferent compiler Do not suggest using -Oipa5, -Oaggress and so on; higher numbers are not always correlated with better performance Compiler feedback : -rm (fortran), -hlist=m ( C ) If don’t want OpenMP : -xomp or -Othread0 or -hnoomp Manpages : crayftn, craycc, crayCC Performance Optimization on Blue Waters

25 Starting Point for PGI Compilers
Suggested Option : -fast Interprocedural analysis allows the compiler to perform whole program optimizations : -Mipa=fast(,safe) If you can be flexible with precision, also try -Mfprelaxed Option -Msmartalloc, calls the subroutine mallopt in the main routine, can have a dramatic impact on the performance of program that uses dynamic allocation of memory Compiler feedback : -Minfo=all, -Mneginfo Manpages : pgf90, pgcc, pgCC Performance Optimization on Blue Waters

26 Additional PGI Compiler Options
-default64 : Fortran driver option for -i8 and -r8 -i8, -r8 : Treats INTEGER and REAL variables in Fortran as 8 bytes (use ftn -default64 option to link the right libraries) -byteswapio : Reads big endian files in fortran -Mnomain : Uses ftn driver to link programs with the main program (written in C or C++) and one or more subroutines (written in fortran) Performance Optimization on Blue Waters

27 Starting Point for GNU Compilers
-O3 –ffast-math –funroll-loops Compiler feedback : -ftree-vectorizer-verbose=2 Manpages : gfortran, gcc, g++ Performance Optimization on Blue Waters

28 Test 03: Manual code optimization
Challenge: Find and fix two programming blunders in the test code. Modify app2.c and app3.c so their runtime T(app1) > T(app2) > T(app3). Runnig the test: cc -hlist=m -o app1.x app1.c cc -hlist=m -o app2.x app2.c cc -hlist=m -o app3.x app3.c qsub run Target: T(app1)=5.665s, T(app2)=0.242s, T(app3)=0.031s Performance Optimization on Blue Waters

29 Test 03: Solution app1.c app2.c app3.c Presentation Title

30 Performance Optimization on Blue Waters
Libraries Performance Optimization on Blue Waters

31 Libraries: Where to Start
Libraries motto: Reuse rather than reinvent Libraries are tailored to Programming Environment Choose programming environment PrgEnv-[cray,gnu,pgi] Load library module: module load <libname> See actual path: module show <libname> Location path: ls –l $CRAY_LIBSCI_PREFIX_DIR/lib Use compiler wrappers: ftn, cc, CC For most applications, using default settings work very well OpenMP threaded BLAS/LAPACK libraries are available The serial version is used if “OMP_NUM_THREADS” is not set or set to 1 Performance Optimization on Blue Waters

32 PETSc (Argonne National Laboratory)
Programmable, Extensible Toolkit for Scientific Computing Widely-used collection of many different types of linear and non-linear solvers Actively under development; very responsive team Can also interface with numerous optional external packages (e.g., SLEPC, HYPRE, ParMETIS, …) Optimized version installed by Cray, along with many external packages Use “module load petsc[/version]” Pronounced “PET-see”; unsure why the “c” is lower-case. Performance Optimization on Blue Waters

33 Other Numerical Libraries
ACML (AMD Core Math Library) BLAS, LAPACK, FFT, Random Number Generators Trilinos (from Sandia National Laboratories) Somewhat similar to PETSc, interfaces to a large collection of preconditioners, solvers, and other computational tools GSL (GNU Scientific Library) Collection of numerous computational solvers and tools for C and C++ programs See all available modules “module avail” Performance Optimization on Blue Waters

34 Cray Scientific Library (libsci)
Contains optimized versions of several popular scientific software routines Available by default; see available versions with “module avail cray-libsci” and load particular version “module load cray‑libsci[/version]” BLAS, BLACS LAPACK, ScaLAPACK FFT, FFTW Unique to Cray (affects portability) CRAFFT, CASE, IRT GENERAL: BLAS (Basic Linear Algebra Subroutines), BLACS (Basic Linear Algebra Communication Subprograms), LAPACK (Linear Algebra routines), ScaLAPACK (parallel Linear Algebra routines), FFT (Fast Fourier Transforms), FFTW (Fastest FFT in the West) Cray-specific: CRAFFT (Cray Adaptive FFT routines) – a library of serial and parallel, single- and double-precision, Fortran and C routines that compute the discrete Fourier transform in one, two, or three dimensions. CRAFFT provides a simplified interface to several FFT libraries and allows automatic, dynamic selection of the fastest FFT kernel. CASE (Cray Adaptive Simple Eigensolver) – a collection of simplified interfaces into high-performance LAPACK and ScaLAPACK eigensolver routines that construct the sometimes complicated LAPACK and ScaLAPACK eigensolver calling sequences and work arrays from simple arguments. IRT (Iterative Refinement Toolkit) – a library of solvers and tools that provides solutions to linear systems using single-precision factorizations while preserving accuracy through mixed-precision iterative refinement. Performance Optimization on Blue Waters

35 Cray Accelerated Libraries
Cray LibSci accelerated BLAS, LAPACK, and ScaLAPACK libraries PrgEnv-cray or PrgEnv-gnu Programming Environment module add craype-accel-nvidia35 call libsci_acc_init() … (your code) … call libsci_acc_finalize() The Library interface automatically initiates appropriate execution mode (CPU, GPU, Hybrid). When BLAS or LAPACK routines are called from applications built with the Cray or GNU compilers, Cray LibSci automatically loads and links libsci_acc libraries upon execution if it determines performance will be enhanced. Execution control from source code: routine_name invokes automated method routine_name_cpu executed on host CPU only routine_name_acc executed on GPU only See “man intro_libsci_acc” for more information. Performance Optimization on Blue Waters

36 Cray Accelerated Libraries, continued
Libsci_acc is not thread safe. It will fail when called concurrently from OpenMP threads. ENVIRONMENT VARIABLES CRAY_LIBSCI_ACC_MODE Specifies execution mode for libsci_acc routine: 0 Use automated mode. Adds slight overhead. This is the default. 1 Forces all supported auto-tuned routines to execute on the accelerator. 2 Forces all supported routines to execute on the CPU if the data is located within host processor address space. LIBSCI_ACC_BYPASS_FUNCTION Specifies the execution mode for FUNCTION. 1 Call the version that handles data addresses resident on the GPU. 2 Call the version that handles data addresses resident on the CPU. 3 For SGEMM, DGEMM, CGEMM, ZGEMM, this will call accelerated version. Warning: Passing a wrong CPU / GPU address will cause the program to crash. Performance Optimization on Blue Waters

37 Performance Optimization on Blue Waters
Test 04: Speed up matrix multiplication by using OpenMP-threaded AMD Core Math Library Challenge: Speed up application by using threaded library Running the test module swap PrgEnv-cray PrgEnv-pgi module add acml cc -lacml -Wl,-ydgemm_ -o app1.x app.c /opt/acml/5.3.1/pgi64_fma4/lib/libacml.a(dgemm.o): definition of dgemm_ cc -lacml -Wl,-ydgemm_ -mp=nonuma -o app2.x app.c export OMP_NUM_THREADS=32 aprun -n1 -cc none -d32 ./app1.x aprun -n1 -cc none -d32 ./app2.x aprun -n d32 ./app2.x Output: Time = , Time = , Time = Take home lesson: Reuse instead of reinvent. “-Wl,-ydgemm_” tells where dgemm() was resolved from. “-cc none” allows OpenMP threads to migrate. 05-libsci/ Performance Optimization on Blue Waters

38 Performance Optimization on Blue Waters
Profiling Performance Optimization on Blue Waters

39 The Cray Performance Analysis Tools
Supports traditional post-mortem performance analysis Automatic identification of performance problems Indication of causes of problems Suggestions of modifications for performance improvement pat_build: provides automatic instrumentation CrayPAT run-time library collects measurements (transparent to the user) pat_report performs analysis and generates text reports pat_help: online help utility To start working with CrayPAT: module load perftools Performance Optimization on Blue Waters

40 Application Instrumentation with pat_build
Supports two categories of experiments asynchronous experiments (sampling) capture values from the call stack or the program counter at specified intervals or when a specified counter overflows Event-based experiments (tracing) count some events such as the number of times a specific system call is executed While tracing provides most detailed information, it can be very heavy if the application runs on a large number of cores for a long period of time Sampling (-S, default: -Oapa = sampling +HWPC + MPI tracing) can be useful as a starting point, to provide a first overview of the work distribution Performance Optimization on Blue Waters

41 Example Runtime Environment Variables
Optional timeline view of program available export PAT_RT_SUMMARY=0 (minimize volume of tracing data) Request hardware performance counter information: export PAT_RT_PERFCTR=1 (FLOP count) Performance Optimization on Blue Waters

42 Predefined Trace Wrappers (-g tracegroup)
blas Basic Linear Algebra subprograms caf Co-Array Fortran (Cray CCE compiler only) hdf manages extremely large data collection heap dynamic heap io includes stdio and sysio groups lapack Linear Algebra Package math ANSI math mpi MPI omp OpenMP API pthreads POSIX threads shmem SHMEM sysio I/O system calls system system calls upc Unified Parallel C (Cray CCE compiler only) For a full list, please see pat_build(1) man page Performance Optimization on Blue Waters

43 Performance Optimization on Blue Waters
Example Experiments > pat_build –O apa Gets top time consuming routines Least overhead > pat_build –u –g mpi ./my_program Collects information about user functions and MPI > pat_build –w ./my_program Collections information for MAIN Lightest-weight tracing > pat_build –g netcdf,mpi ./my_program Collects information about netcdf routines and MPI Performance Optimization on Blue Waters

44 Steps to Collecting Performance Data
Access performance tools software % module load perftools Build application keeping .o files (CCE: -h keepfiles) % make clean ; make Instrument application for automatic profiling analysis You should get an instrumented program a.out+pat % pat_build a.out Run application to get top time consuming routines % aprun … a.out+pat (or qsub <pat script>) Performance Optimization on Blue Waters

45 Steps to Collecting Performance Data (2)
You should get a performance file (“<sdatafile>.xf”) or multiple files in a directory <sdatadir> Generate report % pat_report <sdatafile>.xf > sampling_report Performance Optimization on Blue Waters

46 Example: HW counter data
PAPI_TLB_DM Data translation lookaside buffer misses PAPI_L1_DCA Level 1 data cache accesses PAPI_FP_OPS Average Rate of Floating point operations per single MPI task MFLOPS (aggregate) Total FLOP rate of the entire job FLOP rate shows the efficiency of hardware utilization ======================================================================== USER Time% % Time secs Imb.Time secs Imb.Time% Calls M/sec calls PAPI_L1_DCM M/sec misses PAPI_TLB_DM M/sec misses PAPI_L1_DCA M/sec refs PAPI_FP_OPS M/sec ops User time (approx) secs cycles %Time Average Time per Call sec CrayPat Overhead : Time % HW FP Ops / User time M/sec ops 4.1%peak(DP) HW FP Ops / WCT M/sec Computational intensity ops/cycle ops/ref MFLOPS (aggregate) M/sec TLB utilization refs/miss avg uses D1 cache hit,miss ratios % hits % misses D1 cache utilization (misses) refs/miss avg hits Performance Optimization on Blue Waters

47 Example: Time spent in User and MPI functions
Table 1: Profile by Function Samp % | Samp | Imb. | Imb. |Group | | Samp | Samp % | Function | | | | PE='HIDE’ 100.0% | 775 | | |Total | | 94.2% | 730 | | |USER || || 43.4% | 336 | | 2.6% |mlwxyz_ || 16.1% | 125 | | 4.9% |half_ || 8.0% | 62 | | 9.5% |full_ || 6.8% | 53 | | 3.5% |artv_ || 4.9% | 38 | | 3.6% |bnd_ || 3.6% | 28 | | 6.9% |currenf_ || 2.2% | 17 | | 8.6% |bndsf_ || 1.7% | 13 | | 13.5% |model_ || 1.4% | 11 | | 12.2% |cfl_ || 1.3% | 10 | | 7.0% |currenh_ || 1.0% | 8 | | 41.9% |bndbo_ || 1.0% | 8 | | 53.4% |bndto_ ||========================================== | 5.4% | 42 | | |MPI || 1.9% | 15 | | 23.9% |mpi_sendrecv_ || 1.8% | 14 | | 55.0% |mpi_bcast_ || 1.7% | 13 | | 30.7% |mpi_barrier_ |=========================================== Computation intensity vs. Communication intensity Performance Optimization on Blue Waters

48 Test 05: Obtain FLOP count and User/MPI time
Challenge: What is application FLOP count? Use CrayPat to perform profiling experiment and compare the result with manual count. Running the test: module add perftools cc -c app.c // compile the application cc -o app.x app.o // link the application pat_build app.x // instrument the application qsub run // submit the job aprun -n16 -d2 ./app.x+pat pat_report app.x+pat*.xf > report.out Results: MPI: 88.9%; USER: 11.1%; communication intensive job FLOP count = 2 GF (hint – convert FLOP rate to FLOP count) Performance Optimization on Blue Waters

49 Performance Optimization on Blue Waters
I/O optimization HDF5 I/O Library PnetCDF Adios IOBUF MPI-IO I/O Middleware Scientist… Application… Damaris Utilities Parallel File System Lustre Darshan Blue Waters Performance Optimization on Blue Waters

50 Lustre File System: Striping
Logical Physical File striping: single files are distributed across a series of OSTs File size can grow to the aggregate size of available OSTs (rather than a single disk) Accessing multiple OSTs concurrently increases I/O bandwidth Performance Optimization on Blue Waters

51 Performance Impact: Configuring File Striping
lfs is the Lustre utility for viewing/setting file striping info Stripe count – the number of OSTs across which the file can be striped Stripe size – the size of the blocks that a file will be broken into Stripe offset – the ID of an OST for Lustre to start with, when deciding which OSTs a file will be striped across (leave at default value) Configurations should focus on stripe count/size Blue Waters defaults: $> touch test $> lfs getstripe test test lmm_stripe_count: 1 lmm_stripe_size: lmm_stripe_offset: 708 obdidx objid objid group x20faa Performance Optimization on Blue Waters

52 Setting Striping Patterns
$> lfs setstripe -c 5 -s 32m test $> lfs getstripe test test lmm_stripe_count: 5 lmm_stripe_size: lmm_stripe_offset: 1259 obdidx objid objid group x20ff7d x210c x x20fb x20fa Note: a file’s striping pattern is permanent, and set upon creation lfs setstripe creates a new, 0 byte file The striping pattern can be changed for a directory; every new file or directory created within will inherit its striping pattern Simple API available for configuring striping – portable to other Lustre systems Performance Optimization on Blue Waters

53 IOBUF – I/O Buffering Library
Optimize I/O performance with minimal effort Asynchronous prefetch Write back caching stdin, stdout, stderr disabled by default No code changes needed module load iobuf Recompile & relink the code Ideal for sequential read or write operations Application IOBUF Linux IO infrastructure File Systems / Lustre Performance Optimization on Blue Waters

54 IOBUF – I/O Buffering Library
Globally (dis)enable by (un)setting IOBUF_PARAMS Fine grained control Control buffer size, count, synchronicity, prefetch Disable iobuf per file Example: export IOBUF_PARAMS='*:verbose' export IOBUF_PARAMS= '*.in:count=4:size=32M,*.out:count=8:size=64M' Performance Optimization on Blue Waters

55 IOBUF – MPI-IO Sample Verbose Output
Performance Optimization on Blue Waters

56 Performance Optimization on Blue Waters
I/O Utility: Darshan Darshan was developed at Argonne It is “a scalable HPC I/O characterization tool… designed to capture an accurate picture of application I/O behavior… with minimum overhead” I/O Characterization Sheds light on the intricacies of an application’s I/O Useful for application I/O debugging Pinpointing causes of extremes Analyzing/tuning hardware for optimizations Performance Optimization on Blue Waters

57 Performance Optimization on Blue Waters
Darshan Specifics Darshan collects per-process statistics (organized by file) Counts I/O operations, e.g. unaligned and sequential accesses Times for file operations, e.g. opens and writes Accumulates read/write bandwidth info Creates data for simple visual representation More Requires no code modification (only re-linking) Small memory footprint Includes a job summary tool Performance Optimization on Blue Waters

58 Summary Tool Example Output
Performance Optimization on Blue Waters

59 Performance Optimization on Blue Waters

60 Test 06: Use Darshan to perform I/O analysis
Challenge: How much time the application spends in write operation? Running the test: module unload perftools module load darshan ftn app.f // compile the application qsub run // submit the job export DARSHAN_LOGPATH=./ aprun -n16 -d2 ./a.out input.psf input.pdb *.gz // get job summary darshan-parser *.gz > darshan.log // get darshan log Results: Calculation time = seconds IO write time = seconds Performance Optimization on Blue Waters

61 Test 06: Three possible solutions
Look in Darshan job-summary PDF file (figure above): 6.2 sec Get from Darshan log for CP_F_CLOSE_TIMESTAMP - CP_F_OPEN_TIMESTAMP = – = 5.5 sec Manually put timers in designated spots in the source code: 6.0 sec Presentation Title

62 Performance Optimization on Blue Waters
Good I/O Practices Opening a file for writing/appending is expensive, so: If possible, open files as read-only Avoid large numbers of small writes while(forever){ open(“myfile”); write(a_byte); close(“myfile”); } Be gentle with metadata (or suffer its wrath) limit the number of files in a single directory Instead opt for hierarchical directory structure ls contacts the metadata server, ls –l communicates with every OST assigned to a file (for all files) Avoid wildcards: rm –rf *, expanding them is expensive over many files It may even be more efficient to pass metadata through MPI than have all processes hit the MDS (calling stat) Avoid updating last access time for read-only operations (NO_ATIME) Performance Optimization on Blue Waters

63 Acknowledgements Thanks to Kalyana Chadalavada, Robert Brunner, Manisha Gajbe, Galen Arnold for contributing to this presentation

Download ppt "Optimizing Applications on Blue Waters"

Similar presentations

Ads by Google