Presentation on theme: "The Jacquard Programming Environment Mike Stewart NUG User Training, 10/3/05."— Presentation transcript:
The Jacquard Programming Environment Mike Stewart NUG User Training, 10/3/05
2 Outline Compiling and Linking. Optimization. Libraries. Debugging. Porting from Seaborg and other systems.
3 Pathscale Compilers Default compilers: Pathscale Fortran 90, C, and C++. Module “path” is loaded by default and points to the current default version of the Pathscale compilers (currently 2.2.1). Other versions available: module avail path. Extensive vendor documentation available on-line at http://pathscale.com/docs.html.http://pathscale.com/docs.html Commercial product: well supported and optimized.
4 Compiling Code Compiler invocation: –No MPI: pathf90, pathcc, pathCC. –MPI: mpif90, mpicc, mpicxx The mpi compiler invocation will use the currently loaded compiler version. The mpi and non-mpi compiler invocations have the same options and arguments.
5 Compiler Optimization Options 4 numeric levels –On where n ranges from 0 (no optimization) to 3. Default level: -O2 (unlike IBM) –g without a –O option changes the default to –O0.
6 -O1 Optimization Minimal impact on compilation time compared to –O0 compile. Only optimizations applied to straight line code (basic blocks) like instruction scheduling.
7 -O2 Optimization Default when no optimization arguments given. Optimizations that always increase performance. Can significantly increase compilation time. -O2 optimization examples: –Loop nest optimization. –Global optimization within a function scope. –2 passes of instruction scheduling. –Dead code elimination. –Global register allocation.
8 -O3 Optimization More extensive optimizations that may in some cases slow down performance. Optimizes loop nests rather than just inner loops, i.e. inverts indices, etc. “Safe” optimizations – produces answers identical with those produced by –O0. NERSC recommendation based on experiences with benchmarks.
9 -Ofast Optimization Equivalent to -O3 -ipa -fno-math-errno -OPT:roundoff=2:Olimit=0:div_split=ON:alias=typed. ipa – interprocedural analysis. –Optimizes across functional boundaries. –Must be specified both at compile and link time. Aggressive “unsafe” optimizations: –Changes order of evaluation. –Deviates from IEEE 754 standard to obtain better performance. There are some known problems with this level of optimization in the current release, 2.2.1.
10 NAS B Serial Benchmarks Performance (MOP/S) Seaborg Best -O0-O1-O2-O3-Ofast BT 99.6157.2348.1633.6739.8750.9 CG 46.3101.2128.3236.9223.1224.5 EP 3.7 15.1 17.5 21.9 21.8 FT130.1186.2231.5572.4592.7 did not compile IS 5.8 16.9 22.0 25.6 27.0 26.8 LU169.8129.0342.4700.0809.9903.2 MG163.3109.0257.9747.7518.5530.0 SP 78.2104.7225.7507.3462.9516.6
11 NAS B Serial Benchmarks Compile Times (seconds) -O0-O1-O2-O3-Ofast BT 2.1 9.0 4.9 9.130.7 CG.4.7.9 1.5 EP.126.96.36.199 FT.4.5.8 1.5 did not compile IS.188.8.131.52 LU 2.1 4.2 5.711.417.4 MG.5.7 1.1 2.2 2.9 SP 1.6 2.0 3.210.014.4
12 NAS B Optimization Arguments Used by LNXI Benchmarkers BenchmarkArguments BT-O3 -ipa -WOPT:aggstr=off CG-O3 -ipa -CG:use_movlpd=on -CG:movnti=1 EP-LNO:fission=2 -O3 -LNO:vintr=2 FT-O3 -LNO:opt=0 IS-Ofast -DUSE_BUCKETS LU-Ofast -LNO:fusion=2:prefetch=0:full_unroll=10:ou_max=5 -OPT:ro=3:fold_unsafe_relops=on:fold_unsigned_relops=on: unroll_size=256:unroll_times_max=16:fast_complex -CG:cflow=off:p2align_freq=1 -fno-exceptions MG-O3 -ipa -WOPT:aggstr=off -CG:movnti=0 SP-Ofast
13 NAS C FT (32 Proc) OptimizationMops/ProcCompile Time (seconds) Seaborg Best 86.5N/A -O0148.8.7 -O1180.6.9 -O2356.5 1.4 -O3347.4 2.4 -Ofast346.0 3.4
14 SuperLU MPI Benchmark Based on the SuperLU general purpose library for the direct solution of large, sparse, nonsymmetric systems of linear equations. Mostly C with some Fortran 90 routines. Run on 64 processors/32 nodes. Uses BLAS routines from ACML.
15 SLU (64 procs) OptimizationElapsed run time (seconds) Compile Time (seconds) Seaborg Best 742.5N/A -O0276.7 5.8 -O1241.5 7.1 -O2213.510.6 -O3212.114.6 -OfastN/ADid not compile
17 ACML Library AMD Core Math Library - set of numerical routines tuned specifically for AMD64 platform processors. –BLAS –LAPACK –FFT To use with pathscale: –module load acml (built with pathscale compilers) –Compile and link with $ACML To use with gcc: –module load acml_gcc (build with pathscale compilers) –Compile and link with $ACML
18 Matrix Multiply Optimization Example 3 ways to multiply 2 dense matrices –Directly in Fortran with nested loops –Matmul F90 intrinsic –dgemm from ACML Example 2 1000 by 1000 double precision matrices. Order of indices: ijk means – do i=1,n – do j=1,n – do k=1,n