Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 PGI ® Compilers Tools for Scientists and Engineers Brent Leback – Dave Norton –

Similar presentations


Presentation on theme: "1 PGI ® Compilers Tools for Scientists and Engineers Brent Leback – Dave Norton –"— Presentation transcript:

1 1 PGI ® Compilers Tools for Scientists and Engineers Brent Leback – Dave Norton – September 2006

2 Introduction to PGI Compilers and Tools Documentation. Getting Help Basic Compiler Options Optimization Strategies 6.2 Features and Roadmap Questions and Answers Outline of Today’s Topics

3 Optimization – State-of-the-art vector, parallel, IPA, Feedback, … Cross-platform – AMD & Intel, 32/64-bit, Linux & Windows PGI Unified Binary for AMD and Intel processors Tools – Integrated OpenMP/MPI debug & profile, IDE integration Parallel – MPI, OpenMP 2.5, auto-parallel for Multi-core Comprehensive OS Support – Red Hat 7.3 – 9.0, RHEL 3.0/4.0, Fedora Core 2/3/4/5, SuSE 7.1 – 10.1, SLES 8/9/10, Windows XP, Windows x64 PGI Compilers and Tools, features

4 PGI Tools Enable Developers to: View x64 as a unified CPU architecture Extract peak performance from x64 CPUs Ride innovation waves from both Intel and AMD Use a single source base and toolset across Linux and Windows Develop, debug, tune parallel applications for Multi-core, Multi-core SMP, Clustered Multi-core SMP

5 PGI Documentation and Support PGI provided documentation PGI User Forums, at PGI FAQs, Tips & Techniques pages support, via Web support, a form-based system similar to support Fax support

6 PGI Docs & Support, cont. Legacy phone support, direct access, etc. PGI download web page PGI prepared/personalized training PGI ISV program PGI Premier Service program

7 PGI Basic Compiler Options Basic Usage Language Dialects Target Architectures Debugging aids Optimization switches

8 PGI Basic Compiler Usage A compiler driver interprets options and invokes pre- processors, compilers, assembler, linker, etc. Options precedence: if options conflict, last option on command line takes precedence Use -Minfo to see a listing of optimizations and transformations performed by the compiler Use -help to list all options or see details on how to use a given option, e.g. pgf90 -Mvect -help Use man pages for more details on options, e.g. “man pgf90” Use –v to see under the hood

9 Flags to support language dialects Fortran –pgf77, pgf90, pgf95, pghpf tools –Suffixes.f,.F,.for,.fpp,.f90,.F90,.f95,.F95,.hpf,.HPF –-Mextend, -Mfixed, -Mfreeform –Type size –i2, -i4, -i8, -r4, -r8, etc. –-Mcray, -Mbyteswapio, -Mupcase, -Mnomain, -Mrecursive, etc. C/C++ –pgcc, pgCC, aka pgcpp –Suffixes.c,.C,.cc,.cpp,.i –-B, -c89, -c9x, -Xa, -Xc, -Xs, -Xt –-Msignextend, -Mfcon, -Msingle, -Muchar, -Mgccbugs

10 Specifying the target architecture Not an issue on XT3. Defaults to the type of processor/OS you are running on Use the “tp” switch. –-tp k8-64 or –tp p7-64 or –tp core2-64 for 64-bit code. –-tp amd64e for AMD opteron rev E or later –-tp x64 for unified binary –-tp k8-32, k7, p7, piv, piii, p6, p5, px for 32 bit code

11 Flags for debugging aids -g generates symbolic debug information used by a debugger -gopt generates debug information in the presence of optimization -Mbounds adds array bounds checking -v gives verbose output, useful for debugging system or build problems -Mlist will generate a listing -Minfo provides feedback on optimizations made by the compiler -S or –Mkeepasm to see the exact assembly generated

12 Basic optimization switches Traditional optimization controlled through -O[ ], n is 0 to 4. -fast switch combines common set into one simple switch, is equal to -O2 -Munroll=c:1 -Mnoframe -Mlre –For -Munroll, c specifies completely unroll loops with this loop count or less –-Munroll=n: says unroll other loops m times -Mnoframe does not set up a stack frame -Mlre is loop-carried redundancy elimination

13 Basic optimization switches, cont. fastsse switch is commonly used, extends –fast to SSE hardware, and vectorization -fastsse is equal to -O2 -Munroll=c:1 -Mnoframe -Mlre (-fast) plus -Mvect=sse, -Mscalarsse -Mcache_align, -Mflushz -Mcache_align aligns top level arrays and objects on cache-line boundaries -Mflushz flushes SSE denormal numbers to zero

14 Optimization Strategies Establish a workload Optimization from the top-down Use of proper tools, methods Processor level optimizations, parallel methods Different flags/features for different types of code

15 15 Node level tuning  Vectorization – packed SSE instructions maximize performance  Interprocedural Analysis (IPA) – use it! motivating examples  Function Inlining – especially important for C and C++  Parallelization – for Cray XD1 and multi-core processors  Miscellaneous Optimizations – hit or miss, but worth a try

16 ! 351 ! Initialize vertex, similarity and coordinate arrays 352 ! 353 Do Index = 1, NodeCount 354 IX = MOD (Index - 1, NodesX) IY = ((Index - 1) / NodesX) CoordX (IX, IY) = Position (1) + (IX - 1) * StepX 357 CoordY (IX, IY) = Position (2) + (IY - 1) * StepY 358 JetSim (Index) = SUM (Graph (:, :, Index) * & 359 & GaborTrafo (:, :, CoordX(IX,IY), CoordY(IX,IY))) 360 VertexX (Index) = MOD (Params%Graph%RandomIndex (Index) - 1, NodesX) VertexY (Index) = ((Params%Graph%RandomIndex (Index) - 1) / NodesX) End Do Vectorizable F90 Array Syntax Data is REAL*4 Inner “loop” at line 358 is vectorizable, can used packed SSE instructions

17 17 % pgf95 -fastsse -Mipa=fast -Minfo -S graphRoutines.f90 … localmove: 334, Loop unrolled 1 times (completely unrolled) 343, Loop unrolled 2 times (completely unrolled) 358, Generated an alternate loop for the inner loop Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop … –fastsse to Enable SSE Vectorization –Minfo to List Optimizations to stderr

18 18 Scalar SSE: Vector SSE: Facerec Scalar: sec Facerec Vector: 84.3 sec.LB6_668: # lineno: 358 movss -12(%rax),%xmm2 movss -4(%rax),%xmm3 subl $1,%edx mulss -12(%rcx),%xmm2 addss %xmm0,%xmm2 mulss -4(%rcx),%xmm3 movss -8(%rax),%xmm0 mulss -8(%rcx),%xmm0 addss %xmm0,%xmm2 movss (%rax),%xmm0 addq $16,%rax addss %xmm3,%xmm2 mulss (%rcx),%xmm0 addq $16,%rcx testl %edx,%edx addss %xmm0,%xmm2 movaps %xmm2,%xmm0 jg.LB6_625.LB6_1245: # lineno: 358 movlps (%rdx,%rcx),%xmm2 subl $8,%eax movlps 16(%rcx,%rdx),%xmm3 prefetcht0 64(%rcx,%rsi) prefetcht0 64(%rcx,%rdx) movhps 8(%rcx,%rdx),%xmm2 mulps (%rsi,%rcx),%xmm2 movhps 24(%rcx,%rdx),%xmm3 addps %xmm2,%xmm0 mulps 16(%rcx,%rsi),%xmm3 addq $32,%rcx testl %eax,%eax addps %xmm3,%xmm0 jg.LB6_1245:

19 Vectorizable C Code Fragment? 217 void func4(float *u1, float *u2, float *u3, … … 221 for (i = -NE+1, p1 = u2-ny, p2 = n2+ny; i < nx+NE-1; i++) 222 u3[i] += clz * (p1[i] + p2[i]); 223 for (i = -NI+1, i < nx+NE-1; i++) { 224 float vdt = v[i] * dt; 225 u3[i] = 2.*u2[i]-u1[i]+vdt*vdt*u3[i]; 226 } % pgcc –fastsse –Minfo functions.c func4: 221, Loop unrolled 4 times 221, Loop not vectorized due to data dependency 223, Loop not vectorized due to data dependency

20 Pointer Arguments Inhibit Vectorization % pgcc –fastsse –Msafeptr –Minfo functions.c func4: 221, Generated vector SSE code for inner loop Generated 3 prefetch instructions for this loop 223, Unrolled inner loop 4 times 217 void func4(float *u1, float *u2, float *u3, … … 221 for (i = -NE+1, p1 = u2-ny, p2 = n2+ny; i < nx+NE-1; i++) 222 u3[i] += clz * (p1[i] + p2[i]); 223 for (i = -NI+1, i < nx+NE-1; i++) { 224 float vdt = v[i] * dt; 225 u3[i] = 2.*u2[i]-u1[i]+vdt*vdt*u3[i]; 226 }

21 C Constant Inhibits Vectorization % pgcc –fastsse –Msafeptr –Mfcon –Minfo functions.c func4: 221, Generated vector SSE code for inner loop Generated 3 prefetch instructions for this loop 223, Generated vector SSE code for inner loop Generated 4 prefetch instructions for this loop 217 void func4(float *u1, float *u2, float *u3, … … 221 for (i = -NE+1, p1 = u2-ny, p2 = n2+ny; i < nx+NE-1; i++) 222 u3[i] += clz * (p1[i] + p2[i]); 223 for (i = -NI+1, i < nx+NE-1; i++) { 224 float vdt = v[i] * dt; 225 u3[i] = 2.*u2[i]-u1[i]+vdt*vdt*u3[i]; 226 }

22 22 -Msafeptr Option and Pragma –M[no]safeptr[=all | arg | auto | dummy | local | static | global] allAll pointers are safe argArgument pointers are safe locallocal pointers are safe staticstatic local pointers are safe globalglobal pointers are safe #pragma [scope] [no]safeptr={arg | local | global | static | all},… Where scope is global, routine or loop

23 23 Common Barriers to SSE Vectorization  Potential Dependencies & C Pointers – Give compiler more info with –Msafeptr, pragmas, or restrict type qualifer  Function Calls – Try inlining with –Minline or –Mipa=inline  Type conversions – manually convert constants or use flags  Large Number of Statements – Try –Mvect=nosizelimit  Too few iterations – Usually better to unroll the loop  Real dependencies – Must restructure loop, if possible

24 24 Barriers to Efficient Execution of Vector SSE Loops  Not enough work – vectors are too short  Vectors not aligned to a cache line boundary  Non unity strides  Code bloat if altcode is generated

25 25  Vectorization – packed SSE instructions maximize performance  Interprocedural Analysis (IPA) – use it! motivating example  Function Inlining – especially important for C and C++  Parallelization – for Cray XD1 and multi-core processors  Miscellaneous Optimizations – hit or miss, but worth a try

26 26 What can Interprocedural Analysis and Optimization with –Mipa do for You?  Interprocedural constant propagation  Pointer disambiguation  Alignment detection, Alignment propagation  Global variable mod/ref detection  F90 shape propagation  Function inlining  IPA optimization of libraries, including inlining

27 27 Effect of IPA on the WUPWISE Benchmark PGF95 Compiler Options Execution Time in Seconds –fastsse –fastsse –Mipa=fast –fastsse –Mipa=fast,inline91.72  – Mipa=fast => constant propagation => compiler sees complex matrices are all 4x3 => completely unrolls loops  – Mipa=fast,inline => small matrix multiplies are all inlined

28 28 Using Interprocedural Analysis  Must be used at both compile time and link time  Non-disruptive to development process – edit/build/run  Speed-ups of 5% - 10% are common  –Mipa=safe: - safe to optimize functions which call or are called from unknown function/library name  –Mipa=libopt – perform IPA optimizations on libraries  –Mipa=libinline – perform IPA inlining from libraries

29 29  Vectorization – packed SSE instructions maximize performance  Interprocedural Analysis (IPA) – use it! motivating examples  Function Inlining – especially important for C and C++  SMP Parallelization – for Cray XD1 and multi-core processors  Miscellaneous Optimizations – hit or miss, but worth a try

30 30 Explicit Function Inlining –Minline[=[lib:] | [name:] | except: | size: | levels: ] [lib:] Inline extracted functions from inlib [name:] Inline function func except: Do not inline function func size: Inline only functions smaller than n statements (approximate) levels: Inline n levels of functions For C++ Codes, PGI Recommends IPA-based inlining or –Minline=levels:10!

31 31 Other C++ recommendations  Encapsulation, Data Hiding - small functions, inline!  Exception Handling – use –no_exceptions until 7.0  Overloaded operators, overloaded functions - okay  Pointer Chasing - -Msafeptr, restrict qualifer, 32 bits?  Templates, Generic Programming – now okay  Inheritance, polymorphism, virtual functions – runtime lookup or check, no inlining, potential performance penalties

32 32  Vectorization – packed SSE instructions maximize performance  Interprocedural Analysis (IPA) – use it! motivating examples  Function Inlining – especially important for C and C++  SMP Parallelization – for Cray XD1 and multi-core processors  Miscellaneous Optimizations – hit or miss, but worth a try

33 33 SMP Parallelization  –Mconcur for auto-parallelization on multi-core  Compiler strives for parallel outer loops, vector SSE inner loops  –Mconcur=innermost forces a vector/parallel innermost loop  –Mconcur=cncall enables parallelization of loops with calls  –mp to enable OpenMP 2.5 parallel programming model  See PGI User’s Guide or OpenMP 2.5 standard  OpenMP programs compiled w/out –mp “just work”  Not supported on Cray XT3 – would require some custom work  –Mconcur and –mp can be used together!

34 MGRID Benchmark Main Loop DO 10 I3=2, N-1 DO 10 I2=2,N-1 DO 10 I1=2,N-1 10 R(I1,I2,I3) = V(I1,I2,I3) & -A(0)*(U(I1,I2,I3)) & -A(1)*(U(I1-1,I2,I3)+U(I1+1,I2,I3) & +U(I1,I2-1,I3)+U(I1,I2+1,I3) & +U(I1,I2,I3-1)+U(I1,I2,I3+1)) & -A(2)*(U(I1-1,I2-1,I3)+U(I1+1,I2-1,I3) & +U(I1-1,I2+1,I3)+U(I1+1,I2+1,I3) & +U(I1,I2-1,I3-1)+U(I1,I2+1,I3-1) & +U(I1,I2-1,I3+1)+U(I1,I2+1,I3+1) & +U(I1-1,I2,I3-1)+U(I1-1,I2,I3+1) & +U(I1+1,I2,I3-1)+U(I1+1,I2,I3+1) ) & -A(3)*(U(I1-1,I2-1,I3-1)+U(I1+1,I2-1,I3-1) & +U(I1-1,I2+1,I3-1)+U(I1+1,I2+1,I3-1) & +U(I1-1,I2-1,I3+1)+U(I1+1,I2-1,I3+1) & +U(I1-1,I2+1,I3+1)+U(I1+1,I2+1,I3+1))

35 Auto-parallel MGRID Overall Speed-up is 40% on Dual-core AMD Opteron % pgf95 –fastsse –Mipa=fast,inline –Minfo –Mconcur mgrid.f resid: , Parallel code for non-innermost loop activated if loop count >= 33; block distribution 291, 4 loop-carried redundant expressions removed with 12 operations and 16 arrays Generated vector SSE code for inner loop Generated 8 prefetch instructions for this loop Generated vector SSE code for inner loop Generated 8 prefetch instructions for this loop

36 36  Vectorization – packed SSE instructions maximize performance  Interprocedural Analysis (IPA) – use it! motivating examples  Function Inlining – especially important for C and C++  SMP Parallelization – for Cray XD1 and multi-core processors  Miscellaneous Optimizations – hit or miss, but worth a try

37 37 Miscellaneous Optimizations (1)  –Mfprelaxed – single-precision sqrt, rsqrt, div performed using reduced-precision reciprocal approximation  –lacml and –lacml_mp – link in the AMD Core Math Library  –Mprefetch=d:,n: – control prefetching distance, max number of prefetch instructions per loop  –tp k8-32 – can result in big performance win on some C/C++ codes that don’t require > 2GB addressing; pointer and long data become 32-bits

38 38 Miscellaneous Optimizations (2)  –O3 – more aggressive hoisting and scalar replacement; not part of –fastsse, always time your code to make sure it’s faster  For C++ codes: ––no_exceptions –Minline=levels:10  –M[no]movnt – disable / force non-temporal moves  – V[version] to switch between PGI releases at file level  –Mvect=noaltcode – disable multiple versions of loops

39 Industry-leading SPECFP06 and SPECINT06 Performance PGI Visual Fortran for Windows x64 & Windows XP Full-featured PGI Workstation/Server for 32-bit Windows XP PGI Unified Binary performance enhancements More gcc extensions / compatibility New SSE intrinsics PGI CDK ROLL for ROCKS clusters MPICH1 and MPICH2 support in the PGI CDK Incremental debugger/profiler enhancements Limited tuning for Intel Core2 (Woodcrest et al) What’s New in PGI 6.2

40 Deep integration with Visual Studio 2005 PGI-custom Fortran-aware text editor Syntax coloring, keyword completion Fortran 95 Intrinsics tips PGI-custom project system and icons PGI-custom property pages One-touch project build / execute MS Visual C++ interoperability Mixed VC++ / PGI Fortran applications PGI-custom parallel F95 debug engine OpenMP 2.5 / threads debugging Just-in-time debugging features DVF/CVF compatibility features Win32 API support Complete (Vis Studio bundled) and Standard (no Vis Studio) versions PGI Visual Fortran 6.2 PGI Unified Binary executables Auto-parallel for multi-core CPUs Native OpenMP 2.5 parallelization World-class performance 64-bit Windows x64 support 32-bit Windows 2000/XP support Optimization/support for AMD64 Optimization/support for Intel EM64T DEC/IBM/Cray compatibility features cpp-compatible pre-processing Visual Studio 2005 bundled* MSDN Library bundled* GUI parallel debugging/profiling* Assembly-optimized BLAS/LAPACK/FFTs* Boxed CD-ROM/Manuals media kit* *PVF Workstation Complete Only

41 PGI Unified Binary directives and enhancements Aggressive Intel Core2 and next gen AMD64 tuning Industry-leading SPECFP06 and SPECINT06 Performance on Linux/Windows/AMD/Intel/32/64 Incremental PGDBG enhancements, improved C++ support MPI Debugging / Profiling for Windows x64 CCS Clusters All-new cross-platform PGPROF performance profiler Fortran 2003/C99 language features GCC front-end compatibility, g++ interoperability PGC++ tuning, PGC++/VC++ interoperability Windows SUA and Apple/MacOS platform support De facto standard scalable C/Fortran language/tools extensions On the PGI Roadmap

42 Reach me at Thanks for your time Questions?


Download ppt "1 PGI ® Compilers Tools for Scientists and Engineers Brent Leback – Dave Norton –"

Similar presentations


Ads by Google