Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.

Similar presentations


Presentation on theme: "1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009."— Presentation transcript:

1 1 The Portland Group, Inc. Brent Leback brent.leback@pgroup.com www.pgroup.com HPC User Forum, Broomfield, CO September 2009

2 2 High Level Languages for Clusters  Many failures in this area, academically and commercially  Lack of Supply?  Lack of Standards?  Bad/Buggy Implementations?  Lack of Generality?  Lack of Performance?  CAF is headed for the Fortran Standard (?) (!)  Is it a good idea?  Is it mature enough to standardize?  Will anyone in attendance use it?  Given our experience with HPF, PGI will be conservative on this front

3 3 Performance Across Platforms: PGI Unified Binary  PGI Unified Binary has been available since 2005  A single X64 binary including optimized code sequences for multiple target processor cores.  -tp switch to specify target processor type, a number of AMD and Intel processor families currently supported.  Especially important to ISVs  AVX support is in progress  Now PGI Unified Binary supports accelerated/non-accelerated binaries  A single X64 binary recognizes the existence of a GPU and runs PGI accelerated versions there if available.  -ta switch to specify target accelerator, currently only –ta=nvidia is supported.  Use –ta=nvidia,host to generate code for both cases  Target processor and Target Accelerator switches can be used together. Today, Intel64, AMD64, + NVIDIA is the full gamut.

4 4 The “ Full Gamut ” Isn ’ t Very Full

5 5 SUBROUTINE SAXPY (A,X,Y,N) INTEGER N REAL A,X(N),Y(N) !$ACC REGION DO I = 1, N X(I) = A*X(I) + Y(I) ENDDO !$ACC END REGION END saxpy_: … movl (%rbx), %eax movl %eax, -4(%rbp) call __pgi_cu_init... call __pgi_cu_function … call __pgi_cu_alloc … call __pgi_cu_upload … call __pgi_cu_call … call __pgi_cu_download … saxpy_: … movl (%rbx), %eax movl %eax, -4(%rbp) call __pgi_cu_init... call __pgi_cu_function … call __pgi_cu_alloc … call __pgi_cu_upload … call __pgi_cu_call … call __pgi_cu_download … Host x64 asm File Auto-generated GPU code typedef struct dim3{ unsigned int x,y,z; }dim3; typedef struct uint3{ unsigned int x,y,z; }uint3; extern uint3 const threadIdx, blockIdx; extern dim3 const blockDim, gridDim; static __attribute__((__global__)) void pgicuda( __attribute__((__shared__)) int tc, __attribute__((__shared__)) int i1, __attribute__((__shared__)) int i2, __attribute__((__shared__)) int _n, __attribute__((__shared__)) float* _c, __attribute__((__shared__)) float* _b, __attribute__((__shared__)) float* _a ) { int i; int p1; int _i; i = blockIdx.x * 64 + threadIdx.x; if( i < tc ){ _a[i+i2-1] = ((_c[i+i2-1]+_c[i+i2-1])+_b[i+i2-1]); _b[i+i2-1] = _c[i+i2]; _i = (_i+1); p1 = (p1-1); } } + Unified a.out compile link execute … no change to existing makefiles, scripts, IDEs, programming environment, etc. PGI Accelerator Compilers

6 6 Supporting Heterogeneous Cores: PGI Accelerator Model  Minimal changes to the language – directives/pragmas, in the same vein as vector or OpenMP parallel directives. As simple as !$ACC REGION !$ACC END REGION  Minimal library calls – usually none  Standard x64 toolchain – no changes to makefiles, linkers, build process, standard libraries, other tools  Not a “platform” – binaries will execute on any compatible x64+GPU hardware system  Performance feedback – learn from and leverage the success of vectorizing compilers in the 1970s and 1980s  Incremental program migration – put migration decisions in the hands of developers  PGI Unified Binary Technology – ensures continued portability to non GPU-enabled targets

7 7 Programmer Productivity: Compiler-to-Programmer Feedback HPC Code PGI Compiler x64 CCFF Trace PGPROF HPC User Acc + Directives, Options, RESTRUCTURING CCFF provides: how/when a function was compiled, IPA optimizations, profile feedback runtime values, info on vectorization and parallelization, compute intensity, and missed opportunities Performance

8 8 Supporting Third-Parties  PGI 9.0 supports OpenMP 3.0 for Fortran, C/C++.  OpenMP 3.0 Tasks supported in all languages  OpenMP runtime overhead as measured by the EPCC benchmark is lower than our competition  PGI is currently working with the OpenMP committee to investigate the support of an accelerator programming model as part of OpenMP and/or other standards body.  Michael Wolfe is our OpenMP representative  IMSL and NAG are already supported with PGI compilers; we're enabling them to migrate incrementally to heterogeneous manycore.

9 9 Availability and Additional Information  PGI Accelerator Programming Model – is supported for x64+NVIDIA Linux targets in the PGI 9.0 Fortran and C compilers, available now  PGI CUDA Fortran – supporting explicit programming of x64+NVIDIA targets will be available in a production release of the PGI Fortran 95/03 compiler currently scheduled for release in November, 2009  Other GPU and Accelerator Targets – are being studied by PGI, and may be supported in the future as the necessary low-level software infrastructure (e.g. OpenCL) becomes more widely available  Further Information – see www.pgroup.com/accelerate for a detailed specification of the PGI Accelerator model, an FAQ, and related articles and white paperswww.pgroup.com/accelerate  CCFF – The Common Compiler Feedback Format, is described at www.pgroup.com/resources/ccff.htm www.pgroup.com/resources/ccff.htm


Download ppt "1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009."

Similar presentations


Ads by Google