1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.
Introduction to the CUDA Platform
GPU Programming using BU Shared Computing Cluster
Xtensa C and C++ Compiler Ding-Kai Chen
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
University of Houston So What’s Exascale Again?. University of Houston The Architects Did Their Best… Scale of parallelism Multiple kinds of parallelism.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
Presented by Rengan Xu LCPC /16/2014
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
1 Real-Time System Design Developing a Cross Compiler and libraries for a target system.
CUDA Grids, Blocks, and Threads
Contemporary Languages in Parallel Computing Raymond Hummel.
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
COP4020 Programming Languages
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
OMPi: A portable C compiler for OpenMP V2.0 Elias Leontiadis George Tzoumas Vassilios V. Dimakopoulos University of Ioannina.
Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.
OpenTS for Windows Compute Cluster Server. Overview  Introduction  OpenTS (academic) for Windows CCS  T-converter  T-microkernel  OpenTS installer.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
1 3-Software Design Basics in Embedded Systems. 2 Development Environment Development processor  The processor on which we write and debug our programs.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
TRACEREP: GATEWAY FOR SHARING AND COLLECTING TRACES IN HPC SYSTEMS Iván Pérez Enrique Vallejo José Luis Bosque University of Cantabria TraceRep IWSG'15.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
1 4-Development Environment Development processor  The processor on which we write and debug our programs Usually a PC Target processor  The processor.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Debugging PGI Compilers for Heterogeneous Supercomputing.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Threaded Programming Lecture 2: Introduction to OpenMP.
Martin Kruliš by Martin Kruliš (v1.0)1.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Single Node Optimization Computational Astrophysics.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.
Computer Engg, IIT(BHU)
Prof. Zhang Gang School of Computer Sci. & Tech.
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
Enabling machine learning in embedded systems
Exploiting NVIDIA GPUs with OpenMP
Lecture 5: GPU Compute Architecture
Many-core Software Development Platforms
Programming Languages
Lecture 5: GPU Compute Architecture for the last time
Using OpenMP offloading in Charm++
CUDA Grids, Blocks, and Threads
HPC User Forum: Back-End Compiler Technology Panel
6- General Purpose GPU Programming
CUDA Fortran Programming with the IBM XL Fortran Compiler
Question 1 How are you going to provide language and/or library (or other?) support in Fortran, C/C++, or another language for massively parallel programming.
Presentation transcript:

1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009

2 High Level Languages for Clusters  Many failures in this area, academically and commercially  Lack of Supply?  Lack of Standards?  Bad/Buggy Implementations?  Lack of Generality?  Lack of Performance?  CAF is headed for the Fortran Standard (?) (!)  Is it a good idea?  Is it mature enough to standardize?  Will anyone in attendance use it?  Given our experience with HPF, PGI will be conservative on this front

3 Performance Across Platforms: PGI Unified Binary  PGI Unified Binary has been available since 2005  A single X64 binary including optimized code sequences for multiple target processor cores.  -tp switch to specify target processor type, a number of AMD and Intel processor families currently supported.  Especially important to ISVs  AVX support is in progress  Now PGI Unified Binary supports accelerated/non-accelerated binaries  A single X64 binary recognizes the existence of a GPU and runs PGI accelerated versions there if available.  -ta switch to specify target accelerator, currently only –ta=nvidia is supported.  Use –ta=nvidia,host to generate code for both cases  Target processor and Target Accelerator switches can be used together. Today, Intel64, AMD64, + NVIDIA is the full gamut.

4 The “ Full Gamut ” Isn ’ t Very Full

5 SUBROUTINE SAXPY (A,X,Y,N) INTEGER N REAL A,X(N),Y(N) !$ACC REGION DO I = 1, N X(I) = A*X(I) + Y(I) ENDDO !$ACC END REGION END saxpy_: … movl (%rbx), %eax movl %eax, -4(%rbp) call __pgi_cu_init... call __pgi_cu_function … call __pgi_cu_alloc … call __pgi_cu_upload … call __pgi_cu_call … call __pgi_cu_download … saxpy_: … movl (%rbx), %eax movl %eax, -4(%rbp) call __pgi_cu_init... call __pgi_cu_function … call __pgi_cu_alloc … call __pgi_cu_upload … call __pgi_cu_call … call __pgi_cu_download … Host x64 asm File Auto-generated GPU code typedef struct dim3{ unsigned int x,y,z; }dim3; typedef struct uint3{ unsigned int x,y,z; }uint3; extern uint3 const threadIdx, blockIdx; extern dim3 const blockDim, gridDim; static __attribute__((__global__)) void pgicuda( __attribute__((__shared__)) int tc, __attribute__((__shared__)) int i1, __attribute__((__shared__)) int i2, __attribute__((__shared__)) int _n, __attribute__((__shared__)) float* _c, __attribute__((__shared__)) float* _b, __attribute__((__shared__)) float* _a ) { int i; int p1; int _i; i = blockIdx.x * 64 + threadIdx.x; if( i < tc ){ _a[i+i2-1] = ((_c[i+i2-1]+_c[i+i2-1])+_b[i+i2-1]); _b[i+i2-1] = _c[i+i2]; _i = (_i+1); p1 = (p1-1); } } + Unified a.out compile link execute … no change to existing makefiles, scripts, IDEs, programming environment, etc. PGI Accelerator Compilers

6 Supporting Heterogeneous Cores: PGI Accelerator Model  Minimal changes to the language – directives/pragmas, in the same vein as vector or OpenMP parallel directives. As simple as !$ACC REGION !$ACC END REGION  Minimal library calls – usually none  Standard x64 toolchain – no changes to makefiles, linkers, build process, standard libraries, other tools  Not a “platform” – binaries will execute on any compatible x64+GPU hardware system  Performance feedback – learn from and leverage the success of vectorizing compilers in the 1970s and 1980s  Incremental program migration – put migration decisions in the hands of developers  PGI Unified Binary Technology – ensures continued portability to non GPU-enabled targets

7 Programmer Productivity: Compiler-to-Programmer Feedback HPC Code PGI Compiler x64 CCFF Trace PGPROF HPC User Acc + Directives, Options, RESTRUCTURING CCFF provides: how/when a function was compiled, IPA optimizations, profile feedback runtime values, info on vectorization and parallelization, compute intensity, and missed opportunities Performance

8 Supporting Third-Parties  PGI 9.0 supports OpenMP 3.0 for Fortran, C/C++.  OpenMP 3.0 Tasks supported in all languages  OpenMP runtime overhead as measured by the EPCC benchmark is lower than our competition  PGI is currently working with the OpenMP committee to investigate the support of an accelerator programming model as part of OpenMP and/or other standards body.  Michael Wolfe is our OpenMP representative  IMSL and NAG are already supported with PGI compilers; we're enabling them to migrate incrementally to heterogeneous manycore.

9 Availability and Additional Information  PGI Accelerator Programming Model – is supported for x64+NVIDIA Linux targets in the PGI 9.0 Fortran and C compilers, available now  PGI CUDA Fortran – supporting explicit programming of x64+NVIDIA targets will be available in a production release of the PGI Fortran 95/03 compiler currently scheduled for release in November, 2009  Other GPU and Accelerator Targets – are being studied by PGI, and may be supported in the future as the necessary low-level software infrastructure (e.g. OpenCL) becomes more widely available  Further Information – see for a detailed specification of the PGI Accelerator model, an FAQ, and related articles and white paperswww.pgroup.com/accelerate  CCFF – The Common Compiler Feedback Format, is described at