Slides:



Advertisements
Similar presentations
Lecture Computer Science I - Martin Hardwick The Programming Process rUse an editor to create a program file (source file). l contains the text of.
Advertisements


Intermediate GPGPU Programming in CUDA
What's new in Microsoft Visual C Preview
C++  PPL  AMP When NO branches between a micro-op and retiring to the visible architectural state – its no longer speculative.
APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team.
for (i = 0; i < 1024; i++) C[i] = A[i]*B[i]; for (i = 0; i < 1024; i+=4) C[i:i+3] = A[i:i+3]*B[i:i+3]; #pragma loop(hint_parallel ( N ) ) for.
Visual Studio 2013 Conformance Performance Productivity Services Mobile Devices What’s Next.
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
C++ AMP: Accelerated Massive Parallelism in Visual C++ 11 Kate Gregory Gregory Consulting
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 Session-I & II CSIT-121 Spring 2006 Session Targets Introducing the VC++.NET Solving problems on computer Programming in C++ Writing and Running Programs.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Multiple-Subscripted Array
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
C++ + r1 r2 r3 add r3, r1, r2 SCALAR (1 operation) v1 v2 v3 + vector length vadd v3, v1, v2 VECTOR (N operations)
Visual Studio 11 for Game Developers Boris Jabes Senior Program Manager Microsoft Corporation.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
images source: AMD image source: NVIDIA performance portability productivity.
C++ Accelerated Massive Parallelism in Visual C Kate Gregory Gregory Consulting DEV334.
An Introduction to Programming with CUDA Paul Richmond
C++ / G4MICE Course Session 3 Introduction to Classes Pointers and References Makefiles Standard Template Library.
demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
Steve Teixeira Director of Program Management, Visual C++ Microsoft Corporation Visual C++ and the Native Renaissance.
CS Tutorial 1 Getting Started with Visual Studio 2012 (Visual Studio 2010 are no longer available on MSDNAA, please choose Visual Studio 2012 which.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
Programming Language C++ Xulong Peng CSC415 Programming Languages.
CS179: GPU Programming Lecture 11: Lab 5 Recitation.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
3. The Nuts and Bolts of C++ Computer Programming 3. The Nuts and Bolts of C++ 1 Learning the C++ language 3. The Nuts and Bolts of C++
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Looping and Counting Lecture 3 Hartmut Kaiser
C++ Programming Lecture 11 Functions – Part III By Ghada Al-Mashaqbeh The Hashemite University Computer Engineering Department.
OpenCL Programming James Perry EPCC The University of Edinburgh.
#include guards Practical session #2 Software Engineering
Overview of Java CSCI 392 Day One. Running C code vs Java code C Source Code C Compiler Object File (machine code) Library Files Linker Executable File.
Using the while-statement to process data files. General procedure to access a data file General procedure in computer programming to read data from a.
This deck has 1-, 2-, and 3- slide variants for C++ AMP If your own deck uses 4:3, get with the 21 st century and switch to 16:9 ( Design tab, Page Setup.
More Array Access Examples Here is an example showing array access logic: const int MAXSTUDENTS = 100; int Test[MAXSTUDENTS]; int numStudents = 0;... //
C++ LANGUAGE TUTORIAL LESSON 1 –WRITING YOUR FIRST PROGRAM.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
CSE 1341 Honors Note Set 2 1. Overview  Java vs. C++  Functions in C++  First Programming Packet  Development Environment 2.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
MAXIMISE.NET WITH C++ FOR INTEROP, PERFORMANCE AND PRODUCTIVITY Angel Hernandez Avanade Australia (c) 2011 Microsoft. All rights reserved. SESSION CODE:
Lecture 7.  There are 2 types of libraries used by standard C++ The C standard library (math.h) and C++ The C++ standard template library  Allows us.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Graphic Processing Units Presentation by John Manning.
History C++11 : First ContactE. Conteslide 2 C First international standard of the C++ language. Published as ISO/IE C 14882:1998 C
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Computer Engg, IIT(BHU)
Local Memory optimizations
Header files (answers to jit-anonymous questions)
Computer Terms Review from what language did C++ originate?
Programming Introduction to C++.
Lecturer: Mukhtar Mohamed Ali “Hakaale”
Taming GPU compute with C++ Accelerated Massive Parallelism
1. Open Visual Studio 2008.
Parallel programming with GPGPU coprocessors
CS 179: Lecture 3.
Seoul National University
C++ Compilation Model C++ is a compiled language
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Seoul National University
Presentation transcript:

images source: AMD

image source: AMD

…and also non-Microsoft… through standards & open specification

performance portability productivity

C++ AMP “Hello World” File -> New -> Project Visual C++ -> Empty Project Project -> Add New Item Empty C++ file

1.#include int main() 5.{ 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; for (int idx = 0; idx < 11; idx++) 9. { 10. v[idx] += 1; 11. } 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout ( v[i]); 14. }

1.#include 2.#include 3.using namespace concurrency; 4.int main() 5.{ 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; for (int idx = 0; idx < 11; idx++) 9. { 10. v[idx] += 1; 11. } 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout ( v[i]); 14. }

1.#include 2.#include 3.using namespace concurrency; 4.int main() 5.{ 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view av(11, v); 8. for (int idx = 0; idx < 11; idx++) 9. { 10. v[idx] += 1; 11. } 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout ( v[i]); 14. }

1.#include 2.#include 3.using namespace concurrency; 4.int main() 5.{ 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view av(11, v); 8. for (int idx = 0; idx < 11; idx++) 9. { 10. av[idx] += 1; 11. } 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout ( v[i]); 14. }

1.#include 2.#include 3.using namespace concurrency; 4.int main() 5.{ 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view av(11, v); 8. for (int idx = 0; idx < 11; idx++) 9. { 10. av[idx] += 1; 11. } 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout (av[i]); 14. }

C++ AMP “Hello World” File -> New -> Project Empty Project Project -> Add New Item Empty C++ file 1.#include 2.#include 3.using namespace concurrency; 4.int main() 5.{ 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view av(11, v); 8. parallel_for_each(av.extent, [=](index idx) restrict(amp) 9. { 10. av[idx] += 1; 11. }); 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout (av[i]); 14. }

C++ AMP “Hello World” File -> New -> Project Empty Project Project -> Add New Item Empty C++ file 1.#include 2.#include 3.using namespace concurrency; 4.int main() 5.{ 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view av(11, v); 8. parallel_for_each(av.extent, [=](index idx) restrict(amp) 9. { 10. av[idx] += 1; 11. }); 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout (av[i]); 14. } restrict(amp): tells the compiler to check that code conforms to C++ subset parallel_for_each: execute the lambda on the accelerator once per thread index: the thread ID that is running the lambda, used to index into data array_view: wraps the data to operate on the accelerator. array_view variables captured and associated data copied to accelerator (on demand) amp.h: header for C++ AMP library concurrency: namespace for library amp.h: header for C++ AMP library concurrency: namespace for library extent: the thread number and shape to execute the lambda

void AddArrays(int n, int * pA, int * pB, int * pSum) { for (int i=0; i<n; i++) { pSum[i] = pA[i] + pB[i]; } How do we take the serial code on the left that runs on the CPU and convert it to run on an accelerator like the GPU?

void AddArrays(int n, int * pA, int * pB, int * pSum) { for (int i=0; i<n; i++) { pSum[i] = pA[i] + pB[i]; } #include using namespace concurrency; void AddArrays(int n, int * pA, int * pB, int * pSum) { array_view a(n, pA); array_view b(n, pB); array_view sum(n, pSum); parallel_for_each( sum.extent, [=](index i) restrict(amp) { sum[i] = a[i] + b[i]; } ); } void AddArrays(int n, int * pA, int * pB, int * pSum) { for (int i=0; i<n; i++) { pSum[i] = pA[i] + pB[i]; }

void AddArrays(int n, int * pA, int * pB, int * pSum) { array_view a(n, pA); array_view b(n, pB); array_view sum(n, pSum); parallel_for_each( sum.extent, [=](index i) restrict(amp) { sum[i] = a[i] + b[i]; } ); } array_view variables captured and associated data copied to accelerator (on demand) restrict(amp): tells the compiler to check that this code conforms to C++ AMP language restrictions parallel_for_each: execute the lambda on the accelerator once per thread extent: the number and shape of threads to execute the lambda index: the thread ID that is running the lambda, used to index into data array_view: wraps the data to operate on the accelerator

index i(2); index i(2,0,1); extent e(3,2,2); index i(0,2); extent e(3,4); extent e(6);

vector v(10); extent e(2,5); array_view a(e, v); //above two lines can also be written //array_view a(2,5,v); index i(1,3); int o = a[i]; // or a[i] = 16; //or int o = a(1, 3);

1.parallel_for_each( 2.e, //e is of type extent 3.[ ](index idx) restrict(amp) { // kernel code } 1.);

introduction.aspx

double cos( double d ); // 1a: cpu code double cos( double d ) restrict(amp); // 1b: amp code double bar( double d ) restrict(cpu,amp); // 2 : common subset of both void some_method(array_view & c) { parallel_for_each( c.extent, [=](index idx) restrict(amp) { //… double d0 = c[idx]; double d1 = bar(d0);// ok, bar restrictions include amp double d2 = cos(d0);// ok, chooses amp overload //… }); }

PCIe HostAccelerator (e.g. discrete GPU)

// enumerate all accelerators vector accs = accelerator::get_all(); // choose one based on your criteria accelerator acc = accs[0]; // launch a kernel on it parallel_for_each(acc.default_view, my_extent, [=]…); // enumerate all accelerators vector accs = accelerator::get_all(); // choose one based on your criteria accelerator acc = accs[0]; // launch a kernel on it parallel_for_each(acc.default_view, my_extent, [=]…);

vector v(8 * 12); extent e(8,12); accelerator acc = … array a(e, acc.default_view); copy_async(v.begin(), v.end(), a); parallel_for_each(e, [&](index idx) restrict(amp) { a[idx] += 1; }); copy(a, v.begin());

void MatrixMultiplySerial( vector & vC, const vector & vA, const vector & vB, int M, int N, int W ) { for (int row = 0; row < M; row++) { for (int col = 0; col < N; col++){ float sum = 0.0f; for(int i = 0; i < W; i++) sum += vA[row * W + i] * vB[i * N + col]; vC[row * N + col] = sum; } void MatrixMultiplyAMP( vector & vC, const vector & vA, const vector & vB, int M, int N, int W ) { array_view a(M,W,vA),b(W,N,vB); array_view c(M,N,vC); c.discard_data(); parallel_for_each(c.extent, [=](index idx) restrict(amp) { int row = idx[0]; int col = idx[1]; float sum = 0.0f; for(int i = 0; i < W; i++) sum += a(row, i) * b(i, col); c[idx] = sum; } ); }

Per Thread Registers Global Memory … … … … … … ……

Per Thread Registers Global Memory … … … Per Thread Registers … … … ……

Programmable Cache Per Thread Registers Global Memory … … … Programmable Cache Per Thread Registers … … … ……

parallel_for_each(data.extent.tile (), [=] (tiled_index t_idx) restrict(amp) { … }); array_view data(12, my_data); parallel_for_each(data.extent, [=] (index idx) restrict(amp) { … });

array_view data(2, 6, p_my_data); parallel_for_each( data.extent.tile (), [=] (tiled_index t_idx)… { … }); T T

imagine the code here

void MatrixMultSimple(vector & vC, const vector & vA, const vector & vB, int M, int N, int W ) { array_view a(M, W, vA), b(W, N, vB); array_view c(M,N,vC); c.discard_data(); parallel_for_each(c.extent, [=] (index idx) restrict(amp) { int row = idx[0]; int col = idx[1]; float sum = 0.0f; for(int k = 0; k < W; k++) sum += a(row, k) * b(k, col); c[idx] = sum; } ); } void MatrixMultTiled(vector & vC, const vector & vA, const vector & vB, int M, int N, int W ) { static const int TS = 16; array_view a(M, W, vA), b(W, N, vB); array_view c(M,N,vC); c.discard_data(); parallel_for_each(c.extent.tile (), [=] (tiled_index t_idx) restrict(amp) { int row = t_idx.global[0]; int col = t_idx.global[1]; float sum = 0.0f; for(int k = 0; k < W; k++) sum += a(row, k) * b(k, col); c[t_idx.global] = sum; } ); }

void MatrixMultSimple(vector & vC, const vector & vA, const vector & vB, int M, int N, int W ) { static const int TS = 16; array_view a(M, W, vA), b(W, N, vB); array_view c(M,N,vC); c.discard_data(); parallel_for_each(c.extent.tile (), [=] (tiled_index t_idx) restrict(amp) { int row = t_idx.global[0]; int col = t_idx.global[1]; float sum = 0.0f; for(int k = 0; k < W; k++) sum += a(row, k) * b(k, col); c[t_idx.global] = sum; } ); } void MatrixMultTiled(vector & vC, const vector & vA, const vector & vB, int M, int N, int W ) { static const int TS = 16; array_view a(M, W, vA), b(W, N, vB); array_view c(M,N,vC); c.discard_data(); parallel_for_each(c.extent.tile (), [=] (tiled_index t_idx) restrict(amp) { int row = t_idx.local[0]; int col = t_idx.local[1]; tile_static float locA[TS][TS], locB[TS][TS]; float sum = 0.0f; for (int i = 0; i < W; i += TS) { locA[row][col] = a(t_idx.global[0], col + i); locB[row][col] = b(row + i, t_idx.global[1]); t_idx.barrier.wait(); for (int k = 0; k < TS; k++) sum += locA[row][k] * locB[k][col]; t_idx.barrier.wait(); } c[t_idx.global] = sum; } ); } Phase 1 Phase 2 imagine the code here

void MatrixMultSimple(vector & vC, const vector & vA, const vector & vB, int M, int N, int W ) { static const int TS = 16; array_view a(M, W, vA), b(W, N, vB); array_view c(M,N,vC); c.discard_data(); parallel_for_each(c.extent.tile (), [=] (tiled_index t_idx) restrict(amp) { int row = t_idx.global[0]; int col = t_idx.global[1]; float sum = 0.0f; for(int k = 0; k < W; k++) sum += a(row, k) * b(k, col); c[t_idx.global] = sum; } ); } void MatrixMultTiled(vector & vC, const vector & vA, const vector & vB, int M, int N, int W ) { static const int TS = 16; array_view a(M, W, vA), b(W, N, vB); array_view c(M,N,vC); c.discard_data(); parallel_for_each(c.extent.tile (), [=] (tiled_index t_idx) restrict(amp) { int row = t_idx.local[0]; int col = t_idx.local[1]; tile_static float locA[TS][TS], locB[TS][TS]; float sum = 0.0f; for (int i = 0; i < W; i += TS) { locA[row][col] = a(t_idx.global[0], col + i); locB[row][col] = b(row + i, t_idx.global[1]); t_idx.barrier.wait(); for (int k = 0; k < TS; k++) sum += locA[row][k] * locB[k][col]; t_idx.barrier.wait(); } c[t_idx.global] = sum; } ); } Phase 1 Phase 2

/* Trying to use REF emulator on a machine that does not have it installed, throws runtime_exception */ try { accelerator a(accelerator::direct3d_ref); } catch(runtime_exception& ex) { std::cout << ex.what() << std::endl; }

C++ AMP typeDirectX typeC++ AMP interop API arrayID3D11Buffer*get_buffer, make_array texture ID3D11Texture1D/2D/3 D *get_texture, make_texture accelerator_vie w ID3D11Device *get_device, create_accelerator_view amp.aspx amp.aspx

1.#include 2.#include 3.using namespace concurrency; 4.using namespace concurrency::fast_math; // using namespace concurrency::precise_math; 5.int main() { 6. float a = 2.2f, b = 3.5f; 7. float result = pow(a,b); 8. std::vector v(1); 9. array_view av(1,v); 10. parallel_for_each(av.extent, [=](index idx) restrict(amp) 11. { 12. av[idx] = pow(a,b); 13. }); 14.}

y/