C++  PPL  AMP When NO branches between a micro-op and retiring to the visible architectural state – its no longer speculative.


Similar presentations
Brown Bag #2 Advanced C++. Topics  Templates  Standard Template Library (STL)  Pointers and Smart Pointers  Exceptions  Lambda Expressions  Tips.

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
COSC513 Operating System Research Paper Fundamental Properties of Programming for Parallelism Student: Feng Chen (134192)
Review of the MIPS Instruction Set Architecture. RISC Instruction Set Basics All operations on data apply to data in registers and typically change the.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
1 ECE369 ECE369 Chapter 2. 2 ECE369 Instruction Set Architecture A very important abstraction –interface between hardware and low-level software –standardizes.
CSE 340 Computer Architecture Spring 2014 MIPS ISA Review
Computer Science II Exam I Review Monday, February 6, 2006.
Data Transfer & Decisions I (1) Fall 2005 Lecture 3: MIPS Assembly language Decisions I.

C++ + r1 r2 r3 add r3, r1, r2 SCALAR (1 operation) v1 v2 v3 + vector length vadd v3, v1, v2 VECTOR (N operations)
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
C++ Accelerated Massive Parallelism in Visual C Kate Gregory Gregory Consulting DEV334.
Steve Teixeira Director of Program Management, Visual C++ Microsoft Corporation Visual C++ and the Native Renaissance.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Assembly Questions תרגול 12.
CS179: GPU Programming Lecture 11: Lab 5 Recitation.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
CS212: Object Oriented Analysis and Design Lecture 2: Introduction to C++
CUDA - 2.
Finishing out EECS 470 A few snapshots of the real world.
1 Original Source : and Problem and Problem Solving.ppt.
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
Introduction to MMX, XMM, SSE and SSE2 Technology
This deck has 1-, 2-, and 3- slide variants for C++ AMP If your own deck uses 4:3, get with the 21 st century and switch to 16:9 ( Design tab, Page Setup.
The Alpha Thomas Daniels Other Dude Matt Ziegler.
C++ [ebp+10] Parameter 3 [ebp+0C] Parameter 2 [ebp+08] Parameter 1 [ebp+04] Return address [ebp+00] Old ebp [ebp -04]
Auto-Vectorization Jim Hogg Program Manager Visual C++ Compiler Microsoft Corporation.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
Current Assignments Project 3 has been posted, due next Tuesday. Write a contact manager. Homework 6 will be posted this afternoon and will be due Friday.
General Computer Science for Engineers CISC 106 Lecture 27 Dr. John Cavazos Computer and Information Sciences 04/27/2009.
LESSON 5 Loop Control Structure. Loop Control Structure  Operation made over and over again.  Iterate statement.
Chapter Overview General Concepts IA-32 Processor Architecture
Review 1.
CSCE430/830 Computer Architecture
Henk Corporaal TUEindhoven 2009
CS 2308 Exam I Review.
Ronny Krashinsky and Mike Sung
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Code Optimization(II)
Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.
Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.
Henk Corporaal TUEindhoven 2011
Lab 1 Introduction to C++.
Python Tutorial for C Programmer Boontee Kruatrachue Kritawan Siriboon
Introduction to CUDA.
ECE 498AL Lecture 10: Control Flow
Using string type variables
Computer Terms Review from what language did C++ originate?
ECE 498AL Spring 2010 Lecture 10: Control Flow
02/02/10 20:53 Assembly Questions תרגול 12 1.
A Level Computer Science Topic 5: Computer Architecture and Assembly
Lecture 11: Machine-Dependent Optimization
Sarah Diesburg Operating Systems COP 4610
Computer Architecture and System Programming Laboratory
Presentation transcript:

C++  PPL  AMP

When NO branches between a micro-op and retiring to the visible architectural state – its no longer speculative foo; 190: r4 = 0


B[0] B[1] B[2] B[3] A[0] A[1] A[2] A[3] A[0] + B[0] A[1] + B[1] A[2] + B[2] A[3] + B[3] xmm0 “addps xmm1, xmm0 “ xmm1 +

+ r1 r2 r3 add r3, r1, r2 SCALAR (1 operation) v1 v2 v3 + vector length vadd v3, v1, v2 VECTOR (N operations)

… float x = input[threadID]; float y = func(x); output[threadID] = y; … threadID Arrays of Parallel Threads - SPMD All threads run the same code (SPMD)‏ Each thread has an ID that it uses to compute memory addresses and make control decisions

for (i = 0; i < 1000/4; i++){ movps xmm0, [ecx] movps xmm1, [eax] addps xmm0, xmm1 movps [edx], xmm0 } for (i = 0; i < 1000; i++) A[i] = B[i] + C[i]; Compiler look across loop iterations !

C++ or Klingon

B[0] B[1] B[2] B[3] A[0] A[1] A[2] A[3] A[0] + B[0] A[1] + B[1] A[2] + B[2] A[3] + B[3] xmm0 “addps xmm1, xmm0 “ xmm1 +

A(3) = ?

−ALL loads before ALL stores A (2:4) = A (1:4) + A (3:7) VR1 = LOAD(A(1:4)) VR2 = LOAD(A(3:7)) VR3 = VR1 + VR2 // A(3) = F (A(2) A(4)) STORE(A(2:4)) = VR3

Instead - load store load store... Instead - load store load store... FOR ( j = 2; j <= 257; j++) A( j ) = A( j-1 ) + A( j+1 ) A(2) = A(1) + A(3) A(3) = A(2) + A(4) // A(3) = F ( A(1)A(2)A(3)A(4) ) A(4) = A(3) + A(5) A(5) = A(4) + A(6) … …

A ( a1 * I + c1 ) ?= A ( a2 * I’ + c2)

Complex C++ Not just arrays!

void foo() { #pragma loop(hint_parallel(4)) for (int i=0; i<1000; i++) A[i] = B[i] + C[i]; } void foo() { CompilerParForLib(0, 1000, 4, &foo$par1, A, B, C); } foo$par1(int T1, int T2, int *A, int *B, int *C) { for (int i=T1; i<T2; i+=4) movps xmm0, [ecx] movps xmm1, [eax] addps xmm0, xmm1 movps [edx], xmm0 } Parallelism + vector foo$par1(0, 249, A, B, C);core 1 instr foo$par1(250, 499, A, B, C);core 2 instr foo$par1(500, 749, A, B, C);core 3 instr foo$par1(750, 999, A, B, C);core 4 instr Runtime Vectorized + and parallel

dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; } for (k = 1; k <= M; k++) { dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; for (k = 1; k <= M; k++) { if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; } for (k = 1; k < M; k++) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; }

“pmax xmm1, xmm0 “ pmax A[0] > B[0] ? A[0] : B[0] A[1]> B[1] ? A[1] : B[1] A[2] > B[2] ? A[2] : B[2] A[3] > B[3] ? A[3] : B:[3] B[0] B[1] B[2] B[3] A[0] A[1] A[2] A[3] xmm1 xmm0

if ( __isa_availablility > SSE2 && NO_ALIASISIN ) { Vector loop Scalar loop Vector loop

15x faster

for (i=0; i<n; i++) { a[i] = a[i] + b[i]; a[i] = sin(a[i]); } for(i=0; i<n; i=i+VL) { a(i : i+VL-1) = a(i : i+VL-1) + b(i : i+VL-1); a(i : i+VL-1) = _svml_Sin(a(i : i+VL-1)); } NEW Run-Time Library HW SIMD instruction

Foo (float *a, float *b, float *c) { #pragma loop(hint_parallel(N)) for (auto i=0; i<N; i++) { *c++ = (*a++) * bar(b++); }; Pointers and procedure calls with escaped pointers prevent analysis for auto- parallelization Use simple directives Pragma

for (int l = top; l < bottom; l++){ for (int m = left; m < right; m++ ){ int y = *(blurredImage + (l*dimX) +m); ySourceRed += (unsigned int) (y & 0x00FF0000) >> 16; ySourceGreen += (unsigned int) (y & 0x0000ff00) >> 8; ySourceBlue += (unsigned int) (y & 0x000000FF); averageCount++; }

{ if (iter >= maxIter) *a=0xFF000000; //black else //a gradient from red to yellow { unsigned short redYellowComponent = ~(iter * 32000/maxIter) ; unsigned int xx = 0x00FFF000; //0 Alpha + RED xx = xx|redYellowComponent; xx <<= 8; *a = xx; }

1.#include int main() 5.{ 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; for (int idx = 0; idx < 11; idx++) 9. { 10. v[idx] += 1; 11. } 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout ( v[i]); 14. }

1.#include 2.#include 3.using namespace concurrency; 4.int main() 5.{ 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; for (int idx = 0; idx < 11; idx++) 9. { 10. v[idx] += 1; 11. } 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout ( v[i]); 14. } amp.h: header for C++ AMP library concurrency: namespace for library amp.h: header for C++ AMP library concurrency: namespace for library

1.#include 2.#include 3.using namespace concurrency; 4.int main() 5.{ 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view av(11, v); 8. for (int idx = 0; idx < 11; idx++) 9. { 10. v[idx] += 1; 11. } 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout ( v[i]); 14. } array_view: wraps the data to operate on the accelerator. array_view variables captured and associated data copied to accelerator (on demand)

1.#include 2.#include 3.using namespace concurrency; 4.int main() 5.{ 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view av(11, v); 8. for (int idx = 0; idx < 11; idx++) 9. { 10. av[idx] += 1; 11. } 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout ( av[i]); 14. } array_view: wraps the data to operate on the accelerator. array_view variables captured and associated data copied to accelerator (on demand)

C++ AMP “Hello World” File -> New -> Project Empty Project Project -> Add New Item Empty C++ file 1.#include 2.#include 3.using namespace concurrency; 4.int main() 5.{ 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view av(11, v); 8. parallel_for_each(av.extent, [=](index idx) restrict(amp) 9. { 10. av[idx] += 1; 11. }); 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout (av[i]); 14. } parallel_for_each: execute the lambda on the accelerator once per thread extent: the parallel loop bounds or “shape” index: the thread ID that is running the lambda, used to index into data

C++ AMP “Hello World” File -> New -> Project Empty Project Project -> Add New Item Empty C++ file 1.#include 2.#include 3.using namespace concurrency; 4.int main() 5.{ 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view av(11, v); 8. parallel_for_each(av.extent, [=](index idx) restrict(amp) 9. { 10. av[idx] += 1; 11. }); 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout (av[i]); 14. } restrict(amp): tells the compiler to check that code conforms to C++ subset, and tells compiler to target GPU

C++ AMP “Hello World” File -> New -> Project Empty Project Project -> Add New Item Empty C++ file 1.#include 2.#include 3.using namespace concurrency; 4.int main() 5.{ 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view av(11, v); 8. parallel_for_each(av.extent, [=](index idx) restrict(amp) 9. { 10. av[idx] += 1; 11. }); 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout (av[i]); 14. } array_view: automatically copied to accelerator if required array_view: automatically copied back to host when and if required

32nm 22nm 22nm 14nm 10nm 256 bit AVX(2)256 bit AVX 128 bit SSE You are here (3D tri-state transistors)