Presentation is loading. Please wait.

Presentation is loading. Please wait.

EECE571R -- Harnessing Massively Parallel Processors ece

Similar presentations


Presentation on theme: "EECE571R -- Harnessing Massively Parallel Processors ece"— Presentation transcript:

1 EECE571R -- Harnessing Massively Parallel Processors http://www. ece
EECE571R -- Harnessing Massively Parallel Processors Lecture 1: Introduction to GPU Programming By Samer Al-Kiswany Acknowledgement: some slides borrowed from presentations by Kayvon Fatahalian, and Mark Harris

2 Outline Hardware Software Programming Model Optimizations

3 GPU Architecture Intuition

4 GPU Architecture Intuition

5 GPU Architecture Intuition

6 GPU Architecture Intuition

7 GPU Architecture Intuition

8 GPU Architecture Intuition

9 GPU Architecture Intuition

10 GPU Architecture Intuition

11 GPU Architecture Host Machine GPU Multiprocessor N Multiprocessor 2
Processor M Instruction Unit Shared Memory Registers Multiprocessor 1 Processor 1 Processor 2 Host Constant Memory Texture Memory Global Memory

12 GPU Architecture SIMD Architecture. Four memories.
Device (a.k.a. global) slow – cycles access latency large – 256MB – 1GB Shared fast – 4 cycles access latency small – 16KB Texture – read only Constant – read only

13 GPU Architecture – Program Flow
Preprocessing Data transfer in GPU Processing Data transfer out Postprocessing 3 1 2 4 5 TPreprocesing 1 + TDataHtoG 2 + TProcessing 3 + TDataGtoH 4 + TPostProc 5 TTotal =

14 Outline Hardware Software Programming Model Optimizations

15 GPU Programming Model Programming Model: Software representation of the Hardware

16 GPU Programming Model Block Kernel: A function on the grid

17 GPU Programming Model

18 GPU Programming Model

19 GPU Programming Model In reality scheduling granularity is a warp (32 threads)  4 cycles to complete a single instruction by a warp

20 GPU Programming Model In reality scheduling granularity is a warp (32 threads)  4 cycles to complete a single instruction by a warp Threads in a Block can share stat through shared memory Threads in the Block can synchronies Global atomic operations

21 Outline Hardware Software Programming Model Optimizations

22 Optimizations Can be roughly categorized into the following categories: Memory Related Computation Related Data Transfer Related

23 Optimizations - Memory
Use shared memory Use texture (1D, 2D, or 3D) and constant memory Avoid shared memory bank conflicts Coalesced memory access (one approach: padding)

24 Optimizations - Memory
Shared Memory Complications Bank 0 Bank 1 Bank 15 . Shared memory is organized into 16 -1KB banks. Complication I : Concurrent accesses to the same bank will be serialized (bank conflict)  slow down. Tip : Assign different threads to different banks. Complication II : Banks are interleaved. Bank 0 Bank 1 Bank 2 . 4 bytes 4 8 16

25 Optimizations - Memory
Global Memory Coalesced Access

26 Optimizations - Memory
Global Memory Non-Coalesced Access

27 Optimizations Can be roughly categorized into the following categories: Memory Related Computation Related Data Transfer Related

28 Optimizations - Computation
Use 1000s of threads to best use the GPU hardware Use Full Warps (32 threads) (use blocks multiple of 32). Lower code branch divergence. Avoid synchronization Loop unrolling (Less instructions, space for compiler optimizations)

29 Optimizations Can be roughly categorized into the following categories: Memory Related Computation Related Data Transfer Related

30 Optimizations – Data Transfer
Reduce amount of data transferred between host and GPU Hide transfer overhead through overlapping transfer and computation (Asynchronous transfer)

31 Summary GPUs are highly parallel devices.
Easy to program for (functionality). Hard to optimize for (performance). Optimization: Many optimization, but often you do not need them all (Iteration of profiling and optimization) May bring hard tradeoffs (More coputation vs. less memory, more computation vs. better memory access, ..etc).


Download ppt "EECE571R -- Harnessing Massively Parallel Processors ece"

Similar presentations


Ads by Google