Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Similar presentations


Presentation on theme: "University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable."— Presentation transcript:

1 University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable Stream Programming on Graphics Engines

2 University of Michigan Electrical Engineering and Computer Science 2 Why GPUs? Every mobile and desktop system will have one Affordable and high performance Over-provisioned Programmable Sony PlayStation Phone

3 University of Michigan Electrical Engineering and Computer Science 3 GPU Architecture Shared Regs 0 1 2 3 4 5 6 7 Interconnection Network CPU SM 0SM 1SM 29 Kernel 1 Kernel 2 Time 01 2 3 45 6 7 Shared Regs 0 1 2 3 4 5 6 7 Shared Regs 0 1 2 3 4 5 6 7 Registers Global Memory (Device Memory) Shared Memory

4 University of Michigan Electrical Engineering and Computer Science 4 GPU Programming Model Threads  Blocks  Grid All the threads run one kernel Registers private to each thread Registers spill to local memory Shared memory shared between threads of a block Global memory shared between all blocks

5 University of Michigan Electrical Engineering and Computer Science 5 Grid 1 GPU Execution Model SM 1 Shared Regs 0 1 2 3 4 5 6 7 SM 0 Shared Regs 0 1 2 3 4 5 6 7 SM 2 Shared Regs 0 1 2 3 4 5 6 7 SM 3 Shared Regs 0 1 2 3 4 5 6 7 SM 30 Shared Regs 0 1 2 3 4 5 6 7

6 University of Michigan Electrical Engineering and Computer Science 6 GPU Execution Model Block 0 Block 1 Block 3 Shared Registers 0 1 2 4 5 3 6 7 SM0 Block 2 Warp 0Warp 1 ThreadId 0313263

7 University of Michigan Electrical Engineering and Computer Science 7 GPU Programming Challenges Optimized for GeForce GTX 285 Optimized for GeForce 8400 GS Data restructuring for complex memory hierarchy efficiently –Global memory, Shared memory, Registers Partitioning work between CPU and GPU Lack of portability between different generations of GPU –Registers, active warps, size of global memory, size of shared memory Will vary even more –Newer high performance cards e.g. NVIDA’s Fermi –Mobile GPUs with less resources

8 University of Michigan Electrical Engineering and Computer Science 8 Nonlinear Optimization Space [Ryoo, CGO ’08] SAD Optimization Space 908 Configurations We need higher level of abstraction!

9 University of Michigan Electrical Engineering and Computer Science 9 Goals Write-once parallel software Free the programmer from low-level details (C + Pthreads) Shared Memory Processors (C +Intrinsics) SIMD Engines (Verilog/VHDL) FPGAs (CUDA/OpenCL) GPUs Parallel Specification

10 University of Michigan Electrical Engineering and Computer Science 10 Streaming Higher-level of abstraction Decoupling computation and memory accesses Coarse grain exposed parallelism, exposed communication Programmers can focus on the algorithms instead of low-level details Streaming actors use buffers to communicate A lot of recent works on extending portability of streaming applications

11 University of Michigan Electrical Engineering and Computer Science 11 Sponge –Generating optimized CUDA for a wide variety of GPU targets –Perform an array of optimizations on stream graphs –Optimizing and porting to different generations –Utilize memory hierarchy (registers, shared memory, coallescing) –Efficiently utilize streaming cores Reorganization and Classification Memory Layout Memory Layout Graph Restructuring Graph Restructuring Register Optimization Register Optimization Shared/Global Memory Helper Threads Bank Conflict Resolution Loop Unrolling Software Prefetching

12 University of Michigan Electrical Engineering and Computer Science 12 GPU Performance Model - Memory bound Kernels M 0M 1M 2M 3M 4M 5M 6M 7 C 0C 1C 2C 3C 4C 5C 6 C 7 ≈ Memory Time - Computation bound Kernels M 0M 1M 4M 5M 2M 3M 6M 7 C 0C1C 2C 3C 4C 5C 6C 7 ≈ Computation Time M C Memory InstructionsComputation Instructions

13 University of Michigan Electrical Engineering and Computer Science 13 Actor Classification High Traffic Actors(HiT) –Large number of memory accesses per actor –Less number of threads with shared memory –Using shared memory underutilizes the processors Low Traffic Actors(LoT) –Less number of memory accesses per actor –More number of threads –Using shared memory increases the performance

14 University of Michigan Electrical Engineering and Computer Science 14 Thread 1 Thread 2 Thread 3 Thread 0 15 14 13 12 11 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 15 14 13 12 11 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 Global Memory Accesses A[4,4] Global Memory 2 2 6 6 10 14 2 2 6 6 10 14 1 1 5 5 9 9 13 1 1 5 5 9 9 0 0 4 4 8 8 12 0 0 4 4 8 8 3 3 7 7 11 15 3 3 7 7 11 15 Large access latency Not access the words in sequence No coalescing A[4,4] A[i, j]  Actor A has i pops and j pushes

15 University of Michigan Electrical Engineering and Computer Science 15 Thread 3 Thread 2 Thread 1 Thread 0 Shared Memory 15 14 13 12 11 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 15 14 13 12 11 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 A[4,4] Shared Memory 15 14 13 12 11 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 15 14 13 12 11 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 Global To Shared Global Memory 3 3 2 2 1 1 0 0 3 3 2 2 1 1 0 0 7 7 6 6 5 5 4 4 7 7 6 6 5 5 4 4 11 10 9 9 8 8 11 10 9 9 8 8 15 14 13 12 15 14 13 12 3 3 2 2 1 1 0 0 3 3 2 2 1 1 0 0 7 7 6 6 5 5 4 4 7 7 6 6 5 5 4 4 11 10 9 9 8 8 11 10 9 9 8 8 15 14 13 12 15 14 13 12 First bring the data into shared memory with coalescing –Each filter brings data for other filters –Satisfies coalescing constraints After data is in the shared memory, then each filter accesses its own memory. Improve bandwidth and performance Shared to Global

16 University of Michigan Electrical Engineering and Computer Science 16 Using Shared Memory Shared memory is 100x faster than global memory Coalesce all global memory accesses Number of threads is limited by size of the shared memory.

17 University of Michigan Electrical Engineering and Computer Science 17 Helper Threads Shared memory limits the number of threads. Underutilized processors can fetch data. All the helper threads are in one warp. (no control flow divergence)

18 University of Michigan Electrical Engineering and Computer Science 18 Data Prefetch Better register utilization Data for iteration i+1 is moved to registers Data for iteration i is moved from register to shared memory Allows the GPU to overlap instructions

19 University of Michigan Electrical Engineering and Computer Science 19 Loop unrolling Similar to traditional unrolling Allows the GPU to overlap instructions Better register utilization Less loop control overhead Can also be applied to memory transfer loops

20 University of Michigan Electrical Engineering and Computer Science 20 Methodology Set of benchmarks from the StreamIt Suite 3GHz Intel Core 2 Duo CPU with 6GB RAM Nvidia Geforce GTX 285

21 University of Michigan Electrical Engineering and Computer Science 21 Result (Baseline CPU) 10 24

22 University of Michigan Electrical Engineering and Computer Science 22 Result (Baseline GPU) 64% 3% 16%

23 University of Michigan Electrical Engineering and Computer Science 23 Conclusion Future systems will be heterogeneous GPUs are important part of such systems Programming complexity is a significant challenge Sponge automatically creates optimized CUDA code for a wide variety of GPU targets Provide portability by performing an array of optimizations on stream graphs

24 University of Michigan Electrical Engineering and Computer Science 24 Questions

25 University of Michigan Electrical Engineering and Computer Science 25 Spatial Intermediate Representation StreamIt Main Constructs: –Filter  Encapsulate computation. –Pipeline  Expressing pipeline parallelism. –Splitjoin  Expressing task-level parallelism. –Other constructs not relevant here Exposes different types of parallelism –Composable, hierarchical Stateful and stateless filters pipeline filter splitjoin

26 University of Michigan Electrical Engineering and Computer Science 26 Nonlinear Optimization Space [Ryoo, CGO ’08] SAD Optimization Space 908 Configurations

27 University of Michigan Electrical Engineering and Computer Science 27 Thread 1 Thread 2 Thread 0 Bank Conflict 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 15 14 13 12 11 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 A[8,8] Shared Memory 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 15 14 13 12 11 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 Conflict 0 0 8 8 0 0 0 0 8 8 0 0 1 1 9 9 1 1 1 1 9 9 1 1 2 2 10 2 2 2 2 2 2 27 data = buffer[BaseAddress + s * ThreadId]

28 University of Michigan Electrical Engineering and Computer Science 28 Thread 2 Thread 1 Thread 0 Removing Bank Conflict 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 15 14 13 12 11 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 A[8,8] Shared Memory 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 15 14 13 12 11 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 0 9 9 2 2 0 0 9 9 2 2 1 1 3 3 1 1 3 3 2 2 11 4 4 2 2 4 4 28 data = buffer[BaseAddress + s * ThreadId] if GCD( # of bank, s) is 1 there will be no bank conflict  s must be odd


Download ppt "University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable."

Similar presentations


Ads by Google