University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati 2, Mojtaba Mehrara 3, Janghaeng Lee 1 and Scott Mahlke 1 1 1 University of Michigan - Ann Arbor 2 Microsoft Research 3 NVIDIA Research

University of Michigan Electrical Engineering and Computer Science GPU Performance Gap High performance at low cost Peak performance is difficult to achieve 2 GeForce GTX 480 GeForce GTX 280 GeForce 8800 GTX GeForce 7800 GTX GeForce GTX 590 GeForce GTX 680 In Practice

University of Michigan Electrical Engineering and Computer Science TMV Performance on Various Input 3 Square Matrix Rectangular Matrix

University of Michigan Electrical Engineering and Computer Science GPU Execution Model 4 Grid 1 SM 0 Shared Regs 0 1 2 3 4 5 6 7 SM 1 Shared Regs 0 1 2 3 4 5 6 7 SM 2 Shared Regs 0 1 2 3 4 5 6 7 SM 3 Shared Regs 0 1 2 3 4 5 6 7 SM 7 Shared Regs 0 1 2 3 4 5 6 7 Executes Thread

University of Michigan Electrical Engineering and Computer Science Transposed Matrix Vector Multiplication (4 x 1M) 5 SM 0 Block 0 Thread 0 ~ 15 Block 3 Block 1 Block 2 01 23 45 67 Regs Shared SM 1 01 23 45 67 Regs Shared SM 2 01 23 45 67 Regs Shared SM 3 01 23 45 67 Regs Shared SM 4 01 23 45 67 Regs Shared SM 5 01 23 45 67 Regs Shared SM 6 01 23 45 67 Regs Shared SM 7 01 23 45 67 Regs Shared IDLE

University of Michigan Electrical Engineering and Computer Science Transposed Matrix Vector Multiplication (1M x 4) 6 SM 0 01 23 45 67 Regs Shared SM 1 01 23 45 67 Regs Shared SM 2 01 23 45 67 Regs Shared SM 3 01 23 45 67 Regs Shared SM 4 01 23 45 67 Regs Shared SM 5 01 23 45 67 Regs Shared SM 6 01 23 45 67 Regs Shared SM 7 01 23 45 67 Regs Shared Block 0 ~ 7 Block 8 ~ 15 Block 1,000,000 125,000 blocks / SM

University of Michigan Electrical Engineering and Computer Science GPU Programming Challenge - Portability 7 GPU Architectures Input Matrix SizeSource Code 4 x 1MGTX285_MV_4_1M.cu 128 x 32KGTX285_MV_128_32K.cu 32K x 128GTX285_MV_32K_128.cu 1M x 4GTX285_MV_1M_4.cu 4 x 1MGTX580_MV_4_1M.cu 128 x 32KGTX580_MV_128_32K.cu 32K x 128GTX580_MV_32K_128.cu 1M x 4GTX580_MV_1M_4.cu 4 x 1MGTX680_MV_4_1M.cu 128 x 32KGTX680_MV_128_32K.cu 32K x 128GTX680_MV_32K_128.cu 1M x 4GTX680_MV_1M_4.cu Fastest Matrix-Vector Multiplication for any GPU for any input size Cores : 240 Cores : 512 Cores : 1536 2008 2011 2012

University of Michigan Electrical Engineering and Computer Science Adaptic Adaptive Input-aware Compilation for GPUs –Device-Portable –Input-Portable –Programmers can focus on the algorithms without concerning about low-level details Streaming Language –Higher-level of abstraction –Separating Memory Access from Algorithm –e.g) StreamIt 8

University of Michigan Electrical Engineering and Computer Science Stream It Higher-level of abstraction Decoupling computation and memory accesses Coarse grain exposed parallelism, exposed communication Streaming actors use buffers to communicate A lot of recent works on extending portability of streaming applications 9

University of Michigan Electrical Engineering and Computer Science Compilation Flow in Adaptic 10 Input-aware Optimization Input-unaware Optimization StreamIt Code Target GPUInput Range Offline Compilation Performance Model Memory Access Optimization Actor Segmentation Actor Integration Why? Global Memory Accesses Large access latency Optimizations Memory Restructuring Coalesced Access Neighboring Access Data Reuse Splits Actors More blocks will be generated Alleviate resource under-utilization Optimizations Stream Reduction Intra-actor Parallelization Integrate Actors Merge several actors into one Alleviate high resource contention Optimizations Vertical Integration Horizontal Integration Executable Smallest Input Largest Input Small Input Large Input Input size? Launch Kernel Kernel 0Kernel 1Kernel 2Kernel 3 Several CUDA Kernels for various input range

University of Michigan Electrical Engineering and Computer Science Memory Optimization Global Memory - Large access latency Not access the words in sequence No coalescing 11 A[i, j]  Actor A has i pops and j pushes Thread 1 Thread 2 Thread 3 Thread 0 15 14 13 12 11 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 15 14 13 12 11 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 A[4,4] Global Memory 2 2 6 6 10 14 2 2 6 6 10 14 1 1 5 5 9 9 13 1 1 5 5 9 9 0 0 4 4 8 8 12 0 0 4 4 8 8 3 3 7 7 11 15 3 3 7 7 11 15 A[4,4]

University of Michigan Electrical Engineering and Computer Science Memory Optimization Global Memory - Large access latency Not access the words in sequence No coalescing 12 Thread 1 Thread 2 Thread 3 Thread 0 15 14 13 12 11 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 15 14 13 12 11 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 A[4,4] Global Memory A[4,4] 15 14 13 12 11 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 2 2 6 6 14 1 1 5 5 9 9 13 0 0 4 4 8 8 12 3 3 7 7 11 15 14 13 12 11 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 2 2 6 6 14 1 1 5 5 9 9 13 0 0 4 4 8 8 12 3 3 7 7 11 15 A[i, j]  Actor A has i pops and j pushes

University of Michigan Electrical Engineering and Computer Science Actor Segmentation 13 4 x 1M Transposed Matrix-Vector Multiplication Block 0 Block 3 Block 1 Block 2 Block 96 Block 32 Block 64 ~ Block 0 Block 31

University of Michigan Electrical Engineering and Computer Science Actor Integration Merges several actors or threads to balance threads’ workloads Vertical integration: reducing off-chip memory traffic by storing intermediate results in the shared memory. Horizontal integration : reducing synchronization overhead and also lets the merged actors share instructions. 14

University of Michigan Electrical Engineering and Computer Science Experimental Setup CPU - Intel Xeon X5650 GPU –NVidia Telsa C2050 3GB GDDR 5 –NVidia GTX 285 2GB GDDR 2 Benchmarks –CUBLAS Library 3.2 –NVidia SDK 3.1 15

University of Michigan Electrical Engineering and Computer Science Result( Matrix Vector Multlipication) 16

University of Michigan Electrical Engineering and Computer Science Results (Speedup) 17 Input Size

University of Michigan Electrical Engineering and Computer Science Results(BiCGSTAB) 18 Input unaware

University of Michigan Electrical Engineering and Computer Science Summary Performance of GPU is affected by –GPU Model / Input CUDA / OpenCL Programming Model –Lacks Architecture and Input Portability Scientific Applications use irregular input –Hard to get optimized performance Proposed Adaptic –Architecture and input portable /w streaming language –Showed speedup over CUBLAS / SDK in various input range 19

University of Michigan Electrical Engineering and Computer Science Q & A 20

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati."— Presentation transcript:

Similar presentations

About project

Feedback