Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Cache Coherence for GPU Architectures Inderpreet Singh 1, Arrvindh Shriraman 2, Wilson Fung 1, Mike O’Connor 3, Tor Aamodt 1 Image source:
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
The University of Adelaide, School of Computer Science
Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.
1 Wilson W. L. Fung Tor M. Aamodt University of British Columbia HPCA-17 Feb 14, 2011.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Computer Graphics Graphics Hardware
1 Chapter 04 Authors: John Hennessy & David Patterson.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
(1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Performance in GPU Architectures: Potentials and Distances
(1) ©Sudhakar Yalamanchili unless otherwise noted Reducing Branch Divergence in GPU Programs T. D. Han and T. Abdelrahman GPGPU 2011.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Advanced Science and Technology Letters Vol.43 (Multimedia 2013), pp Superscalar GP-GPU design of SIMT.
Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie Mellon University.
The Present and Future of Parallelism on GPUs
Computer Graphics Graphics Hardware
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
COMP 740: Computer Architecture and Implementation
EECE571R -- Harnessing Massively Parallel Processors ece
Design of Digital Circuits Lecture 21: GPUs
Multi-core processors
Kwang-yeob Lee1, Nak-woong Eum2, Jae-chang Kwak1 *
From Turing Machine to Global Illumination
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Mattan Erez The University of Texas at Austin
Lecture 26: Multiprocessors
Presented by: Isaac Martin
Coe818 Advanced Computer Architecture
Computer Graphics Graphics Hardware
Lecture 27: Multiprocessors
Mattan Erez The University of Texas at Austin
Graphics Processing Unit
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Presented by Ondrej Cernin
Presentation transcript:

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering University of British Columbia Micro-40 Dec 5, 2007

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 2 Motivation = GPU: A massively parallel architecture  SIMD pipeline: Most computation out of least silicon/energy Goal: Apply GPU to non-graphics computing  Many challenges  This talk: Hardware Mechanism for Efficient Control Flow

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 3 Programming Model Modern graphics pipeline CUDA-like programming model  Hide SIMD pipeline from programmer  Single-Program-Multiple-Data (SPMD)  Programmer expresses parallelism using threads  ~Stream processing Vertex Shader Pixel Shader OpenGL/ DirectX

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 4 Programming Model Warp = Threads grouped into a SIMD instruction From Oxford Dictionary:  Warp: In the textile industry, the term “warp” refers to “the threads stretched lengthwise in a loom to be crossed by the weft”.

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 5 The Problem: Control flow GPU uses SIMD pipeline to save area on control logic.  Group scalar threads into warps Branch divergence occurs when threads inside warps branches to different execution paths. Branch Path A Path B Branch Path A Path B 50.5% performance loss with SIMD width = 16

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 6 Dynamic Warp Formation Consider multiple warps Branch Path A Path B Opportunity? Branch Path A 20.7% Speedup with 4.7% Area Increase

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 7 Outline Introduction Baseline Architecture Branch Divergence Dynamic Warp Formation and Scheduling Experimental Result Related Work Conclusion

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 8 Baseline Architecture Shader Core Interconnection Network Memory Controller GDDR3 Memory Controller GDDR3 Memory Controller GDDR3 Shader Core Shader Core CPU spawn done GPU CPU Time CPU spawn GPU

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 9 SIMD Execution of Scalar Threads All threads run the same kernel Warp = Threads grouped into a SIMD instruction Thread Warp 3 Thread Warp 8 Thread Warp 7 Thread Warp Scalar Thread W Scalar Thread X Scalar Thread Y Scalar Thread Z Common PC SIMD Pipeline

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 10 Latency Hiding via Fine Grain Multithreading Interleave warp execution to hide latencies Register values of all threads stays in register file Need 100~1000 threads  Graphics has millions of pixels

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 11 Thread Warp Common PC SPMD Execution on SIMD Hardware: The Branch Divergence Problem Thread 2 Thread 3 Thread 4 Thread 1 B CD E F A G

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 12 -G1111 TOS B CD E F A G Baseline: PDOM Thread Warp Common PC Thread 2 Thread 3 Thread 4 Thread 1 B/1111 C/1001D/0110 E/1111 A/1111 G/1111 -A1111 TOS ED0110 EC1001 TOS -E1111 ED0110 TOS -E1111 ADGA Time CBE -B1111 TOS -E1111 TOS Reconv. PC Next PCActive Mask Stack ED0110 EE1001 TOS -E1111

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 13 Dynamic Warp Formation: Key Idea Idea: Form new warp at divergence  Enough threads branching to each path to create full new warps

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 14 Dynamic Warp Formation: Example AABBGGAACCDDEEFF Time AABBGGAA CD EE F A x/1111 y/1111 B x/1110 y/0011 C x/1000 y/0010 D x/0110 y/0001 F x/0001 y/1100 E x/1110 y/0011 G x/1111 y/1111 A new warp created from scalar threads of both Warp x and y executing at Basic Block D D Execution of Warp x at Basic Block A Execution of Warp y at Basic Block A Legend AA Baseline Dynamic Warp Formation

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 15 A 5678 A 1234 Dynamic Warp Formation: Hardware Implementation B C B B 0 B 5238 B 0010B B C C 1 C 1 4 C C 1 No Lane Conflict A: BEQ R2, B C: … X Y X X X X Y Y Y Y Z Z Z

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 16 Methodology Created new cycle-accurate simulator from SimpleScalar (version 3.0d) Selected benchmarks from SPEC CPU2006, SPLASH2, CUDA Demo  Manually parallelized  Similar programming model to CUDA

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 17 Experimental Results hmmerlbmBlackBitonicFFTLUMatrixHM IPC Baseline: PDOM Dynamic Warp Formation MIMD

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 18 Dynamic Warp Scheduling Lane Conflict Ignored (~5% difference)

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 19 Area Estimation CACTI 4.2 (90nm process) Size of scheduler = 2.471mm 2 8 x 2.471mm mm 2 = 22.39mm 2  4.7% of Geforce 8800GTX (~480mm 2 )

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 20 Related Works Predication  Convert control dependency into data dependency Lorie and Strong  JOIN and ELSE instruction at the beginning of divergence Cervini  Abstract/software proposal for “regrouping”  SMT processor Liquid SIMD (Clark et al.)  Form SIMD instructions from scalar instructions Conditional Routing (Kapasi)  Code transform into multiple kernels to eliminate branches

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 21 Conclusion Branch divergence can significantly degrade a GPU’s performance.  50.5% performance loss with SIMD width = 16 Dynamic Warp Formation & Scheduling  20.7% on average better than reconvergence  4.7% area cost Future Work  Warp scheduling – Area and Performance Tradeoff

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 22 Thank You. Questions?

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow 23 Shared Memory Banked local memory accessible by all threads within a shader core (a block) Idea: Break Ld/St into 2 micro-code:  Address Calculation  Memory Access After address calculation, use bit vector to track bank access just like lane conflict in the scheduler