(1) ©Sudhakar Yalamanchili unless otherwise noted Reducing Branch Divergence in GPU Programs T. D. Han and T. Abdelrahman GPGPU 2011.

Slides:



Advertisements
Similar presentations
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
Chapter 12 Pipelining Strategies Performance Hazards.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
CH12 CPU Structure and Function
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
(1) Register File Organization ©Sudhakar Yalamanchili unless otherwise noted.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
(1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques.
Farzad Khorasani, Rajiv Gupta, Laxmi N. Bhuyan UC Riverside
Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.
Sunpyo Hong, Hyesoon Kim
Operation of the SM Pipeline
Performance in GPU Architectures: Potentials and Distances
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner, and T. Aamodt.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Single Instruction Multiple Threads
CS 352H: Computer Systems Architecture
Gwangsun Kim, Jiyun Jeong, John Kim
Computer Architecture Chapter (14): Processor Structure and Function
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Multiscalar Processors
Computer Structure Multi-Threading
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture 5: GPU Compute Architecture
Presented by: Isaac Martin
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Lecture 5: GPU Compute Architecture for the last time
Lecture 28: Reliability Today’s topics: GPU wrap-up Disk basics RAID
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Operation of the Basic SM Pipeline
Introduction to Control Divergence
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
Shared Memory Accesses
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Programming with Shared Memory Specifying parallelism
Loop-Level Parallelism
6- General Purpose GPU Programming
Introduction to Optimization
Presentation transcript:

(1) ©Sudhakar Yalamanchili unless otherwise noted Reducing Branch Divergence in GPU Programs T. D. Han and T. Abdelrahman GPGPU 2011

(2) Reading T. D. Han and T. Abdelrahman, “Reducing Branch Divergence in GPGPU Programs,” GPGPU 2011

(3) Goal Improve the utilization of the SIMD core Temporal compaction of control flow paths  Across successive loop iterations  Re-align iteration across threads in a warp so that we increase the number of threads taking the same control flow path

(4) Basic Idea Delay iterations in a thread so that majority of threads in a warp will take the same control flow path Improve utilization and reduce dynamic instruction count Local decision on when to delay an iteration? Warp Figure from T. D. Han and T. Abdelrahman, “Reducing Branch Divergence in GPGPU Programs,” GPGPU 2011

(5) Basic Idea (2): 2 Warps Iteration number Delay iteration in these threads TakenNot Taken if(…) {… } else { …} Idle thread

(6) Basic Idea (3) Figure from T. D. Han and T. Abdelrahman, “Reducing Branch Divergence in GPGPU Programs,” GPGPU 2011

(7) Operation Collect branch conditions in a 32-bit register Count number of 1’s in a 32-bit pattern Compute this iteration? Iteration count adjusted on a per thread basis Thread specific local variable Figure from T. D. Han and T. Abdelrahman, “Reducing Branch Divergence in GPGPU Programs,” GPGPU 2011

(8) Example Loops must be barrier free Figure from T. D. Han and T. Abdelrahman, “Reducing Branch Divergence in GPGPU Programs,” GPGPU 2011

(9) Optimization Starvation effects of threads that take the less traveled path  Periodically reverse the decision  complement the branch strategy Figure from T. D. Han and T. Abdelrahman, “Reducing Branch Divergence in GPGPU Programs,” GPGPU 2011

(10) Properties Applicable to other forms of control flow in a loop, e.g., switch Performance is dependent on  Code size in branch paths relative to fixed instruction overhead of the delayed branching  Code size outside of the branch paths  Branching pattern in thread  high degree of branching can be effectively exploited Impact on memory coalescing  No worse than without delaying iterations?

(11) Branch Distribution Code transformation to reduce impact of branch divergence Aggressively hoist common code structure, rather than common subexpression elimination Produce smaller divergent code blocks

(12) Example Overhead Smaller divergent code body Greater register pressure Figure from T. D. Han and T. Abdelrahman, “Reducing Branch Divergence in GPGPU Programs,” GPGPU 2011

(13) Summary Up to 20% improvement in performance indicated when using both techniques Does not require any special purpose hardware support Compatible with other compiler optimizations

(14) Dynamic Warp Subdivision Meng, Tarjan, and Skadron ISCA 2010 ©Sudhakar Yalamanchili unless otherwise noted

(15) Goals Understand the interactions between divergence behaviors and memory and control flow behaviors Integrated handling of branch and memory divergence behaviors Application of warp splitting and re-combining (re- convergence) to counteract disadvantages of lock- step, synchronous execution of thread bundles

(16) Reading J. Meng, D. Tarjan, and K. Skadron, “Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance,” ISCA 2010

(17) Memory Divergence Behavior Wider warps  more likely to exhibit memory divergence  Less total number of warps but greater need for more warps for latency hiding  Single thread miss can hold up an entire warp Effective throughput depends on footprint in the L1  Think of the number of warps suspended on cache misses  Cache contention Need for intra-warp latency tolerance Figure from J. Meng, D. Tarjan, and K. Skadron, “Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance,” ISCA 2010

(18) Memory Divergence and Control Flow PDOM reconvergence permits only one branch path execution at a time  limiting latency hiding? What happens on cache misses for some threads?  Misses on code block C but not D? Figure from J. Meng, D. Tarjan, and K. Skadron, “Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance,” ISCA 2010

(19) Memory Divergence How do we hide latency between threads in a warp? The “last thread” effect due to implicit lock step progress Dynamic Warp Subdivision (DWS) as as solution Cache miss

(20) Dynamic Warp Subdivision Dynamically split warps into smaller, independently schedulable units  Overlap miss behavior as much as possible without adding new threads  Allow the active threads to run-ahead When to split and reconverge? How do we orchestrate (schedule) better memory behaviors? Overlap memory behavior

(21) MLP for Divergent Branches Overlap memory access in different branch paths Figure from J. Meng, D. Tarjan, and K. Skadron, “Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance,” ISCA 2010

(22) Run-Ahead Execution Early issue of memory instructions Figure from J. Meng, D. Tarjan, and K. Skadron, “Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance,” ISCA 2010

(23) Intra-Warp Latency Hiding Threads in the same warp can progress overlapping memory access latency Figure from J. Meng, D. Tarjan, and K. Skadron, “Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance,” ISCA 2010

(24) Prefetching Behavior of Warp Splits Pre-fetch effect for lagging threads Figure from J. Meng, D. Tarjan, and K. Skadron, “Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance,” ISCA 2010

(25) Reconvergence Behaviors Over-subdivision  when?  Over-splitting degrades benefits of warps  Judicious splits  e.g., when thoughput has dropped due to a large number of suspended warps Reconvergence  ? when  Still can take place at the TOS of the recovergence stack  Keep track of warp split and use PC based reconvergence oCheck every cycle?

(26) Reconvergence Re-convergence using PCs Figure from J. Meng, D. Tarjan, and K. Skadron, “Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance,” ISCA 2010

(27) PC vs. Stack Based Reconvergence PC-based reconvergence can reconverge across loop iterations Stack based causes reconvergence at the PDOM Figure from J. Meng, D. Tarjan, and K. Skadron, “Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance,” ISCA 2010

(28) Summary Exploit memory level parallelism within a warp  Optimization of memory behavior drives the management of threads/warps rather than computation Balancing warp size, warp count, and memory latency Integrated solution leads upto 1.7X speedup Low area overhead

(29) The Dual Path Execution Model for Efficient GPU Control Flow M. Rhu and M.Erez HPCA 2013 ©Sudhakar Yalamanchili unless otherwise noted

(30) Goal Reconvergence-based techniques permit only a single divergent path to be active at any point in time  Elegant, but serializes execution Support parallel execution using stack-based reconvergence  Combine principals of dynamic subdivision and stack-based reconvergence

(31) Baseline Execution Profile Figure from M. Rhu and M. Erez, “The Dual Path Execution Model for Efficient GPU Control Flow,” HPCA 2013 Only one active path at a time for a warp Serialization of path execution

(32) PDOM Execution Figure from M. Rhu and M. Erez, “The Dual Path Execution Model for Efficient GPU Control Flow,” HPCA 2013

(33) Dual Path Execution Figure from M. Rhu and M. Erez, “The Dual Path Execution Model for Efficient GPU Control Flow,” HPCA 2013 Store state of both paths

(34) Concurrency Behavior Figure from M. Rhu and M. Erez, “The Dual Path Execution Model for Efficient GPU Control Flow,” HPCA 2013 Note only two active paths at a time in a single stack (nested control flow segment of code) Top level divergent set of threads are not interleaved Interleaved execution Suspended execution pending completion of nested sub-warps

(35) Handling Dependencies in a Warp What about data dependencies?  Dual instruction issue from a warp Need a per warp scoreboard

(36) Dual Path Scoreboard Replicate Scoreboard for each path How do we distinguish cross-path vs. in-path dependencies? Figure from M. Rhu and M. Erez, “The Dual Path Execution Model for Efficient GPU Control Flow,” HPCA 2013

(37) Dependency Management Incorrect execution deadlock Case I Pending writes before reconvergence Case II False dependencies Case III Distinct registers Case IV Figure from M. Rhu and M. Erez, “The Dual Path Execution Model for Efficient GPU Control Flow,” HPCA 2013

(38) Dependency Management (2) Shadow bits not set These false dependencies avoided On divergence shadow bits set Actually write (B) unrelated to reads (D&E) Figure from M. Rhu and M. Erez, “The Dual Path Execution Model for Efficient GPU Control Flow,” HPCA 2013

(39) Updating the Scoreboard Indicates cross-path dependency due to pending write before divergence Copy P  S at divergence When checking scoreboard during issue, check shadow bits in the other scoreboard  E.g., make sure Path A write has completed for Path C Writes need single bit to indicate which scoreboard to clear Figure from M. Rhu and M. Erez, “The Dual Path Execution Model for Efficient GPU Control Flow,” HPCA 2013

(40) Warp Scheduler Impact I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback scalar Pipeline scalar pipeline scalar pipeline Issue I-Buffer pending warps Typically multiple warp schedulers per SM Dual path execution can double the number of schedulable entries Expect it can be more sensitive to the warp scheduling policy

(41) Opportunity Non-interleavable vs. interleavable branches How do we assess opportunity? Number dynamic instructions in a warp Number of paths at instruction i Figure from M. Rhu and M. Erez, “The Dual Path Execution Model for Efficient GPU Control Flow,” HPCA 2013

(42) Application Profiles Figure from M. Rhu and M. Erez, “The Dual Path Execution Model for Efficient GPU Control Flow,” HPCA 2013

(43) Some Points to Note Achieve reduction in idle cycles  expected Impact on L1 cache misses  can increase Use for handling memory divergence Path forwarding to enable concurrency across stack levels (not just at the TOS)  not clear this is worth the cost

(44) Summary Looking for structured concurrency  living with control divergence handling solutions Low cost overhead for applications that have high percentage of interleavable divergent branches Extends the performance of PDOM