Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

Slides:



Advertisements
Similar presentations
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Multi-Core Architectures
How should a highly multithreaded architecture, like a GPU, pick which threads to issue? Cache-Conscious Wavefront Scheduling Use feedback from the memory.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Divergence-Aware Warp Scheduling
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
IMP: Indirect Memory Prefetcher
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
1 Lecture: Storage, GPUs Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Performance in GPU Architectures: Potentials and Distances
(1) ©Sudhakar Yalamanchili unless otherwise noted Reducing Branch Divergence in GPU Programs T. D. Han and T. Abdelrahman GPGPU 2011.
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
My Coordinates Office EM G.27 contact time:
Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
A Case for Toggle-Aware Compression for GPU Systems
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
A Variable Warp-Size Architecture
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*
Mattan Erez The University of Texas at Austin
Presented by: Isaac Martin
The Vector-Thread Architecture
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
6- General Purpose GPU Programming
Address-Stride Assisted Approximate Load Value Prediction in GPUs
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture

Contemporary GPU  Massively Multithreaded  10,000’s of threads concurrently executing on 10’s of Streaming Multiprocessors (SM) 2Tim RogersA Variable Warp-Size Architecture

Contemporary Streaming Multiprocessor (SM) 3  1,000’s of schedulable threads  Amortize front end and memory overheads by grouping threads into warps.  Size of the warp is fixed based on the architecture Tim RogersA Variable Warp-Size Architecture

Contemporary GPU Software  Regular structured computation  Predictable access and control flow patterns  Can take advantage of HW amortization for increased performance and energy efficiency 4 Execute Efficiently on a GPU Today Graphics Shaders Matrix Multiply … Tim RogersA Variable Warp-Size Architecture

Forward-Looking GPU Software  Still Massively Parallel  Less Structured  Memory access and control flow patterns are less predictable 5 Execute efficiently on a GPU today Graphics Shaders Matrix Multiply … Less efficient on today’s GPU Raytracing Molecular Dynamics Object Classification … Tim RogersA Variable Warp-Size Architecture

Divergence : Source of Inefficiency  Regular hardware that amortizes front end and overhead  Irregular software with many different control flow paths and less predicable memory accesses. 6 Branch DivergenceMemory Divergence … if (…) { … } … Tim RogersA Variable Warp-Size Architecture Load R1, 0(R2) Can cut function unit utilization to 1/ Wide Main Memory Instruction may wait for 32 cache lines

Irregular “Divergent” Applications: Perform better with a smaller warp size 7Tim RogersA Variable Warp-Size Architecture … if (…) { … } … Allows more threads to proceed concurrently Branch DivergenceMemory Divergence Load R1, 0(R2) Increased function unit utilization Main Memory Each instruction waits on fewer accesses

Negative effects of smaller warp size Tim RogersA Variable Warp-Size Architecture8  Less front end amortization  Increase in fetch/decode energy  Negative performance effects  Scheduling skew increases pressure on the memory system

Regular “Convergent” Applications: Perform better with a wider warp 9Tim RogersA Variable Warp-Size Architecture 32- Wide Load R1, 0(R2) Main Memory GPU memory coalescing One memory system request can service all 32 threads Smaller warps: Less coalescing Main Memory 8 redundant memory accesses – no longer occurring together

Performance vs. Warp Size 10Tim RogersA Variable Warp-Size Architecture Convergent Applications Warp-Size Insensitive Applications Divergent Applications  165 Applications

Goals  Convergent Applications  Maintain wide-warp performance  Maintain front end efficiency  Warp Size Insensitive Applications  Maintain front end efficiency  Divergent Applications  Gain small warp performance 11Tim RogersA Variable Warp-Size Architecture Set the warp size based on the executing application

Sliced Datapath + Ganged Scheduling 12  Split the SM datapath into narrow slices.  Extensively studied 4-thread slices  Gang slice execution to gain efficiencies of wider warp. Tim RogersA Variable Warp-Size Architecture Slices can execute independently Slices share an L1 I-Cache and Memory Unit

Initial operation  Slices begin execution in ganged mode  Mirrors the baseline 32-wide warp system  Question: When to break the gang? 13Tim RogersA Variable Warp-Size Architecture Ganging unit drives the slices Instructions are fetched and decoded once

Breaking Gangs on Control Flow Divergence  PCs common to more than one slice form a new gang  Slices that follow a unique PC in the gang are transferred to independent control 14Tim RogersA Variable Warp-Size Architecture Unique PCs: no longer controlled by ganging unit Observes different PCs from each slice

Breaking Gangs on Memory Divergence  Latency of accesses from each slice can differ  Evaluated several heuristics on breaking the gang when this occurs 15Tim RogersA Variable Warp-Size Architecture Hits in L1 Goes to memory

Gang Reformation  Performed opportunistically  Ganging unit checks for gangs or independent slices at the same PC  Forms them into a gang 16Tim RogersA Variable Warp-Size Architecture Independent BUT at the same PC Ganging unit re-gangs them More details in the paper

Methodology  In House, Cycle-Level Streaming Multiprocessor Model  1 In-order core  64KB L1 Data cache  128KB L2 Data Cache (One SM’s worth)  48KB Shared Memory  Texture memory unit  Limited BW memory system  Greedy-Then-Oldest (GTO) Issue Scheduler 17Tim RogersA Variable Warp-Size Architecture

Configurations Tim RogersA Variable Warp-Size Architecture18  Warp Size 32 (WS 32)  Warp Size 4 (WS 4)  Inelastic Variable Warp Sizing (I-VWS)  Gangs break on control flow divergence  Are not reformed  Elastic Variable Warp Sizing (E-VWS)  Like I-VWS, except gangs are opportunistically reformed  Studied 5 applications from each category in detail Paper Explores Many More Configurations

Divergent Application Performance 19Tim RogersA Variable Warp-Size Architecture Warp Size 4 I-VWS: Break on CF Only E-VWS: Break + Reform

Divergent Application Fetch Overhead 20Tim RogersA Variable Warp-Size Architecture Warp Size 4 I-VWS: Break on CF Only Used as a proxy for energy consumption E-VWS: Break + Reform

Convergent Application Performance Tim RogersA Variable Warp-Size Architecture21 Warp Size 4 I-VWS: Break on CF Only Warp-Size Insensitive Applications Unaffected E-VWS: Break + Reform

Convergent/Insensitive Application Fetch Overhead 22Tim RogersA Variable Warp-Size Architecture Warp Size 4 I-VWS: Break on CF Only E-VWS: Break + Reform

165 Application Performance Tim RogersA Variable Warp-Size Architecture23 Convergent Applications Divergent Applications Warp-Size Insensitive Applications

Related Work 24Tim RogersA Variable Warp-Size Architecture Compaction/FormationSubdivision/Multipath Improve Function Unit Utilization Decrease Thread Level Parallelism Decreased Function Unit Utilization Increase Thread Level Parallelism Variable Warp Sizing Improve Function Unit Utilization Increase Thread Level Parallelism Area Cost VWS Estimate: 5% for 4-wide slices 2.5% for 8-wide slices

Conclusion  Explored space surrounding warp size and perfromance  Vary the size of the warp to meet the depends of the workload  35% performance improvement on divergent apps  No performance degradation on convergent apps  Narrow slices with ganged execution  Improves both SIMD efficiency and thread-level parallelism 25Tim RogersA Variable Warp-Size Architecture Questions?