Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Slides:



Advertisements
Similar presentations
Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog 1, Evgeny Bolotin 2, Zvika Guz 2,a, Mike Parker.
Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
Multi-GPU System Design with Memory Networks
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
Supporting GPU Sharing in Cloud Environments with a Transparent
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Extracted directly from:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Performance in GPU Architectures: Potentials and Distances
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
My Coordinates Office EM G.27 contact time:
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming Qiumin Xu*, Hyeran Jeon ✝, Keunsoo Kim ❖, Won.
A Case for Toggle-Aware Compression for GPU Systems
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Zorua: A Holistic Approach to Resource Virtualization in GPUs
Gwangsun Kim, Jiyun Jeong, John Kim
Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim,
ISPASS th April Santa Rosa, California
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Spare Register Aware Prefetching for Graph Algorithms on GPUs
RegLess: Just-in-Time Operand Staging for GPUs
Mattan Erez The University of Texas at Austin
Rachata Ausavarungnirun
Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu
Presented by: Isaac Martin
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
The Vector-Thread Architecture
Mattan Erez The University of Texas at Austin
` A Framework for Memory Oversubscription Management in Graphics Processing Units Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao.
Address-Stride Assisted Approximate Load Value Prediction in GPUs
Presentation transcript:

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky 2 William J. Dally 2,3 1 The University of Texas at Austin 2 NVIDIA 3 Stanford University 1

Motivation  GPUs have thousands of on-chip resident threads  On-chip storage per thread is very limited  On-chip storage split between register file, scratchpad, and cache  Applications have diverse requirements between these three types of on-chip storage  Efficiently utilizing on-chip storage can improve both performance and energy 2

Overview  Automated algorithm determines most efficient allocation  Overheads are mitigated by leveraging prior work on register file hierarchy 3 Traditional Design Proposed Unified Design Register File Shared Memory Cache Program A Program B Register File Shared Memory Cache Register File Shared Memory Cache

Contemporary GPUs  Large number of SMs per chip  High bandwidth memory system  Each SM contains:  Parallel SIMT lanes  High capacity register file  Programmer controlled shared memory  Primary data cache  Coarse-grain configurability between shared memory and cache 4

Baseline Streaming Processor  Each SM contains:  32 SIMT lanes  256KB main register file  64KB shared memory  64KB primary data cache  Register file hierarchy 5 SIMT Lanes SFU TEX MEM ALU Register File Hierarchy Main Register File Shared Memory Cache Streaming Multiprocessor (SM) Register File Hierarchy L0 RF L1 RF Main Register File ALUs [Gebhart, MICRO 2011]

Outline 6  Motivation  GPU background  Unified GPU on-chip storage  Sensitivity study  Microarchitecture  Results  Conclusions

Sensitivity Study  Evaluate the performance impact of memory capacity of three structures:  Larger Register file  Increase the number of registers per threads  Increase the number of concurrent threads  Larger Shared Memory  Refactor code to use more shared memory per thread  Increase the number of concurrent threads  Larger Cache  Better exploit locality 7

Register File Sensitivity Study 8 256, 512, 768, 1024 threads per SM

Register File Sensitivity Study 9

Shared Memory Sensitivity Study 10 Needle 256, 512, 768, 1024 threads per SM

Shared Memory Sensitivity Study 11 Needle PCR STOLU

Cache Capacity Sensitivity Study 12

Cache Capacity Sensitivity Study 13

Workload Characterization Summary  Wide range of ideal capacities for each different type of memory  Performance is most sensitive to excessive register spills  Some applications see significant benefits from large caches  Fewer DRAM accesses both improves performance and reduces energy 14

Proposed Design  Challenges:  Performance overhead of bank conflicts  Energy overhead of bank access  Allocation decisions 15 Traditional Design Proposed Unified Design Register File Shared Memory Cache Program A Program B Register File Shared Memory Cache Register File Shared Memory Cache

Baseline Microarchitecture 16  SM is composed of 8 clusters  Total of 96 banks  32 register file banks  32 L1 cache banks  32 shared memory banks

Unified Microarchitecture  Total of 32 unified storage banks  Increase in bank conflicts  Increase in bank access energy 17

Allocation Algorithm  Allocate enough registers to eliminate spills  Programmer dictates shared memory blocking  Maximize thread count subject to register and shared requirements  Devote remaining storage to cache 18 Registers per thread Bytes of shared memory per thread Runtime Scheduler Compiler Programmer Cache capacity Number of threads to run

Methodology  Generated execution and address traces with Ocelot  Performance and energy estimates come from custom SM trace-based simulator  30 CUDA benchmarks drawn from CUDA SDK, Parboil, Rodinia, GPGPU-sim  22 with limited memory requirements that don’t benefit  8 that see significant benefits 19

Overheads  For applications that don’t benefit  <1% performance overhead  <1% energy overhead 20

Allocation Decisions  Different allocation decisions are made across benchmarks  Register file usage ranges from 50 to 250KB  Cache usage ranges from 50 to 300KB  Needle requires a large amount of shared memory 21

Results  Performance improvements range from 5—71%  Energy and DRAM reductions up to 33%  Leads to substantial efficiency improvements 22

Comparison with Limited Flexibility  Unified design outperforms limited flexibility design that only unifies shared memory and cache  mummergpu underperforms with unified design due to interactions with scheduler 23

Summary  Applications have diverse needs from on-chip storage  Unified memory presents minimal overheads  Register file hierarchy mitigates bank conflicts  Moderate performance gains for large number of applications  Enables a more flexible system 24