OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Slides:



Advertisements
Similar presentations
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Advertisements

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog 1, Evgeny Bolotin 2, Zvika Guz 2,a, Mike Parker.
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
High Performing Cache Hierarchies for Server Workloads
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
1 MacSim Tutorial (In ISCA-39, 2012). Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
Aérgia: Exploiting Packet Latency Slack in On-Chip Networks
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
Managing GPU Concurrency in Heterogeneous Architect ures Shared Resources Network LLC Memory.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
Mascar: Speeding up GPU Warps by Reducing Memory Pitstops
Divergence-Aware Warp Scheduling
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.
WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
Sunpyo Hong, Hyesoon Kim
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
My Coordinates Office EM G.27 contact time:
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner, and T. Aamodt.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
µC-States: Fine-grained GPU Datapath Power Management
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Session 2A: Today, 10:20 AM Nandita Vijaykumar.
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Controlled Kernel Launch for Dynamic Parallelism in GPUs
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait.
ISPASS th April Santa Rosa, California
Managing GPU Concurrency in Heterogeneous Architectures
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Rachata Ausavarungnirun
Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu
Presented by: Isaac Martin
Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.
MASK: Redesigning the GPU Memory Hierarchy
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Achieving High Performance and Fairness at Low Cost
Managing GPU Concurrency in Heterogeneous Architectures
CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads S. –Y Lee, A. A. Kumar and C. J Wu ISCA 2015.
ITAP: Idle-Time-Aware Power Management for GPU Execution Units
Address-Stride Assisted Approximate Load Value Prediction in GPUs
Haonan Wang, Adwait Jog College of William & Mary
Presented by Ondrej Cernin
Presentation transcript:

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das

Source: nVIDIA GPUs are everywhere! Super Computers Desktops Laptops Tablets Smartphones Gaming Consoles

Executive Summary Limited DRAM bandwidth is a critical performance bottleneck Thousands of concurrently executing threads on a GPU  May not always be enough to hide long memory latencies  Access small size caches  High cache contention Proposal: A comprehensive scheduling policy, which  Reduces Cache Miss Rates  Improves DRAM Bandwidth  Improves Latency Hiding Capability of GPUs 3

Off-chip Bandwidth is Critical! 4 Percentage of total execution cycles wasted waiting for the data to come back from DRAM Type-1 Applications 55% AVG: 32% Type-2 Applications GPGPU Applications

Outline Introduction Background CTA-Aware Scheduling Policy  1. Reduces cache miss rates  2. Improves DRAM Bandwidth Evaluation Results and Conclusions 5

High-Level View of a GPU 6 DRAM SIMT Cores … Scheduler ALUs L1 Caches Threads … W W W W W W W W W W W W Warps L2 cache Interconnect CTA Cooperative Thread Arrays (CTAs)

Warp Scheduler ALUs L1 Caches CTA-Assignment Policy (Example) 7 Warp Scheduler ALUs L1 Caches Multi-threaded CUDA Kernel SIMT Core-1SIMT Core-2 CTA-1 CTA-2 CTA-3 CTA-4 CTA-2 CTA-4 CTA-1CTA-3

Warp Scheduling Policy All launched warps on a SIMT core have equal priority  Round-Robin execution Problem: Many warps stall at long latency operations roughly at the same time 8 WW CTA WW WW WW All warps compute All warps have equal priority WW CTA WW WW WW All warps compute All warps have equal priority Send Memory Requests SIMT Core Stalls Time

Solution 9 WW CTA WW WW WW WW WW WW WW Send Memory Requests Saved Cycles WW CTA WW WW WW All warps compute All warps have equal priority WW CTA WW WW WW All warps compute All warps have equal priority Send Memory Requests SIMT Core Stalls Time Form Warp-Groups (Narasiman MICRO’11) CTA-Aware grouping Group Switch is Round-Robin

Outline Introduction Background CTA-Aware Scheduling Policy "OWL"  1. Reduces cache miss rates  2. Improves DRAM bandwidth Evaluation Results and Conclusions 10

OWL Philosophy (1) 11 OWL focuses on "one work (group) at a time"

OWL Philosophy (1) What does OWL do?  Selects a group (Finds food)  Always prioritizes it (Focuses on food) Group switch is NOT round -robin Benefits:  Lesser Cache Contention  Latency hiding benefits via grouping are still present 12

Objective 1: Improve Cache Hit Rates 13 CTA 1CTA 3CTA 5CTA 7 CTA 1CTA 3 CTA 5 CTA 7 Data for CTA1 arrives. No switching. CTA 3CTA 5CTA 7 CTA 1CTA 3 C5 CTA 7 CTA 1 C5 Data for CTA1 arrives. T No Switching: 4 CTAs in Time T Switching: 3 CTAs in Time T Fewer CTAs accessing the cache concurrently  Less cache contention Time Switch to CTA1.

Reduction in L1 Miss Rates 14 Limited benefits for cache insensitive applications Additional analysis in the paper. 8% 18% Round-Robin CTA-Grouping CTA-Grouping-Prioritization

Outline Introduction Background CTA-Aware Scheduling Policy "OWL"  1. Reduces cache miss rates  2. Improves DRAM bandwidth (via enhancing bank-level parallelism and row buffer locality) Evaluation Results and Conclusions 15

More Background Independent execution property of CTAs  CTAs can execute and finish in any order CTA DRAM Data Layout  Consecutive CTAs (in turn warps) can have good spatial locality  (more details to follow) 16

CTA Data Layout (A Simple Example) 17 A(0,0)A(0,1)A(0,2)A(0,3) : : DRAM Data Layout (Row Major) Bank 1 Bank 2 Bank 3 Bank 4 A(1,0)A(1,1)A(1,2)A(1,3) : : A(2,0)A(2,1)A(2,2)A(2,3) : : A(3,0)A(3,1)A(3,2)A(3,3) : : Data Matrix A(0,0) A(0,1 ) A(0,2)A(0,3) A(1,0)A(1,1)A(1,2)A(1,3) A(2,0)A(2,1)A(2,2)A(2,3) A(3,0)A(3,1)A(3,2)A(3,3) mapped to Bank 1 CTA 1 CTA 2 CTA 3 CTA 4 mapped to Bank 2 mapped to Bank 3 mapped to Bank 4 Average percentage of consecutive CTAs (out of total CTAs) accessing the same row = 64%

L2 Cache Implications of high CTA-row sharing 18 W CTA-1 W CTA-3 W CTA-2 W CTA-4 SIMT Core-1 SIMT Core-2 Bank-1Bank-2Bank-3Bank-4 Row-1Row-2Row-3Row-4 Idle Banks WW WW CTA Prioritization Order

Analogy THOSE WHO HAVE TIME: STAND IN LINE HERE THOSE WHO DON’T HAVE TIME: STAND IN LINE HERE Counter 1 Counter 2 Which counter will you prefer?

THOSE WHO HAVE TIME: STAND IN LINE HERE THOSE WHO DON’T HAVE TIME: STAND IN LINE HERE Counter 1 Counter 2

High Row Locality Low Bank Level Parallelism High Row Locality Low Bank Level Parallelism Bank-1 Row-1 Bank-2 Row-2 Bank-1 Row-1 Bank-2 Row-2 Req Lower Row Locality Higher Bank Level Parallelism Lower Row Locality Higher Bank Level Parallelism Req

OWL Philosophy (2) What does OWL do now ?  Intelligently selects a group (Intelligently finds food)  Always prioritizes it (Focuses on food) OWL selects non-consecutive CTAs across cores  Attempts to access as many DRAM banks as possible. Benefits:  Improves bank level parallelism  Latency hiding and cache hit rates benefits are still preserved 22

L2 Cache Objective 2: Improving Bank Level Parallelism 23 W CTA-1 W CTA-3 W CTA-2 W CTA-4 SIMT Core-1 SIMT Core-2 Bank-1Bank-2Bank-3Bank-4 Row-1Row-2Row-3Row-4 WW W W 11% increase in bank-level parallelism 14% decrease in row buffer locality 11% increase in bank-level parallelism 14% decrease in row buffer locality CTA Prioritization Order

L2 Cache Objective 3: Recovering Row Locality 24 W CTA-1 W CTA-3 W CTA-2 W CTA-4 SIMT Core-2 Bank-1Bank-2Bank-3Bank-4 Row-1Row-2Row-3Row-4 WW WW Memory Side Prefetching L2 Hits! SIMT Core-1

Outline Introduction Background CTA-Aware Scheduling Policy "OWL"  1. Reduces cache miss rates  2. Improves DRAM bandwidth Evaluation Results and Conclusions 25

Evaluation Methodology Evaluated on GPGPU-Sim, a cycle accurate GPU simulator Baseline Architecture  28 SIMT cores, 8 memory controllers, mesh connected  1300MHz, SIMT Width = 8, Max threads/core  32 KB L1 data cache, 8 KB Texture and Constant Caches  GDDR3 800MHz Applications Considered (in total 38) from:  Map Reduce Applications  Rodinia – Heterogeneous Applications  Parboil – Throughput Computing Focused Applications  NVIDIA CUDA SDK – GPGPU Applications 26

IPC results (Normalized to Round-Robin) 27 11% within Perfect L2 More details in the paper 44%

Conclusions Many GPGPU applications exhibit sub-par performance, primarily because of limited off-chip DRAM bandwidth OWL scheduling policy improves –  Latency hiding capability of GPUs (via CTA grouping)  Cache hit rates (via CTA prioritization)  DRAM bandwidth (via intelligent CTA scheduling and prefetching) 33% average IPC improvement over round-robin warp scheduling policy, across type-1 applications 28

THANKS! QUESTIONS? 29

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das

Related Work Warp Scheduling: (Rogers+ MICRO 2012, Gebhart+, ISCA 2011, Narasiman+ MICRO 2011) DRAM scheduling: (Ausavarungnirun+ ISCA 2012, Lakshminarayana+ CAL 2012, Jeong+ HPCA 2012, Yuan+ MICRO 2009) GPGPU hardware prefetching: (Lee+, MICRO 2009) 31

Memory Side Prefetching Prefetch the so-far-unfetched cache lines in an already open row into the L2 cache, just before it is closed What to prefetch?  Sequentially prefetches the cache lines that were not accessed by demand requests  Sophisticated schemes are left as future work When to prefetch?  Opportunsitic in Nature  Option 1: Prefetching stops as soon as demand request comes for another row. (Demands are always critical)  Option 2: Give more time for prefetching, make demands wait if there are not many. (Demands are NOT always critical) 32

CTA row sharing Our experiment driven study shows that:  Across 38 applications studied, the percentage of consecutive CTAs (out of total CTAs) accessing the same row is 64%, averaged across all open rows.  Ex: if CTAs 1, 2, 3, 4 all access a single row, the CTA row sharing percentage is 100%.  The applications considered include many irregular applications, which do not show high row sharing percentages. 33