Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture begins with an example of how a wide- memory bus is utilized in GPUs The impact of memory coalescing and memory bank conflicts.

Advertisements

TRIPS Primary Memory System Simha Sethumadhavan 1.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.

Veynu Narasiman The University of Texas at Austin Michael Shebanow

1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,

Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

Sunpyo Hong, Hyesoon Kim

Performance in GPU Architectures: Potentials and Distances

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

My Coordinates Office EM G.27 contact time:

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

15-740/ Computer Architecture Lecture 25: Main Memory

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.

RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads Amir Yazdanbakhsh, Bradley Thwaites, Hadi Esmaeilzadeh Gennady Pekhimenko, Onur Mutlu,

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

Zorua: A Holistic Approach to Resource Virtualization in GPUs

A Variable Warp-Size Architecture

Sathish Vadhiyar Parallel Programming

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ISPASS th April Santa Rosa, California

5.2 Eleven Advanced Optimizations of Cache Performance

RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads

Rachata Ausavarungnirun

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Lecture 26: Multiprocessors

Presented by: Isaac Martin

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah

Lecture 27: Multiprocessors

ITAP: Idle-Time-Aware Power Management for GPU Execution Units

Rajeev Balasubramonian

Address-Stride Assisted Approximate Load Value Prediction in GPUs

Haonan Wang, Adwait Jog College of William & Mary

Presentation transcript:

Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian

Irregular GPGPU Applications Conventional GPGPU workloads access vector or matrix-based data structures Predictable strides, large data parallelism Emerging Irregular Workloads Pointer-based data-structures & data-dependent memory accesses Memory Latency Divergence on SIMT platforms Warp-aware memory scheduling to reduce DRAM latency divergence SC

SIMT Execution Overview SC GDDR5 Channel L1 Warp 1 Warp Scheduler SIMD Lanes Memory Port Warp 2 Warp 3 Warp N SIMT Core INTERCONNECTINTERCONNECT INTERCONNECTINTERCONNECT L2 Slice Memory Controller GDDR5 Channel Memory Partition THREADS Memory Partition Warps GDDR5 L2 Slice Memory Controller Lockstep execution Warp stalled on memory access

Memory Latency Divergence Coalescer has limited efficacy in irregular workloads Partial hits in L1 and L2 1 st source of latency divergence DRAM requests can have varied latencies Warp stalled for last request DRAM Latency Divergence Load Inst SIMD Lanes (32) Access Coalescing Unit L1 L2 GDDR5 SC

GPU Memory Controller (GMC) SC Optimized for high throughput Harvest channel and bank parallelism Address mapping to spread cache-lines across channels and banks. Achieve high row-buffer hit rate Deep queuing Aggressive reordering of requests for row-hit batching Not cognizant of the need to service requests from a warp together Interleave requests from different warps leading to latency divergence

Warp-Aware Scheduling SC SM 1 SM 2 A: LD A A A A BB BB MC A: Use Baseline GMC Scheduling A B AA A B: Use B BB Stall Cycles Warp-Aware Scheduling A A AB B A BB A: Use Stall Cycles B: LD Reduced Average Memory Stall Time

Impact of DRAM Latency Divergence SC If all requests from a warp were to be returned in perfect sequence from the DRAM – ~40% improvement. If there was only 1 request per warp – 5X improvement.

Key Idea Form batches of requests from each warp warp-group Schedule all requests from a warp-group together Scheduling algorithm arbitrates between warp-groups to minimize average stall-time of warps SC

Controller Design SC

Controller Design SC

Warp-Group Scheduling : Single Channel SC Pending Warp-Groups Warp-group priority table Transaction Scheduler # of reqs in warp- group Row hit/miss status of reqs Queuing delay in cmd queues Pick warp-group with lowest runtime Each Warp-Group assigned a priority Reflects completion time of last request Higher Priority to Few requests High spatial locality Lightly loaded banks Priorities updated dynamically Transaction Scheduler picks warp- group with lowest run-time Shortest-job-first based on actual service time

WG-scheduling SC Latency Divergence Ideal Bandwidth Utilization GMC Baseline WG

Multiple Memory Controllers Channel level parallelism Warp’s requests sent to multiple memory channels Independent scheduling at each controller Subset of warp’s requests can be delayed at one or few memory controllers Coordinate scheduling between controllers Prioritize warp-group that has already been serviced at other controllers Coordination message broadcast to other controllers on completion of a warp-group. SC

Warp-Group Scheduling : Multi-Channel SC Pending Warp-Groups Priority Table Transaction Scheduler # of reqs in warp- group Row hit/miss status of reqs Queuing delay in cmd queues Pick warp-group with lowest runtime Status of Warp-group in other channels Periodic messages to other channels about completed warp-groups

WG-M Scheduling SC Latency Divergence Ideal Bandwidth Utilization GMC Baseline WG WG-M

Bandwidth-Aware Warp-Group Scheduling Warp-group scheduling negatively affects bandwidth utilization Reduced row-hit rate Conflicting objectives Issue row-miss request from current warp-group Issue row-hit requests to maintain bus utilization Activate and Precharge idle cycles Hidden by row-hits in other banks Delay row-miss request to find the right slot SC

Bandwidth-Aware Warp-Group Scheduling SC The minimum number of row-hits needed in other banks to overlap (tRTP+tRP+tRCD) Determined by GDDR timing parameters Minimum efficient row burst (MERB) Stored in a ROM looked up by Transaction Scheduler More banks with pending row-hits smaller MERB Schedule row-miss after MERB row-hits have been issued to bank

WG-Bw Scheduling SC Latency Divergence Ideal Bandwidth Utilization GMC Baseline WG WG-M WG-Bw

Warp-Aware Write Draining Writes drained in batches starts at High_Watermark Can stall small warp-groups When WQ reaches a threshold (lower than High_Watermark) Drain singleton warp-groups only Reduce write-induced latency SC

WG-scheduling SC Latency Divergence Ideal Bandwidth Utilization GMC Baseline WG WG-M WG-Bw WG-W

Methodology GPGPUSim v3.1 : Cycle Accurate GPGPU simulator USIMM v1.3 : Cycle Accurate DRAM Simulator modified to model GMC-baseline & GDDR5 timings Irregular and Regular workloads from Parboil, Rodinia, Lonestar, and MARS. SC SM Cores30 Max Threads/Core1024 Warp Size32 Threads/warp L1 / L232KB / 128 KB DRAM6Gbps GDDR5 DRAM Channels Banks6 Channels 16 Banks/channel

Performance Improvement SC Reduced Latency Divergence Restored Bandwidth Utilization

Impact on Regular Workloads SC Effective coalescing High spatial locality in warp-group WG scheduling works similar to GMC-baseline No performance loss WG-Bw and WG-W provide Minor benefits

Energy Impact of Reduced Row Hit-Rate Scheduling Row-misses over Row- hits Reduces the row-buffer hit rate 16% In GDDR5, power consumption dominated by I/O. Increase in DRAM power negligible compared to execution speed-up Net improvement in system energy SC

Conclusions Irregular applications place new demands on the GPU’s memory system Memory scheduling can alleviate the issues caused by latency divergence Carefully orchestrating the scheduling of commands can help regain the bandwidth lost by warp-aware scheduling Future techniques must also include the cache-hierarchy in reducing latency divergence SC

Thanks ! SC

Backup Slides SC

Performance Improvement : IPC SC

Average Warp Stall Latency SC

DRAM Latency Divergence SC

Bandwidth Utilization SC

Memory Controller Microarchitecture SC

Warp-Group Scheduling Every batch assigned a priority-score completion time of the longest request Higher priority to warp groups with Few requests High spatial locality Lightly loaded banks Priorities updated after each warp-group scheduling Warp-group with lowest service time selected Shortest-job-first based on actual service time, not number of requests SC