Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Slides:

Advertisements

Similar presentations

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Advertisements

Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching Pedro Díaz and Marcelo Cintra University of Edinburgh

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

1 MacSim Tutorial (In ISCA-39, 2012). Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache.

Access Map Pattern Matching Prefetch: Optimization Friendly Method

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

|Introduction |Background |TAP (TLP-Aware Cache Management Policy) Core sampling Cache block lifetime normalization TAP-UCP and TAP-RRIP |Evaluation Methodology.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.

MacSim Tutorial (In ICPADS 2013) 1. |The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Sunpyo Hong, Hyesoon Kim

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

My Coordinates Office EM G.27 contact time:

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

ISPASS th April Santa Rosa, California

The University of Adelaide, School of Computer Science

5.2 Eleven Advanced Optimizations of Cache Performance

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Presented by: Isaac Martin

Milad Hashemi, Onur Mutlu, Yale N. Patt

Address-Value Delta (AVD) Prediction

/ Computer Architecture and Design

15-740/ Computer Architecture Lecture 14: Prefetching

rePLay: A Hardware Framework for Dynamic Optimization

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc

Introduction Many-Thread Aware Prefetching Mechanisms (MICRO-43) 2  General Purpose GPUs (GPGPU) are getting popular  High-performance capability (NVIDIA Geforce GTX 580: 1.5 TFLOPS)  Many cores with large-scale multi-threading and SIMD unit  CUDA programming model  SIMT (Single Instruction Multiple Threads)  Hierarchy of threads groups: thread, thread block SIMD Execution SIMD Execution Shared Memory Shared Memory Memory Request Buffer Core DRAM

Memory Latency Problem Many-Thread Aware Prefetching Mechanisms (MICRO-43) 3  Tolerating memory latency is critical in CPUs  Many techniques have been proposed  Caches, prefetching, multi-threading, etc.  GPGPUs have employed multi-threading  Memory latency is critical in GPGPUs as well  Limited thread-level-parallelism  Application behavior  Algorithmically, lack of parallelism  Limited by resource constraints  # registers per thread, # threads per block, shared memory usage per block

Multi-threading Example Many-Thread Aware Prefetching Mechanisms (MICRO-43) 4  Example 1: Enough threads  Example 2: Not enough threads C C C C D D M M M M C C C C C C D D M M M M C C Switch C C C C D D M M M M C C C C C C D D M M M M C C 4 active threads Switch No stall T0 T1 T2 T3 Memory Latency C C C C M M C C Computation M M Memory D D Dependent on memory C C C C D D M M M M C C C C C C D D M M M M C C Switch Stall 2 active threads Stall Cycles T0 T1 Memory Latency C C C C D D M M M M C C

Prefetching in GPGPUs Many-Thread Aware Prefetching Mechanisms (MICRO-43) 5  Problem: when multi-threading is not enough, we need other mechanisms to hide memory latency.  Other solutions  Caching (NVIDIA Fermi)  Prefetching  Many prefetchers have been proposed for CPUs  Stride, stream, Markov, CDP, GHB, helper thread, etc.  Question: Will the existing mechanisms work in GPGPUs? In this talk

Characteristic #1. Many Threads Many-Thread Aware Prefetching Mechanisms (MICRO-43) 6  Problem #1. Training of prefetcher  Accesses from many threads are interleaved  Thread ID indexing  Reduced effective prefetcher size  Scalability Prefetcher Prefetching in CPU Prefetcher Prefetching in GPGPU 1 thread2 threadsMany threads

Characteristic #2. Data Level Parallelism Many-Thread Aware Prefetching Mechanisms (MICRO-43) 7  Problem #2. Short thread lifetime  Due to parallelization  The length of a thread in parallel programs is shorter  Removes prefetching opportunities prefetch demand prefetch demand Sequential ThreadParallel Threads Too short lifetime No opportunity Too short lifetime No opportunity Useful! Memory latency create terminate

Characteristic #3. SIMT Many-Thread Aware Prefetching Mechanisms (MICRO-43) 8  Problem #3. Single-Configuration Many-Threads (SCMT)  Too many threads are controlled together  Prefetch degree: # of prefetches per trigger  Prefetch degree 1: < cache size  Prefetch degree 2: >> cache size  Problem #4. Amplified negative effects  One useless prefetch request per thread  many useless prefetches pref Prefetch Cache Fit in a cache Capacity misses

Goal Many-Thread Aware Prefetching Mechanisms (MICRO-43) 9  Design hardware/software prefetching mechanisms for GPGPU applications  Step 1. Prefetcher for Many-thread Architecture  Many-Thread Aware Prefetching Mechanisms (Scalability, short thread lifetime)  Step 2. Feedback mechanism to reduce negative effects  Prefetch Throttling (SCMT, amplifying negative effects)

Goal Many-Thread Aware Prefetching Mechanisms (MICRO-43) 10  Design hardware/software prefetching mechanisms for GPGPU applications  Step 1. Prefetcher for Many-thread Architecture  Many-Thread Aware Prefetching Mechanisms (Scalability, short thread lifetime) * H/W prefetcher: in this talk, S/W prefetcher: in the paper  Step 2. Feedback mechanism to reduce negative effects  Prefetch Throttling (SCMT, amplifying negative effect)

Stride Pref. Promotion Table IP Pref. Decision Logic Decision Logic Pref. Addr PC, ADDR PC, ADDR TID Many-Thread Aware Hardware Prefetcher 11  (Conventional) Stride prefetcher  Promotion table for stride prefetcher (Scalability)  Inter-Thread prefetcher (Short thread lifetime)  Decision logic Promotion Table IP Pref. Stride Pref. Decision Logic Decision Logic Stride Promotion Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Solving Scalability Problem Many-Thread Aware Prefetching Mechanisms (MICRO-43) 12  Problem #1. Training of prefetcher (Scalability)  Stride Promotion  Similar (or even same) access pattern in threads (SIMT)  Without promotion, table is occupied by redundant entries  By promotion, we can effectively manage storage  Reduce training time using earlier threads’ information PCSTRIDE 0x1a65536 …… …… …… …… PCTIDSTRIDE 0x1a x1a x1a x1a ……… Redundant Entries Redundant Entries Promotion Conventional Stride TablePromotion Table

Solving Short Thread Lifetime Problem Many-Thread Aware Prefetching Mechanisms (MICRO-43) 13  Problem #2. Short thread lifetime  Highly parallelized code often reduces prefetching opportunities prefetch demand Memory latency for (ii = 0; ii < 100; ++ii) { prefetch(A[ii+D]); prefetch(B[ii+D]); C[ii] = A[ii] + B[ii]; } // there are 100 threads __global__ void KernelFunction(…) { int tid = blockDim.x * blockIdx.x + threadIdx.x; int varA = aa[tid]; int varB = bb[tid]; C[tid] = varA + varB; } Loop! No loop Few instructions No opportunity No loop Few instructions No opportunity

Inter-Thread Prefetching Many-Thread Aware Prefetching Mechanisms (MICRO-43) 14  Instead, we can prefetch for other threads  Inter-Thread Prefetching (IP)  In CUDA, Memory index is a function of the thread id // there are 100 threads __global__ void KernelFunction(…) { int tid = blockDim.x * blockIdx.x + threadIdx.x; int next_tid = tid + 32; prefetch(aa[next_tid]); prefetch(bb[next_tid]); int varA = aa[tid]; int varB = bb[tid]; C[tid] = varA + varB; } T0 T3…T2……… T32 T35…T34……… T64 ……T66……… Prefetch Memory access in other threads Memory access in other threads prefetch T1 T33 T65

IP Table IP Pattern Detection in Hardware Many-Thread Aware Prefetching Mechanisms (MICRO-43) 15  Detecting strides across threads  Launch prefetch requests PCAddr1TID 1Addr 2TID 2TrainDelta PC:0x1a Addr:400 TID:3 PCAddr1TID 1Addr 2TID 2TrainDelta 0x1a PC:0x1a Addr:1100 TID:10 PCAddr1TID 1Addr 2TID 2TrainDelta 0x1a PC:0x1a Addr:200 TID:1 PCAddr1TID 1Addr 2TID 2TrainDelta 0x1a √100 All three deltas are same We found a pattern All three deltas are same We found a pattern Req 1Req 2Req 3 TID ∆ PC:0x1a Addr:2100 TID:1Req 4 Prefetch (addr + stride) Addr:2100 Stride: 100 Addr ∆

MT-aware Hardware Prefetcher Many-Thread Aware Prefetching Mechanisms (MICRO-43) 16  Decision logic  Promotion table > Stride prefetcher > IP prefetcher  Stride behavior in a thread is more common  Entries in Promotion table have been trained longer time Promotion Table IP Pref. Stride Pref. Decision Logic Decision Logic Pref. Addr Stride Promotion Cycle 1 Cycle 2 Cycle 3 PC, ADDR PC, ADDR TID PromotionIP TableStride PrefetcherAction 1 st cycle2 nd cycle3 rd cycle HIT Not accessedGenerate stride prefetch requests HITMISSNot accessedGenerate stride prefetch requests MISSHITNot accessedGenerate IP prefetch requests MISS Accessed Generate stride prefetch requests, if hit Update Promotion Table, if necessary

Goal Many-Thread Aware Prefetching Mechanisms (MICRO-43) 17  Design a hardware/software prefetcher for GPGPU applications  Step 1. Prefetcher for Many-thread Architecture  Many-Thread Aware Prefetching Mechanisms (Scalability, short thread lifetime)  Step 2. Feedback mechanism to reduce negative effects  Prefetch Throttling (SCMT, amplifying negative effects)

Design GPGPU Prefetch Throttling Many-Thread Aware Prefetching Mechanisms (MICRO-43) 18  Need GPGPU specific metrics to identify whether prefetching is effective  Extension from feedback prefetching for CPUs [Srinath07]  Useful prefetches – accurate and timely  Harmful prefetches – inaccurate or too early  Some late prefetches can be tolerable  By multithreading  Less harmful

Throttling Metrics Many-Thread Aware Prefetching Mechanisms (MICRO-43) 19  Merged memory requests  New request with same address of existing entries  Inside of a core (in MSHR)  Late prefetches in CPUs  Indicate accuracy (due to massive multi-threading)  Early block eviction from a prefetch cache  Due to capacity misses, regardless of accuracy  Periodic Updates  To cope with runtime behavior

Heuristic for Prefetch Throttling Many-Thread Aware Prefetching Mechanisms (MICRO-43) 20 * Ideal case (accurate and perfect timing) will have low early eviction and low merge ratio.  Throttle Degree  Vary from 0 (prefetch all) to 5 (no prefetch)  Default:2 Early EvictionMerge RatioActionNote High NO prefetchToo aggressive Medium-LESS prefetch LowHighMORE prefetch Low NO prefetchInaccurate * HighLowNO prefetchInaccurate

Outline Many-Thread Aware Prefetching Mechanisms (MICRO-43) 21  Motivation  Step 1. Many-Thread Aware Prefetching  Step 2. Prefetch Throttling  Evaluation  Conclusion

Evaluation Methodology Many-Thread Aware Prefetching Mechanisms (MICRO-43) 22  MacSim simulator  A cycle accurate, in-house simulator  A trace-driven simulator (trace from GPUOcelot[Diamos10])  Baseline  14-core (8-wide SIMD) Freq:900MHz, 16 Banks/8 Channels, 1.2GHz memory frequency, 900MHz bus, FR-FCFS  NVIDIA G80 Architecture  14 memory intensive benchmarks  CUDA SDK, Merge, Rodinia, and Parboil  Stride, MP (massively parallel), and uncoalesced types  Non-memory intensive benchmarks (in the paper)

Evaluation Methodology Many-Thread Aware Prefetching Mechanisms (MICRO-43) 23  Prefetch  Stream, Stride, and GHB prefetchers evaluated  16 KB cache per core (other size results are in the paper)  Prefetch distance:1 degree :1 (the optimal configuration)  Results  Hardware prefetcher  Software prefether (in the paper)

Results: MT Hardware Prefetching Many-Thread Aware Prefetching Mechanisms (MICRO-43) 24  GHB/Stride do not work in mp and uncoal-type  IP (Inter-Thread Prefetching) can be effective  Stride Promotion improves performance of few benchmarks 15% over Stride

Results: MT-HWP with Throttling Many-Thread Aware Prefetching Mechanisms (MICRO-43) 25 15% over Stride + Throttling  GHB+F improves performance  MT-HWP+T eliminates negative effect (stream) * Feedback mechanism is more effective in software prefetching

Outline Many-Thread Aware Prefetching Mechanisms (MICRO-43) 26  Motivation  Step 1. Many-Thread Aware Prefetching  Step 2. Prefetch Throttling  Evaluation  Conclusion

Conclusion Many-Thread Aware Prefetching Mechanisms (MICRO-43) 27  Memory latency is an important problem in GPGPUs as well.  GPGPU prefetching has four problems:  Scalability, short thread, SCMT, and amplifying negative effects  Goal: Design hardware/software prefetcher  Step 1. Many-Thread aware prefetcher (promotion, IP)  Step 2. Prefetch throttling  MT-aware hardware prefetcher shows 15% performance improvement and prefetch throttling removes all the negative effects.  Future work  Study other many-thread architectures.  Other programming models, architectures with caches

Many-Thread Aware Prefetching Mechanisms (MICRO-43) 28 THANK YOU!

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc

NVIDIA Fermi Result Many-Thread Aware Prefetching Mechanisms (MICRO-43) 30

Different Prefetch Cache Size Many-Thread Aware Prefetching Mechanisms (MICRO-43) 31

Software MT Prefetcher Results Many-Thread Aware Prefetching Mechanisms (MICRO-43) 32

Hardware prefetcher without TID Many-Thread Aware Prefetching Mechanisms (MICRO-43) 33

Hardware prefetcher with TID Many-Thread Aware Prefetching Mechanisms (MICRO-43) 34

Benefit Because of Few Threads? Many-Thread Aware Prefetching Mechanisms (MICRO-43) 35 BlackConvMersenneMontePNSScalarstream backpropcelloceanbfscfdlinearsepia Some benchmarks have enough number of threads but they still cannot hide memory latency fully.

Inter-Thread Prefetching Many-Thread Aware Prefetching Mechanisms (MICRO-43) 36  IP may not be useful in some cases  Case 1. Demand requests have already been generated  Threads are not executed in a strict sequential order  Out of order execution among threads  Redundant prefetches: requests will be merged in the memory system. Less harmful.  Case 2. Out of array range effect: The last thread in a block generates a request for another thread which is mapped to a different core.  Unless inter-core merge occurs in DRAM controller, useless prefetches