Presentation is loading. Please wait.

Presentation is loading. Please wait.

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Similar presentations


Presentation on theme: "Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc."— Presentation transcript:

1 Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc

2 Introduction Many-Thread Aware Prefetching Mechanisms (MICRO-43) 2  General Purpose GPUs (GPGPU) are getting popular  High-performance capability (NVIDIA Geforce GTX 580: 1.5 TFLOPS)  Many cores with large-scale multi-threading and SIMD unit  CUDA programming model  SIMT (Single Instruction Multiple Threads)  Hierarchy of threads groups: thread, thread block SIMD Execution SIMD Execution Shared Memory Shared Memory Memory Request Buffer Core DRAM

3 Memory Latency Problem Many-Thread Aware Prefetching Mechanisms (MICRO-43) 3  Tolerating memory latency is critical in CPUs  Many techniques have been proposed  Caches, prefetching, multi-threading, etc.  GPGPUs have employed multi-threading  Memory latency is critical in GPGPUs as well  Limited thread-level-parallelism  Application behavior  Algorithmically, lack of parallelism  Limited by resource constraints  # registers per thread, # threads per block, shared memory usage per block

4 Multi-threading Example Many-Thread Aware Prefetching Mechanisms (MICRO-43) 4  Example 1: Enough threads  Example 2: Not enough threads C C C C D D M M M M C C C C C C D D M M M M C C Switch C C C C D D M M M M C C C C C C D D M M M M C C 4 active threads Switch No stall T0 T1 T2 T3 Memory Latency C C C C M M C C Computation M M Memory D D Dependent on memory C C C C D D M M M M C C C C C C D D M M M M C C Switch Stall 2 active threads Stall Cycles T0 T1 Memory Latency C C C C D D M M M M C C

5 Prefetching in GPGPUs Many-Thread Aware Prefetching Mechanisms (MICRO-43) 5  Problem: when multi-threading is not enough, we need other mechanisms to hide memory latency.  Other solutions  Caching (NVIDIA Fermi)  Prefetching  Many prefetchers have been proposed for CPUs  Stride, stream, Markov, CDP, GHB, helper thread, etc.  Question: Will the existing mechanisms work in GPGPUs? In this talk

6 Characteristic #1. Many Threads Many-Thread Aware Prefetching Mechanisms (MICRO-43) 6  Problem #1. Training of prefetcher  Accesses from many threads are interleaved  Thread ID indexing  Reduced effective prefetcher size  Scalability Prefetcher Prefetching in CPU Prefetcher Prefetching in GPGPU 1 thread2 threadsMany threads

7 Characteristic #2. Data Level Parallelism Many-Thread Aware Prefetching Mechanisms (MICRO-43) 7  Problem #2. Short thread lifetime  Due to parallelization  The length of a thread in parallel programs is shorter  Removes prefetching opportunities prefetch demand prefetch demand Sequential ThreadParallel Threads Too short lifetime No opportunity Too short lifetime No opportunity Useful! Memory latency create terminate

8 Characteristic #3. SIMT Many-Thread Aware Prefetching Mechanisms (MICRO-43) 8  Problem #3. Single-Configuration Many-Threads (SCMT)  Too many threads are controlled together  Prefetch degree: # of prefetches per trigger  Prefetch degree 1: < cache size  Prefetch degree 2: >> cache size  Problem #4. Amplified negative effects  One useless prefetch request per thread  many useless prefetches pref Prefetch Cache Fit in a cache Capacity misses

9 Goal Many-Thread Aware Prefetching Mechanisms (MICRO-43) 9  Design hardware/software prefetching mechanisms for GPGPU applications  Step 1. Prefetcher for Many-thread Architecture  Many-Thread Aware Prefetching Mechanisms (Scalability, short thread lifetime)  Step 2. Feedback mechanism to reduce negative effects  Prefetch Throttling (SCMT, amplifying negative effects)

10 Goal Many-Thread Aware Prefetching Mechanisms (MICRO-43) 10  Design hardware/software prefetching mechanisms for GPGPU applications  Step 1. Prefetcher for Many-thread Architecture  Many-Thread Aware Prefetching Mechanisms (Scalability, short thread lifetime) * H/W prefetcher: in this talk, S/W prefetcher: in the paper  Step 2. Feedback mechanism to reduce negative effects  Prefetch Throttling (SCMT, amplifying negative effect)

11 Stride Pref. Promotion Table IP Pref. Decision Logic Decision Logic Pref. Addr PC, ADDR PC, ADDR TID Many-Thread Aware Hardware Prefetcher 11  (Conventional) Stride prefetcher  Promotion table for stride prefetcher (Scalability)  Inter-Thread prefetcher (Short thread lifetime)  Decision logic Promotion Table IP Pref. Stride Pref. Decision Logic Decision Logic Stride Promotion Many-Thread Aware Prefetching Mechanisms (MICRO-43)

12 Solving Scalability Problem Many-Thread Aware Prefetching Mechanisms (MICRO-43) 12  Problem #1. Training of prefetcher (Scalability)  Stride Promotion  Similar (or even same) access pattern in threads (SIMT)  Without promotion, table is occupied by redundant entries  By promotion, we can effectively manage storage  Reduce training time using earlier threads’ information PCSTRIDE 0x1a65536 …… …… …… …… PCTIDSTRIDE 0x1a165536 0x1a365536 0x1a1065536 0x1a765536 ……… Redundant Entries Redundant Entries Promotion Conventional Stride TablePromotion Table

13 Solving Short Thread Lifetime Problem Many-Thread Aware Prefetching Mechanisms (MICRO-43) 13  Problem #2. Short thread lifetime  Highly parallelized code often reduces prefetching opportunities prefetch demand Memory latency for (ii = 0; ii < 100; ++ii) { prefetch(A[ii+D]); prefetch(B[ii+D]); C[ii] = A[ii] + B[ii]; } // there are 100 threads __global__ void KernelFunction(…) { int tid = blockDim.x * blockIdx.x + threadIdx.x; int varA = aa[tid]; int varB = bb[tid]; C[tid] = varA + varB; } Loop! No loop Few instructions No opportunity No loop Few instructions No opportunity

14 Inter-Thread Prefetching Many-Thread Aware Prefetching Mechanisms (MICRO-43) 14  Instead, we can prefetch for other threads  Inter-Thread Prefetching (IP)  In CUDA, Memory index is a function of the thread id // there are 100 threads __global__ void KernelFunction(…) { int tid = blockDim.x * blockIdx.x + threadIdx.x; int next_tid = tid + 32; prefetch(aa[next_tid]); prefetch(bb[next_tid]); int varA = aa[tid]; int varB = bb[tid]; C[tid] = varA + varB; } T0 T3…T2……… T32 T35…T34……… T64 ……T66……… Prefetch Memory access in other threads Memory access in other threads prefetch T1 T33 T65

15 IP Table IP Pattern Detection in Hardware Many-Thread Aware Prefetching Mechanisms (MICRO-43) 15  Detecting strides across threads  Launch prefetch requests PCAddr1TID 1Addr 2TID 2TrainDelta ------- PC:0x1a Addr:400 TID:3 PCAddr1TID 1Addr 2TID 2TrainDelta 0x1a4003---- PC:0x1a Addr:1100 TID:10 PCAddr1TID 1Addr 2TID 2TrainDelta 0x1a4003110010-- PC:0x1a Addr:200 TID:1 PCAddr1TID 1Addr 2TID 2TrainDelta 0x1a4003110010√100 All three deltas are same We found a pattern All three deltas are same We found a pattern Req 1Req 2Req 3 TID ∆ PC:0x1a Addr:2100 TID:1Req 4 Prefetch (addr + stride) Addr:2100 Stride: 100 Addr ∆

16 MT-aware Hardware Prefetcher Many-Thread Aware Prefetching Mechanisms (MICRO-43) 16  Decision logic  Promotion table > Stride prefetcher > IP prefetcher  Stride behavior in a thread is more common  Entries in Promotion table have been trained longer time Promotion Table IP Pref. Stride Pref. Decision Logic Decision Logic Pref. Addr Stride Promotion Cycle 1 Cycle 2 Cycle 3 PC, ADDR PC, ADDR TID PromotionIP TableStride PrefetcherAction 1 st cycle2 nd cycle3 rd cycle HIT Not accessedGenerate stride prefetch requests HITMISSNot accessedGenerate stride prefetch requests MISSHITNot accessedGenerate IP prefetch requests MISS Accessed Generate stride prefetch requests, if hit Update Promotion Table, if necessary

17 Goal Many-Thread Aware Prefetching Mechanisms (MICRO-43) 17  Design a hardware/software prefetcher for GPGPU applications  Step 1. Prefetcher for Many-thread Architecture  Many-Thread Aware Prefetching Mechanisms (Scalability, short thread lifetime)  Step 2. Feedback mechanism to reduce negative effects  Prefetch Throttling (SCMT, amplifying negative effects)

18 Design GPGPU Prefetch Throttling Many-Thread Aware Prefetching Mechanisms (MICRO-43) 18  Need GPGPU specific metrics to identify whether prefetching is effective  Extension from feedback prefetching for CPUs [Srinath07]  Useful prefetches – accurate and timely  Harmful prefetches – inaccurate or too early  Some late prefetches can be tolerable  By multithreading  Less harmful

19 Throttling Metrics Many-Thread Aware Prefetching Mechanisms (MICRO-43) 19  Merged memory requests  New request with same address of existing entries  Inside of a core (in MSHR)  Late prefetches in CPUs  Indicate accuracy (due to massive multi-threading)  Early block eviction from a prefetch cache  Due to capacity misses, regardless of accuracy  Periodic Updates  To cope with runtime behavior

20 Heuristic for Prefetch Throttling Many-Thread Aware Prefetching Mechanisms (MICRO-43) 20 * Ideal case (accurate and perfect timing) will have low early eviction and low merge ratio.  Throttle Degree  Vary from 0 (prefetch all) to 5 (no prefetch)  Default:2 Early EvictionMerge RatioActionNote High NO prefetchToo aggressive Medium-LESS prefetch LowHighMORE prefetch Low NO prefetchInaccurate * HighLowNO prefetchInaccurate

21 Outline Many-Thread Aware Prefetching Mechanisms (MICRO-43) 21  Motivation  Step 1. Many-Thread Aware Prefetching  Step 2. Prefetch Throttling  Evaluation  Conclusion

22 Evaluation Methodology Many-Thread Aware Prefetching Mechanisms (MICRO-43) 22  MacSim simulator  A cycle accurate, in-house simulator  A trace-driven simulator (trace from GPUOcelot[Diamos10])  Baseline  14-core (8-wide SIMD) Freq:900MHz, 16 Banks/8 Channels, 1.2GHz memory frequency, 900MHz bus, FR-FCFS  NVIDIA G80 Architecture  14 memory intensive benchmarks  CUDA SDK, Merge, Rodinia, and Parboil  Stride, MP (massively parallel), and uncoalesced types  Non-memory intensive benchmarks (in the paper)

23 Evaluation Methodology Many-Thread Aware Prefetching Mechanisms (MICRO-43) 23  Prefetch  Stream, Stride, and GHB prefetchers evaluated  16 KB cache per core (other size results are in the paper)  Prefetch distance:1 degree :1 (the optimal configuration)  Results  Hardware prefetcher  Software prefether (in the paper)

24 Results: MT Hardware Prefetching Many-Thread Aware Prefetching Mechanisms (MICRO-43) 24  GHB/Stride do not work in mp and uncoal-type  IP (Inter-Thread Prefetching) can be effective  Stride Promotion improves performance of few benchmarks 15% over Stride

25 Results: MT-HWP with Throttling Many-Thread Aware Prefetching Mechanisms (MICRO-43) 25 15% over Stride + Throttling  GHB+F improves performance  MT-HWP+T eliminates negative effect (stream) * Feedback mechanism is more effective in software prefetching

26 Outline Many-Thread Aware Prefetching Mechanisms (MICRO-43) 26  Motivation  Step 1. Many-Thread Aware Prefetching  Step 2. Prefetch Throttling  Evaluation  Conclusion

27 Conclusion Many-Thread Aware Prefetching Mechanisms (MICRO-43) 27  Memory latency is an important problem in GPGPUs as well.  GPGPU prefetching has four problems:  Scalability, short thread, SCMT, and amplifying negative effects  Goal: Design hardware/software prefetcher  Step 1. Many-Thread aware prefetcher (promotion, IP)  Step 2. Prefetch throttling  MT-aware hardware prefetcher shows 15% performance improvement and prefetch throttling removes all the negative effects.  Future work  Study other many-thread architectures.  Other programming models, architectures with caches

28 Many-Thread Aware Prefetching Mechanisms (MICRO-43) 28 THANK YOU!

29 Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc

30 NVIDIA Fermi Result Many-Thread Aware Prefetching Mechanisms (MICRO-43) 30

31 Different Prefetch Cache Size Many-Thread Aware Prefetching Mechanisms (MICRO-43) 31

32 Software MT Prefetcher Results Many-Thread Aware Prefetching Mechanisms (MICRO-43) 32

33 Hardware prefetcher without TID Many-Thread Aware Prefetching Mechanisms (MICRO-43) 33

34 Hardware prefetcher with TID Many-Thread Aware Prefetching Mechanisms (MICRO-43) 34

35 Benefit Because of Few Threads? Many-Thread Aware Prefetching Mechanisms (MICRO-43) 35 BlackConvMersenneMontePNSScalarstream 12 8168 backpropcelloceanbfscfdlinearsepia 16 6 24 Some benchmarks have enough number of threads but they still cannot hide memory latency fully.

36 Inter-Thread Prefetching Many-Thread Aware Prefetching Mechanisms (MICRO-43) 36  IP may not be useful in some cases  Case 1. Demand requests have already been generated  Threads are not executed in a strict sequential order  Out of order execution among threads  Redundant prefetches: requests will be merged in the memory system. Less harmful.  Case 2. Out of array range effect: The last thread in a block generates a request for another thread which is mapped to a different core.  Unless inter-core merge occurs in DRAM controller, useless prefetches


Download ppt "Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc."

Similar presentations


Ads by Google