Presentation is loading. Please wait.

Presentation is loading. Please wait.

|Introduction |Background |TAP (TLP-Aware Cache Management Policy) Core sampling Cache block lifetime normalization TAP-UCP and TAP-RRIP |Evaluation Methodology.

Similar presentations


Presentation on theme: "|Introduction |Background |TAP (TLP-Aware Cache Management Policy) Core sampling Cache block lifetime normalization TAP-UCP and TAP-RRIP |Evaluation Methodology."— Presentation transcript:

1

2 |Introduction |Background |TAP (TLP-Aware Cache Management Policy) Core sampling Cache block lifetime normalization TAP-UCP and TAP-RRIP |Evaluation Methodology |Evaluation Results |Conclusion TLP-Aware Cache Management Policy (HPCA-18) 2/25

3 |Combining GPU cores with conventional CMPs is a trend. |Various resources are shared between CPU and GPU cores. LLC, on-chip interconnection, memory controller, and DRAM |Shared cache is one of most important resources. Intel’s Sandy BridgeAMD’s FusionDenver Project TLP-Aware Cache Management Policy (HPCA-18) 3/25

4 |Many researchers have proposed various cache mechanisms. Dynamic cache partitioning  Suh+[HPCA’02], Kim+[PACT’04], Qureshi+[MICRO’06] Dynamic cache insertion policies  Qureshi+[ISCA’07], Jaleel+[PACT’08,ISCA’10], Wu+[MICRO’11,MICRO’11] Many other mechanisms |All mechanisms target CMPs. |These may not be directly applicable to CPU-GPU heterogeneous architectures because CPU and GPU cores have different characteristics. TLP-Aware Cache Management Policy (HPCA-18) 4/25

5 |SIMD, massive threading, lack of speculative execution, … |GPU cores have an order-of-magnitude more threads. CPU: 1-4 way SMT GPU: 10s of active threads in a core |GPU cores have higher TLP (Thread-Level Parallelism) than CPU cores. |TLP has a significant impact on how caching affects performance of applications. TLP-Aware Cache Management Policy (HPCA-18) 5/25

6 |With low TLP|With High TLP |This type is hardly found in CPU applications Cache Size MPKI CPI MPKI CPI MPKI CPI Compute intensive or Thrashing TLP DominantCache friendly TLP-Aware Cache Management Policy (HPCA-18) 6/25

7 |Cache-oriented metrics cannot differentiate two types. Unable to recognize the effect of TLP |We need to directly monitor performance effect by caching. Cache Size MPKI CPI MPKI CPI TLP DominantCache friendly Identical Different TLP-Aware Cache Management Policy (HPCA-18) 7/25

8 |Samples GPU cores with different cache policies Core L1 DRAM Core Shared Last-Level Cache Core POL2 Core POL2 Bypassing LLC (No L3) MRU insertion policy in LLC Core Follow Core Follow Core Follow Core Follow Core Follow Core Follow Core Follow Core Follow Followers Core POL1 Core POL1 CPUs GPUs TLP-Aware Cache Management Policy (HPCA-18) 8/25

9 |Measures performance differences Core Sampling Controller Core Sampling Controller Calculate ∆ (IPC 1, IPC 2 ) ∆ > Threshold Cache-friendly  Caching improves perf. Not Cache-friendly  Caching does not affect perf. YesNo Collect Performance Samples Calculate Performance Delta Make a decision IPC 1 IPC 2 Core POL2 Core POL2 Bypassing LLC (No L3) MRU insertion policy in LLC Core Follow Core Follow Core Follow Core Follow Core Follow Core Follow Core Follow Core Follow Followers Core POL1 Core POL1 TLP-Aware Cache Management Policy (HPCA-18) 9/25

10 Cache Size MPKI CPI MPKI CPI TLP DominantCache friendly Core POL1 Core POL1 Core POL2 Core POL2 Core POL1 Core POL1 Core POL2 Core POL2 ∆ > Threshold: Cache-friendly ∆ < Threshold: Not cache-friendly TLP-Aware Cache Management Policy (HPCA-18) 10/25 Bypassing LLCMRU insertionBypassing LLCMRU insertion

11 |Having different LLC policies for cores to identify the effect of last-level cache |Main goal - finding cache-friendly GPGPU applications |How core sampling is viable SPMD (Single Program, Multiple Data) model  Each GPU core is running same program.  GPGPUs usually have symmetric behavior on their running GPU cores.  Performance variance between GPU cores is very small. TLP-Aware Cache Management Policy (HPCA-18) 11/25

12 |GPU cores have higher TLP (Thread-Level Parallelism) than CPU cores. |GPU cores have an order-of-magnitude more cache accesses |GPUs have higher tolerance for cache misses due to TLP Generate cache accesses from different threads without stalls |SIMD execution – one SIMD instruction can generate multiple memory requests TLP-Aware Cache Management Policy (HPCA-18) 12/25

13 GPU ThreadsCPU Thread Processor stalled Cache miss < 100> 500 Stalled, fewer cache accessesNo stalls, more^ 2 cache accesses CPU, 1-coreGPU, 6-core TLP-Aware Cache Management Policy (HPCA-18) 13/25

14 |Why are much more frequent accesses from GPGPU applications problematic? Severe interference by GPGPU applications  e.g.) base LRU replacement policy Performance impact of cache hits is different in applications.  Perf. Penalty CPU (cache miss) Perf. Penalty GPU (cache miss) |We have to consider the different degree of cache accesses. |We propose Cache Block Lifetime Normalization. TLP-Aware Cache Management Policy (HPCA-18) 14/25 =? >

15 |Simple monitoring mechanism Monitor cache access rate differences between CPU and GPGPU applications and periodically calculate the ratio |Hints for proposed TAP mechanisms regarding access rate differences XSRATIO GPU $ Access Counter CPU $ Access Counter TLP-Aware Cache Management Policy (HPCA-18) 15/25

16 TAP Core Sampling Lifetime Normalization To find cache-friendly applications To consider different degree of cache accesses UCP Utility-based Cache Partitioning UCP Utility-based Cache Partitioning RRIP Re-Reference Interval Prediction RRIP Re-Reference Interval Prediction TAP-UCP TAP-RRIP In this talk In the paper TLP-Aware Cache Management Policy (HPCA-18) 16/25

17 |UCP-Mask Register |Core Sampling |Cache block lifetime normalization Partitioning Algorithm Core Sampling Controller Core Sampling Controller Cache block lifetime normalization XSRATIO TAP (TLP-Aware) UCP-Mask / / LLC Per application, ATD and hit counters ATD (LRU Stack) ATD (LRU Stack) ATD (LRU Stack) ATD (LRU Stack) ATD (LRU Stack) ATD (LRU Stack) Way Hit Counters n1n2n3n4n5n6n7n8 Optimal Partition Divide hit counter by XSRATIO register value to balance cache space UCP-Mask = 1 if not cache friendly n2n3n4n5n6n7 GPU CPU UCP Assign 1 way to GPGPU If UCP-Mask == 1 TLP-Aware Cache Management Policy (HPCA-18) UCP [Qureshi and Patt, MICRO-2006] 17/25 UCP Partitioning Algorithm Way Hit Counters TAP

18 35.510.398.87.8 1638205832 CPU Hit Counters 3261640101664 GPU Hit Counters UCPTAP-UCP CG 35.510.398.87.861120.71817.615.7 CGGGG 61120.71817.615.7 CGGGGGG 101310.135.510.3101310.1 CGGGGGGG16382058323261640101664CCCCCCCG Final Partition Marginal Utility How many more hits are expected if N ways are given to an application Marginal Utility How many more hits are expected if N ways are given to an application 1 CPU 7 GPU 7 CPU 1 GPU MRULRUMRULRU Utility …… TLP-Aware Cache Management Policy (HPCA-18) 18/25 Not Cache-friendly Caching has little effect on Perf.  Assign only 1 way to GPGPU Case 1: Non Cache-Friendly ∆ < Threshold UCP TAP-UCP 1 CPU: 7 GPU 7 CPU: 1 GPU Performance 4 CPU: 4 GPU More GPU ways More CPU ways

19 1638205832 CPU Hit Counters 3261640101664 GPU Hit Counters 35.510.398.87.8 1638205832 35.510.398.87.8 CG 35.510.398.87.8 CGCCC 56.55.335.510.335.510.3 Cache-friendly ∆ > Threshold CGCCCGGGCGCCCGGG Divide hit counters by XSRATIO Final Partition 4 CPU 4 GPU Utility UCP CGGGGGGG Final Partition 1 CPU 7 GPU TAP-UCP XSRATIO = 2 MRULRUMRULRU TLP-Aware Cache Management Policy (HPCA-18) 19/25 Case 2: Cache-Friendly UCP TAP-UCP Performance 1 CPU: 7 GPU7 CPU: 1 GPU4 CPU: 4 GPU More GPU ways More CPU ways

20 |I|Introduction |B|Background |T|TAP (TLP-Aware Cache Management Policy) Core sampling Cache block lifetime normalization |T|TAP-UCP |E|Evaluation Methodology |E|Evaluation Results |C|Conclusion TLP-Aware Cache Management Policy (HPCA-18) 20/25

21 |MacSim simulator (http://code.google.com/p/macsim) [GT]http://code.google.com/p/macsim Trace-driven, timing simulator, x86+ptx instructions |Workload CPU: SPEC 2006 GPGPU: CUDA SDK, Parboil, Rodinia, ERCBench 1-CPU (1 CPU + 1 GPU) 2-CPU (2 CPUs + 1 GPU) 4-CPU (4 CPUs + 1 GPU) Stream-CPU (Stream CPU + 1 GPU) 1521507525 OOO 4-wide OOO 4-wide Private L1/L2 Private L1/L2 16 SIMD width 16 SIMD width Private L1 Private L1 CPU (1-4 cores) GPU (6 cores) 32-way 8MB Shared LLC (Base: LRU) 32-way 8MB Shared LLC (Base: LRU) LLC DDR3-1333, 41.6GB/s BW FR-FCFS DDR3-1333, 41.6GB/s BW FR-FCFS DRAM TLP-Aware Cache Management Policy (HPCA-18) 21/25

22 |UCP is effective with thrashing. |Less effective with cache-sensitive GPGPU applications. |RRIP is generally less effective on heterogeneous workloads. 11% 12% TLP-Aware Cache Management Policy (HPCA-18) 22/25

23 |Sphinx3 + Stencil |Stencil TLP dominant |MPKI CPU: significant decrease GPGPU: considerable increase Overall MPKI: increased |Performance CPU: huge improvement GPU: no change Overall: huge improvement TLP-Aware Cache Management Policy (HPCA-18) 23/25

24 |TAP mechanisms show higher benefits with more CPU applications. 11% 12.5% 17.5% 12% 14% 24% TLP-Aware Cache Management Policy (HPCA-18) 24/25

25 |CPU-GPU Heterogeneous architecture is a popular trend. Resource sharing problem is more significant. |We propose TAP for CPU-GPU heterogeneous architecture First proposal to consider the resource sharing problem |We introduce a core sampling technique that samples GPU cores with different policies to identify cache-friendliness. |Two TAP mechanisms improve the performance of the system significantly. TAP-UCP: 11% over LRU and 5% over UCP TAP-RRIP: 12% over LRU and 9% over RRIP TLP-Aware Cache Management Policy (HPCA-18) 25/25


Download ppt "|Introduction |Background |TAP (TLP-Aware Cache Management Policy) Core sampling Cache block lifetime normalization TAP-UCP and TAP-RRIP |Evaluation Methodology."

Similar presentations


Ads by Google