|Introduction |Background |TAP (TLP-Aware Cache Management Policy) Core sampling Cache block lifetime normalization TAP-UCP and TAP-RRIP |Evaluation Methodology.

Slides:



Advertisements
Similar presentations
Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.
Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.
High Performing Cache Hierarchies for Server Workloads
Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
1 MacSim Tutorial (In ISCA-39, 2012). Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache.
1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
Prefetch-Aware Cache Management for High Performance Caching
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
Improving Cache Performance by Exploiting Read-Write Disparity
Memory System Characterization of Big Data Workloads
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Veynu Narasiman The University of Texas at Austin Michael Shebanow
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.
MacSim Tutorial (In ICPADS 2013) 1. |The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms.
Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.
Sampling Dead Block Prediction for Last-Level Caches
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
HPCA Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas.
Exploiting Compressed Block Size as an Indicator of Future Reuse
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.
Embedded System Lab. 정범종 PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie et al. ACM, 2009.
The Evicted-Address Filter
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
DISSERTATION RESEARCH PLAN Mitesh Meswani. Outline  Dissertation Research Update  Previous Approach and Results  Modified Research Plan  Identifying.
Sunpyo Hong, Hyesoon Kim
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
Cache Memory and Performance
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Adaptive Cache Partitioning on a Composite Core
18742 Parallel Computer Architecture Caching in Multi-core Systems
Prefetch-Aware Cache Management for High Performance Caching
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Energy-Efficient Address Translation
CARP: Compression Aware Replacement Policies
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Managing GPU Concurrency in Heterogeneous Architectures
CARP: Compression-Aware Replacement Policies
Massachusetts Institute of Technology
Lecture 14: Large Cache Design II
Presentation transcript:

|Introduction |Background |TAP (TLP-Aware Cache Management Policy) Core sampling Cache block lifetime normalization TAP-UCP and TAP-RRIP |Evaluation Methodology |Evaluation Results |Conclusion TLP-Aware Cache Management Policy (HPCA-18) 2/25

|Combining GPU cores with conventional CMPs is a trend. |Various resources are shared between CPU and GPU cores. LLC, on-chip interconnection, memory controller, and DRAM |Shared cache is one of most important resources. Intel’s Sandy BridgeAMD’s FusionDenver Project TLP-Aware Cache Management Policy (HPCA-18) 3/25

|Many researchers have proposed various cache mechanisms. Dynamic cache partitioning  Suh+[HPCA’02], Kim+[PACT’04], Qureshi+[MICRO’06] Dynamic cache insertion policies  Qureshi+[ISCA’07], Jaleel+[PACT’08,ISCA’10], Wu+[MICRO’11,MICRO’11] Many other mechanisms |All mechanisms target CMPs. |These may not be directly applicable to CPU-GPU heterogeneous architectures because CPU and GPU cores have different characteristics. TLP-Aware Cache Management Policy (HPCA-18) 4/25

|SIMD, massive threading, lack of speculative execution, … |GPU cores have an order-of-magnitude more threads. CPU: 1-4 way SMT GPU: 10s of active threads in a core |GPU cores have higher TLP (Thread-Level Parallelism) than CPU cores. |TLP has a significant impact on how caching affects performance of applications. TLP-Aware Cache Management Policy (HPCA-18) 5/25

|With low TLP|With High TLP |This type is hardly found in CPU applications Cache Size MPKI CPI MPKI CPI MPKI CPI Compute intensive or Thrashing TLP DominantCache friendly TLP-Aware Cache Management Policy (HPCA-18) 6/25

|Cache-oriented metrics cannot differentiate two types. Unable to recognize the effect of TLP |We need to directly monitor performance effect by caching. Cache Size MPKI CPI MPKI CPI TLP DominantCache friendly Identical Different TLP-Aware Cache Management Policy (HPCA-18) 7/25

|Samples GPU cores with different cache policies Core L1 DRAM Core Shared Last-Level Cache Core POL2 Core POL2 Bypassing LLC (No L3) MRU insertion policy in LLC Core Follow Core Follow Core Follow Core Follow Core Follow Core Follow Core Follow Core Follow Followers Core POL1 Core POL1 CPUs GPUs TLP-Aware Cache Management Policy (HPCA-18) 8/25

|Measures performance differences Core Sampling Controller Core Sampling Controller Calculate ∆ (IPC 1, IPC 2 ) ∆ > Threshold Cache-friendly  Caching improves perf. Not Cache-friendly  Caching does not affect perf. YesNo Collect Performance Samples Calculate Performance Delta Make a decision IPC 1 IPC 2 Core POL2 Core POL2 Bypassing LLC (No L3) MRU insertion policy in LLC Core Follow Core Follow Core Follow Core Follow Core Follow Core Follow Core Follow Core Follow Followers Core POL1 Core POL1 TLP-Aware Cache Management Policy (HPCA-18) 9/25

Cache Size MPKI CPI MPKI CPI TLP DominantCache friendly Core POL1 Core POL1 Core POL2 Core POL2 Core POL1 Core POL1 Core POL2 Core POL2 ∆ > Threshold: Cache-friendly ∆ < Threshold: Not cache-friendly TLP-Aware Cache Management Policy (HPCA-18) 10/25 Bypassing LLCMRU insertionBypassing LLCMRU insertion

|Having different LLC policies for cores to identify the effect of last-level cache |Main goal - finding cache-friendly GPGPU applications |How core sampling is viable SPMD (Single Program, Multiple Data) model  Each GPU core is running same program.  GPGPUs usually have symmetric behavior on their running GPU cores.  Performance variance between GPU cores is very small. TLP-Aware Cache Management Policy (HPCA-18) 11/25

|GPU cores have higher TLP (Thread-Level Parallelism) than CPU cores. |GPU cores have an order-of-magnitude more cache accesses |GPUs have higher tolerance for cache misses due to TLP Generate cache accesses from different threads without stalls |SIMD execution – one SIMD instruction can generate multiple memory requests TLP-Aware Cache Management Policy (HPCA-18) 12/25

GPU ThreadsCPU Thread Processor stalled Cache miss < 100> 500 Stalled, fewer cache accessesNo stalls, more^ 2 cache accesses CPU, 1-coreGPU, 6-core TLP-Aware Cache Management Policy (HPCA-18) 13/25

|Why are much more frequent accesses from GPGPU applications problematic? Severe interference by GPGPU applications  e.g.) base LRU replacement policy Performance impact of cache hits is different in applications.  Perf. Penalty CPU (cache miss) Perf. Penalty GPU (cache miss) |We have to consider the different degree of cache accesses. |We propose Cache Block Lifetime Normalization. TLP-Aware Cache Management Policy (HPCA-18) 14/25 =? >

|Simple monitoring mechanism Monitor cache access rate differences between CPU and GPGPU applications and periodically calculate the ratio |Hints for proposed TAP mechanisms regarding access rate differences XSRATIO GPU $ Access Counter CPU $ Access Counter TLP-Aware Cache Management Policy (HPCA-18) 15/25

TAP Core Sampling Lifetime Normalization To find cache-friendly applications To consider different degree of cache accesses UCP Utility-based Cache Partitioning UCP Utility-based Cache Partitioning RRIP Re-Reference Interval Prediction RRIP Re-Reference Interval Prediction TAP-UCP TAP-RRIP In this talk In the paper TLP-Aware Cache Management Policy (HPCA-18) 16/25

|UCP-Mask Register |Core Sampling |Cache block lifetime normalization Partitioning Algorithm Core Sampling Controller Core Sampling Controller Cache block lifetime normalization XSRATIO TAP (TLP-Aware) UCP-Mask / / LLC Per application, ATD and hit counters ATD (LRU Stack) ATD (LRU Stack) ATD (LRU Stack) ATD (LRU Stack) ATD (LRU Stack) ATD (LRU Stack) Way Hit Counters n1n2n3n4n5n6n7n8 Optimal Partition Divide hit counter by XSRATIO register value to balance cache space UCP-Mask = 1 if not cache friendly n2n3n4n5n6n7 GPU CPU UCP Assign 1 way to GPGPU If UCP-Mask == 1 TLP-Aware Cache Management Policy (HPCA-18) UCP [Qureshi and Patt, MICRO-2006] 17/25 UCP Partitioning Algorithm Way Hit Counters TAP

CPU Hit Counters GPU Hit Counters UCPTAP-UCP CG CGGGG CGGGGGG CGGGGGGG CCCCCCCG Final Partition Marginal Utility How many more hits are expected if N ways are given to an application Marginal Utility How many more hits are expected if N ways are given to an application 1 CPU 7 GPU 7 CPU 1 GPU MRULRUMRULRU Utility …… TLP-Aware Cache Management Policy (HPCA-18) 18/25 Not Cache-friendly Caching has little effect on Perf.  Assign only 1 way to GPGPU Case 1: Non Cache-Friendly ∆ < Threshold UCP TAP-UCP 1 CPU: 7 GPU 7 CPU: 1 GPU Performance 4 CPU: 4 GPU More GPU ways More CPU ways

CPU Hit Counters GPU Hit Counters CG CGCCC Cache-friendly ∆ > Threshold CGCCCGGGCGCCCGGG Divide hit counters by XSRATIO Final Partition 4 CPU 4 GPU Utility UCP CGGGGGGG Final Partition 1 CPU 7 GPU TAP-UCP XSRATIO = 2 MRULRUMRULRU TLP-Aware Cache Management Policy (HPCA-18) 19/25 Case 2: Cache-Friendly UCP TAP-UCP Performance 1 CPU: 7 GPU7 CPU: 1 GPU4 CPU: 4 GPU More GPU ways More CPU ways

|I|Introduction |B|Background |T|TAP (TLP-Aware Cache Management Policy) Core sampling Cache block lifetime normalization |T|TAP-UCP |E|Evaluation Methodology |E|Evaluation Results |C|Conclusion TLP-Aware Cache Management Policy (HPCA-18) 20/25

|MacSim simulator ( [GT] Trace-driven, timing simulator, x86+ptx instructions |Workload CPU: SPEC 2006 GPGPU: CUDA SDK, Parboil, Rodinia, ERCBench 1-CPU (1 CPU + 1 GPU) 2-CPU (2 CPUs + 1 GPU) 4-CPU (4 CPUs + 1 GPU) Stream-CPU (Stream CPU + 1 GPU) OOO 4-wide OOO 4-wide Private L1/L2 Private L1/L2 16 SIMD width 16 SIMD width Private L1 Private L1 CPU (1-4 cores) GPU (6 cores) 32-way 8MB Shared LLC (Base: LRU) 32-way 8MB Shared LLC (Base: LRU) LLC DDR3-1333, 41.6GB/s BW FR-FCFS DDR3-1333, 41.6GB/s BW FR-FCFS DRAM TLP-Aware Cache Management Policy (HPCA-18) 21/25

|UCP is effective with thrashing. |Less effective with cache-sensitive GPGPU applications. |RRIP is generally less effective on heterogeneous workloads. 11% 12% TLP-Aware Cache Management Policy (HPCA-18) 22/25

|Sphinx3 + Stencil |Stencil TLP dominant |MPKI CPU: significant decrease GPGPU: considerable increase Overall MPKI: increased |Performance CPU: huge improvement GPU: no change Overall: huge improvement TLP-Aware Cache Management Policy (HPCA-18) 23/25

|TAP mechanisms show higher benefits with more CPU applications. 11% 12.5% 17.5% 12% 14% 24% TLP-Aware Cache Management Policy (HPCA-18) 24/25

|CPU-GPU Heterogeneous architecture is a popular trend. Resource sharing problem is more significant. |We propose TAP for CPU-GPU heterogeneous architecture First proposal to consider the resource sharing problem |We introduce a core sampling technique that samples GPU cores with different policies to identify cache-friendliness. |Two TAP mechanisms improve the performance of the system significantly. TAP-UCP: 11% over LRU and 5% over UCP TAP-RRIP: 12% over LRU and 9% over RRIP TLP-Aware Cache Management Policy (HPCA-18) 25/25