Dynamic Cache Clustering for Chip Multiprocessors

Slides:

Advertisements

Similar presentations

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

To Include or Not to Include? Natalie Enright Dana Vantrease.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

Lecture 12 Reduce Miss Penalty and Hit Time

High Performing Cache Hierarchies for Server Workloads

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

Center-of-Gravity Reduce Task Scheduling to Lower MapReduce Network Traffic Mohammad Hammoud, M. Suhail Rehman, and Majd F. Sakr 1.

Memory System Characterization of Big Data Workloads

1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,

Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.

1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

Multiprocessor Cache Coherency

Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Applying Control Theory to the Caches of Multiprocessors Department of EECS University of Tennessee, Knoxville Kai Ma.

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

CSE 241 Computer Engineering (1) هندسة الحاسبات (1) Lecture #3 Ch. 6 Memory System Design Dr. Tamer Samy Gaafar Dept. of Computer & Systems Engineering.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Chip Multiprocessors, Introduction and Challenges Lecture 8 February 6, 2013 Mohammad Hammoud CS Perspectives in Computer Architecture.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.

1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Speaker : Kyu Hyun, Choi. Problem: Interference in shared caches – Lack of isolation → no QoS – Poor cache utilization → degraded performance.

Author : Tzi-Cker Chiueh, Prashant Pradhan Publisher : High-Performance Computer Architecture, Presenter : Jo-Ning Yu Date : 2010/11/03.

M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.

ASR: Adaptive Selective Replication for CMP Caches

How will execution time grow with SIZE?

Lecture: Large Caches, Virtual Memory

Lecture 13: Large Cache Design I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

CARP: Compression-Aware Replacement Policies

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

How can we find data in the cache?

Presentation transcript:

Dynamic Cache Clustering for Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Dept. of Computer Science University of Pittsburgh

Tiled CMP Architectures Tiled CMP Architectures have recently been advocated as a scalable design. They replicate identical building blocks (tiles) and connect them with a switched network on-chip (NoC). A tile typically incorporates a private L1 cache and an L2 cache bank. Two traditional practices of CMP caches: One bank to one core assignment (Private Scheme). One bank to all cores assignment (Shared Scheme).

Private and Shared Schemes Private Scheme: A core maps and locates a cache block, B, to and from its local L2 bank. Coherence maintenance is required at both, the L1 and the L2 levels. Data is read very fast but cache miss rate might render high. Shared Scheme: A core maps and locates a cache block, B, to and from a target tile (using some bits- home select or HS bits from B’s physical address) referred to as the static home tile (SHT) of B. Coherence is required only at the L1 level. Cache miss rate is low but data reads are slow (NUCA design). B’s physical address:

The Degree of Sharing Sharing Degree (SD), or the number of cores that share a given pool of cache banks, could be set somewhere between the shared and the private designs. 1-1 assignment 2-2 assignment 4-4 assignment 8-8 assignment 16-16 assignment (Private Design) (Shared Design)

Static Designs’ Principal Deficiency The aforementioned static designs are subject to a principal deficiency: In reality, computer applications exhibit different cache demands. A single application may demonstrate different phases corresponding to distinct code regions invoked during its execution. Program phases can be characterized by different L2 cache misses and durations. They all entail static partitioning of the available cache capacity and don’t tolerate the variability among working sets and phases of a working set.

Our work Dynamically monitor the behaviors of the programs running on different CMP cores. Adapt to each program cache demand by offering a fine-grained banks-to-cores assignments (a technique we refer to as cache clustering). Introduce novel mapping and location strategies to manage dynamic cache designs in tiled CMPs. (CD = Cluster Dimension)

Talk roadmap The proposed dynamic cache clustering (DCC) scheme. Performance metrics. DCC algorithm. DCC mapping strategy. DCC location strategy. Quantitative evaluation Concluding remarks.

The Proposed Scheme We denote the L2 cache banks that can be assigned to a specific core, i, as i’s cache cluster. We further denote the number of banks that the cache cluster of core i consists of as cache cluster dimension of core i (CDi). We propose a dynamic cache clustering (DCC) scheme where: Each core is initially started up with a specific cache cluster. After every period time T (potential re-clustering point), the cache cluster of a core is dynamically contracted, expanded, or kept intact, depending on the cache demand experienced by that core.

Performance Metrics The basic trade-offs of varying the dimension of a cache cluster are the average L2 access latency and the L2 miss rate. Average L2 access latency (AAL) increases strictly with the cluster dimension. L2 miss rate (MR) is inversely proportional to the cluster dimension. Improving either AAL or MR doesn’t necessarily correlate to an improvement in the overall system performance. Improving one of the following metrics typically translates to a better system performance.

DCC Algorithm The AMAT metric can be utilized to judiciously gauge the benefit of varying the cache cluster dimension of a certain core i. At every potential re-clustering point: The AMATi (AMATi current) experienced by a process P running on core i is evaluated and stored. AMATi current is subtracted from the previously stored AMATi (AMATi previous). Assume a contraction action has been taken previously: A positive subtraction value indicates that AMATi has increased. Hence, we retard and expand P’s cluster. A negative value indicates that AMATi has decreased. We hence contract P’s cluster a step further predicting more benefit.

DCC Mapping Strategy Varying a cache cluster dimension (CD) of each core over time requires a function that maps cache blocks to cache clusters exactly as required. Assume that a core i requests a cache block B. If CDi < 16 (for instance), B is mapped to a dynamic home tile (DHT) different than the static home tile (SHT) of B. DHT of B depends on CDi. With CDi smaller than 16 only a subset of bits from the HS field of B’s physical address needs to be utilized to determine B’s DHT (i.e., 3 bits from HS are used if CDi = 8). We developed the following generic function to determine the DHT of block B (ID is the binary representation of core i and MB are masking bits):

DCC Mapping Strategy: A Working Example Assume core 5 (ID = 0101) requests cache block B with HS = 1111. DHT= (1111&1111) + (0101&0000) = 1111 DHT= (1111&0111) + (0101&1000) = 0111 DHT= (1111&0101) + (0101&1010) = 0101 DHT= (1111&0001) + (0101&1110) = 0101 DHT= (1111&0000) + (0101&1111) = 0101

DCC Location Strategy The generic mapping function we defined can’t be used straightforwardly to locate cache blocks. Assume a cache block B with HS = 1111 is requested by core 0 (ID = 0000). Assume the cache cluster of core 0 is contracted and B is afterward requested by core 0. DHT= (1111&0111) + (0000&1000) = 0111 DHT= (1111&0101) + (0000&1010) = 0101

DCC Location Strategy Solution 1: re-copy all blocks upon a re-clustering action. Solution2: After missing at B’s DHT, B’s SHT (tile 15) can be accessed to locate B at tile 7. Solution3: Send the L2 request directly to B’s SHT instead of sending it first to B’s DHT and then possibly to B’s SHT. Very Expensive Slow: Inter-tile communications between tiles: 0, 5, 15, 7, and lastly 0 DHT= (1111&0101) + (0000&1010) = 0101 Slow: inter-tile communications between tiles 0, 15, 7, and lastly 0.

DCC Location Strategy Solution4: Send simultaneous requests to only the tiles that are potential DHTs of B. The potential DHTs of B can be easily determined by varying MB and MBbar of the DCC mapping function for the range of CDs 1, 2, 4, 8, and 16. Upper bound = Lower bound = 1 Average = 1 + 1/2 log2(n) (i.e., for 16 tiles, 3 messages per request) log2(NumberofTiles) + 1

Quantitative Evaluation: Methodology System Parameters: We simulate a 16-way tiled CMP. Simulator: Simics 3.0.29 (Solaris OS) Cache line size: 64 Bytes. L1 I-D sizes/ways/latency: 16KB/2 ways/1 cycles. L2 size/ways/latency: 512KB per bank/16 ways/12 cycles. Latency per hop: 3 cycles. Memory latency: 300 cycles. L1 and L2 replacement policy: LRU Benchmarks: SPECJBB, OCEAN, BARNES, LU, RADIX, FFT, MIX1(16 copies of HMMER), MIX2(16 copies of SPHINX), MIX3( Barnes, Lu, Milc, Mcf, Bzip2, and Hmmer- 2 threads/copies each).

Comparing With Static Schemes We first study the effect of the average L1 miss time (AMT) across FS1, FS2, FS4, FS8, FS16, and DCC. FS16 FS1 DCC outperforms FS16, FS8, FS4, FS2, and FS1 by averages of 6.5%, 8.6%, 10.1%, 10%, and 4.5%, respectively, and by as much as 21.3%.

Comparing With Static Schemes We second study the effect of L2 miss rate across FS1, FS2, FS4, FS8, FS16, and DCC. No Single static scheme provides the best miss rate for all the benchmarks. DCC always provides miss rates comparable to the best static alternative.

Comparing With Static Schemes We third study the effect of execution time across FS1, FS2, FS4, FS8, FS16, and DCC. The superiority of DCC in AMT translates to better overall performance. DCC always provides performance comparable to the best static alternative.

Sensitivity Study We fourth study the sensitivity of DCC to different {T,Tl,Tg} values. DCC is not much dependent on the values of parameters {T,Tl,Tg} . Overall, DCC performs a little better with T = 100K than with T = 300K.

Comparing With Cooperative Caching We fifth compare DCC against the cooperative caching (CC) scheme. CC is based on FS1 (private scheme). DCC FS1 CC DCC outperforms CC by an average of 1.59%. The basic problem with CC is that it spills blocks without knowing if spilling helps or hurts cache performance (a problem addressed recently in HPCA09).

Concluding Remarks This paper proposes DCC, a distributed cache management scheme for large scale chip multiprocessors. Contrary to static designs, DCC adapts to working sets irregularities. We propose generic mapping and location strategies that can be utilized for both, static designs (with different sharing degrees) and dynamic designs in tiled CMPs. The proposed DCC location strategy can be improved (in regard to reducing the number of messages per request) by maintaining a small history about a specific cluster expansions and contractions. For instance, with an activity chain of 16-8-4, we can predict that a requested block can’t exist at a DHT corresponding to CD = 1 or 2, and has higher probability to exist at DHTs corresponding to CD = 4 and 8 than at DHT that corresponds to CD = 16.

Dynamic Cache Clustering for Chip Multiprocessors Thank you! Dynamic Cache Clustering for Chip Multiprocessors M. Hammoud, S. Cho, and R. Melhem Dept. of Computer Science University of Pittsburgh