Dynamic Cache Clustering for Chip Multiprocessors

Name: Dynamic Cache Clustering for Chip Multiprocessors
Uploaded: 2017-09-08T03:37:54+00:00
Duration: PTM14S49
Channel: Dale Moody
Description: Dynamic Cache Clustering for Chip Multiprocessors

Dynamic Cache Clustering for Chip Multiprocessors
Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Dept. of Computer Science University of Pittsburgh

Tiled CMP Architectures
Tiled CMP Architectures have recently been advocated as a scalable design. They replicate identical building blocks (tiles) and connect them with a switched network on-chip (NoC). A tile typically incorporates a private L1 cache and an L2 cache bank. Two traditional practices of CMP caches: One bank to one core assignment (Private Scheme). One bank to all cores assignment (Shared Scheme).

Private and Shared Schemes
Private Scheme: A core maps and locates a cache block, B, to and from its local L2 bank. Coherence maintenance is required at both, the L1 and the L2 levels. Data is read very fast but cache miss rate might render high. Shared Scheme: A core maps and locates a cache block, B, to and from a target tile (using some bits- home select or HS bits from B’s physical address) referred to as the static home tile (SHT) of B. Coherence is required only at the L1 level. Cache miss rate is low but data reads are slow (NUCA design). B’s physical address:

The Degree of Sharing Sharing Degree (SD), or the number of cores that share a given pool of cache banks, could be set somewhere between the shared and the private designs. 1-1 assignment 2-2 assignment 4-4 assignment 8-8 assignment 16-16 assignment (Private Design) (Shared Design)

Static Designs’ Principal Deficiency
The aforementioned static designs are subject to a principal deficiency: In reality, computer applications exhibit different cache demands. A single application may demonstrate different phases corresponding to distinct code regions invoked during its execution. Program phases can be characterized by different L2 cache misses and durations. They all entail static partitioning of the available cache capacity and don’t tolerate the variability among working sets and phases of a working set.

Our work Dynamically monitor the behaviors of the programs running on different CMP cores. Adapt to each program cache demand by offering a fine-grained banks-to-cores assignments (a technique we refer to as cache clustering). Introduce novel mapping and location strategies to manage dynamic cache designs in tiled CMPs. (CD = Cluster Dimension)

Talk roadmap The proposed dynamic cache clustering (DCC) scheme.
Performance metrics. DCC algorithm. DCC mapping strategy. DCC location strategy. Quantitative evaluation Concluding remarks.

The Proposed Scheme We denote the L2 cache banks that can be assigned to a specific core, i, as i’s cache cluster. We further denote the number of banks that the cache cluster of core i consists of as cache cluster dimension of core i (CDi). We propose a dynamic cache clustering (DCC) scheme where: Each core is initially started up with a specific cache cluster. After every period time T (potential re-clustering point), the cache cluster of a core is dynamically contracted, expanded, or kept intact, depending on the cache demand experienced by that core.

Performance Metrics The basic trade-offs of varying the dimension of a cache cluster are the average L2 access latency and the L2 miss rate. Average L2 access latency (AAL) increases strictly with the cluster dimension. L2 miss rate (MR) is inversely proportional to the cluster dimension. Improving either AAL or MR doesn’t necessarily correlate to an improvement in the overall system performance. Improving one of the following metrics typically translates to a better system performance.

DCC Algorithm The AMAT metric can be utilized to judiciously gauge the benefit of varying the cache cluster dimension of a certain core i. At every potential re-clustering point: The AMATi (AMATi current) experienced by a process P running on core i is evaluated and stored. AMATi current is subtracted from the previously stored AMATi (AMATi previous). Assume a contraction action has been taken previously: A positive subtraction value indicates that AMATi has increased. Hence, we retard and expand P’s cluster. A negative value indicates that AMATi has decreased. We hence contract P’s cluster a step further predicting more benefit.

DCC Mapping Strategy Varying a cache cluster dimension (CD) of each core over time requires a function that maps cache blocks to cache clusters exactly as required. Assume that a core i requests a cache block B. If CDi < 16 (for instance), B is mapped to a dynamic home tile (DHT) different than the static home tile (SHT) of B. DHT of B depends on CDi. With CDi smaller than 16 only a subset of bits from the HS field of B’s physical address needs to be utilized to determine B’s DHT (i.e., 3 bits from HS are used if CDi = 8). We developed the following generic function to determine the DHT of block B (ID is the binary representation of core i and MB are masking bits):

DCC Mapping Strategy: A Working Example
Assume core 5 (ID = 0101) requests cache block B with HS = 1111. DHT= (1111&1111) + (0101&0000) = 1111 DHT= (1111&0111) + (0101&1000) = 0111 DHT= (1111&0101) + (0101&1010) = 0101 DHT= (1111&0001) + (0101&1110) = 0101 DHT= (1111&0000) + (0101&1111) = 0101

DCC Location Strategy The generic mapping function we defined can’t be used straightforwardly to locate cache blocks. Assume a cache block B with HS = 1111 is requested by core 0 (ID = 0000). Assume the cache cluster of core 0 is contracted and B is afterward requested by core 0. DHT= (1111&0111) + (0000&1000) = 0111 DHT= (1111&0101) + (0000&1010) = 0101

DCC Location Strategy Solution 1: re-copy all blocks upon a re-clustering action. Solution2: After missing at B’s DHT, B’s SHT (tile 15) can be accessed to locate B at tile 7. Solution3: Send the L2 request directly to B’s SHT instead of sending it first to B’s DHT and then possibly to B’s SHT. Very Expensive Slow: Inter-tile communications between tiles: 0, 5, 15, 7, and lastly 0 DHT= (1111&0101) + (0000&1010) = 0101 Slow: inter-tile communications between tiles 0, 15, 7, and lastly 0.

DCC Location Strategy Solution4: Send simultaneous requests to only the tiles that are potential DHTs of B. The potential DHTs of B can be easily determined by varying MB and MBbar of the DCC mapping function for the range of CDs 1, 2, 4, 8, and 16. Upper bound = Lower bound = 1 Average = 1 + 1/2 log2(n) (i.e., for 16 tiles, 3 messages per request) log2(NumberofTiles) + 1

Quantitative Evaluation: Methodology
System Parameters: We simulate a 16-way tiled CMP. Simulator: Simics (Solaris OS) Cache line size: 64 Bytes. L1 I-D sizes/ways/latency: 16KB/2 ways/1 cycles. L2 size/ways/latency: 512KB per bank/16 ways/12 cycles. Latency per hop: 3 cycles. Memory latency: 300 cycles. L1 and L2 replacement policy: LRU Benchmarks: SPECJBB, OCEAN, BARNES, LU, RADIX, FFT, MIX1(16 copies of HMMER), MIX2(16 copies of SPHINX), MIX3( Barnes, Lu, Milc, Mcf, Bzip2, and Hmmer- 2 threads/copies each).

Comparing With Static Schemes
We first study the effect of the average L1 miss time (AMT) across FS1, FS2, FS4, FS8, FS16, and DCC. FS16 FS1 DCC outperforms FS16, FS8, FS4, FS2, and FS1 by averages of 6.5%, 8.6%, 10.1%, 10%, and 4.5%, respectively, and by as much as 21.3%.

We second study the effect of L2 miss rate across FS1, FS2, FS4, FS8, FS16, and DCC. No Single static scheme provides the best miss rate for all the benchmarks. DCC always provides miss rates comparable to the best static alternative.

We third study the effect of execution time across FS1, FS2, FS4, FS8, FS16, and DCC. The superiority of DCC in AMT translates to better overall performance. DCC always provides performance comparable to the best static alternative.

Sensitivity Study We fourth study the sensitivity of DCC to different {T,Tl,Tg} values. DCC is not much dependent on the values of parameters {T,Tl,Tg} . Overall, DCC performs a little better with T = 100K than with T = 300K.

Comparing With Cooperative Caching
We fifth compare DCC against the cooperative caching (CC) scheme. CC is based on FS1 (private scheme). DCC FS1 CC DCC outperforms CC by an average of 1.59%. The basic problem with CC is that it spills blocks without knowing if spilling helps or hurts cache performance (a problem addressed recently in HPCA09).

Concluding Remarks This paper proposes DCC, a distributed cache management scheme for large scale chip multiprocessors. Contrary to static designs, DCC adapts to working sets irregularities. We propose generic mapping and location strategies that can be utilized for both, static designs (with different sharing degrees) and dynamic designs in tiled CMPs. The proposed DCC location strategy can be improved (in regard to reducing the number of messages per request) by maintaining a small history about a specific cluster expansions and contractions. For instance, with an activity chain of , we can predict that a requested block can’t exist at a DHT corresponding to CD = 1 or 2, and has higher probability to exist at DHTs corresponding to CD = 4 and 8 than at DHT that corresponds to CD = 16.

Dynamic Cache Clustering for Chip Multiprocessors
Thank you! Dynamic Cache Clustering for Chip Multiprocessors M. Hammoud, S. Cho, and R. Melhem Dept. of Computer Science University of Pittsburgh

Dynamic Cache Clustering for Chip Multiprocessors

Similar presentations

Presentation on theme: "Dynamic Cache Clustering for Chip Multiprocessors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dynamic Cache Clustering for Chip Multiprocessors

Similar presentations

Presentation on theme: "Dynamic Cache Clustering for Chip Multiprocessors"— Presentation transcript:

Similar presentations

About project

Feedback