Presentation is loading. Please wait.

Presentation is loading. Please wait.

Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Similar presentations


Presentation on theme: "Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research."— Presentation transcript:

1 Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel Labs - UPC Barcelona, Spain antonio.gonzalez@intel.com ф Dept. Enginyeria Informàtica Universitat Rovira i Virgili Tarragona, Spain carlos.molina@urv.net ψ Dept. Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain javier.lira@ac.upc.edu Euro-Par 2009, Delft (The Netherlands) - August 27, 2009

2 Outline Introduction Methodology Last Bank Characterization of replacements in NUCA Last Bank Optimizations Conclusions 2

3 Introduction CMPs have emerged as a dominant paradigm in system design. 1. Keep performance improvement while reducing power consumption. 2. Take advantage of Thread-level parallelism. Commercial CMPs are currently available. CMPs incorporate larger and shared last-level caches. Wire delay is a key constraint. 3

4 NUCA Non-Uniform Cache Architecture (NUCA) was first proposed in ASPLOS 2002 by Kim et al. [1]. NUCA divides a large cache in smaller and faster banks. Banks close to cache controller have smaller latencies than further banks. Processor [1] C. Kim, D. Burger and S.W. Keckler. An Adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS ‘02 4

5 Outline Introduction Methodology Last Bank Characterization of replacements in NUCA Last Bank Optimizations Conclusions 5

6 Methodology Simulation tools: Simics + GEMS CACTI v6.0 PARSEC Benchmark Suite Number of cores8 Core processorOut-of-order SPARCv9 Main Memory Size4 Gbytes Memory Bandwidth512 Bytes/cycle L1 cache latency3 cycles NUCA bank latency2 cycles Router delay1 cycle On-chip wire delay1 cycle Main memory latency350 cycles (from core) Private L1 data caches8 KBytes Private L1 instr. caches8 KBytes Shared L2 NUCA cache1 MByte, 256 Banks

7 Baseline NUCA cache architecture 8 cores 256 banks [2] B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. MICRO ‘04

8 Outline Introduction Methodology Last Bank Characterization of replacements in NUCA Last Bank Optimizations Conclusions 8

9 Last Bank Data movements concentrate most accessed data in few banks. Data replacements in HOT banks are unfair. 9

10 Last Bank An extra bank is included in the NUCA cache. Acts as a Victim cache, but it is not fully-associative. Provides evicted data a second chance for keeping in the NUCA. 10 Last Bank

11 11 Performance benefits restricted by Last Bank size. Significant performance potential. Analysis of reused addresses to find improvement points.

12 Outline Introduction Methodology Last Bank Characterization of replacements in NUCA Last Bank Optimizations Conclusions 12

13 Characterization of replacements in NUCA How many evicted addresses are later reused? How many cycles do a reused address usually spend out of the NUCA before being reinserted? Where were reused addresses located within the NUCA just before being evicted? What action did motivate reused addresses eviction from NUCA? 13

14 Reused address statistics 14 Nearly 70% of evicted addresses return to the NUCA cache. Most of the reused address, return to NUCA at least twice.

15 Time between Eviction and Reinsertion 15 Nearly 30% of evicted addresses return in less than 100,000 cycles. In blackscholes, almost 50% of reused addresses return to NUCA in less than 1,000 cycles.

16 Last location within the NUCA Most of reused addresses were evicted from Local Banks. Most of addresses replaced from Central Banks are not later reused. 16

17 Outline Introduction Methodology Last Bank Characterization of replacements in NUCA Last Bank Optimizations Conclusions 17

18 Selective Last Bank Target: To reduce pollution in Last Bank. This mechanism allows to select the evicted data blocks that are going to be stored in the Last Bank. Implemented Selective Last Bank: Stores data blocks, if and only if, they were evicted from a Local Bank. Otherwise, sends them back to the main memory. 18

19 LRU Prioritising Last Bank Target: To maintain reused addresses in the NUCA cache. Modification of data eviction policy of NUCA banks. Prioritises lines that come from Last Bank during the data replacement process. 19 @A P: 0 @B P: 0 @C P: 0 @D P: 1 0123 MRULRU @D P: 0 @A P: 0 @B P: 0 @C P: 0 0123 @D, P:0 @A, P:0 @B, P:0 @C, P:0

20 Results 20 Both optimizations increase Last Bank performance benefits. There is still room for improvement. Adaptive filters will be analysed in future works.

21 Outline Introduction Methodology Last Bank Characterization of replacements in NUCA Last Bank Optimizations Conclusions 21

22 Conclusions Data movements provoke unfair replacements in HOT banks. Last Bank reduce access latency of promptly reused addresses. Huge performance potential. Two optimizations are proposed: Selective Last Bank: Reduce pollution in Last Bank. LRU Prioritising Last Bank: Maintain reused addresses in the NUCA cache. 22

23 Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Questions?


Download ppt "Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research."

Similar presentations


Ads by Google