Presentation is loading. Please wait.

Presentation is loading. Please wait.

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

Similar presentations


Presentation on theme: "LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research."— Presentation transcript:

1 LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs - UPC Barcelona, Spain antonio.gonzalez@intel.com ф Dept. Enginyeria Informàtica Universitat Rovira i Virgili Tarragona, Spain carlos.molina@urv.net ψ Dept. Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain javier.lira@ac.upc.edu ICCD 2009, Lake Tahoe, CA (USA) - October 6, 2009

2 Outline Introduction Methodology LRU-PEA Results Conclusions 2

3 Introduction CMPs have emerged as a dominant paradigm in system design. 1. Keep performance improvement while reducing power consumption. 2. Take advantage of Thread-level parallelism. Commercial CMPs are currently available. CMPs incorporate larger and shared last-level caches. Wire delay is a key constraint. 3

4 NUCA Non-Uniform Cache Architecture (NUCA) was first proposed in ASPLOS 2002 by Kim et al. [1]. NUCA divides a large cache in smaller and faster banks. Banks close to cache controller have smaller latencies than further banks. Processor [1] C. Kim, D. Burger and S.W. Keckler. An Adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS ‘02 4

5 NUCA Policies Bank Placement PolicyBank Access Policy Bank Replacement PolicyBank Migration Policy

6 Outline Introduction Methodology LRU-PEA Results Conclusions 6

7 Methodology Simulation tools: Simics + GEMS CACTI v6.0 PARSEC Benchmark Suite Number of cores8 – UltraSPARC IIIi Frequency1.5 GHz Main Memory Size4 Gbytes Memory Bandwidth512 Bytes/cycle L1 cache latency3 cycles NUCA bank latency4 cycles Router delay1 cycle On-chip wire delay1 cycle Main memory latency250 cycles (from core) Private L1 caches8 x 32 Kbytes, 2-way Shared L2 NUCA cache8 MBytes, 256 Banks NUCA Bank32 KBytes, 8-way

8 Baseline NUCA cache architecture CMP-DNUCA 8 cores 256 banks Non-inclusive [2] B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. MICRO ‘04

9 Outline Introduction Methodology LRU-PEA Background How does it work? Results Conclusions 9

10 Background Entrance into the NUCA Off-chip memory L1 cache replacements Migration movements Promotion Demotion 10

11 Data categories 11 1. Off-chip 2. L1 cache Replacements 3. Promoted data 4. Demoted data

12 LRU-PEA LRU with Priority Eviction Approach Replacement policy for CMP-NUCA architectures. Data Eviction Policy: Chooses data to evict from a NUCA bank. Data Target Policy: Determines the destination bank of the evicted data. Globalizes replacement decisions to the whole NUCA. 12 Data Eviction Policy Data Target Policy LRU-PEA

13 Data Eviction Policy Based on the LRU replacement policy. Static prioritisation of NUCA data categories. Lowest-category data is evicted from the NUCA bank. PROBLEM: Highest-category could monopolize the NUCA cache. Category comparisson is restricted to the LRU and the LRU-1 positions. 13 BANK LocalCentral +L1 ReplacementsPromoted PRIORITY PromotedOff-chip Demoted - L1 Replacements

14 Data Eviction Policy Example (NUCA bank, 4-way)**: 14 @A Promoted @A Promoted @B Demoted @B Demoted @C Offchip @C Offchip @D Promoted @D Promoted ** The set associativity assumed in this work for NUCA banks is 8-way. 0 0 1 1 2 2 3 3 MRULRU L1 Replacement Promoted Offchip Demoted @C Offchip @C Offchip @D Promoted @D Promoted LRU-PEA @D Promoted @D Promoted Available

15 Data Target Policy Migration movements provoke bank usage imbalance in the NUCA cache. Replacements in most accessed banks are unfair. LRU-PEA globalizes replacement decisions to evict the most appropriate data from the NUCA cache. 15

16 Data Target Policy Example (256 NUCA Banks, 16 possible placements): 16 Current eviction Off-chip P2 Central Step 1 L1 Replac. P1 Local Step 2 Off-chip P2 Central Step 3 Demoted P4 Local … Current eviction Demoted P4 Local Cascade mode

17 Outline Introduction Methodology LRU-PEA Results Conclusions 17

18 Increasing network congestion No CascadeCascade Enabled DirectProvoked 1 step645420 2 steps1277 3 steps424 4 steps324 5 steps323 6 steps214 7 steps213 8 steps214 9 steps113 10 steps114 11 steps113 12 steps116 13 steps116 14 steps1130 15 steps321- Values in percentage (%) 18

19 NUCA miss rate analysis 19

20 Performance analysis 20

21 Dynamic EPI analysis 21

22 Outline Introduction Methodology LRU-PEA Results Conclusions 22

23 Conclusions LRU-PEA is proposed as an alternative to the traditional LRU replacement policy in CMP-NUCA architectures. Defines four novel NUCA categories and prioritises them to find the most appropriate data to evict. In a D-NUCA architecture, data movements provoke unfair replacements in most accessed banks. LRU-PEA globalizes replacement decisions taken in a single bank to the whole NUCA cache. LRU-PEA reduces miss rate, increases performance with parallel applications, reduces energy consumed per instruction, compared to the traditional LRU policy. 23

24 LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Questions?


Download ppt "LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research."

Similar presentations


Ads by Google