ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms.

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms for CMP

ECE8833 H.-H. S. Lee 2009 2 Cache Sharing in CMP [Kim, Chandra, Solihin, PACT’04] L2 $ L1 $ …… Processor Core 1Processor Core 2 L1 $ Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU [Kim, Chandra, Solihin PACT2004]

ECE8833 H.-H. S. Lee 2009 3 Cache Sharing in CMP L2 $ L1 $ …… Processor Core 1 L1 $ Processor Core 2 ←t1 Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

ECE8833 H.-H. S. Lee 2009 4 Cache Sharing in CMP Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU L1 $ Processor Core 1 L1 $ Processor Core 2 L2 $ …… t2→

ECE8833 H.-H. S. Lee 2009 5 Cache Sharing in CMP Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU L1 $ L2 $ …… Processor Core 1Processor Core 2 ←t1 L1 $ t2→ t2’s throughput is significantly reduced due to unfair cache sharing.

ECE8833 H.-H. S. Lee 2009 6 Shared L2 Cache Space Contention Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

ECE8833 H.-H. S. Lee 2009 7 Impact of Unfair Cache Sharing 7 Uniprocessor scheduling 2-core CMP scheduling gzip will get more time slices than others if gzip is set to run at higher priority (and it could run slower than others  priority inversion) It could further slows down the other processes (starvation) Thus the overall throughput is reduced (uniform slowdown) t1 t4 t1 t3 t2 t1 t2 t1 t3 t1 t2 t1 t3 t4 t1 P1: P2: time slice

ECE8833 H.-H. S. Lee 2009 8 Stack Distance Profiling Algorithm CTR Pos 0 CTR Pos 1 CTR Pos 2 CTR Pos 3 MRULRU HIT Counters Cache Tag HIT CountersValue CTR Pos 0 CTR Pos 1 CTR Pos 2 CTR Pos 3 30 20 15 10 Misses = 25 [Qureshi+, MICRO-39]

ECE8833 H.-H. S. Lee 2009 9 Stack Distance Profiling A counter for each cache way, C >A is the counter for misses Show the reuse frequency for each way in a cache Can be used to predict the misses for associativity smaller than “A” –Misses for 2-way cache for gzip = C >A + Σ C i where i = 3 to 8 art does not need all the space for likely poor temporal locality If the given space is halved for art and given to gzip, what happens?

ECE8833 H.-H. S. Lee 2009 10 Fairness Metrics [Kim et al. PACT’04] Uniform slowdown Execution time of t i when it runs alone. Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

ECE8833 H.-H. S. Lee 2009 11 Fairness Metrics [Kim et al. PACT’04] Uniform slowdown Execution time of t i when it shares cache with others. Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

ECE8833 H.-H. S. Lee 2009 12 Fairness Metrics [Kim et al. PACT’04] Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU Uniform slowdown We want to minimize: –Ideally: Try to equalize the ratio of miss increase of each thread

ECE8833 H.-H. S. Lee 2009 13 Fairness Metrics [Kim et al. PACT’04] Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU Uniform slowdown We want to minimize: –Ideally:

ECE8833 H.-H. S. Lee 2009 14 Partitionable Cache Hardware LRU P1: 448B P2 Miss P2: 576B Current Partition P1: 384B P2: 640B Target Partition Modified LRU cache replacement policy –G. E. Suh, et. al., HPCA 2002 Per-thread Counter Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

ECE8833 H.-H. S. Lee 2009 15 Partitionable Cache Hardware LRU * P1: 448B P2 Miss P2: 576B Current Partition P1: 384B P2: 640B Target Partition Modified LRU cache replacement policy –G. Suh, et. al., HPCA 2002 LRU * P1: 384B P2: 640B Current Partition P1: 384B P2: 640B Target Partition Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU Partition granularity could be as coarse as one entire cache way

ECE8833 H.-H. S. Lee 2009 16 Dynamic Fair Caching Algorithm P1: P2: Ex) Optimizing M3 metric P1: P2: Target Partition MissRate alone P1: P2: MissRate shared Repartitioning interval Counters to keep miss rates running the process alone (from stack distance profiling) Counters to keep dynamic miss rates (running with a shared cache) Counters to keep target partition size Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU 10K accesses found to be the best

ECE8833 H.-H. S. Lee 2009 17 Dynamic Fair Caching Algorithm 1 st Interval P1:20% P2: 5% MissRate alone Repartitioning interval P1: P2: MissRate shared P1:20% P2:15% MissRate shared P1:256KB P2:256KB Target Partition Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

ECE8833 H.-H. S. Lee 2009 18 Dynamic Fair Caching Algorithm Repartition! Evaluate M3 P1: 20% / 20% P2: 15% / 5% P1:20% P2: 5% MissRate alone Repartitioning interval P1:20% P2:15% MissRate shared P1:256KB P2:256KB Target Partition P1:192KB P2:320KB Target Partition Partition granularity: 64KB Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

ECE8833 H.-H. S. Lee 2009 19 Dynamic Fair Caching Algorithm 2 nd Interval P1:20% P2: 5% MissRate alone Repartitioning interval P1:20% P2:15% MissRate shared P1:20% P2:15% MissRate shared P1:192KB P2:320KB Target Partition Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

ECE8833 H.-H. S. Lee 2009 20 Dynamic Fair Caching Algorithm Repartition! Evaluate M3 P1: 20% / 20% P2: 10% / 5% P1:20% P2: 5% MissRate alone Repartitioning interval P1:20% P2:15% MissRate shared P1:20% P2:10% MissRate shared P1:192KB P2:320KB Target Partition P1:128KB P2:384KB Target Partition Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

ECE8833 H.-H. S. Lee 2009 21 Dynamic Fair Caching Algorithm 3 rd Interval P1:20% P2: 5% MissRate alone Repartitioning interval P1:20% P2:10% MissRate shared P1:128KB P2:384KB Target Partition P1:20% P2:10% MissRate shared P1:25% P2: 9% MissRate shared Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

ECE8833 H.-H. S. Lee 2009 22 Dynamic Fair Caching Algorithm Repartition! Do Rollback if: P2: Δ<T rollback Δ=MR old -MR new P1:20% P2: 5% MissRate alone Repartitioning interval P1:20% P2:10% MissRate shared P1:25% P2: 9% MissRate shared P1:128KB P2:384KB Target Partition P1:192KB P2:320KB Target Partition Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU The best T rollback threshold found to be 20%

ECE8833 H.-H. S. Lee 2009 23 Generic Repartitioning Algorithm Pick the largest and smallest as a pair for repartitioning Repeat for all candidate processes

Utility-Based Cache Partitioning (UCP)

ECE8833 H.-H. S. Lee 2009 25 Running Processes on Dual-Core [Qureshi & Patt, MICRO-39] LRU: in real runs on avg., 7 ways were allocated to equake and 9 to vpr UTIL –How much you use (in a set) is how much you will get –Ideally, 3 ways to equake and 13 to vpr # of ways given (1 to 16)

ECE8833 H.-H. S. Lee 2009 26 Defining Utility Utility U a b = Misses with a ways – Misses with b ways Low Utility High Utility Saturating Utility Num ways from 16-way 1MB L2 Misses per 1000 instructions Slide courtesy: Moin Qureshi, MICRO-39

ECE8833 H.-H. S. Lee 2009 27 Framework for UCP Slide courtesy: Moin Qureshi, MICRO-39 Three components:  Utility Monitors (UMON) per core  Partitioning Algorithm (PA)  Replacement support to enforce partitions I$ D$ Core1 I$ D$ Core2 Shared L2 cache Main Memory UMON1 UMON2 PA

ECE8833 H.-H. S. Lee 2009 28 Utility Monitors (UMON)  For each core, simulate LRU policy using Auxiliary Tag Dir (ATD)  UMON-global (one way-counter for all sets)  Hit counters in ATD to count hits per recency position  LRU is a stack algorithm: hit counts  utility E.g., hits(2 ways) = H0+H1 Set A Set B Set C Set D Set E Set F Set G Set H +++++... (MRU)(LRU)H0H1H2H3H15 ATD

ECE8833 H.-H. S. Lee 2009 29 Utility Monitors (UMON)  Extra tags incur hardware and power overhead  DSS reduces overhead [Qureshi et al. ISCA’06] Set A Set B Set C Set D Set E Set F Set G Set H +++++... (MRU)(LRU)H0H1H2H3H15 ATD Set A Set B Set C Set D Set E Set F Set G Set H

ECE8833 H.-H. S. Lee 2009 30 Utility Monitors (UMON)  Extra tags incur hardware and power overhead  DSS reduces overhead [Qureshi et al. ISCA’06]  32 sets sufficient based on Chebyshev’s inequality  Sample every 32 sets (simple static) used in the paper  Storage < 2KB/UMON (or 0.17% L2) Set A Set B Set C Set D Set E Set F Set G Set H +++++... (MRU)(LRU)H0H1H2H3H15 ATD UMON (DSS) Set B Set E Set F

ECE8833 H.-H. S. Lee 2009 31 Partitioning Algorithm (PA)  Evaluate all possible partitions and select the best  With a ways to core1 and (16-a) ways to core2: Hits core1 = (H 0 + H 1 + … + H a-1 ) ---- from UMON1 Hits core2 = (H 0 + H 1 + … + H 16-a-1 ) ---- from UMON2  Select a that maximizes (Hits core1 + Hits core2 )  Partitioning done once every 5 million cycles  After each partitioning interval  Hit counters in all UMONs are halved  To retain some past information

ECE8833 H.-H. S. Lee 2009 32 Replacement Policy to Reach Desired Partition Use way partitioning [Suh+ HPCA’02, Iyer ICS’04] Each Line contains core-id bits On a miss, count ways_occupied in the set by miss-causing app Binary decision for dual-core (in this paper) ways_occupied < ways_given Yes No Victim is the LRU line from other app Victim is the LRU line from miss-causing app

ECE8833 H.-H. S. Lee 2009 33 UCP Performance (Weighted Speedup) UCP improves average weighted speedup by 11% (Dual Core)

ECE8833 H.-H. S. Lee 2009 34 UPC Performance (Throughput) UCP improves average throughput by 17%

Dynamic Insertion Policy

ECE8833 H.-H. S. Lee 2009 36 Conventional LRU MRU LRU Incoming Block Slide Source: Yuejian Xie

ECE8833 H.-H. S. Lee 2009 37 Conventional LRU MRU LRU Occupies one cache block for a long time with no benefit! Slide Source: Yuejian Xie

ECE8833 H.-H. S. Lee 2009 38 LIP: LRU Insertion Policy [Qureshi et al. ISCA’07] MRU LRU Incoming Block 38 Slide Source: Yuejian Xie

ECE8833 H.-H. S. Lee 2009 39 LIP: LRU Insertion Policy [Qureshi et al. ISCA’07] MRU LRU Useless BlockEvicted at next eviction Useful BlockMoved to MRU position Adapted Slide from Yuejian Xie

ECE8833 H.-H. S. Lee 2009 40 LIP: LRU Insertion Policy [Qureshi et al. ISCA’07] MRU LRU Useless BlockEvicted at next eviction Useful BlockMoved to MRU position Slide Source: Yuejian Xie LIP is not entirely new, Intel has tried this in 1998 when designing “Timna” (integrating CPU and Gfx accelerator that share L2)

ECE8833 H.-H. S. Lee 2009 41 BIP: Bimodal Insertion Policy [Qureshi et al. ISCA’07] if ( rand() < e ) Insert at MRU position; // LRU replacement policy else Insert at LRU position; Promote to MRU if reused LIP may not age older lines Infrequently insert lines in MRU position Let e = Bimodal throttle parameter

ECE8833 H.-H. S. Lee 2009 42 DIP: Dynamic Insertion Policy [Qureshi et al. ISCA’07] Two types of workloads: LRU-friendly or BIP-friendly DIP can be implemented by: 1.Monitor both policies (LRU and BIP) 2.Choose the best-performing policy 3.Apply the best policy to the cache Need a cost-effective implementation  “Set Dueling” DIP BIPLRU LIP LRU ε 1-ε

ECE8833 H.-H. S. Lee 2009 43 Set Dueling for DIP [Qureshi et al. ISCA’07] LRU-sets Follower Sets BIP-sets Divide the cache in three: Dedicated LRU sets Dedicated BIP sets Follower sets (winner of LRU,BIP) n-bit saturating counter misses to LRU sets: counter++ misses to BIP sets : counter-- Counter decides policy for follower sets: MSB = 0, Use LRU MSB = 1, Use BIP n-bit cntr + miss – MSB = 0? YES No Use LRU Use BIP monitor  choose  apply (using a single counter) Slide Source: Moin Qureshi

Promotion/Insertion Pseudo Partitioning

ECE8833 H.-H. S. Lee 2009 45 PIPP [Xie & Loh ISCA’09] What’s PIPP? –Promotion/Insertion Pseudo Partitioning –Achieving both capacity (UCP) and dead-time management (DIP). Eviction –LRU block as the victim Insertion –The core’s quota worth of blocks away from LRU Promotion –To MRU by only one. MRU LRU To Evict Promote Hit Insert Position = 3 (Target Allocation) New 45 Slide Source: Yuejian Xie

ECE8833 H.-H. S. Lee 2009 46 PIPP Example Core0 quota: 5 blocks Core1 quota: 3 blocks Core0 quota: 5 blocks Core1 quota: 3 blocks 1 1 A A 2 2 3 3 4 4 5 5 B B C C Core0’s Block Core1’s Block Request MRU LRU Core1’s quota=3 D D Slide Source: Yuejian Xie

ECE8833 H.-H. S. Lee 2009 47 PIPP Example Core0 quota: 5 blocks Core1 quota: 3 blocks Core0 quota: 5 blocks Core1 quota: 3 blocks 1 1 A A 2 2 5 5 3 3 4 4 D D B B Core0’s Block Core1’s Block Request MRU LRU 6 6 Core0’s quota=5 Slide Source: Yuejian Xie

ECE8833 H.-H. S. Lee 2009 48 PIPP Example Core0 quota: 5 blocks Core1 quota: 3 blocks Core0 quota: 5 blocks Core1 quota: 3 blocks 1 1 A A 2 2 6 6 3 3 4 4 D D B B Core0’s Block Core1’s Block Request MRU LRU Core0’s quota=5 7 7 Slide Source: Yuejian Xie

ECE8833 H.-H. S. Lee 2009 49 PIPP Example Core0 quota: 5 blocks Core1 quota: 3 blocks Core0 quota: 5 blocks Core1 quota: 3 blocks 1 1 A A 2 2 6 6 3 3 4 4 D D Core0’s Block Core1’s Block Request MRU LRU D D 7 7 Slide Source: Yuejian Xie

ECE8833 H.-H. S. Lee 2009 50 Core0Core1Core2Core3 Quota6442 MRU LRU Insert closer to LRU position 50 How PIPP Does Both Management Slide Source: Yuejian Xie

ECE8833 H.-H. S. Lee 2009 51 MRU 0 Core0 quota: 5 blocks Core1 quota: 3 blocks Core0 quota: 5 blocks Core1 quota: 3 blocks Core0’s Block Core1’s Block Request Strict Partition MRU 1 LRU 1 LRU 0 New Pseudo Partitioning Benefits Slide Source: Yuejian Xie

ECE8833 H.-H. S. Lee 2009 52 MRU LRU Core0 quota: 5 blocks Core1 quota: 3 blocks Core0 quota: 5 blocks Core1 quota: 3 blocks Core0’s Block Core1’s Block Request New Pseudo Partition Pseudo Partitioning Benefits Slide Source: Yuejian Xie Core1 “stole” a line from Core0

ECE8833 H.-H. S. Lee 2009 53 Pseudo Partitioning Benefits

ECE8833 H.-H. S. Lee 2009 54 Directly to MRU (TADIP) Directly to MRU (TADIP) 54 New MRU LRU Promote By One (PIPP) Promote By One (PIPP) MRU LRU New Single Reuse Block Slide Source: Yuejian Xie

ECE8833 H.-H. S. Lee 2009 55 Algorithm Capacity Management Dead-time Management Note LRU Baseline, no explicit management UCPStrict partitioning DIP / TADIP Insert at LRU and promote to MRU on hit PIPP Pseudo-partitioning and incremental promotion Algorithm Comparison Slide Source: Yuejian Xie

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms.

Similar presentations

Presentation on theme: "ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms.

Similar presentations

Presentation on theme: "ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms."— Presentation transcript:

Similar presentations

About project

Feedback