PageNUCA: Selected Policies for Page-grain Locality Management in Large Shared CMP Caches Mainak Chaudhuri, IIT Kanpur

PageNUCA: Selected Policies for Page-grain Locality Management in Large Shared CMP Caches Mainak Chaudhuri, IIT Kanpur mainakc@iitk.ac.in

PageNUCA (IIT, Kanpur) Talk in one slide Large shared caches in CMPs are designed as a collection of a number of smaller banks Large shared caches in CMPs are designed as a collection of a number of smaller banks The banks are distributed across the floor of the chip and connected to the cores by some point-to-point interconnect giving rise to a NUCA The banks are distributed across the floor of the chip and connected to the cores by some point-to-point interconnect giving rise to a NUCA We explore page-grain dynamic data migration in a NUCA and compare it with block-grain migration and OS-assisted static bank-to-page mapping techniques (first touch and application-directed) We explore page-grain dynamic data migration in a NUCA and compare it with block-grain migration and OS-assisted static bank-to-page mapping techniques (first touch and application-directed)

PageNUCA (IIT, Kanpur)Sketch  Preliminaries –Why page-grain –Hypothesis and observations Dynamic page migration Dynamic page migration Dynamic cache block migration Dynamic cache block migration OS-assisted static page mapping OS-assisted static page mapping Simulation environment Simulation environment Simulation results Simulation results An analytical model An analytical model Summary Summary

PageNUCA (IIT, Kanpur) Preliminaries: Example floorplan C0 B0B1 L2 bank control C1 B2B3 C2 B4B5 C3 B6B7 C4 B8B9 C5 B10B11 C6 B12B13 C7 B14B15 Memory control L2 bank Ring Core w/ L1$

PageNUCA (IIT, Kanpur) Preliminaries: Baseline mapping Virtual address to physical address mapping is demand-based L2 cache- aware bin-hopping Virtual address to physical address mapping is demand-based L2 cache- aware bin-hopping –Good for reducing L2 cache conflicts An L2 cache block is found in a unique bank at any point in time An L2 cache block is found in a unique bank at any point in time –Home bank maintains the directory entry of each block in the bank as an extended state –Home bank is a function of physical address coming out of the L1 cache controller –Home bank may change as a block migrates –Replication not explored in this work

PageNUCA (IIT, Kanpur) Preliminaries: Baseline mapping Physical address to bank mapping is page-interleaved Physical address to bank mapping is page-interleaved –Bank number bits are located right next to the page offset bits –Delivers performance and energy-efficiency similar to the more popular block-interleaved scheme Private L1 caches are kept coherent via a home-based MESI directory protocol Private L1 caches are kept coherent via a home-based MESI directory protocol –Every L1 cache request is forwarded to the home bank first for consulting the directory entry –The cache hierarchy maintains inclusion

PageNUCA (IIT, Kanpur) Preliminaries: Why page-grain Past research has explored block-grain data migration and replication in NUCAs Past research has explored block-grain data migration and replication in NUCAs –See paper for a detailed account Learning dynamic reference patterns at coarse-grain requires less storage Learning dynamic reference patterns at coarse-grain requires less storage Can pipeline the transfer of multiple cache blocks (amortizes the overhead) Can pipeline the transfer of multiple cache blocks (amortizes the overhead) Page-grain is particularly attractive Page-grain is particularly attractive –Contiguous physical data exceeding a page may include completely unrelated virtual pages (we compare two ends of spectrum) –Success in NUMAs (Origin 2000 and Wildfire)

PageNUCA (IIT, Kanpur) Preliminaries: Observations >= 32 [16, 31] [8, 15] [1, 7] Fraction of all pages or L2$ accesses 0 0.2 0.4 0.6 0.8 1.0 BarnesMatrixEquakeFFTWOceanRadix Solo pages Access coverage

PageNUCA (IIT, Kanpur) Preliminaries: Observations For five out of six applications, more than 75% of pages accessed in a 0.1M-cycle sample period are solo For five out of six applications, more than 75% of pages accessed in a 0.1M-cycle sample period are solo For five out of six applications, more than 50% L2 cache accesses are covered by these solo pages For five out of six applications, more than 50% L2 cache accesses are covered by these solo pages Major portion of L2 cache accesses are covered by solo pages with 32 or more accesses Major portion of L2 cache accesses are covered by solo pages with 32 or more accesses –Potential for compensating migration overhead by enjoying subsequent reuses

PageNUCA (IIT, Kanpur)Sketch Preliminaries Preliminaries –Why page-grain –Hypothesis and observations  Dynamic page migration Dynamic cache block migration Dynamic cache block migration OS-assisted static page mapping OS-assisted static page mapping Simulation environment Simulation environment Simulation results Simulation results An analytical model An analytical model Summary Summary

PageNUCA (IIT, Kanpur) Dynamic page migration Fully hardwired solution composed of four central algorithms Fully hardwired solution composed of four central algorithms –When to migrate a page –Where to migrate a candidate page –How to locate a cache block belonging to a migrated page –How the physical data transfer takes place Definition: an L2 cache bank B is local to a core C if B is in {x | RTWD (x, C) ≤ RTWD (y, C) for all y ≠ x} = LOCAL(C) Definition: an L2 cache bank B is local to a core C if B is in {x | RTWD (x, C) ≤ RTWD (y, C) for all y ≠ x} = LOCAL(C) –A core can have multiple local banks

PageNUCA (IIT, Kanpur) When to migrate a page When an L1$ request from core R for address A belonging to physical page P arrives at the L2 cache provided When an L1$ request from core R for address A belonging to physical page P arrives at the L2 cache provided –HOME(A) is not in LOCAL(R) Sharer mode migration decision Sharer mode migration decision –SHARER(P) > 1 and MaxAccess(P) – SecondMaxAccess(P) T 2 Solo mode migration decision Solo mode migration decision –(SHARER(P) == 1 or MaxAccess(P) – SecondMaxAccess(P) ≥ T 1 ) and R is in MaxAccessCluster(P)

PageNUCA (IIT, Kanpur) When to migrate a page Hardware support Hardware support –Page access counter table (PACT) per L2 cache bank and associated logic –PACT is a set-associative cache that maintains several information about a page –Valid, tag, LRU states –Saturating counters keeping track of access count from a topologically close cluster of cores (pair of adjacent) –Max. and second max. counts, max. cluster –Sharer bitvector and population count –Count of accesses since last sharer added

PageNUCA (IIT, Kanpur) When to migrate a page PACT organization PACT organization PageSet 0 PageSet 2 PageSet N-1 PageSet 1 k ways Psz/Bsz L2 cache bank k ways 0 1 2 N-1 PACT

PageNUCA (IIT, Kanpur) Where to migrate a page Consists of two sub-algorithms Consists of two sub-algorithms –Find a destination bank of migration –Find an appropriate “region” in the destination bank for holding the migrated page Find a destination bank D for a candidate page P for solo mode migration Find a destination bank D for a candidate page P for solo mode migration –Definition: load on a bank is defined as the number of pages mapped on to that bank either by OS or dynamically by migration –Set D to the least loaded bank among LOCAL(R) where R is the requesting core for the current transaction

PageNUCA (IIT, Kanpur) Where to migrate a page Find a destination bank D for a candidate page P for sharer mode migration Find a destination bank D for a candidate page P for sharer mode migration –Ideally we want D to minimize Σ i a i (P)*RTWD (x, S i (P)) where i ranges over the sharers of P (read out from PACT), a i (P) is the number of accesses from the i th sharer to page P, and S i (P) is the i th sharer –Simplification: assume a i (P) == a j (P) –Maintain a “Proximity ROM” of size 2 #C per L2 cache bank indexed by the sharer vector of P and returning top four solutions of the minimization problem; cancel migration if HOME(P) is one of these four –Set D to the one with least load

PageNUCA (IIT, Kanpur) Where to migrate a page Find a region in destination bank D for migrated page P Find a region in destination bank D for migrated page P –A design decision: migration is done by swapping the contents of page frame P’ mapping to D with those of P in HOME(P); no gradual migration => saves power –Look for an invalid entry in PACT(D) => unused index range covering a page in D; generate a frame id P’ outside physical address range mapping to that index range –If not found, let P’ be the LRU page in a randomly picked non-MRU set in PACT(D)

PageNUCA (IIT, Kanpur) How to locate a cache block in L2$ The migration process is confined within the boundaries of the L2 cache only The migration process is confined within the boundaries of the L2 cache only –Not visible to OS, TLBs, L1 caches, external memory system (may contain other CMP nodes) –Definition: OS-generated physical address (OS PA) is the address assigned to a page at the time of a page fault –Definition: L2 cache address (L2 CA) of a cache block is the address of the block within the L2 cache –Appropriate translation must be carried out between OS PA and L2 CA at L2 cache boundaries

PageNUCA (IIT, Kanpur) How to locate a cache block in L2$ On-core translation of OS PA to L2 CA (showing the L1 data cache misses only) On-core translation of OS PA to L2 CA (showing the L1 data cache misses only) dTLB L1 Data Cache dL1 Map LSQVPN Offset OS PA PPN Miss L2 CA Ring Core outbound OS PPN to L2 PPN Exercised by all L1 to L2 transactions One-to-one Filled on dTLB miss

PageNUCA (IIT, Kanpur) How to locate a cache block in L2$ Uncore translation between OS PA and L2 CA Uncore translation between OS PA and L2 CA L2 Cache Bank Forward L2Map Inverse L2Map PACT Ring L2 CA L2 PPN Mig.? MC OS PPN L2 CA (RING) L2 PPN Miss Offset OS PPN OS PA Refill/Ext. Hit

PageNUCA (IIT, Kanpur) How to locate a cache block in L2$ Storage overhead Storage overhead –L1Maps: instruction and data per core; organization same as iTLB and dTLB; filled at the time of TLB miss from forward L2Map (if not found, filled with identity) –Forward and inverse L2Maps per L2 cache bank: organized as a set-associative cache; sized to achieve small volume of replacements Invariant: Map(P, Q) є fL2Map(HOME(P)) iff Map(Q, P) є iL2Map(HOME(Q)) Invariant: Map(P, Q) є fL2Map(HOME(P)) iff Map(Q, P) є iL2Map(HOME(Q))

PageNUCA (IIT, Kanpur) How to locate a cache block in L2$ Implications on miss paths Implications on miss paths –L1Map lookup can be hidden under the write to outbound queue in the local switch –L2 cache miss path gets lengthened because on a miss, the request must be routed to the original home bank over the ring for allocating the MSHR and going through the proper MC –On an L2 cache refill or external intervention, the transaction arrives at the original home bank and must be routed to its migrated bank (if any)

PageNUCA (IIT, Kanpur) How data is transferred Page P from bank B is being swapped with page P’ from bank B’ Page P from bank B is being swapped with page P’ from bank B’ –Note that these are L2 CAs –Step1: iL2Map(B) produces OS PA of P (call it Q) and iL2Map(B’) produces OS PA of P’ (call it Q’); swap these two entries –Step2: fL2Map(HOME(Q)) must have Map(Q, P) and fL2Map(HOME(Q’)) must have Map(Q’, P’); swap these two entries –Step3: Send the new forward maps i.e. Map(Q, P’) and Map(Q’, P) to the sharing cores of P and P’ [obtained from PACT(B) and PACT(B’)] so that they can update their L1Maps

PageNUCA (IIT, Kanpur) How data is transferred Page P from bank B is being swapped with page P’ from bank B’ Page P from bank B is being swapped with page P’ from bank B’ –Step4: Sharing cores acknowledge L1Map update –Step5: Start the pipelined transfer of data blocks, coherence states, and directory entry –Banks B and B’ stop accepting any request until the migration is complete –Migration protocol may evict cache blocks from B or B’ to make room for the migrated blocks (perfect swap may not be possible) –Cycle-free virtual lane dependence graph guarantees freedom from deadlock

PageNUCA (IIT, Kanpur)Sketch Preliminaries Preliminaries –Why page-grain –Hypothesis and observations Dynamic page migration Dynamic page migration  Dynamic cache block migration OS-assisted static page mapping OS-assisted static page mapping Simulation environment Simulation environment Simulation results Simulation results An analytical model An analytical model Summary Summary

PageNUCA (IIT, Kanpur) Dynamic cache block migration Modeled as a special case of page-grain migration where the grain is a single L2 cache block Modeled as a special case of page-grain migration where the grain is a single L2 cache block –PACT is replaced by BACT and is now tightly coupled with the L2 cache tag array (doesn’t require separate tags and LRU states) –T 1 and T 2 are retuned for best performance –Destination bank selection algorithm is similar except the load on a bank is the number of cache blocks fills to the bank –Destination set is selected by first looking for the next round-robin set with an invalid way and resorting to a random selection if none found

PageNUCA (IIT, Kanpur) Dynamic cache block migration The algorithm for locating a cache block in the L2 cache is similar The algorithm for locating a cache block in the L2 cache is similar –The per-core L1Map is now a replica of the forward L2Map so that on an L1 cache miss request can be routed to the correct bank –As an optimization, we store the target set and way also in the L1Map so that the L2 cache tag access latency can be eliminated (races with migration are resolved by NACKing the racing L1 cache request) –The forward and inverse L2Maps get bigger (same organization as the L2 cache) –The inverse L2Map shares the tag array with the L2 cache

PageNUCA (IIT, Kanpur)Sketch Preliminaries Preliminaries –Why page-grain –Hypothesis and observations Dynamic page migration Dynamic page migration Dynamic cache block migration Dynamic cache block migration  OS-assisted static page mapping Simulation environment Simulation environment Simulation results Simulation results An analytical model An analytical model Summary Summary

PageNUCA (IIT, Kanpur) OS-assisted first touch mapping The OS-assisted techniques (static or dynamic) change the default VA to PA mapping to indirectly achieve a “good” PA to L2 cache bank mapping The OS-assisted techniques (static or dynamic) change the default VA to PA mapping to indirectly achieve a “good” PA to L2 cache bank mapping –Contrast with the hardware techniques that keep the VA to PA mapping unchanged and introduce a new PA to shadow PA indirection –First touch mapping is a static technique where the OS assigns a PA to a virtual page such that the PA is mapped to a bank local to the core touching the page for the first time –Resort to a spill mechanism if all local page frames are exhausted (e.g., pick the globally least loaded)

PageNUCA (IIT, Kanpur) OS-assisted application-directed The application can provide a one-time (manually coded) hint to the OS about the affinity of data structures The application can provide a one-time (manually coded) hint to the OS about the affinity of data structures –The hint is sent through special system calls just before the first parallel section begins –Completely private data structures can provide accurate hints –Shared pages provide hints such that they are placed round-robin within the local banks of the sharing cores –Avoid flushing the re-mapped pages from cache hierarchy or copying in memory by leveraging the hardware page-grain map tables

PageNUCA (IIT, Kanpur)Sketch Preliminaries Preliminaries –Why page-grain –Hypothesis and observations Dynamic page migration Dynamic page migration Dynamic cache block migration Dynamic cache block migration OS-assisted static page mapping OS-assisted static page mapping  Simulation environment Simulation results Simulation results An analytical model An analytical model Summary Summary

PageNUCA (IIT, Kanpur) Simulation environment Single-node CMP with eight OOO cores Single-node CMP with eight OOO cores –Private L1 caches: 32KB 4-way LRU –Shared L2 cache: 1MB 16-way LRU banks, 16 banks distributed over a bidirectional ring –Round-trip L2 cache hit latency from L1 cache: maximum 20 ns, minimum 7.5 ns (local access), mean 13.75 ns (assumes uniform access distribution) [65 nm process, M5 for ring with optimally placed repeaters] –Ring widths evaluated: 1024 bits, 512 bits, 256 bits (area based on wiring pitch:30 mm 2, 15 mm 2, 7.5 mm 2 ) –Off-die DRAM latency: 70 ns row miss, 30 ns row hit

PageNUCA (IIT, Kanpur) Simulation environment Shared memory applications Shared memory applications –Barnes, Ocean, Radix from SPLASH-2; Matrix (sparse solver using iterative CG) from DIS; Equake from SPEC; FFTW –All optimized with array-based queue locks and tree barriers Multi-programmed workloads Multi-programmed workloads –Mix of SPEC 2000 and BioBench –We report average turn-around time (i.e. average CPI) for each application to commit a representative set of one billion dynamic instructions (identified using SimPoint)

PageNUCA (IIT, Kanpur) Storage overhead Comparison of storage overhead between page-grain and block-grain migration Comparison of storage overhead between page-grain and block-grain migration –Page-grain: Proximity ROM (8 KB) + PACT (49 KB) + L1Maps (7.1 KB) + Forward L2Map (392 KB) + Inverse L2Map (392 KB) = 848.1 KB (4.8% of total L2 cache storage) –Block-grain: Proximity ROM (8 KB) + BACT (1088 KB) + L1Map (4864 KB) + Forward L2Map (608 KB) + Inverse L2Map (208 KB) = 6776 KB (28.5%) –Idealized block-grain: only one L1Map (608 KB) shared by all cores; total = 2520 KB (12.9%) [hard to plan the floor of the chip]

PageNUCA (IIT, Kanpur)Sketch Preliminaries Preliminaries –Why page-grain –Hypothesis and observations Dynamic page migration Dynamic page migration Dynamic cache block migration Dynamic cache block migration OS-assisted static page mapping OS-assisted static page mapping Simulation environment Simulation environment  Simulation results An analytical model An analytical model Summary Summary

PageNUCA (IIT, Kanpur) Performance comparison Page Normalized cycles (lower is better) 0.6 0.7 0.8 0.9 1.0 1.1 BarnesMatrixEquakeFFTWOceanRadix gmean 1.461.69 Block First touch App.-dir. Perfect 18.7%22.5% Lock placement

PageNUCA (IIT, Kanpur) Performance comparison Page Normalized avg. cycles (lower is better) 0.6 0.7 0.8 0.9 1.0 1.1 MIX1MIX2MIX3MIX4MIX5MIX6 gmean Block First touch Perfect 12.6%15.2% Spill effect MIX7MIX8

PageNUCA (IIT, Kanpur) Performance analysis Why page-grain sometimes outperforms block-grain (counter-intuitive) Why page-grain sometimes outperforms block-grain (counter-intuitive) –Pipelined block transfer during page migration helps amortize the cost and allows page migration to be more aggressively tuned for T 1 and T 2 –The degree of aggression gets reflected in local L2 cache access percentage: Base Page Block FT AP Base Page Block FT AP ShMem 21.0% 81.7% 72.6% 43.1% 54.1% MProg 21.6% 85.3% 84.0% 69.6%

PageNUCA (IIT, Kanpur) Performance analysis Impact of ring bandwidth Impact of ring bandwidth –Results presented till now assume a bidirectional data ring of width 1024 bits in each direction –A 256-bit data ring causes a 3.6% increase in execution time of page migration for shared memory applications and 1.3% increase in execution time of multiprogrammed workloads –Block migration is more tolerant to ring bandwidth variation

PageNUCA (IIT, Kanpur) L1 cache prefetching Impact of a 16 read/write stream stride prefetcher per core Impact of a 16 read/write stream stride prefetcher per core L1 Pref. Page Mig. Both L1 Pref. Page Mig. Both ShMem 14.5% 18.7% 25.1% MProg 4.8% 12.6% 13.0%

PageNUCA (IIT, Kanpur) Energy Savings Energy savings originate from Energy savings originate from –Reduced execution time Potential show stoppers Potential show stoppers –Extra dynamic interconnect energy due to migration –Extra leakage in added SRAMs –Extra dynamic energy in consulting the additional tables and logic

PageNUCA (IIT, Kanpur) Energy Savings Good news Good news –Dynamic page migration is most energy- efficient among all the options –Saves 14% energy for shared memory applications and 11% for multiprogrammed workloads compared to baseline static NUCA –Extra leakage in large tables kills block migration: saves only 4% and 2% energy for shared memory and multiprogrammed workloads

PageNUCA (IIT, Kanpur)Sketch Preliminaries Preliminaries –Why page-grain –Hypothesis and observations Dynamic page migration Dynamic page migration Dynamic cache block migration Dynamic cache block migration OS-assisted static page mapping OS-assisted static page mapping Simulation environment Simulation environment Simulation results Simulation results  An analytical model Summary Summary

PageNUCA (IIT, Kanpur) An analytical model Normalized execution time for data migration is given by N = Normalized execution time for data migration is given by N = rA + (1 - r)[s + t(1 – s)] rA + (1 - r)[s + t(1 – s)] ------------------------------- ------------------------------- rA + 1 – r rA + 1 – r r = L2$ miss rate r = L2$ miss rate A = L2$ miss latency/avg. L2$ hit latency A = L2$ miss latency/avg. L2$ hit latency s = Ratio of average hit latency after s = Ratio of average hit latency after migration to before migration migration to before migration t = Fraction of busy cycles t = Fraction of busy cycles Observations: lim r 1 N = 1, lim s 1 N = 1, lim t 1 N = 1, lim A ∞ N = 1 Observations: lim r 1 N = 1, lim s 1 N = 1, lim t 1 N = 1, lim A ∞ N = 1

PageNUCA (IIT, Kanpur)Sketch Preliminaries Preliminaries –Why page-grain –Hypothesis and observations Dynamic page migration Dynamic page migration Dynamic cache block migration Dynamic cache block migration OS-assisted static page mapping OS-assisted static page mapping Simulation environment Simulation environment Simulation results Simulation results An analytical model An analytical model  Summary

PageNUCA (IIT, Kanpur)Summary Explored hardwired and OS-assisted page migration in CMPs Explored hardwired and OS-assisted page migration in CMPs Page migration reduces execution time by 18.7% for shared memory applications and 12.6% for multiprogrammed workloads Page migration reduces execution time by 18.7% for shared memory applications and 12.6% for multiprogrammed workloads Storage overhead of page migration is less than 5% Storage overhead of page migration is less than 5% Performance-optimized block migration algorithms come close to page migration, but require at least 13% extra storage Performance-optimized block migration algorithms come close to page migration, but require at least 13% extra storage

PageNUCA (IIT, Kanpur)Acknowledgments Intel Research Council Intel Research Council –Financial support Gautam Doshi Gautam Doshi –Moral support, useful “tele-brain-storming” Vijay Degalahal, Jugash Chandarlapati Vijay Degalahal, Jugash Chandarlapati –HSPICE simulations for leakage modeling Sreenivas Subramoney Sreenivas Subramoney –Detailed feedback on early manuscript Kiran Panesar, Shubhra Roy, Manav Subodh Kiran Panesar, Shubhra Roy, Manav Subodh –Initial connections

PageNUCA: Selected Policies for Page-grain Locality Management in Large Shared CMP Caches Mainak Chaudhuri, IIT Kanpur mainakc@iitk.ac.in [Presented at HPCA’09] THANK YOU!

PageNUCA: Selected Policies for Page-grain Locality Management in Large Shared CMP Caches Mainak Chaudhuri, IIT Kanpur

Similar presentations

Presentation on theme: "PageNUCA: Selected Policies for Page-grain Locality Management in Large Shared CMP Caches Mainak Chaudhuri, IIT Kanpur"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PageNUCA: Selected Policies for Page-grain Locality Management in Large Shared CMP Caches Mainak Chaudhuri, IIT Kanpur

Similar presentations

Presentation on theme: "PageNUCA: Selected Policies for Page-grain Locality Management in Large Shared CMP Caches Mainak Chaudhuri, IIT Kanpur"— Presentation transcript:

Similar presentations

About project

Feedback