Presentation is loading. Please wait.

Presentation is loading. Please wait.

Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures Nikos Hardavellas, Northwestern University Team: M. Ferdman, B. Falsafi,

Similar presentations


Presentation on theme: "Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures Nikos Hardavellas, Northwestern University Team: M. Ferdman, B. Falsafi,"— Presentation transcript:

1 Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures Nikos Hardavellas, Northwestern University Team: M. Ferdman, B. Falsafi, A. Ailamaki Northwestern, Carnegie Mellon, EPFL

2 © Hardavellas 2 Moore’s Law Is Alive And Well 90nm 90nm transistor (Intel, 2005) Swine Flu A/H1N1 (CDC) 65nm nm nm nm nm 2019 Device scaling continues for at least another 10 years

3 © Hardavellas 3 Good Days Ended Nov [Yelick09] “New” Moore’s Law: 2x cores with every generation On-chip cache grows commensurately to supply all cores with data Moore’s Law Is Alive And Well

4 © Hardavellas 4 slow access large caches Larger Caches Are Slower Caches Increasing access latency forces caches to be distributed

5 © Hardavellas 5 Cache design trends Balance cache slice access with network latency As caches become bigger, they get slower: Split cache into smaller “slices”:

6 © Hardavellas 6 core Modern Caches: Distributed Split cache into “slices”, distribute across die L2 core

7 Data Placement Determines Performance © Hardavellas 7 core L2 core L2 core L2 core L2 core L2 core L2 core L2 core L2 Goal: place data on chip close to where they are used cache slice

8 © Hardavellas 8 Our proposal: R-NUCA Reactive Nonuniform Cache Architecture Data may exhibit arbitrarily complex behaviors ...but few that matter! Learn the behaviors at run time & exploit their characteristics  Make the common case fast, the rare case correct  Resolve conflicting requirements

9 © Hardavellas 9 Reactive Nonuniform Cache Architecture Cache accesses can be classified at run-time  Each class amenable to different placement Per-class block placement  Simple, scalable, transparent  No need for HW coherence mechanisms at LLC  Up to 32% speedup (17% on average)  -5% on avg. from an ideal cache organization Rotational Interleaving  Data replication and fast single-probe lookup [Hardavellas et al, ISCA 2009] [Hardavellas et al, IEEE-Micro Top Picks 2010]

10 © Hardavellas 10 Outline Introduction Why do Cache Accesses Matter? Access Classification and Block Placement Reactive NUCA Mechanisms Evaluation Conclusion

11 © Hardavellas 11 Bottleneck shifts from memory to L2-hit stalls Cache accesses dominate execution 4-core CMP DSS: TPCH/DB2 1GB database [Hardavellas et al, CIDR 2007] Lower is better Ideal

12 © Hardavellas 12 How much do we lose? We lose half the potential throughput 4-core CMP DSS: TPCH/DB2 1GB database Higher is better

13 © Hardavellas 13 Outline Introduction Why do Cache Accesses Matter? Access Classification and Block Placement Reactive NUCA Mechanisms Evaluation Conclusion

14 © Hardavellas 14 Terminology: Data Types core L2 core L2 core L2 core Read or Write Read Write Private Shared Read-Only Shared Read-Write

15 © Hardavellas 15 Distributed shared L2 core L2 core L2 core L2 core L2 core L2 core L2 core L2 core L2 Maximum capacity, but slow access (30+ cycles) address mod Unique location for any block (private or shared)

16 © Hardavellas 16 L2 Distributed private L2 core L2 core L2 core L2 core L2 core L2 core L2 core L2 core L2 Fast access to core-private data Private data: allocate at local L2 slice On every access allocate data at local L2 slice

17 © Hardavellas 17 L2 Distributed private L2: shared-RO access core L2 core L2 core L2 core L2 core L2 core L2 core L2 core L2 Wastes capacity due to replication Shared read-only data: replicate across L2 slices On every access allocate data at local L2 slice

18 © Hardavellas 18 Distributed private L2: shared-RW access core L2 core L2 core L2 core L2 core L2 core L2 core L2 dir core L2 Slow for shared read-write Wastes capacity (dir overhead) and bandwidth X Shared read-write data: maintain coherence via indirection (dir) On every access allocate data at local L2 slice

19 © Hardavellas 19 Conventional Multi-Core Caches core L2 core L2 core L2 core dirL2 We want: high capacity (shared) + fast access (private) Private Shared Address-interleave blocks + High capacity − Slow access Each block cached locally + Fast access (local) − Low capacity (replicas) − Coherence: via indirection (distributed directory)

20 © Hardavellas 20 Where to Place the Data? Close to where they are used! Accessed by single core: migrate locally Accessed by many cores: replicate (?)  If read-only, replication is OK  If read-write, coherence a problem  Low reuse: evenly distribute across sharers sharers# read-write migrate replicate share read-only

21 21 Methodology Flexus: Full-system cycle-accurate timing simulation Model Parameters Tiled, LLC = L2 Server/Scientific wrkld.  16-cores, 1MB/core Multi-programmed wrkld.  8-cores, 3MB/core OoO, 2GHz, 96-entry ROB Folded 2D-torus  2-cycle router, 1-cycle link 45ns memory Workloads OLTP: TPC-C WH  IBM DB2 v8  Oracle 10g DSS: TPC-H Qry 6, 8, 13  IBM DB2 v8 SPECweb99 on Apache 2.0 Multiprogammed: Spec2K Scientific: em3d © Hardavellas [Hardavellas et al, SIGMETRICS-PER 2004 Wenisch et al, IEEE Micro 2006]

22 © Hardavellas 22 Cache Access Classification Example Each bubble: cache blocks shared by x cores Size of bubble proportional to % L2 accesses y axis: % blocks in bubble that are read-write % RW Blocks in Bubble

23 © Hardavellas 23 Scientific/MP Apps Cache Access Clustering Accesses naturally form 3 clusters Server Apps migrate locally share (addr-interleave) replicate R/W migrate replicate share R/O sharers# % RW Blocks in Bubble

24 Instruction Replication © Hardavellas 24 L2 core L2 core L2 core L2 core L2 core L2 core L2 Distribute in cluster of neighbors, replicate across Instruction working set too large for one cache slice

25 © Hardavellas 25 Reactive NUCA in a nutshell Classify accesses  private data: like private scheme (migrate)  shared data: like shared scheme (interleave)  instructions: controlled replication (middle ground) To place cache blocks, we first need to classify them

26 © Hardavellas 26 Outline Introduction Access Classification and Block Placement Reactive NUCA Mechanisms Evaluation Conclusion

27 © Hardavellas 27 Classification Granularity Per-block classification  High area/power overhead (cut L2 size by half)  High latency (indirection through directory) Per-page classification (utilize OS page table)  Persistent structure  Core accesses the page table for every access anyway (TLB)  Utilize already existing SW/HW structures and events  Page classification is accurate (<0.5% error) Classify entire data pages, page table/TLB for bookkeeping

28 Instructions classification: all accesses from L1-I (per-block) Data classification: private/shared per-page at TLB miss Classification Mechanisms © Hardavellas 28 TLB Miss core L2 Ld A Core i OS A: Private to “i” TLB Miss Ld A OS A: Private to “i” core L2 Core j A: Shared On 1 st accessOn access by another core Bookkeeping through OS page table and TLB

29 Page Table and TLB Extensions © Hardavellas 29 vpageppageL2 idP/S/I 2 bits log(n) vpageppageP/S TLB entry: 1 bit Page granularity allows simple + practical HW Page table entry: Core accesses the page table for every access anyway (TLB)  Pass information from the “directory” to the core Utilize already existing SW/HW structures and events

30 © Hardavellas 30 Data Class Bookkeeping and Lookup offset Physical Addr.: vpageppageL2 id vpageppageL2 idS cache index tag Page table entry: vpageppageP TLB entry: L2 id vpageppageS TLB entry: P private data: place in local L2 slice shared data: place in aggregate L2 (addr interleave)

31 © Hardavellas 31 Coherence: No Need for HW Mechanisms at LLC Fast access, eliminates HW overhead, SIMPLE core L2 core L2 core L2 core L2 Private data: local sliceShared data: addr-interleave Reactive NUCA placement guarantee  Each R/W datum in unique & known location

32 each slice caches the same blocks on behalf of any cluster © Hardavellas Instructions Lookup: Rotational Interleaving log 2 (k) RID Fast access (nearest-neighbor, simple lookup) Balance access latency with capacity constraints Equal capacity pressure at overlapped slices PC: 0xfa480 RID Addr size-4 clusters: local slice + 3 neighbors

33 © Hardavellas 33 Outline Introduction Access Classification and Block Placement Reactive NUCA Mechanisms Evaluation Conclusion

34 © 2009 Hardavellas 34 Evaluation Delivers robust performance across workloads Shared: same for Web, DSS; 17% for OLTP, MIX Private: 17% for OLTP, Web, DSS; same for MIX       Shared (S) R-NUCA (R) Ideal (I)

35 © Hardavellas 35 Conclusions Data may exhibit arbitrarily complex behaviors ...but few that matter! Learn the behaviors that matter at run time  Make the common case fast, the rare case correct Reactive NUCA: near-optimal cache block placement  Simple, scalable, low-overhead, transparent, no coherence  Robust performance  Matches best alternative, or 17% better; up to 32%  Near-optimal placement (-5% avg. from ideal)

36 For more information: © Hardavellas 36 Thank You! N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki. Near- Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures. IEEE Micro Top Picks, Vol. 30(1), pp , January/February N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki. Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches. ISCA 2009.

37 © 2009 Hardavellas 37 BACKUP SLIDES

38 Why Are Caches Growing So Large? Increasing number of cores: cache grows commensurately  Fewer but faster cores have the same effect Increasing datasets: faster than Moore’s Law! Power/thermal efficiency: caches are “cool”, cores are “hot”  So, its easier to fit more cache in a power budget Limited bandwidth: large cache == more data on chip  Off-chip pins are used less frequently © Hardavellas 38

39 © 2009 Hardavellas 39 Backup Slides ASR

40 © 2009 Hardavellas 40 ASR vs. R-NUCA Configurations ASR-1ASR-2R-NUCA 12.5×25.0×5.6× 2.1×2.2×38% Core TypeIn-OrderOoO L2 Size (MB)416 Memory Local L Avg. Shared L

41 © Hardavellas 41 ASR design space search

42 © 2009 Hardavellas 42 Backup Slides Prior Work

43 © Hardavellas 43 Prior Work Several proposals for CMP cache management  ASR, cooperative caching, victim replication, CMP-NuRapid, D-NUCA...but suffer from shortcomings  complex, high-latency lookup/coherence  don’t scale  lower effective cache capacity  optimize only for subset of accesses We need: Simple, scalable mechanism for fast access to all data

44 © Hardavellas 44 Shortcomings of prior work L2-Private  Wastes capacity  High latency (3 slice accesses + 3 hops on shr.) L2-Shared  High latency Cooperative Caching  Doesn’t scale (centralized tag structure) CMP-NuRapid  High latency (pointer dereference, 3 hops on shr) OS-managed L2  Wastes capacity (migrates all blocks)  Spill to neighbors useless (all run same code)

45 © Hardavellas 45 Shortcomings of Prior Work D-NUCA  No practical implementation (lookup?) Victim Replication  High latency (like L2-Private)  Wastes capacity (home always stores block) Adaptive Selective Replication (ASR)  High latency (like L2-Private)  Capacity pressure (replicates at slice granularity)  Complex (4 separate HW structures to bias coin)

46 © 2009 Hardavellas 46 Backup Slides Classification and Lookup

47 © 2009 Hardavellas 47 Data Classification Timeline TLB Miss OS vpageppageiP core L2 Ld A Core i allocate A vpageppagexS TLB Miss core L2 Ld A Core j i≠j inval A TLBi evict A core L2 Core k allocate A reply A Fast & simple lookup for data

48 © Hardavellas 48 Misclassifications at Page Granularity Classification at page granularity is accurate Accesses from pages with multiple access types Access misclassifications A page may service multiple access types But, one type always dominates accesses

49 © 2009 Hardavellas 49 Backup Slides Placement

50 © Hardavellas 50 Spill to neighbors if working set too large?  NO!!! Each core runs similar threads Private Data Placement Store in local L2 slice (like in private cache)

51 © Hardavellas 51 Private Data Working Set OLTP: Small per-core work. set (3MB/16 cores = 200KB/core) Web: primary wk. set <6KB/core, remaining <1.5% L2 refs DSS: Policy doesn’t matter much (>100MB work. set, <13% L2 refs  very low reuse on private)

52 © Hardavellas 52 Read-write + large working set + low reuse  Unlikely to be in local slice for reuse Also, next sharer is random [WMPI’04] Shared Data Placement Address-interleave in aggregate L2 (like shared cache)

53 © Hardavellas 53 Shared Data Working Set

54 Instruction Placement Working set too large for one slice  Slices store private & shared data too!  Sufficient capacity with 4 L2 slices © Hardavellas 54 Share in clusters of neighbors, replicate across

55 © Hardavellas 55 Instructions Working Set

56 © 2009 Hardavellas 56 Backup Slides Rotational Interleaving

57 Instruction Classification and Lookup © 2009 Hardavellas 57 L2 core L2 core L2 core L2 core L2 core L2 core L2 Share within neighbors’ cluster, replicate across Identification: all accesses from L1-I But, working set too large to fit in one cache slice

58 RotationalID 0 © 2009 Hardavellas Rotational Interleaving log 2 (k) Fast access (nearest-neighbor, simple lookup) Equalize capacity pressure at overlapping slices TileID

59 © Hardavellas 59 Nearest-neighbor size-8 clusters


Download ppt "Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures Nikos Hardavellas, Northwestern University Team: M. Ferdman, B. Falsafi,"

Similar presentations


Ads by Google