Download presentation
Presentation is loading. Please wait.
Published byPrecious Harcourt Modified over 9 years ago
1
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO 2004 12/8/04
2
Beckmann & WoodManaging Wire Delay in Large CMP Caches2 Overview Managing wire delay in shared CMP caches Three techniques extended to CMPs 1.On-chip Strided Prefetching (not in talk – see paper) –Scientific workloads: 10% average reduction –Commercial workloads: 3% average reduction 2.Cache Block Migration (e.g. D-NUCA) –Block sharing limits average reduction to 3% –Dependence on difficult to implement smart search 3.On-chip Transmission Lines (e.g. TLC) –Reduce runtime by 8% on average –Bandwidth contention accounts for 26% of L2 hit latency Combining techniques +Potentially alleviates isolated deficiencies –Up to 19% reduction vs. baseline –Implementation complexity
3
Beckmann & WoodManaging Wire Delay in Large CMP Caches3 Current CMP: IBM Power 5 L2 Bank L2 Bank L2 Bank 2 CPUs 3 L2 Cache Banks CPU 0 L1 I$ L1 D$ CPU 1 L1 D$ L1 I$
4
4 2010 Reachable Distance / Cycle CMP Trends L2 CPU 0 L1 I$ L1 D$ CPU 1 L1 D$ L1 I$ 2004 technology 2010 technology CPU 2 L1 I$ L1 D$ CPU 3 L1 D$ L1 I$ CPU 4 L1 I$ L1 D$ CPU 5 L1 D$ L1 I$ CPU 6 L1 I$ L1 D$ CPU 7 L1 D$ L1 I$ L2 2004 Reachable Distance / Cycle
5
5 Baseline: CMP-SNUCA L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU 5
6
Beckmann & WoodManaging Wire Delay in Large CMP Caches6 Outline Global interconnect and CMP trends Latency Management Techniques Evaluation –Methodology –Block Migration: CMP-DNUCA –Transmission Lines: CMP-TLC –Combination: CMP-Hybrid
7
7 Block Migration: CMP-DNUCA L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU 5 B A A B
8
Beckmann & WoodManaging Wire Delay in Large CMP Caches8 On-chip Transmission Lines Similar to contemporary off-chip communication Provides a different latency / bandwidth tradeoff Wires behave more “transmission-line” like as frequency increases –Utilize transmission line qualities to our advantage –No repeaters – route directly over large structures –~10x lower latency across long distances Limitations –Requires thick wires and dielectric spacing –Increases manufacturing cost
9
9 Transmission Lines: CMP-TLC CPU 3 L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ CPU 2 CPU 1 CPU 0 CPU 4 CPU 5 CPU 6 CPU 7 16 8-byte links
10
10 Combination: CMP-Hybrid L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU 5 8 32-byte links
11
Beckmann & WoodManaging Wire Delay in Large CMP Caches11 Outline Global interconnect and CMP trends Latency Management Techniques Evaluation –Methodology –Block Migration: CMP-DNUCA –Transmission Lines: CMP-TLC –Combination: CMP-Hybrid
12
Beckmann & WoodManaging Wire Delay in Large CMP Caches12 Methodology Full system simulation –Simics –Timing model extensions Out-of-order processor Memory system Workloads –Commercial apache, jbb, otlp, zeus –Scientific Splash: barnes & ocean SpecOMP: apsi & fma3d
13
Beckmann & WoodManaging Wire Delay in Large CMP Caches13 System Parameters Memory SystemDynamically Scheduled Processor L1 I & D caches64 KB, 2-way, 3 cyclesClock frequency10 GHz Unified L2 cache16 MB, 256x64 KB, 16- way, 6 cycle bank access Reorder buffer / scheduler 128 / 64 entries L1 / L2 cache block size 64 BytesPipeline width4-wide fetch & issue Memory latency260 cyclesPipeline stages30 Memory bandwidth320 GB/sDirect branch predictor3.5 KB YAGS Memory size4 GB of DRAMReturn address stack64 entries Outstanding memory request / CPU 16Indirect branch predictor256 entries (cascaded)
14
Beckmann & WoodManaging Wire Delay in Large CMP Caches14 Outline Global interconnect and CMP trends Latency Management Techniques Evaluation –Methodology –Block Migration: CMP-DNUCA –Transmission Lines: CMP-TLC –Combination: CMP-Hybrid
15
15 CMP-DNUCA: Organization Bankclusters Local Inter. Center CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5
16
Beckmann & WoodManaging Wire Delay in Large CMP Caches16 Hit Distribution: Grayscale Shading CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5 Greater % of L2 Hits
17
Beckmann & WoodManaging Wire Delay in Large CMP Caches17 CMP-DNUCA: Migration Migration policy –Gradual movement –Increases local hits and reduces distant hits other bankclusters my center bankcluster my inter. bankcluster my local bankcluster
18
Beckmann & WoodManaging Wire Delay in Large CMP Caches18 CMP-DNUCA: Hit Distribution Ocean per CPU CPU 0CPU 1CPU 2CPU 3 CPU 4 CPU 5 CPU 6 CPU 7
19
Beckmann & WoodManaging Wire Delay in Large CMP Caches19 CMP-DNUCA: Hit Distribution Ocean all CPUs Block migration successfully separates the data sets
20
Beckmann & WoodManaging Wire Delay in Large CMP Caches20 CMP-DNUCA: Hit Distribution OLTP all CPUs
21
Beckmann & WoodManaging Wire Delay in Large CMP Caches21 CMP-DNUCA: Hit Distribution OLTP per CPU Hit Clustering: Most L2 hits satisfied by the center banks CPU 0CPU 1CPU 2CPU 3 CPU 4 CPU 5 CPU 6 CPU 7
22
Beckmann & WoodManaging Wire Delay in Large CMP Caches22 CMP-DNUCA: Search Search policy –Uniprocessor DNUCA solution: partial tags Quick summary of the L2 tag state at the CPU No known practical implementation for CMPs –Size impact of multiple partial tags –Coherence between block migrations and partial tag state –CMP-DNUCA solution: two-phase search 1 st phase: CPU’s local, inter., & 4 center banks 2 nd phase: remaining 10 banks Slow 2 nd phase hits and L2 misses
23
Beckmann & WoodManaging Wire Delay in Large CMP Caches23 CMP-DNUCA: L2 Hit Latency
24
Beckmann & WoodManaging Wire Delay in Large CMP Caches24 CMP-DNUCA Summary Limited success –Ocean successfully splits Regular scientific workload – little sharing –OLTP congregates in the center Commercial workload – significant sharing Smart search mechanism –Necessary for performance improvement –No known implementations –Upper bound – perfect search
25
Beckmann & WoodManaging Wire Delay in Large CMP Caches25 Outline Global interconnect and CMP trends Latency Management Techniques Evaluation –Methodology –Block Migration: CMP-DNUCA –Transmission Lines: CMP-TLC –Combination: CMP-Hybrid
26
Beckmann & WoodManaging Wire Delay in Large CMP Caches26 L2 Hit Latency Bars Labeled D: CMP-DNUCA T: CMP-TLC H: CMP-Hybrid
27
Beckmann & WoodManaging Wire Delay in Large CMP Caches27 Overall Performance Transmission lines improve L2 hit and L2 miss latency
28
Beckmann & WoodManaging Wire Delay in Large CMP Caches28 Conclusions Individual Latency Management Techniques –Strided Prefetching: subset of misses –Cache Block Migration: sharing impedes migration –On-chip Transmission Lines: limited bandwidth Combination: CMP-Hybrid –Potentially alleviates bottlenecks –Disadvantages Relies on smart-search mechanism Manufacturing cost of transmission lines
29
Beckmann & WoodManaging Wire Delay in Large CMP Caches29 Backup Slides
30
Beckmann & WoodManaging Wire Delay in Large CMP Caches30 Strided Prefetching Utilize repeatable memory access patterns –Subset of misses –Tolerates latency within the memory hierarchy Our implementation –Similar to Power4 –Unit and Non-unit stride misses L1 – L2 L2 – Mem
31
Beckmann & WoodManaging Wire Delay in Large CMP Caches31 On and Off-chip Prefetching Commercial Scientific Benchmarks
32
Beckmann & WoodManaging Wire Delay in Large CMP Caches32 CMP Sharing Patterns
33
Beckmann & WoodManaging Wire Delay in Large CMP Caches33 CMP Request Distribution
34
34 2nd Search Phase CMP-DNUCA: Search Strategy Bankclusters Local Inter. Center CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5 1st Search Phase Uniprocessor DNUCA: partial tag array for smart searches Significant implementation complexity for CMP-DNUCA
35
35 CMP-DNUCA: Migration Strategy CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5 other local other inter. other center my center my inter. my local Bankclusters Local Inter. Center
36
Beckmann & WoodManaging Wire Delay in Large CMP Caches36 Uncontended Latency Comparison
37
Beckmann & WoodManaging Wire Delay in Large CMP Caches37 CMP-DNUCA: L2 Hit Distribution Benchmarks
38
Beckmann & WoodManaging Wire Delay in Large CMP Caches38 CMP-DNUCA: L2 Hit Latency
39
Beckmann & WoodManaging Wire Delay in Large CMP Caches39 CMP-DNUCA: Runtime
40
Beckmann & WoodManaging Wire Delay in Large CMP Caches40 CMP-DNUCA Problems Hit clustering –Shared blocks move within the center –Equally far from all processors Search complexity –16 separate clusters –Partial tags impractical Distributed information Synchronization complexity
41
Beckmann & WoodManaging Wire Delay in Large CMP Caches41 CMP-TLC: L2 Hit Latency Bars Labeled D: CMP-DNUCA T: CMP-TLC
42
Beckmann & WoodManaging Wire Delay in Large CMP Caches42 Runtime: Isolated Techniques
43
Beckmann & WoodManaging Wire Delay in Large CMP Caches43 CMP-Hybrid: Performance
44
Beckmann & WoodManaging Wire Delay in Large CMP Caches44 Energy Efficiency
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.