Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO 2004 12/8/04.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO 2004 12/8/04

Beckmann & WoodManaging Wire Delay in Large CMP Caches2 Overview Managing wire delay in shared CMP caches Three techniques extended to CMPs 1.On-chip Strided Prefetching (not in talk – see paper) –Scientific workloads: 10% average reduction –Commercial workloads: 3% average reduction 2.Cache Block Migration (e.g. D-NUCA) –Block sharing limits average reduction to 3% –Dependence on difficult to implement smart search 3.On-chip Transmission Lines (e.g. TLC) –Reduce runtime by 8% on average –Bandwidth contention accounts for 26% of L2 hit latency Combining techniques +Potentially alleviates isolated deficiencies –Up to 19% reduction vs. baseline –Implementation complexity

Beckmann & WoodManaging Wire Delay in Large CMP Caches3 Current CMP: IBM Power 5 L2 Bank L2 Bank L2 Bank 2 CPUs 3 L2 Cache Banks CPU 0 L1 I$ L1 D$ CPU 1 L1 D$ L1 I$

4 2010 Reachable Distance / Cycle CMP Trends L2 CPU 0 L1 I$ L1 D$ CPU 1 L1 D$ L1 I$ 2004 technology 2010 technology CPU 2 L1 I$ L1 D$ CPU 3 L1 D$ L1 I$ CPU 4 L1 I$ L1 D$ CPU 5 L1 D$ L1 I$ CPU 6 L1 I$ L1 D$ CPU 7 L1 D$ L1 I$ L2 2004 Reachable Distance / Cycle

5 Baseline: CMP-SNUCA L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU 5

Beckmann & WoodManaging Wire Delay in Large CMP Caches6 Outline Global interconnect and CMP trends Latency Management Techniques Evaluation –Methodology –Block Migration: CMP-DNUCA –Transmission Lines: CMP-TLC –Combination: CMP-Hybrid

7 Block Migration: CMP-DNUCA L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU 5 B A A B

Beckmann & WoodManaging Wire Delay in Large CMP Caches8 On-chip Transmission Lines Similar to contemporary off-chip communication Provides a different latency / bandwidth tradeoff Wires behave more “transmission-line” like as frequency increases –Utilize transmission line qualities to our advantage –No repeaters – route directly over large structures –~10x lower latency across long distances Limitations –Requires thick wires and dielectric spacing –Increases manufacturing cost

9 Transmission Lines: CMP-TLC CPU 3 L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ CPU 2 CPU 1 CPU 0 CPU 4 CPU 5 CPU 6 CPU 7 16 8-byte links

10 Combination: CMP-Hybrid L1 I $ L1 D $ CPU 2 L1 I $ L1 D $ CPU 3 L1 D $ L1 I $ CPU 7 L1 D $ L1 I $ CPU 6 L1 D $ L1 I $ CPU 1 L1 D $ L1 I $ CPU 0 L1 I $ L1 D $ CPU 4 L1 I $ L1 D $ CPU 5 8 32-byte links

Beckmann & WoodManaging Wire Delay in Large CMP Caches12 Methodology Full system simulation –Simics –Timing model extensions Out-of-order processor Memory system Workloads –Commercial apache, jbb, otlp, zeus –Scientific Splash: barnes & ocean SpecOMP: apsi & fma3d

Beckmann & WoodManaging Wire Delay in Large CMP Caches13 System Parameters Memory SystemDynamically Scheduled Processor L1 I & D caches64 KB, 2-way, 3 cyclesClock frequency10 GHz Unified L2 cache16 MB, 256x64 KB, 16- way, 6 cycle bank access Reorder buffer / scheduler 128 / 64 entries L1 / L2 cache block size 64 BytesPipeline width4-wide fetch & issue Memory latency260 cyclesPipeline stages30 Memory bandwidth320 GB/sDirect branch predictor3.5 KB YAGS Memory size4 GB of DRAMReturn address stack64 entries Outstanding memory request / CPU 16Indirect branch predictor256 entries (cascaded)

15 CMP-DNUCA: Organization Bankclusters Local Inter. Center CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5

Beckmann & WoodManaging Wire Delay in Large CMP Caches16 Hit Distribution: Grayscale Shading CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5 Greater % of L2 Hits

Beckmann & WoodManaging Wire Delay in Large CMP Caches17 CMP-DNUCA: Migration Migration policy –Gradual movement –Increases local hits and reduces distant hits other bankclusters my center bankcluster my inter. bankcluster my local bankcluster

Beckmann & WoodManaging Wire Delay in Large CMP Caches18 CMP-DNUCA: Hit Distribution Ocean per CPU CPU 0CPU 1CPU 2CPU 3 CPU 4 CPU 5 CPU 6 CPU 7

Beckmann & WoodManaging Wire Delay in Large CMP Caches19 CMP-DNUCA: Hit Distribution Ocean all CPUs Block migration successfully separates the data sets

Beckmann & WoodManaging Wire Delay in Large CMP Caches20 CMP-DNUCA: Hit Distribution OLTP all CPUs

Beckmann & WoodManaging Wire Delay in Large CMP Caches21 CMP-DNUCA: Hit Distribution OLTP per CPU Hit Clustering: Most L2 hits satisfied by the center banks CPU 0CPU 1CPU 2CPU 3 CPU 4 CPU 5 CPU 6 CPU 7

Beckmann & WoodManaging Wire Delay in Large CMP Caches22 CMP-DNUCA: Search Search policy –Uniprocessor DNUCA solution: partial tags Quick summary of the L2 tag state at the CPU No known practical implementation for CMPs –Size impact of multiple partial tags –Coherence between block migrations and partial tag state –CMP-DNUCA solution: two-phase search 1 st phase: CPU’s local, inter., & 4 center banks 2 nd phase: remaining 10 banks Slow 2 nd phase hits and L2 misses

Beckmann & WoodManaging Wire Delay in Large CMP Caches23 CMP-DNUCA: L2 Hit Latency

Beckmann & WoodManaging Wire Delay in Large CMP Caches24 CMP-DNUCA Summary Limited success –Ocean successfully splits Regular scientific workload – little sharing –OLTP congregates in the center Commercial workload – significant sharing Smart search mechanism –Necessary for performance improvement –No known implementations –Upper bound – perfect search

Beckmann & WoodManaging Wire Delay in Large CMP Caches26 L2 Hit Latency Bars Labeled D: CMP-DNUCA T: CMP-TLC H: CMP-Hybrid

Beckmann & WoodManaging Wire Delay in Large CMP Caches27 Overall Performance Transmission lines improve L2 hit and L2 miss latency

Beckmann & WoodManaging Wire Delay in Large CMP Caches28 Conclusions Individual Latency Management Techniques –Strided Prefetching: subset of misses –Cache Block Migration: sharing impedes migration –On-chip Transmission Lines: limited bandwidth Combination: CMP-Hybrid –Potentially alleviates bottlenecks –Disadvantages Relies on smart-search mechanism Manufacturing cost of transmission lines

Beckmann & WoodManaging Wire Delay in Large CMP Caches29 Backup Slides

Beckmann & WoodManaging Wire Delay in Large CMP Caches30 Strided Prefetching Utilize repeatable memory access patterns –Subset of misses –Tolerates latency within the memory hierarchy Our implementation –Similar to Power4 –Unit and Non-unit stride misses L1 – L2 L2 – Mem

Beckmann & WoodManaging Wire Delay in Large CMP Caches31 On and Off-chip Prefetching Commercial Scientific Benchmarks

Beckmann & WoodManaging Wire Delay in Large CMP Caches32 CMP Sharing Patterns

Beckmann & WoodManaging Wire Delay in Large CMP Caches33 CMP Request Distribution

34 2nd Search Phase CMP-DNUCA: Search Strategy Bankclusters Local Inter. Center CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5 1st Search Phase Uniprocessor DNUCA: partial tag array for smart searches Significant implementation complexity for CMP-DNUCA

35 CMP-DNUCA: Migration Strategy CPU 2 CPU 3 CPU 7 CPU 6 CPU 1 CPU 0 CPU 4 CPU 5 other local other inter. other center my center my inter. my local Bankclusters Local Inter. Center

Beckmann & WoodManaging Wire Delay in Large CMP Caches36 Uncontended Latency Comparison

Beckmann & WoodManaging Wire Delay in Large CMP Caches37 CMP-DNUCA: L2 Hit Distribution Benchmarks

Beckmann & WoodManaging Wire Delay in Large CMP Caches38 CMP-DNUCA: L2 Hit Latency

Beckmann & WoodManaging Wire Delay in Large CMP Caches39 CMP-DNUCA: Runtime

Beckmann & WoodManaging Wire Delay in Large CMP Caches40 CMP-DNUCA Problems Hit clustering –Shared blocks move within the center –Equally far from all processors Search complexity –16 separate clusters –Partial tags impractical Distributed information Synchronization complexity

Beckmann & WoodManaging Wire Delay in Large CMP Caches41 CMP-TLC: L2 Hit Latency Bars Labeled D: CMP-DNUCA T: CMP-TLC

Beckmann & WoodManaging Wire Delay in Large CMP Caches42 Runtime: Isolated Techniques

Beckmann & WoodManaging Wire Delay in Large CMP Caches43 CMP-Hybrid: Performance

Beckmann & WoodManaging Wire Delay in Large CMP Caches44 Energy Efficiency

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO 2004 12/8/04.

Similar presentations

Presentation on theme: "Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO 2004 12/8/04."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO 2004 12/8/04.

Similar presentations

Presentation on theme: "Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO 2004 12/8/04."— Presentation transcript:

Similar presentations

About project

Feedback