Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.

Similar presentations


Presentation on theme: "University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian."— Presentation transcript:

1 University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian

2 University of Utah2 Large Caches  Cache hierarchies will dominate chip area  Montecito has two private 12 MB L3 caches (27MB including L2)  Long global wires are required to transmit data/address Intel Montecito Cache

3 University of Utah3 Wire Delay/Power  Wire delays are costly for performance and power  Latencies of 60 cycles to reach ends of a chip at 32nm (@ 5 GHz)  50% of dynamic power is in interconnect switching (Magen et al. SLIP 04)  CACTI* access time for 24 MB cache is 90 cycles @ 5GHz, 65nm Tech *version 3.2

4 University of Utah 4 Contributions  Methodology to compute optimal baseline NUCA organization Performs 51% better than prior NUCA models  Introduce heterogeneity in the network Additional 15% improvement in performance

5 University of Utah5 Cache Design Basics Input address Decoder Wordline Bitlines Tag array Data array Column muxes Sense Amps Comparators Output driver Valid output? Mux drivers Data output Output driver

6 University of Utah6 Existing Model - CACTI Decoder delay Wordline & bitline delay Cache model with 4 sub-arraysCache model with 16 sub-arrays Decoder delay = H-tree delay + logic delay

7 University of Utah7 CACTI Shortcomings  Access delay is equal to the delay of slowest sub- array  Very high hit time for large caches  Employs a separate bus for each cache bank for multi-banked caches  Not scalable Exploit different wire types and network design choices to reduce access latency Potential solution – NUCA Extend CACTI to model NUCA

8 University of Utah8 Non-Uniform Cache Access (NUCA)*  Large cache is broken into a number of small banks  Employs on-chip network for communication  Access delay  (distance between bank and cache controller) CPU & L1 Cache banks *(Kim et al. ASPLOS 02)

9 University of Utah9 Extension to CACTI  On-chip network Wire model based on ITRS 2005 parameters Grid network 3-stage speculative router pipeline  Network latency vs Bank access latency tradeoff Iterate over different bank sizes Calculate the average network delay based on the number of banks and bank sizes  Similarly we also consider power consumed for each organization

10 University of Utah10 Effect of Network Delay (32MB cache) Earlier NUCA Model

11 University of Utah11 Power Centric Design (32MB Cache)

12 University of Utah12  Wires can be tuned for low latency or low power  Low power wires with small, fewer repeaters Wire Design Space Global wires B wires 8x plane Semi global wires W wires 4x plane Power optimized PW wires 4x plane Fast, low bandwidth L wires 8x plane  Fat, low-bandwidth fast wires

13 University of Utah13 Wire Model MMM Wire RC V o res o cap I cap C side-wall C adj Wire TypeRelative LatencyRelative AreaDynamic PowerStatic Power B-Wire 8x1x 2.65  1x B-Wire 4x1.6x0.5x 2.9  1.13x L-Wire 8x0.25x8x 1.46  0.55X PW-Wire 4x3.2x0.5x 0.87  0.3x Ref: Banerjee et al. IEEE TED 2002 65nm process

14 University of Utah14 Access time for different link types Bank Count Bank Access Time Avg Access time 8x-wires4x-wiresL-wires 277 12215798 462 11915690 826 8011056 1617709954 329729957 6468612770 1285108149101 2564147179132 5123210240195

15 University of Utah15 Cache Look-Up Total cache access time Network delay (4-6 bits to identify the cache Bank) Decoder, WL, BL (10-15 bits of address) Comparator, output driver delay (rest of the add.)  The entire access happens in a sequential manner Bank access + Data transfer

16 University of Utah16 Early Look-Up  Send partial address in L-wires  Initiate the bank lookup  In parallel send the complete address  Complete the access L Early lookup (10-15 bits of address) Tag match  We can hide ~70% of the bank access delay + Data transfer LookupTag Address Traditional Access

17 University of Utah17 Aggressive Look-Up L Agg. lookup (additional 8-bits On L-wires for address for partial tag match) Tag match at cache controller Early lookup Tag L LookupTag Address Traditional Access Early Lookup Aggressive Lookup 1101…1101111100010 11100010 Full tag entry

18 University of Utah18 Aggressive Look-Up  Reduction in link delay (for address transfer)  Increase in traffic due to false match < 1%  Marginal increase in link overhead Additional 8-bits -More logic at the cache controller for tag match -Address transfer for writes happens on L-wires

19 University of Utah19 Heterogeneous Network  Routers introduce significant overhead (especially in L-network) L-wires can transfer signal across four banks in four cycles Router adds three cycles for each hop  Modify network topology to take advantage of wire property Different topology for address and data transfers

20 University of Utah20 Hybrid Network  Combination of point-to-point and bus Reduction in latency Reduction in power Efficient use of L- wires -Low bandwidth Core L2 Controller Shared bus Router

21 University of Utah21 Experimental Setup  Simplescalar with contention modeled in detail  Single core, 8-issue out-of-order processor  32 MB, 8-way set-associative, on-chip L2 cache (SNUCA organization)  32KB L1 I-cache and 32KB L1 D-cache with a hit latency of 3 cycles  Main memory latency 300 cycles

22 University of Utah22 CMP Setup  Eight Core CMP (Simplescalar tool)  32 MB, 8-way set-associative (SNUCA organization)  Two cache controllers  Main memory latency 300 cycles C5 C3 C4 C6 C2C1 C8 C7 L2 Bank

23 University of Utah23 Network Model  Virtual channel flow control Four virtual channels/physical channel Credit based flow control (for backpressure)  Adaptive routing Each hop should reduce Manhattan distance between the source and the destination

24 University of Utah24 Cache Models ModelBank Access (cycles) Bank CountNetwork LinkDescription 13512B-wiresBased on prior work 21716B-wiresCACTI-L2 31716B & L–wiresEarly Lookup 41716B & L–wiresAgg. Lookup 51716B & L–wiresHybrid network 61716B-wiresUpper bound

25 University of Utah25 Performance Results (Uniprocessor) Latency sensitive benchmarks - ~70% of the SPEC suite Prior workCACTI-L2EarlyAggr.Hybrid.Ideal 73% 114%

26 University of Utah26 Performance Results (Uniprocessor) Latency sensitive benchmarks - ~70% of the SPEC suite Prior workCACTI-L2EarlyAggr.Hybrid.Ideal 6% 8% 9% 15% 20% 19% 26%

27 University of Utah27 Performance Results (CMP)

28 University of Utah28 Performance Results (4X – Wires) Wire delay constrained model Performance improvements are better Early lookup - 7% Aggressive model - 20% Hybrid model - 29%

29 University of Utah29 Conclusion  Network parameters play a significant role in the performance of large caches  Modified CACTI model, that includes network overhead performs 51% better compared to previous models  Methodology to compute an optimal baseline NUCA

30 University of Utah30 Conclusion  Wires can be tuned for different metrics  Routers impose non-trivial overhead  Address and data have different bandwidth needs  We introduce heterogeneity at three levels Different types of wires for address and data transfers Different topologies for address and data networks Different architectures within address network (point-to-point and bus) (Yields an additional performance improvement of 15% over the optimal, baseline NUCA)

31 University of Utah31 Performance Results (Uniprocessor) Model derived from CACTI, improvement over model assumed in the prior work – 73% L2 Sensitive – 114% Model derived from CACTI, improvement over model assumed in the prior work – 73% L2 Sensitive – 114% Prior workCACTI-L2EarlyAggr.Hybrid.Ideal

32 University of Utah32 Performance Results (Uniprocessor) Early lookup technique, average improvement over Model 2 – 6% L2 Sensitive – 8% Prior workCACTI-L2EarlyAggr.Hybrid.Ideal

33 University of Utah33 Performance Results (Uniprocessor) Aggressive lookup technique, average improvement over Model 2 – 8% L2 Sensitive – 9% Prior workCACTI-L2EarlyAggr.Hybrid.Ideal

34 University of Utah34 Performance Results (Uniprocessor) Hybrid model, average improvement over Model 2 – 15% L2 Sensitive – 20% Prior workCACTI-L2EarlyAggr.Hybrid.Ideal

35 University of Utah35 Outline Problem Overview Cache Design Basics  Extensions to CACTI  Effect of Network Parameters  Wire Design Space  Exploiting Heterogeneous Wires  Results

36 University of Utah36 Outline Problem Overview Cache Design Basics Extensions to CACTI Effect of Network Parameters Wire Design Space  Exploiting Heterogeneous Wires  Results

37 University of Utah37 Outline Problem Overview Cache Design Basics Extensions to CACTI Effect of Network Parameters  Wire Design Space  Exploiting Heterogeneous Wires  Results

38 University of Utah38 Outline Overview Cache Design Effect of Network Parameters Wire Design Space Exploiting Heterogeneous Wires  Methodology  Results

39 1101…1101111100010 University of Utah39 Aggressive Look-Up L Agg. lookup (additional 8-bits of address for partial tag match) Tag match at cache controller 1101…1101111100010 11100010 Full tag entry Way 1 Way n


Download ppt "University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian."

Similar presentations


Ads by Google