Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah.

Similar presentations


Presentation on theme: "1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah."— Presentation transcript:

1 1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah

2 Naveen Muralimanohar University of Utah 2 Effect of Technology Scaling  Power wall  Temperature wall  Reliability issues Process variation Soft errors  Wire scaling Communication is expensive but computations are cheap

3 Naveen Muralimanohar University of Utah 3 Wire Delay – Compelling Opportunity  Existing proposals are indirect Hide wire delay  Pre-fetching, Speculative coherence, Run-ahead execution Reduce communication to save power  Wire level optimizations are still limited to circuit designers

4 Thesis Statement “The growing cost of on-chip wire delay requires a thorough understanding of wires. The dissertation advocates exposing wire properties to architects and proposes microarchitectural wire management”

5 Naveen Muralimanohar University of Utah 5 Wire Delay/Power  Pentium 4 (@ 90nm) spent two cycles to send a signal across the chip  Wire delays are costly for performance and power  Latencies of 60 cycles to reach ends of a chip at 32nm (@ 5 GHz)  50% of dynamic power is in interconnect switching (Magen et al. SLIP 04)

6 Naveen Muralimanohar University of Utah 6 Large Caches  Cache hierarchies will dominate chip area  Montecito has two private 12 MB L3 caches (27MB including L2)  Long global wires are required to transmit data/address Intel Montecito Cache

7 Naveen Muralimanohar University of Utah 7 On-Chip Cache Challenges 4 MB16 MB64 MB ~1.5X 65nm process ~1X 130nm process ~2X 32nm process Cache access time calculated using CACTI

8 Naveen Muralimanohar University of Utah 8 Effect of L2 Hit Time An aggressive out-of-order processor (L2-hit time 30 ->15 cycles) Avg = 17%

9 Naveen Muralimanohar University of Utah 9 Coherence Traffic  CMP has already become ubiquitous Requires Coherence among multiple cores Coherence operations entail frequent communications + Different coherence messages have different latency and bandwidth needs L2$ Core 1Core2Core 3 L1$ Read Req Fwd Read Req to owner Latest copy Ex Req Inval Req Inv Ack Messages related to read miss Messages related to write miss

10 L1 Accesses  Highly latency critical in aggressive out-of- order processors (such as a clustered processor)  The choice of inter-cluster communication fabric has a high impact on performance

11 Naveen Muralimanohar University of Utah 11 On-chip Traffic P0 ID P1 ID P2 ID P3 ID P4 ID P5 ID P6 ID P7 ID P8 ID P9 ID P10 ID P11 ID P12 ID P13 ID P14 ID P15 ID Controller Cache Reads and Writes Coherence Transactions L1-accesses L2 bank Cluster

12 Naveen Muralimanohar University of Utah 12 Outline Overview  Wire Design Space  Methodology to Design Scalable Caches  Heterogeneous Wires for Large Caches  Heterogeneous Wires for Coherence Traffic  Conclusions

13 Naveen Muralimanohar University of Utah 13 Wire Characteristics  Wire Resistance and capacitance per unit length ResistanceCapacitanceBandwidth Width Spacing

14 Naveen Muralimanohar University of Utah 14 Design Space Exploration  Tuning wire width and spacing Base case B wires Fast but Low bandwidth L wires (Width & Spacing)   Delay  Bandwidth 

15 Naveen Muralimanohar University of Utah 15 Design Space Exploration  Tuning Repeater size and spacing Traditional Wires Large repeaters Optimum spacing Power Optimal Wires Smaller repeaters Increased spacing Delay Power

16 Naveen Muralimanohar University of Utah 16 ED Trade-off in a Repeated Wire

17 Naveen Muralimanohar University of Utah 17 Design Space Exploration Base case B wires Bandwidth optimized W wires Power and B/W optimized PW wires Fast, low bandwidth L wires Latency 1x Power 1x Area 1x Latency 1.6x Power 0.9x Area 0.5x Latency 3.2x Power 0.3x Area 0.5x Latency 0.5x Power 0.5x Area 4x

18 Naveen Muralimanohar University of Utah 18 Wire Model MMM Wire RC V o res o cap I cap C side-wall C adj Wire TypeRelative LatencyRelative AreaDynamic PowerStatic Power B-Wire 8x1x 2.65  1x B-Wire 4x1.6x0.5x 2.9  1.13x L-Wire 8x0.5x4x 1.46  0.55X PW-Wire 4x3.2x0.5x0.87  0.3x Ref: Banerjee et al. 65nm process, 10 Metal Layers – 4 in 1X and 2 in each 2X, 4X and 8X plane

19 Naveen Muralimanohar University of Utah 19 Outline Overview Wire Design Space  Methodology to Design Scalable Caches  Heterogeneous Wires for Large Caches  Heterogeneous Wires for Coherence Traffic  Conclusions

20 Naveen Muralimanohar University of Utah 20 Cache Design Basics Input address Decoder Wordline Bitlines Tag array Data array Column muxes Sense Amps Comparators Output driver Valid output? Mux drivers Data output Output driver

21 Naveen Muralimanohar University of Utah 21 Existing Model - CACTI Decoder delay Wordline & bitline delay Cache model with 4 sub-arraysCache model with 16 sub-arrays Decoder delay = H-tree delay + logic delay

22 Naveen Muralimanohar University of Utah 22 CACTI Shortcomings  Access delay is equal to the delay of slowest sub-array  Very high hit time for large caches  Employs a separate bus for each cache bank for multi-banked caches  Not scalable Exploit different wire types and network design choices to reduce access latency Potential solution – NUCA Extend CACTI to model NUCA

23 Naveen Muralimanohar University of Utah 23 Non-Uniform Cache Access (NUCA)*  Large cache is broken into a number of small banks  Employs on-chip network for communication  Access delay  (distance between bank and cache controller) CPU & L1 Cache banks *(Kim et al. ASPLOS 02)

24 Naveen Muralimanohar University of Utah 24 Extension to CACTI  On-chip network Wire model based on ITRS 2005 parameters Grid network 3-stage speculative router pipeline  Network latency vs Bank access latency tradeoff Iterate over different bank sizes Calculate the average network delay based on the number of banks and bank sizes Consider contention values for different cache configurations  Similarly we also consider power consumed for each organization

25 Naveen Muralimanohar University of Utah 25 Trade-off Analysis (32 MB Cache) Delay Optimal Point

26 Naveen Muralimanohar University of Utah 26 Effect of Core Count

27 Naveen Muralimanohar University of Utah 27 Power Centric Design (32MB Cache)

28 28 Search Space of Old CACTI University of Utah28  Design space with global wires optimized for delay

29 29 Search Space of CACTI-6 University of Utah29 Design space with various wire types Least Delay 30% Delay Penalty Low-swing

30 Naveen Muralimanohar University of Utah 30 Earlier NUCA Models  Made simplified assumptions for network parameters Minimum bank access time Minimum network hop latency Single cycle router pipeline  Employed 512 banks for a 32 MB cache +More bandwidth -2.5X less efficient in terms of delay

31 Naveen Muralimanohar University of Utah 31 Outline Overview Wire Design Space Methodology to Design Scalable Caches  Heterogeneous Wires for Large Caches  Heterogeneous Wires for Coherence Traffic  Conclusions

32 Naveen Muralimanohar University of Utah 32 Cache Look-Up  The entire access happens in a sequential manner Core/L1 L2 Bank Tag Data Network Routing Logic 4-6 bits Decoder 10-15 bits Comparator

33 Naveen Muralimanohar University of Utah 33 Early Look-Up  Break the sequential access  Hides 70% of the bank access time Core/L1 L2 Bank Tag Data Critical lower order bits Comparator

34 Naveen Muralimanohar University of Utah 34 Aggressive Look-Up Core/L1 L2 Bank Tag Data Critical lower order bits + 8 bits Comparator 1101…1101111100010 11100010

35 Naveen Muralimanohar University of Utah 35 Aggressive Look-Up  Reduction in link delay (for address transfer)  Increase in traffic due to false match < 1%  Marginal increase in link overhead Additional 8-bits -More logic at the cache controller for tag match -Address transfer for writes happens on L-wires

36 Naveen Muralimanohar University of Utah 36 Heterogeneous Network  Routers introduce significant overhead (especially in L-network) L-wires can transfer signal across four banks in four cycles Router adds three cycles for each hop  Modify network topology to take advantage of wire property Different topology for address and data transfers

37 Naveen Muralimanohar University of Utah 37 Hybrid Network  Combination of point-to-point and bus Reduction in latency Reduction in power Efficient use of L- wires -Low bandwidth Core L2 Controller Shared bus Router

38 Naveen Muralimanohar University of Utah 38 Experimental Setup  Simplescalar with contention modeled in detail  Single core, 8-issue out-of-order processor  32 MB, 8-way set-associative, on-chip L2 cache (SNUCA organization)  32KB L1 I-cache and 32KB L1 D-cache with a hit latency of 3 cycles  Main memory latency 300 cycles

39 Naveen Muralimanohar University of Utah 39 CMP Setup  Eight Core CMP (Simplescalar tool)  32 MB, 8-way set-associative (SNUCA organization)  Two cache controllers  Main memory latency 300 cycles L2 Bank C1 C2 C3 C4 C5 C6 C7 C8

40 Naveen Muralimanohar University of Utah 40 Network Model  Virtual channel flow control Four virtual channels/physical channel Credit based flow control (for backpressure)  Adaptive routing Each hop should reduce Manhattan distance between the source and the destination

41 Naveen Muralimanohar University of Utah 41 Cache Models ModelBank Access (cycles) Bank CountNetwork LinkDescription 13512B-wiresBased on prior work 21716B-wiresCACTI-6 31716B & L–wiresEarly Lookup 41716B & L–wiresAgg. Lookup 51716B & L–wiresHybrid network 61716B-wiresUpper bound

42 Naveen Muralimanohar University of Utah 42 Performance Results (Uniprocessor) Model derived from CACTI, improvement over model assumed in the prior work – 73% L2 Sensitive – 114% Model derived from CACTI, improvement over model assumed in the prior work – 73% L2 Sensitive – 114% Prior workCACTI-L2EarlyAggr.Hybrid.Ideal

43 Naveen Muralimanohar University of Utah 43 Performance Results (Uniprocessor) Early lookup technique, average improvement over Model 2 – 6% L2 Sensitive – 8% Prior workCACTI-L2EarlyAggr.Hybrid.Ideal

44 Naveen Muralimanohar University of Utah 44 Performance Results (Uniprocessor) Aggressive lookup technique, average improvement over Model 2 – 8% L2 Sensitive – 9% Prior workCACTI-L2EarlyAggr.Hybrid.Ideal

45 Naveen Muralimanohar University of Utah 45 Performance Results (Uniprocessor) Hybrid model, average improvement over Model 2 – 15% L2 Sensitive – 20% Prior workCACTI-L2EarlyAggr.Hybrid.Ideal

46 Naveen Muralimanohar University of Utah 46 Performance Results (CMP)

47 Naveen Muralimanohar University of Utah 47 Performance Results (4X – Wires) Wire delay constrained model Performance improvements are better Early lookup - 7% Aggressive model - 20% Hybrid model - 29%

48 Naveen Muralimanohar University of Utah 48 NUCA Design  Network parameters play a significant role in the performance of large caches  Modified CACTI model, that includes network overhead performs 51% better compared to previous models  Methodology to compute an optimal baseline NUCA

49 Naveen Muralimanohar University of Utah 49 NUCA Design II  Wires can be tuned for different metrics  Routers impose non-trivial overhead  Address and data have different bandwidth needs  We introduce heterogeneity at three levels Different types of wires for address and data transfers Different topologies for address and data networks Different architectures within address network (point-to-point and bus) (Yields an additional performance improvement of 15% over the optimal, baseline NUCA)

50 Naveen Muralimanohar University of Utah 50 Outline Overview Methodology to Design Scalable Caches Wire Design Space Heterogeneous Wires for Large Caches  Heterogeneous Wires for Coherence Traffic  Conclusions

51 Naveen Muralimanohar University of Utah 51 Directory Based Protocol (Write- Invalidate)  Map critical/small messages on L wires and non- critical messages on PW wires Read exclusive request for block in shared state Read request for block in exclusive state Negative Ack (NACK) messages Hop Imbalance in messages

52 Naveen Muralimanohar University of Utah 52 1 Rd-Ex request from processor 1 2 Directory sends clean copy to processor 1 3 Directory sends invalidate message to processor 2 4 Cache 2 sends acknowledgement back to processor 1 Cache 1 L2 & Directory Cache 2 Processor 1 Processor 2 1 2 3 4 Critical Non-Critical Exclusive request for a shared copy

53 Naveen Muralimanohar University of Utah 53 Read to an Exclusive Block Proc 2 L1 Proc 1 L1 L2 & Directory Read Req Spec Reply Req ACK Fwd Dirty Copy WB Data (critical) (non-critical)

54 Naveen Muralimanohar University of Utah 54 Evaluation Platform & Simulation Methodology  Virtutech Simics Functional Simulator  Ruby Timing Model (GEMS)  SPLASH Suite L2$ Processor

55 Naveen Muralimanohar University of Utah 55 Heterogeneous Model L2$ Processor L-wire B-wire PW-wire  11% Performance improvement  22.5% Power savings in wire

56 Naveen Muralimanohar University of Utah 56 Summary  Coherence messages have diverse needs  Intelligent mapping of these messages to wires in heterogeneous network can improve both performance and power  Low bandwidth, high speed links improve performance by 11% for SPLASH benchmark suite  Non-critical traffic on power optimized network decreases wire power by 22.5% Ref: Interconnect Aware Coherence Protocol (ISCA 06) collaborated with Liqun Cheng

57 On-Core Communications  L-wires Narrow bit width operands Branch mis-predict signal  PW – wires Non-critical register values  Ready registers Store data  11% improvement in ED^2

58 Naveen Muralimanohar University of Utah 58 Results Summary P0 ID P1 ID P2 ID P3 ID P4 ID P5 ID P6 ID P7 ID P8 ID P9 ID P10 ID P11 ID P12 ID P13 ID P14 ID P15 ID Controller Cache Reads and Writes 114% Processor performance improvement 50% Power Savings Coherence Transactions 11% Performance Improvement 22.5% power savings in wires L1-accesses 7% performance improvement 11% ED^2 improvement L2 bank Cluster

59 Naveen Muralimanohar University of Utah 59 Conclusion  Impact of interconnect choices in modern processors is significant  Architectural level wire management can improve both power and performance of future communication bound processors  Architects have a lot to offer in the area of wire aware design

60 Naveen Muralimanohar University of Utah 60 Future Research  Exploit upcoming technologies Low-swing wires, optical interconnect, RF, transmission lines etc.  Transactional Memory Network to support register-register communication Dynamic adaptation

61 Acknowledgements  Committee members Rajeev, Al, John, Erik, and Shubu (Intel)  External Dr. Norm Jouppi (HP Labs), Dr. Ravi Iyer (Intel)  CS front office staff  Lab-mates Karthik, Niti, Liqun, and other fellow grads

62 Naveen Muralimanohar University of Utah 62 Avenues Explored  Inter-core communication (ISCA 2006)  Memory hierarchy (ISCA 2007)  CACTI 6.0 – publicly released (MICRO 2007), (IEEE Micro Top Picks 2008)  Out-of-order core (HPCA 2005, IEEE Micro 06)  Power and Temperature Aware Architectures (ISPASS 2006) Current Project or under submission:  Scalable and Reliable Transactional Memory (PACT 08)  Rethinking Fundamentals: Route Wires or Packets?  3D Reconfigurable Caches


Download ppt "1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah."

Similar presentations


Ads by Google