Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Similar presentations


Presentation on theme: "Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen."— Presentation transcript:

1 Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatanand Venkatachalapathy

2 University of Utah February 14 th 2005 2 Overview/Motivation  Wire delays are costly for performance and power  Latencies of 30 cycles to reach ends of a chip  50% of dynamic power is in interconnect switching (Magen et al. SLIP 04)  Abundant number of metal layers

3 University of Utah February 14 th 2005 3 Wire Characteristics  Wire Resistance and capacitance per unit length (Width & Spacing)   Delay  (as delay  RC), Bandwidth  ResistanceCapacitanceBandwidth Width Spacing

4 University of Utah February 14 th 2005 4 Design Space Exploration  Tuning wire width and spacing d 2d B Wires Resistance Capacitance Resistance Capacitance Bandwidth L wires

5 University of Utah February 14 th 2005 5 Transmission Lines  Allow extremely low delay  High implementation complexity and overhead!  Large width  Large spacing between wires  Design of sensing circuit  Shielding power and ground lines adjacent to each line  Implemented in test CMOS chips  Not employed in this study

6 University of Utah February 14 th 2005 6 Design Space Exploration  Tuning Repeater size and spacing Traditional Wires Large repeaters Optimum spacing Power Optimal Wires Smaller repeaters Increased spacing Delay Power

7 University of Utah February 14 th 2005 7 Design Space Exploration Base case B wires Bandwidth Optimized W wires Power Optimized P wires Power and B/W Optimized PW wires Fast, low bandwidth L wires

8 University of Utah February 14 th 2005 8 Outline Overview Wire Design Space Exploration  Employing L wires for Performance  PW wires: The Power Optimizers  Results  Conclusions

9 University of Utah February 14 th 2005 9 Evaluation Platform L1 D Cache Cluster  Centralized front-end  I-Cache & D-Cache  LSQ  Branch Predictor  Clustered back-end

10 University of Utah February 14 th 2005 10 Cache Pipeline L1 D Cache LSQLSQ Eff. Address Transfer 10c Mem. Dep Resolution 5c Cache Access 5c Data return at 20c L1 D Cache LSQLSQ Eff. Address Transfer 10c Mem. Dep Resolution 5c Cache Access 5c Data return at 20c L1 D Cache LSQLSQ Eff. Address Transfer 10c Partial Mem. Dep Resolution 3c Cache Access 5c 8-bit Transfer 5c Data return at 14c Functional Unit

11 University of Utah February 14 th 2005 11 L wires: Accelerating cache access  Transmit LSB bits of effective address through L wires Faster memory disambiguation  Partial comparison of loads and stores in LSQ  Introduces false dependences ( < 9%) Indexing data and tag RAM arrays  LSB bits can prefetch data out of L1$  Reduce access latency of loads

12 University of Utah February 14 th 2005 12 L wires: Narrow Bit Width Operands  PowerPC: Data bit-width determines FU latency  Transfer of 10 bit integers on L wires  Can introduce scheduling difficulties  A predictor table of saturating counters  Accuracy of 98%  Reduction in branch mispredict penalty

13 University of Utah February 14 th 2005 13 Power Efficient Wires. Base case B wires Power and B/W Optimized PW wires  Idea: steer non-critical data through energy efficient PW interconnect

14 University of Utah February 14 th 2005 14 PW wires: Power/Bandwidth Efficient  Ready Register operands Transfer of data at instruction dispatch Transfer of input operands to remote register file Covered by long dispatch to issue latency  Store data Could stall commit process Delay dependent loads Rename & Dispatch IQ Regfile FU IQ Regfile FU IQ Regfile FU IQ Regfile FU Operand is ready at cycle 90 Consumer instruction Dispatched at cycle 100

15 University of Utah February 14 th 2005 15 Outline Overview Wire Design Space Exploration  Employing L wires for Performance  PW wires: The Power Optimizers  Results  Conclusions

16 University of Utah February 14 th 2005 16 Evaluation Methodology L1 D Cache B wires (2 cycles) L wires (1 cycle) PW wires (3 cycles) Cluster  Simplescalar -3.0 augmented to simulate a dynamically scheduled 4-cluster model  Crossbar interconnects (L, B and PW wires)

17 University of Utah February 14 th 2005 17 Heterogeneous Interconnects  Intercluster global Interconnect 72 B wires (64 data bits and 8 control bits)  Repeaters sized and spaced for optimum delay 18 L wires  Wide wires and large spacing  Occupies more area  Low latencies 144 PW wires  Poor delay  High bandwidth  Low power

18 University of Utah February 14 th 2005 18 Analytical Model C = C a + W s C b + C c /W s 123 1Fringing Capacitance 2Capacitance between different layers of wires 3Capacitance between wires Of same metal layer RC Model of the wire  Total Power = Short-Circuit Power + Switching Power + Leakage Power

19 University of Utah February 14 th 2005 19 Evaluation methodology I-Cache D-cache LSQ Cluster Cross bar Ring interconnect  Simplescalar -3.0 augmented to simulate a dynamically scheduled 16-cluster model  Ring latencies  B wires ( 4 cycles)  PW wires ( 6 cycles)  L wires (2 cycles)

20 University of Utah February 14 th 2005 20 IPC improvements: L wires  L wires improve performance by 4.2% on four cluster system and 7.1% on a sixteen cluster system

21 University of Utah February 14 th 2005 21 Four Cluster System: ED 2 Improvements 92.195.0970.961.5144 PW 36 L 99.296.61030.982.0288 B 94.593.31010.992.0144 B, 36 L 93.294.4990.972.0288 PW,36 L 100.2103.4970.921.0288 PW 100 0.951.0144 B Relative ED 2 (20%) Relative ED 2 (10%) Relative processor energy (10%) IPCRelative metal area Link

22 University of Utah February 14 th 2005 22 Sixteen Cluster system: ED 2 gains 93.11051.18288 B 88.71071.22288 B, 36 L 88.71021.19144 B, 36 L 105.3941.05144 PW, 36 L 100 1.11144 B Relative ED 2 (20%) Relative Processor Energy (20%) IPCLink

23 University of Utah February 14 th 2005 23 Conclusions  Exposing the wire design space to the architecture  A case for micro-architectural wire management!  A low latency low bandwidth network alone helps improve performance by up to 7%  ED 2 improvements of about 11% compared to a baseline processor with homogeneous interconnect  Entails hardware complexity

24 University of Utah February 14 th 2005 24 Future work  3-D wire model for the interconnects  Design of heterogeneous clusters  Interconnects for cache coherence and L2$

25 University of Utah February 14 th 2005 25 Questions and Comments? Thank you!

26 University of Utah February 14 th 2005 26 Backup

27 University of Utah February 14 th 2005 27 L wires: Accelerating cache access  TLB access for page look up Transmit a few bits of Virtual page number on L wires Prefetch data our of L1$ and TLB 18 L wires( 6 tag bits, 8 L1 index and 4 TLB index bits) Wire Type Crossb ar delay Ring hop delay PW wires 36 B wires24 L wires12

28 University of Utah February 14 th 2005 28 Model parameters  Simplescalar-3.0 with separate integer and floating point queues  32 KB 2 way Instruction cache  32 KB 4 way Data cache  128 entry 8 way I and D TLB

29 University of Utah February 14 th 2005 29 Overview/Motivation: ± Three wire implementations employed in this study ± B wires: traditional  Optimal delay  Huge power consumption ± L wires:  Faster than B wires  Lesser bandwidth ± PW wires:  Reduced power consumption  Higher bandwidth compared to B wires  Increased delay through the wires

30 University of Utah February 14 th 2005 30


Download ppt "Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen."

Similar presentations


Ads by Google