Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Similar presentations


Presentation on theme: "Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen."— Presentation transcript:

1 Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatand Venkatachalapathy Processor Architecture

2 February 14 th 2005 2 Overview/Motivation  Wire delays hamper performance.  Power incurred in movement of data  50% of dynamic power is in interconnect switching (Magen et al. SLIP 04)  MIT Raw processor’s on-chip network consumes 36% of total chip power (Wang et al. 2003)  Abundant number of metal layers

3 February 14 th 2005 3 Wire characteristics  Wire Resistance and capacitance per unit length ± Width   R  , C   Spacing   C   Delay  (as delay  RC), Bandwidth 

4 February 14 th 2005 4 Design space exploration  Tuning wire width and spacing d 2d B Wires Resistance Capacitance Resistance Capacitance Bandwidth

5 February 14 th 2005 5 Transmission Lines  Similar to L wires - extremely low delay  Constraining implementation requirements!  Large width  Large spacing between wires  Design of sensing circuits  Implemented in test CMOS chips

6 February 14 th 2005 6 Design space exploration  Tuning Repeater size and spacing Traditional Wires Large repeaters Optimum spacing Power Optimal Wires Smaller repeaters Increased spacing Delay Power

7 February 14 th 2005 7 Design space exploration Delay Optimized B wires Bandwidth Optimized W wires Power Optimized P wires Power and B/W Optimized PW wires Fast, low bandwidth L wires

8 February 14 th 2005 8 Heterogeneous Interconnects  Intercluster global Interconnect 72 B wires  Repeaters sized and spaced for optimum delay 18 L wires  Wide wires and large spacing  Occupies more area  Low latencies 144 PW wires  Poor delay  High bandwidth  Low power

9 February 14 th 2005 9 Outline Overview Design Space Exploration Heterogeneous Interconnects  Employing L wires for performance  PW wires: The power optimizers  Evaluation  Results  Conclusion

10 February 14 th 2005 10 L1 D Cache LSQLSQ Eff. Address Transfer 10c Mem. Dep Resolution 5c Cache Access 5c Data return at 20c L1 Cache pipeline

11 February 14 th 2005 11 Exploiting L-Wires L1 D Cache LSQLSQ Eff. Address Transfer 10c Partial Mem. Dep Resolution 3c Cache Access 5c 8-bit Transfer 5c Data return at 14c

12 February 14 th 2005 12 L wires: Accelerating cache access  Transmit LSB bits of effective address through L wires  Partial comparison of loads and stores in LSQ  Faster memory disambiguation  Introduces false dependences ( < 9%)  Indexing data and tag RAM arrays  LSB bits can prefetch data out of L1$  Reduce access latency of loads

13 February 14 th 2005 13 L wires: Narrow bit width operands  Transfer of 10 bit integers on L wires  Schedule wake up operations  Reduction in branch mispredict penalty  A predictor table of 8K two bit counters  Identifies 95% of all narrow bit-width results  Accuracy of 98%  Implemented in the PowerPC!

14 February 14 th 2005 14 PW wires: Power/Bandwidth efficient  Idea: steer non-critical data through energy efficient PW interconnect  Transfer of data at instruction dispatch Transfer of input operands to remote register file Covered by long dispatch to issue latency  Store data

15 February 14 th 2005 15 Evaluation methodology L1 D Cache B wires (2 cycles) L wires (1 cycle) PW wires (3 cycles) Cluster  A dynamically scheduled clustered modeled with 4 clusters in simplescalar-3.0  Crossbar interconnects  Centralized front-end I-Cache & D-Cache LSQ Branch Predictor

16 February 14 th 2005 16 Evaluation methodology I-Cache D-cache LSQ Cluster Cross bar Ring interconnect  A dynamically scheduled 16 cluster modeled in Simplescalar-3.0  Ring latencies  B wires ( 4 cycles)  PW wires ( 6 cycles)  L wires (2 cycles)

17 February 14 th 2005 17 IPC improvements: L wires  L wires improves performance by 4% on four cluster system and 7.1% on a sixteen cluster system

18 February 14 th 2005 18 Four cluster system: ED 2 gains Link Relativ e metal area IPC Relative processor energy (10%) Relative ED 2 (10%) Relative ED 2 (20%) 144 B1.00.95100 288 PW1.00.9297103.4100.2 144 PW 36 L1.50.969795.092.1 288 B2.00.9810396.699.2 288 PW,36 L2.00.979994.493.2 144 B, 36 L2.00.9910193.394.5

19 February 14 th 2005 19 Sixteen Cluster system: ED 2 gains LinkIPC Relative Processor Energy (20%) Relative ED 2 (20%) 144 B1.11100 144 PW, 36 L1.0594105.3 288 B1.1810593.1 144 B, 36 L1.1910288.7 288 B, 36 L1.2210788.7

20 February 14 th 2005 20 Conclusions  Exposing the wire design space to the architecture  A case for micro-architectural wire management!  A low latency low bandwidth network alone helps improve performance by upto 7%  ED 2 improvements of about 11% compared to a baseline processor with homogeneous interconnect  Entails hardware complexity

21 February 14 th 2005 21 Future work  A preliminary evaluation looks promising  Heterogeneous interconnect entails complexity  Design of heterogeneous clusters  Energy efficient interconnect

22 February 14 th 2005 22 Questions and Comments? Thank you!

23 February 14 th 2005 23 Backup

24 February 14 th 2005 24 L wires: Accelerating cache access  TLB access for page look up Transmit a few bits of Virtual page number on L wires Prefetch data our of L1$ and TLB 18 L wires( 6 tag bits, 8 L1 index and 4 TLB index bits) Wire Type Crossb ar delay Ring hop delay PW wires 36 B wires24 L wires12

25 February 14 th 2005 25 Model parameters  Simplescalar-3.0 with separate integer and floating point queues  32 KB 2 way Instruction cache  32 KB 4 way Data cache  128 entry 8 way I and D TLB

26 February 14 th 2005 26 Overview/Motivation: ± Three wire implementations employed in this study ± B wires: traditional  Optimal delay  Huge power consumption ± L wires:  Faster than B wires  Lesser bandwidth ± PW wires:  Reduced power consumption  Higher bandwidth compared to B wires  Increased delay through the wires


Download ppt "Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen."

Similar presentations


Ads by Google