Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Advertisements

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
1 Billion Transistor Architectures Interconnect design for low power – Naveen & Karthik Computational unit design for low temperature – Karthik Increased.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Dynamic Management of Microarchitecture Resources in Future Processors Rajeev Balasubramonian Dept. of Computer Science, University of Rochester.
1 Exploring Design Space for 3D Clustered Architectures Manu Awasthi, Rajeev Balasubramonian University of Utah.
September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture ­Facilitate parallel execution ­Scale well with advancing.
Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.
TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.
1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Ozan Akar CMPE 511 Fall 2006.
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Superscalar Architecture Design Framework for DSP Operations Rehan Ahmed.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Dynamic Associative Caches:
Dynamic Scheduling Why go out of style?
Adaptive Cache Partitioning on a Composite Core
SECTIONS 1-7 By Astha Chawla
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores
The Microarchitecture of the Pentium 4 processor
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Alpha Microarchitecture
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Presenter: Gongyu Wang BTCA Jan. 18, 2007
Efficient Interconnects for Clustered Microarchitectures
A Case for Interconnect-Aware Architectures
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Rajeev Balasubramonian
Presentation transcript:

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatanand Venkatachalapathy

University of Utah February 14 th Overview/Motivation  Wire delays are costly for performance and power  Latencies of 30 cycles to reach ends of a chip  50% of dynamic power is in interconnect switching (Magen et al. SLIP 04)  Abundant number of metal layers

University of Utah February 14 th Wire Characteristics  Wire Resistance and capacitance per unit length (Width & Spacing)   Delay  (as delay  RC), Bandwidth  ResistanceCapacitanceBandwidth Width Spacing

University of Utah February 14 th Design Space Exploration  Tuning wire width and spacing d 2d B Wires Resistance Capacitance Resistance Capacitance Bandwidth L wires

University of Utah February 14 th Transmission Lines  Allow extremely low delay  High implementation complexity and overhead!  Large width  Large spacing between wires  Design of sensing circuit  Shielding power and ground lines adjacent to each line  Implemented in test CMOS chips  Not employed in this study

University of Utah February 14 th Design Space Exploration  Tuning Repeater size and spacing Traditional Wires Large repeaters Optimum spacing Power Optimal Wires Smaller repeaters Increased spacing Delay Power

University of Utah February 14 th Design Space Exploration Base case B wires Bandwidth Optimized W wires Power Optimized P wires Power and B/W Optimized PW wires Fast, low bandwidth L wires

University of Utah February 14 th Outline Overview Wire Design Space Exploration  Employing L wires for Performance  PW wires: The Power Optimizers  Results  Conclusions

University of Utah February 14 th Evaluation Platform L1 D Cache Cluster  Centralized front-end  I-Cache & D-Cache  LSQ  Branch Predictor  Clustered back-end

University of Utah February 14 th Cache Pipeline L1 D Cache LSQLSQ Eff. Address Transfer 10c Mem. Dep Resolution 5c Cache Access 5c Data return at 20c L1 D Cache LSQLSQ Eff. Address Transfer 10c Mem. Dep Resolution 5c Cache Access 5c Data return at 20c L1 D Cache LSQLSQ Eff. Address Transfer 10c Partial Mem. Dep Resolution 3c Cache Access 5c 8-bit Transfer 5c Data return at 14c Functional Unit

University of Utah February 14 th L wires: Accelerating cache access  Transmit LSB bits of effective address through L wires Faster memory disambiguation  Partial comparison of loads and stores in LSQ  Introduces false dependences ( < 9%) Indexing data and tag RAM arrays  LSB bits can prefetch data out of L1$  Reduce access latency of loads

University of Utah February 14 th L wires: Narrow Bit Width Operands  PowerPC: Data bit-width determines FU latency  Transfer of 10 bit integers on L wires  Can introduce scheduling difficulties  A predictor table of saturating counters  Accuracy of 98%  Reduction in branch mispredict penalty

University of Utah February 14 th Power Efficient Wires. Base case B wires Power and B/W Optimized PW wires  Idea: steer non-critical data through energy efficient PW interconnect

University of Utah February 14 th PW wires: Power/Bandwidth Efficient  Ready Register operands Transfer of data at instruction dispatch Transfer of input operands to remote register file Covered by long dispatch to issue latency  Store data Could stall commit process Delay dependent loads Rename & Dispatch IQ Regfile FU IQ Regfile FU IQ Regfile FU IQ Regfile FU Operand is ready at cycle 90 Consumer instruction Dispatched at cycle 100

University of Utah February 14 th Outline Overview Wire Design Space Exploration  Employing L wires for Performance  PW wires: The Power Optimizers  Results  Conclusions

University of Utah February 14 th Evaluation Methodology L1 D Cache B wires (2 cycles) L wires (1 cycle) PW wires (3 cycles) Cluster  Simplescalar -3.0 augmented to simulate a dynamically scheduled 4-cluster model  Crossbar interconnects (L, B and PW wires)

University of Utah February 14 th Heterogeneous Interconnects  Intercluster global Interconnect 72 B wires (64 data bits and 8 control bits)  Repeaters sized and spaced for optimum delay 18 L wires  Wide wires and large spacing  Occupies more area  Low latencies 144 PW wires  Poor delay  High bandwidth  Low power

University of Utah February 14 th Analytical Model C = C a + W s C b + C c /W s 123 1Fringing Capacitance 2Capacitance between different layers of wires 3Capacitance between wires Of same metal layer RC Model of the wire  Total Power = Short-Circuit Power + Switching Power + Leakage Power

University of Utah February 14 th Evaluation methodology I-Cache D-cache LSQ Cluster Cross bar Ring interconnect  Simplescalar -3.0 augmented to simulate a dynamically scheduled 16-cluster model  Ring latencies  B wires ( 4 cycles)  PW wires ( 6 cycles)  L wires (2 cycles)

University of Utah February 14 th IPC improvements: L wires  L wires improve performance by 4.2% on four cluster system and 7.1% on a sixteen cluster system

University of Utah February 14 th Four Cluster System: ED 2 Improvements PW 36 L B B, 36 L PW,36 L PW B Relative ED 2 (20%) Relative ED 2 (10%) Relative processor energy (10%) IPCRelative metal area Link

University of Utah February 14 th Sixteen Cluster system: ED 2 gains B B, 36 L B, 36 L PW, 36 L B Relative ED 2 (20%) Relative Processor Energy (20%) IPCLink

University of Utah February 14 th Conclusions  Exposing the wire design space to the architecture  A case for micro-architectural wire management!  A low latency low bandwidth network alone helps improve performance by up to 7%  ED 2 improvements of about 11% compared to a baseline processor with homogeneous interconnect  Entails hardware complexity

University of Utah February 14 th Future work  3-D wire model for the interconnects  Design of heterogeneous clusters  Interconnects for cache coherence and L2$

University of Utah February 14 th Questions and Comments? Thank you!

University of Utah February 14 th Backup

University of Utah February 14 th L wires: Accelerating cache access  TLB access for page look up Transmit a few bits of Virtual page number on L wires Prefetch data our of L1$ and TLB 18 L wires( 6 tag bits, 8 L1 index and 4 TLB index bits) Wire Type Crossb ar delay Ring hop delay PW wires 36 B wires24 L wires12

University of Utah February 14 th Model parameters  Simplescalar-3.0 with separate integer and floating point queues  32 KB 2 way Instruction cache  32 KB 4 way Data cache  128 entry 8 way I and D TLB

University of Utah February 14 th Overview/Motivation: ± Three wire implementations employed in this study ± B wires: traditional  Optimal delay  Huge power consumption ± L wires:  Faster than B wires  Lesser bandwidth ± PW wires:  Reduced power consumption  Higher bandwidth compared to B wires  Increased delay through the wires

University of Utah February 14 th