Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Advanced Computer Architecture Lab University of Michigan MASE Eric Larson MASE: Micro Architectural Simulation Environment Eric Larson, Saugata Chatterjee,
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Pipelined Processor II CPSC 321 Andreas Klappenecker.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.
1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
A Position-Insensitive Finished Store Buffer Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.
Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Pipelining Basics.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Morgan Kaufmann Publishers
Lynn Choi Dept. Of Computer and Electronics Engineering
5.2 Eleven Advanced Optimizations of Cache Performance
CS203 – Advanced Computer Architecture
Microprocessor Microarchitecture Dynamic Pipeline
CS203 – Advanced Computer Architecture
Half-Price Architecture
Morgan Kaufmann Publishers The Processor
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
CS 152 Computer Architecture & Engineering
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Composing the Elements
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Out-of-Order Execution Structures Optimizations
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Presentation transcript:

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction Scheduler Dan Ernst - Andrew Hamel - Todd Austin Advanced Computer Architecture Lab The University of Michigan

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Challenges in High-Speed Dynamic Scheduling Broadcast-based dynamic scheduler circuits are: –High complexity –Power-hungry –Scale poorly Global synchronization is becoming increasingly expensive –More Pipeline Stages + Slow Long Wires + Increasing Clock Speeds = Difficult Global Signal Design –Example: Pipeline stalling Memory scheduling is a “second class citizen” –Non-deterministic latencies don’t fit well into current popular dynamic scheduling paradigm

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Goals 1)Design a competitive, completely broadcast-free scheduler -Minimize global synchronization 2)Address memory scheduling in a “first class” way 3)Minimize “loose loops”

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Difference in Approaches From an instruction’s point of view… : Scheduling is just figuring out how long to wait. Broadcast approach –Instruction’s schedule is “recomputed” every cycle –Polling (“can I go now? How about now?”) Cyclone approach –Schedule based on a single timing computation –Instruction is given an execution time once, so no re-computation needed –Put in a timed “router” to execute the schedule as best it can

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Conceptual Overview Timing Predictor Routing/Timing Network Dependence Check FU I

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Pre-scheduler Design I0I0 I1I1 I2I2 I3I3 PSCHED 0 max + reschedule? timing table PSCHED 1 I0I0 I2I2 I3I3 16 Example Schedule a)b) max + + dep check MUX control I1I

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone Scheduler replay? fn units register file ready bits bypass REGEX/MEMSCHED instruction pre-scheduler store set predictor branch predictor countdown/replay queue main queue (includes timing information) switchback datapaths I0I Not ready! Ready!

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone – Switchback Conflict replay? fn units register file ready bits bypass REGEX/MEMSCHED instruction pre-scheduler store set predictor branch predictor countdown/replay queue main queue (includes timing information) switchback datapaths 4 3 2

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone – Switchback Conflict replay? fn units register file ready bits bypass REGEX/MEMSCHED instruction pre-scheduler store set predictor branch predictor countdown/replay queue main queue (includes timing information) switchback datapaths 3 210

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Architectural Methodology Baseline architectural model –Derived from SimpleScalar 3.0 –More sophisticated scheduling support Separated ROB and RS Variable-length pipelines Selective scheduler replay on memory latency misprediction –Store Set predictor Cyclone model –Replaced scheduling portion of pipeline with Cyclone model –Added timing information to store set predictor Simulated SPEC2000 (INT and FP)

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone IPC

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Circuit Timing and Area Methodology Timing – SPICE models –Critical paths of Cyclone Switchback paths were very fast Pre-scheduler dependence check was the critical path –CAM-style broadcast windows Used models from last year’s ISCA (Tag Elimination) –Both used TSMC 0.18  m process at 1.8 V –Presented here as Throughput (IPns) Area Analysis – Register Bit Equivalent (RBE) –Process-independent analytical model of chip area –Assumed RAM/CAM area scaled quadratically with number of ports –Modeled scheduler structures and extra tables (also RF) –More information in Mulder, Quach, and Flynn. [17]

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan 4-wide Complexity Analysis

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan 8-wide Complexity Analysis

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Design Space Overview

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Complexity Options Run at higher frequency –Deeper pipelines Make the total scheduler size larger –Increase IPC Run at same frequency –Much lower power

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Conclusions Competitive broadcast-free scheduling –Allows high speed circuits at the expense of IPC –Saves chip area Power savings… Alternative to stalling –Avoid broadcasting across stages by using the replay mechanism

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Future Directions… Close the IPC gap –Wider queues? Complete Power analysis –Trade-off between size and activity rate Further opportunities to pipeline the control system –Global synchronization without fast global communication

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone Extras

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Current/Future Work Pipelined Global Control – Low Power Razor –Average-case design opportunities Simple and effective selective replay implementations (WDDD) –Spawned from previous work (Tag Elimination – ISCA ’02) Removing as much global control as possible from pipelines