Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.

Slides:



Advertisements
Similar presentations
Spring 2003CSE P5481 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulos algorithm.
Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 3, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Introduction)
A scheme to overcome data hazards
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters J. Nelson Amaral.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Mult. Issue CSE 471 Autumn 011 Multiple Issue Alternatives Superscalar (hardware detects conflicts) –Statically scheduled (in order dispatch and hence.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
Branch Target Buffers BPB: Tag + Prediction
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
CS203 – Advanced Computer Architecture ILP and Speculation.
IBM System 360. Common architecture for a set of machines
/ Computer Architecture and Design
Smruti R. Sarangi IIT Delhi
CS203 – Advanced Computer Architecture
Lecture: Out-of-order Processors
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Microprocessor Microarchitecture Dynamic Pipeline
High-level view Out-of-order pipeline
A Dynamic Algorithm: Tomasulo’s
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 11: Memory Data Flow Techniques
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Checking for issue/dispatch
How to improve (decrease) CPI
Static vs. dynamic scheduling
Control unit extension for data hazards
Instruction Level Parallelism (ILP)
15-740/ Computer Architecture Lecture 10: Out-of-Order Execution
Control unit extension for data hazards
Control unit extension for data hazards
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units

Trace cache and Back-end Oper. CSE 4712 Instruction Fetch Unit Using Trace Cache Trace CacheIcache I-TLB Branch Predictor Trace Predictor Decoder Register renaming Execution units Fill Unit

Trace cache and Back-end Oper. CSE 4713 Two Traces can have the same tag Assume traces <= 16 instructions Traces B1B2B4 and B1B3B4 have same tag (address L1) Differentiated by trace predictor or something else (e.g.,#of branches taken) B1: 6 instructions B2: 5 instructions B3: 4 instructions B4: 5 instructions L1: thenelse

Trace cache and Back-end Oper. CSE 4714 Back-end Operations (OOO) Instruction scheduling –Detecting a ready instruction: wake-up –Maybe more than m ready instructions in an m-way superscalar: need to select An often “important“ instruction” is load –Load dependencies are bottlenecks –Load latencies are variable –Does a given load conflict with previous store? Load speculation Other optimizations –Value prediction??? Critical instructions??? Clustering of functional units???

Trace cache and Back-end Oper. CSE 4715 Reservation Stations and Functional Units ProcessorRS typeNumber of res. stations Functional units Int l/s fp IBM Power PC 620distributed 154 (1) 1 1 IBM Power 4distributed 314 (2) 2 2 Intel P6 (Pentium III)centralized 203 (3) 2 4 (4) Intel Pentium 4hybrid (1 for mem.op, 1 for rest) 126 (72,54)5 (5) 2 2 (6) AMD K6centralized AMD Opterondistributed MIPS 10000hybrid (1 for int, 1 for mem, 1 for fp) 48 (16,16,16) Alpha 21264hybrid (1 for int /mem 1 for fp) 35 (20,15) Ultra SparcIIIIn-order queue (7)

Trace cache and Back-end Oper. CSE 4716 Wake-up If f functional units, then up to f results per cycle Hence f comparators per operand in reservation station If w reservation stations then need of 2fw comparators –From previous slide over a 1000 comparators! Can be reduced by –Res. Stations distributed by function –There might not be f broadcast buses

Trace cache and Back-end Oper. CSE 4717 Select Hardwired priority –Enforced by a hardware encoder: woken-up instruction sends a request for issue to encoder –In general “oldest woken-up instruction” first Examples of difficulty: –Result register name of an instruction must be broadcast one cycle before the result is computed so that a dependent instruction can be woken-up in time to get the forwarded result –In case of a cache access, this is speculative so need to be able to recover, i.e., not execute a selected instruction at a given time but let it remain in the instruction window

Trace cache and Back-end Oper. CSE 4718 Load Speculation Load = Address computation + Get memory contents Two flavors of speculation –Address speculation: Used for prefetching (see cache techniques later on) –Memory dependence prediction: dependence between loads and previous stores. The so-called memory disambiguation in Intel Core architecture, for example

Trace cache and Back-end Oper. CSE 4719 Store Buffer Once the address to where to store has been generated, the store will be put in a store buffer if either –The result of the store depends on an uncompleted instruction –The result of the store is known but the store instruction is not committed An entry in the store buffer consists of: –A bit to indicate that the entry is free (state AV) –The store has been woken-up, the store address has been computed but the result is not there (state AD) –Address and result are there but the store has not been committed (state RE) –The store instruction has been committed (state CO)

Trace cache and Back-end Oper. CSE Load/Store Unit AGU Load/store reservation stations or instruction window Store buffer Status Address Data Store unit Data Cache Address Data Load buffer Load Unit

Trace cache and Back-end Oper. CSE Load Issue Simple scheme: Load and store issue (to AGU) in program order –Simplest: Load can issue only if store buffer empty –Simpler: load bypassing – load issue if no address conflict with addresses in store buffer Requires to check if preceding store instruction has entered the address in the store buffer If there is a match in state AD or RE the load is aborted (contents discarded) –Next: load forwarding Take advantage of states RE and CO and forward result to result register of load.

Trace cache and Back-end Oper. CSE More load speculation Stores issue in program order but a load can issue before some store (i.e., load/store res. station is not a queue) Pessimistic approach (previous slide) + check that there is no store left “unissued” in reservation station before the load –Used in Pentium Optimistic approach: always issue loads –Need of a load buffer so we can recover Dependence prediction –Like optimistic but use of a predictor of memory dependencies and hence fewer recoveries

Trace cache and Back-end Oper. CSE Example Prior to this instruction all stores have been committed i1: st R1, memadd1 …………………. i2: st R2, memadd2 ……………………….. i3: ld R3, memadd3 …………………. i4: ld R4, memadd4 True mem.dep. Ready to issue

Trace cache and Back-end Oper. CSE Example (c’ed) Pessimistic: –no load can issue until i2 has computed its address and put it in store buffer –Then i4 can issue –i3 will have to wait till i2 has computed result and can forward (state RE) Optimistic –i3 and i4 issue and are put in load buffer. –When i1 computes its address, nothing happens in the load buffer –When i2 reaches state RE (or AD depending on implementation), i3 and i4 are removed from the load buffer and will have to reissue (i4 because it might depend on i3, again depending on implementation) Dependence prediction –If dependence between i2 and i3 is predicted, i3 cannot issue but i4 can (if not dependent on i3)