ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 9 Instruction-Level Parallelism – Part 2 Benjamin Lee Electrical and Computer Engineering Duke.

Slides:



Advertisements
Similar presentations
Out-of-Order Execution & Register Renaming
Advertisements

Krste Asanovic Electrical Engineering and Computer Sciences
Electrical and Computer Engineering
A scheme to overcome data hazards
2/28/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste.
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 8 Instruction-Level Parallelism – Part 1 Benjamin Lee Electrical and Computer Engineering Duke.
COMP25212 Advanced Pipelining Out of Order Processors.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Complex Pipelining II Steve Ko Computer Sciences and Engineering University at Buffalo.
February 28, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP II Steve Ko Computer Sciences and Engineering University at Buffalo.
February 28, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste Asanovic Electrical Engineering.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP III Steve Ko Computer Sciences and Engineering University at Buffalo.
March 11, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 14 - Advanced Superscalars Krste Asanovic Electrical Engineering.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP I Steve Ko Computer Sciences and Engineering University at Buffalo.
March 2, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 12 - Complex Pipelines Krste Asanovic Electrical Engineering and Computer.
CS 152 Computer Architecture and Engineering Lecture 14 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
March 4, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste.
CS 152 Computer Architecture and Engineering Lecture 12 - Complex Pipelines Krste Asanovic Electrical Engineering and Computer Sciences University of California.
CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue and Register Renaming Krste Asanovic Electrical Engineering and Computer Sciences.
CS 152 Computer Architecture and Engineering Lecture 12 - Complex Pipelines Krste Asanovic Electrical Engineering and Computer Sciences University of California.
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)
CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of.
March 2, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste.
CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.
CS6461 – Computer Architecture Fall 2015 Morris Lancaster Adapted from Professor Stephen Kaisler’s Slides Lecture 8 Instruction level Parallelism (continued)
CS 152 Computer Architecture and Engineering Lecture 15 - Out-of-Order Memory, Complex Superscalars Review Krste Asanovic Electrical Engineering and Computer.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
March 1, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
COMP25212 Advanced Pipelining Out of Order Processors.
CS203 – Advanced Computer Architecture ILP and Speculation.
CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering.
Dynamic Scheduling Why go out of style?
CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction John Wawrzynek Electrical Engineering.
Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming
/ Computer Architecture and Design
Out of Order Processors
Dr. George Michelogiannakis EECS, University of California at Berkeley
CS203 – Advanced Computer Architecture
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Electrical and Computer Engineering
CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste Asanovic Electrical Engineering.
Krste Asanovic Electrical Engineering and Computer Sciences
Krste Asanovic Electrical Engineering and Computer Sciences
Adapted from the slides of Prof
Krste Asanovic Electrical Engineering and Computer Sciences
Krste Asanovic Electrical Engineering and Computer Sciences
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Advanced Computer Architecture
Adapted from the slides of Prof
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 11 – Out-of-Order Execution Krste Asanovic Electrical Engineering.
Conceptual execution on a processor which exploits ILP
Presentation transcript:

ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 9 Instruction-Level Parallelism – Part 2 Benjamin Lee Electrical and Computer Engineering Duke University

ECE 552 / CPS 5502 ECE552 Administrivia 27 September – Homework #2 Due -Use blackboard forum for questions -Attend office hours with questions - for separate meetings 2 October – Class Discussion Roughly one reading per class. Do not wait until the day before! 1.Srinivasan et al. “Optimizing pipelines for power and performance” 2.Mahlke et al. “A comparison of full and partial predicated execution support for ILP processors” 3.Palacharla et al. “Complexity-effective superscalar processors” 4.Yeh et al. “Two-level adaptive training branch prediction” 4 October – Midterm Exam

ECE 552 / CPS 5503 In-Order Issue Pipeline IF ID WB ALUMem Fadd Fmul Fdiv Issue GPR’s FPR’s

ECE 552 / CPS 5504 Scoreboard Busy[FU#]: a bit-vector to indicate functional unit availability where FU = {Int, Add, Mutl, Div} WP[#regs]: a bit-vector to record the registers to which writes are pending - Bits are set to true by issue logic - Bits are set to false by writeback stage - Each functional unit’s pipeline registers must carry ‘dest’ field and a flag to indicate if it’s valid: “the (we, ws) pair” Issue logic checks instruction (opcode, dest, src1, src2) against scoreboard (busy, wp) to dispatch - FU available?Busy[FU#] - RAW?WP[src1] or WP[src2] - WAR?Cannot arise - WAW?WP[dest]

ECE 552 / CPS 5505 Limitations of In-Order Issue InstructionOperandsLatency 1: LDF2, 34(R2)1 2: LDF4, 45(R3)long 3: MULTDF6, F4, F23 4: SUBDF8, F2, F21 5: DIVDF4, F2, F84 6: ADDDF10, F6, F41 In-order: 1 (2 1) ………… … In-order restriction keeps instruction 4 from issuing

ECE 552 / CPS 5506 Out-of-Order Issue -Issue stage buffer holds multiple instructions waiting to issue -Decode stage adds next instruction to buffer if there is space and next instruction does not cause a WAR or WAW hazard -Any instruction in buffer whose RAW hazards are satisfied can issue -When instruction commits, a new instruction can issue IFIDWB ALUMem Fadd Fmul Issue

ECE 552 / CPS 5507 Limitations of Out-of-Order Issue InstructionOperandsLatency 1: LDF2, 34(R2)1 2: LDF4, 45(R3)long 3: MULTDF6, F4, F23 4: SUBDF8, F2, F21 5: DIVDF4, F2, F84 6: ADDDF10, F6, F41 In-order: 1 (2 1) ………… … Out-of-order: 1 (2 1) 4 4 …….2 3… … Out-of-order execution has no gain. Why did we not issue instruction 5?

ECE 552 / CPS 5508 Instructions In-Flight What features of an ISA limit the number of instructions in the pipeline? Number of registers What features of a program limit the number of instructions in the pipeline? Control transfers Out-of-order issue does not address these other limitations.

ECE 552 / CPS 5509 Mitigating Limited Register Names Floating point pipelines often cannot be filled with small number of registers - IBM 360 had only 4 floating-point registers Can a microarchitecture use more registers than specified by the ISA without loss of ISA compatibility? - In 1967, Robert Tomasulo’s solution was dynamic register renaming.

ECE 552 / CPS Little’s Law Throughput (T) = Number-in-flight (N) / Latency (L) - Example: 4 floating-point registers, 8 cycles per floating-point op - Little’s Law  ½ issue per cycle WB Issue Execution

ECE 552 / CPS ILP via Renaming InstructionOperandsLatency 1: LDF2, 34(R2)1 2: LDF4, 45(R3)long 3: MULTDF6, F4, F23 4: SUBDF8, F2, F21 5: DIVDF4, F2, F84 6: ADDDF10, F6, F41 In-order: 1 (2 1) ………… … Out-of-order: 1 (2 1) ….2 (3, 5) Any anti-dependence can be eliminated by renaming (requires additional storage). Renaming can be done in hardware! X

ECE 552 / CPS Register Renaming -Decode stage renames registers and adds instructions to the reorder buffer (ROB) -ROB tracks in-flight instructions in program order -ROB renames registers to eliminate WAR or WAW hazards -ROB instructions with resolved RAW hazards can issue (source operands are ready) -This is called “out-of-order” or “dataflow” execution IFIDWB ALUMem Fadd Fmul Issue

ECE 552 / CPS Reorder Buffer (ROB) Instruction slot is candidate for execution when… -Instruction is valid (“use” bit is set) -Instruction is not already executing (“exec” bit is clear) -Operands are available (“p1” and “p2” are set for “src1” and “src2”) Reorder buffer t1t2...tnt1t2...tn ptr 2 next to deallocate ptr 1 next available Ins# use exec op p1 src1 p2 src2

ECE 552 / CPS Renaming Registers and the ROB 1.Insert instruction into ROB (after decoding it) i.ROB entry is used, use  1 ii.Instruction is not yet executing, exec  1 iii.Specify operation in ROB entry 2.Update renaming table i.Identify instruction’s destination register (e.g., F1) ii.Look up register (e.g., F1) in renaming table iii.Insert pointer to instruction’s ROB entry 3.When instruction executes, exec  1 4.When instruction writes-back, replace pointer to ROB with produced value

ECE 552 / CPS Example 1: LD F2, 34 (R2) 2: LD F4, 45 (R3) 3: MUTLD F6, F4, F2 4: SUBD F8, F2, F2 5: DIVD F4, F2, F8 6: ADDD F10, F6, F4 When are names in sources replaced by data? When a functional unit produces data When can a name be re-used? When an instruction completes Renaming tableReorder buffer Ins# use exec op p1 src1 p2 src2 t1t2t3t4t5..t1t2t3t4t5.. data / t i p data F1 F2 F3 F4 F5 F6 F7 F8 t LD t LD DIV 1 v1 0 t SUB 1 v1 1 v1 t MUL 0 t2 1 v1 t3 t5 v LD SUB 1 v1 1 v1 4 0 v DIV 1 v1 1 v LD MUL 1 v2 1 v1

ECE 552 / CPS Renaming Registers and the ROB 1.Insert instruction into ROB (after decoding it) i.ROB entry is used, use  1 ii.Instruction is not yet executing, exec  1 iii.Specify operation in ROB entry 2.Update renaming table i.Identify instruction’s destination register (e.g., F1) ii.Look up register (e.g., F1) in renaming table iii.Insert pointer to instruction’s ROB entry 3.When instruction executes, exec  1 4.When instruction writes-back, replace pointer to ROB with produced value

ECE 552 / CPS Register Renaming - Decode stage allocates instruction template (i.e., tag t) and stores tag in register file. - When instruction completes, tag is de-allocated. Load Unit FU Store Unit Ins# use exec op p1 src1 p2 src2 t1t2..tnt1t2..tn Renaming Table & Register File Reorder Buffer

ECE 552 / CPS Allocating/Deallocating Templates - Reorder buffer is managed circularly. - Field “exec” is set when instruction begins execution. - Field “use” is cleared when instruction completes - Ptr2 increments when “use” bit is cleared. Reorder buffer t1t2...t1t2... ptr 2 next to deallocate prt 1 next available Ins# use exec op p1 src1 p2 src2

ECE 552 / CPS Reservation Stations Mult pdata p 1212 p load buffers (from memory) Adder pdata p Floating-point Register File & Renaming Table store buffers (to memory)... instructions Common bus ensures that data is made available immediately to all the instructions waiting for it IBM 360/91 distributes instruction templates (ROB) by functional units. Also known as reservation stations. p data

ECE 552 / CPS Effectiveness History - Renaming/out-of-order execution first introduction in 360/91 in However, implementation did not re-appear until mid-90s - Why? Limitations - Effective on a very small class of problems - Memory latency was a much bigger problem in the 1960s - Problem-1: Exceptions were not precise - Problem-2: Control transfers

ECE 552 / CPS Precise Interrupts Definition - It must appear as if an interrupt is taken between two instructions - Consider instructions k, k+1 - Effect of all instructions up to and including k is totally complete - No effect of any instruction after k has taken place Interrupt Handler - Aborts program or restarts at instruction k+1

ECE 552 / CPS Out-of-Order & Interrupts Out-of-order Completion - Precise interrupts are difficult to implement at high performance - Want to start execution of later instructions before exception checks are finished on earlier instructions I 1 DIVDf6, f6,f4 I 2 LDf2,45(r3) I 3 MULTDf0,f2,f4 I 4 DIVDf8,f6,f2 I 5 SUBDf10,f0,f6 I 6 ADDDf6,f8,f2 out-of-order comp restore f2 restore f10 interrupts

ECE 552 / CPS Exception Handling (in-order) -- Hold exception flags in pipeline until commit point -- Exceptions earlier in program order override those later in program order -- Inject external interrupts, which over-ride others, at commit point -- If exception at commit: (1) update Cause and EPC registers, (2) kill all stages, (3) inject handler PC into fetch stage Asynchronous Interrupts Exc D PC D PC Inst. Mem D Decode EM Data Mem W + Exc E PC E Exc M PC M Cause EPC Kill D Stage Kill F Stage Kill E Stage Illegal Opcode Overflow Data Addr Except PC Address Exceptions Kill Writeback Select Handler PC Commit Point

ECE 552 / CPS Phases of Instruction Execution Fetch: Instruction bits retrieved from cache. I-cache Fetch Buffer Issue Buffer Func. Units Arch. State Execute: Instructions and operands sent to execution units. When execution completes, all results and exception flags are available. Decode: Instructions placed in appropriate issue (aka “dispatch”) buffer Result Buffer Commit: Instruction irrevocably updates architectural state (aka “graduation” or “completion”). PC

ECE 552 / CPS Exception Handling (out-of-order) In-Order Commit for Precise Exceptions - Instructions fetched, decoded into reorder buffer (ROB) in-order - Instructions executed, completed out-of-order - Instructions committed in-order - Instruction commit writes to architectural state (e.g., register file, memory) Need temporary storage for results before commit FetchDecode Execute Commit Reorder Buffer In-order Out-of-order Kill Exception? Inject handler PC

ECE 552 / CPS Supporting Precise Exceptions - Add fields to instruction template - pd (1 if result ready), dest (target register), data (result computed) - cause (reason for interrupt/exception) - Commit instructions to register file and memory in-order - On exception, clear re-order buffer (reset ptr-1 = ptr-2) - Store instructions must commit before modifying memory ptr 2 next to commit ptr 1 next available Inst# use exec op p1 src1 p2 src2 pd dest data cause

ECE 552 / CPS Renaming and Rollbacks Register file no longer contains renaming tags. How does decode stage find tag of a source register? Search the ‘dest’ field in reorder buffer (ROB). Register File (now holds only committed state) Reorder buffer Load Unit FU Store Unit t1t2..tnt1t2..tn Ins# use exec op p1 src1 p2 src2 pd dest data Commit

ECE 552 / CPS Renaming and Rollbacks Register File (now holds only committed state) Reorder Buffer Load Unit FU Store Unit t1t2..tnt1t2..tn Ins# use exec op p1 src1 p2 src2 pd dest data Commit Rename Table Renaming table is a cache, speeds up register name look-up. Table is cleared after each exception. When else are valid bits cleared? Control transfers. r1r1 tv r2r2 tag valid bit

ECE 552 / CPS Control Transfer Penalty I-cache Fetch Buffer Issue Buffer Func. Units Arch. State Execute Decode Result Buffer Commit PC Fetch Branch executed Next fetch started Modern processors may have >10 pipeline stages between next PC calculation and branch resolution. How much work is lost if pipeline does not follow correct instruction flow? [Loop Length] x [Pipeline Width]

ECE 552 / CPS Branches and Jumps Each instruction fetch depends on 1-2 pieces of information from preceding instruction: 1. Is preceding instruction a branch? 2. If so, what is the target address? InstructionTaken known?Target known? Jafter decodeafter decode JRafter decodeafter fetch BEQZ/BNEZafter fetch*after decode *assuming zero? detect when register read

ECE 552 / CPS Reducing Control Flow Penalty Software Solutions 1. Eliminate branches -- loop unrolling increases run length before branch 2. Reduce resolution time – instruction scheduling moves instruction that produces condition earlier Hardware Solutions 1. Find other work – delay slots and software cooperation 2. Speculate – predict branch result and execute instructions beyond branch

ECE 552 / CPS Acknowledgements These slides contain material developed and copyright by -Arvind (MIT) -Krste Asanovic (MIT/UCB) -Joel Emer (Intel/MIT) -James Hoe (CMU) -John Kubiatowicz (UCB) -Alvin Lebeck (Duke) -David Patterson (UCB) -Daniel Sorin (Duke)