ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 8 Instruction-Level Parallelism – Part 1 Benjamin Lee Electrical and Computer Engineering Duke.

Slides:



Advertisements
Similar presentations
Complex Pipelining Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Asanovic/Devadas Spring
Advertisements

© Krste Asanovic, 2014CS252, Spring 2014, Lecture 5 CS252 Graduate Computer Architecture Spring 2014 Lecture 5: Out-of-Order Processing Krste Asanovic.
Electrical and Computer Engineering
Computer Architecture: A Constructive Approach Six Stage Pipeline/Bypassing Joel Emer Computer Science & Artificial Intelligence Lab. Massachusetts Institute.
Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.
A scheme to overcome data hazards
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
2/28/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste.
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
COMP25212 Advanced Pipelining Out of Order Processors.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Complex Pipelining II Steve Ko Computer Sciences and Engineering University at Buffalo.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
February 28, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Instruction-Level Parallelism (ILP)
February 28, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Complex Pipelining I Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP I Steve Ko Computer Sciences and Engineering University at Buffalo.
March 2, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 12 - Complex Pipelines Krste Asanovic Electrical Engineering and Computer.
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
CS 152 Computer Architecture and Engineering Lecture 12 - Complex Pipelines Krste Asanovic Electrical Engineering and Computer Sciences University of California.
CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue and Register Renaming Krste Asanovic Electrical Engineering and Computer Sciences.
EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
CS 152 Computer Architecture and Engineering Lecture 12 - Complex Pipelines Krste Asanovic Electrical Engineering and Computer Sciences University of California.
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)
ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)
EENG449b/Savvides Lec 5.1 1/27/04 January 27, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 9 Instruction-Level Parallelism – Part 2 Benjamin Lee Electrical and Computer Engineering Duke.
CET 520/ Gannod1 Section A.8 Dynamic Scheduling using a Scoreboard.
CS 152 Computer Architecture and Engineering Lecture 15 - Out-of-Order Memory, Complex Superscalars Review Krste Asanovic Electrical Engineering and Computer.
1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,
© Krste Asanovic, 2015CS252, Fall 2015, Lecture 6 CS252 Graduate Computer Architecture Fall 2015 Lecture 6: Out-of-Order Processors Krste Asanovic
COMP25212 Advanced Pipelining Out of Order Processors.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering.
Instruction-Level Parallelism
CS252 Graduate Computer Architecture Fall 2015 Lecture 5: Out-of-Order Processing Krste Asanovic
CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction John Wawrzynek Electrical Engineering.
Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming
/ Computer Architecture and Design
Lecture 07: Pipelining Multicycle, MIPS R4000, and More
Out of Order Processors
Step by step for Tomasulo Scheme
CS203 – Advanced Computer Architecture
Microprocessor Microarchitecture Dynamic Pipeline
Pipelining: Advanced ILP
CMSC 611: Advanced Computer Architecture
Out of Order Processors
Pipelining Multicycle, MIPS R4000, and More
Superscalar Processors & VLIW Processors
Electrical and Computer Engineering
Krste Asanovic Electrical Engineering and Computer Sciences
CS 704 Advanced Computer Architecture
Krste Asanovic Electrical Engineering and Computer Sciences
Krste Asanovic Electrical Engineering and Computer Sciences
How to improve (decrease) CPI
CSCE430/830 Computer Architecture
CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 4 – Pipelining Part II Krste Asanovic Electrical Engineering.
September 20, 2000 Prof. John Kubiatowicz
CMSC 611: Advanced Computer Architecture
Presentation transcript:

ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 8 Instruction-Level Parallelism – Part 1 Benjamin Lee Electrical and Computer Engineering Duke University

ECE 552 / CPS 5502 ECE552 Administrivia 27 September – Homework #2 Due -Use blackboard forum for questions -Attend office hours with questions - for separate meetings 2 October – Class Discussion Roughly one reading per class. Do not wait until the day before! 1.Srinivasan et al. “Optimizing pipelines for power and performance” 2.Mahlke et al. “A comparison of full and partial predicated execution support for ILP processors” 3.Palacharla et al. “Complexity-effective superscalar processors” 4.Yeh et al. “Two-level adaptive training branch prediction”

ECE 552 / CPS 5503 Complex Pipelining Pipelining becomes complex when we want high performance in the presence of… -Long latency or partially pipelined floating-point units -Memory systems with variable access time -Multiple arithmetic and memory units MIPS Floating Point -Interaction between floating-point (FP), integer datapath defined by ISA -Architect separate register files for floating point (FPR) and integer (GPR) -Define separate load/store instructions for FPR, GPR -Define move instructions between register files -Define FP branches in terms of FP-specific condition codes

ECE 552 / CPS 5504 Floating-Point Unit (FPU) FPU requires much more hardware than integer unit Single-cycle FPU a bad idea -Why? -It is common to have several, different types of FPUs (Fadd, Fmul, etc.) -FPU may be pipelined, partially pipelined, or not pipelined Floating-point Register File (FPR) -To operate several FPUs concurrently, FPR requires several read/write ports

ECE 552 / CPS 5505 Pipelining FPUs Functional units have internal pipeline registers -Inputs to a functional unit (e.g., register file) can change during a long latency operation -Operands are latched when an instruction enters the functional unit fully pipelined partially pipelined 1cyc 2 cyc

ECE 552 / CPS 5506 Realistic Memory Systems Latency of main memory access usually greater than one cycle and often unpredictable -Solving this problem is a central issue in computer architecture Improving memory performance -Separate instruction and data memory ports, no self-modifying code -Caches -- size L1 cache for single-cycle access -Caches -- L1 miss stalls pipeline -Memory – interleaving memory allows multiple simultaneous access -Memory – bank conflicts stall the pipeline

ECE 552 / CPS 5507 Multiple Functional Units IF ID WB ALUMem Fadd Fmul Fdiv Issue GPR’s FPR’s

ECE 552 / CPS 5508 Complex Pipeline Control Implications of multi-cycle instructions -FPU or memory unit requires more than one cycle -Structural conflict in execution stage, if FPU or memory unit is not pipelined Different functional unit latencies -Structural conflict in writeback stage due to different latencies -Out-of-order write conflicts due to variable latencies How to handle exceptions?

ECE 552 / CPS 5509 Complex In-Order Pipeline Delay writeback so all operations have same latency to writeback stage. Write ports never over- subscribed – Every cycle has one instruction in and one instruction out How do we prevent increased writeback latency from slowing down single-cycle integer operations? Forwarding PC Inst. Mem D Decode X1X2 Data Mem W + GPRs X2W Fadd X3 FPRs X1 X2 Fmul X3 X2 FDiv X3 Unpipelined divider Commit Point

ECE 552 / CPS Complex In-Order Pipeline How do we handle data hazards for very long latency operations? Stall pipeline on long latency operations (e.g., divides, cache misses) Exceptions handled in program order at commit point PC Inst. Mem D Decode X1X2 Data Mem W + GPRs X2W Fadd X3 FPRs X1 X2 Fmul X3 X2 FDiv X3 Unpipelined divider Commit Point

ECE 552 / CPS Superscalar In-Order Pipeline Fetch 2 instructions per cycle. Issue both simultaneously if instruction mix matches functional unit mix. Increases instruction throughput. How do we further increase issue width? (a) duplicate functional units, (b) increase register file ports, (c) increase forwarding paths 2 PC Inst. Mem D Dual Decode X1X2 Data Mem W + GPRs X2W Fadd X3 FPRs X1X2 Fmul X3 X2 FDiv X3 Unpipelined divider Commit Point

ECE 552 / CPS Dependence Analysis Consider executing a sequence of instructions of the form: Rk  (Ri) op (Rj) Data Dependence R3  (R1) op (R4)# RAW hazard (R3) R5  (R3) op (R4) Anti-dependence R3  (R1) op (R2)# WAR hazard (R1) R1  (R4) op (R5) Output-dependence R3  (R1) op (R2)# WAW hazard (R3) R3  (R6) op (R7)

ECE 552 / CPS Detecting Data Hazards Range and Domain of Instruction (j) R(j) = registers (or other storage) modified by instruction j D(j) = registers (or other storage) read by instruction j Suppose instruction k follows instruction j in program order. Executing instruction k before the effect of instruction j has occurred can cause… RAW hazard if R(j)  D(k)  # j modifies a register read by k WAR hazard if D(j)  R(k)  # j reads a register modified by k WAW hazard if R(j)  R(k)  # j, k modify the same register

ECE 552 / CPS Registers vs Memory Dependence Data hazards due to register operands can be determined at decode stage Data hazards due to memory operands can be determined only after computing effective address in execute stage storeM[R1 + disp1]  R2 loadR3  M[R4 + disp2] (R1 + disp1) == (R4 + disp2)?

I 1 DIVDf6, f6,f4 I 2 LDf2,45(r3) I 3 MULTDf0,f2,f4 I 4 DIVDf8,f6,f2 I 5 SUBDf10,f0,f6 I 6 ADDDf6,f8,f2 ECE 552 / CPS Data Hazards Example RAW Hazards WAR Hazards WAW Hazards

ECE 552 / CPS Instruction Scheduling I6I6 I2I2 I4I4 I1I1 I5I5 I3I3 Valid Instruction Orderings in-orderI 1 I 2 I 3 I 4 I 5 I 6 out-of-order I2 I1 I3 I4 I5I6I2 I1 I3 I4 I5I6 I1 I2I3 I5 I4I6I1 I2I3 I5 I4I6 I 1 DIVDf6, f6,f4 I 2 LDf2,45(r3) I 3 MULTDf0,f2,f4 I 4 DIVDf8,f6,f2 I 5 SUBDf10,f0,f6 I 6 ADDDf6,f8,f2

ECE 552 / CPS Out-of-Order Completion Let k indicate when instruction k is issued. Let k denote when instruction k is completed. Latency I 1 DIVDf6, f6,f4 4 I 2 LDf2,45(r3)1 I 3 MULTDf0,f2,f43 I 4 DIVDf8,f6,f24 I 5 SUBDf10,f0,f61 I 6 ADDDf6,f8,f21 I6I6 I2I2 I4I4 I1I1 I5I5 I3I3

ECE 552 / CPS Out-of-Order Completion Latency I 1 DIVDf6, f6,f4 4 I 2 LDf2,45(r3)1 I 3 MULTDf0,f2,f43 I 4 DIVDf8,f6,f24 I 5 SUBDf10,f0,f61 I 6 ADDDf6,f8,f21 in-order comp1 2 out-of-order comp I6I6 I2I2 I4I4 I1I1 I5I5 I3I3

ECE 552 / CPS Scoreboard Up until now, we assumed user or compiler statically examines instructions, detecting hazards and scheduling instructions Scoreboard is a hardware data structure to dynamically detect hazards

ECE 552 / CPS Cray CDC6600 Seymour Cray, Fast, pipelined machine with 60-bit words Kword main memory capacity, 32-banks - Ten functional units (parallel, unpipelined) - Floating-point: adder, 2 multipliers, divider - Integer: adder, 2 incrementers - Dynamic instruction scheduling with scoreboard - Ten peripheral processors for I/O -More than 400K transistors, 750 sq-ft, 5 tons, 150kW with novel Freon-based cooling - Very fast clock, 10MHz (FP add in 4 clocks) - Fastest machine in world for 5 years - Over 100 sold ($7-10M each)

ECE 552 / CPS IBM Memo on CDC6600 Thomas Watson Jr., IBM CEO, August 1963 “Last week, Control Data….announced the 6600 system. I understand that in the laboratory developing the system there are only 34 people including the janitor. Of these, 14 are engineers and 4 are programmers…Contrasting this modest effort with our vast development activities, I fail to understand why we have lost our industry leadership by letting someone else offer the world’s most powerful computer.” To which Cray replied… “It seems like Mr. Watson has answered his own question.”

ECE 552 / CPS Multiple Functional Units IF ID WB ALUMem Fadd Fmul Fdiv Issue GPR’s FPR’s Previously, resolved write hazards (WAR, WAW) by equalizing pipeline depths and forwarding. Is there an alternative?

ECE 552 / CPS Conditions for Instruction Issue When is it safe to issue an instruction? - Suppose a data structure tracks all instructions in all functional units Before issuing instruction, issue logic must check: - Is the required functional unit available? Check for structural hazard. - Is the input data available? Check for RAW hazard. - Is it safe to write the destination? Check for WAR, WAW hazard - Is there a structural hazard at the write back stage?

ECE 552 / CPS Issue Logic and Data Structure In issue stage, instruction j consults the table - Functional unit available? Check the busy column - RAW?Search the Dest column for j’s sources - WAR?Search the Src1/2 columns for j’s destination - WAW?Search the Dest column for j’s destination If no hazard, add entry and issue instruction Upon instruction write back, remove entry NameBusyOpDestSrc1Src2 Int Mem Add1 Add2 Add3 Mult1 Mult2 Div

ECE 552 / CPS Simplifying the Data Structure Assume instructions issue in-order Assume issue logic does not dispatch instruction if it detects RAW hazard or busy functional unit Assume functional unit latches operands when the instruction is issued

ECE 552 / CPS Simplifying the Data Structure Can the dispatched instruction cause WAR hazard? - No, because instructions issue in order and operands are read at issue No WAR Hazards - No need to track source-1 and source-2 Can the dispatched instruction cause WAW hazard? - Yes, because instructions may complete out-of-order Do not issue instruction in case of WAW hazard - In scoreboard, a register name occurs at most once in ‘dest’ column

ECE 552 / CPS Scoreboard Busy[FU#]: a bit-vector to indicate functional unit availability (FU = Int, Add, Mutl, Div) WP[#regs]: a bit-vector to record the registers to which writes are pending - Bits are set to true by issue logic - Bits are set to false by writeback stage - Each functional unit’s pipeline registers must carry ‘dest’ field and a flag to indicate if it’s valid: “the (we, ws) pair” Issue logic checks instruction (opcode, dest, src1, src2) against scoreboard (busy, wp) to dispatch - FU available?Busy[FU#] - RAW?WP[src1] or WP[src2] - WAR?Cannot arise - WAW?WP[dest]

ECE 552 / CPS I 1 DIVDf6, f6,f4 I 2 LDf2,45(r3) I 3 MULTDf0,f2,f4 I 4 DIVDf8,f6,f2 I 5 SUBDf10,f0,f6 I 6 ADDDf6,f8,f2 Busy-Functional Units Status Writes Pending (WP) Int(1) Add(1) Mult(3) Div(4) WB t0 I 1 f6 f6 t1 I 2 f2 f6f6, f2 t2 f6 f2 f6, f2I 2 t3 I 3 f0 f6 f6, f0 t4 f0 f6 f6, f0 I 1 t5 I 4 f0 f8 f0, f8 t6 f8 f0 f0, f8I 3 t7 I 5 f10f8 f8, f10 t8 f8 f10 f8, f10I 5 t9 f8 f8I 4 t10 I 6 f6 f6 t11 f6 f6I 6 Instruction Issue Logic FU available? Busy[FU#] RAW? WP[src1] or WP[src2] WAR? Cannot arise WAW? WP[dest]

ECE 552 / CPS Scoreboard Detect hazards dynamically Issue instructions in-order Complete instructions out-of-order Increases instruction-level-parallelism by - More effectively exploiting multiple functional units - Reducing the number of pipeline stalls due to hazards

ECE 552 / CPS Acknowledgements These slides contain material developed and copyright by -Arvind (MIT) -Krste Asanovic (MIT/UCB) -Joel Emer (Intel/MIT) -James Hoe (CMU) -John Kubiatowicz (UCB) -Alvin Lebeck (Duke) -David Patterson (UCB) -Daniel Sorin (Duke)