1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.

A scheme to overcome data hazards

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

COMP25212 Advanced Pipelining Out of Order Processors.

CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

Instruction-Level Parallelism dynamic scheduling prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University May 2015Instruction-Level Parallelism.

1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.

1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

Lecture 9 Instruction Level Parallelism Topics Review Appendix C Dynamic Scheduling Scoreboarding Tomasulo Readings: Chapter 3 October 8, 2014 CSCE 513.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

COMP25212 Advanced Pipelining Out of Order Processors.

CS203 – Advanced Computer Architecture ILP and Speculation.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Instruction-Level Parallelism and Its Dynamic Exploitation

IBM System 360. Common architecture for a set of machines

CS 352H: Computer Systems Architecture

The University of Adelaide, School of Computer Science

/ Computer Architecture and Design

The University of Adelaide, School of Computer Science

COMP 740: Computer Architecture and Implementation

Approaches to exploiting Instruction Level Parallelism (ILP)

Out of Order Processors

Dynamic Scheduling and Speculation

CS203 – Advanced Computer Architecture

Lecture 10 Tomasulo’s Algorithm

Lecture 12 Reorder Buffers

Chapter 3: ILP and Its Exploitation

Advantages of Dynamic Scheduling

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

A Dynamic Algorithm: Tomasulo’s

Out of Order Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

The University of Adelaide, School of Computer Science

CS 704 Advanced Computer Architecture

Adapted from the slides of Prof

The University of Adelaide, School of Computer Science

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Advanced Computer Architecture

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Tomasulo Organization

Reduction of Data Hazards Stalls with Dynamic Scheduling

CS5100 Advanced Computer Architecture Dynamic Scheduling

Adapted from the slides of Prof

Midterm 2 review Chapter

Chapter 3: ILP and Its Exploitation

Lecture 7 Dynamic Scheduling

Conceptual execution on a processor which exploits ILP

Presentation transcript:

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture A Quantitative Approach, Fifth Edition

2 Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Scheduling Rearrange order of instructions to reduce stalls while maintaining data flow Instructions initiate and/or complete out of order Advantages: Compiler doesn’t need to have knowledge of microarchitecture Handles cases where dependencies are complex or unknown at compile time Disadvantage: Substantial increase in hardware complexity New types of data hazards Branch Prediction

3 Name Dependencies A name dependence occurs when two unrelated instructions use the same registers Not an issue if instructions are executed in original ordering Antidependence: Instruction i must read data from register R before instruction j writes to R so that i reads the correct data Output dependence: Instruction i must write data to register R before instruction j writes to R so that R contains the correct data Copyright © 2012, Elsevier Inc. All rights reserved.

4 Data Hazards RAW An instruction tries to a read data before a previous instruction writes it (i.e., data is not yet ready) Solution: stall the pipeline until data is ready or forward data WAR Due to reordering, an instruction reads incorrect data because a later instruction has already written to the register (antidependence) Solution: maintain original order or rename registers WAW Due to reordering, an instruction writes incorrect data to a register because a later instruction has already written to the register (output dependence) Solution: maintain original order or rename registers Copyright © 2012, Elsevier Inc. All rights reserved.

5 Register Renaming Example: DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8 Name dependence with F6 and F8 Branch Prediction antidependence

6 Copyright © 2012, Elsevier Inc. All rights reserved. Register Renaming Example: DIV.D F0,F2,F4 ADD.D S,F0,F8 S.D S,0(R1) SUB.D T,F10,F14 MUL.D F6,F10,T Now only RAW hazards remain, which can be strictly ordered Branch Prediction

7 Tomasulo’s Approach Key components: Register Renaming Allows multiple copies of register contents Data is buffered per instruction instead of per register Eliminates WAR hazards Register Status Track instructions writing to registers to enforce write order Eliminates WAW hazards Common Data Bus (CDB) Broadcast medium for distribution of results Data is forwarded as soon as its ready No need to wait for registers Copyright © 2012, Elsevier Inc. All rights reserved.

8 Reservation Stations Register renaming is provided by reservation stations (RS) Contains: The instruction (Op) Buffered operand values (Vj, Vk) when available Reservation station number of instruction providing the operand values (Qj, Qk) As instructions are issued, operand values in registers are buffered If operand values are not in registers, find and store the RS which containing the source instruction Listen for needed operand values on the CDB Eliminates RAW hazards May be more reservation stations than registers Branch Prediction

9 Copyright © 2012, Elsevier Inc. All rights reserved. Tomasulo’s Algorithm Load and store buffers Contain data and addresses, act like reservation stations Top-level design: Branch Prediction

10 Copyright © 2012, Elsevier Inc. All rights reserved. Tomasulo’s Algorithm Three Steps: Issue Get next instruction from FIFO queue If available RS, issue the instruction to the RS Stall if no RS available Execute When all operands are ready, begin execution Must also wait until all preceding branches have completed (no branch prediction) Write result Write result on CDB into reservation stations, store buffers, and registers Branch Prediction

11 Tomasulo Example Example uses a FP instruction sequence in which instructions require multiple execution cycles HW assumptions: one dedicated integer addressing unit three memory loaders three DP adders (also handle subtraction) two DP multipliers (also handle division) Latency assumptions: Memory operation takes 2 cycles FP add takes 2 cycles FP multiply takes 10 cycles FP divide takes 40 cycles Copyright © 2012, Elsevier Inc. All rights reserved.

12 Tomasulo Example Cycle 0

13 Tomasulo Example Cycle 1

14 Tomasulo Example Cycle 2

15 Tomasulo Example Cycle 3 Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued Load1 completing; what is waiting for Load1?

16 Tomasulo Example Cycle 4 Load2 completing; what is waiting for it?

17 Tomasulo Example Cycle 5

18 Tomasulo Example Cycle 6 Issue ADDD here

19 Tomasulo Example Cycle 7 Add1 (SUBD) completing; what is waiting for it?

20 Tomasulo Example Cycle 8

21 Tomasulo Example Cycle 9

22 Tomasulo Example Cycle 10 Add2 completing; what is waiting for it?

23 Tomasulo Example Cycle 11 Write result of ADDD here

24 Tomasulo Example Cycle 12 Note: all quick instructions complete already

25 Tomasulo Example Cycle 13

26 Tomasulo Example Cycle 14

27 Tomasulo Example Cycle 15 Mult1 completing; what is waiting for it?

28 Tomasulo Example Cycle 16 Note: Just waiting for divide

29 Tomasulo Example Cycle 55

30 Tomasulo Example Cycle 56 Mult 2 completing; what is waiting for it?

31 Tomasulo Example Cycle 57

32 Tomasulo’s Algorithm Summary Reservations stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards Allows loop unrolling in HW Not limited to basic blocks Lasting Contributions Dynamic scheduling Register renaming Copyright © 2012, Elsevier Inc. All rights reserved.

33 Copyright © 2012, Elsevier Inc. All rights reserved. Hardware-Based Speculation Branch prediction Predict branch outcome to allow fetching and decoding of subsequent instructions Branch speculation Predict branch output to allow fetching, decoding, and execution of subsequent instructions Relatively easy in a simple pipeline Just no-op instructions before they complete if prediction was incorrect More difficult out-of-order processors Requires treating speculative instructions as a transaction Branch Prediction

34 Copyright © 2012, Elsevier Inc. All rights reserved. Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Add an Instruction Commit phase to Tomasulo’s design Only allow an instruction to update the register file when instruction is no longer speculative Need an additional piece of hardware to buffer instructions until they commit Branch Prediction

35 Copyright © 2012, Elsevier Inc. All rights reserved. Reorder Buffer Reorder buffer – ordered buffer which holds the result of instruction between completion and commit Process instructions in issue order as they complete Four fields: Instruction type: branch/store/register Destination field: register number Value field: output value Ready field: completed execution? Modify reservation stations: Get operand values from ROB instead of other RSs Store ROB identifier for instruction Branch Prediction

36 Copyright © 2012, Elsevier Inc. All rights reserved. Reorder Buffer When an instruction reaches the head of the ROB, handle based on type Register/store types Commit instruction by writing value and destination to the CDB and removing from the ROB Branch type If prediction correct, commit normally If prediction incorrect, flush the ROB Branch Prediction

37 Reorder Buffer Copyright © 2012, Elsevier Inc. All rights reserved.

38 Extending Speculation Branch-Target Buffer (BTB) Store target PC with each prediction Allows predicted next instruction to be fetched immediately following the prediction Return-Address Predictor Predicts next PC after a function return Functions my be called from many different addresses, so organize buffer as a stack Buffer then imitates the call stack! Copyright © 2012, Elsevier Inc. All rights reserved.

39 Copyright © 2012, Elsevier Inc. All rights reserved. Multiple Issue and Static Scheduling To achieve CPI < 1, need to issue multiple instructions per clock cycle Solutions: Statically scheduled superscalar processors In-order execution VLIW (Very Long Instruction Word) processors In-order execution Dynamically scheduled superscalar processors Out-of-order execution Multiple Issue and Static Scheduling

40 Copyright © 2012, Elsevier Inc. All rights reserved. Multiple Issue Multiple Issue and Static Scheduling

41 Copyright © 2012, Elsevier Inc. All rights reserved. Multiple Issue, Static Scheduling Modern energy-efficient microarchitectures: Static scheduling + multiple issue Compiler performs scheduling Processor issues instructions in order Up to a fixed number of instructions can be issued simultaneously in no interdependencies Always issue at least one instruction Issue logic is relatively simple Just need to detect interdependencies Dynamic Scheduling, Multiple Issue, and Speculation

42 Copyright © 2012, Elsevier Inc. All rights reserved. Multiple Issue, Dynamic Scheduling Modern high-performance microarchitectures: Dynamic scheduling + multiple issue + speculation Two approaches: Assign reservation stations and update pipeline control table in half clock cycles Only supports 2 instructions/clock Design logic to handle any possible dependencies between the instructions Hybrid approaches Dynamic Scheduling, Multiple Issue, and Speculation

43 Copyright © 2012, Elsevier Inc. All rights reserved. Limit the number of instructions of a given class that can be issued in a “bundle” i.e. one FP, one integer, one load, one store Usually one per Reservation Station Examine all the interdependencies among the instructions in the bundle If interdependencies exist in bundle, encode them in reservation stations Issue logic is a major bottleneck Grows quadratically with number of instructions! Also need multiple completion/commit Dynamic Scheduling, Multiple Issue, and Speculation Multiple Issue, Dynamic Scheduling

44 Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Scheduling, Multiple Issue, and Speculation Multiple Issue, Dynamic Scheduling

45 Copyright © 2012, Elsevier Inc. All rights reserved. Loop:LD R2,0(R1);R2=array element DADDIU R2,R2,#1;increment R2 SD R2,0(R1);store result DADDIU R1,R1,#8;increment pointer BNE R2,R3,LOOP;branch if not last element Dynamic Scheduling, Multiple Issue, and Speculation Example

46 Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Scheduling, Multiple Issue, and Speculation Example (No Speculation)

47 Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Scheduling, Multiple Issue, and Speculation Example

48 Limits of ILP Get an idea of ILP limits by finding available parallelism in SPEC benchmarks Assume an ideal architecture Infinite register renaming No register pressure Perfect branch prediction Always know branch outcome immediately Perfect caches No cache misses of any kind Infinite window size Issue logic can examine entire program Copyright © 2012, Elsevier Inc. All rights reserved.

49 Limits of ILP Copyright © 2012, Elsevier Inc. All rights reserved.

50 Limits of ILP Now assume a more realistic architecture Up to 64 issues per clock cycle More than 10x of current Tournament branch predictor with 1K entries Reasonable Register renaming with 64 integer and 64 FP registers in reservation stations, 128 reorder-buffer entries Reasonable Vary window size Copyright © 2012, Elsevier Inc. All rights reserved.

51 Limits of ILP Copyright © 2012, Elsevier Inc. All rights reserved.

52 A8 Processor Dual issue, statically scheduled In-order issue In-order execution Up to two instructions per cycle One core, no FP Two-level cache hierarchy 1 GHz clock rate 2W power design ARM ISA (RISC) Copyright © 2012, Elsevier Inc. All rights reserved.

53 A8 Pipeline Copyright © 2012, Elsevier Inc. All rights reserved.

54 A8 Pipeline Decode Copyright © 2012, Elsevier Inc. All rights reserved.

55 A8 Pipeline Execute Copyright © 2012, Elsevier Inc. All rights reserved.

56 A8 CPI Copyright © 2012, Elsevier Inc. All rights reserved.

57 A9 Vs A8 Copyright © 2012, Elsevier Inc. All rights reserved.

58 i7 920 Processor Multiple issue, dynamically scheduled In-order issue Out-of-order execution Up to four instructions per cycle (plus fusion) Four cores, each with FP Three-level cache hierarchy 2.66 GHz clock rate 130W power design X86-64 ISA (CISC) 1-17 byte instructions are decoded into RISC microinstructions Copyright © 2012, Elsevier Inc. All rights reserved.

59 i7 Pipeline Copyright © 2012, Elsevier Inc. All rights reserved.

60 i7 CPI Copyright © 2012, Elsevier Inc. All rights reserved.

61 Atom 230 Processor Multiple issue, dynamically scheduled In-order issue In-order execution Up to two instructions per cycle One core (dual core available), with FP Two-level cache hierarchy 1.66 GHz clock rate 4W power design X86-64 ISA (CISC) Decodes to RISC microinstructions Copyright © 2012, Elsevier Inc. All rights reserved.

62 i7 Vs Atom Copyright © 2012, Elsevier Inc. All rights reserved.

63 Fallacies Processors with lower CPIs will always be faster Processors with higher clock rates will always be faster Copyright © 2012, Elsevier Inc. All rights reserved.