Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 3 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2011 Computer Applications Text book slides: Computer Architec ture:

Similar presentations


Presentation on theme: "Chapter 3 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2011 Computer Applications Text book slides: Computer Architec ture:"— Presentation transcript:

1 Chapter 3 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2011 Computer Applications Text book slides: Computer Architec ture: A Quantitative Approach 4th E dition, John L. Hennessy & David A. Patterso with modifications.

2 Dr. Amr Talaat Elect 707 Recall from Pipelining Review Pipelined CPI = Ideal pipeline CPI + Structural Stalls + Data Ha zard Stalls + Control Stalls  Ideal pipeline CPI: measure of the maximum performance attainable by the implementation  Structural hazards: HW cannot support this combination of instructions  Data hazards: Instruction depends on result of prior instruction still in the pipeline  Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)

3 Dr. Amr Talaat Elect 707 Reduction of Pipeline Hazards Techniques Today

4 Dr. Amr Talaat Elect 707 Instruction Level Parallelism  Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve performance.  2 main approaches to exploit ILP: 1) Rely on hardware to help discover and exploit the parallelism dynamically (e.g., Pentium 4) 2)Rely on software technology to find parallelism, statically at compile-time (e.g., Itanium 2)

5 Dr. Amr Talaat Elect 707 Floating-Point Pipeline

6 Dr. Amr Talaat Elect 707 Dynamic Scheduling  Dynamic scheduling: hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behavior  It handles cases when dependences unknown at compile time  it allows the processor to tolerate unpredictable delays such as cache misses, by executing other code while waiting for the miss to resolve  It allows code that compiled for one pipeline to run efficiently on a different pipeline  It simplifies the compiler

7 Dr. Amr Talaat Elect 707 HW Schemes: Instruction Parallelism  Key idea: Allow instructions behind stall to proceed DIVDF0,F2,F4 ADDDF10,F0,F8 SUBDF12,F8,F14  Enables out-of-order execution and allows out-of-order completion (e.g., SUBD )  In a dynamically scheduled pipeline, all instructions still pas s through issue stage in order (in-order issue)  Will distinguish when an instruction begins execution and when it completes execution; between 2 times, the instru ction is in execution  Note: Dynamic execution creates WAR and WAW hazards

8 Dr. Amr Talaat Elect 707 Dynamic Scheduling Step 1  Simple pipeline had 1 stage to check both structur al and data hazards: Instruction Decode (ID), also called Instruction Issue  Split the ID pipe stage of simple 5-stage pipeline i nto 2 stages:  Issue—Decode instructions, check for structural haz ards  Read operands—Wait until no data hazards, then re ad operands

9 Dr. Amr Talaat Elect 707 A Dynamic Algorithm: Tomasulo’s  For IBM 360/91 (before caches!)   Long memory latency  Goal: High Performance without special compilers  Small number of floating point registers (4 in 360) prevented in teresting compiler scheduling of operations  This led Tomasulo to try to figure out how to get more effective re gisters — renaming in hardware!  Why Study 1966 Computer?  The descendants of this are used in:  Alpha 21264, Pentium 4, AMD Opteron, Power 5, …

10 Dr. Amr Talaat Elect 707 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From Mem FP Registers Reservation Stations To Mem FP Op Queue Load Buffers Store Buffers Load1 Load2 Load3 Load4 Load5 Load6

11 Dr. Amr Talaat Elect 707 Reservation Station Components Op:Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands Qj, Qk: Reservation stations producing source registers (value to be written)  Note: Qj,Qk=0 => ready  Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instruction s that will write that register.

12 Dr. Amr Talaat Elect 707 Three Stages of Tomasulo Algorithm 1.Issue —get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2.Execute —operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3.Write result —finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available  Normal data bus: data + destination (“go to” bus)  Common data bus: data + source  Example speed: 3 clocks for Fl.pt. +,-; 10 for * ; 40 clks for /

13 Dr. Amr Talaat Elect 707 Tomasulo Example Clock cycle counter FU count down Instruction stream 3 Load/Buffers 3 FP Adder R.S. 2 FP Mult R.S.

14 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 1

15 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 2 Note: Can have multiple loads outstanding

16 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 3 Note: registers names are removed (“renamed”) in Reservation Stations ; MULT issued Load1 completing; what is waiting for Load1?

17 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 4 Load2 completing; what is waiting for Load2?

18 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 5 Timer starts down for Add1, Mult1

19 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 6 Issue ADDD here despite name dependency on F6?

20 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 7 Add1 (SUBD) completing; what is waiting for it?

21 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 8

22 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 9

23 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 10 Add2 (ADDD) completing; what is waiting for it?

24 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 11 Write result of ADDD here? All quick instructi ons complete in this cycle!

25 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 12

26 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 13

27 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 14

28 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 15 Mult1 (MULTD) completing; what is waiting for it?

29 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 16 Just waiting for Mult2 (DIVD) to complete

30 Dr. Amr Talaat Elect 707 (skip a couple of cycles)

31 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 55

32 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 56 Mult2 (DIVD) is completing; what is waiting for it?

33 Dr. Amr Talaat Elect 707 Tomasulo Example Cycle 57 Once again: In-order issue, out-of-order execution and out-of-or der completion.

34 Dr. Amr Talaat Elect 707 Why can Tomasulo overlap iterations of loops?  Register renaming  Multiple iterations use different physical destinations for regi sters (dynamic loop unrolling).  Reservation stations  Permit instruction issue to advance past integer control flow operations  Also buffer old values of registers - totally avoiding the WAR stall  Other perspective: Tomasulo building data flo w dependency graph on the fly

35 Dr. Amr Talaat Elect 707 Tomasulo Loop Example Loop:LDF00R1 MULTDF4F0F2 SDF40R1 SUBIR1R1#8 BNEZR1Loop  Multiply takes 4 clocks  Load have cache misses

36 Dr. Amr Talaat Elect 707 Loop Example Cycle 0

37 Dr. Amr Talaat Elect 707 Loop Example Cycle 1

38 Dr. Amr Talaat Elect 707 38 Loop Example Cycle 2

39 Dr. Amr Talaat Elect 707 39 Loop Example Cycle 3

40 Dr. Amr Talaat Elect 707 40 Loop Example Cycle 4

41 Dr. Amr Talaat Elect 707 41 Loop Example Cycle 5

42 Dr. Amr Talaat Elect 707 Loop Example Cycle 6

43 Dr. Amr Talaat Elect 707 43 Loop Example Cycle 7

44 Dr. Amr Talaat Elect 707 44 Loop Example Cycle 8

45 Dr. Amr Talaat Elect 707 45 Loop Example Cycle 9

46 Dr. Amr Talaat Elect 707 46 Loop Example Cycle 10

47 Dr. Amr Talaat Elect 707 47 Loop Example Cycle 11

48 Dr. Amr Talaat Elect 707 48 Loop Example Cycle 12 Structural hazard – no MULT unit available

49 Dr. Amr Talaat Elect 707 49 Loop Example Cycle 13

50 Dr. Amr Talaat Elect 707 50 Loop Example Cycle 14

51 Dr. Amr Talaat Elect 707 51 Loop Example Cycle 15

52 Dr. Amr Talaat Elect 707 52 Loop Example Cycle 16

53 Dr. Amr Talaat Elect 707 53 Loop Example Cycle 17

54 Dr. Amr Talaat Elect 707 54 Loop Example Cycle 18

55 Dr. Amr Talaat Elect 707 55 Loop Example Cycle 19 …

56 Dr. Amr Talaat Elect 707 56 Loop Example Cycle 20

57 Dr. Amr Talaat Elect 707 57 Loop Example Cycle 21

58 Dr. Amr Talaat Elect 707 Tomasulo’s scheme offers 2 major advantages 1. Distribution of the hazard detection logic  distributed reservation stations and the CDB  If multiple instructions waiting on single result, & each instru ction has other operand, then instructions can be released si multaneously by broadcast on CDB  If a centralized register file were used, the units would have t o read their results from the registers when register buses ar e available 2. Elimination of stalls for WAW and WAR hazar ds

59 Dr. Amr Talaat Elect 707 Tomasulo Drawbacks  Complexity  Many associative stores (CDB) at high speed  Performance limited by Common Data Bus  Each CDB must go to multiple functional units  high capacitance, high wiring density  Number of functional units that can complete per cycle limite d to one!  Multiple CDBs  more FU logic for parallel assoc stores  Non-precise interrupts!  We will address this later

60 Dr. Amr Talaat Elect 707  Sect 5 th Edition 3.4 & 3.5


Download ppt "Chapter 3 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2011 Computer Applications Text book slides: Computer Architec ture:"

Similar presentations


Ads by Google