Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)

Similar presentations


Presentation on theme: "Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)"— Presentation transcript:

1 Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)

2 Optimally schedule code for(i=0;i { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/12/3362729/slides/slide_2.jpg", "name": "Optimally schedule code for(i=0;i

3 1. Identify Dependencies lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop $t0 – lw->addi – RAW $t0 – addi->sw - RAW

4 2.Draw timing diagram WITH DATA FORWARDING lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop F D X M W

5 3. Remove WAR/WAW dependencies lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop RAW, WAR, WAW F D X M W D F F lw addi sw addi slt bne Target the false dependencies

6 3. Remove WAR/WAW dependencies lw $t0, 0($s1) sw $t0, 0($s1) addi $s1, $s1, 4 lw $t0, 0($s1) addi $s1, $s1, 4 sw $t0, 0($s1) lw $t0, 0($s1) addi sw Original Incorrect Correct

7 lw $t0, 0($s1) addi $s1, $s1, 4 addi $t0, $t0, 10 sw $t0, ____($s1) slt $t1, $s1, $s2 bne $t1, $0, loop lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop

8 3. Remove WAR/WAW dependencies lw $t0, 0($s1) addi $s1, $s1, 4 addi $t0, $t0, 10 slt $t1, $s1, $s2 sw $t0, -4($s1) bne $t1, $0, loop F D X M W lw addi sw addi slt bne

9 Software Control Hazard Removal If ( (x % 2) == 1) isodd = 1;

10 Software Control Hazard Removal If ( x == true) y = false; else y = true;

11 If ((x == MON) || (x == TUE) || (x == WED)) { } Software Control Hazard Removal

12 If ((TheCoinTossIsHeads) || (StudentStudiedForExam)) { } Increasing Branch Performance

13 What does it all mean? Does that mean that error-checking code is bad? That is a whole lot of branches if you do it well!!!

14 The moral is….. Calculation is less expensive than …..

15 Superscalars - Parallelism Ford mass produces cars. We want to “mass produce” instructions Increase Depth – assembly line – build many cars at the same time, but each car is in a different stage of assembly. Increase Width – multiple assembly lines – build many cars at the same time by building many line, all of which operate simultaneously.

16 “Superpipelining” (deep pipelining – many stages) Limiting returns because…. Register delays are __________________________ of clock Difficult to __________________

17 SuperScalars __________ parts of pipeline Multiple instructions in _______ stage at once

18 SuperScalars Which instructions can execute in parallel? Fetching multiple instructions per cycle

19 Static Scheduling – VLIW or EPIC (Itanium) __________ schedules the instructions If one instruction stalls, all following instructions stall Book Example: SuperScalar MIPS: Two instructions / cycle one alu/branch, one ld/st each cycle

20 Schedule for SS MIPS Loop: lw$t0, 0($s1) addu$t0, $t0, $s2 sw$t0, 0($s1) addi$s1, $s1, -4 bne$s1, $zero,Loop PCALU/branchld/st 0 8 16 24 32

21 SuperScalars - Static bne FetchMemoryWriteBackExecuteDecode Read Values Write Values addu sw lw addi

22 Loop Problem Problem: –Too many _______________ in loop –Not enough ______________ to fill in holes Solution: –Do ______________ at once –More instructions –Only one branch

23 Loop Unrolling 1. Unroll Loop Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) bne$s1, $zero,Loop Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4 ($s1) bne$s1, $zero,Loop

24 Loop Unrolling 2. Rename Registers Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) lw$t1, 0($s1) addi$s1, $s1, -4 addu$t1, $t1, $s2 sw$t1, 4($s1) bne$s1, $zero,Loop But wait!!! How has this helped? There are tons of dependencies? Whatever are we to do? Register Renaming!!!

25 Loop Unrolling 2. Rename Registers Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) lw$t1, 0($s1) addi$s1, $s1, -4 addu$t1, $t1, $s2 sw$t1, 4($s1) bne$s1, $zero,Loop (Repeated slide for your reference) Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, 4($s1) bne$s1, $zero,Loop

26 Loop Unrolling 3. Reduce Instructions Loop: lw$t0, 0($s1) addi$s1, $s1, -8 addu$t0, $t0, $s2 sw$t0, 8($s1) lw$t1, 4($s1) addu$t1, $t1, $s2 sw$t1, 4($s1) bne$s1, $zero,Loop Loop: lw$t0, 0($s1) addi$s1, $s1, -4 addu$t0, $t0, $s2 sw$t0, ___($s1) lw$t1, ___($s1) addu$t1, $t1, $s2 sw$t1, 4($s1) bne$s1, $zero,Loop

27 Loop Unrolling 4. Schedule Loop: lw1$t0, 0($s1) addi$s1, $s1, -8 addu1$t0, $t0, $s2 sw1$t0, 8($s1) lw2$t1, 4($s1) addu2$t1, $t1, $s2 sw2$t1, 4($s1) bne$s1, $zero,Loop ALU/branchlw/sw lw1

28 Performance Comparison OriginalUnrolled ALU/branchld/st lw $t0, 0($s1) addi$s1, $s1, -4 addu $t0, $t0, $s2 bne$s1, $zero,L sw $t0, 4($s1)

29 Static Scheduling Summary Code size ______________ (because of nops) It can not resolve __________ dependencies If one instruction stalls, ___________________

30 Dynamic Scheduling _________ schedules ready instructions Only ___________ instructions stall _______________ resolved in hardware

31 4-wide Dynamic Superscalar Fetch Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Fetch 4 instructions each cycle addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)

32 4-wide Dynamic Superscalar Decode Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Register Alias Table records 1.Current Register Number (WAW/WAR Register Renaming) or addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)

33 4-wide Dynamic Superscalar Decode Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Register Alias Table records 1.Current Register Number (WAW/WARRegister Renaming) or 2. Functional Unit (RAW – result not ready) addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)

34 4-wide Dynamic Superscalar Execute Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Wait until your inputs are ready addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)

35 4-wide Dynamic Superscalar Execute Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Execute once they are ready addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)

36 4-wide Dynamic Superscalar Memory Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) First calculate the address addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)lw r2, 0(s1)

37 4-wide Dynamic Superscalar Memory Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) Ld/St Queue checks memory addresses – out of order lw/sw addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1)

38 4-wide Dynamic Superscalar Commit Register File Instruction Window Ld/St1Add2Add3Add Commit Buffer Ld/St Queue 2add11add123 Register Alias Table lw r2, 0(s1) KEY Waiting for value Reading value Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop addu r2,ldst1,r5 sw 1add1, 0(s1) addi r1,r1,-4 bne 2add1,r7,Loop lw r2, 0(s1) sw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 lw r2, 0(s1) addu r2,r2,r5 addi r1,r1,-4 bne r1,r7,Loop sw r2, 0(s1) Instructions wait until all previous instructions have completed

39 Fallacies & Pitfalls Pipelining is easy –______________ is difficult Instruction set has no impact on pipelining –Complicated _____________ & _____________________ instructions complicate pipelining immensely

40 Technology Influences Pipelining ideas are good ideas regardless of technology –Only recently, with extra chip space, has ___________________ become better than ____________________ –Now, pipelining limited by ________

41 Exceptions – Unexpected Events InternalExternal

42 Definitions a.Anything unexpected happens b.External event occurs c.Internal event occurs d.Change in control flow ExceptionInterrupt Power PC Intel MIPS

43 Exception-Handling Stop Transfer control to OS Tell OS what happened Begin executing where we left off

44 1. Detect Exception Add control lines to detect errors

45 Step 2: Store PC into EPC Read Addr Out Data Instruction Memory PC Inst 4 src1 src1data src2 src2data Register File destreg destdata op/fun rs rt rd imm Addr Out Data Data Memory In Data 32 Sign Ext 16 << 2 << 2

46 Step 3: Tell OS the problem Store error code in the _________ Use vectored interrupts –Use error code to determine _________

47 Cause Register Set a flag in the cause register How does the OS find out if an overflow occurred if the bit corresponding to an overflow is bit 5?

48 Vectored Interrupts The address of trap handler is determined by cause Exception typeException vector address (in hex) Undefined InstructionC0 00 00 00 hex Arithmetic OverflowC0 00 00 20 hex

49 Cause Register – Go to OS Read Addr Out Data Instruction Memory PC Inst 4 src1 src1data src2 src2data Register File destreg destdata op/fun rs rt rd imm Addr Out Data Data Memory In Data 32 Sign Ext 16 << 2 << 2 EPC -4 Cause Handler PC

50 Vectored Interrupt – Go to OS Read Addr Out Data Instruction Memory PC Inst 4 src1 src1data src2 src2data Register File destreg destdata op/fun rs rt rd imm Addr Out Data Data Memory In Data 32 Sign Ext 16 << 2 << 2 EPC -4 Cause Vector Table

51 Steps for Exceptions Detect exception Place processor in state before offending instruction Record exception type Record instruction’s PC in EPC Transfer control to OS

52 What happens if the third instruction is undefined? Time-> add $s0, $0, $0 lw $s1, 0($t0) undefined or $s3, $s4, $t3 IF ID IF ID IF MEM ID IF 1 2 3 4 5 6 7 8 ID WB MEM WB MEM WB MEM WB In what stage is it detected? In what cycle? 1. Detection

53 Must associate exception with proper instruction What happens if multiple exceptions happen in the same cycle?

54 Time-> add $s0, $0, $0 lw $s1, 0($t0) undefined or $s3, $s4, $t3 IF ID IF ID IF MEM ID IF 1 2 3 4 5 6 7 8 2. Preserve state before instruction What? What does that mean?!?

55 3. Record exception type Place value in cause register or Use vectored interrupts –(exception routine address dependent on exception type)

56 PCPC 4 4 Addr Instr Inst Mem src1 src1data src2 Reg File src2data dest destdata ALU Addr OutData Data Mem InData X < Undef add lwor 4. Record PC in EPC Machine in detection cycle

57 PCPC 4 4 Addr Instr Inst Mem src1 src1data src2 Reg File src2data dest destdata ALU Addr OutData Data Mem InData X < Undef 4. Record PC in EPC Machine in before transfer Where is the proper PC? Long gone!!!

58 4. Record PC in EPC Non-trivial because PC changes each cycle, and exceptions can be detected in several stages (decode, execute, memory) Precise exceptions Imprecise exceptions

59 5. Transfer control to OS Same as before


Download ppt "Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)"

Similar presentations


Ads by Google