Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presentation stolen from the web (with changes) from the Univ of Aberta and Espen Skoglund and Thomas Richards (470 alum) and Our textbook’s authors IA-64:

Similar presentations


Presentation on theme: "Presentation stolen from the web (with changes) from the Univ of Aberta and Espen Skoglund and Thomas Richards (470 alum) and Our textbook’s authors IA-64:"— Presentation transcript:

1 Presentation stolen from the web (with changes) from the Univ of Aberta and Espen Skoglund and Thomas Richards (470 alum) and Our textbook’s authors IA-64: Advanced Loads Speculative Loads Software Pipelining

2 IA bit registers –Use a register window similarish to SPARC bit fp registers 64 1 bit predicate registers 8 64-bit branch target registers

3 Explicit Parallelism Groups –Instructions which could be executed in parallel if hardware resources available. Bundle –Code format. 3 instructions fit into a 128-bit bundle. –5 bits of template, 41*3 bits of instruction. »Template specifies what execution units each instruction requires.

4 Instruction groups IA-64 instructions are bound in instruction groups –No read-after-write dependencies –No write-after-write dependencies –Any instruction in the group may be executed in parallel –New processors can easily take advantage of the existing ILP in the instruction group Instruction groups indicated by stop bits in template Instruction groups may end dynamically on branches

5 Instruction bundles Instruction bundles contain –3 instructions –A template field which maps instructions to execution units Processor dispatches all three instruction in parallel Instruction group may end in middle of bundle Bundles are aligned on 16 byte boundaries Slot 3 Slot 2 Slot 1 Template Instruction bundle

6 Predication Use predicates to eliminate branches Predicates are one bit registers (total of 64) Most instructions can be predicated (qp) mnemonic dest = source Predicates are set by compare instructions (qp) cmp.crel px,py = source x86 assembly: cmpa, b beq.eq add$4, y jmp.done.eq:add$3, y.done: IA-64 assembly: cmp.eqp1,p2 = a,b (p1)addy = y, 3 (p2)addy = y, 4 C code: if (a == b) y += 3; else y += 4;

7 Advanced loads and speculative loads Advanced loads Used to address data dependencies Speculative loads Used to address control dependencies advanced load st check load st ld (p) br ld speculative load (p) br check speculation

8 Advanced loads Addr1 and addr2 in example might point to same address If different: –Datum in addr2 can be prefetched If same: –Datum in addr2 can not be prefetched C code example: int foo (int *addr1, int *addr2) { int h; *addr1 = 4; h = *addr2; return h+1; }

9 Advanced loads Insert advanced loads (ld.a) to prefetch data (store in ALAT) Use check data instruction (ld.c) in place of original load If memory contents has changed, perform real load Advanced loads do not defer exceptions (e.g., page-faults) Regular load: addr3 = 4,r0 ;; st4[r32] = r3 ld4r2 = [r33]regular load addr5 = r2,r3use data Advanced Load: ld4.ar2 = [r33]advanced load addr3 = 4,r0 ;; st4[r32] = r3 ld4.cr2 = [r33] ;;verify data addr5 = r2,r3use data

10 Speculative loads If addr in example is legal, we can prefetch its value If addr is illegal, prefetching the value would cause exception Any exception should be delayed until code path has been resolved C code example: int add5 (int *addr) { if (addr == NULL) return (-1); else return (*addr+5); }

11 Speculative loads Insert speculative loads (ld.s) to prefetch data Verify load using check instruction (chk.s) NaT-bit/NaTVal is used track success of load Might also be combined with advanced loads (ld.sa and chk.a) Assembly code: add5: ld8.sr1 = [r32] cmp.eqp6,p5 = r32,r0 ;; (p6)addr8 = -1,r0 (p6)br.ret (p5)chk.sr1,return_error addr8 = 5,r1 br.ret ;; return_error: recovery code

12 Code example “Why hoist loads?” add r15 = r2,r3//A mult r4 = r15,r2 //B mult r4 = r4,r4 //C st8 [r12] = r4 //D ld8 r5 = [r15] //E div r6 = r5,r7 //F add r5 = r6,r2//G Assume latencies are: add, store: +0 mult, div: +3 ld: +4 A:1 D:1 G:1 C:4 E:5B:4 F:

13 Advanced Loads Recovery // Case A: Advanced Load ld.a r2 = [r10] st8[r1] = r9 ld.c r2 = [r10] add r15 = r2, r3 st8 [r18] = r19 Case A – Hoist just the load. »In this case, if there is a memory dependency we just re-execute the load. A ld.c will only re-execute the load, r5 is still wrong after the ld.c! Case B – Hoist the load and dependent instructions. »In this case, we need to re- execute all of the dependent instructions. // Case B: Advanced Load // With Speculative Add ld.a r2 = [r10] add r5 = r2, r3 st8 [r1] = r9 ld.c r2 = [r10] // Wrong st8 [r18] = r19

14 Advanced Load-Use Recovery: Compiler Generated Recovery Code // Solution: Using the chk.a instruction ld8.a r2 = [r10] add r5 = r2, r3 st8 [r1] = r9 chk.a r6, fixup return: // Return Point st8 [r18] = r fixup: // Re-execute load and all speculative uses ld8 r2 = [r10] add r5 = r2, r3 br return Use ld.c if JUST a load is speculative. Use chk.a if a load and an instruction that is dependant on the load are both speculative.

15 The Advanced Load Address Table (ALAT) The ALAT tells us if we need to recover from an Advanced Load. When an advanced load is executed – Save the type of load, size of load, and load address to the ALAT (indexed by PR). When we execute a ld.c or chk.a look for the entry in the ALAT. If it is missing, run the recovery code. Remove an entry from the ALAT if –A store address overlaps an ALAT entry. –Capacity/Associatively evictions. –Other advanced load indexes the same PR.

16 Control Speculation and Recovery What if we want to move a load above a branch? –Problem is that the load maybe shouldn’t have executed and might have thrown a spurious exception. Similar to Advanced Load, but no ALAT. –Instead, check NaT bit for deferred exceptions. »See next slide. –Use chk.s for recovery (instead of chk.a or ld.a). // Control Speculation and Recovery ld8.s r1 = [r10] // load moved outside of branch st8[r11] = r9 (p1)br.cond branch_label // (p1) is a predication bit chk.s r1,recovery return: add r2 = r1, r2 chk.s checks r1 to see if the NaT bit is set. If so, branch to recovery code (re-execute instructions if necessary).

17 Not a Thing Bit (NaT) IA64 register If a control speculative load causes an exception, the processor can set this bit, which defers the exception. NaT bits propagate. –Propagation allows a single check for multiple ld.s. 64bits + 1NaT ld8.s r1 = [r10] ld8.s r2 = [r11] add r3 = r1, r2 ld8.s r4 = [r3] st8[r11] = r9 (p1)br.cond branch_label chk.s r4, recovery

18 Software pipelining on IA-64 Lots of tricks –Rotating registers –Special counters Often don’t need Prologue and Epilog. –Special counters and prediction lets us only execute those instructions we need to.

19 Prolog and epilog From before!!!!! r3=r3-8 // Needed to check legal! r4=MEM[r2+0]//A(1) r1=r4*2//B(1) r4=MEM[r2+4]//A(2) Loop:MEM[r2+0]=r1 //C(n) r1=r4*2 //B(n+1) r4=MEM[r2+8] //A(n+2) r2=r2+4 //D(n) bne r2 r3 Loop //E(n) MEM[r2+0]=r1// C(x-1) r1=r4*2// B(x) MEM[r2+0]=r1// C(x) r3=r3+8// Could have used tmp var.

20 There are three special purpose registers used in IA-64 for software pipelining Loop counter (LC) indicates how many times to run through loop (prolog/kernel) –Initialized to N-1 before starting loop code –Decremented until LC == 0 Epilog counter (EC) indicates how many times to run loop after loop counter exhausted (epilog) –Needed to flush the software pipeline –Initialized to num-stages before entering loop code –Decremented if LC == 0, and EC > 1

21 And RRB (Register Rename Base) Add internal counter RRB to register number to get actual used register –Counter decreased by special loop branch instructions –May be reset by clrrrb instruction –Use modular lookup (so we wrap around!) Rotated predicate registers –Initially reset using: mov pr.rot = value –pr 63 is reset before every rotation

22 How does register rotation work? (Basics) Rotated registers: –General: gr 32 - gr N (as specified by alloc instruction) –Predicate: pr 16 - pr 63 –Floating point: fr 16 - fr 127 Registers are rotated to higher numbers –Register r n is renamed to r n+1, r max is renamed to r min Registers are rotated by specific loop branch instructions –br.ctop, br.cexit (for counted loops) –br.wtop, br.exit (for while loops)

23 How they relate LC-- EC=EC PR[63]=1 RRB-- LC=LC EC=EC PR[63]=0 RRB=RRB LC=LC EC-- PR[63]=0 RRB-- LC=LC EC-- PR[63]=0 RRB-- EC? LC? == 0 (epilog) == 0 (prolog/kernel) > 1 != 0 ctop, cexit ctop: branch cexit: fall-thru ctop: fall-thru cexit: branch (special unrolled loops) == 1

24 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop General Registers (Physical) Predicate Registers 4 LC 3 EC x4 x5 x1 x2 x3 Memory General Registers (Logical) 0 RRB

25 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x General Registers (Physical) Predicate Registers 4 LC 3 EC x4 x5 x1 x2 x3 Memory General Registers (Logical) 0 RRB

26 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop Predicate Registers 4 LC 3 EC x4 x5 x1 x2 x3 Memory x General Registers (Physical) General Registers (Logical) 0 RRB

27 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop Predicate Registers 4 LC 3 EC x4 x5 x1 x2 x3 Memory x General Registers (Physical) General Registers (Logical) 0 RRB

28 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop Predicate Registers 4 LC 3 EC 1 x4 x5 x1 x2 x3 Memory x General Registers (Physical) General Registers (Logical) RRB

29 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop Predicate Registers 3 LC 3 EC x4 x5 x1 x2 x3 Memory x General Registers (Physical) General Registers (Logical) RRB

30 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop Predicate Registers 3 LC 3 EC x4 x5 x1 x2 x3 Memory x General Registers (Physical) General Registers (Logical) x2 RRB

31 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop Predicate Registers 3 LC 3 EC x4 x5 x1 x2 x3 Memory x General Registers (Physical) General Registers (Logical) x2 y1 RRB

32 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop Predicate Registers 3 LC 3 EC x4 x5 x1 x2 x3 Memory x General Registers (Physical) General Registers (Logical) x2 y1 RRB

33 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop Predicate Registers 3 LC 3 EC x4 x5 x1 x2 x3 Memory x General Registers (Physical) General Registers (Logical) x2 y1 RRB

34 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop Predicate Registers 2 LC 3 EC 1 x4 x5 x1 x2 x3 Memory x General Registers (Physical) General Registers (Logical) x2 y1 -2 RRB

35 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop Predicate Registers 2 LC 3 EC x4 x5 x1 x2 x3 Memory x General Registers (Physical) General Registers (Logical) x2y1x3 -2 RRB

36 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop y Predicate Registers 2 LC 3 EC x4 x5 x1 x2 x3 Memory General Registers (Physical) General Registers (Logical) x2y1x3 -2 RRB

37 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop Predicate Registers 2 LC 3 EC x4 x5 x1 x2 x3 y1 Memory y General Registers (Physical) General Registers (Logical) x2y1x3 -2 RRB

38 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop Predicate Registers 2 LC 3 EC x4 x5 x1 x2 x3 y1 Memory y General Registers (Physical) General Registers (Logical) x2y1x3 -2 RRB

39 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop Predicate Registers 1 LC 3 EC 1 x4 x5 x1 x2 x3 y1 Memory -3 RRB y General Registers (Physical) General Registers (Logical) x2y1x3

40 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop Predicate Registers 1 LC 3 EC x4 x5 x1 x2 x3 y1 Memory -3 RRB y2 x General Registers (Physical) General Registers (Logical) x2y1x3

41 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop Predicate Registers 1 LC 3 EC x4 x5 x1 x2 x3 y1 Memory y2 x General Registers (Physical) General Registers (Logical) y3y1x3 -3 RRB

42 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop Predicate Registers 1 LC 3 EC x4 x5 x1 x2 x3 y1 y2 Memory y2 x General Registers (Physical) General Registers (Logical) y3y1x3 -3 RRB

43 Software Pipelining Example in the IA Predicate Registers 1 LC 3 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y1 y2 Memory y2 x General Registers (Physical) General Registers (Logical) y3y1x3 -3 RRB

44 Software Pipelining Example in the IA Predicate Registers 0 LC 3 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 1 x4 x5 x1 x2 x3 y1 y2 Memory -4 RRB y2 x General Registers (Physical) General Registers (Logical) y3y1x3

45 Software Pipelining Example in the IA Predicate Registers 0 LC 3 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y1 y2 Memory y2 x5x General Registers (Physical) General Registers (Logical) y3y1x3 -4 RRB

46 Software Pipelining Example in the IA Predicate Registers 0 LC 3 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y1 y2 Memory y2 x5x General Registers (Physical) General Registers (Logical) y3y1y4 -4 RRB

47 Software Pipelining Example in the IA Predicate Registers 0 LC 3 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y1 y2 y3 Memory -4 RRB y2 x5x General Registers (Physical) General Registers (Logical) y3y1y4

48 Software Pipelining Example in the IA Predicate Registers 0 LC 3 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y1 y2 y3 Memory y2 x5x General Registers (Physical) General Registers (Logical) y3y1y4 -4 RRB

49 Software Pipelining Example in the IA Predicate Registers 0 LC 2 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 0 x4 x5 x1 x2 x3 y1 y2 y3 Memory y2 x5x General Registers (Physical) General Registers (Logical) y3y1y4 -5 RRB

50 Software Pipelining Example in the IA Predicate Registers 0 LC 2 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y1 y2 y3 Memory y2 x5x General Registers (Physical) General Registers (Logical) y3y1y4 -5 RRB

51 Software Pipelining Example in the IA Predicate Registers 0 LC 2 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y1 y2 y3 Memory y2 x5y General Registers (Physical) General Registers (Logical) y3y1y4 -5 RRB

52 Software Pipelining Example in the IA Predicate Registers 0 LC 2 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y4 y1 y2 y3 Memory y2 x5y General Registers (Physical) General Registers (Logical) y3y1y4 -5 RRB

53 Software Pipelining Example in the IA Predicate Registers 0 LC 2 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y4 y1 y2 y3 Memory y2 x5y General Registers (Physical) General Registers (Logical) y3y1y4 -5 RRB

54 Software Pipelining Example in the IA Predicate Registers 0 LC 1 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 0 x4 x5 x1 x2 x3 y4 y1 y2 y3 Memory y2 x5y General Registers (Physical) General Registers (Logical) y3y1y4 -6 RRB

55 Software Pipelining Example in the IA Predicate Registers 0 LC 1 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y4 y1 y2 y3 Memory y2 x5y5 General Registers (Physical) General Registers (Logical) y3y1y4 -6 RRB

56 Software Pipelining Example in the IA Predicate Registers 0 LC 1 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y4 y1 y2 y3 Memory y2 x5y5 General Registers (Physical) General Registers (Logical) y3y1y4 -6 RRB

57 Software Pipelining Example in the IA Predicate Registers 0 LC 1 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y4 y5 y1 y2 y3 Memory y2 x5y5 General Registers (Physical) General Registers (Logical) y3y1y4 -6 RRB

58 Software Pipelining Example in the IA Predicate Registers 0 LC 1 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y4 y5 y1 y2 y3 Memory y2 x5y5 General Registers (Physical) General Registers (Logical) y3y1y4 -6 RRB

59 Software Pipelining Example in the IA Predicate Registers 0 LC 1 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y4 y5 y1 y2 y3 Memory y2 x5y5 General Registers (Physical) General Registers (Logical) y3y1y4 -6 RRB

60 Software Pipelining Example in the IA Predicate Registers 0 LC 0 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 0 x4 x5 x1 x2 x3 y4 y5 y1 y2 y3 Memory y2 x5y5 General Registers (Physical) General Registers (Logical) y3y1y4 -7 RRB

61 IA-64 Software pipelining Review No prolog or epilog in code –But we execute a lot of noops. Rotated registers help –In this case, we just didn’t have to reverse the code ordering »But in general, better still. Could move load from use more than one loop iteration apart. Looks good at least in this case…

62 IA-64 review Some problems –ALAT difficult for compliers to use. »Recall Colwell talking about “once we figure out how to do this…” –128/3 instruction size makes I-cache worse. –Big register file has disadvantages »Context switch mainly. –So many dependencies with special purpose instructions, dynamic OoO is unlikely. But… –If the complier could do a good job, there really does look like the potential for a big win.


Download ppt "Presentation stolen from the web (with changes) from the Univ of Aberta and Espen Skoglund and Thomas Richards (470 alum) and Our textbook’s authors IA-64:"

Similar presentations


Ads by Google