\course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-6522 Reading List Slides: Topic4x Henn&Patt: Chapter 4 Other assigned readings from homework and classes

\course\ELEG652-03Fall\Topic3-6523 Design Space for Processors 20 10 5.0 2.0 1.0 0.5 0.2 0.1 0.05 Cycle per Instruction { Enough Parallelism ? [TheobaldGaoHen 1992,1993,1994] Scalar CISC Scalar RISC Superpipelined Most likely future processor space Multithreaded Superscalar RISC Vector Supercomputer VLIW 5 10 20 50 100 200 500 1000 MHz Clock Rate

\course\ELEG652-03Fall\Topic3-6524 Pipelining - A Review Hazards Structural: resource conflicts when hardware cannot support all possible combinations of insets.. in overlapped exec. Data: insts depend on the results of a previous inst. Control: due to branches and other insts that change PC Hazard will cause “stall” but in pipeline “stall” is serious - it will hold up multiple insts.

\course\ELEG652-03Fall\Topic3-6525 RISC Concepts: Revisit What makes it a success ? - Pipeline - cache What prevents CPI = 1? - hazards and its resolution - Def - dependence graph

\course\ELEG652-03Fall\Topic3-6526 Structural Hazards - Non-pipelined Fus - One port of a R-file - One port of M. Data hazards for some data hazards ( e.g. ALU/ALU ops solutions): forwards (bypass) for others: pipeline interlock + pipeline stall (bypass cannot do on time) LDR1A +R4R1R7 this may need a “stall” or bubble

\course\ELEG652-03Fall\Topic3-6527 Example of Structural Hazard Instruction Clock cycle number 1 2 3 4 5 6 7 8 9 Load instruction IF ID EX MEM WB Instruction i+1 IF ID EX MEM WB Instruction i+2 IF ID EX MEM WB Instruction i+3 stall IF ID EX MEM WB Instruction i+4 IF ID EX MEM

\course\ELEG652-03Fall\Topic3-6528 Clock cycle Instruction123456 ADD instruction IFIDEXMEMWB- data written here SUB instruction IFID- data read hereEXMEM WB The ADD instruction writes a register that is a source operand for the SUB instruction. But the ADD doesn’t finish writing the data into the register file until three clock cycles after SUB begins reading it! (1) data hazard may cause SUB read wrong value (2) this is dangerous: as the result may be non-deterministic (3) forwarding (by-passing) Data Hazard

\course\ELEG652-03Fall\Topic3-6529 IF ID EX MEM WB ADDR 1,R 2,R 3 SUBR 4,R 1,R 5 ANDR 6,R 1,R 7 ORR 8,R 1,R 9 XORR 10,R 1,R 11 A set of instructions in the pipeline that need to forward results.

\course\ELEG652-03Fall\Topic3-65210 Register file Mux R4 R1 Bypass paths ALU result buffers Result write bus Pipeline Bypassing ALU

\course\ELEG652-03Fall\Topic3-65211 AB + C. EA + D Flow-dependency ( R/W conflicts)

\course\ELEG652-03Fall\Topic3-65212 AB + C. AB - C Output dependency ( W/W conflicts) Leave A in wrong state if order is changed

\course\ELEG652-03Fall\Topic3-65213 AA + B. AC + D anti-dependency ( W/R conflicts)

\course\ELEG652-03Fall\Topic3-65214 How about arrays?. A [i] = = A[i-1]+..

\course\ELEG652-03Fall\Topic3-65215 j i DLX Read / Read Write / Writeno Read / Write Write /Readno “Shared datum” conflicts

\course\ELEG652-03Fall\Topic3-65216 Not all data hazards can be eliminated by bypassing LWR1, 32 (R6) ADDR4, R1, R7 SUB R5, R1, R8 ANDR6, R1, R7

\course\ELEG652-03Fall\Topic3-65217 Load latency cannot be eliminated by forward alone It is handled often by “pipeline interlock” - which detects a hazard and “stall” the pipeline the delay cycle - called stall or “bubble” Any instruction IF ID EX MEM WB LWR1, 32 (R6) IF ID EX MEM WB ADDR4, R1, R7 IF ID stall EX MEM WB SUB R5, R1, R8 IF stall ID EX MEM WB ANDR6, R1, R7 stall IF ID EX MEM 45

\course\ELEG652-03Fall\Topic3-65218 “Issue” - pass ID stage “Issued instructions” - DLX always only issue inst where there is no hazard. Detect interlock early in the pipeline has the advantage that it never needs to suspend an inst and undo the state changes.

\course\ELEG652-03Fall\Topic3-65219 Exploitation Instruction Level Parallelism static scheduling dynamic scheduling simple scheduling loop unrolling loop unrolling + scheduling software pipelining out-of-order execution dataflow computers

\course\ELEG652-03Fall\Topic3-65220 directed-edges: data-dependence undirected-edges: Resources constraint An edge (u,v) (directed or undirected) of length e represent an interlock between node u and v, and they must be separated by e time. Constraint Graph S1S1 S6S6 S5S5 S4S4 S3S3 S2S2 12 62 11 operation latencies 4 3

\course\ELEG652-03Fall\Topic3-65221 Code Scheduling for Single Pipeline (CSSP problem) Input: A constraint Graph G = (V.E.) Output: A sequence of operations in G, v 1, v 2,...v n with number of no-ops no greater than k such that: 1.if the no-ops are deleted, the result is a topological sort of G. 2.any two nodes u,v in the sequence is separated by a distance >= d (u,v)

\course\ELEG652-03Fall\Topic3-65222 Advanced Pipelining Instruction reordering/scheduling within loop body loop unrolling : the code is not compact superscalar: compact code + multiple issuing of different class of instructions VLIW

\course\ELEG652-03Fall\Topic3-65223 Loop :LDF0, 0 (R1) ; load the vector element ADDDF4, F0, F2; add the scalar in F2 SD0 (R1), F4; store the vector element SUBR1, R1, #8; decrement the pointer by ; 8 bytes (per DW) BNEZR1, LOOP; branch when it’s zero An Example: X + a

\course\ELEG652-03Fall\Topic3-65224 Instruction producing Destination instruction Latency in ? result FP ALU opAnother FP ALU op 3 FP ALU opStore double 2 Load doubleFP ALU op 1 Load doubleStore double 0 Latencies of FP operations used in this section. The first column shows the originating instruction type. The second column is the type of the consuming instruction. The last column is the number of intervening clock cycles needed to avoid a stall. These numbers are similar to the average latencies we would see on an FP unit, like the one we described for DLX in the last chapter. The major change versus the DLX FP pipeline was to reduce the latency of FP multiply; this helps keep our examples from becoming unwieldy. The latency of a floating-point load to a store is zero, since the result of the load can be bypassed without stalling the store. We will continue to assume an integer load latency of 1 and an integer ALU operation latency of 0.

\course\ELEG652-03Fall\Topic3-65225 Without any scheduling the loop will execute as follows: Clock cycle issued Loop :LD F0, 0 (R1)1 stall2 ADDD F4, F0, F23 stall4 stall5 SD 0(R1), F46 SUB R1, R1, #87 BNEZ R1, LOOP8 stall9 This requires 9 clock cycles per iteration.

\course\ELEG652-03Fall\Topic3-65226 We can schedule the loop to obtain Loop : LDF0, 0 (R1) stall ADDDF4, F0, F2 SUBR1, R1, #8 BNEZR1, LOOP; delayed branch SD8 (R1), F4; changed because interchanged with SUB Average: 6 cycles per element

\course\ELEG652-03Fall\Topic3-65227 Loop unrolling: Here is the result after dropping the unnecessary SUB and BNEZ operations duplicated during unrolling. Loop :LDF0, 0 (R1) ADDDF4, F0, F2 SD0 (R1), F4 ; drop SUB & BNEZ LDF6, -8 (R1) ADDDF8, F6, F2 SD-8 (R1), F8 ; drop SUB & BNEZ LDF10, -16 (R1) ADDDF12, F10, F2 SD-16 (R1), F12 ; drop SUB & BNEZ LDF14, -24 (R1) ADDDF16, F14, F2 SD-24 (R1), F16 SUBR1, R1, #32 BNEZR1, LOOP Average: 6.8 cycles per element

\course\ELEG652-03Fall\Topic3-65228 Unrolling + Scheduling Show the unrolled loop in the previous example after it has been scheduled on DLX. Loop :LDF0, 0 (R1) LDF6, - 8 (R1) LDF10, -16 (R1) LDF14, -24 (R1) ADDDF4, F0, F2 ADDDF8, F6, F2 ADDDF12, F10, F2 ADDDF16, F14, F2 SD0 (R1), F4 SD-8 (R1), F8 SD-16 (R1), F12 SUBR1, R1, #32 ; branch dependence BNEZR1, LOOP SD8 (R1), F16 ; 8-32 = -24 The execution time of the unrolled loop has dropped to a total of 14 clock cycles, or 3.5 clock cycles per element, compared to 6.8 per element before scheduling

\course\ELEG652-03Fall\Topic3-65229 R1 LD 0 F0 a + SD R1F4 F2 1 2 3 0 R1 LD -24 F14 a + SD R1F16 F2 10 11 12 -24 R1 LD -8 F6 a + SD R1F8 F2 4 5 6 -8 R1 LD -6 F10 a + SD R1F12 F2 7 8 9 -16 Simple unrolling : We have eliminated three branches and three decrements of R1. The addresses on the loads and stores have been compensated for. Without scheduling, every operation is followed by a dependent operation, and thus will cause a stall. This loop will run in 27 clock cycles - each LD takes 2 clock cycles,each ADDD 3, the branch 2, and all other instructions 1 - or 6.8 clock cycles for each of the four elements y[i] = X [i] + a 27 cycle 4 elem. = 6.8 cycle/elem.

\course\ELEG652-03Fall\Topic3-65230 LD F0 a + SD F4 F2 1 5 4 LD F6 a + SD F8 F2 2 6 10 LD F10 a + SD F12 F2 3 7 11 LD F14 a + SD F16 F2 4 8 12 Unrolling + Scheduling 14 cycle 4 elem = 3.5 cycle/elem

\course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

Similar presentations

Presentation on theme: "\course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

\course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

Similar presentations

Presentation on theme: "\course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)"— Presentation transcript:

Similar presentations

About project

Feedback