Chapter 3 Instruction Level Parallelism 2 Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2014 Computer Applications Text book slides: Computer Architec ture:

Chapter 3 Instruction Level Parallelism 2 Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2014 Computer Applications Text book slides: Computer Architec ture: A Quantitative Approach 4th E dition, John L. Hennessy & David A. Patterso with modifications.

Dr. Amr Talaat Elect 707 Control Hazard on Branches Three Stage Stall 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11 Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg

Dr. Amr Talaat Elect 707 Solving Branch Problems  Prediction  Static: Software  Dynamic: hardware  Speculation

Dr. Amr Talaat Elect 707 Static Branch Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken  Execute successor instructions in sequence  “Squash” instructions in pipeline if branch actually taken  Advantage of late pipeline state update  47% MIPS branches not taken on average  PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken  53% MIPS branches taken on average  But haven’t calculated branch target address in MIPS  MIPS still incurs 1 cycle branch penalty  Other solutions: branch target known before outcome

Dr. Amr Talaat Elect 707 Dynamic Branch Prediction  Performance = ƒ(accuracy, cost of misprediction)  Branch History Table: Lower bits of PC address index table of 1-bit values  Says whether or not branch taken last time  No address check  Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit):  End of loop case, when it exits instead of looping as before  First time through loop on next time through code, when it predicts exit instead of looping

Dr. Amr Talaat Elect 707 Branch History Table T Predict Taken Predict not Taken 1 0 T NT 1-bit prediction NT Feedback 01 BHT branch PC

Dr. Amr Talaat Elect 707  Solution: 2-bit scheme where change prediction only if get misprediction twice  Adds hysteresis to decision making process Dynamic Branch Prediction T T NT Predict Taken Predict Not Taken Predict Taken Predict Not Taken T NT T

Dr. Amr Talaat Elect 707 BHT Accuracy  Mispredict because either:  Wrong guess for that branch  Got branch history of wrong branch when index the table  4096 entry table:

Dr. Amr Talaat Elect 707 9 Correlating Branches Code example showing the potential Assemble code Observation: if BNEZ1 and BNEZ2 is not taken, then BNEZ3 is taken

Dr. Amr Talaat Elect 707 10 Correlating Branch Predictor Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behav ior)  Then behavior of recent branches selects between, say, 2 predictions of next branch, updating just that prediction  (1,1) predictor: 1-bit global, 1-bit local Branch address (4 bits) 1-bits per branch local predictors Prediction 1-bit global branch history (0 = not taken)

Dr. Amr Talaat Elect 707 Correlated Branch Prediction  Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper n-bit branch history table  In general, (m,n) predictor means record last m branches to select between 2 m history tables, each with n-bit counters  Thus, old 2-bit BHT is a (0,2) predictor  Global Branch History: m-bit shift register keeping T/NT sta tus of last m branches.  Each entry in table has m n-bit predictors.

Dr. Amr Talaat Elect 707 Correlating Branches (2,2) predictor –Behavior of recent branches selects between four predictio ns of next branch, updating jus t that prediction Branch address 2-bits per branch predictor Prediction 2-bit global branch history 4

Dr. Amr Talaat Elect 707 0%0% Frequency of Mispredictions 0% 1% 5% 6% 11% 4% 6% 5% 1% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) Accuracy of Different Schemes 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT nasa7 matrix300doducdspicefppppgccexpressoeqntottlitomcatv

Dr. Amr Talaat Elect 707 Loop Example Loop:LDF00R1 SDF40R1 SUBIR1R1#8 MULTDF4F0F2 BNEZF4Loop  very long wait for branch?

Dr. Amr Talaat Elect 707 spec·u·la·tion  spec·u·la·tion : [spek-yuh-ley-shuhn] Show IPA 1. the contemplation or consideration of some subject: toenga ge in speculation on humanity's ultimate destiny. 2. a single instance or process of consideration. 3. a conclusion or opinion reached by such contemplation: T hese speculations are impossible to verify. 4. conjectural consideration of a matter; conjecture or surmis e:a report based on speculation rather than facts. 5. engagement in business transactions involving considerable risk but offering the chance of large gains, especially tradin g in commodities, stocks, etc., in the hope of profit from ch anges in the market price

Dr. Amr Talaat Elect 707 Speculation to greater ILP  Greater ILP: Overcome control dependence by hardwar e speculating on outcome of branches and executing p rogram as if guesses were correct  Dynamic scheduling  only fetches and issues instruction s  Speculation  fetch, issue, and execute instructions as if branch predictions were always correct  Essentially a data flow execution model: Operations exe cute as soon as their operands are available

Dr. Amr Talaat Elect 707 Speculation to greater ILP 3 components of HW-based speculation: 1. Dynamic branch prediction to choose which instructions to execute 2. Speculation to allow execution of instructions before con trol dependences are resolved + ability to undo effects of incorrectly speculated sequence 3. Dynamic scheduling to deal with scheduling of different combinations of basic blocks

Dr. Amr Talaat Elect 707 Adding Speculation to Tomasulo  Must separate execution from allowing instruction to finish or “commit”  This additional step called instruction commit  When an instruction is no longer speculative, allow it to update the register file or memory  Requires additional set of buffers to hold results of in structions that have finished execution but have not committed  This reorder buffer (ROB) is also used to pass results among instructions that may be speculated

Dr. Amr Talaat Elect 707 Reorder Buffer (ROB)  In Tomasulo’s algorithm, once an instruction writes its result, any subsequently issued instructions will find result in the register file  With speculation, the register file is not updated until the instruction commits  (we know definitively that the instruction should execute)  Thus, the ROB supplies operands in interval between completion of instruction execution and instruction commit  ROB is a source of operands for instructions, just as reservation stations (RS) provide operands in Tomasulo’s algorithm  ROB extends architecture registers like RS

Dr. Amr Talaat Elect 707 Reorder Buffer Entry  Each entry in the ROB contains four fields: 1. Instruction type a branch (has no destination result), a store (has a memory address destination), or a register operation (ALU operation or load, which has register destinations) 2. Destination Register number (for loads and ALU operations) or memory address (for stores) where the instruction result should be written 3. Value Value of instruction result until the instruction commits 4. Ready Indicates that instruction has completed execution, and the value is ready

Dr. Amr Talaat Elect 707 Reorder Buffer operation  Holds instructions in FIFO order, exactly as issued  When instructions complete, results placed into ROB  Supplies operands to other instruction between execution complete & commit  more registers like RS  Tag results with ROB buffer number instead of reservation station  Instructions commit  values at head of ROB placed in registers  As a result, easy to undo speculated instructions on mispredicted branches or on exceptions Reorder Buffer FP Op Queue FP Adder Res Stations FP Regs Commit path

Dr. Amr Talaat Elect 707 4 Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue in str & send operands & reorder buffer no. for destination ( this stage sometimes called “dispatch”) 2.Execution—operate on operands (EX) When both operands ready then execute; if not ready, wa tch CDB for result; when both in reservation station, exec ute; checks RAW (sometimes called “issue”) 3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4.Commit—update register with reorder result When instr. at head of reorder buffer & result present, up date register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reo rder buffer (sometimes called “graduation”)

Dr. Amr Talaat Elect 707 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 LD F0,10(R2) N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

Dr. Amr Talaat Elect 707 2 ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F10 F0 ADDD F10,F4,F0 LD F0,10(R2) N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

Dr. Amr Talaat Elect 707 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

Dr. Amr Talaat Elect 707 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 ADDD F0,F4,F6 N N F4 LD F4,0(R3) N N -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers 5 0+R3

Dr. Amr Talaat Elect 707 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 -- F0 ROB5 ST 0(R3),F4 ADDD F0,F4,F6 N N N N F4 LD F4,0(R3) N N -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory Dest Reorder Buffer Registers 1 10+R2 5 0+R3

Dr. Amr Talaat Elect 707 3 DIVD ROB2,R(F6) Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 -- F0 M[10] ST 0(R3),F4 ADDD F0,F4,F6 Y Y N N F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers 2 ADDD R(F4),ROB1 6 ADDD M[10],R(F6)

Dr. Amr Talaat Elect 707 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 -- F0 M[10] ST 0(R3),F4 ADDD F0,F4,F6 Y Y Y Y F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

Dr. Amr Talaat Elect 707 -- F0 M[10] ST 0(R3),F4 ADDD F0,F4,F6 Y Y Y Y F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

Dr. Amr Talaat Elect 707 Getting CPI below 1  CPI ≥ 1 if issue only 1 instruction every clock cycle  Multiple-issue processors come in 3 flavors: 1. statically-scheduled superscalar processors, 2. dynamically-scheduled superscalar processors, and 3. VLIW (very long instruction word) processors  2 types of superscalar processors issue varying number s of instructions per clock  use in-order execution if they are statically scheduled, or  out-of-order execution if they are dynamically scheduled  VLIW processors, in contrast, issue a fixed number of in structions formatted either as one large instruction or as a fixed instruction packet with the parallelism among ins tructions explicitly indicated by the instruction (Intel/HP I tanium)

Dr. Amr Talaat Elect 707  Reading Assignment: Intel Core i7 prediction scheme  Sections 3.3 & 3.6

Chapter 3 Instruction Level Parallelism 2 Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2014 Computer Applications Text book slides: Computer Architec ture:

Similar presentations

Presentation on theme: "Chapter 3 Instruction Level Parallelism 2 Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2014 Computer Applications Text book slides: Computer Architec ture:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 3 Instruction Level Parallelism 2 Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2014 Computer Applications Text book slides: Computer Architec ture:

Similar presentations

Presentation on theme: "Chapter 3 Instruction Level Parallelism 2 Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2014 Computer Applications Text book slides: Computer Architec ture:"— Presentation transcript:

Similar presentations

About project

Feedback