Data Hazard Solution 2: Data Forwarding Our naïve pipeline would experience many data stalls  Register isn’t written until completion of write-back stage.

Data Hazard Solution 2: Data Forwarding Our naïve pipeline would experience many data stalls  Register isn’t written until completion of write-back stage  Source operands are read from register file in decode stage  Values need to be in register file at start of stage  Leads to many more stall cycles than necessary Key observation  The value we want is generated in execute or memory stage  It is “available” 1-2 cycles before write-back Trick: go get it!  Pass value directly from stage of generating instruction to decode stage  Must be available before end of decode stage to avoid stall

Detecting Stall Condition 0x000: irmovl $10,%edx 123456789 FDEMW 0x006: irmovl $3,%eax FDEMW 0x00c: nop FDEMW bubble F EMW 0x00e: addl %edx,%eax DDEMW 0x010: halt FDEMW 10 # demo-h2.ys F FDEMW 0x00d: nop 11 Cycle 6 W D W_dstE = %eax W_valE = 3 srcA = %edx srcB = %eax

Data Forwarding Example  irmovl in write-back stage  Destination value in W pipeline register  “Forward” as valB for decode stage  addl instruction can proceed without stalling  When do we actually know the values of %eax and %edx?

Data Forwarding Example #2 Register %edx  Generated by ALU during previous cycle  Forward from memory as valA Register %eax  Value just generated by ALU  Forward from execute as valB

Forwarding Hardware  Feedback paths from E, M, and W registers to decode stage  Logic blocks to select source for valA and valB in decode stage  Note: we either do forwarding or stall on data hazards  Forwarding has better performance, higher cost PIPE

## Actions of “Sel+Fwd A” block ## Pick the correct A value ## Order is important! int new_E_valA = [ # Use incremented PC D_icode in {ICALL,IJXX} : D_valP; # Forward valE from execute d_srcA == E_dstE : e_valE; # Forward valM from memory d_srcA == M_dstM : m_valM; # Forward valE from memory d_srcA == M_dstE : M_valE; # Forward valM from write back d_srcA == W_dstM : W_valM; # Forward valE from write back d_srcA == W_dstE : W_valE; # Use value read from register file 1 : d_rvalA; ]; Forwarding Control PIPE

At clock cycle 4 d_srcA = d_srcB = E_dstE = e_valE = M_dstE = M_valE = What are the forwarding conditions? Highlight the data path in D, E, M stages. Forwarding Example PIPE

At clock cycle 4 d_srcA = ecx d_srcB = edx M_dstE = edx m_valE = 128 E_dstE = ecx e_valE = 3 What are the forwarding conditions? Forwarding Example PIPE

Limitation of Forwarding Load-use dependency  Value needed by end of decode stage in cycle 7  Value read from memory in memory stage of cycle 8 Terminology  This is a load hazard  Only solution is a load stall  Preferred: compiler avoid

Load/Use Hazard: Desired Behavior Best we can do in hardware  Stall reading instruction for one cycle  Then forward value from memory stage Better yet  Have compiler avoid in code it generates

Addressing Load/Use Hazard Detection  Previous instr. is loading from memory to src register  dstM in E register matches srcA or srcB (and not 0xF) Action  Stall instruction in decode

Interrupts and Exceptions Basic interrupt mechanism … instr i instr i+1 instr i+2 instr i+3 instr i+4 instr i+5 instr i+6 … CPU running current process Event occurs that needs attention (e.g., Disc read finishes) HW asserts CPU interrupt line Control transferred to interrupt handler (think HW-induced function call) instr 1 instr 2 instr 3 … Handler How is state of interrupted process saved? How is location of handler determined?

Interrupt Handling Calling handler  Save return address (PC) on stack  Address of next instruction to be executed for this process –Depending on event, either current or next instruction  PC usually passed through pipeline along with instruction  Precise exception: all instructions to PC executed, none past (“Clean Break”)  Jump to handler address  Usually obtained from table stored at fixed memory address  Index to table entry determined by exception type/interrupt priority level  Interrupt vector table written by software, accessed by hardware Implementation  Critical for real hardware  Seldom implemented in simulators: no OS running to pass control to!

Exceptions  Events occurring within processor under which pipeline cannot continue normal operation Possible causes  Halt instruction executed (Current)  Bad address for instruction or data (Previous)  Invalid instruction (Previous)  Pipeline control error (Previous)  System calls, page faults, math errors (not in Y86) Desired action  Complete instructions to specific point  Either current or previous (depends on exception type)  Discard instructions that follow  Transfer control to exception handler in OS  Save return address, get handler address from table

Exception Examples Detect in fetch stage irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address (for Y86 tools) jmp $-1 # Invalid jump target.byte 0xFF # Invalid instruction code halt # Halt instruction Detect in memory stage

Exceptions in Pipeline Processor (#1) Desired behavior  rmmovl should cause exception (1 st in sequential machine)  Tricky because invalid instruction code detected first # demo-exc1.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # Invalid address nop.byte 0xFF # Invalid instruction code 0x000: irmovl $100,%eax 1234 FDEM FDE 0x006: rmmovl %eax,0x10000(%eax) 0x00c: nop 0x00d:.byte 0xFF FD F W 5 M E D Exception detected

Exceptions in Pipeline Processor (#2) Desired behavior  No exception should occur  Must match behavior and results of sequential execution # demo-exc2.ys 0x000: xorl %eax,%eax # Set condition codes 0x002: jne t # Not taken 0x007: irmovl $1,%eax 0x00d: irmovl $2,%edx 0x013: halt 0x014: t:.byte 0xFF # Target 0x000: xorl %eax,%eax 123 FDE FD 0x002: jne t 0x014: t:.byte 0xFF 0x???: (I’m lost!) F Exception detected 0x007: irmovl $1,%eax 4 M E F D W 5 M D F E M E W 7 W M 8 W 9 E D M 6 W

Correct Exception Handling Challenges: respond to exceptions in program order, and only those that “really” occur  Motivation for exception status field (stat) in pipeline registers  Fetch stage sets to either “AOK,” “ADR” (bad fetch address), or “INS” (illegal instruction)  Decode & execute stages pass values through  Memory stage either passes through or sets to “ADR”  CPU responds to exception only when instruction reaches write back F predPC W icodevalEvalMdstEdstMstat M CndicodevalEvalAdstEdstMstat E icodeifunvalCvalAvalBdstEdstMsrcAsrcBstat D rBvalCvalPicodeifunrAstat

Avoiding Side Effects Desired behavior  rmmovl should cause exception  No following instruction should change any state  Note special challenge of condition codes! # demo-exc3.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address addl %eax,%eax # Sets condition codes 0x000: irmovl $100,%eax 1234 FDEM FDE 0x006: rmmovl %eax,0x10000(%eax) 0x00c: addl %eax,%eax FD W 5 M E Exception detected Condition code set

Avoiding Side Effects Exception should disable state update for following instructions  When exception detected in memory stage  Disable condition code setting in execute  Must happen in same clock cycle  When exception passes to write-back stage  Disable memory write in memory stage  Disable condition code setting in execute stage Let’s see how these are handled in PIPE processor

PIPE: Fetch Details Main points  Branch prediction  Branch misprediction recovery  Return handling  Stat initialization F D rB M_icode Predict PC valCvalPicodeifunrA Instruction memory Instruction memory PC incr. PC incr. predPC Need regids Need valC Instr valid Align Split Bytes 1-5Byte 0 Select PC M_Cnd M_valA W_icode W_valM f_pc stat imem_error icodeifun

PIPE: Decode and Write-back Main points  Forwarding logic, paths  Forwarding priority  valA and valP merged

PIPE: Execute Main points  CC update inhibited by prior exceptions  Values for forwarding  Special handling for dstE (?) E M CndicodevalEvalAdstEdstM icodeifunvalCvalAvalBdstEdstMsrcAsrcB CC ALU A ALU B ALU fun. Set CC cond e_valE e_Cnd e_dstE stat dstE m_stat W_stat

PIPE: Memory and Write-back Main points  Values for forwarding  Stat update logic  Feedback for branch misprediction recovery M W Addr icode data in M_valA valEvalMdstEdstM CndicodevalEvalAdstEdstM Data memory Data memory Mem. read write data out Mem. write M_valE W_dstM W_valE W_valM W_dstE W_icode M_icode M_dstM M_dstE m_valM M_Cnd stat dmem_error m_stat stat Stat

Pipeline Control: Register Modes Rising clock Rising clock  Output = y yy Rising clock Rising clock  Output = x xx xx n o p Rising clock Rising clock  Output = nop Output = xInput = y stall = 0 bubble = 0 xx Normal Output = xInput = y stall = 1 bubble = 0 xx Stall Output = xInput = y stall = 0 bubble = 1 Bubble

PIPE Control Logic Handles special cases  Handles ret, load/use hazards, misprediction recovery, exceptions  Existing PIPE logic handles forwarding, branch prediction E M W F D CC rB srcA srcB icodevalEvalMdstEdstM CndicodevalEvalAdstEdstM icodeifunvalCvalAvalBdstEdstMsrcAsrcB valCvalPicodeifunrA predPC d_srcB d_srcA e_Cnd D_icode E_icode M_icode E_dstM Pipe control logic D_bubble D_stall E_bubble F_stall M_bubble W_stall set_cc stat W_stat stat m_stat

PIPE: Actual ret handling 0x000: irmovl Stack,%edx 123456789 FDEMW 0x006: call proc FDEMW EMW 10 # prog7 0x020: ret bubble FDEMW D EMW D EMW D 0x00b: irmovl $10,%edx # Return point FDEMW 11 F F F 0x000: irmovl Stack,%edx 123456789 FDEMW 0x006: call proc FDEMW F EMW 10 # prog7 0x020: ret 0x021: rrmovl %edx,%ebx # Not executed bubble FDEMW D F EMW 0x021: rrmovl %edx,%ebx # Not executed bubble D F EMW 0x021: rrmovl %edx,%ebx # Not executed bubble D 0x00b: irmovl $10,%edx # Return point FDEMW 11 Simplified view What hardware actually does

PIPE: Actual exception handling Scenario: pushl uses bad memory address  Actions: disable CC, inject bubbles into memory stage, stall write-back 0x000: irmovl $1,%eax 123456789 FDEMW 0x006: xorl %esp,%esp #CC = 100 FDEMW 10 # prog10 0x008: pushl %eax 0x00a: addl %eax,%eax FD 0x00c: irmovl $2, %eax FDEMW E FDE WWW      Cycle 6 M mem_error = 1 E New CC = 000 set_cc  0

Special Control Cases: Exceptions Detection Action (on next cycle)  Also: disable setting of condition codes in execute in current cycle ConditionTrigger Exception m_stat is in {SADR, SINS, SHLT} || W_stat is in {SADR, SINS, SHLT} ConditionFDEMWExceptionnormalnormalnormalbubblestall

Special Control Cases: Non-exceptions Detection Action (on next cycle) ConditionTrigger Processing ret IRET in { D_icode, E_icode, M_icode } Load/Use Hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } Mispredicted Branch E_icode = IJXX & !e_Cnd ConditionFDEMW Processing ret stallbubblenormalnormalnormal Load/Use Hazard stallstallbubblenormalnormal Mispredicted Branch normalbubblebubblenormalnormal

Pipeline Control, rev. 1.0 bool F_stall = # Conditions for a load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode }; bool D_stall = # Conditions for a load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB }; bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode }; bool E_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB}; How do we know this works?

Analysis: Control Combinations  Special cases that can arise on same clock cycle Combination A  Not-taken branch  ret instruction at branch target Combination B  Instruction that reads from memory to %esp  Followed by ret instruction

Control Combination A  Should be handled as mispredicted branch  Combination will also stall F pipeline register  But PC selection logic will be using M_valA anyway  Correct action taken! JXX E D M Mispredict JXX E D M Mispredict E ret D M 1 E D M 1 E D M 1 Combination A ConditionFDEMW Processing ret stallbubblenormalnormalnormal Mispredicted branch normalbubblebubblenormalnormal Combinationstallbubblebubblenormalnormal

Control Combination B  Would assert both bubble and stall for pipeline register D  Should be signaled by processor as pipeline error  Combination not handled correctly in control code 1.0  But it passed many simulation tests; caught only with systematic analysis Load E Use D M Load/use E ret D M 1 E D M 1 E D M 1 Combination B ConditionFDEMW Processing ret stallbubblenormalnormalnormal Load/use hazard stallstallbubblenormalnormal Combinationstall bubble + stall bubblenormalnormal

Control Combination B: Correct Handling Load E Use D M Load/use E ret D M 1 E D M 1 E D M 1 Combination B ConditionFDEMW Processing ret stallbubblenormalnormalnormal Load/use hazard stallstallbubblenormalnormal Combinationstallstallbubblenormalnormal  Load/use hazard should get priority  ret instruction should be held in decode stage for additional cycle

Corrected Pipeline Control Logic  Load/use hazard should get priority  ret instruction should be held in decode stage for additional cycle ConditionFDEMW Processing ret stallbubblenormalnormalnormal Load/use hazard stallstallbubblenormalnormal Combinationstallstallbubblenormalnormal bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode } # but not condition for a load/use hazard && !(E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB }); New

Lesson Learned Extensive and thorough testing is good, but it can’t prove a design correct Formal verification important, but field not mature enough for large-scale designs  Important and active research area

Performance Metrics Clock rate  Measured in Megahertz or Gigahertz  Function of stage partitioning and circuit design  To increase: keep amount of work per stage small Rate at which instructions executed  CPI: cycles per instruction  On average, how many clock cycles does each instruction require (after completion of previous instruction)?  CPI a function of pipeline design and the program  How frequently are branches mispredicted?  How frequent are load stalls?  How frequent are ret instructions?

CPI for PIPE Ideal CPI = 1.0  Fetch instruction each clock cycle  Process new instruction every cycle  Although each individual instruction has latency of 5 cycles Actual CPI > 1.0  Due to pipeline stalls, branch mispredictions Computing CPI  C clock cycles  I instructions executed to completion  B bubbles injected (C = I + B) CPI = C/I = (I+B)/I = 1.0 + B/I  B/I represents average penalty (per instruction) due to bubbles

CPI for PIPE (Cont.) B/I = LP + MP + RP  LP: Penalty due to load/use hazard stalling  Fraction of instructions that are loads0.25  Fraction of load instructions requiring stall0.20  Number of bubbles injected each time1  LP = 0.25 * 0.20 * 1 = 0.05  MP: Penalty due to mispredicted branches  Fraction of instructions that are cond. jumps 0.20  Fraction of cond. jumps mispredicted0.40  Number of bubbles injected each time 2  MP = 0.20 * 0.40 * 2 = 0.16  RP: Penalty due to ret instructions  Fraction of instructions that are returns0.02  Number of bubbles injected each time 3  RP = 0.02 * 3 = 0.06  Net effect of penalties: 0.05 + 0.16 + 0.06 = 0.27  CPI = 1.27 (Not bad! Assumes perfect memories.) Typical Values

State-of-the-Art Pipelining What have we ignored in our Y86 implementation?  Balancing delay in each stage  Which stage is longest, how might we speed it up?  Multicycle instructions  Realistic memory systems

Fetch Logic Revisited During fetch cycle 1. Select PC 2. Read bytes from instruction memory 3. Examine icode to determine instruction length 4. Increment PC Timing  Steps 2 & 4 require significant amount of time F D rB M_icode Predict PC valCvalPicodeifunrA Instruction memory Instruction memory PC incr. PC incr. predPC Need regids Need valC Instr valid Align Split Bytes 1-5Byte 0 Select PC M_Cnd M_valA W_icode W_valM f_pc stat imem_error icodeifun

Standard Fetch Timing  Must perform everything in sequence:  Can’t compute incremented PC until we know value to increment it with  Why is increment slow?  How could we speed this up? Select PC Mem. ReadIncrement need_regids, need_valC 1 clock cycle

A Fast PC Increment Circuit 3-bit adder need_ValC need_regids 0 29-bit incrementer MUX High-order 29 bits Low-order 3 bits High-order 29 bitsLow-order 3 bits 01 PC incrPC SlowFast carry 1

Modified Fetch Timing 29-bit incrementer  Acts as soon as PC selected  Output not needed until final MUX  Works in parallel with memory read Select PC Mem. Read Incrementer need_regids, need_valC 3-bit add MUX 1 clock cycle Standard cycle

State-of-the-Art Pipelining Other issues to consider  More complex instructions: consider FP divide, sqrt  Take many cycles to execute  Forwarding can’t resolve hazards: more data stalls  Important for compiler to schedule code  Deeper pipelines to allow faster cycle times  Increased penalty from misprediction, data hazard stalls, etc.  Increased emphasis on branch prediction  Actual memory hierarchy issues (will increase CPI)  Difficult to complete memory access in one cycle!  Possibility of cache misses, TLB misses, page faults  Superscalar/VLIW: process multiple instructions/cycle  Dynamic scheduling (discussed in Chapter 5)  Scheduling = determining instruction execution order  Hardware decides based on data dependencies, resources

Pentium 4 Pipeline Very deep pipeline  Enables very high clock rates, but 20+ cycle branch penalty  Slower than Pentium III for a given clock rate 123456789101112 TC Nxt IPTC FetchDriveAlloc Rename QueSch 1314 Disp 1516 17 18 19 20 RF Ex Flgs Br Ck Drive RF

Multicycle FP Operations Multiple functional units: one approach  Special-purpose hardware for FP operations  Increased latency causes more frequent stalls can cause “structural” hazards Single cycle integer unit F DMW Fully pipelined multiplier Non-pipelined divider Fully pipelined FP adder

Dynamic scheduling Out-of-order execution engine: one view (Pentium 4) Image from www.xbitlabs. com Fetching, decoding, translation of x86 instrs to uops to support precise exceptions and recovery from mispredicted branches

Branch Prediction: Simplistic Branch history  Encode history about prior history of each individual branch instruction; store as hash table on instr. address  Use history to predict branch outcome State machine stores history  Each time branch taken, move to left  Each time branch not taken, move to right  In state Yes*, predict taken; in state No*, predict not taken  Can be encoded using 2 bits per table entry TTT Yes!Yes?No?No! NT T

Branch Prediction: Realistic Alpha 21264 “tournament” predictor Addr. of branch instr 12 bits 4K entries, each 2 bits, to select predictor Global Predictor Local Predictor 12 bit shift register of last branch outcomes globally 10 bits of branch address 4K entries, standard 2-bit branch predictors 1K entries, each 10 bits, history of behavior of this branch 1K entries, each a 3-bit saturation counter Predictor size: 8K + 8K + 10K + 3K = 29K bits! 8K 10K3K

Discussion Questions Data hazards on register values can be dealt with by stalling or by forwarding.  Can hazards occur on condition codes?  Can hazards occur on data memory accesses? Can software be responsible for pipeline correctness?  Schedule instructions, use nops  DSP chips historically have had exposed pipelines Relationship of hazards and dependencies  Does every data dependence cause a data hazard?  Is every data hazard caused by a data dependence?

Data Hazard Solution 2: Data Forwarding Our naïve pipeline would experience many data stalls  Register isn’t written until completion of write-back stage.

Similar presentations

Presentation on theme: "Data Hazard Solution 2: Data Forwarding Our naïve pipeline would experience many data stalls  Register isn’t written until completion of write-back stage."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Hazard Solution 2: Data Forwarding Our naïve pipeline would experience many data stalls  Register isn’t written until completion of write-back stage.

Similar presentations

Presentation on theme: "Data Hazard Solution 2: Data Forwarding Our naïve pipeline would experience many data stalls  Register isn’t written until completion of write-back stage."— Presentation transcript:

Similar presentations

About project

Feedback