Presentation is loading. Please wait.

Presentation is loading. Please wait.

Superscalar Processors Superscalar Execution –How it can help –Issues: Maintaining Sequential Semantics Scheduling –Scoreboard –Superscalar vs. Pipelining.

Similar presentations


Presentation on theme: "Superscalar Processors Superscalar Execution –How it can help –Issues: Maintaining Sequential Semantics Scheduling –Scoreboard –Superscalar vs. Pipelining."— Presentation transcript:

1 Superscalar Processors Superscalar Execution –How it can help –Issues: Maintaining Sequential Semantics Scheduling –Scoreboard –Superscalar vs. Pipelining Example: Alpha 21164 and 21064

2 Sequential Execution Semantics Contract: The machine should appear to behave like this.

3 Sequential Execution Semantics We will be studying techniques that exploit the semantics of Sequential Execution. Sequential Execution Semantics: –instructions appear as if they executed in the program specified order –and one after the other Alternatively –At any given point in time we should be able to identify an instruction so that: 1. All preceding instructions have executed 2. None following has executed

4 Pipelined Execution Pipelining: Partial Overlap of Instructions –Initiate one instruction per cycle –Subsequent instructions overlap partially –Commit one instruction per cycle Program Order Pipelining

5 Superscalar - In-order Two or more consecutive instructions in the original program order can execute in parallel –This is the dynamic execution order N-way Superscalar –Can issue up to N instructions per cycle –2-way, 3-way, … Program Order PipeliningSuperscalar

6 Superscalar vs. Pipelining loop:ld r2, 10(r1) addr3,r3,r2 subr1,r1,1 bner1,r0,loop Pipelining: sum += a[i--] fetchdecodeld fetchdecodeadd fetchdecodesub fetchdecodebne time Superscalar: fetchdecodeld fetchdecodeadd fetchdecodesub fetchdecodebne

7 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Performance Spectrum? –What if all instructions were dependent? Speedup = 0, Superscalar buys us nothing –What if all instructions were independent? Speedup = N where N = superscalarity Again key is typical program behavior –Some parallelism exists Superscalar Performance

8 “Real-Life” Performance OLTP = Online Transaction Processing SOURCE: Partha Ranganathan Kourosh Gharachorloo** Sarita Adve* Luiz André Barroso** Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors ASPLOS98

9 “Real Life” Performance SPEC CPU 2000: Simplescalar sim: 32K I$ and D$, 8K bpred

10 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Superscalar Issue (s) An instruction at decode can execute if: –Dependences RAW –Input operand availability WAR and WAW Must check against Instructions: Simultaneously Decoded In-progress in the pipeline (i.e., previously issued) –Recall the register vector from pipelining Increasingly Complex with degree of superscalarity –2-way, 3-way, …, n-way

11 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Issue Rules (s) Stall at decode if: –RAW dependence and no data available Source registers against previous targets –WAR or WAW dependence Target register against previous targets + sources –No resource available This check is done in program order

12 Issue Mechanism – A Group of Instructions at Decode Assume 2 source & 1 target max per instr. –comparators for 2-way: 3 for tgt and 2 for src (tgt: WAW + WAR, src: RAW) –comparators for 4-way: 2nd instr: 3 tgt and 2 src 3rd instr: 6 tgt and 4 src 4th instr: 9 tgt and 6 src tgtsrc1 tgtsrc1 tgtsrc1  simplifications may be possible  resource checking not shown Program order

13 Issue – Checking for Dependences with In-Flight instructions Naïve Implementation: –Compare registers with all outstanding registers –RAW, WAR and WAW –How many comparators we need? Stages x Superscalarity x Regs per Instruction –Priority enforcers? –But we need some of this for bypassing RAW A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto

14 Issue – Checking for Dependences with In-Flight instructions Scoreboard: –Pending Write per register, one bit Set at decode / Reset at writeback –Pending Read? Not needed if all reads are done in order WAR and WAW not possible Can handle structural hazards –Busy indicators per resource Can handle bypass –Where a register value is produced –R0 busy, in ALU0, at time +3 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto

15 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Implications Need to multiport some structures –Register File Multiple Reads and Writes per cycle –Register Availability Vector (scoreboard) Multiple Reads and Writes per cycle –From Decode and Commit –Also need to worry about WAR and WAW Resource tracking –Additional issue conditions Many Superscalars had additional restrictions –E.g., execute one integer and one floating point op –one branch, or one store/load

16 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Preserving Sequential Semantics (s) In principle not much different than pipelining Program order is preserved in the pipeline Some instructions proceed in parallel –But order is clearly defined Defer interrupts to commit stage (i.e., writeback) –Flush all subsequent instructions may include instructions committing simultaneously –Allow all preceding instructions to commit Recall comparisons are done in program order Must have sufficient time in clock cycle to handle these

17 Preserving Sequential Semantics loop:ld r2, 10(r1) addr3,r3,r2 subr1,r1,1 bner1,r0,loop Pipelining: sum += a[i--] fetchdecodeld fetchdecodeadd fetchdecodesub fetchdecodebne time Superscalar: fetchdecodeld fetchdecodeadd fetchdecodesub fetchdecodebne

18 Interrupts Example fetchdecodeld fetchdecodeadd fetchdecodediv fetchdecodebne Exception raised Exception taken fetchdecodebne fetchdecodeld fetchdecodeadd fetchdecodediv fetchdecodebne Exception raised Exception taken fetchdecodebne Exception raised

19 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Superscalar and Pipelining In principle they are orthogonal –Superscalar non-pipelined machine –Pipelined non-superscalar –Superscalar and Pipelined (common) Additional functionality needed by Superscalar: –Another bound on clock cycle –At some point it limits the number of pipeline stages

20 Superscalar vs. Superpipelining (s) Superpipelining: –Vaguely defined as deep pipelining, i.e., lots of stages Superscalar issue complexity: limits super-pipelining How do they compare? –2-way Superscalar vs. Twice the stages –Not much difference. fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst F1D1 F1D1 D2 F2 F1F2 F1F2 E1E2 E1E2 E1E2 E1E2

21 Superscalar vs. Superpipelining fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst F1D1 F1D1 D2 F2 F1F2 F1F2 E1E2 E1E2 E1E2 E1E2 fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst F1D1 D2 F2 F1F2 E1E2 E1E2 F1D1 D2 F2 F1F2 E1E2 E1E2 WANT 2X PERFORMANCE:

22 Superscalar vs. Superpipelining fetchdecodeinst fetchdecodeinst fetchdecodeinst fetch decodeinst F1D1inst D1inst F1D1inst D2inst D2 D1D2 F2 F1F2 F1F2 fetch decodeinst D2instF1 D1 RAW decode fetch D1D2 F2D1 F1D1instD2F2 fetchdecodeinst fetchdecodeinst F1D1instD2D1 D2F2 D1 F2 WANT 2X PERFORMANCE:

23 Pipeline Performance (s) g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled) 1-g = performance suffers

24 Superscalar vs. Superpipelining: Another View Source: Lipasti, Shen, Wood, Hill, Sohi, Smith (CMU/Wisconsin) A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto f = fraction that is vectorizable (parallelism) v = speedup for f Overall speedup: No. of Processors N Time 1 1 - f f Amdhal’s Law Work performed

25 Amdhal’s Law: Sequential Part Limits Performance Parallelism can’t help if there isn’t any Even if v is infinite –Performance limited by nonvectorizable portion (1-f) No. of Processors N Time 1 1 - f f

26 Amdahl’s Law

27 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Case Study: Alpha 21164

28 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto 21164: Int. Pipe

29 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto 21164: Memory Pipeline

30 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto 21164: Floating-Point Pipe

31 Performance Comparison Source: 4-way2-way

32 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto CPI Comparison

33 Compiler Impact Optimized Base Performance

34 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Issue Cycle Distribution - 21164

35 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Issue Cycle Distribution - 21064

36 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Stall Cycles - 21164 No instructions Data Dependences/Data Stalls

37 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Stall Cycles Distrubution Model: When no instruction is committing Does not capture overlapping factors: Stall due to dependence while committing Stall due to cache miss while committing

38 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Replay Traps Tried to do something and couldn’t –Store and write-buffer is full Can’t complete instruction –Load and miss-address-file full Can’t complete instruction –Assumed Cache hit and was miss Dependent instructions executed Must re-execute dependent instructions Re-execute the instruction and everything that follows

39 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Replay Traps Explained ld r1 add _, r1 FEMDW FEMDW Cache hit D FEMDW FEMDW Cache miss D M D

40 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Optimistic Scheduling ld r1 add _, r1 FEMDW FEMDW Cache hit D M ED add should start execution here Must decide that add should execute Start making scheduling decisions Hit/miss known here

41 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Optimistic Scheduling #2 ld r1 add _, r1 FEMDW FEMDW Cache hit D M ED add should start execution here Must decide that add should execute Start making scheduling decisions Hit/miss known here Guess Hit/Miss

42 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Stall Distribution

43 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto 21164 Microarchitecture Instruction Fetch/Decode + Branch Units Integer Execution Unit Floating-Point Execution Unit Memory Address Translation Unit Cache Control and Bus Interface Data Cache Instruction Cache Second-Level Cache

44 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Instruction Decode/Issue Up to four insts/cycle Naturally aligned groups –Must start at 16 byte boundary (INT16) –Simplifies Fetch path (in a second) All of group must issue before next group gets in Simplifies Scheduling –No need for reshuffling

45 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Instruction Decode/Issue Up to four insts/cycle Naturally aligned groups –Must start at 16 byte boundary (INT16) –Simplifies Fetch path CPU needs: I-Cache: Where instructions come from?

46 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Fetching Four Instructions CPU needs: I-Cache: Where instructions come from? Software must guarantee alignment at 16 byte boundaries Lots of NOPs

47 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Instruction Buffer and Prefetch I-buffer feeding issue 4-entry, 8-instruction prefetch buffer Check I-Cache and PB in parallel PB hit: Fill Cache, Feed pipeline PB miss: Prefetch four lines

48 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Branch Execution One cycle delay  Calc. target PC –Naïve implementation: Can fetch every other cycle Branch Prediction to avoid the delay Up to four pending branches in stage 2 –Assignment to Functional Units One at stage 3 –Instruction Scheduling/Issue One at stage 4 –Instruction Execution Full and execute from right PC

49 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Return Address Stack Returns  Target Address Changes Conventional Branch Prediction can’t handle Predictable change –Return address = Call site return point Detect Calls –Push return address onto hardware stack –Return pops address –Speculative –12-entry “stack” Circular queue  overflow/underflow messes it up

50 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Instruction Translation Buffer Translate Virtual Addresses to Physical 48-entry fully-associative Pages 8KB to 4MB Not-last-used/Not-MRU replacement 128 Address space identifiers

51 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Integer Execution Unit Two of: –Adder –Logic One of: –Barrel shifter –Byte-manipulation –Multiply Asymmetric Unit Configurations are common –Tradeoff between; Flexibility/Performance Area/Cost/Complexity How to decide? Common application behavior

52 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Integer Register File 32+8 registers –8 are legacy DEC Four read ports, two write ports –Support for up to two integer ops

53 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Floating-Point Unit FPU ADD FPU Multiply 2 ops per cycle Divides take multiple cycles 32 registers, five reads, four writes –Four reads and two writes for FP pipe –One read for stores (handled by integer pipe) –One write for loads (handled by integer pipe)

54 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Memory Unit Up to two accesses Data translation buffer –512-entries, not-MRU –Loads access in parallel with D-cache Miss Address File –Pending misses –Six data loads –Four instruction reads –Merges loads to same block

55 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Store/Load Conflicts Load immediately after a store –Can’t see the data –Detect and replay Flush pipe and re-execute Compiler can help –Schedule load three cycles after store –Two cycles stalls the load at issue/address generation

56 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Write Buffer Six 32-byte entries Defer stores until there is a port available Loads can read from Writebuffer

57 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Pipeline Processing Front-End

58 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Integer Add

59 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Floating-Point Add

60 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Load Hit

61 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Load Miss

62 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Store Hit

63 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto 80486 Pipeline Fetch –Load 16-bytes from into prefetch buffer Decode 1 –Determine instruction length and type Decode 2 –Compute memory address –Generate immediate operands Execute –Register Read –ALU –Memory read/write Write-back –Update register file (source: CS740 CMU, ’97, all slides on 486)

64 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto 80486 Pipeline detail Fetch –Moves 16 bytes of instruction stream into code queue –Not required every time –About 5 instructions fetched at once (avg. length 2.5 bytes) –Only useful if don’t branch –Avoids need for separate instruction cache D1 –Determine total instruction length –Signals code queue aligner where next instruction begins –May require two cycles When multiple operands must be decoded About 6% of “typical” DOS program

65 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto 80486 Pipeline D2 –Extract memory displacements and immediate operands –Compute memory addresses –Add base register, and possibly scaled index register –May require two cycles If index register involved, or both address & immediate operand Approx. 5% of executed instructions EX –Read register operands –Compute ALU function –Read or write memory (data cache) WB –Update register result


Download ppt "Superscalar Processors Superscalar Execution –How it can help –Issues: Maintaining Sequential Semantics Scheduling –Scoreboard –Superscalar vs. Pipelining."

Similar presentations


Ads by Google