Presentation is loading. Please wait.

Presentation is loading. Please wait.

Instructional Parallelism

Similar presentations


Presentation on theme: "Instructional Parallelism"β€” Presentation transcript:

1 Instructional Parallelism

2 Getting Faster – So Far Speed up clock Reduce CPI
Reduce Instructions/Program Clock, CPI, Instruction Power tightly interrelated – no free lunches Performance = π‘π‘Ÿπ‘œπ‘”π‘Ÿπ‘Žπ‘š π‘ π‘’π‘π‘œπ‘›π‘‘ = π‘π‘Ÿπ‘œπ‘”π‘Ÿπ‘Žπ‘š π‘–π‘›π‘ π‘‘π‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘›π‘  Γ— π‘–π‘›π‘ π‘‘π‘Ÿπ‘’π‘π‘‘π‘–π‘œπ‘›π‘  𝑐𝑦𝑐𝑙𝑒 Γ— 𝑐𝑦𝑐𝑙𝑒𝑠 π‘ π‘’π‘π‘œπ‘›π‘‘

3 Getting Faster – So Far Pipelined processor
Ideal speedup = N times more throughput for N stages But Latency increases Branches / conflicts mean limited returns after certain point

4 Getting Faster – ILP Instruction Level Parallelism
Ability to run multiple instructions at the same time

5 Superscalar Superscalar : processor with multiple pipelines

6 Conventional vs Superscalar
Not all parts need duplication

7 Conventional vs Superscalar
Width generally focued on execution units Slowest part of pipeline Most specialized

8 Superscalar Multi-issue :
Fetch multiple instruction, issue to dispatch units I1 I2

9 Superscalar Dispatch :
Instructions transmitted to functional units for execution I2 I1

10 ARM A7 & A15 A7 A15 Patial Dual Issue 8 stage integer pipeline
3 instruction issue 15 stage integer pipeline

11 AMD Zen 10 wide execution 4 Integer ALUs 4 Floating Point Units
2 Address Generation Units

12 Superscalar Dependency issues just got MUCH harder…
Instructions packed closer More to keep track of

13 Sample Program 9 instructions

14 Sample Program 9 instructions RAW Dependencies

15 In Order Issue Time 1 Issue 1 & 2 Time 2 Issue 3 & 4, start execution

16 In Order Issue Time 3 I5 Issued I4 is blocked on I1

17 In Order Issue Time 4 I5 blocked on I4 I6 issued

18 In Order Issue Time 5 I5 blocked on I4 I7 issued

19 In Order Issue Time 6 I8 issued I7 blocked on I5

20 In Order Issue Time 7 I9 issued I7 blocked on I5

21 In Order Issue Time 8

22 In Order Issue Time 9

23 Even More Complicated Reality is even more complicated

24 Analysis Single wide 3-stage pipeline:
Those 9 instructions in 12 clocks

25 Analysis Double width 3-stage pipeline:
Those 9 instructions in 9 clocks

26 Analysis – This Case 𝑆= 𝑠𝑖𝑛𝑔𝑙𝑒 𝑀𝑖𝑑𝑒 π‘‘π‘–π‘šπ‘’ 2βˆ’π‘€π‘–π‘‘π‘’ π‘‘π‘–π‘šπ‘’ = 12 9 =1.25
Ideal speedup for double wide pipeline: 2x Speed up for dual pipelines in this case: 𝑆= 𝑠𝑖𝑛𝑔𝑙𝑒 𝑀𝑖𝑑𝑒 π‘‘π‘–π‘šπ‘’ 2βˆ’π‘€π‘–π‘‘π‘’ π‘‘π‘–π‘šπ‘’ = 12 9 =1.25

27 Out Of Order Execution Out of Order execution:
Allow execution units to process instructions out of order Reduce waiting Guarantee same behavior as in order

28 Out Of Order Execution Out of Order example:
RAW dependency with R1 and R6

29 Out Of Order Execution Out of Order example:
Resolve by moving ADD R6, R3, R8 up to fill bubble due to R1

30 WAR and WAW Out of order execution means new dangers Write After Read
Write After Write

31 Out Of Order Execution Reservation stages : holding pens for instructions until needed resources ready

32 Out Of Order Execution Retire : put instructions back into order before writing out

33 Sample Program RAW Dependencies WAR WAW

34 Sample Program RAW Dependencies WAR WAW

35 Timing RAW Dependencies Need to follow by 2+ time stages

36 Timing RAW Dependencies Need to follow by 2+ time stages
WAR Dependecies Can be issued at same time But I2 can't be before I1

37 Timing RAW Dependencies Need to follow by 2+ time stages
WAR Dependecies Can be issued at same time But I2 can't be before I1 WAW Dependecies Be no later than dependency Or retire buffer used to fix writeback order

38 Sample Program Out of Order Execution Red = RAW 2 stage delay
Green = WAR Must come at same or later time WAW handled by reorder buffer

39 Out of Order Issue Time 1 Time 2 Issue 1 & 2
Issue 3 & 6 to avoid 4 blocking on 1

40 Out of Order Issue Time 3 Now safe to issue 4 and 8

41 Out of Order Issue Time 4 Now safe to issue 9 5 not ready to start

42 Out of Order Issue Time 5 Now safe to issue 5

43 Out of Order Issue Time 6+ Can't start 7 until 5 is writing

44 Analysis Out of Order Can help keep pipelines full
Can't shorten time for a critical path like 1 οƒ  4 οƒ  5 οƒ  7

45 Register Renaming Programmers/Compiler reuse registers for different jobs: 1 ldr r1,[r0] ;get x 2 ldr r2,[r3] ;get y in r2 (first use of r2) 3 mlt r4,r1,r2 ;z = xΒ·y 4 sub r5,r1,#4 ;q = x - 4 5 div r2,r1,r5 ;z = x/(x – y) (reuse of r2) add r6,r4,r2 ;s = xΒ·y + x/(x – 4)

46 Register Renaming Register renaming :
Avoiding data conflicts by reassign registers to Other physical registers Hidden shadow registers

47 Register Renaming R2 renamed to r7 1 ldr r1,[r0] ;get x
2 ldr r2,[r3] ;get y in r2 (first use of r2) 3 mlt r4,r1,r2 ;z = xΒ·y 4 sub r5,r1,#4 ;q = x - 4 5 div r7,r1,r5 ;z = x/(x – y) (changed to r7) add r6,r4,r7 ;s = xΒ·y + x/(x – 4)

48 Register Renaming Before: After:

49 Superscalar Pro/Con Good The hardware solves everything:
Hardware solves scheduling/registers/etc… Compiler/programmer can still help matters Binary compatibility New hardware issues old instructions in a more efficient way

50 Transistor Count Vast majority of transistor count is to support doing work faster

51 Superscalar Pro/Con Good Bad The hardware solves everything:
Hardware solves scheduling/registers/etc… Compiler/programmer can still help matters Binary compatibility New hardware issues old instructions in a more efficient way Bad Complex hardware Limit to scale

52 VLIW: Superscalar Alternative
VLIW : Very Large Instruction Word One bundle contains multiple instructions Each bundle designed to schedule cleanly

53 Who does work? Compiler assembles long instructions
Reorders at compile time Compiler has more time, information

54 VLIW Uses Itanium : EPIC : Explicitly Parallel Computing
3 instruction bundles

55 VLIW Pro/Con Good Bad Simple hardware
No scheduling responsibilities Potentially better optimization in compiler Bad Binary compatibility : compiler builds for one specific hardware Good compilers are HARD to write


Download ppt "Instructional Parallelism"

Similar presentations


Ads by Google