Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 8: Data-Capture Instruction Schedulers. The goal is to execute instructions in dataflow order as opposed to the sequential order specified by.

Similar presentations


Presentation on theme: "Lecture 8: Data-Capture Instruction Schedulers. The goal is to execute instructions in dataflow order as opposed to the sequential order specified by."— Presentation transcript:

1 Lecture 8: Data-Capture Instruction Schedulers

2 The goal is to execute instructions in dataflow order as opposed to the sequential order specified by the ISA The renamer plays a critical role by removing all of the false register dependencies The scheduler is responsible for: –for each instruction, detecting when all dependencies have been satisifed (and therefore ready to execute) –propagating readiness information between instructions Lecture 8: Data-Capture Instruction Schedulers 2

3 3 Static Program Fetch Dynamic Instruction Stream Rename Renamed Instruction Stream Schedule Dynamically Scheduled Instructions Out-of-order = out of the original sequential order

4 Lecture 8: Data-Capture Instruction Schedulers 4 A: R1 = Load 16[R2] B: R3 = R1 + R4 C: R6 = Load 8[R9] D: R5 = R2 – 4 E: R7 = Load 20[R5] F: R4 = R4 – 1 G: BEQ R4, #0 C C D D E E cache miss B B C C D D E E F F G G 10 cycles B B F F G G 7 cycles A A B B C C D D E E F F G G C C D D E E F F G G B B 5 cycles B B C C D D E E F F G G 8 cycles A A cache miss 1-wide In-Order A A cache miss 2-wide In-Order A A 1-wide Out-of-Order A A cache miss 2-wide Out-of-Order

5 At dispatch, instruction read all available operands from the register files and store a copy in the scheduler Unavailable operands will be captured from the functional unit outputs When ready, instructions can issue directly from the scheduler without reading additional operands from any other register files Lecture 8: Data-Capture Instruction Schedulers 5 Fetch & Dispatch Fetch & Dispatch ARF PRF/ROB Data-Capture Scheduler Data-Capture Scheduler Functional Units Functional Units Physical register update Bypass

6 Lecture 8: Data-Capture Instruction Schedulers 6 Fetch & Dispatch Fetch & Dispatch ARF PRF Scheduler Functional Units Functional Units Physical register update More on this next lecture! More on this next lecture!

7 Lecture 8: Data-Capture Instruction Schedulers 7 A A B B C C D D F F E E G G Buffer for unexecuted instructions Method for tracking state of dependencies (resolved or not) ArbiterArbiter B B Method for choosing between multiple ready instructions competing for the same resource Method for notification of dependency resolution Scheduler Entries or Issue Queue (IQ) or Reservation Stations (RS)

8 Scheduling Loop or Wakeup-Select Loop –Wake-Up Part: Instructions selected for execution notify dependents (children) that the dependency has been resolved For each instruction, check whether all input dependencies have been resolved –if so, the instruction is woken up –Select Part: Choose which instructions get to execute –If 3 add instructions are ready at the same time, but there are only two adders, someone must wait to try again in the next cycle (and again, and again until selected) Lecture 8: Data-Capture Instruction Schedulers 8

9 9 T14 T16 T39 T6 T17 T39 T15 T39 = = = = = = = = = = = = = = = = T8 T17 T42 Select Logic To Execute Logic Tag Broadcast Bus

10 Tags, Ready Logic Tags, Ready Logic Select Logic Select Logic Lecture 8: Data-Capture Instruction Schedulers 10 Tag Broadcast Buses = = = = = = = = = = = = = = = = bid grants Src L Rdy L Val L Issued Src L Rdy L Val L Dst

11 Lecture 8: Data-Capture Instruction Schedulers 11 A A Select Logic SRSR SRSR D D SLSL SLSL opcode Val L Val R Val L Val R Val L Val R Val L Val R Payload RAM

12 Lecture 8: Data-Capture Instruction Schedulers 12 A B Select Logic SRSR SRSR D D SLSL SLSL opcode Val L Val R Val L Val R Val L Val R Val L Val R opcode Val L Val R Val L Val R Val L Val R The scheduler captures the data, hence Data-Capture

13 Maximum number of instructions selected for execution each cycle is the issue width –Previous slide showed an issue width of two –The slide before that showed the details of a scheduler entry for an issue width of four Hardware requirements: –Typically, an Issue Width of N requires N tag broadcast buses –Not always true: can specialize such that, for example, one issue slot can only handle branches Lecture 8: Data-Capture Instruction Schedulers 13

14 Lecture 8: Data-Capture Instruction Schedulers 14 Select Payload Wakeup A: Execute Capture B: tag broadcast result broadcast enable capture on tag match Select Payload Execute Wakeup C: enable capture tag broadcast Cycle i Cycle i+1 A A B B C C

15 Lecture 8: Data-Capture Instruction Schedulers 15 Select Payload A: Execute Capture B: tag broadcast result broadcast enable capture Select Payload Execute Capture C: enable capture tag broadcast Cycle iCycle i+1 Select Payload Execute Cycle i+2Cycle i+3 Wakeup A A B B C C Cant read and write payload RAM at the same time; may need to bypass the results Cant read and write payload RAM at the same time; may need to bypass the results

16 Previous slide placed the pipeline boundary at the writing of the ready bits This slide shows a pipeline where latches are placed right before the tag broadcast Lecture 8: Data-Capture Instruction Schedulers 16 Select Payload A: Execute Capture B: tag broadcast result broadcast enable capture Select Payload Execute Cycle i Cycle i+1Cycle i+2 Wakeup A A B B

17 Lecture 8: Data-Capture Instruction Schedulers 17 Select Payload A: Execute Capture B: tag broadcast result broadcast and bypass enable capture C: Cycle i Wakeup Select Wakeup Payload Execute Select Payload Exec Capture Cycle i+1Cycle i+2Cycle i+3 Capture Wakeup tag match on first operand tag match on second operand (now C is ready) No simultaneous read/write! A A B B C C Need a second level of bypassing Need a second level of bypassing

18 Lecture 8: Data-Capture Instruction Schedulers 18 Select Payload A: Execute Capture C: Cycle i Wakeup i+1i+2i+3 Select Payload Execute Wakeup Capture Select Payload Ex i+4i+5 D: A A C C B B D D Wakeup Capture B: Select Payload Execute A&B both ready, only A selected, B bids again A C and C D must be bypassed, but no bypass for B D A C and C D must be bypassed, but no bypass for B D Dependent instructions cannot execute in back-to-back cycles! Dependent instructions cannot execute in back-to-back cycles!

19 Wakeup-Select Loop cannot be trivially pipelined while maintaining back-to-back execution of dependent instructions Lecture 8: Data-Capture Instruction Schedulers 19 A A B B C C A A B B C C Regular Scheduling No Back- to-Back Worst-case IPC reduction by ½ Shouldnt be that bad (previous slide had IPC of 4/3) Studies indicate 10-15% IPC penalty Loose Loops Sink Chips, Borch et al.

20 10-15% IPC drop doesnt seem bad if we can double the clock frequency Lecture 8: Data-Capture Instruction Schedulers ps 500ps 2.0 IPC, 1GHz 1.7 IPC, 2GHz 2 BIPS3.4 BIPS Frequency doesnt double –latch/pipeline overhead –unbalanced stages Other sources of IPC penalties –branches: pipe depth, predictor size, predict-to-update latency –caches/memory: same time in seconds, frequency, more cycles Power limitations: more logic, higher frequency P=½CV 2 f 900ps 450ps 900ps GHz

21 Goal: minimize DFG height (execution time) NP-Hard –Precedence Constrained Scheduling Problem –Even harder because the entire DFG is not known at scheduling time Scheduling decisions made now may affect the scheduling of instructions not even fetched yet Heuristics –For performance –For ease of implementation Lecture 8: Data-Capture Instruction Schedulers 21

22 Lecture 8: Data-Capture Instruction Schedulers 22 Scheduler Entries 1 S entries yields O(S) gate delay Grant 0 = 1 Grant 1 = !Bid 0 Grant 2 = !Bid 0 & !Bid 1 Grant 3 = !Bid 0 & !Bid 1 & !Bid 2 Grant n-1 = !Bid 0 & … & !Bid n-2 1 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x8x8 grant 0 x i = Bid i grant i grant 1 grant 2 grant 3 grant 4 grant 5 grant 6 grant 7 grant 8 grant 9 O(log S) gates

23 Instructions may be located in scheduler entries in no particular order –The first ready entry may be the oldest, youngest, anywhere in between –Simple select results in a random schedule the schedule is still correct in that no dependencies are violated it just may be far from optimal Lecture 8: Data-Capture Instruction Schedulers 23

24 Intuition: –An instruction that has just entered the scheduler will likely have few, if any, dependents (only intra-group) –Similarly, the instructions that have been in the scheduler the longest (the oldest) are likely to have the most dependents –Selecting the oldest instructions has a higher chance of satisfying more dependencies, thus making more instructions ready to execute more parallelism Lecture 8: Data-Capture Instruction Schedulers 24

25 Lecture 8: Data-Capture Instruction Schedulers 25 B A A C C D E F H Write instructions into scheduler in program order G E E F F H H G G Compress Up K K L L E E F F H H G G I I J J Newly dispatched I I J J B B D D

26 Compressing buffers are very complex –gates, wiring, area, power Lecture 8: Data-Capture Instruction Schedulers 26 Ex. 4-wide Need up to shift by 4 An entire instructions worth of data: tags, opcodes, immediates, readiness, etc.

27 Lecture 8: Data-Capture Instruction Schedulers 27 B C A D E F H G Age-Aware Select Logic Grant

28 Lecture 8: Data-Capture Instruction Schedulers 28 Sched PayLd Exec Sched PayLd Exec Add R1 = R2 + R3 Xor R4 = R1 ^ R5 Sched PayLd Exec Add R4 = R1 + R5 Sched PayLd Exec Mul R1 = R2 × R3 Exec Add attemps to execute too early! Result not ready for another two cycles.

29 It works, but… –Must make sure tag broadcast bus will be available N cycles in the future when needed –Bypass, data-capture potentially get more complex Lecture 8: Data-Capture Instruction Schedulers 29 Sched PayLd Exec Add R4 = R1 + R5 Sched PayLd Exec Mul R1 = R2 × R3 Exec Assume pipelined such that tag broadcast occurs at cycle boundary Assume pipelined such that tag broadcast occurs at cycle boundary

30 Lecture 8: Data-Capture Instruction Schedulers 30 Sched PayLd Exec Add R4 = R1 + R5 Sched PayLd Exec Mul R1 = R2 × R3 Exec Sched PayLd Exec Sub R7 = R8 – #1 Sched PayLd Exec Xor R9 = R9 ^ R6 Assume issue width equals 2 Assume issue width equals 2 In this cycle, three instructions need to broadcast their tags!

31 Possible solutions 1.Have one select for issuing, another select for tag broadcast messes up timing of data-capture 2.Pre-reserve the bus select logic more complicated, must track usage in future cycles in addition to the current cycle 3.Hold the issue slot from initial launch until tag broadcast Lecture 8: Data-Capture Instruction Schedulers 31 sch payl exec Issue width effectively reduced by one for three cycles

32 Push the delay to the consumer Lecture 8: Data-Capture Instruction Schedulers 32 = = Tag Broadcast for R1 = R2 × R3 R1 = = R4 R5 = R1 + R4 ready! Tag arrives, but we wait three cycles before acknowledging it Also need to know parents latency

33 Problem with previous approaches is that they assume that all instruction latencies are known at the time of scheduling –Makes things uglier for the delayed broadcast –This pretty much kills the delayed wakeup approach Examples –Load instructions Latency {L1_lat, L2_lat, L3_lat, DRAM_lat} –DRAM_lat is not a constant either, queuing delays –Some architecture specific cases PowerPC 603 has a early out for multiplication with a low-bit- width multiplicand Intel Core 2s divider also has an early out Lecture 8: Data-Capture Instruction Schedulers 33

34 Just wait and see whether a load hits or misses in the cache Lecture 8: Data-Capture Instruction Schedulers 34 Sched PayLd Exec R2 = R1 + #4 Sched PayLd Exec R1 = 16[$sp] Exec Cache hit known, can broadcast tag Load-to-Use latency increases by 2 cycles (3 cycle load appears as 5) Scheduler DL1 Tags DL1 Data DL1 Data May be able to design cache s.t. hit/miss known before data Exec Sched PayLd Exec Penalty reduced to 1 cycle

35 Caches work pretty well –hit rates are high, otherwise caches wouldnt be too useful –Just assume all loads will hit in the cache Lecture 8: Data-Capture Instruction Schedulers 35 Sched PayLd Exec R2 = R1 + #4 Sched PayLd Exec R1 = 16[$sp] Exec Cache hit, data forwarded Broadcast delayed by DL1 latency Um, ok, what happens when theres a load miss?

36 Lecture 8: Data-Capture Instruction Schedulers 36 Sched PayLd Exec Sched PayLd Exec Broadcast delayed by DL1 latency Exec … … Cache Miss Detected! Value at cache output is bogus Invalidate the instruction (ALU output ignored) Sched PayLd Exec Rescheduled assuming a hit at the DL2 cache There could be a miss at the L2, and again at the L3 cache. A single load can waste multiple issuing opportunities. Each mis-scheduling wastes an issue slot: the tag broadcast bus, payload RAM read port, writeback/bypass bus, etc. could have been used for another instruction Broadcast delayed by L2 latency L2 hit

37 Normally, as soon as an instruction issues, it can vacate its scheduler entry –The sooner an entry is deallocated, the sooner another instruction can reuse that entry leads to that instruction executing earlier In the case of a load, the load must hang around in the scheduler until it can be guaranteed that it will not have to rebroadcast its destination tag –Decreases the effective size of the scheduler Lecture 8: Data-Capture Instruction Schedulers 37

38 Lecture 8: Data-Capture Instruction Schedulers 38 Sched PayLd Exec Sched PayLd Exec DL1 Miss Sched PayLd Exec Sched PayLd Exec Sched PayLd Exec Squash Not only children get squashed, there may be grand-children to squash as well All waste issue slots All must be rescheduled All waste power None may leave scheduler until load hit known Sched PayLd Ex Sched PayLd Ex Sched PayLd Ex Sched PayLd Exec

39 The number of cycles worth of dependents that must be squashed is equal to Cache-Miss-Known latency minus one –previous example, miss-known latency = 3 cycles, there are two cycles worth of mis-scheduled dependents –Early miss-detection helps reduce mis-scheduled instructions –A load may have many children, but the issue width limits how many can possibly be mis-scheduled Max = Issue-width × (Miss-known-lat – 1) Lecture 8: Data-Capture Instruction Schedulers 39

40 Simple: squash anything in-flight between schedule and execute Lecture 8: Data-Capture Instruction Schedulers 40 Sched PayLd Exec Sched PayLd Exec Sched PayLd Exec Sched PayLd Exec Sched PayLd Exec Sched PayLd Exec Sched PayLd Exec Sched PayLd Exec This may include non- dependent instructions All instructions must stay in the scheduler for a few extra cycles to make sure they will not be rescheduled due to a squash

41 Selective squashing: use load colors –each load is assigned a unique color –every dependent inherits its parents colors –on a load miss, the load broadcasts its color and anyone in the same color group gets squashed An instruction may end up with many colors Lecture 8: Data-Capture Instruction Schedulers 41 …… Explicitly tracking each color would require a huge number of comparisons

42 Can list colors in unary (bit-vector) form Lecture 8: Data-Capture Instruction Schedulers 42 Load R1 = 16[R2] Add R3 = R1 + R Load R5 = 12[R7] Load R8 = 0[R1] R7 Load R7 = 8[R4] R7 Add R6 = R8 + R Each instructions color vector is the bitwise OR of its parents vectors A load miss now only squashes the dependent instructions Hardware cost increases quadratically with number of load instructions X X X

43 Allocate in-order, Deallocate in-order Very simple! Smaller effective scheduler size –instructions may have already executed out-of-order, but their RS entries cannot be reused –Can be very bad if a load goes to main memory Lecture 8: Data-Capture Instruction Schedulers 43 Circular Buffer Head Tail

44 With arbitrary placement, entries much better utilized Allocator more complex –must scan availability and find N free entries Write logic more complex –must route N instructions to arbitrary entries of the scheduler Lecture 8: Data-Capture Instruction Schedulers 44 RS Allocator Entry availability bit-vector

45 Segment the entries –only one entry per segment may be allocated per cycle –instead of 4-of-16 alloc (previous slide), each allocator only does 1-of-4 –write logic simplified as well Still possible inefficiencies –full segment leads to allocating less than N instructions per cycle Lecture 8: Data-Capture Instruction Schedulers Alloc A A B B C C D D X Free RS entries exist, just not in the correct segment Free RS entries exist, just not in the correct segment 0 0


Download ppt "Lecture 8: Data-Capture Instruction Schedulers. The goal is to execute instructions in dataflow order as opposed to the sequential order specified by."

Similar presentations


Ads by Google