Lecture 11: Memory Scheduling. Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should.

Slides:



Advertisements
Similar presentations
CSE502: Computer Architecture Out-of-Order Memory Access.
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
CS6290 Speculation Recovery. Loose Ends Up to now: –Techniques for handling register dependencies Register renaming for WAR, WAW Tomasulo’s algorithm.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
CS 7810 Lecture 8 Memory Dependence Prediction using Store Sets G.Z. Chrysos and J.S. Emer Proceedings of ISCA
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Goal: Reduce the Penalty of Control Hazards
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.
CMPE 421 Parallel Computer Architecture
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
1 Lecture 7: Speculative Execution and Recovery using Reorder Buffer Branch prediction and speculative execution, precise interrupt, reorder buffer.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.
1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
CSE 502: Computer Architecture
Data Prefetching Smruti R. Sarangi.
/ Computer Architecture and Design
Smruti R. Sarangi IIT Delhi
CSE 502: Computer Architecture
Lecture: Out-of-order Processors
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Morgan Kaufmann Publishers The Processor
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Smruti R. Sarangi IIT Delhi
Lecture 11: Memory Data Flow Techniques
Ka-Ming Keung Swamy D Ponpandi
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Krste Asanovic Electrical Engineering and Computer Sciences
Super-scalar.
Chapter Six.
Control unit extension for data hazards
Instruction Execution Cycle
Data Prefetching Smruti R. Sarangi.
Adapted from the slides of Prof
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Control unit extension for data hazards
Control unit extension for data hazards
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Ka-Ming Keung Swamy D Ponpandi
Handling Stores and Loads
Presentation transcript:

Lecture 11: Memory Scheduling

Lecture 13: Memory Scheduling 2 If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should have gotten value from the Store, but it didn’t! Load R3 = 0[R6] Add R7 = R3 + R9 Store R4  0[R7] Sub R1 = R1 – R2 Load R8 = 0[R1] Issue Cache Miss! IssueCache Hit! Miss serviced… Issue But there was a later load…

Ordering problem is a data-dependence violation Why can’t this happen with non-memory insts? –Operand specifiers in non-memory insts are absolute “R1” refers to one specific location –Operand specifiers in memory insts are ambiguous “R1” refers to a memory location specified by the value of R1. As pointers change, so does this location. Determining whether it is safe to issue a load OOO requires disambiguating the operand specifiers Lecture 13: Memory Scheduling 3

Memory disambiguation –Are there any earlier unexecuted stores to the same address as myself? (I’m a load) –Binary question: answer is yes or no Store-to-load forwarding problem –Which earlier store do I get my value from? (I’m a load) –Which later load(s) do I forward my value to? (I’m a store) –Non-binary question: answer is one or more instruction identifiers Lecture 13: Memory Scheduling 4

5 L L 0xF x L/S PCSeqAddrValue S S 0xF04C x S S 0xF x L L 0xF x L L 0xF x L L 0xF x S S 0xF85C x L L 0xF x L L 0xF x L L 0xF63C x Oldest Youngest 0x x x x33001 Data Cache

No Memory Reordering LSQ still needed for forwarded data (last slide) Easy to schedule Lecture 13: Memory Scheduling 6 Ready! bid grant bid grant Ready! 1 … … Least IPC, all memory executed sequentially

Let loads exec OOO w.r.t. each other, but no ordering past earlier unexecuted stores Lecture 13: Memory Scheduling 7 S S rdex all earlier stores executed L L L L S S L L S=0 L=1

Stores normally don’t “Execute” until both inputs are ready: address and data Only address is needed to disambiguate Lecture 13: Memory Scheduling 8 S S L L Address ready Data ready

Most aggressive approach Relies on fact that store  load forwarding is not the common case Greatest potential IPC – loads never stall Potential for incorrect execution Lecture 13: Memory Scheduling 9

Case 1: Older store execs before younger load –No problem; if same address st  ld forwarding happens Case 2: Older store execs after younger load –Store scans all younger loads –Address match  ordering violation Lecture 13: Memory Scheduling 10

Lecture 13: Memory Scheduling 11 L0xF x S0xF04C417740x S0xF x L0xF x L0xF x L0xF x33001 S0xF85C417790x32900 L0xF x L0xF x L0xF63C417820x33001 Store broadcasts value, address and sequence # (-17,0x3290,41775) Loads CAM-match on address, only care if store seq-# is lower than own seq (Load ignores because it has a lower seq #) IF younger load hadn’t executed, and address matches, grab b’casted value IF younger load has executed, and address matches, then ordering violation! -17 Grab value, flush pipeline after load (0,0x3290,41779) An instruction may be involved in more than one ordering violation

Instructions using the load’s stale/wrong value will propagate more wrong values These must somehow be re-executed Lecture 13: Memory Scheduling 12 Easiest: flush all instructions after (and including?) the misspeculated load, and just refetch Load uses forwarded value Correct value propagated when instructions re-execute

When flushing only part of the pipeline (everything after the load), RAT must be repaired to the state just after the load was renamed Solutions? –Checkpoint at every load Not so good, between loads and branches, very large number of checkpoints needed –Rollback to previous branch (which has its own checkpoint) Make sure load doesn’t misspeculate on 2 nd time around Have to redo the work between the branch and the load which were all correct the first time around –Works with undo-list style of recovery Lecture 13: Memory Scheduling 13

Not all later instructions are dependent on the bogus load value Pipeline latency due to refetch is exposed Hunting down RS entries to squash is tricky Lecture 13: Memory Scheduling 14

Ideal case w.r.t. maintaining high IPC Very complicated –need to hunt down only data-dependent insts –messier because some instructions may have already executed (now in ROB) while others may not have executed yet (still in RS) iteratively walk dependence graph? use some sort of load/store coloring scheme? P4 uses replay for load-latency misspeculation –But replay wouldn’t work in this case (why?) Lecture 13: Memory Scheduling 15

“SimpleScalar” style Lecture 13: Memory Scheduling 16 Store alloc ea-comp Add ea-comp st-data ld-data LSQ RS schedule ea D D Independently Execute Store Store “complete” Forward value to later Loads S S ea D D Independently Schedule Crack at Dispatch time Load is similar, but LD-data portion is data-dependent on the LD ea-comp Load is similar, but LD-data portion is data-dependent on the LD ea-comp Add Load

LSQ needs data-capture support –Store Data needs to capture value –EA-comps can write to LSQ entries directly using LSQ index (no associative search) Lecture 13: Memory Scheduling 17 Ld-d St-d add L-ea xor S-ea LSQ RS ADD T17 T12 T43 opdestsrc L src R St-ea Lsq-5 T18 #0 Store normally doesn’t have a dest; overload field for LSQ index Store normally doesn’t have a dest; overload field for LSQ index Load ea-comp done the same; Load’s LSQ entry handles “real” destination tag broadcast Load ea-comp done the same; Load’s LSQ entry handles “real” destination tag broadcast

Select Load must bid/select twice –once for ea-comp portion –once for cache access (includes LSQ check) Lecture 13: Memory Scheduling 18 Ld-ea Ld-data Ea-comp Exec Data Cache Data Cache Data cache and LSQ search in parallel RSLSQ

“Pentium” Style Lecture 13: Memory Scheduling 19 Store dispatch/ alloc STA STD LD “store” “load” LSQ RS schedule Add Load STA and STD still execute independently LSQ does not need data- capture –uses RS’s data-capture (for data-capture scheduler) –or RS  PRF  LSQ Potentially adds a little delay from STD-ready to ST  LD forwarding

Select Only one select/bid Lecture 13: Memory Scheduling 20 Load Ea-comp Exec Data Cache Data Cache LSQ search in parallel RSLSQ Load queue part doesn’t “execute”, but just holds address for detecting ordering violations Load queue part doesn’t “execute”, but just holds address for detecting ordering violations

STA and STD independently issue from RS –STA does ea comp –STD just reads operand and moves it to the LSQ When both have executed and reached the LSQ, then perform LSQ search for younger loads that have already executed (i.e., ordering violations) Lecture 13: Memory Scheduling 21

CAM logic – harder than regular scheduler because we need address + age information Age information not needed for physical registers since register renaming guarantees one writer per address –No easy way to prevent more than one store to the same address Lecture 13: Memory Scheduling 22

Lecture 13: Memory Scheduling 23 ST 0x4000 ST 0x4120 LD 0x4000 = = Address Bank Data Bank = = = = = = = = = = = = 0 No earlier matches Addr match Valid store Use this store Need to adjust this so that load need not be at bottom, and that LSQ can wrap-around Need to adjust this so that load need not be at bottom, and that LSQ can wrap-around If |LSQ| is large, logic can be adapted to have log delay If |LSQ| is large, logic can be adapted to have log delay

Similar Logic to Previous Slide Lecture 13: Memory Scheduling 24 ST 0x4000 ST 0x4120 ST 0x4000 LD 0x4000 Addr Match Is Load Capture Value Overwritten Data Bank This logic is ugly, complicated, slow and power hungry!

Each store is assigned a unique, increasing number (its color) Loads inherit the color of the most recently alloc’d st Lecture 13: Memory Scheduling 25 St Ld Color=1 Color=2 Color=3 Color=4 Ld All three loads have same color: only care about ordering w.r.t. stores, not other loads St Ld Ignore store broadcasts If store’s color > your own Special care is needed to deal with the eventual overflow/wrap-around of the color/age counter

When load receives data, it still needs to wakeup its dependents… value not needed until dependents make it to execute stage Alternative timing/implementation: –Broadcast address only –When load wakes up, search LSQ again (should hit now) Lecture 13: Memory Scheduling 26

Lecture 13: Memory Scheduling 27 Ideal Case: std sta LD Load predicted dependent on store: waits for STA LD add Cycle i Cycle i+1Cycle i+2 i+3 i+4 add SXE i i+1 std sta LD With decoupled Scheduling: LD Re-search: std sta LD LD: search LSQ i i+1i+2i+3 add SXE i+4 i+2 add Even if load value is ready, dependent op hasn’t been scheduled No performance benefit for direct ST  LD forwarding at time of address broadcast

We should all know by now that associative searches do not scale well So how do we manage this? Lecture 13: Memory Scheduling 28

Stores don’t need to b’cast address to stores Loads don’t need to check for collisions against earlier loads Lecture 13: Memory Scheduling 29 Store Queue (STQ) Load Queue (LDQ) Associative search for earlier stores only needs to check entries that actually contain stores Associative search for later loads for ST  LD forwarding only needs to check entries that actually contain loads

Load issue  EA computation  DL1 access and LSQ search in parallel Typical Latencies –DL1: 3 cycles –LSQ search: 1 cycle (more?) Remember: instructions are speculatively scheduled! Lecture 13: Memory Scheduling 30

Lecture 13: Memory Scheduling 31 S S X X X X X X E E S S X X X X X X E E Pipeline timing assuming LSQ hit LOAD ADD Pipeline timing assuming DL1 hit S S X X X X X X E E S S X X X X X X E E LOAD ADD E E E E But at time of scheduling, how do we know LSQ hit vs. DL1 hit? But at time of scheduling, how do we know LSQ hit vs. DL1 hit?

Can predict latency –similar to predicting L1 hit vs. L2 hit vs. going to DRAM –If predict LSQ hit but wrong  scheduling replay –If predict L1 hit but wrong  waste a few cycles Normalize latencies –Make LSQ hit and L1 hit have same latency –Greatly simplifies scheduler –Loses some performance since in theory you could do ST  LD forwarding in less time than the L1 latency Loss is not too great since most loads do not hit in LSQ Lecture 13: Memory Scheduling 32

Dependence violations can be predicted Lecture 13: Memory Scheduling 33 A A B B Ordering Violation Detected Make Note of it! 1 1 B B A A Next time around don’t let B issue before previous STA’s known X X Z Z All previous STA’s known; now it’s safe to issue B B A A X X Z Z Table has finite number of entries; eventually all will be set to “do not speculate”  equivalent to machine with no ordering speculation Table has finite number of entries; eventually all will be set to “do not speculate”  equivalent to machine with no ordering speculation

Do similar to branch predictors: use counters –asymmetric costs mispredicting T-branch as NT, or NT-branch as T makes no difference; need to flush and re-fetch either way predicting a no-conflict load as conflict causes load to stall unnecessarily, but other insts may still execute predicting a conflict as no-conflict causes pipeline flush –asymmetric frequencies no conflict loads much more common than conflicting loads Lecture 13: Memory Scheduling 34

Asymmetric updates –when no ordering violation, decrement counter by 1 –on ordering violation, increment by X > 1 choose X based on frequency of misspeculations and penalty/performance cost of misspeculation Periodic reset –Every K cycles, reset the entire table –Works reasonably well, lower hardware cost than using saturating counters Lecture 13: Memory Scheduling 35

Explicitly remember which load conflicted with which store Lecture 13: Memory Scheduling 36 A A B B Ordering Violation Detected AB Make Note of It! B B A A Next time around don’t let B issue before A’s STA is known (don’t have to wait for X and Z) X X Z Z A’s STA is known, but X and Z still unknown; it’s hopefully safe to issue B B A A X X Z Z

A load may have conflicts with more than one previous store Lecture 13: Memory Scheduling 37 basic block #2 basic block #2 basic block #1 basic block #1 A A B B C C Store R1  0x4000 Store R4  0x4000 Load R2  0x4000 A A B B basic block #3 basic block #3 C C

Lecture 13: Memory Scheduling 38 A A C C Ordering Violation Detected A Make Note of It! B B Another B C C A A Next time around don’t let C issue before A&B’s STA’s are known (don’t have to wait for Z) B B Z Z A&B’s STA’s are known, but Z still unknown; it’s hopefully safe to issue C C A A B B Z Z

Lecture 13: Memory Scheduling Store Sets Identification Table (SSIT) A A B B C C D D E E PC hash into SSIT; entry indicates store set A, B, C belong to same store set Last Fetched Store Table (LFST) A A A Fetched SSIT lookup  SSID = 4 A:L12 Update LFST w/ LSQ index C C C Fetched SSIT lookup  SSID = 4 LFST says load should wait on LSQ entry 12 before issuing If B fetched before C, then B waits on A, updates LFST, then C will wait on B If B fetched before C, then B waits on A, updates LFST, then C will wait on B

Few processors actually support this –21264 did; used the “load wait table” –Core 2 supports this now… so this is becoming much more important Many machines only use wait-for-earlier-STAs approach –becomes bottleneck as instruction window size increases Lecture 13: Memory Scheduling 40