1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,

Slides:



Advertisements
Similar presentations
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
Lecture 13: Consistency Models
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 12: Relaxed Consistency Models Topics: sequential consistency recap, relaxing various SC constraints, performance comparison.
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
1 Lecture 22: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Sunita Marathe.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04.
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Lecture: Out-of-order Processors
Lecture 20: Consistency Models, TM
Speculative Lock Elision
Lecture 11: Consistency Models
Lecture: Out-of-order Processors
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 18: Pipelining Today’s topics:
Lecture 12: TM, Consistency Models
Lecture 18: Pipelining Today’s topics:
Lecture: Consistency Models, TM
Lecture: Out-of-order Processors
Presented to CS258 on 3/12/08 by David McGrogan
Lecture 8: Dynamic ILP Topics: out-of-order processors
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 21: Synchronization and Consistency
Lecture 22: Consistency Models, TM
Lecture: Coherence and Synchronization
Lecture 20: OOO, Memory Hierarchy
Lecture 20: OOO, Memory Hierarchy
Lecture 10: Consistency Models
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 23: Transactional Memory
Lecture 21: Synchronization & Consistency
Lecture: Coherence and Synchronization
Lecture: Consistency Models, TM
Lecture 11: Relaxed Consistency Models
Lecture 9: Dynamic ILP Topics: out-of-order processors
Lecture 18: Coherence and Synchronization
Lecture 7: Branch Prediction, Dynamic ILP
Lecture 11: Consistency Models
Presentation transcript:

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate, and Timely Self-Invalidation using Last Touch Prediction, Purdue, ISCA’00

2 Consistency: Summary To preserve sequential consistency:  hardware must preserve program order for all memory operations (including waiting for acks)  writes to a location must be serialized  the value of a write cannot be read unless all have seen the write (it is ok if writes to different locations are not seen in the same order as long as conflicting reads do not happen)

3 Relaxations  IBM 370: a read can complete before an earlier write to a different address, but a read cannot return the value of a write unless all processors have seen the write  SPARC V8 Total Store Ordering (TSO): a read can complete before an earlier write to a different address, but a read cannot return the value of a write by another processor unless all processors have seen the write (it returns the value of own write before others see it)  Processor Consistency (PC): a read can complete before an earlier write (by any processor to any memory location) has been made visible to all RelaxationW  R Order W  W Order R  RW Order Rd others’ Wr early Rd own Wr early IBM 370X TSOXX PCXXX

4 Release Consistency More distinctions among memory operations: Instructions are classified as non-special and special and special instructions are further classified as acquires and releases All instructions after an acquire can begin only after the acquire has completed A release can begin only after all previous instructions have completed

5 An Out-of-Order Processor Implementation Branch prediction and instr fetch R1  R1+R2 R2  R1+R3 BEQZ R2 R3  R1+R2 R1  R3+R2 Instr Fetch Queue Decode & Rename Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 T1 T2 T3 T4 T5 T6 Reorder Buffer (ROB) T1  R1+R2 T2  T1+R3 BEQZ T2 T4  T1+T2 T5  T4+T2 Issue Queue (IQ) ALU Register File R1-R32 Results written to ROB and tags broadcast to IQ

6 SC Execution The world hasn’t yet seen A Ld-B should really execute only when it reaches the head of the ROB Can execute early, but when it reaches the head of the ROB, it must make sure that the read value is still valid A table can keep track of speculatively read values and flag a violation if it detects a write (similar to LL-SC) An overflow will conservatively flag a violation No problem if Ld-C executes sooner than Ld-B St A Ld B Ld C ROB

7 SC Execution Ld-A hasn’t yet executed St-B can acquire permissions early – if the permissions are lost before the store reaches the ROB head, permissions are re-acquired St-B can also write the result into the cache/memory early, but the old value must be preserved somewhere; if someone else asks for the data, a violation is flagged Ld A St B St C ROB

8 Three Implementations SC:  Loads can execute speculatively, but must check during commit  Relatively small ROB size can hold up commit  Stores can fetch permissions speculatively RC:  Software-assigned constraints  Orderings allowed among non-special ops  Loads can execute speculatively across fences (as in SC) SC++:  Stores are allowed to write to cache early (old values are maintained in a log)  Instructions are copied out of ROB into a Speculative History Queue SHiQ (effectively, provides a much larger ROB)  (Long latency loads can tie up the ROB, but long-latency stores can not)  To check for conflicts, a Block Look-up Table (BLT) maintains the relevant block addresses for instructions in the SHiQ

9 Results

10 Violations Violations are caused by  true sharing: two threads that r/w the same word  false sharing: two threads that r/w the same block  conflict/capacity misses in the cache: if a block is evicted, a violation is conservatively flagged Violations are caused by  a speculative load that gets written to by another node  a speculative load and the block is evicted from cache  a speculative store that is accessed by another node before it commits (need not cause rollback; just have to re-acquire permissions when ready to commit)

11 Coherence Decoupling In addition to the correct cache coherence protocol, introduce a Speculative Cache Lookup (SCL) protocol In essence, value prediction for the coherence protocol Mechanisms incorporated for correct SC execution can be leveraged to handle violations SC++ allows us to overlap memory instructions; but for a load, cannot proceed with execution until data+perms arrive Coherence decoupling attempts to pass the results of the load to dependents before arrival of permissions/data

12 Coherence Decoupling Why it works: False sharing: the line is invalidated, but the relevant word is unchanged Speculative updates: updated values are selectively pushed to improve the chances of finding correct data

13 Read and Update Components Filter: branch-pred-like confidence mechanism to avoid mis-speculations IA: send data with invalidation message (higher bw cost) C: send data only if it is {-1, 0, 1}

14 Potential and Accuracy

15 Performance

16 Self Invalidation To reduce the cost of invalidating shared blocks, blocks can “prefetch” this effect by invalidating themselves in advance Can perform self invalidation when leaving a critical section Can perform self invalidation by observing the sequence of instructions that eventually lead to the last touch (truncated addition is used to generate a signature) (may require complex branch predictor-like structures) Cache decay predictors have good accuracy in a single-threaded setting… may have similar benefits in multi-threaded settings as well

17 Summary May need large windows to hide the cost of long memory operations Can overlap operations, can speculate on values Will need checks to make sure SC is not violated and values are speculated correctly

18 Title Bullet