CS5102 High Performance Computer Systems Memory Consistency

Slides:



Advertisements
Similar presentations
Symmetric Multiprocessors: Synchronization and Sequential Consistency.
Advertisements

1 Episode III in our multiprocessing miniseries. Relaxed memory models. What I really wanted here was an elephant with sunglasses relaxing On a beach,
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
1 Chapter 5 Concurrency: Mutual Exclusion and Synchronization Principals of Concurrency Mutual Exclusion: Hardware Support Semaphores Readers/Writers Problem.
Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.
SE-292 High Performance Computing
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
CS492B Analysis of Concurrent Programs Consistency Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Consistency Models Based on Tanenbaum/van Steen’s “Distributed Systems”, Ch. 6, section 6.2.
By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.
Memory consistency models Presented by: Gabriel Tanase.
1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
Lecture 13: Consistency Models
Computer Architecture II 1 Computer architecture II Lecture 9.
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
Memory Consistency Models
Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Meenaktchi Venkatachalam.
1 Lecture 22: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Processor Consistency [Goodman 1989]* Processor Consistency is a memory model in which the result of any execution is the same as if the operations of.
Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Sunita Marathe.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Computer Architecture 2015 – Cache Coherency & Consistency 1 Computer Architecture Memory Coherency & Consistency By Yoav Etsion and Dan Tsafrir Presentation.
Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.
Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.
By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Shared Memory Consistency Models: A Tutorial.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Synchronization, Memory Consistency 17th April, 2006.
Anshul Kumar, CSE IITD ECE729 : Advance Computer Architecture Lecture 26: Synchronization, Memory Consistency 25 th March, 2010.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors Yan Solihin Copyright.
Memory Consistency Zhonghai Lu Outline Introduction What is a memory consistency model? Who should care? Memory consistency models Strict.
CS533 Concepts of Operating Systems Jonathan Walpole.
Computer Architecture Lecture 28: Memory Consistency and Cache Coherence Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/8/2015.
Fundamentals of Memory Consistency Smruti R. Sarangi Prereq: Slides for Chapter 11 (Multiprocessor Systems), Computer Organisation and Architecture, Smruti.
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.
Multiprocessors – Locks
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Lecture 20: Consistency Models, TM
Outline Introduction Centralized shared-memory architectures (Sec. 5.2) Distributed shared-memory and directory-based coherence (Sec. 5.4) Synchronization:
Distributed Shared Memory
CS5102 High Performance Computer Systems Thread-Level Parallelism
Memory Consistency Models
Lecture 11: Consistency Models
Memory Consistency Models
Prof. Gennady Pekhimenko University of Toronto Fall 2017
Threads and Memory Models Hal Perkins Autumn 2011
Example Cache Coherence Problem
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Cache Coherence Protocols 15th April, 2006
Consistency Models.
Shared Memory Consistency Models: A Tutorial
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Threads and Memory Models Hal Perkins Autumn 2009
Lecture 21: Synchronization and Consistency
Lecture 22: Consistency Models, TM
Shared Memory Consistency Models: A Tutorial
Concurrency: Mutual Exclusion and Process Synchronization
Lecture 10: Consistency Models
Memory Consistency Models
CSE 153 Design of Operating Systems Winter 19
Relaxed Consistency Part 2
Lecture 24: Multiprocessors
Programming with Shared Memory Specifying parallelism
Lecture: Coherence and Synchronization
Lecture: Consistency Models, TM
Lecture 18: Coherence and Synchronization
Advanced Operating Systems (CS 202) Memory Consistency and Transactional Memory Feb. 6, 2019.
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 19 Memory Consistency Models Krste Asanovic Electrical Engineering.
Lecture 11: Consistency Models
Presentation transcript:

CS5102 High Performance Computer Systems Memory Consistency Prof. Chung-Ta King Department of Computer Science National Tsing Hua University, Taiwan (Slides are from textbook, Prof. O. Mutlu, Prof. S. Adve, Prof. K. Pingali, Prof. A. Schuster, Prof. R. Gupta)

Outline Introduction Centralized shared-memory architectures (Sec. 5.2) Distributed shared-memory and directory-based coherence (Sec. 5.4) Synchronization: the basics (Sec. 5.5) Models of memory consistency (Sec. 5.6)

As a Programmer, You Expect ... Threads/processors P1 and P2 see the same memory Initially A = Flag = 0 P1 P2 A = 23; while (Flag != 1) {}; Flag = 1; ... = A; P1 writes data into variable A and then sets Flag to tell P2 that data value can be read from A P2 waits till Flag is set and then reads data from A

As a Compiler, You May ... Perform the following code movement: A = 23; Flag = 1; Flag = 1; A = 23; This is considered safe, because no data dependence Processors also reorder operations for performance Note: constraints on reordering  obey dependeny Data dependences must be respected: in particular, loads/stores to a given memory address must be executed in program order Control dependences must be respected Reordering can be performed either by compiler or processor (out-of-order, OOO, architecture)

As a Computer Architecture, You May ... Load bypass in write buffer (WB): WB holds stores that need to be sent to memory Loads have higher priority than stores because their results are needed to keep processor busy  bypass WB So, load address can be checked against addresses in WB, and WB satisfies load if there is an address match If no match, loads can bypass stores to access memory Load bypassing Write buffer Processor Memory system

So, as a Programmer, You Expect ... Initially A = Flag = 0 P1 P2 A = 23; while (Flag != 1) {}; Flag = 1; ... = A; Expected execution sequence: st A, 23 ld Flag //get 0 st Flag, 1 ... ld Flag //get 1 ld A //get 23 Problem: If the two writes in P1 can be reordered, it is possible for P2 to read 0 from variable A X

Problem in Multiprocessor Context The problem becomes even more complex for multiprocessors: If a processor is allowed to reorder independent operations in its own instruction stream, (which is safe and allowed in sequential programs) will the execution of a parallel program on a multiprocessor produce correct results as expected by the programmers? Answer: no! There are data dependences across processors!

Example: Primitive Mutual Exclusion Initially Flag1 = Flag2 = 0 P1 P2 Flag1 = 1; Flag2 = 1; if (Flag2 == 0) if (Flag1 == 0) critical section critical section Possible execution sequence: st Flag1,1 st Flag2,1 ld Flag2 //get 0 ld Flag1 //get ? Most people would say that P2 will read 1 as the value of Flag1. Since P1 reads 0 as the value of Flag2, P1’s read of Flag2 must happen before P2 writes to Flag2. Intuitively, we would expect P1’s write of Flag1 to happen before P2’s read of Flag1.

Example: Primitive Mutual Exclusion Possible execution sequence: P1 P2 st Flag1,1 st Flag2,1 ld Flag2 //get 0 ld Flag1 //get ? Intuition: P1’s write to Flag1 should happen before P2’s read of Flag1 True only if reads and writes on the same processor to different locations are not reordered Unfortunately, reordering is very common on modern processors (e.g., write buffer with load bypass)

Mutex Exclusion with Write Buffer Note: we have not yet considered optimizations such as caching, prefetching, speculative execution, … P1 P2 P1 enters CS P2 enters CS ld Flag2 (WB bypass) ld flag1 (WB bypass) WB WB st Flag1,1 st Flag2,1 Shared Bus Flag1: 0 Flag2: 0

Summary Uniprocessors can reorder instructions subject only to control and data dependence constraints However, these constraints are not sufficient in shared-memory multiprocessors May give counter-intuitive results for parallel programs Question: What constraints must we put on instruction reordering so that parallel programs are executed according to the expectation of the programmers? One important aspect of the constraints is memory (consistency) model supported by the processor

Memory Consistency Models A contract between HW/compiler and programmer Hardware and compiler optimizations will not violate the ordering specified in the model Programmer should obey rules specified by the model Example: if programmers demand getting 23 at P2 P1 P2 A = 23; while (Flag != 1) {}; Flag = 1; ... = A; // get 23 To satisfy the required memory access ordering, hardware or compiler must not perform certain optimizations, e.g. write buffer with load bypassing, to the program, even if many reorderings look safe  may miss opt. chances

Memory Consistency Models On the other hand, if programmers acknowledge such coding may get wrong results P1 P2 A = 23; while (Flag != 1) {}; Flag = 1; ... = A; // get ?? And are willing to let their intentions explicit A = 23; V(S); P(S); ... = A; // get 23 Under such a relaxed memory model, hardware or compiler can perform optimizations on those unprotected code sequences

Memory Consistency Models Memory consistency model is a contract Drafting a “good” contract requires careful tradeoffs between programmability and machine performance Preserving an “expected” (more accurately, “agreed upon”) order usually simplifies programmer’s life Ease of debugging, state recovery, exception handling Preserving an “expected” order often makes the hardware designer’s life difficult for optimizations Lower performance Easier to reason Fewer memory reorderings Stronger models Stronger constraints

Let’s Start with Uniprocessor Model Sequential programs running on uniprocessors expect sequential order Hardware executes the load and store operations in the order specified by the sequential program Out-of-order execution does not change semantics Hardware retires (reports to software the results of) loads and stores in the order specified by sequential program Advantages: Architectural state is precise within an execution Architectural state is consistent across different runs of the program  easier to debug programs Disadvantage: overhead for preserving order (ROB?)

How about Multiprocessors? Intuitive view from the programmers: Operations from a given processor are executed in program order Memory operations from different processors appear to be interleaved in some order at the memory Memory switch analogy: Switch services one load or store at a time from any proc. All processors see currently serviced load/store at same time Each processor’s operations are serviced in program order Memory P1 P2 P3 Pn Or, all processors share a common cache No cache No Wr. Buf.

Sequential Consistency Memory Model A multiprocessor system is sequentially consistent (SC) if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, AND the operations of each individual processor appear in this sequence in the order specified by its program This is a memory ordering model, or memory model All processors see the same order of memory ops, i.e., all memory ops happen in an order (called the global total order) that is consistent across all processors Within this global order, each processor’s operations appear in sequential order Lamport, “How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs,” IEEE Transactions on Computers, 1979 Consider the memory switch analogy

Example of Sequential Consistency b2: Rx = x; a1: x = 1; a2: y = 1; b1: Ry = y; P1 P2 b1: Ry = y; b2: Rx = x; a1: x = 1; a2: y = 1; a1: x = 1; b1: Ry = y; b2: Rx = x; a2: y = 1; The equivalence can be understood using the memory switch analogy: memory switch stays at P1 for a1 and a2; no other processor in the middle b1: Ry = y; b2: Rx = x; a1: x = 1; a2: y = 1; b1: Ry = y; b2: Rx = x; a1: x = 1; a2: y = 1; a2: y = 1; b1: Ry = y; b2: Rx = x; a1: x = 1; ≡ (Rx=0, Ry =0)

Consequences of Sequential Consistency Simple and intuitive consistent with programmers’ intuition easy to reason program behavior Within the same execution, all processors see the same global order of operations to memory No correctness issue Satisfies the “happened before” intuition Across different executions, different global orders can be observed (each of which is sequentially consistent) Debugging is still difficult (as order changes across runs)

Understanding Program Order Initially X = 2 Possible execution sequences: X=3 X=4 P1 P2 ….. ….. LD r0,X LD r1,X ADD r0,r0,#1 ADD r1,r1,#1 ST r0,X ST r1,X ….. …… sequential consistency has nothing to do with atomicity P1: LD r0,X P2: LD r1,X P1: ADD r0,r0,#1 P1: ST r0,X P2: ADD r1,r1,#1 P2: ST r1,X P2: LD r1,X P2: ADD r1,r1,#1 P2: ST r1,X P1: LD r0,X P1: ADD r0,r0,#1 P1: ST r0,X

Coherence vs. Consistency Coherence concerns only one memory location Consistency concerns ordering for all locations A memory system is coherent if Can serialize all operations to that location Operations by any processor appear in program order Read returns value written by last store to that location NO guarantees on when an update should be seen NO guarantees on what order of updates should be seen A memory system is consistent if It follows the rules of its Memory Model Operations on memory locations appear in some defined order Serialization: assume a shared bus or directory Last store: who decide which one is “last store”? Does sequential consistence also imply coherence? Coherence is to make caches in multicore systems functionally invisible as caches in single-core system. Once caches are invisible, what behavior remains is defined by consistency Consistency model can be defined separately without coherence 3

Coherence vs. Consistency: Example Is this execution sequence possible on SC MP? Can this sequence happen on coherent MP? How to build a machine to make this sequence happen? We thus say that sequential consistency is a stronger or stricter model than coherence Parallel program Execution sequence a2: y = 1; b1: Ry = y; b2: Rx = x; a1: x = 1; P1 P2 a1: x = 1; b1: Ry = y; b2: Rx = x; a2: y = 1; Both P1 and P2 see the same load/store sequence to x as well as y, but not when a store will be see, e.g., when x=1 should be see! A machine to generate the execution sequence: (1) read and write to different data can be reordered, and (2) read takes more time than write

Problems with SC Memory Model Difficult to implement efficiently in hardware Straightforward implementations: No concurrency among memory accesses Strict ordering of memory accesses at each node Essentially precludes out-of-order processors Unnecessarily restrictive Many code sequences in a parallel program can actually be reordered safely, but are prohibited due to SC model What can we do about it? Ask the programmer to do more, e.g. give explicit hints, so that hardware/compiler can optimize more freely  Weaken or relax the memory consistence model 3

Weaker Memory Consistence Models Programmers take more burden to explicitly tell the machine which sequences of code cannot be reordered and which sequences can The multiprocessor must provide programmers some means, e.g. instructions or libraries, to make such specifications It is now the responsibility of programmers to give proper specifications for parallel programs to execute correctly What is the simplest thing that programmers can do in order to relax SC?

Relaxed Model: Weak Consistency Programmers insert fence instructions in programs: Order memory operations before and after the fence All data operations before fence in program order must complete before fence is executed All data operations after fence in program order must wait for fence to complete Fences are performed in program order Synchronization primitives, such as barrier can serve as fences Fence propagates writes to and from a machine at appropriate points, similar to flushing the memory operations FENCE forces the write ordering of p and flag P1 p = new A(…) FENCE flag = true;

Weak Ordering Implementation of fence: fence Processor has counter that is incremented when data op is issued, and decremented when data op is completed fence Memory operations within these regions can be reordered by hardware or compiler optimizations program execution fence fence

Example of Fence Initially A = Flag = 0 P1 P2 Execution: A = 23; fence; while(Flag != 1){}; Flag = 1; ... = A; Execution: P1 writes data into A Fence waits till write to A is completed P1 then writes data to Flag If P2 sees Flag == 1, it is guaranteed that it will read 23 from A even if memory operations in P1 before and after fence are reordered by HW or compiler

Tradeoffs: Weak Consistency Advantage No need to guarantee a very strict order of memory operations Enables the hardware implementation of performance enhancement techniques to be simpler Can have higher performance than stricter ordering Disadvantage More burden on the programmer or software (need to get the “fences” correct) Another example of the programmer-microarchitect tradeoff

More Relaxed Model: Release Consistency Divide fence into two: one guard against before and the other against after Acquire: must complete (acquired) before all following memory accesses, e.g. operations like lock Release: can proceed only after all memory operations before release are completed (released), e.g. operations like unlock However, Acquire does not wait for memory accesses preceding it Memory accesses after release in program order do not have to wait for release K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J.L. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 15--26. May 1990.

Release Consistency Doing an acquire means that writes on other processors to protected variables will be known Doing an release means that writes to protected variables are exported and will be seen by other processors when they do an acquire (lazy release consistency) or immediately (eager release consistency)

Understanding Release Consistency From this point in this execution all processors must see the value 1 in X P1 P2 rel(L1) X  1 t X = 0 X = ? acq(L1) X = 1 Note the programmer is sure that the read after ack(L) returns 1 because release and acquire were of the same lock. If they belonged to different locks, the programmer would not be able to assume that acquire happens after the release. It is undefined what value is read here. It can be any value written by some processors 1 is read in the current execution. However, programmer cannot be sure 1 will be read in all executions. Programmer knows that in all executions this read returns 1.

Acquire and Release Acquire and release are not only used for synchronization of execution, but also for synchronization of memory, i.e. for propagation of writes from/to other processors Release serves as a memory-synch operation, or a flush of local modifications to all other processors A release followed by an acquire of the same lock guarantees to the programmer that all writes previous to the release will be seen by all reads following the acquire Removed: “acquire of lock L serves as a memory-barrier, or a gather of all modifications that were published by release(L) operation.”

Happened-Before Relation w(x) rel(L1) w(x) acq(L2) r(y) P1 w(y) rel(L2) r(x) r(x) rel(L1) P2 There is no direct dependency between release and acquire of different locks. Example: 1st rel in A and 1st acq in C. acq(L1) r(y) w(y) P3 rel(L2) acq(L2) t

Example of Acquire and Release L/S Which operations can be overlapped? ACQ L/S REL These operation orderings must be observed L/S

Comments In the literature, there are a large number of other consistency models Processor consistency, total store order (TSO), …. It is important to remember that these are concerned with reordering of independent memory operations within a processor Easy to come up with shared-memory programs that behave differently for each consistency model Emerging consensus that weak/release consistency is adequate

Summary Consistency model is multiprocessor specific Programmers will often implement explicit synchronization Nothing really to do with memory operations from different processors/threads Sequential consistency: perform global memory operations in program order Relaxed consistency models: all of them rely on some notion of a fence operation that delineates regions within which reordering is permissible