Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

Similar presentations


Presentation on theme: "CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,"— Presentation transcript:

1 CS 162 Memory Consistency Models

2 Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g., code motion, caching value in register) Behave the same as long as dependences are respected Reordering in Uniprocessors a1: St x a2: Ld y a1: St x ≡

3 counter-intuitive program behavior Reordering in Multiprocessors Initially x=y=0 (R x =1, R y =1) (R x =1, R y =0) (R x =0, R y =0) b1 : R y = y; b2 : R x = x; a1 : x = 1; a2 : y = 1; b2 : R x = x; a1 : x = 1; a2 : y = 1; b1 : R y = y; b2 : R x = x; a1 : x = 1; a2 : y = 1; b1 : R y = y; b2 : R x = x; (R x =0, R y =1) Intuitively, y=1  x=1 a1 : x = 1; b1 : R y = y; b2 : R x = x; a2 : y = 1; P1 P2 a1 : x = 1; a2 : y = 1; Possible outcomes

4 Reordering in Multiprocessors p = new A(…) if (flag) a = p->var; flag = true; P1 P2 flag is supposed to be set after p is allocated Initially p=NULL, flag = false counter-intuitive program behavior Lock-free algorithms, e.g., Dekker, Peterson

5 Dekker Algorithm (mutual exclusion) Reordering in Multiprocessors flag1 = 1; flag2 = 1; if (flag2 == 0) if (flag1 == 0) critical section critical section P1 P2 Initially flag1 = flag2 = 0 flag1 = 1 flag2 == 0 After reordering, both flag1 and flag2 can be 0 St flag1 Ld flag2 counter-intuitive program behavior

6 Memory Consistency Models Specify the ordering of loads and stores to different memory locations Ld  Ld, Ld  St, St  Ld, St  St Contract between hardware, compiler, and programmer hardware and compiler will not violate the ordering specified the programmer will not assume a stricter order than that of the model

7 Memory Consistency Models Allowed Reordering Commercial Architecture Sequential Consistency Nonenot exist Total Store Ordering St  Ld x86, SPARC Relaxed Memory Order AllARM, PowerPC Low High Performance Stronger models Stronger constraints Fewer memory reorderings Easier to reason Lower performance High Low Programmability

8 Cache Coherence vs. Memory Model Cache coherence ensures a consistent view of memory Guarantees that the update to memory by one processor will be seen by other processors eventually But, how consistent ? NO guarantees on when an update should be seen NO guarantees on what order of updates should be seen

9 Cache Coherence vs. Memory Model Initially A = B = 0 P1 P2 P3 A = 1; while (A != 1) ; B = 1; while (B != 1) ; tmp = A ; tmp = 1? or tmp = 0?

10 Sequential Consistency (SC) Definition [Lamport] (1) the result of any execution is the same as if the operations of all processors were executed in some sequential order; (2) the operations of each individual processor appear in this sequence in the order specified by its program. MEMORY P1P3P2Pn Behave as the repetition: (1)Pick a processor by any method (e.g., randomly) (2) the processor completes a load/store operation

11 SC Example b1 : R y = y; b2 : R x = x; a1 : x = 1; a2 : y = 1; b2 : R x = x; a1 : x = 1; a2 : y = 1; b1 : R y = y; b2 : R x = x; (R x =0, R y =0) a1 : x = 1; b1 : R y = y; b2 : R x = x; a2 : y = 1; P1 P2 a1 : x = 1; a2 : y = 1; b1 : R y = y; b2 : R x = x; a1 : x = 1; a2 : y = 1; ≡ b1 : R y = y; b2 : R x = x; a1 : x = 1; a2 : y = 1; b1 : R y = y; b2 : R x = x; a1 : x = 1;

12 Sequential Consistency (SC) Simple and intuitive consistent with programmers’ intuition easy to reason program behavior However, the simplicity comes at the cost of performance prevents aggressive compiler optimizations (e.g., load reordering, store reordering, caching value in register) constrains hardware utilization, (e.g., store buffer)

13 SC Violation a1: x = 1 a2: y = 1 b1: R1 = y b2: R2 = x program order conflict relation SC Violation - A cycle formed by program orders and conflict orders [Shasha and Snir, 1988] e.g., (a2, b1, b2, a1, a2) - Executing in the order (a2, b1, b2, a1) will produce R1=1, R2=0, which is not an SC outcome Insert fences to break cycle - a2 can not be executed before a1

14 Fence Instructions p = new A(…) flag = true; P1 Fence Instructions Order memory operations before and after the fence FENCE Inevitable -- building concurrent implementations (e.g., mutual exclusion, queues) [Attiya et. al., POPL’11] Expensive -- Cilk-5’s THE protocol spends 50% of its time executing a memory fence [Frigo et. al., PLDI’98]

15 a1: St x a2: Ld y Fence1 b1: St y b2: Ld x Fence2 Conservativeness of Fences Inserted statically and conservatively T At time T, a1 and a2 have completed; b1 and b2 only execute after time T. No cycle is formed at runtime

16 if (cond) a1: St x a2: Ld y b1: St y b2: Ld x Fence1Fence2 a1 is in a conditional branch Conservativeness of Fences a1: St *p a2: Ld x b1: St x b2: Ld *q Fence1Fence2 p and q may point to the same memory location Inserted statically and conservatively No cycle is formed at runtime

17 Processor-centric Fence Traditional fence Processor-centric - unaware of memory accesses in other processors However, purpose of fences Prevent memory accesses from being reordered and observed by other processors (i.e., a cycle formed at runtime)

18 Address-aware Fences Consider memory locations accessed around fences at runtime Fences only take effect when there is a cycle about to happen

19 Detect and Avoid Cycles A1 A2 Proc 1 Proc 2 a1: … a2: … Fence1 B1 B2 b1: … Fence2 b2: … c1 c2 ? How to detect c2 efficiently?

20 Detect and Avoid Cycles A1 A2 Proc 1 Proc 2 a1: … a2: … Fence1 B1 B2 b1: … Fence2 b2: … c1 watchlist c2 ? How to detect c2 efficiently? Collecting watchlist for each fence Completing memory operation checks the watchlist - bypass, if its address is not in the watchlist - stall, otherwise

21 Performance: Execution Time Traditional fence (T) vs. Address-aware fence (A) Fence overhead becomes negligible

22 Further Reading L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess program. IEEE Trans. Comput., 28(9):690– 691, S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29:66–76, D. Shasha and M. Snir. Efficient and correct execution of parallel programs that share memory. ACM Trans. Program. Lang. Syst., 10(2):282–312, Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture, C. Lin, V. Nagarajan, and R. Gupta. Address-aware fences. ICS ’13, pages 313–324, 2013


Download ppt "CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,"

Similar presentations


Ads by Google