Are We Trading Consistency Too Easily? A Case for Sequential Consistency Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLAUniversity of.

Are We Trading Consistency Too Easily? A Case for Sequential Consistency Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLAUniversity of Michigan Abhay Singh Satish Narayanasamy

M EMORY C ONSISTENCY M ODEL Abstracts the program runtime (compiler + hardware) Hides compiler transformations Hides hardware optimizations, cache hierarchy, … Sequential consistency (SC) [Lamport ‘79] “The result of any execution is the same as if the operations were executed in some sequential order, and the operations of each individual processor thread in this sequence appear in the program order”

S EQUENTIAL C ONSISTENCY E XPLAINED X = 1; F = 1; X = 1; F = 1; t = F; u = X; t = F; u = X; int X = F = 0; // F = 1 implies X is initialized X = 1; F = 1; t = F; u = X; X = 1; F = 1; t = F; u = X; X = 1; F = 1; t = F; u = X; X = 1; F = 1; t = F; u = X; X = 1; F = 1; t = F; u = X; X = 1; F = 1; t = F; u = X; t=1, u=1t=0, u=1 t=0, u=0t=0, u=1 t=1 implies u=1

C ONVENTIONAL W ISDOM SC is slow Disables important compiler optimizations Disables important hardware optimizations Relaxed memory models are faster

C ONVENTIONAL W ISDOM SC is slow Hardware speculation can hide the cost of SC hardware [Gharachorloo et.al. ’91, …, Blundell et.al. ’09] Compiler optimizations that break SC provide negligible performance improvement [PLDI ’11] Relaxed memory models are faster Need fences for correctness Programmers conservatively add more fences than necessary Libraries use the strongest fence necessary for all clients Fence implementations are slow Efficient fence implementations require speculation support

asm: mov eax [X]; asm: mov eax [X]; I MPLEMENTING S EQUENTIAL C ONSISTENCY E FFICIENTLY SC-Preserving Compiler SC-Preserving Compiler SC Hardware src: t = X; src: t = X; Every SC behavior of the binary is a SC behavior of the source Every observed runtime behavior is a SC behavior of the binary This talk

C HALLENGE : I MPORTANT C OMPILER O PTIMIZATIONS ARE NOT SC-P RESERVING Example: Common Subexpression Elimination (CSE) t,u,v are local variables X,Y are possibly shared L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t;

C OMMON S UBEXPRESSION E LIMINATION IS NOT SC-P RESERVING L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t; M1: X = 1; M2: Y = 1; M1: X = 1; M2: Y = 1; M1: X = 1; M2: Y = 1; M1: X = 1; M2: Y = 1; u == 1 implies v == 5 possibly u == 1 && v == 0 Init: X = Y = 0;

I MPLEMENTING CSE IN A SC-P RESERVING C OMPILER Enable this transformation when X is a local variable, or Y is a local variable In these cases, the transformation is SC-preserving Identifying local variables: Compiler generated temporaries Stack allocated variables whose address is not taken L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t;

A SC- PRESERVING LLVM C OMPILER FOR C PROGRAMS Modify each of ~70 phases in LLVM to be SC-preserving Without any additional analysis Enable trace-preserving optimizations These do not change the order of memory operations e.g. loop unrolling, procedure inlining, control-flow simplification, dead-code elimination,… Enable transformations on local variables Enable transformations involving a single shared variable e.g. t= X; u=X; v=X;  t=X; u=t; v=t;

A VERAGE P ERFORMANCE OVERHEAD IS ~2% Baseline: LLVM –O3 Experiments on Intel Xeon, 8 cores, 2 threads/core, 6GB RAM 480373 154 132200116159 173 237 298

H OW F AR C AN A SC-P RESERVING C OMPILER G O ? float s, *x, *y; int i; s=0; for( i=0; i<n; i++ ){ s += (x[i]-y[i]) * (x[i]-y[i]); } float s, *x, *y; int i; s=0; for( i=0; i<n; i++ ){ s += (x[i]-y[i]) * (x[i]-y[i]); } float s, *x, *y; float *px, *py, *e; s=0; py=y; e = &x[n] for( px=x; px<e; px++, py++){ s += (*px-*py) * (*px-*py); } float s, *x, *y; float *px, *py, *e; s=0; py=y; e = &x[n] for( px=x; px<e; px++, py++){ s += (*px-*py) * (*px-*py); } float s, *x, *y; int i; s=0; for( i=0; i<n; i++ ){ s += (*(x + i*sizeof(float)) – *(y + i*sizeof(float))) * (*(x + i*sizeof(float)) – *(y + i*sizeof(float))); } float s, *x, *y; int i; s=0; for( i=0; i<n; i++ ){ s += (*(x + i*sizeof(float)) – *(y + i*sizeof(float))) * (*(x + i*sizeof(float)) – *(y + i*sizeof(float))); } float s, *x, *y; float *px, *py, *e, t; s=0; py=y; e = &x[n] for( px=x; px<e; px++, py++){ t = (*px-*py); s += t*t; } float s, *x, *y; float *px, *py, *e, t; s=0; py=y; e = &x[n] for( px=x; px<e; px++, py++){ t = (*px-*py); s += t*t; } no opt. SC pres full opt

W E C AN R EDUCE THE F ACE S IM O VERHEAD ( IF WE CHEAT A BIT ) 30% overhead comes from the inability to perform CSE in But argument evaluation in C is nondeterministic The specification explicitly allows overlapped evaluation of function arguments return MATRIX_3X3 ( x[0]*A.x[0]+x[3]*A.x[1]+x[6]*A.x[2], x[1]*A.x[0]+x[4]*A.x[1]+x[7]*A.x[2], x[2]*A.x[0]+x[5]*A.x[1]+x[8]*A.x[2], x[0]*A.x[3]+x[3]*A.x[4]+x[6]*A.x[5], x[1]*A.x[3]+x[4]*A.x[4]+x[7]*A.x[5], x[2]*A.x[3]+x[5]*A.x[4]+x[8]*A.x[5], x[0]*A.x[6]+x[3]*A.x[7]+x[6]*A.x[8], x[1]*A.x[6]+x[4]*A.x[7]+x[7]*A.x[8], x[2]*A.x[6]+x[5]*A.x[7]+x[8]*A.x[8] ); return MATRIX_3X3 ( x[0]*A.x[0]+x[3]*A.x[1]+x[6]*A.x[2], x[1]*A.x[0]+x[4]*A.x[1]+x[7]*A.x[2], x[2]*A.x[0]+x[5]*A.x[1]+x[8]*A.x[2], x[0]*A.x[3]+x[3]*A.x[4]+x[6]*A.x[5], x[1]*A.x[3]+x[4]*A.x[4]+x[7]*A.x[5], x[2]*A.x[3]+x[5]*A.x[4]+x[8]*A.x[5], x[0]*A.x[6]+x[3]*A.x[7]+x[6]*A.x[8], x[1]*A.x[6]+x[4]*A.x[7]+x[7]*A.x[8], x[2]*A.x[6]+x[5]*A.x[7]+x[8]*A.x[8] );

I MPROVING P ERFORMANCE OF SC-P RESERVING C OMPILER Request programmers to reduce shared accesses in hot loops Use sophisticated static analysis Infer more thread-local variables Infer data-race-free shared variables Use program annotations Requires changing the program language Minimum annotations sufficient to optimize the hot loops Perform load-optimizations speculatively Hardware exposes speculative-load optimization to the software Load optimizations reduce the max overhead to 6%

C ONCLUSION Hardware should support strong memory models TSO is efficiently implementable [Mark Hill] Speculation support for SC over TSO is not currently justifiable Can we quantify the programmability cost for TSO? Compiler optimizations should preserve the hardware memory model High-level programming models can abstract TSO/SC Further enable compiler/hardware optimizations Improve programmer productivity, testability, and debuggability

E AGER -L OAD O PTIMIZATIONS Eagerly perform loads or use values from previous loads or stores L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t; L1: X = 2; L2: u = Y; L3: v = X*5; L1: X = 2; L2: u = Y; L3: v = X*5; L1: X = 2; L2: u = Y; L3: v = 10; L1: X = 2; L2: u = Y; L3: v = 10; L1: L2: for(…) L3: t = X*5; L1: L2: for(…) L3: t = X*5; L1: u = X*5; L2: for(…) L3: t = u; L1: u = X*5; L2: for(…) L3: t = u; Common Subexpression Elimination Constant/copy Propagation Loop-invariant Code Motion

P ERFORMANCE OVERHEAD Allowing eager-load optimizations alone reduces max overhead to 6% 480373 154 132200116159 173 237 298

C ORRECTNESS C RITERIA FOR E AGER -L OAD O PTIMIZATIONS Eager-loads optimizations rely on a variable remaining unmodified in a region of code Sequential validity: No mods to X by the current thread in L1-L3 SC-preservation: No mods to X by any other thread in L1-L3 L1: t = X*5; L2: *p = q; L3: v = X*5; L1: t = X*5; L2: *p = q; L3: v = X*5; Enable invariant “t == 5.X” Maintain invariant “t == 5.X” Use invariant “t == 5.X" to transform L3 to v = t; Use invariant “t == 5.X" to transform L3 to v = t;

S PECULATIVELY P ERFORMING E AGER -L OAD O PTIMIZATIONS On monitor.load, hardware starts tracking coherence messages on X’s cache line The interference check fails if X’s cache line has been downgraded since the monitor.load In our implementation, a single instruction checks interference on up to 32 tags L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = monitor.load(X, tag) * 5; L2: u = Y; L3: v = t; C4: if (interference.check(tag)) C5: v = X*5; L1: t = monitor.load(X, tag) * 5; L2: u = Y; L3: v = t; C4: if (interference.check(tag)) C5: v = X*5;

Are We Trading Consistency Too Easily? A Case for Sequential Consistency Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLAUniversity of.

Similar presentations

Presentation on theme: "Are We Trading Consistency Too Easily? A Case for Sequential Consistency Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLAUniversity of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Are We Trading Consistency Too Easily? A Case for Sequential Consistency Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLAUniversity of.

Similar presentations

Presentation on theme: "Are We Trading Consistency Too Easily? A Case for Sequential Consistency Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLAUniversity of."— Presentation transcript:

Similar presentations

About project

Feedback