Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Case for a SC-preserving Compiler Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLA University of Michigan Abhay Singh Satish Narayanasamy.

Similar presentations


Presentation on theme: "The Case for a SC-preserving Compiler Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLA University of Michigan Abhay Singh Satish Narayanasamy."— Presentation transcript:

1 The Case for a SC-preserving Compiler Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLA University of Michigan Abhay Singh Satish Narayanasamy

2 T ALK S UMMARY SC-preserving compiler Every SC behavior of the binary is a SC behavior of the source Guarantees SC assuming SC hardware A SC-preserving compiler is acceptably efficient Enable optimizations only when provably SC-preserving With simple, scalable, and readily implementable analysis 2% avg, 30% max overhead on SPLASH & PARSEC benchmarks Static and dynamic analyses can further reduce the performance overhead

3 M ANY C OMPILER O PTIMIZATIONS ARE NOT SC-P RESERVING Example: Common Subexpression Elimination (CSE) t,u,v are local variables X,Y are possibly shared L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t;

4 C OMMON S UBEXPRESSION E LIMINATION IS NOT SC-P RESERVING L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t; M1: X = 1; M2: Y = 1; M1: X = 1; M2: Y = 1; M1: X = 1; M2: Y = 1; M1: X = 1; M2: Y = 1; u == 1  v == 5 possibly u == 1 && v == 0 Init: X = Y = 0;

5 I MPLEMENTING CSE IN A SC-P RESERVING C OMPILER Enable this transformation when X is a local variable, or Y is a local variable In these cases, the transformation is SC-preserving Identifying local variables: Compiler generated temporaries Stack allocated variables whose address is not taken L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t;

6 A SC- PRESERVING LLVM C OMPILER FOR C PROGRAMS Modify each of ~70 phases in LLVM to be SC-preserving Enable trace-preserving optimizations These do not change the order of memory operations e.g. loop unrolling, procedure inlining, control-flow simplification, dead-code elimination,… Enable transformations on local variables Enable transformations involving a single shared variable e.g. t= X; u=X; v=X;  t=X; u=t; v=t;

7 P ERFORMANCE OVERHEAD Baseline: LLVM –O3 Experiments on Intel Xeon, 8 cores, 2 threads/core, 6GB RAM

8 T HE O VERHEAD IN F ACESIM This transformation reduces the overhead from 34% to 6% Optimizations in non-hot-loops do not buy much performance A SC-preserving compiler slows down a program if The hot-loops involve more than one shared variable, and Aliasing constraints do not prevent optimizations in the loop float s, *x, *y; int i; … hot_for_loop(… i …){ s += (x[i]-y[i]) *(x[i]-y[i]); … } float s, *x, *y; int i; … hot_for_loop(… i …){ s += (x[i]-y[i]) *(x[i]-y[i]); … } float s, t, *x, *y; int i; … hot_for_loop(… i …){ t = (x[i]-y[i]); s += t*t; … } float s, t, *x, *y; int i; … hot_for_loop(… i …){ t = (x[i]-y[i]); s += t*t; … }

9 I MPROVING P ERFORMANCE OF SC-P RESERVING C OMPILER Request programmers to reduce shared accesses in hot loops Use sophisticated static analysis Infer more thread-local variables Infer data-race-free shared variables Use program annotations Requires changing the program language Minimum annotations sufficient to optimize the hot loops Perform load-optimizations speculatively Hardware exposes speculative-load optimization to the software Load optimizations reduce the max overhead to 6%

10 E AGER -L OAD O PTIMIZATIONS Eagerly perform loads or use values from previous loads or stores L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t; L1: X = 2; L2: u = Y; L3: v = X*5; L1: X = 2; L2: u = Y; L3: v = X*5; L1: X = 2; L2: u = Y; L3: v = 10; L1: X = 2; L2: u = Y; L3: v = 10; L1: L2: for(…) L3: t = X*5; L1: L2: for(…) L3: t = X*5; L1: u = X*5; L2: for(…) L3: t = u; L1: u = X*5; L2: for(…) L3: t = u; Common Subexpression Elimination Constant/copy Propagation Loop-invariant Code Motion

11 P ERFORMANCE OVERHEAD Allowing eager-load optimizations alone reduces max overhead to 6%

12 C ORRECTNESS C RITERIA FOR E AGER -L OAD O PTIMIZATIONS Eager-loads optimizations rely on a variable remaining unmodified in a region of code Sequential validity: No mods to X by the current thread in L1-L3 SC-preservation: No mods to X by any other thread in L1-L3 L1: t = X*5; L2: *p = q; L3: v = X*5; L1: t = X*5; L2: *p = q; L3: v = X*5; Enable invariant “t == 5.X” Maintain invariant “t == 5.X” Use invariant “t == 5.X" to transform L3 to v = t; Use invariant “t == 5.X" to transform L3 to v = t;

13 S PECULATIVELY P ERFORMING E AGER -L OAD O PTIMIZATIONS On monitor.load, hardware starts tracking coherence messages on X’s cache line The interference check fails if X’s cache line has been downgraded since the monitor.load In our implementation, a single instruction checks interference on up to 32 tags L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = monitor.load(X, tag) * 5; L2: u = Y; L3: v = t; C4: if (interference.check(tag)) C5: v = X*5; L1: t = monitor.load(X, tag) * 5; L2: u = Y; L3: v = t; C4: if (interference.check(tag)) C5: v = X*5;

14 C ONCLUSION ( S ) Performance cost of SC = 5% Cost of SC hardware = 3% [Milo’s talk yesterday] Cost of SC-preserving compiler = 2%


Download ppt "The Case for a SC-preserving Compiler Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLA University of Michigan Abhay Singh Satish Narayanasamy."

Similar presentations


Ads by Google