Download presentation

Presentation is loading. Please wait.

Published byCaiden Lifton Modified over 3 years ago

1
The Case for a SC-preserving Compiler Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLA University of Michigan Abhay Singh Satish Narayanasamy

2
T ALK S UMMARY SC-preserving compiler Every SC behavior of the binary is a SC behavior of the source Guarantees SC assuming SC hardware A SC-preserving compiler is acceptably efficient Enable optimizations only when provably SC-preserving With simple, scalable, and readily implementable analysis 2% avg, 30% max overhead on SPLASH & PARSEC benchmarks Static and dynamic analyses can further reduce the performance overhead

3
M ANY C OMPILER O PTIMIZATIONS ARE NOT SC-P RESERVING Example: Common Subexpression Elimination (CSE) t,u,v are local variables X,Y are possibly shared L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t;

4
C OMMON S UBEXPRESSION E LIMINATION IS NOT SC-P RESERVING L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t; M1: X = 1; M2: Y = 1; M1: X = 1; M2: Y = 1; M1: X = 1; M2: Y = 1; M1: X = 1; M2: Y = 1; u == 1 v == 5 possibly u == 1 && v == 0 Init: X = Y = 0;

5
I MPLEMENTING CSE IN A SC-P RESERVING C OMPILER Enable this transformation when X is a local variable, or Y is a local variable In these cases, the transformation is SC-preserving Identifying local variables: Compiler generated temporaries Stack allocated variables whose address is not taken L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t;

6
A SC- PRESERVING LLVM C OMPILER FOR C PROGRAMS Modify each of ~70 phases in LLVM to be SC-preserving Enable trace-preserving optimizations These do not change the order of memory operations e.g. loop unrolling, procedure inlining, control-flow simplification, dead-code elimination,… Enable transformations on local variables Enable transformations involving a single shared variable e.g. t= X; u=X; v=X; t=X; u=t; v=t;

7
P ERFORMANCE OVERHEAD Baseline: LLVM –O3 Experiments on Intel Xeon, 8 cores, 2 threads/core, 6GB RAM 480373 154 132200116159173237 298

8
T HE O VERHEAD IN F ACESIM This transformation reduces the overhead from 34% to 6% Optimizations in non-hot-loops do not buy much performance A SC-preserving compiler slows down a program if The hot-loops involve more than one shared variable, and Aliasing constraints do not prevent optimizations in the loop float s, *x, *y; int i; … hot_for_loop(… i …){ s += (x[i]-y[i]) *(x[i]-y[i]); … } float s, *x, *y; int i; … hot_for_loop(… i …){ s += (x[i]-y[i]) *(x[i]-y[i]); … } float s, t, *x, *y; int i; … hot_for_loop(… i …){ t = (x[i]-y[i]); s += t*t; … } float s, t, *x, *y; int i; … hot_for_loop(… i …){ t = (x[i]-y[i]); s += t*t; … }

9
I MPROVING P ERFORMANCE OF SC-P RESERVING C OMPILER Request programmers to reduce shared accesses in hot loops Use sophisticated static analysis Infer more thread-local variables Infer data-race-free shared variables Use program annotations Requires changing the program language Minimum annotations sufficient to optimize the hot loops Perform load-optimizations speculatively Hardware exposes speculative-load optimization to the software Load optimizations reduce the max overhead to 6%

10
E AGER -L OAD O PTIMIZATIONS Eagerly perform loads or use values from previous loads or stores L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t; L1: X = 2; L2: u = Y; L3: v = X*5; L1: X = 2; L2: u = Y; L3: v = X*5; L1: X = 2; L2: u = Y; L3: v = 10; L1: X = 2; L2: u = Y; L3: v = 10; L1: L2: for(…) L3: t = X*5; L1: L2: for(…) L3: t = X*5; L1: u = X*5; L2: for(…) L3: t = u; L1: u = X*5; L2: for(…) L3: t = u; Common Subexpression Elimination Constant/copy Propagation Loop-invariant Code Motion

11
P ERFORMANCE OVERHEAD Allowing eager-load optimizations alone reduces max overhead to 6% 480373 154 132200116159173237 298

12
C ORRECTNESS C RITERIA FOR E AGER -L OAD O PTIMIZATIONS Eager-loads optimizations rely on a variable remaining unmodified in a region of code Sequential validity: No mods to X by the current thread in L1-L3 SC-preservation: No mods to X by any other thread in L1-L3 L1: t = X*5; L2: *p = q; L3: v = X*5; L1: t = X*5; L2: *p = q; L3: v = X*5; Enable invariant “t == 5.X” Maintain invariant “t == 5.X” Use invariant “t == 5.X" to transform L3 to v = t; Use invariant “t == 5.X" to transform L3 to v = t;

13
S PECULATIVELY P ERFORMING E AGER -L OAD O PTIMIZATIONS On monitor.load, hardware starts tracking coherence messages on X’s cache line The interference check fails if X’s cache line has been downgraded since the monitor.load In our implementation, a single instruction checks interference on up to 32 tags L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = monitor.load(X, tag) * 5; L2: u = Y; L3: v = t; C4: if (interference.check(tag)) C5: v = X*5; L1: t = monitor.load(X, tag) * 5; L2: u = Y; L3: v = t; C4: if (interference.check(tag)) C5: v = X*5;

14
C ONCLUSION ( S ) Performance cost of SC = 5% Cost of SC hardware = 3% [Milo’s talk yesterday] Cost of SC-preserving compiler = 2%

Similar presentations

OK

CSC 4181 Compiler Construction Code Generation & Optimization.

CSC 4181 Compiler Construction Code Generation & Optimization.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google