Presentation is loading. Please wait.

Presentation is loading. Please wait.

Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen.

Similar presentations


Presentation on theme: "Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen."— Presentation transcript:

1 Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen

2 Example 1 int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x; } source code

3 Example 2 R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 assembly code FFF F 0 faults exceptions x load ? mis-speculations

4 Example 3 R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 assembly code B AD S TUFF H APPENS !

5 R0 and R1 are unmodified R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 Example 4 assembly code just re-execute! convention: use checkpoints/buffers

6 It’s Idempotent! 5 idempoh… what…? int sum(int *data, int len) { int x = 0; for (int i = 0; i < len; ++i) x += data[i]; return x; } =

7 Idempotent Region Construction 6 previously… in PLDI ’12 idempotent regions A LL T HE T IME before: after:

8 Idempotent Code Generation 7 now… in CGO ’13 int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x; } how do we get from here...

9 Idempotent Code Generation 8 now… in CGO ’13 to here... R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3

10 Idempotent Code Generation 9 now… in CGO ’13 not here (this is not idempotent)... R2 = load [R1] R1 = 0 LOOP: R4 = load [R0 + R2] R1 = add R1, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3

11 Idempotent Code Generation 10 now… in CGO ’13 and not here (this is slow)... R3 = R1 R2 = load [R3] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3

12 Idempotent Code Generation 11 now… in CGO ’13 here... R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3

13 12 FFF F 0 faults exceptions x load ? mis-speculations Hampton & Asanović, ICS ’06 De Kruijf & Sankaralingam, MICRO ’11 Menon et al., ISCA ’12 Kim et al., TOPLAS ’06 Zhang et al., ASPLOS ‘13 De Kruijf et al., ISCA ’10 Feng et al., MICRO ’11 De Kruijf et al., PLDI ’12 Idempotent Code Generation applications to prior work

14 Idempotent Code Generation 13 executive summary (1)how do we generate efficient idempotent code? (2) how do external factors affect overhead? (a)idempotent region size (b)instruction set (ISA) characteristics (c)control flow side-effects each can affect overheads by 10% or more algorithms made available in source code form: http://research.cs.wisc.edu/vertical/iCompiler not covered in this talk

15 14 Presentation Overview ❶ Introduction ❷ Analysis ❸ Evaluation (a)idempotent region size (b)ISA characteristics (c)control flow side-effects

16 Analysis 15 (a) idempotent region size region size overhead - number of inputs increasing - likelihood of spills growing - maximum spill cost reached - amortized over more instructions

17 Analysis 16 (b) ISA characteristics (1) two-address (e.g. x86) vs. three-address (e.g. ARM) ADD R1, R2 -> R1 Idempotent? NO! (2) register-memory (e.g. x86) vs. register-register (e.g. ARM) (3) number of available registers ADD R1, R2 = R3 idempotent? YES! for register-memory, register spills may be less costly (microarchitecture dependent) impact is obvious, but… more registers is not always enough (see back-up slide)

18 Analysis (c) control flow side-effects x =...... = f(x) y =... 17 region boundaries x ’s “shadow interval” given no side-effects x ’s live interval

19 Analysis (c) control flow side-effects x =...... = f(x) y =... 18 region boundaries x ’s “shadow interval” given side-effects x ’s live interval

20 19 Presentation Overview ❶ Introduction ❷ Analysis ❸ Evaluation (a)idempotent region size (b)ISA characteristics (c)control flow side-effects

21 Evaluation 20 methodology measurements – performance overhead: dynamic instruction count – for x86, using PIN – for ARM, using gem5 – region size: instructions between boundaries (path length) benchmarks – SPEC 2006, PARSEC, and Parboil suites

22 Evaluation region size overhead Y OU ARE HERE (baseline: typically 10-30 instructions) ? (a) idempotent region size 21 10+ instructions 13.1% (geometric mean)

23 region size overhead 22 detection latency ? ? Evaluation (a) idempotent region size 13.1%

24 region size overhead 23 detection latency re-execution time ? Evaluation (a) idempotent region size 0.06% 13.1% 11.1%

25 Evaluation 24 percentage overhead x86-64ARMv7 Three-address support matters more for FP benchmarks Register-memory matters more for integer benchmarks (b) ISA characteristics

26 Evaluation 25 percentage overhead no side-effectsside-effects (c) control flow side-effects substantial only in two cases; insubstantial otherwise intuition: typically compiler already spills for control flow divergence

27 26 Presentation Overview ❶ Introduction ❷ Analysis ❸ Evaluation

28 Conclusions 27 (a) region size – matters a lot; large regions are ideal if recovery is infrequent overheads approach zero as regions grow overheads drop below 10% only with careful co-design (b) instruction set – matters when region sizes must be small supporting control flow side-effects is not expensive (c) control flow side-effects – generally does not matter

29 Conclusions 28 code generation and static analysis algorithms http://research.cs.wisc.edu/vertical/iCompiler applicability not limited to architecture design see Zhang et al., ASPLOS ‘13: “ConAir: Featherweight Concurrency Bug Recovery [...]” thank you!

30 Back-up Slides 29

31 ISA Characteristics 30 more registers isn’t always enough x = 0; if (y > 0) x = 1; z = x + y; C code R0 = 0 if (R1 > 0) R0 = 1 R2 = R0 + R1

32 ISA Characteristics 31 more registers isn’t always enough R0 = 0 if (R1 > 0) R3 = R0 x = 0; if (y > 0) x = 1; z = x + y; C code R3 = 1 R2 = R3 + R1 need an extra instruction no matter what

33 32 data from SPEC INT only (SPEC INT uses General Purpose Registers (GPRs) only) percentage overhead ISA Characteristics idempotence vs. fewer registers no idempotence, #GPR reduced from 16

34 Very Large Regions 33 how do we get there? Problem #1: aliasing analysis – no flow-sensitive analysis in LLVM; hurts loops Problem #2: loop optimizations – boundaries in loops are bad for everyone (next slides) – loop blocking, fission/fusion, inter-change, peeling, unrolling, scalarization, etc. can all help Problem #3: large array structures – awareness of array access patterns can help (next slides) Problem #4: intra-procedural scope – limited scope aggravates all effects listed above

35 Very Large Regions 34 Re: Problem #2 (cut in loops are bad) i 0 = φ(0, i 1 ) i 1 = i 0 + 1 if (i 1 < X) for (i = 0; i < X; i++) {... } C codeCFG + SSA

36 Very Large Regions 35 Re: Problem #2 (cut in loops are bad) R0 = 0 R0 = R0 + 1 if (R0 < X) for (i = 0; i < X; i++) {... } C codemachine code N O B OUNDARIES = N O P ROBLEM

37 Very Large Regions 36 Re: Problem #2 (cut in loops are bad) R0 = 0 R0 = R0 + 1 if (R0 < X) for (i = 0; i < X; i++) {... } C codemachine code

38 Very Large Regions 37 Re: Problem #2 (cut in loops are bad) R1 = 0 R0 = R1 R1 = R0 + 1 if (R1 < X) for (i = 0; i < X; i++) {... } C codemachine code – “redundant” copy – extra boundary (pressure)

39 Very Large Regions 38 Re: Problem #3 (array access patterns) [x] = a; b = [x]; [x] = c; [x] = a; b = a; [x] = c; non-clobber antidependences… GONE! PLDI ‘12 algorithm makes this simplifying assumption: cheap for scalars, expensive for arrays

40 Very Large Regions 39 Re: Problem #3 (array access patterns) not really practical for large arrays but if we don’t do it, non-clobber antidependences remain solution: handle potential non-clobbers in a post-pass (same way we deal with loop clobbers in static analysis) // initialize: int[100] array; memset(&array, 100*4, 0); // accumulate: for (...) array[i] += foo(i);


Download ppt "Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen."

Similar presentations


Ads by Google