Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen.

Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen

Example 1 int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x; } source code

Example 2 R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 assembly code FFF F 0 faults exceptions x load ? mis-speculations

Example 3 R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 assembly code B AD S TUFF H APPENS !

R0 and R1 are unmodified R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 Example 4 assembly code just re-execute! convention: use checkpoints/buffers

It’s Idempotent! 5 idempoh… what…? int sum(int *data, int len) { int x = 0; for (int i = 0; i < len; ++i) x += data[i]; return x; } =

Idempotent Region Construction 6 previously… in PLDI ’12 idempotent regions A LL T HE T IME before: after:

Idempotent Code Generation 7 now… in CGO ’13 int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x; } how do we get from here...

Idempotent Code Generation 8 now… in CGO ’13 to here... R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3

Idempotent Code Generation 9 now… in CGO ’13 not here (this is not idempotent)... R2 = load [R1] R1 = 0 LOOP: R4 = load [R0 + R2] R1 = add R1, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3

Idempotent Code Generation 10 now… in CGO ’13 and not here (this is slow)... R3 = R1 R2 = load [R3] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3

Idempotent Code Generation 11 now… in CGO ’13 here... R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3

12 FFF F 0 faults exceptions x load ? mis-speculations Hampton & Asanović, ICS ’06 De Kruijf & Sankaralingam, MICRO ’11 Menon et al., ISCA ’12 Kim et al., TOPLAS ’06 Zhang et al., ASPLOS ‘13 De Kruijf et al., ISCA ’10 Feng et al., MICRO ’11 De Kruijf et al., PLDI ’12 Idempotent Code Generation applications to prior work

Idempotent Code Generation 13 executive summary (1)how do we generate efficient idempotent code? (2) how do external factors affect overhead? (a)idempotent region size (b)instruction set (ISA) characteristics (c)control flow side-effects each can affect overheads by 10% or more algorithms made available in source code form: http://research.cs.wisc.edu/vertical/iCompiler not covered in this talk

14 Presentation Overview ❶ Introduction ❷ Analysis ❸ Evaluation (a)idempotent region size (b)ISA characteristics (c)control flow side-effects

Analysis 15 (a) idempotent region size region size overhead - number of inputs increasing - likelihood of spills growing - maximum spill cost reached - amortized over more instructions

Analysis 16 (b) ISA characteristics (1) two-address (e.g. x86) vs. three-address (e.g. ARM) ADD R1, R2 -> R1 Idempotent? NO! (2) register-memory (e.g. x86) vs. register-register (e.g. ARM) (3) number of available registers ADD R1, R2 = R3 idempotent? YES! for register-memory, register spills may be less costly (microarchitecture dependent) impact is obvious, but… more registers is not always enough (see back-up slide)

Analysis (c) control flow side-effects x =...... = f(x) y =... 17 region boundaries x ’s “shadow interval” given no side-effects x ’s live interval

Analysis (c) control flow side-effects x =...... = f(x) y =... 18 region boundaries x ’s “shadow interval” given side-effects x ’s live interval

19 Presentation Overview ❶ Introduction ❷ Analysis ❸ Evaluation (a)idempotent region size (b)ISA characteristics (c)control flow side-effects

Evaluation 20 methodology measurements – performance overhead: dynamic instruction count – for x86, using PIN – for ARM, using gem5 – region size: instructions between boundaries (path length) benchmarks – SPEC 2006, PARSEC, and Parboil suites

Evaluation region size overhead Y OU ARE HERE (baseline: typically 10-30 instructions) ? (a) idempotent region size 21 10+ instructions 13.1% (geometric mean)

region size overhead 22 detection latency ? ? Evaluation (a) idempotent region size 13.1%

region size overhead 23 detection latency re-execution time ? Evaluation (a) idempotent region size 0.06% 13.1% 11.1%

Evaluation 24 percentage overhead x86-64ARMv7 Three-address support matters more for FP benchmarks Register-memory matters more for integer benchmarks (b) ISA characteristics

Evaluation 25 percentage overhead no side-effectsside-effects (c) control flow side-effects substantial only in two cases; insubstantial otherwise intuition: typically compiler already spills for control flow divergence

26 Presentation Overview ❶ Introduction ❷ Analysis ❸ Evaluation

Conclusions 27 (a) region size – matters a lot; large regions are ideal if recovery is infrequent overheads approach zero as regions grow overheads drop below 10% only with careful co-design (b) instruction set – matters when region sizes must be small supporting control flow side-effects is not expensive (c) control flow side-effects – generally does not matter

Conclusions 28 code generation and static analysis algorithms http://research.cs.wisc.edu/vertical/iCompiler applicability not limited to architecture design see Zhang et al., ASPLOS ‘13: “ConAir: Featherweight Concurrency Bug Recovery [...]” thank you!

Back-up Slides 29

ISA Characteristics 30 more registers isn’t always enough x = 0; if (y > 0) x = 1; z = x + y; C code R0 = 0 if (R1 > 0) R0 = 1 R2 = R0 + R1

ISA Characteristics 31 more registers isn’t always enough R0 = 0 if (R1 > 0) R3 = R0 x = 0; if (y > 0) x = 1; z = x + y; C code R3 = 1 R2 = R3 + R1 need an extra instruction no matter what

32 data from SPEC INT only (SPEC INT uses General Purpose Registers (GPRs) only) percentage overhead ISA Characteristics idempotence vs. fewer registers no idempotence, #GPR reduced from 16

Very Large Regions 33 how do we get there? Problem #1: aliasing analysis – no flow-sensitive analysis in LLVM; hurts loops Problem #2: loop optimizations – boundaries in loops are bad for everyone (next slides) – loop blocking, fission/fusion, inter-change, peeling, unrolling, scalarization, etc. can all help Problem #3: large array structures – awareness of array access patterns can help (next slides) Problem #4: intra-procedural scope – limited scope aggravates all effects listed above

Very Large Regions 34 Re: Problem #2 (cut in loops are bad) i 0 = φ(0, i 1 ) i 1 = i 0 + 1 if (i 1 < X) for (i = 0; i < X; i++) {... } C codeCFG + SSA

Very Large Regions 35 Re: Problem #2 (cut in loops are bad) R0 = 0 R0 = R0 + 1 if (R0 < X) for (i = 0; i < X; i++) {... } C codemachine code N O B OUNDARIES = N O P ROBLEM

Very Large Regions 36 Re: Problem #2 (cut in loops are bad) R0 = 0 R0 = R0 + 1 if (R0 < X) for (i = 0; i < X; i++) {... } C codemachine code

Very Large Regions 37 Re: Problem #2 (cut in loops are bad) R1 = 0 R0 = R1 R1 = R0 + 1 if (R1 < X) for (i = 0; i < X; i++) {... } C codemachine code – “redundant” copy – extra boundary (pressure)

Very Large Regions 38 Re: Problem #3 (array access patterns) [x] = a; b = [x]; [x] = c; [x] = a; b = a; [x] = c; non-clobber antidependences… GONE! PLDI ‘12 algorithm makes this simplifying assumption: cheap for scalars, expensive for arrays

Very Large Regions 39 Re: Problem #3 (array access patterns) not really practical for large arrays but if we don’t do it, non-clobber antidependences remain solution: handle potential non-clobbers in a post-pass (same way we deal with loop clobbers in static analysis) // initialize: int[100] array; memset(&array, 100*4, 0); // accumulate: for (...) array[i] += foo(i);

Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen.

Similar presentations

Presentation on theme: "Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen.

Similar presentations

Presentation on theme: "Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen."— Presentation transcript:

Similar presentations

About project

Feedback