Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen
Example 1 int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x; } source code
Example 2 R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 assembly code FFF F 0 faults exceptions x load ? mis-speculations
Example 3 R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 assembly code B AD S TUFF H APPENS !
R0 and R1 are unmodified R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 Example 4 assembly code just re-execute! convention: use checkpoints/buffers
It’s Idempotent! 5 idempoh… what…? int sum(int *data, int len) { int x = 0; for (int i = 0; i < len; ++i) x += data[i]; return x; } =
Idempotent Region Construction 6 previously… in PLDI ’12 idempotent regions A LL T HE T IME before: after:
Idempotent Code Generation 7 now… in CGO ’13 int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x; } how do we get from here...
Idempotent Code Generation 8 now… in CGO ’13 to here... R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3
Idempotent Code Generation 9 now… in CGO ’13 not here (this is not idempotent)... R2 = load [R1] R1 = 0 LOOP: R4 = load [R0 + R2] R1 = add R1, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3
Idempotent Code Generation 10 now… in CGO ’13 and not here (this is slow)... R3 = R1 R2 = load [R3] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3
Idempotent Code Generation 11 now… in CGO ’13 here... R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3
12 FFF F 0 faults exceptions x load ? mis-speculations Hampton & Asanović, ICS ’06 De Kruijf & Sankaralingam, MICRO ’11 Menon et al., ISCA ’12 Kim et al., TOPLAS ’06 Zhang et al., ASPLOS ‘13 De Kruijf et al., ISCA ’10 Feng et al., MICRO ’11 De Kruijf et al., PLDI ’12 Idempotent Code Generation applications to prior work
Idempotent Code Generation 13 executive summary (1)how do we generate efficient idempotent code? (2) how do external factors affect overhead? (a)idempotent region size (b)instruction set (ISA) characteristics (c)control flow side-effects each can affect overheads by 10% or more algorithms made available in source code form: not covered in this talk
14 Presentation Overview ❶ Introduction ❷ Analysis ❸ Evaluation (a)idempotent region size (b)ISA characteristics (c)control flow side-effects
Analysis 15 (a) idempotent region size region size overhead - number of inputs increasing - likelihood of spills growing - maximum spill cost reached - amortized over more instructions
Analysis 16 (b) ISA characteristics (1) two-address (e.g. x86) vs. three-address (e.g. ARM) ADD R1, R2 -> R1 Idempotent? NO! (2) register-memory (e.g. x86) vs. register-register (e.g. ARM) (3) number of available registers ADD R1, R2 = R3 idempotent? YES! for register-memory, register spills may be less costly (microarchitecture dependent) impact is obvious, but… more registers is not always enough (see back-up slide)
Analysis (c) control flow side-effects x = = f(x) y = region boundaries x ’s “shadow interval” given no side-effects x ’s live interval
Analysis (c) control flow side-effects x = = f(x) y = region boundaries x ’s “shadow interval” given side-effects x ’s live interval
19 Presentation Overview ❶ Introduction ❷ Analysis ❸ Evaluation (a)idempotent region size (b)ISA characteristics (c)control flow side-effects
Evaluation 20 methodology measurements – performance overhead: dynamic instruction count – for x86, using PIN – for ARM, using gem5 – region size: instructions between boundaries (path length) benchmarks – SPEC 2006, PARSEC, and Parboil suites
Evaluation region size overhead Y OU ARE HERE (baseline: typically instructions) ? (a) idempotent region size instructions 13.1% (geometric mean)
region size overhead 22 detection latency ? ? Evaluation (a) idempotent region size 13.1%
region size overhead 23 detection latency re-execution time ? Evaluation (a) idempotent region size 0.06% 13.1% 11.1%
Evaluation 24 percentage overhead x86-64ARMv7 Three-address support matters more for FP benchmarks Register-memory matters more for integer benchmarks (b) ISA characteristics
Evaluation 25 percentage overhead no side-effectsside-effects (c) control flow side-effects substantial only in two cases; insubstantial otherwise intuition: typically compiler already spills for control flow divergence
26 Presentation Overview ❶ Introduction ❷ Analysis ❸ Evaluation
Conclusions 27 (a) region size – matters a lot; large regions are ideal if recovery is infrequent overheads approach zero as regions grow overheads drop below 10% only with careful co-design (b) instruction set – matters when region sizes must be small supporting control flow side-effects is not expensive (c) control flow side-effects – generally does not matter
Conclusions 28 code generation and static analysis algorithms applicability not limited to architecture design see Zhang et al., ASPLOS ‘13: “ConAir: Featherweight Concurrency Bug Recovery [...]” thank you!
Back-up Slides 29
ISA Characteristics 30 more registers isn’t always enough x = 0; if (y > 0) x = 1; z = x + y; C code R0 = 0 if (R1 > 0) R0 = 1 R2 = R0 + R1
ISA Characteristics 31 more registers isn’t always enough R0 = 0 if (R1 > 0) R3 = R0 x = 0; if (y > 0) x = 1; z = x + y; C code R3 = 1 R2 = R3 + R1 need an extra instruction no matter what
32 data from SPEC INT only (SPEC INT uses General Purpose Registers (GPRs) only) percentage overhead ISA Characteristics idempotence vs. fewer registers no idempotence, #GPR reduced from 16
Very Large Regions 33 how do we get there? Problem #1: aliasing analysis – no flow-sensitive analysis in LLVM; hurts loops Problem #2: loop optimizations – boundaries in loops are bad for everyone (next slides) – loop blocking, fission/fusion, inter-change, peeling, unrolling, scalarization, etc. can all help Problem #3: large array structures – awareness of array access patterns can help (next slides) Problem #4: intra-procedural scope – limited scope aggravates all effects listed above
Very Large Regions 34 Re: Problem #2 (cut in loops are bad) i 0 = φ(0, i 1 ) i 1 = i if (i 1 < X) for (i = 0; i < X; i++) {... } C codeCFG + SSA
Very Large Regions 35 Re: Problem #2 (cut in loops are bad) R0 = 0 R0 = R0 + 1 if (R0 < X) for (i = 0; i < X; i++) {... } C codemachine code N O B OUNDARIES = N O P ROBLEM
Very Large Regions 36 Re: Problem #2 (cut in loops are bad) R0 = 0 R0 = R0 + 1 if (R0 < X) for (i = 0; i < X; i++) {... } C codemachine code
Very Large Regions 37 Re: Problem #2 (cut in loops are bad) R1 = 0 R0 = R1 R1 = R0 + 1 if (R1 < X) for (i = 0; i < X; i++) {... } C codemachine code – “redundant” copy – extra boundary (pressure)
Very Large Regions 38 Re: Problem #3 (array access patterns) [x] = a; b = [x]; [x] = c; [x] = a; b = a; [x] = c; non-clobber antidependences… GONE! PLDI ‘12 algorithm makes this simplifying assumption: cheap for scalars, expensive for arrays
Very Large Regions 39 Re: Problem #3 (array access patterns) not really practical for large arrays but if we don’t do it, non-clobber antidependences remain solution: handle potential non-clobbers in a post-pass (same way we deal with loop clobbers in static analysis) // initialize: int[100] array; memset(&array, 100*4, 0); // accumulate: for (...) array[i] += foo(i);