Compiler Construction of Idempotent Regions and Applications in Architecture Design Marc de Kruijf Advisor: Karthikeyan Sankaralingam PhD Defense 07/20/2012.

Slides:

Advertisements

Similar presentations

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

Lecture 12 Reduce Miss Penalty and Hit Time

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

UW-Madison Computer Sciences Vertical Research Group© 2010 Relax: An Architectural Framework for Software Recovery of Hardware Faults Marc de Kruijf Shuou.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.

Multiscalar processors

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Architecture Basics ECE 454 Computer Systems Programming

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.

Spring 2003CSE P5481 Precise Interrupts Precise interrupts preserve the model that instructions execute in program-generated order, one at a time If an.

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

Sunpyo Hong, Hyesoon Kim

Memory-Aware Compilation Philip Sweany 10/20/2011.

LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.

PipeliningPipelining Computer Architecture (Fall 2006)

CS 352H: Computer Systems Architecture

idempotent (ī-dəm-pō-tənt) adj

Hardware Multithreading

15-740/ Computer Architecture Lecture 5: Precise Exceptions

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism (ILP)

* From AMD 1996 Publication #18522 Revision E

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Dynamic Hardware Prediction

Patrick Akl and Andreas Moshovos AENAO Research Group

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

rePLay: A Hardware Framework for Dynamic Optimization

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Compiler Construction of Idempotent Regions and Applications in Architecture Design Marc de Kruijf Advisor: Karthikeyan Sankaralingam PhD Defense 07/20/2012

Example 2 int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x; } source code

Example 3 R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 assembly code FFF F 0 faults exceptions x load ? mis-speculations

Example 4 R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 assembly code B AD S TUFF H APPENS !

R0 and R1 are unmodified R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 Example 5 assembly code just re-execute! convention: use checkpoints/buffers

It’s Idempotent! 6 idempoh… what…? int sum(int *data, int len) { int x = 0; for (int i = 0; i < len; ++i) x += data[i]; return x; } =

Thesis 7 idempotent regions A LL T HE T IME specifically… – using compiler analysis (intra-procedural) – transparent; no programmer intervention – hardware/software co-design – software analysis hardware execution the thing that I am defending

Thesis 8 prelim.pptx defense.pptx preliminary exam (11/2010) – idempotence:concept and simple empirical analysis – compiler:preliminary design & partial implementation – architecture:some area and power savings…? defense (07/2012) – idempotence:formalization and detailed empirical analysis – compiler:complete design, source code release * – architecture:compelling benefits (various) *

Contributions & Findings 9 a summary contribution areas – idempotence:models and analysis framework – compiler:design, implementation, and evaluation – architecture:design and evaluation findings – potentially large idempotent regions exist in applications – for compilation, larger is better – small regions (5-15 instructions), 10-15% overheads – large regions (50+ instructions), 0-2% overheads – enables efficient exception and hardware fault recovery

10 Overview ❶ Idempotence Models in Architecture ❷ Compiler Design & Evaluation ❸ Architecture Design & Evaluation

Idempotence Models 11 M ODEL A An input is a variable that is live-in to the region. A region preserves an input if the input is not overwritten. idempotence: what does it mean? D EFINITION (1)a region is idempotent iff its re-execution has no side-effects (2)a region is idempotent iff it preserves its inputs OK, but what does it mean to preserve an input? four models (next slides): A, B, C, & D

Idempotence Model A 12 R1 = R2 + R3 if R1 > 0 mem[R4] = R1 SP = SP - 16 false true Live-ins: { all registers } a starting point { all memory } \{R1} ?? = mem[R4] … ?

Idempotence Model A 13 R1 = R2 + R3 if R1 > 0 mem[R4] = R1 SP = SP - 16 false true Live-ins: { all registers } a starting point { all memory }

?? = mem[R4] … ? Idempotence Model A 14 R1 = R2 + R3 if R1 > 0 mem[R4] = R1 SP = SP - 16 false true Live-ins: { all registers } a starting point T WO REGIONS. S OME STUFF NOT COVERED. { all memory } \{mem[R4]}

?? = mem[R4] … ? Idempotence Model B 15 R1 = R2 + R3 if R1 > 0 mem[R4] = R1 SP = SP - 16 false true live-in but dynamically dead at time of write – OK to overwrite if control flow invariable varying control flow assumptions O NE REGION. S OME STUFF STILL NOT COVERED.

Idempotence Model C 16 R1 = R2 + R3 if R1 > 0 mem[R4] = R1 SP = SP - 16 false true allow final instruction to overwrite input (to include otherwise ineligible instructions) varying sequencing assumptions O NE REGION. E VERYTHING COVERED.

Idempotence Model D 17 R1 = R2 + R3 if R1 > 0 mem[R4] = R1 SP = SP - 16 false true may be concurrently read in another thread – consider as input varying isolation assumptions T WO REGIONS. E VERYTHING COVERED.

Idempotence Models 18 an idempotence taxonomy sequencing axis control axis isolation axis (Model C) (Model B) (Model D)

what are the implications? 19

Empirical Analysis 20 methodology measurement – dynamic region size (path length) – subject to axis constraints – x86 dynamic instruction count (using PIN) benchmarks – SPEC 2006, PARSEC, and Parboil suites experimental configurations – unconstrained: ideal upper bound (Model C) – oblivious: actual in normal compiled code (Model C) – X-constrained: ideal upper bound constrained by axis X

Empirical Analysis 21 * geometrically averaged across suites oblivious vs. unconstrained X IMPROVEMENT ACHIEVABLE * average region size *

Empirical Analysis 22 control axis sensitivity * geometrically averaged across suites average region size * 4 X DROP IF VARIABLE CONTROL FLOW

Empirical Analysis 23 isolation axis sensitivity * geometrically averaged across suites average region size * 1.5 X DROP IF NO ISOLATION

non-idempotent instructions * Empirical Analysis 24 sequencing axis sensitivity 0.19% * geometrically averaged across suites INHERENTLY NON - IDEMPOTENT I NSTRUCTIONS : RARE AND OF LIMITED TYPE

Idempotence Models 25 a summary a spectrum of idempotence models – significant opportunity: 100+ sizes possible – 4x reduction constraining control axis – 1.5x reduction constraining isolation axis two models going forward – architectural idempotence & contextual idempotence – both are effectively the ideal case (Model C) – architectural idempotence: invariable control always – contextual idempotence: variable control w.r.t. locals

26 Overview ❶ Idempotence Models in Architecture ❷ Compiler Design & Evaluation ❸ Architecture Design & Evaluation

Compiler Design 27 P ARTITION : A NALYZE : choose your own adventure C ODE G EN : P ARTITION A NALYZE C ODE G EN C OMPILER E VALUATION identify semantic clobber antidependences cut semantic clobber antidependences preserve semantic idempotence

Compiler Evaluation 28 preamble NOT Free Free WHAT DO YOU MEAN : PERFORMANCE OVERHEADS? P ARTITION : A NALYZE : C ODE G EN : identify semantic clobber antidependences cut semantic clobber antidependences preserve semantic idempotence

Compiler Evaluation 29 preamble region size overhead register pressure - preserve input values in registers - spill other values (if needed) - spill input values to stack - allocate other values to registers

Compiler Evaluation 30 compiler implementation – LLVM, support for both x86 and ARM methodology measurements – performance overhead: dynamic instruction count – for x86, using PIN – for ARM, using gem5 (just for ISA comparison at end) – region size: instructions between boundaries (path length) – x86 only, using PIN benchmarks – SPEC 2006, PARSEC, and Parboil suites

Results, Take 1/3 31 initial results – overhead performance overhead percentage increase in x86 dynamic instruction count geometrically averaged across suites 13.1% N ON -T RIVIAL O VERHEAD

Results, Take 1/3 region size overhead Y OU ARE HERE (typically instructions) ? analysis of trade-offs instructions register pressure

Results, Take 1/3 region size overhead register pressure analysis of trade-offs 33 detection latency ? ?

Results, Take 2/3 34 minimizing register pressure performance overhead 13.1% S TILL N ON -T RIVIAL O VERHEAD 11.1% BeforeAfter

Results, Take 2/3 region size overhead register pressure analysis of trade-offs 35 detection latency re-execution time ?

Big Regions 36 how do we get there? Problem #1: aliasing analysis – no flow-sensitive analysis in LLVM; really hurts loops Problem #2: loop optimizations – boundaries in loops are bad for everyone – loop blocking, fission/fusion, inter-change, peeling, unrolling, scalarization, etc. can all help Problem #3: large array structures – awareness of array access patterns can help Problem #4: intra-procedural scope – limited scope aggravates all effects listed above

Big Regions 37 how do we get there? solutions can be automated – a lot of work… what would be the gain? ad hoc for now – consider PARSEC and Parboil suites as a case study – aliasing annotations – manual loop refactoring, scalarization, etc. – partitioning algorithm refinements (application-specific) – inlining annotations

Results Take 3/3 38 big regions performance overhead 13.1% 0.06% BeforeAfter G AIN = BIG; T RIVIAL O VERHEAD

Results Take 3/ instructions is good enough region size overhead register pressure 50+ instructions (mis-optimized)

ISA Sensitivity 40 you might be curious ISA matters? (1) two-address (e.g. x86) vs. three-address (e.g. ARM) (2) register-memory (e.g. x86) vs. register-register (e.g. ARM) (3) number of available registers the short version ( ) – impact of (1) & (2) not significant (+/- 2% overall) – even less significant as regions grow larger – impact of (3): to get same performance w/ idempotence – increase registers by 0% (large) to ~60% (small regions)

Compiler Design & Evaluation 41 a summary design and implementation – static analysis algorithms, modular and perform well – code-gen algorithms, modular and perform well – LLVM implementation source code available * findings – pressure-related performance overheads range from: – 0% (large regions) to ~15% (small regions) – greatest opportunity: loop-intensive applications – ISA effects are insignificant *

42 Overview ❶ Idempotence Models in Architecture ❷ Compiler Design & Evaluation ❸ Architecture Design & Evaluation

Architecture Recovery: It’s Real 43 safety first speed first (safety second)

Fetch 4 4 Write-back 3 3 Execute 2 2 Decode Architecture Recovery: It’s Real lots of sharp turns 1 1 Fetch 2 2 Decode 3 3 Execute 4 4 Write-back closer to the truth

Fetch 4 4 Write-back 3 3 Execute 2 2 Decode 1 1 Fetch 2 2 Decode 3 3 Execute 4 4 Write-back !!! Architecture Recovery: It’s Real lots of interaction too late!

46 Architecture Recovery: It’s Real bad stuff can happen mis-speculation (a) branch mis-prediction, (b) memory re-ordering, (c) transaction violation, etc. hardware faults (d) wear-out fault, (e) particle strike, (f) voltage spike, etc. exceptions (g) page fault, (h) divide-by-zero, (i) mis-aligned access, etc.

47 Architecture Recovery: It’s Real bad stuff can happen mis-speculation (a) branch mis-prediction, (b) memory re-ordering, (c) transaction violation, etc. hardware faults (d) wear-out fault, (e) particle strike, (f) voltage spike, etc. exceptions (g) page fault, (h) divide-by-zero, (i) mis-aligned access, etc. register pressure detection latency re-execution time

48 Architecture Recovery: It’s Real bad stuff can happen mis-speculation (a) branch mis-prediction, (b) memory re-ordering, (c) transaction violation, etc. hardware faults (d) wear-out fault, (e) particle strike, (f) voltage spike, etc. exceptions (g) page fault, (h) divide-by-zero, (i) mis-aligned access, etc.

49 Architecture Recovery: It’s Real bad stuff can happen hardware faultsexceptions (d) wear-out fault, (e) particle strike, (f) voltage spike, etc. (g) page fault, (h) divide-by-zero, (i) mis-aligned access, etc. integrated GPU low-power CPUhigh-reliability systems

GPU Exception Support 50

51 GPU Exception Support why would we want it? GPU/CPU integration – unified address space: support for demand paging – numerous secondary benefits as well…

52 GPU Exception Support why is it hard? the CPU solution pipelineregisters buffers

53 GPU Exception Support why is it hard? CPU: 10s of registers/core … GPU:10s of registers/thread 32 threads/warp 48 warps per “core” 10,000s of registers/core

54 GPU Exception Support idempotence on GPUs GPUs hit the sweet spot (1) extractably large regions (low compiler overheads) (2) detection latencies long or hard to bound (large is good) (3) exceptions are infrequent (low re-execution overheads) register pressure detection latency re-execution time

55 GPU Exception Support idempotence on GPUs GPU design topics – compiler flow – hardware support – exception live-lock – bonus: fast context switching D ETAILS GPUs hit the sweet spot (1) extractably large regions (low compiler overheads) (2) detection latencies long or hard to bound (large is good) (3) exceptions are infrequent (low re-execution overheads)

GPU Exception Support 56 evaluation methodology compiler – LLVM targeting ARM benchmarks – Parboil GPU benchmarks for CPUs, modified simulation – gem5 for ARM: simple dual-issue in-order (e.g. Fermi) – 10-cycle page fault detection latency measurement – performance overhead in execution cycles

GPU Exception Support 57 evaluation results performance overhead 0.54% NEGLIGIBLE OVERHEADS

CPU Exception Support 58

59 CPU Exception Support why is it a problem? the CPU solution pipelineregisters buffers

60 CPU Exception Support why is it a problem? Decode, Rename, & Issue Decode, Rename, & Issue Fetch Integer Multiply Load/Store RF Branch FP … … IEEE FP Bypass Replay queue Flush? Replay? … … Before

61 CPU Exception Support why is it a problem? Fetch Decode & Issue Integer Branch Multiply Load/Store FP RF … … After

62 CPU Exception Support idempotence on CPUs CPU design simplification – in ARM Cortex-A8 (dual-issue in-order) can remove: – bypass / staging register file, replay queue – rename pipeline stage – IEEE-compliant floating point unit – pipeline flush for exceptions and replays – all associated control logic D ETAILS leaner hardware – bonus: cheap (but modest) OoO issue

CPU Exception Support 63 evaluation methodology compiler – LLVM targeting ARM, minimize pressure (take 2/3) benchmarks – SPEC 2006 & PARSEC suites (unmodified) simulation – gem5 for ARM: aggressive dual-issue in-order (e.g. A8) – stall on potential in-flight exception measurement – performance overhead in execution cycles

CPU Exception Support 64 evaluation results performance overhead 9.1% LESS THAN 10%... BUT NOT AS GOOD AS GPU ( REGIONS TOO SMALL )

Hardware Fault Tolerance 65

66 Hardware Fault Tolerance what is the opportunity? reliability trends – CMOS reliability is a growing problem – future CMOS alternatives are no better architecture trends – hardware power and complexity are premium – desire for simple hardware + efficient recovery application trends – emerging workloads consist of large idempotent regions – increasing levels of software abstraction

67 Hardware Fault Tolerance design topics hardware organizations – homogenous: idempotence everywhere – statically heterogeneous: e.g. accelerators – dynamically heterogeneous: adaptive cores F AULT MODEL fault detection capability – fine-grained in hardware (e.g. Argus, MICRO ‘07) or – fine-grained in software (e.g. instruction/region DMR) fault model (aka ISA semantics) – similar to pipeline-based (e.g. ROB) recovery

Hardware Fault Tolerance 68 evaluation methodology compiler – LLVM targeting ARM (compiled to minimize pressure) benchmarks – SPEC 2006, PARSEC, and Parboil suites (unmodified) simulation – gem5 for ARM: simple dual-issue in-order – DMR detection; compare against checkpoint/log and TMR measurement – performance overhead in execution cycles

Hardware Fault Tolerance 69 evaluation results performance overhead IDEMPOTENCE : BEATS THE COMPETITION

70 Overview ❶ Idempotence Models in Architecture ❷ Compiler Design & Evaluation ❸ Architecture Design & Evaluation

71 R ELATED W ORK C ONCLUSIONS

Conclusions 72 idempotence: not good for everything – small regions are expensive – preserving register state is difficult with limited flexibility – large regions are cheap – preserving register state is easy with amortization effect – preserving memory state is mostly “for free” idempotence: synergistic with modern trends – programmability (for GPUs) – low power (for everyone) – high-level software efficient recovery (for everyone)

73 The End

74 Back-Up: Chronology MapReduce for CELL Time SELSE ’09: Synergy ISCA ‘10: Relax MICRO ’11: Idempotent Processors PLDI ’12: Static Analysis and Compiler Design ISCA ’12: iGPU DSN ’10: TS model CGO ??: Code Gen TACO ??: Models prelimdefense

Choose Your Own Adventure Slides 75

Idempotence Analysis 76 Yes 2 is this idempotent?

Idempotence Analysis 77 No 2 how about this?

Idempotence Analysis 78 Yes 2 maybe this?

Idempotence Analysis 79 operation sequencedependence chainidempotent? write read, write write, read, write Yes No Yes it’s all about the data dependences

Idempotence Analysis 80 operation sequencedependence chainidempotent? write, read read, write write, read, write Yes No Yes it’s all about the data dependences C LOBBER A NTIDEPENDENCE antidependence with an exposed read

Semantic Idempotence 81 (1) local (“pseudoregister”) state: can be renamed to remove clobber antidependences* does not semantically constrain idempotence two types of program state (2) non-local (“memory”) state: cannot “rename” to avoid clobber antidependences semantically constrains idempotence semantic idempotence = no non-local clobber antidep. preserve local state by renaming and careful allocation

Region Partitioning Algorithm 82 steps one, two, and three Step 1:transform function remove artificial dependences, remove non-clobbers Step 3:refine for correctness & performance account for loops, optimize for dynamic behavior Step 2:construct regions around antidependences cut all non-local antidependences in the CFG

Step 1: Transform 83 Transformation 1: SSA for pseudoregister antidependences not one, but two transformations clobber antidependences region boundaries region identification But we still have a problem: depends on

Step 1: Transform 84 Transformation 1: SSA for pseudoregister antidependences not one, but two transformations Transformation 2: Scalar replacement of memory variables [x] = a; b = [x]; [x] = c; [x] = a; b = a; [x] = c; non-clobber antidependences… GONE!

Step 1: Transform 85 Transformation 1: SSA for pseudoregister antidependences not one, but two transformations Transformation 2: Scalar replacement of memory variables clobber antidependences region boundaries region identification depends on

Region Partitioning Algorithm 86 steps one, two, and three Step 1:transform function remove artificial dependences, remove non-clobbers Step 3:refine for correctness & performance account for loops, optimize for dynamic behavior Step 2:construct regions around antidependences cut all non-local antidependences in the CFG

Step 2: Cut the CFG 87 cut, cut, cut… construct regions by “cutting” non-local antidependences antidependence

larger is (generally) better: large regions amortize the cost of input preservation 88 region size overhead sources of overhead optimal region size? Step 2: Cut the CFG rough sketch but where to cut…?

Step 2: Cut the CFG 89 but where to cut…? goal:the minimum set of cuts that cuts all antidependence paths intuition:minimum cuts fewest regions large regions approach:a series of reductions: minimum vertex multi-cut (NP-complete) minimum hitting set among paths minimum hitting set among “dominating nodes” details omitted

Region Partitioning Algorithm 90 steps one, two, and three Step 1:transform function remove artificial dependences, remove non-clobbers Step 3:refine for correctness & performance account for loops, optimize for dynamic behavior Step 2:construct regions around antidependences cut all non-local antidependences in the CFG

Step 3: Loop-Related Refinements 91 correctness: Not all local antidependences removed by SSA… loops affect correctness and performance loop-carried antidependences may clobber depends on boundary placement; handled as a post-pass performance: Loops tend to execute multiple times… to maximize region size, place cuts outside of loop algorithm modified to prefer cuts outside of loops details omitted

Code Generation Algorithms 92 idempotence preservation background & concepts: live intervals, region intervals, and shadow intervals compiling for architectural idempotence: invariable control flow upon re-execution compiling for contextual idempotence: potentially variable control flow upon re-execution

Code Generation Algorithms live intervals and region intervals x = = f(x) y = region boundaries region interval x ’s live interval

Code Generation Algorithms shadow intervals 94 shadow interval the interval over which a variable must not be overwritten specifically to preserve idempotence different for architectural and contextual idempotence

Code Generation Algorithms for contextual idempotence x = = f(x) y = region boundaries x ’s shadow interval x ’s live interval

Code Generation Algorithms for architectural idempotence x = = f(x) y = region boundaries x ’s shadow interval x ’s live interval

Code Generation Algorithms for architectural idempotence x = = f(x) y = region boundaries x ’s shadow interval x ’s live interval y ’s live interval

Big Regions 98 Re: Problem #2 (cut in loops are bad) i 0 = φ(0, i 1 ) i 1 = i if (i 1 < X) for (i = 0; i < X; i++) {... } C codeCFG + SSA

Big Regions 99 Re: Problem #2 (cut in loops are bad) R0 = 0 R0 = R0 + 1 if (R0 < X) for (i = 0; i < X; i++) {... } C codemachine code N O B OUNDARIES = N O P ROBLEM

Big Regions 100 Re: Problem #2 (cut in loops are bad) R0 = 0 R0 = R0 + 1 if (R0 < X) for (i = 0; i < X; i++) {... } C codemachine code

Big Regions 101 Re: Problem #2 (cut in loops are bad) R1 = 0 R0 = R1 R1 = R0 + 1 if (R1 < X) for (i = 0; i < X; i++) {... } C codemachine code – “redundant” copy – extra boundary (pressure)

Big Regions 102 Re: Problem #3 (array access patterns) [x] = a; b = [x]; [x] = c; [x] = a; b = a; [x] = c; non-clobber antidependences… GONE! algorithm makes this simplifying assumption: cheap for scalars, expensive for arrays

Big Regions 103 Re: Problem #3 (array access patterns) not really practical for large arrays but if we don’t do it, non-clobber antidependences remain solution: handle potential non-clobbers in a post-pass (same way we deal with loop clobbers in static analysis) // initialize: int[100] array; memset(&array, 100*4, 0); // accumulate: for (...) array[i] += foo(i);

Big Regions 104 BenchmarkProblemsSize BeforeSize After blackscholesA LIASING, S COPE 78.9>10,000,000 cannealS COPE fluidanimateA RRAYS, L OOPS, S COPE 9.4>10,000,000 streamclusterA LIASING ,928 swaptionsA LIASING, A RRAYS ,000 cutcpL OOPS fftA LIASING 24.72,450 histoA RRAYS, S COPE 4.44,640,000 mri-q–22,100 sadA LIASING ,000 tpacfA RRAYS, S COPE ,000 results: sizes

Big Regions 105 BenchmarkProblemsOverhead BeforeOverhead After blackscholesA LIASING, S COPE -2.93%-0.05% cannealS COPE 5.31%1.33% fluidanimateA RRAYS, L OOPS, S COPE 26.67%-0.62% streamclusterA LIASING 13.62%0.00% swaptionsA LIASING, A RRAYS 17.67%0.00% cutcpL OOPS 6.344%-0.01% fftA LIASING 11.12%0.00% histoA RRAYS, S COPE 23.53%0.00% mri-q–0.00% sadA LIASING 4.17%0.00% tpacfA RRAYS, S COPE 12.36%-0.02% results: overheads

Big Regions 106 problem labels Problem #1: aliasing analysis – no flow-sensitive analysis in LLVM; really hurts loops Problem #2: loop optimizations – boundaries in loops are bad for everyone – loop blocking, fission/fusion, inter-change, peeling, blocking, unrolling, scalarization, etc. can all help Problem #3: large array structures – awareness of array access patterns can help Problem #4: intra-procedural scope – limited scope aggravates all effects listed above (A LIASING ) (L OOPS ) (A RRAYS ) (S COPE )

ISA Sensitivity 107 percentage overhead x86-64ARMv7 same configuration as take 1/3 x86-64 vs. ARMv7

ISA Sensitivity 108 general purpose register (GPR) sensititivity ARMv7, 16 GPR baseline; data as geometric mean across SPEC INT percentage overhead

ISA Sensitivity 109 more registers isn’t always enough x = 0; if (y > 0) x = 1; z = x + y; C code R0 = 0 if (R1 > 0) R0 = 1 R2 = R0 + R1

ISA Sensitivity 110 more registers isn’t always enough R0 = 0 if (R1 > 0) R3 = R0 x = 0; if (y > 0) x = 1; z = x + y; C code R3 = 1 R2 = R3 + R1

111 GPU Exception Support compiler flow & hardware support kernel source source code compiler IR device code generator partitioningpreservation idempotent device code L2 cache core L1, TLB general purpose registers RPCs fetchFU … … decodeFU compiler hardware

112 GPU Exception Support exception live-lock and fast context switching bonus: fast context switching – boundary locations are configurable at compile time – observation 1: save/restore only live state – observation 2: place boundaries to minimize liveness exception live-lock – multiple recurring exceptions can cause live-lock – detection: save PC and compare – recovery: single-stepped re-execution or re-compilation

113 CPU Exception Support design simplification idempotence OoO retirement – simplify result bypassing – simplifies exception support for long latency instructions – simplifies scheduling of variable-latency instructions OoO issue?

114 CPU Exception Support design simplification what about branch prediction, etc.? high re-execution costs; live-lock issues register pressure detection latency re-execution time !!! region placement to minimize re-execution...?

CPU Exception Support 115 minimizing branch re-execution cost percentage overhead 9.1% take 2/3cut at branch 18.1%

116 Hardware Fault Tolerance fault semantics hardware fault model (fault semantics) – side-effects are temporally contained to region execution – side-effects are spatially contained to target resources – control flow is legal (follows static CFG edges)

Related Work 117 on idempotence Very RelatedYearDomain Sentinel Scheduling1992Speculative memory re-ordering Reference Idempotency2006Reducing speculative storage Restart Markers2006Virtual memory in vector machines Encore2011Hardware fault recovery Somewhat RelatedYearDomain Multi-Instruction Retry1995Branch and hardware fault recovery Atomic Heap Transactions1999Atomic memory allocation

118 Related Work on idempotence what’s new? – idempotence model classification and analysis – first work to decompose entire programs – static analysis in terms of clobber (anti-)dependences – static analysis and code generation algorithms – overhead analysis: detection, pressure, re-execution – comprehensive (and general) compiler implementation – comprehensive compiler evaluation – a spectrum of architecture designs & applications