Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen.

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

SSA and CPS CS153: Compilers Greg Morrisett. Monadic Form vs CFGs Consider CFG available exp. analysis: statement gen's kill's x:=v 1 p v 2 x:=v 1 p v.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Review of the MIPS Instruction Set Architecture. RISC Instruction Set Basics All operations on data apply to data in registers and typically change the.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

Alias Speculation using Atomic Regions (To appear at ASPLOS 2013) Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

Recursion CS 367 – Introduction to Data Structures.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.

Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.

1 Today’s lecture  Last lecture we started talking about control flow in MIPS (branches)  Finish up control-flow (branches) in MIPS —if/then —loops —case/switch.

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

UW-Madison Computer Sciences Vertical Research Group© 2010 Relax: An Architectural Framework for Software Recovery of Hardware Faults Marc de Kruijf Shuou.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

1 Storage Registers vs. memory Access to registers is much faster than access to memory Goal: store as much data as possible in registers Limitations/considerations:

1 Handling nested procedures Method 1 : static (access) links –Reference to the frame of the lexically enclosing procedure –Static chains of such links.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

November 29, 2005Christopher Tuttle1 Linear Scan Register Allocation Massimiliano Poletto (MIT) and Vivek Sarkar (IBM Watson)

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.

CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Behnam Robatmili, Katherine E. Coons, Kathryn S. McKinley, and Doug Burger Register Bank Assignment For Spatially Partitioned Processors.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection S. Lu, P. Zhou, W. Liu, Y. Zhou, J. Torrellas University.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Software Pipelining in Pegasus/CASH Cody Hartwig Elie Krevat

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Compiler Construction of Idempotent Regions and Applications in Architecture Design Marc de Kruijf Advisor: Karthikeyan Sankaralingam PhD Defense 07/20/2012.

1 Improving Productivity With Fine-grain Compiler-based Checkpointing Chuck (Chengyan) Zhao Prof. Greg Steffan Prof. Cristiana Amza Allan Kielstra* Dept.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

CS533 Concepts of Operating Systems Jonathan Walpole.

Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.

Kirk Scott Computer Science The University of Alaska Anchorage 1.

Using Dynamic Binary Translation to Fuse Dependent Instructions Shiliang Hu & James E. Smith.

Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Static Identification of Delinquent Loads V.M. Panait A. Sasturkar W.-F. Fong.

11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.

1 Compiler Support for Efficient Software-only Checkpointing Chuck (Chengyan) Zhao Dept. of Computer Science University of Toronto Ph.D. Thesis Exam Sept.

IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.

LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.

University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.

Memory Protection through Dynamic Access Control Kun Zhang, Tao Zhang and Santosh Pande College of Computing Georgia Institute of Technology.

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

Input Space Partition Testing CS 4501 / 6501 Software Testing

The compilation process

idempotent (ī-dəm-pō-tənt) adj

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Midterm 2 review Chapter

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Static Scheduling Techniques

rePLay: A Hardware Framework for Dynamic Optimization

Lecture 5: Pipeline Wrap-up, Static ILP

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen

Example 1 int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x; } source code

Example 2 R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 assembly code FFF F 0 faults exceptions x load ? mis-speculations

Example 3 R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 assembly code B AD S TUFF H APPENS !

R0 and R1 are unmodified R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 Example 4 assembly code just re-execute! convention: use checkpoints/buffers

It’s Idempotent! 5 idempoh… what…? int sum(int *data, int len) { int x = 0; for (int i = 0; i < len; ++i) x += data[i]; return x; } =

Idempotent Region Construction 6 previously… in PLDI ’12 idempotent regions A LL T HE T IME before: after:

Idempotent Code Generation 7 now… in CGO ’13 int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x; } how do we get from here...

Idempotent Code Generation 8 now… in CGO ’13 to here... R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3

Idempotent Code Generation 9 now… in CGO ’13 not here (this is not idempotent)... R2 = load [R1] R1 = 0 LOOP: R4 = load [R0 + R2] R1 = add R1, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3

Idempotent Code Generation 10 now… in CGO ’13 and not here (this is slow)... R3 = R1 R2 = load [R3] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3

Idempotent Code Generation 11 now… in CGO ’13 here... R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3

12 FFF F 0 faults exceptions x load ? mis-speculations Hampton & Asanović, ICS ’06 De Kruijf & Sankaralingam, MICRO ’11 Menon et al., ISCA ’12 Kim et al., TOPLAS ’06 Zhang et al., ASPLOS ‘13 De Kruijf et al., ISCA ’10 Feng et al., MICRO ’11 De Kruijf et al., PLDI ’12 Idempotent Code Generation applications to prior work

Idempotent Code Generation 13 executive summary (1)how do we generate efficient idempotent code? (2) how do external factors affect overhead? (a)idempotent region size (b)instruction set (ISA) characteristics (c)control flow side-effects each can affect overheads by 10% or more algorithms made available in source code form: not covered in this talk

14 Presentation Overview ❶ Introduction ❷ Analysis ❸ Evaluation (a)idempotent region size (b)ISA characteristics (c)control flow side-effects

Analysis 15 (a) idempotent region size region size overhead - number of inputs increasing - likelihood of spills growing - maximum spill cost reached - amortized over more instructions

Analysis 16 (b) ISA characteristics (1) two-address (e.g. x86) vs. three-address (e.g. ARM) ADD R1, R2 -> R1 Idempotent? NO! (2) register-memory (e.g. x86) vs. register-register (e.g. ARM) (3) number of available registers ADD R1, R2 = R3 idempotent? YES! for register-memory, register spills may be less costly (microarchitecture dependent) impact is obvious, but… more registers is not always enough (see back-up slide)

Analysis (c) control flow side-effects x = = f(x) y = region boundaries x ’s “shadow interval” given no side-effects x ’s live interval

Analysis (c) control flow side-effects x = = f(x) y = region boundaries x ’s “shadow interval” given side-effects x ’s live interval

19 Presentation Overview ❶ Introduction ❷ Analysis ❸ Evaluation (a)idempotent region size (b)ISA characteristics (c)control flow side-effects

Evaluation 20 methodology measurements – performance overhead: dynamic instruction count – for x86, using PIN – for ARM, using gem5 – region size: instructions between boundaries (path length) benchmarks – SPEC 2006, PARSEC, and Parboil suites

Evaluation region size overhead Y OU ARE HERE (baseline: typically instructions) ? (a) idempotent region size instructions 13.1% (geometric mean)

region size overhead 22 detection latency ? ? Evaluation (a) idempotent region size 13.1%

region size overhead 23 detection latency re-execution time ? Evaluation (a) idempotent region size 0.06% 13.1% 11.1%

Evaluation 24 percentage overhead x86-64ARMv7 Three-address support matters more for FP benchmarks Register-memory matters more for integer benchmarks (b) ISA characteristics

Evaluation 25 percentage overhead no side-effectsside-effects (c) control flow side-effects substantial only in two cases; insubstantial otherwise intuition: typically compiler already spills for control flow divergence

26 Presentation Overview ❶ Introduction ❷ Analysis ❸ Evaluation

Conclusions 27 (a) region size – matters a lot; large regions are ideal if recovery is infrequent overheads approach zero as regions grow overheads drop below 10% only with careful co-design (b) instruction set – matters when region sizes must be small supporting control flow side-effects is not expensive (c) control flow side-effects – generally does not matter

Conclusions 28 code generation and static analysis algorithms applicability not limited to architecture design see Zhang et al., ASPLOS ‘13: “ConAir: Featherweight Concurrency Bug Recovery [...]” thank you!

Back-up Slides 29

ISA Characteristics 30 more registers isn’t always enough x = 0; if (y > 0) x = 1; z = x + y; C code R0 = 0 if (R1 > 0) R0 = 1 R2 = R0 + R1

ISA Characteristics 31 more registers isn’t always enough R0 = 0 if (R1 > 0) R3 = R0 x = 0; if (y > 0) x = 1; z = x + y; C code R3 = 1 R2 = R3 + R1 need an extra instruction no matter what

32 data from SPEC INT only (SPEC INT uses General Purpose Registers (GPRs) only) percentage overhead ISA Characteristics idempotence vs. fewer registers no idempotence, #GPR reduced from 16

Very Large Regions 33 how do we get there? Problem #1: aliasing analysis – no flow-sensitive analysis in LLVM; hurts loops Problem #2: loop optimizations – boundaries in loops are bad for everyone (next slides) – loop blocking, fission/fusion, inter-change, peeling, unrolling, scalarization, etc. can all help Problem #3: large array structures – awareness of array access patterns can help (next slides) Problem #4: intra-procedural scope – limited scope aggravates all effects listed above

Very Large Regions 34 Re: Problem #2 (cut in loops are bad) i 0 = φ(0, i 1 ) i 1 = i if (i 1 < X) for (i = 0; i < X; i++) {... } C codeCFG + SSA

Very Large Regions 35 Re: Problem #2 (cut in loops are bad) R0 = 0 R0 = R0 + 1 if (R0 < X) for (i = 0; i < X; i++) {... } C codemachine code N O B OUNDARIES = N O P ROBLEM

Very Large Regions 36 Re: Problem #2 (cut in loops are bad) R0 = 0 R0 = R0 + 1 if (R0 < X) for (i = 0; i < X; i++) {... } C codemachine code

Very Large Regions 37 Re: Problem #2 (cut in loops are bad) R1 = 0 R0 = R1 R1 = R0 + 1 if (R1 < X) for (i = 0; i < X; i++) {... } C codemachine code – “redundant” copy – extra boundary (pressure)

Very Large Regions 38 Re: Problem #3 (array access patterns) [x] = a; b = [x]; [x] = c; [x] = a; b = a; [x] = c; non-clobber antidependences… GONE! PLDI ‘12 algorithm makes this simplifying assumption: cheap for scalars, expensive for arrays

Very Large Regions 39 Re: Problem #3 (array access patterns) not really practical for large arrays but if we don’t do it, non-clobber antidependences remain solution: handle potential non-clobbers in a post-pass (same way we deal with loop clobbers in static analysis) // initialize: int[100] array; memset(&array, 100*4, 0); // accumulate: for (...) array[i] += foo(i);