The Case for a SC-preserving Compiler Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLA University of Michigan Abhay Singh Satish Narayanasamy.

Slides:

Advertisements

Similar presentations

CSC 4181 Compiler Construction Code Generation & Optimization.

Advertisements

Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.

Part IV: Memory Management

Synopsys University Courseware Copyright © 2012 Synopsys, Inc. All rights reserved. Compiler Optimization and Code Generation Lecture - 3 Developed By:

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

7. Optimization Prof. O. Nierstrasz Lecture notes by Marcus Denker.

Course Outline Traditional Static Program Analysis Software Testing

A Program Transformation For Faster Goal-Directed Search Akash Lal, Shaz Qadeer Microsoft Research.

CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Alias Speculation using Atomic Regions (To appear at ASPLOS 2013) Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign.

Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-

1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

Advanced microprocessor optimization Kampala August, 2007 Agner Fog

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

DRF x A Simple and Efficient Memory Model for Concurrent Programming Languages Dan Marino Abhay Singh Todd Millstein Madan Musuvathi Satish Narayanasamy.

The Last Lecture Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission.

Trace-based Just-in-Time Type Specialization for Dynamic Languages Andreas Gal, Brendan Eich, Mike Shaver, David Anderson, David Mandelin, Mohammad R.

Recap from last time We were trying to do Common Subexpression Elimination Compute expressions that are available at each program point.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

9. Optimization Marcus Denker. 2 © Marcus Denker Optimization Roadmap  Introduction  Optimizations in the Back-end  The Optimizer  SSA Optimizations.

Introduction to Program Optimizations Chapter 11 Mooly Sagiv.

Case Studies of Compilers and Future Trends Chapter 21 Mooly Sagiv.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Hardware-Software Interface Machine Program Performance = t cyc x CPI x code size X Available resources statically fixed Designed to support wide variety.

Improving Code Generation Honors Compilers April 16 th 2002.

Adaptive Optimization in the Jalapeño JVM M. Arnold, S. Fink, D. Grove, M. Hind, P. Sweeney Presented by Andrew Cove Spring 2006.

Optimizing Compilers Nai-Wei Lin Department of Computer Science and Information Engineering National Chung Cheng University.

Compiler Code Optimizations. Introduction Introduction Optimized codeOptimized code Executes faster Executes faster efficient memory usage efficient memory.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Data Flow in Static Profiling Cathal Boogerd, Delft University, The Netherlands Leon Moonen, Simula Research Lab, Norway ?

Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

What’s in an optimizing compiler?

- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.

O VERVIEW OF THE IBM J AVA J UST - IN -T IME C OMPILER Presenters: Zhenhua Liu, Sanjeev Singh 1.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Register Allocation John Cavazos University.

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

Aritra Sengupta, Swarnendu Biswas, Minjia Zhang, Michael D. Bond and Milind Kulkarni ASPLOS 2015, ISTANBUL, TURKEY Hybrid Static-Dynamic Analysis for Statically.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

High-Level Transformations for Embedded Computing

Are We Trading Consistency Too Easily? A Case for Sequential Consistency Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLAUniversity of.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

A Safety-First Approach to Memory Models Madan Musuvathi Microsoft Research ISMM ‘13 Keynote 1.

Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

CS412/413 Introduction to Compilers and Translators April 2, 1999 Lecture 24: Introduction to Optimization.

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Code Optimization.

Introduction To Computer Systems

Atomic Operations in Hardware

Persistency for Synchronization-Free Regions

Machine-Independent Optimization

Optimizing Transformations Hal Perkins Autumn 2011

Optimizing Transformations Hal Perkins Winter 2008

Compiler Code Optimizations

Exam Topics Hal Perkins Autumn 2009

Instruction Level Parallelism (ILP)

How to improve (decrease) CPI

EECS 583 – Class 9 Classic and ILP Optimization

Dynamic Binary Translators and Instrumenters

Code Optimization.

Presentation transcript:

The Case for a SC-preserving Compiler Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLA University of Michigan Abhay Singh Satish Narayanasamy

T ALK S UMMARY SC-preserving compiler Every SC behavior of the binary is a SC behavior of the source Guarantees SC assuming SC hardware A SC-preserving compiler is acceptably efficient Enable optimizations only when provably SC-preserving With simple, scalable, and readily implementable analysis 2% avg, 30% max overhead on SPLASH & PARSEC benchmarks Static and dynamic analyses can further reduce the performance overhead

M ANY C OMPILER O PTIMIZATIONS ARE NOT SC-P RESERVING Example: Common Subexpression Elimination (CSE) t,u,v are local variables X,Y are possibly shared L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t;

C OMMON S UBEXPRESSION E LIMINATION IS NOT SC-P RESERVING L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t; M1: X = 1; M2: Y = 1; M1: X = 1; M2: Y = 1; M1: X = 1; M2: Y = 1; M1: X = 1; M2: Y = 1; u == 1  v == 5 possibly u == 1 && v == 0 Init: X = Y = 0;

I MPLEMENTING CSE IN A SC-P RESERVING C OMPILER Enable this transformation when X is a local variable, or Y is a local variable In these cases, the transformation is SC-preserving Identifying local variables: Compiler generated temporaries Stack allocated variables whose address is not taken L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t;

A SC- PRESERVING LLVM C OMPILER FOR C PROGRAMS Modify each of ~70 phases in LLVM to be SC-preserving Enable trace-preserving optimizations These do not change the order of memory operations e.g. loop unrolling, procedure inlining, control-flow simplification, dead-code elimination,… Enable transformations on local variables Enable transformations involving a single shared variable e.g. t= X; u=X; v=X;  t=X; u=t; v=t;

P ERFORMANCE OVERHEAD Baseline: LLVM –O3 Experiments on Intel Xeon, 8 cores, 2 threads/core, 6GB RAM

T HE O VERHEAD IN F ACESIM This transformation reduces the overhead from 34% to 6% Optimizations in non-hot-loops do not buy much performance A SC-preserving compiler slows down a program if The hot-loops involve more than one shared variable, and Aliasing constraints do not prevent optimizations in the loop float s, *x, *y; int i; … hot_for_loop(… i …){ s += (x[i]-y[i]) *(x[i]-y[i]); … } float s, *x, *y; int i; … hot_for_loop(… i …){ s += (x[i]-y[i]) *(x[i]-y[i]); … } float s, t, *x, *y; int i; … hot_for_loop(… i …){ t = (x[i]-y[i]); s += t*t; … } float s, t, *x, *y; int i; … hot_for_loop(… i …){ t = (x[i]-y[i]); s += t*t; … }

I MPROVING P ERFORMANCE OF SC-P RESERVING C OMPILER Request programmers to reduce shared accesses in hot loops Use sophisticated static analysis Infer more thread-local variables Infer data-race-free shared variables Use program annotations Requires changing the program language Minimum annotations sufficient to optimize the hot loops Perform load-optimizations speculatively Hardware exposes speculative-load optimization to the software Load optimizations reduce the max overhead to 6%

E AGER -L OAD O PTIMIZATIONS Eagerly perform loads or use values from previous loads or stores L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: t = X*5; L2: u = Y; L3: v = t; L1: X = 2; L2: u = Y; L3: v = X*5; L1: X = 2; L2: u = Y; L3: v = X*5; L1: X = 2; L2: u = Y; L3: v = 10; L1: X = 2; L2: u = Y; L3: v = 10; L1: L2: for(…) L3: t = X*5; L1: L2: for(…) L3: t = X*5; L1: u = X*5; L2: for(…) L3: t = u; L1: u = X*5; L2: for(…) L3: t = u; Common Subexpression Elimination Constant/copy Propagation Loop-invariant Code Motion

P ERFORMANCE OVERHEAD Allowing eager-load optimizations alone reduces max overhead to 6%

C ORRECTNESS C RITERIA FOR E AGER -L OAD O PTIMIZATIONS Eager-loads optimizations rely on a variable remaining unmodified in a region of code Sequential validity: No mods to X by the current thread in L1-L3 SC-preservation: No mods to X by any other thread in L1-L3 L1: t = X*5; L2: *p = q; L3: v = X*5; L1: t = X*5; L2: *p = q; L3: v = X*5; Enable invariant “t == 5.X” Maintain invariant “t == 5.X” Use invariant “t == 5.X" to transform L3 to v = t; Use invariant “t == 5.X" to transform L3 to v = t;

S PECULATIVELY P ERFORMING E AGER -L OAD O PTIMIZATIONS On monitor.load, hardware starts tracking coherence messages on X’s cache line The interference check fails if X’s cache line has been downgraded since the monitor.load In our implementation, a single instruction checks interference on up to 32 tags L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = monitor.load(X, tag) * 5; L2: u = Y; L3: v = t; C4: if (interference.check(tag)) C5: v = X*5; L1: t = monitor.load(X, tag) * 5; L2: u = Y; L3: v = t; C4: if (interference.check(tag)) C5: v = X*5;

C ONCLUSION ( S ) Performance cost of SC = 5% Cost of SC hardware = 3% [Milo’s talk yesterday] Cost of SC-preserving compiler = 2%