A Practical Stride Prefetching Implementation in Global Optimizer

Slides:

Advertisements

Similar presentations

CSC 4181 Compiler Construction Code Generation & Optimization.

Advertisements

Comparison and Evaluation of Back Translation Algorithms for Static Single Assignment Form Masataka Sassa #, Masaki Kohama + and Yo Ito # # Dept. of Mathematical.

1 Optimizing compilers Managing Cache Bercovici Sivan.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

7. Optimization Prof. O. Nierstrasz Lecture notes by Marcus Denker.

Loops or Lather, Rinse, Repeat… CS153: Compilers Greg Morrisett.

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Course Outline Traditional Static Program Analysis –Theory Compiler Optimizations; Control Flow Graphs Data-flow Analysis – today’s class –Classic analyses.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.

1 S. Tallam, R. Gupta, and X. Zhang PACT 2005 Extended Whole Program Paths Sriraman Tallam Rajiv Gupta Xiangyu Zhang University of Arizona.

Software Group © 2005 IBM Corporation Compiler Technology October 17, 2005 Array privatization in IBM static compilers -- technical report CASCON 2005.

TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Introduction to Program Optimizations Chapter 11 Mooly Sagiv.

U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Parallelizing Compilers Presented by Yiwei Zhang.

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.

PSUCS322 HM 1 Languages and Compiler Design II IR Code Optimization Material provided by Prof. Jingke Li Stolen with pride and modified by Herb Mayer PSU.

Eliminating Memory References Joshua Dunfield Alina Oprea.

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

Chapter 1 Algorithm Analysis

24/06/2004Programming Language Design and Implementation 1 Optimizations in XSLT tokyo.ac.jp/schuko/XSLT-opt.ppt 24/June/04.

Putting Pointer Analysis to Work Rakesh Ghiya and Laurie J. Hendren Presented by Shey Liggett & Jason Bartkowiak.

Toward Efficient Flow-Sensitive Induction Variable Analysis and Dependence Testing for Loop Optimization Yixin Shou, Robert A. van Engelen, Johnnie Birch,

A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.

Thread-Level Speculation Karan Singh CS

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

PLC '06 Experience in Testing Compiler Optimizers Using Comparison Checking Masataka Sassa and Daijiro Sudo Dept. of Mathematical and Computing Sciences.

Conditional Memory Ordering Christoph von Praun, Harold W.Cain, Jong-Deok Choi, Kyung Dong Ryu Presented by: Renwei Yu Published in Proceedings of the.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

CS 412/413 Spring 2005Introduction to Compilers1 CS412/CS413 Introduction to Compilers Tim Teitelbaum Lecture 30: Loop Optimizations and Pointer Analysis.

Compiler Support for Better Memory Utilization in Scientific Code Rob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem {rjf, johnmc, jin,

An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.

Automatic Thread Extraction with Decoupled Software Pipelining

Code Optimization Overview and Examples

Code Optimization.

Simone Campanoni Loop transformations Simone Campanoni

Static Single Assignment

Princeton University Spring 2016

Performance Optimization for Embedded Software

Preliminary Transformations

Register Pressure Guided Unroll-and-Jam

Radu Rugina and Martin Rinard Laboratory for Computer Science

Optimizing Transformations Hal Perkins Winter 2008

Code Optimization Overview and Examples Control Flow Graph

Sampoorani, Sivakumar and Joshua

October 18, 2018 Kit Barton, IBM Canada

CSC D70: Compiler Optimization Prefetching

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Dynamic Hardware Prediction

Loop Optimization “Programs spend 90% of time in loops”

Lecture 19: Code Optimisation

Memory System Performance Chapter 3

EECS 583 – Class 9 Classic and ILP Optimization

The SGI Pro64 Compiler Infrastructure

CSC D70: Compiler Optimization Prefetching

Presentation transcript:

A Practical Stride Prefetching Implementation in Global Optimizer Hucheng Zhou, Xing Zhou Tsinghua University 11/28/2018

Outline Introduction Motivation Algorithm Phase Ordering Prefetching Scheduling Experiments Future Work 11/28/2018

Introduction What’s data prefetching Compiler controlled prefetching Brings data into cache ahead of its use Compiler controlled prefetching Prefetching candidates identification Prefetching timing determination Unnecessary prefetching elimination Other prefetching tuning 11/28/2018

Introduction Stride data prefetching Our focus Massive consecutive memory references Cause to many cache misses, thus poor performance Our focus Compiler based stride data prefetching 11/28/2018

Motivation Dominant stride prefetching algorithm LNO based algorithm Loop Nest Optimizer (LNO) based LNO based algorithm Locality analysis (reuse analysislocalized iteration spaceprefetching predicates) Loop splitting (loop peeling and unrolling) Scheduling prefetches (iterations ahead of use) Limitations of LNO based approach Observations 11/28/2018

LNO based algorithm Example: 11/28/2018

Limitations Only effective for affine array reference Only handle with DO loop nest Due to the vector space model Just focus on numerical applications operate on dense matrices However, not all of the strided references are affine array references, such as c++ STL vector traversing and other wrap around data structures 11/28/2018

Necessity Four common ways of STL vector traversing 11/28/2018

The Component flow of Open64 11/28/2018

IR after PRE-OPT For ACCESS1 and ACCESS2 11/28/2018

Compare with array references 11/28/2018

Comparison LNO based approach exploits the tight affinity with locality analysis and vector space model to identify the prefetching candidates which suffer from cache misses However, this affinity limits itself only for affine array references, cannot handle STL style stride references From another angle, identify stride prefetching candidates as induction variable recognition, then exploit the phase ordering to avoid unnecessary prefetches 11/28/2018

Definitions and Observations A linear inductive variable (expression) is an expression whose value is incremented by a nonzero integer loop invariant on every iterations Lemma 1: linear inductive expression can be recursively defined: If v is a linear induction variable with stride s, then i is a linear inductive expression with the same stride s; If expr is a linear inductive expression with stride s, then –expr is a linear inductive expression with the same stride -s; If expr is linear inductive expression with stride s and invar is a loop invariant, then expr + invar and invar + expr are all inductive expressions with stride s; If expr1 and expr2 are linear inductive expressions with stride s1 and s2 respectively, then expr1 + expr2 is a linear inductive expression with stride s1 + s2; If expr is linear inductive expression with stride s and invar is a loop invariant, then expr * invar and invar * expr are all inductive expressions with stride invar * s; If expr is linear inductive expression with stride s and invar is a loop invariant, then expr / invar is a linear inductive expression with stride s/invar. 11/28/2018

Definitions and Observations So, Mathematically, it equals to the linear combination of linear induction variables and loop invariants, with the form: E = c1* i1 + c2*i2 + … + cn*in + invar, where stride value is A stride reference is the reference in a loop whose accessed memory address is incremented by a integer loop invariant on every iterations Lemma 2: If a reference in loop whose accessed memory address is represented as an inductive expression, then it is a stride reference 11/28/2018

Speculative Induction Variable Recognition for Stride Prefetching Thus stride reference identification equals to induction expression recognition We have presented an algorithm for demand driven speculative recognition of induction expression 11/28/2018

Speculative Induction Variable Recognition for Stride Prefetching Induction variables in SSA form must satisfy the following condition : there must be a live phi in the corresponding loop header BB among the two operands of the phi, the loop invariant operand must point to the initialization of the induction variable out of the loop, while the other operand must be defined within the loop body. We call them init and increment respectively After expanding the increment operand of phi by copy propagation, the expanding result must contain the result of that phi, with a loop invariant expression as stride of the induction variable 11/28/2018

Our algorithm 11/28/2018

11/28/2018

Comparison Traditional induction variable recognition Our algorithm Equals to strongly connected component Just for variable Conservative due to alias Limitations of copy propagation Our algorithm Demand driven Symbolic interpretation Speculative determination Modify a few on the expansion process of current implementation 11/28/2018

Phase Ordering Implement our algorithm after SSAPRE will benefit from strength reduction and PRE optimizations 11/28/2018

Prefetching Scheduling Leading reference determination Prefetching information collection Stride value, data/loop shape, target cache model Prefetching determination for the candidates Based on the heuristics, such as data and loop size as well as the number of prefetches in the loop Computation of prefetching distance division of memory latency and the estimated time per iteration Loop transformations based on locality information to further reduce the number of prefetches 11/28/2018

Experiments We have conducted experiments against SPEC2006 benchmark on IA64 Itanium 2 Madison 1.6GHz with 6MB L3 cache and 8 GBytes memory quad-processor server with RedhatLinux Advanced Server 4.0 compiler is Open64 4.1 11/28/2018

Normalized results of SPEC2006 FP 11/28/2018

Normalized results of SPEC2006 INT 11/28/2018

Conclusion and Future Work we propose an alternative inductive data prefetching algorithm implemented in global optimizer at O2 level, which can in theory prefetch almost all of the stride references statically determined in compile time extend to prefetch periodic, polynomial, geometric, monotonic and wrap-around variables Totally integrated stride prefetching algorithm with strength reduction optimization in SSAPRE coordinate the data prefetch with data layout optimization further investigate the interaction between software and hardware prefetching according to the static compiler analysis and feedback information on X86 platform 11/28/2018

Thanks Thank you very much And any questions? 11/28/2018