Automatic Pool Allocation for better Memory System Performance Presented by: Chris Lattner Joint work with: Vikram Adve

Slides:

Advertisements

Similar presentations

Garbage collection David Walker CS 320. Where are we? Last time: A survey of common garbage collection techniques –Manual memory management –Reference.

Advertisements

Target Code Generation

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Dynamic Memory Management

Context-Sensitive Interprocedural Points-to Analysis in the Presence of Function Pointers Presentation by Patrick Kaleem Justin.

Compilation 2011 Static Analysis Johnni Winther Michael I. Schwartzbach Aarhus University.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

CSC 213 – Large Scale Programming. Today’s Goals  Consider what new does & how Java works  What are traditional means of managing memory?  Why did.

Program Representations. Representing programs Goals.

Various languages….  Could affect performance  Could affect reliability  Could affect language choice.

Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap June 13, 2005 PLDI Chris.

Memory allocation CSE 2451 Matt Boggus. sizeof The sizeof unary operator will return the number of bytes reserved for a variable or data type. Determine:

Introduction to Advanced Topics Chapter 1 Mooly Sagiv Schrierber

Hastings Purify: Fast Detection of Memory Leaks and Access Errors.

SAFECode Memory Safety Without Runtime Checks or Garbage Collection By Dinakar Dhurjati Joint work with Sumant Kowshik, Vikram Adve and Chris Lattner University.

CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 18.

Chapter 8 Runtime Support. How program structures are implemented in a computer memory? The evolution of programming language design has led to the creation.

SAFECode SAFECode: Enforcing Alias Analysis for Weakly Typed Languages Dinakar Dhurjati University of Illinois at Urbana-Champaign Joint work with Sumant.

6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.

Static Analysis of Embedded C Code John Regehr University of Utah Joint work with Nathan Cooprider.

Code Generation Mooly Sagiv html:// Chapter 4.

TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.

Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.

Introduction to Code Generation Mooly Sagiv html:// Chapter 4.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

4/23/09Prof. Hilfinger CS 164 Lecture 261 IL for Arrays & Local Optimizations Lecture 26 (Adapted from notes by R. Bodik and G. Necula)

Honors Compilers Addressing of Local Variables Mar 19 th, 2002.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Previous finals up on the web page use them as practice problems look at them early.

Memory Management 2010.

Improving Code Generation Honors Compilers April 16 th 2002.

Introduction to Code Generation Mooly Sagiv html:// Chapter 4.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Automatic Pool Allocation for Disjoint Data Structures Presented by: Chris Lattner Joint work with: Vikram Adve ACM.

Procedure Optimizations and Interprocedural Analysis Chapter 15, 19 Mooly Sagiv.

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

Secure Virtual Architecture John Criswell, Arushi Aggarwal, Andrew Lenharth, Dinakar Dhurjati, and Vikram Adve University of Illinois at Urbana-Champaign.

JIT in webkit. What’s JIT See time_compilation for more info. time_compilation.

Architecture for a Next-Generation GCC Chris Lattner Vikram Adve The First Annual GCC Developers'

Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 9.

Mark Marron IMDEA-Software (Madrid, Spain) 1.

COMP3190: Principle of Programming Languages

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Overview of Compilers and JikesRVM John.

RUN-Time Organization Compiler phase— Before writing a code generator, we must decide how to marshal the resources of the target machine (instructions,

Dynamic Array. An Array-Based Implementation - Summary Good things:  Fast, random access of elements  Very memory efficient, very little memory is required.

11/26/2015IT 3271 Memory Management (Ch 14) n Dynamic memory allocation Language systems provide an important hidden player: Runtime memory manager – Activation.

Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap Paper by: Chris Lattner and Vikram Adve University of.

Transparent Pointer Compression for Linked Data Structures June 12, 2005 MSP Chris Lattner Vikram Adve.

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

Static Identification of Delinquent Loads V.M. Panait A. Sasturkar W.-F. Fong.

Copyright 2014 – Noah Mendelsohn Code Tuning Noah Mendelsohn Tufts University Web:

GC Assertions: Using the Garbage Collector To Check Heap Properties Samuel Z. Guyer Tufts University Edward Aftandilian Tufts University.

©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.

Object Lifetime and Pointers

Storage Allocation Mechanisms

Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

CS 326 Programming Languages, Concepts and Implementation

Introduction to Advanced Topics Chapter 1 Text Book: Advanced compiler Design implementation By Steven S Muchnick (Elsevier)

Compositional Pointer and Escape Analysis for Java Programs

Optimization Code Optimization ©SoftMoore Consulting.

Automatic Pool Allocation

Closure Representations in Higher-Order Programming Languages

Memory Management Overview

Pointer analysis.

Target Code Generation

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

(via graph coloring and spilling)

Presentation transcript:

Automatic Pool Allocation for better Memory System Performance Presented by: Chris Lattner Joint work with: Vikram Adve University of Illinois at Urbana-Champaign Compiler/Architecture Seminar April 26, 2002

Slide #2 Talk Overview ›Problems, approach ›LLVM Infrastructure ›Data Structure Analysis ›Fully Automatic Pool Allocation ›Pointer Compression ›Early Results

Slide #3 The Problem Memory system performance is important –Fast CPU, slow memory, not enough cache These are hard problems –Data sets continue to grow –Traditional scalar optimizations are not enough Fine grain approaches have limited gains –Prefetching recursive structures is hard –Load/store motion can only fix some symptoms

Slide #4 Our Approach Analyze & transform entire data structures –Use a macroscopic approach to get biggest gains Handle arbitrarily complex data structures –Process lists, trees, hash tables, ASTs, etc… This is fundamentally interprocedural: –Perform transformations at link-time

Slide #5 Our Approach (continued) Fully Automatic Pool Allocation Improves Locality Enables New Transformations: –Pointer compression (discussed here) –Novel prefetching schemes –More aggressive structure reordering, splitting, … –Transparent garbage collection –Others...

Slide #6 High Level Strategy Logical Data Structure Analysis –Identify data structures to transform Automatic Pool Allocation –Create and destroy memory pools –Allocate and free from pools instead of heap Pointer Compression –Reduce memory bandwidth by shrinking pointers

Slide #7 Talk Overview ›Problems, approach ›LLVM Infrastructure ›Data Structure Analysis ›Fully Automatic Pool Allocation ›Pointer Compression ›Early Results

Slide #8 LLVM Infrastructure Low Level IR with High Level Types –High level transformations on 3 address code Supports Link-Time Optimization –All code retained in LLVM until final link Language and target neutral –Currently use a GCC based C front-end –64 bit Sparc v9 back-end

Slide #9 Compiling with LLVM Linker IP Optimizer Code-gen Linker IP Optimizer Code-gen LLVM or Machine code Machine code Static Compiler 1 C, C++ Java Fortran LLVM Runtime Optimizer Runtime Optimizer Static Compiler N C, C++ Java Fortran Precompiled Libraries Precompiled Libraries

Slide #10 Important properties of LLVM Type information critical for our analysis –Simple identification of unsafe type usage malloc & free instructions built into ISA –Simple analysis of memory traffic –malloc returns a strongly typed pointer result Expressive enough to represent C codes –… but explicitly identifies unsafe pointer tricks

Slide #11 Talk Overview ›Problems, approach ›LLVM Infrastructure ›Data Structure Analysis ›Fully Automatic Pool Allocation ›Pointer Compression ›Early Results

Slide #12 Logical Data Structure Analysis Identify logical data structures –Entire lists, trees, heaps, graphs,... Capture data structure graph concisely Context sensitive, flow insensitive analysis –Related to heap shape analysis, pointer analysis

Slide #13 Example Data Structure Graphs

Slide #14 Data Structure Graph Each node represents a memory object –malloc(), alloca(), global variable,... Edges represent “may point to” set –Set of nodes reachable from pointer Track scalar point-to sets: –Which graph nodes do scalars point to? –Completely ignore integer values!

Slide #15 Analysis Outline: First pass Initial pass over function: –Find all pointers, global values, unknowns –Create graph nodes, initialize scalar pointers Example Function: struct List { Patient *data; List *next; }; void addList(List *list, Patient *data) { List *b, *nlist; while (list ≠ NULL) { b = list; list = list  next; } nlist = malloc(sizeof(List)); nlist  data = data; nlist  next = NULL; b  next = nlist; } shadow List nextdata new List nextdata shadow Patient list data nlist b

Slide #16 Analysis Outline: Worklist Add all pointer insts to a worklist. For each: –Update graph for each instruction –If pointer value changes, add all uses to worklist –Merge indistinguishable nodes shadow List nextdata shadow Patient data nlist shadow List nextdata shadow List nextdata Example Function: struct List { Patient *data; List *next; }; void addList(List *list, Patient *data) { List *b, *nlist; while (list ≠ NULL) { b = list; list = list  next; } nlist = malloc(sizeof(List)); nlist  data = data; nlist  next = NULL; b  next = nlist; } list b new List nextdata

Slide #17 Interprocedural Closure Inline called function graphs for IP closure –Inline a graph in place of the call node –Link up arguments & return values –# of inlines proportional to # of fn calls in program! –Resultant graphs are very compact Example Function: Tree* TreeAlloc(…){ if (…) return NULL; node= malloc(…); node  l= TreeAlloc(…); node  r= TreeAlloc(…); return node; } Local Graph: new Tree rightleft call TreeAlloc returns shadow Tree rightleft call TreeAlloc returns shadow Tree rightleft return node Closed Graph: new Tree rightleft return node

Slide #18 Talk Overview ›Problems, approach ›LLVM Infrastructure ›Data Structure Analysis ›Fully Automatic Pool Allocation ›Pointer Compression ›Early Results

Slide #19 Pool Allocation Motivation Increases performance: –Better spatial locality –Avoid overhead of most malloc implementations All elements of the pool are the same size Enables new transformations: –Data structure nodes all in a known place Can transform/rewrite heap at runtime –Pointer compression is our first application

Slide #20 Automatic Pool Allocation Pool alloc. is commonly applied manually –… but never fully automatically (to our knowledge) We have already identified logical DS’s –Allocate each node to a different pool Pool allocate when safe and profitable: –All nodes of data structure subgraph are allocations –Lifetime of DS is contained in function F –“profitable” heuristic is true (future work)

Slide #21 Pool Allocation Transformation 1. Function F is root of call graph using DS –Transform F and all called functions 2. Initialize pool descriptor on entry to F 3. Transform malloc & free instructions to poolalloc(&PD), poolfree(&PD) calls 4. Destroy pool descriptor on all exits of F

Slide #22 root doesn’t escape main Example: treeadd Benchmark int main() { Tree *root = TreeAlloc(...); result = TreeAdd(root); } int main() { PoolDescriptor_t Pool; poolinit(&Pool, sizeof(Tree)); Tree *root = pa_TreeAlloc(&Pool,...); result = pa_TreeAdd(&Pool, root); pooldestroy(&Pool); } root new Tree rightleft Allocate pool descriptor Initialize memory pool Transform function body Destroy pool on exit

Slide #23 More Complex Example: power Each node gets separate pool –Each pool has homogenous objects –Improve locality in pool Related Pool Desc’s are linked –Isomorphic to data structure graph Disjoint Data Structures –Each has a separate set of pools –E.g. two disjoint bin-trees in two pools P1 P2 P3 P4

Slide #24 Talk Overview ›Problems, approach ›LLVM Infrastructure ›Data Structure Analysis ›Fully Automatic Pool Allocation ›Pointer Compression ›Early Results

Slide #25 Pointer Compression Motivation Pointers are big and very sparse –Full 64 bit of addressability is often unnecessary! Often, local data structures contain < 2 16 nodes –Pointers consume cache capacity and waste memory bandwidth 64 bit arch’s more and more prevalent –Alpha, Sparc v9, IA64, x86-64, … –This problem isn’t going away any time soon… Some vendors report Spec numbers in 32 bit mode!

Slide #26 Pointer Compression Approach “Compress” pool pointers into pool indices –Use a 16 or 32 bit index instead of a 64 bit pointer –Can dramatically shrink data structure size! Grow indices as required: 16  32  64 bit –At runtime, if overflow detected, rewrite pool –We know where are the references are to do this –Guarantees transformation is safe: no loss of generality/power

Slide #27 Pointer Compression Strategies #1. Replace pointers with 16 or 32 bit indices: –Not safe: poolalloc fails when pool size exceeded Index would wrap around #2. Generate generic code for any index size: –Pool descriptor contains current index size #3. Runtime specialization of generic code: –Index sizes are almost always runtime constants –Specialize code using runtime reoptimizer (future work)

Slide #28 Transformation Transform function as before, but: –Replace all pointers into pools with k bit indices –Data structure types change: tree = { tree*, int, tree* }  newtree = { uint, int, uint } Can dramatically reduce memory footprint Scalars change as well Load/store instructions are more complex: –Compression makes dereferencing more expensive

Slide #29 Transforming Loads and Stores Must have pool base to access DS node –Maintained by pool descriptor Transform: X = t  left; –t_poolbase = t_pool  base; –X = t_poolbase[t].left; Transform: s  right = Y; –s_poolbase = s_pool  base; –s_poolbase[s].right = Y; Problem: too many loads of pool base! t  left = NULL; t  right = NULL; t  data = 0; t_poolbase = t_pool  base; t_poolbase[t]  left = NULL; t_poolbase = t_pool  base; t_poolbase[t]  right = NULL; t_poolbase = t_pool  base; t_poolbase[t]  data = 0;

Slide #30 Hacking Off Low Hanging Fruit Pool base can change: poolalloc and poolfree –but usually doesn’t (e.g. TreeAdd function) Local (basic block) load elim pass Trivially simple algorithm (~50 LOC impl) –Removes most loads –Better analysis could remove more! Loop invariant loads would be useful t_poolbase = t_pool  base; t_poolbase[t]  left = NULL; t_poolbase = t_pool  base; t_poolbase[t]  right = NULL; t_poolbase = t_pool  base; t_poolbase[t]  data = 0; t_poolbase = t_pool  base; t_poolbase[t]  left = NULL; t_poolbase[t]  right = NULL; t_poolbase[t]  data = 0;

Slide #31 Talk Overview ›Problems, approach ›LLVM Infrastructure ›Data Structure Analysis ›Fully Automatic Pool Allocation ›Pointer Compression ›Early Results

Slide #32 Early Results Implementation Underway –Data structure analysis complete –Pool allocation complete –Pointer compression strategy #1: ~85% done missing a few corner cases –Local load elimination complete –Many other optimizations/cleanup missing! Implemented on C version of Olden suite

Slide #33 Experimental Setup Baseline: –Compile with LLC, the Sparc V9 LLVM backend PointerComp 32: –Enable 32 bit pointer compression, compile with LLC PointerComp 16: –Enable 16 bit pointer compression, compile with LLC Each benchmark run four times: –first execution discarded, rest are averaged

Slide #34 Pointer Compression Speedup treeadd, llubench:treeadd, llubench: –heap intensive, small objects  big gains Lets zoom in to see more detail...Lets zoom in to see more detail...

Slide #35 Speedups (Zoomed) 16 bit pool allocation is always a win! (1% to 350%) power: CPU intensive  0.2% improvement (cache is not bottleneck) tsp, bisort: Overhead is too much for 32 bit pool allocation! –Loss of 13% and 29% respectively, what are the sources of the overhead? Note: These are preliminary numbers, further work is needed!Note: These are preliminary numbers, further work is needed!

Slide #36 Local Load Elimination Speedups Relatively small improvements –A 4% gain isn’t anything to sneeze at though What is the 3% loss from? –Bad interaction with our untuned register allocator… 

Slide #37 Sources of Overhead Extra Arguments to Functions: –Must pass pool descriptor in with pointers Redundant poolbase Loads –Use pointer analysis based load elimination... Load/Store Complexity: –Common sub-expressions not eliminated yet! t_poolbase = t_pool  base; t_poolbase[t]  left = NULL; t_poolbase[t]  right = NULL; t_poolbase[t]  data = 0; t_poolbase = t_pool  base; T = &t_poolbase[t]; T  left = NULL; T  right = NULL; T  data = 0;

Slide #38 Future Work Tuning, refining implementation: –“profitable” heuristic to avoid unnecessary xforms –Reducing other overheads More, bigger benchmarks Investigating pool allocation+prefetching –Allocation order prefetching for free –History prefetching using compressed pointers Other applications of pool allocation

Slide #39 Conclusions Macroscopic Data Structure Transformations Fully Automatic Pool Allocation –Uses data structure analysis –Used primarily to enable new transformations Example Usage: Pointer Compression –Promising initial results! Many future applications...

Slide #40 Questions? For more information: – To contact us: Vikram Adve Chris Lattner