Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 - 2006-09-13 pag. 1 Discovery of Locality-Improving Refactorings.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
Advertisements

© 2009 IBM Corporation July, 2009 | PADTAD Chicago, Illinois A Proposal of Operation History Management System for Source-to-Source Optimization.
Kristof Beyls, Erik D’Hollander, Frederik Vandeputte ICCS 2005 – May 23 RDVIS: A Tool That Visualizes the Causes of Low Locality and Hints Program Optimizations.
8. Code Generation. Generate executable code for a target machine that is a faithful representation of the semantics of the source code Depends not only.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Reuse distance as a metric for cache behavior - pdcs2001 [1] Characterization and Optimization of Cache Behavior Kristof Beyls, Yijun Yu, Erik D’Hollander.
Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999.
1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.
Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Performance Analysis And Visualization By:Mehdi Semsarzadeh Chapter 15.
Introduction to Analysis of Algorithms
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
Memory Redundancy Elimination to Improve Application Energy Efficiency Keith Cooper and Li Xu Rice University October 2003.
Improving Data-flow Analysis with Path Profiles ● Glenn Ammons & James R. Larus ● University of Wisconsin-Madison ● 1998 ● Presented by Jessica Friis.
Parallelizing Compilers Presented by Yiwei Zhang.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
A Mathematical Model for Balancing Co-Phase Effects in Simulated Multithreaded Systems Joshua L. Kihm, Tipp Moseley, and Dan Connors University of Colorado.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
University of California San Diego Locality Phase Prediction Xipeng Shen, Yutao Zhong, Chen Ding Computer Science Department, University of Rochester Class.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Generative Programming. Generic vs Generative Generic Programming focuses on representing families of domain concepts Generic Programming focuses on representing.
Linear Scan Register Allocation POLETTO ET AL. PRESENTED BY MUHAMMAD HUZAIFA (MOST) SLIDES BORROWED FROM CHRISTOPHER TUTTLE 1.
SAGE: Self-Tuning Approximation for Graphics Engines
Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.
Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
1 Sampling-based Program Locality Approximation Yutao Zhong, Wentao Chang Department of Computer Science George Mason University June 8th,2008.
Generative Programming. Automated Assembly Lines.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Reuse Distance as a Metric for Cache Behavior Kristof Beyls and Erik D’Hollander Ghent University PDCS - August 2001.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
ALG0183 Algorithms & Data Structures Lecture 4 Experimental Algorithmics 8/25/20091 ALG0183 Algorithms & Data Structures by Dr Andy Brooks Case study article:
SEMINAR WEI GUO. Software Visualization in the Large.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
1 Recursive Data Structure Profiling Easwaran Raman David I. August Princeton University.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
CISC Machine Learning for Solving Systems Problems Presented by: Eunjung Park Dept of Computer & Information Sciences University of Delaware Solutions.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Sunpyo Hong, Hyesoon Kim
1 An Execution-Driven Simulation Tool for Teaching Cache Memories in Introductory Computer Organization Courses Salvador Petit, Noel Tomás Computer Engineering.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
PROGRAMMING FUNDAMENTALS INTRODUCTION TO PROGRAMMING. Computer Programming Concepts. Flowchart. Structured Programming Design. Implementation Documentation.
ECE 750 Topic 8 Meta-programming languages, systems, and applications Automatic Program Specialization for J ava – U. P. Schultz, J. L. Lawall, C. Consel.
Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Compiler Support for Better Memory Utilization in Scientific Code Rob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem {rjf, johnmc, jin,
An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.
A few words on locality and arrays
Introduction To Computer Systems
The Hardware/Software Interface CSE351 Winter 2013
Optimization Code Optimization ©SoftMoore Consulting.
STUDY AND IMPLEMENTATION
Calpa: A Tool for Automating Dynamic Compilation
Predicting Unroll Factors Using Supervised Classification
Module Recognition Algorithms
CSE 373: Data Structures and Algorithms
Presentation transcript:

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC pag. 1 Discovery of Locality-Improving Refactorings by Reuse Path Analysis Kristof Beyls Erik H. D’Hollander

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 2 Overview Cache behaviour optimization by reuse analysis: motivation & concepts Efficient profiling of reuse paths Experimental results Conclusion

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 3 Overview Cache behaviour optimization by reuse analysis: motivation & concepts Efficient profiling of reuse paths Experimental results Conclusion

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 4 Motivation: why optimize cache behavior manually? Many programs: cache misses cause more than 2x slowdown. Compilers can only eliminate small part of the misses by automatic code optimizations. Therefore: Need for profiling tools that enable effective source code optimization.

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 5 Illustration: bottlenecks of SPEC2000 on Itanium1

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 6 Good Cache Behavior = High Data Locality Caches allow more efficient execution when they retain data between different data reuses. Cache hits only occur when distance between reuses is small enough. (reuse distance) Long reuse distance => poor locality Short reuse distance => good locality

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 7 Optimize temporal locality: How-to? Optimize temporal locality = Shorten reuse distance General optimization methodology: –profile to find hot spots –refactor the hot spots to diminish bottleneck However: hot spots where misses occur is often not the place where a refactoring is needed to optimize temporal locality!!! (Example is given in a few slides)

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 8 Methodology to optimize temporal locality Goal: Find the refactorings that improves temporal locality, so that cache misses are eliminated. Step 1: locate cache misses in source code (e.g. using VTune, Cachegrind, etc.) (1990’s) Step 2: find previous uses (cache miss = long-distance reuse, using StatCache (Berg 2005), RDVIS (Beyls 2005)) Step 3: find refactoring to shorten reuse distance (using SLO (Suggestions for Locality Optimizations)), by analyzing reuse paths. Reuse path = code executed between use and reuse.

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 9 all cache misses occur here. Methodology to optimize locality: example: step 1 double sum( … ) { … for(int i=0; i<len; i++) result += X[i]; … } (VTune, Cachegrind, CProf, HPCView, Memspy, …)

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 10 previous uses occur here. all cache misses occur here. Methodology to optimize locality: example: step 2 double sum( … ) { … for(int i=0; i<len; i++) result += X[i]; … } double inproduct(…) { … for(int i=0; i<len; i++) result += X[i]*Y[i]; … } (StatCache, RDVIS)

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 11 Methodology to optimize locality: example: step 3 double inproduct( … ) { … for(int i=0; i<len; i++) result += X[i]*Y[i]; … } double sum( … ) { … for(int i=0; i<len; i++) result += X[i]; … } (SLO – Suggestions For Locality Optimizations)

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 12 Methodology to optimize cache behavior: example: conclusions Conclusions from example: Cache misses occur in function sum. Refactoring to optimize locality is needed in function prodsum. => refactoring is needed in different location than where the bottleneck occurs!! Place where previous use occurs (in inproduct) is not enough information to find the refactoring (in prodsum). How to automatically pinpoint the location where a refactoring is needed? =>topic of this paper.

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 13 Overview Cache behaviour optimization by reuse analysis: motivation & concepts Efficient profiling of reuse paths Experimental results Conclusion

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 14 Reuse paths are complex for many long-distance reuses. Long-distance reuses need to be shortened. All code executed between use and long-distance reuse is responsible for long distance. = reuse path. Reuse paths often span large section of source code, hence they are difficult to analyze manually. How to refactor complex reuse paths so that the distance is drastically reduced? Complexity of reuse paths is reduced by giving them hierarchical structure in the function call and loop hierarchy.

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 15 Function Call and Loop Hierarchy Function Call and Loop Hierarchy is a tree containing three kinds of internal nodes: Function call nodes Loop nodes Iteration nodes Tree reflects function calls, loops and iterations as executed at run-time.

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 16 Function Call and Loop Hierarchy For each memory reuse (indicated by arrow), the least common ancestor (LCA) of use and reuse is the level where an overview over both reuses is possible. LCA

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 17 Function Call and Loop Hierarchy To shorten the reuse distance, the nodes just below the LCA must be fused/interleaved at run- time. LCA

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 18 Function Call and Loop Hierarchy Mapping back to source code: For each transformation indicated in source code, it’s importance can be read from reuse distance histogram:

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 19 Function Call and Loop Hierarchy What transformation to apply? Function-nodes: fuse functions Loop-nodes: fuse loops Iteration-nodes: fuse iterations? = loop interchange or loop tiling…

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 20 Overview Cache behaviour optimization by reuse analysis: motivation & concepts Efficient profiling of reuse paths Experimental results Conclusion

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 21 Problems with naively measuring loop and functions hierarchy. Typically, billions of memory accesses and loop iterations in a program run. => loop and function hierarchy contains billions of nodes. Typically billions of reuses.

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 22 Solutions: 1.Only track “open reuse pairs” 2.Use reservoir sampling to only track limited number of reuses, while guaranteeing accuracy.

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 23 Tracking open reuse pairs. Details are in the paper Sample open reuses X X

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 24 Overview Cache behaviour optimization by reuse analysis: motivation & concepts Efficient profiling of reuse paths Experimental results Conclusion

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 25 Experiment Evaluation Profiling implemented by extending GCC: GCC-SLO. Results are interactively visualized by SLO. Part 1: Overhead of profiling SPEC2000. –Memory overhead –Execution time overhead Part 2: Program speedups attainable: –Optimized 5 programs from SPEC2000 using SLO. –Evaluated speedup on 5 different platforms

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 26 Overhead reduction of reuse path profiling by reservoir sampling From 1000-fold slow-down to 5-fold slow-down All SPEC2000 programs can be processed in less than 250 MiB extra memory.

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 27 Locality profiles: examples from SPEC2000: 173.APPLU …… Few refactorings to optimize most long-distance reuses. Blue refactoring: fuse function jacld will blts. At the algorithmic level: combine jacld: form lower triangular part of jacobian matrix with blts: solve lower triangular part

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 28 Locality profiles: other examples from SPEC2000 EquakeVPRArt CraftyGCCGalgel

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 29 Using SLO to optimize 5 SPEC2000 programs: Itanium.

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 30 Using SLO to optimize 5 SPEC2000 programs: 5 platforms.

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 31 Overview Cache behaviour optimization by reuse analysis: concepts Efficient measurement of reuse pairs and hierarchy of functions and loops Experimental results Conclusion

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 32 Conclusions To improve temporal locality, often a refactoring is needed in a function other than the one that generates the cache misses! Reuse Path Analysis in the Function Call and Loop Hierarchy discovers the required refactorings. Sampling reduces overhead from 1000x to 5x, and almost constant memory overhead Typically less than 10 refactorings required. Indicated refactorings are performance-portable over wide range of architectures. Implemented in SLO - Suggestions for Locality Optimizations:

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 33 Backup slides

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 34 SLO: example (continued)

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 35 SLO: example (continued)

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 36 SLO: example (continued) Resulting speedup on Pentium4: 5.72

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – pag. 37 SLO: example VPR