Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

Computer Architecture Instruction-Level Parallel Processors

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Dynamic Branch Prediction

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Click to add text © IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.

3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.

ABCD: Eliminating Array-Bounds Checks on Demand Rastislav Bodík Rajiv Gupta Vivek Sarkar U of Wisconsin U of Arizona IBM TJ Watson recent experiments.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 Integrating Influence Mechanisms into Impact Analysis for Increased Precision Ben Breech Lori Pollock Mike Tegtmeyer University of Delaware Army Research.

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani.

Spring Path Profile Estimation and Superblock Formation Jeff Pang Jimeng Sun.

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

Partial Method Compilation using Dynamic Profile Information John Whaley Stanford University October 17, 2001.

Previous finals up on the web page use them as practice problems look at them early.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.

Variational Path Profiling Erez Perelman*, Trishul Chilimbi †, Brad Calder* * University of Califonia, San Diego †Microsoft Research, Redmond.

Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.

Data Flow in Static Profiling Cathal Boogerd, Delft University, The Netherlands Leon Moonen, Simula Research Lab, Norway ?

Adaptive Optimization in the Jalapeño JVM Matthew Arnold Stephen Fink David Grove Michael Hind Peter F. Sweeney Source: UIUC.

P ath & E dge P rofiling Michael Bond, UT Austin Kathryn McKinley, UT Austin Continuous Presented by: Yingyi Bu.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

Code Size Efficiency in Global Scheduling for ILP Processors TINKER Research Group Department of Electrical & Computer Engineering North Carolina State.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

ABCD: Eliminating Array-Bounds Checks on Demand Rastislav Bodík Rajiv Gupta Vivek Sarkar U of Wisconsin U of Arizona IBM TJ Watson recent experiments.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Interprocedural Path Profiling David Melski Thomas Reps University of Wisconsin.

Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,

2D-Profiling Detecting Input-Dependent Branches with a Single Input Data Set Hyesoon Kim M. Aater Suleman Onur Mutlu Yale N. Patt HPS Research Group The.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

1 The Inner Most Loop Iteration counter a new dimension in branch history André Seznec, Joshua San Miguel, Jorge Albericio.

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

Online Subpath Profiling

Samira Khan University of Virginia Nov 13, 2017

High Coverage Detection of Input-Related Security Faults

Feedback directed optimization in Compaq’s compilation tools for Alpha

EE 382N Guest Lecture Wish Branches

Yingmin Li Ting Yan Qi Zhao

Inlining and Devirtualization Hal Perkins Autumn 2011

Inlining and Devirtualization Hal Perkins Autumn 2009

Dynamic Branch Prediction

Adaptive Optimization in the Jalapeño JVM

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Sampoorani, Sivakumar and Joshua

Instruction Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Adapted from the slides of Prof

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Dynamic Hardware Prediction

Patrick Akl and Andreas Moshovos AENAO Research Group

CMSC 611: Advanced Computer Architecture

Static Scheduling Techniques

Gang Luo, Hongfei Guo {gangluo,

Presentation transcript:

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles, UIUC

2 Path information is useful Enlarges scope of optimizations – Superblock formation – Hyperblock formation Improves other optimizations – Code scheduling and register allocation – Dataflow analysis – Software pipelining – Code layout – Static branch prediction

3 Overhead vs. accuracy Edge profiling (SPEC 95 INT)

4 Overhead vs. accuracy Edge profiling (SPEC 95 INT) Ball-Larus path profiling (SPEC 2000 INT)

5 Overhead vs. accuracy Edge profiling (SPEC 95 INT) Ball-Larus path profiling (SPEC 2000 INT) Targeted path profiling (SPEC 2000 INT)

6 Overhead vs. accuracy Edge profiling (SPEC 95 INT) Ball-Larus path profiling (SPEC 2000 INT) Targeted path profiling (SPEC 2000 INT) Profile-guided profiling

7 Outline Background – Staged dynamic optimization and profile-guided profiling – Ball-Larus path profiling – Opportunities for reducing overhead Targeted path profiling Results – Overhead and accuracy

8 Staged dynamic optimization Static optimizations Stage 0

9 Staged dynamic optimization Static optimizations Edge profile Stage 0 Hardware edge profiler

10 Staged dynamic optimization Static optimizations Edge profile Stage 0 Local Optimizations (code layout) Stage 1 Hardware edge profiler

11 Staged dynamic optimization Static optimizations Edge profile Stage 0 Local Optimizations (code layout) Path profiling instrumentation Stage 1 Hardware edge profiler

12 Staged dynamic optimization Static optimizations Edge profile Stage 0 Local Optimizations (code layout) Path profiling instrumentation Stage 1 Path profile Hardware edge profiler

13 Staged dynamic optimization Static optimizations Edge profile Stage 0 Local Optimizations (code layout) Path profiling instrumentation Global Optimizations (superblock formation) Stage 2 Stage 1 Path profile Hardware edge profiler

14 Profile-guided profiling Static optimizations Stage 0 Local Optimizations (code layout) Path profiling instrumentation Global Optimizations (superblock formation) Stage 2 Stage 1 Path profile Hardware edge profiler Edge profile

15 Ball-Larus path profiling Acyclic, intraprocedural paths Handles cyclic CFGs – Paths end at loop back edges Each path computes unique integer

16 Ball-Larus path profiling 4 paths CB D A FE G

17 Ball-Larus path profiling paths Each path computes unique integer CB D A FE G

18 Ball-Larus path profiling paths Each path computes unique integer Path 0 CB D A FE G

19 Ball-Larus path profiling paths Each path computes unique integer Path 0 Path 1 CB D A FE G

20 Ball-Larus path profiling paths Each path computes unique integer Path 0 Path 1 Path 2 CB D A FE G

21 Ball-Larus path profiling paths Each path computes unique integer Path 0 Path 1 Path 2 Path 3 CB D A FE G

22 Ball-Larus path profiling r=r+2 r=0 r=r+1 count[r]++ r : path register count : array of path frequencies CB D A FE G

23 Overhead in Ball-Larus path profiling SPEC 95SPEC 2000 gcc 96%87% INT Avg41%43% FP Avg12%22% Overall Avg28%37%

24 Overhead in Ball-Larus path profiling SPEC 95SPEC 2000 gcc 96%87% INT Avg41%43% FP Avg12%22% Overall Avg28%37% Opportunities for reducing overhead? – When there are many paths – When edge profile gives perfect path profile

25 Routines with many paths Many possible paths – Exponential in number of edges – Can’t use array of counters Number of taken paths small – Ball-Larus uses hash table – Hash function call expensive Hashed path ~5 times overhead

26 Edge profile gives perfect path profile

27 Edge profile gives perfect path profile

28 Edge profile gives perfect path profile An obvious path contains an edge that is only on that path – Path uniquely identified by edge – Path freq = edge freq If all paths obvious, edge profile gives perfect path profile

29 Outline Background – Staged dynamic optimization and profile-guided profiling – Ball-Larus path profiling – Opportunities for reducing overhead Targeted path profiling Results – Overhead and accuracy

30 Targeted path profiling Profile-guided profiling – Use existing edge profile Exploits opportunities for reducing overhead – When there are many paths Remove cold edges – When edge profile gives perfect path profile Don’t instrument obvious routines and loops

31 Removing cold edges Examine relative execution frequency of each branch if (relFreq < threshold) edge is cold 397

32 Removing cold edges Examine relative execution frequency of each branch if (relFreq < threshold) edge is cold 397

33 Removing cold edges Examine relative execution frequency of each branch if (relFreq < threshold) edge is cold 397

34 Removing cold edges A path that contains a cold edge is a cold path Removing an edge may halve number of paths

35 Removing cold edges A path that contains a cold edge is a cold path Removing an edge may halve number of paths Number of paths: 16  4

36 Removing cold edges A path that contains a cold edge is a cold path Removing an edge may halve number of paths Number of paths: 16  4 Goal: hashed  non-hashed

37 Removing cold edges Remaining paths potentially hot 4 paths  [0, 3] 2 1

38 Removing cold edges r=r+2 r=0 r=r+1 count[r]++ Remaining paths potentially hot 4 paths  [0, 3]

39 Removing cold edges What if cold edge taken? r=r+2 r=0 r=r+1 count[r]++

40 Removing cold edges What if cold edge taken? Cold edges poison path r=r+2 r=0 r=poison r=r+1 count[r]++

41 Removing cold edges What if cold edge taken? Cold edges poison path Instrumentation checks for poisoned path r=r+2 r=0 r=poison r=r+1 if (r poisoned) cold_counter++ else count[r]++

42 Checking for poison if (r poisoned) cold_counter++ else count[r]++

43 Obvious routines All paths obvious We don’t instrument obvious routines Edge profile gives perfect path profile

44 Obvious loops Loop with obvious body Don’t instrument obvious loops with high average trip counts Edge profile yields high-accuracy path profile … …

45 Obvious loops Loop with obvious body Don’t instrument obvious loops with high average trip counts Edge profile yields high-accuracy path profile … …

46 Summary of our techniques Remove cold edges – Eliminates many cold paths – Count paths with array (instead of hash table) Don’t instrument obvious routines and loops – Edge profile derives path profile

47 Outline Background – Staged dynamic optimization and profile-guided profiling – Ball-Larus path profiling – Opportunities for reducing overhead Targeted path profiling Results – Overhead and accuracy

48 Implementation Static profiling PP : tool for path profiling TPP : tool for targeted path profiling Tools instrument native SPARC executables – SPEC 95 ref – SPEC 2000 ref

49 Results: SPEC 2000 INT

50 Where does benefit come from? Cold path elimination alone: 60% Add obvious path elimination: + 40% Little benefit from obvious path elimination alone

51 Related work Dynamo [Bala et al. ‘00] – Successful online path-guided optimization – “Bails out” when no dominant path Instrumentation sampling [Arnold & Ryder ‘01] – Orthogonal to targeted path profiling Selective path profiling [Apiwattanapong & Harrold ’02] – Useful when only a few paths of interest

52 Summary Profile-guided profiling in a staged dynamic optimization system Two synergistic techniques – Remove cold paths – Don’t instrument obvious routines and loops Reduces overhead by half (SPEC 95) to two-thirds (SPEC 2000) High accuracy: ~99%