Phase Detection Jonathan Winter Casey Smith CS 612 04/05/05.

Slides:



Advertisements
Similar presentations
Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Advertisements

Page Replacement Algorithms
Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.
Instruction Set Design
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
1 ITCS 3181 Logic and Computer Systems B. Wilkinson Slides9.ppt Modification date: March 30, 2015 Processor Design.
Reuse distance as a metric for cache behavior - pdcs2001 [1] Characterization and Optimization of Cache Behavior Kristof Beyls, Yijun Yu, Erik D’Hollander.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Fundamentals of Python: From First Programs Through Data Structures
CISC Machine Learning for Solving Systems Problems Presented by: John Tully Dept of Computer & Information Sciences University of Delaware Using.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Perceptron-based Global Confidence Estimation for Value Prediction Master’s Thesis Michael Black June 26, 2003.
Automatically Characterizing Large Scale Program Behavior Timothy Sherwood Erez Perelman Greg Hamerly Brad Calder.
Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.
CS 104 Introduction to Computer Science and Graphics Problems
Multiscalar processors
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
University of California San Diego Locality Phase Prediction Xipeng Shen, Yutao Zhong, Chen Ding Computer Science Department, University of Rochester Class.
DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
Virtual Memory.
Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.
1 Using Multiple Energy Gears in MPI Programs on a Power- Scalable Cluster Vincent W. Freeh, David K. Lowenthal, Feng Pan, and Nandani Kappiah Presented.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
Reuse Distance as a Metric for Cache Behavior Kristof Beyls and Erik D’Hollander Ghent University PDCS - August 2001.
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
Static Identification of Delinquent Loads V.M. Panait A. Sasturkar W.-F. Fong.
HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
Dynamic Branch Prediction During Context Switches Jonathan Creekmore Nicolas Spiegelberg T NT.
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
Jiahao Chen, Yuhui Deng, Zhan Huang 1 ICA3PP2015: The 15th International Conference on Algorithms and Architectures for Parallel Processing. zhangjiajie,
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
Best detection scheme achieves 100% hit detection with
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Instruction Prefetching Smruti R. Sarangi. Contents  Motivation for Prefetching  Simple Schemes  Recent Work  Proactive Instruction Fetching  Return.
An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
PipeliningPipelining Computer Architecture (Fall 2006)
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
CMSC 611: Advanced Computer Architecture
CSCI1600: Embedded and Real Time Software
Tosiron Adegbija and Ann Gordon-Ross+
Phase Capture and Prediction with Applications
CARP: Compression-Aware Replacement Policies
Hardware Counter Driven On-the-Fly Request Signatures
Practical Session 8, Memory Management 2
rePLay: A Hardware Framework for Dynamic Optimization
CSCI1600: Embedded and Real Time Software
Practical Session 9, Memory Management continues
Phase based adaptive Branch predictor: Seeing the forest for the trees
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Phase Detection Jonathan Winter Casey Smith CS /05/05

Phase Detection 2 Jonathan Winter and Casey Smith Motivation Large-scale phases exist (order of millions of instructions) –For many programs, if we look at any interesting metric (cache misses, IPC, etc.), we see repeating behavior –Call the regions with similar behavior “phases” Knowledge of phase-based behavior can be used for adaptive optimization –Current hardware doesn’t exploit phase behaviors For instance –A region of execution may only need a small cache—save power/increase performance by shrinking –A region of execution may benefit from data structure reorganization

Phase Detection 3 Jonathan Winter and Casey Smith Basic Methodology 1)Identify phase boundaries 2)Classify phases 3)Determine what optimizations to perform for each phase When can each step be performed? Run time, compile time, offline

Phase Detection 4 Jonathan Winter and Casey Smith Overview We’ll focus two papers on phase detection –Sherwood, Sair, and Calder, “Phase Tracking and Prediction,” ISCA 2003 – Shen, Zhong, and Ding, “Locality Phase Prediction,” ASPLOS 2004

Phase Detection 5 Jonathan Winter and Casey Smith Sherwood et al Classifies the behavior of a program into phases based on code execution Finds strong correlations between code execution phases and important performance and energy metrics Simulates hardware for real-time detection and prediction of phases Demonstrates usefulness through a variety of optimization techniques made possible by phase detection

Phase Detection 6 Jonathan Winter and Casey Smith Definition of a Phase Previously (stemming from Denning 1972), a phase was defined as an interval of execution where a measured program metric stayed relatively constant. Sherwood et al. consider all sections of code with similar values for the program metric to be part of the same phase even if the intervals are spread out over the course of the programs execution.

Phase Detection 7 Jonathan Winter and Casey Smith Key Program Metrics Instructions per cycle (IPC), energy, branch prediction accuracy, data cache misses, instruction cache misses, L2 cache misses are all vital statistics for optimizing speed and power consumption

Phase Detection 8 Jonathan Winter and Casey Smith Single Unified Metric Goal: find a single metric that –Uniquely distinguishes phases –Guides optimization and policy decisions Need some section of code on which to measure this metric—pick 10M instructions –Much longer time span than typical architectural techniques handle –Long enough to capture large-scale behavior –Short enough to capture detailed phase behavior –Size of an OS timeslice

Phase Detection 9 Jonathan Winter and Casey Smith Metric for Classification Based on Basic Blocks –Basic blocks are a section of code with one entry point and one exit point Basic Block Vector –Count the number of times each basic block is executed in the 10M interval –Entries in the vector are the product of the number times each basic block is executed and the block length (BB 1 *L 1, BB 2 *L 2, BB 3 *L 3, …) –This vector is a signature of the phase which correlates well with other metrics of interest: IPC, cache misses, etc.

Phase Detection 10 Jonathan Winter and Casey Smith Advantages of BBVs Independent of architectural measures and thus unaffected by optimizations Weighting biases the signatures to more frequently executed instructions Creates unique signatures which execute the same code but in different proportions

Phase Detection 11 Jonathan Winter and Casey Smith Hardware Implementation Don’t want to store and examine the whole vector: compress to a 32-entry vector (footprint)

Phase Detection 12 Jonathan Winter and Casey Smith Visualization of the Footprints Footprints for different intervals of gzip

Phase Detection 13 Jonathan Winter and Casey Smith What do we do with our footprint? Store a small sample of representative footprints as phase signatures Compare the current footprint to previously stored footprints If we have a close enough match, we classify them as the same phase If not, we store the new footprint as the representative member of a new phase

Phase Detection 14 Jonathan Winter and Casey Smith Comparing Footprints To save space, only store the top 6 bits of each entry in the 32-vector –Counters were saturating 24-bit counters –The smallest value that the maximum entry could have would occur if all 10M instructions were distributed evenly across the 32 entries –In this case the top six bits means that a counter value of 10M/32 would have a value of 1 Distance between footprints is defined as the Manhattan distance: the sum of the absolute difference between corresponding entries in two vectors

Phase Detection 15 Jonathan Winter and Casey Smith Finding a Match If the Manhattan distance is less than a threshold, two footprints are classified as being in the same phase Determine threshold by false positives/ false negatives as compared to an offline oracle tool. Threshold of 2 20 chosen

Phase Detection 16 Jonathan Winter and Casey Smith Opportunity These classification methods are oversimplified Opportunity to apply better machine learning techniques

Phase Detection 17 Jonathan Winter and Casey Smith Within Phase Homogeneity Within a phase, architectural metrics have nearly constant values (this is what we were aiming for)

Phase Detection 18 Jonathan Winter and Casey Smith Phase Prediction Once we’ve been through an interval, we can identify the phase easily But we want to know what phase we’re going to go to next We need to know what phase we will be in before the interval starts in order to perform useful optimizations (such as changing the cache size)

Phase Detection 19 Jonathan Winter and Casey Smith Simple Prediction We could just predict that the next phase would be the same as the current phase The program tends to change phases more slowly than our 10M intervals, so this actually gives reasonable accuracy However, we can do better Note: standard hardware predictors have not been tried (branch prediction, memory disambiguation, etc.)

Phase Detection 20 Jonathan Winter and Casey Smith Markov Model Predictor Phase changes depend on the set of previous phases and the duration of their execution Phases tend to last many intervals, therefore studying recent previous history doesn’t provide more information than the current state Need to encode how long we’ve been in the current state Predict the length of phase to be the same length it was previously

Phase Detection 21 Jonathan Winter and Casey Smith Run Length Encoding

Phase Detection 22 Jonathan Winter and Casey Smith Opportunity RLE Markov model is overly simple Better prediction techniques exist Make use of the order of previous states rather than just the length of the current state

Phase Detection 23 Jonathan Winter and Casey Smith Prediction Accuracy

Phase Detection 24 Jonathan Winter and Casey Smith Applications Frequent Value Locality –Certain data values form bulk of loads Compress to save energy Specialize code segments to common values Dynamic cache size adaptation –Shrink cache size to save energy Dynamic processor width adaptation –Fetch/Decode/Issue fewer instructions per cycle when IPC will be low anyway

Phase Detection 25 Jonathan Winter and Casey Smith Frequent Value Locality

Phase Detection 26 Jonathan Winter and Casey Smith Cache Size Adaptation

Phase Detection 27 Jonathan Winter and Casey Smith Processor Width Adaptation

Phase Detection 28 Jonathan Winter and Casey Smith Summary of BBV method Divide program into 10M instruction intervals Characterize each interval by footprint approximation to basic block vector Classify intervals as phases based on footprint Predict future phases based on RLE Markov predictor Use information about phases to improve frequent value locality and optimize cache size and processor width for performance/energy

Phase Detection 29 Jonathan Winter and Casey Smith Bottom Line Classifying phases based on the frequency of executed basic blocks is effective at partitioning the program into regions of homogenous architectural behavior Significant energy savings with small performance degradation can be achieved by applying phase specific optimizations.

Phase Detection 30 Jonathan Winter and Casey Smith Shen et al Defines phases in a totally different way Phases have variable lengths (not 10M intervals) Detects phases by finding likely phase boundaries Uses offline analysis of programs on test inputs to predict behavior on other inputs

Phase Detection 31 Jonathan Winter and Casey Smith Metric of Interest For optimizing cache size, what we really care about is the locality of reference Measure the locality directly, and classify phases based on that Independent of optimizations performed: phases recovered are independent of the hardware it runs on.

Phase Detection 32 Jonathan Winter and Casey Smith Reuse Distance Define the reuse distance as the number of distinct data elements (locations in memory) touched between two consecutive references to the same element. Define the reuse distance at the second reference Example: abcbbac Also called LRU Stack Distance

Phase Detection 33 Jonathan Winter and Casey Smith Overview Simulate a test run and record reuse distance throughout the program Use this to separate the program into “phases” Insert phase markers into binary code Predict when phase changes will occur Use information about phases to adjust cache size or other hardware parameters

Phase Detection 34 Jonathan Winter and Casey Smith New Definition of Phase Here, a phase is a unit of repeating behavior, rather than a unit of nearly uniform behavior A phase change is an abrupt change in the data reuse pattern

Phase Detection 35 Jonathan Winter and Casey Smith Reuse Trace

Phase Detection 36 Jonathan Winter and Casey Smith Why Offline Analysis? Compilers cannot fully analyze data locality in programs with indirect referencing or dynamic structures Hardware methods like the one presented earlier require many severe approximations for real-time analysis Solution: take method offline and analyze program behavior on test inputs.

Phase Detection 37 Jonathan Winter and Casey Smith Phase Detection Process 1)Record reuse trace 2)Perform signal processing techniques to extract useful information from the trace 3)Use the extracted information to find good places for phase transitions

Phase Detection 38 Jonathan Winter and Casey Smith 1) Record Reuse Trace Nontrivial programs access data locations so many times that an actual full trace would be overwhelming Just sample a representative set of memory locations/reuse distances Threshold to reduce trace size and remove irrelevant data –Throw out short distances (C[i] = C[i] + 2) –Throw out references to nearby memory locations

Phase Detection 39 Jonathan Winter and Casey Smith 2) Signal Processing Use wavelet filtering to find abrupt changes in reuse distance for each recorded memory location

Phase Detection 40 Jonathan Winter and Casey Smith 3) Phase Partitioning Now we have points representing locations of abrupt changes in reuse distance for individual memory locations Want to divide the list with two things in mind: –Maximize phase length –Minimize repetitions of memory locations within a phase (no multiple abrupt changes) Example:abcdeefabdfccabef abcde efabdfc cabef

Phase Detection 41 Jonathan Winter and Casey Smith Missing Link So now we have locations of phase transitions. How do detect which regions are the same phase? Doesn’t say. Missing section in paper? Assume we can somehow classify the regions into phases

Phase Detection 42 Jonathan Winter and Casey Smith Phase Markers We know how often a phase occurs and approximately where its boundaries are Goal: find markers that tell us when we’re entering a particular phase For each phase, look for basic blocks that occur once near each of its beginning boundaries, and only near the beginnings of its boundaries. Use that basic block as a marker to tell when the program enters that phase

Phase Detection 43 Jonathan Winter and Casey Smith Using Phases Now we know what basic blocks signal phase entry points Run the program with new input When we enter a phase for the first time, we record how long it lasts and its locality properties Assume that these properties will hold for all subsequent executions of the same phase

Phase Prediction Performance

Phase Detection 45 Jonathan Winter and Casey Smith Negative Examples Not all programs have phases of repeating behavior that can be identified from test runs

Phase Detection 46 Jonathan Winter and Casey Smith Applications Adaptive Cache Resizing –Potential performance increase –Potential power savings Memory Remapping –Reorder data in memory to speed up execution

Phase Detection 47 Jonathan Winter and Casey Smith Adaptive Cache Resizing Shrink cache without increasing miss ratio Phases have repeating behavior, not uniform behavior Divide phases into 10K intervals First couple of times we execute a phase follow test properties Apply those cache sizes to subsequent executions of the phase

Phase Detection 48 Jonathan Winter and Casey Smith Cache Size Reductions

Phase Detection 49 Jonathan Winter and Casey Smith Cache Size Reductions with 5% Miss Increase

Phase Detection 50 Jonathan Winter and Casey Smith Memory Remapping Reorder data in memory to speed up execution For example, we might interleave arrays that tend to be accessed together. Options: –Analyze whole program to find array affinities –Analyze by phase and reorganize data during execution (should take into account cost of remapping, but the authors don’t)

Phase Detection 51 Jonathan Winter and Casey Smith Memory Remapping

Phase Detection 52 Jonathan Winter and Casey Smith Summary of the locality- based method Record a sampled version of the reuse distance trace on test input Process the trace Find phase boundaries Find basic block markers for each phase Run the program on new data. When we see a new phase marker, record how long it lasts and experiment with optimization parameters for 10K intervals Assume subsequent executions of the phase will have the same length and locality profile, so we can use the determined optimization parameters

Phase Detection 53 Jonathan Winter and Casey Smith Bottom Line Many programs have long repeating patterns of data reuse separated by abrupt changes These repeating patterns can be detected by analyzing the reuse trace Characterizing these patterns can lead to significant energy savings and performance enhancement through cache resizing and memory remapping

Phase Detection 54 Jonathan Winter and Casey Smith Overall Conclusions Many programs exhibit large-scale phase behavior which can be classified and predicted Characterization of the phases can lead to energy savings and performance enhancement through cache resizing and other techniques –But no well-done analysis of just how much power is saved Some of this can be done at compile time (identifying many phase markers), but interval type analysis and phase characterization must be done at runtime

Phase Detection 55 Jonathan Winter and Casey Smith Opportunities More intelligent classification More sophisticated prediction Account for the cost of changing the cache size in the energy/performance analysis Compare results of phase-based adjustments to actual optimal adjustments Examine potential for using compilers for different parts of the analysis