PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

Slides:

Advertisements

Similar presentations

Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.

Advertisements

Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems Jason D. Hiser, Daniel Williams, Wei Hu, Jack W. Davidson, Jason.

1 Enterprise Platforms Group Pinpointing Representative Portions of Large Intel Itanium Programs with Dynamic Instrumentation Harish Patil, Robert Cohn,

Dynamic Optimization using ADORE Framework 10/22/2003 Wei Hsu Computer Science and Engineering Department University of Minnesota.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Intel® performance analyze tools Nikita Panov Idrisov Renat.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

CISC Machine Learning for Solving Systems Problems Presented by: John Tully Dept of Computer & Information Sciences University of Delaware Using.

Colorado Computer Architecture Research Group Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A.

Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences.

Workload Characteristics and Representative Workloads David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA.

Previous finals up on the web page use them as practice problems look at them early.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

The Memory Behavior of Data Structures Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences The University.

Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.

University of California San Diego Locality Phase Prediction Xipeng Shen, Yutao Zhong, Chen Ding Computer Science Department, University of Rochester Class.

Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.

Sunpyo Hong, Hyesoon Kim

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.

Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore

Gift Nyikayaramba 30 September 2014

Ph.D. in Computer Science

Outline Motivation Project Goals Methodology Preliminary Results

Online Subpath Profiling

White-Box Testing.

Spare Register Aware Prefetching for Graph Algorithms on GPUs

CSCI1600: Embedded and Real Time Software

Department of Computer Science University of California, Santa Barbara

Predictive Performance

Hardware Multithreading

White-Box Testing.

Address-Value Delta (AVD) Prediction

Ann Gordon-Ross and Frank Vahid*

John-Paul Fryckman CSE 231: Paper Presentation 23 May 2002

Phase Capture and Prediction with Applications

CARP: Compression-Aware Replacement Policies

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Hardware Counter Driven On-the-Fly Request Signatures

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Patrick Akl and Andreas Moshovos AENAO Research Group

Department of Computer Science University of California, Santa Barbara

rePLay: A Hardware Framework for Dynamic Optimization

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Gang Luo, Hongfei Guo {gangluo,

CSCI1600: Embedded and Real Time Software

Dynamic Binary Translators and Instrumenters

Phase based adaptive Branch predictor: Seeing the forest for the trees

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim, Sreekumar Kodak Computer Science Department University of Minnesota October 9, 2004 PIN Tutorial at ASPLOS`04

Outline What is Pintos? What can Pintos do? Phase detection for optimization and simulation Optimization (instruction prefetching) Fast Simulation Summary

What is Pintos? PINTOS is a PIN based Tool for Optimization and Simulation A research framework supports adaptive object code optimization –Supports deep analysis of run-time program behavior for object code optimization (e.g. instruction, data prefetching) –Integrates HPM performance monitoring (Pfmon) with dynamic instrumentation (PIN). Also supports fast performance simulation –Identifies program phases (with coarse and fine granularity) –Generates simulation strings that capture representative program behaviors

Pintos Framework program pfmon profile analysis Opt targets program pfmon profile analysis phase targets PIN-based Analysis control flow Cache Sim PIN-based Phase Detection Simulation String Gen Optimization Simulation Filtered Opt Targets Simulation Strings Phase Info

Our Background ADORE dynamic optimization system Main Thread Kernel / Pfmon Hardware Performance Monitoring Unit Dynamic Optimization Thread Code Cache Trace Selection Optimization Deployment Phase Detection

ADORE Performance: Speedup of ORC2.1 +O2 Compiled SPEC2000 Benchmarks

ADORE Performance at Different Sampling Rates

Future Enhancements to ADORE I-cache prefetching Help thread based optimizations Value prediction based optimizations Dynamically undo aggressive optimizations (e.g. control/data speculations, indirect array prefetches) Software Branch Predictions

What can Pintos do for us? Pintos uses pfmon to identify high-level performance problems (e.g. I-cache miss) and locate target code (phases) for optimization Pintos then uses PIN-based analysis tool to focus on target code (phases) to conduct deep analysis Pintos provides a framework to support deep analysis of program behavior so that we may experience with new object code optimization techniques and feed them to ADORE. Simulation strings can be generated by Pintos and used for more efficient micro-architecture simulations

Phase based Optimization and Simulation Phase is a sequence of code that consistently exhibits certain performance behaviors in Pintos, for example –Gzip shows consistent and repeated data cache miss patterns –Crafty exhibits consistent I-cache misses A repeating phase can serve as an unit for dynamic and adaptive optimization, or for fast performance simulations. –Optimization unit can be basic block, trace, procedure and region (loops and loop nests including complex control transfers) –Simulation unit can be an extended code sequence

Phase Detection One phase detection method doesn’t fit all needs. –Dynamic data cache prefetching requires coarse grain phases (e.g. loops) while dynamic I-cache prefetching requires fine-grain phases (e.g. frequent calling paths). A phase tuple is used to determine the current point of execution in PIN instrumentation –Phase tuple: (phase ID #, ip addr, # of retired insts)

Pintos for Optimization (I-Prefetch) Many applications still suffer from significant I-cache misses (e.g. data base apps, some SPEC CPU2000 benchmarks, etc) L1I miss rate (%) L1I Prefetch miss rate L2I miss rate 176.gcc crafty eon perlbmk vortex Complex control flows cause high miss rate from streaming prefetches Predictable call sequence results in relatively low miss rate

I-Cache Miss Analysis (pfmon) Miss address based info –Crafty (2125/ ) 25% 30 (1.41%) Each top miss PC was caused by % 91 (4.28%) different paths. 75% 228 (10.73%) 90% 442 (20.80%) Path based info –Crafty (8016/ ) Each top path leading to I-cache 25% 28 (0.34%) miss has 1-2 possible prefetch targets 50% 126 (1.57%) 75% 436 (5.43%) Data show we can reduce points of 90% 1118 (13.94%) interest for inst prefetching

Exploring prospective points of instruction prefetching (PIN) B2 B1 B3 B6 B4 B5 B7 B8 Instruction Cache Simulator Control flow graph Pintos generates prospective paths leading to frequent I- cache misses by analyzing pfmon profile PIN instrumentation routine constructs control flow graph and simulates instruction cache along execution It inserts I-cache prefetching instructions for the prospective paths based on control flow edge weight and estimated cache replacement Paths frequently causing I-cache misses

Exploring prospective points of instruction prefetching (PIN) B2 B1 B3 B6 B4 B5 B7 B8 Instruction Cache Simulator Control flow graph Key observation –Most I-cache misses happen in the following cache lines after the entry or the return of a function call. –L1I cache misses are mostly capacity misses. We need to estimate how prefetch affect incoming instruction stream. Key idea –Run ahead by exploring CFG and I-cache simulator –Evaluate prospective paths given by Pintos Paths frequently causing I-cache misses

Pintos for Fast Simulation Execution driven micro-architectral simulation is commonly used for evaluating new microarchitecture features and respective code optimizations. Simulation time is often too long for a complete simulation. New methods for fast simulations such as Simpoint and Smarts have been proposed. PASS (Phase Aware Stratified Sampling) is a different way to generate representative and customized traces for targeted simulations

Fast Simulation Techniques Truncated Execution -Run Z, FastFoward-W-R Sampling -SMARTS -SIMPOINT -Stratified Sampling Reduced Input Sets -MinneSPEC

Problems of Previous Works Truncated Execution gives very inaccurate results Reduced Input sets do not always behave the same as reference inputs so the performance estimation based on reduced input sets may be misleading.

Mechanism of SMARTS UWWU (K-1) * U Program Run Time W: Warm up time (Fixed to 2000 instructions for SPEC 2000) U: Detailed Simulation (Fixed to 1000 instructions for SPEC2000) (K-1)*U: Function Simulation with Functional Warming (The tool gives the value of K for which the IPC will be within + 3% of the actual value with 99.7% confidence interval)

Issues in Previous Work SMARTS Value of U and W fixed for SPEC 2000 suite. Have to identify them for every new benchmark suite (Very time consuming) Over sampling in steady phases. Does not effectively exploit the existence of phases in programs SIMPOINT The user chooses the length of simulation point (100 million, 10 million, 1 million) Provides Simulation Points based on Clustering of Basic Block profiles which is generated using sim-fast or ATOM

Phase Aware Stratified Sampling (PASS) Deploy a hierarchical method to detect coarse and fine grain program phases (1) Tracking calling stack (stable bottom = coarse grain phase)  inter-procedure (2) Detecting loops within the procedure  intra- procedure (3)Tracking data access pattern such as stride within loops (fine grain phases) Select stratified samples from each phase until getting high statistical confidence

IPC vs SimPoint (cc1-166, 1 million insts) simpoint IPC

IPC vs Phase Classification on PASS (cc1-166, 1 million insts)

IPC vs SimPoint (cc1-166, 250 million insts)

IPC vs SimPoint (gzip-source, 1 million insts) simpoint IPC

IPC vs Phase Classification on PASS (gzip-source, 1 million insts)

IPC vs SimPoint (gzip-source, 250 million insts)

IPC vs SimPoint (mcf-ref, 1 million insts) simpoint IPC

IPC vs Phase Classification on PASS (mcf-ref)

IPC vs SimPoint (mcf-ref, 250 million insts)

IPC vs Phase Classification on PASS (gap-ref, 1 million insts)

IPC vs SimPoint (gap-ref, 250 million insts)

Summary We show the combination of HPM sampling (Pfmon) and dynamic instrumentation (Pin) in our research framework (Pintos) for adaptive object code optimization and micro-architectural simulation. PASS (Phase Aware Stratified Sampling) may lead to a more efficient way in simulating the interaction between compiler optimizations and new micro-architectural features.