Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Slides:



Advertisements
Similar presentations
Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.
Advertisements

Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
CISC Machine Learning for Solving Systems Problems Presented by: John Tully Dept of Computer & Information Sciences University of Delaware Using.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.
Phase Detection Jonathan Winter Casey Smith CS /05/05.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Perceptron-based Global Confidence Estimation for Value Prediction Master’s Thesis Michael Black June 26, 2003.
Workload Characteristics and Representative Workloads David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA.
Automatically Characterizing Large Scale Program Behavior Timothy Sherwood Erez Perelman Greg Hamerly Brad Calder.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
Kyushu University Koji Inoue ICECS'061 Supporting A Dynamic Program Signature: An Intrusion Detection Framework for Microprocessors Koji Inoue Department.
Automatically Characterizing Large Scale Program Behavior Timothy Sherwood Erez Perelman Greg Hamerly Brad Calder Used with permission of author.
Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project.
Advancing Wireless Link Signatures for Location Distinction Mobicom 2008 Junxing Zhang, Mohammad H. Firooz Neal Patwari, Sneha K. Kasera University of.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Workload Design: Selecting Representative Program-Input Pairs Lieven Eeckhout Hans Vandierendonck Koen De Bosschere Ghent University, Belgium PACT 2002,
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Best detection scheme achieves 100% hit detection with
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Page 1 SARC Samsung Austin R&D Center SARC Maximizing Branch Behavior Coverage for a Limited Simulation Budget Maximilien Breughe 06/18/2016 Championship.
Multiscalar Processors
A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets Ashok Sharma, Robert Podolsky, Jieping.
Application-Specific Customization of Soft Processor Microarchitecture
Dimension Review Many of the geometric structures generated by chaotic map or differential dynamic systems are extremely complex. Fractal : hard to define.
Recent Advances in Iterative Parameter Estimation
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
CSCI1600: Embedded and Real Time Software
Tosiron Adegbija and Ann Gordon-Ross+
Lecture 14: Reducing Cache Misses
Phase Capture and Prediction with Applications
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
CARP: Compression-Aware Replacement Policies
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Hardware Counter Driven On-the-Fly Request Signatures
15-740/ Computer Architecture Lecture 14: Prefetching
Program Phase Directed Dynamic Cache Way Reconfiguration
rePLay: A Hardware Framework for Dynamic Optimization
Physics-guided machine learning for milling stability:
Application-Specific Customization of Soft Processor Microarchitecture
CSCI1600: Embedded and Real Time Software
Phase based adaptive Branch predictor: Seeing the forest for the trees
Presentation transcript:

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Architectural Simulators Explore Design Space Evaluate existing hardware, or Predict performance of proposed hardware Designer has control Functional Simulators: Model architecture (programmers’ focus) Eg., sim-fast, sim-safe Performance Simulators: Model microarchitecture (designer’s focus) Eg., cycle-by-cycle (sim-outoforder)

Simulation Issues Real-applications take too long for a cycle-by-cycle simulation Vast design space: Design Parameters: code properties, value prediction, dynamic instruction distance, basic block size, instruction fetch mechanisms, etc. Architectural metrics: IPC/ILP, cache miss rate, branch prediction accuracy, etc. Find design flaws + Provide design improvements Need a “robust” simulation methodology !!

Two Methodologies HLS BBDA Hybrid: Statistical + Symbolic REF: HLS: Combining Statistical and Symbolic Simulation to Guide Microprocessor Designs. M. Oskin, F. T. Chong and M. Farrens. Proc. ISCA. 71-82. 2000. BBDA Basic block distribution analysis REF: Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications. T. Sherwood, E. Perelman and B. Calder. Proc. PACT. 2001.

HLS: An Overview A hybrid processor simulator HLS Statistical Model HLS Performance Contours spanned by design space parameters Symbolic Execution What can be achieved? Explore design changes in architectures and compilers that would be impractical to simulate using conventional simulators

HLS: Main Idea Statistical Profiling Application code Synthetically generated code Application code Statistical Profiling Instruction stream, data stream Structural Simulation of FU, issue pipeline units Code characteristics: basic block size Dynamic instruction distance Instruction mix Architecture metrics: Cache behavior Branch prediction accuracy

Statistical Code Generation Each “synthetic instruction” contains the following parameters based on the statistical profile: Functional unit requirements Dynamic instruction distances Cache behavior

Validation of HLS against SimpleScalar For varying combinations of design parameters: Run original benchmark code on SimpleScalar (use sim-outoforder) Run statistically generated code on HLS Compare SimpleScalar IPC vs. HLS IPC

Validation: Single- and Multi-value correlations IPC vs. L1-cache hit rate For SPECint95: HLS Errors are within 5-7% of the cycle-by-cycle results !!

HLS: Code Properties Basic Block Size vs. L1-Cache Hit Rate Correlation suggests that: Increasing block size helps only when L1 cache hit rate is >96% or <82%

HLS: Value Prediction GOAL: Break True Dependency Stall Penalty for mispredict vs. Value Prediction Knowledge DID vs. Value predictability

HLS: Conclusions Low error rate only on SPECint95 benchmark suite. High error rates on SPECfp95 and STREAM benchmarks Findings: by R. H. Bell et. Al, 2004 Reason: Instruction-level granularity for workload Recommended Improvement: Basic block-level granularity

Basic Block Distribution Analysis Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications. T. Sherwood, E. Perelman and B. Calder. Proc. PACT. 2001.

Introduction Program Execution Initialization Goal Approach Period To capture large scale program behavior in significantly reduced simulation time. Approach Find a representative subset of the full program. Find an ideal place to simulate given a specific number of instructions one has to simulate Accurate confidence estimation of the simulation point. Initialization Simulation Points Period Program Execution

Program Behavior Program behavior has ramifications on architectural techniques. Program behavior is different in different parts of execution. Initialization Cyclic behavior (Periodic) Cyclic Behavior is not representative of all programs. Common case for compute bound applications.

BBDA Basics Fast profiling is used to determine the number of times a basic block executes. Behavior of the program is directly related to the code that it is executing. Profiling gives a basic block fingerprint for that particular interval of time. The interval chosen is ideally a representative of the full execution of the program. Profiling information is collected in intervals of 100 million instructions.

Basic Block Vector B1 B2 BD BBV for Interval i: Frequency Interval i … BBV = Fingerprint of an interval Varying size intervals A BBV collected over an interval of N times 100 million instructions is a BBV of duration N. Bx

Target BBV BBVs are normalized Target BBV Objective Each element divided by the sum of all elements. Target BBV BBV for the entire execution of the program. Objective Find a BBV of smallest duration “similar” to Target BBV.

Basic Block Vector Difference Difference between BBVs Euclidean Distance Manhattan Distance

Basic Block Difference Graph Plot of how well each individual interval in the program compares to the target BBV. For each interval of 100 million instructions, we create a BBV and calculate its difference from target BBV. Used to Find the end of initialization phase. Find the period for the program.

Basic Block Difference Graph

Initialization Initialization is not trivial. Important to simulate representative sections of the initialization code. Detection of the end of the initialization phase is important. Initialization Difference Graph Initial Representative Signal - First quarter of BB Difference graph. Slide it across BB difference graph. Difference calculated at each point for first half of BBDG. When IRS reaches the end of the initialization stage on the BB difference graph, the difference is maximized.

Initialization

Period Period Difference Graph Period Representative Signal Part of BBDG, starting from the end of initialization to ¼th the length of program execution. Slide across half the BBDG. Distance between the minimum Y-axis points is the period. Using larger durations of a BBV creates a BBDG that emphasizes larger periods.

Period

Summary of Results IPC of chosen period vs. IPC of the full execution  Differed by 5% BBV based technique (to be continued…)

Characterizing Program Behavior Through Clustering Automatically characterizing Large Scale Program Behavior. T. Sherwood, E. Perelman, G. Hamerly and B. Calder. ASPLOS 2002

Clustering Approach #1 P1 #2 P2 N BBVs Clustering … … #K Pk Clusters Multiple Simulation Points … … #K Pk Clusters

Clustering (k-means) Goal is to divide a set of points into groups such that points within each group are similar to one another by a desired metric. Input: N points in D-dimensional space Output: A partition of k clusters Algorithm: Randomly choose k points as centroids (initialization) Compute cluster membership of each point based on its distance from each centroid Compute new centroid for each cluster Iterate steps 2 and 3 until convergence Runtime complexity affected by the “curse of dimensionality”

Random Projection Reduce the dimension of the BBVs to 15 Dimension Selection Dimension Reduction Random Linear Projection.

BBDA: Conclusions BBDA provides better sensitivity and lower performance variation in phases Other related work such as instruction working set technique provides higher “stability”