Suhas Chakravarty, Zhuoran Zhao, Andreas Gerstlauer

Slides:

Advertisements

Similar presentations

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Advertisements

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

Program Representations. Representing programs Goals.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,

Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.

Cpeg421-08S/final-review1 Course Review Tom St. John.

Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

The Design of a Resourceable and Retargetable Binary Translator Cristina Cifuentes Sixth Working Conference on Reverse Engineering On page(s):

Validating High-Level Synthesis Sudipta Kundu, Sorin Lerner, Rajesh Gupta Department of Computer Science and Engineering, University of California, San.

Multiscalar processors

Center for Embedded Computer Systems University of California, Irvine SPARK: A High-Level Synthesis Framework for Applying.

Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.

Predictable Implementation of Real-Time Applications on Multiprocessor Systems-on-Chip Alexandru Andrei, Petru Eles, Zebo Peng, Jakob Rosen Presented By:

Computer Science 12 Design Automation for Embedded Systems ECRTS 2011 Bus-Aware Multicore WCET Analysis through TDMA Offset Bounds Timon Kelter, Heiko.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.

Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL.

Automated Design of Custom Architecture Tulika Mitra

Presenter: Zong Ze-Huang Fast and Accurate Resource Conflict Simulation for Performance Analysis of Multi- Core Systems Stattelmann, S. ; Bringmann, O.

A New Method For Developing IBIS-AMI Models

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS252 project presentation.

Timing Analysis of Embedded Software for Speculative Processors Tulika Mitra Abhik Roychoudhury Xianfeng Li School of Computing National University of.

Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

Presentation by Tom Hummel OverSoC: A Framework for the Exploration of RTOS for RSoC Platforms.

Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

PowerMixer IP : IP-Level Power Modeling for Processors Shan-Chien Fang 1 Jia-Lu Liao 2 Chen-Wei Hsu 2 Chia-Chien Weng 2 Shi-Yu Huang 2 Wen-Tsan Hsieh 3.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.

1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.

Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

WCET-Aware Dynamic Code Management on Scratchpads for Software-Managed Multicores Yooseong Kim 1,2, David Broman 2,3, Jian Cai 1, Aviral Shrivastava 1,2.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Slack Analysis in the System Design Loop Girish VenkataramaniCarnegie Mellon University, The MathWorks Seth C. Goldstein Carnegie Mellon University.

Learning-Based Power Modeling of System-Level Black-Box IPs Dongwook Lee, Taemin Kim, Kyungtae Han, Yatin Hoskote, Lizy K. John, Andreas Gerstlauer.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

CHaRy Software Synthesis for Hard Real-Time Systems

Andreas Hoffmann Andreas Ropers Tim Kogel Stefan Pees Prof

Computer Architecture Principles Dr. Mike Frank

Optimizing Compilers Background

Hierarchical Architecture

Introduction to SimpleScalar (Based on SimpleScalar Tutorial)

A Review of Processor Design Flow

CSCI1600: Embedded and Real Time Software

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

EE 382N Guest Lecture Wish Branches

Alan Mishchenko University of California, Berkeley

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

CprE 588 Embedded Computer Systems

Control Flow Analysis (Chapter 7)

In Search of Near-Optimal Optimization Phase Orderings

Predicting Unroll Factors Using Supervised Classification

Automatic Tuning of Two-Level Caches to Embedded Applications

rePLay: A Hardware Framework for Dynamic Optimization

CSCI1600: Embedded and Real Time Software

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Suhas Chakravarty, Zhuoran Zhao, Andreas Gerstlauer Automated, Retargetable Back-Annotation for Host Compiled Performance and Power Modeling Suhas Chakravarty, Zhuoran Zhao, Andreas Gerstlauer Electrical and Computer Engineering The University of Texas at Austin http://www.ece.utexas.edu/~gerstl CODES+ISSS, 9/30/13

© S. Chakravarty, Z. Zhao, A. Gerstlauer Outline Introduction Related Work Retargetable Back-Annotation Flow Experimental Results Summary and Conclusion CODES+ISSS, 9/30/13 © S. Chakravarty, Z. Zhao, A. Gerstlauer

© S. Chakravarty, Z. Zhao, A. Gerstlauer Motivation Increasing design complexities Rapid design space exploration desired Fast and accurate performance and power validation Traditional simulation models Instruction Set Simulator (ISS) RTL/Gate level Too slow or too inaccurate Modeling at higher abstraction levels Higher simulation speed Host-compiled simulation Brief introduction, why simulation, why introduce HC CODES+ISSS, 9/30/13 © S. Chakravarty, Z. Zhao, A. Gerstlauer

Host-Compiled Modeling Modeling above the ISS level Compile and execute application natively Annotate application with target timing and power Wrap with SystemC code for platform integration Fast and accurate simulation to complement ISS Key points of HC CODES+ISSS, 9/30/13 © S. Chakravarty, Z. Zhao, A. Gerstlauer

© S. Chakravarty, Z. Zhao, A. Gerstlauer Related Work Source level timing modeling Binary-to-source mapping Obtain estimation at source IR level [Hwang08, Brandolese01] Disable optimization and rely on debug information [Wang09] Mapping ambiguity Reference model Static binary code analysis [Stattelmann11, Wang09, Schnerr08] Apply ISS or abstract pipeline model [Plyaskin11, Lin10] Source level power modeling Coarse-grain reference model Complete instructions and source-level operations [Brandolese00, Brandolese11, Calvo11] Fast, but not accurate B to SCR mapping/ where the information comes from CODES+ISSS, 9/30/13 © S. Chakravarty, Z. Zhao, A. Gerstlauer

Back-Annotation Concerns Annotation granularity? Speed vs. accuracy tradeoff Dynamic execution effects Basic Block (BB) granularity Compiler optimizations? Mapping between source and binary Work with intermediate representation (IR) Dynamic architecture effects? Pipelining, caching, branch prediction Pairwise characterization BB granularity + hybrid simulation (future) Path dependency… IR BB highlight difference Two issue: what path, how long of the path Static vs dynamic WCET Annotation granularity? Speed vs. accuracy tradeoff Dynamic execution effects Basic Block (BB) granularity Data dependent execution behavior captured Simulation speed still close to native execution Compiler optimizations? Mapping between source and binary Work with intermediate representation (IR) Front-end optimizations accounted for CODES+ISSS, 9/30/13 © S. Chakravarty, Z. Zhao, A. Gerstlauer

Retargetable Back-Annotator (RBA) Intermediate representation (IR) Frontend optimizations [gcc] IR to C conversion Timing and energy Back-Annotator Binary-to-IR mapping Timing and power estimation Back-annotation Sum of the annotator, CODES+ISSS, 9/30/13 © S. Chakravarty, Z. Zhao, A. Gerstlauer

Timing and Energy Back Annotator Binary-to-IR mapping Cross-compiler backend [gcc] Control-flow graph matching Timing and power estimation Micro-architecture description language (uADL) or RTL Cycle-accurate timing Reference power model [McPAT] Back-annotation IR basic block level CODES+ISSS, 9/30/13 © S. Chakravarty, Z. Zhao, A. Gerstlauer

© S. Chakravarty, Z. Zhao, A. Gerstlauer Binary-to-IR mapping IR Binary Backend optimizations Instruction scheduling Blocks added/removed Predicated execution Control flow mismatches Establish binary-IR mapping for back-annotation Graph matching heuristic Recursive traversal Identify all legal mappings Resolve ambiguities using debug information Traversal both graph… ani on the algo Predicated instruction CODES+ISSS, 9/30/13 © S. Chakravarty, Z. Zhao, A. Gerstlauer

Graph Matching Heuristic Loop and branch level computation Loop: nesting level Branch: flow value Synchronized, recursive depth-first traversal Enumerate all compatible successor pairings Compatibility: loop and branch nesting levels Including successor skips (hoist successors of successors) Return least-cost mapping Cost: sum of unmatched nodes in subgraphs rooted at node Traversal both graph… ani on the algo Predicated instruction CODES+ISSS, 9/30/13 © S. Chakravarty, Z. Zhao, A. Gerstlauer

Graph Matching Example Cost =5 Cost =5 A (1) A (1) A (1) A (1) A A’ (1) A’ (1) A’ (1) A’ A’ (1) 0.5 0.5 0.5 0.5 B B (0.5) B (0.5) B (0.5) C (0.5) C (0.5) C (0.5) C (0.5) C Cost =2 C’ C’ (0.5) C’ (0.5) C’ (0.5) C’ (0.5) C’ (0.5) C’ (0.5) Cost =2 Cost =2 0.5 0.5 0.5 D (1) D (1) D (1) D (1) D D (1) Cost =2 D’ (1) D’ (1) D’ D’ (1) D’ (1) D’ (1) D’ (1) D’ (1) Cost =2 0.5 0.5 0.5 0.5 E (0.5) E (0.5) E (0.5) E E (0.5) F F (0.5) F (0.5) F (0.5) F (0.5) F’ (0.5) F’ (0.5) F’ (0.5) F’ F’ (0.5) E’ E’ (0.5) E’ (0.5) E’ (0.5) E’ (0.5) Cost =1 Cost =1 Traversal both graph… ani on the algo Predicated instruction Cost =1 Cost =1 0.5 0.25 0.25 0.25 0.25 G (0.75) G (0.75) G G (0.75) H (0.25) H (0.25) H (0.25) H H (0.25) H’ (0.25) H’ (0.25) H’ (0.25) H’ (0.25) H’ (0.25) H’ Cost = 0 0.5 Cost = inf Cost = 0 0.75 0.25 0.25 I (1) I (1) I (1) I (1) I (1) I I’ (1) I’ (1) I’ (1) I’ (1) I’ (1) I’ (1) I’ I’ (1) I’ (1) Cost = 0 Cost = 0 IR CFG Binary CFG CODES+ISSS, 9/30/13 © S. Chakravarty, Z. Zhao, A. Gerstlauer

Basic Block Characterization BB1 BB2 BB3 Exec flow 1 Exec flow 2 SS =A SS = B SS – Sys State (registers, mem, pipeline) Path-dependent metrics Execution history Architecture state Execution path estimation Capture the effects of previously executed code Trade off between accuracy and complexity Pairwise characterization What is the issue? 2 How to solve the problem, highlight the principles energy CODES+ISSS, 9/30/13 © S. Chakravarty, Z. Zhao, A. Gerstlauer

Pairwise Characterization Characterize each block with all immediate predecessors Initialize system state from earlier execution Scoreboarding to resolve dependency between pairs Function call characterization Divide caller block into sub-blocks Characterize caller and callee in conjunction with each other On call and return What is the issue? 2 How to solve the problem, highlight the principles energy CODES+ISSS, 9/30/13 © S. Chakravarty, Z. Zhao, A. Gerstlauer

© S. Chakravarty, Z. Zhao, A. Gerstlauer Pairwise Execution Difference in fetch times Intra block stall will propagate and manifest Adjust for: inter block stall or overlap Difference in fetch times Intra block stall will propagate and manifest Adjust for: inter block stall What is the issue? 2 How to solve the problem, highlight the principles energy CODES+ISSS, 9/30/13 © S. Chakravarty, Z. Zhao, A. Gerstlauer

© S. Chakravarty, Z. Zhao, A. Gerstlauer IR Back Annotation Path dependent metrics Encoded as global array: delay[pred_bb][cur_bb] Captures static branch prediction What is the issue? 2 How to solve the problem, highlight the principles energy CODES+ISSS, 9/30/13 © S. Chakravarty, Z. Zhao, A. Gerstlauer

© S. Chakravarty, Z. Zhao, A. Gerstlauer Experimental Results Automatic timing and energy back-annotation Telecom & security applications [MiBench] SHA, ADPCM, CRC32 & custom Eratosthenes’ Sieve Small and large data sets, 10 to 700 million instr. One-time back-annotation 3min. to 3s BA runtime Host-compiled simulation vs. traditional ISS 2000 MIPS vs. 0.8-1 MIPS Close to source-level speeds Key points why these benchmarks, no floating point no library… CODES+ISSS, 9/30/13 © S. Chakravarty, Z. Zhao, A. Gerstlauer

© S. Chakravarty, Z. Zhao, A. Gerstlauer Accuracy Results Host-compiled power and performance simulation Single- (z4-like) and dual-issue (z6-like) e200 PowerPC No cache, static branch prediction Compare against cycle-accurate reference ISS+McPAT >98% average timing and energy accuracy @ 2000 MIPS CODES+ISSS, 9/30/13 © S. Chakravarty, Z. Zhao, A. Gerstlauer

© S. Chakravarty, Z. Zhao, A. Gerstlauer Summary & Conclusions Retargetable power/performance back-annotation Automated ISS-driven estimation and BB characterization Binary-to-IR control flow matching algorithm ADL/ISS/McPAT-based pairwise block-level characterization Back-annotation of timing & energy estimates into IR Scripting to insert source level timing and energy annotations Host-compiled simulation performance Running at 2000MIPS with >98% accuracy Future work Integrated other metrics into host-compiled simulation (thermal, reliability) Fully automated host-compiled modeling flow CODES+ISSS, 9/30/13 © S. Chakravarty, Z. Zhao, A. Gerstlauer