A Hybrid Energy-Estimation Technique for Extensible Processors Fei, Y.; Ravi, S.; Raghunathan, A.; Jha, N.K. IEEE Transactions on Computer-Aided Design.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

Programmability Issues
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
The General Linear Model Or, What the Hell’s Going on During Estimation?
Presented by: Thabet Kacem Spring Outline Contributions Introduction Proposed Approach Related Work Reconception of ADLs XTEAM Tool Chain Discussion.
From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Efficient Software Performance Estimation Methods for Hardware/Software Codesign Kei Suzuki Alberto Sangiovanni-Vincentelli Present: Yanmei Li.
RTL Processor Synthesis for Architecture Exploration and Implementation Schliebusch, O. Chattopadhyay, A. Leupers, R. Ascheid, G. Meyr, H. Steinert, M.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.
Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement Gheewala, A.; Peir, J.-K.; Yen-Kuang Chen; Lai, K.; IEEE.
Mehdi Amirijoo1 Power estimation n General power dissipation in CMOS n High-level power estimation metrics n Power estimation of the HW part.
Development of Empirical Models From Process Data
Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical.
Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.
November 18, 2004 Embedded System Design Flow Arkadeb Ghosal Alessandro Pinto Daniele Gasperini Alberto Sangiovanni-Vincentelli
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
Educational Computer Architecture Experimentation Tool Dr. Abdelhafid Bouhraoua.
Principle of Functional Verification Chapter 1~3 Presenter : Fu-Ching Yang.
A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.
Using A Defined and Measured Personal Software Process Watts S. Humphrey CS 5391 Article 8.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
Ronny Krashinsky Seongmoo Heo Michael Zhang Krste Asanovic MIT Laboratory for Computer Science SyCHOSys Synchronous.
Intro to Architecture – Page 1 of 22CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Introduction Reading: Chapter 1.
Presenter : Ching-Hua Huang 2013/9/16 Visibility Enhancement for Silicon Debug Cited count : 62 Yu-Chin Hsu; Furshing Tsai; Wells Jong; Ying-Tsai Chang.
Automated Design of Custom Architecture Tulika Mitra
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
6-1 Chapter 6 - Languages and the Machine Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Computer.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Time Series Analysis and Forecasting
3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.
Performance evaluation of component-based software systems Seminar of Component Engineering course Rofideh hadighi 7 Jan 2010.
Chin-Yu Huang Department of Computer Science National Tsing Hua University Hsinchu, Taiwan Optimal Allocation of Testing-Resource Considering Cost, Reliability,
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
TOPIC : Different levels of Fault model UNIT 2 : Fault Modeling Module 2.1 Modeling Physical fault to logical fault.
1 Power estimation in the algorithmic and register-transfer level September 25, 2006 Chong-Min Kyung.
Architecture Selection of a Flexible DSP Core Using Re- configurable System Software July 18, 1998 Jong-Yeol Lee Department of Electrical Engineering,
1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
PowerMixer IP : IP-Level Power Modeling for Processors Shan-Chien Fang 1 Jia-Lu Liao 2 Chen-Wei Hsu 2 Chia-Chien Weng 2 Shi-Yu Huang 2 Wen-Tsan Hsieh 3.
Review of Parnas’ Criteria for Decomposing Systems into Modules Zheng Wang, Yuan Zhang Michigan State University 04/19/2002.
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Designs for Experiments with More Than One Factor When the experimenter is interested in the effect of multiple factors on a response a factorial design.
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Banaras Hindu University. A Course on Software Reuse by Design Patterns and Frameworks.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
Full Design. DESIGN CONCEPTS The main idea behind this design was to create an architecture capable of performing run-time load balancing in order to.
Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.
CORRELATION-REGULATION ANALYSIS Томский политехнический университет.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
The PLA Model: On the Combination of Product-Line Analyses 강태준.
OPERATING SYSTEMS CS 3502 Fall 2017
Andreas Hoffmann Andreas Ropers Tim Kogel Stefan Pees Prof
Evaluating Register File Size
Design Flow System Level
Introduction to cosynthesis Rabi Mahapatra CSCE617
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Presentation transcript:

A Hybrid Energy-Estimation Technique for Extensible Processors Fei, Y.; Ravi, S.; Raghunathan, A.; Jha, N.K. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems Volume: 23 Issue: 5 Pages: May 2004

A Hybrid Energy-Estimation Technique for Extensible Processor 2/ /6/20 Abstract  In this paper, we present an efficient and accurate methodology for estimating the energy consumption of application programs running on extensible processors. Extensible processors, which are getting increasingly popular in embedded system design, allow a designer to customize a base processor core through instruction set extensions. Existing processor energy macromodeling techniques are not applicable to extensible processor, since they assume that the instruction set architecture as well as the underlying structural description of the micro-architecture remain fixed. Our solution to the above problem is a hybrid energy macromodel suitably parameterized to estimate the energy consumption of an application running on the corresponding application-specific extended processor instance, which incorporates any custom instruction extension. Such a characterization is facilitated by careful selection of macromodel parameters/variables that can capture both the functional and structural aspects of the execution of a program on an extensible processor.

A Hybrid Energy-Estimation Technique for Extensible Processor 3/ /6/20 Abstract (cont.)  Another feature of the proposed energy characterization flow is the use of regression analysis to build the macromodel. Regression analysis allows for in-situ characterization, thus allowing arbitrary test programs to be used during macromodel construction. We validated the proposed methodology by characterizing the energy consumption of a state-of-the-art extensible processor (Tensilica’s Xtensa). We used the macromodel to analyze the energy consumption of several benchmark applications with custom instructions. The mean absolute error in the macromodel estimates is only 3.3%, when compared to the energy values obtained by a commercial tool operating on the synthesized register-transfer level (RTL) description of the custom processor. Our approach achieves an average speedup of three orders of magnitude over the commercial RTL energy estimator. Our experiments show that the proposed methodology also achieves good relative accuracy, which is essential in energy optimization studies. Hence, our technique is both efficient and accurate.

A Hybrid Energy-Estimation Technique for Extensible Processor 4/ /6/20 Outline  What’s the problem  Introduction & related work  Extensible processor energy macromodel requirements  Proposed energy estimation methodology  Experimental results and evaluation  Conclusions

A Hybrid Energy-Estimation Technique for Extensible Processor 5/ /6/20 What’s the Problem  Existing processor energy estimation framework is impractical for use in energy optimization done in the ASIP design cycle  The extension to the base processor ISA is not fixed  The number of configurations/extensions is large  It’s essential to have a fast and accurate energy estimation of an application running on an extensible processor for each candidate configuration in energy optimization studies

A Hybrid Energy-Estimation Technique for Extensible Processor 6/ /6/20 Related Work  Structural macromodeling  Characterize energy consumption of it’s constituent hardware module E =∑E m1,i(bit transition) + ∑E m2,i(bit transition) + …… + ∑ Emk,i(bit transition) ( E m1,i(bit transition) denote energy per access of the module1) Advantage: High accuracy Disadvantage: 1) Low efficiency (RTL simulation of a processor is extremely slow) 2) Require RTL hardware description of the processor  Suitable for energy estimation of a processor core

A Hybrid Energy-Estimation Technique for Extensible Processor 7/ /6/20 Related Work (cont.)  Instruction-level macromodeling  Characterize energy consumption of each instruction of the processor E = E IC1 * Cyc IC1 + E IC2 * Cyc IC2 + E IC3 * Cyc IC3 +…….+ E ICk * Cyc ICk (E IC1 denote average energy consumption by instruction class1 ) (Cyc IC1 denote number of cycles taken by instruction class1 ) Energy coefficient E IC1 is acquired by actual measurement of a chip implementation  Advantage: High efficiency (Use ISS to yield energy estimation)  Disadvantage: 1) Low accuracy 2) Require actual chip implement and this is infeasible for power tradeoff studies early in the design cycle  Suitable for energy estimation of software on a fixed processor architecture

A Hybrid Energy-Estimation Technique for Extensible Processor 8/ /6/20 Related Work (cont.)  Statistical analysis and prediction macromodeling  Energy coefficients are calculated with regression analysis to build the macromodel E i = C 1 * M 1,i + C 2 * M 2,i + …….+ C k * M k,i + ∆ i ( i=1,2….n) (Total energy consumption E i denote dependent variable) (Macromodel parameters M 1,i…. M k,I denote independent variable) (∆ i denote inaccuracy) Use a set of given (Ei, M 1,i,….,M k,i ),i=1,2…n to predict the best energy coefficient C 1, C 2,.., C k  Energy macromodel generation Ê = Ĉ 1 * M 1 + Ĉ 2 * M 2, + …….+ Ĉ k * M k (Ĉ 1,..,Ĉ k denote the estimate of energy coefficient) (Ê denotes the estimate of total energy consumption ) (Macromodel parameters M 1,..,M k are observable during ISS )

A Hybrid Energy-Estimation Technique for Extensible Processor 9/ /6/20 Paper Overview and Contributions  Hybrid energy macromodeling  Instruction-level macromodeling for base processor  Structural macromodeling for custom hardware extension  Regression macromodeling for energy characterization  Contributions  Energy consumption can simply be determined by instruction set simulation  Combines the efficiency of instruction-level approaches and the accuracy of structural approaches  Only needs the custom instruction descriptions  Does’t require the custom processor to be synthesized  This is the only work on evaluate energy/performance tradeoff among candidate custom instructions for extensible processor at the early design cycle

A Hybrid Energy-Estimation Technique for Extensible Processor 10/ /6/20 Extensible Xtensa Processor  Xtensa’s ISA consists of a basic set of instructions plus a set of configurable and extensible options  Extensibility is achieved by specifying application-specific functionality through custom instructions  The behavior of the custom instruction is descried using TIE (Tensilica Instruction Extension) language  TIE is independent of the processor’s pipeline Only need to describe the semantics of the instructions as if they consist of only combination logic  The TIE compiler automatically derives  The hardware implementation of custom instructions  Corresponding software development kit for the configuration ANCI C/C++ compiler, linker, assembler, debugger Cycle-accurate instruction set simulator (ISS)

A Hybrid Energy-Estimation Technique for Extensible Processor 11/ /6/20 Example Containing Three Custom Instructions  user register statement  Specify the custom state register and indices  iclass statement  Define a new instruction class with one or multiple custom instructions  semantic statement  Describe the behavior of the instruction class  schedule statement (Used for multiple cycle instruction)  Schedule the operation sequence of the custom instruction  Need ars and art at the beginning of first cycle  Need ACCU at the beginning of second cycle  Produce new ACCU at the end of second cycle

A Hybrid Energy-Estimation Technique for Extensible Processor 12/ /6/20 Partial Architecture of an Extended Processor  Augmented with custom hardware to implement three custom instruction: MULT, MAC and CUS  MULT and MAC perform their functionality using shared custom hardware (which is dependent of base processor operand buses)  A multiplier (X), a multiplexer (MUX1), and an adder (+1)  CUS accesses custom register CR0…CR2 (which is independent of base processor operand buses) temp1temp2 ACCU

A Hybrid Energy-Estimation Technique for Extensible Processor 13/ /6/20 Snapshot of Dynamic Execution of a Program  Top horizontal bar lists the sequence of processor events dictated by its execution  The bottom bar depicts the side effects in either the base processor or the custom hardware  Execution of the base processor instruction add actives custom hardware (X, MUX1, +1) in the second cycle  Execution of the custom instructions (I 2 and I 3 ) active base processor hardware (ALU) in the second cycle  Side effect occurs because the custom hardware and the ALU of the base processor share the same operand buses

A Hybrid Energy-Estimation Technique for Extensible Processor 14/ /6/20 Different Factors of the Energy Macromodel  Energy consumed by base processor instructions on the base processor core  Energy dependency on inter-instruction correlation and other nonideal features (such as stalls, cache misses, etc.)  Energy consumed by custom instructions on the custom hardware Only custom hardware computation energy  The second box in the top bar of I 2, I 3, I 4  Interplay between the base processor and custom hardware  Active energy of custom hardware owing to base processor instructions Computation side effect in the EXE stage  The bottom bar of instruction I 1  Active energy of base processor hardware owing to custom instructions Computation side effect in the EXE stage  The bottom bar of instructions I 2 and I 3 Involvement of the base processor in other pipeline stages  RdReg, Wait, WrReg, WrCR event in the top bar of instruction I 2, I 3, I 4

A Hybrid Energy-Estimation Technique for Extensible Processor 15/ /6/20 Extensible Processor Energy Estimation Flowchart  constructing macromodel template E=E 0 X 0 +E 1 X 1 + …+E n X n express energy consumption (dependent variable) as a function of those characteristic parameter (independent variable) E 0,..,En are constants called energy coefficient X 1,...,X n are chosen from both instruction-level and structural domain  Test program suite incorporates custom instructions to cover all the custom HW library components  Regression analysis require knowledge of both the dependent variable and the independent variable Step 3-7 repeat for all the test program dependent variable independent variable  Regression analysis finds the estimate of energy coefficient (energy macromodel construction complete) Characterization Flow

A Hybrid Energy-Estimation Technique for Extensible Processor 16/ /6/20 Extensible Processor Energy Estimation Flowchart  Step 9 gathers instruction-level macromodel parameter values instruction-level execution statistics  Step 10 gathers structural macromodel parameter values The activation of custom hardware Estimation Flow  parameter values are fed to the energy macromodel to yield the energy estimation

A Hybrid Energy-Estimation Technique for Extensible Processor 17/ /6/20 Energy Macromodel Template Generation - E ins is a linear function of instruction-level parameters depicts energy on the base processor - E struc is a linear function of structural parameters depicts energy on custom hardware  Instruction-level macromodel parameters  Reflect the usage of base processor core due to either base processor or custom instructions  Energy components of the base processor core  Energy of base processor owing to base processor instructions E arith,.., E br_utk represent the average energy consumption of each instruction class Cyc arith,.., Cyc br_utk represent the number of cycles taken by each instruction class  Energy due to inter-instruction correlation and other nonideal features Macromodel parameters Num i,..,Num interlock denote the number of times each nonideal case occurs  Energy consumption in the base processor imposed by custom instructions (Energy consumption in the four pipeline stages other than the EXE stage) Macromodel parameter Cyc side_tie accounts for the number of cycles taken by all custom instructions E ins = E arith *Cyc arith + E ld *Cyc ld + E st *Cyc st + E j *Cyc j + E br_tk * Cyc br_tk + E br_utk *Cyc br_utk + E i *Num i + E d *Num d + E uncache * Num uncache + E interlock *Num interlock + E side_tie *Cyc side_tie

A Hybrid Energy-Estimation Technique for Extensible Processor 18/ /6/20 Energy Macromodel Template Generation  Structural macromodel parameters  Reflect the usage of custom hardware extensions due to either base processor or custom instructions Macromodel parameters Cyc 1,…,Cyc 10 denote the number of cycles in which each custom hardware component category is active Energy coefficients E1,..,E10 represent the average energy consumption for each kind of custom hardware component category  Energy components of the custom hardware extensions  Custom functional blocks is activated when any custom instructions executing  Custom functional blocks can also be activated when base processor instructions are running Side effect due to the sharing of the same operand buses still affects the custom hardware  Dynamic resource usage analysis in the execution trace identifies the activated custom functional blocks (HW component) for each instruction Custom hardware energy consumption expresses as below: E struc = E 1 * Cyc 1 + E 2 * Cyc 2 + E 3 * Cyc 3 +….+E 10 * Cyc 10 Note: structural macromodel parameters should be covered all the components present in the custom hardware library (10 component categories is this paper)

A Hybrid Energy-Estimation Technique for Extensible Processor 19/ /6/20 Macromodel Fitting Through Regression Analysis  Determining the energy coefficients in the macromodel template  Solving the linear-matrix equation M ( n*21) X C ( 21*1) =E ( n*1)  E denotes a n*1 column vector which are grouped by the energy consumption data of n test programs  M denotes a n*21 matrix which are grouped by the values corresponding to the macromodel parameters  C is the energy coefficient vector corresponding to { E arith, E ld, E st, E j, E br_tk, E br_utk, E i, E d, E uncache, E interlock, E side_tie, E 1, E 2, E 3, E 4, E 5, E 6, E 7, E 8, E 9, E 10 } ( Ĉ denotes the estimate of energy coefficient C) ( Ê denotes the estimate of total energy consumption E) Yields the energy coefficient vector C, such that the mean square error is minimized

A Hybrid Energy-Estimation Technique for Extensible Processor 20/ /6/20 Energy Coefficients of the Xtensa Processor  Energy consumption for each base processor instruction category per cycle  Energy consumption for side-effect per cycle  Energy consumption for execution-time effects per miss/per-interlock  Energy consumption for different custom hardware components per cycle

A Hybrid Energy-Estimation Technique for Extensible Processor 21/ /6/20 Absolute Accuracy Examination Application Energy Estimates  The maximum estimation error is 8.5%  The average absolute error is only 3.3%  The proposed energy estimation methodology is very fast  WattWatcher needs several more hours for energy estimation ( RTL description generation + RTL simulation + power estimation using WattWatcher )

A Hybrid Energy-Estimation Technique for Extensible Processor 22/ /6/20 Absolute Accuracy Examination (cont.)  Energy consumption due to custom hardware can be significant  The accuracy of the macromodel is high both for the base processor and custom hardware

A Hybrid Energy-Estimation Technique for Extensible Processor 23/ /6/20 Relative Accuracy Examination  Good relative accuracy of our macromodel  The proposed energy estimation methodology is high relative accuracy and low effort (no custom processor generation, no RTL simulation)  Therefore, it is highly suitable for energy optimization studies

A Hybrid Energy-Estimation Technique for Extensible Processor 24/ /6/20 Conclusions  Presented an efficient and accurate energy estimation methodology for extensible processors  High efficiency comes from energy estimation only requires instruction-set simulation based analysis of the application  High accuracy comes from dynamic analysis of custom hardware usage pattern  Although it speedup energy estimation, but it still have good absolute accuracy (average absolute error is only 3.3%) and also achieve high relative accuracy