Fast Energy Evaluation of Embedded Applications for Many-core Systems Felipe Rosa, Luciano Ost, Thiago Raupp, Fernando Moraes, Ricardo Reis.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Embedded Software Optimization for MP3 Decoder Implemented on RISC Core Yingbiao Yao, Qingdong Yao, Peng Liu, Zhibin Xiao Zhejiang University Information.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
Presenter: Shao-Chieh Hou. OVP => Open Virtual Platforms A FREE and OPEN platform for SoC and MPSoC develop  Hardware develop 。 Existing modules 。 Self-design.
Embedded Computing From Theory to Practice November 2008 USTC Suzhou.
Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.
Instrumentation and Profiling David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA
November 18, 2004 Embedded System Design Flow Arkadeb Ghosal Alessandro Pinto Daniele Gasperini Alberto Sangiovanni-Vincentelli
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
COM181 Computer Hardware Ian McCrumRoom 5B18,
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
Operating System Support for Virtual Machines Samuel T. King, George W. Dunlap,Peter M.Chen Presented By, Rajesh 1 References [1] Virtual Machines: Supporting.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Introduction 1-1 Introduction to Virtual Machines From “Virtual Machines” Smith and Nair Chapter 1.
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Hybrid Prototyping of MPSoCs Samar Abdi Electrical and Computer Engineering Concordia University Montreal, Canada
Computer Organization and Design Computer Abstractions and Technology
Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?
The Central Processing Unit (CPU) and the Machine Cycle.
 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.
1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
QCAdesigner – CUDA HPPS project
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
The Game Machine. Introduction Background – The project is to create a new type of 3D game machine architecture on embedded system device. – We will discuss.
Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
Full and Para Virtualization
Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.
Portable and Predictable Performance on Heterogeneous Embedded Manycores (ARTEMIS ) ARTEMIS 3 rd Project Review October 2015 WP6 – Space Demonstrator.
DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO CS 219 Computer Organization.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->
Virtual Machines Mr. Monil Adhikari. Agenda Introduction Classes of Virtual Machines System Virtual Machines Process Virtual Machines.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
Liana Duenha (FACOM and Unicamp) A SystemC Benchmark Suite for Evaluating MPSoC Tools and Methodologies Rodolfo Azevedo (Unicamp)
Addressing modes, memory architecture, interrupt and exception handling, and external I/O. An ISA includes a specification of the set of opcodes (machine.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
A Closer Look at Instruction Set Architectures
Improving java performance using Dynamic Method Migration on FPGAs
Flow Path Model of Superscalars
A Survey on Virtualization Technologies
Overview of big data tools
A High Performance SoC: PkunityTM
Suhas Chakravarty, Zhuoran Zhao, Andreas Gerstlauer
Computer Architecture
Introduction to Virtual Machines
Introduction to Virtual Machines
Maria Méndez Real, Vincent Migliore, Vianney Lapotre, Guy Gogniat
ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.
Presentation transcript:

Fast Energy Evaluation of Embedded Applications for Many-core Systems Felipe Rosa, Luciano Ost, Thiago Raupp, Fernando Moraes, Ricardo Reis.

Outline 1. Introduction 2. Open Virtual Platforms 3. Proposed Energy Model 5. Conclusion and Future Works 4. Exploration in Large Scale Systems

1. Introduction  High performance many-core systems are a reality up to 256 cores available today What about software development?

 Software development challenges comprise: inter-processor communication protocol stacks definition OS porting and analysis parallel programming model porting drivers development  Software development costs is increasing… Source: IBS 2013 Simulation is the key tool for many- core research 1. Introduction

 Full-system simulators are one good option virtual platforms that emulate hardware behaviour, making software believe that it is running on a real physical hardware  Support concomitant HW and SW development improve time-to-market  Examples of such simulators are:  Why OVP simulator? SimulatorAccuracySupported processor architecturesLicenseActive support SimicsFunctionally- accurate Alpha, ARM, MIPS, PowerPC, SPARC, and x86 PrivateYes PTLsimCycle-accurateX86OpenNo SimpleScalarCycle-accurateAlpha, ARM, PowerPC, and x86OpenNo GEM5Cycle-accurate Alpha, ARM, MIPS, PowerPC, SPARC, and x86 OpenYes QEMUInstruction-accurate ARM, MicroBraze,MIPS, PowerPC, SPARC, x86, and others OpenYes OVPsimInstruction-accurate Alpha, ARC, ARM, MIPS, PowerPC, MicroBraze, and others Open and PrivateYes

2. Open Virtual Platform - OVP  Large number of processor architectures (ISAs) supported (e.g. MIPS, ARM, x86, PowerPC )  Simulation speeds of up to 2200 MIPS relies on just-in-time (JIT) dynamic binary translation  Complete development environment with APIs  Open source license  Extensive documentation and active forum  Powerful debug environment  Limitation: OVPsim provides instruction accuracy only, resulting in inaccurate software performance estimation  Contribution: propose and integration of an fast and accurate energy models into the OVPsim

2. Open Virtual Platform - OVP Source: [Davidmann and Graham 2014] Proposed Model Location Callback Software stack separated from the simulator Target instructions are translated to host machine binary code

3. Energy Model  Characteristics Instruction-driven energy model Developed on the basis of OVP APIs Run-time based approach ISA-based approach  Advantages It avoids huge amount of memory No trace files are required Model is transparent to the software engineer no pre- or post-processing application/software is required The approach can be applied to other processor architectures  Model called Watchdog

 The instruction energy cost information is not available Characterization phase Instructions are organized in groups according their energy cost similarity Less complexity/computation during the characterization and simulation phases  Reference CPU PLASMA Core MIPS Architecture 3-Stage pipeline 100 MHz 65nm low power library from ST Microelectronics  Using Cadence Tools static and dynamic energy 3. Energy Model

3. Energy Model - Watchdog the parser module disassembles the binary code and identifies the instruction that must be executed 2 1 identified instruction is used as a hash table key to ascertain to which class such instruction belongs 3 The energy cost is computed and the instruction is executed in the CPU

- Benchmark Conception - Activity Measurement - Power Acquisition - Energy Calculation Energy Model Creation GroupsPower (mW)Exec Time (us)Energy (nJ)# of instEnergy per Inst (nJ) Arithmetic6,456342, , , Jump6,046102,600620, , Load-Store4, , , , Logical4,469349, , , Move3,129480, , , NOP2,141257,155550, , Shift3,824298, , , Groups Power (mW) Exec Time (us) Energy (nJ) # of inst Energy per Inst (nJ) Arithmetic6,456342, , , Jump6,046102,600620, , Load-Store4, , , , Logical4,469349, , , Move3,129480, , , NOP2,141257,155550, , Shift3,824298, , , Characterization Flow 3. Energy Model 5 5 What about accuracy?

3. Energy Model – Experimental Setup  Benchmarks 19 applications from different research domains WCET and other benchmarks created in house Model estimation compared with a gate-level implementation (PLASMA) #NameSuite ABFSH In-House production B BinarySear ch Mälardale n WCET C BitManipul ation Mälardale n WCET DBubble Mälardale n WCET ECounts Mälardale n WCET FCrc In-House production GEdn Mälardale n WCET HExpint Mälardale n WCET IFactorial In-House production #NameSuite JFftMälardalen WCET KFibIn-House production LHanoiIn-House production MHarmIn-House production NInsertSortMälardalen WCET OMatrixInverMälardalen WCET PMdcIn-House production QPeakspeedImperas RUdMälardalen WCET SUsqrtMälardalen WCET

 Benchmarks 19 applications from different research domains MiBench and other benchmarks created in house Model estimation compared with gate-level 3. Energy Model – Accuracy Evaluation  Mismatch is below 6% in 15 out of the 19 adopted benchmarks What about speedup gain ?

3. Energy Model – Achieved Speedup  Comparing each benchmark watchdog estimation execution time with gate-level execution time Achieved speedup varying from 461 to 1577 Mean relative gain 1118 the large application code the more relative gain ,16

3. Energy Model – Scalability  Scenario exploration with up to 1000 CPUs  Each CPU has one Watchdog associated Around 1.8 MIPS Improvement Watchdog Model

4. Exploration in Large Scale Systems  The proposed instruction-driven energy model was integrated into a NoC-based MPSoC model proposed in [Mandelli et al. 2013] Case study: mapping process cost evaluation Nearest Neighbor (NN), first free (FF) and LECDN 8x8 MPSoC size organized in 4x4 clusters Only the heuristic algorithm was observed 5 applications instances: 4 partial MPEG decoder containing 5 tasks 1 DTW containing 10 tasks

5. Conclusion and Future Work  Inclusion of a fast and accurate energy models into OVPsim  Extensive evaluation of both models considering several benchmarks, while comparing it to a gate-level simulation  Approach is ISA/CPU-oriented thus everything is transparent to the software engineer  Programmers can use the same simulator to have fast simulation and accurate software performance evaluation  Limitation of our approach:  we are not considering processors with cache

5. Conclusion and Future Work  Consider memory access power cost Calibrate our model considering NVSim/CACTI  Porting the proposed model to the OVPSim morphing phase  Improve overall model accuracy evaluate load and stores patterns enhance the division and multiplication algorithm estimation  Complex processor architectures as Out-Order or Super Pipeline

Questions?

Fast Energy Evaluation of Embedded Applications for Many-core Systems Felipe Rosa, Luciano Ost, Thiago Raupp, Fernando Moraes, Ricardo Reis.