1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.

Slides:



Advertisements
Similar presentations
VHDL Design of Multifunctional RISC Processor on FPGA
Advertisements

VHDL - I 1 Digital Systems. 2 «The designer’s guide to VHDL» Peter J. Andersen Morgan Kaufman Publisher Bring laptop with installed Xilinx.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
A Handel-C Implementation of a Computationally Intensive Problem in GF(3) Joey C. Libby, Jonathan P. Lutes, and Kenneth B. Kent The Handel-C Language Handel-C.
Final Class, ECE472 Midterm #2 due today – 1-5% extra credit for written report of Dally’s video Oral presentation of class project: today Graduate students:
The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Spring 07, Jan 16 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.
Define Embedded Systems Small (?) Application Specific Computer Systems.
Frank Vahid Associate Professor
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent.
Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science.
The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.
HW/SW CODESIGN OF THE MPEG-2 VIDEO DECODER Matjaz Verderber, Andrej Zemva, Andrej Trost University of Ljubljana Faculty of Electrical Engineering Trzaska.
HW/SW CODESIGN OF THE MPEG-2 VIDEO DECODER Matjaz Verderber, Andrej Zemva, Andrej Trost University of Ljubljana Faculty of Electrical Engineering Trzaska.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
ASIC/FPGA design flow. FPGA Design Flow Detailed (RTL) Design Detailed (RTL) Design Ideas (Specifications) Design Ideas (Specifications) Device Programming.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Automated Design of Custom Architecture Tulika Mitra
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.
Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.
IEEE ICECS 2010 SysPy: Using Python for processor-centric SoC design Evangelos Logaras Elias S. Manolakos {evlog, Department of Informatics.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS252 project presentation.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
High Performance Embedded Computing © 2007 Elsevier Lecture 18: Hardware/Software Codesign Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
VLSI Algorithmic Design Automation Lab. 1 Integration of High-Performance ASICs into Reconfigurable Systems Providing Additional Multimedia Functionality.
EE3A1 Computer Hardware and Digital Design
NC STATE UNIVERSITY 1 Feedback EDF Scheduling w/ Async. DVS Switching on the IBM Embedded PowerPC 405 LP Frank Mueller North Carolina State University,
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
1 Hardware/Software Co-Design Final Project Emulation on Distributed Simulation Co-Verification System 陳少傑 教授 R 黃鼎鈞 R 尤建智 R 林語亭.
FPGA-Based System Design: Chapter 7 Copyright  2004 Prentice Hall PTR Topics n Hardware/software co-design.
Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Real-Time System-On-A-Chip Emulation.  Introduction  Describing SOC Designs  System-Level Design Flow  SOC Implemantation Paths-Emulation and.
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
Chapter 3 Boolean Algebra and Digital Logic T103: Computer architecture, logic and information processing.
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.
SUBJECT : DIGITAL ELECTRONICS CLASS : SEM 3(B) TOPIC : INTRODUCTION OF VHDL.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Programmable Logic Devices
Evaluating Register File Size
Introduction to Programmable Logic
Anne Pratoomtong ECE734, Spring2002
Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.
A High Performance SoC: PkunityTM
Dynamic Hardware/Software Partitioning: A First Approach
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University of California, Riverside ACM Transactions on Embedded Computing Systems, February 2004

2 General Idea Increase performance and/or save power of a single embedded system program. Take advantage of embedded properties: –Fairly specific applications that rarely change. –Small loops account for large portion of exec time. Dedicate configurable logic device or an ASIC to perform the loop function efficiently. For overall power savings, speedup must be great enough to overcome increase in “exec” power.

3 Study Uniqueness What separates this study from others? –Simple HW/SW partitioning method (no complex search algorithms). –Focus on embedded systems –Extensive evaluation of energy savings.

4 Critical Loops Ave.

5 Partitioning Methods Modified apps with critical loops moved to hardware using Synopsis register-transfer VHDL. Configurable system logic (CSL) master of bus. CSL accesses memory directly, or through DMA. CPU – CSL communication via shared memory (including CSL registers) and direct signals. Using ASIC, no DMA.

6 Partitioning Methods Handshaking routines used for activating custom hardware (CSL or ASIC) when entering “critical” region.

7 Speedup & E-savings (estimation) Software loops replaced with handshaking behavior. HW cycles/loop calculated as always worst-case. Simulated: 100 MHz MIPS, 25 MHz 8051, max possible CSL speeds after synthesis. Xilinx Vertix power estimator for CSL (.18 um FPGA 1.8V – XCV50E). Measured active/idle power in Triscend’s parts: CPUidle =.85*CPUactive, CSLidle =.125*SCLactive. Power of interconnect and memory gathered through physical measurment of Triscend parts. Total system Energy =

8 Speedup & E-savings (estimation)

9 Gates

10 Speedup & E-savings (measurement) Single-chip microproc/CSL devices from Triscend: –E5 25 MHz) –A7 40 MHz Digital multimeter used for current/voltage measurement, time with timer (!) Subset of benchmarks measured. Good speedups and energy savings, energy “estimates even look conservative”. (only on MIPS) … But comparing a 100 MHz MIPS (sim) to 40 MHz ARM7 (measured)?

11 Speedup & E-savings (measurement)

12 Speedup & E-savings (ASIC) Estimations of a uP and custom logic on a single ASIC. Synopsis synthesis and power estimation tools for 0.18 um. Ave. estimated speedups increased to 4.0 from 3.2, due to increase in clock speed. E savings up to 50% from 34%. Ave # of gates down to 5,738 from 10,507.

13 Voltage Scaling Additional energy saved if voltage scaling factored in. Because of the increased performance, clock speed may be slowed, and voltage reduced to attain equivalent performance. On average, Vscaling gives an additional 14% of E-savings.

14 Voltage Scaling Percent Speed (clock) Reduction

15 Conclusion Moving a small amount of critical code to hardware can provide speedups and/or energy savings. Single-chip CPU / Config logic can give much improvements over CPU-only implementations. Extensive hardware/software partitioning exploration not needed – only basic profiling.

16 Discussion Ideas Can the gains seen on this benchmark suite carry over to actual applications? Why did they simulate a 100 MHz MIPS, but used a 40 MHz ARM? How would the results be different on more modern microprocs? Xscale? AVR? Do these newer CPU’s have much better performance/power ratios? Pg 223. – “parallel execution”? Do they actually have parallel exec going on? Pg. 224 says no. How does having a DMA option allow “almost any software region” to be implemented on HW more easily? 85% idle power for 8051??!! (pg 225). Obviously not “sleeping.”