A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Slides:

Advertisements

Similar presentations

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.

Advertisements

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

Computer Architecture & Organization

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.

Processor Technology and Architecture

Reconfigurable Architectures

Architectural Considerations for CPU and Network Interface Integration C. D. Cranor; R. Gopalakrishnan; P. Z. Onufryk IEEE Micro Volume: 201, Jan.-Feb.

A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.

Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.

Computer ArchitectureFall 2007 © October 31, CS-447– Computer Architecture M,W 10-11:20am Lecture 17 Review.

Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.

Chapter 6 Memory and Programmable Logic Devices

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

ASH: A Substrate for Scalable Architectures Mihai Budiu Seth Copen Goldstein CALCM Seminar, March 19, 2002.

Computer performance.

An Energy-Efficient Reconfigurable Multiprocessor IC for DSP Applications Multiple programmable VLIW processors arranged in a ring topology –Balances its.

Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.

Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.

2007 Sept 06SYSC 2001* - Fall SYSC2001-Ch1.ppt1 Computer Architecture & Organization  Instruction set, number of bits used for data representation,

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

Invitation to Computer Science 5th Edition

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

Paper Review: XiSystem - A Reconfigurable Processor and System

CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Chapter One Introduction to Pipelined Processors.

Automated Design of Custom Architecture Tulika Mitra

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.

VLSI Algorithmic Design Automation Lab. 1 Integration of High-Performance ASICs into Reconfigurable Systems Providing Additional Multimedia Functionality.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Advanced Processor Technology Architectural families of modern computers are CISC RISC Superscalar VLIW Super pipelined Vector processors Symbolic processors.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY

E X C E E D I N G E X P E C T A T I O N S VLIW-RISC CSIS Parallel Architectures and Algorithms Dr. Hoganson Kennesaw State University Instruction.

1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.

Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University

EKT303/4 Superscalar vs Super-pipelined.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Sunpyo Hong, Hyesoon Kim

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

PipeliningPipelining Computer Architecture (Fall 2006)

15-740/ Computer Architecture Lecture 3: Performance

ECE354 Embedded Systems Introduction C Andras Moritz.

Low-power Digital Signal Processing for Mobile Phone chipsets

Evaluating Register File Size

Embedded Systems Design

Architecture & Organization 1

Superscalar Processors & VLIW Processors

Architecture & Organization 1

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National.

Dynamically Reconfigurable Architectures: An Overview

VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder

Computer Evolution and Performance

Presentation transcript:

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma, A. La Rosa, L. Lavagno, C. Passerone, R.Canegallo Nice, France April 22, 2003

Outline Motivations XiRisc: a VLIW Processor PiCoGA: A Pipelined Configurable Gate Array Software Development Environment Results & Measurements Conclusions

Motivations Increased on-chip Transistor density Increased Integration costs Strong limitations in power supply Severe power consumption constraints Millions of transistors/Chip Technology (nm) Increased Algorithmic complexity Quest for performance and flexibility Algorithm complexity Moore’s law Battery capacity

Embedded systems Algorithms analysis 90% of computational complexity is concentrated in small kernels covering small parts of overall code Many algorithms show a relevant instruction-level parallelism Performance improved by multiple parallel data paths Operand granularity is typically different from 32-bit Traditional ALU is power-inefficient Significant improvements can be obtained extending embedded processors with application-specific function units Reconfigurable computing to achieve maximum flexibility

Existing Architectures Standard processor coupled with embedded programmable logic where application specific functions are dynamically remapped depending on the performed algorithm 1: Coprocessor model 2: Function unit model

 32-bit load/store Risc architecture (5 stages pipeline) Concurrent fetch and execution of two 32-bit instructions per cycle  VLIW Elaboration:  Set of specialized function units implementing DSP-specific operations EXTENDED INSTRUCTION SET RISC ARCHITECTURE  Function unit approach: Reconfigurable device fits in a classical RISC pipeline: Low communication overhead Exploits very high resource parallelism

Architecture Duplicated instruction decode logic (2 simmetrical data- channels) Duplicated commonly used function Units (Alu and Shifter) All others function units are shared (DSP operations, Memory handler) A tightly coupled pipelined configurable Gate Array

Dynamic Instruction Set Extension configuration specification region specification pGA-load Specific operation to transfer data from a configuration cache to the PiCoGA: 32-bit and 64-bit operation to launch the execution inside the PiCoGA (Data exchange through register file): operation specification 32-bit pGA-op Source 1Source 2 Dest 1Dest 2 64-bit pGA-op Source 1 Source 2 operation specification Dest 1Dest 2 Source 3Source 4

PiCoGA: a Pipelined Configurable Gate Array Two-dimensional array of LUT-based Reconfigurable Logic Cells Each row implements a possible stage of a customized pipeline, independent and concurrent with the processor Up to 4x32-bit input data and up to 2x32-bit output data from/to register File  Embedded function unit for dynamic extension of the Instruction Set PiCoGA

DFG-based elaboration Row elaboration is activated by an embedded control unit Execution enable signal for of each pipeline stage PiCoGA operation latency is dependent on the operation performed

Configuration Cache PiCoGA PiCoGA Configuration Goal: to reduce cache misses due to PiCoGA configuration Multi-context programming (4 cache layers/planes inside the array) Dedicated Configuration Cache with high bandwith bus to the PiCoGA (192 bits) Partial Run-Time Reconfiguration (A region is configured while another one is active) Configuration is completely concurrent with processor elaboration Layer4 Layer3 Layer2 Layer1

PiCoGA mapping The Software Development Environment Inititial C code Profiling Computation kernel extraction Executable code Latency information Assembler Level Scheduler pGA-op

Software Simulation Goals: check the correctness of the algorithm and evaluate performances In the source code pGA-op is described using a pragma directive: #pragma pGA shift_add 0x12 5 c a b c = ( a << 2 ) + b #pragma end /**************************************/ /* Shift_add mapped on PiCoGA */ /**************************************/ #if defined(PiCoGA)... asm(“pGA-op 0x12...”)... /*************************************/ /* Emulation function _shift_add */ /************************************/ #else void _shift_add(){... c = ( a << 2 ) + b... } #endif

Sofware Simulation Two special instructions are defined to support emulation:... topga... jal _shft_add fmpga topga saves current state and passes arguments to emulation function. Function clock cycle count is halted fmpga copies emulation function result(s) and restores registers; cycle count is incremented with the latency value of the pGA-op Evaluation of overall performances by counting elaboration cycles

Results and Measurements Normalized Energy Histogram Speed-ups for several signal processing cores: 75% of energy consumption for a VLIW architecture is due to accesses to instruction and data memory Strong reduction of accesses to instruction memory DESCRC Median Filter Motion Estimation Motion Prediction Turbo Codes 13.5x4.3x7.7x12.4x4.5x12x

Conclusions XiRisc: VLIW Risc architecture enhanced by run-time reconfigurable function unit PiCoGA: pipelined, runtime configurable, row-oriented array of LUT-based cells Specific software development toolchain Speedups range from 4.3x to 13.5x Up to 93% energy consumption reduction