A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Slides:

Advertisements

Similar presentations

Comparison of Altera NIOS II Processor with Analog Device’s TigerSHARC

Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Chapter 9 Computer Design Basics. 9-2 Datapaths Reminding A digital system (or a simple computer) contains datapath unit and control unit. Datapath: A.

& Microelectronics and Embedded Systems M 2 μP - Multithreading Microprocessor Thesis Presentation Embedded Systems Research Group Department of Industrial.

Parallell Processing Systems1 Chapter 4 Vector Processors.

A Survey of Logic Block Architectures For Digital Signal Processing Applications.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

1 Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

The Processor 2 Andreas Klappenecker CPSC321 Computer Architecture.

The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

COM181 Computer Hardware Ian McCrumRoom 5B18,

February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.

Paper Review I Coarse Grained Reconfigurable Arrays Presented By: Matthew Mayhew I.D.# ENG*6530 Tues, June, 10,

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Archs, VHDL 3 Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Paper Review: XiSystem - A Reconfigurable Processor and System

Chapter One Introduction to Pipelined Processors.

Automated Design of Custom Architecture Tulika Mitra

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

Presenter: Jyun-Yan Li Effective Software-Based Self-Test Strategies for On-Line Periodic Testing of Embedded Processors Antonis Paschalis Department of.

Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

32-bit Pipelined RISC Processor Group 1 aka “Go Us” Alice Wang Ann Ho Jason Fong CS m152b TA: Young Cho Lab section 1.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,

ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

Principles of Linear Pipelining

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY

Chapter One Introduction to Pipelined Processors

Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.

High Performance, Low Power Reconfigurable Processor for Embedded Systems Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki Murakami Kyushu University,

Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO CS 219 Computer Organization.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )

The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Programmable Hardware: Hardware or Software?

ARM Embedded Systems

15-740/ Computer Architecture Lecture 3: Performance

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

FPGAs in AWS and First Use Cases, Kees Vissers

CDA 3101 Spring 2016 Introduction to Computer Organization

Superscalar Processors & VLIW Processors

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

URECA: A Compiler Solution to Manage Unified Register File for CGRAs

The University of Adelaide, School of Computer Science

Chapter 4 The Von Neumann Model

Presentation transcript:

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section of Electronics and Computers, Department of Physics, Aristotle University of Thessaloniki, Thessaloniki, Greece Algarve, Portugal February 22-23, 2005

2 Outline Motivations Proposed Architecture Software Development Environment DemonstrationResultsConclusions

3Motivations Quest for Performance and Flexibility Large portion of computational complexity is concentrated in small kernels covering small parts of overall code Performance Improved by Accelerating these kernels Performance Improved by Accelerating these kernels Many Algorithms Show a relevant Instruction Level Parallelism (ILP) Performance Improved by parallel execution Performance Improved by parallel execution Traditional Processors have computation clock slack Performance Improved by chaining of operations (Spatial Computation) Performance Improved by chaining of operations (Spatial Computation) Extending Embedded Processors With Application Specific Function Units Reconfigurable Instruction Set Processors for Performance with Maximum Flexibility

4 Proposed Architecture Reconfigurable Instruction Set Processor (RISP) Core Processor 32-bit load/store RISC architecture 32-bit load/store RISC architecture 5 Pipeline Stages 5 Pipeline Stages Single Issue Elaboration Single Issue Elaboration Reconfigurable Logic Coupling Reconfigurable Function Unit (RFU) approach Reconfigurable Function Unit (RFU) approach => Low Communication Overhead Tightly Coupled => RFU Fits in two RISC pipeline stages Tightly Coupled => RFU Fits in two RISC pipeline stages => Better Utilization of the Pipeline Stages RFU 1-D Array of Coarse Grain Processing Elements (PEs) 1-D Array of Coarse Grain Processing Elements (PEs) PE Functionality Configurable at Design Time to meet Application requirements PE Functionality Configurable at Design Time to meet Application requirements Exploits Instruction Level Parallelism – Spatial & Temporal Computation Exploits Instruction Level Parallelism – Spatial & Temporal Computation

5 Proposed Architecture Core Processor Commonly Used Function Units Commonly Used Function Units Control Logic Properly Extended to Handle Reconfigurable Instructions Control Logic Properly Extended to Handle Reconfigurable Instructions 4-Read-1-Write Register File 4-Read-1-Write Register File Core / RFU Interface Receives & Delivers Control and Data Signals Receives & Delivers Control and Data Signals Tightly Coupled RFU Configuration-Processing- Interconnection Layers Configuration-Processing- Interconnection Layers Operates & Delivers Results in two Concurrent Pipeline Stages Operates & Delivers Results in two Concurrent Pipeline Stages

6 Standard And Reconfigurable Instructions Re=‘0’ => Standard Instruction Control Logic : Configure Core Datapath Control Logic : Configure Core Datapath Operands : Source1-2 & Destination Operands : Source1-2 & Destination ReOpCode = “nop” ReOpCode = “nop” Re=‘1’ => Reconfigurable Instruction Control Logic : Configure Interface Control Logic : Configure Interface Operands : Source1-4 & Destination Operands : Source1-4 & Destination ReOpCode = “OpCode” ReOpCode = “OpCode” Three Types of Reconfigurable Instructions Complex Computational Operations Complex Computational Operations Complex Addressing Modes Complex Addressing Modes Complex Control Flow Operations Complex Control Flow Operations Each Instruction can be multicycle 32-Bit Instruction Word Format

7 Reconfigurable Function Unit (RFU) Embedded RFU for Dynamic Extension of the Instruction Set Executes Multiple-Input-Single-Output (MISO) Reconfigurable Instructions 1-D Array of Coarse Grain Reconfigurable Blocks Comprised of Three Layers Processing Layer Processing Layer Interconnection Layer Interconnection Layer Configuration Layer Configuration Layer

8 RFU-Processing Layer PE Basic Structure Configurable PE functionality for the targeted application Unregistered Output => Spatial Computation Register Output => Temporal Computation Floating PEs => Can operate in both core pipeline stages on demand Local Memory for Read Only Values Execute Long Chains of Operation in one processor cycle

9 RFU-Interconnection Layer 1-D Array of PEs Operands from Register File Constant Values from Local Memory Input Network Operand Select Output Network => Delivers Results to corresponding pipeline stages

10 RFU-Configuration Layer Configuration Bits Local Storage Structure Multi-Context Configuration Layer Coarse Grain => Small Number of Configuration Bits => Negligible Overhead to Download new Contexts

11 Architecture Synthesis & Evaluation A Hardware Model (VHDL) was Designed for Evaluation Purposes Configuration Value Granularity 32-bits Number of Processing Elements 8 Processing Elements Functionality ALU, Shifter, Multiplier Configuration Contexts 16 words of 134 bits Local Memory Size 8 constants of 32-bits Number of Provided Local Operands 4 ComponentArea (mm 2 ) Processor Core0.134 RFU Processing Layer0.186 RFU Interconnection Layer0.125 RFU Configuration Layer0.137 RFU Total0.448 The Model was Synthesized with STM 0.13um Process The RFU Area Overhead is 3.3x the Area of the Core Processor No Caches were taken into account No Overhead to Core Critical Path

12 Software Development Environment

13 Demonstration-RFU Elaboration Largest MaxMISO for a Quantization Kernel Execution on the Core => six cycles Execution on the Core+RFU => one cycle Performance Improvements Reduced Instruction Memory Accesses

14 Results CRCFIRFFTQUANTVLC 1.6x1.8x2.8x1.9x1.7x Energy Consumption Dominated by Memory Accesses Speed-Ups for Several Kernels – Core Vs. Core+RFU

15 Conclusions A RISC Processor Enhanced by a Run-Time Reconfigurable Function Unit 1-D Reconfigurable Array of Coarse Grain Processing Elements Multiple-Input-Single-Output Reconfigurable Instructions Specific Software Development Environment Low Cost Performance and Energy Consumption Improvements Next Step => Expand to VLIW Elaboration to Boost Achieved Speed-Ups