Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

Slides:



Advertisements
Similar presentations
FPGA (Field Programmable Gate Array)
Advertisements

Part IV: Memory Management
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
Lecture 7 FPGA technology. 2 Implementation Platform Comparison.
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
Hardwired networks on chip for FPGAs and their applications
Chapter 8 Hardware Conventional Computer Hardware Architecture.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.
Reference: Message Passing Fundamentals.
PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.
Packet-Switched vs. Time-Multiplexed FPGA Overlay Networks Kapre et. al RC Reading Group – 3/29/2006 Presenter: Ilya Tabakh.
SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.
CMOL overview ● CMOS / nanowire / MOLecular hybrids ● Uses combination of Micro – Nano – Nano implements regular blocks (ie memory) – CMOS used for logic,
February 4, 2002 John Wawrzynek
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
Introduction to Parallel Processing Ch. 12, Pg
Field Programmable Gate Array (FPGA) Layout An FPGA consists of a large array of Configurable Logic Blocks (CLBs) - typically 1,000 to 8,000 CLBs per chip.
1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.
GPGPU platforms GP - General Purpose computation using GPU
EECE **** Embedded System Design
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs
Operating Systems for Reconfigurable Systems John Huisman ID:
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Institute of Information Sciences and Technology Towards a Visual Notation for Pipelining in a Visual Programming Language for Programming FPGAs Chris.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
FPGA FPGA2  A heterogeneous network of workstations (NOW)  FPGAs are expensive, available on some hosts but not others  NOW provide coarse- grained.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Data and Computer Communications Circuit Switching and Packet Switching.
J. Christiansen, CERN - EP/MIC
Programmable Logic Devices
Embedded Runtime Reconfigurable Nodes for wireless sensor networks applications Chris Morales Kaz Onishi 1.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
Field Programmable Gate Arrays (FPGAs) An Enabling Technology.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
Principles of Linear Pipelining
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead Heng Tan and Ronald F. DeMara University of Central Florida.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Chapter One Introduction to Pipelined Processors
PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
1 Advanced Digital Design Reconfigurable Logic by A. Steininger and M. Delvai Vienna University of Technology.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu
Reconfigurable Computing1 Reconfigurable Computing Part II.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #22 – Multi-Context.
Runtime Reconfigurable Network-on- chips for FPGA-based systems Mugdha Puranik Department of Electrical and Computer Engineering
Runtime Temporal Partitioning Assembly to Reduce FPGA Reconfiguration Time Abelardo Jara-Berrocal, Ann Gordon-Ross HCS Research Laboratory College of Engineering.
Programmable Logic Devices
Programmable Hardware: Hardware or Software?
Floating-Point FPGA (FPFPGA)
Parallel Programming By J. H. Wang May 2, 2017.
ESE532: System-on-a-Chip Architecture
Anne Pratoomtong ECE734, Spring2002
Parallel and Multiprocessor Architectures
Dynamically Reconfigurable Architectures: An Overview
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
Multiprocessors and Multi-computers
Presentation transcript:

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one configuration. Run-time reconfiguration (RTR) RTR is a dynamic implementation strategy where each application consists of multiple cooperating configurations.

Compile Time Reconfiguration Consist of a single system-wide configuration Static hardware configuration remains on the FPGAs for the duration of the application Similar to ASIC from application point of view Conventional design tools provide adequate support for application development Examples: Splash, Nano processor

Run Time Reconfiguration Applications reconfigure hardware resources during application execution Each configuration implements some fraction of the application Optimizes hardware resources Lack of sufficient design tools and a well-defined design methodology

New Design Problems Divide the algorithm into time-exclusive segments that do not need to (or cannot) run concurrently –Each segment should remain a reasonable amount of time –Tasks should be relatively independent from each other Coordinate the behavior between configurations –Intermediate result

Two RTR Approaches (1) Global Approach –Each phase of application is implemented as a single system-wide configuration; it allocates all hardware resources in each configuration step –Relatively simple, coarse grained Implementation issues –Divide the application into roughly equal-sized partitions –Interfaces between configurations are fixed Example: RRANN

Two RTR Approaches (2) Local Approach –Applications locally reconfigure subsets of the logic as the application executes –Flexible, finer granularity –Ability to create fine-grained functional operators Implementation Issues –Interfaces are not fixed –Designer need to ensure both structural compliance and physical compliance –No good design tool support Example: DISC, RRANN2

Run-time Reconfiguration Paper (1) FPGA and Neural Networks –Implementation of random topologies –Training versus operation –Multiple training algorithms –Run-time reconfiguration

Problem: Backpropagation Training Algorithm –Feed-forward stage: –Backpropagation Run-time Reconfiguration Paper (2)

Run-time Reconfiguration Paper (3) –Backpropagation –Update

Run-time Reconfiguration Paper (4) Approach 1: –Combine all three stages of execution into the same circuit module and configure this module onto FPGAs –No reconfiguration Approach 2: –Combine the feed-forward and update stages into one circuit and the backpropagation stage into another. –Reconfigure twice (per cycle)

Run-time Reconfiguration Paper (5) Approach 3: –Treat feed-forward, backpropagation and update as three circuit modules –Need to reconfigure three times per cycle –Each stage consists of a global controller occupying one FPGA and many neural processors occupying the balance of the available FPGAs –6 neurons per FPGA

Run-time Reconfiguration Paper (6) Global Controller –Sequence the execution of local hardware subroutines on the neural processors –Supplying data to the neural processors Neural Processor –Perform computations –Have six hardware neurons, pre- and post- processing, memory interfacing, local control, and a local RAM

Run-time Reconfiguration Paper (7) Multiplexed Interconnection –Broadcast bus is used to connect all outputs of neurons on layer m and inputs of neurons on layer m+1 The Feed-forward Stage The Backpropagation Stage The Update Stage

Run-time Reconfiguration Paper (8) Implementation –Xilinx XC3090 –Host PC Comparison of space capacity –Option 1: One hardware neuron per XC3090 –Option 2: Four hardware neurons per XC3090 –Option 3: Six hardware neurons per XC3090

Run-time Reconfiguration Paper (9) Comparison of time efficiency –Option 1: 0ms reconfiguration time –Option 2: 14ms per pass reconfiguration time –Option 3: 21ms per pass reconfiguration time Time / Space tradeoff –When more hardware is needed, the same space on an FPGA could be reused many times through reconfiguration, but doing so reduces the amount of time that the FPGA could spend executing

Run-time Reconfiguration Paper (10) Functional Density Metric D –Funtional density is a composite area-time metric used to identify the computational throughput (operations per second) of unit hardware resources –A (area) is measured in the FPGA cell-count of the circuit; operating time (T) is measured as the execution time of the system

RRANN2: Partial Reconfiguration (1) Runtime reconfiguration (RTR) is an implementation approach that divides an application into a series of sequentially executed stages with each stage implemented as a separate circuit module Partial RTR extends the approach by partitioning these stages and designing their circuitry such that they exhibits a high degree of functional and physical commonality By leaving common circuitry resident, transition between configurations can be accomplished by updating only the difference between configurations

RRANN2: Partial Reconfiguration (2) Design goal –To reach the break-even point with fewer neurons per layer Advantages –Reduced size of reconfiguration bit-stream is faster to download –Eliminating part of the routing and control circuitry increases hardware neural density Static versus Dynamic Circuitry

RRANN2: Partial Reconfiguration (3) Fully static circuitry –Combinational logic –Storage devices (preserves both configuration and current value of the storage device) Mostly static circuitry –Precision: two devices only differ in their precision –Constant value: two blocks differ by a constant value –Function: two blocks perform logically different functions but their construction is almost identical –Subsets: one block is structurally and functionally contained within the bounds of the other

RRANN2: Partial Reconfiguration (4) Physical design issues – Each block should contain the same physical implementation and occupy the same position on the device –A common logic block is also constrained by the physical context of its surroundings, many of which might be unknown at design time –Further constrains have to be placed on the design to group the static circuitry to insure the decrease of the resulting bit-stream –No good design-tool support

RRANN2: Partial Reconfiguration (5) Implementation –Step 1: The circuit modules are placed and routed by hand to physically map the schematics to corresponding FPGA resources –Step 2: The physical representation is converted to downloadable configuration bit streams Performance (CLAy31) –Reconfiguration time: 600us –Training performance: 4 times the performance of RRANN –FPGA density: 50% more neurons per FPGA than RRANN

Research Issues (1) Scheduling designs into a Time-multiplexed FPGA –An algorithm is proposed to split a FPGA design into multiple configurations of time-multiplexed FPGAs –ASAP –ALAP –Optimize the scheduler by identifying the units not on the critical path and reschedule their evaluation into other cycles

Research Issues (2) Wormhole run-time reconfiguration –The means of altering the configuration has relied on global control strategies, which presents a fundamental bottleneck to the potential bandwidth of configuration information flow –Serial configuration: Xilinx 4000 –Random access configuration: CLAy –Wormhole run-time reconfiguration

Research Issues (3) Interaction of pipeline and reconfigurable FPGAs –An ideal virtualized FPGA would be capable of executing any hardware design, regardless of the size of that design. The execution speed would be proportional to the physical capacity of FPGA, and inversely-proportional to the size of the hardware design –Similar to DISC? –Granularity of swapping unit