Runtime Temporal Partitioning Assembly to Reduce FPGA Reconfiguration Time Abelardo Jara-Berrocal, Ann Gordon-Ross HCS Research Laboratory College of Engineering.

Slides:



Advertisements
Similar presentations
2009 Midyear Workshop F4-09: Virtual Architecture and Design Automation for Partial Reconfiguration All Hands Meeting November 10th, 2009 Dr. Ann Gordon-Ross.
Advertisements

Computer Architecture (EEL4713, Fall 2013) Partial Reconfiguration Not just a half baked job of reconfiguring Rohit Kumar Research Student University of.
Reconfigurable Computing (EEL4930/5934) Partial Reconfiguration Not just a half baked job of reconfiguring Rohit Kumar Joseph Antoon Research Students.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
ENG6530 Reconfigurable Computing Systems Dynamic Run Time Reconfiguration Operating System Support & Embedded Systems.
HTR: On-Chip Hardware Task Relocation for Partially Reconfigurable FPGAs + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
BRASS Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, John Wawrzynek University of California, Berkeley – BRASS.
Hardwired networks on chip for FPGAs and their applications
1 Students: Lin Ilia Khinich Fanny Instructor: Fiksman Evgeny המעבדה למערכות ספרתיות מהירות High Speed Digital Systems Laboratory הטכניון - מכון טכנולוגי.
1 Performed by: Lin Ilia Khinich Fanny Instructor: Fiksman Eugene המעבדה למערכות ספרתיות מהירות High Speed Digital Systems Laboratory הטכניון - מכון טכנולוגי.
Configurable System-on-Chip: Xilinx EDK
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.
Virtual Architecture For Partially Reconfigurable Embedded Systems (VAPRES) Architecture for creating partially reconfigurable embedded systems Module.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
Torino (Italy) – June 25th, 2013 Ant Colony Optimization for Mapping, Scheduling and Placing in Reconfigurable Systems Christian Pilato Fabrizio Ferrandi,
Bitstream Relocation with Local Clock Domains for Partially Reconfigurable FPGAs Adam Flynn, Ann Gordon-Ross, Alan D. George NSF Center for High-Performance.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Benefits of Partial Reconfiguration Reducing the size of the FPGA device required to implement a given function, with consequent reductions in cost and.
Partially Reconfigurable System-on-Chips for Adaptive Fault Tolerance Shaon Yousuf Adam Jacobs Ph.D. Students NSF CHREC Center, University of Florida Dr.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
DLS Digital Controller Tony Dobbing Head of Power Supplies Group.
Operating Systems for Reconfigurable Systems John Huisman ID:
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Embedded Systems Seminar (EEL6935, Spring 2013) Partial Reconfiguration Not just a half baked job of reconfiguring Rohit Kumar Research Student University.
DAPR: Design Automation for Partially Reconfigurable FPGAs Shaon Yousuf Ph.D. Student NSF CHREC Center, University of Florida Dr. Ann Gordon-Ross Associate.
Heng Tan Ronald Demara A Device-Controlled Dynamic Configuration Framework Supporting Heterogeneous Resource Management.
Design Framework for Partial Run-Time FPGA Reconfiguration Chris Conger, Ann Gordon-Ross, and Alan D. George Presented by: Abelardo Jara-Berrocal HCS Research.
Exploiting Partially Reconfigurable FPGAs for Situation-Based Reconfiguration in Wireless Sensor Networks Rafael Garcia, Dr. Ann Gordon-Ross, Dr. Alan.
Partial Region and Bitstream Cost Models for Hardware Multitasking on Partially Reconfigurable FPGAs + Also Affiliated with NSF Center for High- Performance.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead Heng Tan and Ronald F. DeMara University of Central Florida.
Field Programmable Port Extender (FPX) 1 Modular Design Techniques for the FPX.
Reconfigurable Embedded Processor Peripherals Xilinx Aerospace and Defense Applications Brendan Bridgford Brandon Blodget.
FPGA Partial Reconfiguration Presented by: Abelardo Jara-Berrocal HCS Research Laboratory College of Engineering University of Florida April 10 th, 2009.
VAPRES A Virtual Architecture for Partially Reconfigurable Embedded Systems Presented by Joseph Antoon Abelardo Jara-Berrocal, Ann Gordon-Ross NSF Center.
Self-Adaptive Embedded Technologies for Pervasive Computing Architectures Self-Adaptive Networked Entities Concept, Implementations,
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular Reconfigurable Systems Abelardo Jara-Berrocal, Ann Gordon-Ross NSF.
Physically Aware HW/SW Partitioning for Reconfigurable Architectures with Partial Dynamic Reconfiguration Sudarshan Banarjee, Elaheh Bozorgzadeh, Nikil.
Runtime Reconfigurable Network-on- chips for FPGA-based systems Mugdha Puranik Department of Electrical and Computer Engineering
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
An Automated Hardware/Software Co-Design
Presenter: Darshika G. Perera Assistant Professor
Partial Reconfigurable Designs
Nios II Processor: Memory Organization and Access
Memory Management.
Everybody.
Hiba Tariq School of Engineering
Andrea Acquaviva, Luca Benini, Bruno Riccò
Parallel Programming By J. H. Wang May 2, 2017.
A Methodology for System-on-a-Programmable-Chip Resources Utilization
Ming Liu, Wolfgang Kuehn, Zhonghai Lu, Axel Jantsch
Dr. Michael Nasief Lecture 2
Improving java performance using Dynamic Method Migration on FPGAs
Introduction to Microprocessors and Microcontrollers
Reconfigurable Computing
Abelardo Jara-Berrocal Joseph Antoon Ph.D. Students
AT91 Memory Interface This training module describes the External Bus Interface (EBI), which generatesthe signals that control the access to the external.
Shaon Yousuf Ph.D. Student NSF CHREC Center, University of Florida
Chapter 1 Introduction.
Dynamic Partial Reconfiguration of FPGA
Presentation transcript:

Runtime Temporal Partitioning Assembly to Reduce FPGA Reconfiguration Time Abelardo Jara-Berrocal, Ann Gordon-Ross HCS Research Laboratory College of Engineering University of Florida ReConFig'09 December 9-11, 2009, Cancun, Mexico

Accelerating Embedded Applications Hardware accelerators offer 10x-1000x speed-ups over software implementations of the same algorithm  Algorithms are implemented as digital circuits  Circuits eliminate fetch and decode cycles while exploiting parallelism FPGA are commonly used to implement hardware accelerators  However FPGAs are not always big enough 2 Does not fit External I/O Hardware accelerator General purpose I/O Processor (off- chip processor also possible) Shared memory Battery Off-chip memory Available FPGA resources Single FPGA device Possible solution: Temporal partitioning

External I/O Processor (system controller) Shared memory Battery FPGA bistreams storage memory Available FPGA resources Single FPGA device Inputs Outputs Problem definition: Divide circuit into pre-defined number of partitions satisfying a set of design constraints  Hardware accelerator is decomposed into a set of hardware modules  Hardware modules grouped into partitions using static scheduling techniques  Each partition’s resources (slices, BRAMs) must not exceed available resources Temporal partitioning allows time-multiplexing of FPGA hardware resources among several partitions  Modules within a temporal partition execute concurrently  Intermediate data between partitions transferred through shared-memory and/or a system controller Temporal Partitioning Problem M1M1 M3M3 M2M2 M4M4 M5M5 M6M6 M7M7 P1 P2 P3 P4 Outputs Inputs Hardware accelerator P1 P2P3P4 JTAG Full reconfiguration

4 Module Types and Module Reusability Reconfiguration of a complete temporal partition is time consuming  Full reconfiguration of a VLX25 (Virtex-4) FPGA close to 3 seconds  Fortunately, temporal partitions can share modules of the same type Module types  Modules classified based on description (functionality), throughput, and area (slices, BRAMs, DSP48s) Module reusability  Replace only modules of different type between consecutive temporal partitions Approach  Leverage Virtex-4 and Virtex-5 partial reconfiguration  Enables independent reconfiguration of a PRR (partially reconfigurable region)  Logic outside PRR continues execution without interruption M1 M3M2 M4M5 M6M P1 P2 P3 P4 Module types

5 Partial Reconfiguration Hardware modules can span more than one adjacent PRR  Smaller PRRs allow finer granularity when decomposing partitions into modules PR allows replacement of modules of different types between consecutive temporal partitions  If modules of same type are kept in the same PRRs, no reconfiguration is needed However inter-module communication can be different  A dynamic inter-module communication architecture is required PRR1PRR2PRR3PRR4 PRR1PRR2PRR3PRR4 PRR1PRR2PRR3PRR4 PRR1PRR2PRR3PRR4 Temporal partitions Problems:  Placement of modules inside PRRs  Orchestration of system operation  Dynamic inter-module communication Available adjacent PRRs M3 M4M5 M1M2 M7 M6

6 Module A Module C Module B FPGA Bitstreams storage Battery External I/O Module C 3. Smaller partial bitstreams Module A request 1. System controller does not need to be placed in an external device 2. Access to fast Internal Configuration Access Port (ICAP – 32 bits, 100 MHz) 4. No need to halt complete system when reconfiguring a module 5. Time multiplexing of FPGA resources to load and unload HW modules on demand Base system configuration JTAG Reconfigurable area disabled Controller (Microblaze) ICAP Flash controller Module C Module B enabled Module A enabled disabled Static area Module A Module B This architecture is application-specific, can we design a general purpose PR architecture? Sample Application-Specific PR Architecture VAPRES – Virtual Architecture for Partially Reconfigurable Embedded Systems

7 VAPRES Base Architecture MicroBlaze PRR1PRR2PRR3PRR4 FSL Interf, PLB Bus SCORES Switch Interface RSB (one to more RSBs compose the Data Processing Region ) Interface clk1 clk2 clk3 SCORES clk0 ICAP Flash controller UART SDRAM To external I/O pins System Control Region Network  Reconfigurable Streaming Blocks (RSBs) Leverages a reconfigurable stream-based processing chain between I/O modules HW modules can span one to more adjacent PRRs and operate at different clock frequencies  Scalable Communication Architecture for Reconfigurable Embedded Systems (SCORES) Linear or ring topology composed of switches Dynamic streaming communication between modules Filter 3Filter 1Filter 2  System Control Region Orchestrates RSBs operation and execution of temporal partitions Asynchronous FSL (Fast Simplex Link) interfaces between Microblaze and HW modules inside PRRs Partial bitstreams stored in external flash memory I/O Module

Runtime Assembly of Temporal Partitions Definition – Modules composing a temporal partition are dynamically mapped to VAPRES architecture for execution  Modules placed inside VAPRES PRRs through PR  Dynamic inter-module communication through SCORES Original hardware modules are encapsulated inside module wrappers  Module wrappers leverage communication with SCORES module interfaces Temporal partition assembly time (t assembly )  Ni->j = number of switches between i-th and j-th module at same temporal partition  Pk = 3 = number of clock cycles for a SCORES switch to allocate an output link to a requesting input port SCORES must leverage enough resources to insure successful temporal partition assembly  Architectural parameters enable customized SCORES communication based on requirements across all temporal partitions 8

Evaluating PRM Placement Partially reconfigurable module (PRM) placement dictates the number of reconfigured PRRs during a temporal partition transition Optimization problem  Cost function (ReducedConfigurationCost): Number of reconfigured VAPRES PRRs during all temporal partitions  Formulation of PlacementMatrix data structure for TotalCost calculation Partial reconfiguration avoidance  PRMs occupy the same PRR(s) in the immediate subsequent temporal partition  Also applicable between two non-subsequent temporal partitions where intermediate partitions contain empty PRR(s) 9 RedConfCost= 7 1_11_21_32_1 1_11_21_3-2 4_14_24_33_1 5_1 2_1 RedConfCost = 6 VAPRES PRRs Temporal partitions Negative number indicates number of empty PRRs located at the right of an occupied PRR -2 1_1 1_2 1_3 2_1 1_11_2 1_3 4_14_24_3 3_1 5_12_ PlacementMatrix

PRM Placement Optimization Formulated a placement optimization algorithm based on simulated annealing  Simulated annealing commonly used in optimization problems Placement perturbation function defined as swapping the placement of two random modules at a given temporal partition  Temporal partition also randomly selected 10

Experimental setup and Results Benchmark generation using TGFF (Task Graphs for Free)  Three sample applications (edges correspond to inter-module communication) Small application - 20 modules, 30 edges Medium application - 60 nodes, 120 edges Large application - 60 nodes, 160 edges  Number of different module types ranged from 1 to Results  ReducedConfigurationCost (after placement optimization) vs FullConfigurationCost  FullConfigurationCost = number of VAPRES PRRs x number of temporal partitions Reduction in reconfiguration cost increase as number of PRRs increase or number of module types decrease (a) Small task graph(b) Medium task graph(b) Large task graph 43.7% 37.6% 38.4%

12 Conclusions Leverage partial reconfiguration to achieve performance improvements compared to full reconfiguration using temporal partitions  Reconfiguration time reduction using runtime assembly of temporal partitions  Full reconfiguration time must exceed the time required to assemble a temporal partition (t assembly )  For SCORES, t assembly in the order of tens of clock cycles  Experimentally measured partial reconfiguration time through ICAP on Virtex-4 FPGA 10,277,796 clock cycles for 16x10 CLBs PRR Contributions of this work  Leverage concept for runtime assembly of temporal partitions to reduce configuration time in systems using temporal partitioning Isolate inter-module communication from hardware processing  Formulation of methodology for runtime assembly temporal partitions using VAPRES architecture  Formulation of heuristic algorithm for placement of PRMs inside VAPRES PRRs 40% reduction in reconfiguration time (on average) as compared to full reconfiguration

13 Questions