Download presentation
Presentation is loading. Please wait.
Published byKristian Stephens Modified over 8 years ago
1
Runtime Temporal Partitioning Assembly to Reduce FPGA Reconfiguration Time Abelardo Jara-Berrocal, Ann Gordon-Ross HCS Research Laboratory College of Engineering University of Florida ReConFig'09 December 9-11, 2009, Cancun, Mexico
2
Accelerating Embedded Applications Hardware accelerators offer 10x-1000x speed-ups over software implementations of the same algorithm Algorithms are implemented as digital circuits Circuits eliminate fetch and decode cycles while exploiting parallelism FPGA are commonly used to implement hardware accelerators However FPGAs are not always big enough 2 Does not fit External I/O Hardware accelerator General purpose I/O Processor (off- chip processor also possible) Shared memory Battery Off-chip memory Available FPGA resources Single FPGA device Possible solution: Temporal partitioning
3
External I/O Processor (system controller) Shared memory Battery FPGA bistreams storage memory Available FPGA resources Single FPGA device Inputs Outputs Problem definition: Divide circuit into pre-defined number of partitions satisfying a set of design constraints Hardware accelerator is decomposed into a set of hardware modules Hardware modules grouped into partitions using static scheduling techniques Each partition’s resources (slices, BRAMs) must not exceed available resources Temporal partitioning allows time-multiplexing of FPGA hardware resources among several partitions Modules within a temporal partition execute concurrently Intermediate data between partitions transferred through shared-memory and/or a system controller Temporal Partitioning Problem M1M1 M3M3 M2M2 M4M4 M5M5 M6M6 M7M7 P1 P2 P3 P4 Outputs Inputs Hardware accelerator P1 P2P3P4 JTAG Full reconfiguration
4
4 Module Types and Module Reusability Reconfiguration of a complete temporal partition is time consuming Full reconfiguration of a VLX25 (Virtex-4) FPGA close to 3 seconds Fortunately, temporal partitions can share modules of the same type Module types Modules classified based on description (functionality), throughput, and area (slices, BRAMs, DSP48s) Module reusability Replace only modules of different type between consecutive temporal partitions Approach Leverage Virtex-4 and Virtex-5 partial reconfiguration Enables independent reconfiguration of a PRR (partially reconfigurable region) Logic outside PRR continues execution without interruption M1 M3M2 M4M5 M6M7 1 1 4 2 3 2 5 P1 P2 P3 P4 Module types
5
5 Partial Reconfiguration Hardware modules can span more than one adjacent PRR Smaller PRRs allow finer granularity when decomposing partitions into modules PR allows replacement of modules of different types between consecutive temporal partitions If modules of same type are kept in the same PRRs, no reconfiguration is needed However inter-module communication can be different A dynamic inter-module communication architecture is required PRR1PRR2PRR3PRR4 PRR1PRR2PRR3PRR4 PRR1PRR2PRR3PRR4 PRR1PRR2PRR3PRR4 Temporal partitions Problems: Placement of modules inside PRRs Orchestration of system operation Dynamic inter-module communication Available adjacent PRRs M3 M4M5 M1M2 M7 M6
6
6 Module A Module C Module B FPGA Bitstreams storage Battery External I/O Module C 3. Smaller partial bitstreams Module A request 1. System controller does not need to be placed in an external device 2. Access to fast Internal Configuration Access Port (ICAP – 32 bits, 100 MHz) 4. No need to halt complete system when reconfiguring a module 5. Time multiplexing of FPGA resources to load and unload HW modules on demand Base system configuration JTAG Reconfigurable area disabled Controller (Microblaze) ICAP Flash controller Module C Module B enabled Module A enabled disabled Static area Module A Module B This architecture is application-specific, can we design a general purpose PR architecture? Sample Application-Specific PR Architecture VAPRES – Virtual Architecture for Partially Reconfigurable Embedded Systems
7
7 VAPRES Base Architecture MicroBlaze PRR1PRR2PRR3PRR4 FSL Interf, PLB Bus SCORES Switch Interface RSB (one to more RSBs compose the Data Processing Region ) Interface clk1 clk2 clk3 SCORES clk0 ICAP Flash controller UART SDRAM To external I/O pins System Control Region Network Reconfigurable Streaming Blocks (RSBs) Leverages a reconfigurable stream-based processing chain between I/O modules HW modules can span one to more adjacent PRRs and operate at different clock frequencies Scalable Communication Architecture for Reconfigurable Embedded Systems (SCORES) Linear or ring topology composed of switches Dynamic streaming communication between modules Filter 3Filter 1Filter 2 System Control Region Orchestrates RSBs operation and execution of temporal partitions Asynchronous FSL (Fast Simplex Link) interfaces between Microblaze and HW modules inside PRRs Partial bitstreams stored in external flash memory I/O Module
8
Runtime Assembly of Temporal Partitions Definition – Modules composing a temporal partition are dynamically mapped to VAPRES architecture for execution Modules placed inside VAPRES PRRs through PR Dynamic inter-module communication through SCORES Original hardware modules are encapsulated inside module wrappers Module wrappers leverage communication with SCORES module interfaces Temporal partition assembly time (t assembly ) Ni->j = number of switches between i-th and j-th module at same temporal partition Pk = 3 = number of clock cycles for a SCORES switch to allocate an output link to a requesting input port SCORES must leverage enough resources to insure successful temporal partition assembly Architectural parameters enable customized SCORES communication based on requirements across all temporal partitions 8
9
Evaluating PRM Placement Partially reconfigurable module (PRM) placement dictates the number of reconfigured PRRs during a temporal partition transition Optimization problem Cost function (ReducedConfigurationCost): Number of reconfigured VAPRES PRRs during all temporal partitions Formulation of PlacementMatrix data structure for TotalCost calculation Partial reconfiguration avoidance PRMs occupy the same PRR(s) in the immediate subsequent temporal partition Also applicable between two non-subsequent temporal partitions where intermediate partitions contain empty PRR(s) 9 RedConfCost= 7 1_11_21_32_1 1_11_21_3-2 4_14_24_33_1 5_1 2_1 RedConfCost = 6 VAPRES PRRs Temporal partitions Negative number indicates number of empty PRRs located at the right of an occupied PRR -2 1_1 1_2 1_3 2_1 1_11_2 1_3 4_14_24_3 3_1 5_12_1 -3-2 +3 +1+2 +1 +3 +1 +3 PlacementMatrix
10
PRM Placement Optimization Formulated a placement optimization algorithm based on simulated annealing Simulated annealing commonly used in optimization problems Placement perturbation function defined as swapping the placement of two random modules at a given temporal partition Temporal partition also randomly selected 10
11
Experimental setup and Results Benchmark generation using TGFF (Task Graphs for Free) Three sample applications (edges correspond to inter-module communication) Small application - 20 modules, 30 edges Medium application - 60 nodes, 120 edges Large application - 60 nodes, 160 edges Number of different module types ranged from 1 to 20 11 Results ReducedConfigurationCost (after placement optimization) vs FullConfigurationCost FullConfigurationCost = number of VAPRES PRRs x number of temporal partitions Reduction in reconfiguration cost increase as number of PRRs increase or number of module types decrease (a) Small task graph(b) Medium task graph(b) Large task graph 43.7% 37.6% 38.4%
12
12 Conclusions Leverage partial reconfiguration to achieve performance improvements compared to full reconfiguration using temporal partitions Reconfiguration time reduction using runtime assembly of temporal partitions Full reconfiguration time must exceed the time required to assemble a temporal partition (t assembly ) For SCORES, t assembly in the order of tens of clock cycles Experimentally measured partial reconfiguration time through ICAP on Virtex-4 FPGA 10,277,796 clock cycles for 16x10 CLBs PRR Contributions of this work Leverage concept for runtime assembly of temporal partitions to reduce configuration time in systems using temporal partitioning Isolate inter-module communication from hardware processing Formulation of methodology for runtime assembly temporal partitions using VAPRES architecture Formulation of heuristic algorithm for placement of PRMs inside VAPRES PRRs 40% reduction in reconfiguration time (on average) as compared to full reconfiguration
13
13 Questions
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.