Presentation is loading. Please wait.

Presentation is loading. Please wait.

DAPR: Design Automation for Partially Reconfigurable FPGAs Shaon Yousuf Ph.D. Student NSF CHREC Center, University of Florida Dr. Ann Gordon-Ross Associate.

Similar presentations


Presentation on theme: "DAPR: Design Automation for Partially Reconfigurable FPGAs Shaon Yousuf Ph.D. Student NSF CHREC Center, University of Florida Dr. Ann Gordon-Ross Associate."— Presentation transcript:

1 DAPR: Design Automation for Partially Reconfigurable FPGAs Shaon Yousuf Ph.D. Student NSF CHREC Center, University of Florida Dr. Ann Gordon-Ross Associate Professor of ECE NSF CHREC Center, University of Florida

2 2 Dynamic Reconfiguration Dynamic reconfiguration can be beneficial to system designers  Allows run-time hardware adaptation  Enables time-multiplexing FPGA resources Two types of dynamic reconfiguration  Full reconfiguration (FR)  Partial reconfiguration (PR) PR isolates reconfiguration to a portion of the FPGA fabric Hardware time-multiplexing and flexibility Designs loaded on demand on the same fabric Power Savings Design A, B, & C stored in external memory Execution Stalled! Full bitstreams Design A RequiredDesign C Required Design B Required Configuration Controller Design A Design B Design C External memory FPGA Fabric Design A executingDesign B executing Design C executing

3 PR divides the FPGA into two regions  Static design maps to static region  Partially reconfigurable modules map to partially reconfigurable regions (PRRs)  Regions communicate using Partition Pins (Parpins) FR benefits is enhanced by PR  Partial bitstreams are significantly smaller than full bitstreams Reduced memory requirements Reduced power requirements Reduced reconfiguration time  Increased flexibility Design functionality changed by simply loading a new partial bitstream 3 Partial Reconfiguration (PR) Central Controlling Agent ICAP Mem Controller Module A Module B Module C Static Design PR Modules PRR 1 PRR 2 Static Region Full Bitstream: Static design Partial Bitstreams: Module A & B Partial Bitstreams: Modules C & D Module D FPGA Fabric Example with 2 PRRs

4 PR system design is constantly evolving  Currently supported only by Xilinx  Altera support announced PR system design flow is complex  PR system performance may suffer if system design is not carefully considered Static region and PRRs design partitioning in hardware description language (HDL) Setting PRR and Parpins placement constraints (Floorplanning) Analyzing set floorplanned constraints timing results to ensure acceptable design performance PR benefits outweigh complexities  Benefits applications such as Software defined radios Image processing applications 4 PR Challenges and Motivations HDL Synthesis Set Design Constraints Implement Static Design and PR Modules HDL Design Description Final Generated Bitstreams Merge Timing/ Placement Analysis Xilinx PR Implementation Flow Mandatory steps for PR

5 5 Contribution Currently there is insufficient PR system design support  HDL design partitioning is not straightforward Requires finding a balance between required performance and flexibility  PRR placements constraints must be set and evaluated manually Large design space as no formal process exists for determining optimal PRR and PPs placements constraints during floorplanning We present the DAPR design flow  The Design Automation for Partial Reconfiguration design flow Automatically generates design floorplans (candidate designs)  Eliminates manual design space exploration Outputs highest clock frequency design found Also outputs a Pareto optimal set of PR designs  Design points trade off clock frequency and partial bitstream size  The DAPR design flow can significantly reduce PR design time Makes PR design more accessible and amenable to system designers

6 Merge Implement PR Modules Implement Base Design Timing/Place ment Analysis Set Design Constraints HDL Synthesis DAPR Tool Modified HDL Design Description DAPR Design Flow System Designer Annotations Merge Implement Base Design Implement PR Modules Timing/Place- ment Analysis Set Design Constraints HDL Synthesis 6 DAPR Design Flow Manual Steps Automated Steps HDL Design Description Final Generated Bitstreams PR Flow DAPR design flow performs automated design space exploration  Outputs best found clock frequency design  Outputs Pareto optimal set of design points that tradeoff clock frequency and partial bitstream size DAPR Tool Phases Phase 1 Information identification, extraction, and collection Phase 2 Candidate Generation Phase 3 Bitstream Generation Phase 4 Design Evaluation

7 7 DAPR Design Flow Phases DAPR tool starts here Initial input Modified VHDL Top File VHDL Top File Phase 3 Bitstream Generation Implement and merge design Output best found PR design’s full and partial bitstreams Y N Phase 4 Design Evaluation Design Constraints File (.dcs) Design constraints met Iterations left >0 Y N Phase 2 Candidate Generation Synthesize modules and estimate resource requirements Perform automated floorplanning Device Information Library File (.dil) User Constraints File (.ucf) Phase 1 Information Identification, Extraction, and Collection PR Automation Information File (.prai) Identify static region, PRRs, and design file names DAPR tool generated Directory Structure PR automation file (.prai) contains port map connection information Device information library file (.dil) contains target device hardware resource information Design constraints file (.dcf) contains device specific design constraints DAPR tool file descriptions Candidate Designs Design constraints file PR automation file Virtex-4 Lx25 device information library file

8 8 Candidate Floorplan Generation N Initialization Start Y Stop Icurr < Imax N N Y Accept new solution? Store solution Icurr < Irand Y Randomized PRR and Parpin placement Evaluate solution Icurr = Current running iteration number Imax = Maximum iteration bound Irand = Random number Icurr = Current running iteration number Imax = Maximum iteration bound Irand = Random number Icurr == Irand N Load stored solution and calculate initial temperature Create new solution Evaluate solution Y Accept new solution? Update stored solution Update temperature Y N PRR placements and Parpin placements leverage simulated annealing (SA) algorithm  SA algorithm trims design space exploration time by initially using random placements  SA algorithm leverage random placement solution values and work towards optimal placements SA algorithm for placement  Place PRRs randomly across fabric to find initial good placement Cost function  PRR aspect ratio  Communication distant between PRRs  Evaluate design Use Xilinx utilities to evaluate partial bitstream size and clock frequency  Calculate initial temperature Use SA equation  Vary PRR initial placement size Vary distance between PRRs and PRR height  Change PRR placement location to explore better solution  Depends on certain number of uphill moves, downhill moves or total number of moves

9 9 Experimental Setup Software (Linux Environment)  Perl 5.12.1  Dot language interpreter  Xilinx ISE 13.4 Synthesize options  Optimization Goal - Speed  Optimization Effort - Normal Hardware  Intel® Core™ 2 Duo E6750 2.66 GHz CPU and 3.24 GB of RAM  Xilinx Virtex-5 XUPV5-LX110T FPGA

10 10 Results – DAPR Simulated Annealing Placement PRR counter design used for evaluation  Optimal clock frequency found using exhaustive (ES) search algorithm  Percentage of design space exploration to find the optimal clock frequency by Simulated annealing (SA) algorithm found  SA also compared with a random exploration (RE) algorithm PRR Size 5x6 CLBs - SA PRR Size 7x6 CLBs- SA PRR Size 6x6 CLBs - SA PRR Size 7x6 CLBs - RE DAPR SA algorithm improves with increased PRR size DAPR SA algorithm outperforms RE for large PRR sizes PRR Size 4x6 CLBs - SA Configurable Logic Blocks (CLBs)

11 11 Results – 1K-point FFT Solution improves with successful iterations* *Successful iterations complete without place and route errors

12 12 Results – Pareto Optimal Set Only 3% of the explored design space is interesting* *Design points trade off clock frequency and partial bitstream size

13 13 Results – Additional PR Designs Growth Rate quickly levels off to within 2.3 % of the highest achievable clock frequency within an average of 10 successful iterations

14 DAPR++ tool suite aids designing RC systems using automation Evolution of DAPR to DAPR++ Tool Suite 14 Creates master and slave FPGA component layout tree Creates FPGA VHDL black boxes for all components Creates master and slave FPGA component layout tree Creates FPGA VHDL black boxes for all components DAPR++ Tool Suite PR Architecture Generator Network Generator PR Task Manager Throughput Profiler Bitstream Manager PRR Floorplanner Automatically generates target device resource mapping Heuristically floorplans PRRs and partition pins Automatically generates target device resource mapping Heuristically floorplans PRRs and partition pins Allow PRR partial bitstream manipulation in FPGA memory Creates network protocols for master and slave FPGAs Creates PR task reconfiguration schedules to reduce reconfiguration time Records data packet transfer rates between master and slave FPGAs Switch Master FPGA Slave FPGA 1 GPP PRRs Slave FPGA 2 PRRs

15 PR System Design Automation with DAPR++ Tool Suite 15 DAPR++ Architecture Generation Leverages Master-Slave Configuration  Master FPGA used for centralized control Contains single or multiple ARM-compatible 32-bit RISC processor (Amber processor) for application control Loads/unloads hardware tasks into slave FPGAs Transfers and receives data to slave FPGAs via WishBone-interface-compatible system bus  Slave FPGAs used for hardware acceleration Leverages PRML output for appropriate hardware architecture generation (Number of PRRs, PRR interfaces, & PRR size) PRRs loaded with required application task functionality Leverages on-chip network for inter-PRR communication

16 16 ARM-compatible 32-bit RISC processor - Two available cores  Amber 23 3-stage pipeline, a unified instruction & data cache, Wishbone interface, and 0.75 Dhrystone MIPS per MHz  Amber 25 5-stage pipeline, separate data and instruction caches, Wishbone interface, and 1.0 Dhrystone MIPS per MHz  Both cores run 2.4 Linux kernel Amber 25 performs 30% - 40% but also 30% to 40% larger Amber 25 Pipeline Amber 23 Pipeline PR System Design Automation with DAPR++ Tool Suite

17 17 Core type and Cache Configuration SlicesRAMB16DSPsClock frequency A23 32KB323622233MHz A25 32KB380024250MHz Wishbone Interface25600350MHz Ethernet Core120000180MHz Overall System with one core with A25 855646133MHz Amber core Virtex-5 LX110T Initial synthesis results shown below PR System Design Automation with DAPR++ Tool Suite

18 Master FPGA System overview 18 Wishbone arbiter Amber 25 Processor Core 0 Ethernet MAC Boot Loader -8k embedded SRAM – Contains boot loader code Boot Loader -8k embedded SRAM – Contains boot loader code Primary Interrupt controller Wishbone to Xilinx Virtex-5 SRAM controller bridge Xilinx Virtex-5 SRAM controller SRAM Interface firq irq Amber 25 Processor Core 1 Statically configurable simple UART Amber 25 Processor Core 1 Ethernet Interface UART Interface PR System Design Automation with DAPR++ Tool Suite

19 19 NameSourceAssertion Description ACK_IMasterindicates the normal termination of a bus cycle ADR_O()Masterused to pass a binary address CYC_OMasterIndicates a valid bus cycle is in progress. ERR_IMasterindicates an abnormal cycle termination. LOCK_OMasterindicates that the current bus cycle is uninterruptible RTY_IMasterindicates that the interface is not ready to accept or send data SEL_O()Masterindicates where valid data is expected STB_OMasterindicates a valid data transfer cycle. TGA_O()Mastercontains information associated with address lines TGC_O()Mastercontains information associated with bus cycles, WE_OMasterindicates whether the current local bus cycle is a READ or WRITE cycle ACK_O()Slaveindicates the termination of a normal bus cycle ADR_I()Slaveused to pass a binary address CYC_ISlaveindicates that a valid bus cycle is in progress ERR_OSlaveindicates an abnormal cycle termination LOCK_ISlaveindicates that the current bus cycle is uninterruptible. RTY_OSlaveindicates that the interface is not ready to accept or send data SEL_I()Slaveindicates where valid data is placed on data bus during write STB_ISlaveindicates that the SLAVE is selected TGA_ISlavecontains information associated with address lines TGC_I()Slavecontains information associated with bus cycles WE_ISlaveindicates whether the current local bus cycle is a READ or WRITE cycle Wishbone Interface Extended Signal List Signal name and sizeSignal type relative to data-flow controller Function p_consumerfsl_rdy (1 bit)InIndicates valid input data in consumer FSL p_producerfsl_rdy (1 bit)InIndicates producer FSL is ready for data rfd (1 bit)InIndicates PRM is ready for data done (1 bit)InIndicates PRM will produce valid output data in next clock cycle dv (1 bit)InIndicates PRM has produced valid out data input_data (32 bit)InInput data signal to PRM P_producerfsl_data (32 bit)InInput data signal from producer FSL p_consumerfsl_en (1 bit)OutAllows reading data from consumer FSL ce (1 bit)OutHalts PRM in current state (overrides start signal) start (1 bit)OutStarts PRM when asserted p_producerfsl_en (1 bit)OutAllows writing data to producer FSL output_data (32 bit)OutOutput data signal from PRM p_consumerfsl_data (32 bit)OutOutput data signal to consumer FSL PRR Interface Signal List PR System Design Automation with DAPR++ Tool Suite

20 Task B: PR System Design Automation with DAPR++ Tool Suite 20 PR-Task Manager (PRTM) Utilizes Hardware reuse (Reconfiguration overlapping) and configuration prefetching PRTM tested on a JPEG Codec architecture Simplified PR-Task Scheduler Flowchart Create Application PR Task Schedule Map Tasks to PRRs According to PRR Size Modularized C/C++ Application Modularized C/C++ Application Application Task Flow Graph (TFG) Check for Task Hardware Reusablity Slave Architecture PRR information Create Task Pre-configuration Schedule Execute Application

21 RGB2YCbCr & FDCT2D ZigZagQuantizer Huffman Encoder Run length Encoder Byte Stuffer Header Generator Decoder Pipeline Controller 21 Control Signals DEMUX Encoder Pipeline Controller MUX RAM HOST PROG HOST DATA Host IF Data Control Signals PR Region Control Signals PR Module Buffer JPEG Codec Encoder Architecture Encoder Data Path Task B: PR System Design Automation with DAPR++ Tool Suite

22 Run Length Decoder Byte Stuffer Header Generator Header Decoder JPEG Codec Decoder Architecture YCbCr2RGB & IDCT2D ReorderDequantizer Huffman Decoder Decoder Pipeline Controller 22 Control Signals DEMUX MUX RAM HOST PROG HOST DATA Host IF Data Control Signals PR Region Control Signals PR Module Buffer Byte Stripper Decoder Data Path Task B: PR System Design Automation with DAPR++ Tool Suite

23 23 JPEG CODEC Tests reveal PRTM based systems achieve an average 40% reconfiguration delay reduction 1 20% Reduction 60% Reduction Task B: PR System Design Automation with DAPR++ Tool Suite

24 24 PRR – Partially Reconfigurable Region PR System Design Automation with DAPR++ Tool Suite Networking Tool Overview  Sets up Master and Slave FPGAs network interfaces Automatically generates hardware/software controllers Master FPGA GPP to slave FPGA controller Slave FPGA hardware controller for PRRs 0 Amber Processor 1 Ethernet Simplified Master FPGA Architecture Amber Processor 0 Amber processor 2 WishBone Interface 0 Ethernet Simplified Slave FPGA Architecture WishBone Interface HW controller PRR 1 Results: Resource Requirements ComponentSlicesRAMB16Max Operating Frequency Slave FPGA PRRs1,6002250MHz 100 Mbps Ethernet Core 1,1004180MHz Master FPGA Processor with 8KB cache 4,82010250MHz Slave FPGA Overall system with 2 PRRs 6,95620155MHz Master FPGA system with two GPPs 11,24028133MHz Results: Network Transfers  Master/Slave FPGA setup with 2 GPPs/PRRs  Simple Transmission test Amber processors sends data to PRRs PRRs rotates bit value and transfer back result  FFT, CORDIC, Matrix Multiply cores GPPs send data from Master FPGA ram to PRRs PRRs process data and transfer back result Networking Tool Experimental Setup  Tested cores an average throughput of 41 Mpbs

25 25 Conclusions and Future Work We presented the DAPR and DAPR++ design flow  DAPR performs automatic design space exploration Uses an iterative candidate PR floorplan generation methodology  DAPR design flow’s key contributions include: Making PR design more accessible and amenable to a wide range of system designers Creating high-performance systems with reduced design time effort Allows choosing between PR designs that trade off clock frequency and partial bitstream size  DAPR++ tool suite allows Automated RC system generation Each tool generates different RC System portions tailored to application needs Portions are integrated to build complete RC system ready for use on selected FPGAs Future Work  Enhance portability of DAPR++ tool sutie to multiple devices and vendors

26 QUESTIONS? This work was supported in part by the I/UCRC Program of the National Science Foundation under Grant No. EEC-0642422. We also gratefully acknowledge tools provided by Xilinx.


Download ppt "DAPR: Design Automation for Partially Reconfigurable FPGAs Shaon Yousuf Ph.D. Student NSF CHREC Center, University of Florida Dr. Ann Gordon-Ross Associate."

Similar presentations


Ads by Google