Presentation is loading. Please wait.

Presentation is loading. Please wait.

2009 Midyear Workshop F4-09: Virtual Architecture and Design Automation for Partial Reconfiguration All Hands Meeting November 10th, 2009 Dr. Ann Gordon-Ross.

Similar presentations


Presentation on theme: "2009 Midyear Workshop F4-09: Virtual Architecture and Design Automation for Partial Reconfiguration All Hands Meeting November 10th, 2009 Dr. Ann Gordon-Ross."— Presentation transcript:

1 2009 Midyear Workshop F4-09: Virtual Architecture and Design Automation for Partial Reconfiguration All Hands Meeting November 10th, 2009 Dr. Ann Gordon-Ross Assistant Professor of ECE University of Florida Dr. Alan D. George Professor of ECE University of Florida Abelardo Jara Terence Frederick Rohit Kumar Shaon Yousuf Research Students University of Florida

2 Outline Goals, Motivation and Challenges Virtual Architecture for Partially Reconfigurable Embedded System (VAPRES)  Design methodology  Multiple clock domains support  Bitstream relocation MACS Inter-module Communication Architecture Case Study Application: Embedded Target Tracking System on Virtex-4 FPGA board  Preliminary non-PR version using Kalman filters Design Automation for Partial Reconfiguration (DAPR)  DAPR design flow VHDL annotations Connectivity file and graph Device library file Overlay generation

3 3 GOAL – Leverage partial reconfiguration (PR) for application designers  Architect and implement a Virtual Architecture (VA) for Partially Reconfigurable Embedded Systems  Ease PR design via design automation MOTIVATIONS – Increase productivity and reduce design complexity for PR designs  VA reduces development time Dynamically load and unload hardware processing modules Processing hardware adapts to external environmental conditions  Automated design flow makes PR more amenable system designers Current PR design flow requires very high level of specialization Simplifies design of systems that time-multiplex FPGA resources → smaller devices CHALLENGES  Provide sufficient VA flexibility with architectural parameterization Balancing enough application specialization with exploration complexity  Creating new exploration algorithms/heuristics to automate PR design flow steps with respect to available PR tools 33 Goals, Motivations, and Challenges Sensor Interface Central Controlling Agent ICAP Processed output Filter repository Filter A Filter B PRR Filter A External Trigger Sensor Coverage Area

4 44 Expand and prototype an FPGA-based architecture for rapid development of PR embedded systems  VAPRES: Virtual Architecture for Partially Reconfigurable Embedded Systems  MACS: Minimal Adaptive Circuit Switching mesh inter-module communication architecture for VAPRES  Improvement over F4-08 SCORES communication architecture  Architectural support for hardware module context save and restore Formulate and implement an automated PR design flow  DAPR: Design Automation for Partial Reconfiguration Tool Study Virtex-4 and Virtex-5 bitstreams to leverage additional functionalities  Extend bitstream relocation and context save and restore for Virtex-5 F4-09 Approach Highly specialized PR system design Reconfiguration behavior known at design time Highly optimized system floorplan based on known application Flexible and reusable base architecture Not optimized for a specific application Tools to develop both reconfigurable modules and application software Design Methodology + VAPRES Builder Tool VAPRES Base Architecture

5 VAPRES: Architecture Design Flexible scalable architecture  Multiple architectural parameters enable base system specialization N =number of PRRs kr =number of streaming channels going right kl =number of streaming channels going left Some additional parameters presented next Base PR embedded system Multiple clock domains  PRMs can operate at independent clock frequencies  PRMs use FIFO-based I/O ports High speed inter-module communication architecture (MACS) Streaming channels PRR1PRR2PRR3 FSL Interface PLB Bus MACS switch Module Interfaces Module Interfaces Module Interfaces clk1 clk2 clk0 ICAP Flash controller UART SDRAM To external I/O pins Network I/O Module DCR Bridge Module Interfaces Module Interfaces MicroBlaze PR Socket 1PR Socket 2PR Socket 3 Slice macros To external I/O pins Control Region Data Processing Region kr=1 2 3 N= kl=2 5

6 6 VAPRES: Design Methodology Application software PRMs Application decomposition Base system specifications Software implementation PRM design Executable filePartial bitstreamsStatic bitstream VAPRES API (vapres.h) FPGA board Base system design Parametric VHDL models Synthesis Application Flow (application designers) Base System Flow (base system designer) Implementation System definition files Synthesis Implementation Software implementatio n Software design System designer chooses VAPRES parameters VAPRES VHDL, MHS, MSS, and UCF C/C++ libraries for application software development PRM implementation is separate from base system implementation Application designers work separate from system designer Parametric models for VAPRES and MACS enable customization Floorplan System floorplan defines PRR sizes and shapes

7 7 VAPRES: Builder Tool Overview  Automates process of building VAPRES base system and applications Increases designers productivity Builder Tool Features  Some additional parameters used PRR height and width  Automatic creation of VAPRES base system from parameters Base system floorplanning Slice macro instantiation and placement  Automatic implementation of static and partial bitstreams Assisted framework for application designers  Generates VAPRES SW libraries  Templates for PRMs and software Static base system PR modules (PRMs) Application software Architectural parameters System floorplan (.ucf) Top VHDL entity (.vhd) Software specifications (.mss) Hardware specifications (.mhs)

8 Design 1Design 2Design 3Design 4 Number of PRRs1123 PRR height1 row (16 CLBs)2 rows (32 CLBSs)2 rows (32 CLBs)1 row (16 CLBs) PRR width10 CLBs MACS parametersN=1, kr=1, kl=1 N=2, kr=2,kl=2N=3, kr=2,kl=2 Post-place and route implementation for base static system Maximum clock120.3 MHz117.6 MHz116.1 MHz119.3 MHz Static region slices (without MACS) MACS slicesN/A VAPRES Builder – Results N = number of PRRs = number of MACS switches, kr = number of channels between switches going in the right direction, kl = number of channels between switches going in the left direction Set of slice macros (1 set for each PRR) PRR boundary 1123 ≈ 280 slices more when when adding an extra PRR +0 slices+284 slices+263 slices 100 MHz constraint met for all place- and-routed designs

9 Only one partial bitstream necessary for each PRM Partial bitstreams stored in compact flash When PRM is needed, partial bitstream is loaded into Microblaze and relocator is called New partial bitstream is loaded into correct PRR Program runs in external memory: Bitstream relocator is stored in non-volatile compact flash System ACE controller loads relocator from flash and stores it in SDRAM Microblaze PRR1PRR2 FSL Interf, PLB Bus Interface I/O Module Interface clk0 clk1 ICAP SystemACE Flash UART SDRAM To external I/O pins Network 9 VAPRES – Bitstream Relocation SCORES Switch Data Processing Region (includes one or more RSBs – Reconfigurable Streaming Blocks) System Control Region In-situ Bitstream Relocation – Alters partial bitstream (with no external inputs) to run in any PRR Advantages: Reduces bitstream storage requirements (only one partial bitstream per module) Saves step of reading a partial bitstream from external Flash memory, if similar partial bitstream was already loaded into memory Enables VAPRES to dynamically place and migrate modules Restriction – PRRs must be homogeneous (ensures sufficient resources) I/O Module

10 Overview – MACS Communication Architecture 10 MACS: Minimal adaptive circuit switching mesh communication architecture  VAPRES requires high-bandwidth, low-latency communication channels inside reconfigurable streaming blocks (RSBs)  Novel communication architecture named SCORES was implemented in 2008  MACS extends SCORES from linear array topology to mesh topology with few other new features Features of MACS  Minimal-adaptive routing to explore all possible shortest paths Selects lowest cost path that best achieves network load distribution  Similar interface ports for nodes and neighboring switch Any number (<=6) of nodes can be put on a single switch Unused interface ports, of switches around edges of NoC, can be utilized Node interface port available in MxN NoC is <= 2(M*N + M + N) Reduces area overhead of communication architecture per node  Provides low-latency path(s) between frequently communicating node pairs (if attached to same switch) 10 S NN S NN S NN S NN S NN S NN S NN S NN S NN MACS

11 11 MACS implementation results (1/2) 9 architectural parameters to play around with  Plotting all combinations is not feasible Assuming two values of each parameter requires 2 9 “area usage” plots and 2 9 “achievable frequency” plots Figure 1: Area usage in number of slices per module for data widths W = 8, 16, and 32 bits for a varying number of lanes per switch and local port. The x-axis in each graph varies the Kl, Kr, Kd, and Ku parameters from 1 to 3 lanes per switch port. Left to right, the graphs vary the Kll and Krl parameters from 1 to 3 lanes per local port. Figure 2: Maximum operating frequency for data widths W = 8, 16, and 32 bits for a varying number of lanes per switch and local port. The x-axis in each graph varies the Kl, Kr, Kd, and Ku parameters from 1 to 3 lanes per switch port. From left to right, the graphs vary the Kll and Krl parameters from 1 to 3 lanes per local port.

12 12 MACS implementation results (2/2) Comparison of NoCs  Difficult due to lack of published implementation results from other authors  Representative packet-switching NoC 1 Designed and realized by Bartic et al. 8 modules attached in 2D-mesh topology 16-bit wide data  Similar circuit-switched NoC, i.e. PNoC 2 Programmable Network on Chip, designed and realized by Hilton et al. Single switch with 8 modules attached to it 16-bit wide data  Comparable configuration of MACS 2x2 mesh of MACS switches W=16, Ku=Kd=Kl=Kr=Kil=Kir=1 Network Architecture SlicesBRAMsFrequency MACS MHz Packet- Switching MHz PNoC MHz Comparison Results  5x faster and 1.5x less area overhead than packet- switching NoC  2x faster (with slight area overhead) than PNoC 1.Bartic, A., Mignolet, J.Y., Nollet, V., Marescaux, T., Verkest, D., Vernalde, S., and Lauwereins, R. “Highly scalable network on chip for reconfigurable systems”.In Proceedings of International Symposium on System-on-Chip, 2003, pages 79–82. 2.Hilton C. and Nelson B., “PNoC: a flexible circuit-switched NoC for FPGA-based systems”.In Proceedings of Computers and Digital Techniques, 2006, pages

13 Analytical model of SCORES/MACS  Streaming network FIFO at both ends:  Producer FIFO (of size D), Consumer FIFO (of size C) Pipelined channel/medium:  n-stage pipeline Control Feedback Path  n-stage  Phases I Analysis of producer-medium and medium-consumer pairs  Phase II Analysis of medium-consumer with feedback Analytical Modeling 13 λpλpλmλmλmλm n-stage Size D Size C µmµm µcµc

14 Markov-chain model Phase-I: Producer-Medium Pair(1/2) 14 λpλp μ m Size D 12k λ p,1 λ p,k-1 λ p,k μ m,k+1 μ m,k μ m,2 D λ p,D-1 μ m,D k+1 P k probability associated with the queue being in state k i.e. queue having k packets in it λp = Arrival rate μ m = Service rate D = System capacity Flow = Sum of product of λ’s, μ’s and P’s Solving for steady state gives 0 λpλp μ m,1 P0P0 P1P1 P2P2 PkPk P k+1 PDPD

15 Phase-I: Producer-Medium Pair(2/2) 15 PDPD 1/(D+1) D (line size) 1 Total probability of the system should be 1

16 Phase II: Medium-Consumer Pair with control feedback, 2D-Markov Chain Model (1/2) 16 Streaming network  Number of packets in queue(k)  Recently reached threshold(Q) Potential Queuing at Q = 0  Producer is filling with rate λ p  Service rate is µ m  At k = D-1, queue switches to de-queuing state Potential De-queuing at Q = 1  Producer is filling with reduced rate λ p,1 Consumer is emptying with µ m  Total probability of state Q = 1 gives the Packet drop probability  At k = 1, queue switches to queuing state, i.e. Q=0 P D,1 12k λpλp λpλp λpλp µmµm µmµm µmµm λpλp D-1 0 λpλp µmµm P0P0 P1P1 P2P2 PkPk P d-1 12i λ p,1 µmµm µmµm µmµm D µmµm D-1 P 1,1 P 2,1 P i,1 P d-1,1 Q=1 Q=0 k µmµm λ p,1 µmµm λpλp

17 Probability of FIFO being filled with ‘k’ packets when ρ ≠ 1 Probability of FIFO being filled with ‘k’ packets when ρ = 1 17 Phase II: Medium-Consumer Pair with control feedback, 2D-Markov Chain Model (2/2) Packet Drop Probability when ρ ≠ 1 Packet Drop Probability when ρ = 1

18 18 Real-time Simulation and Profiling of MACS Setup for basic experiment  One MACS switch with both module interface occupied  Network frequency = Module frequency = 100 MHz  Producer and consumer rates are Poisson process ROM holds MATLAB generated Poisson distributed intervals based on different λ and µ Producer/consumer loads its counter with value from ROM and generates/reads a unit of data at counter overflow  ChipScope ILA core captures all FIFO activity  System parameters: FIFO sizes = 512 bytes, Network BW = 400MBps, Producer rate = 40MBps Consumer Rate = 4MBps, (both generates data at Poisson distributed random intervals), Transfer size = 0-128KB Results  Link utilization = 1/10.35, before consumer FIFO is full (at transfer size ~46KB)  Link utilization = 1/ , after consumer FIFO is full (at transfer size > 46KB)  Both FIFO’s activity and probability distribution of consumer FIFO being ‘almost’ full is also plotted w.r.t to transfer size S NN S NN S NN S NN S NN S NN S NN S NN S NN

19 19 Setup for advanced experiment  3x3 MACS NoC with both module interface occupied for each switch  Network frequency = Module frequency = 100 MHz  Producer and consumer rates are linear  ChipScope ILA core captures all activities such as request establishment, write enables for FIFO (used in link utilization calculation), average number of retrials for establishing a channel, avg. channel establishment latency, etc  Observe aforementioned parameters for various network traffic patterns Network traffic generation patterns Real-time Simulation and Profiling of MACS Pattern NameDescription Uniform RandomModule chooses a random destination among all the other modules and sends a packet to that destination. The probability is equal among the other modules Nearest NeighborEach node send a packet to a module of its immediate neighbor switch with equal probability Tornado{X, Y} will send packets to destination {X+k/2−1, y} mod k for the k-ary network (k=4) TransposeRouter of the address {X, Y} will send a packet to router {Y, X} Bit ComplementNode with address {b0,b1,b2,b3} in bits will send packets to the destination address NOT{b0,b1,b2,b3} in bits Hot SpotAll the nodes send the packet to a certain node. Hot spot can act as receiver only or can be both transmitter and receiver.

20 HDL Synthesis Implement Base Design Implement PR Modules Merge Timing/Place ment Analysis Manual Steps Automated Steps DAPR Tool Overview - Design Automation for Partial Reconfiguration (DAPR) Xilinx Early Access (EA) PR Flow provides PR system design support  Existing PR flow is very specialized  Requires target device architecture knowledge  System designer must manually apply steps Hierarchical coding of HDL design description, synthesis, floorplanning, timing analysis implementation and merge DAPR design flow will mitigate existing PR design flow intricacies  Manual Steps Hierarchical HDL design description Modified HDL design description via system designer annotations System designer annotated design constraints (optional)  Automated Steps DAPR inputs - modified HDL design description and design constraints (parameters include bitstream size, timing, power) DAPR design exploration - iteratively generates candidate design and compares generated design performance parameters with system designer annotated constraints DAPR output – Final bitstreams if system designer constraints are met otherwise output final bitstreams that match closest to system designer annotated constraints HDL Design Description Final Generated Bitstreams Merge Modified HDL Design Description Design Constraints (optional) DAPR Design Flow EA PR Flow 20 HDL Design Description HDL Synthesis Set Design Constraints Implement Base Design Implement PR Modules Timing/Place- ment Analysis

21 Overview - DAPR Tool Phases and Description 21 Initial input Modified VHDL Top File Phase 1 Information Extraction Phase 2 Information Collection Run script to synthesize modules and estimate resource requirements Phase 3 Overlay Generation Implement and merge design Perform automated floorplanning and write to User Constraint File (UCF) VHDL Top File PR automation information File (.paif) Generated full and partial bitstreams PRRs identification Static region identification Device inf. libraries (.dilf) DAPR tool starts here Phase 4 Bitstream Generation Information Extraction  Extract static and PR region instantiations and corresponding HDL design description filenames from top level HDL design description file Information Collection  Collect and write port connection names and widths within each instantiation to partial reconfiguration automation information file (*.paif) Resource Estimation and Constraint Generation  Synthesize all HDL design description file with Xilinx XST utility  Read and record estimated slice requirements from generated synthesis log file (.srp) to.paif  Generate connectivity information and PRR floorplan using estimated resources and device information libraries Bitstream Generation  Implement static region and PRMs with Xilinx’s ngdbuild, MAP, and PAR utilities  Merge top, static, and PRMs with Xilinx’s PR_verify design and PR_assemble utilities to generate final full and partial bitstreams

22 A simple example design with two PRRs Two 32-bit up and down counter modules map to PRR 1 Two 8-bit up and down counter modules map to PRR 2 Connectivity information gathered from.paif file and connectivity graph generated for system designer verification Example system designer annotations (Case insensitive) --PRR_Start :: filename, filename…--Static_Start :: filename, filename…--bm_start--PRR_clock Significance of system designer annotation Identifies beginning PRR instantiation and PRM filenames (use comma to specify multiple filenames) Identifies static region instantiation and filenames (use comma to specify multiple filenames) Identifies slice Macro instantiation Identifies system top level clock System Designer Annotations and Connectivity Information Examples PRR_start :: prm_up, prm_down reconfig : rmodule Port Map( led_in=> rm_in_int, led_out=> rm_out_int); static_start:: static led_registers : base Port Map( clk=> clk, led_unreg=> rm_out, led_reg=> rm_in); bm_start in0 : busmacro_xc4v_l2r_sync_narrow Port Map( input0 => bml2r(0), input1 => bml2r(1), input2 => bml2r(2), Connectivity Information Example 32 Design Connectivity Graph Counter Static Region 32 8 Counter_sm 8 Module Name/Type Incoming Connections Outgoing Connections Base/Static40 Counter/PR32 Counter_sm/PR88 Design Connectivity Information Table

23 DAPR V4LX25 Device Library Bank 0 Bank 2 Bank 1 Device divided into 3 banks  Bank 0 (left), Bank 1(right), Bank 2(center)  Resource representation Single letter with prefix of either 1 or 0  Letters are S for Slices, D for DSP48s, F for FIFO16s, R for RAMB16s, C for DCM’s, G for BUGF’s Prefix of 0 means resource occupied, 1 means resource vacant  Checking individual values will help identify resource type and also resource availability Device Library file will be shown in Demo

24 DAPR Overlay Generation Overlay generation uses Cluster growth algorithm Cluster Growth Algorithm works in two steps  Linear ordering of modules Choose seed module from initial set of modules and move to a new set of ordered modules (initially an empty set) Compute gain for each remaining module (gain is number of connecting nets) Move module with highest gain to set of ordered modules and repeat from gain computation until no more modules are remaining in the initial set  Place ordered modules on floorplan space Two types of floorplan growth – Vertical and Diagonal  Current overlay generator floorplans builds vertically Advantage - bitstream size will be smaller Disadvantage - routing is difficult and will take longer Floorplan Growth Direction Floorplan Growths (diagonal (left) and veritcal (right) and colored blocks represent PRMs) 1 CLB wide and 16 CLB tall

25 25 Results – Low-Level DAPR Design Flow Numerical Results  Case Study implementation results with a 32 bit counter  More design s are under test Cordic FFT Matrix Multiplier Iteration no. Clock (Mhz) Pwr(mw) PRR size (CLB's) Partial bitstream size (KBs) X X X X X X X X X X X CLB wide and 16 CLB tall

26 Data format  For the X and Y coordinates 16 bits fixed point representation: 1 sign bit; 8 integral bits and 7 fractional bits  For the 2 FIFOs Implemented using one Virtex-4 BRAM Each one has 32 bits width (16 for X and 16 for Y) and 512 words depth The process of the system 26 Kalman Filter Case Study 26

27 Application  Target tracking in linear system: Provide accurate, continuously updated information about the position of a target given a sequence of observations about its position. Dynamic model and measurement model are linear Noises are Gaussian distributed  The system model: The dynamic system model: Uniform velocity motion: The measurement model: 27 Kalman filter - Introduction 27

28 Initialization Predict  Predicted state:  Predicted covariance : Update  Innovation measurement :  Innovation covariance:  Optimal Kalman gain:  Update state estimate:  Update estimate covariance: The simplified version – Fixed-gain Kalman filter  Difference The optimal Kalman gain is acquired before processing and keep fixed.  Application If the system is stationary stochastic process, the Kalman gain does not change. 28 Kalman filter algorithm 28

29 8 multiplications Read and write FIFOs for Kalman filter part  The process control If the FIFO TX is Full, stop writing and reading the data from the FIFO RX. -> stop processing data  The time interval guarantee  At least 3 clock cycles  Parameters input  Parameters (fixed Kalman gain, initial values) are inputted instead of being pre- programmed in the system 29 Type 1: Fixed-gain Kalman filter 29

30 For the flexibility of application, use 8 DSP to Instantiate the multipliers Resources consumption (V4LX25) Number of Slices: 280 (2%) Number of DSP48s: 8 (16%) Maximum frequency MHz, Throughput 52 MSPS (3 cycles) Dynamic power consumption (100MHz CLK) W Estimated results comparison  Bouncing ball experiment  Fixed-gain Kalman filter is suitable  Results calculated by FPGA are identical to Matlab 30 Results & Analysis 30

31 31 Type 2: Basic version of Kalman filter 31 Assuming all noises are non-coherent, four elements in Kalman gain matrix are zero. 4 divisions and 12 multiplications.

32 Reduce number of dividers and multipliers by resources reuse Estimated results comparison  Bouncing ball experiment  Kalman filter gain updates in each iteration  Results calculated by FPGA are identical to Matlab 32 Results & Analysis 32 4 divs & 12muls 2 divs & 6muls 1 div & 3muls Slices (V4LX25) 1958 (18%) 1316 (12%) 1033 (9%) DSP48s 12 (25%) 6 (12%) 3 (6%) Max. frequency 71.4 MHz Processing time 23 clock cycles 24 clock cycles 26 clock cycles Throughput 3.1 MSPS 2.9 MSPS 2.7 MSPS Dynamic power (50MHz CLK) W W W


Download ppt "2009 Midyear Workshop F4-09: Virtual Architecture and Design Automation for Partial Reconfiguration All Hands Meeting November 10th, 2009 Dr. Ann Gordon-Ross."

Similar presentations


Ads by Google