2009 Midyear Workshop F4-09: Virtual Architecture and Design Automation for Partial Reconfiguration All Hands Meeting November 10th, 2009 Dr. Ann Gordon-Ross.

Slides:

Advertisements

Similar presentations

Computer Architecture (EEL4713, Fall 2013) Partial Reconfiguration Not just a half baked job of reconfiguring Rohit Kumar Research Student University of.

Advertisements

Run-Time FPGA Partial Reconfiguration for Image Processing Applications Shaon Yousuf Ph.D. Student NSF CHREC Center, University of Florida Dr. Ann Gordon-Ross.

Reconfigurable Computing (EEL4930/5934) Partial Reconfiguration Not just a half baked job of reconfiguring Rohit Kumar Joseph Antoon Research Students.

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.

Lecture 7 FPGA technology. 2 Implementation Platform Comparison.

HTR: On-Chip Hardware Task Relocation for Partially Reconfigurable FPGAs + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

1 Performed By: Khaskin Luba Einhorn Raziel Einhorn Raziel Instructor: Rivkin Ina Spring 2004 Spring 2004 Virtex II-Pro Dynamical Test Application Part.

1 Performed by: Lin Ilia Khinich Fanny Instructor: Fiksman Eugene המעבדה למערכות ספרתיות מהירות High Speed Digital Systems Laboratory הטכניון - מכון טכנולוגי.

Configurable System-on-Chip: Xilinx EDK

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

1 FPGA Lab School of Electrical Engineering and Computer Science Ohio University, Athens, OH 45701, U.S.A. An Entropy-based Learning Hardware Organization.

Virtual Architecture For Partially Reconfigurable Embedded Systems (VAPRES) Architecture for creating partially reconfigurable embedded systems Module.

Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Bitstream Relocation with Local Clock Domains for Partially Reconfigurable FPGAs Adam Flynn, Ann Gordon-Ross, Alan D. George NSF Center for High-Performance.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

Sub-Nyquist Sampling DSP & SCD Modules Presented by: Omer Kiselov, Daniel Primor Supervised by: Ina Rivkin, Moshe Mishali Winter 2010High Speed Digital.

Benefits of Partial Reconfiguration Reducing the size of the FPGA device required to implement a given function, with consequent reductions in cost and.

Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Efficient FPGA Implementation of QR

Open Discussion of Design Flow Today’s task: Design an ASIC that will drive a TV cell phone Exercise objective: Importance of codesign.

Embedded Systems Seminar (EEL6935, Spring 2013) Partial Reconfiguration Not just a half baked job of reconfiguring Rohit Kumar Research Student University.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

DAPR: Design Automation for Partially Reconfigurable FPGAs Shaon Yousuf Ph.D. Student NSF CHREC Center, University of Florida Dr. Ann Gordon-Ross Associate.

Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.

FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Design Framework for Partial Run-Time FPGA Reconfiguration Chris Conger, Ann Gordon-Ross, and Alan D. George Presented by: Abelardo Jara-Berrocal HCS Research.

Exploiting Partially Reconfigurable FPGAs for Situation-Based Reconfiguration in Wireless Sensor Networks Rafael Garcia, Dr. Ann Gordon-Ross, Dr. Alan.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

1 Optical Packet Switching Techniques Walter Picco MS Thesis Defense December 2001 Fabio Neri, Marco Ajmone Marsan Telecommunication Networks Group

Partial Region and Bitstream Cost Models for Hardware Multitasking on Partially Reconfigurable FPGAs + Also Affiliated with NSF Center for High- Performance.

A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.

High Speed Digital Systems Lab. Agenda  High Level Architecture.  Part A.  DSP Overview. Matrix Inverse. SCD  Verification Methods. Verification Methods.

1 Implementation in Hardware of Video Processing Algorithm Performed by: Yony Dekell & Tsion Bublil Supervisor : Mike Sumszyk SPRING 2008 High Speed Digital.

A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead Heng Tan and Ronald F. DeMara University of Central Florida.

Field Programmable Port Extender (FPX) 1 Modular Design Techniques for the FPX.

Reconfigurable Embedded Processor Peripherals Xilinx Aerospace and Defense Applications Brendan Bridgford Brandon Blodget.

FPGA Partial Reconfiguration Presented by: Abelardo Jara-Berrocal HCS Research Laboratory College of Engineering University of Florida April 10 th, 2009.

Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,

M. ALSAFRJALANI D. DZENITIS Runtime PR for Software Radio 2/26/2010 UFL ECE Dept 1 PARTIAL RECONFIGURATION (PR)

VAPRES A Virtual Architecture for Partially Reconfigurable Embedded Systems Presented by Joseph Antoon Abelardo Jara-Berrocal, Ann Gordon-Ross NSF Center.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

Part A Final Dor Obstbaum Kami Elbaz Advisor: Moshe Porian August 2012 FPGA S ETTING U SING F LASH.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular Reconfigurable Systems Abelardo Jara-Berrocal, Ann Gordon-Ross NSF.

Runtime Reconfigurable Network-on- chips for FPGA-based systems Mugdha Puranik Department of Electrical and Computer Engineering

Runtime Temporal Partitioning Assembly to Reduce FPGA Reconfiguration Time Abelardo Jara-Berrocal, Ann Gordon-Ross HCS Research Laboratory College of Engineering.

An Automated Hardware/Software Co-Design

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Presenter: Darshika G. Perera Assistant Professor

Partial Reconfigurable Designs

Backprojection Project Update January 2002

School of Engineering University of Guelph

Ming Liu, Wolfgang Kuehn, Zhonghai Lu, Axel Jantsch

Introduction to cosynthesis Rabi Mahapatra CSCE617

Reconfigurable Computing

CSCI1600: Embedded and Real Time Software

Abelardo Jara-Berrocal Joseph Antoon Ph.D. Students

ChipScope Pro Software

Shaon Yousuf Ph.D. Student NSF CHREC Center, University of Florida

ChipScope Pro Software

MACS: A Minimal Adaptive Routing Circuit Switched Architecture for Scalable and Parametric NoCs Rohit Kumar Dr. Ann Gordon-Ross Introduction MACS: A.

CSCI1600: Embedded and Real Time Software

Presentation transcript:

2009 Midyear Workshop F4-09: Virtual Architecture and Design Automation for Partial Reconfiguration All Hands Meeting November 10th, 2009 Dr. Ann Gordon-Ross Assistant Professor of ECE University of Florida Dr. Alan D. George Professor of ECE University of Florida Abelardo Jara Terence Frederick Rohit Kumar Shaon Yousuf Research Students University of Florida

Outline Goals, Motivation and Challenges Virtual Architecture for Partially Reconfigurable Embedded System (VAPRES)  Design methodology  Multiple clock domains support  Bitstream relocation MACS Inter-module Communication Architecture Case Study Application: Embedded Target Tracking System on Virtex-4 FPGA board  Preliminary non-PR version using Kalman filters Design Automation for Partial Reconfiguration (DAPR)  DAPR design flow VHDL annotations Connectivity file and graph Device library file Overlay generation

3 GOAL – Leverage partial reconfiguration (PR) for application designers  Architect and implement a Virtual Architecture (VA) for Partially Reconfigurable Embedded Systems  Ease PR design via design automation MOTIVATIONS – Increase productivity and reduce design complexity for PR designs  VA reduces development time Dynamically load and unload hardware processing modules Processing hardware adapts to external environmental conditions  Automated design flow makes PR more amenable system designers Current PR design flow requires very high level of specialization Simplifies design of systems that time-multiplex FPGA resources → smaller devices CHALLENGES  Provide sufficient VA flexibility with architectural parameterization Balancing enough application specialization with exploration complexity  Creating new exploration algorithms/heuristics to automate PR design flow steps with respect to available PR tools 33 Goals, Motivations, and Challenges Sensor Interface Central Controlling Agent ICAP Processed output Filter repository Filter A Filter B PRR Filter A External Trigger Sensor Coverage Area

44 Expand and prototype an FPGA-based architecture for rapid development of PR embedded systems  VAPRES: Virtual Architecture for Partially Reconfigurable Embedded Systems  MACS: Minimal Adaptive Circuit Switching mesh inter-module communication architecture for VAPRES  Improvement over F4-08 SCORES communication architecture  Architectural support for hardware module context save and restore Formulate and implement an automated PR design flow  DAPR: Design Automation for Partial Reconfiguration Tool Study Virtex-4 and Virtex-5 bitstreams to leverage additional functionalities  Extend bitstream relocation and context save and restore for Virtex-5 F4-09 Approach Highly specialized PR system design Reconfiguration behavior known at design time Highly optimized system floorplan based on known application Flexible and reusable base architecture Not optimized for a specific application Tools to develop both reconfigurable modules and application software Design Methodology + VAPRES Builder Tool VAPRES Base Architecture

VAPRES: Architecture Design Flexible scalable architecture  Multiple architectural parameters enable base system specialization N =number of PRRs kr =number of streaming channels going right kl =number of streaming channels going left Some additional parameters presented next Base PR embedded system Multiple clock domains  PRMs can operate at independent clock frequencies  PRMs use FIFO-based I/O ports High speed inter-module communication architecture (MACS) Streaming channels PRR1PRR2PRR3 FSL Interface PLB Bus MACS switch Module Interfaces Module Interfaces Module Interfaces clk1 clk2 clk0 ICAP Flash controller UART SDRAM To external I/O pins Network I/O Module DCR Bridge Module Interfaces Module Interfaces MicroBlaze PR Socket 1PR Socket 2PR Socket 3 Slice macros To external I/O pins Control Region Data Processing Region kr=1 2 3 N= kl=2 5

6 VAPRES: Design Methodology Application software PRMs Application decomposition Base system specifications Software implementation PRM design Executable filePartial bitstreamsStatic bitstream VAPRES API (vapres.h) FPGA board Base system design Parametric VHDL models Synthesis Application Flow (application designers) Base System Flow (base system designer) Implementation System definition files Synthesis Implementation Software implementatio n Software design System designer chooses VAPRES parameters VAPRES VHDL, MHS, MSS, and UCF C/C++ libraries for application software development PRM implementation is separate from base system implementation Application designers work separate from system designer Parametric models for VAPRES and MACS enable customization Floorplan System floorplan defines PRR sizes and shapes

7 VAPRES: Builder Tool Overview  Automates process of building VAPRES base system and applications Increases designers productivity Builder Tool Features  Some additional parameters used PRR height and width  Automatic creation of VAPRES base system from parameters Base system floorplanning Slice macro instantiation and placement  Automatic implementation of static and partial bitstreams Assisted framework for application designers  Generates VAPRES SW libraries  Templates for PRMs and software Static base system PR modules (PRMs) Application software Architectural parameters System floorplan (.ucf) Top VHDL entity (.vhd) Software specifications (.mss) Hardware specifications (.mhs)

Design 1Design 2Design 3Design 4 Number of PRRs1123 PRR height1 row (16 CLBs)2 rows (32 CLBSs)2 rows (32 CLBs)1 row (16 CLBs) PRR width10 CLBs MACS parametersN=1, kr=1, kl=1 N=2, kr=2,kl=2N=3, kr=2,kl=2 Post-place and route implementation for base static system Maximum clock120.3 MHz117.6 MHz116.1 MHz119.3 MHz Static region slices (without MACS) MACS slicesN/A VAPRES Builder – Results N = number of PRRs = number of MACS switches, kr = number of channels between switches going in the right direction, kl = number of channels between switches going in the left direction Set of slice macros (1 set for each PRR) PRR boundary 1123 ≈ 280 slices more when when adding an extra PRR +0 slices+284 slices+263 slices 100 MHz constraint met for all place- and-routed designs

Only one partial bitstream necessary for each PRM Partial bitstreams stored in compact flash When PRM is needed, partial bitstream is loaded into Microblaze and relocator is called New partial bitstream is loaded into correct PRR Program runs in external memory: Bitstream relocator is stored in non-volatile compact flash System ACE controller loads relocator from flash and stores it in SDRAM Microblaze PRR1PRR2 FSL Interf, PLB Bus Interface I/O Module Interface clk0 clk1 ICAP SystemACE Flash UART SDRAM To external I/O pins Network 9 VAPRES – Bitstream Relocation SCORES Switch Data Processing Region (includes one or more RSBs – Reconfigurable Streaming Blocks) System Control Region In-situ Bitstream Relocation – Alters partial bitstream (with no external inputs) to run in any PRR Advantages: Reduces bitstream storage requirements (only one partial bitstream per module) Saves step of reading a partial bitstream from external Flash memory, if similar partial bitstream was already loaded into memory Enables VAPRES to dynamically place and migrate modules Restriction – PRRs must be homogeneous (ensures sufficient resources) I/O Module

Overview – MACS Communication Architecture 10 MACS: Minimal adaptive circuit switching mesh communication architecture  VAPRES requires high-bandwidth, low-latency communication channels inside reconfigurable streaming blocks (RSBs)  Novel communication architecture named SCORES was implemented in 2008  MACS extends SCORES from linear array topology to mesh topology with few other new features Features of MACS  Minimal-adaptive routing to explore all possible shortest paths Selects lowest cost path that best achieves network load distribution  Similar interface ports for nodes and neighboring switch Any number (<=6) of nodes can be put on a single switch Unused interface ports, of switches around edges of NoC, can be utilized Node interface port available in MxN NoC is <= 2(M*N + M + N) Reduces area overhead of communication architecture per node  Provides low-latency path(s) between frequently communicating node pairs (if attached to same switch) 10 S NN S NN S NN S NN S NN S NN S NN S NN S NN MACS

11 MACS implementation results (1/2) 9 architectural parameters to play around with  Plotting all combinations is not feasible Assuming two values of each parameter requires 2 9 “area usage” plots and 2 9 “achievable frequency” plots Figure 1: Area usage in number of slices per module for data widths W = 8, 16, and 32 bits for a varying number of lanes per switch and local port. The x-axis in each graph varies the Kl, Kr, Kd, and Ku parameters from 1 to 3 lanes per switch port. Left to right, the graphs vary the Kll and Krl parameters from 1 to 3 lanes per local port. Figure 2: Maximum operating frequency for data widths W = 8, 16, and 32 bits for a varying number of lanes per switch and local port. The x-axis in each graph varies the Kl, Kr, Kd, and Ku parameters from 1 to 3 lanes per switch port. From left to right, the graphs vary the Kll and Krl parameters from 1 to 3 lanes per local port.

12 MACS implementation results (2/2) Comparison of NoCs  Difficult due to lack of published implementation results from other authors  Representative packet-switching NoC 1 Designed and realized by Bartic et al. 8 modules attached in 2D-mesh topology 16-bit wide data  Similar circuit-switched NoC, i.e. PNoC 2 Programmable Network on Chip, designed and realized by Hilton et al. Single switch with 8 modules attached to it 16-bit wide data  Comparable configuration of MACS 2x2 mesh of MACS switches W=16, Ku=Kd=Kl=Kr=Kil=Kir=1 Network Architecture SlicesBRAMsFrequency MACS MHz Packet- Switching MHz PNoC MHz Comparison Results  5x faster and 1.5x less area overhead than packet- switching NoC  2x faster (with slight area overhead) than PNoC 1.Bartic, A., Mignolet, J.Y., Nollet, V., Marescaux, T., Verkest, D., Vernalde, S., and Lauwereins, R. “Highly scalable network on chip for reconfigurable systems”.In Proceedings of International Symposium on System-on-Chip, 2003, pages 79–82. 2.Hilton C. and Nelson B., “PNoC: a flexible circuit-switched NoC for FPGA-based systems”.In Proceedings of Computers and Digital Techniques, 2006, pages

Analytical model of SCORES/MACS  Streaming network FIFO at both ends:  Producer FIFO (of size D), Consumer FIFO (of size C) Pipelined channel/medium:  n-stage pipeline Control Feedback Path  n-stage  Phases I Analysis of producer-medium and medium-consumer pairs  Phase II Analysis of medium-consumer with feedback Analytical Modeling 13 λpλpλmλmλmλm n-stage Size D Size C µmµm µcµc

Markov-chain model Phase-I: Producer-Medium Pair(1/2) 14 λpλp μ m Size D 12k λ p,1 λ p,k-1 λ p,k μ m,k+1 μ m,k μ m,2 D λ p,D-1 μ m,D k+1 P k probability associated with the queue being in state k i.e. queue having k packets in it λp = Arrival rate μ m = Service rate D = System capacity Flow = Sum of product of λ’s, μ’s and P’s Solving for steady state gives 0 λpλp μ m,1 P0P0 P1P1 P2P2 PkPk P k+1 PDPD

Phase-I: Producer-Medium Pair(2/2) 15 PDPD 1/(D+1) D (line size) 1 Total probability of the system should be 1

Phase II: Medium-Consumer Pair with control feedback, 2D-Markov Chain Model (1/2) 16 Streaming network  Number of packets in queue(k)  Recently reached threshold(Q) Potential Queuing at Q = 0  Producer is filling with rate λ p  Service rate is µ m  At k = D-1, queue switches to de-queuing state Potential De-queuing at Q = 1  Producer is filling with reduced rate λ p,1 Consumer is emptying with µ m  Total probability of state Q = 1 gives the Packet drop probability  At k = 1, queue switches to queuing state, i.e. Q=0 P D,1 12k λpλp λpλp λpλp µmµm µmµm µmµm λpλp D-1 0 λpλp µmµm P0P0 P1P1 P2P2 PkPk P d-1 12i λ p,1 µmµm µmµm µmµm D µmµm D-1 P 1,1 P 2,1 P i,1 P d-1,1 Q=1 Q=0 k µmµm λ p,1 µmµm λpλp

Probability of FIFO being filled with ‘k’ packets when ρ ≠ 1 Probability of FIFO being filled with ‘k’ packets when ρ = 1 17 Phase II: Medium-Consumer Pair with control feedback, 2D-Markov Chain Model (2/2) Packet Drop Probability when ρ ≠ 1 Packet Drop Probability when ρ = 1

18 Real-time Simulation and Profiling of MACS Setup for basic experiment  One MACS switch with both module interface occupied  Network frequency = Module frequency = 100 MHz  Producer and consumer rates are Poisson process ROM holds MATLAB generated Poisson distributed intervals based on different λ and µ Producer/consumer loads its counter with value from ROM and generates/reads a unit of data at counter overflow  ChipScope ILA core captures all FIFO activity  System parameters: FIFO sizes = 512 bytes, Network BW = 400MBps, Producer rate = 40MBps Consumer Rate = 4MBps, (both generates data at Poisson distributed random intervals), Transfer size = 0-128KB Results  Link utilization = 1/10.35, before consumer FIFO is full (at transfer size ~46KB)  Link utilization = 1/ , after consumer FIFO is full (at transfer size > 46KB)  Both FIFO’s activity and probability distribution of consumer FIFO being ‘almost’ full is also plotted w.r.t to transfer size S NN S NN S NN S NN S NN S NN S NN S NN S NN

19 Setup for advanced experiment  3x3 MACS NoC with both module interface occupied for each switch  Network frequency = Module frequency = 100 MHz  Producer and consumer rates are linear  ChipScope ILA core captures all activities such as request establishment, write enables for FIFO (used in link utilization calculation), average number of retrials for establishing a channel, avg. channel establishment latency, etc  Observe aforementioned parameters for various network traffic patterns Network traffic generation patterns Real-time Simulation and Profiling of MACS Pattern NameDescription Uniform RandomModule chooses a random destination among all the other modules and sends a packet to that destination. The probability is equal among the other modules Nearest NeighborEach node send a packet to a module of its immediate neighbor switch with equal probability Tornado{X, Y} will send packets to destination {X+k/2−1, y} mod k for the k-ary network (k=4) TransposeRouter of the address {X, Y} will send a packet to router {Y, X} Bit ComplementNode with address {b0,b1,b2,b3} in bits will send packets to the destination address NOT{b0,b1,b2,b3} in bits Hot SpotAll the nodes send the packet to a certain node. Hot spot can act as receiver only or can be both transmitter and receiver.

HDL Synthesis Implement Base Design Implement PR Modules Merge Timing/Place ment Analysis Manual Steps Automated Steps DAPR Tool Overview - Design Automation for Partial Reconfiguration (DAPR) Xilinx Early Access (EA) PR Flow provides PR system design support  Existing PR flow is very specialized  Requires target device architecture knowledge  System designer must manually apply steps Hierarchical coding of HDL design description, synthesis, floorplanning, timing analysis implementation and merge DAPR design flow will mitigate existing PR design flow intricacies  Manual Steps Hierarchical HDL design description Modified HDL design description via system designer annotations System designer annotated design constraints (optional)  Automated Steps DAPR inputs - modified HDL design description and design constraints (parameters include bitstream size, timing, power) DAPR design exploration - iteratively generates candidate design and compares generated design performance parameters with system designer annotated constraints DAPR output – Final bitstreams if system designer constraints are met otherwise output final bitstreams that match closest to system designer annotated constraints HDL Design Description Final Generated Bitstreams Merge Modified HDL Design Description Design Constraints (optional) DAPR Design Flow EA PR Flow 20 HDL Design Description HDL Synthesis Set Design Constraints Implement Base Design Implement PR Modules Timing/Place- ment Analysis

Overview - DAPR Tool Phases and Description 21 Initial input Modified VHDL Top File Phase 1 Information Extraction Phase 2 Information Collection Run script to synthesize modules and estimate resource requirements Phase 3 Overlay Generation Implement and merge design Perform automated floorplanning and write to User Constraint File (UCF) VHDL Top File PR automation information File (.paif) Generated full and partial bitstreams PRRs identification Static region identification Device inf. libraries (.dilf) DAPR tool starts here Phase 4 Bitstream Generation Information Extraction  Extract static and PR region instantiations and corresponding HDL design description filenames from top level HDL design description file Information Collection  Collect and write port connection names and widths within each instantiation to partial reconfiguration automation information file (*.paif) Resource Estimation and Constraint Generation  Synthesize all HDL design description file with Xilinx XST utility  Read and record estimated slice requirements from generated synthesis log file (.srp) to.paif  Generate connectivity information and PRR floorplan using estimated resources and device information libraries Bitstream Generation  Implement static region and PRMs with Xilinx’s ngdbuild, MAP, and PAR utilities  Merge top, static, and PRMs with Xilinx’s PR_verify design and PR_assemble utilities to generate final full and partial bitstreams

A simple example design with two PRRs Two 32-bit up and down counter modules map to PRR 1 Two 8-bit up and down counter modules map to PRR 2 Connectivity information gathered from.paif file and connectivity graph generated for system designer verification Example system designer annotations (Case insensitive) --PRR_Start :: filename, filename…--Static_Start :: filename, filename…--bm_start--PRR_clock Significance of system designer annotation Identifies beginning PRR instantiation and PRM filenames (use comma to specify multiple filenames) Identifies static region instantiation and filenames (use comma to specify multiple filenames) Identifies slice Macro instantiation Identifies system top level clock System Designer Annotations and Connectivity Information Examples PRR_start :: prm_up, prm_down reconfig : rmodule Port Map( led_in=> rm_in_int, led_out=> rm_out_int); static_start:: static led_registers : base Port Map( clk=> clk, led_unreg=> rm_out, led_reg=> rm_in); bm_start in0 : busmacro_xc4v_l2r_sync_narrow Port Map( input0 => bml2r(0), input1 => bml2r(1), input2 => bml2r(2), Connectivity Information Example 32 Design Connectivity Graph Counter Static Region 32 8 Counter_sm 8 Module Name/Type Incoming Connections Outgoing Connections Base/Static40 Counter/PR32 Counter_sm/PR88 Design Connectivity Information Table

DAPR V4LX25 Device Library Bank 0 Bank 2 Bank 1 Device divided into 3 banks  Bank 0 (left), Bank 1(right), Bank 2(center)  Resource representation Single letter with prefix of either 1 or 0  Letters are S for Slices, D for DSP48s, F for FIFO16s, R for RAMB16s, C for DCM’s, G for BUGF’s Prefix of 0 means resource occupied, 1 means resource vacant  Checking individual values will help identify resource type and also resource availability Device Library file will be shown in Demo

DAPR Overlay Generation Overlay generation uses Cluster growth algorithm Cluster Growth Algorithm works in two steps  Linear ordering of modules Choose seed module from initial set of modules and move to a new set of ordered modules (initially an empty set) Compute gain for each remaining module (gain is number of connecting nets) Move module with highest gain to set of ordered modules and repeat from gain computation until no more modules are remaining in the initial set  Place ordered modules on floorplan space Two types of floorplan growth – Vertical and Diagonal  Current overlay generator floorplans builds vertically Advantage - bitstream size will be smaller Disadvantage - routing is difficult and will take longer Floorplan Growth Direction Floorplan Growths (diagonal (left) and veritcal (right) and colored blocks represent PRMs) 1 CLB wide and 16 CLB tall

25 Results – Low-Level DAPR Design Flow Numerical Results  Case Study implementation results with a 32 bit counter  More design s are under test Cordic FFT Matrix Multiplier Iteration no. Clock (Mhz) Pwr(mw) PRR size (CLB's) Partial bitstream size (KBs) X X X X X X X X X X X CLB wide and 16 CLB tall

Data format  For the X and Y coordinates 16 bits fixed point representation: 1 sign bit; 8 integral bits and 7 fractional bits  For the 2 FIFOs Implemented using one Virtex-4 BRAM Each one has 32 bits width (16 for X and 16 for Y) and 512 words depth The process of the system 26 Kalman Filter Case Study 26

Application  Target tracking in linear system: Provide accurate, continuously updated information about the position of a target given a sequence of observations about its position. Dynamic model and measurement model are linear Noises are Gaussian distributed  The system model: The dynamic system model: Uniform velocity motion: The measurement model: 27 Kalman filter - Introduction 27

Initialization Predict  Predicted state:  Predicted covariance : Update  Innovation measurement :  Innovation covariance:  Optimal Kalman gain:  Update state estimate:  Update estimate covariance: The simplified version – Fixed-gain Kalman filter  Difference The optimal Kalman gain is acquired before processing and keep fixed.  Application If the system is stationary stochastic process, the Kalman gain does not change. 28 Kalman filter algorithm 28

8 multiplications Read and write FIFOs for Kalman filter part  The process control If the FIFO TX is Full, stop writing and reading the data from the FIFO RX. -> stop processing data  The time interval guarantee  At least 3 clock cycles  Parameters input  Parameters (fixed Kalman gain, initial values) are inputted instead of being pre- programmed in the system 29 Type 1: Fixed-gain Kalman filter 29

For the flexibility of application, use 8 DSP to Instantiate the multipliers Resources consumption (V4LX25) Number of Slices: 280 (2%) Number of DSP48s: 8 (16%) Maximum frequency MHz, Throughput 52 MSPS (3 cycles) Dynamic power consumption (100MHz CLK) W Estimated results comparison  Bouncing ball experiment  Fixed-gain Kalman filter is suitable  Results calculated by FPGA are identical to Matlab 30 Results & Analysis 30

31 Type 2: Basic version of Kalman filter 31 Assuming all noises are non-coherent, four elements in Kalman gain matrix are zero. 4 divisions and 12 multiplications.

Reduce number of dividers and multipliers by resources reuse Estimated results comparison  Bouncing ball experiment  Kalman filter gain updates in each iteration  Results calculated by FPGA are identical to Matlab 32 Results & Analysis 32 4 divs & 12muls 2 divs & 6muls 1 div & 3muls Slices (V4LX25) 1958 (18%) 1316 (12%) 1033 (9%) DSP48s 12 (25%) 6 (12%) 3 (6%) Max. frequency 71.4 MHz Processing time 23 clock cycles 24 clock cycles 26 clock cycles Throughput 3.1 MSPS 2.9 MSPS 2.7 MSPS Dynamic power (50MHz CLK) W W W