Presentation is loading. Please wait.

Presentation is loading. Please wait.

 HPEC using FPGAs Challenges and Benefits. Utah State University 2 Cache Valley 90 miles North of Salt Lake City David. Sant. Engineering Innovation.

Similar presentations


Presentation on theme: " HPEC using FPGAs Challenges and Benefits. Utah State University 2 Cache Valley 90 miles North of Salt Lake City David. Sant. Engineering Innovation."— Presentation transcript:

1  HPEC using FPGAs Challenges and Benefits

2 Utah State University 2 Cache Valley 90 miles North of Salt Lake City David. Sant. Engineering Innovation Building

3 Agenda  On-board computing for Spacecraft  A primer on FPGAs (5 slides)  HPEC using FPGAs (26 slides)  The Polymorphic Systolic Array Framework  Improving productivity  Enabling real time and responsive reconfiguration  Future technologies for FPGAs  Acknowledgements 3

4 On-board Computing  Civilian and Military space missions getting more complex  Need to support several types of data from several types of sensors  Missions will require spacecraft computer to be more responsive  Need for In-situ data processing (signal processing)  Not just compression, but data analysis, decision making etc.  Power budget, form factors of spacecraft computer extremely tight  State of the art RadHard microprocessor from BAE systems or RISC processor?  Aging workhorse, time to upgrade big time 4

5 So, what do we upgrade to?  Commodity Microprocessors  Cell, GPU, Many/Multi core  Very powerful  Blows out the power budget  RadHard parts need to be custom ordered  Commodity DSP chips  Good as long as you stick to just one chip  Rahhard parts can be custom ordered  Commodity Reconfigurable chips  FPGAs (field programmable gate arrays)  Can perform like a custom silicon chip  Best performance/power ratios  RadHard parts already available with steady roadmap from Xilinx 5

6 Programming perspective  Microprocessors  Optimistic view point 6  DSP chips  FPGAs Frozen pizzaTake ‘n’ bake Raw ingredients

7 Quick Primer on FPGAs  Mixture of blocks on a die  Some dedicated  DSP (MAC units)  PPC (optional)  RAM  Some programmable  Look Up Tables (LUT)  Gazillions of network switches  Hidden  Special circuit  ICAP (internal configuration access port) 7

8 Simple View of Programming an FPGA  An FPGA is essentially a vast set of SRAM cells waiting to be loaded with 0s and 1s to mimic Boolean logic 8 NMOS transistor All computations are assumed to be based on Boolean Logic So, Problem solving concept => algorithms Algorithms => Discrete set of simple tasks (add/multiply…) Simple tasks => A set of Boolean functions talking to each other Boolean function=> simple manipulation of 1 and 0 bits Each bit stored in a small memory cell (SRAM)

9 Programming an FPGA  Each Look Up Table (LUT) has a unique mailing address  16 bits go into each Look Up Table (LUT)  Each routing switch has a unique mailing address  One bit for each switch  Executable for an FPGA is sequence of bits that have to be delivered precisely to each LUT and Switch Box  This binary/executable is called “Configuration Bitstream” or simply “Bitstream” 9

10 Programming an FPGA  Programming the FPGA is like having a Mailman deliver bits to each address correctly  Slow process  But a Bitstream is slightly more complex  Each FPGA is like a Country (has a unique code)  A “Bitstream” before entering the chip has to undergo security clearance (CRC or cyclic redundancy check)  Port of Entry = ICAP  FPGA addresses are hierarchical (state, county, city, suburb, house address)  Term used for encoding all this overhead is “Frame Address”  All this address stuff is overhead  Actual useful stuff is inside the mail envelope 10

11 So what does a real configured/programmed FPGA look like? 11 Before Programming Nice clean plate Empty LUTs, Switches…. After Programming Messy plate of spaghetti Configured LUTs, Switches…. All those green things are wires that have been setup to carry data between LUTs, FFs etc…

12 High Performance Embedded Computing (HPEC) using FPGAs  Signal processing algorithms  Wildly useful and hence widely used  Computationally quite parallel/pipeline-amenable  Proven to be accelerate-able by Systolic Array designs on FPGAs  The Good of FPGAs:  FPGAs claim to have orders of magnitude performance advantage over DSP chips (www.xilinx.com  They can be reconfigured partially and dynamically  The Bad (no the Ugly):  Productivity is the biggest barrier  The number of signal processing folks willing to adopt FPGAs is small and stagnant  Partial dynamic reconfiguration is very slow compared to processing speeds 12

13 Elaborating the Good of FPGAs: Extreme DSP computing 13

14 Elaborating the Good of FPGAs: Partial Dynamic Reconfiguration 14 At some point in time…… Abruptly…say we need to quickly increase parallelism support for application α (  5) At the cost of taking away parallelism support for the other application, Because we did not have enough space on the chip to support high levels of parallelism for both applications, or There was a power budget we couldn’t satisfy Can we dynamically reconfigure the chip, without disturbing the execution of either application? And do it fast enough? Remember, programming the FPGA is a very very very slow process: RELATIVE to execution speeds of applications FPGA Circuit α Four parallel processing circuits for Application α Circuit β Seven parallel processing circuits for application β Circuit β FPGA Circuit α 4 parallel processing circuits for Application α Circuit β 7 parallel processing circuits for application β Circuit β FPGA Circuit α 4 parallel processing circuits for Application α Circuit β 7 parallel processing circuits for application β Circuit β FPGA Circuit α 4 parallel processing circuits for Application α Circuit β 6 parallel processing circuits for application β Circuit β FPGA Circuit α 5 parallel processing circuits for Application α Circuit β 6 parallel processing circuits for application β Circuit β Circuit α

15 Productivity  It’s a funny thing in the FPGA world  FPGA programmers are essentially VLSI design guys  They don’t buy $5K parts to get average performance  Every clock cycle is precious  Every LUT/FF/MAC/BRAM is precious  They don’t adopt new programming languages in a hurry  They love to have full control over every operation 15

16 Productivity, so what does it mean?  Wants an entire system on FPGA modeled, performance predicted, designed, implemented, debugged, verified, guaranteed timing closure, low power, high throughput….  Done really really fast, just like software  And then wants to make some minor changes and do it quickly all over again, just like software… 16

17 Why cant new designs be compiled, loaded onto FPGAs and tested super fast?  Need to look at traditional design flow 1. Hardware-Software partition (quick) 2. Create macro and micro architectures for hardware portion (a month, two months..) 3. Write bug free VHDL/Verilog code for architectures (a few months) 4. Synthesize, translate, map, place and route (5 to 15 hours) 5. Simulate  If there is a functional or timing bug, you pay a penalty of a few days to weeks 6. Load configuration onto chip  Test again.  If there is a timing bug, you pay a penalty of several weeks 7. If you decide to make a micro architecture change, go back to step 2 8. Good luck trying to finish your project on time and budget 9. This will still not get you a dynamically reconfigurable design 17

18 One way to Improve Productivity  Stick to the traditional design flow as much as possible  FPGA users are once bitten twice shy  Very conservative and believe in the existing flow  But introduce structure into the flow, i.e. physical structure, macro-architecture structure  Make Partial Dynamic Reconfiguration (PDR) almost automatic  FPGA designers are not conversant with PDR designs 18

19 Augmented Design Flow: Exclusively for Signal Processing Algorithms  Hardware-Software Partitioning (just a concept and specific to an application)  Structured Macro-architecture via Floor Planning  Generic structure applicable to many algorithms  Structure Micro-architecture design  Project, Schedule data flow model of Sig. Proc. Kernel onto things called Sockets of Macro-architecture  Well understood process  Embed dynamic reconfiguration capability  New technology  Works in tandem with Macro-architecture  Code, Synthesize….  Test on chip 19

20 Structured Macro-architecture  Some important Terms/Elements:  Socket: A physical region on the FPGA chip reserved by designer to be loaded with/configured with a PE. This is also called a Partial Reconfiguration Region (PRR)  Switch Box: A circuit that makes the array of Sockets re-partition-able  PE/Processing Element: A circuit/bitstream to implement a signal processing kernel’s systolic array data-flow functionality. To activate a socket, a PE must be loaded into it

21 Socket/PRR: Under the Hood 21 Yellow box: A socket/PRR It contains BRAMs, MACs and LUTs/FFs (purple and blue/green/black stuff) If you want to dynamically reconfigure the parallelism of Systolic Arrays on an FPGA: All PRRs must be created with identical resources of MACs, BRAMs, LUTs, FFs. Physical fabric of Virtex SX 35 FPGA

22 Simple circuit Need to set mux sel lines & fifo controls Resides in static region on FPGA Change SB connections to change partitioning of sockets/PRRs between systolic array kernels’ nodes Switch Box: Stuff that makes the Array of Sockets Re-partition-able

23 Ok, time to port Macro-architecture Framework onto Chip 23

24 Virtex 4 SX 35 Static region (luminescent green stuff) Microprocessor Switch Boxes Cache Controller PRRs/Sockets (white boxes) To be filled with Systolic Array Processing Elements What really happened when we tried it

25 Now to the Micro-architecture… First, Hardware Software Partitioning 25 Example: Extended Kalman Filter (EKF). A critical navigation algorithm and a nasty signal processing kernel. All stuff with rounded edges are tasks that can change based on physics of the problem. So put it all in software (Microblaze). All else is consistent and so put them in hardware (PolySAF)

26 Designing/Deriving the Processing Element: Example EKF 26 Works on Faddeev Algorithm to compute Schur compliment

27 One of the many possible ways 27 Port

28 Code, Synthesize, …Optimize  Port: Code, synthesize, Translate, Map, Place and Route  For One Socket/PRR (just a few days worth of work)  Move Nets around to meet timing: Manually pick up a wire in this small bowl of spaghetti of wires, and move it around.  Nuisance of a task, but necessary  But you need to do it only in one PRR (just a few hours worth of work)  Copy Locally optimized bitstream/circuit of the one PRR to all PRRs  Automatically obtain Global Timing closure for the PolySAF  If Microprocessor, Cache are retained for multiple designs, then global timing closure for whole chip is also automatically gifted to you 28

29 Have we answered the Productivity problem?Time to Grade the Approach 29  Need to look at traditional design flow 1. Hardware-Software partition (quick) 2. Create macro and micro architectures for hardware portion (a month, two months..) Applicable to a wide range of Sig. Proc. Algorithms 3. Write bug free VHDL/Verilog code for architectures (a few months) Reuse most of the macro structure and code only for one PRR 4. Synthesize, translate, map, place and route (5 to 15 hours) Do for only one PRR 5. Simulate  If there is a functional or timing bug, you pay a penalty of a few days to weeks 6. Load configuration onto chip  Test again.  If there is a timing bug, you pay a penalty of several weeks 7. If you decide to make a micro architecture change, go back to step 3 8. Good luck trying to finish your project on time and budget

30 Want the details, the math, the algorithms etc?  Read this paper  A. Sudarsanam, R. Barnes, A. Dasu, J. Carver, and R. Kallam, “Dynamically Reconfigurable Systolic Array Accelerators: A case study with EKF and DWT Algorithms,” IET/IEE Computers & Digital Techniques. Vol 4, Issue 1. Jan  Author preprint available on line at Reconfigurable Computing Group  30

31 Now, onto Partial Dynamic Reconfiguration in the PolySAF 31 3 nodes EKF 2 nodes DWT Detach Socket 2 nodes EKF 2 nodes DWT Reconfigure Reset new PRR Re-attach 2 nodes EKF 3 nodes DWT DWT: discrete wavelet transform. The kernel used in JPEG 2000 image compression

32 How to Physically Reconfigure PRR?  Known Methods 32

33 Comparison of all known options 33 Best known technique: from Microsoft Research Labs (2008) eMIPS project Too Slow, Too expensive (hogs up valuable on-chip BRAMs)

34 Embedding Dynamic Reconfiguration into the System 34  Active Bitstream (PRR) to PRR: Hardware Circuit ARC ICAP PRR (source) active bitstream PRR (destination) FPGA PRR (destination) ICAP wrapper snoop

35 Accelerated Relocation Circuit (ARC)  Manipulate Frame addresses  FAR is Frame address register  Lots of unnecessary overhead can be avoided  No need for CRC processing 35

36 Results…reconfiguration times in millisecs 36 All systems 100 MHz Footprint of ARC: 1064 LUTs, 638 FFs and 1 BRAM * Estimated values for state of the art competing technologies Test CircuitResources Bitstream Size (Bytes) #.of. frames ARC BiRF* IEEE TVLSI 2009 Microsoft* Tech. Report 2008 PolySAF node LUTFFDSPBRAM Same Side/ Opp Side Same side BRAM Same Side Opp Side FSA_ no_DSP DSA_ no_DSP Matrix_Mult no_DSP FSA_ with_DSP DSA_ with_DSP Matrx_Mult with_DSP RFT cases DCT CSC DWT

37 Next steps… Improve, Formalize and Collaborate  Performance prediction Model  Predict how big circuit will be, how it will perform using Excel and Matlab  Big leap in productivity  Arithmetic Precision manipulation is extraordinarily powerful when it comes to FPGAs  If the right non-IEEE precision can be chosen for a Sig. Proc. App. Then you can save medium to massive amounts of area, power in the circuit mapped onto the FPGA  Great opportunity for Small Satellites  Efficient communication between Microprocessor and PolySAF via threads  Validate and brutally test this on a large number of algorithms (FFTs, Filters, Hyperspectral processing…..)  NASA can help with this  Technology is attractive for software defined radios, precision navigation… 37

38 Kaleidoscope: Future of FPGA  Near term  Maybe better tools to program and debug FPGAs?  Mentor’s Catapult, AutoESL compiler, Synfora compiler….  Maybe some sort of standardization in FPGA programming  Hopefully DARPA HPCS program will produce something  Longer term (Revolutionary things to come)  Vertically Integrated FPGA + DRAM on a single chip  1000x improvement in performance/watt  Visit Micron Research Center at USU to learn more  38

39 Acknowledgements  Joe Bredekamp and the NASA AISR program  Applied Information Systems Research  Funding from NASA is valuable  Focused research  Want my technology to be adopted for real missions  Xilinx and Mentor Graphics (donated > $ 100K worth software)  My Grad Students 39


Download ppt " HPEC using FPGAs Challenges and Benefits. Utah State University 2 Cache Valley 90 miles North of Salt Lake City David. Sant. Engineering Innovation."

Similar presentations


Ads by Google