Graduate Computer Architecture I Lecture 16: FPGA Design.

Slides:



Advertisements
Similar presentations
FPGA (Field Programmable Gate Array)
Advertisements

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
Verilog Fundamentals Shubham Singh Junior Undergrad. Electrical Engineering.
Spartan-3 FPGA HDL Coding Techniques
ECE 551 Digital System Design & Synthesis Lecture 08 The Synthesis Process Constraints and Design Rules High-Level Synthesis Options.
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
EELE 367 – Logic Design Module 2 – Modern Digital Design Flow Agenda 1.History of Digital Design Approach 2.HDLs 3.Design Abstraction 4.Modern Design Steps.
Spartan II Features  Plentiful logic and memory resources –15K to 200K system gates (up to 5,292 logic cells) –Up to 57 Kb block RAM storage  Flexible.
Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.
Chapter 8 Hardware Conventional Computer Hardware Architecture.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
FPGA chips and DSP Algorithms By Emily Fabes. 2 Agenda FPGA Background Reasons to use FPGA’s Advantages and disadvantages of using FPGA’s Sample VHDL.
The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.
Configurable System-on-Chip: Xilinx EDK
Evolution of implementation technologies
Programmable logic and FPGA
George Mason University ECE 448 – FPGA and ASIC Design with VHDL Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts,
Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.
Introduction to FPGA and DSPs Joe College, Chris Doyle, Ann Marie Rynning.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
GPGPU platforms GP - General Purpose computation using GPU
Study of AES Encryption/Decription Optimizations Nathan Windels.
ECE 551 Digital System Design & Synthesis Lecture 11 Verilog Design for Synthesis.
9/20/6Lecture 3 - Instruction Set - Al1 Address Decoding for Memory and I/O.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
Introduction to FPGA AVI SINGH. Prerequisites Digital Circuit Design - Logic Gates, FlipFlops, Counters, Mux-Demux Familiarity with a procedural programming.
Charles Kime & Thomas Kaminski © 2004 Pearson Education, Inc. Terms of Use (Hyperlinks are active in View Show mode) Terms of Use Lecture 12 – Design Procedure.
An automatic tool flow for the combined implementation of multi-mode circuits Brahim Al Farisi, Karel Bruneel, João Cardoso, Dirk Stroobandt.
ASIC 120: Digital Systems and Standard-Cell ASIC Design Tutorial 4: Digital Systems Concepts November 16, 2005.
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
ECE 465 Introduction to CPLDs and FPGAs Shantanu Dutt ECE Dept. University of Illinois at Chicago Acknowledgement: Extracted from lecture notes of Dr.
System Arch 2008 (Fire Tom Wada) /10/9 Field Programmable Gate Array.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
J. Christiansen, CERN - EP/MIC
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
Anurag Dwivedi. Basic Block - Gates Gates -> Flip Flops.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
(1) Basic Language Concepts © Sudhakar Yalamanchili, Georgia Institute of Technology, 2006.
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.
Lecture 11: FPGA-Based System Design October 18, 2004 ECE 697F Reconfigurable Computing Lecture 11 FPGA-Based System Design.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
 Seattle Pacific University EE Logic System DesignMux-Decoder-1 Multiplexers Two alternative forms for a 2:1 Mux Truth Table Functional form Logical.
L20 – Register Set. The 430 Register Set  Not exactly a dual ported register set, but a dual drive register set.  Ref: text Unit 10, 17, 20 9/2/2012.
Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May – 9 June 2007 Javier.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
EE3A1 Computer Hardware and Digital Design Lecture 9 Pipelining.
Introduction to Field Programmable Gate Arrays (FPGAs) EDL Spring 2016 Johns Hopkins University Electrical and Computer Engineering March 2, 2016.
CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.
Introduction to Programmable Logic
Instructor: Dr. Phillip Jones
FPGAs in AWS and First Use Cases, Kees Vissers
Field Programmable Gate Array
Field Programmable Gate Array
Field Programmable Gate Array
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
VHDL Introduction.
ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL code ECE 448 – FPGA and ASIC Design.
Programmable logic and FPGA
Presentation transcript:

Graduate Computer Architecture I Lecture 16: FPGA Design

2 - CSE/ESE 560M – Graduate Computer Architecture I Emergence of FPGA Great for Prototyping and Testing –Enable logic verification without high cost of fab –Reprogrammable  Research and Education –Meets most computational requirements –Options for transferring design to ASIC Technology Advances –Huge FPGAs are available Up to 200,000 Logic Units –Above clocking rate of 500 MHz Competitive Pricing

3 - CSE/ESE 560M – Graduate Computer Architecture I System on Chip (SoC) Large Embedded Memories –Up 10 Megabits of on-chip memories (Virtex 4) –High bandwidth and reconfigurable Processor IP Cores –Tons of Soft Processor Cores (some open source) –Embedded Processor Cores PowerPC, Nios RISC, and etc. – 450+ MHz –Simple Digital Signal Processing Cores Up to 512 DSPs on Virtex 4 Interconnects –High speed network I/O (10Gbps) –Built-in Ethernet MACs (Soft/Hard Core) Security –Embedded 256-bit AES Encryption

4 - CSE/ESE 560M – Graduate Computer Architecture I Potential Advantages of FPGAs

5 - CSE/ESE 560M – Graduate Computer Architecture I Designing with FPGAs Opportunities –Hardware logics are programmable –Immediate testing on the actual platform Challenges –Programming Environment Think and design in 2-D instead of 1-D Consider hardware limitations –Hardware Synthesis Smart language interpreter and translator Efficient HW resource utilization

6 - CSE/ESE 560M – Graduate Computer Architecture I Today Programming Environment –Object Oriented Programming Model –Template based language editors –Hardware/Software Co-design –Still a disconnect between SW/HW methods –Lack of education to bring them together Hardware Synthesis –Getting smarter but not smart enough –Tuned specifically for each platform –Not able to take full advantage of resources –Manual tweaking and using templates

7 - CSE/ESE 560M – Graduate Computer Architecture I High Performance Design in FPGA Fine Grain Pipelining –Reducing Critical Path –One level of look-up-table between D-flip flop –Works best for streaming data with little or no data dependencies Logic Resource –Smaller sizes often yield faster design –Use all available resources –Less resource map and place conflicts –Quicker compilation Parallel Engines –Exploit parallelism in application –Faster place and route

8 - CSE/ESE 560M – Graduate Computer Architecture I Pipelining DEFINITION: –a K-Stage Pipeline (“K-pipeline”) is an acyclic circuit having exactly K registers on every path from an input to an output. –a COMBINATIONAL CIRCUIT is thus an 0-stage pipeline. CONVENTION: –Every pipeline stage, hence every K-Stage pipeline, has a register on its OUTPUT (not on its input). ALWAYS: –The CLOCK common to all registers must have a period sufficient to cover propagation over combinational paths + (input) register progation delay + (output) register setup time.

9 - CSE/ESE 560M – Graduate Computer Architecture I Bad pipelining You can not just randomly registers –Successive inputs get mixed: e.g., B(A(X i +1), Y i ) –This happened because some paths from inputs to outputs have 2 registers, and some have only 1! Not a well-formed K pipeline!

10 - CSE/ESE 560M – Graduate Computer Architecture I Adding Pipelines Method –Draw a line that crosses every output in the circuit and mark the endpoints as terminal points. –Continue to draw new lines between the terminal points across various circuit connections, ensuring that every connection crosses each line in the same direction. These lines represent pipeline stages. Adding a pipeline register at every point where a separating line crosses a connection will always generate a valid pipeline Focus on the slowest part of the circuit

11 - CSE/ESE 560M – Graduate Computer Architecture I Pipelining Example 8 bit to 256 bit decoder –256 different combination library ieee; use ieee.std_logic_1164.all; entity DECODER is port( I: in std_logic_vector(7 downto 0); O: out std_logic_vector(255 downto 0)); end DECODER; architecture behavioral of DECODER is begin process (I) begin case I is when “ ” => O <= “ ”; when “ ” => O <= “ ”; when “ ” => O <= “ ”;... when “ ” => O <= “ ”; when “ ” => O <= “ ”; end case; end process; end behavioral; 256 bits

12 - CSE/ESE 560M – Graduate Computer Architecture I Hardware Synthesis Synthesis –Uses at least three 4 to 1 Look-up-tables to decode 256 combinations of I(7:0) Resource Usage –3-LUT4 X 256 –768 LUT4 Critical Path –Input/Output pin delays –2 levels of LUT4 –Sometimes 3 levels?! –Virtex 4 – Speed ns  121 Mhz “2” “1” I(7:0) Comb Logic for “0” … Comb Logic for “255” O(0) O(1) O(2) O(254:3) O(255) LUT4

13 - CSE/ESE 560M – Graduate Computer Architecture I Pipelined Decoder Input/Output pin DFF –Already in most FPGAs –Minimizes pin latencies DFF after every LUT4 –LUT4 always followed by DFF (why not use it) –Only when possible –Minimizes logic latency FPGA Resource –768 LUT4 as before –Plus 768 dff and 264 pin dff –But not really… Critical Path –1 Level of LUT4 –Plus small DFF prop delay and setup –Virtex 4 – Speed ns  455 Mhz 3.76x Speedup “2” “1” I(7:0) Comb Logic for “0” … Comb Logic for “255” O(0) O(1) O(2) O(254:3) O(255) LUT4

14 - CSE/ESE 560M – Graduate Computer Architecture I logic Logic Resource Leveraging on FPGA Architecture –Similarity with Architecture –LUT and few special logic followed by DFF Smaller Design is often Faster –Easier for tools to Map, Place, and Route –Optimize designs wherever –In FPGA, each wire can has a large fanout limit –Reuse logic and results InputOutput Fanout  Capacity for the wire to drive the inputs to other logic

15 - CSE/ESE 560M – Graduate Computer Architecture I Reusing Logic Synthesis Tools –Obvious duplicate logics are automatically combined –Most are not optimized Decoder Example –Two 4 bit to16 bit decoders –Combining decoder outputs –Two 16 bits to 256 bit Critical Path –1 Level of LUT4 –Approximately the same –Differences in wire delay FPGA Resources –I/O DFF remain same –2 x 16 LUT4 and DFF –Plus 256 LUT4 and DFF –Total 272 LUT4 and DFF! “2” “1” Comb Logic for “0” … Comb Logic for “256” O(0) O(1) O(2) O(254:3) O(255) I(7:0) “0,2” “0,1” AND Gate “0,0” AND Gate “15,15” LUT4 Two sets of 4 to16 decoder LUT4

16 - CSE/ESE 560M – Graduate Computer Architecture I Virtex 4 – Elementary Logic Block 2 to 1 Multiplexors 1 bit D-Flip Flops 4 to 1 LUT

17 - CSE/ESE 560M – Graduate Computer Architecture I Using MUXF as 2-input Gates 0 a MUXF 0 1 sel b a b zz MUXF 0 a 0 1 sel b a b Inverters can be pushed into the LUT4 or DFF (by using inverted Q) z z

18 - CSE/ESE 560M – Graduate Computer Architecture I Using Unused Multiplexors Decoder Example –Replace all LUT4 in the 2 nd Decoder stages with MUX based 2 input AND gates Critical Path –Same –2.198 ns  455 Mhz FPGA Resources –I/O DFF remain same –256 MUXF and DFF –32 LUT4 and DFF “2” “1” Comb Logic for “0” … Comb Logic for “256” O(0) O(1) O(2) O(254:3) O(255) I(7:0) “0,2” “0,1” AND Gate “0,0” AND Gate “15,15” LUT4 Two sets of 4 to16 decoder MUXF 0 1 sel

19 - CSE/ESE 560M – Graduate Computer Architecture I Parallel Design Use Area to Increase Performance –Increase the Input bandwidth (Input Bus width) Processing multiple data at a time –Duplicate engines to process independent data sets Thread/Object level parallelism Instructional level parallelism –Loop unroll to expose the parallelism –Excellent for Streaming Data Applications Multimedia Network Processing Performance Scalability –Linear Performance increase with Size Achieved for many algorithms –Sometimes Exponential Hardware Size Try to scale using higher level of parallelism

20 - CSE/ESE 560M – Graduate Computer Architecture I Summary FPGA Designing Methods –Fine Grain Pipelining to Increase Clock Rate If possible 1-level of LUT followed by DFF –Parallel Engines to Increase Bandwidth Duplicate logic to linearly increase the performance –Reducing Logic Resource Usage Reusing duplicate logics Using all available embedded Logic There are other logics (i.e. Embedded Procs, Large Memories, Optimized primitive gates, and IP Cores) Best Methods Today –Learn about internal architecture of FPGA –Make your own templates and use them –Use IP Cores Future Research Topics –Integration of Generalize Pipelining Algorithms (In the works) –Smarter Synthesis Tools (Understanding HDL) –Automatic Platform Specific Optimization Techniques