CDA 4253 FPGA System Design RTL Design Methodology 1 Hao Zheng Comp Sci & Eng USF.

Slides:



Advertisements
Similar presentations
What are FPGA Power Management HDL Coding Techniques Xilinx Training.
Advertisements

Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.
Logic Synthesis – 3 Optimization Ahmed Hemani Sources: Synopsys Documentation.
Register Cell Design.
BR 1/991 Hints on Meeting Project Constraints Clock Cycle Constraint (12 clocks) – More resources (multipliers, adders), less clocks –Can be done with.
Minimizing Clock Skew in FPGAs
Synchronous Digital Design Methodology and Guidelines
Dr. Turki F. Al-Somani VHDL synthesis and simulation – Part 3 Microcomputer Systems Design (Embedded Systems)
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
VHDL Coding Exercise 4: FIR Filter. Where to start? AlgorithmArchitecture RTL- Block diagram VHDL-Code Designspace Exploration Feedback Optimization.
George Mason University ECE 448 – FPGA and ASIC Design with VHDL Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts,
ECE 551 Digital System Design & Synthesis Lecture 11 Verilog Design for Synthesis.
CS3350B Computer Architecture Winter 2015 Lecture 5.2: State Circuits: Circuits that Remember Marc Moreno Maza [Adapted.
Lecture 9 RTL Design Methodology Sorting Example.
Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.
ECE Advanced Digital Systems Design Lecture 12 – Timing Analysis Capt Michael Tanner Room 2F46A HQ U.S. Air Force Academy I n t e g r i.
SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.
Chap 8. Sequencing and Control. 8.1 Introduction Binary information in a digital computer –data manipulated in a datapath with ALUs, registers, multiplexers,
ENG241 Digital Design Week #6 Sequential Circuits (Part A)
George Mason University Design of Controllers Finite State Machines and Algorithmic State Machine (ASM) Charts ECE 545 Lecture 14.
SEQUENTIAL CIRCUITS Component Design and Use. Register with Parallel Load  Register: Group of Flip-Flops  Ex: D Flip-Flops  Holds a Word of Data 
Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,
Lecture 14: Processors CS 2011 Fall 2014, Dr. Rozier.
George Mason University Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL Code ECE 448 Lecture 6.
Anurag Dwivedi. Basic Block - Gates Gates -> Flip Flops.
Lecture 9 RTL Design Methodology. Structure of a Typical Digital System Datapath (Execution Unit) Controller (Control Unit) Data Inputs Data Outputs Control.
BR 1/991 DataPath Elements Altera LPM library has many elements useful for building common datapath functions –lpm_ram_dq - recommended for either asynchronous.
RTL Design Methodology Transition from Pseudocode & Interface
Lecture 5B Block Diagrams HASH Example.
Lecture 3 RTL Design Methodology Transition from Pseudocode & Interface to a Corresponding Block Diagram.
Hao Zheng Comp Sci & Eng USF CDA 4253 FPGA System Design Chapter 5 Finite State Machines.
1 ECE 545 – Introduction to VHDL Algorithmic State Machines Sorting Example ECE 656 Lecture 8.
ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.
George Mason University Design of Controllers using Algorithmic State Machine (ASM) Charts ECE 545 Lecture 12.
1 강의노트 09 Logic Design with ASM Charts: Based on Digital Systems Design Using VHDL, Chapter 5, by Charles H. Roth, Jr.
George Mason University Finite State Machines Refresher ECE 545 Lecture 11.
Class Exercise 1B.
RTL Design Methodology Transition from Pseudocode & Interface
ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL Code.
Introduction Introduction to VHDL Entities Signals Data & Scalar Types
Hao Zheng Comp Sci & Eng USF
Basics of digital systems
ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.
RTL Design Methodology
RTL Design Methodology
RTL Design Methodology
SYNTHESIS OF SEQUENTIAL LOGIC
Lecture 18 SORTING in Hardware.
ECE 545 Lecture 12 Design of Controllers Finite State Machines and Algorithmic State Machine (ASM) Charts.
ECE 545 Lecture 10 Design of Controllers Finite State Machines and Algorithmic State Machine (ASM) Charts.
ECE 545 Lecture 9 Design of Controllers Finite State Machines and Algorithmic State Machine (ASM) Charts.
ECE 545 Lecture 11 Design of Controllers Finite State Machines and Algorithmic State Machine (ASM) Charts.
Architecture Synthesis
T.H.A.D.D. GROUP TOM DUAN HELEN YU ANDY LEE DANNY HUANG DAWEY HUANG
RTL Design Methodology Transition from Pseudocode & Interface
ECE 545 Remaining Tasks.
RTL Design Methodology
RTL Design Methodology Transition from Pseudocode & Interface
ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL Code.
RTL Design Methodology
RTL Design Methodology Transition from Pseudocode & Interface
RTL Design Methodology
RTL Design Methodology
RTL Design Methodology Transition from Pseudocode & Interface
RTL Design Methodology
ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL code ECE 448 – FPGA and ASIC Design.
RTL Design Methodology
RTL Design Methodology
ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.
RTL Design Methodology
Presentation transcript:

CDA 4253 FPGA System Design RTL Design Methodology 1 Hao Zheng Comp Sci & Eng USF

Structure of a Typical Digital Design Datapath (Execution Unit) Controller (Control Unit) Data Inputs Data Outputs Control Inputs Control Outputs Control Signals Status Signals 2

Hardware Design with RTL VHDL Pseudocode DatapathController Block diagram Block diagram State diagram or ASM chart VHDL code Interface 3

Hardware Design with RTL VHDL 4

Steps of the Design Process 1.Text description 2.Interface 3.Pseudocode 4.Block diagram of the Datapath 5.Interface divided into Datapath and Controller 6.FSM of the Controller 7.RTL VHDL code of the Datapath, Controller, and Top-Level Unit 8.Testbench for the Datapath, Controller, and Top-Level Unit 9.Functional simulation and debugging 10.Synthesis and post-synthesis simulation 11.Implementation and timing simulation 12.Experimental testing using FPGA board 5

Min_Max_Average 6

Pseudocode Input: M[i] Outputs: max, min, average max = 0 min = MAX // the maximal constant sum = 0 for i=0 to 31 do d = M[i]; sum = sum + d if (d < min) then min = d endif if (d > max) then max = d endif endfor average = sum/32 7 Data M[i] are stored in memory. Results are stored in the internal registers.

Circuit Interface n 5 n 2 clk reset in_data in_addr write START DONE out_data out_addr MIN_MAX_AVR 8

Interface Table PortWidthMeaning clk1System clock reset1System reset – clears internal registers in_datanInput data bus in_addr5Address of the internal memory where input data is stored write1Synchronous write control signal START1Starts the computations DONE1Asserted when all results are ready out_datanOutput data bus used to read results out_addr201 – reading minimum 10 – reading maximum 11 – reading average 9

Datapath 10 Input: M[i] Outputs: max, min, average max = 0 min = MAX sum = 0 for i=0 to 31 do d = M[i]; sum = sum + d if (d < min) then min = d endif if (d > max) then max = d endif endfor average = sum/32

Datapath 11 Input: M[i] Outputs: max, min, average max = 0 min = MAX sum = 0 for i=0 to 31 do d = M[i]; sum = sum + d if (d < min) then min = d endif if (d > max) then max = d endif endfor average = sum/32 + d sum average min max d min d max min max < > mux /32

State Diagram for Controller 12 Input: M[i] Outputs: max, min, average max = 0 min = MAX sum = 0 for i=0 to 31 do d = M[i]; sum = sum + d if (d < min) then min = d endif if (d > max) then max = d endif endfor average = sum/32

State Diagram for Controller 13 Input: M[i] Outputs: max, min, average max = 0 min = MAX sum = 0 for i=0 to 31 do d = M[i]; sum = sum + d if (d < min) then min = d endif if (d > max) then max = d endif endfor average = sum/32 start / rst=1 i==32 / done=1 i < 32 / i++ start’/ done=0 init run end

Sorting 14

15 Before sorting During Sorting After sorting Addr i=0i=0i=0i=1i=1i=2 j=1j=2j=3j=2j=3j=3 MiMi MjMj Legend: position of memory indexed by i position of memory indexed by j Sorting - Example Data

16 Pseudocode for i=0 to k-2 do A = M[i] for j=i+1 to k-1 do B = M[j] if A > B then M[i] = B M[j] = A A = B end if end for

17 Sorting - Interface Sort clock reset din N done addr k we start Memory N dout

18 Datapath for i=0 to k-2 do A = M[i] for j=i+1 to k-1 do B = M[j] if A > B then M[i] = B M[j] = A A = B end if end for Registers to hold A, B, and memory addresses i and j. Incrementor. Comparator.

19 Datapath for i=0 to k-2 do A = M[i] for j=i+1 to k-1 do B = M[j] if A > B then M[i] = B M[j] = A A = B end if end for counter i enable j +1 mux regj sel1 en_regj

20 Datapath for i=0 to k-2 do A = M[i] for j=i+1 to k-1 do B = M[j] if A > B then M[i] = B M[j] = A A = B end if end for i j mux addr B A mux dout sel3 din A B enA > AgtB A B regA regB enB mux B sel2

21 State Diagram for the Controller for i=0 to k-2 do A = M[i] for j=i+1 to k-1 do B = M[j] if A > B then M[i] = B M[j] = A A = B end if end for Nested loops by two FSMs: one for the outer loop controls the one for the inner loop. Reuse the FSM for the single for loop in the previous example.

22 State Diagram for the Controller for i=0 to k-2 do A = M[i] for j=i+1 to k-1 do B = M[j] if A > B then M[i] = B M[j] = A A = B end if end for start / rst=1 j<k-1 / … i<k-2 / A <= M[i] j <= i+1 start’/ done=0 init outer i==k-2 / done=1 end j==k-1 / … inner

23 Optimization for Performance

24 Performance Definitions Throughput: the number of inputs processed per unit time. Latency: the amount of time for an input to be processed. Maximizing throughput and minimizing latency in conflict. Both require timing optimization: -Reduce delay of the critical path

25 Achieving High Throughput: Pipelining Divide data processing into stages Process different data inputs in different stages simultaneously. xpower = 1; for (i = 0; i < 3; i++) xpower = x * xpower; process (clk) begin if rising_edge(clk) then if start=‘1’ then cnt <= 3; end if; if cnt > 0 then cnt <= cnt – 1; xpower <= xpower * x; elsif cnt = 0 then done <= ‘1’; end if; end process; Throughput: 1 data / 3 cycles = 0.33 data / cycle. Latency: 3 cycles. Critical path delay: 1 multiplier delay

26 Achieving High Throughput: Pipelining xpower = 1; for (i = 0; i < 3; i++) xpower = x * xpower; process (clk, rst) begin if rising_edge(clk) then if start=‘1’ then -- stage 1 x1 <= x; xpower1 <= x; done1 <= start; end if; -- stage 2 x2 <= x1; xpower2 <= xpower1 * x1; done2 <= done1; -- stage 3 xpower <= xpower2 * x2; done <= done2; end if; end process; Throughput: 1 data / cycle Latency: 3 cycles + register delays. Critical path delay: 1 multiplier delay

27 Achieving High Throughput: Pipelining Divide data processing into stages Process different data inputs in different stages simultaneously. din dout

28 Achieving High Throughput: Pipelining Divide data processing into stages Process different data inputs in different stages simultaneously. din dout … stage 1 stage 2 stage n Penalty: increase in area as logic needs to be duplicated for different stages registers

29 Reducing Latency Closely related to reducing critical path delay. Reducing pipeline registers reduces latency. din dout … stage 1 stage 2 stage n registers

30 Reducing Latency Closely related to reducing critical path delay. Reducing pipeline registers reduces latency. din dout … stage 1 stage 2 stage n

31 Timing Optimization Maximal clock frequency determined by the longest path delay in any combinational logic blocks. Pipelining is one approach. din dout … stage 1 stage 2 stage n registers din dout

32 Timing Optimization: Spatial Computing Extract independent operations Execute independent operations in parallel. X = A + B + C + D process (clk, rst) begin if rising_edge(clk) then X1 := A + B; X2 := X1 + C; X <= X2 + D; end if; end process; process (clk, rst) begin if rising_edge(clk) then X1 <= A + B; X2 <= C + D; X <= X1 + X2; end if; end process; Critical path delay: 3 adders Critical path delay: 2 adders

33 Timing Optimization: Avoid Unwanted Priority process (clk, rst) begin if rising_edge(clk) then if c0=‘1’ then r0 <= din; elsif c1=‘1’ then r1 <= din; elsif c2=‘1’ then r2 <= din; elsif c3=‘1’ then r3 <= din; end if; end process; Critical path delay: 4-input AND gate + 4x1 MUX.

34 Timing Optimization: Avoid Unwanted Priority Critical path delay: 2x1 MUX process (clk, rst) begin if rising_edge(clk) then if c0=‘1’ then r0 <= din; end if; if c1=‘1’ then r1 <= din; end if; if c2=‘1’ then r2 <= din; end if; if c3=‘1’ then r3 <= din; end if; end if; end process;

35 Timing Optimization: Register Balancing Maximal clock frequency determined by the longest path delay in any combinational logic blocks. din block 1 block 2 dout din block 1 block 2 dout

Timing Optimization: Register Balancing process (clk, rst) begin if rising_edge(clk) then rA <= A; rB <= B; rC <= C; sum <= rA + rB + rC; end if; end process; process (clk, rst) begin if rising_edge(clk) then sumAB <= A + B; rC <= C; sum <= sumAB + rC; end if; end process;

37 Optimization for Area

38 Area Optimization: Resource Sharing Rolling up pipleline: share common resources at different time – a form of temporal computing din dout din dout … stage 1 stage 2 stage n Block including all all logic in stage 1 to n.

39 Area Optimization: Resource Sharing Use registers to hold inputs Develop FSM to select which inputs to process in each cycle. X = A + B + C + D A B C D X

40 Area Optimization: Resource Sharing Use registers to hold inputs Develop FSM to select which inputs to process in each cycle. X = A + B + C + D A B C D X + X A B C D A, B, C, D need to hold steady until X is processed control

41 Area Optimization: Resource Sharing Merge duplicate components together

42 Area Optimization: Resource Sharing Merge duplicate components together

43 Impact of Reset on Area

44 Resetting Block RAM Block RAM only supports synchronous reset. Suppose that Mem is 256x16b RAM. Implementations of Mem with synchronous and asynchronous reset on Xilinx Virtex-4.

45 Optimization for Power

46 Power Reduction Techniques In general, FPGAs are power hungry. Power consumption is determined by where V is voltage, C is load capacitance, and f is switching frequency In FPGAs, V is usually fixed, C depends on the number of switching gates and length of wires connecting all gates. To reduce power, turn off gates not actively used, have multiple clock domains, reduce f.

47 Dual-EdgeTriggered FFs A design that is active on both clock edges can reduce clock frequency by 50%. din dout stage 1 stage 2 stage n stage 4 din dout stage 1 stage 2 stage n stage 4 Example 1 Example 2 positively triggered negatively triggered