John Kubiatowicz (www.cs.berkeley.edu/~kubitron)

John Kubiatowicz (www.cs.berkeley.edu/~kubitron)
CS152 Computer Architecture and Engineering Lecture 5 High-Level Design FPGAs/Vertex-E Chipset Just like lat time, I like to start today’s lecture with a recap of our last lecture Start X:40 February 9, 2004 John Kubiatowicz ( lecture slides:

Review: Elements of the Design Process
Divide and Conquer (e.g., ALU) Formulate a solution in terms of simpler components. Design each of the components (subproblems) Generate and Test (e.g., ALU) Given a collection of building blocks, look for ways of putting them together that meets requirement Successive Refinement (e.g., multiplier, divider) Solve "most" of the problem (i.e., ignore some constraints or special cases), examine and correct shortcomings. Formulate High-Level Alternatives (e.g., shifter) Articulate many strategies to "keep in mind" while pursuing any one approach. Work on the Things you Know How to Do The unknown will become “obvious” as you make progress. Here are some key elements of the design process. First is divide and conquer. (a) First you formulate a solution in terms of simpler components. (b) Then you concentrate on designing each components. Once you have the individual components built, you need to find a way to put them together to solve our original problem. Unless you are really good or really lucky, you probably won’t have a perfect solution the first time so you will need to apply successive refinement to your design. While you are pursuing any one approach, you need to keep alternate strategies in mind in case what you are pursuing does not work out. One of the most important advice I can give you is that work on the things you know how to do first. As you make forward progress, a lot of the unknowns will become clear. If you sit around and wait until you know everything before you start, you will never get anything done. +2 = 15 min. (X:55) 2/9/034 ©UCB Spring 2004

Review: ALU Design Bit-slice plus extra on the two ends
Overflow means number too large for the representation Carry-look ahead and other adder tricks 32 B 32 A signed-arith and cin xor co a0 b0 a31 b31 4 ALU0 ALU31 M co cin co cin s0 s31 C/L to produce select, comp, c-in Ovflw 32 S 2/9/034 ©UCB Spring 2004

Review: Carry Look Ahead (Design trick: peek)
C0 = Cin A B C-out 0 0 0 “kill” 0 1 C-in “propagate” 1 0 C-in “propagate” 1 1 1 “generate” A0 B0 A1 B1 A2 B2 A3 B3 S G P C1 = G0 + C0  P0 G = A and B P = A xor B Names: suppose G0 is 1 => carry no matter what else => generates a carry suppose G0 =0 and P0=1 => carry IFF C0 is a 1 => propagates a carry Like dominoes What about more than 4 bits? C2 = G1 + G0 P1 + C0  P0  P1 C3 = G2 + G1 P2 + G0  P1  P2 + C0  P0  P1  P2 G P C4 = . . . 2/9/034 ©UCB Spring 2004

Review: Design Trick: Guess (or “Precompute”)
CP(2n) = 2*CP(n) n-bit adder n-bit adder CP(2n) = CP(n) + CP(mux) Use multiplexor to save time: guess both ways and then select (assumes mux is faster than adder) n-bit adder 1 n-bit adder n-bit adder Cout Carry-select adder 2/9/034 ©UCB Spring 2004

Why should you keep a design notebook?
Keep track of the design decisions and the reasons behind them Otherwise, it will be hard to debug and/or refine the design Write it down so that can remember in long project: 2 weeks ->2 yrs Others can review notebook to see what happened Record insights you have on certain aspect of the design as they come up Record of the different design & debug experiments Memory can fail when very tired Industry practice: learn from others mistakes Well, the goal of this part of the lecture is to convince EACH of you should keep your OWN design note book. Why? Well, first of all, you need to keep track of all the design decisions you made and may be more importantly, the reasons behind your design decisions. This may not be that important when your project life span is only a few weeks but after you graduate, you will work on projects that last for 2 to 3 years. And if you don’t write things down, you may not remember how you do certain things and why and you may find it very hard to debug and refine your design. Also, sometimes when you are working on certain part of the design, you may suddenly get some insights on another part of the design. You may not have time to follow up your insights immediately and if you don’t write them down, you may never be able to reconstruct them later when you have time. Finally, it is very important for you to write down everything you see on the tests or experiments you run when you are debugging your design. +2 = 59 min. (Y:39) 2/9/034 ©UCB Spring 2004

Why do we keep it on-line?
You need to force yourself to take notes Open a window and leave an editor running while you work 1) Acts as reminder to take notes 2) Makes it easy to take notes 1) + 2) => will actually do it Take advantage of the window system’s “cut and paste” features It is much easier to read your typing than your writing Also, paper log books have problems Limited capacity => end up with many books May not have right book with you at time vs. networked screens Can use computer to search files/index files to find what looking for The next question some of you may want to ask is, OK, I will keep a note book. But why should I keep it on line? Well, let’s be honest to ourselves. All of us need a little bit reminder to force ourselves to take notes while we work. One of the best reminder I find is the window system of modern workstation. By keeping an extra window open and have an editor running, it makes taking notes very easy and the editor also serves as a constant reminder for you to take notes. Also by keeping your notebook on-line, you can take advantage of the window system’s cut and paste feature to drop important “print outs” into your note book. Finally, although you may be able to read your own handwriting much better than anybody else, it is still easier to read your own typing than your own writing. +2 = 61 min. (Y:41) 2/9/034 ©UCB Spring 2004

How should you do it? Keep it simple
DON’T make it so elaborate that you won’t use (fonts, layout, ...) Separate the entries by dates type “date” command in another window and cut&paste Start day with problems going to work on today Record output of simulation into log with cut&paste; add date May help sort out which version of simulation did what Record key with cut&paste Record of what works & doesn’t helps team decide what went wrong after you left Index: write a one-line summary of what you did at end of each day How should you keep your on-line notebook? By all means, Keep It Simple. The on-line notebook should help you trace down and solve your problems. It should NOT become one of your problems. In order to keep the note book easy to read, you should separate your entries by dates. Furthermore, before you sign off each date, we should write a one-line summary of what you did and this will serve as the index to your notebook. Let me show you some examples. +2 = 63 min. (Y:43) 2/9/034 ©UCB Spring 2004

On-line Notebook Example
Refer to the handout “Example of On-Line Log Book” on cs 152 handouts page Spend 10 minutes on the notebook example: 6 minutes per page. +12 = 75 min. (Y:55) 2/9/034 ©UCB Spring 2004

1st page of On-line notebook (Index + Wed. 9/6/95)
Wed Sep 6 00:47:28 PDT Created the 32-bit comparator component Thu Sep 7 14:02:21 PDT Tested the comparator Mon Sep 11 12:01:45 PDT Investigated bug found by Bart in comp32 and fixed it + ==================================================================== Wed Sep 6 00:47:28 PDT 1995 Goal: Layout the schematic for a 32-bit comparator I've layed out the schemtatics and made a symbol for the comparator. I named it comp32. The files are ~/wv/proj1/sch/comp32.sch ~/wv/proj1/sch/comp32.sym Wed Sep 6 02:29:22 PDT 1995 - ==================================================================== Add 1 line index at front of log file at end of each session: date+summary Start with date, time of day + goal Make comments during day, summary of work End with date, time of day (and add 1 line summary at front of file) 2/9/034 ©UCB Spring 2004

2nd page of On-line notebook (Thursday 9/7/95)
Thu Sep 7 14:02:21 PDT 1995 + ==================================================================== Goal: Test the comparator component I've written a command file to test comp32. I've placed it in ~/wv/proj1/diagnostics/comp32.cmd. I ran the command file in viewsim and it looks like the comparator is working fine. I saved the output into a log file called ~/wv/proj1/diagnostics/comp32.log Notified the rest of the group that the comparator is done. Thu Sep 7 16:15:32 PDT 1995 - ==================================================================== 2/9/034 ©UCB Spring 2004

3rd page of On-line notebook (Monday 9/11/95)
+ ==================================================================== Mon Sep 11 12:01:45 PDT 1995 Goal: Investigate bug discovered in comp32 and hopefully fix it Bart found a bug in my comparator component. He left the following . From Sun Sep 10 01:47: Received: by wayne.manor (NX5.67e/NX3.0S) id AA00334; Sun, 10 Sep 95 01:47: Date: Wed, 10 Sep 95 01:47: From: Bart Simpson To: Subject: [cs152] bug in comp32 Status: R Hey Bruce, I think there's a bug in your comparator. The comparator seems to think that ffffffff and fffffff7 are equal. Can you take a look at this? Bart 2/9/034 ©UCB Spring 2004

4th page of On-line notebook (9/11/95 contd)
I verified the bug. here's a viewsim of the bug as it appeared.. (equal should be 0 instead of 1) SIM>stepsize 10ns SIM>v a_in A[31:0] SIM>v b_in B[31:0] SIM>w a_in b_in equal SIM>a a_in ffffffff\h SIM>a b_in fffffff7\h SIM>sim time = ns A_IN=FFFFFFFF\H B_IN=FFFFFFF7\H EQUAL=1 Simulation stopped at 10.0ns. Ah. I've discovered the bug. I mislabeled the 4th net in the comp32 schematic. I corrected the mistake and re-checked all the other labels, just in case. I re-ran the old diagnostic test file and tested it against the bug Bart found. It seems to be working fine. hopefully there aren’t any more bugs:) 2/9/034 ©UCB Spring 2004

5th page of On-line notebook (9/11/95 contd)
On second inspectation of the whole layout, I think I can remove one level of gates in the design and make it go faster. But who cares! the comparator is not in the critical path right now. the delay through the ALU is dominating the critical path. so unless the ALU gets a lot faster, we can live with a less than optimal comparator. I ed the group that the bug has been fixed Mon Sep 11 14:03:41 PDT 1995 - ==================================================================== Perhaps later critical path changes; what was idea to make compartor faster? Check log book! 2/9/034 ©UCB Spring 2004

Representation Languages
Hardware Representation Languages: Block Diagrams: FUs, Registers, & Dataflows Register Transfer Diagrams: Choice of busses to connect FUs, Regs Flowcharts State Diagrams Fifth Representation "Language": Hardware Description Languages E.G., ISP' VHDL Verilog Descriptions in these languages can be used as input to simulation systems synthesis systems Two different ways to describe sequencing & microoperations hw modules described like programs with i/o ports, internal state, & parallel execution of assignment statements "software breadboard" generate hw from high level description "To Design is to Represent" 2/9/034 ©UCB Spring 2004

Simulation Before Construction
"Physical Breadboarding" discrete components/lower scale integration preceeds actual construction of prototype verify initial design concept No longer possible as designs reach higher levels of integration! Simulation Before Construction high level constructs implies faster to construct play "what if" more easily limited performance accuracy, however 2/9/034 ©UCB Spring 2004

Levels of Description Architectural Simulation Functional/Behavioral/
Dataflow Register Transfer Logic Circuit models programmer's view at a high level; written in your favorite programming language more detailed model, like the block diagram view commitment to datapath FUs, registers, busses; register xfer operations are clock phase accurate model is in terms of logic gates; higher level MSI functions described in terms of these electrical behavior; accurate waveforms Less Abstract More Accurate Slower Simulation Schematic capture + logic simulation package like Xilinx ISE Special languages + simulation systems for describing the inherent parallel activity in hardware 2/9/034 ©UCB Spring 2004

Netlist Alternative format: n1 g1.in1
n5 g1.out g3.in1 n6 g2.out g3.in2 n7 g3.out g1 "and" g2 "and" g3 "or" A key data structure (or representation) in the design process is the “netlist”: Network List A netlist lists components and connects them with nodes: ex: g1 "and" n1 n2 n5 g2 "and" n3 n4 n6 g3 "or" n5 n6 n7 g1 g2 g3 Netlist is what is needed for simulation and implementation. Could be at the transistor level, gate level, ... Could be hierarchical or flat. How do we generate a netlist? 2/9/034 ©UCB Spring 2004

XilinxTM Design Flow Design Entry High-level Analysis Technology
Mapping Low-level Decoder(output x0,x1,x2,x3; inputs a,b) { wire abar, bbar; inv(bbar, b); inv(abar, a); nand(x0, abar, bbar); nand(x1, abar, b ); nand(x2, a, bbar); nand(x3, a, b ); } XilinxTM 2/9/034 ©UCB Spring 2004

Design Flow Design Entry High-level Analysis Technology Mapping
Low-level Circuit is described and represented: Graphically (Schematics) Textually (HDL) Other (Special Compilers) Memories Error Correcting Circuite Result of circuit specification (and compilation) is a netlist of: generic primitives - logic gates, flip-flops, or technology specific primitives - LUTs/CLBs, transistors, discrete gates, or higher level library elements - adders, ALUs, register files, decoders, etc. 2/9/034 ©UCB Spring 2004

Low-level High-level Analysis is used to verify: correct function rough: timing power cost Common tools used are: simulator - check functional correctness, and static timing analyzer estimates circuit delays based on timing model and delay parameters for library elements (or primitives). 2/9/034 ©UCB Spring 2004

Low-level Technology Mapping: Converts netlist to implementation technology dependent details Expands library elements, Performs: partitioning, placement, routing Low-level Analysis Simulation and Analysis Tools perform low-level checks with: accurate timing models, wire delay For FPGAs this step could also use the actual device. 2/9/034 ©UCB Spring 2004

Design Flow Design Entry High-level Analysis Netlist: Technology
Mapping Low-level Netlist: used between and internally for all steps. 2/9/034 ©UCB Spring 2004

Design Entry Schematics are intuitive. They match our use of gate-level or block diagrams. Somewhat physical. They imply a physical implementation. This is why we use them for datapaths Require a special tool (editor). Unless hierarchy is carefully designed, schematics can be confusing and difficult to follow. 2/9/034 ©UCB Spring 2004

High Level Design Languages (HDLs)
“Structural” example: Decoder(output x0,x1,x2,x3; inputs a,b) { wire abar, bbar; inv(bbar, b); inv(abar, a); nand(x0, abar, bbar); nand(x1, abar, b ); nand(x2, a, bbar); nand(x3, a, b ); } “Behavioral” example: case [a b] 00: [x0 x1 x2 x3] = 0x0; 01: [x0 x1 x2 x3] = 0x2; 10: [x0 x1 x2 x3] = 0x4; 11: [x0 x1 x2 x3] = 0x8; endcase; Basic Idea: Language constructs describe circuits with two basic forms: Structural descriptions similar to hierarchical netlist. Behavioral descriptions use higher-level constructs (similar to conventional programming). Originally designed to help in abstraction and simulation. Now “logic synthesis” tools exist to automatically convert from behavioral descriptions to gate netlist. Greatly improves designer productivity. However, this may lead you to falsely believe that hardware design can be reduced to writing programs! 2/9/034 ©UCB Spring 2004

Verilog History Originated at Automated Integrated Design Systems (renamed Gateway) in Acquired by Cadence in 1989. Invented as simulation language. Synthesis was an afterthought. Many techniques for synthesis developed at Berkeley in 80’s and applied commercially in the 90’s. Around the same time as the origin of Verilog, the US Department of Defense developed VHDL. Because it was in the public domain it began to grow in popularity. VHDL is still popular within the government, in Europe and Japan, and some Universities. Standardization Afraid of losing market share, Cadence opened Verilog to the public in 1990. An IEEE working group was established in 1993, and ratified IEEE Standard 1394 (Verilog) in 1995. Verilog is language of choice of Silicon Valley companies, initially because of high-quality tool support and its similarity to C-language syntax. Most major CAD frameworks now support both VHDL and Verilog. 2/9/034 ©UCB Spring 2004

Example: Structural XOR (xor built-in, but..)
module xor(Z, X, Y); input X, Y; output Z; wire notX, notY, XnotY, YnotX; not (notX, X), (notY, Y); or (Z, YnotX, XnotY); and (YnotX, notX, Y), (XnotY, X, notY); endmodule Says which “ports” input, output Default is 1 bit wide data “nets” to connect components notX YnotX X Y Z XnotY notY Note: order of gates doesn’t matter, since structure determines relationship 2/9/034 ©UCB Spring 2004

Example: Behavioral XOR in Verilog
module xorB(Z, X, Y); input X, Y; output Z; reg Z; (X or Y) Z = X ^ Y; // ^ is C operator for xor endmodule; Unusual parts of above Verilog (X or Y)” => whenever X or Y changes, do the following statement “reg” is only type of behavioral data that can be changed in assignment, so must redeclare Z as reg Default is single bit data types: X, Y, Z 2/9/034 ©UCB Spring 2004

Verilog big idea: Time in code
Difference from normal prog. lang. is that time is part of the language part of what trying to describe is when things occur, or how long things will take Underlying simulation system is event driven: Changes on signals cause events Events trigger firing of attached components Changes to signals never visible in zero time! But behavioral signals can only be visible only visible if time advances at least one ‘tick’ Simulation time does not advance without timing control of some sort. Examples of timing control: Gate or wire delays can schedule events in the future Delay controls, introduced with the “#” symbol Event controls, introduced by the symbol The wait statement 2/9/034 ©UCB Spring 2004

Delay Specifications `timescale 1ns/1ps //Dataflow description of mux
module mux2 in0, in1, select, out); input in0,in1,select; output out; assign out = #(5,10) select ? in1 : in0; endmodule // mux2 Notes: Delay specifications relative to timescale specification May be placed in many different syntactical positions #singlenumber Delay specification for both edges #(rising,falling) Delay specification for rising and falling edges 2/9/034 ©UCB Spring 2004

1 Time Example time stream module test(stream); output stream;
reg stream; initial begin stream = 0; #2 stream = 1; #5 stream = 0; #3 stream = 1; #4 stream = 0; end endmodule “Initial” means do this code once Note: Verilog uses begin … end vs. { … } as in C #2 stream = 1 means wait 2 ns before changing stream to 1 Output called a “waveform” stream 1 time 2 7 10 14 2/9/034 ©UCB Spring 2004

Time, variable update, and monitor
or #2(Z, X, Y); X Z Y The instant before the rising edge of the clock, all outputs and wires have their OLD values. This includes inputs to flip flops. Therefore, if you change the inputs to a flip flop at a particular rising edge, that change will not be reflected at the output until the NEXT rising edge. This is because when the rising edge occurs, the flip flop still sees the old value. So when simulated time changes in Verilog, then ports, registers updated 2/9/034 ©UCB Spring 2004

Sequential Logic // Sequential Logic – involves an edge
module FF (CLK,Q,D); input D, CLK; output Q; reg Q; (posedge CLK) Q=D; endmodule // FF //Parallel to Serial converter module ParToSer(LD, X, out, CLK); input [3:0] X; input LD, CLK; output out; reg out; reg [3:0] Q; assign out = Q[0]; (posedge CLK) if (LD) Q=X; else Q = Q>>1; endmodule // mux2 Notes: (posedge CLK)” forces Q register to be rewritten every simulation cycle. “>>” operator does right shift (shifts in a zero on the left). Shifts on non-reg variables can be done with concatenation: wire [3:0] A, B; assign B = {1’b0, A[3:1]} 2/9/034 ©UCB Spring 2004

Verilog: replication, hierarchy
Often in hardware need many copies of an item, connected together in a regular way Need way to name each copy Need way to specify how many copies Specify a module with 4 XORs using existing module example 2/9/034 ©UCB Spring 2004

Example: Replicated XOR in Verilog
module 4xor(C, A, B); input [3:0] A, B; output[3:0] C; xor foo4xor[3:0] (.X(A), .Y(B), .Z(C) ); endmodule; Note 1: can associate ports explicitly by name, (.X (A), .Y(B), .Z(C)) or implicitly by order (as in C) (C, A, B) Note 2: must give a name to new instance of xors (foo4xor) C[3] A[3] B[3] C[2] A[2] B[2] C[1] A[1] B[1] C[0] A[0] B[0] 2/9/034 ©UCB Spring 2004

Basic Example: 2-to1 mux in Structural Form
//2-input multiplexor in gates module mux2 (in0, in1, select, out); input in0, in1, select; output out; wire s0, w0, w1; not (s0, select); and (w0, s0, in0), (w1, select, in1); or (out, w0, w1); endmodule // mux2 Notes: Comments start with // Input/output “wires” by default “module” port list declarations wire type primitive gates 2/9/034 ©UCB Spring 2004

2-1 Mux in Dataflow Form module mux2 in0, in1, select, out);
//Dataflow description of mux module mux2 in0, in1, select, out); input in0,in1,select; output out; assign out = (~select & in0) | (select & in1); endmodule // mux2 Alternative: assign out = select ? in1 : in0; Notes: provides a way to describe combinational logic by its function rather than gate structure (similar to Boolean expressions). The assign keyword is used to indicate a continuous assignment. Whenever anything on the RHS changes the LHS is updated. 2/9/034 ©UCB Spring 2004

2-to-1 mux Behavioral description
// Behavioral model of 2-to-1 // multiplexor. module mux2 (in0,in1,select,out); input in0,in1,select; output out; reg out; (in0 or in1 or select) if (select) out=in1; else out=in0; endmodule // mux2 Behavioral: use keyword always followed by one procedural statement Use Begin/End to place more statements after always @() specifier: wait until an event (here, change on one of 3 sigs) Output of procedural assignments must of of type reg a reg type retains its value until a new value is assigned Not necessarily a real register: only signal) 2/9/034 ©UCB Spring 2004

Combining modules: Hierarchy & Bit Vectors
//Assuming we have already // defined a 2-input mux (either // structurally or behaviorally, //4-input mux built from 3 2-input muxes module mux4 (in0, in1, in2, in3, select, out); input in0,in1,in2,in3; input [1:0] select; output out; wire w0,w1; mux2 m0 (.select(select[0]), .in0(in0), .in1(in1), .out(w0)), m1 (.select(select[0]), .in0(in2), .in1(in3), .out(w1)), m2 (.select(select[1]), .in0(w0), .in1(w1), .out(out)); endmodule // mux4 Instance Names: m0, m1, m2 Notes: instantiation similar to primitives select is 2-bits wide named port assignment 2/9/034 ©UCB Spring 2004

Behavioral 4-to1 mux Select in0 in1 out in2 in3
//4-input mux behavioral description module mux4 (in0, in1, in2, in3, select, out); input in0,in1,in2,in3; input [1:0] select; output out; reg out; (in0 or in1 or in2 or in3 or select) case (select) 2’b00: out=in0; 2’b01: out=in1; 2’b10: out=in2; 2’b11: out=in3; endcase endmodule // mux4 in0 in1 out Select MUX in2 in3 2 Notes: Case construct equivalent to nested if constructs. Definition: A structural description is one where the function of the module is defined by the instantiation and interconnection of sub-modules. A behavioral description uses higher level language constructs and operators. Verilog allows modules to mix both behavioral constructs and sub-module instantiation. 2/9/034 ©UCB Spring 2004

Behavioral with Bit Vectors
//Behavioral model of 32-bit // wide 2-to-1 multiplexor. module mux32 (in0,in1,select,out); input [31:0] in0,in1; input select; output [31:0] out; reg [31:0] out; (in0 or in1 or select) if (select) out=in1; else out=in0; endmodule // Mux //Behavioral model of 32-bit adder. module add32 (C,S,A,B); input [31:0] A,B; output [31:0] S; output C; reg [31:0] S; reg C; (A or B) {C,S} = A + B; endmodule // Add 32 in0 in1 out Select MUX Bit Vector Sizing and Ordering (32 bits, bit 31 MSB) 32 A B S C Adder 32 Concatenation Operation: {} 2/9/034 ©UCB Spring 2004

Testing: Make sure that things work
Testing methodologies Understand what correct behavior is when you design things Collect vectors for later use Build monitor modules to check assertions of correct values Produce a regression test Set of tests to run each time something changes Types of test (Doug Clark): Directed Vectors – test explicit behavior Random Vectors – apply random values or orderings to device Daemons – continuous error insertion Monitor modules: Check to see if invariants are maintained during long running simulations Alewife Numbers 2/9/034 ©UCB Spring 2004

Monitor Modules: Passthrough testing
module monitorsum32(carry,sum,A,B ); input [31:0] A,B; output [31:0] sum; output carry; reg [31:0] predsum; reg precarry; // The “real” adder sum32 mysum (carry,sum,A,B); `ifndef synthesis // This checker code only for simulation or B) begin #100 //wait for output to settle (don’t make too long!) {predcarry,predsum} = A + B; if ((carry != predcarry) || (sum != predsum)) $display(“>>> Mismatch: 0x%x+0x%x->0x%x carry %x”, A,B,sum,carry); end `endif endmodule 2/9/034 ©UCB Spring 2004

Testbench: Applying Directed Vectors
module testmux; reg a, b, s; wire f; reg expected; // Unit under test. mux2 myMux (.select(s), .in0(a), .in1(b), .out(f)); initial begin s=0; a=0; b=1; expected=0; #10 a=1; b=0; expected=1; #10 s=1; a=0; b=1; expected=1; end $monitor( "select=%b in0=%b in1=%b out=%b, expected out=%b time=%d", s, a, b, f, expected, $time); endmodule // testmux Top-level modules written specifically to test sub-modules. Notes: initial block similar to always except only executes once (at beginning of simulation) #n’s needed to advance time $monitor - prints output A variety of other “system functions”, similar to monitor exist for displaying output and controlling the simulation. 2/9/034 ©UCB Spring 2004

Testbench: Randomized Vector Testing
module testbench( ); reg [31:0] A,B; wire [31:0] sum; wire carry; reg [31:0] predsum; reg predcarry; // Device under test sum32 mysum (carry,sum,A,B); always begin A = $random; B = $random; #100 //wait for output to settle {predcarry,predsum} = A + B; if ((carry != predcarry) || (sum != predsum)) $display(“>>> Mismatch: 0x%x+0x%x->0x%x carry %x”, A,B,sum,carry); else $display(“Successful: 0x%x+0x%x=0x%x carry %x”, end endmodule Source of Vectors: With $random->predicted result Actual vectors Check actual results against predicted 2/9/034 ©UCB Spring 2004

More Verilog Help The lecture notes only cover the very basics of Verilog and mostly just the conceptual issues. The Mano textbook covers Verilog with many examples. The Bhasker book is a good tutorial. On reserve in the Engineering Complete language spec from the IEEE available on handouts page Also, pretty good tutorial available on handouts page Synplify manual (for when we start using synthesis) 2/9/034 ©UCB Spring 2004

Where are FPGAs in the IC Zoo?
Source: Dataquest Logic Standard Logic ASIC Programmable Logic Devices (PLDs) Gate Arrays Cell-Based ICs Full Custom ICs SPLDs (PALs) CPLDs Jack says: You may want to use this slide in conjunction with the FPGA overview or at the end of the “Why FPGAs” section to show where FPGAs stand in relation to other electronic components. I would put it after page 7 in professor’s Wawryznek slides. FPGAs Acronyms SPLD = Simple Prog. Logic Device PAL = Prog. Array of Logic CPLD = Complex PLD FPGA = Field Prog. Gate Array (Standard logic is SSI or MSI buffers, gates) Common Resources Configurable Logic Blocks (CLB) Memory Look-Up Table AND-OR planes Simple gates Input / Output Blocks (IOB) Bidirectional, latches, inverters, pullup/pulldowns Interconnect or Routing Local, internal feedback, and global

FPGA Variations Families of FPGA’s differ in:
physical means of implementing user programmability, arrangement of interconnection wires, and basic functionality of logic blocks Most significant difference is in the method for providing flexible blocks and connections: Anti-fuse based (ex: Actel) Non-volatile, relatively small - fixed (non-reprogrammable) (Almost used in 150 Lab: only 1-shot at getting it right!) 2/9/034 ©UCB Spring 2004

User Programmability Latch-based (Xilinx, Altera, …)
reconfigurable - volatile relatively large die size Note: Today 90% die is interconnect, 10% is gates Latches are used to: 1. make or break cross-point connections in interconnect 2. define function of logic blocks 3. set user options: within the logic blocks in the input/output blocks global reset/clock “Configuration bit stream” loaded under user control: All latches are strung together in a shift chain “Programming” => creating bit stream 2/9/034 ©UCB Spring 2004

Idealized FPGA Logic Block
4-input Look Up Table (4-LUT) implements combinational logic functions Register optionally stores output of LUT Latch determines whether read reg or LUT 2/9/034 ©UCB Spring 2004

4-LUT Implementation n-bit LUT is actually implemented as a 2n x 1 memory: inputs choose one of 2n memory locations. memory locations (latches) are normally loaded with values from user’s configuration bit stream. Inputs to mux control are the CLB (Configurable Logic Block) inputs. Result is a general purpose “logic gate”. n-LUT can implement any function of n inputs! 2/9/034 ©UCB Spring 2004

LUT as general logic gate
An n-lut as a direct implementation of a function truth- table Each latch location holds value of function corresponding to one input combination Example: 4-lut Example: 2-lut Implements any function of 2 inputs. How many functions of n inputs? 2/9/034 ©UCB Spring 2004

More functionality for “free”?
Given basic idea LUT built from RAM Latches connected as shift register What other functions could be provided at very little extra cost? Using CLB latches as little RAM vs. logic Using CLB latches as shift register vs. logic 2/9/034 ©UCB Spring 2004

= or 1. “Distributed RAM” CLB LUT configurable as Distributed RAM
RAM16X1S O D WE WCLK A0 A1 A2 A3 RAM32X1S A4 RAM16X2S O1 D0 D1 O0 = LUT or RAM16X1D SPO DPRA0 DPO DPRA1 DPRA2 DPRA3 CLB LUT configurable as Distributed RAM A LUT equals 16x1 RAM Implements Single and Dual-Ports Cascade LUTs to increase RAM size Synchronous write Synchronous/Asynchronous read Accompanying flip-flops used for synchronous read Jack says: Below, under “Xilinx says” is xilinx’ powerpoint notes for this slide. All I think is important to mention is that the Luts can store 16 bits of information, so therefore they can be used as a 16x1 ram. You can also cascade LUTs or use more than one Lut to create dual ports or larger RAM sizes. Since memory can always replace logic in a design, this implementation gives the designer the flexibility to use additional memory if needed. As for the students and what will be required in their lab work, you can tell them not to worry; the CAD tools will automatically detect and infer distributed RAM as necessary. Students should just be aware of this tradeoff and the fact that it is available for them to use. Xilinx says: When the CLB LUT is configured as memory, it can implement 16x1 synchronous RAM. One LUT can implement 16x1 Single-Port RAM. Two LUTs are used to implement 16x1 dual port RAM. The LUTs can be cascaded for desired memory depth and width. The write operation is synchronous. The read operation is asynchronous and can be made synchronous by using the accompanying flip flops of the CLB LUT. The distributed ram is compact and fast which makes it ideal for small ram based functions. 2/9/034 ©UCB Spring 2004

Block RAM (Extra RAM not using LUTs)
Spartan-IIE True Dual-Port Port A Port B Most efficient memory implementation Dedicated blocks of memory Ideal for most memory requirements Virtex-E XCV2000 has 160? blocks 4096 bits per blocks Use multiple blocks for larger memories Builds both single and true dual-port RAMs CORE Generator provides custom-sized block RAMs Quickly generates optimized RAM implementation The Block Ram is true dual port, which means it has 2 independent Read and Write ports and these ports can be read and/or written simultaneously, independent of each other. All control logic is implemented within the RAM so no additional CLB logic is required to implement dual port configuration. The Altera 10KE and ACEX 1K families have only 2-port RAM. To emulate dual port capability, they would need twice the number of memory blocks and at half the performance. 2/9/034 ©UCB Spring 2004

Virtex-E Block RAM Flexible 4096-bit block… Variable aspect ratio
Increase memory depth or width by cascading blocks 2/9/034 ©UCB Spring 2004

= 2. Shift Register Each LUT can be configured as shift register
Serial in, serial out Saves resources: can use less than 16 FFs Faster: no routing Note: CAD tools determine with CLB used as LUT, RAM, or shift register, rather than up to designer D Q CE LUT IN CLK DEPTH[3:0] OUT = Jack says: Since you program the FPGA by shifting in values for the luts, the luts already have the capability to be a shift register. Using a lut as a shift register allows the FPGA to save resources (it doesn’t have to use 16 FFs), and it’ll also be faster, since it eliminates routing. Again, the students do not have to worry about this, as the tools should automatically infer their shift registers and optimize it to use this. Even if the tools do not recognize it, it will not affect the logical operation of the design, it will just implement a slower design (if this shift register happens to be in the critical path). Xilinx says: The LUT can be configured as a shift register (serial in, serial out) with bit width programmable from 1 to 16. For example, DEPTH[3:0] = 0010(binary) means that the shift register is 3-bit wide. In the simplest case, a 16 bit shift register can be implemented in a LUT, eliminating the need for 16 flip flops, and also eliminating extra routing resources that would have been lowered the performance otherwise. 2/9/034 ©UCB Spring 2004

How Program: FPGA Generic Design Flow
Design Entry: Create your design files using: schematic editor or hardware description language (Verilog, VHDL) Design “implementation” on FPGA: Partition, place, and route (“PPR”) to create bit-stream file Divide into CLB-sized pieces, place into blocks, route to blocks Design verification: Use Simulator to check function, Other software determines max clock frequency. Load onto FPGA device (cable connects PC to board) check operation at full speed in real environment. 2/9/034 ©UCB Spring 2004

Example Partition, Placement, and Route
Example Schematic Circuit: collection of gates and flip-flops Idealized FPGA structure: Circuit combinational logic must be “covered” by 4-input 1-output “gates”. Flip-flops from circuit must map to FPGA flip-flops. (Best to preserve “closeness” to CL to minimize wiring.) Placement in general attempts to minimize wiring. 2/9/034 ©UCB Spring 2004

Xilinx Vittex-E Routing Hierarchy
INTERNAL BUSSES Internal 3-state Bus Long lines and Global lines Buffered Hex lines (1/6 blocks) Note: CAD tools do PPR, not designers Single-length lines Xilinx provides automatic place and route tools efficiently use these routing The designers do not need to worry about this process. DIRECT CONNECTION Direct connections 24 single-length lines Route GRM signals to adjacent GRMs in 4 directions 96 buffered hex lines Route GRM (general routing matrix) signals to another GRMs six blocks away in each of the 4 directions 12 buffered Long lines Routing across top and bottom, left and right 2/9/034 ©UCB Spring 2004

Virtex-E Configurable Logic Block (CLB)
2 “logic slices” / CLB, two 4-LUTs / slice => Four 4-LUTs / CLB 2/9/034 ©UCB Spring 2004

Virtex-E CLB Slice Structure
Each slice contains two sets of the following: Four-input LUT Any 4-input logic function Or 16-bit x 1 sync RAM Or 16-bit shift register Carry & Control Fast arithmetic logic Multiplexer logic Multiplier logic Storage element Latch or flip-flop Set and reset True or inverted inputs Sync. or async. control Xilinx says: Two slices form a CLB. These slices can be used independently or together for wider logic functions.Within each slice also, the LUT and the flip flop can be used for the same function or for independent functions. The flip flops do not handcuff the designers into only having a set or clear. And for more ASIC like flows, the flip flop can be sued as latch. So, the designers do not need to re-code the design for the device architecture. 2/9/034 ©UCB Spring 2004

Details of Virtex-E Slice
Very fast ripple carry: 100 MHz) Multiplexors to help combine CLBs into larger multiplexor 2/9/034 ©UCB Spring 2004

Virtex-E Dedicated Expansion Multiplexers
Since 4-LUT has 4 inputs, max is 2:1 Mux (2 inputs, 1 control line) MUXF5 combines 2 LUTs to create 4x1 multiplexer Or any 5-input function (5-LUT) Or selected functions up to 9 inputs MUXF6 combines 2 slices to form 8x1 multiplexer Or any 6-input function (6-LUT) Or selected functions up to 19 inputs Dedicated muxes are faster and more space efficient CLB MUXF6 Slice LUT MUXF5 Dedicated multiplexers MUXF5 and MUXF6 allow 5-input and 6-input logic functions in only one logic (LUT) level. 2/9/034 ©UCB Spring 2004

Xilinx Virtex-E Chip Floorplan
Input / Output Blocks (IOBs) Configurable Logic Blocks (CLBs) Block RAMs (BRAMs) (discussed soon) Delay Locked Loop (DLL) (discussed soon) “VersaRing” = Page 16 is the flooplan of the Virtex-E FPGA. This is the actual FPGA used in lab. This slide shows the what’s actually in an Virtex-E as opposed to what the idealized FPGA is from the previous slides. IOB is input/output buffers, and students won’t have to worry about that. BRAMs is block rams, which is ram built in to the FPGA that students can use (note that this is different than the distributed RAM mentioned before). I’m not sure what the VersaRing is, and it shows the 4 DLLs that the Virtex-E has. Slide 5 from the powerpoint elaborates on the different kinds of memory and you may want to put that after this slide. 2/9/034 ©UCB Spring 2004

Virtex-E Delay Lock Loop (DLL) Capabilities
Easy clock duplication System clock distribution Cleans and reconditions incoming clock Quick and easy frequency adjustment Single crystal easily generates multiple clocks Excellent for advance memory types De-skew incoming clock Generate fast setup and hold time or fast clock-to-outs Clock Mirror duplicates incoming clock and performs system synchronization. Multiple and Divide functions allow simple frequency adjustments for distribution through out the board. By using inexpensive crystals, clock frequencies can by multiplied internally to the FPGA reducing board EMI. Clock Phase Shift provides coarse phase shifts of 0, 90, 180 and 270 degrees. Excellent for the fast clocking of State Machines by utilizing each of the different clock phases. Clock de-skew allows for faster setup, hold and clock-to-out times allowing higher overall system performance. Clock De-skew 2/9/034 ©UCB Spring 2004

DLL: Multiplication of Clock Speed
Have faster internal clock relative to external clock source Use 1 DLL for 2x multiplication Combine 2 DLLs for 4x multiplication Reduce board EMI Route low-frequency clock externally and multiply clock on-chip 66MHz - 2x Clock Multiplication 66 MHz 132 MHz (Multiply by 2) DLL Clock multiplication gives the designer a number of alternatives. For instance, a 25 MHz clock source can be doubled by a DLL to drive an FPGA design operating at 50 MHz. This technique can simplify board design because the clock path on the board no longer distributes a high speed signal. Connect two DLL circuits in series to perform a 4x multiplication. 2/9/034 ©UCB Spring 2004

DLL: Division of Clock Speed
Selectable division values 1.5, 2, 2.5, 3, 4, 5, 8, or 16 Cascade DLLs to combine functions Combine DLLs to multiply and divide to get desired speed 50/50 duty cycle correction available Using multiple DLLs in the device can create various forms of clock - phase shifted clocks, divide by clock and phase shifted clock. Clock x2 and Clock 2 30 MHz (180° Shift) 60 MHz (Multiply by 2) 30 MHz (180° Shift) DLL Used for FB 180° Phase Shift 15 MHz (Divide by 2) 2/9/034 ©UCB Spring 2004

Clock Management Summary
All digital DLL Implementation Input noise rejection 50/50 duty cycle correction Clock mirror provides system clock distribution Multiply input clock by 2x or 4x Divide clock by 1.5, 2, 2.5, 3, 4, 5, 8, or 16 De-skew clock for fast setup, hold, or clock-to-out times 2/9/034 ©UCB Spring 2004

Summary I Design Process Design Entry: Schematics, HDL, Compilers
High Level Analysis: Simulation, Testing, Assertions Technology Mapping: Turn design into physical implementation Low Level Analysis: Check out Timing, Setup/Hold, etc Verilog – Three programming styles Structural: Like a Netlist Instantiation of modules + wires between them Dataflow: Higher Level Expressions instead of gates Behavioral: Hardware programming Full flow-control mechanisms Registers, variables File I/O, consol display, etc 2/9/034 ©UCB Spring 2004

Summary: Xilinx FPGAs How they differ from idealized array:
In addition to their use as general logic “gates”, LUTs can alternatively be used as general purpose RAM or shift register Each 4-LUT can become a 16x1-bit RAM array Special circuitry to speed up “ripple carry” in adders and counters Therefore adders assembled by the CAD tools operate much faster than adders built from gates and LUTs alone. Many more wires, including tri-state capabilities. 2/9/034 ©UCB Spring 2004

In conclusion, FPGAs… TTM Performance NRE Unit Cost Better Worse
FPGAs are basically interconnect plus distributed RAM that can be programmed to act as any logical function of 4 inputs CAD tools due the partitioning, routing and placement functions onto CLBs FPGAs offer compromise of performance, unit cost, time to market vs. ASICs and microprocessors plus software TTM Performance NRE Unit Cost Better ASIC MICRO FPGA Worse 2/9/034 ©UCB Spring 2004

John Kubiatowicz (www.cs.berkeley.edu/~kubitron)

Similar presentations

Presentation on theme: "John Kubiatowicz (www.cs.berkeley.edu/~kubitron)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

John Kubiatowicz (www.cs.berkeley.edu/~kubitron)

Similar presentations

Presentation on theme: "John Kubiatowicz (www.cs.berkeley.edu/~kubitron)"— Presentation transcript:

Similar presentations

About project

Feedback