Reconfigurable Computing - Options in Circuit Design

Slides:



Advertisements
Similar presentations
Using Carry-Save Adders For Radix- 4, Can Be Used to Generate 3a – No Booth’s Slight Delay Penalty from CSA – 3 Gates.
Advertisements

UNIVERSITY OF MASSACHUSETTS Dept
UNIVERSITY OF MASSACHUSETTS Dept
Contemporary Logic Design Arithmetic Circuits © R.H. Katz Lecture #24: Arithmetic Circuits -1 Arithmetic Circuits (Part II) Randy H. Katz University of.
Computer ArchitectureFall 2008 © August 25, CS 447 – Computer Architecture Lecture 3 Computer Arithmetic (1)
Copyright 2008 Koren ECE666/Koren Part.6a.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.
Multiplication.
Coping With the Carry Problem 1. Limit Carry to Small Number of Bits Hybrid Redundant Residue Number Systems 2.Detect the End of Propagation Rather Than.
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
Digital Arithmetic and Arithmetic Circuits
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 10-1 The ALU.
Chapter # 5: Arithmetic Circuits
Reconfigurable Computing - Multipliers: Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on.
Digital Kommunikationselektronik TNE027 Lecture 2 1 FA x n –1 c n c n1- y n1– s n1– FA x 1 c 2 y 1 s 1 c 1 x 0 y 0 s 0 c 0 MSB positionLSB position Ripple-Carry.
Mohammad Reza Najafi Main Ref: Computer Arithmetic Algorithms and Hardware Designs (Behrooz Parhami) Spring 2010 Class presentation for the course: “Custom.
Computer Architecture Lecture 32 Fasih ur Rehman.
Reconfigurable Computing - Pipelined Systems John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.
Unconventional Fixed-Radix Number Systems
Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,
CS/EE 3700 : Fundamentals of Digital System Design Chris J. Myers Lecture 5: Arithmetic Circuits Chapter 5 (minus 5.3.4)
Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.
Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,
Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,
Explain Half Adder and Full Adder with Truth Table.
Choosing RNS Moduli Assume we wish to represent 100, Values Standard Binary  lg 2 (100,000) 10  =   =17 bits RNS(13|11|7|5|3|2), Dynamic.
CPE 201 Digital Design Lecture 2: Digital Systems & Binary Numbers (2)
Unit 1 Introduction Number Systems and Conversion.
Combinational Circuits
Prof. Sin-Min Lee Department of Computer Science
Integer Multiplication and Division
Digital Systems and Number Systems
Multiplier Design [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition, ©2003 J. Rabaey, A. Chandrakasan, B. Nikolic]
MIPS mul/div instructions
Sequential Multipliers
UNIVERSITY OF MASSACHUSETTS Dept
CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu
Addition and multiplication
CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu
Multiplication & Division
Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.
Digital Systems Section 14 Registers. Digital Systems Section 14 Registers.
Wakerly Section 2.4 and further
Basics Combinational Circuits Sequential Circuits Ahmad Jawdat
Radix 2 Sequential Multipliers
ECE 331 – Digital System Design
Unsigned Multiplication
Arithmetic Functions & Circuits
Unconventional Fixed-Radix Number Systems
Arithmetic Circuits (Part I) Randy H
Computer Organization and Design
Multiplier-less Multiplication by Constants
UNIVERSITY OF MASSACHUSETTS Dept
By: A. H. Abdul Hafez CAO, by Dr. A.H. Abdul Hafez, CE Dept. HKU
Overview Part 1 – Design Procedure Part 2 – Combinational Logic
Part III The Arithmetic/Logic Unit
Addition and multiplication
UNIVERSITY OF MASSACHUSETTS Dept
UNIVERSITY OF MASSACHUSETTS Dept
Addition and multiplication
ECE 352 Digital System Fundamentals
ECE 352 Digital System Fundamentals
ECE 352 Digital System Fundamentals
UNIVERSITY OF MASSACHUSETTS Dept
Lecture 9 Digital VLSI System Design Laboratory
Sequential Multipliers
UNIVERSITY OF MASSACHUSETTS Dept
Arithmetic Logic Unit A.R. Hurson Electrical and Computer Engineering Missouri University of Science & Technology A.R. Hurson.
Booth Recoding: Advantages and Disadvantages
UNIVERSITY OF MASSACHUSETTS Dept
Presentation transcript:

Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western Australia

Design Options – so far ‘Structural Options’ Bit serial Most Space efficient Slow One bit of result produced per cycle Sometimes this isn’t a problem Example Small efficient adder Very small multiplier

Serial Circuits Bit serial adder FA Note: The synthesizer will insert sum a Bit serial adder register 2-bit b FA cout cin ENTITY serial_add IS PORT( a, b, clk : IN std_logic; sum, cout : OUT std_logic ); END ENTITY serial_add; ARCHITECTURE df OF serial_add IS SIGNAL cint : std_logic; BEGIN PROCESS( clk ) BEGIN IF clk’EVENT AND clk = ‘1’ THEN sum <= a XOR b XOR cint; cint <= (a AND b) OR (b AND cint) OR (a AND cint ); END IF; END PROCESS; cout <= cint; END ARCHITECTURE df; clock Note: The synthesizer will insert the latch on the internal signals! Note: Reset or clear needed to frame operands!

Design Options – so far ‘Structural Options’ Bit serial Most Space efficient Sequential Combinatorial / bit-parallel block + register Example Sequential multiplier – adder + shifter + register

Multipliers - Pipelined Multiplier arrays need space! O(n2) full adders – a considerable amount of space! Sequential multipliers use O(n) space but O(n) cycles! · · (a ^ bj) 2j + · · a b

Design Options – so far ‘Structural Options’ Bit serial Sequential Pipelined High throughput High latency too though! Need to achieve pipeline balance Every stage should have similar propagation delay More later! Example Pipelined multiplier

Multipliers - Pipelined Pipelining will throughput (results produced per second) but also total latency (time to produce full result) · · · Insert registers to capture partial sums · Benefits * Simple * Regular * Register width can vary - Need to capture operands also! * Usual pipeline advantages Inserting a register at every stage may not produce a benefit! · · · · · ·

Design Options – so far ‘Structural Options’ Bit serial Sequential Pipelined Examine communication patterns Example Eliminate horizontal carry chains in parallel array multiplier

more efficient adder in each row? Multipliers We can add the partial products with FA blocks a3 a2 a1 a0 Try to use a more efficient adder in each row? b0 A simpler scheme uses a ‘carry save’ adder – which pushes the carry out’s down to the next row! FA FA FA FA b1 FA FA FA FA b2 Note that an extra adder is needed below the last row to add the last partial products and the carries from the row above! Carry select adder FA FA FA FA p1 p0 product bits

Design Options – so far ‘Structural Options’ Bit serial Sequential Pipelined Examine communication patterns Tree structures Example Combine carries in level below Wallace Tree multiplier

So combine them vertically! Multipliers - Tree Summing the partial products · · · · · So combine them vertically! · · · · · · · · · · · · · · · First level results · · · ·

Signed digit arithmetic – Avoiding the carries! Terminology First, we need to distinguish carefully between digits of a number and bits used in representing the number In the standard binary representations, one bit is used to represent each binary digit (0 or 1) of a number However, we can use other representation schemes … If we use more than one bit to represent each digit of an operand, then we have a redundant system We’re using more bits than the minimum log2n needed to represent a number of magnitude, n. These redundant number systems generally have the ability to avoid carry propagation This may be exploited in the addition of sequences of numbers Carries are transferred to the following addition Concept similar to that used in carry-save multiplier where carries are transferred to the following partial product addition

Booth Recoding A binary number can be re-coded according to Booth’s scheme to reduce the number of partial products in a multiplier Original idea Early computers: shift much faster than add Observe than when there is a 0 in the multiplier, you can skip the addition and just shift the multiplicand In a synchronous computer, this doesn’t help – in the worst case, you still have to perform an add for each digit of the multiplier (all or most of them are 1’s) but in an asynchronous computer, the ability to skip some additions reduces the average completion time Booth observed that when there is a long sequence of 1s, eg digits j through (down to) k are 1s, then 2j + 2j+1 + … +2k-1 + 2k = 2j+1 – 2k

Booth Recoding A binary number can be re-coded according to Booth’s scheme to reduce the number of partial products in a multiplier Booth recoding Booth observed that when there is a long sequence of 1s, eg digits j through (down to) k are 1s, then 2j + 2j+1 + … +2k-1 + 2k = 2j+1 – 2k Thus the sequence of additions can be replaced by An addition of the multiplicand shifted by j+1 positions and A subtraction of the multiplicand shifted by k positions This is equivalent to recoding the multiplier from a representation using {0,1} to one using {-1,0,1} – corresponding to subtract, skip, add The recoding can be done in O(1) time by inspecting neighbouring digits

Booth Recoding Note yj xj-1 xj Booth’s scheme Radix-2 Booth recoding For each position, j, inspect xj and xj-1 to determine the bits (2 needed!) of yj Example x: 1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0 (0) y: -1 0 1 0 0 -1 1 0 -1 1 -1 1 0 0 -1 0 In practice, this scheme is no use in a synchronous machine, Worst case: sequence of alternating 0 1 More additions than necessary! but if we use a higher radix Booth recoding  Note yj xj-1 xj No 1’s End of a string of 1’s - add 1 Start of a string of 1’s - subtract -1 Middle of a string of 1’s - skip

Higher Radix Multiplication Radix-2 multiplier Use 1 bit of the multiplier at a time Form partial product with and gates Radix-4 multiplier Use 2 bits of the multiplier at a time If A is the multiplicand .. Radix-4 Booth recoding …  Operation Multiplier bits none 00 +A 01 +2A (shift A) 10 +3A (precompute A+2A?) 11

Radix-4 Booth Recoding n/2 partial products generated Operation yj x2j-1 x2j x2j+1 No 1’s +A End of 1’s string 1 +A Isolated 1 +2A End of 1’s string 2 -2A Beginning of 1’s -2 -A End one string, start new one -1 -A Start of 1’s string Middle of 1’s Recode multiplier into a signed digit form Use 3 bits of the original multiplier at a time Recoded multiplier has half the number of digits, but each digit is in [-2,2] Operands to the adders are now formed by shifts alone Recode Constant time Partial products Shift, and, select n/2 partial products generated Potentially 2× speed!

No carries at all? Residue Number Systems 

What is the decimal representation of (2,3,2) in RNS(7,5,3)? Residue Arithmetic Residue Number Systems A verse by the Chinese scholar, Sun Tsu, over 1500 years ago posed this problem What number has remainders 2, 3 and 2 when divided by the numbers 7, 5 and 3, respectively? This is probably the first documented use of number representations using multiple residues In a residue number system, a number, x, is represented by the list of its residues (remainders) with respect to k relatively prime moduli, mk-1, mk-2, …, m0 Thus x is represented by (xk-1, xk-2, …, x0) where xi = x mod mi So the puzzle may be re-written What is the decimal representation of (2,3,2) in RNS(7,5,3)?

Residue Number Systems The dynamic range of a RNS, M = mk-1  mk-2  … m0 For example, in the system RNS(8,7,5,3) M = 8  7  5  3 = 840 Thus we have Any RNS can be viewed as a weighted representation In RNS(8,7,5,3), the weights are: 105 120 336 280 Thus (1,2,4,0) represents (105  1 + 120  2 336  4 + 280  0)840 = (1689)840 = 9 Decimal RNS(8,7,5,3) 0 or 840 or -840 or … (0,0,0,0) 1 or 841 or -839 or … (1,1,1,1) 2 or 842 or … (2,2,2,2) 8 or 848 or … (0,1,3,2)

Residue Number Systems - Operations Complement To find –x, complement each of the digits with respect to the modulus for that digit 21 = (5,0,1,0) so -21 = (8-5,0,5-1,0) = (3,0,4,0) Addition or subtraction is performed on each digit ( 5 , 5 , 0 , 2 )RNS = 510 ( 7 , 6 , 4 , 2 )RNS = -110 ( (5+7)=48, (5+6)=47, 4 , (2+2)=13)RNS = 410 ( 4 , 4 , 4 , 1 )RNS = 410 Multiplication is also achieved by operations on each digit ( (5x7)=38, (5x6)=27, 0 , (2x2)=13)RNS = -510 ( 3 , 2 , 0 , 1 )RNS = -510

Residue Arithmetic - Advantages Parallel independent operations on small numbers of digits Significant speed ups Especially for multiplication! 4 bit x 4 bit multiplier (moduli up to 15) much simpler than 16 bit x 16 bit one Carries are strictly confined to small numbers of bits Each modulus is only a small number of bits Can be implemented in Look Up Tables (LUTs) 6 bit residues (moduli up to 64) 64 x 64 x 6 bits required (<4Kbytes)

Residue Arithmetic – Choosing the moduli Largest modulus determines the overall speed – Try to make it as small as possible Simple strategy Choose sequence of prime numbers until the dynamic range, M, becomes large enough eg Application requires a range of at least 105, ie M  105 For RNS(13,11,7,5,3,2), M = 30,300 Range is too low, so add one more modulus: RNS(17,13,11,7,5,3,2), M = 510,510 Now each modulus requires a separate circuit and our range is now ~5 times as large as needed, so remove 5: RNS(17,13,11,7,3,2), M = 102,102 Six residues, requiring 5 + 4 + 4 + 3 + 2 + 1 = 19 bits The largest modulus (17 requiring 5 bits) determines the speed, so …

Residue Arithmetic – Choosing the moduli Application requires a range of at least 105, ie M  105 … RNS(17,13,11,7,3,2), M = 102,102 Six residues, requiring 5 + 4 + 4 + 3 + 2 + 1 = 19 bits The largest modulus (17 requiring 5 bits) determines the speed, so combine some of the smaller moduli (Remember the requirement is that they be relatively prime!) Try to produce the largest modulus using only 5 bits – Pair 2 and 13, 3 and 7 RNS(26,21,17, 11), M = 102,102 Four residues, requiring 5 + 5 + 5 + 4 = 19 bits (no improvement in total bit count, but 2 fewer ALUs!) Better …?

Residue Arithmetic – Choosing the moduli Application requires a range of at least 105, ie M  105 … RNS(26,21,17, 11), M = 102,102 Four residues, requiring 5 + 5 + 5 + 4 = 19 bits (no improvement in total bit count, but 2 fewer ALUs!) Include powers of smaller primes before primes, starting with RNS(3,2), M = 6 Note that 22 is smaller than the next prime, 5, so move to RNS(22,3), M = 12 (trying to minimize the size of the largest modulus) After including 5 and 7, note that 23 and 32 are smaller than 11: RNS(32,23,7,5), M = 2,520 Add 11  RNS(11,32,23,7,5), M = 27,720 Add 13  RNS(13,11,32,23,7,5), M = 360,360

Residue Arithmetic – Choosing the moduli Application requires a range of at least 105, ie M  105 … Add 13  RNS(13,11,32,23,7,5), M = 360,360 M is now 3 larger than needed, so replace 9 with 3, then combine 5 and 3 RNS(15,13,11,23,7), M = 360,360 5 moduli, 4 + 4 + 4 + 3 + 3 = 18 bits, largest modulus has 4 bits You can actually do somewhat better than this! Reference: B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs, Oxford University Press, 2000

Residue Numbers - Conversion Inputs and outputs will invariably be in standard binary or decimal representations, conversion to and from them is required Conversion from binary | decimal to RNS Problem: Given a number, y, find its residues wrt moduli, mi Divisions would be too time-consuming! Use this equality: (yk-1yk-2…y1y0)2mi =   2k-1yk-1  mi + … +  2y1  mi +  y0  mi  mi So we only need to precompute the residues  2 j  mi for each of the moduli, mi, used by the RNS

Residue Numbers - Conversion For RNS(8,7,5,3) : <y>8 is trivially calculated (3 LSB bits) For 7, 5 and 3, we need the powers of 2 modulus 7, 5 and 3 2 j3 2 j5 2 j7 2 j j 1 2 4 3 8 16 32 5 64 6 128 7 256 512 9

Residue Numbers - Conversion Find 16410 = 1010 01002 = 27 + 25 + 22 in RNS(8,7,5,3) : <164>8 is 1002 = 410 <164>7 = <2 + 4 + 4>7 = <10>7 = 3 2 j3 2 j5 2 j7 2 j j 1 2 4 3 8 16 32 5 64 6 128 7 256 512 9 Note that the additions are done in a modular adder! Worst case: k additions for each residue for a k -bit number

Residue Numbers - Conversion Conversion from RNS to binary Digits of an RNS representation can be shown to have position weightings, eg for RNS(8,7,5,3) the weightings are 105 120 336 280 The weightings may be calculated using the Chinese Remainder Theorem x = (xk-1xk-2 … x1x0)RNS =  S Mi aixim M where Mi = M / mi and ai = < Mi-1>m is the multiplicative inverse of Mi wrt mi This means that (x3, x2, x1, x0)RNS = x3 × 105 + x2 × 120 + x1 × 336 + x0 × 280 i i

Residue Numbers - Conversion Conversion from RNS to binary Digits of an RNS representation can be shown to have position weightings, eg for RNS(8,7,5,3) the weightings are 105 120 336 280 Calculate position weights with CRT … This means that (x3, x2, x1, x0)RNS = x3 × 105 + x2 × 120 + x1 × 336 + x0 × 280 This is most efficiently done through a LUT Note that the table for RNS(8,7,5,3) requires only 8 + 7 + 5 + 3 = 23 entries In general, this requires only Sk-1i=0 mi words – a reasonable number!

Residue Arithmetic - Disadvantages Range is limited Division is hard! Comparison <, >, sign (<0?) are hard Still suitable for some DSP applications Only use +, x Result range is known Examples: digital filters, Fourier transforms

first row of partial products Multipliers ‘Long’ multiplication a3 a2 a1 a0 b3 b2 b1 b0 x x x x x x x x x x x In binary, the partial products are trivial – if multiplier bit = 1, copy the multiplicand else 0 Use an ‘and’ gate! b0 b1 b2 b3 a3 a2 a1 a0 b0 first row of partial products

Multipliers We can add the partial products with FA blocks a3 a2 a1 a0 b0 FA FA FA FA b1 FA FA FA FA b2 FA FA FA FA p1 p0 product bits

This part is straight-forward! Parallel Array Adder We can build this adder in VHDL with two GENERATE loops SIGNAL pa, pb, cout : ARRAY( 0 TO n-1 ) OF ARRAY( 0 TO n-1 ) OF std_logic; … but you need to fill in the PORT MAP using internal signals! FOR j IN 0 TO n-1 GENERATE -- For each row FOR j IN 0 TO n-1 GENERATE –- Generate a row pjk : full_adder PORT MAP( … ); END GENERATE; This part is straight-forward!

What’s the worst case propagation delay? Multipliers We can add the partial products with FA blocks a3 a2 a1 a0 b0 Optimization 1: Replace this row of FAs FA FA FA FA b1 Time? What’s the worst case propagation delay? FA FA FA FA b2 FA FA FA FA p1 p0 product bits