Presentation is loading. Please wait.

Presentation is loading. Please wait.

396-ps 32-bit Han-Carlson ALU in 180nm TSMC process

Similar presentations


Presentation on theme: "396-ps 32-bit Han-Carlson ALU in 180nm TSMC process"— Presentation transcript:

1 396-ps 32-bit Han-Carlson ALU in 180nm TSMC process
Liang-Kai Wang Name: Liang-Kai Wang, 2nd year master student Title: 396ps 32-bit H-C ALU with 180 nm TSMC process At first, thanks for your coming and, I promise I will finish this asap to save your time. However, if you have any question about the content, feel free to interrupt me. Besides, I know some of you have already been very familiar with the adder, so if you figure out anything wrong or ambiguous, please let me know. I will try to explain it as I canb VLSI CAD Lab University of Wisconsin, Madison 2019/1/3

2 Outline Review of Adders The Idea of Han-Carlson Adder
The Implementation of Han-Carlson Adder Simulation Result Discussion Comparison between Ling’s and H-C Adder Future work Reference This is the outline of this presentation. Although there are several sections, I will focus on the idea, which is the second section, and the implementation, ie the third, in this presentation. 2019/1/3

3 Review of Adders Carry Ripple Adder 2019/1/3
Carry Ripple Adder: Add one bit at a time. The total execution time will be the number of bits times the clock cycle. Save area but waste time. Parallel Prefix Adder: a.k.a. Hybrid Prefix Adder, This kind of adder is a variation of Carry-Lookahead adderand about this two adder, introduced in the next slice. 2019/1/3

4 Review of Adders(cont.)
Carry Skip Adder Carry Skip Adder: The concept is to separate bits into several groups, bypass the carry_in of each group to the carry_out if all the propagate bits within this group is “1”. This device needs additional logic to implement the bypass logic. Although it might increase the speed, it only happens in some cases. There is still some other optimization in this adder but isn’t covered in my presentation. Carry-Select Adder: Compute the outcome for two different carry-ins. Select the correct answer once we know the correct carry-in. Although the speed is improved, the cost is doubled. 2019/1/3

5 Review of Adders(cont.)
Carry-Select Adder 2019/1/3

6 Review of Adders(cont.)
Carry-Save Adder In doing multiplication, there might be lots of numbers added at a time, if we use regular CLA here, it might spends lots of time in addition. Carry Save adder provides an alternative way to solve this problem. It reduce binary numbers to two numbers having the same sum in O(logn) levels. If a fast logarithmic time carry-propagate adder is applied to add the two resulting numbers, the Cost will equal to (n-1)Ccsa+Ccpa, and T=O(tree height+Tcpa)=O(log n+log k), k is the width needed in CSA. In order to have simple layout and high speed, some people provided new ideas for the modification of Carry Lookahead adder. Brent-Kung, Kogge-Stone, all provided their idea about the modification. However, both of them have their pro and con. 2019/1/3

7 Review of Adders(cont.)
Carry Lookahead Adder CLA: Best known textbook adder implementation. Get the partial Generate and Propagate and combine them in each level. Ling Adder: The modified version of CLA. It achieves significant hardware savings and improve the speed by reconstruct the equation. In the equation, “p” signal is replaced by “t” signal, which is called transfer bit. (It’s because he thought the carry_in bit is also passed to the next stage even if generate bit=1, although it seems to be a redundant term here.) And Ling’s modification also consists of propagating hi instead of ci. As you might see in the slice, the max # of input for AND is less, as well as the total number of gate inputs. What’s the pro and con? In terms of the speed, fewer inputs means less load capacitance and high speed. However, the sum bit becomes more complex. 2019/1/3

8 Review of Adders(cont.)
Ling Adder Observation: Ling adders are a variation of the commonly used CLA adder, both of which are implemented which calculate whether groups of bits will generate or propagate a carry. And these bits can be hierarchically combined to calculate the carry into any bit. And Ling adders use a simplified version of the group generate signal which takes less time to compute Basic Assumption: the carry term is still the critical path. So we can reduce the delay of the carry term by increasing the delay of some other term. The overall delay to form si is reduced by moving pi-1 out of the critical path. We can reduce the delay of both adder directly from the input bits by using a single complex gate instead of first generating the local generate and propagate signals. Compare with G0 and H0, the maximum fan-in of CLA is 4 and 3 for Ling adder, which means the pull-down path is shortened by one and the speed is improved. Back 2019/1/3

9 Review of Adders(cont.)
Hybrid (Parallel) Prefix Adder Brent-Kung Adder Kogge-Stone Han-Carlson Adder 2019/1/3

10 Review of Adders(cont.)
Brent-Kung Adder Cost : C(k)=C(k/2)+k-1=2k-2-log2k (# of adder cells) Time : 2*log2k – 2 (in terms of adder levels) This slice is known as the Brent-Kung parallel prefix adder Brent-Kung Adder: Based on a divide-and-conquer approach. The inputs are first combined pairwise to obtain the sequence of length k/2, (assume s1, s(k/2) )and the even-indexed prefix are then computed by the odd-indexed. The delay recurrence is 2 log2 k-1 and the cost recurrence is 2k-2-log2k Less layout area, less processing nodes Basic assumption, k=2k C(21)=c(1)+2-1 C(22)=c(21)+22-1 C(23)=c(22)+23-1 C(2k’)=c(2k’-1)+2k’-1 C(2k’)=c(1) …2k’-k’=0+2(1-2k’)/(1-2)-k’ K=2k’k’=log2k C(k)=2(k-1)-log2k D(k)=D(k/2)+1=log2k =>because we calculate half of the bit first and then combine them in another one stage C(k)=2C(k/2)+k/2=(k/2)log2k => because we process each of two large groups and combine the result with another k/2 device So B-K cost=C(k/2)+k-1=2k-2-log2k =>process half of the bits and need another k-1( (k-1)/2 for the first merge and another (k-1)/2 for the second merge 2019/1/3

11 Review of Adders(cont.)
Kogge-Stone Adder Cost : klog2k-(k-1) Time : log2k The time needed here is the number of level and obviously, it equals to log2k. And the cost is n2 because of the large number of vertical tracks required to embed wires in the upper stages of the graph. 2019/1/3

12 The idea of Han-Carlson Adder
B-K adder: small area, but slow K-S adder: large area, but fast Speed: 2log2n-2log2n (1/2 reduction) Cost: 2k-2-log2kklog2k-k+1 (log2k/2 increase) The area-time tradeoff results in Han-Carlson Adder Comparing these graph representations for prefix computation, we found K-S: high area with low time and B-K: low area with high time. And a constant factor reduction in time from 2logn-2 to logn results in significant increase in area from O(2k-2-Log2k) to O(k*Log2k-k+1) 2019/1/3

13 The idea of Han-Carlson Adder (cont.)
Cost : O(k/2*log2k) Time : O(log2k+1) The cost is like k/2-input Kogge-Stone  k/2*(log2(k/2)-k/2+1) Plus “k-1”, which is the number of nodes needed for the first and last stage. The time is only one stage more than the Kogge-Stone adder. This additional stage is used to produce the odd bits. Main construction of Han-Carlson Adder Combine B-K and K-S adder 50% less number of Carry-merge gates and 40% reduction in active leakage energy compared to K-S with only one additional stage time wasted 2019/1/3

14 Review of Adders(cont.)
Optimized Brent-Kung Adder Cost : C(k)=C(k/2)+k-1=2k-2-log2k Time : log2k (in terms of adder levels) 2019/1/3

15 The idea of Han-Carlson Adder (cont.)
B-K original K-S Optimal H-C Cost 2k-2-log2k klog2k-(k-1) k/2*log2k Delay 2*log2k – 2 log2k log2k+1 Max. FO 2 4 2019/1/3

16 The idea of Han-Carlson Adder (cont.)
Produce Generate, Propagate, and Partial Sum bit in the first stage. Single-rail circuit with double-rail in the last stage to perform XOR function. Sum=Partial_Sum XOR CarryIn Theoretically, we compute sum bit by XOR two input bits with the carry_in bit in the last stage, but practically, we have some other ways to improve the speed. For example, we can produce Partial Sum bit in the very first stage and then we will only need one XOR logic in the last stage. And we can also improve the efficiency of XOR by introduce double-rail logic, which means we have the signal and its complement at the output at the same time. What’s more, we can make even stage in dynamic and odd stage in static to increase the speed. Improved: Domino circuit with odd stage in Dynamic and even stage in Static. 2019/1/3

17 The implementation of Han-Carlson Adder
Schematics Design by Composer, Simulation by Spectre. Both of them are in the Cadence design kits The simulation result is from Schematic (pre-layout) The best speed is achieved by using the fast mode in the technology file instead of tuning the Bulk voltage Clock is generated by ring oscillator with five inverters in the loop. Cadence tutorial for both of them and about how to setup the environment are provided here. Next, I will introduce my internal circuit. At first, the whole design is finished by Cadence tool, Composer and Spectre. And the simulation result is prelayout now. The best speed I got here is achieved by using the 0.18um fast mode CMOS technology instead of changing the substrate voltage. It’s because even I can make the Vt smaller by tuning the bulk voltage, it might be failure in the real implementation. As for the regular mode, the speed is about 2GHz. The clock is generated from ring-oscillator with five inverters in the loop. Besides, I also made a tutorial for cadence because there are few people using cadence in Wisconsin, which made me crazy when I have some problems about the tools. So, I think it’s a good idea to make a tutorials such that others people here can use it more easily. 2019/1/3

18 The implementation of Han-Carlson Adder(cont.)
Clock generation: Ring Oscillator : five inverters followed by lots of buffers NMOS trigger output At first, I introduce the clock signal in my circuit. My clock is generated from the ring oscillators with five inverters and lot of buffers followed. If we need slower clock or have non-symmetric clock signal, we could modify some parameters here to make it. About the clock distribution, in order to use the clock efficiently, we need four different ones as you see in this diagram. Besides, there is a latch to catch the result which isn’t listed here and it’s controlled by phi-1 to prove the whole execution could be fulfilled in a clock. 2019/1/3

19 The implementation of Han-Carlson Adder(cont.)
Clock distribution PG gen. S0 S1 S2 S3 S4 Sum gen. Sum. Sum# Ø1 Ø2 stclk2 stclk3 Correct Latch 2019/1/3

20 The implementation of Han-Carlson Adder(cont.)
The whole view PG gen. CM0 CM1 CM2 CM3 CM4 A B Carry In Sum gen. M2 Sum Sum # Correct M1 Path for P and G bit Single Rail Circuit As you can see in this slice, A, B, and Carry In are three signals for this circuit. Because of the necessity of the PG generator block, the two inputs A and B need to be inverse to produce P, G, and Psum signals. P and G signals are transferred via Carry-Merge stage and Psumi also need to be transferred through M stage to remain its value. At the last stage, all of these three value will be passed to generate the final Sum bits and produce correct bit to see if the value here is correct. Foot-transistor added Path for Psum bit Double Rail inside 2019/1/3

21 The implementation of Han-Carlson Adder(cont.)
ALU PG/Partial Sum Circuit. First, we check the first stage, the PG/partial sum circuit. This is a dynamic stage. The circuit isn't very complex but about the partial sum portion, I tried using more straightforward view instead of the one in the paper. I.e., A XOR B =g# AND p. The path is shortened compared with the design in the paper and I will compare this with the paper one in the discussion section. Besides, the inputs for P and G bits are A_bar and B_bar instead of A and B Back 2019/1/3

22 The implementation of Han-Carlson Adder (cont.)
Dynamic and Static Carry Merge Stage : i=0, 2,…30 Even Stage : i=1, 3, … 31, or the carry at that bit is already got. As I just mentioned in the last few slice, we make the even stage in static and odd stage in dynamic. That means we will have complement value of the input signals and we need to do some process to get the right value in the odd stage as you see here Odd Stage: 2019/1/3

23 The implementation of Han-Carlson Adder (cont.)
Dynamic and Static Carry Merge Stage (cont.): Carry-In of LSB should be merged in order to do subtraction. The generate and propagate bit MSB are passed to the last stage to produce the carry_out of the ALU. (for the check bit) Besides, in order to do subtraction, we need to merge carry-in bit as well, which means the generate bit and propagate bit will be passed all the way down to the last stage to produce the sum bit and carry_out bit of this addition. 2019/1/3

24 The implementation of Han-Carlson Adder (cont.)
Even/Odd-bits CSG Sum Generation Complementary signal generator (CSG) logic 2019/1/3

25 The implementation of Han-Carlson Adder (cont.)
Even/Odd-bits CSG Sum Generation Use a latch to increase noise tolerance Carry_bar This is the final stage of the circuit. At first, carry bit and carry_bar bit are both pre-charge to “1” and after the Ph2 signal going up, both signals are going down and if carry bit is going down here, we must make sure the carry_bar is going down faster than carry such that we still can use the latch to pull the carry_bar signal back to “1”, With this CSG, the noise will less than 100mV in 0.13um process. Carry 2019/1/3

26 Simulation Result Try the worst case pattern to test this design:
A=0, B=-2, Carry-In=1 is the worst case delay. Why? Because from the structure of the circuit, the worst case is 3N-2P-2N-2P-2N-2P-3N (For Propagate bit) Worst case path 3N-2P-2N-2P-2N-2P-3N. Why? 1st stage: for P bit, there are three NMOS serially connected. So the worst case here will be the result equal to 0. 2nd stage: for P bit, two serially connected PMOS will be the worst case here. In order to make the other “P” bit equal to “0” All the other following stages are similar to these two stages. And we should have result = 0 and correct =1 2019/1/3

27 Simulation Result (cont.)
0th stage: Carry-In=1 1st stage: g=0, p=0, Psum=0 (P/G/Psum, 3N) 2nd stage: g# =1, p# =1 (Static, 2P) 3rd stage: g=0, p=0 (Dynamic, 2N) 4th stage: g# =1, p# =1 (Static, 2P) 5th Stage: g=0, p=0 (Dynamic, 2N) 6th stage: g# =1, p# =1 (static, 2P) 7th stage: Cin31=0, (Dynamic, 3N) The result should be “2” Correct = 1 2019/1/3

28 Simulation Result (cont.)
2019/1/3

29 Simulation Result (cont.)
The result window Yeah, as you can see here. All the waveforms are “0” except the correct signal. 2019/1/3

30 Simulation Result (cont.)
Test if the error flag is correct. 1st Test pattern: A=-231 B=-1. The answer is (1’b0+31’b1), which is the wrong answer. And the correct bit should be equal to 0. (test the lower bound) Also check the clock period is about ps 2019/1/3

31 Simulation Result (cont.)
2019/1/3

32 Simulation Result (cont.)
2nd Test pattern: A=231-1 B=2. The answer is (1’b1 +30’b 0+1’b1, wrong answer), the correct bit should be equal to 0. (test the upper bound) 2019/1/3

33 Simulation Result (cont.)
As you can see in the bottom of this slice, the clock period is 396ps 2019/1/3

34 Discussion: P/G/Psum Block
P circuit G circuit Psum circuit Psum= A xor B Now, we are entering the discussion section. In one of the paper from the reference, they try to implement XOR logic with these circuits. However, I found that these circuits may not have better performance because the pull down path is longer. Let’s check the next slice. Mine 2019/1/3

35 Discussion (cont.) What might be the problem?
Longer path to the ground When pre-charge, both of the propagate and generate bit are “1” What we need to consider? If p=0, g=0, this circuit may have a good performance. However, what if g goes from 1 to 0, but p=1? As you can see here, if we want to discharge the G signal to ground, we need to discharge from three paths, which may degrade the pull-down speed as you see in the next slice. 2019/1/3

36 Discussion (Cont.) As you can see here, the pull-down is worse compared with the design you just saw. And what’s worse, the minimal value of G is about 0.2mv, which make the Psum bit worse 2019/1/3

37 Discussion (cont.) If the longest path is cut, then… Mine 2019/1/3

38 Discussion (Cont.) Mine 2019/1/3

39 Comparison between H-C adder and Ling Adder
For n-bit Ling adder combining r groups critical path: “logrn-1” levels r1 reduction result in logrn levels, “-1” is because of the using of CLA expression rather than Ling’s expression for the last group. Therefore, additional stage is saved. The worst case delay will remain the second path from the last block For each block, there are r+1 transistors serially connected. Use carry-select block for the generation of Sum bit. Only additional “2” gate delays needed. 2019/1/3

40 Comparison between H-C adder and Ling Adder(cont.)
Td=(logrn-1)(r+1)+2 E.g. r=3, n=32, Td=14 Lookahead Network Group Generation This is under the construction of Group Generation, Lookahead Network, CLA expression in the last stage, and Final Sum (Using carry-select) CLA expression Carry-Select structure (MUX) 2019/1/3

41 Comparison between H-C adder and Ling Adder(cont.)
P, G generation =3 Carry Merge in each stage (including dynamic and static) = 2 CSG Sum = 5 Td=2*log2n+3(P, G generation)+5 (CSG Sum) E.g. n=32, Td=18 2019/1/3

42 Comparison between H-C adder and Ling Adder(cont.)
What is the pros and cons? Ling Adder: Advantage: shorter worse case path  might be faster theoretically. Disadvantage.: not regular layout Area waste Lots of complex gates imply the charge sharing problem. Lots of input for a stage contribute to the long path of wire  delay problem for high frequency Carry-Select logic makes the area bigger. 2019/1/3

43 Comparison between H-C adder and Ling Adder(cont.)
Han-Carlson Adder: Disadvantage. : Longer path to the output Advantage.: Regular layout for each stage Fewer of inputs for each path imply the resolution of interconnection Simpler gates means few charge sharing problem 2019/1/3

44 Future Work Power Reduction by inserting sleep transistors
Speed improvement by inserting discharge transistors in the intermediate stack nodes of the dynamic stages during precharge phase. Area Reduction in layout SOI model test Self-Resetting to minimize the clock period There are still lots of works I can try, such as using sleep transistors to reduce the power, better layout skill to lower the area, using SOI model or Self-Resetting to increase the speed. All of the above are already shown in the recently papers. 2019/1/3

45 Reference A 6.5GHz 130nm Single-Ended Dynamic ALU and Instruction Scheduler Loop, ISSCC 2002 Sub-500-ps 64-b ALUs in 0.18-um SOI/Bulk CMOS: Design and Scaling Trends, JSSC, Nov, 2001 Fast Area-Efficient VLSI Adders, Proc. 8th Symp. Computer Arithmetic, Sept. 1987 2019/1/3

46 Reference (cont.) Computer Arithmetic, Algorithms and Hardware Design. Behrooz Parhami, Oxford University Press. Advanced Computer Arithmetic Design. Michael J. Flynn, et al. John Wiley & Sons, INC. 5 GHz 32b Integer-Execution Core in 130nm Dual-Vt CMOS, ISSCC 2002 Implementation of a Self-Resetting CMOS 64-Bit Parallel Adder with Enhanced Testability, JSSC Aug. 1999 2019/1/3


Download ppt "396-ps 32-bit Han-Carlson ALU in 180nm TSMC process"

Similar presentations


Ads by Google