Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 24: Peripheral Memory Circuits [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition,

Similar presentations


Presentation on theme: "CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 24: Peripheral Memory Circuits [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition,"— Presentation transcript:

1 CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 24: Peripheral Memory Circuits
[Adapted from Rabaey’s Digital Integrated Circuits, Second Edition, © J. Rabaey, A. Chandrakasan, B. Nikolic]

2 Review: Read-Write Memories (RAMs)
Static – SRAM data is stored as long as supply is applied large cells (6 fets/cell) – so fewer bits/chip fast – so used where speed is important (e.g., caches) differential outputs (output BL and !BL) use sense amps for performance compatible with CMOS technology Dynamic – DRAM periodic refresh required (every 1 to 4 ms) to compensate for the charge loss caused by leakage small cells (1 to 3 fets/cell) – so more bits/chip slower – so used for main memories single ended output (output BL only) need sense amps for correct operation not typically compatible with CMOS technology

3 Non-Volatile Memories The Floating-gate transistor (FAMOS)
D Source Drain t ox t ox n + p n +_ Substrate Schematic symbol Device cross-section

4 Floating-Gate Transistor Programming
20 V 10 V 5 V D S Avalanche injection 0 V -- 5 V D S Removing programming voltage leaves charge trapped 5 V -- 2.5 V D S Programming results in higher V T .

5 A “Programmable-Threshold” Transistor

6 Peripheral Memory Circuitry
Row and column decoders Read bit line precharge logic Sense amplifiers Timing and control Speed Power consumption Area – pitch matching Address decoders have a substantial impact on the speed and power consumption of the memory When designing decoders, important to keep the complete memory floorplan in perspective so that geometry matching between the decoder cell dimensions and the core cell is done – pitch matching. Otherwise, will have long lines affecting speed and power consumption.

7 Row Decoders Collection of 2M complex logic gates organized in a regular, dense fashion (N)AND decoder for 8 address bits WL(0) = !A7 & !A6 & !A5 & !A4 & !A3 & !A2 & !A1 & !A0 WL(255) = A7 & A6 & A5 & A4 & A3 & A2 & A1 & A0 NOR decoder for 8 address bits WL(0) = !(A7 | A6 | A5 | A4 | A3 | A2 | A1 | A0) WL(255) = !(!A7 | !A6 | !A5 | !A4 | !A3 | !A2 | !A1 | !A0) Goals: Pitch matched, fast, low power note that addresses are represented as unsigned numbers (all bits are used unlike in the book)

8 Implementing a Wide NOR Function
Single stage 8x256 bit decoder (as in Lecture 22) One 8 input NOR gate per row x 256 rows = 256 x (8+8) = 4,096 Pitch match and speed/power issues Decompose logic into multiple levels !WL(0) = !(!(A7 | A6) & !(A5 | A4) & !(A3 | A2) & !(A1 | A0)) First level is the predecoder (for each pair of address bits, form Ai|Ai-1, Ai|!Ai-1, !Ai|Ai-1, and !Ai|!Ai-1) Second level is the word line driver Predecoders reduce the number of transistors required Four sets of four 2-bit NOR predecoders = 4 x 4 x (2+2) = 64 256 word line drivers, each a four input NAND – 256 x (4+4) = 2,048 4,096 vs 2,112 = almost a 50% savings Number of inputs to the gates driving the WLs is halved, so the propagation delay is reduced by a factor of ~4 Single stage speed issues include large fan-in gates (if not dynamic) and that a single gate is driving a large (WL) load This derivation produces !WL and has four 2-input NOR gates predecoder feeding 4-input NAND gate word line drivers while the one on the next slide produces WL with eight two 3-input NAND gates and four 2-input NAND gates predecoder feeding 3-input NOR gate word line drivers. There is a three layer decoder in the book with a two level predecoder (four sets of four 2-input NANDs feeding 256 four 2-input NANDs word line drivers) All large decoders are realized using at least a two level implementation. By adding a select signal to the predecoders, its possible to disable the decoder when the memory block is not selected to save power.

9 Hierarchical Decoders
Multi-stage implementation improves performance WL 1 WL A A A A A A A A A A A A A A A A 1 1 1 1 2 3 2 3 2 3 2 3 NAND decoder using 2-input pre-decoders A A A A A A A A 1 1 3 2 2 3

10 Which one is faster? Smaller? Low power?
Dynamic Decoders Precharge devices GND GND V DD WL 3 WL 3 WL WL 2 2 WL 1 WL 1 WL WL V f A A A A Nor is faster (only one transistor to ground), but larger (see ROM, each transistor has to connect to GND), more power (3 WL switch vs. 1 WL in NAND) DD 1 1 A A A A 1 1 f 2-input NOR decoder 2-input NAND decoder Which one is faster? Smaller? Low power?

11 Pass Transistor Based Column Decoder
BL3 !BL3 BL2 !BL2 BL1 !BL1 BL0 !BL0 S3 A1 S2 2 input NOR decoder S1 A0 S0 data_out !data_out Read: connect BLs to the Sense Amps (SA) Writes: drive one of the BLs low to write a 0 into the cell Fast since there is only one transistor in the signal path. However, there is a large transistor count ( (K+1)2K + 2 x 2K) For K = 2  3 x 22 (decoder) + 2 x 22 (PTs) = = 20 Essentially a 2**k input multiplexer Can run the NOR decoder while the row decoder and core are working – so only have 1 extra transistor in the signal path. Make sure the select lines (S) go full swing (to VDD) so that full swing appears on the BLs during write Transistor count – 2*(k+1)2**k + 2**k devices - so for k=10 (1024 to 1) it would require 2*12,288 transistors (11* ) Note that this is for 1-bit data lines. For multiple bit data words, the cost of the predecoder can be amortized and the cost of the pt design is less (in terms of transistor count). Note the large load on the decoder outputs for multiple bit data words (2 * # bits in the data word).

12 Tree Based Column Decoder
BL3 !BL3 BL2 !BL2 BL1 !BL1 BL0 !BL0 A0 !A0 A1 !A1 data_out !data_out Number of transistors reduced to (2 x 2 x (2K -1)) for K = 2  2 x 2 x (22 – 1) = 4 x 3 = 12 Delay increases quadratically with the number of sections (K) (so prohibitive for large decoders) can fix with buffers, progressive sizing, combination of tree and pass transistor approaches Note no predecoder needed as with previous design – the reason for the transistor count reduction Number of transistors comes down to 2* 2*(2**k – 1) – so for k=10 (1024 to1) requires only 2*2,046 transistors But not true (i.e., transistor count savings) for more than one bit of data!

13 Decoder Complexity Comparisons
Consider a memory with 10b address and 8b data Conf. Data/Row Row Decoder Column Decoder 1D 8b 10b = a 10x210 decoder Single stage = 20,480 Two stage = 10,320 2D 32b (32x256 core) 8b = 8x28 decoder Single stage = 4,096 T Two stage = 2,112 T 2b = 2x22 decoder PT = 76 T Tree = 96 T 64b (64x128 core) 7b = 7x27 decoder Single stage = 1,792 T Two stage = 1,072 T 3b = 3x23 decoder PT = 160 T Tree = 224 T 128b (128x64 core) 6b = 6x26 decoder Single stage = 768 T Two stage = 432 T 4b = 4x24 decoder PT = 336 T Tree = 480 T Column Decoder – (for 1D architecture) single stage = 1,024 x (10+10), 10 input gates!; two stage = 5 x 4 x (2+2) (first stage) x (5+5) = 10, = 10,320, 5 input gates! Row decoder – remember, its for 8 bits of data PT – 12 in decoder + 8x8 in the PT column decoder Tree – 12x8 in the column decoder

14 Bit Line Precharge Logic
First step of a Read cycle is to precharge (PC) the bit lines to VDD every differential signal in the memory must be equalized to the same voltage level before Read Turn off PC and enable the WL the grounded PMOS load limits the bit line swing (speeding up the next precharge cycle) !PC equalization transistor - speeds up equalization of the two bit lines by allowing the capacitance and pull-up device of the nondischarged bit line to assist in precharging the discharged line BL !BL Static pullup scheme – advantage is that it does not require a heavily loaded precharge clock signal to be routed across the array; disadvantage is that it is always on, so is fighting against the bit line discharge for the bit lines that are moving low (consumes power) Clocked scheme – allows the designer to use much larger precharge devices so that bit line equalization happens more rapidly (note equalization transistor to help even more); disadvantage is the power consumption of the heavily loaded precharge clock signal. What purpose do the two pfets with their gates tied to ground serve?

15 Sense Amplifiers Amplification – resolves data with small bit line swings (in some DRAMs required for proper functionality) Delay reduction – compensates for the limited drive capability of the memory cell to accelerate BL transition SA input output tp = ( C * V ) / Iav large small make  V as small as possible Power reduction – eliminates a large part of the power dissipation due to charging and discharging bit lines Signal restoration – for DRAMs, need to drive the bit lines full swing after sensing (read) to do data refresh

16 Classes of Sense Amplifiers
Differential SA – takes small signal differential inputs (BL and !BL) and amplifies them to a large signal single- ended output common-mode rejection – rejects noise that is equally injected to both inputs Only suitable for SRAMs (with BL and !BL) Types Current mirroring Two-stage Latch based Single-ended SA – needed for DRAMs differential SA characterized by its ability to reject the common noise and to amplify the true difference between the signals – but only applicable to SRAMs

17 Differential Sense Amplifier
V DD M M 3 4 y Out bit M M bit 1 2 SE M 5 Directly applicable to SRAMs

18 Differential Sensing ― SRAM

19 Read/Write Circuitry D: data (write) bus R: read bus W: write signal
BL !BL D: data (write) bus R: read bus W: write signal CS: column select (column decoder) Local W (write): BL = D, !BL = !D enabled by W & CS Local R (read): R = BL, !R = !BL enabled by !W & CS SA CS Local R/W D W One of many ways to implement the read/write circuitry Use precharge logic (and possibly another sense amp that’s not shown) to speed up the voltage change on the read bus The capacitance of a selected bit line pair and the data/read buses are isolated by the Local R/W logic helping to keep the latency low !R Precharge R

20 Approaches to Memory Timing
DRAM Timing Multiplexed Addressing RAS CAS RAS-CAS timing Address Bus msb’s lsb’s Row Addr. Column SRAM Timing Self-Timed Address Address transition initiates memory operation Address Bus

21 Reliability and Yield Memories operate under low signal-to-noise conditions word line to bit line coupling can vary substantially over the memory array folded bit line architecture (routing BL and !BL next to each other ensures a closer match between parasitics and bit line capacitances) interwire bit line to bit line coupling transposed (or twisted) bit line architecture (turn the noise into a common-mode signal for the SA) leakage (in DRAMs) requiring refresh operation suffer from low yield due to high density and structural defects increase yield by using error correction (e.g., parity bits) and redundancy and are susceptible to soft errors due to alpha particles and cosmic rays we have only shown/considered folded bit line architectures here.

22 Redundancy in the Memory Structure
Fuse bank Redundant row Redundant columns Row address Replace bad row or column with “spare” – done by setting the fuse bank Helps to correct faults that affect a large section of the memory; not good for scattered point errors or local errors (use error correction (ECC) logic like parity bits for that) Column address

23 Row Redundancy Fused Repair Addresses == ? Redundant Wordline == ?
Enable Normal Wordline Decoder Normal Wordline Functional Address Normal Wordline Decoder Normal Wordline Enable == ? Redundant Wordline Fused Repair Addresses == ? Redundant Wordline Page 4

24 Column Redundancy Page 5

25 Error-Correcting Codes
Example: Hamming Codes e.g. If B3 flips 1 = 3 One example of ECC is Hamming codes 2K>= m+k m # data bit, k # check bit For 64 data bits, needs 7 check bits

26 Performance and area overhead for ECC
A circuit failure occurs only when the voltage disturbance causes the logic state of the circuit to change such that it cannot automatically recover. This can happen before the disturbed node completely charges or discharges. Once the node voltage reaches the switching points of any associated logic gates, this false transition will start to propagate along these signal paths. Furthermore, since many circuits have feedback loops, positive feedback can even accelerate the faulty transitions. Given the physical mechanism of a soft error event, the following measures can be taken in circuit design to reduce the particle induced failure rates: increase the storage node charge Add devices to compensate for charge loss Minimize the charge collecting efficiency at the storage nodes

27 Redundancy and Error Correction

28 Soft Errors Nonrecurrent and nonpermanent errors from
From Semico Research Corp. Nonrecurrent and nonpermanent errors from alpha particles (from the packaging materials) neutrons from cosmic rays As feature size decreases, the charge stored at each node decreases (due to a lower node capacitance and lower VDD) and thus Qcritical (the charge necessary to cause a bit flip) decreases leading to an increase in the soft error rate (SER) From Actel FIT= Failure In Time, one FIT is a single failure in 1 billion (1e9) hours. Hence, a system that experiences 1 failure in 13,158 hours has a failure rate of 1E9/13,158 = 76,000 FITs. Avionics system in civilian aviation: altitude of 30,000 feet on a route crossing the north pole both cause increase in neutron flux. If avionics board uses four 1M 130nm SRAM-based FPGAs, it would be subject to upsets per day = 324 hours between upsets or 3million FITs. Assume one such system on-board each commercial aircraft, 4,000 civilian flights per day, 3 hours average flight time. Nearly 37 aircraft will experience a neutron-induced SRAM-based FPGA configuration failure during the duration of their flight. MTBF (hours) .13 m .09  m Ground-based 895 448 Civilian Avionics System 324 162 Military Avionics System 18 9

29 CELL Processor! See class website for web links

30 CELL Processor!

31 CELL Processor!

32 Embedded SRAM (4.6Ghz) Each SRAM cell 0.99um2
Each block has 32 sub-arrays, Each sub-array has 128 WL plus 4 redundant line, Each block has 2 redundant BL,

33 Multiplier in CELL

34 Next Lecture and Reminders
Power consumption in datapaths and memories Reading assignment – Rabaey, et al, 11.7; 12.5


Download ppt "CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 24: Peripheral Memory Circuits [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition,"

Similar presentations


Ads by Google