SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†
2 Emerging Memory Technologies Resistive memories –Due to DRAM scaling challenge Phase Change Memory (PCM) Scalability, high density Limited write endurance (Avg writes) Incurring stuck-at faults
3 Cell Write Endurance Endurance variation –No spatial correlation –Increases with technology scaling Issues –Unpredictable cell endurance Read verification required for each write –The weakest cell dictates memory lifetime! –# of stuck-at faults gradually grows! Multi-bit error recovery scheme is needed!
4 Existing Error Correcting Methods (72,64) Hamming code –For transient faults –Single Error Correction Double Error Detection (SECDED) –12.5% overhead Error-Correcting Pointers (ECP) [Schechter, ISCA37] –Dynamically replace failed cells with extra cells –Storing multiple fail pointers for each data block –Recover from 6 fails with 61-bit overhead (11.9%)
5 SAFER: Stuck-At-Fault Error Recovery
SEC 6 Concept of SAFER Exploit two properties of Stuck-At Faults –Permanency –Readability Multiple error correction –Fault separation –Low-cost Single Error Correction (SEC) Fault Separation
7 SAFER: 1. Fault Separation 2. Single Error Correction
8 Fault Separation Assuming 2 faults in an 8-bit block –C(8,2) = 28 possible fault pairs How to separate these 2 faults (of all 28 pairs)? Pattern #2 Pattern #1 Pattern #
9 Pattern #2 Pattern #1 Pattern #0 Decision for Fault Separation Use bit pointers for fault separation Data Block Bit Pointer bit 2 bit 1 bit 0 Bit Pointer
10 Pattern #0 Pattern #1 Pattern #2 Decision for Fault Separation Find pattern candidates by XORing bit pointers Data Block Bit Pointer Difference Vector bit 2 bit 1 bit 0 Bit Pointer
11 Pattern #0 Pattern #1 Pattern #2 Decision for Fault Separation Find pattern candidates by XORing bit pointers Data Block Bit Pointer bit 2 bit 1 bit 0 Bit Pointer
12 Extension to Multi-Group Partition Use two bits for 4 group partition Data Block Bit Pointer (bit 3, bit 2) (bit 3, bit 1) (bit 3, bit 0) (bit 2, bit 1) (bit 2, bit 0) (bit 1, bit 0) bit 2 bit 1 bit 0 Bit Pointer bit
Data Block 1 st Partition Field 2 nd Partition Field bit 0 Fixed Partition Counter 1 bit 3 Data Block 1 st Partition Field bit 2 2 nd Partition Field bit 0 Fixed Partition Counter 0 13 Dynamic Partition 4 group partition for a 16-bit data block Data Block 1 st Partition Field bit 2 2 nd Partition Field bit 0 Fixed Partition Counter 0 Bit Pointer bit 3 1 bit 0010 = 0000 = 0010
14 Dynamic Partition Objective –Separate multiple stuck-at faults into different groups Additional meta data –Assuming an n bit block and a k group partition – log 2 k log 2 log 2 n log 2 log 2 k 1 Example: n = 512, k = 32 –Required meta data: 23 bits/block –6 the number of separable stuck-at faults 32 # of partition fields size of each partition field size of fixed partition counter
15 SAFER: 1. Fault Separation 2. Single Error Correction
16 Low-cost Single Error Correction Stuck-At Fault Property: Readability Write Verify 1010
Low-cost Single Error Correction Stuck-At Fault Property: Readability Write Verify
18 Low-cost Single Error Correction Stuck-At Fault Property: Readability Write Verify Need to recover!!
19 Low-cost Single Error Correction Data Inversion as an SEC nd Write 2 nd Verify Recovered from Stuck-At Fault!! 0101“F” Inversion & Mark 011 “F” 1010 Inversion 0 Flip Mark One additional bit per group
20 Design Issues
21 SAFER Sequence for a Write N Start Read Write (1 st ) Verify Error SuccessFailure Inversion Write (2 nd ) Verify Error N Y Y Fixed Partition Counter < MAX Re-partition Y N Y Drawbacks: - accelerating wear-out - performance degradation
22 Fail Information Cache Objective: avoid the 2 nd writes Solution: early inversion decision Fail Info. Cache with 1K entries –Keep track of recent data blocks with stuck-at faults –Store fail positions and their stuck-at values tag_a 0 0 Bank #0 TagValidStuck Value Cache Index 0 1 tag_b 0 1 tag_c 1 0 Bank #1 1 tag_d tag_e 0 Bank #15 Block AddressFail Pointer TagIndexBank Addr
23 Evaluation
Monte Carlo simulations –Data block size = 512 bits –Perfect wear-leveling scheme (256-byte block) –Cell write endurance: –IdealECC, ECP, SAFER, SAFER_FC Hardware overhead
25 Relative Lifetime Improvement 14.8% Cell write endurance: – = 100M writes, = 10M writes
26 Conclusion Need to recover from multiple stuck-at faults SAFER –Efficient recovery scheme –handles the growing stuck-at faults Dynamic partition Data inversion –SAFER32_FC 11.9% (11.5%) better hardware efficiency than ECP6 (IdealECC8) 14.8% (3.1%) better lifetime improvement than ECP6 (IdealECC8)
27 Thank You All!! Questions?
28 SRAM Fail Info. Cache Overhead Cell size in 2024 –SRAM = 140 F 10nm, PCM = 6 F 8nm –36.6X difference Compared with a 8 Gbit PCM chip Number of Entries Tag Size (bits) Entry Size (bits) Cache Size (bits) Area Overhead 1K K0.01% 2K K0.02% 4K K0.04% 8K M0.08% 16K M0.15% 32K M0.28% 64K M0.53% 128K M1.00%
29 Relative Lifetime Improvement Need a method measuring relative lifetime –independent from and T Definition Cell Write Endurance Distribution: 100M writes 10M writes Bit Toggle Rate ( T ) = 0.5 Recovery scheme contribution for lifetime T = (L F) T = FL Lifetime Contribution
30 Lifetime Contribution per Meta-bit
31 Average Number of Recovered Fails
32 SAFER with Fail Cache
33 Low-cost Single Error Correction Stuck-At Fault Property: Readability Write Verify Write Verify 1010
34 Low-cost Single Error Correction Stuck-At Fault Property: Readability Write Verify Write Verify 100 Need to recover!! 0
35 Low-cost Single Error Correction Data Inversion as an SEC – one additional bit per group Write 0101 Verify nd Write 2 nd Verify Recovered from Stuck-At Fault!! 0101“F” Inversion & Mark 011 “F” 1010 Inversion 0
36 Evaluation Monte Carlo simulations –Data block size = 512 bits –Perfect wear-leveling scheme (256-byte block) –Cell write endurance: –IdealECC, ECP, SAFER, SAFER_FC 11.9%