Presentation is loading. Please wait.

Presentation is loading. Please wait.

SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

Similar presentations


Presentation on theme: "SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†"— Presentation transcript:

1 SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

2 2 Emerging Memory Technologies Resistive memories –Due to DRAM scaling challenge Phase Change Memory (PCM) Scalability, high density  Limited write endurance (Avg. 10 8 writes) Incurring stuck-at faults

3 3 Cell Write Endurance Endurance variation –No spatial correlation –Increases with technology scaling Issues –Unpredictable cell endurance Read verification required for each write –The weakest cell dictates memory lifetime! –# of stuck-at faults gradually grows! Multi-bit error recovery scheme is needed!

4 4 Existing Error Correcting Methods (72,64) Hamming code –For transient faults –Single Error Correction Double Error Detection (SECDED) –12.5% overhead Error-Correcting Pointers (ECP) [Schechter, ISCA37] –Dynamically replace failed cells with extra cells –Storing multiple fail pointers for each data block –Recover from 6 fails with 61-bit overhead (11.9%)

5 5 SAFER: Stuck-At-Fault Error Recovery

6 SEC 6 Concept of SAFER Exploit two properties of Stuck-At Faults –Permanency –Readability Multiple error correction –Fault separation –Low-cost Single Error Correction (SEC) Fault Separation

7 7 SAFER: 1. Fault Separation 2. Single Error Correction

8 8 Fault Separation Assuming 2 faults in an 8-bit block –C(8,2) = 28 possible fault pairs How to separate these 2 faults (of all 28 pairs)? 76543210 Pattern #2 Pattern #1 Pattern #0 76543210 76543210

9 9 Pattern #2 Pattern #1 Pattern #0 Decision for Fault Separation Use bit pointers for fault separation Data Block Bit Pointer 76543210 1 1 0 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 1 1 bit 2 bit 1 bit 0 Bit Pointer

10 10 Pattern #0 Pattern #1 Pattern #2 Decision for Fault Separation Find pattern candidates by XORing bit pointers Data Block Bit Pointer 76543210 11110000 11001100 10101010 1 0 0 Difference Vector bit 2 bit 1 bit 0 Bit Pointer

11 11 Pattern #0 Pattern #1 Pattern #2 Decision for Fault Separation Find pattern candidates by XORing bit pointers Data Block Bit Pointer 76543210 11110000 11001100 10101010 0 1 1 bit 2 bit 1 bit 0 Bit Pointer

12 12 Extension to Multi-Group Partition Use two bits for 4 group partition Data Block Bit Pointer 1514131211109876543210 (bit 3, bit 2) (bit 3, bit 1) (bit 3, bit 0) (bit 2, bit 1) (bit 2, bit 0) (bit 1, bit 0) bit 2 bit 1 bit 0 Bit Pointer bit 31111111100000000 1111111100000000 1100110011001100 1010101010101010

13 Data Block 1 st Partition Field 2 nd Partition Field bit 0 Fixed Partition Counter 1 bit 3 Data Block 1 st Partition Field bit 2 2 nd Partition Field bit 0 Fixed Partition Counter 0 13 Dynamic Partition 4 group partition for a 16-bit data block Data Block 1 st Partition Field bit 2 2 nd Partition Field bit 0 Fixed Partition Counter 0 Bit Pointer 1514131211109876543210 bit 3 1 bit 1 2 1000  0010 = 1010 0010  0000 = 0010

14 14 Dynamic Partition Objective –Separate multiple stuck-at faults into different groups Additional meta data –Assuming an n bit block and a k group partition –  log 2 k    log 2  log 2 n     log 2   log 2 k   1  Example: n = 512, k = 32 –Required meta data: 23 bits/block –6  the number of separable stuck-at faults  32 # of partition fields size of each partition field size of fixed partition counter

15 15 SAFER: 1. Fault Separation 2. Single Error Correction

16 16 Low-cost Single Error Correction Stuck-At Fault Property: Readability 1010 1010 Write Verify 1010

17 1010 17 Low-cost Single Error Correction Stuck-At Fault Property: Readability Write 0101 011 Verify 0101 0

18 18 Low-cost Single Error Correction Stuck-At Fault Property: Readability 1010 1000 Write Verify Need to recover!! 0 011100

19 19 Low-cost Single Error Correction Data Inversion as an SEC 1010 0101 2 nd Write 2 nd Verify Recovered from Stuck-At Fault!! 0101“F” Inversion & Mark 011 “F” 1010 Inversion 0 Flip Mark One additional bit per group

20 20 Design Issues

21 21 SAFER Sequence for a Write N Start Read Write (1 st ) Verify Error SuccessFailure Inversion Write (2 nd ) Verify Error N Y Y Fixed Partition Counter < MAX Re-partition Y N Y Drawbacks: - accelerating wear-out - performance degradation

22 22 Fail Information Cache Objective: avoid the 2 nd writes Solution: early inversion decision Fail Info. Cache with 1K entries –Keep track of recent data blocks with stuck-at faults –Store fail positions and their stuck-at values 0 0 1 tag_a 0 0 Bank #0 TagValidStuck Value Cache Index 0 1 tag_b 0 1 tag_c 1 0 Bank #1 1 tag_d 1 0 0 1 tag_e 0 Bank #15 Block AddressFail Pointer TagIndexBank Addr

23 23 Evaluation

24 Monte Carlo simulations –Data block size = 512 bits –Perfect wear-leveling scheme (256-byte block) –Cell write endurance:  –IdealECC, ECP, SAFER, SAFER_FC Hardware overhead

25 25 Relative Lifetime Improvement 14.8% Cell write endurance:  –  = 100M writes,  = 10M writes

26 26 Conclusion Need to recover from multiple stuck-at faults SAFER –Efficient recovery scheme –handles the growing stuck-at faults Dynamic partition Data inversion –SAFER32_FC 11.9% (11.5%) better hardware efficiency than ECP6 (IdealECC8) 14.8% (3.1%) better lifetime improvement than ECP6 (IdealECC8)

27 27 Thank You All!! Questions?

28 28 SRAM Fail Info. Cache Overhead Cell size in 2024 –SRAM = 140 F 2 @ 10nm, PCM = 6 F 2 @ 8nm –36.6X difference Compared with a 8 Gbit PCM chip Number of Entries Tag Size (bits) Entry Size (bits) Cache Size (bits) Area Overhead 1K232525.6K0.01% 2K222449.2K0.02% 4K212394.2K0.04% 8K20220.18M0.08% 16K19210.33M0.15% 32K18200.63M0.28% 64K17191.19M0.53% 128K16182.25M1.00%

29 29 Relative Lifetime Improvement Need a method measuring relative lifetime –independent from  and T Definition Cell Write Endurance Distribution:   100M writes   10M writes Bit Toggle Rate ( T ) = 0.5 Recovery scheme contribution for lifetime  T  = (L  F)  T  = FL Lifetime Contribution

30 30 Lifetime Contribution per Meta-bit

31 31 Average Number of Recovered Fails

32 32 SAFER with Fail Cache

33 33 Low-cost Single Error Correction Stuck-At Fault Property: Readability Write 0101 0101 Verify 0101 1010 1010 Write Verify 1010

34 34 Low-cost Single Error Correction Stuck-At Fault Property: Readability Write 0101 011 Verify 0101 1010 1000 Write Verify 100 Need to recover!! 0

35 35 Low-cost Single Error Correction Data Inversion as an SEC – one additional bit per group Write 0101 Verify 0101 1010 0101 2 nd Write 2 nd Verify Recovered from Stuck-At Fault!! 0101“F” Inversion & Mark 011 “F” 1010 Inversion 0

36 36 Evaluation Monte Carlo simulations –Data block size = 512 bits –Perfect wear-leveling scheme (256-byte block) –Cell write endurance:  –IdealECC, ECP, SAFER, SAFER_FC 11.9%


Download ppt "SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†"

Similar presentations


Ads by Google