SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

Slides:

Advertisements

Similar presentations

CS 346 – April 4 Mass storage –Disk formatting –Managing swap space –RAID Commitment –Please finish chapter 12.

Advertisements

- Dr. Kalpakis CMSC Dr. Kalpakis 1 Outline In implementing DBMS we need to answer How should the system store and manage very large amounts of data?

Handling Resistance Drift in Phase Change Memory - Device, Circuit, Architecture, and System Solutions Manu Awasthi, Manjunath Shevgoor, Kshitij Sudan,

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors.

RAID- Redundant Array of Inexpensive Drives. Purpose Provide faster data access and larger storage Provide data redundancy.

Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,

Reducing Read Latency of Phase Change Memory via Early Read and Turbo Read Feb 9 th 2015 HPCA-21 San Francisco, USA Prashant Nair - Georgia Tech Chiachen.

Nak Hee Seong Sungkap Yeo Hsien-Hsin S. Lee

Moinuddin K. Qureshi ECE, Georgia Tech

Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

1 Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background.

Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,

S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Chapter 4 (continued): Caching; Testing Memory Modules.

File System Implementation CSCI 444/544 Operating Systems Fall 2008.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Computer ArchitectureFall 2008 © November 3 rd, 2008 Nael Abu-Ghazaleh CS-447– Computer.

1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.

Design of SCS Architecture, Control and Fault Handling.

Due to the economic downturn, Microsoft Research has eliminated all funding for title slides. We sincerely apologize for any impact these austerity measures.

Defining Anomalous Behavior for Phase Change Memory

Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.

Security Refresh Prevent Malicious Wear-out and Increase Durability for Phase-Change Memory with Dynamically Randomized Address Mapping Nak Hee Seong Dong.

Lecture 7: PCM, Cache coherence

IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.

Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee

Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes Alaa R. Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu,

1 Towards Phase Change Memory as a Secure Main Memory André Seznec IRISA/INRIA.

P AY -A S -Y OU -G O S TORAGE -E FFICIENT H ARD E RROR C ORRECTION Moinuddin K. Qureshi ECE, Georgia Tech Research done while at: IBM T. J. Watson Research.

RDIS: A Recursively Defined Invertible Set Scheme to Tolerate Multiple Stuck-At Faults in Resistive Memory Rami Melhem, Rakan Maddah and Sangyeun cho Computer.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

Yun-Chung Yang SimTag: Exploiting Tag Bits Similarity to Improve the Reliability of the Data Caches Jesung Kim, Soontae Kim, Yebin Lee 2010 DATE(The Design,

Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University

Implicit-Storing and Redundant- Encoding-of-Attribute Information in Error-Correction-Codes Yiannakis Sazeides 1, Emre Ozer 2, Danny Kershaw 3, Panagiota.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Storage and File structure COP 4720 Lecture 20 Lecture Notes.

1 Lecture 27: Disks Today’s topics:  Disk basics  RAID  Research topics.

1 Lecture 5: Refresh, Chipkill Topics: refresh basics and innovations, error correction.

1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.

Efficient Scrub Mechanisms for Error-Prone Emerging Memories Manu Awasthi ǂ, Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran.

1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.

Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.

Rakan Maddah1, Sangyeun2,1 Cho and Rami Melhem1

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

Cache Memory Presentation I

Coding Theory Dan Siewiorek June 2012.

Scalable High Performance Main Memory System Using PCM Technology

Lecture 21: Memory Hierarchy

Lecture 23: Cache, Memory, Virtual Memory

Lecture 22: Cache Hierarchies, Memory

Lecture 6: Reliability, PCM

CDA 5155 Caches.

RAID Redundant Array of Inexpensive (Independent) Disks

Lecture 20: OOO, Memory Hierarchy

Use ECP, not ECC, for hard failures in resistive memories

Lecture 22: Cache Hierarchies, Memory

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Increasing Effective Cache Capacity Through the Use of Critical Words

2019 2학기 고급운영체제론 ZebRAM: Comprehensive and Compatible Software Protection Against Rowhammer Attacks 3 # 단국대학교 컴퓨터학과 # 남혜민 # 발표자.

Presentation transcript:

SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

2 Emerging Memory Technologies Resistive memories –Due to DRAM scaling challenge Phase Change Memory (PCM) Scalability, high density  Limited write endurance (Avg writes) Incurring stuck-at faults

3 Cell Write Endurance Endurance variation –No spatial correlation –Increases with technology scaling Issues –Unpredictable cell endurance Read verification required for each write –The weakest cell dictates memory lifetime! –# of stuck-at faults gradually grows! Multi-bit error recovery scheme is needed!

4 Existing Error Correcting Methods (72,64) Hamming code –For transient faults –Single Error Correction Double Error Detection (SECDED) –12.5% overhead Error-Correcting Pointers (ECP) [Schechter, ISCA37] –Dynamically replace failed cells with extra cells –Storing multiple fail pointers for each data block –Recover from 6 fails with 61-bit overhead (11.9%)

5 SAFER: Stuck-At-Fault Error Recovery

SEC 6 Concept of SAFER Exploit two properties of Stuck-At Faults –Permanency –Readability Multiple error correction –Fault separation –Low-cost Single Error Correction (SEC) Fault Separation

7 SAFER: 1. Fault Separation 2. Single Error Correction

8 Fault Separation Assuming 2 faults in an 8-bit block –C(8,2) = 28 possible fault pairs How to separate these 2 faults (of all 28 pairs)? Pattern #2 Pattern #1 Pattern #

9 Pattern #2 Pattern #1 Pattern #0 Decision for Fault Separation Use bit pointers for fault separation Data Block Bit Pointer bit 2 bit 1 bit 0 Bit Pointer

10 Pattern #0 Pattern #1 Pattern #2 Decision for Fault Separation Find pattern candidates by XORing bit pointers Data Block Bit Pointer Difference Vector bit 2 bit 1 bit 0 Bit Pointer

11 Pattern #0 Pattern #1 Pattern #2 Decision for Fault Separation Find pattern candidates by XORing bit pointers Data Block Bit Pointer bit 2 bit 1 bit 0 Bit Pointer

12 Extension to Multi-Group Partition Use two bits for 4 group partition Data Block Bit Pointer (bit 3, bit 2) (bit 3, bit 1) (bit 3, bit 0) (bit 2, bit 1) (bit 2, bit 0) (bit 1, bit 0) bit 2 bit 1 bit 0 Bit Pointer bit

Data Block 1 st Partition Field 2 nd Partition Field bit 0 Fixed Partition Counter 1 bit 3 Data Block 1 st Partition Field bit 2 2 nd Partition Field bit 0 Fixed Partition Counter 0 13 Dynamic Partition 4 group partition for a 16-bit data block Data Block 1 st Partition Field bit 2 2 nd Partition Field bit 0 Fixed Partition Counter 0 Bit Pointer bit 3 1 bit  0010 =  0000 = 0010

14 Dynamic Partition Objective –Separate multiple stuck-at faults into different groups Additional meta data –Assuming an n bit block and a k group partition –  log 2 k    log 2  log 2 n     log 2   log 2 k   1  Example: n = 512, k = 32 –Required meta data: 23 bits/block –6  the number of separable stuck-at faults  32 # of partition fields size of each partition field size of fixed partition counter

15 SAFER: 1. Fault Separation 2. Single Error Correction

16 Low-cost Single Error Correction Stuck-At Fault Property: Readability Write Verify 1010

Low-cost Single Error Correction Stuck-At Fault Property: Readability Write Verify

18 Low-cost Single Error Correction Stuck-At Fault Property: Readability Write Verify Need to recover!!

19 Low-cost Single Error Correction Data Inversion as an SEC nd Write 2 nd Verify Recovered from Stuck-At Fault!! 0101“F” Inversion & Mark 011 “F” 1010 Inversion 0 Flip Mark One additional bit per group

20 Design Issues

21 SAFER Sequence for a Write N Start Read Write (1 st ) Verify Error SuccessFailure Inversion Write (2 nd ) Verify Error N Y Y Fixed Partition Counter < MAX Re-partition Y N Y Drawbacks: - accelerating wear-out - performance degradation

22 Fail Information Cache Objective: avoid the 2 nd writes Solution: early inversion decision Fail Info. Cache with 1K entries –Keep track of recent data blocks with stuck-at faults –Store fail positions and their stuck-at values tag_a 0 0 Bank #0 TagValidStuck Value Cache Index 0 1 tag_b 0 1 tag_c 1 0 Bank #1 1 tag_d tag_e 0 Bank #15 Block AddressFail Pointer TagIndexBank Addr

23 Evaluation

Monte Carlo simulations –Data block size = 512 bits –Perfect wear-leveling scheme (256-byte block) –Cell write endurance:  –IdealECC, ECP, SAFER, SAFER_FC Hardware overhead

25 Relative Lifetime Improvement 14.8% Cell write endurance:  –  = 100M writes,  = 10M writes

26 Conclusion Need to recover from multiple stuck-at faults SAFER –Efficient recovery scheme –handles the growing stuck-at faults Dynamic partition Data inversion –SAFER32_FC 11.9% (11.5%) better hardware efficiency than ECP6 (IdealECC8) 14.8% (3.1%) better lifetime improvement than ECP6 (IdealECC8)

27 Thank You All!! Questions?

28 SRAM Fail Info. Cache Overhead Cell size in 2024 –SRAM = 140 F 10nm, PCM = 6 F 8nm –36.6X difference Compared with a 8 Gbit PCM chip Number of Entries Tag Size (bits) Entry Size (bits) Cache Size (bits) Area Overhead 1K K0.01% 2K K0.02% 4K K0.04% 8K M0.08% 16K M0.15% 32K M0.28% 64K M0.53% 128K M1.00%

29 Relative Lifetime Improvement Need a method measuring relative lifetime –independent from  and T Definition Cell Write Endurance Distribution:   100M writes   10M writes Bit Toggle Rate ( T ) = 0.5 Recovery scheme contribution for lifetime  T  = (L  F)  T  = FL Lifetime Contribution

30 Lifetime Contribution per Meta-bit

31 Average Number of Recovered Fails

32 SAFER with Fail Cache

33 Low-cost Single Error Correction Stuck-At Fault Property: Readability Write Verify Write Verify 1010

34 Low-cost Single Error Correction Stuck-At Fault Property: Readability Write Verify Write Verify 100 Need to recover!! 0

35 Low-cost Single Error Correction Data Inversion as an SEC – one additional bit per group Write 0101 Verify nd Write 2 nd Verify Recovered from Stuck-At Fault!! 0101“F” Inversion & Mark 011 “F” 1010 Inversion 0

36 Evaluation Monte Carlo simulations –Data block size = 512 bits –Perfect wear-leveling scheme (256-byte block) –Cell write endurance:  –IdealECC, ECP, SAFER, SAFER_FC 11.9%