1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors.

Slides:



Advertisements
Similar presentations
CS 346 – April 4 Mass storage –Disk formatting –Managing swap space –RAID Commitment –Please finish chapter 12.
Advertisements

RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.
LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.
Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†
Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures Dec 15 th 2014 MICRO-47 Cambridge UK Prashant Nair - Georgia Tech David.
Phase Change Memory What to wear out today? Chris Craik, Aapo Kyrola, Yoshihisa Abe.
CS 153 Design of Operating Systems Spring 2015
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
1 Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background.
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
1 Lecture 16: Virtual Memory Today: DRAM innovations, virtual memory (Sections )
1 Lecture 26: Storage Systems Topics: Storage Systems (Chapter 6), other innovations Final exam stats:  Highest: 95  Mean: 70, Median: 73  Toughest.
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
Due to the economic downturn, Microsoft Research has eliminated all funding for title slides. We sincerely apologize for any impact these austerity measures.
Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu
1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.
Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.
1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010.
Review of Memory Management, Virtual Memory CS448.
LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm.
Lecture 7: PCM, Cache coherence
IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.
1 Lecture: Virtual Memory, DRAM Main Memory Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)
P AY -A S -Y OU -G O S TORAGE -E FFICIENT H ARD E RROR C ORRECTION Moinuddin K. Qureshi ECE, Georgia Tech Research done while at: IBM T. J. Watson Research.
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
RDIS: A Recursively Defined Invertible Set Scheme to Tolerate Multiple Stuck-At Faults in Resistive Memory Rami Melhem, Rakan Maddah and Sangyeun cho Computer.
Chapter 4 Memory Management Virtual Memory.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1 Lecture 27: Disks Today’s topics:  Disk basics  RAID  Research topics.
1 Lecture 5: Refresh, Chipkill Topics: refresh basics and innovations, error correction.
1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.
1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu
1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
Rakan Maddah1, Sangyeun2,1 Cho and Rami Melhem1
Lecture 13: Large Cache Design I
Lecture 15: DRAM Main Memory Systems
Lecture: Memory, Multiprocessors
Lecture: DRAM Main Memory
Lecture 23: Cache, Memory, Virtual Memory
Lecture 22: Cache Hierarchies, Memory
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture 28: Reliability Today’s topics: GPU wrap-up Disk basics RAID
Lecture: Memory Technology Innovations
Lecture 6: Reliability, PCM
Lecture 24: Memory, VM, Multiproc
Use ECP, not ECC, for hard failures in resistive memories
CSE 451: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
Lecture 22: Cache Hierarchies, Memory
IT 344: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Chia-Chi Teng CTB
Presentation transcript:

1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors

2 Chipkill Chipkill correct systems can withstand failure of an entire DRAM chip For chipkill correctness  the 72-bit word must be spread across 72 DRAM chips  or, a 13-bit word (8-bit data and 5-bit ECC) must be spread across 13 DRAM chips

3 RAID-like DRAM Designs DRAM chips do not have built-in error detection Can employ a 9-chip rank with ECC to detect and recover from a single error; in case of a multi-bit error, rely on a second tier of error correction Can do parity across DIMMs (needs an extra DIMM); use ECC within a DIMM to recover from 1-bit errors; use parity across DIMMs to recover from multi-bit errors in 1 DIMM Reads are cheap (must only access 1 DIMM); writes are expensive (must read and write 2 DIMMs) Used in some HP servers

4 RAID-like DRAM Udipi et al., ISCA’10 Add a checksum to every row in DRAM; verified at the memory controller Adds area overhead, but provides self-contained error detection When a chip fails, can re-construct data by examining another parity DRAM chip Can control overheads by having checksum for a large row or one parity chip for many data chips Writes are again problematic

5 SSC-DSD The cache line is organized into multi-bit symbols Two symbols are required for error detection and 3/4 symbols are used for error correction (can handle complete failure in one symbol, i.e., each symbol is fetched from a different DRAM chip) 3-symbol codes are not popular because it leads to non-standard DIMMs 4-symbol codes are more popular, but are used as 32+4 so that standard ECC DIMMs can be used (high activation energy and low rank-level parallelism) (16+4 would require a non-standard DIMM)

6 Virtualized ECC Yoon and Erez, ASPLOS’10 Also builds a two-tier error protection scheme, but does the second tier in software The second-tier codes are stored in the regular physical address space (not specialized DRAM chips); software has flexibility in terms of the types of codes to use and the types of pages that are protected Reads are cheap; writes are expensive as usual; but, the second-tier codes can now be cached; greatly helps reduce the number of DRAM writes Requires a 144-bit datapath (increases overfetch)

7 LoT-ECC Udipi et al., ISCA 2012 Use checksums to detect errors and parity codes to fix Requires access of only 9 DRAM chips per read, but the storage overhead grows to 26%

8 Phase Change Memory Emerging NVM technology that can replace Flash and DRAM; there are other competing technologies too Much higher density; much better scalability; can do multi-level cells When materials (GST) are heated (with electrical pulses) and then cooled, they form either crystalline or amorphous materials depending on the intensity and duration of the pulses; crystalline materials have low resistance (1 state) and amorphous materials have high resistance (0 state) Non-volatile, fast reads (~50ns), slow and energy-hungry writes; limited lifetime (~10 writes per cell), no leakage 8

9 PCM as a Main Memory Lee et al., ISCA 2009

10 PCM as a Main Memory Lee et al., ISCA 2009 Two main innovations to overcome these drawbacks:  decoupled row buffers and non-destructive PCM reads  multiple narrow row buffers (row buffer cache)

11 Optimizations for Writes (Energy, Lifetime) Read a line before writing and only write the modified bits Zhou et al., ISCA’09 Write either the line or its inverted version, whichever causes fewer bit-flips Cho and Lee, MICRO’09 Only write dirty lines in a PCM page (when a page is evicted from a DRAM cache) Lee et al., Qureshi et al., ISCA’09 When a page is brought from disk, place it only in DRAM cache and place in PCM upon eviction Qureshi et al., ISCA’09 Wear-leveling: rotate every new page, shift a row periodically, swap segments Zhou et al., Qureshi et al., ISCA’09

12 Hard Error Tolerance in PCM PCM cells will eventually fail; important to cause gradual capacity degradation when this happens Pairing: among the pool of faulty pages, pair two pages that have faults in different locations; replicate data across the two pages Ipek et al., ASPLOS’10 Errors are detected with parity bits; replica reads are issued if the initial read is faulty

13 ECP Schechter et al., ISCA’10 Instead of using ECC to handle a few transient faults in DRAM, use error-correcting pointers to handle hard errors in specific locations For a 512-bit line with 1 failed bit, maintain a 9-bit field to track the failed location and another bit to store the value in that location Can store multiple such pointers and can recover from faults in the pointers too ECC has similar storage overhead and can handle soft errors; but ECC has high entropy and can hasten wearout

14 SAFER Seong et al., MICRO 2010 Most PCM hard errors are stuck-at faults (stuck at 0 or stuck at 1) Either write the word or its flipped version so that the failed bit is made to store the stuck-at value For multi-bit errors, the line can be partitioned such that each partition has a single error Errors are detected by verifying a write; recently failed bit locations are cached so multiple writes can be avoided

15 FREE-p Yoon et al., HPCA 2011 When a PCM block (64B) is unusable because the number of hard errors has exceeded the ECC capability, it is remapped to another address; the pointer to this address is stored in the failed block; need another bit per block The pointer can be replicated many times in the failed block to tolerate the multiple errors in the failed block Requires two accesses when handling failed blocks; this overhead can be reduced by caching the pointer at the memory controller

16 Title Bullet