Error Correcting Memory

Presentation on theme: "Error Correcting Memory"— Presentation transcript:

Error Correcting Memory
EECS 373 Jon Beaumont Ben Mason

What is ECC? Error Correcting Code is a mechanism for systems to ensure that data is reliable in all cases

Why ECC? ECC prevents both Soft Errors
Transmission Errors This is particularly necessary in systems that must run continuously with very low tolerance for error

What happens after a Soft Error?
Incorrect values in the instruction or data streams Best case: Execution of illegal instructions or memory addresses Automatic reboot Worst case: Error goes undetected and multiplies as data is used to calculate new data

ECC vs No-ECC

ECC Considerations What range of errors? How much overhead?
Detection versus Correction

Different Methods of Memory Correction
Detection Parity bit Detection and correction Triple-redundancy Hamming Code

Parity Bit (even parity)
For every chunk of data, add a single parity bit set so there are in total an even number of binary 1's An odd number of binary 1's means an error has occured

Parity Bit (even parity)
Raw Data: (4 1’s) (5 1’s) Prepend a parity bit

Parity Bit Cons Pros Can detect only an odd number of errors
No way to detect which bit caused an error, can only discard data Pros Simple to implement (XOR) Low overhead Good for applications in which the original data can be easily resent/recalculated (e.g. SCSI, PCI, UART)

Triple Redundancy Data is calculated and stored 3 times Majority wins
Pros: Simple to execute Can correct errors (potentially multiple bits) Cons: Very inefficient (1/2 data:overhead)

Hamming Code Objective: A concise method of detecting the precise location of an error so that it can be detected and corrected without drastic action Intuition: include multiple parity bits, so that each data bit can be uniquely identified by a set of parity bits which cover it

Hamming Code Algorithm:
Assign each position in a chunk of data a binary number Those positions that are a power of 2 (i.e. have exactly one 1 bit) are parity bits

Hamming Code Algorithm:
Parity bits cover all data bits whose binary position shares a common 1 bit [D7, D5, D3 , P1]

Hamming Code Algorithm:
Parity bits cover all data bits whose binary position shares a common 1 bit [D7,D6, D3, P2]

Hamming Code Algorithm:
Parity bits cover all data bits whose binary position shares a common 1 bit [D7, D6, D5, P4]

Hamming Code Example: Encoding the following nibble using even-parity:
Allocate space for parity bits: b110_1__

Hamming Code Example: Encoding the following nibble using even-parity:
P1 covers [D7,D5,D3] b110_1_?

Hamming Code Example: Encoding the following nibble using even-parity:
P1 covers [D7,D5,D3] b110_1_0

Hamming Code Example: Encoding the following nibble using even-parity:
P2 covers [D7,D6,D3] b110_1?0

Hamming Code Example: Encoding the following nibble using even-parity:
P2 covers [D7,D6,D3] b110_110

Hamming Code Example: Encoding the following nibble using even-parity:
P4 covers [D7,D6,D5] b110?110

Hamming Code Example: Encoding the following nibble using even-parity:
P4 covers [D7,D6,D5] b

Hamming Code Example: Encoding the following nibble using even-parity:
Encoded data b

Hamming Code D6 gets flipped between write and read
b > b

Hamming Code D6 gets flipped between write and read
b > b Parity bit 1: b Even number 1 bits -> No Error

Hamming Code D6 gets flipped between write and read
b > b Parity bit 2: b Odd number 1 bits -> ERROR Parity bits generating error: [P2]

Hamming Code D6 gets flipped between write and read
b > b Parity bit 4: b Odd number 1 bits -> ERROR Parity bits generating error: [P2, P4]

Hamming Code Parity bits generating error: [P2, P4]
X= ERROR O= NO ERROR Only column with just X's is D6, the incorrect bit D3 D5 D6 D7 P1 O P2 X P4

Hamming Code Pros: Overhead of only O(log(n)) bits
4 data bits -> 3 parity bits (57%) 248 data bits -> 8 parity bits (97%) Good for large chunks of memory (DRAM) Cons: More complicated to implement detection logic than simple parity bit

Drawbacks of ECC More Expensive
When error correcting algorithm acts on shorter correction code, performance drops abruptly. This loss of performance known as “error floor phenomenon”

Recent Developments in ECC
Moving away from Hamming Code scheme towards BCH code which is more efficient For more information visit BCH_code.html

Questions?