CS 61C: Great Ideas in Computer Architecture Dependability - ECC Nicholas Weaver & Vladimir Stojanovic 1.

Slides:



Advertisements
Similar presentations
A Case for Redundant Arrays Of Inexpensive Disks Paper By David A Patterson Garth Gibson Randy H Katz University of California Berkeley.
Advertisements

What is RAID Redundant Array of Independent Disks.
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
parity bit is 1: data should have an odd number of 1's
CSCI 4717/5717 Computer Architecture
 RAID stands for Redundant Array of Independent Disks  A system of arranging multiple disks for redundancy (or performance)  Term first coined in 1987.
RAID- Redundant Array of Inexpensive Drives. Purpose Provide faster data access and larger storage Provide data redundancy.
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Instructor: Justin Hsia 8/08/2013Summer Lecture #271 CS 61C: Great Ideas in Computer Architecture Dependability: Parity, RAID, ECC.
CSE 461: Error Detection and Correction. Next Topic  Error detection and correction  Focus: How do we detect and correct messages that are garbled during.
NETWORKING CONCEPTS. ERROR DETECTION Error occures when a bit is altered between transmission& reception ie. Binary 1 is transmitted but received is binary.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Performance/Reliability of Disk Systems So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Quantum Error Correction SOURCES: Michele Mosca Daniel Gottesman Richard Spillman Andrew Landahl.
Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,
David Culler Electrical Engineering and Computer Sciences
Reliability and Channel Coding
15-853Page :Algorithms in the Real World Error Correcting Codes I – Overview – Hamming Codes – Linear Codes.
Hamming Code Rachel Ah Chuen. Basic concepts Networks must be able to transfer data from one device to another with complete accuracy. Data can be corrupted.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
CS 61C: Great Ideas in Computer Architecture Dependability Instructor: David A. Patterson 1Spring Lecture.
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
CSI-09 COMMUNICATION TECHNOLOGY FAULT TOLERANCE AUTHOR: V.V. SUBRAHMANYAM.
RAID REDUNDANT ARRAY OF INEXPENSIVE DISKS. Why RAID?
CS 61C: Great Ideas in Computer Architecture Lecture 25: Dependability and RAID Instructor: Sagar Karandikar
Information Coding in noisy channel error protection:-- improve tolerance of errors error detection: --- indicate occurrence of errors. Source.
CS 61C: Great Ideas in Computer Architecture Dependability and RAID
Data Link Layer: Error Detection and Correction
Copyright © Curt Hill, RAID What every server wants!
Data and Computer Communications by William Stallings Eighth Edition Digital Data Communications Techniques Digital Data Communications Techniques Click.
Practical Session 10 Error Detecting and Correcting Codes.
Riyadh Philanthropic Society For Science Prince Sultan College For Woman Dept. of Computer & Information Sciences CS 251 Introduction to Computer Organization.
Error Detection and Correction
The concept of RAID in Databases By Junaid Ali Siddiqui.
Implicit-Storing and Redundant- Encoding-of-Attribute Information in Error-Correction-Codes Yiannakis Sazeides 1, Emre Ozer 2, Danny Kershaw 3, Panagiota.
CS 61C: Great Ideas in Computer Architecture Dependability and RAID Lecture 25 Instructors: John Wawrzynek & Vladimir Stojanovic
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Error Detection and Correction
Error-Detecting and Error-Correcting Codes
1 Lecture 27: Disks Today’s topics:  Disk basics  RAID  Research topics.
Practical Session 10 Computer Architecture and Assembly Language.
CS203 – Advanced Computer Architecture Dependability & Reliability.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
CS 110 Computer Architecture Lecture 25: Dependability and RAID Instructor: Sören Schwertfeger School of Information Science.
Network-Attached Storage. Network-attached storage devices Attached to a local area network, generally an Ethernet-based network environment.
CS Introduction to Operating Systems
A Case for Redundant Arrays of Inexpensive Disks (RAID) -1988
parity bit is 1: data should have an odd number of 1's
Computer Architecture and Assembly Language
Vladimir Stojanovic & Nicholas Weaver
Fault Tolerance & Reliability CDA 5140 Spring 2006
Error Correcting Code.
Coding Theory Dan Siewiorek June 2012.
Krste Asanović & Randy H. Katz
Krste Asanović & Randy H. Katz
Sequential circuits and Digital System Reliability
Information Redundancy Fault Tolerant Computing
RAID Redundant Array of Inexpensive (Independent) Disks
UNIT IV RAID.
CS 325: CS Hardware and Software Organization and Architecture
Reliability and Channel Coding
Computer Architecture and Assembly Language
parity bit is 1: data should have an odd number of 1's
Seminar on Enterprise Software
Presentation transcript:

CS 61C: Great Ideas in Computer Architecture Dependability - ECC Nicholas Weaver & Vladimir Stojanovic 1

Great Idea #6: Dependability via Redundancy Applies to everything from datacenters to memory – Redundant datacenters so that can lose 1 datacenter but Internet service stays online – Redundant routes so can lose nodes but Internet doesn’t fail Or at least can recover quickly… – Redundant disks so that can lose 1 disk but not lose data (Redundant Arrays of Independent Disks/RAID) – Redundant memory bits of so that can lose 1 bit but no data (Error Correcting Code/ECC Memory) 2

Dependability Corollary: Fault Detection The ability to determine that something is wrong is often the key to redundancy – "Work correctly or fail" is far easier to deal with than "May work incorrectly on failure" Error detection is generally a necessary prerequisite to error correction – And as we saw with rowhammer: Errors aren't just errors, but can be potential avenues for exploitation! 3

Dependability via Redundancy: Time vs. Space Spatial Redundancy – replicated data or check information or hardware to handle hard and soft (transient) failures Temporal Redundancy – redundancy in time (retry) to handle soft (transient) failures – "Insanity overcoming soft failures is repeatedly doing the same thing and expecting different results" 4

Dependability Measures Reliability: Mean Time To Failure (MTTF) Service interruption: Mean Time To Repair (MTTR) Mean time between failures (MTBF) – MTBF = MTTF + MTTR Availability = MTTF / (MTTF + MTTR) Improving Availability – Increase MTTF: More reliable hardware/software + Fault Tolerance – Reduce MTTR: improved tools and processes for diagnosis and repair 5

Availability Measures Availability = MTTF / (MTTF + MTTR) as % – MTTF, MTBF usually measured in hours Since hope rarely down, shorthand is “number of 9s of availability per year” 1 nine: 90% => 36 days of repair/year 2 nines: 99% => 3.6 days of repair/year 3 nines: 99.9% => 526 minutes of repair/year 4 nines: 99.99% => 53 minutes of repair/year 5 nines: % => 5 minutes of repair/year 6

Reliability Measures Another is average number of failures per year: Annualized Failure Rate (AFR) – E.g., 1000 disks with 100,000 hour MTTF – 365 days * 24 hours = 8760 hours – (1000 disks * 8760 hrs/year) / 100,000 = 87.6 failed disks per year on average – 87.6/1000 = 8.76% annual failure rate Google’s 2007 study* found that actual AFRs for individual drives ranged from 1.7% for first year drives to over 8.6% for three-year old drives 7 *research.google.com/archive/disk_failures.pdf

The "Bathtub Curve" Often failures follow the "bathtub curve" Brand new devices may fail – "Crib death" Old devices fail Random failure in between 8

Dependability Design Principle Design Principle: No single points of failure – “Chain is only as strong as its weakest link” Dependability Corollary of Amdahl’s Law – Doesn’t matter how dependable you make one portion of system – Dependability limited by part you do not improve + Murphy's Law – If you have one thing different, that is the one thing that will fail! 9

Error Detection/Correction Codes Memory systems generate errors (accidentally flipped-bits) – DRAMs store very little charge per bit – “Soft” errors occur occasionally when cells are struck by alpha particles or other environmental upsets – “Hard” errors can occur when chips permanently fail – Problem gets worse as memories get denser and larger Memories protected against failures with EDC/ECC Extra bits are added to each data-word – Used to detect and/or correct faults in the memory system – Each data word value mapped to unique code word – A fault changes valid code word to invalid one, which can be detected 10

Block Code Principles Hamming distance = difference in # of bits p = , q = , Ham. distance (p,q) = 2 p = , q = , distance (p,q) = ? Can think of extra bits as creating a code with the data What if minimum distance between members of code is 2 and get a 1-bit error? Richard Hamming, Turing Award Winner 11

Parity: Simple Error-Detection Coding Each data value, before it is written to memory is “tagged” with an extra bit to force the stored word to have even parity: Each word, as it is read from memory is “checked” by finding its parity (including the parity bit). b7b6b5b4b3b2b1b0b7b6b5b4b3b2b1b0 + b 7 b 6 b 5 b 4 b 3 b 2 b 1 b 0 p + c Minimum Hamming distance of parity code is 2 A non-zero parity check indicates an error occurred: –2 errors (on different bits) are not detected –nor any even number of errors, just odd numbers of errors are detected 12 p

Parity Example Data ones, even parity now Write to memory: to keep parity even Data ones, odd parity now Write to memory: to make parity even Read from memory ones => even parity, so no error Read from memory ones => odd parity, so error What if error in parity bit? 13

Suppose Want to Correct 1 Error? Richard Hamming came up with simple to understand mapping to allow Error Correction at minimum distance of 3 – Single error correction, double error detection Called “Hamming ECC” – Worked weekends on relay computer with unreliable card reader, frustrated with manual restarting – Got interested in error correction; published 1950 – R. W. Hamming, “Error Detecting and Correcting Codes,” The Bell System Technical Journal, Vol. XXVI, No 2 (April 1950) pp

Detecting/Correcting Code Concept Detection: bit pattern fails codeword check Correction: map to nearest valid code word Space of possible bit patterns (2 N ) Sparse population of code words (2 M << 2 N ) - with identifiable signature Error changes bit pattern to non-code 15

Hamming Distance: 8 code words 16

Hamming Distance 2: Detection Detect Single Bit Errors 17 No 1 bit error goes to another valid codeword ½ codewords are valid This is parity Invalid Codewords

Hamming Distance 3: Correction Correct Single Bit Errors, Detect Double Bit Errors 18 No 2 bit error goes to another valid codeword; 1 bit error near 1/4 codewords are valid Nearest 000 (one 1) Nearest 111 (one 0)

Graphic of Hamming Code 19

Hamming ECC Set parity bits to create even parity for each group A byte of data: Create the coded word, leaving spaces for the parity bits: _ _ 1 _ _ a b c – bit position Calculate the parity bits 20

Hamming ECC Position 1 checks bits 1,3,5,7,9,11: ? _ 1 _ _ set position 1 to a _: Position 2 checks bits 2,3,6,7,10,11: 0 ? 1 _ _ set position 2 to a _: Position 4 checks bits 4,5,6,7,12: ? _ set position 4 to a _: Position 8 checks bits 8,9,10,11,12: ? set position 8 to a _: 21

Hamming ECC Position 1 checks bits 1,3,5,7,9,11: ? _ 1 _ _ set position 1 to a 0: 0 _ 1 _ _ Position 2 checks bits 2,3,6,7,10,11: 0 ? 1 _ _ set position 2 to a 1: _ _ Position 4 checks bits 4,5,6,7,12: ? _ set position 4 to a 1: _ Position 8 checks bits 8,9,10,11,12: ? set position 8 to a 0:

Hamming ECC Final code word: Data word:

Hamming ECC Error Check Suppose receive

Hamming ECC Error Check Suppose receive

Hamming ECC Error Check Suppose receive √ X -Parity 2 in error √ X -Parity 8 in error Implies position 8+2=10 is in error

Hamming ECC Error Correct Flip the incorrect bit …

Hamming ECC Error Correct Suppose receive √ √ √ √ 28

One Problem: Malicious "errors" Error Correcting Code and Error Detecting codes designed for random errors But sometimes you need to protect against deliberate errors Enter cryptographic hash functions – Designed to be nonreversible and unpredictable – An attacker should not be able to change, add, or remove any bits without changing the hash output For a 256b cryptographic hash function (e.g. SHA256), need to have items you are comparing before you have a reasonable possibility of a collision – This is also known as a "Message Digest" 29

And, in Conclusion, … Great Idea: Redundancy to Get Dependability – Spatial (extra hardware) and Temporal (retry if error) Reliability: MTTF & Annualized Failure Rate (AFR) Availability: % uptime (MTTF-MTTR/MTTF) Memory – Hamming distance 2: Parity for Single Error Detect – Hamming distance 3: Single Error Correction Code + encode bit position of error Treat disks like memory, except you know when a disk has failed—erasure makes parity an Error Correcting Code RAID-2, -3, -4, -5: Interleaved data and parity 30