1 Solid State Storage (SSS) System Error Recovery LHO 08 For NASA Langley Research Center.

Slides:



Advertisements
Similar presentations
Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
Advertisements

1 A triple erasure Reed-Solomon code, and fast rebuilding Mark Manasse, Chandu Thekkath Microsoft Research - Silicon Valley Alice Silverberg Ohio State.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
CS 6560: Operating Systems Design
Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
Mathematics of Cryptography Part II: Algebraic Structures
Computer Interfacing and Protocols
CSCE430/830 Computer Architecture
 RAID stands for Redundant Array of Independent Disks  A system of arranging multiple disks for redundancy (or performance)  Term first coined in 1987.
Cryptography and Network Security
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
15-853:Algorithms in the Real World
Information and Coding Theory
CHANNEL CODING REED SOLOMON CODES.
RAID- Redundant Array of Inexpensive Drives. Purpose Provide faster data access and larger storage Provide data redundancy.
RAID Redundant Arrays of Inexpensive Disks –Using lots of disk drives improves: Performance Reliability –Alternative: Specialized, high-performance hardware.
Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
RAID Technology CS350 Computer Organization Section 2 Larkin Young Rob Deaderick Amos Painter Josh Ellis.
CSE 461: Error Detection and Correction. Next Topic  Error detection and correction  Focus: How do we detect and correct messages that are garbled during.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Performance/Reliability of Disk Systems So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #6.
Chien Hsing James Wu David Gottesman Andrew Landahl.
CSCI 4550/8556 Computer Networks Comer, Chapter 7: Packets, Frames, And Error Detection.
Redundant Data Update in Server-less Video-on-Demand Systems Presented by Ho Tsz Kin.
Forward Error Correction Steven Marx CSC45712/04/2001.
7/2/2015Errors1 Transmission errors are a way of life. In the digital world an error means that a bit value is flipped. An error can be isolated to a single.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
RAID Ref: Stallings. Introduction The rate in improvement in secondary storage performance has been considerably less than the rate for processors and.
Page 19/4/2015 CSE 30341: Operating Systems Principles Raid storage  Raid – 0: Striping  Good I/O performance if spread across disks (equivalent to n.
RAID Shuli Han COSC 573 Presentation.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
Two or more disks Capacity is the same as the total capacity of the drives in the array No fault tolerance-risk of data loss is proportional to the number.
MATH 224 – Discrete Mathematics
FINITE FIELDS 7/30 陳柏誠.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
Part.7.1 Copyright 2007 Koren & Krishna, Morgan-Kaufman FAULT TOLERANT SYSTEMS Part 7 - Coding.
AES Background and Mathematics CSCI 5857: Encoding and Encryption.
Session 1 Stream ciphers 1.
Hash and MAC Functions CS427 – Computer Security
Data Security and Encryption (CSE348) 1. Lecture # 12 2.
Copyright © Curt Hill, RAID What every server wants!
Cyclic Redundancy Check CRC Chapter CYCLIC CODES Cyclic codes are special linear block codes with one extra property. In a cyclic code, if a codeword.
Linear Feedback Shift Register. 2 Linear Feedback Shift Registers (LFSRs) These are n-bit counters exhibiting pseudo-random behavior. Built from simple.
Great Theoretical Ideas in Computer Science.
The concept of RAID in Databases By Junaid Ali Siddiqui.
1 © Unitec New Zealand CRC calculation and Hammings code.
Storage and File structure COP 4720 Lecture 20 Lecture Notes.
The Advanced Encryption Standard Part 2: Mathematical Background
15-499Page :Algorithms and Applications Cryptography II – Number theory (groups and fields)
Part IV I/O System Chapter 12: Mass Storage Structure.
LECTURE 13 I/O. I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access.
Class Report 林格名 : Reed Solomon Encoder. Reed-Solomom Error Correction When a codeword is decoded, there are three possible outcomes –If 2s + r < 2t (s.
Number Systems. The position of each digit in a weighted number system is assigned a weight based on the base or radix of the system. The radix of decimal.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
RAID TECHNOLOGY RASHMI ACHARYA CSE(A) RG NO
CS Introduction to Operating Systems
Disks and RAID.
FIRST REVIEW.
Vladimir Stojanovic & Nicholas Weaver
Overview Continuation from Monday (File system implementation)
TECHNICAL SEMINAR PRESENTATION
Erasure Correcting Codes for Highly Available Storage
Error Detection and Correction
Generating QR Codes from Oracle Database - Appendix
Presentation transcript:

1 Solid State Storage (SSS) System Error Recovery LHO 08 For NASA Langley Research Center

2 Background NASA Langley Research Center is building a system to record streaming video and other data when the Space Shuttle docks with the Space Station. This data will be used to develop algorithms that will enable the next generation of the space station to perform autonomous docking. Due to the harsh environment in space the data will be stored in a RAID array of solid state SATA drives with the capability of recovering data even if two drives fail. This Solid State Storage (SSS) system is being developed at VCU. We will look at the that portion of the system that deals with drive error recovery.

3 Proposed SSS system Overview To data recorder

4 SSS Data Recovery The Solid State Storage (SSS) system will consist of six solid state data drives. The discussion will be directed to this specific configuration. The data will be sector striped across these six drives. A modified RAID 6 system capable of recovering data from two corrupted sectors in a stripe is proposed. –Optimized for long single-thread transfers that are multiples of the entire stripe.

5 RAID 5 To illustrate concepts and implications consider a RAID 5 implementation. RAID 5 uses striped array with rotating parity. Optimized for short, multithreaded transfers. Capable of recovering from a single drive failure.

6 RAID 5 system consisting of three data drives and rotating parity. Four stripes for sectors A, B, C, and D are shown.

7 Rotating Parity Why rotating parity? The following steps are necessary to update a single data sector in a stripe. –The old data sector and the parity sector for the stripe must be read. –Compute the new parity using the new data sector, old data sector, and old parity. –Write new data sector and new parity sector. Thus, to write to a data sector both the data sector and parity sector must be read and written. Since there are many data drives a fixed parity drive would accessed much more frequently than a data drive. This excessive access of a single parity drive is avoid by rotating parity across all drives.

8 Rotating parity not needed in SSS The SSS is required to store long data streams. Not random sectors. Make the size of these streams a multiple of the stripe size. An entire stripe with parity will be buffered. The entire stripe with party will be simultaneously written to all drives. –It is not necessary to first read the drives. The SSS will always read and write entire stripes. –Easier to implement. –Faster access.

9 Parity Parity encoding is given by Where D i represent a data byte in a sector on drive i. If both sides of the above equation are exclusive ored with P, then D 5 for example can be recovered by

10 Parity problem Using parity it is easy to recover data on a single drive if we know that drive is bad. We may have data corruption on a drive without without the entire drive failing. –Undetectable based on parity alone. Propose to include a 32-bit CRC in sector. –Simple to implement. –Less than 1% overhead. –In RAID 6 will ensure as long as a stripe has no more than two bad sectors the data in that stripe can be recovered.

11 Key Conclusions Write data as entire stripes. Used fixed parity drive. Include sector CRC.

12 Raid 6 (modified) Use two fixed parity drives (P and Q). Data can be recovered if two sectors in a stripe are corrupted. P parity is the same as RAID 5 (simple XOR). –Easy to encode and easy to recover data. Q parity is more complicated.

13 Q parity encoding The Q parity is a Reed-Solomon code given by Where  is Galois Field (GF) multiplication and g i is a constant. For i < 8 it turns out that g i = 2 i. For larger i, it not as simple. For example g 8 = 29. But for the SSS application Q simplifies to The problem is how to compute the GF multiplication.

14 GF multiplication In ordinary arithmetic multiplication can be accomplished summing the logs and taking the inverse log. GF multiplication is typically accomplished using lookup tables to find the GF log and inverse log. The addition in modulo 255. See Xilinx application note XAPP731 “Hardware Accelerator for RADD 6 Parity Generation / Data Recovery Controller”.

15

16

17 Examples

18 Examples Note: A  B = 0 if A = 0 or B = 0. This is a special case and cannot be computed using logs. It is also worth noting that A  1 = A. This does follow from using logs since log GF (0x01) = 0.

19 Elaboration on Galois Field Mathematics Évariste Galois (1832) –Established many of the ideas of group theory. –Left only sixty pages of mathematical writings. –Mortally wounded in a duel at age 20. Most of his major centrifugations stem from a letter written the night before the duel. His work has had great impact. Provides powerful tool for investigating fundamental mathematical problems. –Roots of algebraic equations. –GF theory provides simple proof that an angle cannot be trisected using only compass and unmarked straightedge. »This had baffled mathematicians since the time of Euclid. Recently applied to computer design and data-communication systems.

20 Galois Field Mathematics A Galois Field is a algebraic structure where G is a set consisting of 2 n elements,  is addition mod 2 (bit wise XOR) and  is GF multiplication. Math similar to ordinary arithmetic.  and  is commutative and associative. Distributive such that We are only concerned with GF(2 8 ) where the set G has 256 elements. We will use a hex byte to specify the elements. Then A  A = 0x00, A  0x00 = 0x00, A  0x01 = A

21 GF(2 8 ) The GF log look up tables are generates based on what in GF theory is called a primitive polynomial. Primitive polynomials have certain properties that lead to the error correction techniques. GF(2 8 ) is generated using the primitive polynomial This is the same primitive polynomials use to determine the feed back path for an 8-bit maximum count linear feedback shift registers (LFBSR’s). The LFBSR can be use to perform GF multiplication.

22 The 8 bit LFBSR Q 0 Q 1 Q 2 Q 3 Q 4 Q 5 Q 6 Q 7 Or reversing order so that the most significant bit is at the left A shift has the same effect as  2. In VHDL Q <= Q(6) & Q(5) & Q(4) & (Q(3) XOR Q(7)) & (Q(2) XOR Q(7)) & (Q(1) XOR Q(7)) & Q(0) & Q(7);

23 1 Before shift After Shift X2 0X7X7 X6X6 0X6X6 X5X5 0X5X5 X4X4 1X4X4 X3X7X3X7 1X3X3 X2X7X2X7 1X2X2 X1X7X1X7 0X1X1 X0X0 1X0X0 X7X7

24

25 Galois Field Division