EECS 262a Advanced Topics in Computer Systems Lecture 3 Filesystems (Con’t) September 10 th, 2012 John Kubiatowicz and Anthony D. Joseph Electrical Engineering.

Slides:



Advertisements
Similar presentations
Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
Advertisements

A CASE FOR REDUNDANT ARRAYS OF INEXPENSIVE DISKS (RAID) D. A. Patterson, G. A. Gibson, R. H. Katz University of California, Berkeley.
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
CSCE430/830 Computer Architecture
The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”
EECS 262a Advanced Topics in Computer Systems Lecture 4 Filesystems (Con’t) September 15 th, 2014 John Kubiatowicz Electrical Engineering and Computer.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan Hewlett-Packard Laboratories Presented by Sri.
R.A.I.D. Copyright © 2005 by James Hug Redundant Array of Independent (or Inexpensive) Disks.
Chapter 3 Presented by: Anupam Mittal.  Data protection: Concept of RAID and its Components Data Protection: RAID - 2.
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
CSE521: Introduction to Computer Architecture Mazin Yousif I/O Subsystem RAID (Redundant Array of Independent Disks)
CSE 486/586 CSE 486/586 Distributed Systems Case Study: Facebook f4 Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)
RAID Technology. Use Arrays of Small Disks? 14” 10”5.25”3.5” Disk Array: 1 disk design Conventional: 4 disk designs Low End High End Katz and Patterson.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
EECC551 - Shaaban #1 Lec # 13 Winter Magnetic Disk CharacteristicsMagnetic Disk Characteristics I/O Connection StructureI/O Connection Structure.
G Robert Grimm New York University Sprite LFS or Let’s Log Everything.
Computer ArchitectureFall 2007 © November 28, 2007 Karem A. Sakallah Lecture 24 Disk IO and RAID CS : Computer Architecture.
Computer ArchitectureFall 2008 © November 12, 2007 Nael Abu-Ghazaleh Lecture 24 Disk IO.
Cse Feb-001 CSE 451 Section February 24, 2000 Project 3 – VM.
G Robert Grimm New York University Sprite LFS or Let’s Log Everything.
RAID Systems CS Introduction to Operating Systems.
THE HP AUTORAID HIERARCHICAL STORAGE SYSTEM J. Wilkes, R. Golding, C. Staelin T. Sullivan HP Laboratories, Palo Alto, CA.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
Redundant Array of Inexpensive Disks (RAID). Redundant Arrays of Disks Files are "striped" across multiple spindles Redundancy yields high data availability.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
FFS, LFS, and RAID Andy Wang COP 5611 Advanced Operating Systems.
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
Lecture 4 1 Reliability vs Availability Reliability: Is anything broken? Availability: Is the system still available to the user?
CSE 321b Computer Organization (2) تنظيم الحاسب (2) 3 rd year, Computer Engineering Winter 2015 Lecture #4 Dr. Hazem Ibrahim Shehata Dept. of Computer.
Redundant Array of Independent Disks
1 Chapter 7: Storage Systems Introduction Magnetic disks Buses RAID: Redundant Arrays of Inexpensive Disks.
RAID COP 5611 Advanced Operating Systems Adapted from Andy Wang’s slides at FSU.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Disk Structure Disk drives are addressed as large one- dimensional arrays of logical blocks, where the logical block is the smallest unit of transfer.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan Hewlett-Packard Laboratories.
CE Operating Systems Lecture 20 Disk I/O. Overview of lecture In this lecture we will look at: Disk Structure Disk Scheduling Disk Management Swap-Space.
Redundant Array of Independent Disks.  Many systems today need to store many terabytes of data.  Don’t want to use single, large disk  too expensive.
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan Presented by Arthur Strutzenberg.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
CS 153 Design of Operating Systems Spring 2015 Lecture 21: File Systems.
Αρχιτεκτονική Υπολογιστών Ενότητα # 6: RAID Διδάσκων: Γεώργιος Κ. Πολύζος Τμήμα: Πληροφορικής.
Embedded System Lab. 정영진 The Design and Implementation of a Log-Structured File System Mendel Rosenblum and John K. Ousterhout ACM Transactions.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
RAID Technology By: Adarsha A,S 1BY08A03. Overview What is RAID Technology? What is RAID Technology? History of RAID History of RAID Techniques/Methods.
CMSC 611: Advanced Computer Architecture I/O & Storage Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CS Introduction to Operating Systems
John Kubiatowicz and Anthony D. Joseph
HP AutoRAID (Lecture 5, cs262a)
Finding a needle in Haystack: Facebook’s photo storage OSDI 2010
RAID Redundant Arrays of Independent Disks
Steve Ko Computer Sciences and Engineering University at Buffalo
Steve Ko Computer Sciences and Engineering University at Buffalo
Vladimir Stojanovic & Nicholas Weaver
Finding a needle in Haystack: Facebook’s photo storage OSDI 2010
Finding a Needle in Haystack : Facebook’s Photo storage
Steve Ko Computer Sciences and Engineering University at Buffalo
HP AutoRAID (Lecture 5, cs262a)
Steve Ko Computer Sciences and Engineering University at Buffalo
ICOM 6005 – Database Management Systems Design
THE HP AUTORAID HIERARCHICAL STORAGE SYSTEM
Overview Continuation from Monday (File system implementation)
UNIT IV RAID.
John Kubiatowicz Electrical Engineering and Computer Sciences
John Kubiatowicz Electrical Engineering and Computer Sciences
Andy Wang COP 5611 Advanced Operating Systems
Presentation transcript:

EECS 262a Advanced Topics in Computer Systems Lecture 3 Filesystems (Con’t) September 10 th, 2012 John Kubiatowicz and Anthony D. Joseph Electrical Engineering and Computer Sciences University of California, Berkeley

9/10/20122cs262a-S12 Lecture-04 Today’s Papers The HP AutoRAID Hierarchical Storage System (2-up version), John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan. Appears in ACM Transactions on Computer Systems, Vol. 14, No, 1, February 1996, Pages The HP AutoRAID Hierarchical Storage System2-up version Finding a needle in Haystack: Facebook’s photo storage,Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, Peter Vajgel. Appears in Proceedings of the USENIX conference in Operating Systems Design and Implementation (OSDI), 2010Finding a needle in Haystack: Facebook’s photo storage System design paper and system analysis paper Thoughts?

9/10/20123cs262a-S12 Lecture-04 Array Reliability Reliability of N disks = Reliability of 1 Disk ÷ N 50,000 Hours ÷ 70 disks = 700 hours Disk system MTTF: Drops from 6 years to 1 month! Arrays (without redundancy) too unreliable to be useful! Hot spares support reconstruction in parallel with access: very high media availability can be achieved

9/10/20124cs262a-S12 Lecture-04 RAID Basics (Two optional papers) Levels of RAID (those in RED are actually used): –RAID 0: striping with no parity (just bandwidth) –RAID 1: Mirroring (simple, fast, but requires 2x storage) »Reads faster, writes slower (why?) –RAID 2: bit interleaving with error-correcting codes (ECC) –Dedicated parity disk (RAID level 3), byte-level striping »Dedicated parity disk is write bottleneck, since every write also writes parity –RAID 4:dedicated parity disk, block-level striping –RAID 5: Rotating parity disk, block-level striping »most popular; rotating disk spreads out parity load –RAID 6: RAID 5 with two parity blocks (tolerates two failures) If you don’t have RAID 6 with today’s drive sizes, you are asking for trouble…!

9/10/20125cs262a-S12 Lecture-04 Redundant Arrays of Disks RAID 1: Disk Mirroring/Shadowing Each disk is fully duplicated onto its "shadow" Very high availability can be achieved Bandwidth sacrifice on write: Logical write = two physical writes Reads may be optimized Most expensive solution: 100% capacity overhead Targeted for high I/O rate, high availability environments recovery group

9/10/20126cs262a-S12 Lecture-04 Redundant Arrays of Disks RAID 5+: High I/O Rate Parity A logical write becomes four physical I/Os Independent writes possible because of interleaved parity Reed-Solomon Codes ("Q") for protection during reconstruction D0D1D2 D3 P D4D5D6 P D7 D8D9P D10 D11 D12PD13 D14 D15 PD16D17 D18 D19 D20D21D22 D23 P Disk Columns Increasing Logical Disk Addresses Stripe Unit Targeted for mixed applications

9/10/20127cs262a-S12 Lecture-04 Problems of Disk Arrays: Small Writes D0D1D2 D3 P D0' + + D1D2 D3 P' new data old data old parity XOR (1. Read) (2. Read) (3. Write) (4. Write) RAID-5: Small Write Algorithm 1 Logical Write = 2 Physical Reads + 2 Physical Writes

9/10/20128cs262a-S12 Lecture-04 System Availability: Orthogonal RAIDs Array Controller String Controller String Controller String Controller String Controller String Controller String Controller... Redundant Support Components: fans, power supplies, controller, cables Data Recovery Group: unit of data redundancy End to End Data Integrity: internal parity protected data paths

9/10/20129cs262a-S12 Lecture-04 System-Level Availability Fully dual redundant I/O Controller Array Controller Recovery Group Goal: No Single Points of Failure host with duplicated paths, higher performance can be obtained when there are no failures

9/10/201210cs262a-S12 Lecture-04 How to get to “RAID 6”? One option: Reed-Solomon codes (Non-systematic): –Use of Galois Fields (finite element equivalent of real numbers) –Data as coefficients, code space as values of polynomial: –P(x)=a 0 +a 1 x 1 +… a 4 x 4 –Coded: P(1),P(2)….,P(6),P(7) –Advantage: can add as much redundancy as you like: 5 disks? Problems with Reed-Solomon codes: decoding gets complex quickly – even to add a second disk Alternates: lot of them – I’ve posted one possibility. –Idea: Use prime number of columns, diagonal as well as straight XOR

9/10/201211cs262a-S12 Lecture-04 HP AutoRAID – Motivation Goals: automate the efficient replication of data in a RAID –RAIDs are hard to setup and optimize –Mix fast mirroring (2 copies) with slower, more space-efficient parity disks –Automate the migration between these two levels RAID small-write problem: –to overwrite part of a block required 2 reads and 2 writes! –read data, read parity, write data, write parity Each kind of replication has a narrow range of workloads for which it is best... –Mistake ⇒ 1) poor performance, 2) changing layout is expensive and error prone –Also difficult to add storage: new disk ⇒ change layout and rearrange data...

9/10/201212cs262a-S12 Lecture-04 HP AutoRAID – Key Ideas Key idea: mirror active data (hot), RAID 5 for cold data –Assumes only part of data in active use at one time –Working set changes slowly (to allow migration) How to implement this idea? –Sys-admin »make a human move around the files.... BAD. painful and error prone –File system »best choice, but hard to implement/ deploy; can’t work with existing systems –Smart array controller: (magic disk) block-level device interface. »Easy to deploy because there is a well-defined abstraction »Enables easy use of NVRAM (why?)

9/10/201213cs262a-S12 Lecture-04 HP AutoRaid – Features Block Map –level of indirection so that blocks can be moved around among the disks –implies you only need one “zero block” (all zeroes), a variation of copy on write –in fact could generalize this to have one real block for each unique block Mirroring of active blocks –RAID 5 for inactive blocks or large sequential writes (why?) –Start out fully mirrored, then move to 10% mirrored as disks fill Promote/demote in 64K chunks (8-16 blocks) –Hot swap disks, etc. (A hot swap is just a controlled failure.) –Add storage easily (goes into the mirror pool) –useful to allow different size disks (why?) No need for an active hot spare (per se); –just keep enough working space around Log-structured RAID 5 writes. –Nice big streams, no need to read old parity for partial writes)

9/10/201214cs262a-S12 Lecture-04 AutoRAID Details PEX (Physical Extent): 1MB chunk of disk space PEG (Physical Extent Group): Size depends on # Disks –A group of PEXes assigned to one storage class Stripe: Size depends # Disks –One row of parity and data segments in a RAID 5 storage class Segment: 128 KB –Strip unit (RAID 5) or half of a mirroring unit

9/10/201215cs262a-S12 Lecture-04 Closer Look:

9/10/201216cs262a-S12 Lecture-04 Questions When to demote? When there is too much mirrored storage (>10%) –Demotion leaves a hole (64KB). What happens to it? Moved to free list and reused –Demoted RBs are written to the RAID5 log, one write for data, a second for parity Why log RAID5 better than update in place? –Update of data requires reading all the old data to recalculate parity. –Log ignores old data (which becomes garbage) and writes only new data/parity stripes. When to promote? When a RAID5 block is written... –Just write it to mirrored and the old version becomes garbage. How big should an RB be? –Bigger ⇒ Less mapping information, fewer seeks –Smaller ⇒ fine grained mapping information How do you find where an RB is? –Convert addresses to (LUN, offset) and then lookup RB in a table from this pair. –Map size = Number of RBs and must be proportional to size of total storage.

9/10/201217cs262a-S12 Lecture-04 Issues Disks writes go to two disks (since newly written data is “hot”). –Must wait for both to complete (why?). –Does the host have to wait for both? No, just for NVRAM. Controller uses cache for reads Controller uses NVRAM for fast commit, then moves data to disks –What if NVRAM is full? Block until NVRAM flushed to disk, then write to NVRAM. What happens in the background? –1) compaction, 2) migration, 3) balancing. Compaction: clean RAID5 and plug holes in the mirrored disks. –Do mirrored disks get cleaned? Yes, when a PEG is needed for RAID5; i.e., pick a disks with lots of holes and move its used RBs to other disks. Resulting empty PEG is now usable by RAID5. –What if there aren’t enough holes? Write the excess RBs to RAID5, then reclaim the PEG. Migration: which RBs to demote? Least-recently-written (not LRU) Balancing: make sure data evenly spread across the disks. (Most important when you add a new disk)

9/10/201218cs262a-S12 Lecture-04 Is this a good paper? What were the authors’ goals? What about the performance metrics? Did they convince you that this was a good system? Were there any red-flags? What mistakes did they make? Does the system meet the “Test of Time” challenge? How would you review this paper today?

9/10/201219cs262a-S12 Lecture-04 Finding a needle in Haystack This is a systems level solution: –Takes into account specific application (Photo Sharing) »Large files!, Many files! »260 Billion images, 20 PetaBytes (10 15 bytes!) »One billion new photos a week (60 TeraBytes) –Takes into account environment (Presence of Content Delivery Network, CDN) –Takes into account usage patterns: »New photos accessed a lot (caching well) »Old photos accessed little, but likely to be requested at any time  NEEDLES Cumulative graph of accesses as function of age

9/10/201220cs262a-S12 Lecture-04 Old Solution: NFS Issues with this design? Long Tail  Caching does not work for most photos –Every access to back end storage must be fast without benefit of caching! Linear Directory scheme works badly for many photos/directory –Many disk operations to find even a single photo –Directory’s block map too big to cache in memory –“Fixed” by reducing directory size, however still not great Meta-Data (FFS) requires ≥ 3 disk accesses per lookup –Caching all iNodes in memory might help, but iNodes are big Fundamentally, Photo Storage different from other storage: –Normal file systems fine for developers, databases, etc

9/10/201221cs262a-S12 Lecture-04 New Solution: Haystack Finding a needle (old photo) in Haystack Differentiate between old and new photos –How? By looking at “Writeable” vs “Read-only” volumes –New Photos go to Writeable volumes Directory: Help locate photos –Name (URL) of photo has embedded volume and photo ID Let CDN or Haystack Cache Serve new photos –rather than forwarding them to Writeable volumes Haystack Store: Multiple “Physical Volumes” –Physical volume is large file (100 GB) which stores millions of photos –Data Accessed by Volume ID with offset into file –Since Physical Volumes are large files, use XFS which is optimized for large files

9/10/201222cs262a-S12 Lecture-04 What about these results? Are these good benchmarks? –Why or why not? Are these good results? –Why or why not?

9/10/201223cs262a-S12 Lecture-04 Discussion of Haystack Did their design address their goals? –Why or why not Were they successful? –Is this a different question? What about the benchmarking? –Good performance metrics? –Did they convince you that this was a good system? Were there any red-flags? What mistakes did they make? Will this system meet the “Test of Time” challenge?